Files
doc-exports/docs/mrs/umn/ALM-19006.html
Yang, Tong 2195db241c MRS UMN 20231220 version update
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Reviewed-by: Rechenburg, Matthias <matthias.rechenburg@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2024-05-16 09:40:21 +00:00

107 lines
18 KiB
HTML

<a name="ALM-19006"></a><a name="ALM-19006"></a>
<h1 class="topictitle1">ALM-19006 HBase Replication Sync Failed</h1>
<div id="body19137257"><div class="section" id="ALM-19006__se4824cd5d196465b8d73f2eb2bee6f27"><h4 class="sectiontitle">Description</h4><p id="ALM-19006__en-us_topic_0070543520_p34566005">The alarm module checks the HBase DR data synchronization status every 30 seconds. When disaster recovery (DR) data fails to be synchronized to a standby cluster, the alarm is triggered.</p>
<p id="ALM-19006__en-us_topic_0070543520_p42658596">When DR data synchronization succeeds, the alarm is cleared.</p>
</div>
<div class="section" id="ALM-19006__s6baa8c91d3d941128fb7b103a0e72522"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-19006__en-us_topic_0070543520_table36247516" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-19006__en-us_topic_0070543520_row18100201"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-19006__en-us_topic_0070543520_p56830139">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-19006__en-us_topic_0070543520_p39838562">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-19006__en-us_topic_0070543520_p5698118">Automatically Cleared</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-19006__en-us_topic_0070543520_row58894414"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-19006__en-us_topic_0070543520_p5718243">19006</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-19006__en-us_topic_0070543520_p60524524">Critical</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-19006__en-us_topic_0070543520_p3539410">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-19006__se44896fecc0247b18c87100c926924ac"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-19006__en-us_topic_0070543520_table18256764" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-19006__en-us_topic_0070543520_row6882862"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-19006__en-us_topic_0070543520_p20640979">Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-19006__en-us_topic_0070543520_p61306625">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-19006__row13141184315104"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-19006__p192431315431">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-19006__p692551319435">Specifies the cluster for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-19006__en-us_topic_0070543520_row66889567"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-19006__en-us_topic_0070543520_p49345813">ServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-19006__en-us_topic_0070543520_p37587916">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-19006__en-us_topic_0070543520_row2746927"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-19006__en-us_topic_0070543520_p21174538">RoleName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-19006__en-us_topic_0070543520_p37415993">Specifies the role for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-19006__en-us_topic_0070543520_row1199620"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-19006__en-us_topic_0070543520_p30060365">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-19006__en-us_topic_0070543520_p18970462">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-19006__row055895518017"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-19006__p26086497">Trigger Condition</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-19006__p14558185516011">Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-19006__s5887104fc2c449a8824e76eea7339a25"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-19006__en-us_topic_0070543520_p60212448">HBase data in a cluster fails to be synchronized to the standby cluster, causing data inconsistency between active and standby clusters.</p>
</div>
<div class="section" id="ALM-19006__s8035331f27bb4b889e704fe8d58f8d58"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-19006__en-us_topic_0070543520_ul45370106"><li id="ALM-19006__en-us_topic_0070543520_li5677776">The HBase service on the standby cluster is abnormal.</li><li id="ALM-19006__en-us_topic_0070543520_li51099989">A network exception occurs.</li></ul>
</div>
<div class="section" id="ALM-19006__s71895df776534d31b2da3b7303304ca7"><h4 class="sectiontitle">Procedure</h4><p class="tableheading" id="ALM-19006__en-us_topic_0070543520_p45458457"><strong id="ALM-19006__b25525818194110">Observe whether the system automatically clears the alarm.</strong></p>
<ol id="ALM-19006__ol64528750194119"><li id="ALM-19006__li3534916119413"><span>On the <span id="ALM-19006__text34789336432">MRS</span> Manager portal of the active cluster, click <span class="menucascade" id="ALM-19006__menucascade9622183132519"><b><span class="uicontrol" id="ALM-19006__uicontrol862223115259">O&amp;M</span></b> &gt; <b><span class="uicontrol" id="ALM-19006__uicontrol106221731182512">Alarm</span></b> &gt; <b><span class="uicontrol" id="ALM-19006__uicontrol9622153122518">Alarms.</span></b></span></span></li><li id="ALM-19006__li5777190419413"><span>In the alarm list, click the alarm to obtain alarm generation time from <strong id="ALM-19006__b4970699819413">Generated</strong> of the alarm. Check whether the alarm has existed for five minutes.</span><p><ul class="subitemlist" id="ALM-19006__ul641910019413"><li id="ALM-19006__li6472421119413">If yes, go to <a href="#ALM-19006__li2065263819413">4</a>.</li><li id="ALM-19006__li816977319413">If no, go to <a href="#ALM-19006__li5327925819413">3</a>.</li></ul>
</p></li><li id="ALM-19006__li5327925819413"><a name="ALM-19006__li5327925819413"></a><a name="li5327925819413"></a><span>Wait five minutes and check whether the system automatically clears the alarm.</span><p><ul class="subitemlist" id="ALM-19006__ul1059982219413"><li id="ALM-19006__li5018509019413">If yes, no further action is required.</li><li id="ALM-19006__li3846046019413">If no, go to <a href="#ALM-19006__li2065263819413">4</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-19006__p2828953819413"><strong id="ALM-19006__b28700433194131">Check the HBase service status of the standby cluster.</strong></p>
<ol start="4" id="ALM-19006__ol64924621194153"><li id="ALM-19006__li2065263819413"><a name="ALM-19006__li2065263819413"></a><a name="li2065263819413"></a><span>Log in to the <span id="ALM-19006__text1637123475613">MRS</span> Manager portal of the active cluster, and click <span class="menucascade" id="ALM-19006__menucascade11443462617"><b><span class="uicontrol" id="ALM-19006__uicontrol341034192618">O&amp;M</span></b> &gt; <b><span class="uicontrol" id="ALM-19006__uicontrol241434202620">Alarm</span></b> &gt; <b><span class="uicontrol" id="ALM-19006__uicontrol16493412620">Alarms.</span></b></span></span></li><li id="ALM-19006__li916715319413"><span>In the alarm list, click the alarm to obtain <strong id="ALM-19006__b5165602119413">HostName</strong> from <strong id="ALM-19006__b6225101119413">Location</strong>.</span></li><li id="ALM-19006__li1113073719413"><span>Access the node where the HBase client of the active cluster resides as user <strong id="ALM-19006__b1539551419413">omm</strong>.</span><p><p class="litext" id="ALM-19006__p1614982919413">If the cluster uses a security mode, perform security authentication first and then access the <strong id="ALM-19006__b434190319413">hbase shell</strong> interface as user <strong id="ALM-19006__b3907712719413">hbase</strong>.</p>
<p id="ALM-19006__p9416156152318"><strong id="ALM-19006__b1525104553116">cd <span id="ALM-19006__ph381512063917">/opt/client</span></strong></p>
<p class="litext" id="ALM-19006__p81551352143820"><strong id="ALM-19006__b8155185210380">source ./bigdata_env</strong></p>
<p class="litext" id="ALM-19006__p19979739162423"><strong id="ALM-19006__b1925362210265">kinit </strong><em id="ALM-19006__i10521323162611">hbaseuser</em></p>
</p></li><li id="ALM-19006__li4024786019413"><span>Run the <strong id="ALM-19006__b3306777219413">status 'replication', 'source'</strong> command to check the DR synchronization status of the faulty node.</span><p><p class="litext" id="ALM-19006__p2917449519413">The DR synchronization status of a node is as follows.</p>
<pre class="screen" id="ALM-19006__screen1938506519413"><strong id="ALM-19006__b6124386719413">10-10-10-153</strong>:
SOURCE: PeerID=abc, SizeOfLogQueue=0, ShippedBatches=2, ShippedOps=2, ShippedBytes=320, LogReadInBytes=1636, LogEditsRead=5, LogEditsFiltered=3, SizeOfLogToReplicate=0, TimeForLogToReplicate=0, ShippedHFiles=0, SizeOfHFileRefsQueue=0, AgeOfLastShippedOp=0, TimeStampsOfLastShippedOp=Mon Jul 18 09:53:28 CST 2016, Replication Lag=0, FailedReplicationAttempts=0
SOURCE: <strong id="ALM-19006__b1432389819413">PeerID=abc1</strong>, SizeOfLogQueue=0, ShippedBatches=1, ShippedOps=1, ShippedBytes=160, LogReadInBytes=1636, LogEditsRead=5, LogEditsFiltered=3, SizeOfLogToReplicate=0, TimeForLogToReplicate=0, ShippedHFiles=0, SizeOfHFileRefsQueue=0, AgeOfLastShippedOp=16788, TimeStampsOfLastShippedOp=Sat Jul 16 13:19:00 CST 2016, Replication Lag=16788, <strong id="ALM-19006__b6180621919413">FailedReplicationAttempts=5</strong></pre>
</p></li><li id="ALM-19006__li2323816019413"><span>Obtain <strong id="ALM-19006__b2668642119413">PeerID</strong> corresponding to a record whose <strong id="ALM-19006__b3885119719413">FailedReplicationAttempts</strong> value is greater than 0.</span><p><p class="litext" id="ALM-19006__p258201719413">In the preceding step, data on the faulty node 10-10-10-153 fails to be synchronized to a standby cluster whose <strong id="ALM-19006__b1411645319413">PeerID</strong> is <strong id="ALM-19006__b5993921419413">abc1</strong>.</p>
</p></li><li id="ALM-19006__li6555881219413"><a name="ALM-19006__li6555881219413"></a><a name="li6555881219413"></a><span>Run the <strong id="ALM-19006__b781685019413">list_peers</strong> command to find the cluster and the HBase instance corresponding to the <strong id="ALM-19006__b324279419413">PeerID</strong> value.</span><p><pre class="screen" id="ALM-19006__screen1518712219413">PEER_ID CLUSTER_KEY STATE TABLE_CFS
<strong id="ALM-19006__b2918515219413">abc1</strong> 10.10.10.110,10.10.10.119,10.10.10.133:2181:<strong id="ALM-19006__b6133978119413">/hbase2</strong> ENABLED
abc 10.10.10.110,10.10.10.119,10.10.10.133:2181:/hbase ENABLED </pre>
<p class="litext" id="ALM-19006__p2219739319413">In the preceding information, <strong id="ALM-19006__b246637719413">/hbase2</strong> indicates that data is synchronized to the HBase2 instance of the standby cluster.</p>
</p></li><li id="ALM-19006__li5088708919413"><span>In the service list of <span id="ALM-19006__text122001536185610">MRS</span> Manager of the standby cluster, check whether the running status of the HBase instance obtained by using <a href="#ALM-19006__li6555881219413">9</a> is <strong id="ALM-19006__b7892151111569">Normal</strong>.</span><p><ul class="subitemlist" id="ALM-19006__ul565412119413"><li id="ALM-19006__li3066242219413">If yes, go to <a href="#ALM-19006__li2284519319413">14</a>.</li><li id="ALM-19006__li62823519413">If no, go to <a href="#ALM-19006__li448244019413">11</a>.</li></ul>
</p></li><li id="ALM-19006__li448244019413"><a name="ALM-19006__li448244019413"></a><a name="li448244019413"></a><span>In the alarm list, check whether the <strong id="ALM-19006__b1988616223112">ALM-19000 HBase Service Unavailable</strong> alarm is generated.</span><p><ul class="subitemlist" id="ALM-19006__ul2286767019413"><li id="ALM-19006__li5259537519413">If yes, go to <a href="#ALM-19006__li2753337519413">12</a>.</li><li id="ALM-19006__li3236701419413">If no, go to <a href="#ALM-19006__li2284519319413">14</a>.</li></ul>
</p></li><li id="ALM-19006__li2753337519413"><a name="ALM-19006__li2753337519413"></a><a name="li2753337519413"></a><span>Follow troubleshooting procedures in <strong id="ALM-19006__b11821135219302">ALM-19000 HBase Service Unavailable</strong> to rectify the fault.</span></li><li id="ALM-19006__li1519512019413"><span>Wait for a few minutes and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-19006__ul5652589919413"><li id="ALM-19006__li4647379019413">If yes, no further action is required.</li><li id="ALM-19006__li628065519413">If no, go to <a href="#ALM-19006__li2284519319413">14</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-19006__p3897104819413"><strong id="ALM-19006__b55853063194210">Check network connections between RegionServers on active and standby clusters.</strong></p>
<ol start="14" id="ALM-19006__ol41239844194223"><li id="ALM-19006__li2284519319413"><a name="ALM-19006__li2284519319413"></a><a name="li2284519319413"></a><span>Log in to the <span id="ALM-19006__text36271837125614">MRS</span> Manager portal of the active cluster, and click <span class="menucascade" id="ALM-19006__menucascade45089526261"><b><span class="uicontrol" id="ALM-19006__uicontrol19508195219261">O&amp;M</span></b> &gt; <b><span class="uicontrol" id="ALM-19006__uicontrol175081752112617">Alarm</span></b> &gt; <b><span class="uicontrol" id="ALM-19006__uicontrol1250845262617">Alarms.</span></b></span></span></li><li id="ALM-19006__li3322104919413"><a name="ALM-19006__li3322104919413"></a><a name="li3322104919413"></a><span>In the alarm list, click the alarm to obtain <strong id="ALM-19006__b428014919413">HostName</strong> from <strong id="ALM-19006__b3852134319413">Location</strong>.</span></li><li id="ALM-19006__li5895388719413"><span>Use the IP address obtained in <a href="#ALM-19006__li3322104919413">15</a> to log in to a faulty RegionServer node as user <strong id="ALM-19006__b3055398719413">omm</strong>.</span></li><li id="ALM-19006__li1392399119413"><span>Run the <strong id="ALM-19006__b6082293919413">ping</strong> command to check whether network connections between the faulty RegionServer node and the host where RegionServer of the standby cluster resides are in the normal state.</span><p><ul class="subitemlist" id="ALM-19006__ul154711019413"><li id="ALM-19006__li2771099419413">If yes, go to <a href="#ALM-19006__li342888619413">20</a>.</li><li id="ALM-19006__li2999806219413">If no, go to <a href="#ALM-19006__li5820706019413">18</a>.</li></ul>
</p></li><li id="ALM-19006__li5820706019413"><a name="ALM-19006__li5820706019413"></a><a name="li5820706019413"></a><span>Contact the network administrator to restore the network.</span></li><li id="ALM-19006__li5579067219413"><span>After the network is running properly, check whether the alarm is cleared in the alarm list.</span><p><ul class="subitemlist" id="ALM-19006__ul4708502519413"><li id="ALM-19006__li5410149219413">If yes, no further action is required.</li><li id="ALM-19006__li2014475019413">If no, go to <a href="#ALM-19006__li342888619413">20</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-19006__p2111204419413"><strong id="ALM-19006__b15875429194233">Collect fault information.</strong></p>
<ol start="20" id="ALM-19006__ol17286287194237"><li id="ALM-19006__li342888619413"><a name="ALM-19006__li342888619413"></a><a name="li342888619413"></a><span>On the <span id="ALM-19006__text1398573825619">MRS</span> Manager interface of active and standby clusters, choose <strong id="ALM-19006__b87211374515">O&amp;M</strong> &gt; <strong id="ALM-19006__b1558615387574">Log </strong>&gt;<strong id="ALM-19006__b155871538105719"> Download</strong>.</span></li><li id="ALM-19006__li1545387419413"><span>In the <strong id="ALM-19006__b3085997719413">Service</strong> drop-down list box, select <strong id="ALM-19006__b18463135412711">HBase </strong>in the required cluster.</span></li><li id="ALM-19006__li1145664103113"><span>Click <span><img id="ALM-19006__image1945644173117" src="en-us_image_0000001532607922.png"></span> in the upper right corner, and set <strong id="ALM-19006__b6456941173117">Start Date</strong> and <strong id="ALM-19006__b11456154113318">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-19006__b13456164113319">Download</strong>.</span></li><li id="ALM-19006__li3956018919413"><span>Contact the <span id="ALM-19006__text4614151421417">O&amp;M personnel</span> and send the collected fault logs.</span></li></ol>
</div>
<div class="section" id="ALM-19006__section1529716184534"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-19006__p4677152685316">After the fault is rectified, the system automatically clears this alarm.</p>
</div>
<div class="section" id="ALM-19006__s3ecfd41e58d645de9a90fdb22deb2672"><h4 class="sectiontitle">Related Information</h4><p id="ALM-19006__en-us_topic_0070543520_p963478">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>