Files
doc-exports/docs/mrs/umn/alm_14009.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

91 lines
16 KiB
HTML

<a name="alm_14009"></a><a name="alm_14009"></a>
<h1 class="topictitle1">ALM-14009 Number of Faulty DataNodes Exceeds the Threshold</h1>
<div id="body8662426"><div class="section" id="alm_14009__en-us_topic_0191813881_section4477025"><h4 class="sectiontitle">Description</h4><p id="alm_14009__en-us_topic_0191813881_p60892825">The system periodically checks the number of faulty DataNodes in the HDFS cluster every 30 seconds, and compares the number with the threshold. The number of faulty DataNodes has a default threshold. This alarm is generated when the number of faulty DataNodes in the HDFS cluster exceeds the threshold.</p>
<p id="alm_14009__en-us_topic_0191813881_p33371781">This alarm is cleared when the number of faulty DataNodes in the HDFS cluster is less than or equal to the threshold.</p>
</div>
<div class="section" id="alm_14009__en-us_topic_0191813881_section40293226"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="alm_14009__en-us_topic_0191813881_table18759701" frame="border" border="1" rules="all"><thead align="left"><tr id="alm_14009__en-us_topic_0191813881_row48616240"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="alm_14009__en-us_topic_0191813881_p45601378">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="alm_14009__en-us_topic_0191813881_p2724161">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="alm_14009__en-us_topic_0191813881_p19330484">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="alm_14009__en-us_topic_0191813881_row22265380"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="alm_14009__en-us_topic_0191813881_p58665336">14009</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="alm_14009__en-us_topic_0191813881_p54271769">Major</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="alm_14009__en-us_topic_0191813881_p33937134">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="alm_14009__en-us_topic_0191813881_section27094719"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="alm_14009__en-us_topic_0191813881_table64553311" frame="border" border="1" rules="all"><thead align="left"><tr id="alm_14009__en-us_topic_0191813881_row25037822"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="alm_14009__en-us_topic_0191813881_p14797678">Parameter</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="alm_14009__en-us_topic_0191813881_p57761279">Description</p>
</th>
</tr>
</thead>
<tbody><tr id="alm_14009__en-us_topic_0191813881_row48152024"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="alm_14009__en-us_topic_0191813881_p7999906">ServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="alm_14009__en-us_topic_0191813881_p44012677">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="alm_14009__en-us_topic_0191813881_row60569775"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="alm_14009__en-us_topic_0191813881_p7204713">RoleName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="alm_14009__en-us_topic_0191813881_p46710863">Specifies the role for which the alarm is generated.</p>
</td>
</tr>
<tr id="alm_14009__en-us_topic_0191813881_row17744591"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="alm_14009__en-us_topic_0191813881_p28025763">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="alm_14009__en-us_topic_0191813881_p55494360">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
<tr id="alm_14009__en-us_topic_0191813881_row29687198"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="alm_14009__en-us_topic_0191813881_p55852871">Trigger condition</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="alm_14009__en-us_topic_0191813881_p27788708">Generates an alarm when the actual indicator value exceeds the specified threshold.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="alm_14009__en-us_topic_0191813881_section42525879"><h4 class="sectiontitle">Impact on the System</h4><p id="alm_14009__en-us_topic_0191813881_p36292894">Faulty DataNodes cannot provide HDFS services.</p>
</div>
<div class="section" id="alm_14009__en-us_topic_0191813881_section47188597"><h4 class="sectiontitle">Possible Causes</h4><ul id="alm_14009__en-us_topic_0191813881_ul54043292"><li id="alm_14009__en-us_topic_0191813881_li16627584">DataNodes are faulty or overloaded.</li><li id="alm_14009__en-us_topic_0191813881_li15430535">The network between the NameNode and the DataNode is disconnected or busy.</li><li id="alm_14009__en-us_topic_0191813881_li4657088">NameNodes are overloaded.</li></ul>
</div>
<div class="section" id="alm_14009__en-us_topic_0191813881_section22044193"><h4 class="sectiontitle">Procedure</h4><ol id="alm_14009__en-us_topic_0191813881_ol30498500155859"><li class="tableheading" id="alm_14009__en-us_topic_0191813881_li13131878155859"><span>Check whether DataNodes are faulty.</span><p><ol type="a" id="alm_14009__en-us_topic_0191813881_ol39574466"><li id="alm_14009__en-us_topic_0191813881_li60083059">Use the client on the cluster node and run the <strong id="alm_14009__b187821571655">hdfs dfsadmin -report</strong> command to check whether DataNodes are faulty. <ul class="subitemlist" id="alm_14009__en-us_topic_0191813881_ul3876622"><li id="alm_14009__en-us_topic_0191813881_li34889605">If yes, go to <a href="#alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step4">1.b</a>.</li><li id="alm_14009__en-us_topic_0191813881_li7485743">If no, go to <a href="#alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step6">2.a</a>.</li></ul>
</li><li id="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step4"><a name="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step4"></a><a name="en-us_topic_0191813881_alm14007_3_mmccppss_step4"></a>On the MRS cluster details page, choose <strong id="alm_14009__b2012016291456">Components</strong> &gt; <strong id="alm_14009__b512116298518">HDFS</strong> &gt; <strong id="alm_14009__b41211929450">Instances</strong> to check whether the DataNode is stopped.<div class="note" id="alm_14009__en-us_topic_0191813881_note163984233303"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="alm_14009__en-us_topic_0191813881_p103987239302">For MRS 1.7.2 or earlier, log in to MRS Manager and choose <strong id="alm_14009__b6852016978">Services</strong> &gt; <strong id="alm_14009__b78619165720">HDFS</strong> &gt; <strong id="alm_14009__b88791618718">Instances</strong>.</p>
</div></div>
<ul class="subitemlist" id="alm_14009__en-us_topic_0191813881_ul21289199"><li id="alm_14009__en-us_topic_0191813881_li57385070">If yes, go to <a href="#alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step5">1.c</a>.</li><li id="alm_14009__en-us_topic_0191813881_li17679057">If no, go to <a href="#alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step6">2.a</a>.</li></ul>
</li><li id="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step5"><a name="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step5"></a><a name="en-us_topic_0191813881_alm14007_3_mmccppss_step5"></a>Select the DataNode instance, and choose <strong id="alm_14009__b1881742617149">More</strong> &gt; <strong id="alm_14009__b12817112614143">Restart Instance</strong> to restart it. Wait 5 minutes and check whether the alarm is cleared.<ul class="subitemlist" id="alm_14009__en-us_topic_0191813881_ul3131361"><li id="alm_14009__en-us_topic_0191813881_li28182251">If yes, no further action is required.</li><li id="alm_14009__en-us_topic_0191813881_li52313669">If no, go to <a href="#alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step6">2.a</a>.</li></ul>
</li></ol>
</p></li><li class="tableheading" id="alm_14009__en-us_topic_0191813881_li19177414155938"><span>Check the status of the network between the NameNode and the DataNode.</span><p><ol type="a" id="alm_14009__en-us_topic_0191813881_ol34769881155956"><li id="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step6"><a name="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step6"></a><a name="en-us_topic_0191813881_alm14007_3_mmccppss_step6"></a>Log in to the service IP address of the node where the faulty DataNode is located, and run the <strong id="alm_14009__b2048174421416">ping</strong> <em id="alm_14009__i749144421411">IP address of the NameNode</em> command to check whether the network between the DataNode and the NameNode is abnormal.<ul class="subitemlist" id="alm_14009__en-us_topic_0191813881_ul47571555"><li id="alm_14009__en-us_topic_0191813881_li25490811">If yes, go to <a href="#alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step7">2.b</a>.</li><li id="alm_14009__en-us_topic_0191813881_li51489795">If no, go to <a href="#alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step8">3.a</a>.</li></ul>
</li><li id="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step7"><a name="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step7"></a><a name="en-us_topic_0191813881_alm14007_3_mmccppss_step7"></a>Rectify the network fault. Wait 5 minutes and check whether the alarm is cleared.<ul class="subitemlist" id="alm_14009__en-us_topic_0191813881_ul22205728"><li id="alm_14009__en-us_topic_0191813881_li65633828">If yes, no further action is required.</li><li id="alm_14009__en-us_topic_0191813881_li53833544">If no, go to <a href="#alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step8">3.a</a>.</li></ul>
</li></ol>
</p></li><li class="tableheading" id="alm_14009__en-us_topic_0191813881_li15599852155949"><span>Check whether the DataNode is overloaded.</span><p><ol type="a" id="alm_14009__en-us_topic_0191813881_ol31229633155956"><li id="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step8"><a name="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step8"></a><a name="en-us_topic_0191813881_alm14007_3_mmccppss_step8"></a>On the MRS cluster details page, click <strong id="alm_14009__b948473931618">Alarms</strong> and check whether the alarm ALM-14008 HDFS DataNode Memory Usage Exceeds the Threshold exists.<ul class="subitemlist" id="alm_14009__en-us_topic_0191813881_ul7935270"><li id="alm_14009__en-us_topic_0191813881_li4308570">If yes, go to <a href="#alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step13">3.b</a>.</li><li id="alm_14009__en-us_topic_0191813881_li13449869">If no, go to <a href="#alm_14009__en-us_topic_0191813881_step9">4.a</a>.</li></ul>
</li><li id="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step13"><a name="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step13"></a><a name="en-us_topic_0191813881_alm14007_3_mmccppss_step13"></a>Follow procedures in <a href="alm_14008.html">ALM-14008 HDFS DataNode Memory Usage Exceeds the Threshold</a> to handle the alarm and check whether the alarm is cleared.<ul class="subitemlist" id="alm_14009__en-us_topic_0191813881_ul63544148"><li id="alm_14009__en-us_topic_0191813881_li35026424">If yes, go to <a href="#alm_14009__en-us_topic_0191813881_ss10">3.c</a>.</li><li id="alm_14009__en-us_topic_0191813881_li18568063">If no, go to <a href="#alm_14009__en-us_topic_0191813881_step9">4.a</a>.</li></ul>
</li><li id="alm_14009__en-us_topic_0191813881_ss10"><a name="alm_14009__en-us_topic_0191813881_ss10"></a><a name="en-us_topic_0191813881_ss10"></a>Wait 5 minutes and check whether the alarm is cleared. <ul class="subitemlist" id="alm_14009__en-us_topic_0191813881_ul47236811"><li id="alm_14009__en-us_topic_0191813881_li22478116">If yes, no further action is required.</li><li id="alm_14009__en-us_topic_0191813881_li976457">If no, go to <a href="#alm_14009__en-us_topic_0191813881_step9">4.a</a>.</li></ul>
</li></ol>
</p></li><li class="tableheading" id="alm_14009__en-us_topic_0191813881_li53071569155956"><span>Check whether the NameNode is overloaded.</span><p><ol type="a" id="alm_14009__en-us_topic_0191813881_ol21483741155956"><li id="alm_14009__en-us_topic_0191813881_step9"><a name="alm_14009__en-us_topic_0191813881_step9"></a><a name="en-us_topic_0191813881_step9"></a>On the MRS cluster details page, click <strong id="alm_14009__b5859183916175">Alarms</strong> and check whether the alarm ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold exists.<ul class="subitemlist" id="alm_14009__en-us_topic_0191813881_ul31192822"><li id="alm_14009__en-us_topic_0191813881_li12299945">If yes, go to <a href="#alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step14">4.b</a>.</li><li id="alm_14009__en-us_topic_0191813881_li56771492">If no, go to <a href="#alm_14009__en-us_topic_0191813881_li572522141314">5</a>.</li></ul>
</li><li id="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step14"><a name="alm_14009__en-us_topic_0191813881_alm14007_3_mmccppss_step14"></a><a name="en-us_topic_0191813881_alm14007_3_mmccppss_step14"></a>Follow procedures in <a href="alm_14007.html">ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold</a> to handle the alarm and check whether the alarm is cleared.<ul class="subitemlist" id="alm_14009__en-us_topic_0191813881_ul23565055"><li id="alm_14009__en-us_topic_0191813881_li10758903">If yes, go to <a href="#alm_14009__en-us_topic_0191813881_ss13">4.c</a>.</li><li id="alm_14009__en-us_topic_0191813881_li66164776">If no, go to <a href="#alm_14009__en-us_topic_0191813881_li572522141314">5</a>.</li></ul>
</li><li id="alm_14009__en-us_topic_0191813881_ss13"><a name="alm_14009__en-us_topic_0191813881_ss13"></a><a name="en-us_topic_0191813881_ss13"></a>Wait 5 minutes and check whether the alarm is cleared. <ul class="subitemlist" id="alm_14009__en-us_topic_0191813881_ul49957660"><li id="alm_14009__en-us_topic_0191813881_li46965764">If yes, no further action is required.</li><li id="alm_14009__en-us_topic_0191813881_li20038698">If no, go to <a href="#alm_14009__en-us_topic_0191813881_li572522141314">5</a>.</li></ul>
</li></ol>
</p></li><li id="alm_14009__en-us_topic_0191813881_li572522141314"><a name="alm_14009__en-us_topic_0191813881_li572522141314"></a><a name="en-us_topic_0191813881_li572522141314"></a><span>Collect fault information.</span><p><ol type="a" id="alm_14009__en-us_topic_0191813881_en-us_topic_0191813935_ol6089206913036"><li id="alm_14009__en-us_topic_0191813881_en-us_topic_0191813935_li4478836213036">On MRS Manager, choose <span class="menucascade" id="alm_14009__menucascade1159223713188"><b><span class="uicontrol" id="alm_14009__uicontrol059173714182">System</span></b> &gt; <b><span class="uicontrol" id="alm_14009__uicontrol459173710188">Export Log</span></b></span>.</li><li id="alm_14009__li18574327401">Contact technical support engineers for help. For details, see <a href="https://docs.otc.t-systems.com/en-us/public/learnmore.html" target="_blank" rel="noopener noreferrer">technical support</a>.</li></ol>
</p></li></ol>
</div>
<div class="section" id="alm_14009__en-us_topic_0191813881_section64180012"><h4 class="sectiontitle">Reference</h4><p id="alm_14009__en-us_topic_0191813881_p43901854">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_0241.html">Alarm Reference (Applicable to Versions Earlier Than MRS 3.x)</a></div>
</div>
</div>