forked from docs/doc-exports
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
111 lines
20 KiB
HTML
111 lines
20 KiB
HTML
<a name="ALM-14009"></a><a name="ALM-14009"></a>
|
|
|
|
<h1 class="topictitle1">ALM-14009 Number of Dead DataNodes Exceeds the Threshold</h1>
|
|
<div id="body59830422"><div class="section" id="ALM-14009__s6c60596b6a184e788c2dd162512396be"><h4 class="sectiontitle">Description</h4><p id="ALM-14009__en-us_topic_0070543645_p54311227">The system periodically detects the number of dead DataNodes in the HDFS cluster every 30 seconds, and compares the number with the threshold. The number of DataNodes in the Dead state has a default threshold. This alarm is generated when the number exceeds the threshold.</p>
|
|
<p id="ALM-14009__en-us_topic_0070543645_p19039002">You can change the threshold in <strong id="ALM-14009__en-us_topic_0070543638_b55978213">O&M</strong> > <strong id="ALM-14009__b18216526383">Alarm ></strong> <strong id="ALM-14009__b122075817202">Thresholds</strong> > <em id="ALM-14009__i10674629125819">Name of the desired cluster</em><strong id="ALM-14009__b76731229185816"> ></strong> <strong id="ALM-14009__en-us_topic_0070543638_b5927966">HDFS</strong>.</p>
|
|
<p id="ALM-14009__p12972664104520">When the <strong id="ALM-14009__b48421890111935">Trigger Count</strong> is 1, this alarm is cleared when the number of Dead DataNodes is less than or equal to the threshold. When the <strong id="ALM-14009__b11395929193813">Trigger Count</strong> is greater than 1, this alarm is cleared when the number of Dead DataNodes is less than or equal to 90% of the threshold.</p>
|
|
</div>
|
|
<div class="section" id="ALM-14009__s6d29b71a12fd427fbd58bafdc5b07e55"><h4 class="sectiontitle">Attribute</h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14009__en-us_topic_0070543645_table56038149" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14009__en-us_topic_0070543645_row2506715"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-14009__en-us_topic_0070543645_p1717399">Alarm ID</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-14009__en-us_topic_0070543645_p4891598">Alarm Severity</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-14009__en-us_topic_0070543645_p60675178">Automatically Cleared</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-14009__en-us_topic_0070543645_row15742419"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-14009__en-us_topic_0070543645_p67568">14009</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-14009__en-us_topic_0070543645_p5473066">Major</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-14009__en-us_topic_0070543645_p40665192">Yes</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-14009__sf5df6fd306aa4d2ba9ff73d4462172b4"><h4 class="sectiontitle">Parameters</h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14009__en-us_topic_0070543645_table5546259" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14009__en-us_topic_0070543645_row9802558"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-14009__en-us_topic_0070543645_p55809695">Name</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-14009__en-us_topic_0070543645_p24291441">Meaning</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-14009__row17339212163615"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14009__p192431315431">Source</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14009__p692551319435">Specifies the cluster for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14009__en-us_topic_0070543645_row21449691"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14009__en-us_topic_0070543645_p59703430">ServiceName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14009__en-us_topic_0070543645_p4139674">Specifies the service for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14009__en-us_topic_0070543645_row37257066"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14009__en-us_topic_0070543645_p65032365">RoleName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14009__en-us_topic_0070543645_p33130221">Specifies the role for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14009__en-us_topic_0070543645_row29736535"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14009__en-us_topic_0070543645_p59849116">HostName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14009__en-us_topic_0070543645_p15940242">Specifies the host for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14009__row0467709217"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14009__p8343823141916">NameServiceName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14009__en-us_topic_0070543642_p6743286">Specifies the NameService for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-14009__en-us_topic_0070543645_row9244455"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14009__en-us_topic_0070543645_p10603368">Trigger condition</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14009__en-us_topic_0070543645_p53566511">Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-14009__sa95ff4cf43484aac8ae79c576f58a4a0"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-14009__en-us_topic_0070543645_p43920097">DataNodes that are in the Dead state cannot provide HDFS services.</p>
|
|
</div>
|
|
<div class="section" id="ALM-14009__se5a15867de754e7f911aac51efa5e0e9"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-14009__en-us_topic_0070543645_ul758120"><li id="ALM-14009__en-us_topic_0070543645_li6823087">DataNodes are faulty or overloaded.</li><li id="ALM-14009__en-us_topic_0070543645_li61407788">The network between the NameNode and the DataNode is disconnected or busy.</li><li id="ALM-14009__en-us_topic_0070543645_li15799184">NameNodes are overloaded.</li><li id="ALM-14009__li9560152410149">The NameNodes are not restarted after the DataNode is deleted.</li></ul>
|
|
</div>
|
|
<div class="section" id="ALM-14009__s138d81c310f84c30a740bfc9499f0053"><h4 class="sectiontitle">Procedure</h4><p class="tableheading" id="ALM-14009__en-us_topic_0070543645_p4665529"><strong id="ALM-14009__b4893980117543">Check whether DataNodes are faulty.</strong></p>
|
|
<ol id="ALM-14009__ol5822525517553"><li id="ALM-14009__li5192402917545"><span>On the FusionInsight Manager portal, choose <strong id="ALM-14009__b93631013193213">Cluster > </strong><em id="ALM-14009__i03651713133217">Name of the desired cluster</em><strong id="ALM-14009__b7363111310321"> > Services</strong> > <strong id="ALM-14009__b1638262217545">HDFS</strong>. The <strong id="ALM-14009__b1322587717545">HDFS Status</strong> page is displayed.</span></li><li id="ALM-14009__li321816517545"><span>In the <strong id="ALM-14009__b30282645162958">Basic Information</strong> area, click <strong id="ALM-14009__b4509681617545">NameNode(Active)</strong> to go to the HDFS WebUI.</span><p><div class="note" id="ALM-14009__note840916461457"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-14009__en-us_topic_0193189480_p91833832915">By default, the <strong id="ALM-14009__en-us_topic_0193189480_b4780151814294">admin</strong> user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component.</p>
|
|
</div></div>
|
|
</p></li><li id="ALM-14009__li4777189917545"><span>On the HDFS WebUI, click the <strong id="ALM-14009__b4836693537">Datanodes</strong> tab. In the <strong id="ALM-14009__b88363905318">In operation</strong> area, click <strong id="ALM-14009__b168369914531">Filter</strong> to check whether <strong id="ALM-14009__b58363919534">down</strong> is in the drop-down list.</span><p><ul class="subitemlist" id="ALM-14009__ul2767761017545"><li id="ALM-14009__li4219963917545">If yes, select <strong id="ALM-14009__b13228123517534">down</strong>, record the information about the filtered DataNodes, and go to <a href="#ALM-14009__li4499900717545">4</a>.</li><li id="ALM-14009__li6272761317545">If no, go to <a href="#ALM-14009__li2034924617545">8</a>.</li></ul>
|
|
</p></li><li id="ALM-14009__li4499900717545"><a name="ALM-14009__li4499900717545"></a><a name="li4499900717545"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-14009__b853912714543">Cluster > </strong><em id="ALM-14009__i3539167105419">Name of the desired cluster</em> > <strong id="ALM-14009__b65390717548">Services</strong> > <strong id="ALM-14009__b17539207115420">HDFS</strong> > <strong id="ALM-14009__b1553915755413">Instance</strong> to check whether recorded DataNodes exist in the instance list.</span><p><ul class="subitemlist" id="ALM-14009__ul13221514112118"><li id="ALM-14009__li522101412117">If all recorded DataNodes exist, go to <a href="#ALM-14009__li22951519113013">5</a>.</li><li id="ALM-14009__li1922314162114">If none of the recorded DataNodes exists, go to <a href="#ALM-14009__li4226377546">6</a>.</li><li id="ALM-14009__li14230147210">If some of the recorded DataNodes exist, go to <a href="#ALM-14009__li992618717545">7</a>.</li></ul>
|
|
</p></li><li id="ALM-14009__li22951519113013"><a name="ALM-14009__li22951519113013"></a><a name="li22951519113013"></a><span>Locate the DataNode instance, click <strong id="ALM-14009__b6295101911303">More</strong> > <strong id="ALM-14009__b929510191306">Restart Instance</strong> to restart it and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14009__ul11295131963020"><li id="ALM-14009__li14295151913301">If yes, no further action is required.</li><li id="ALM-14009__li529631918308">If no, go to <a href="#ALM-14009__li2034924617545">8</a>.</li></ul>
|
|
</p></li><li id="ALM-14009__li4226377546"><a name="ALM-14009__li4226377546"></a><a name="li4226377546"></a><span>Select all NameNode instances, choose <strong id="ALM-14009__b192782052155416">More</strong> > <strong id="ALM-14009__b22781152145418">Instance Rolling Restart</strong> to restart them and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14009__ul411831665510"><li id="ALM-14009__li14514191675613">If yes, no further action is required.</li><li id="ALM-14009__li19118121605511">If no, go to <a href="#ALM-14009__li4607504917545">16</a>.</li></ul>
|
|
</p></li><li id="ALM-14009__li992618717545"><a name="ALM-14009__li992618717545"></a><a name="li992618717545"></a><span>Select all NameNode instances, choose <strong id="ALM-14009__b11330174310311">More</strong> > <strong id="ALM-14009__b2330174383112">Instance Rolling Restart</strong> to restart them. Locate the DataNode instance, click <strong id="ALM-14009__b233788317545">More</strong> > <strong id="ALM-14009__b2104095117545">Restart Instance</strong> to restart it and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14009__ul675058117545"><li id="ALM-14009__li5515083617545">If yes, no further action is required.</li><li id="ALM-14009__li3803276617545">If no, go to <a href="#ALM-14009__li2034924617545">8</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p class="tableheading" id="ALM-14009__p6075523317545"><strong id="ALM-14009__b4599985117558">Check the status of the network between the NameNode and the DataNode.</strong></p>
|
|
<ol start="8" id="ALM-14009__ol257612321769"><li id="ALM-14009__li2034924617545"><a name="ALM-14009__li2034924617545"></a><a name="li2034924617545"></a><span>Log in to the faulty DataNode on the management page as user <strong id="ALM-14009__b239531216515">root</strong>, and run the <strong id="ALM-14009__b2222682017545">ping </strong><em id="ALM-14009__i6582365317545">IP address of the NameNode</em> command to check whether the network between the DataNode and the NameNode is abnormal. <span id="ALM-14009__text101733453110"></span></span><p><p id="ALM-14009__p126174131814">On the FusionInsight Manager page, choose <strong id="ALM-14009__b229319363815">Cluster > </strong><em id="ALM-14009__i19293136886">Name of the desired cluster</em><strong id="ALM-14009__b329303613814"> > Services</strong> > <strong id="ALM-14009__b2482112111194">HDFS</strong> > <strong id="ALM-14009__b1148215211193">Instance</strong>. In the instance list, view the service plane IP address of the faulty DataNode.</p>
|
|
<ul class="subitemlist" id="ALM-14009__ul971756717545"><li id="ALM-14009__li3011565917545">If yes, go to <a href="#ALM-14009__li3193609617545">9</a>.</li><li id="ALM-14009__li2344935117545">If no, go to <a href="#ALM-14009__li4029888217545">10</a>.</li></ul>
|
|
</p></li><li id="ALM-14009__li3193609617545"><a name="ALM-14009__li3193609617545"></a><a name="li3193609617545"></a><span>Rectify the network fault, and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14009__ul3187744317545"><li id="ALM-14009__li4892549217545">If yes, no further action is required.</li><li id="ALM-14009__li354193817545">If no, go to <a href="#ALM-14009__li4029888217545">10</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p class="tableheading" id="ALM-14009__p1846153617545"><strong id="ALM-14009__b4397501817615">Check whether the DataNode is overloaded.</strong></p>
|
|
<ol start="10" id="ALM-14009__ol6601875217627"><li id="ALM-14009__li4029888217545"><a name="ALM-14009__li4029888217545"></a><a name="li4029888217545"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-14009__b1898941417545">O&M > Alarm<strong id="ALM-14009__b27872374104950"> > Alarms</strong></strong> and check whether the alarm <strong id="ALM-14009__b3668700217545">ALM-14008 HDFS DataNode Memory Usage Exceeds the Threshold</strong> exists.</span><p><ul class="subitemlist" id="ALM-14009__ul5667343617545"><li id="ALM-14009__li1885717617545">If yes, go to <a href="#ALM-14009__li3775267317545">11</a>.</li><li id="ALM-14009__li5103629117545">If no, go to <a href="#ALM-14009__li2641038017545">13</a>.</li></ul>
|
|
</p></li><li id="ALM-14009__li3775267317545"><a name="ALM-14009__li3775267317545"></a><a name="li3775267317545"></a><span>See <strong id="ALM-14009__b1074012219571">ALM-14008 HDFS DataNode Memory Usage Exceeds the Threshold</strong> to handle the alarm and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14009__ul2656436217545"><li id="ALM-14009__li5131166217545">If yes, go to <a href="#ALM-14009__li4983258617545">12</a>.</li><li id="ALM-14009__li6260391917545">If no, go to <a href="#ALM-14009__li2641038017545">13</a>.</li></ul>
|
|
</p></li><li id="ALM-14009__li4983258617545"><a name="ALM-14009__li4983258617545"></a><a name="li4983258617545"></a><span>Check whether the alarm is cleared from the alarm list.</span><p><ul class="subitemlist" id="ALM-14009__ul6358155817545"><li id="ALM-14009__li422973917545">If yes, no further action is required.</li><li id="ALM-14009__li706461717545">If no, go to <a href="#ALM-14009__li2641038017545">13</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p class="tableheading" id="ALM-14009__p3536311517545"><strong id="ALM-14009__b5411870017633">Check whether the NameNode is overloaded.</strong></p>
|
|
<ol start="13" id="ALM-14009__ol5392475917644"><li id="ALM-14009__li2641038017545"><a name="ALM-14009__li2641038017545"></a><a name="li2641038017545"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-14009__b4584009017545">O&M > Alarm<strong id="ALM-14009__b6261132018197"> > Alarms</strong></strong> and check whether the alarm <strong id="ALM-14009__b990762917545">ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold</strong> exists.</span><p><ul class="subitemlist" id="ALM-14009__ul4767372917545"><li id="ALM-14009__li6432052217545">If yes, go to <a href="#ALM-14009__li1070095917545">14</a>.</li><li id="ALM-14009__li4257978317545">If no, go to <a href="#ALM-14009__li4607504917545">16</a>.</li></ul>
|
|
</p></li><li id="ALM-14009__li1070095917545"><a name="ALM-14009__li1070095917545"></a><a name="li1070095917545"></a><span>See <strong id="ALM-14009__b19917103315576">ALM-14007 HDFS NameNode Memory Usage Exceeds the Threshold</strong> to handle the alarm and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14009__ul864553517545"><li id="ALM-14009__li6003240717545">If yes, go to <a href="#ALM-14009__li5612534017545">15</a>.</li><li id="ALM-14009__li3078677617545">If no, go to <a href="#ALM-14009__li4607504917545">16</a>.</li></ul>
|
|
</p></li><li id="ALM-14009__li5612534017545"><a name="ALM-14009__li5612534017545"></a><a name="li5612534017545"></a><span>Check whether the alarm is cleared from the alarm list.</span><p><ul class="subitemlist" id="ALM-14009__ul1312047217545"><li id="ALM-14009__li2919976717545">If yes, no further action is required.</li><li id="ALM-14009__li1637091117545">If no, go to <a href="#ALM-14009__li4607504917545">16</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p class="tableheading" id="ALM-14009__p5097539117545"><strong id="ALM-14009__b1811747517650">Collect fault information.</strong></p>
|
|
<ol start="16" id="ALM-14009__ol3147822117653"><li id="ALM-14009__li4607504917545"><a name="ALM-14009__li4607504917545"></a><a name="li4607504917545"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-14009__b39977366113627">O&M</strong> > <strong id="ALM-14009__b24251979113627">Log > Download</strong>.</span></li><li id="ALM-14009__li4007909417545"><span>Select <strong id="ALM-14009__b1202226517545">HDFS</strong> in the required cluster from the <strong id="ALM-14009__b4109152317545">Service</strong>.</span></li><li id="ALM-14009__li1145664103113"><span>Click <span><img id="ALM-14009__image1945644173117" src="en-us_image_0269383964.png"></span> in the upper right corner, and set <strong id="ALM-14009__b6456941173117">Start Date</strong> and <strong id="ALM-14009__b11456154113318">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-14009__b13456164113319">Download</strong>.</span></li><li id="ALM-14009__li5868395217545"><span>Contact the <span id="ALM-14009__text4614151421417">O&M personnel</span> and send the collected logs.</span></li></ol>
|
|
</div>
|
|
<div class="section" id="ALM-14009__section1529716184534"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-14009__p4677152685316">After the fault is rectified, the system automatically clears this alarm.</p>
|
|
</div>
|
|
<div class="section" id="ALM-14009__s4a1927215e974cd7bfd0dc1cb7e27881"><h4 class="sectiontitle">Related Information</h4><p id="ALM-14009__en-us_topic_0070543645_p60020356">None</p>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
|
|
</div>
|
|
</div>
|
|
|