doc-exports/docs/mrs/umn/ALM-14026.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

117 lines
15 KiB
HTML

<a name="ALM-14026"></a><a name="ALM-14026"></a>
<h1 class="topictitle1">ALM-14026 Blocks on DataNode Exceed the Threshold</h1>
<div id="body1541485623369"><div class="section" id="ALM-14026__section14753556"><h4 class="sectiontitle">Description</h4><p id="ALM-14026__p42796248">The system checks the number of blocks on each DataNode every 30 seconds. This alarm is generated when the number of blocks on the DataNode exceeds the threshold.</p>
<p id="ALM-14026__p43944044">If <strong id="ALM-14026__b109201323847">Trigger Count</strong> is <strong id="ALM-14026__b033415261747">1</strong> and the number of blocks on the DataNode is less than or equal to the threshold, this alarm is cleared. If <strong id="ALM-14026__b765543210420">Trigger Count</strong> is greater than <strong id="ALM-14026__b1319418351744">1</strong> and the number of blocks on the DataNode is less than or equal to 90% of the threshold, this alarm is cleared.</p>
</div>
<div class="section" id="ALM-14026__section65673142"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14026__table2697805" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14026__row10450762"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-14026__p41205356">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-14026__p49299555">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-14026__p33841047">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14026__row56770287"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-14026__p34990548">14026</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-14026__p15662125">Minor</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-14026__p60672611">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-14026__section54187374"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14026__table15534429" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14026__row48561591"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-14026__p41174828">Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-14026__p46826794">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14026__row135623149261"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14026__p156438591896">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14026__p187931338134115">Specifies the cluster for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14026__row34873944"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14026__p65062640">ServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14026__p33829733">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14026__row36032144"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14026__p35626567">RoleName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14026__p49481274">Specifies the role for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14026__row42678285"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14026__p51620924">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14026__p34048007">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14026__row37996610"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14026__p57826595">Trigger condition</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14026__p53442657">Specifies the threshold for triggering the alarm.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-14026__section17924324"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-14026__p33887971">If this alarm is reported, there are too many blocks on the DataNode. In this case, data writing into the HDFS may fail due to insufficient disk space.</p>
</div>
<div class="section" id="ALM-14026__section26588641115124"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-14026__ul19685536194610"><li id="ALM-14026__li11686936144611">The alarm threshold is improperly configured.</li></ul>
<ul id="ALM-14026__ul6823812161"><li id="ALM-14026__li10413180125610">Data skew occurs among DataNodes.</li><li id="ALM-14026__li48816389104827">The disk space configured for the HDFS cluster is insufficient.</li></ul>
</div>
<div class="section" id="ALM-14026__section290805983653"><h4 class="sectiontitle">Procedure</h4><p id="ALM-14026__p170994995717"><strong id="ALM-14026__b117706317645042">Change the threshold.</strong></p>
<ol id="ALM-14026__ol12639154321312"><li id="ALM-14026__li12638164313138"><span>On FusionInsight Manager, choose <strong id="ALM-14026__b2096016328819">Cluster</strong>, click the name of the desired cluster, and choose <strong id="ALM-14026__b19384553585">HDFS</strong>. Then choose <strong id="ALM-14026__b636825891">Configurations</strong> &gt; <strong id="ALM-14026__b933692716916">All Configurations</strong>. On the displayed page, find the <strong id="ALM-14026__b17534337792">GC_OPTS</strong> parameter under <strong id="ALM-14026__b9784441697">HDFS-&gt;DataNode</strong>.</span></li><li id="ALM-14026__li0197164317255"><span>Set the threshold of the DataNode blocks. Specifically, change the value of <strong id="ALM-14026__b1113312348106">Xmx</strong> of the <strong id="ALM-14026__b438113821017">GC_OPTS</strong> parameter. <strong id="ALM-14026__b69761940151012">Xmx</strong> specifies the memory, and each GB memory supports a maximum of 500,000 DataNode blocks. Set the memory as required. Confirm that <strong id="ALM-14026__b326113751116">GC_PROFILE</strong> is set to <strong id="ALM-14026__b622612181112">custom</strong> and save the configuration.</span></li><li id="ALM-14026__li1263984391312"><span>Choose <strong id="ALM-14026__b18649823171115">Cluster</strong>, click the name of the desired cluster, and choose <strong id="ALM-14026__b228624214116">HDFS</strong> &gt; <strong id="ALM-14026__b2025944441119">Instance</strong>. Select the DataNode instance whose status is <strong id="ALM-14026__b11301778123">Expired</strong>, click <strong id="ALM-14026__b16949171515128">More</strong>, and select <strong id="ALM-14026__b1729052411210">Restart Instance</strong> to make the <strong id="ALM-14026__b20381929101212">GC_OPTS</strong> configuration take effect.</span></li><li id="ALM-14026__li18639843121316"><span>Check whether the alarm is cleared 5 minutes later.</span><p><ul id="ALM-14026__ul1563994341319"><li id="ALM-14026__li663954315133">If yes, no further action is required.</li><li id="ALM-14026__li93765525195">If no, go to <a href="#ALM-14026__li10750133111389">5</a>.</li></ul>
</p></li></ol>
<p id="ALM-14026__p138318135386"><strong id="ALM-14026__b123356131314">Check whether associated alarms are reported.</strong></p>
<ol start="5" id="ALM-14026__ol1575115316381"><li id="ALM-14026__li10750133111389"><a name="ALM-14026__li10750133111389"></a><a name="li10750133111389"></a><span>On FusionInsight Manager, choose <strong id="ALM-14026__b6601103414131">O&amp;M</strong> &gt; <strong id="ALM-14026__b2060214345137">Alarm</strong> &gt; <strong id="ALM-14026__b3602153481316">Alarms</strong>, and check whether the <strong id="ALM-14026__b156031134131315">ALM-14002 DataNode Disk Usage Exceeds the Threshold</strong> alarm exists.</span><p><ul id="ALM-14026__ul207501731183818"><li id="ALM-14026__li10750193113817">If yes, go to <a href="#ALM-14026__li5750123115384">6</a>.</li><li id="ALM-14026__li375083117381">If no, go to <a href="#ALM-14026__li4795431151710">8</a>.</li></ul>
</p></li><li id="ALM-14026__li5750123115384"><a name="ALM-14026__li5750123115384"></a><a name="li5750123115384"></a><span>Handle the alarm by following the instructions in <strong id="ALM-14026__b135258200144">ALM-14002 DataNode Disk Usage Exceeds the Threshold</strong> and check whether the alarm is cleared.</span><p><ul id="ALM-14026__ul4750173183819"><li id="ALM-14026__li675053118387">If yes, go to <a href="#ALM-14026__li10751231113815">7</a>.</li><li id="ALM-14026__li1875043183812">If no, go to <a href="#ALM-14026__li4795431151710">8</a>.</li></ul>
</p></li><li id="ALM-14026__li10751231113815"><a name="ALM-14026__li10751231113815"></a><a name="li10751231113815"></a><span>Check whether the alarm is cleared 5 minutes later.</span><p><ul id="ALM-14026__ul9751331153816"><li id="ALM-14026__li3750193173818">If yes, no further action is required.</li><li id="ALM-14026__li1375120312389">If no, go to <a href="#ALM-14026__li4795431151710">8</a>.</li></ul>
</p></li></ol>
<p id="ALM-14026__p13794120144032"><strong id="ALM-14026__b136411505151">Expand the DataNode capacity.</strong></p>
<ol start="8" id="ALM-14026__ol1795231111716"><li id="ALM-14026__li4795431151710"><a name="ALM-14026__li4795431151710"></a><a name="li4795431151710"></a><span>Expand the DataNode capacity.</span></li><li id="ALM-14026__li1179513171717"><span>On FusionInsight Manager, wait for 5 minutes and check whether the alarm is cleared.</span><p><ul id="ALM-14026__ul0795631141711"><li id="ALM-14026__li279533110174">If yes, no further action is required.</li><li id="ALM-14026__li11795153181712">If no, go to <a href="#ALM-14026__li10844183481711">10</a>.</li></ul>
</p></li></ol>
<p id="ALM-14026__p27663650144049"><strong id="ALM-14026__b47646260144049">Collect the fault information.</strong></p>
<ol start="10" id="ALM-14026__ol584593418178"><li id="ALM-14026__li10844183481711"><a name="ALM-14026__li10844183481711"></a><a name="li10844183481711"></a><span>On FusionInsight Manager, choose <strong id="ALM-14026__b14137132331618">O&amp;M</strong>. In the navigation pane on the left, choose <strong id="ALM-14026__b1113792361614">Log</strong> &gt; <strong id="ALM-14026__b9138142317162">Download</strong>.</span></li><li id="ALM-14026__li1384513349173"><span>Expand the drop-down list next to the <strong id="ALM-14026__b4764217175">Service</strong> field. In the <strong id="ALM-14026__b983152115179">Services</strong> dialog box that is displayed, select <strong id="ALM-14026__b118362112176">HDFS</strong> for the target cluster.</span></li><li id="ALM-14026__li184523419174"><span>Click <span><img id="ALM-14026__image122464011945042" src="en-us_image_0263895589.png"></span> in the upper right corner, and set <strong id="ALM-14026__b187202489845042">Start Date</strong> and <strong id="ALM-14026__b164554226145042">End Date</strong> for log collection to 20 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-14026__b69821886145042">Download</strong>.</span></li><li id="ALM-14026__li38457344172"><span>Contact <span id="ALM-14026__text126301214142412">O&amp;M personnel</span> and provide the collected logs.</span></li></ol>
</div>
<div class="section" id="ALM-14026__section169311343318"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-14026__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
</div>
<div class="section" id="ALM-14026__section12763521144142"><h4 class="sectiontitle">Related Information</h4><p id="ALM-14026__p18511159312"><strong id="ALM-14026__b135301313181816">Configuration rules of the DataNode JVM parameter.</strong></p>
<p id="ALM-14026__a37bf1ed1126f4e43a4894e2e5072886d">Default value of the DataNode JVM parameter <strong id="ALM-14026__b184152265645042">GC_OPTS</strong>: </p>
<p id="ALM-14026__p172211247121910">-Xms2G -Xmx4G -XX:NewSize=128M -XX:MaxNewSize=256M -XX:MetaspaceSize=128M -XX:MaxMetaspaceSize=128M -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=65 -XX:+PrintGCDetails -Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFE -Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFE -XX:-OmitStackTraceInFastThrow -XX:+PrintGCDateStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=1M -Djdk.tls.ephemeralDHKeySize=2048</p>
<p id="ALM-14026__ae45046f416574593bab3e1e7c29b634a">The average number of blocks stored in each DataNode instance in the cluster is: Number of HDFS blocks x 3/Number of DataNodes. If the average number of blocks changes, you need to change <strong id="ALM-14026__b92201813345042">-Xms2G -Xmx4G -XX:NewSize=128M -XX:MaxNewSize=256M</strong> in the default value. The following table lists the reference values. </p>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14026__td322224489704ac6a8d18c21146d6686" frame="border" border="1" rules="all"><caption><b>Table 1 </b>DataNode JVM configuration</caption><thead align="left"><tr id="ALM-14026__r5fc0d03f19d54d1082a536369f32ab95"><th align="left" class="cellrowborder" valign="top" width="42.29%" id="mcps1.3.8.6.2.3.1.1"><p id="ALM-14026__a55dc1e45920f45a4b1fadbcc65df9c71">Average Number of Blocks in a DataNode Instance</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="57.709999999999994%" id="mcps1.3.8.6.2.3.1.2"><p id="ALM-14026__a6dca2c4d47794c5eb0adffb5d0324682">Reference Value</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14026__r00929d1b71584e8e8f30505a9e2872b3"><td class="cellrowborder" valign="top" width="42.29%" headers="mcps1.3.8.6.2.3.1.1 "><p id="ALM-14026__a416aecf87e5e458c996f243204ef02ff">2,000,000</p>
</td>
<td class="cellrowborder" valign="top" width="57.709999999999994%" headers="mcps1.3.8.6.2.3.1.2 "><p id="ALM-14026__a3d30bc343740499aaaee64ad55b721e9">-Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M</p>
</td>
</tr>
<tr id="ALM-14026__r0e83ba2dd9ed4f1695730d71f1457c24"><td class="cellrowborder" valign="top" width="42.29%" headers="mcps1.3.8.6.2.3.1.1 "><p id="ALM-14026__a67fbb2a910984d69b4b8d77a94c40c8d">5,000,000</p>
</td>
<td class="cellrowborder" valign="top" width="57.709999999999994%" headers="mcps1.3.8.6.2.3.1.2 "><p id="ALM-14026__a692858cc5d8746d1b0136271ef662183">-Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G</p>
</td>
</tr>
</tbody>
</table>
</div>
<p id="ALM-14026__p2063804341310"><strong id="ALM-14026__b118786255245042">Xmx</strong> specifies memory which corresponds to the threshold of the number of DataNode blocks, and each GB memory supports a maximum of 500,000 DataNode blocks. Set the memory as required.</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>