doc-exports/docs/mrs/umn/ALM-14003.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

110 lines
15 KiB
HTML

<a name="ALM-14003"></a><a name="ALM-14003"></a>
<h1 class="topictitle1">ALM-14003 Number of Lost HDFS Blocks Exceeds the Threshold</h1>
<div id="body38198915"><div class="section" id="ALM-14003__section8243740"><h4 class="sectiontitle">Description</h4><p id="ALM-14003__p7104400">The system checks the lost blocks every 30 seconds and compares the actual lost blocks with the threshold. The lost blocks indicator has a default threshold. This alarm is generated when the number of lost HDFS blocks exceeds the threshold.</p>
<p id="ALM-14003__p63939602">To change the threshold, choose <strong id="ALM-14003__en-us_topic_0070543638_b55978213">O&amp;M</strong> &gt; <strong id="ALM-14003__b18216526383">Alarm &gt;</strong> <strong id="ALM-14003__b122075817202">Thresholds</strong> &gt; <em id="ALM-14003__i10674629125819">Name of the desired cluster</em><strong id="ALM-14003__b76731229185816"> &gt;</strong> <strong id="ALM-14003__en-us_topic_0070543638_b5927966">HDFS</strong>.</p>
<p id="ALM-14003__p58579705104741">If <strong id="ALM-14003__b48421890111935">Trigger Count</strong> is <strong id="ALM-14003__b9678101312317">1</strong>, this alarm is cleared when the value of lost HDFS blocks is less than or equal to the threshold. If <strong id="ALM-14003__b048141313383">Trigger Count</strong> is greater than <strong id="ALM-14003__b166528222316">1</strong>, this alarm is cleared when the value of lost HDFS blocks is less than or equal to 90% of the threshold.</p>
</div>
<div class="section" id="ALM-14003__section7084804"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14003__table38418539" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14003__row53418480"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-14003__p31929608">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-14003__p36161432">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-14003__p43394889">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14003__row25325122"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-14003__p38069036">14003</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-14003__p63693103">Major</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-14003__p58867698">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-14003__section63763242"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14003__table3554205" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14003__row22865724"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-14003__p40184376">Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-14003__p33709057">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14003__row5538101328"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14003__p156438591896">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14003__p187931338134115">Specifies the cluster for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14003__row46079102"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14003__p65062640">ServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14003__p66669494">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14003__row63154538"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14003__p35626567">RoleName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14003__p26802723">Specifies the role for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14003__row39897916"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14003__p51620924">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14003__p45657288">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14003__row8262415"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14003__p65275865">NameServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14003__p52853732">Specifies the NameService for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14003__row5921545"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14003__p9883160">Trigger Condition</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14003__p62338491">Specifies the threshold for triggering the alarm.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-14003__section36998271"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-14003__p16253019">Data stored in HDFS is lost. HDFS may enter the safe mode and cannot provide write services. Lost block data cannot be restored.</p>
</div>
<div class="section" id="ALM-14003__section64548988"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-14003__ul41426143"><li id="ALM-14003__li37290970">The DataNode instance is abnormal.</li><li id="ALM-14003__li74416">Data is deleted.</li></ul>
</div>
<div class="section" id="ALM-14003__section44069987"><h4 class="sectiontitle">Procedure</h4><p class="tableheading" id="ALM-14003__p6027701"><strong id="ALM-14003__b23450184163346">Check the DataNode instance.</strong></p>
<ol id="ALM-14003__ol359796216321"><li id="ALM-14003__li49283749163156"><span>On FusionInsight Manager, choose <strong id="ALM-14003__b19149021309">Cluster</strong> &gt; <em id="ALM-14003__i51512212305">Name of the desired cluster</em> &gt; <strong id="ALM-14003__b1715015215309">Services</strong> &gt; <strong id="ALM-14003__b63585940163350">HDFS</strong> &gt; <strong id="ALM-14003__b35402554163350">Instance</strong>.</span></li><li id="ALM-14003__li23401293163156"><span>Check whether the <strong id="ALM-14003__b7757161910547">Running </strong><strong id="ALM-14003__b13759619105413">Status</strong> of all DataNode instance is <strong id="ALM-14003__b49034625163350">Normal</strong>.</span><p><ul class="subitemlist" id="ALM-14003__ul47339386163156"><li id="ALM-14003__li32560724163156">If yes, go to <a href="#ALM-14003__li19356361163156">11</a>.</li><li id="ALM-14003__li20173012163156">If no, go to <a href="#ALM-14003__li6471267163156">3</a>.</li></ul>
</p></li><li id="ALM-14003__li6471267163156"><a name="ALM-14003__li6471267163156"></a><a name="li6471267163156"></a><span>Restart the DataNode instance and check whether the DataNode instance restarts successfully.</span><p><ul class="subitemlist" id="ALM-14003__ul51447170163156"><li id="ALM-14003__li16456550163156">If yes, go to <a href="#ALM-14003__li177391556152310">4</a>.</li><li id="ALM-14003__li57912135163156">If no, go to <a href="#ALM-14003__li58241411163156">5</a>.</li></ul>
</p></li><li class="subitemlist" id="ALM-14003__li177391556152310"><a name="ALM-14003__li177391556152310"></a><a name="li177391556152310"></a><span>Choose <strong id="ALM-14003__b559311995517">O&amp;M</strong> &gt; <strong id="ALM-14003__b145995917551">Alarm </strong>&gt; <strong id="ALM-14003__b1759916912553">Alarms </strong>and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14003__ul19166544162414"><li id="ALM-14003__li17166244142411">If yes, no further action is required.</li><li id="ALM-14003__li11166244102419">If no, go to <a href="#ALM-14003__li58241411163156">5</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-14003__p60371353163156"><strong id="ALM-14003__b217792316344">Delete the damaged file.</strong></p>
<ol start="5" id="ALM-14003__ol12872311163220"><li id="ALM-14003__li58241411163156"><a name="ALM-14003__li58241411163156"></a><a name="li58241411163156"></a><span>On FusionInsight Manager, choose <strong id="ALM-14003__b155495162307">Cluster</strong> &gt; <em id="ALM-14003__i1055111619302">Name of the desired cluster</em> &gt; <strong id="ALM-14003__b5550111603013">Services</strong> &gt; <strong id="ALM-14003__b32382621163350">HDFS</strong> &gt; <strong id="ALM-14003__b23008138163350">NameNode(Active)</strong>. On the WebUI page of the HDFS, view the information about lost blocks.</span><p><div class="note" id="ALM-14003__note537419483164"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><ul id="ALM-14003__ul84816374214"><li id="ALM-14003__li1948114373218">If a block is lost, a line in red is displayed on the WebUI.</li><li id="ALM-14003__li8646239162116">By default, the <strong id="ALM-14003__b4780151814294">admin</strong> user does not have the permissions to manage other components. If the page cannot be opened or the displayed content is incomplete when you access the native UI of a component due to insufficient permissions, you can manually create a user with the permissions to manage that component.</li></ul>
</div></div>
</p></li><li id="ALM-14003__li58058213163156"><span>The user checks whether the file containing the lost data block is useful.</span><p><div class="note" id="ALM-14003__note2734490411313"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-14003__p64673579105059">Files generated in directories <strong id="ALM-14003__b48679603104139">/mr-history</strong>, <strong id="ALM-14003__b54268200104145">/tmp/hadoop-yarn</strong>, and <strong id="ALM-14003__b65952513104150">/tmp/logs</strong> during MapReduce task execution are unnecessary.</p>
</div></div>
<ul class="subitemlist" id="ALM-14003__ul36277074163156"><li id="ALM-14003__li19933870163156">If yes, go to <a href="#ALM-14003__li7098948163156">7</a>.</li><li id="ALM-14003__li4030786163156">If no, go to <a href="#ALM-14003__li2696171714538">8</a>.</li></ul>
</p></li><li id="ALM-14003__li7098948163156"><a name="ALM-14003__li7098948163156"></a><a name="li7098948163156"></a><span>The user checks whether the file containing the lost data block is backed up.</span><p><ul class="subitemlist" id="ALM-14003__ul23158393163156"><li id="ALM-14003__li5094850163156">If yes, go to <a href="#ALM-14003__li2696171714538">8</a>.</li><li id="ALM-14003__li10029695163156">If no, go to <a href="#ALM-14003__li19356361163156">11</a>.</li></ul>
</p></li><li id="ALM-14003__li2696171714538"><a name="ALM-14003__li2696171714538"></a><a name="li2696171714538"></a><span>Log in to the HDFS client as user <strong id="ALM-14003__b5509195720511">root</strong>. The user password is defined by the user before the installation. Contact the MRS cluster administrator to obtain the password. Run the following commands:</span><p><ul id="ALM-14003__ul136531649134219"><li id="ALM-14003__li1665316493421">Security mode:<p id="ALM-14003__p15448113195317"><a name="ALM-14003__li1665316493421"></a><a name="li1665316493421"></a><strong id="ALM-14003__b5899001435951">cd </strong><em id="ALM-14003__i675970895951">Client installation directory</em></p>
<p id="ALM-14003__p242017192533"><strong id="ALM-14003__b1749214015534">source bigdata_env</strong></p>
<p id="ALM-14003__p1383154718581"><strong id="ALM-14003__b272414499589">kinit hdfs</strong></p>
</li><li id="ALM-14003__li99822554425">Normal mode:<p id="ALM-14003__p19874185710584"><a name="ALM-14003__li99822554425"></a><a name="li99822554425"></a><strong id="ALM-14003__b187411310205919">su - omm</strong></p>
<p id="ALM-14003__p1481214635916"><strong id="ALM-14003__b7688421085951">cd </strong><em id="ALM-14003__i20930854935951">Client installation directory</em></p>
<p id="ALM-14003__p188128685916"><strong id="ALM-14003__b208122655912">source bigdata_env</strong></p>
</li></ul>
</p></li><li id="ALM-14003__li15776911191511"><span>On the node client, run <strong id="ALM-14003__b2278181512514">hdfs fsck / -delete</strong> to delete the lost file. If the file where the lost block is located is a useful file, you need to write the file again to restore the data.</span><p><div class="note" id="ALM-14003__note744895410814"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-14003__p7449185412811"><span id="ALM-14003__text5875253174515">Deleting a file or folder is a high-risk operation. Ensure that the file or folder is no longer required before performing this operation.</span></p>
</div></div>
</p></li><li id="ALM-14003__li9607247163156"><span>Choose <strong id="ALM-14003__b18752161915515">O&amp;M</strong> &gt; <strong id="ALM-14003__b17758619125519">Alarm </strong>&gt; <strong id="ALM-14003__b1175821920556">Alarms </strong>and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-14003__ul13374679163156"><li id="ALM-14003__li7751055163156">If yes, no further action is required.</li><li id="ALM-14003__li23855696163156">If no, go to <a href="#ALM-14003__li19356361163156">11</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-14003__p53263255163156"><strong id="ALM-14003__b22972001966">Collect the fault information.</strong></p>
<ol start="11" id="ALM-14003__ol60451969163226"><li id="ALM-14003__li19356361163156"><a name="ALM-14003__li19356361163156"></a><a name="li19356361163156"></a><span>On FusionInsight Manager, choose <strong id="ALM-14003__b23611331494">O&amp;M</strong> &gt; <strong id="ALM-14003__b7363173312915">Log </strong>&gt;<strong id="ALM-14003__b336514333916"> Download</strong>.</span></li><li id="ALM-14003__li39989527163156"><span>Expand the drop-down list next to the <strong id="ALM-14003__b77861235152010">Service</strong> field. In the <strong id="ALM-14003__b7793535152019">Services</strong> dialog box that is displayed, select <strong id="ALM-14003__b17794935112012">HDFS</strong> for the target cluster.</span></li><li id="ALM-14003__li24361424163156"><span>Click <span><img id="ALM-14003__image1945644173117" src="en-us_image_0269383960.png"></span> in the upper right corner, and set <strong id="ALM-14003__b6456941173117">Start Date</strong> and <strong id="ALM-14003__b11456154113318">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-14003__b13456164113319">Download</strong>.</span></li><li id="ALM-14003__li27118356163156"><span>Contact <span id="ALM-14003__text35101124194217">O&amp;M personnel</span> and provide the collected logs.</span></li></ol>
</div>
<div class="section" id="ALM-14003__section169311343318"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-14003__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
</div>
<div class="section" id="ALM-14003__section61085563"><h4 class="sectiontitle">Related Information</h4><p id="ALM-14003__p11393601">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>