doc-exports/docs/mrs/umn/ALM-14027.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

92 lines
11 KiB
HTML

<a name="ALM-14027"></a><a name="ALM-14027"></a>
<h1 class="topictitle1">ALM-14027 DataNode Disk Fault</h1>
<div id="body1541645491805"><div class="section" id="ALM-14027__section14753556"><h4 class="sectiontitle">Description</h4><p id="ALM-14027__p42796248">The system checks the disk status on DataNodes every 60 seconds. This alarm is generated when a disk is faulty.</p>
<p id="ALM-14027__p9884153073515">After all faulty disks on the DataNode are recovered, you need to manually clear the alarm and restart the DataNode.</p>
</div>
<div class="section" id="ALM-14027__section65673142"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14027__table2697805" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14027__row10450762"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-14027__p41205356">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-14027__p49299555">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-14027__p33841047">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14027__row56770287"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-14027__p34990548">14027</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-14027__p15662125">Major</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-14027__p60672611">No</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-14027__section54187374"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-14027__table15534429" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-14027__row48561591"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-14027__p41174828">Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-14027__p46826794">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-14027__row7847914269"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14027__p156438591896">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14027__p187931338134115">Specifies the cluster for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14027__row34873944"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14027__p65062640">ServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14027__p33829733">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14027__row36032144"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14027__p35626567">RoleName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14027__p49481274">Specifies the role for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14027__row42678285"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14027__p51620924">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14027__p34048007">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-14027__row37996610"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-14027__p9643152813466">Failed Volumes</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-14027__p21777528466">Specifies the list of faulty disks.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-14027__section17924324"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-14027__p33887971">If this alarm is reported, there are abnormal disk partitions on the DataNode. This may cause the loss of written files.</p>
</div>
<div class="section" id="ALM-14027__section26588641115124"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-14027__ul6823812161"><li id="ALM-14027__li55766936115124">The hard disk is faulty.</li><li id="ALM-14027__li48816389104827">The disk permissions are configured improperly.</li></ul>
</div>
<div class="section" id="ALM-14027__section290805983653"><h4 class="sectiontitle">Procedure</h4><p id="ALM-14027__p158750818188"><strong id="ALM-14027__b74301931238">Check whether a disk alarm is generated.</strong></p>
<ol id="ALM-14027__ol9670103102713"><li id="ALM-14027__li14669113152720"><span>On FusionInsight Manager, choose <strong id="ALM-14027__b39091837833">O&amp;M</strong> &gt; <strong id="ALM-14027__b1362841839">Alarm</strong> &gt; <strong id="ALM-14027__b21741946137">Alarms</strong> and check whether <strong id="ALM-14027__b9759151347">ALM-12014 Partition Lost</strong> or <strong id="ALM-14027__b136158151742">ALM-12033 Slow Disk Fault</strong> exists.</span><p><ul id="ALM-14027__ul156692319270"><li id="ALM-14027__li7669173152713">If yes, go to <a href="#ALM-14027__li106705312711">2</a>.</li><li id="ALM-14027__li146692031273">If no, go to <a href="#ALM-14027__li76681531273">4</a>.</li></ul>
</p></li><li id="ALM-14027__li106705312711"><a name="ALM-14027__li106705312711"></a><a name="li106705312711"></a><span>Rectify the fault by referring to the handling procedure of <strong id="ALM-14027__b1799144716413">ALM-12014 Partition Lost</strong> or <strong id="ALM-14027__b34578581241">ALM-12033 Slow Disk Fault</strong>. Then, check whether the alarm is cleared.</span><p><ul id="ALM-14027__ul10670437270"><li id="ALM-14027__li9669203102714">If yes, go to <a href="#ALM-14027__li1067073192717">3</a>.</li><li id="ALM-14027__li176703314272">If no, go to <a href="#ALM-14027__li76681531273">4</a>.</li></ul>
</p></li><li id="ALM-14027__li1067073192717"><a name="ALM-14027__li1067073192717"></a><a name="li1067073192717"></a><span>Wait 5 minutes and check whether the alarm is cleared.</span><p><ul id="ALM-14027__ul56703317274"><li id="ALM-14027__li146701838278">If yes, no further action is required.</li><li id="ALM-14027__li367011318279">If no, go to <a href="#ALM-14027__li76681531273">4</a>.</li></ul>
</p></li></ol>
<p id="ALM-14027__p16407172385916"><strong id="ALM-14027__b179019115616">Modify disk permissions.</strong></p>
<ol start="4" id="ALM-14027__ol866973192715"><li id="ALM-14027__li76681531273"><a name="ALM-14027__li76681531273"></a><a name="li76681531273"></a><span>Choose <strong id="ALM-14027__b10519192616">O&amp;M</strong> &gt; <strong id="ALM-14027__b10925820760">Alarm</strong> &gt; <strong id="ALM-14027__b196161221611">Alarms</strong> and view <strong id="ALM-14027__b1579843417611">Location</strong> and <strong id="ALM-14027__b235713401368">Additional Information</strong> of the alarm to obtain the location of the faulty disk.</span></li><li id="ALM-14027__li186681733277"><span>Log in to the node for which the alarm is generated as user <strong id="ALM-14027__b166611100917">root</strong>. <span id="ALM-14027__text9421319626"></span> Go to the directory where the faulty disk is located, and run the <strong id="ALM-14027__b197313514368">ll</strong> command to check whether the permission of the faulty disk is <strong id="ALM-14027__b1256719276117">711</strong> and whether the user is <strong id="ALM-14027__b3428133181118">omm</strong>.</span><p><ul id="ALM-14027__ul186687322713"><li id="ALM-14027__li13668173102712">If yes, go to <a href="#ALM-14027__li206502049133310">8</a>.</li><li id="ALM-14027__li12668163152711">If no, go to <a href="#ALM-14027__li188961329122819">6</a>.</li></ul>
</p></li><li id="ALM-14027__li188961329122819"><a name="ALM-14027__li188961329122819"></a><a name="li188961329122819"></a><span>Modify the permission of the faulty disk. For example, if the faulty disk is <strong id="ALM-14027__b117069387128">data1</strong>, run the following commands:</span><p><p id="ALM-14027__p91663582917"><strong id="ALM-14027__b2053685882914">chown omm:wheel data1</strong></p>
<p id="ALM-14027__p185691031112815"><strong id="ALM-14027__b95371858152913">chmod 711 data1</strong></p>
</p></li><li id="ALM-14027__li3669433276"><span>In the alarm list on Manager, click <strong id="ALM-14027__b1905171561315">Clear</strong> in the <strong id="ALM-14027__b10101902976435">Operation</strong> column of the alarm to manually clear the alarm. Choose <strong id="ALM-14027__b18844123171310">Cluster </strong>&gt; <strong id="ALM-14027__b17773163471316">Services </strong>&gt; <strong id="ALM-14027__b15786153671314">HDFS </strong>&gt; <strong id="ALM-14027__b1911443831310">Instance</strong>, select the DataNode, choose <strong id="ALM-14027__b9891544101310">More </strong>&gt; <strong id="ALM-14027__b19907114714137">Restart Instance</strong>, wait for 5 minutes, and check whether a new alarm is reported.</span><p><ul id="ALM-14027__ul466917317274"><li id="ALM-14027__li1866816382710">If no, no further action is required.</li><li id="ALM-14027__li1866911342720">If yes, go to <a href="#ALM-14027__li206502049133310">8</a>.</li></ul>
</p></li></ol>
<p id="ALM-14027__p27663650144049"><strong id="ALM-14027__b47646260144049">Collect the fault information.</strong></p>
<ol start="8" id="ALM-14027__ol665014911339"><li id="ALM-14027__li206502049133310"><a name="ALM-14027__li206502049133310"></a><a name="li206502049133310"></a><span>On FusionInsight Manager, choose <strong id="ALM-14027__b21531434283">O&amp;M</strong>. In the navigation pane on the left, choose <strong id="ALM-14027__b31559314281">Log</strong> &gt; <strong id="ALM-14027__b61554362810">Download</strong>.</span></li><li id="ALM-14027__li14650749103316"><span>Expand the <strong id="ALM-14027__b77326208140">Service</strong> drop-down list, and select <strong id="ALM-14027__b187328209149">HDFS</strong> and <strong id="ALM-14027__b162421346101420">OMS</strong> for the target cluster.</span></li><li id="ALM-14027__li9650249183317"><span>Click <span><img id="ALM-14027__image104601319175315" src="en-us_image_0263895589.png"></span> in the upper right corner, and set <strong id="ALM-14027__b5339985386435">Start Date</strong> and <strong id="ALM-14027__b618036566435">End Date</strong> for log collection to 20 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-14027__b17975714556435">Download</strong>.</span></li><li id="ALM-14027__li17650149123318"><span>Contact <span id="ALM-14027__text118351845104211">O&amp;M personnel</span> and provide the collected logs.</span></li></ol>
</div>
<div class="section" id="ALM-14027__section169311343318"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-14027__p754913417333">After the fault is rectified, the system does not automatically clear this alarm and you need to manually clear the alarm.</p>
</div>
<div class="section" id="ALM-14027__section12763521144142"><h4 class="sectiontitle">Related Information</h4><p id="ALM-14027__p56138825202324">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>