doc-exports/docs/mrs/umn/ALM-12101.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

84 lines
9.5 KiB
HTML

<a name="ALM-12101"></a><a name="ALM-12101"></a>
<h1 class="topictitle1">ALM-12101 AZ Unhealthy</h1>
<div id="body1584325551595"><div class="section" id="ALM-12101__section1506154710501"><h4 class="sectiontitle">Description</h4><p id="ALM-12101__p16655114712506">After the AZ DR function is enabled, the system checks the AZ health status every 5 minutes. This alarm is generated when the system detects that the AZ is subhealthy or unhealthy. This alarm is cleared when the AZ becomes healthy.</p>
</div>
<div class="section" id="ALM-12101__section0507144755019"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-12101__table1508174795018" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-12101__row4655184705015"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-12101__p186551347165011">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-12101__p146552473502">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-12101__p12655124775017">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-12101__row15655114720507"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-12101__p1656154715507">12101</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-12101__p186561477509">Major</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-12101__p0656184715010">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-12101__section252384735016"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-12101__table14524174795012" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-12101__row665624745014"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-12101__p26561347105010">Parameter</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-12101__p1165624713504">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-12101__row26562471503"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-12101__p13656247195020">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-12101__p1165611479507">Specifies the cluster for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-12101__row176561147145014"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-12101__p0656547105011">ServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-12101__p1365624712502">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-12101__row06562476508"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-12101__p365617477504">AZName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-12101__p265644795016">Specifies the AZ for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-12101__row292217131438"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-12101__en-us_topic_0070543632_p56851411">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-12101__en-us_topic_0070543632_p41561572">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-12101__section45337477501"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-12101__p1282529125011">The health status of an AZ is determined by whether the health status of storage resources (HDFS), computing resources (Yarn), and key roles in the AZ exceeds the configured threshold.</p>
<p id="ALM-12101__p665611474507">An AZ is subhealthy when:</p>
<ul id="ALM-12101__ul36562477505"><li id="ALM-12101__li10656104705011">The computing resources (Yarn) are unhealthy, but the storage resources (HDFS) are healthy. Tasks cannot be submitted to the local AZ, but data can still be read and written in the local AZ.</li><li id="ALM-12101__li14656154765016">The computing resources (Yarn) are healthy, but some storage resources (HDFS) are unhealthy. Tasks can be submitted to the local AZ, and some data can be read and written in the local AZ. This depends on the locality of data detected by Spark/Hive scheduling.</li></ul>
<p id="ALM-12101__p1165684775015">An AZ is unhealthy when:</p>
<ul id="ALM-12101__ul76561547105015"><li id="ALM-12101__li765684745012">The computing resources (Yarn) are healthy, but the storage resources (HDFS) are unhealthy. Although tasks can be submitted to the local AZ, data cannot be read or written in the local AZ. As a result, the tasks submitted to the local AZ are invalid.</li><li id="ALM-12101__li265619479502">The computing resources (Yarn) and storage resources (HDFS) are unhealthy. Tasks cannot be submitted to the local AZ, and data cannot be read or written in the local AZ.</li><li id="ALM-12101__li33861213103719">The health status of key roles except Yarn and HDFS is lower than the configured threshold.</li></ul>
</div>
<div class="section" id="ALM-12101__section115431347145017"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-12101__ul26561947155013"><li id="ALM-12101__li17656164716502">The computing resources (Yarn) are unhealthy.</li><li id="ALM-12101__li6656124712508">The storage resources (HDFS) are unhealthy.</li><li id="ALM-12101__li19656154718506">Some storage resources (HDFS) are unhealthy.</li><li id="ALM-12101__li16181310195315">Key roles except Yarn and HDFS are unhealthy.</li></ul>
</div>
<div class="section" id="ALM-12101__section15475478505"><h4 class="sectiontitle">Procedure</h4><p id="ALM-12101__p19656164745011"><strong id="ALM-12101__b1265612478508">Disable the DR drill.</strong></p>
<ol id="ALM-12101__ol11761181319520"><li id="ALM-12101__li1476141317524"><span>On FusionInsight Manager, choose <strong id="ALM-12101__b1576012133528">Cluster</strong> &gt; <em id="ALM-12101__i11315144243811">Name of the desired cluster</em> &gt;<strong id="ALM-12101__b1043416444389"> Cross-AZ HA</strong>. The Cross-AZ HA page is displayed.</span></li><li id="ALM-12101__li37611513165219"><span>In the AZ DR list, check whether <strong id="ALM-12101__b6761191317528">Perform DR Drill</strong> in the <strong id="ALM-12101__b276121375219">Operation</strong> column of the AZ whose health status is <strong id="ALM-12101__b1576113139526">Unhealthy</strong> is gray.</span><p><ul id="ALM-12101__ul13761713155215"><li id="ALM-12101__li187619139525">If yes, go to <a href="#ALM-12101__li57606134528">4</a>.</li><li id="ALM-12101__li676181395212">If no, go to <a href="#ALM-12101__li1076171313521">3</a>.</li></ul>
</p></li><li id="ALM-12101__li1076171313521"><a name="ALM-12101__li1076171313521"></a><a name="li1076171313521"></a><span>Click <strong id="ALM-12101__b19761141305218">Restore</strong> in the <strong id="ALM-12101__b176141365212">Operation</strong> column of the target AZ. Wait 2 minutes and refresh the page to view the health status of the AZ. Check whether the health status is normal.</span><p><ul id="ALM-12101__ul15761171315522"><li id="ALM-12101__li17761181395216">If yes, no further action is required.</li><li id="ALM-12101__li676141305217">If no, go to <a href="#ALM-12101__li57606134528">4</a>.</li></ul>
</p></li></ol>
<p id="ALM-12101__p21349383518"><strong id="ALM-12101__b1165714710509">Collect the fault information.</strong></p>
<ol start="4" id="ALM-12101__ol19760201318528"><li id="ALM-12101__li57606134528"><a name="ALM-12101__li57606134528"></a><a name="li57606134528"></a><span>Log in to the active management node as user <strong id="ALM-12101__b14760191320523">root</strong>. <span id="ALM-12101__text54736122102"></span><span id="ALM-12101__text108402160108"></span></span></li><li id="ALM-12101__li2380124144911"><span>View logs of unhealthy services.</span><p><ul id="ALM-12101__ul1414161445612"><li id="ALM-12101__li490418475582">HDFS log files are stored in<strong id="ALM-12101__b64085511588"> /var/log/Bigdata/hdfs/nn/hdfs-az-state.log</strong>.</li><li id="ALM-12101__li12414114125613">Yarn log files are stored in<strong id="ALM-12101__b19941321135916"> /var/log/Bigdata/yarn/rm/yarn-az-state.log</strong>.</li><li id="ALM-12101__li7414111419565">For other services, view the service health check logs in the corresponding service log directory.</li></ul>
</p></li><li id="ALM-12101__li17601013145215"><span>Contact <span id="ALM-12101__text4614151421417">O&amp;M personnel</span> and provide detailed log file information.</span></li></ol>
</div>
<div class="section" id="ALM-12101__section456064705013"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-12101__p1665710475509">After the fault is rectified, the system automatically clears this alarm.</p>
</div>
<div class="section" id="ALM-12101__section156124711508"><h4 class="sectiontitle">Related Information</h4><p id="ALM-12101__p065724765011">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>