doc-exports/docs/mrs/umn/ALM-18000.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

91 lines
12 KiB
HTML

<a name="ALM-18000"></a><a name="ALM-18000"></a>
<h1 class="topictitle1">ALM-18000 Yarn Service Unavailable</h1>
<div id="body54853885"><div class="section" id="ALM-18000__s1c5b2f91bc4c49c297fe8d0721dc1913"><h4 class="sectiontitle">Description</h4><p id="ALM-18000__en-us_topic_0070543681_p31120856">This alarm is generated when the Yarn service is unavailable. The alarm module checks the Yarn service status every 60 seconds.</p>
<p id="ALM-18000__en-us_topic_0070543681_p11652250">The alarm is cleared when the Yarn service recovers.</p>
</div>
<div class="section" id="ALM-18000__s3930db61023f4348afdb706dae50a1e8"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-18000__en-us_topic_0070543681_table4308221" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-18000__en-us_topic_0070543681_row12288967"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-18000__en-us_topic_0070543681_p55882267">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-18000__en-us_topic_0070543681_p30169808">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-18000__en-us_topic_0070543681_p27835406">Automatically Cleared</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-18000__en-us_topic_0070543681_row40075388"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-18000__en-us_topic_0070543681_p24880962">18000</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-18000__en-us_topic_0070543681_p2092041">Critical</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-18000__en-us_topic_0070543681_p35237624">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-18000__s10a5e866e42d490ea172cecdb0354082"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-18000__en-us_topic_0070543681_table35675320" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-18000__en-us_topic_0070543681_row44771354"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-18000__en-us_topic_0070543681_p2601030">Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-18000__en-us_topic_0070543681_p9356903">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-18000__row1715792410238"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-18000__p192431315431">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-18000__p692551319435">Specifies the cluster for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-18000__en-us_topic_0070543681_row19711717"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-18000__en-us_topic_0070543681_p53145255">ServiceNam</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-18000__en-us_topic_0070543681_p9798422">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-18000__en-us_topic_0070543681_row21076937"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-18000__en-us_topic_0070543681_p29510374">RoleName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-18000__en-us_topic_0070543681_p41530083">Specifies the role for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-18000__en-us_topic_0070543681_row38226429"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-18000__en-us_topic_0070543681_p9333038">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-18000__en-us_topic_0070543681_p17778579">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-18000__se63a4ca7918e4c63b34e71166c6d509a"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-18000__en-us_topic_0070543681_p30778772">The cluster cannot provide Yarn services. Users cannot run new applications. Submitted applications cannot be run.</p>
</div>
<div class="section" id="ALM-18000__s68c1a5e3a46c4fac86815c54a610299b"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-18000__en-us_topic_0070543681_ul10052635"><li id="ALM-18000__en-us_topic_0070543681_li23364851">The ZooKeeper service is abnormal.</li><li id="ALM-18000__en-us_topic_0070543681_li8957069">The HDFS service is abnormal.</li><li id="ALM-18000__en-us_topic_0070543681_li13504763">There is no active ResourceManager instance in the Yarn cluster.</li><li id="ALM-18000__en-us_topic_0070543681_li54434004">All the NodeManagers in the Yarn cluster are abnormal.</li></ul>
</div>
<div class="section" id="ALM-18000__sc0f25eebb8744e519b7fb96a61670ee6"><h4 class="sectiontitle">Procedure</h4><p class="tableheading" id="ALM-18000__en-us_topic_0070543681_p47078232"><strong id="ALM-18000__b12267826174737">Check ZooKeeper service status.</strong></p>
<ol id="ALM-18000__ol3997015174744"><li id="ALM-18000__li45230096174725"><span>On the FusionInsight Manager, check whether the alarm list contains <strong id="ALM-18000__b9339544174725">ALM-13000 ZooKeeper Service Unavailable</strong>.</span><p><ul class="subitemlist" id="ALM-18000__ul57221349174725"><li id="ALM-18000__li18305591174725">If yes, go to <a href="#ALM-18000__li311182174725">2</a>.</li><li id="ALM-18000__li6357927174725">If no, go to <a href="#ALM-18000__li19148237174725">3</a>.</li></ul>
</p></li><li id="ALM-18000__li311182174725"><a name="ALM-18000__li311182174725"></a><a name="li311182174725"></a><span>Rectify the fault by following the steps provided in <strong id="ALM-18000__b16316142351414">ALM-13000 ZooKeeper Service Unavailable</strong>, and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-18000__ul60484669174725"><li id="ALM-18000__li39759129174725">If yes, no further action is required.</li><li id="ALM-18000__li66372842174725">If no, go to <a href="#ALM-18000__li19148237174725">3</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-18000__p7491116174725"><strong id="ALM-18000__b1787260817482">Check the HDFS service status.</strong></p>
<ol start="3" id="ALM-18000__ol28880248174816"><li id="ALM-18000__li19148237174725"><a name="ALM-18000__li19148237174725"></a><a name="li19148237174725"></a><span>On the FusionInsight Manager, check whether the alarm list contains the HDFS alarms.</span><p><ul class="subitemlist" id="ALM-18000__ul54323365174725"><li id="ALM-18000__li25205820174725">If yes, go to <a href="#ALM-18000__li13219687174725">4</a>.</li><li id="ALM-18000__li28405550174725">If no, go to <a href="#ALM-18000__li40584762174725">5</a>.</li></ul>
</p></li><li id="ALM-18000__li13219687174725"><a name="ALM-18000__li13219687174725"></a><a name="li13219687174725"></a><span>Choose <strong id="ALM-18000__b15623531132319">O&amp;M &gt; Alarm<strong id="ALM-18000__b27872374104950"> &gt; Alarms</strong></strong>, handle HDFS alarms based on the alarm help, and check whether the Yarn alarm is cleared.</span><p><ul class="subitemlist" id="ALM-18000__ul34131890174725"><li id="ALM-18000__li7503360174725">If yes, no further action is required.</li><li id="ALM-18000__li3792432174725">If no, go to <a href="#ALM-18000__li40584762174725">5</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-18000__p38751556174725"><strong id="ALM-18000__b14545250174838">Check the ResourceManager status in the Yarn cluster.</strong></p>
<ol start="5" id="ALM-18000__ol62918647174855"><li id="ALM-18000__li40584762174725"><a name="ALM-18000__li40584762174725"></a><a name="li40584762174725"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-18000__b1656204320386">Cluster &gt; </strong><em id="ALM-18000__i1065884373814">Name of the desired cluster</em><strong id="ALM-18000__b13657184314381"> &gt; Services</strong> &gt; <strong id="ALM-18000__b64161741174725">Yarn</strong>.</span></li><li id="ALM-18000__li40403515174725"><span>In <strong id="ALM-18000__b4308232137">Dashboard</strong>, check whether there is an active ResourceManager instance in the Yarn cluster.</span><p><ul class="subitemlist" id="ALM-18000__ul20382916174725"><li id="ALM-18000__li58391498174725">If yes, go to <a href="#ALM-18000__li7454663174725">7</a>.</li><li id="ALM-18000__li32090930174725">If no, go to <a href="#ALM-18000__li46526163174725">10</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-18000__p49228522174725"><strong id="ALM-18000__b18887241174919">Check the NodeManager node status in the Yarn cluster.</strong></p>
<ol start="7" id="ALM-18000__ol23522848174955"><li id="ALM-18000__li7454663174725"><a name="ALM-18000__li7454663174725"></a><a name="li7454663174725"></a><span>On the FusionInsight Manager portal, choose <strong id="ALM-18000__b10772144983818">Cluster &gt; </strong><em id="ALM-18000__i6773649113815">Name of the desired cluster</em><strong id="ALM-18000__b17772184973810"> &gt; Services</strong> &gt; <strong id="ALM-18000__b51459311174725">Yarn</strong> &gt; <strong id="ALM-18000__b60480619174725">Instance</strong>.</span></li><li id="ALM-18000__li15977364174725"><span>Query NodeManager<strong id="ALM-18000__b2020550269"> Running Status</strong>, and check whether there are unhealthy nodes.</span><p><ul class="subitemlist" id="ALM-18000__ul9231803174725"><li id="ALM-18000__li65740686174725">If yes, go to <a href="#ALM-18000__li25011012174725">9</a>.</li><li id="ALM-18000__li23395377174725">If no, go to <a href="#ALM-18000__li46526163174725">10</a>.</li></ul>
</p></li><li id="ALM-18000__li25011012174725"><a name="ALM-18000__li25011012174725"></a><a name="li25011012174725"></a><span>Rectify the fault by following the steps provided in <strong id="ALM-18000__b828076131515">ALM-18002 NodeManager Heartbeat Lost</strong> or <strong id="ALM-18000__b7236118131510">ALM-18003 NodeManager Unhealthy</strong>. After the fault is rectified, check whether the Yarn alarm is cleared.</span><p><ul class="subitemlist" id="ALM-18000__ul10250831174725"><li id="ALM-18000__li37664987174725">If yes, no further action is required.</li><li id="ALM-18000__li30965143174725">If no, go to <a href="#ALM-18000__li46526163174725">10</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-18000__p25148622174725"><strong id="ALM-18000__b51806081175029">Collect fault information.</strong></p>
<ol start="10" id="ALM-18000__ol40182044175016"><li id="ALM-18000__li46526163174725"><a name="ALM-18000__li46526163174725"></a><a name="li46526163174725"></a><span>On the FusionInsight Manager portal of the active cluster, choose <strong id="ALM-18000__b977113042710">O&amp;M</strong> &gt; <strong id="ALM-18000__b12626114174725">Log &gt; Download</strong>.</span></li><li id="ALM-18000__li47045674174725"><span>Select <strong id="ALM-18000__b16082288174725">Yarn</strong> in the required cluster from the <strong id="ALM-18000__b10522864174725">Service</strong>.</span></li><li id="ALM-18000__li1145664103113"><span>Click <span><img id="ALM-18000__image1945644173117" src="en-us_image_0269417390.png"></span> in the upper right corner, and set <strong id="ALM-18000__b6456941173117">Start Date</strong> and <strong id="ALM-18000__b11456154113318">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-18000__b13456164113319">Download</strong>.</span></li><li id="ALM-18000__li56082450174725"><span>Contact the <span id="ALM-18000__text4614151421417">O&amp;M personnel</span> and send the collected logs.</span></li></ol>
</div>
<div class="section" id="ALM-18000__section1529716184534"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-18000__p4677152685316">After the fault is rectified, the system automatically clears this alarm.</p>
</div>
<div class="section" id="ALM-18000__sb7f9c95269284e9eac231cf07c9638a5"><h4 class="sectiontitle">Related Information</h4><p id="ALM-18000__en-us_topic_0070543681_p31573738">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>