doc-exports/docs/mrs/umn/ALM-18022.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

91 lines
16 KiB
HTML

<a name="ALM-18022"></a><a name="ALM-18022"></a>
<h1 class="topictitle1">ALM-18022 Insufficient Yarn Queue Resources</h1>
<div id="body1536649207855"><div class="section" id="ALM-18022__section137072329478"><h4 class="sectiontitle">Description</h4><p id="ALM-18022__p2447344144712">The alarm module checks Yarn queue resources every 60 seconds. This alarm is generated when available resources or ApplicationMaster (AM) resources of a queue are insufficient.</p>
<p id="ALM-18022__p24476445479">This alarm is cleared when available resources are sufficient.</p>
</div>
<div class="section" id="ALM-18022__section2100630125019"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-18022__table1589914219502" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-18022__row16901174214501"><th align="left" class="cellrowborder" valign="top" width="33.333333333333336%" id="mcps1.3.2.2.1.4.1.1"><p id="ALM-18022__p09031642175015">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.333333333333336%" id="mcps1.3.2.2.1.4.1.2"><p id="ALM-18022__p179041042165011">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.333333333333336%" id="mcps1.3.2.2.1.4.1.3"><p id="ALM-18022__p1590520422506">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-18022__row199071142165011"><td class="cellrowborder" valign="top" width="33.333333333333336%" headers="mcps1.3.2.2.1.4.1.1 "><p id="ALM-18022__p6909042135012">18022</p>
</td>
<td class="cellrowborder" valign="top" width="33.333333333333336%" headers="mcps1.3.2.2.1.4.1.2 "><p id="ALM-18022__p1791164215505">Minor</p>
</td>
<td class="cellrowborder" valign="top" width="33.333333333333336%" headers="mcps1.3.2.2.1.4.1.3 "><p id="ALM-18022__p14912104255011">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-18022__section1248164755314"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-18022__table1257815617538" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-18022__row7583165655311"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.1"><p id="ALM-18022__p9583125611539">Parameter Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.3.2.1.3.1.2"><p id="ALM-18022__p8585155665317">Description</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-18022__row11912195951013"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-18022__p192431315431">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-18022__p692551319435">Specifies the cluster for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-18022__row25871956175310"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-18022__p8588175625318">QueueName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-18022__p18588135616538">Specifies the queue for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-18022__row4590165617538"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-18022__p105911956185319">QueueMetric</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-18022__p1159210565534">Specifies the metric of the queue for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-18022__row19404132515113"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.1 "><p id="ALM-18022__p353873093910">Trigger Condition</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.3.2.1.3.1.2 "><p id="ALM-18022__p540415257511">Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-18022__section7984510115416"><h4 class="sectiontitle">Impact on the System</h4><ul id="ALM-18022__ul4225423185414"><li id="ALM-18022__li82257234549">An application being executed takes longer time.</li><li id="ALM-18022__li16227122316542">An application fails to be executed for a long time after being submitted.</li></ul>
</div>
<div class="section" id="ALM-18022__section732675013546"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-18022__ul351612613552"><li id="ALM-18022__li1151736195510">NodeManager node resources are insufficient.</li><li id="ALM-18022__li1851810611556">The configured maximum resource capacity of the queue is excessively small.</li><li id="ALM-18022__li0518664552">The configured maximum AM resource percentage is excessively small.</li></ul>
</div>
<div class="section" id="ALM-18022__section822512515510"><h4 class="sectiontitle">Procedure</h4><p id="ALM-18022__p783931917564"><strong id="ALM-18022__b1088812527596">View alarm details</strong><strong id="ALM-18022__b15291324013">.</strong></p>
<ol id="ALM-18022__ol16330712012"><li id="ALM-18022__li18633373010"><span>On the FusionInsight Manager, choose <strong id="ALM-18022__b10633573014">O&amp;M</strong> &gt; <strong id="ALM-18022__b16331871302">Alarm<strong id="ALM-18022__b27872374104950"> &gt; Alarms</strong></strong>.</span></li><li id="ALM-18022__li10633147109"><span>View location information of this alarm and check whether <strong id="ALM-18022__b4633177903">QueueName</strong> is <strong id="ALM-18022__b663367200">root</strong> and <strong id="ALM-18022__b8633679017">QueueMetric</strong> is <strong id="ALM-18022__b463311717020">Memory</strong> or <strong id="ALM-18022__b563310714015">QueueName</strong> is <strong id="ALM-18022__b76331778012">root</strong> and <strong id="ALM-18022__b10633137205">QueueMetric</strong> is <strong id="ALM-18022__b3633171507">vCore</strong><strong id="ALM-18022__b2063347608">s</strong>.</span><p><ul id="ALM-18022__ul17959728904"><li id="ALM-18022__li62971516367">If yes, go to <a href="#ALM-18022__li1118842213014">3</a>.</li><li id="ALM-18022__li0961528401">If no, go to <a href="#ALM-18022__li550716317319">4</a>.</li></ul>
</p></li></ol><ol start="3" id="ALM-18022__ol111885221707"><li id="ALM-18022__li1118842213014"><a name="ALM-18022__li1118842213014"></a><a name="li1118842213014"></a><span>The memory or CPU of the Yarn cluster is insufficient. In this case, log in to the node where NodeManager resides and run the <strong id="ALM-18022__b718713221701">free -g</strong> and <strong id="ALM-18022__b41870221008">cat /proc/cpuinfo</strong> commands to query the available memory and available CPU of the node, respectively. On FusionInsight Manager, increase the values of <strong id="ALM-18022__b121881221306">yarn.nodemanager.resource.memory-mb</strong> and <strong id="ALM-18022__b7188222901">yarn.nodemanager.resource.cpu-vcores</strong> for the Yarn NodeManager based on the query results. Then, restart the NodeManager instance. Check whether the alarm is cleared.</span><p><ul id="ALM-18022__ul135411321720"><li id="ALM-18022__li1541832322">If yes, no further action is required.</li><li id="ALM-18022__li3542153215213">If no, go to <a href="#ALM-18022__li550716317319">4</a>.</li></ul>
</p></li></ol><ol start="4" id="ALM-18022__ol1450719312319"><li id="ALM-18022__li550716317319"><a name="ALM-18022__li550716317319"></a><a name="li550716317319"></a><span>View location information of this alarm and check whether <strong id="ALM-18022__b45061332312">QueueName</strong> is <strong id="ALM-18022__b35061319316">&lt;<em id="ALM-18022__i195065319318">Tenant Queue</em>&gt;</strong> and <strong id="ALM-18022__b105062314320">QueueMetric</strong> is <strong id="ALM-18022__b11506183135">Memory</strong>, or <strong id="ALM-18022__b750643739">QueueName</strong> is <strong id="ALM-18022__b75061231635">&lt;<em id="ALM-18022__i8506636313">Tenant Queue</em>&gt;</strong> and <strong id="ALM-18022__b175061131139">QueueMetric</strong> is <strong id="ALM-18022__b5506131314">vCores</strong> in <strong id="ALM-18022__b18506535311">Location</strong>, check whether <strong id="ALM-18022__b12507434315">available Memory =</strong> or <strong id="ALM-18022__b45073319320">available vCores = </strong>are included in <strong id="ALM-18022__b75073320319">Additional Information</strong>.</span><p><ul id="ALM-18022__ul15681104114315"><li id="ALM-18022__li1168218411331">If yes, go to <a href="#ALM-18022__li11123116735">5</a>.</li><li id="ALM-18022__li6683541337">If no, go to <a href="#ALM-18022__li1189109935">7</a>.</li></ul>
</p></li></ol><ol start="5" id="ALM-18022__ol1312317619316"><li id="ALM-18022__li11123116735"><a name="ALM-18022__li11123116735"></a><a name="li11123116735"></a><span>The memory or CPU of the tenant queue is insufficient. In this case, choose <strong id="ALM-18022__b201221862310">Tenant <strong id="ALM-18022__b68631685519">Resources</strong></strong> &gt; <strong id="ALM-18022__b4122562035">Dynamic Resource Plan &gt; Resource Distribution Policy</strong> and increase the value of <strong id="ALM-18022__b612236539">Maximum Capacity</strong>. Then, check whether the alarm is cleared.</span><p><ul id="ALM-18022__ul1457919121041"><li id="ALM-18022__li165801612640">If yes, no further action is required.</li><li id="ALM-18022__li05821112547">If no, go to <a href="#ALM-18022__li109354114148">6</a>.</li></ul>
</p></li></ol><ol start="6" id="ALM-18022__ol58911910318"><li id="ALM-18022__li109354114148"><a name="ALM-18022__li109354114148"></a><a name="li109354114148"></a><span>Choose <strong id="ALM-18022__b15301441298">Cluster</strong> &gt; <em id="ALM-18022__i123017442917">Name of the desired cluster</em> &gt; <strong id="ALM-18022__b73019492911">Services</strong> &gt; <strong id="ALM-18022__b20301149290">Yarn</strong> &gt; <strong id="ALM-18022__b183014472914">Configurations</strong> &gt; <strong id="ALM-18022__b63020420298">All Configurations</strong>. Enter the keyword "threshold" and click <strong id="ALM-18022__b0306452919">ResourceManager</strong>. Adjust the threshold values of the following parameters:</span><p><p id="ALM-18022__p1055172612917">If <strong id="ALM-18022__b13551126202914">Additional Information</strong> contains <strong id="ALM-18022__b15511526152910">available Memory =</strong>, change the value of <strong id="ALM-18022__b13551202612298">yarn.queue.memory.alarm.threshold</strong> to a value smaller than that of <strong id="ALM-18022__b8551102611294">available Memory =</strong> in <strong id="ALM-18022__b4551152616291">Additional Information</strong>.</p>
<p id="ALM-18022__p455172632910">If <strong id="ALM-18022__b155512026132910">Additional Information</strong> contains <strong id="ALM-18022__b85513268298">available vCores =</strong>, change the value of <strong id="ALM-18022__b755110264294">yarn.queue.vcore.alarm.threshold</strong> to a value smaller than that of <strong id="ALM-18022__b2551112682914">available vCores =</strong> in <strong id="ALM-18022__b18551142632919">Additional Information</strong>.</p>
<div class="p" id="ALM-18022__p4680132113012">Wait for five minutes and check whether the alarm is cleared.<ul id="ALM-18022__ul2149122317298"><li id="ALM-18022__li214911234292">If yes, no further action is required.</li><li id="ALM-18022__li5149172382915">If no, go to <a href="#ALM-18022__li1973131339">9</a>.</li></ul>
</div>
</p></li><li id="ALM-18022__li1189109935"><a name="ALM-18022__li1189109935"></a><a name="li1189109935"></a><span>If <strong id="ALM-18022__b4891291534">available AmMemory =</strong> or <strong id="ALM-18022__b1289189839">available AmvCores =</strong> is included in <strong id="ALM-18022__b4891396320">Additional Information</strong>, ApplicationMaster memory or CPU of the tenant queue is insufficient. In this case, choose <strong id="ALM-18022__b148949135">Tenant Resources</strong> &gt; <strong id="ALM-18022__b208911920312">Dynamic Resource Plan</strong> &gt; <strong id="ALM-18022__b18892915310">Queue Configuration</strong> and increase the value of <strong id="ALM-18022__b16891795318">Maximum Am Resource Percent</strong>. Then, check whether this alarm is cleared.</span><p><ul id="ALM-18022__ul168451656643"><li id="ALM-18022__li1848195618413">If yes, no further action is required.</li><li id="ALM-18022__li158491561548">If no, go to <a href="#ALM-18022__li1382974791617">8</a>.</li></ul>
</p></li><li id="ALM-18022__li1382974791617"><a name="ALM-18022__li1382974791617"></a><a name="li1382974791617"></a><span>Choose <strong id="ALM-18022__b174946812293">Cluster</strong> &gt; <em id="ALM-18022__i24946852913">Name of the desired cluster</em> &gt; <strong id="ALM-18022__b7494168202917">Services</strong> &gt; <strong id="ALM-18022__b949448192915">Yarn</strong> &gt; <strong id="ALM-18022__b134941086299">Configurations</strong> &gt; <strong id="ALM-18022__b34942084292">All Configurations</strong>. Enter the keyword "threshold" and click <strong id="ALM-18022__b949418810296">ResourceManager</strong>. Adjust the threshold values of the following parameters:</span><p><p id="ALM-18022__p143552315315">If <strong id="ALM-18022__b8355133173116">Additional Information</strong> contains <strong id="ALM-18022__b035510363118">available AmMemory =</strong>, change the value of <strong id="ALM-18022__b335610311313">yarn.queue.memory.alarm.threshold</strong> to a value smaller than that of <strong id="ALM-18022__b143565315312">available AmMemory =</strong> in <strong id="ALM-18022__b1356537311">Additional Information</strong>.</p>
<p id="ALM-18022__p635616313317">If <strong id="ALM-18022__b1356173203118">Additional Information</strong> contains <strong id="ALM-18022__b935612318319">available AmvCores =</strong>, change the value of <strong id="ALM-18022__b135618343112">yarn.queue.vcore.alarm.threshold</strong> to a value smaller than that of <strong id="ALM-18022__b63568319316">available AmvCores =</strong> in <strong id="ALM-18022__b13356238315">Additional Information</strong>.</p>
<div class="p" id="ALM-18022__p1119455673116">Wait for five minutes and check whether the alarm is cleared.<ul id="ALM-18022__ul1519435663118"><li id="ALM-18022__li21947563313">If yes, no further action is required.</li><li id="ALM-18022__li18195556173115">If no, go to <a href="#ALM-18022__li1973131339">9</a>.</li></ul>
</div>
</p></li></ol>
<p id="ALM-18022__p944875110417"><strong id="ALM-18022__b1688041925616">Collect fault information.</strong></p>
<ol start="9" id="ALM-18022__ol7988135315"><li id="ALM-18022__li1973131339"><a name="ALM-18022__li1973131339"></a><a name="li1973131339"></a><span>Log in to FusionInsight Manager of the active cluster, and choose <strong id="ALM-18022__b29611133310">O&amp;M</strong> &gt; <strong id="ALM-18022__b169611311320">Log</strong> &gt; <strong id="ALM-18022__b39614133319">Download</strong>.</span></li><li id="ALM-18022__li29791314320"><span>Select <strong id="ALM-18022__b199761320313">Yarn</strong> in the required cluster from the <strong id="ALM-18022__b4975131531">Service</strong>.</span></li><li id="ALM-18022__li1145664103113"><span>Click <span><img id="ALM-18022__image1945644173117" src="en-us_image_0269417410.png"></span> in the upper right corner, and set <strong id="ALM-18022__b6456941173117">Start Date</strong> and <strong id="ALM-18022__b11456154113318">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-18022__b13456164113319">Download</strong>.</span></li><li id="ALM-18022__li2980132318"><span>Contact the <span id="ALM-18022__text4614151421417">O&amp;M personnel</span> and send the collected logs.</span></li></ol>
</div>
<div class="section" id="ALM-18022__section1529716184534"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-18022__p4677152685316">After the fault is rectified, the system automatically clears this alarm.</p>
</div>
<div class="section" id="ALM-18022__section1436217493589"><h4 class="sectiontitle">Reference</h4><p id="ALM-18022__p179001314115919">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>