forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
115 lines
18 KiB
HTML
115 lines
18 KiB
HTML
<a name="ALM-45646"></a><a name="ALM-45646"></a>
|
|
|
|
<h1 class="topictitle1">ALM-45646 Pending Compaction Size of RocksDB Continuously Exceeds the Threshold</h1>
|
|
<div id="body0000002008101577"><p id="ALM-45646__p12261122253615">This section applies to MRS 3.3.0 or later.</p>
|
|
<div class="section" id="ALM-45646__section663215"><h4 class="sectiontitle"><span id="ALM-45646__text516373020197">Alarm Description</span></h4><p id="ALM-45646__p66405588">The system checks the RocksDB monitoring data of jobs at the user-specified alarm reporting interval (<strong id="ALM-45646__b15572101814514">metrics.reporter.alarm.job.alarm.rocksdb.metrics.duration</strong>, 180s by default). This alarm is generated when the number of pending compaction requests of RocksDB for a job continuously reaches <em id="ALM-45646__i19573818135118">n</em> times the number of flush/compaction threads. This alarm is cleared when the number of pending compaction requests of RocksDB for the job is less than or equal to the threshold.</p>
|
|
<ul id="ALM-45646__ul1712241415015"><li id="ALM-45646__li1712310147020">The number of flush/compaction threads is the value of <strong id="ALM-45646__b1690210135312">state.backend.rocksdb.thread.num</strong>. The default value is <strong id="ALM-45646__b1990319065318">2</strong>. If <strong id="ALM-45646__b17903140125317">SPINNING_DISK_OPTIMIZED_HIGH_MEM</strong> is enabled, the default value is <strong id="ALM-45646__b159031904539">4</strong>.</li><li id="ALM-45646__li112319141009">The <strong id="ALM-45646__b67749675313">metrics.reporter.alarm.job.alarm.rocksdb.background.jobs.multiplier</strong> parameter specifies <em id="ALM-45646__i8774186185318">n</em> times the number of flush/compaction threads. The default value is <strong id="ALM-45646__b137740614538">2</strong>.</li></ul>
|
|
</div>
|
|
<div class="section" id="ALM-45646__section5968939"><h4 class="sectiontitle"><span id="ALM-45646__text20591447192117">Alarm Attributes</span></h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45646__table10143581" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-45646__row61411666"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.1"><p id="ALM-45646__p17386810"><span id="ALM-45646__text1864783145211">Alarm ID</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.2"><p id="ALM-45646__p66154394"><span id="ALM-45646__text297913110521">Alarm Severity</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.3"><p id="ALM-45646__p49230886"><span id="ALM-45646__text0890175712305">Auto Cleared</span></p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45646__row49774232"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.1 "><p id="ALM-45646__p5180964">45646</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.2 "><p id="ALM-45646__p713185015583">Minor</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.3 "><p id="ALM-45646__p35224963">Yes</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-45646__section53720453"><h4 class="sectiontitle"><span id="ALM-45646__text18171442142214">Alarm Parameters</span></h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45646__table34649765" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-45646__row18974100"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.1"><p id="ALM-45646__p42699947"><span id="ALM-45646__text6203173410617">Parameter</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.2"><p id="ALM-45646__p36143663"><span id="ALM-45646__text10819164319610">Description</span></p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45646__row16272251424"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45646__p9447153994219">Source</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45646__p144723994214">Specifies the cluster for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45646__row38292076"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45646__p164471639194216">ServiceName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45646__p44471639174211">Specifies the service for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45646__row73049270124"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45646__p24471539104219">ApplicationName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45646__p1944763954213">Specifies the name of the application for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45646__row9875225"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45646__p1244715394427">RoleName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45646__p44471439144216">Specifies the role for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45646__row13243689"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45646__p1244716397426">JobName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45646__p244713917425">Specifies the job for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-45646__section13722030"><h4 class="sectiontitle"><span id="ALM-45646__text98201443182317">Impact on the System</span></h4><p id="ALM-45646__p19628114371712">This alarm has no adverse impact on the system.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45646__section56389407"><h4 class="sectiontitle"><span id="ALM-45646__text11871546172411">Possible Causes</span></h4><p id="ALM-45646__p97889351461">The number of pending compaction requests of RocksDB for the Flink job is too large.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45646__section178029519112"><h4 class="sectiontitle"><span id="ALM-45646__text79051154102518">Handling Procedure</span></h4><p id="ALM-45646__en-us_topic_0000001472937921_p13057358914"><strong id="ALM-45646__b638381410348">Check TaskManager logs for the number of pending compaction requests and collect logs.</strong></p>
|
|
<ol id="ALM-45646__en-us_topic_0000001472937921_ol72082246508"><li id="ALM-45646__en-us_topic_0000001472937921_li1020812242501"><span>Log in to FusionInsight Manager as a user who has the FlinkServer management permission.</span></li><li id="ALM-45646__en-us_topic_0000001472937921_li11208132410508"><span>Choose <strong id="ALM-45646__b111921624103412">O&M</strong> > <strong id="ALM-45646__b10192142412349">Alarm</strong> > <strong id="ALM-45646__b2193524143412">Alarms</strong> > <strong id="ALM-45646__b121931024133413">ALM-45646 Pending Compaction Size of RocksDB Continuously Exceeds the Threshold</strong>, view <strong id="ALM-45646__b20193172413417">Location</strong>, and obtain the name of the task for which the alarm is generated.</span></li><li id="ALM-45646__en-us_topic_0000001472937921_li2208524185013"><span>Choose <strong id="ALM-45646__b10308118175319">Cluster</strong> > <strong id="ALM-45646__b830881816536">Services</strong> > <strong id="ALM-45646__b93081189538">Yarn</strong> and click the link next to <strong id="ALM-45646__b153081118155319">ResourceManager WebUI</strong> to go to the native Yarn page.</span></li></ol><ol start="4" id="ALM-45646__en-us_topic_0000001472937921_ol52081124175014"><li id="ALM-45646__en-us_topic_0000001472937921_li10207524155012"><span>Locate the abnormal task based on its name displayed in <strong id="ALM-45646__b9851185133410">Location</strong>, search for and record the application ID of the job, and check whether the job logs are available on the Yarn page.</span><p><div class="fignone" id="ALM-45646__en-us_topic_0000001472937921_fig52072244500"><span class="figcap"><b>Figure 1 </b>Application ID of a job</span><br><span><img id="ALM-45646__image17434505513" src="en-us_image_0000001971808542.png"></span></div>
|
|
<ul id="ALM-45646__en-us_topic_0000001472937921_ul350914323712"><li id="ALM-45646__en-us_topic_0000001472937921_li13509103163718">If yes, go to <a href="#ALM-45646__en-us_topic_0000001472937921_li192082249500">5</a>.</li><li id="ALM-45646__en-us_topic_0000001472937921_li12509193133710">If no, go to <a href="#ALM-45646__en-us_topic_0000001472937921_li15924173318501">6</a>.</li></ul>
|
|
</p></li><li id="ALM-45646__en-us_topic_0000001472937921_li192082249500"><a name="ALM-45646__en-us_topic_0000001472937921_li192082249500"></a><a name="en-us_topic_0000001472937921_li192082249500"></a><span>Click the application ID of the failed job to go to the job page.</span><p><ol type="a" id="ALM-45646__en-us_topic_0000001472937921_ol2208624155017"><li id="ALM-45646__en-us_topic_0000001472937921_li18207192475010">Click <strong id="ALM-45646__b183236803518">Logs</strong> in the <strong id="ALM-45646__b13246817353">Logs</strong> column to view JobManager logs.<div class="fignone" id="ALM-45646__en-us_topic_0000001472937921_fig620782485019"><span class="figcap"><b>Figure 2 </b>Clicking Logs</span><br><span><img id="ALM-45646__en-us_topic_0000001472937921_image9207142417503" src="en-us_image_0000002008248553.png"></span></div>
|
|
</li><li id="ALM-45646__en-us_topic_0000001472937921_li5208102419500">Click the ID in the <strong id="ALM-45646__b491851433517">Attempt ID</strong> column and click <strong id="ALM-45646__b49191414123510">Logs</strong> in the <strong id="ALM-45646__b091961413510">Logs</strong> column to view and save TaskManager logs. Then go to <a href="#ALM-45646__en-us_topic_0000001472937921_li174771425105416">7</a>.<div class="fignone" id="ALM-45646__en-us_topic_0000001472937921_fig92071724195013"><span class="figcap"><b>Figure 3 </b>Clicking the ID in the Attempt ID column</span><br><span><img id="ALM-45646__en-us_topic_0000001472937921_image17207142475012" src="en-us_image_0000001971648806.png"></span></div>
|
|
<div class="fignone" id="ALM-45646__en-us_topic_0000001472937921_fig1720811242508"><span class="figcap"><b>Figure 4 </b>Clicking Logs</span><br><span><img id="ALM-45646__en-us_topic_0000001472937921_image1820882414506" src="en-us_image_0000002008129121.png"></span></div>
|
|
<div class="note" id="ALM-45646__en-us_topic_0000001472937921_note1320882412506"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-45646__en-us_topic_0000001472937921_p132081324105018">You can also log in to Manager as a user who has the management permission for the current Flink job. Choose <strong id="ALM-45646__b891811275355">Cluster</strong> > <strong id="ALM-45646__b2918182713357">Services</strong> > <strong id="ALM-45646__b1191892703518">Flink</strong>, and click the link next to <strong id="ALM-45646__b15919102719355">Flink WebUI</strong>. On the displayed Flink web UI, click <strong id="ALM-45646__b1791962733516">Job Management</strong>, click <strong id="ALM-45646__b7919827173513">More</strong> in the <strong id="ALM-45646__b4920122717359">Operation</strong> column, and select <strong id="ALM-45646__b792072793517">Job Monitoring</strong> to view TaskManager logs.</p>
|
|
</div></div>
|
|
</li></ol>
|
|
</p></li></ol>
|
|
<p id="ALM-45646__en-us_topic_0000001472937921_p197819554109"><strong id="ALM-45646__b1456911327359">If logs are unavailable on the Yarn page, download logs from HDFS.</strong></p>
|
|
<ol start="6" id="ALM-45646__en-us_topic_0000001472937921_ol69241133165017"><li id="ALM-45646__en-us_topic_0000001472937921_li15924173318501"><a name="ALM-45646__en-us_topic_0000001472937921_li15924173318501"></a><a name="en-us_topic_0000001472937921_li15924173318501"></a><span>On Manager, choose <strong id="ALM-45646__b895218354353">Cluster</strong> > <strong id="ALM-45646__b695313516359">Services</strong> > <strong id="ALM-45646__b795317355354">HDFS</strong>, click the link next to <strong id="ALM-45646__b395323543515">NameNode WebUI</strong> to go to the HDFS page, choose <strong id="ALM-45646__b89541635163518">Utilities</strong> > <strong id="ALM-45646__b109541335123517">Browse the file system</strong>, and download logs in the <strong id="ALM-45646__b495414353352">/tmp/logs/</strong><em id="ALM-45646__i14955203523511">Username</em><strong id="ALM-45646__b129551035193511">/bucket-logs-tfile/</strong><em id="ALM-45646__i695593543518">Last four digits of the task application ID/Application ID of the task</em> directory.</span></li></ol>
|
|
<p id="ALM-45646__en-us_topic_0000001472937921_p1737814291119"><strong id="ALM-45646__b81661138153510">Check whether there are too many pending compaction requests.</strong></p>
|
|
<ol start="7" id="ALM-45646__en-us_topic_0000001472937921_ol673654225011"><li id="ALM-45646__en-us_topic_0000001472937921_li174771425105416"><a name="ALM-45646__en-us_topic_0000001472937921_li174771425105416"></a><a name="en-us_topic_0000001472937921_li174771425105416"></a><span>Check whether the sum of the values of <strong id="ALM-45646__b9668144823520">rocksdb.mem-table-flush-pending</strong> and <strong id="ALM-45646__b2668154818357">rocksdb.compaction-pending</strong> in TaskManager monitoring logs (keyword <strong id="ALM-45646__b1166814873517">RocksDBMetricPrint</strong>) is greater than <strong id="ALM-45646__b67441617121811">n</strong> times the number of RocksDB threads (<strong id="ALM-45646__b76692481359">metrics.reporter.alarm.job.alarm.rocksdb.background.jobs.multiplier</strong>, 2 by default). If it is, you can increase the number of RocksDB threads.</span><p><ul id="ALM-45646__en-us_topic_0000001472937921_ul173694225012"><li id="ALM-45646__en-us_topic_0000001472937921_li9736442105010">If yes, adjust the values of the following custom parameters on the job development page of the Flink web UI, save the settings, and go to <a href="#ALM-45646__en-us_topic_0000001472937921_li4736164255016">8</a>.
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45646__table8213135018497" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Custom parameters</caption><thead align="left"><tr id="ALM-45646__row22148505496"><th align="left" class="cellrowborder" valign="top" width="31.36%" id="mcps1.3.7.8.1.2.1.1.2.2.4.1.1"><p id="ALM-45646__p14214250174913">Parameter</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="25.36%" id="mcps1.3.7.8.1.2.1.1.2.2.4.1.2"><p id="ALM-45646__p82149501496">Default Value</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="43.28%" id="mcps1.3.7.8.1.2.1.1.2.2.4.1.3"><p id="ALM-45646__p62153502499">Description</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45646__row22151050184918"><td class="cellrowborder" valign="top" width="31.36%" headers="mcps1.3.7.8.1.2.1.1.2.2.4.1.1 "><p id="ALM-45646__p1215155018496">state.backend.rocksdb.thread.num</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="25.36%" headers="mcps1.3.7.8.1.2.1.1.2.2.4.1.2 "><ul id="ALM-45646__ul542417561121"><li id="ALM-45646__li94242561824"><strong id="ALM-45646__b1281754121810">2</strong></li><li id="ALM-45646__li174258561020"><strong id="ALM-45646__b120219432183">4</strong>: enables <strong id="ALM-45646__b1620224313189">SPINNING_DISK_OPTIMIZED_HIGH_MEM</strong>.</li></ul>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="43.28%" headers="mcps1.3.7.8.1.2.1.1.2.2.4.1.3 "><ul id="ALM-45646__ul61470391386"><li id="ALM-45646__li217318532515">Number of flush threads. Increase the number of threads to quickly flush memory data to disks.</li><li id="ALM-45646__li1577014227598">When the number of threads is increased, the number of vCores also needs to be increased.</li><li id="ALM-45646__li2779182219585"><strong id="ALM-45646__b161044482188">2</strong> to <strong id="ALM-45646__b1910420480183">10</strong> are recommended.</li></ul>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</li><li id="ALM-45646__en-us_topic_0000001472937921_li1673694219506">If no, go to <a href="#ALM-45646__en-us_topic_0000001472937921_li573684212503">9</a>.</li></ul>
|
|
</p></li><li id="ALM-45646__en-us_topic_0000001472937921_li4736164255016"><a name="ALM-45646__en-us_topic_0000001472937921_li4736164255016"></a><a name="en-us_topic_0000001472937921_li4736164255016"></a><span>Restart the job and check whether the alarm is cleared.</span><p><ul id="ALM-45646__en-us_topic_0000001472937921_ul773611421507"><li id="ALM-45646__en-us_topic_0000001472937921_li6736184215505">If yes, no further action is required.</li><li id="ALM-45646__en-us_topic_0000001472937921_li17736204295010">If no, go to <a href="#ALM-45646__en-us_topic_0000001472937921_li573684212503">9</a>.</li></ul>
|
|
</p></li><li id="ALM-45646__en-us_topic_0000001472937921_li573684212503"><a name="ALM-45646__en-us_topic_0000001472937921_li573684212503"></a><a name="en-us_topic_0000001472937921_li573684212503"></a><span>Contact <span id="ALM-45646__en-us_topic_0000001472937921_text1573624215019">O&M personnel</span> and provide the collected logs.</span></li></ol>
|
|
</div>
|
|
<div class="section" id="ALM-45646__section169311343318"><h4 class="sectiontitle"><span id="ALM-45646__text195945622616">Alarm Clearance</span></h4><p id="ALM-45646__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45646__section4139237"><h4 class="sectiontitle"><span id="ALM-45646__text143698488285">Related Information</span></h4><p id="ALM-45646__p33559471"><span id="ALM-45646__text19275105817121">None.</span></p>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
|
|
</div>
|
|
</div>
|
|
|