forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
130 lines
20 KiB
HTML
130 lines
20 KiB
HTML
<a name="ALM-45644"></a><a name="ALM-45644"></a>
|
|
|
|
<h1 class="topictitle1">ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold</h1>
|
|
<div id="body0000001971781038"><p id="ALM-45644__p12261122253615">This section applies to MRS 3.3.0 or later.</p>
|
|
<div class="section" id="ALM-45644__section663215"><h4 class="sectiontitle"><span id="ALM-45644__text516373020197">Alarm Description</span></h4><p id="ALM-45644__p66405588">The system checks the RocksDB monitoring data of jobs at the user-specified alarm reporting interval (<strong id="ALM-45644__b3335201121214">metrics.reporter.alarm.job.alarm.rocksdb.metrics.duration</strong>, 180s by default). This alarm is generated when the number of SST files at level 0 of RocksDB for a job continuously exceeds the threshold (<strong id="ALM-45644__b9335151116124">state.backend.rocksdb.level0_slowdown_writes_trigger</strong>, 20 by default). This alarm is cleared when the number of SST files at level 0 of RocksDB for the job is less than or equal to the threshold.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45644__section5968939"><h4 class="sectiontitle"><span id="ALM-45644__text20591447192117">Alarm Attributes</span></h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45644__table10143581" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-45644__row61411666"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.1"><p id="ALM-45644__p17386810"><span id="ALM-45644__text1864783145211">Alarm ID</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.2"><p id="ALM-45644__p66154394"><span id="ALM-45644__text297913110521">Alarm Severity</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.3"><p id="ALM-45644__p49230886"><span id="ALM-45644__text0890175712305">Auto Cleared</span></p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45644__row49774232"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.1 "><p id="ALM-45644__p5180964">45644</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.2 "><p id="ALM-45644__p17004965">Minor</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.3 "><p id="ALM-45644__p35224963">Yes</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-45644__section53720453"><h4 class="sectiontitle"><span id="ALM-45644__text18171442142214">Alarm Parameters</span></h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45644__table34649765" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-45644__row18974100"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.1"><p id="ALM-45644__p42699947"><span id="ALM-45644__text6203173410617">Parameter</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.2"><p id="ALM-45644__p36143663"><span id="ALM-45644__text10819164319610">Description</span></p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45644__row16272251424"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45644__p9447153994219">Source</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45644__p144723994214">Specifies the cluster for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45644__row38292076"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45644__p164471639194216">ServiceName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45644__p44471639174211">Specifies the service for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45644__row73049270124"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45644__p24471539104219">ApplicationName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45644__p1944763954213">Specifies the name of the application for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45644__row9875225"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45644__p1244715394427">RoleName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45644__p44471439144216">Specifies the role for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45644__row13243689"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45644__p1244716397426">JobName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45644__p244713917425">Specifies the job for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-45644__section13722030"><h4 class="sectiontitle"><span id="ALM-45644__text98201443182317">Impact on the System</span></h4><p id="ALM-45644__p25040094">This alarm has no adverse impact on the system.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45644__section56389407"><h4 class="sectiontitle"><span id="ALM-45644__text11871546172411">Possible Causes</span></h4><p id="ALM-45644__p15577205516">Possible causes are as follows:</p>
|
|
<ul id="ALM-45644__ul9253205855515"><li id="ALM-45644__li1325425805511">The compaction pressure of RocksDB is too high, and <strong id="ALM-45644__b4930155611315">ALM-45646 Pending Compaction Size of RocksDB Continuously Exceeds the Threshold</strong> and <strong id="ALM-45644__b171461541415">ALM-45647 Estimated Pending Compaction Size of RocksDB Continuously Exceeds the Threshold</strong> are generated.</li><li id="ALM-45644__li4254135815516">There are too many SST files at level 0.</li></ul>
|
|
</div>
|
|
<div class="section" id="ALM-45644__section178029519112"><h4 class="sectiontitle"><span id="ALM-45644__text79051154102518">Handling Procedure</span></h4><p id="ALM-45644__p69711949311"><strong id="ALM-45644__b55026872012">Check whether the compaction pressure of RocksDB is too high and ALM-45646 is generated.</strong></p>
|
|
<ol id="ALM-45644__ol157414375320"><li id="ALM-45644__li95744376312"><span>On FusionInsight Manager, choose <strong id="ALM-45644__b205699912204">O&M</strong> > <strong id="ALM-45644__b8569179172013">Alarm</strong> > <strong id="ALM-45644__b95691198205">Alarms</strong>.</span></li><li id="ALM-45644__li357415376311"><span>In the alarm list, check whether <strong id="ALM-45644__b83331414112015">ALM-45646 Pending Compaction Size of RocksDB Continuously Exceeds the Threshold</strong> exists.</span><p><ul id="ALM-45644__ul3574337237"><li id="ALM-45644__li45741937436">If yes, go to <a href="#ALM-45644__li125749379316">3</a>.</li><li id="ALM-45644__li5574737431">If no, go to <a href="#ALM-45644__li95737371433">5</a>.</li></ul>
|
|
</p></li><li id="ALM-45644__li125749379316"><a name="ALM-45644__li125749379316"></a><a name="li125749379316"></a><span>Handle the alarm by following the instructions provided in section <strong id="ALM-45644__b681413382510">ALM-45646 Pending Compaction Size of RocksDB Continuously Exceeds the Threshold</strong>.</span></li><li id="ALM-45644__li18574137033"><span>After ALM-45646 is cleared, wait a few minutes and check whether this alarm is cleared.</span><p><ul id="ALM-45644__ul145741737836"><li id="ALM-45644__li457433717317">If yes, no further action is required.</li><li id="ALM-45644__li1357410373318">If no, go to <a href="#ALM-45644__li95737371433">5</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-45644__p3916636204711"><strong id="ALM-45644__b432171414317">Check whether the compaction pressure of RocksDB is too high and ALM-45647 is generated.</strong></p>
|
|
<ol start="5" id="ALM-45644__ol1257323718320"><li id="ALM-45644__li95737371433"><a name="ALM-45644__li95737371433"></a><a name="li95737371433"></a><span>In the alarm list, check whether <strong id="ALM-45644__b8880102120318">ALM-45647 Estimated Pending Compaction Size of RocksDB Continuously Exceeds the Threshold</strong> exists.</span><p><ul id="ALM-45644__ul75739371237"><li id="ALM-45644__li16573337434">If yes, go to <a href="#ALM-45644__li1857319375312">6</a>.</li><li id="ALM-45644__li175731937137">If no, go to <a href="#ALM-45644__li168007412119">8</a>.</li></ul>
|
|
</p></li><li id="ALM-45644__li1857319375312"><a name="ALM-45644__li1857319375312"></a><a name="li1857319375312"></a><span>Handle the alarm by following the instructions provided in section <strong id="ALM-45644__b115491836105917">ALM-45647 Estimated Pending Compaction Size of RocksDB Continuously Exceeds the Threshold</strong>.</span></li><li id="ALM-45644__li105734374318"><span>After ALM-45647 is cleared, wait a few minutes and check whether this alarm is cleared.</span><p><ul id="ALM-45644__ul125732371736"><li id="ALM-45644__li65734371314">If yes, no further action is required.</li><li id="ALM-45644__li2057315371136">If no, go to <a href="#ALM-45644__li168007412119">8</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-45644__p1861816531348"><strong id="ALM-45644__b5513151844111">Check TaskManager logs for the number of SST files at level 0 and collect logs.</strong></p>
|
|
<ol start="8" id="ALM-45644__ol88001541511"><li id="ALM-45644__li168007412119"><a name="ALM-45644__li168007412119"></a><a name="li168007412119"></a><span>Log in to FusionInsight Manager as a user who has the FlinkServer management permission.</span></li><li id="ALM-45644__li143241423112412"><span>Choose <strong id="ALM-45644__b1476632616595">O&M</strong> > <strong id="ALM-45644__b8767326185912">Alarm</strong> > <strong id="ALM-45644__b1768162695915">Alarms</strong> > <strong id="ALM-45644__b7769152617597">ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold</strong>, view <strong id="ALM-45644__b13770152614598">Location</strong>, and obtain the name of the task for which the alarm is generated.</span></li><li id="ALM-45644__li18001241614"><span>Choose <strong id="ALM-45644__b1544013513612">Cluster</strong> > <strong id="ALM-45644__b15440535123610">Services</strong> > <strong id="ALM-45644__b14404358364">Yarn</strong> and click the link next to <strong id="ALM-45644__b10440173515363">ResourceManager WebUI</strong> to go to the native Yarn page.</span></li></ol><ol start="11" id="ALM-45644__ol204835294814"><li id="ALM-45644__li174855264820"><span>Locate the abnormal task based on its name displayed in <strong id="ALM-45644__b7672194381417">Location</strong>, search for and record the application ID of the job, and check whether the job logs are available on the Yarn page.</span><p><div class="fignone" id="ALM-45644__en-us_topic_0000001445372489_fig1390461517192"><span class="figcap"><b>Figure 1 </b>Application ID of a job</span><br><span><img id="ALM-45644__image87471018115519" src="en-us_image_0000001971808474.png"></span></div>
|
|
<ul id="ALM-45644__ul292912466360"><li id="ALM-45644__li6929146173619">If yes, go to <a href="#ALM-45644__li14941184217233">12</a>.</li><li id="ALM-45644__li10929144611365">If no, go to <a href="#ALM-45644__li0378678118">13</a>.</li></ul>
|
|
</p></li><li id="ALM-45644__li14941184217233"><a name="ALM-45644__li14941184217233"></a><a name="li14941184217233"></a><span>Click the application ID of the failed job to go to the job page.</span><p><ol type="a" id="ALM-45644__en-us_topic_0000001445372489_ol18905161513191"><li id="ALM-45644__en-us_topic_0000001445372489_li090431510192">Click <strong id="ALM-45644__b1822831141513">Logs</strong> in the <strong id="ALM-45644__b172291218155">Logs</strong> column to view JobManager logs.<div class="fignone" id="ALM-45644__en-us_topic_0000001445372489_fig0904115131915"><span class="figcap"><b>Figure 2 </b>Clicking Logs</span><br><span><img id="ALM-45644__en-us_topic_0000001445372489_image290471501913" src="en-us_image_0000002008248489.png"></span></div>
|
|
</li><li id="ALM-45644__en-us_topic_0000001445372489_li232434015269">Click the ID in the <strong id="ALM-45644__b971916701520">Attempt ID</strong> column and click <strong id="ALM-45644__b671977151520">Logs</strong> in the <strong id="ALM-45644__b5720157191517">Logs</strong> column to view and save TaskManager logs. Then go to <a href="#ALM-45644__li1924461021119">14</a>.<div class="fignone" id="ALM-45644__en-us_topic_0000001445372489_fig16904101571920"><span class="figcap"><b>Figure 3 </b>Clicking the ID in the Attempt ID column</span><br><span><img id="ALM-45644__en-us_topic_0000001445372489_image1890411511199" src="en-us_image_0000001971648738.png"></span></div>
|
|
<div class="fignone" id="ALM-45644__en-us_topic_0000001445372489_fig67971748144610"><span class="figcap"><b>Figure 4 </b>Clicking Logs</span><br><span><img id="ALM-45644__en-us_topic_0000001445372489_image1620681118112" src="en-us_image_0000002008129057.png"></span></div>
|
|
<div class="note" id="ALM-45644__en-us_topic_0000001445372489_note126111528152718"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-45644__en-us_topic_0000001445372489_p14611162814277">You can also log in to Manager as a user who has the management permission for the current Flink job. Choose <strong id="ALM-45644__b1179171981516">Cluster</strong> > <strong id="ALM-45644__b980121918151">Services</strong> > <strong id="ALM-45644__b168013195154">Flink</strong>, and click the link next to <strong id="ALM-45644__b180161941513">Flink WebUI</strong>. On the displayed Flink web UI, click <strong id="ALM-45644__b10801219141514">Job Management</strong>, click <strong id="ALM-45644__b1980119181518">More</strong> in the <strong id="ALM-45644__b780171912153">Operation</strong> column, and select <strong id="ALM-45644__b158014193159">Job Monitoring</strong> to view TaskManager logs.</p>
|
|
</div></div>
|
|
</li></ol>
|
|
</p></li></ol>
|
|
<p id="ALM-45644__p197819554109"><strong id="ALM-45644__b13590922151514">If logs are unavailable on the Yarn page, download logs from HDFS.</strong></p>
|
|
<ol start="13" id="ALM-45644__ol193787714114"><li id="ALM-45644__li0378678118"><a name="ALM-45644__li0378678118"></a><a name="li0378678118"></a><span>On Manager, choose <strong id="ALM-45644__b12329191131911">Cluster</strong> > <strong id="ALM-45644__b133012117192">Services</strong> > <strong id="ALM-45644__b12331181131913">HDFS</strong>, click the link next to <strong id="ALM-45644__b1133213119196">NameNode WebUI</strong> to go to the HDFS page, choose <strong id="ALM-45644__b633391121914">Utilities</strong> > <strong id="ALM-45644__b1933314112190">Browse the file system</strong>, and download logs in the <strong id="ALM-45644__b33341811191917">/tmp/logs/</strong><em id="ALM-45644__i1933491181910">Username</em><strong id="ALM-45644__b12335311201915">/bucket-logs-tfile/</strong><em id="ALM-45644__i5336711121911">Last four digits of the task application ID/Application ID of the task</em> directory.</span></li></ol>
|
|
<p id="ALM-45644__p1737814291119"><strong id="ALM-45644__b051614167196">Check whether the number of SST files at level 0 is too large.</strong></p>
|
|
<ol start="14" id="ALM-45644__ol9245610181117"><li id="ALM-45644__li1924461021119"><a name="ALM-45644__li1924461021119"></a><a name="li1924461021119"></a><span>Check whether the value of <strong id="ALM-45644__b524624015195">rocksdb.num-files-at-level0</strong> in TaskManager monitoring logs (keyword <strong id="ALM-45644__b18438657111914">RocksDBMetricPrint</strong>) is greater than or equal to the value of <strong id="ALM-45644__b53981476212">state.backend.rocksdb.level0_slowdown_writes_trigger</strong> or <strong id="ALM-45644__b12545212202114">state.backend.rocksdb.level0_stop_writes_trigger</strong>.</span><p><ul id="ALM-45644__ul1069154719136"><li id="ALM-45644__li5674185319137">If yes, adjust the values of the following custom parameters on the job development page of the Flink web UI, save the settings, and go to <a href="#ALM-45644__li82441810181117">15</a>.
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45644__table8213135018497" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Custom parameters</caption><thead align="left"><tr id="ALM-45644__row22148505496"><th align="left" class="cellrowborder" valign="top" width="38.81%" id="mcps1.3.7.12.1.2.1.1.2.2.4.1.1"><p id="ALM-45644__p14214250174913">Parameter</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="16.950000000000003%" id="mcps1.3.7.12.1.2.1.1.2.2.4.1.2"><p id="ALM-45644__p82149501496">Default Value</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="44.24%" id="mcps1.3.7.12.1.2.1.1.2.2.4.1.3"><p id="ALM-45644__p62153502499">Description</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45644__row22151050184918"><td class="cellrowborder" valign="top" width="38.81%" headers="mcps1.3.7.12.1.2.1.1.2.2.4.1.1 "><p id="ALM-45644__p1215155018496">state.backend.rocksdb.level0_slowdown_writes_trigger</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="16.950000000000003%" headers="mcps1.3.7.12.1.2.1.1.2.2.4.1.2 "><p id="ALM-45644__p14215125024916">20</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="44.24%" headers="mcps1.3.7.12.1.2.1.1.2.2.4.1.3 "><ul id="ALM-45644__ul61470391386"><li id="ALM-45644__li81479394813">Number of files that trigger slowdown at level 0</li><li id="ALM-45644__li898616258011"><strong id="ALM-45644__b28941949153111">20</strong> to <strong id="ALM-45644__b146391752173114">30</strong> are recommended.</li></ul>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45644__row72157509496"><td class="cellrowborder" valign="top" width="38.81%" headers="mcps1.3.7.12.1.2.1.1.2.2.4.1.1 "><p id="ALM-45644__p18215135019494">state.backend.rocksdb.level0_stop_writes_trigger</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="16.950000000000003%" headers="mcps1.3.7.12.1.2.1.1.2.2.4.1.2 "><p id="ALM-45644__p10215185074913">36</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="44.24%" headers="mcps1.3.7.12.1.2.1.1.2.2.4.1.3 "><ul id="ALM-45644__ul411510421788"><li id="ALM-45644__li151157421785">Maximum number of files that trigger stop at level 0</li><li id="ALM-45644__li1833573517012"><strong id="ALM-45644__b379952843219">36</strong> to <strong id="ALM-45644__b19799128133215">46</strong> are recommended.</li></ul>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</li><li id="ALM-45644__li669194717137">If no, go to <a href="#ALM-45644__li6245710111114">16</a>.</li></ul>
|
|
</p></li><li id="ALM-45644__li82441810181117"><a name="ALM-45644__li82441810181117"></a><a name="li82441810181117"></a><span>Restart the job and check whether the alarm is cleared.</span><p><ul id="ALM-45644__ul132441510131114"><li id="ALM-45644__li32441310151110">If yes, no further action is required.</li><li id="ALM-45644__li524451091120">If no, go to <a href="#ALM-45644__li6245710111114">16</a>.</li></ul>
|
|
</p></li><li id="ALM-45644__li6245710111114"><a name="ALM-45644__li6245710111114"></a><a name="li6245710111114"></a><span>Contact <span id="ALM-45644__text169973595325">O&M personnel</span> and send the collected logs.</span></li></ol>
|
|
</div>
|
|
<div class="section" id="ALM-45644__section169311343318"><h4 class="sectiontitle"><span id="ALM-45644__text195945622616">Alarm Clearance</span></h4><p id="ALM-45644__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45644__section4139237"><h4 class="sectiontitle"><span id="ALM-45644__text143698488285">Related Information</span></h4><p id="ALM-45644__p33559471"><span id="ALM-45644__text19275105817121">None.</span></p>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
|
|
</div>
|
|
</div>
|
|
|