forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
106 lines
18 KiB
HTML
106 lines
18 KiB
HTML
<a name="ALM-45650"></a><a name="ALM-45650"></a>
|
|
|
|
<h1 class="topictitle1">ALM-45650 P95 Latency of RocksDB Write Requests Continuously Exceeds the Threshold</h1>
|
|
<div id="body0000002008101581"><p id="ALM-45650__p12261122253615">This section applies to MRS 3.3.0 or later.</p>
|
|
<div class="section" id="ALM-45650__section663215"><h4 class="sectiontitle"><span id="ALM-45650__text516373020197">Alarm Description</span></h4><p id="ALM-45650__p66405588">The system checks the RocksDB monitoring data of jobs at the user-specified alarm reporting interval (<strong id="ALM-45650__b8681125911231">metrics.reporter.alarm.job.alarm.rocksdb.metrics.duration</strong>, 180s by default). This alarm is generated when the P95 latency of RocksDB write requests exceeds the threshold (<strong id="ALM-45650__b2681165962317">metrics.reporter.alarm.job.alarm.rocksdb.write.micros.threshold</strong>, 50000 microseconds by default). This alarm is cleared when the P95 latency of RocksDB write requests is less than or equal to the threshold.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45650__section5968939"><h4 class="sectiontitle"><span id="ALM-45650__text20591447192117">Alarm Attributes</span></h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45650__table10143581" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-45650__row61411666"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.1"><p id="ALM-45650__p17386810"><span id="ALM-45650__text1864783145211">Alarm ID</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.2"><p id="ALM-45650__p66154394"><span id="ALM-45650__text297913110521">Alarm Severity</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.3"><p id="ALM-45650__p49230886"><span id="ALM-45650__text0890175712305">Auto Cleared</span></p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45650__row49774232"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.1 "><p id="ALM-45650__p5180964">45650</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.2 "><p id="ALM-45650__p17004965">Minor</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.3 "><p id="ALM-45650__p35224963">Yes</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-45650__section53720453"><h4 class="sectiontitle"><span id="ALM-45650__text18171442142214">Alarm Parameters</span></h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45650__table34649765" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-45650__row18974100"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.1"><p id="ALM-45650__p42699947"><span id="ALM-45650__text6203173410617">Parameter</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.2"><p id="ALM-45650__p36143663"><span id="ALM-45650__text10819164319610">Description</span></p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45650__row16272251424"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45650__p9447153994219">Source</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45650__p144723994214">Specifies the cluster for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45650__row38292076"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45650__p164471639194216">ServiceName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45650__p44471639174211">Specifies the service for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45650__row73049270124"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45650__p24471539104219">ApplicationName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45650__p1944763954213">Specifies the name of the application for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45650__row9875225"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45650__p1244715394427">RoleName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45650__p44471439144216">Specifies the role for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45650__row13243689"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45650__p1244716397426">JobName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45650__p244713917425">Specifies the job for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-45650__section13722030"><h4 class="sectiontitle"><span id="ALM-45650__text98201443182317">Impact on the System</span></h4><p id="ALM-45650__p25040094">This alarm has no adverse impact on the system.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45650__section56389407"><h4 class="sectiontitle"><span id="ALM-45650__text11871546172411">Possible Causes</span></h4><p id="ALM-45650__p197241321135718">The possible causes are as follows:</p>
|
|
<ul id="ALM-45650__ul1366381545311"><li id="ALM-45650__li161428512312">There are too many MemTables. As a result, write traffic is limited or write stops, and <strong id="ALM-45650__b7885625102817">ALM-45643 MemTable Size of RocksDB Continuously Exceeds the Threshold</strong> is generated.</li></ul>
|
|
<ul id="ALM-45650__ul11262005419"><li id="ALM-45650__li151421651638">There are too many SST files at level 0, and <strong id="ALM-45650__b78961832112819">ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold</strong> is generated.</li></ul>
|
|
<ul id="ALM-45650__ul1014235115317"><li id="ALM-45650__li114216511339">The estimated compaction size exceeds the threshold, and <strong id="ALM-45650__b423218380287">ALM-45647 Estimated Pending Compaction Size of RocksDB Continuously Exceeds the Threshold</strong> is generated.</li></ul>
|
|
</div>
|
|
<div class="section" id="ALM-45650__section4437121113517"><h4 class="sectiontitle"><span id="ALM-45650__text79051154102518">Handling Procedure</span></h4><p id="ALM-45650__p69711949311"><strong id="ALM-45650__b690784214411">Check whether write traffic limiting or write stop is caused due to too many MemTables.</strong></p>
|
|
<ol id="ALM-45650__ol188231431101818"><li id="ALM-45650__li19823113101810"><span>On FusionInsight Manager, choose <strong id="ALM-45650__b11738174819446">O&M</strong> > <strong id="ALM-45650__b2738748154412">Alarm</strong> > <strong id="ALM-45650__b1873911481446">Alarms</strong>.</span></li><li id="ALM-45650__li1282393131814"><span>In the alarm list, check whether <strong id="ALM-45650__b1569084214288">ALM-45643 MemTable Size of RocksDB Continuously Exceeds the Threshold</strong> exists.</span><p><ul id="ALM-45650__ul3823173118186"><li id="ALM-45650__li6823183117187">If yes, go to <a href="#ALM-45650__li6823133191818">3</a>.</li><li id="ALM-45650__li28231314187">If no, go to <a href="#ALM-45650__li88241531181819">5</a>.</li></ul>
|
|
</p></li><li id="ALM-45650__li6823133191818"><a name="ALM-45650__li6823133191818"></a><a name="li6823133191818"></a><span>Handle the alarm by following the instructions provided in section <strong id="ALM-45650__b122927478281">ALM-45643 MemTable Size of RocksDB Continuously Exceeds the Threshold</strong>.</span></li><li id="ALM-45650__li88231631161811"><span>After ALM-45643 is cleared, wait a few minutes and check whether this alarm is cleared.</span><p><ul id="ALM-45650__ul482318318185"><li id="ALM-45650__li78239313180">If yes, no further action is required.</li><li id="ALM-45650__li178231131131815">If no, go to <a href="#ALM-45650__li88241531181819">5</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p class="tableheading" id="ALM-45650__p135760113337"><strong id="ALM-45650__b8798191004510">Check whether the number of SST files at level 0 is too large.</strong></p>
|
|
<ol start="5" id="ALM-45650__ol982443111184"><li id="ALM-45650__li88241531181819"><a name="ALM-45650__li88241531181819"></a><a name="li88241531181819"></a><span>On FusionInsight Manager, choose <strong id="ALM-45650__b1776382215456">O&M</strong> > <strong id="ALM-45650__b7763102214510">Alarm</strong> > <strong id="ALM-45650__b1876372213451">Alarms</strong>.</span></li><li id="ALM-45650__li19824331181813"><span>In the alarm list, check whether <strong id="ALM-45650__b10451135102811">ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold</strong> exists.</span><p><ul id="ALM-45650__ul1824231121813"><li id="ALM-45650__li138241531191813">If yes, go to <a href="#ALM-45650__li1982483112189">7</a>.</li><li id="ALM-45650__li882419316180">If no, go to <a href="#ALM-45650__li1485215516483">9</a>.</li></ul>
|
|
</p></li><li id="ALM-45650__li1982483112189"><a name="ALM-45650__li1982483112189"></a><a name="li1982483112189"></a><span>Handle the alarm by following the instructions provided in section <strong id="ALM-45650__b18913195510283">ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold</strong>.</span></li><li id="ALM-45650__li1382415317188"><span>After ALM-45644 is cleared, wait a few minutes and check whether this alarm is cleared.</span><p><ul id="ALM-45650__ul172191622161214"><li id="ALM-45650__li8219122211121">If yes, no further action is required.</li><li id="ALM-45650__li3219922111211">If no, go to <a href="#ALM-45650__li1485215516483">9</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-45650__p3916636204711"><strong id="ALM-45650__b1924513475457">Check whether the estimated compaction size exceeds the threshold.</strong></p>
|
|
<ol start="9" id="ALM-45650__ol1929554484711"><li id="ALM-45650__li1485215516483"><a name="ALM-45650__li1485215516483"></a><a name="li1485215516483"></a><span>In the alarm list, check whether <strong id="ALM-45650__b679718062911">ALM-45647 Estimated Pending Compaction Size of RocksDB Continuously Exceeds the Threshold</strong> exists.</span><p><ul id="ALM-45650__ul188529516481"><li id="ALM-45650__li158521854484">If yes, go to <a href="#ALM-45650__li1285312554818">10</a>.</li><li id="ALM-45650__li4853759485">If no, go to <a href="#ALM-45650__li1826072651812">12</a>.</li></ul>
|
|
</p></li><li id="ALM-45650__li1285312554818"><a name="ALM-45650__li1285312554818"></a><a name="li1285312554818"></a><span>Handle the alarm by following the instructions provided in section <strong id="ALM-45650__b93528492914">ALM-45647 Estimated Pending Compaction Size of RocksDB Continuously Exceeds the Threshold</strong>.</span></li><li id="ALM-45650__li1585311516489"><span>After ALM-45647 is cleared, wait a few minutes and check whether this alarm is cleared.</span><p><ul id="ALM-45650__ul1560851651215"><li id="ALM-45650__li260821611124">If yes, no further action is required.</li><li id="ALM-45650__li176081216161213">If no, go to <a href="#ALM-45650__li1826072651812">12</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-45650__p17751320141817"><strong id="ALM-45650__b1439041919464">Collect fault information.</strong></p>
|
|
<ol start="12" id="ALM-45650__ol626019269181"><li id="ALM-45650__li1826072651812"><a name="ALM-45650__li1826072651812"></a><a name="li1826072651812"></a><span>Log in to FusionInsight Manager as a user who has the FlinkServer management permission.</span></li><li id="ALM-45650__li143241423112412"><span>Choose <strong id="ALM-45650__b189484135290">O&M</strong> > <strong id="ALM-45650__b1994911136290">Alarm</strong> > <strong id="ALM-45650__b109492013122918">Alarms</strong> > <strong id="ALM-45650__b79491137299">ALM-45650 P95 Latency of RocksDB Write Requests Continuously Exceeds the Threshold</strong>, view <strong id="ALM-45650__b394981314297">Location</strong>, and obtain the name of the task for which the alarm is generated.</span></li><li id="ALM-45650__li112601426191813"><span>Choose <strong id="ALM-45650__b17846142610561">Cluster</strong> > <strong id="ALM-45650__b184632615619">Services</strong> > <strong id="ALM-45650__b68461226205610">Yarn</strong> and click the link next to <strong id="ALM-45650__b13846926165617">ResourceManager WebUI</strong> to go to the native Yarn page.</span></li></ol><ol start="15" id="ALM-45650__ol53961455194712"><li id="ALM-45650__li18750195794719"><span>Locate the abnormal task based on its name displayed in <strong id="ALM-45650__b1454012546461">Location</strong>, search for and record the application ID of the job, and check whether the job logs are available on the Yarn page.</span><p><div class="fignone" id="ALM-45650__en-us_topic_0000001445372489_fig1390461517192"><span class="figcap"><b>Figure 1 </b>Application ID of a job</span><br><span><img id="ALM-45650__image178248376568" src="en-us_image_0000002008248601.png"></span></div>
|
|
<ul id="ALM-45650__ul112631311128"><li id="ALM-45650__li22611331219">If yes, go to <a href="#ALM-45650__li14941184217233">16</a>.</li><li id="ALM-45650__li1626151371216">If no, go to <a href="#ALM-45650__li42141433171631">18</a>.</li></ul>
|
|
</p></li><li id="ALM-45650__li14941184217233"><a name="ALM-45650__li14941184217233"></a><a name="li14941184217233"></a><span>Click the application ID of the failed job to go to the job page.</span><p><ol type="a" id="ALM-45650__en-us_topic_0000001445372489_ol18905161513191"><li id="ALM-45650__en-us_topic_0000001445372489_li090431510192">Click <strong id="ALM-45650__b1716112128479">Logs</strong> in the <strong id="ALM-45650__b171611712164715">Logs</strong> column to view JobManager logs.<div class="fignone" id="ALM-45650__en-us_topic_0000001445372489_fig0904115131915"><span class="figcap"><b>Figure 2 </b>Clicking Logs</span><br><span><img id="ALM-45650__en-us_topic_0000001445372489_image290471501913" src="en-us_image_0000001971648850.png"></span></div>
|
|
</li><li id="ALM-45650__en-us_topic_0000001445372489_li232434015269">Click the ID in the <strong id="ALM-45650__b386519154713">Attempt ID</strong> column and click <strong id="ALM-45650__b8861319184714">Logs</strong> in the <strong id="ALM-45650__b1686141924717">Logs</strong> column to view and save TaskManager logs.<div class="fignone" id="ALM-45650__en-us_topic_0000001445372489_fig16904101571920"><span class="figcap"><b>Figure 3 </b>Clicking the ID in the Attempt ID column</span><br><span><img id="ALM-45650__en-us_topic_0000001445372489_image1890411511199" src="en-us_image_0000002008129157.png"></span></div>
|
|
<div class="fignone" id="ALM-45650__en-us_topic_0000001445372489_fig67971748144610"><span class="figcap"><b>Figure 4 </b>Clicking Logs</span><br><span><img id="ALM-45650__en-us_topic_0000001445372489_image1620681118112" src="en-us_image_0000001971808618.png"></span></div>
|
|
<div class="note" id="ALM-45650__en-us_topic_0000001445372489_note126111528152718"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-45650__en-us_topic_0000001445372489_p14611162814277">You can also log in to Manager as a user who has the FlinkServer management permission. Choose <strong id="ALM-45650__b182591130144712">Cluster</strong> > <strong id="ALM-45650__b1525983094712">Services</strong> > <strong id="ALM-45650__b02591430104710">Flink</strong>, and click the link next to <strong id="ALM-45650__b14259830154717">Flink WebUI</strong>. On the displayed Flink web UI, click <strong id="ALM-45650__b19259143004716">Job Management</strong>, click <strong id="ALM-45650__b1025973016479">More</strong> in the <strong id="ALM-45650__b1126033044719">Operation</strong> column, and select <strong id="ALM-45650__b1260103016478">Job Monitoring</strong> to view TaskManager logs.</p>
|
|
</div></div>
|
|
</li></ol>
|
|
</p></li><li id="ALM-45650__li7740734122412"><span>View the job logs to rectify the fault, or contact the <span id="ALM-45650__text717863264715">O&M personnel</span> and send the collected fault logs. No further action is required.</span></li></ol>
|
|
<p class="tableheading" id="ALM-45650__p22784204171641"><strong id="ALM-45650__b1588514341477">If logs are unavailable on the Yarn page, download logs from HDFS.</strong></p>
|
|
<ol start="18" id="ALM-45650__ol34389030171648"><li id="ALM-45650__li42141433171631"><a name="ALM-45650__li42141433171631"></a><a name="li42141433171631"></a><span>On Manager, choose <strong id="ALM-45650__b108351136144712">Cluster</strong> > <strong id="ALM-45650__b383533614470">Services</strong> > <strong id="ALM-45650__b1583517360470">HDFS</strong>, click the link next to <strong id="ALM-45650__b183503620470">NameNode WebUI</strong> to go to the HDFS page, choose <strong id="ALM-45650__b9836193610474">Utilities</strong> > <strong id="ALM-45650__b183616367474">Browse the file system</strong>, and download logs in the <strong id="ALM-45650__b1983620364474">/tmp/logs/</strong><em id="ALM-45650__i8836143610474">Username</em><strong id="ALM-45650__b14836153674712">/bucket-logs-tfile/</strong><em id="ALM-45650__i98361336124710">Last four digits of the task application ID/Application ID of the task</em> directory.</span></li><li id="ALM-45650__li12839885171631"><span>View the logs of the failed job to rectify the fault, or contact the <span id="ALM-45650__text3385438134717">O&M personnel</span> and send the collected fault logs.</span></li></ol>
|
|
</div>
|
|
<div class="section" id="ALM-45650__section169311343318"><h4 class="sectiontitle"><span id="ALM-45650__text195945622616">Alarm Clearance</span></h4><p id="ALM-45650__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45650__section4139237"><h4 class="sectiontitle"><span id="ALM-45650__text143698488285">Related Information</span></h4><p id="ALM-45650__p33559471"><span id="ALM-45650__text19275105817121">None.</span></p>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
|
|
</div>
|
|
</div>
|
|
|