forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
134 lines
19 KiB
HTML
134 lines
19 KiB
HTML
<a name="ALM-45649"></a><a name="ALM-45649"></a>
|
|
|
|
<h1 class="topictitle1">ALM-45649 P95 Latency of RocksDB Get Requests Continuously Exceeds the Threshold</h1>
|
|
<div id="body0000001971621306"><p id="ALM-45649__p12261122253615">This section applies to MRS 3.3.0 or later.</p>
|
|
<div class="section" id="ALM-45649__section663215"><h4 class="sectiontitle"><span id="ALM-45649__text516373020197">Alarm Description</span></h4><p id="ALM-45649__p66405588">The system checks the RocksDB monitoring data of jobs at the user-specified alarm reporting interval (<strong id="ALM-45649__b3365358114610">metrics.reporter.alarm.job.alarm.rocksdb.metrics.duration</strong>, 180s by default). This alarm is generated when the P95 latency of RocksDB Get requests exceeds the threshold (<strong id="ALM-45649__b183651358134617">metrics.reporter.alarm.job.alarm.rocksdb.get.micros.threshold</strong>, 50000 microseconds by default). This alarm is cleared when the P95 latency of RocksDB Get requests is less than or equal to the threshold.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45649__section5968939"><h4 class="sectiontitle"><span id="ALM-45649__text20591447192117">Alarm Attributes</span></h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45649__table10143581" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-45649__row61411666"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.1"><p id="ALM-45649__p17386810"><span id="ALM-45649__text1864783145211">Alarm ID</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.2"><p id="ALM-45649__p66154394"><span id="ALM-45649__text297913110521">Alarm Severity</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.3"><p id="ALM-45649__p49230886"><span id="ALM-45649__text0890175712305">Auto Cleared</span></p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45649__row49774232"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.1 "><p id="ALM-45649__p5180964">45649</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.2 "><p id="ALM-45649__p17004965">Minor</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.3 "><p id="ALM-45649__p35224963">Yes</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-45649__section53720453"><h4 class="sectiontitle"><span id="ALM-45649__text18171442142214">Alarm Parameters</span></h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45649__table34649765" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-45649__row18974100"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.1"><p id="ALM-45649__p42699947"><span id="ALM-45649__text6203173410617">Parameter</span></p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.2"><p id="ALM-45649__p36143663"><span id="ALM-45649__text10819164319610">Description</span></p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45649__row16272251424"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45649__p9447153994219">Source</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45649__p144723994214">Specifies the cluster for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45649__row38292076"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45649__p164471639194216">ServiceName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45649__p44471639174211">Specifies the service for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45649__row73049270124"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45649__p24471539104219">ApplicationName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45649__p1944763954213">Specifies the name of the application for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45649__row9875225"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45649__p1244715394427">RoleName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45649__p44471439144216">Specifies the role for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45649__row13243689"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45649__p1244716397426">JobName</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45649__p244713917425">Specifies the job for which the alarm is generated.</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="ALM-45649__section13722030"><h4 class="sectiontitle"><span id="ALM-45649__text98201443182317">Impact on the System</span></h4><p id="ALM-45649__p25040094">This alarm has no adverse impact on the system.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45649__section56389407"><h4 class="sectiontitle"><span id="ALM-45649__text11871546172411">Possible Causes</span></h4><p id="ALM-45649__p1434716211337">The possible causes are as follows:</p>
|
|
<ul id="ALM-45649__ul242513305334"><li id="ALM-45649__li842513010334">There are too many SST files at level 0, causing slow queries. In addition, <strong id="ALM-45649__b15165124055819">ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold</strong> is generated.</li><li id="ALM-45649__li242523043310">The cache hit ratio is lower than 60%, causing frequent swap-ins and swap-outs of the block cache.</li></ul>
|
|
</div>
|
|
<div class="section" id="ALM-45649__section178029519112"><h4 class="sectiontitle"><span id="ALM-45649__text79051154102518">Handling Procedure</span></h4><p class="tableheading" id="ALM-45649__p135760113337"><strong id="ALM-45649__b19962181617176">Check whether the number of SST files at level 0 is too large.</strong></p>
|
|
<ol id="ALM-45649__ol1393337134019"><li id="ALM-45649__li19921237194012"><span>On FusionInsight Manager, choose <strong id="ALM-45649__b1524652014171">O&M</strong> > <strong id="ALM-45649__b7246620111718">Alarm</strong> > <strong id="ALM-45649__b192461204175">Alarms</strong>.</span></li><li id="ALM-45649__li7921537174015"><span>In the alarm list, check whether <strong id="ALM-45649__b1450716215222">ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold</strong> exists.</span><p><ul id="ALM-45649__ul792123716404"><li id="ALM-45649__li13921437154012">If yes, go to <a href="#ALM-45649__li19933375403">3</a>.</li><li id="ALM-45649__li892173718400">If no, go to <a href="#ALM-45649__li2091153718402">5</a>.</li></ul>
|
|
</p></li><li id="ALM-45649__li19933375403"><a name="ALM-45649__li19933375403"></a><a name="li19933375403"></a><span>Handle the alarm by following the instructions provided in section <strong id="ALM-45649__b185529253226">ALM-45644 Number of SST Files at Level 0 of RocksDB Continuously Exceeds the Threshold</strong>.</span></li><li id="ALM-45649__li79323754014"><span>After ALM-45644 is cleared, wait a few minutes and check whether this alarm is cleared.</span><p><ul id="ALM-45649__ul179383744011"><li id="ALM-45649__li2931337194010">If yes, no further action is required.</li><li id="ALM-45649__li1931837104010">If no, go to <a href="#ALM-45649__li2091153718402">5</a>.</li></ul>
|
|
</p></li></ol>
|
|
<p id="ALM-45649__p13057358914"><strong id="ALM-45649__b1654684381910">Check the cache hit ratio in TaskManager logs and collect logs.</strong></p>
|
|
<ol start="5" id="ALM-45649__ol1928375404"><li id="ALM-45649__li2091153718402"><a name="ALM-45649__li2091153718402"></a><a name="li2091153718402"></a><span>Log in to FusionInsight Manager as a user who has the FlinkServer management permission.</span></li><li id="ALM-45649__li189213375409"><span>Choose <strong id="ALM-45649__b17888531102710">O&M</strong> > <strong id="ALM-45649__b8888163111270">Alarm</strong> > <strong id="ALM-45649__b788863117279">Alarms</strong> > <strong id="ALM-45649__b1788813316270">ALM-45649 P95 Latency of RocksDB Get Requests Continuously Exceeds the Threshold</strong>, view <strong id="ALM-45649__b1688815312277">Location</strong>, and obtain the name of the task for which the alarm is generated.</span></li><li id="ALM-45649__li192103704016"><span>Choose <strong id="ALM-45649__b95362139564">Cluster</strong> > <strong id="ALM-45649__b15537121315618">Services</strong> > <strong id="ALM-45649__b1853781317563">Yarn</strong> and click the link next to <strong id="ALM-45649__b1537111312563">ResourceManager WebUI</strong> to go to the native Yarn page.</span></li></ol><ol start="8" id="ALM-45649__ol52081124175014"><li id="ALM-45649__li10207524155012"><span>Locate the abnormal task based on its name displayed in <strong id="ALM-45649__b228110447455">Location</strong>, search for and record the application ID of the job, and check whether the job logs are available on the Yarn page.</span><p><div class="fignone" id="ALM-45649__fig52072244500"><span class="figcap"><b>Figure 1 </b>Application ID of a job</span><br><span><img id="ALM-45649__image421413235569" src="en-us_image_0000002008248597.png"></span></div>
|
|
<ul id="ALM-45649__ul167097362388"><li id="ALM-45649__li47097365384">If yes, go to <a href="#ALM-45649__li192082249500">9</a>.</li><li id="ALM-45649__li470913362388">If no, go to <a href="#ALM-45649__li15924173318501">10</a>.</li></ul>
|
|
</p></li><li id="ALM-45649__li192082249500"><a name="ALM-45649__li192082249500"></a><a name="li192082249500"></a><span>Click the application ID of the failed job to go to the job page.</span><p><ol type="a" id="ALM-45649__ol2208624155017"><li id="ALM-45649__li18207192475010">Click <strong id="ALM-45649__b62194594610">Logs</strong> in the <strong id="ALM-45649__b1219145104613">Logs</strong> column to view JobManager logs.<div class="fignone" id="ALM-45649__fig620782485019"><span class="figcap"><b>Figure 2 </b>Clicking Logs</span><br><span><img id="ALM-45649__image9207142417503" src="en-us_image_0000001971648838.png"></span></div>
|
|
</li><li id="ALM-45649__li5208102419500">Click the ID in the <strong id="ALM-45649__b154161610174615">Attempt ID</strong> column and click <strong id="ALM-45649__b34165109461">Logs</strong> in the <strong id="ALM-45649__b104165105468">Logs</strong> column to view and save TaskManager logs. Then go to <a href="#ALM-45649__li133768138478">11</a>.<div class="fignone" id="ALM-45649__fig92071724195013"><span class="figcap"><b>Figure 3 </b>Clicking the ID in the Attempt ID column</span><br><span><img id="ALM-45649__image17207142475012" src="en-us_image_0000002008129145.png"></span></div>
|
|
<div class="fignone" id="ALM-45649__fig1720811242508"><span class="figcap"><b>Figure 4 </b>Clicking Logs</span><br><span><img id="ALM-45649__image1820882414506" src="en-us_image_0000001971808602.png"></span></div>
|
|
<div class="note" id="ALM-45649__note1320882412506"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-45649__p132081324105018">You can also log in to Manager as a user who has the management permission for the current Flink job. Choose <strong id="ALM-45649__b1358212144814">Cluster</strong> > <strong id="ALM-45649__b14582721134816">Services</strong> > <strong id="ALM-45649__b13582172113488">Flink</strong>, and click the link next to <strong id="ALM-45649__b4583112114481">Flink WebUI</strong>. On the displayed Flink web UI, click <strong id="ALM-45649__b16583162119482">Job Management</strong>, click <strong id="ALM-45649__b45831521154817">More</strong> in the <strong id="ALM-45649__b1258342116485">Operation</strong> column, and select <strong id="ALM-45649__b95831521114819">Job Monitoring</strong> to view TaskManager logs.</p>
|
|
</div></div>
|
|
</li></ol>
|
|
</p></li></ol>
|
|
<p id="ALM-45649__p197819554109"><strong id="ALM-45649__b26321028174813">If logs are unavailable on the Yarn page, download logs from HDFS.</strong></p>
|
|
<ol start="10" id="ALM-45649__ol69241133165017"><li id="ALM-45649__li15924173318501"><a name="ALM-45649__li15924173318501"></a><a name="li15924173318501"></a><span>On Manager, choose <strong id="ALM-45649__b0839123064814">Cluster</strong> > <strong id="ALM-45649__b17839330114811">Services</strong> > <strong id="ALM-45649__b148391430204816">HDFS</strong>, click the link next to <strong id="ALM-45649__b178391630194814">NameNode WebUI</strong> to go to the HDFS page, choose <strong id="ALM-45649__b383915302485">Utilities</strong> > <strong id="ALM-45649__b68394303489">Browse the file system</strong>, and download logs in the <strong id="ALM-45649__b183912309487">/tmp/logs/</strong><em id="ALM-45649__i198391830154813">Username</em><strong id="ALM-45649__b17839930174820">/bucket-logs-tfile/</strong><em id="ALM-45649__i148391030154811">Last four digits of the task application ID/Application ID of the task</em> directory.</span></li></ol>
|
|
<p id="ALM-45649__p1737814291119"><strong id="ALM-45649__b179031935164817">Check whether the cache hit ratio is too low.</strong></p>
|
|
<ol start="11" id="ALM-45649__ol673654225011"><li id="ALM-45649__li133768138478"><a name="ALM-45649__li133768138478"></a><a name="li133768138478"></a><span>Check the values of <strong id="ALM-45649__b19893736134311">rocksdb.block.cache.hit</strong> (cache hit) and <strong id="ALM-45649__b19431154014432">rocksdb.block.cache.miss</strong> (cache miss) in TaskManager monitoring logs (keyword <strong id="ALM-45649__b160272611113">RocksDBMetricPrint</strong>). Calculate the hit ratio using the following formula and check whether it is less than 60%:</span><p><div class="p" id="ALM-45649__p1545314124711"><strong id="ALM-45649__b667211521472">rocksdb.block.cache.hit/(rocksdb.block.cache.hit+rocksdb.block.cache.miss)</strong><ul id="ALM-45649__ul173694225012"><li id="ALM-45649__li9736442105010">If yes, adjust the values of the following custom parameters on the job development page of the Flink web UI, save the settings, and go to <a href="#ALM-45649__li4736164255016">12</a>.
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45649__table202994919105" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Custom parameters</caption><thead align="left"><tr id="ALM-45649__row730018981018"><th align="left" class="cellrowborder" valign="top" width="31.803180318031803%" id="mcps1.3.7.10.1.2.1.2.1.2.2.4.1.1"><p id="ALM-45649__p14214250174913">Parameter</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="27.052705270527056%" id="mcps1.3.7.10.1.2.1.2.1.2.2.4.1.2"><p id="ALM-45649__p82149501496">Default Value</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="41.14411441144114%" id="mcps1.3.7.10.1.2.1.2.1.2.2.4.1.3"><p id="ALM-45649__p62153502499">Description</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="ALM-45649__row330019910105"><td class="cellrowborder" valign="top" width="31.803180318031803%" headers="mcps1.3.7.10.1.2.1.2.1.2.2.4.1.1 "><p id="ALM-45649__p2030010914108">state.backend.rocksdb.block.cache-size</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="27.052705270527056%" headers="mcps1.3.7.10.1.2.1.2.1.2.2.4.1.2 "><ul id="ALM-45649__ul1777815215188"><li id="ALM-45649__li27786527184"><strong id="ALM-45649__b1277513518167">8MB</strong></li><li id="ALM-45649__li5778145217186"><strong id="ALM-45649__b19678135381617">256MB</strong>: enables <strong id="ALM-45649__b1067845311610">SPINNING_DISK_OPTIMIZED_HIGH_MEM</strong>.</li></ul>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="41.14411441144114%" headers="mcps1.3.7.10.1.2.1.2.1.2.2.4.1.3 "><ul id="ALM-45649__ul11524125071317"><li id="ALM-45649__li12524150181317">Cache size</li><li id="ALM-45649__li57941337025"><strong id="ALM-45649__b689919421819">8MB</strong> to <strong id="ALM-45649__b941915917187">1GB</strong> are recommended.</li></ul>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45649__row153000961020"><td class="cellrowborder" valign="top" width="31.803180318031803%" headers="mcps1.3.7.10.1.2.1.2.1.2.2.4.1.1 "><p id="ALM-45649__p23002931016">state.backend.rocksdb.block.blocksize</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="27.052705270527056%" headers="mcps1.3.7.10.1.2.1.2.1.2.2.4.1.2 "><ul id="ALM-45649__ul153291814191913"><li id="ALM-45649__li0329014101919"><strong id="ALM-45649__b94300249185">4KB</strong></li><li id="ALM-45649__li1132911481917"><strong id="ALM-45649__b1525252610186">128KB</strong>: enables <strong id="ALM-45649__b20252142601813">SPINNING_DISK_OPTIMIZED_HIGH_MEM</strong>.</li></ul>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="41.14411441144114%" headers="mcps1.3.7.10.1.2.1.2.1.2.2.4.1.3 "><ul id="ALM-45649__ul17883144716138"><li id="ALM-45649__li198831847191319">Block size</li><li id="ALM-45649__li5482195014210"><strong id="ALM-45649__b1157534031814">4KB</strong> to <strong id="ALM-45649__b957654012184">256KB</strong> are recommended.</li></ul>
|
|
</td>
|
|
</tr>
|
|
<tr id="ALM-45649__row53005914105"><td class="cellrowborder" valign="top" width="31.803180318031803%" headers="mcps1.3.7.10.1.2.1.2.1.2.2.4.1.1 "><p id="ALM-45649__p33004912104">state.backend.rocksdb.use-bloom-filter</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="27.052705270527056%" headers="mcps1.3.7.10.1.2.1.2.1.2.2.4.1.2 "><p id="ALM-45649__p1730019181019"><strong id="ALM-45649__b1061514525181">false</strong></p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="41.14411441144114%" headers="mcps1.3.7.10.1.2.1.2.1.2.2.4.1.3 "><ul id="ALM-45649__ul377917466112"><li id="ALM-45649__li1278014467119">Whether to speed up indexing. If it is <strong id="ALM-45649__b14337147121914">true</strong>, each new SST file will contain a Bloom filter.</li><li id="ALM-45649__li86692211434"><strong id="ALM-45649__b364318207200">true</strong> is recommended.</li></ul>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</li><li id="ALM-45649__li1673694219506">If no, go to <a href="#ALM-45649__li573684212503">13</a>.</li></ul>
|
|
</div>
|
|
</p></li><li id="ALM-45649__li4736164255016"><a name="ALM-45649__li4736164255016"></a><a name="li4736164255016"></a><span>Restart the job and check whether the alarm is cleared.</span><p><ul id="ALM-45649__ul773611421507"><li id="ALM-45649__li6736184215505">If yes, no further action is required.</li><li id="ALM-45649__li17736204295010">If no, go to <a href="#ALM-45649__li573684212503">13</a>.</li></ul>
|
|
</p></li><li id="ALM-45649__li573684212503"><a name="ALM-45649__li573684212503"></a><a name="li573684212503"></a><span>Contact <span id="ALM-45649__text129153352011">O&M personnel</span> and send the collected logs.</span></li></ol>
|
|
</div>
|
|
<div class="section" id="ALM-45649__section169311343318"><h4 class="sectiontitle"><span id="ALM-45649__text195945622616">Alarm Clearance</span></h4><p id="ALM-45649__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
|
|
</div>
|
|
<div class="section" id="ALM-45649__section4139237"><h4 class="sectiontitle"><span id="ALM-45649__text143698488285">Related Information</span></h4><p id="ALM-45649__p33559471"><span id="ALM-45649__text19275105817121">None.</span></p>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
|
|
</div>
|
|
</div>
|
|
|