doc-exports/docs/mrs/umn/ALM-38009.html
Yang, Tong 3b1f73dece MRS UMN 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-13 12:03:34 +00:00

101 lines
14 KiB
HTML

<a name="ALM-38009"></a><a name="ALM-38009"></a>
<h1 class="topictitle1">ALM-38009 Busy Broker Disk I/Os (Applicable to Versions Later Than MRS 3.1.0)</h1>
<div id="body1551088161469"><div class="note" id="ALM-38009__note11591101591011"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-38009__p184731720494">This section applies to versions later than MRS 3.1.0.</p>
</div></div>
<div class="section" id="ALM-38009__section20231696"><h4 class="sectiontitle">Description</h4><p id="ALM-38009__p5536185">The system checks the I/O status of each Kafka disk every 60 seconds. This alarm is generated when the disk I/O of a Kafka data directory on a broker exceeds the threshold (80% by default).</p>
<p id="ALM-38009__p49825671">Its <strong id="ALM-38009__b191401379718">Trigger Count</strong> is <strong id="ALM-38009__b825663881217">3</strong>. This alarm is cleared when the disk I/O is lower than the threshold (80% by default).</p>
</div>
<div class="section" id="ALM-38009__section47867540"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-38009__table9347550" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-38009__row4979446"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.1"><p id="ALM-38009__p681951">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.2"><p id="ALM-38009__p55238032">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.3"><p id="ALM-38009__p45095602">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-38009__row28865132"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.1 "><p id="ALM-38009__p56374375">38009</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.2 "><p id="ALM-38009__p2921675">Major</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.3 "><p id="ALM-38009__p35329086">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-38009__section28154684"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-38009__table43083755" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-38009__row22744955"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.1"><p id="ALM-38009__p30402032">Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.2"><p id="ALM-38009__p46645505">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-38009__row1919991618711"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-38009__p17935380415">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-38009__p187931338134115">Specifies the cluster for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-38009__row20189592"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-38009__p41293795">ServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-38009__p58124657">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-38009__row53359872"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-38009__p23892775">RoleName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-38009__p54289578">Specifies the role for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-38009__row18844162"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-38009__p14847206">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-38009__p22021457">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-38009__row63975386"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-38009__p166506381974">DataDirectoryName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-38009__p43673025">Specifies the name of the Kafka data directory with frequent disk I/Os.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-38009__section52065568"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-38009__p47854133">The disk partition has frequent I/Os. Data may fail to be written to the Kafka topic for which the alarm is generated.</p>
</div>
<div class="section" id="ALM-38009__section65936928"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-38009__ul15764171217348"><li id="ALM-38009__li79711515153417">There are many replicas configured for the topic.</li><li id="ALM-38009__li495141919347">The parameter for batch writing producer's messages is inappropriately configured. The service traffic of this topic is too heavy, and the current partition configuration is inappropriate.</li></ul>
</div>
<div class="section" id="ALM-38009__section15975143119164"><h4 class="sectiontitle">Procedure</h4><p id="ALM-38009__p8959174619813"><strong id="ALM-38009__b1482185183320">Check the number of topic replicas.</strong></p>
<ol id="ALM-38009__ol13742833013"><li id="ALM-38009__li16740123103"><span>On FusionInsight Manager, choose <strong id="ALM-38009__b204635361844">O&amp;M</strong> &gt; <strong id="ALM-38009__b111547591348">Alarm</strong> &gt; <strong id="ALM-38009__b11547211953">Alarms</strong>. Locate the row that contains this alarm, click <span><img id="ALM-38009__image12740153301" src="en-us_image_0263895771.png"></span>, and view the host name in <strong id="ALM-38009__b1594258121811">Location</strong>.</span></li><li id="ALM-38009__li186981043114513"><span>On FusionInsight Manager, choose <strong id="ALM-38009__b10725747152110">Cluster</strong>, click the name of the desired cluster, choose <strong id="ALM-38009__b1127081352216">Services</strong> &gt; <strong id="ALM-38009__b1345381542216">Kafka</strong> &gt; <strong id="ALM-38009__b1757862214229">KafkaTopic Monitor</strong>, search for the topic for which the alarm is generated, and check the number of replicas.</span></li><li id="ALM-38009__li8398191175118"><a name="ALM-38009__li8398191175118"></a><a name="li8398191175118"></a><span>Reduce the replication factors of the topic (for example, reduce to <strong id="ALM-38009__b37311692413">3</strong>) if the number of replicas is greater than 3.</span><p><p id="ALM-38009__p026891365116">Run the following command on the FusionInsight client to replan the replicas of Kafka topics:</p>
<p id="ALM-38009__p174173905"><strong id="ALM-38009__b7741193408">kafka-reassign-partitions.sh </strong><strong id="ALM-38009__b13741143602">--zookeeper </strong><em id="ALM-38009__i47411735017">{zk_host}:{port}</em><strong id="ALM-38009__b137411315013">/kafka</strong> <strong id="ALM-38009__b17741183401">--reassignment-json-file<em id="ALM-38009__i167412031409"> </em></strong><em id="ALM-38009__i187411931501">{manual assignment json file path}</em> <strong id="ALM-38009__b17411433014">--</strong><strong id="ALM-38009__b47414318010">execute</strong></p>
<p id="ALM-38009__p57411437020">For example:</p>
<p id="ALM-38009__p147412031305"><strong id="ALM-38009__b1374117318013"><span id="ALM-38009__ph381512063917">/opt/client</span>/Kafka/kafka/bin/kafka-reassign-partitions.sh </strong><strong id="ALM-38009__b57411139014">--zookeeper 10.149.0.90:2181,10.149.0.91:2181,10.149.0.92:2181/kafka </strong><strong id="ALM-38009__b67411631804">--reassignment-json-file expand-cluster-reassignment.json </strong><strong id="ALM-38009__b8741731804">--execute</strong></p>
<div class="note" id="ALM-38009__note1137643519511"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="ALM-38009__p3377123518512">In the <strong id="ALM-38009__b422138263">expand-cluster-reassignment.json</strong> file, describe the brokers to which the partitions of the topic are migrated in the following format: {"partitions":[{"topic": "<em id="ALM-38009__i6519649115120">topicName</em>","partition": 1,"replicas": [1,2,3] }],"version":1}</p>
</div></div>
</p></li><li id="ALM-38009__li18741632013"><span>Observe for a period of time and check whether the alarm is cleared. If the alarm persists, go to <a href="#ALM-38009__li15319131241119">5</a>.</span></li></ol>
<p id="ALM-38009__p1229810127112"><strong id="ALM-38009__b3554172013293">Check the partition planning of the topic.</strong></p>
<ol start="5" id="ALM-38009__ol16320151213113"><li id="ALM-38009__li15319131241119"><a name="ALM-38009__li15319131241119"></a><a name="li15319131241119"></a><span>On the <strong id="ALM-38009__b272953720298">KafkaTopic Monitor</strong> page, view <strong id="ALM-38009__b132671720301">Topic Input Traffic</strong> in the <strong id="ALM-38009__b42441356152911">Topic Traffic</strong> area of each topic, obtain the topic with the largest value, and check the partitions of this topic as well as information about the host of these partitions.</span></li><li id="ALM-38009__li7320112121118"><a name="ALM-38009__li7320112121118"></a><a name="li7320112121118"></a><span>Log in to the host queried in <a href="#ALM-38009__li15319131241119">5</a> and run the <strong id="ALM-38009__b1663311420377">iostat -d -x</strong> command to check the <strong id="ALM-38009__b12333202583712">%util</strong> value of each disk.</span><p><div class="p" id="ALM-38009__p932061219118"><span><img id="ALM-38009__image532216377471" src="en-us_image_0000001441218685.png"></span><ul id="ALM-38009__ul15320121218113"><li id="ALM-38009__li2319161218111">If the <strong id="ALM-38009__b32410093913">%util</strong> value of each disk exceeds the threshold (<strong id="ALM-38009__b11331199796">80%</strong> by default), expand the Kafka disk capacity. After the capacity expansion, replan the topic partitions by referring to <a href="#ALM-38009__li8398191175118">3</a>.</li><li id="ALM-38009__li17320712181116">If the <strong id="ALM-38009__b064311171407">%util</strong> values of the disks vary greatly, check the disk partition configuration of Kafka. For example, check the value of <strong id="ALM-38009__b17368156416">log.dirs</strong> in the <strong id="ALM-38009__b378191211415">${BIGDATA_HOME}/FusionInsight_HD_<span id="ALM-38009__text1031961281116">8.1.0.1</span>/1_14_Broker/etc/server.properties</strong> file.<p id="ALM-38009__p731918125111">Run the following command to view the <strong id="ALM-38009__b3687122924114">Filesystem</strong> information:</p>
<p id="ALM-38009__p532071271113"><strong id="ALM-38009__b193191912141112">df -h</strong> <em id="ALM-38009__i132051211114">log.dirs value</em></p>
<p id="ALM-38009__p14320171211119">The command output is as follows.</p>
<p id="ALM-38009__p15615112094816"><span><img id="ALM-38009__image1971821164812" src="en-us_image_0000001441098753.png"></span></p>
</li><li id="ALM-38009__li193201612141119">If the partition where Filesystem is located matches the partition with a high <strong id="ALM-38009__b1117242418424">%util</strong> value, plan Kafka partitions on idle disks, configure <strong id="ALM-38009__b142991854154212">log.dirs</strong> as an idle disk directory, and replan topic partitions by referring to <a href="#ALM-38009__li8398191175118">3</a>. Ensure that the partitions of the topic are evenly distributed to each disk.</li></ul>
</div>
</p></li><li id="ALM-38009__li8320191217110"><span>Observe for a period of time and check whether the alarm is cleared.</span><p><ul id="ALM-38009__ul8320181212113"><li id="ALM-38009__li332041211114">If yes, no further action is required.</li><li id="ALM-38009__li4320181211116">If no, repeat <a href="#ALM-38009__li15319131241119">5</a> to <a href="#ALM-38009__li7320112121118">6</a> three times. Then, go to <a href="#ALM-38009__li1032011218115">8</a>.</li></ul>
</p></li><li id="ALM-38009__li1032011218115"><a name="ALM-38009__li1032011218115"></a><a name="li1032011218115"></a><span>Observe for a period of time and check whether the alarm is cleared.</span><p><ul id="ALM-38009__ul1320201214113"><li id="ALM-38009__li1632021241114">If yes, no further action is required.</li><li id="ALM-38009__li1332011211114">If no, go to <a href="#ALM-38009__li1473912318017">9</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-38009__p545517465591"><strong id="ALM-38009__b6171115114513">Collect fault information.</strong></p>
<ol start="9" id="ALM-38009__ol97391932018"><li id="ALM-38009__li1473912318017"><a name="ALM-38009__li1473912318017"></a><a name="li1473912318017"></a><span>On FusionInsight Manager, choose <strong id="ALM-38009__b0544684455">O&amp;M</strong>. In the navigation pane on the left, choose <strong id="ALM-38009__b154418124519">Log</strong> &gt; <strong id="ALM-38009__b2054518164514">Download</strong>.</span></li><li id="ALM-38009__li1673920310018"><span>Expand the <strong id="ALM-38009__b2383116144512">Service</strong> drop-down list, and select <strong id="ALM-38009__b4383151664511">Kafka</strong> for the target cluster.</span></li><li id="ALM-38009__li187391313012"><span>Click <span><img id="ALM-38009__image97391337014" src="en-us_image_0263895859.png"></span> in the upper right corner, and set <strong id="ALM-38009__b86225975718">Start Date</strong> and <strong id="ALM-38009__b18621459185719">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-38009__b1862259125713">Download</strong>.</span></li><li id="ALM-38009__li127391831307"><span>Contact <span id="ALM-38009__text126301214142412">O&amp;M personnel</span> and provide the collected logs.</span></li></ol>
</div>
<div class="section" id="ALM-38009__section169311343318"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-38009__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
</div>
<div class="section" id="ALM-38009__section39290981"><h4 class="sectiontitle">Related Information</h4><p id="ALM-38009__p43705280">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>