Files
doc-exports/docs/mrs/umn/ALM-45641.html
Yang, Tong 2195db241c MRS UMN 20231220 version update
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Reviewed-by: Rechenburg, Matthias <matthias.rechenburg@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2024-05-16 09:40:21 +00:00

111 lines
16 KiB
HTML

<a name="ALM-45641"></a><a name="ALM-45641"></a>
<h1 class="topictitle1">ALM-45641 Data Synchronization Exception Between the Active and Standby FlinkServer Nodes</h1>
<div id="body0000001349190156"><p id="ALM-45641__p15858191817246">This section applies to MRS 3.2.0<span id="ALM-45641__ph174355293719">-LTS.2</span> or later.</p>
<div class="section" id="ALM-45641__section663215"><h4 class="sectiontitle">Description</h4><p id="ALM-45641__p7378398">The system checks data synchronization between the active and standby FlinkServer nodes every 60 seconds. This alarm is generated when the standby FlinkServer node fails to synchronize files with the active FlinkServer node.</p>
<p id="ALM-45641__p66405588">This alarm is cleared when the standby FlinkServer synchronizes files with the active FlinkServer.</p>
</div>
<div class="section" id="ALM-45641__section5968939"><h4 class="sectiontitle">Attribute</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45641__table10143581" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-45641__row61411666"><th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.1"><p id="ALM-45641__p8289053">Alarm ID</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.2"><p id="ALM-45641__p324668">Alarm Severity</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="33.33333333333333%" id="mcps1.3.3.2.1.4.1.3"><p id="ALM-45641__p26298136">Auto Clear</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-45641__row49774232"><td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.1 "><p id="ALM-45641__p5180964">45641</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.2 "><p id="ALM-45641__p17004965">Major</p>
</td>
<td class="cellrowborder" valign="top" width="33.33333333333333%" headers="mcps1.3.3.2.1.4.1.3 "><p id="ALM-45641__p35224963">Yes</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-45641__section53720453"><h4 class="sectiontitle">Parameters</h4>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="ALM-45641__table34649765" frame="border" border="1" rules="all"><thead align="left"><tr id="ALM-45641__row18974100"><th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.1"><p id="ALM-45641__p60507121">Name</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="50%" id="mcps1.3.4.2.1.3.1.2"><p id="ALM-45641__p2129750">Meaning</p>
</th>
</tr>
</thead>
<tbody><tr id="ALM-45641__row16272251424"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45641__p17935380415">Source</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45641__p187931338134115">Specifies the cluster or system for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-45641__row38292076"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45641__p14650458">ServiceName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45641__p45836489">Specifies the service for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-45641__row9875225"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45641__p61695723">RoleName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45641__p31297682">Specifies the role for which the alarm is generated.</p>
</td>
</tr>
<tr id="ALM-45641__row13243689"><td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.1 "><p id="ALM-45641__p66105898">HostName</p>
</td>
<td class="cellrowborder" valign="top" width="50%" headers="mcps1.3.4.2.1.3.1.2 "><p id="ALM-45641__p52977523">Specifies the host for which the alarm is generated.</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="section" id="ALM-45641__section13722030"><h4 class="sectiontitle">Impact on the System</h4><p id="ALM-45641__p25040094">Because the configuration files on the standby FlinkServer are not updated, some configurations will be lost after an active/standby switchover. FlinkServer and some components may not run properly.</p>
</div>
<div class="section" id="ALM-45641__section56389407"><h4 class="sectiontitle">Possible Causes</h4><ul id="ALM-45641__ul7427111141811"><li id="ALM-45641__li124271512180">The link between the active and standby FlinkServer nodes is interrupted.</li><li id="ALM-45641__li1067055151815">The synchronization file does not exist or the file permission is required.</li></ul>
</div>
<div class="section" id="ALM-45641__section37742617"><h4 class="sectiontitle">Procedure</h4><p class="tableheading" id="ALM-45641__p5563313"><strong id="ALM-45641__b12641827698">Check whether the network between the active and standby FlinkServer is in normal state.</strong></p>
<ol id="ALM-45641__ol30860579171637"><li id="ALM-45641__li46586441171631"><span>On <span id="ALM-45641__text34789336432">MRS</span> Manager, choose <strong id="ALM-45641__b218121284115">Cluster</strong> &gt; <strong id="ALM-45641__b122591110154111">Services</strong> &gt; <strong id="ALM-45641__b14930475412">ClickHouse</strong> &gt; <strong id="ALM-45641__b103821615154113">Instance</strong>. View and record the IP addresses of active and standby FlinkServer.</span></li><li id="ALM-45641__li15405401171631"><span>Log in to the active FlinkServer node as the <strong id="ALM-45641__b161068521793">root</strong> user. <span id="ALM-45641__text7780723164819"></span></span></li><li id="ALM-45641__li35813712171631"><span>Run the following command to check whether the standby FlinkServer is reachable:</span><p><p class="subitemlist" id="ALM-45641__p516782102610"><strong id="ALM-45641__b3429315141116">ping</strong> <em id="ALM-45641__i18435141513110">IP address of the standby FlinkServer</em></p>
<ul class="subitemlist" id="ALM-45641__ul18892382171631"><li id="ALM-45641__li12913763171631">If yes, go to <a href="#ALM-45641__li18750195794719">6</a>.</li><li id="ALM-45641__li39381855171631">If no, go to <a href="#ALM-45641__li63406959171631">4</a>.</li></ul>
</p></li><li id="ALM-45641__li63406959171631"><a name="ALM-45641__li63406959171631"></a><a name="li63406959171631"></a><span>Contact the network administrator to check whether the network is faulty.</span><p><ul class="subitemlist" id="ALM-45641__ul29414839171631"><li id="ALM-45641__li15229598171631">If yes, go to <a href="#ALM-45641__li19595462171631">5</a>.</li><li id="ALM-45641__li25637936171631">If no, go to <a href="#ALM-45641__li18750195794719">6</a>.</li></ul>
</p></li><li id="ALM-45641__li19595462171631"><a name="ALM-45641__li19595462171631"></a><a name="li19595462171631"></a><span>Rectify the network fault and check whether the alarm is cleared from the alarm list.</span><p><ul class="subitemlist" id="ALM-45641__ul5212946171631"><li id="ALM-45641__li33791722171631">If yes, no further action is required.</li><li id="ALM-45641__li52774999171631">If no, go to <a href="#ALM-45641__li18750195794719">6</a>.</li></ul>
</p></li></ol>
<p id="ALM-45641__p1812717455469"><strong id="ALM-45641__b13271131164218">Check whether the storage space of the /srv/BigData/LocalBackup directory is full.</strong></p>
<ol start="6" id="ALM-45641__ol53961455194712"><li id="ALM-45641__li18750195794719"><a name="ALM-45641__li18750195794719"></a><a name="li18750195794719"></a><span>Run the following command to check whether the storage space of the <strong id="ALM-45641__b17298152354319">/srv/BigData/LocalBackup</strong> directory is full:</span><p><div class="p" id="ALM-45641__p173459562613"><strong id="ALM-45641__b7694131719303">df -hl /srv/BigData/LocalBackup</strong><ul id="ALM-45641__ul8957005261"><li id="ALM-45641__li99573012263">If yes, go to <a href="#ALM-45641__li7740734122412">7</a>.</li><li id="ALM-45641__li19572018264">If no, go to <a href="#ALM-45641__li13330195272015">10</a>.</li></ul>
</div>
</p></li><li id="ALM-45641__li7740734122412"><a name="ALM-45641__li7740734122412"></a><a name="li7740734122412"></a><span>Run the following command to clear unnecessary backup files:</span><p><p id="ALM-45641__p175572566244"><strong id="ALM-45641__b1455705612241">rm -rf</strong><strong id="ALM-45641__b355712563240"> </strong> <em id="ALM-45641__i1323334417433">Directory to be cleared</em></p>
<p id="ALM-45641__p20279915153115">The following are two examples:</p>
<p id="ALM-45641__p542742082519"><strong id="ALM-45641__b3427132016256">rm -rf </strong><strong id="ALM-45641__b124271920172511">/srv/BigData/LocalBackup/0/default-oms_20191211143443</strong></p>
</p></li><li id="ALM-45641__li16472102242512"><span>On <span id="ALM-45641__text10815337244">MRS</span> Manager, choose <strong id="ALM-45641__b1081245611437">O&amp;M</strong> &gt; <strong id="ALM-45641__b188121456114318">Backup and Restoration</strong> &gt; <strong id="ALM-45641__b1981319568435">Backup Management</strong>.</span><p><p id="ALM-45641__p3483142211255">In the <strong id="ALM-45641__b101881230204411">Operation</strong> column of the backup task, click <strong id="ALM-45641__b3871114174410">Configure</strong> and change the value of <strong id="ALM-45641__b48461046174415">Maximum Number of Backup Copies</strong> to reduce the number of backup file sets.</p>
</p></li><li id="ALM-45641__li652028155317"><span>Wait for 1 minute and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-45641__ul1766732614557"><li id="ALM-45641__li18667162610554">If yes, no further action is required.</li><li id="ALM-45641__li5667112610551">If no, go to <a href="#ALM-45641__li13330195272015">10</a>.</li></ul>
</p></li></ol>
<p id="ALM-45641__p17556109182017"><strong id="ALM-45641__b170194734313">Check whether the synchronization file exists and whether the file permission is valid.</strong></p>
<ol start="10" id="ALM-45641__ol103301752162013"><li id="ALM-45641__li13330195272015"><a name="ALM-45641__li13330195272015"></a><a name="li13330195272015"></a><span>Run the following command to check whether the synchronization file exists:</span><p><p id="ALM-45641__p0988112182215"><strong id="ALM-45641__b195657722215">find /srv/BigData/ -name "sed*"</strong></p>
<p id="ALM-45641__p179881124222"><strong id="ALM-45641__b45671722213">find /opt -name "sed*"</strong></p>
<ul id="ALM-45641__ul895161872213"><li id="ALM-45641__li19511818172212">If yes, go to <a href="#ALM-45641__li6383747162115">11</a>.</li><li id="ALM-45641__li92071822102216">If no, go to <a href="#ALM-45641__li210153713915">12</a>.</li></ul>
</p></li><li id="ALM-45641__li6383747162115"><a name="ALM-45641__li6383747162115"></a><a name="li6383747162115"></a><span>Run the following command to check the synchronization file information and permission queried in <a href="#ALM-45641__li13330195272015">10</a>:</span><p><p id="ALM-45641__p18606720364"><strong id="ALM-45641__b101251241173620">ll </strong> <em id="ALM-45641__i19573152920466">Path of the file you want to search for</em></p>
<ul id="ALM-45641__ul193435212343"><li id="ALM-45641__li43431214342">If the file size is 0 and all values in the permission column are -, the file is a junk file. Run the following command to delete it:<p id="ALM-45641__p757404610368"><a name="ALM-45641__li43431214342"></a><a name="li43431214342"></a><strong id="ALM-45641__b2059920416372">rm -rf </strong><em id="ALM-45641__i11323171812474">Files to be deleted</em></p>
<p id="ALM-45641__p2971420151819">Wait for several minutes and check whether the alarm is cleared. If the alarm persists, go to <a href="#ALM-45641__li210153713915">12</a>.</p>
</li><li id="ALM-45641__li16324131020345">If the file size is not 0, go to <a href="#ALM-45641__li210153713915">12</a>.</li></ul>
</p></li><li id="ALM-45641__li210153713915"><a name="ALM-45641__li210153713915"></a><a name="li210153713915"></a><span>View the log file generated when the alarm is reported.</span><p><ol type="a" id="ALM-45641__ol430319264420"><li id="ALM-45641__li1303026154215">Run the following command to go to the HA run log file path of the current cluster:<p id="ALM-45641__p18441029183711"><a name="ALM-45641__li1303026154215"></a><a name="li1303026154215"></a><strong id="ALM-45641__b1744629193712">cd /var/log/Bigdata/flink/flinkserver/ha/runlog</strong></p>
</li><li id="ALM-45641__li4350341502">Decompress log file and view the logs generated when the alarm is reported.<p id="ALM-45641__p144701722175010"><a name="ALM-45641__li4350341502"></a><a name="li4350341502"></a>For example, if the name of the file is <strong id="ALM-45641__b1669441615512">ha.log.2021-03-22_12-00-07.gz</strong>, run the following command:</p>
<p id="ALM-45641__p111371341514"><strong id="ALM-45641__b2561510535">gunzip </strong><em id="ALM-45641__i106521648531">ha.log.2021-03-22_12-00-07.gz</em></p>
<p id="ALM-45641__p636791434012"><strong id="ALM-45641__b32779764611">vi </strong><em id="ALM-45641__i12778171285311">ha.log.2021-03-22_12-00-07</em></p>
<p id="ALM-45641__p116435174017">Check whether error information is displayed before and after the alarm generation time in the logs.</p>
<ul id="ALM-45641__ul7287186411"><li id="ALM-45641__li8287148154113">If it is displayed, rectify the fault based on the error information. Go to <a href="#ALM-45641__li259318693811">13</a>.<p id="ALM-45641__p12593194025515">For example, if the following error information is displayed, the directory permission is required. In this case, obtain the directory permission that is the same as the permission on a normal node.</p>
<p id="ALM-45641__p67601417115616"><span><img id="ALM-45641__image1029712410578" src="en-us_image_0000001532767430.png"></span></p>
</li><li id="ALM-45641__li68301723174120">If no, go to <a href="#ALM-45641__li42141433171631">14</a>.</li></ul>
</li></ol>
</p></li><li id="ALM-45641__li259318693811"><a name="ALM-45641__li259318693811"></a><a name="li259318693811"></a><span>Wait for about 10 minutes and check whether the alarm is cleared.</span><p><ul class="subitemlist" id="ALM-45641__ul195936613819"><li id="ALM-45641__li155931062381">If yes, no further action is required.</li><li id="ALM-45641__li155931160384">If no, go to <a href="#ALM-45641__li42141433171631">14</a>.</li></ul>
</p></li></ol>
<p class="tableheading" id="ALM-45641__p22784204171641"><strong id="ALM-45641__b29668577171645">Collect fault information.</strong></p>
<ol start="14" id="ALM-45641__ol34389030171648"><li id="ALM-45641__li42141433171631"><a name="ALM-45641__li42141433171631"></a><a name="li42141433171631"></a><span>On <span id="ALM-45641__text1575259249">MRS</span> Manager, choose <strong id="ALM-45641__b1017816443537">O&amp;M</strong> &gt; <strong id="ALM-45641__b19187044145320">Log</strong> &gt; <strong id="ALM-45641__b121891044135313">Download</strong>.</span></li><li id="ALM-45641__li12839885171631"><span>Select FlinkServer information from <strong id="ALM-45641__b154492467534">Services</strong> and click <strong id="ALM-45641__b5451154625318">OK</strong>.</span></li><li id="ALM-45641__li7999192713221"><span>Expand the <strong id="ALM-45641__b1554191105420">Hosts</strong> drop-down list. In the <strong id="ALM-45641__b175412115543">Select Host</strong> dialog box that is displayed, select the hosts to which the role belongs, and click <strong id="ALM-45641__b85411211115419">OK</strong>.</span></li><li id="ALM-45641__li48450108171631"><span>Click <span><img id="ALM-45641__image104601319175315" src="en-us_image_0000001583127333.png"></span> in the upper right corner, and set <strong id="ALM-45641__b99459142545">Start Date</strong> and <strong id="ALM-45641__b1946111445417">End Date</strong> for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click <strong id="ALM-45641__b13946181420548">Download</strong>.</span></li><li id="ALM-45641__li32144708171631"><span>Contact <span id="ALM-45641__text1963932010540">O&amp;M personnel</span> and provide the collected logs.</span></li></ol>
</div>
<div class="section" id="ALM-45641__section169311343318"><h4 class="sectiontitle">Alarm Clearing</h4><p id="ALM-45641__p754913417333">This alarm is automatically cleared after the fault is rectified.</p>
</div>
<div class="section" id="ALM-45641__section4139237"><h4 class="sectiontitle">Related Information</h4><p id="ALM-45641__p33559471">None</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1298.html">Alarm Reference (Applicable to MRS 3.x)</a></div>
</div>
</div>