Files
doc-exports/docs/mrs/umn/admin_guide_000040.html
yangtong c285e88a17 MRS UMN 20250806 version
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: yangtong <yangtong2@huawei.com>
Co-committed-by: yangtong <yangtong2@huawei.com>
2025-09-02 10:43:57 +00:00

35 lines
12 KiB
HTML

<a name="admin_guide_000040"></a><a name="admin_guide_000040"></a>
<h1 class="topictitle1">Decommissioning and Recommissioning an Instance</h1>
<div id="body1529658735911"><div class="section" id="admin_guide_000040__s8f2c2e4ee54d4d929e0f420df2a78b8e"><h4 class="sectiontitle">Scenario</h4><p id="admin_guide_000040__en-us_topic_0046737055_p31725606">Some role instances provide services for external services in distributed and parallel mode. Services independently store information about whether each instance can be used. Therefore, you need to use <span id="admin_guide_000040__text154091129192818">MRS</span> Manager to recommission or decommission these instances to change the instance running status.</p>
<p id="admin_guide_000040__p16987115143417">Some instances do not support the recommissioning and decommissioning functions.</p>
<div class="note" id="admin_guide_000040__en-us_topic_0046737055_note42517832"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><div class="p" id="admin_guide_000040__p1318115181419">The following roles support decommissioning and recommissioning: HDFS DataNode, YARN NodeManager, and HBase RegionServer.<ul id="admin_guide_000040__ul52127872144628"><li class="text" id="admin_guide_000040__li6273615318113">By default, if the number of the DataNodes is less than or equal to that of HDFS replicas, decommissioning cannot be performed. If the number of HDFS replicas is three and the number of DataNodes is less than four in the system, decommissioning cannot be performed. In this case, an error will be reported and force <span id="admin_guide_000040__text1527173263414">MRS</span> Manager to exit the decommissioning 30 minutes after <span id="admin_guide_000040__text19754045183412">MRS</span> Manager attempts to perform the decommissioning.</li><li class="text" id="admin_guide_000040__li1513104122610">You can enable quick decommissioning before decommissioning DataNodes for clusters of MRS 3.3.0 or later. In this case, when the number of DataNodes meets the value of <strong id="admin_guide_000040__b0589781121">dfs.namenode.decommission.force.replication.min</strong>, the system decommissions the nodes and adds HDFS copies at the same time. <strong id="admin_guide_000040__b125841155663">If data is written during quick decommissioning, data may be lost. Exercise caution when performing this operation.</strong> The parameters related to quick decommissioning are listed as follows. You can search for and view the parameters on the HDFS configuration page on MRS Manager.<p id="admin_guide_000040__p18131410268"><strong id="admin_guide_000040__b459994015710">dfs.namenode.decommission.force.enabled</strong>: whether to enable quick decommissioning for DataNode. If this parameter is set to <strong id="admin_guide_000040__b117651379">true</strong>, the function is enabled.</p>
<p id="admin_guide_000040__p313194142619"><strong id="admin_guide_000040__b175309545711">dfs.namenode.decommission.force.replication.min</strong>: minimum number of available copies of a block required for DataNode quick decommissioning. The value ranges from 1 to 3.</p>
</li><li class="text" id="admin_guide_000040__li139655581558">During MapReduce task execution, files with 10 replicas are generated. Therefore, if the number of DataNode instances is less than 10, decommissioning cannot be performed.</li><li id="admin_guide_000040__li64342367144634">If the number of DataNode racks (the number of racks is determined by the number of racks configured for each DataNode) is greater than 1 before the decommissioning, and after some DataNodes are decommissioned, that of the remaining DataNodes changes to 1, the decommissioning will fail. Therefore, before decommissioning DataNode instances, you need to evaluate the impact of decommissioning on the number of racks to adjust the DataNodes to be decommissioned.</li><li id="admin_guide_000040__li2382750145324">If multiple DataNodes are decommissioned at the same time, and each of them stores a large volume of data, the DataNodes may fail to be decommissioned due to timeout. To avoid this problem, it is recommended that one DataNode be decommissioned each time and multiple decommissioning operations be performed.</li><li id="admin_guide_000040__li17949129201411">During broker decommissioning, if the number of remaining brokers after decommissioning is less than the number of built-in topic replicas (3 by default) of the Kafka service, the broker cannot be decommissioned. If the instance is forcibly deleted, the service functions are unavailable. (The following broker constraints are added to MRS 3.5.0.)</li><li id="admin_guide_000040__li394912291144">If multiple brokers are decommissioned at the same time and the data volume of the topic partitions maintained by each broker is large, the brokers may fail to be decommissioned due to timeout. To avoid this problem, you are advised to decommission only one broker each time.</li><li id="admin_guide_000040__li101651833123917">You can view the partition migration progress during broker decommissioning on the Kafka UI, and increase traffic limit to accelerate partition migration during off-peak hours. To avoid service interruption, you can reduce the partition migration traffic or cancel ongoing partition migration tasks. </li><li id="admin_guide_000040__li14313748124316">Do not delete topics during broker decommissioning. Otherwise, there will be residual metadata of the migration tasks. </li><li id="admin_guide_000040__li168681246125011">After adding a broker instance or recommissioning a broker, the system triggers partition balancing in 10 minutes. (You can use the <strong id="admin_guide_000040__b1027201415344">auto.reassign.check.interval.ms</strong> parameter of the Kafka component on Manager to adjust the trigger time.)</li><li id="admin_guide_000040__li146381831192013">Decommissioning or recommissioning constraints for Doris BE nodes<ul id="admin_guide_000040__ul18121184444020"><li id="admin_guide_000040__li1798884110406">After decommissioning, the remaining normal BE nodes must be no less than the copies of any table. Otherwise, decommissioning will fail.</li></ul>
<ul id="admin_guide_000040__ul1669801614288"><li id="admin_guide_000040__li13698816202811"><strong id="admin_guide_000040__b631710017338">BE node storage space</strong><p id="admin_guide_000040__p166987160285">Before cluster decommissioning, the disk space of non-decommissioned BE nodes in the cluster must be enough to store data of all BE nodes to be decommissioned. About 10% of the storage space of each non-decommissioned BE node must be reserved after decommissioning to ensure that the remaining instances can run properly.</p>
</li></ul>
</li></ul>
</div>
</div></div>
</div>
<div class="section" id="admin_guide_000040__section114941822101815"><h4 class="sectiontitle">Procedure</h4><ol id="admin_guide_000040__en-us_topic_0046737055_ol25797834"><li id="admin_guide_000040__en-us_topic_0046737055_li30853914"><span>Perform the following steps to perform a health check for the DataNodes before decommissioning:</span><p><ol type="a" id="admin_guide_000040__en-us_topic_0046737055_ol9249770"><li id="admin_guide_000040__en-us_topic_0046737055_li16139067">Log in to the client installation node as a client user and switch to the client installation directory.</li><li id="admin_guide_000040__en-us_topic_0046737055_li11033878">For a security cluster, use user <strong id="admin_guide_000040__b20367658122716">hdfs</strong> for permission authentication.<pre class="screen" id="admin_guide_000040__s3aa4cfc64a104f838a377cdd8f3fd3ca"><strong id="admin_guide_000040__b7754283397">source bigdata_env</strong> #Configure client environment variables.
<strong id="admin_guide_000040__b8817101407">kinit hdfs</strong> #Configure kinit authentication.
Password for hdfs@HADOOP.COM: #Enter the login password of user <strong id="admin_guide_000040__b24051841104010">hdfs</strong>.</pre>
</li><li id="admin_guide_000040__en-us_topic_0046737055_li21328951">Run the <strong id="admin_guide_000040__b48572019104114">hdfs fsck / -list-corruptfileblocks</strong> command, and check the returned result.<ul class="subitemlist" id="admin_guide_000040__en-us_topic_0046737055_ul49425459"><li id="admin_guide_000040__en-us_topic_0046737055_li42175949">If "has 0 CORRUPT files" is displayed, go to <a href="#admin_guide_000040__en-us_topic_0046737055_step_2">2</a>.</li><li id="admin_guide_000040__en-us_topic_0046737055_li60808702">If the result does not contain "has 0 CORRUPT files" and the name of the damaged file is returned, go to <a href="#admin_guide_000040__en-us_topic_0046737055_step_1e">1.d</a>.</li></ul>
</li><li id="admin_guide_000040__en-us_topic_0046737055_step_1e"><a name="admin_guide_000040__en-us_topic_0046737055_step_1e"></a><a name="en-us_topic_0046737055_step_1e"></a>Run the <strong id="admin_guide_000040__b15965174945012">hdfs dfs -rm</strong> <em id="admin_guide_000040__i10196137135217">Name of the damaged file</em> command to delete the damaged file.<div class="note" id="admin_guide_000040__note173526369243"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="admin_guide_000040__p1347174062416"><span id="admin_guide_000040__text3694101514014">Deleting a file or folder is a high-risk operation. Ensure that the file or folder is no longer required before performing this operation.</span></p>
</div></div>
</li></ol>
</p></li><li id="admin_guide_000040__en-us_topic_0046737055_step_2"><a name="admin_guide_000040__en-us_topic_0046737055_step_2"></a><a name="en-us_topic_0046737055_step_2"></a><span>Log in to <span id="admin_guide_000040__text8596144912343">MRS</span> Manager.</span></li><li id="admin_guide_000040__li164975055814"><span>Choose <strong id="admin_guide_000040__b136897294101">Cluster</strong> &gt; <em id="admin_guide_000040__i136901429141014">Name of the desired cluster</em> &gt; <strong id="admin_guide_000040__b5690929121016">Services</strong>.</span></li><li id="admin_guide_000040__en-us_topic_0046736984_li56314915"><span>Click the specified service name on the service management page. On the displayed page, click the <strong id="admin_guide_000040__b150913481087">Instance</strong> tab.</span></li><li id="admin_guide_000040__en-us_topic_0046737055_li62108324"><span>Select the specified role instance to be decommissioned.</span></li><li id="admin_guide_000040__li9268766452"><span>Select <strong id="admin_guide_000040__b541555725615">Decommission</strong> or <strong id="admin_guide_000040__b36567075720">Recommission</strong> from the <strong id="admin_guide_000040__b16766753135618">More</strong> drop-down list.</span><p><p id="admin_guide_000040__p110318408145">In the displayed dialog box, enter the password of the current login user and click <strong id="admin_guide_000040__b125111749145819">OK</strong>.</p>
<div class="p" id="admin_guide_000040__p1132163518143">Select <strong id="admin_guide_000040__b19831904315">I confirm to decommission these instances and accept the consequence of service performance deterioration</strong> and click <strong id="admin_guide_000040__b64461162317">OK</strong> to perform the corresponding operation.<div class="note" id="admin_guide_000040__note113211435161419"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p class="text" id="admin_guide_000040__p19321143519146">During the instance decommissioning, if the service corresponding to the instance is restarted in the cluster using another browser, <span id="admin_guide_000040__text1764875493415">MRS</span> Manager displays a message indicating that the instance decommissioning is stopped, but the operating status of the instance is displayed as <strong id="admin_guide_000040__b169157161355">Started</strong>. In this case, the instance has been decommissioned on the background. You need to decommission the instance again to synchronize the operating status.</p>
</div></div>
</div>
</p></li></ol>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="admin_guide_000037.html">Instance Management</a></div>
</div>
</div>