Yang, Tong 6182f91ba8 MRS component operation guide_normal 2.0.38.SP20 version
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Yang, Tong <yangtong2@huawei.com>
Co-committed-by: Yang, Tong <yangtong2@huawei.com>
2022-12-09 14:55:21 +00:00

67 lines
6.6 KiB
HTML

<a name="mrs_01_2018"></a><a name="mrs_01_2018"></a>
<h1 class="topictitle1">Why Do the Executors Fail to Register Shuffle Services During the Shuffle of a Large Amount of Data?</h1>
<div id="body1595920219933"><div class="section" id="mrs_01_2018__s8e19beb503804d668dfb349bd955cced"><h4 class="sectiontitle">Question</h4><p id="mrs_01_2018__add4f1e974a444c1b92b968b950a6213e">When more than 50 terabytes of data is shuffled, some executors fail to register shuffle services due to timeout. The shuffle tasks then fail. Why? The error log is as follows: </p>
<pre class="screen" id="mrs_01_2018__sd6544b3b04804d1f82e348e3c0f36178">2016-10-19 01:33:34,030 | WARN | ContainersLauncher #14 | Exception from container-launch with container ID: container_e1452_1476801295027_2003_01_004512 and exit code: 1 | LinuxContainerExecutor.java:397
ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
at org.apache.hadoop.util.Shell.run(Shell.java:472)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:381)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:312)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:88)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-10-19 01:33:34,031 | INFO | ContainersLauncher #14 | Exception from container-launch. | ContainerExecutor.java:300
2016-10-19 01:33:34,031 | INFO | ContainersLauncher #14 | Container id: container_e1452_1476801295027_2003_01_004512 | ContainerExecutor.java:300
2016-10-19 01:33:34,031 | INFO | ContainersLauncher #14 | Exit code: 1 | ContainerExecutor.java:300
2016-10-19 01:33:34,031 | INFO | ContainersLauncher #14 | Stack trace: ExitCodeException exitCode=1: | ContainerExecutor.java:300</pre>
</div>
<div class="section" id="mrs_01_2018__sa3156d591bc642a0a6737ffc375e99a1"><h4 class="sectiontitle">Answer</h4><p id="mrs_01_2018__a3faf260bccc546ff8f8bf05073c03357">The imported data exceeds 50 TB, which exceeds the shuffle processing capability. The shuffle may fail to respond to the registration request of an executor in a timely manner due to the heavy load.</p>
<p id="mrs_01_2018__a1826c93d741143098db4ae7fb9650375">The timeout interval for an executor to register the shuffle service is 5 seconds. The maximum number of retries is 3. This parameter is not configurable.</p>
<p id="mrs_01_2018__a14eca68381e7449ea82075a7899a2817">You are advised to increase the number of task retry times and the number of allowed executor failure times.</p>
<p id="mrs_01_2018__adfe6b6c121df4e4bbdc2727f74e0393a">Configure the following parameters in the <span class="filepath" id="mrs_01_2018__f43735bd8318d4b5bb428a12ace44abf9"><b>spark-defaults.conf</b></span> file on the client: If <span class="parmname" id="mrs_01_2018__pceace18ee32141d39fe2b79291a6a269"><b>spark.yarn.max.executor.failures</b></span> does not exist, manually add it.</p>
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_2018__t1aba15bf45cc42aeb22687574f994c6b" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Parameter Description</caption><thead align="left"><tr id="mrs_01_2018__r97550aaffd4246c89ced15b92159ae7d"><th align="left" class="cellrowborder" valign="top" width="39.756024397560246%" id="mcps1.3.2.6.2.4.1.1"><p id="mrs_01_2018__a05b5f2f76a6848ac9aca6347896fee38">Parameter</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="40.29597040295971%" id="mcps1.3.2.6.2.4.1.2"><p id="mrs_01_2018__ae5f3a94e594247b59519eb42c5dfc1a1">Description</p>
</th>
<th align="left" class="cellrowborder" valign="top" width="19.948005199480054%" id="mcps1.3.2.6.2.4.1.3"><p id="mrs_01_2018__a0e9025526a8e42a1a112c3d2520c2db1">Default <span id="mrs_01_2018__p3fc6982d09124a97aaecdff0d8fb0f4f">Value</span></p>
</th>
</tr>
</thead>
<tbody><tr id="mrs_01_2018__r87aad5cef5044429a9c1d07d78bb833d"><td class="cellrowborder" valign="top" width="39.756024397560246%" headers="mcps1.3.2.6.2.4.1.1 "><p id="mrs_01_2018__a8ef6c7fc70ba4258804d7bf403d185c1">spark.task.maxFailures</p>
</td>
<td class="cellrowborder" valign="top" width="40.29597040295971%" headers="mcps1.3.2.6.2.4.1.2 "><p id="mrs_01_2018__a902b7cd0f02645978c0b5c7e4129d651">Specifies task retry times.</p>
</td>
<td class="cellrowborder" valign="top" width="19.948005199480054%" headers="mcps1.3.2.6.2.4.1.3 "><p id="mrs_01_2018__aac3b849f1fd04a8e94ee7b4e9b618b76">4</p>
</td>
</tr>
<tr id="mrs_01_2018__rb6678c3a3a584973820e97b37b9b94c2"><td class="cellrowborder" rowspan="2" valign="top" width="39.756024397560246%" headers="mcps1.3.2.6.2.4.1.1 "><p id="mrs_01_2018__a36c29ba27fa54bbbb453815912d418e7">spark.yarn.max.executor.failures</p>
</td>
<td class="cellrowborder" valign="top" width="40.29597040295971%" headers="mcps1.3.2.6.2.4.1.2 "><p id="mrs_01_2018__a088d5e423a694e66859ecede67f15b1d">Specifies executor failure attempt times.</p>
<p id="mrs_01_2018__acd7ed58e003c41798689dbf467d04600">Set <span class="parmname" id="mrs_01_2018__pd83d552f4c5c44138b9b04316b073c28"><b>spark.dynamicAllocation.enabled</b></span> to <span class="parmvalue" id="mrs_01_2018__pd4b3f00309be40c0a4cb14e6b5466960"><b>false</b></span>, to disable the dynamic allocation of executors.</p>
</td>
<td class="cellrowborder" valign="top" width="19.948005199480054%" headers="mcps1.3.2.6.2.4.1.3 "><p id="mrs_01_2018__a250b327b58c745f1893dbcbb22a2d2b0">numExecutors * 2, with minimum of 3</p>
</td>
</tr>
<tr id="mrs_01_2018__r3e4f25ef59ae417eb0940470638fe62f"><td class="cellrowborder" valign="top" headers="mcps1.3.2.6.2.4.1.1 "><p id="mrs_01_2018__a16ce32b7b1db42a2a420d42493dde4e3">Specifies executor failure attempt times.</p>
<p id="mrs_01_2018__ad280a963ec0b400e8b0ad68fde9d0a36">Set <span class="parmname" id="mrs_01_2018__pa13ff90128464063ad085535d3efa0ba"><b>spark.dynamicAllocation.enabled</b></span> to <span class="parmvalue" id="mrs_01_2018__p7bf830c7279f462184c76cf3548924e4"><b>true</b></span>, to enable the dynamic allocation of executors.</p>
</td>
<td class="cellrowborder" valign="top" headers="mcps1.3.2.6.2.4.1.2 "><p id="mrs_01_2018__ab4d5755a07e94d3e8bf25897c5bd27fd">3</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_2003.html">Spark Core</a></div>
</div>
</div>