forked from docs/doc-exports
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
67 lines
6.6 KiB
HTML
67 lines
6.6 KiB
HTML
<a name="mrs_01_2018"></a><a name="mrs_01_2018"></a>
|
|
|
|
<h1 class="topictitle1">Why Do the Executors Fail to Register Shuffle Services During the Shuffle of a Large Amount of Data?</h1>
|
|
<div id="body1595920219933"><div class="section" id="mrs_01_2018__s8e19beb503804d668dfb349bd955cced"><h4 class="sectiontitle">Question</h4><p id="mrs_01_2018__add4f1e974a444c1b92b968b950a6213e">When more than 50 terabytes of data is shuffled, some executors fail to register shuffle services due to timeout. The shuffle tasks then fail. Why? The error log is as follows: </p>
|
|
<pre class="screen" id="mrs_01_2018__sd6544b3b04804d1f82e348e3c0f36178">2016-10-19 01:33:34,030 | WARN | ContainersLauncher #14 | Exception from container-launch with container ID: container_e1452_1476801295027_2003_01_004512 and exit code: 1 | LinuxContainerExecutor.java:397
|
|
ExitCodeException exitCode=1:
|
|
at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
|
|
at org.apache.hadoop.util.Shell.run(Shell.java:472)
|
|
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
|
|
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:381)
|
|
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:312)
|
|
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:88)
|
|
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
|
|
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
|
|
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
|
|
at java.lang.Thread.run(Thread.java:745)
|
|
2016-10-19 01:33:34,031 | INFO | ContainersLauncher #14 | Exception from container-launch. | ContainerExecutor.java:300
|
|
2016-10-19 01:33:34,031 | INFO | ContainersLauncher #14 | Container id: container_e1452_1476801295027_2003_01_004512 | ContainerExecutor.java:300
|
|
2016-10-19 01:33:34,031 | INFO | ContainersLauncher #14 | Exit code: 1 | ContainerExecutor.java:300
|
|
2016-10-19 01:33:34,031 | INFO | ContainersLauncher #14 | Stack trace: ExitCodeException exitCode=1: | ContainerExecutor.java:300</pre>
|
|
</div>
|
|
<div class="section" id="mrs_01_2018__sa3156d591bc642a0a6737ffc375e99a1"><h4 class="sectiontitle">Answer</h4><p id="mrs_01_2018__a3faf260bccc546ff8f8bf05073c03357">The imported data exceeds 50 TB, which exceeds the shuffle processing capability. The shuffle may fail to respond to the registration request of an executor in a timely manner due to the heavy load.</p>
|
|
<p id="mrs_01_2018__a1826c93d741143098db4ae7fb9650375">The timeout interval for an executor to register the shuffle service is 5 seconds. The maximum number of retries is 3. This parameter is not configurable.</p>
|
|
<p id="mrs_01_2018__a14eca68381e7449ea82075a7899a2817">You are advised to increase the number of task retry times and the number of allowed executor failure times.</p>
|
|
<p id="mrs_01_2018__adfe6b6c121df4e4bbdc2727f74e0393a">Configure the following parameters in the <span class="filepath" id="mrs_01_2018__f43735bd8318d4b5bb428a12ace44abf9"><b>spark-defaults.conf</b></span> file on the client: If <span class="parmname" id="mrs_01_2018__pceace18ee32141d39fe2b79291a6a269"><b>spark.yarn.max.executor.failures</b></span> does not exist, manually add it.</p>
|
|
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="mrs_01_2018__t1aba15bf45cc42aeb22687574f994c6b" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Parameter Description</caption><thead align="left"><tr id="mrs_01_2018__r97550aaffd4246c89ced15b92159ae7d"><th align="left" class="cellrowborder" valign="top" width="39.756024397560246%" id="mcps1.3.2.6.2.4.1.1"><p id="mrs_01_2018__a05b5f2f76a6848ac9aca6347896fee38">Parameter</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="40.29597040295971%" id="mcps1.3.2.6.2.4.1.2"><p id="mrs_01_2018__ae5f3a94e594247b59519eb42c5dfc1a1">Description</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="19.948005199480054%" id="mcps1.3.2.6.2.4.1.3"><p id="mrs_01_2018__a0e9025526a8e42a1a112c3d2520c2db1">Default <span id="mrs_01_2018__p3fc6982d09124a97aaecdff0d8fb0f4f">Value</span></p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="mrs_01_2018__r87aad5cef5044429a9c1d07d78bb833d"><td class="cellrowborder" valign="top" width="39.756024397560246%" headers="mcps1.3.2.6.2.4.1.1 "><p id="mrs_01_2018__a8ef6c7fc70ba4258804d7bf403d185c1">spark.task.maxFailures</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="40.29597040295971%" headers="mcps1.3.2.6.2.4.1.2 "><p id="mrs_01_2018__a902b7cd0f02645978c0b5c7e4129d651">Specifies task retry times.</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="19.948005199480054%" headers="mcps1.3.2.6.2.4.1.3 "><p id="mrs_01_2018__aac3b849f1fd04a8e94ee7b4e9b618b76">4</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="mrs_01_2018__rb6678c3a3a584973820e97b37b9b94c2"><td class="cellrowborder" rowspan="2" valign="top" width="39.756024397560246%" headers="mcps1.3.2.6.2.4.1.1 "><p id="mrs_01_2018__a36c29ba27fa54bbbb453815912d418e7">spark.yarn.max.executor.failures</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="40.29597040295971%" headers="mcps1.3.2.6.2.4.1.2 "><p id="mrs_01_2018__a088d5e423a694e66859ecede67f15b1d">Specifies executor failure attempt times.</p>
|
|
<p id="mrs_01_2018__acd7ed58e003c41798689dbf467d04600">Set <span class="parmname" id="mrs_01_2018__pd83d552f4c5c44138b9b04316b073c28"><b>spark.dynamicAllocation.enabled</b></span> to <span class="parmvalue" id="mrs_01_2018__pd4b3f00309be40c0a4cb14e6b5466960"><b>false</b></span>, to disable the dynamic allocation of executors.</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="19.948005199480054%" headers="mcps1.3.2.6.2.4.1.3 "><p id="mrs_01_2018__a250b327b58c745f1893dbcbb22a2d2b0">numExecutors * 2, with minimum of 3</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="mrs_01_2018__r3e4f25ef59ae417eb0940470638fe62f"><td class="cellrowborder" valign="top" headers="mcps1.3.2.6.2.4.1.1 "><p id="mrs_01_2018__a16ce32b7b1db42a2a420d42493dde4e3">Specifies executor failure attempt times.</p>
|
|
<p id="mrs_01_2018__ad280a963ec0b400e8b0ad68fde9d0a36">Set <span class="parmname" id="mrs_01_2018__pa13ff90128464063ad085535d3efa0ba"><b>spark.dynamicAllocation.enabled</b></span> to <span class="parmvalue" id="mrs_01_2018__p7bf830c7279f462184c76cf3548924e4"><b>true</b></span>, to enable the dynamic allocation of executors.</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" headers="mcps1.3.2.6.2.4.1.2 "><p id="mrs_01_2018__ab4d5755a07e94d3e8bf25897c5bd27fd">3</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_2003.html">Spark Core</a></div>
|
|
</div>
|
|
</div>
|
|
|