forked from docs/doc-exports
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
20 lines
3.6 KiB
HTML
20 lines
3.6 KiB
HTML
<a name="mrs_01_1977"></a><a name="mrs_01_1977"></a>
|
|
|
|
<h1 class="topictitle1">Optimizing Memory Configuration</h1>
|
|
<div id="body1595920217124"><div class="section" id="mrs_01_1977__s5fde5a20f8c247cdbec462861ab51526"><h4 class="sectiontitle">Scenario</h4><p id="mrs_01_1977__a9b1c962fed8646ab8cbe0dfbbbdddad4">Spark is a memory-based computing frame. If the memory is insufficient during computing, the Spark execution efficiency will be adversely affected. You can determine whether memory becomes the performance bottleneck by monitoring garbage collection (GC) and evaluating the resilient distributed dataset (RDD) size in the memory, and take performance optimization measures.</p>
|
|
<p id="mrs_01_1977__a099afabbe3fb46ffabaac424b4e86042">To monitor GC of node processes, add the -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps parameter to the <span class="parmname" id="mrs_01_1977__pe1704a981da84210b5e064c2090643d9"><b>spark.driver.extraJavaOptions</b></span> and <span class="parmname" id="mrs_01_1977__pa405702460b849f99943d9b6ca4d2ca4"><b>spark.executor.extraJavaOptions</b></span> in the client configuration file <span class="filepath" id="mrs_01_1977__f49d1d72bc13e48fc95a42881eac27bad"><b>conf/spark-default.conf</b></span>. If "Full GC" is frequently reported, GC needs to be optimized. Cache the RDD and query the RDD size in the log. If a large value is found, change the RDD storage level.</p>
|
|
</div>
|
|
<div class="section" id="mrs_01_1977__s2736a5d232a5457f8719683142605398"><h4 class="sectiontitle">Procedure</h4><ul id="mrs_01_1977__uafa092e78d714d8c9dd0de16ad1de156"><li id="mrs_01_1977__lb82f8cb135c24dcb819da01c40dd3f1c">To optimize GC, adjust the ratio of the young generation and tenured generation. Add <span class="parmname" id="mrs_01_1977__p90c0901d52804670b8237e127c37c159"><b>-XX:NewRatio</b></span> parameter to the <span class="parmname" id="mrs_01_1977__p98013e512f354d4686d79da6b5eb09b5"><b>spark.driver.extraJavaOptions</b></span> and <span class="parmname" id="mrs_01_1977__p7f94a1e0dfd3428d90b5ee80d12ed454"><b>spark.executor.extraJavaOptions</b></span> in the client configuration file <span class="filepath" id="mrs_01_1977__f8ec3142d90c04d41b14537dbe80892c1"><b>conf/spark-default.conf</b></span>. For example, export SPARK_JAVA_OPTS=" -XX:NewRatio=2". The new generation accounts for 1/3 of the heap, and the tenured generation accounts for 2/3.</li><li id="mrs_01_1977__l80a2c77e72e3488fab0d33f64bdbf8f9">Optimize the RDD data structure when compiling Spark programs.<ul id="mrs_01_1977__uee12aa6f2e39472ca6dd421ee97dfcb5"><li id="mrs_01_1977__l52deed5e2cf744d585532b5664ec6f15">Use primitive arrays to replace fastutil arrays, for example, use fastutil library.</li><li id="mrs_01_1977__l027f8d1fffbf4e87ae1668f3109b9335">Avoid nested structure.</li><li id="mrs_01_1977__l98e65015abae495bb30271c810f656bc">Avoid using String in keys.</li></ul>
|
|
</li><li id="mrs_01_1977__l92f26e3ef0504a4295b2dd2e8df436c7">Suggest serializing the RDDs when developing Spark programs.<p id="mrs_01_1977__a46bfed62bd1d41258a399f35e9817383"><a name="mrs_01_1977__l92f26e3ef0504a4295b2dd2e8df436c7"></a><a name="l92f26e3ef0504a4295b2dd2e8df436c7"></a>By default, data is not serialized when RDDs are cached. You can set the storage level to serialize the RDDs and minimize memory usage. For example:</p>
|
|
<pre class="screen" id="mrs_01_1977__sfed76b1bce4540dc99edef414d8f73b9">testRDD.persist(StorageLevel.MEMORY_ONLY_SER)</pre>
|
|
<p id="mrs_01_1977__af8773133360e448da9d60fbd63e427cf"></p>
|
|
</li></ul>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1975.html">Spark Core Tuning</a></div>
|
|
</div>
|
|
</div>
|
|
|