doc-exports/docs/mrs/component-operation-guide/mrs_01_1989.html

<a name="mrs_01_1989"></a><a name="mrs_01_1989"></a>

<h1 class="topictitle1">Optimizing the INSERT...SELECT Operation</h1>
<div id="body1595920218430"><div class="section" id="mrs_01_1989__sf684523b485d487f95f5161b60e7d6e1"><h4 class="sectiontitle">Scenario</h4><p id="mrs_01_1989__ab8b14763c2e94537976db71e8295eecd">The INSERT...SELECT operation needs to be optimized if any of the following conditions is true:</p>
<ul id="mrs_01_1989__u4aba18d43be8470fb27f235aa71e2a40"><li id="mrs_01_1989__lc405a48494ed447faffc8c9ed5ce4775">Many small files need to be queried.</li><li id="mrs_01_1989__l21c7c29047aa4adeb72b9be56e88c68e">A few large files need to be queried.</li><li id="mrs_01_1989__la832a58cdc44435fa6d5be50756f16e5">The INSERT...SELECT operation is performed by a non-spark user in Beeline/JDBCServer mode.</li></ul>
</div>
<div class="section" id="mrs_01_1989__sc832b2c8d7254fddb6ca7d9688318c62"><h4 class="sectiontitle">Procedure</h4><p id="mrs_01_1989__aeebbd204e27e44fcbf73d9edb9868c66">Optimize the INSERT...SELECT operation as follows:</p>
<ul id="mrs_01_1989__u4e5c67d1d47f49e381dd9ed41d2a6a8a"><li id="mrs_01_1989__lc17299a875024029bdc0938c6ba2f9c2">If the table to be created is the Hive table, set the storage type to Parquet. This enables INSERT...SELECT statements to be run faster.</li><li id="mrs_01_1989__lfeaee548dfb748909473eef9b64d81b3">Perform the INSERT...SELECT operation as a spark-sql user or spark user (if in Beeline/JDBCServer mode). In this way, it is no longer necessary to change the file owner repeatedly, accelerating the execution of INSERT...SELECT statements.<div class="note" id="mrs_01_1989__ne822253d443846889e6dd26bc51d684d"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_1989__aba50ec55f716410ca7774de11bb5ee6a">In Beeline/JDBCServer mode, the executor user is the same as the driver user. The driver user is a spark user because the driver is a part of JDBCServer service and started by a spark user. If the Beeline user is not a spark user, the file owner must be changed to the Beeline user (actual user) because the executor is unaware of the Beeline user.</p>
</div></div>
</li><li id="mrs_01_1989__le4f42ac06f9346469feb019098fefa75">If many small files need to be queried, set spark.sql.files.maxPartitionBytes and spark.files.openCostInBytes to set the maximum size in bytes of partition and combine multiple small files in a partition to reduce file amount. This accelerates file renaming, ultimately enabling INSERT...SELECT statements to be run faster.</li></ul>
<div class="note" id="mrs_01_1989__ne20f611435d04ea0946a8dc1bd0e5bd3"><img src="public_sys-resources/note_3.0-en-us.png"><span class="notetitle"> </span><div class="notebody"><p id="mrs_01_1989__a7d6c0990019a42e8a62b26d7679071f7">The preceding optimizations are not a one-size-fits-all solution. In the following scenario, it still takes long to perform the INSERT...SELECT operation:</p>
<p id="mrs_01_1989__a26d5b4bb706d4a6c8d3b1259af92aa9c">The dynamic partitioned table contains many partitions.</p>
</div></div>
</div>
<p id="mrs_01_1989__a55cdc08aed39429b943fdcf5d68ca849"></p>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1985.html">Spark SQL and DataFrame Tuning</a></div>
</div>
</div>