forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Su, Xiaomeng <suxiaomeng1@huawei.com> Co-committed-by: Su, Xiaomeng <suxiaomeng1@huawei.com>
101 lines
11 KiB
HTML
101 lines
11 KiB
HTML
<a name="dli_08_0266"></a><a name="dli_08_0266"></a>
|
|
|
|
<h1 class="topictitle1">Common Configuration Items of Batch SQL Jobs</h1>
|
|
<div id="body1586845347201"><p id="dli_08_0266__p36393515233">This section describes the common configuration items of the SQL syntax for DLI batch jobs.</p>
|
|
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="dli_08_0266__table334825142314" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Common configuration items</caption><thead align="left"><tr id="dli_08_0266__row66396519230"><th align="left" class="cellrowborder" valign="top" width="28.349999999999998%" id="mcps1.3.2.2.4.1.1"><p id="dli_08_0266__p463913514234">Item</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="9.56%" id="mcps1.3.2.2.4.1.2"><p id="dli_08_0266__p2639158232">Default Value</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="62.09%" id="mcps1.3.2.2.4.1.3"><p id="dli_08_0266__p1963920542313">Description</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="dli_08_0266__row1363935122313"><td class="cellrowborder" valign="top" width="28.349999999999998%" headers="mcps1.3.2.2.4.1.1 "><p id="dli_08_0266__p56404542316">spark.sql.files.maxRecordsPerFile</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.56%" headers="mcps1.3.2.2.4.1.2 "><p id="dli_08_0266__p1864016519235">0</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="62.09%" headers="mcps1.3.2.2.4.1.3 "><p id="dli_08_0266__p186405592312">Maximum number of records to be written into a single file. If the value is zero or negative, there is no limit.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_0266__row364020522314"><td class="cellrowborder" valign="top" width="28.349999999999998%" headers="mcps1.3.2.2.4.1.1 "><p id="dli_08_0266__p146401254236">spark.sql.shuffle.partitions</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.56%" headers="mcps1.3.2.2.4.1.2 "><p id="dli_08_0266__p176409516238">200</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="62.09%" headers="mcps1.3.2.2.4.1.3 "><p id="dli_08_0266__p186400514233">Default number of partitions used to filter data for join or aggregation.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_0266__row1664018513237"><td class="cellrowborder" valign="top" width="28.349999999999998%" headers="mcps1.3.2.2.4.1.1 "><p id="dli_08_0266__p464005132313">spark.sql.dynamicPartitionOverwrite.enabled</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.56%" headers="mcps1.3.2.2.4.1.2 "><p id="dli_08_0266__p4640185132312">false</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="62.09%" headers="mcps1.3.2.2.4.1.3 "><p id="dli_08_0266__p011513355428">Whether DLI overwrites the partitions where data will be written into during runtime. If you set this parameter to <strong id="dli_08_0266__b1764369104317">false</strong>, all partitions that meet the specified condition will be deleted before data overwrite starts. For example, if you set <strong id="dli_08_0266__b4874153035118">false</strong> and use INSERT OVERWRITE to write partition 2021-02 to a partitioned table that has the 2021-01 partition, this partition will be deleted.</p>
|
|
<p id="dli_08_0266__p1611519355427">If you set this parameter to <strong id="dli_08_0266__b462663524314">true</strong>, DLI does not delete partitions before overwrite starts.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_0266__row56403510234"><td class="cellrowborder" valign="top" width="28.349999999999998%" headers="mcps1.3.2.2.4.1.1 "><p id="dli_08_0266__p106403522310">spark.sql.files.maxPartitionBytes</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.56%" headers="mcps1.3.2.2.4.1.2 "><p id="dli_08_0266__p9640175112311">134217728</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="62.09%" headers="mcps1.3.2.2.4.1.3 "><p id="dli_08_0266__p6640759237">Maximum number of bytes to be packed into a single partition when a file is read.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_0266__row1364118532316"><td class="cellrowborder" valign="top" width="28.349999999999998%" headers="mcps1.3.2.2.4.1.1 "><p id="dli_08_0266__p1641751235">spark.sql.badRecordsPath</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.56%" headers="mcps1.3.2.2.4.1.2 "><p id="dli_08_0266__p206411656232">-</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="62.09%" headers="mcps1.3.2.2.4.1.3 "><p id="dli_08_0266__p46411552315">Path of bad records.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_0266__row8407125716274"><td class="cellrowborder" valign="top" width="28.349999999999998%" headers="mcps1.3.2.2.4.1.1 "><p id="dli_08_0266__p10133536151012">dli.sql.sqlasync.enabled</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.56%" headers="mcps1.3.2.2.4.1.2 "><p id="dli_08_0266__p213318364108">true</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="62.09%" headers="mcps1.3.2.2.4.1.3 "><p id="dli_08_0266__p13134236121018">Whether DDL and DCL statements are executed asynchronously. The value <strong id="dli_08_0266__b785816319547">true</strong> indicates that asynchronous execution is enabled.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_0266__row023695911279"><td class="cellrowborder" valign="top" width="28.349999999999998%" headers="mcps1.3.2.2.4.1.1 "><p id="dli_08_0266__p350815439108">dli.sql.job.timeout</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.56%" headers="mcps1.3.2.2.4.1.2 "><p id="dli_08_0266__p850814351015">-</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="62.09%" headers="mcps1.3.2.2.4.1.3 "><p id="dli_08_0266__p10508043181014">Job running timeout interval, in seconds. If the job times out, it will be canceled.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_0266__row1866915514510"><td class="cellrowborder" valign="top" width="28.349999999999998%" headers="mcps1.3.2.2.4.1.1 "><p id="dli_08_0266__p3670125144516">spark.sql.keep.distinct.expandThreshold</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.56%" headers="mcps1.3.2.2.4.1.2 "><p id="dli_08_0266__p667010517458">-</p>
|
|
</td>
|
|
<td class="cellrowborder" rowspan="2" valign="top" width="62.09%" headers="mcps1.3.2.2.4.1.3 "><ul id="dli_08_0266__ul55481243174717"><li id="dli_08_0266__li554834374720"><strong id="dli_08_0266__b1015158134716">Parameter description:</strong><p id="dli_08_0266__p1408759125610">When running queries with multidimensional analysis that include the <strong id="dli_08_0266__b2095889103718">count(distinct)</strong> function using the cube structure in Spark, the typical execution plan involves using the <strong id="dli_08_0266__b3758141893717">expand</strong> operator. However, this operation can cause query inflation. To avoid this issue, you are advised to configure the following settings:</p>
|
|
<ul id="dli_08_0266__ul1418495175713"><li id="dli_08_0266__li135501458854"><strong id="dli_08_0266__b1977704812">spark.sql.keep.distinct.expandThreshold</strong>:<p id="dli_08_0266__p20462109069">Default value: <strong id="dli_08_0266__b89471211515">-1</strong>, indicating that Spark's default <strong id="dli_08_0266__b14404152112115">expand</strong> operator is used.</p>
|
|
<p id="dli_08_0266__p9739859456">Setting the parameter to a specific value, such as <strong id="dli_08_0266__b1340915616310">512</strong>, defines the threshold for query inflation. If the threshold is exceeded, the <strong id="dli_08_0266__b184511541443">count(distinct)</strong> function will use the <strong id="dli_08_0266__b96911913412">distinct</strong> aggregation operator to execute the query instead of the <strong id="dli_08_0266__b7556201315419">expand</strong> operator.</p>
|
|
</li><li id="dli_08_0266__li112991222578"><strong id="dli_08_0266__b1640416201347">spark.sql.distinct.aggregator.enabled</strong>: whether to forcibly use the <strong id="dli_08_0266__b1212318270412">distinct</strong> aggregation operator. If set to <strong id="dli_08_0266__b169087406418">true</strong>, <strong id="dli_08_0266__b7749114615129">spark.sql.keep.distinct.expandThreshold</strong> is not used.</li></ul>
|
|
</li><li id="dli_08_0266__li35491043114716"><strong id="dli_08_0266__b18108540141417">Use case</strong>: Queries with multidimensional analysis that use the cube structure and may include multiple <strong id="dli_08_0266__b14122913917">count(distinct)</strong> functions, as well as the <strong id="dli_08_0266__b86031436163911">cube</strong> or <strong id="dli_08_0266__b118853916397">rollup</strong> operator.</li><li id="dli_08_0266__li4549543174711"><strong id="dli_08_0266__b1717215834714">Example of a typical use case:</strong><pre class="screen" id="dli_08_0266__screen13423490471">SELECT a1, a2, count(distinct b), count(distinct c) FROM test_distinct group by a1, a2 with cube</pre>
|
|
</li></ul>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_0266__row1564151410467"><td class="cellrowborder" valign="top" headers="mcps1.3.2.2.4.1.1 "><p id="dli_08_0266__p156401417463">spark.sql.distinct.aggregator.enabled</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" headers="mcps1.3.2.2.4.1.2 "><p id="dli_08_0266__p16441413464">false</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_0266__row11135125418304"><td class="cellrowborder" valign="top" width="28.349999999999998%" headers="mcps1.3.2.2.4.1.1 "><p id="dli_08_0266__p2013525463013">spark.sql.optimizer.dynamicPartitionPruning.enabled</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.56%" headers="mcps1.3.2.2.4.1.2 "><p id="dli_08_0266__p1135854183018">true</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="62.09%" headers="mcps1.3.2.2.4.1.3 "><p id="dli_08_0266__p0120112114312">This parameter is used to control whether to enable dynamic partition pruning. Dynamic partition pruning can help reduce the amount of data that needs to be scanned and improve query performance when executing SQL queries.</p>
|
|
<ul id="dli_08_0266__ul10431123219399"><li id="dli_08_0266__li34312325392">When set to <strong id="dli_08_0266__b1293118303259">true</strong>, dynamic partition pruning is enabled. SQL automatically detects and deletes partitions that do not meet the WHERE clause conditions during query. This is useful for tables that have a large number of partitions.</li><li id="dli_08_0266__li18431163263919">If SQL queries contain a large number of nested left join operations and the table has a large number of dynamic partitions, a large number of memory resources may be consumed during data parsing. As a result, the memory of the driver node is insufficient and there are frequent Full GCs.<p id="dli_08_0266__p9628429134018"><a name="dli_08_0266__li18431163263919"></a><a name="li18431163263919"></a>To avoid such issues, you can disable dynamic partition pruning by setting this parameter to <strong id="dli_08_0266__b155611420352">false</strong>.</p>
|
|
<p id="dli_08_0266__p317112359398">However, disabling this optimization may reduce query performance. Once disabled, Spark does not automatically prun the partitions that do not meet the requirements.</p>
|
|
</li></ul>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="dli_08_0221.html">Spark SQL Syntax Reference</a></div>
|
|
</div>
|
|
</div>
|
|
|