forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Lu, Huayi <luhuayi@huawei.com> Co-committed-by: Lu, Huayi <luhuayi@huawei.com>
36 lines
7.0 KiB
HTML
36 lines
7.0 KiB
HTML
<a name="EN-US_TOPIC_0000001233761697"></a><a name="EN-US_TOPIC_0000001233761697"></a>
|
|
|
|
<h1 class="topictitle1">Case: Selecting an Appropriate Distribution Column</h1>
|
|
<div id="body8662426"><p id="EN-US_TOPIC_0000001233761697__p12280182616569">Distribution columns are used to distribute data to different nodes. A proper distribution key can avoid data skew.</p>
|
|
<p id="EN-US_TOPIC_0000001233761697__p1723010242111">When performing join query, you are advised to select the join condition in the query as the distribution key. When a join condition is used as a distribution key, related data is distributed locally on DNs, reducing the cost of data flow between DNs and improving the query speed.</p>
|
|
<div class="section" id="EN-US_TOPIC_0000001233761697__s901340775fea4ae28a81b049149a33e9"><h4 class="sectiontitle">Before optimization</h4><p id="EN-US_TOPIC_0000001233761697__p183649362019">Use <strong id="EN-US_TOPIC_0000001233761697__b04511822192516">a</strong> as the distribution column of <strong id="EN-US_TOPIC_0000001233761697__b5744624102515">t1</strong> and <strong id="EN-US_TOPIC_0000001233761697__b13199202719256">t2</strong>. The table definition is as follows:</p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001233761697__screen93641536317"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span>
|
|
<span class="normal">2</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">t1</span><span class="w"> </span><span class="p">(</span><span class="n">a</span><span class="w"> </span><span class="nb">int</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="nb">int</span><span class="p">)</span><span class="w"> </span><span class="n">DISTRIBUTE</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">HASH</span><span class="w"> </span><span class="p">(</span><span class="n">a</span><span class="p">);</span>
|
|
<span class="k">CREATE</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">t2</span><span class="w"> </span><span class="p">(</span><span class="n">a</span><span class="w"> </span><span class="nb">int</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="nb">int</span><span class="p">)</span><span class="w"> </span><span class="n">DISTRIBUTE</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">HASH</span><span class="w"> </span><span class="p">(</span><span class="n">a</span><span class="p">);</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001233761697__p1345675064813">The following query is executed:</p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001233761697__screen154565503485"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="k">FROM</span><span class="w"> </span><span class="n">t1</span><span class="p">,</span><span class="w"> </span><span class="n">t2</span><span class="w"> </span><span class="k">WHERE</span><span class="w"> </span><span class="n">t1</span><span class="p">.</span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">t2</span><span class="p">.</span><span class="n">b</span><span class="p">;</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001233761697__p236411361313">In this case, the execution plan contains <strong id="EN-US_TOPIC_0000001233761697__b8474145472616">Streaming(type: REDISTRIBUTE)</strong>, that is, the DN redistributes data to all DNs based on the selected column. This will cause a large amount of data to be transmitted between DNs, as shown in <a href="#EN-US_TOPIC_0000001233761697__fig1836515367112">Figure 1</a>.</p>
|
|
<div class="fignone" id="EN-US_TOPIC_0000001233761697__fig1836515367112"><a name="EN-US_TOPIC_0000001233761697__fig1836515367112"></a><a name="fig1836515367112"></a><span class="figcap"><b>Figure 1 </b>Selecting an appropriate distribution column (1)</span><br><span><img id="EN-US_TOPIC_0000001233761697__idf486d549da44693a8002910d7a4ce4c" src="figure/en-us_image_0000001595721561.png"></span></div>
|
|
</div>
|
|
<div class="section" id="EN-US_TOPIC_0000001233761697__s1326784d799a47f096ef937a2f080b30"><h4 class="sectiontitle">After optimization</h4><p id="EN-US_TOPIC_0000001233761697__a4b60cd6312214c31b227e63a21a87d59">Use the join condition in the query as the distribution key and run the following statement to changethe distribution key of <strong id="EN-US_TOPIC_0000001233761697__b20682635112814">t2</strong> as <strong id="EN-US_TOPIC_0000001233761697__b1775154718287">b</strong>:</p>
|
|
<div class="codecoloring" codetype="Sql" id="EN-US_TOPIC_0000001233761697__sc717a036c9964a7782660e4fc821ea23"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal">1</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">ALTER</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="n">t2</span><span class="w"> </span><span class="n">DISTRIBUTE</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="n">HASH</span><span class="w"> </span><span class="p">(</span><span class="n">b</span><span class="p">);</span>
|
|
</pre></div></td></tr></table></div>
|
|
|
|
</div>
|
|
<p id="EN-US_TOPIC_0000001233761697__a5dae7da84fc040a6965434eb00d83702">After the distribution column of table <strong id="EN-US_TOPIC_0000001233761697__b177791359152819">t2</strong> is changed to column <strong id="EN-US_TOPIC_0000001233761697__b169787118292">b</strong>, the execution plan does not contain <strong id="EN-US_TOPIC_0000001233761697__b1922016914295">Streaming(type: REDISTRIBUTE)</strong>. This reduces the amount of communication data between DNs and reduces the execution time from 8.7 ms to 2.7 ms, improving query performance, as shown in <a href="#EN-US_TOPIC_0000001233761697__f31ac330ea3754611a58f39357c543877">Figure 2</a>.</p>
|
|
<div class="fignone" id="EN-US_TOPIC_0000001233761697__f31ac330ea3754611a58f39357c543877"><a name="EN-US_TOPIC_0000001233761697__f31ac330ea3754611a58f39357c543877"></a><a name="f31ac330ea3754611a58f39357c543877"></a><span class="figcap"><b>Figure 2 </b>Selecting an appropriate distribution column (2)</span><br><span><img id="EN-US_TOPIC_0000001233761697__iab51373b65f84777be4c0135e10a0d15" src="figure/en-us_image_0000001188163808.png"></span></div>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="dws_04_0474.html">Optimization Cases</a></div>
|
|
</div>
|
|
</div>
|
|
|