doc-exports/docs/mrs/component-operation-guide/mrs_01_1467.html

<a name="mrs_01_1467"></a><a name="mrs_01_1467"></a>

<h1 class="topictitle1">Why Do I Fail to Create a Hive Table?</h1>
<div id="body1595920216190"><div class="section" id="mrs_01_1467__s783c77a8524d4997a662899fe758426e"><h4 class="sectiontitle">Question</h4><p id="mrs_01_1467__a008284e0c00a452987fd4bfb7065e93a">Why do I fail to create a hive table?</p>
</div>
<div class="section" id="mrs_01_1467__sba5f416af0cc403c8f4ea3553eaf957b"><h4 class="sectiontitle">Answer</h4><p id="mrs_01_1467__a98dbbdeb26a8437c807b2f8a42df3449">Creating a Hive table fails, when source table or sub query has more number of partitions. The implementation of the query requires a lot of tasks, then the number of files will be output a lot, resulting OOM in Driver.</p>
<p id="mrs_01_1467__aa498099d8d364ae289e2ecc7ee20da6b">It can be solved by using <i><b><span class="cmdname" style="font-family:Arial" id="mrs_01_1467__c5888c10569144049a9c1443457169514">distribute by</span></b></i> on suitable cardinality(distinct values) column in the statement of Hive table creation.</p>
<p id="mrs_01_1467__a662064ce559d4325aa0bdcda1f4781be"><i><b><span class="cmdname" style="font-family:Arial" id="mrs_01_1467__c69041cdb278b4ddfa4a93e7696b24f2a">distribute by</span></b></i> clause limits number of  hive table partitions. It considers cardinality of given column or <span class="parmname" id="mrs_01_1467__p6856ae7d93a944109a0cc7ff83c20d12"><b>spark.sql.shuffle.partitions</b></span> which ever is minimal. For example, if <span class="parmname" id="mrs_01_1467__pefde51c39bc647eb910d85d1d953e289"><b>spark.sql.shuffle.partitions</b></span> is 200, but cardinality of column is 100, out files is 200, but the other 100 files are empty. So using very low cardinality column like 1 will cause data skew and will effect later query distribution.</p>
<p id="mrs_01_1467__afbb4afb6db23476eb74513b9c735899d">So we suggest using the column with cardinality greater than <span class="parmname" id="mrs_01_1467__pfa5f8d6cb8384306953b83d29993c25d"><b>spark.sql.shuffle.partitions</b></span>. It can be greater than 2 to 3 times.</p>
<p id="mrs_01_1467__a498cf68a4cd142ff8473c91f633e95dc">Example:</p>
<p id="mrs_01_1467__aff091b9e25d84a7f9e05afcb53dd5ab5"><i><b><span class="cmdname" style="font-family:Arial" id="mrs_01_1467__c4c503c185dee417e83904ce78cee4c0a">create table hivetable1 as select * from sourcetable1 distribute by col_age;</span></b></i></p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="mrs_01_1457.html">CarbonData FAQ</a></div>
</div>
</div>