forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Su, Xiaomeng <suxiaomeng1@huawei.com> Co-committed-by: Su, Xiaomeng <suxiaomeng1@huawei.com>
306 lines
35 KiB
HTML
306 lines
35 KiB
HTML
<a name="dli_08_15049"></a><a name="dli_08_15049"></a>
|
|
|
|
<h1 class="topictitle1">Hive Source Table</h1>
|
|
<div id="body0000001760320197"><div class="section" id="dli_08_15049__section194845717396"><h4 class="sectiontitle">Introduction</h4><p id="dli_08_15049__p29131852188"><a href="https://hive.apache.org/" target="_blank" rel="noopener noreferrer">Apache Hive</a> has established itself as a focal point of the data warehousing ecosystem. It serves as not only a SQL engine for big data analytics and ETL, but also a data management platform, where data is discovered, defined, and evolved.</p>
|
|
<p id="dli_08_15049__p1097343817105">Flink offers a two-fold integration with Hive. The first is to leverage Hive's Metastore as a persistent catalog. The second is to offer Flink as an alternative engine for reading and writing Hive tables. <a href="https://nightlies.apache.org/flink/flink-docs-release-1.15/zh/docs/connectors/table/hive/overview/" target="_blank" rel="noopener noreferrer">Overview | Apache Flink</a></p>
|
|
<p id="dli_08_15049__p10588031164617">Starting from 1.11.0, Flink allows users to write SQL statements in Hive syntax when Hive dialect is used. By providing compatibility with Hive syntax, we aim to improve the interoperability with Hive and reduce the scenarios when users need to switch between Flink and Hive in order to execute different statements. For details, see <a href="https://nightlies.apache.org/flink/flink-docs-release-1.11/dev/table/hive/hive_dialect.html" target="_blank" rel="noopener noreferrer">Apache Flink Hive Dialect</a>.</p>
|
|
<p id="dli_08_15049__p34209344502">Using the HiveCatalog, Apache Flink can be used for unified BATCH and STREAM processing of Apache Hive Tables. This means Flink can be used as a more performant alternative to Hive's batch engine, or to continuously read and write data into and out of Hive tables to power real-time data warehousing applications. <a href="https://nightlies.apache.org/flink/flink-docs-release-1.15/zh/docs/connectors/table/hive/hive_read_write/" target="_blank" rel="noopener noreferrer">Apache Flink Hive Read & Write</a></p>
|
|
</div>
|
|
<div class="section" id="dli_08_15049__section77915506517"><h4 class="sectiontitle">Function</h4><p id="dli_08_15049__p1529893619479">This section describes how to use Flink to read and write Hive tables, the definition of the Hive source table, parameters used for creating the source table, and sample code. For details, see <a href="https://nightlies.apache.org/flink/flink-docs-release-1.15/zh/docs/connectors/table/hive/hive_read_write/" target="_blank" rel="noopener noreferrer">Apache Flink Hive Read & Write</a>.</p>
|
|
<p id="dli_08_15049__p1851164112525">Flink supports reading data from Hive in both <strong id="dli_08_15049__b3622619202">BATCH</strong> and <strong id="dli_08_15049__b145453410206">STREAMING</strong> modes. When running as a <strong id="dli_08_15049__b236311812202">BATCH</strong> application, Flink will execute its query over the state of the table at the point in time when the query is executed. <strong id="dli_08_15049__b88834117202">STREAMING</strong> reads will continuously monitor the table and incrementally fetch new data as it is made available. Flink will read tables as bounded by default.</p>
|
|
<p id="dli_08_15049__p3490142614581"><strong id="dli_08_15049__b10240218152017">STREAMING</strong> reads support consuming both partitioned and non-partitioned tables. For partitioned tables, Flink will monitor the generation of new partitions, and read them incrementally when available. For non-partitioned tables, Flink will monitor the generation of new files in the folder and read new files incrementally.</p>
|
|
</div>
|
|
<div class="section" id="dli_08_15049__dli_08_0256_en-us_topic_0132788972_section2579142713429"><h4 class="sectiontitle">Prerequisites</h4><div class="p" id="dli_08_15049__p12653145115206">To create a FileSystem source table, an enhanced datasource connection is required. You can set security group rules as required when you configure the connection.
|
|
</div>
|
|
</div>
|
|
<div class="section" id="dli_08_15049__section1230618441125"><h4 class="sectiontitle">Caveats</h4><ul id="dli_08_15049__ul181711754173415"><li id="dli_08_15049__li6647155118407">When you create a Flink OpenSource SQL job, set <strong id="dli_08_15049__b194668124622539">Flink Version</strong> to <strong id="dli_08_15049__b98533376922539">1.15</strong> in the <strong id="dli_08_15049__b175124672822539">Running Parameters</strong> tab. Select <strong id="dli_08_15049__b158491806322539">Save Job Log</strong>, and specify the OBS bucket for saving job logs.</li><li id="dli_08_15049__li118441048194615">For details about how to use data types, see <a href="dli_08_15014.html">Format</a>.</li><li id="dli_08_15049__li1877614152149">Flink 1.15 currently only supports creating OBS tables and DLI lakehouse tables using Hive syntax, which is supported by Hive dialect DDL statements.<ul id="dli_08_15049__ul52301539191612"><li id="dli_08_15049__li10367153891614">To create an OBS table using Hive syntax:<ul id="dli_08_15049__ul1176113556160"><li id="dli_08_15049__li11290555171618">For the default dialect, set <strong id="dli_08_15049__b314183432513">hive.is-external</strong> to <strong id="dli_08_15049__b438953820258">true</strong> in the with properties.</li><li id="dli_08_15049__li147011552191818">For the Hive dialect, use the <strong id="dli_08_15049__b193521328102620">EXTERNAL</strong> keyword in the create table statement.</li></ul>
|
|
</li><li id="dli_08_15049__li1083810474167">To create a DLI lakehouse table using Hive syntax:<ul id="dli_08_15049__ul1010114179193"><li id="dli_08_15049__li15304829161914">For the Hive dialect, add <strong id="dli_08_15049__b873392514278">'is_lakehouse'='true'</strong> to the table properties.</li></ul>
|
|
</li></ul>
|
|
</li><li id="dli_08_15049__li1488154619315">Enable checkpointing.</li><li id="dli_08_15049__li42041865253">You are advised to switch to Hive dialect to create Hive-compatible tables. If you want to create Hive-compatible tables with default dialect, make sure to set <strong id="dli_08_15049__b19196327182011">'connector'='hive'</strong> in your table properties, otherwise a table is considered generic by default in HiveCatalog. Note that the connector property is not required if you use Hive dialect.</li><li id="dli_08_15049__li52049642510">Monitor strategy is to scan all directories/files currently in the location path. Many partitions may cause performance degradation.</li><li id="dli_08_15049__li420412619252">Streaming reads for non-partitioned tables requires that each file be written atomically into the target directory.</li><li id="dli_08_15049__li22048612257">Streaming reading for partitioned tables requires that each partition should be added atomically in the view of hive metastore. If not, new data added to an existing partition will be consumed.</li><li id="dli_08_15049__li11204362252">Streaming reads do not support watermark grammar in Flink DDL. These tables cannot be used for window operators.</li></ul>
|
|
</div>
|
|
<div class="section" id="dli_08_15049__section394917571396"><h4 class="sectiontitle">Syntax</h4><div class="codecoloring" codetype="Sql" id="dli_08_15049__screen1694920577399"><div class="highlight"><table class="highlighttable"><tr><td class="linenos"><div class="linenodiv"><pre><span class="normal"> 1</span>
|
|
<span class="normal"> 2</span>
|
|
<span class="normal"> 3</span>
|
|
<span class="normal"> 4</span>
|
|
<span class="normal"> 5</span>
|
|
<span class="normal"> 6</span>
|
|
<span class="normal"> 7</span>
|
|
<span class="normal"> 8</span>
|
|
<span class="normal"> 9</span>
|
|
<span class="normal">10</span>
|
|
<span class="normal">11</span>
|
|
<span class="normal">12</span>
|
|
<span class="normal">13</span>
|
|
<span class="normal">14</span>
|
|
<span class="normal">15</span>
|
|
<span class="normal">16</span>
|
|
<span class="normal">17</span>
|
|
<span class="normal">18</span>
|
|
<span class="normal">19</span>
|
|
<span class="normal">20</span>
|
|
<span class="normal">21</span>
|
|
<span class="normal">22</span>
|
|
<span class="normal">23</span>
|
|
<span class="normal">24</span>
|
|
<span class="normal">25</span>
|
|
<span class="normal">26</span>
|
|
<span class="normal">27</span>
|
|
<span class="normal">28</span>
|
|
<span class="normal">29</span>
|
|
<span class="normal">30</span>
|
|
<span class="normal">31</span></pre></div></td><td class="code"><div><pre><span></span><span class="k">CREATE</span><span class="w"> </span><span class="k">EXTERNAL</span><span class="w"> </span><span class="k">TABLE</span><span class="w"> </span><span class="p">[</span><span class="k">IF</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">EXISTS</span><span class="p">]</span><span class="w"> </span><span class="k">table_name</span>
|
|
<span class="w"> </span><span class="p">[(</span><span class="n">col_name</span><span class="w"> </span><span class="n">data_type</span><span class="w"> </span><span class="p">[</span><span class="n">column_constraint</span><span class="p">]</span><span class="w"> </span><span class="p">[</span><span class="k">COMMENT</span><span class="w"> </span><span class="n">col_comment</span><span class="p">],</span><span class="w"> </span><span class="p">...</span><span class="w"> </span><span class="p">[</span><span class="n">table_constraint</span><span class="p">])]</span>
|
|
<span class="w"> </span><span class="p">[</span><span class="k">COMMENT</span><span class="w"> </span><span class="n">table_comment</span><span class="p">]</span>
|
|
<span class="w"> </span><span class="p">[</span><span class="n">PARTITIONED</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="p">(</span><span class="n">col_name</span><span class="w"> </span><span class="n">data_type</span><span class="w"> </span><span class="p">[</span><span class="k">COMMENT</span><span class="w"> </span><span class="n">col_comment</span><span class="p">],</span><span class="w"> </span><span class="p">...)]</span>
|
|
<span class="w"> </span><span class="p">[</span>
|
|
<span class="w"> </span><span class="p">[</span><span class="k">ROW</span><span class="w"> </span><span class="n">FORMAT</span><span class="w"> </span><span class="n">row_format</span><span class="p">]</span>
|
|
<span class="w"> </span><span class="p">[</span><span class="n">STORED</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="n">file_format</span><span class="p">]</span>
|
|
<span class="w"> </span><span class="p">]</span>
|
|
<span class="w"> </span><span class="p">[</span><span class="k">LOCATION</span><span class="w"> </span><span class="n">obs_path</span><span class="p">]</span>
|
|
<span class="w"> </span><span class="p">[</span><span class="n">TBLPROPERTIES</span><span class="w"> </span><span class="p">(</span><span class="n">property_name</span><span class="o">=</span><span class="n">property_value</span><span class="p">,</span><span class="w"> </span><span class="p">...)]</span>
|
|
|
|
<span class="n">row_format</span><span class="p">:</span>
|
|
<span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="n">DELIMITED</span><span class="w"> </span><span class="p">[</span><span class="n">FIELDS</span><span class="w"> </span><span class="n">TERMINATED</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="nb">char</span><span class="w"> </span><span class="p">[</span><span class="n">ESCAPED</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="nb">char</span><span class="p">]]</span><span class="w"> </span><span class="p">[</span><span class="n">COLLECTION</span><span class="w"> </span><span class="n">ITEMS</span><span class="w"> </span><span class="n">TERMINATED</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="nb">char</span><span class="p">]</span>
|
|
<span class="w"> </span><span class="p">[</span><span class="k">MAP</span><span class="w"> </span><span class="n">KEYS</span><span class="w"> </span><span class="n">TERMINATED</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="nb">char</span><span class="p">]</span><span class="w"> </span><span class="p">[</span><span class="n">LINES</span><span class="w"> </span><span class="n">TERMINATED</span><span class="w"> </span><span class="k">BY</span><span class="w"> </span><span class="nb">char</span><span class="p">]</span>
|
|
<span class="w"> </span><span class="p">[</span><span class="k">NULL</span><span class="w"> </span><span class="k">DEFINED</span><span class="w"> </span><span class="k">AS</span><span class="w"> </span><span class="nb">char</span><span class="p">]</span>
|
|
<span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">SERDE</span><span class="w"> </span><span class="n">serde_name</span><span class="w"> </span><span class="p">[</span><span class="k">WITH</span><span class="w"> </span><span class="n">SERDEPROPERTIES</span><span class="w"> </span><span class="p">(</span><span class="n">property_name</span><span class="o">=</span><span class="n">property_value</span><span class="p">,</span><span class="w"> </span><span class="p">...)]</span>
|
|
|
|
<span class="n">file_format</span><span class="p">:</span>
|
|
<span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="n">SEQUENCEFILE</span>
|
|
<span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">TEXTFILE</span>
|
|
<span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">RCFILE</span>
|
|
<span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">ORC</span>
|
|
<span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">PARQUET</span>
|
|
<span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">AVRO</span>
|
|
<span class="w"> </span><span class="o">|</span><span class="w"> </span><span class="n">INPUTFORMAT</span><span class="w"> </span><span class="n">input_format_classname</span><span class="w"> </span><span class="n">OUTPUTFORMAT</span><span class="w"> </span><span class="n">output_format_classname</span>
|
|
|
|
<span class="n">column_constraint</span><span class="p">:</span>
|
|
<span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="k">NOT</span><span class="w"> </span><span class="k">NULL</span><span class="w"> </span><span class="p">[[</span><span class="n">ENABLE</span><span class="o">|</span><span class="n">DISABLE</span><span class="p">]</span><span class="w"> </span><span class="p">[</span><span class="n">VALIDATE</span><span class="o">|</span><span class="n">NOVALIDATE</span><span class="p">]</span><span class="w"> </span><span class="p">[</span><span class="n">RELY</span><span class="o">|</span><span class="n">NORELY</span><span class="p">]]</span>
|
|
|
|
<span class="n">table_constraint</span><span class="p">:</span>
|
|
<span class="w"> </span><span class="p">:</span><span class="w"> </span><span class="p">[</span><span class="k">CONSTRAINT</span><span class="w"> </span><span class="k">constraint_name</span><span class="p">]</span><span class="w"> </span><span class="k">PRIMARY</span><span class="w"> </span><span class="k">KEY</span><span class="w"> </span><span class="p">(</span><span class="n">col_name</span><span class="p">,</span><span class="w"> </span><span class="p">...)</span><span class="w"> </span><span class="p">[[</span><span class="n">ENABLE</span><span class="o">|</span><span class="n">DISABLE</span><span class="p">]</span><span class="w"> </span><span class="p">[</span><span class="n">VALIDATE</span><span class="o">|</span><span class="n">NOVALIDATE</span><span class="p">]</span><span class="w"> </span><span class="p">[</span><span class="n">RELY</span><span class="o">|</span><span class="n">NORELY</span><span class="p">]]</span>
|
|
</pre></div></td></tr></table></div>
|
|
</div>
|
|
</div>
|
|
<div class="section" id="dli_08_15049__section1794965793911"><h4 class="sectiontitle">Parameter Description</h4>
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="dli_08_15049__table82421231587" frame="border" border="1" rules="all"><caption><b>Table 1 </b>TBLPROPERTIES parameters</caption><thead align="left"><tr id="dli_08_15049__row13242831486"><th align="left" class="cellrowborder" valign="top" width="19.25%" id="mcps1.3.6.2.2.6.1.1"><p id="dli_08_15049__p524213118810">Parameter</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="9.51%" id="mcps1.3.6.2.2.6.1.2"><p id="dli_08_15049__p1124218311813">Mandatory</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="9.92%" id="mcps1.3.6.2.2.6.1.3"><p id="dli_08_15049__p8335102131613">Default Value</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="9.28%" id="mcps1.3.6.2.2.6.1.4"><p id="dli_08_15049__p12962019151617">Data Type</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="52.04%" id="mcps1.3.6.2.2.6.1.5"><p id="dli_08_15049__p15242163119815">Description</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="dli_08_15049__row561201665917"><td class="cellrowborder" valign="top" width="19.25%" headers="mcps1.3.6.2.2.6.1.1 "><p id="dli_08_15049__p6538167598">streaming-source.enable</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.51%" headers="mcps1.3.6.2.2.6.1.2 "><p id="dli_08_15049__p721115265018">No</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.92%" headers="mcps1.3.6.2.2.6.1.3 "><p id="dli_08_15049__p17537163593">false</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.28%" headers="mcps1.3.6.2.2.6.1.4 "><p id="dli_08_15049__p2053716165913">Boolean</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="52.04%" headers="mcps1.3.6.2.2.6.1.5 "><p id="dli_08_15049__p112762304119">Enable streaming source or not. Note: Make sure that each partition/file should be written atomically, otherwise the reader may get incomplete data.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_15049__row1061816175912"><td class="cellrowborder" valign="top" width="19.25%" headers="mcps1.3.6.2.2.6.1.1 "><p id="dli_08_15049__p25371645910">streaming-source.partition.include</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.51%" headers="mcps1.3.6.2.2.6.1.2 "><p id="dli_08_15049__p42101026805">No</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.92%" headers="mcps1.3.6.2.2.6.1.3 "><p id="dli_08_15049__p15371615598">all</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.28%" headers="mcps1.3.6.2.2.6.1.4 "><p id="dli_08_15049__p165331675920">String</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="52.04%" headers="mcps1.3.6.2.2.6.1.5 "><p id="dli_08_15049__p15743125915810">Option to set the partitions to read, supported options are <strong id="dli_08_15049__b63861449113411">all</strong> and <strong id="dli_08_15049__b17954115673413">latest</strong>. By default, this parameter is set to <strong id="dli_08_15049__b23641918113712">all</strong>.</p>
|
|
<p id="dli_08_15049__p26511454183010"><strong id="dli_08_15049__b137062036163511">all</strong> means read all partitions.</p>
|
|
<p id="dli_08_15049__p721810391319"><strong id="dli_08_15049__b490324143920">latest</strong> only works when the streaming hive source table used as temporal table. <strong id="dli_08_15049__b736612518377">latest</strong> means reading latest partition in order of <strong id="dli_08_15049__b356317915389">streaming-source.partition.order</strong>.</p>
|
|
<p id="dli_08_15049__p3751664317">Flink supports temporal join the latest hive partition by enabling <strong id="dli_08_15049__b5683632203812">streaming-source.enable</strong> and setting <strong id="dli_08_15049__b1146811411387">streaming-source.partition.include</strong> to <strong id="dli_08_15049__b1057114467388">latest</strong>. At the same time, user can assign the partition compare order and data update interval by configuring following partition-related options.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_15049__row156114163593"><td class="cellrowborder" valign="top" width="19.25%" headers="mcps1.3.6.2.2.6.1.1 "><p id="dli_08_15049__p1253111695918">streaming-source.monitor-interval</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.51%" headers="mcps1.3.6.2.2.6.1.2 "><p id="dli_08_15049__p17209122614019">No</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.92%" headers="mcps1.3.6.2.2.6.1.3 "><p id="dli_08_15049__p15341695914">None</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.28%" headers="mcps1.3.6.2.2.6.1.4 "><p id="dli_08_15049__p1353201616592">Duration</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="52.04%" headers="mcps1.3.6.2.2.6.1.5 "><p id="dli_08_15049__p1030318531096">Time interval for consecutively monitoring partition/file. Notes: The default interval for hive streaming reading is '1 m', the default interval for hive streaming temporal join is '60 m', this is because there's one framework limitation that every TM will visit the Hive metaStore in current hive streaming temporal join implementation which may produce pressure to metaStore, this will improve in the future.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_15049__row961101625912"><td class="cellrowborder" valign="top" width="19.25%" headers="mcps1.3.6.2.2.6.1.1 "><p id="dli_08_15049__p125391619594">streaming-source.partition-order</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.51%" headers="mcps1.3.6.2.2.6.1.2 "><p id="dli_08_15049__p920816261106">No</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.92%" headers="mcps1.3.6.2.2.6.1.3 "><p id="dli_08_15049__p2539167596">partition-name</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.28%" headers="mcps1.3.6.2.2.6.1.4 "><p id="dli_08_15049__p653101645915">String</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="52.04%" headers="mcps1.3.6.2.2.6.1.5 "><p id="dli_08_15049__p281172219134">The partition order of streaming source, supporting <strong id="dli_08_15049__b9425738194011">create-time</strong>, <strong id="dli_08_15049__b12342114019405">partition-time</strong>, and <strong id="dli_08_15049__b201790456400">partition-name</strong>. </p>
|
|
<p id="dli_08_15049__p4280104111113"><strong id="dli_08_15049__b3906115034016">create-time</strong> compares partition/file creation time, this is not the partition create time in Hive metaStore, but the folder/file modification time in filesystem, if the partition folder somehow gets updated, e.g. add new file into folder, it can affect how the data is consumed.</p>
|
|
<p id="dli_08_15049__p2792141711814"><strong id="dli_08_15049__b179901916154114">partition-time</strong> compares the time extracted from partition name.</p>
|
|
<p id="dli_08_15049__p125172558134"><strong id="dli_08_15049__b9617202715411">partition-name</strong> compares partition name's alphabetical order.</p>
|
|
<p id="dli_08_15049__p132261513161413">For a non-partition table, this value should always be <strong id="dli_08_15049__b8838135224116">create-time</strong>.</p>
|
|
<p id="dli_08_15049__p057034110116">By default the value is <strong id="dli_08_15049__b5731195916419">partition-name</strong>. The option is equality with deprecated option <strong id="dli_08_15049__b047991834219">streaming-source.consume-order</strong>.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_15049__row260816155910"><td class="cellrowborder" valign="top" width="19.25%" headers="mcps1.3.6.2.2.6.1.1 "><p id="dli_08_15049__p155314163598">streaming-source.consume-start-offset</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.51%" headers="mcps1.3.6.2.2.6.1.2 "><p id="dli_08_15049__p2020619261707">No</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.92%" headers="mcps1.3.6.2.2.6.1.3 "><p id="dli_08_15049__p1653121665911">None</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.28%" headers="mcps1.3.6.2.2.6.1.4 "><p id="dli_08_15049__p253111655911">String</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="52.04%" headers="mcps1.3.6.2.2.6.1.5 "><p id="dli_08_15049__p11638151119170">Start offset for streaming consuming. How to parse and compare offsets depends on your order. For <strong id="dli_08_15049__b15433183917438">create-time</strong> and <strong id="dli_08_15049__b2142741174316">partition-time</strong>, should be a timestamp string (yyyy-[m]m-[d]d [hh:mm:ss]).</p>
|
|
<p id="dli_08_15049__p739713771611">For <strong id="dli_08_15049__b9127173554315">partition-time</strong>, will use partition time extractor to extract time from partition. For <strong id="dli_08_15049__b19801133020438">partition-name</strong>, is the partition name string (e.g. pt_year=2020/pt_mon=10/pt_day=01).</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_15049__row5846124142510"><td class="cellrowborder" valign="top" width="19.25%" headers="mcps1.3.6.2.2.6.1.1 "><p id="dli_08_15049__p128472417258">is_lakehouse</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.51%" headers="mcps1.3.6.2.2.6.1.2 "><p id="dli_08_15049__p1184711415255">No</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.92%" headers="mcps1.3.6.2.2.6.1.3 "><p id="dli_08_15049__p208477452516">None</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="9.28%" headers="mcps1.3.6.2.2.6.1.4 "><p id="dli_08_15049__p1984712418251">Boolean</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="52.04%" headers="mcps1.3.6.2.2.6.1.5 "><p id="dli_08_15049__p6847046253">If DLI lakehouse tables using Hive syntax are used, set this parameter to <strong id="dli_08_15049__b1142115371119">true</strong>.</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
<ul id="dli_08_15049__ul12473174043716"><li id="dli_08_15049__li8473174016371"><strong id="dli_08_15049__b5941151634416">Source Parallelism Inference</strong><p id="dli_08_15049__p18476974338">By default, Flink infers the hive source parallelism based on the number of splits, and the number of splits is based on the number of files and the number of blocks in the files.</p>
|
|
<p id="dli_08_15049__p134761072331">Flink allows you to flexibly configure the policy of parallelism inference. You can configure the following parameters in TableConfig (note that these parameters affect all sources of the job):</p>
|
|
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="dli_08_15049__table1440818196330" frame="border" border="1" rules="all"><thead align="left"><tr id="dli_08_15049__row1942516191330"><th align="left" class="cellrowborder" valign="top" width="25%" id="mcps1.3.6.3.1.4.1.5.1.1"><p id="dli_08_15049__p15425131923318">Key</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="25%" id="mcps1.3.6.3.1.4.1.5.1.2"><p id="dli_08_15049__p16425171911336">Default</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="13.209999999999999%" id="mcps1.3.6.3.1.4.1.5.1.3"><p id="dli_08_15049__p17425151993311">Type</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="36.79%" id="mcps1.3.6.3.1.4.1.5.1.4"><p id="dli_08_15049__p94253192334">Description</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="dli_08_15049__row7425101983310"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.6.3.1.4.1.5.1.1 "><p id="dli_08_15049__p2042531916333">table.exec.hive.infer-source-parallelism</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.6.3.1.4.1.5.1.2 "><p id="dli_08_15049__p4425111914337">true</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="13.209999999999999%" headers="mcps1.3.6.3.1.4.1.5.1.3 "><p id="dli_08_15049__p164251219113312">Boolean</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="36.79%" headers="mcps1.3.6.3.1.4.1.5.1.4 "><p id="dli_08_15049__p106102448334">If it is <strong id="dli_08_15049__b18150151164919">true</strong>, source parallelism is inferred according to splits number. If it is <strong id="dli_08_15049__b1917917251499">false</strong>, parallelism of source is set by <strong id="dli_08_15049__b647793134919">config</strong>.</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="dli_08_15049__row1942511993312"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.6.3.1.4.1.5.1.1 "><p id="dli_08_15049__p842591915330">table.exec.hive.infer-source-parallelism.max</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.6.3.1.4.1.5.1.2 "><p id="dli_08_15049__p9425419113314">1000</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="13.209999999999999%" headers="mcps1.3.6.3.1.4.1.5.1.3 "><p id="dli_08_15049__p1342591933312">Integer</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="36.79%" headers="mcps1.3.6.3.1.4.1.5.1.4 "><p id="dli_08_15049__p92841251173313">Sets max infer parallelism for source operator.</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</li></ul>
|
|
<ul id="dli_08_15049__ul121810191377"><li id="dli_08_15049__li31811819133716"><strong id="dli_08_15049__b731192315443">Load Partition Splits</strong><p id="dli_08_15049__p1018654213411">Multi-thread is used to split hive's partitions. You can use <strong id="dli_08_15049__b1457519511517">table.exec.hive.load-partition-splits.thread-num</strong> to configure the thread number. The default value is <strong id="dli_08_15049__b04681117165114">3</strong> and the configured value should be greater than 0.</p>
|
|
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="dli_08_15049__table1250801212555" frame="border" border="1" rules="all"><thead align="left"><tr id="dli_08_15049__row13508191255519"><th align="left" class="cellrowborder" valign="top" width="25%" id="mcps1.3.6.4.1.3.1.5.1.1"><p id="dli_08_15049__p1150831216553">Key</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="25%" id="mcps1.3.6.4.1.3.1.5.1.2"><p id="dli_08_15049__p1250851215519">Default</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="13.209999999999999%" id="mcps1.3.6.4.1.3.1.5.1.3"><p id="dli_08_15049__p95081012115511">Type</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="36.79%" id="mcps1.3.6.4.1.3.1.5.1.4"><p id="dli_08_15049__p18508312165512">Description</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="dli_08_15049__row1550817128555"><td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.6.4.1.3.1.5.1.1 "><p id="dli_08_15049__p1166081819556">table.exec.hive.load-partition-splits.thread-num</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="25%" headers="mcps1.3.6.4.1.3.1.5.1.2 "><p id="dli_08_15049__p16508161265516">3</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="13.209999999999999%" headers="mcps1.3.6.4.1.3.1.5.1.3 "><p id="dli_08_15049__p6508912105510">Integer</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="36.79%" headers="mcps1.3.6.4.1.3.1.5.1.4 "><p id="dli_08_15049__p115080126553">The configured value should be greater than 0.</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
<p id="dli_08_15049__p11844011910">SQL hints can be used to apply configurations to a Hive table without changing its definition in the Hive metastore. See <a href="https://nightlies.apache.org/flink/flink-docs-release-1.15/zh/docs/dev/table/sql/queries/hints/" target="_blank" rel="noopener noreferrer">Hints | Apache Flink</a>.</p>
|
|
</li><li id="dli_08_15049__li4891112212370"><strong id="dli_08_15049__b6791193018223">Vectorized Optimization upon Read</strong><p id="dli_08_15049__p11144319307">Flink will automatically used vectorized reads of Hive tables when the following conditions are met:</p>
|
|
<ul id="dli_08_15049__ul27471666378"><li id="dli_08_15049__li87474612374">Format: ORC or Parquet.</li><li id="dli_08_15049__li127475613379">Columns without complex data type, like hive types: List, Map, Struct, Union.<div class="p" id="dli_08_15049__p41486373210"><a name="dli_08_15049__li127475613379"></a><a name="li127475613379"></a>This feature is enabled by default. It may be disabled with the following configuration.<pre class="screen" id="dli_08_15049__screen1123014327339">table.exec.hive.fallback-mapred-reader=true</pre>
|
|
</div>
|
|
</li></ul>
|
|
</li></ul>
|
|
</div>
|
|
<ul id="dli_08_15049__ul1350013284378"><li id="dli_08_15049__li1350112813717"><strong id="dli_08_15049__b1789165711233">Reading Hive Views</strong><p id="dli_08_15049__p334491372511">Flink is able to read from Hive defined views, but some limitations apply:</p>
|
|
<ul id="dli_08_15049__ul8253145973610"><li id="dli_08_15049__li1525355963611">The Hive catalog must be set as the current catalog before you can query the view. This can be done by either <strong id="dli_08_15049__b1341823201512">tableEnv.useCatalog(...)</strong> in Table API or <strong id="dli_08_15049__b1034523161516">USE CATALOG ...</strong> in SQL Client.</li><li id="dli_08_15049__li192531599362">Hive and Flink SQL have different syntax, e.g. different reserved keywords and literals. Make sure the view's query is compatible with Flink grammar.</li></ul>
|
|
</li></ul>
|
|
<div class="section" id="dli_08_15049__section198994135216"><h4 class="sectiontitle">Example</h4><ol id="dli_08_15049__ol1027415591377"><li id="dli_08_15049__li15274125915372">Create an OBS table in Hive syntax using Spark SQL and insert 10 data records. Simulate the data source.<pre class="screen" id="dli_08_15049__screen23261341195114">CREATE TABLE IF NOT EXISTS demo.student(
|
|
name STRING,
|
|
score DOUBLE)
|
|
PARTITIONED BY (classNo INT)
|
|
STORED AS PARQUET
|
|
LOCATION 'obs://demo/spark.db/student';
|
|
|
|
INSERT INTO demo.student PARTITION(classNo=1) VALUES ('Alice', 90.0), ('Bob', 80.0), ('Charlie', 70.0), ('David', 60.0), ('Eve', 50.0), ('Frank', 40.0), ('Grace', 30.0), ('Hank', 20.0), ('Ivy', 10.0), ('Jack', 0.0);</pre>
|
|
</li><li id="dli_08_15049__li84491597393">Demonstrate batch processing using Flink SQL to read data from the Hive syntax OBS table demo.student in batch mode and print it out. Checkpointing is required.<pre class="screen" id="dli_08_15049__screen924645616211">CREATE CATALOG myhive WITH (
|
|
'type' = 'hive',
|
|
'default-database' = 'demo',
|
|
'hive-conf-dir' = '/opt/flink/conf'
|
|
);
|
|
|
|
USE CATALOG myhive;
|
|
|
|
create table if not exists print (
|
|
name STRING,
|
|
score DOUBLE,
|
|
classNo INT)
|
|
with ('connector' = 'print');
|
|
|
|
insert into print
|
|
select * from student;</pre>
|
|
<div class="p" id="dli_08_15049__p1165115161717">Result (out log of TaskManager):<pre class="screen" id="dli_08_15049__screen1754143915111">+I[Alice, 90.0, 1]
|
|
+I[Bob, 80.0, 1]
|
|
+I[Charlie, 70.0, 1]
|
|
+I[David, 60.0, 1]
|
|
+I[Eve, 50.0, 1]
|
|
+I[Frank, 40.0, 1]
|
|
+I[Grace, 30.0, 1]
|
|
+I[Hank, 20.0, 1]
|
|
+I[Ivy, 10.0, 1]
|
|
+I[Jack, 0.0, 1]</pre>
|
|
</div>
|
|
</li><li id="dli_08_15049__li8247192917391">Demonstrate stream processing by using Flink SQL to read data from the Hive syntax OBS table demo.student in stream mode and print it out.<pre class="screen" id="dli_08_15049__screen121076163171">CREATE CATALOG myhive WITH (
|
|
'type' = 'hive' ,
|
|
'default-database' = 'demo',
|
|
'hive-conf-dir' = '/opt/flink/conf'
|
|
);
|
|
|
|
USE CATALOG myhive;
|
|
|
|
create table if not exists print (
|
|
name STRING,
|
|
score DOUBLE,
|
|
classNo INT)
|
|
with ('connector' = 'print');
|
|
|
|
insert into print
|
|
select * from student /*+ OPTIONS('streaming-source.enable' = 'true', 'streaming-source.monitor-interval' = '3 m') */;</pre>
|
|
</li></ol>
|
|
</div>
|
|
<p id="dli_08_15049__p19851915155410">The SQL hints function is used. SQL hints can be used to apply configurations to a Hive table without changing its definition in the Hive metastore. For details, see <a href="https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/table/sql/queries/hints/" target="_blank" rel="noopener noreferrer">SQL Hints</a>.</p>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="dli_08_15046.html">Hive</a></div>
|
|
</div>
|
|
</div>
|
|
|