forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Su, Xiaomeng <suxiaomeng1@huawei.com> Co-committed-by: Su, Xiaomeng <suxiaomeng1@huawei.com>
117 lines
11 KiB
HTML
117 lines
11 KiB
HTML
<a name="dli_08_15052"></a><a name="dli_08_15052"></a>
|
|
|
|
<h1 class="topictitle1">Using Temporal Join to Associate the Latest Partition of a Dimension Table</h1>
|
|
<div id="body0000001715192008"><div class="section" id="dli_08_15052__section17209510185415"><h4 class="sectiontitle">Function</h4><p id="dli_08_15052__p536418191175">For partitioned tables that change over time, we can read them as unbounded streams. If each partition contains a complete set of data for a certain version, the partition can be considered as a version of the temporal table, which retains the data of the partition. Flink supports automatically tracking the latest partition (version) of the temporal table in processing-time joins.</p>
|
|
<p id="dli_08_15052__p1882274373020">The latest partition (version) is defined by the <strong id="dli_08_15052__b5887931163015">streaming-source.partition-order</strong> parameter.</p>
|
|
<p id="dli_08_15052__p3784144716406">This is the most common use case for using Hive tables as dimension tables in Flink streaming applications.</p>
|
|
</div>
|
|
<div class="section" id="dli_08_15052__section2093155184317"><h4 class="sectiontitle">Caveats</h4><p id="dli_08_15052__p6175145417308">Using Temporal join to associate the latest partition of a dimension table is only supported in Flink STREAMING mode.</p>
|
|
</div>
|
|
<div class="section" id="dli_08_15052__section9498161303"><h4 class="sectiontitle">Example</h4><p id="dli_08_15052__p5801172117424">The following example shows a classic business pipeline where the dimension table comes from Hive and is updated once a day through batch processing or Flink jobs. The Kafka stream comes from real-time online business data or logs and needs to be joined with the dimension table to expand the stream.</p>
|
|
<ol id="dli_08_15052__ol1111312138184"><li id="dli_08_15052__li1311351314183">Create a Hive OBS external table using Spark SQL and insert data.<pre class="screen" id="dli_08_15052__screen7111832173410">CREATE TABLE if not exists dimension_hive_table (
|
|
product_id STRING,
|
|
product_name STRING,
|
|
unit_price DECIMAL(10, 4),
|
|
pv_count BIGINT,
|
|
like_count BIGINT,
|
|
comment_count BIGINT,
|
|
update_time TIMESTAMP,
|
|
update_user STRING
|
|
)
|
|
STORED AS PARQUET
|
|
LOCATION 'obs://demo/spark.db/dimension_hive_table'
|
|
PARTITIONED BY (
|
|
create_time STRING
|
|
);</pre>
|
|
<div class="p" id="dli_08_15052__p43071726181517"><pre class="screen" id="dli_08_15052__screen8596135117201">INSERT INTO dimension_hive_table PARTITION (create_time='create_time_1') VALUES ('product_id_11', 'product_name_11', 1.2345, 100, 50, 20, '2023-11-25 02:10:58', 'update_user_1');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_1') VALUES ('product_id_12', 'product_name_12', 2.3456, 200, 100, 40, '2023-11-25 02:10:58', 'update_user_2');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_1') VALUES ('product_id_13', 'product_name_13', 3.4567, 300, 150, 60, '2023-11-25 02:10:58', 'update_user_3');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_1') VALUES ('product_id_14', 'product_name_14', 4.5678, 400, 200, 80, '2023-11-25 02:10:58', 'update_user_4');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_1') VALUES ('product_id_15', 'product_name_15', 5.6789, 500, 250, 100, '2023-11-25 02:10:58', 'update_user_5');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_1') VALUES ('product_id_16', 'product_name_16', 6.7890, 600, 300, 120, '2023-11-25 02:10:58', 'update_user_6');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_1') VALUES ('product_id_17', 'product_name_17', 7.8901, 700, 350, 140, '2023-11-25 02:10:58', 'update_user_7');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_1') VALUES ('product_id_18', 'product_name_18', 8.9012, 800, 400, 160, '2023-11-25 02:10:58', 'update_user_8');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_1') VALUES ('product_id_19', 'product_name_19', 9.0123, 900, 450, 180, '2023-11-25 02:10:58', 'update_user_9');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_1') VALUES ('product_id_10', 'product_name_10', 10.1234, 1000, 500, 200, '2023-11-25 02:10:58', 'update_user_10');</pre>
|
|
</div>
|
|
</li></ol><ol start="2" id="dli_08_15052__ol174419258189"><li id="dli_08_15052__li474416258183">Create a Flink OpenSource SQL job. Enter the following job script and submit the job. This job simulates reading data from Kafka, performs a join with a Hive dimension table to denormalize the data, and outputs it to Print.<div class="p" id="dli_08_15052__p1896991015915"><a name="dli_08_15052__li474416258183"></a><a name="li474416258183"></a>Change the values of the parameters in bold as needed in the following script.<pre class="screen" id="dli_08_15052__screen7459733164314">CREATE CATALOG myhive WITH (
|
|
'type' = 'hive' ,
|
|
'default-database' = 'demo',
|
|
'hive-conf-dir' = '/opt/flink/conf'
|
|
);
|
|
|
|
USE CATALOG myhive;
|
|
|
|
CREATE TABLE if not exists ordersSource (
|
|
product_id STRING,
|
|
user_name string,
|
|
proctime as Proctime()
|
|
) WITH (
|
|
'connector' = '<strong id="dli_08_15052__b936031994820"><em id="dli_08_15052__i1438519124818">kafka</em></strong>',
|
|
'topic' = '<strong id="dli_08_15052__b833052224817"><em id="dli_08_15052__i141048226485">TOPIC</em></strong>',
|
|
'properties.bootstrap.servers' = '<strong id="dli_08_15052__b1758312318489"><em id="dli_08_15052__i1360573014819">KafkaIP:PROT,KafkaIP:PROT,KafkaIP:PROT</em></strong>',
|
|
'properties.group.id' = '<strong id="dli_08_15052__b6696183464820"><em id="dli_08_15052__i13955347486">GroupId</em></strong>',
|
|
'scan.startup.mode' = '<strong id="dli_08_15052__b11550183710486"><em id="dli_08_15052__i53501379482">latest-offset</em></strong>',
|
|
'format' = '<em id="dli_08_15052__i0687124015485"><strong id="dli_08_15052__b18477124064810">json</strong></em>'
|
|
);
|
|
|
|
create table if not exists print (
|
|
product_id STRING,
|
|
user_name string,
|
|
product_name STRING,
|
|
unit_price DECIMAL(10, 4),
|
|
pv_count BIGINT,
|
|
like_count BIGINT,
|
|
comment_count BIGINT,
|
|
update_time TIMESTAMP,
|
|
update_user STRING,
|
|
create_time STRING
|
|
) with (
|
|
'connector' = 'print'
|
|
);
|
|
|
|
insert into print
|
|
select
|
|
orders.product_id,
|
|
orders.user_name,
|
|
dim.product_name,
|
|
dim.unit_price,
|
|
dim.pv_count,
|
|
dim.like_count,
|
|
dim.comment_count,
|
|
dim.update_time,
|
|
dim.update_user,
|
|
dim.create_time
|
|
from ordersSource orders
|
|
left join dimension_hive_table /*+ OPTIONS('streaming-source.enable'='true',
|
|
'streaming-source.partition.include' = 'latest', 'streaming-source.monitor-interval' = '10 m') */
|
|
for system_time as of orders.proctime as dim on orders.product_id = dim.product_id;</pre>
|
|
</div>
|
|
</li><li id="dli_08_15052__li2029611521181">Connect to the Kafka cluster and insert the following test data into the source topic in Kafka:<pre class="screen" id="dli_08_15052__screen871542733219">{"product_id": "product_id_11", "user_name": "name11"}
|
|
{"product_id": "product_id_12", "user_name": "name12"}</pre>
|
|
</li><li id="dli_08_15052__li1704119199">View the data in the Print result table.<pre class="screen" id="dli_08_15052__screen321591810294">+I[product_id_11, name11, product_name_11, 1.2345, 100, 50, 20, 2023-11-24T18:10:58, update_user_1, create_time_1]
|
|
+I[product_id_12, name12, product_name_12, 2.3456, 200, 100, 40, 2023-11-24T18:10:58, update_user_2, create_time_1]</pre>
|
|
</li><li id="dli_08_15052__li15709131091913">Simulate inserting new partition data into the Hive dimension table.<pre class="screen" id="dli_08_15052__screen8451151912714">INSERT INTO dimension_hive_table PARTITION (create_time='create_time_2') VALUES ('product_id_21', 'product_name_21', 1.2345, 100, 50, 20, '2023-11-25 02:10:58', 'update_user_1');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_2') VALUES ('product_id_22', 'product_name_22', 2.3456, 200, 100, 40, '2023-11-25 02:10:58', 'update_user_2');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_2') VALUES ('product_id_23', 'product_name_23', 3.4567, 300, 150, 60, '2023-11-25 02:10:58', 'update_user_3');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_2') VALUES ('product_id_24', 'product_name_24', 4.5678, 400, 200, 80, '2023-11-25 02:10:58', 'update_user_4');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_2') VALUES ('product_id_25', 'product_name_25', 5.6789, 500, 250, 100, '2023-11-25 02:10:58', 'update_user_5');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_2') VALUES ('product_id_26', 'product_name_26', 6.7890, 600, 300, 120, '2023-11-25 02:10:58', 'update_user_6');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_2') VALUES ('product_id_27', 'product_name_27', 7.8901, 700, 350, 140, '2023-11-25 02:10:58', 'update_user_7');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_2') VALUES ('product_id_28', 'product_name_28', 8.9012, 800, 400, 160, '2023-11-25 02:10:58', 'update_user_8');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_2') VALUES ('product_id_29', 'product_name_29', 9.0123, 900, 450, 180, '2023-11-25 02:10:58', 'update_user_9');
|
|
INSERT INTO dimension_hive_table PARTITION (create_time='create_time_2') VALUES ('product_id_20', 'product_name_20', 10.1234, 1000, 500, 200, '2023-11-25 02:10:58', 'update_user_10');</pre>
|
|
</li><li id="dli_08_15052__li720721781910">Connect to the Kafka cluster and insert the following test data into the source topic in Kafka. Associate the data from the previous partition with <strong id="dli_08_15052__b17887155153114">create_time='create_time_1'</strong>:<pre class="screen" id="dli_08_15052__screen52237583273"> {"product_id": "product_id_13", "user_name": "name13"}</pre>
|
|
</li><li id="dli_08_15052__li182072017121918">View the data in the Print result table. The data of the previous partition <strong id="dli_08_15052__b694316218335">create_time='create_time_1'</strong> in the Hive dimension table has been deleted.<pre class="screen" id="dli_08_15052__screen101335187293"> +I[product_id_13, name13, null, null, null, null, null, null, null, null]</pre>
|
|
</li><li id="dli_08_15052__li9207181713191">Connect to the Kafka cluster and insert the following test data into the source topic in Kafka. Associate the latest partition data with <strong id="dli_08_15052__b1914795814338">create_time='create_time_2'</strong>:<pre class="screen" id="dli_08_15052__screen15719421123110"> {"product_id": "product_id_21", "user_name": "name21"}</pre>
|
|
</li><li id="dli_08_15052__li1420715172197">View the data in the Print result table. The Hive dimension table retains the data of the latest partition with <strong id="dli_08_15052__b11958234345">create_time='create_time_2'</strong>.<pre class="screen" id="dli_08_15052__screen13385543203315"> +I[product_id_21, name21, product_name_21, 1.2345, 100, 50, 20, 2023-11-24T18:10:58, update_user_1, create_time_2]</pre>
|
|
</li></ol>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="dli_08_15046.html">Hive</a></div>
|
|
</div>
|
|
</div>
|
|
|