forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Su, Xiaomeng <suxiaomeng1@huawei.com> Co-committed-by: Su, Xiaomeng <suxiaomeng1@huawei.com>
66 lines
6.8 KiB
HTML
66 lines
6.8 KiB
HTML
<a name="dli_08_15073"></a><a name="dli_08_15073"></a>
|
|
|
|
<h1 class="topictitle1">Window Deduplication</h1>
|
|
<div id="body0000001870833485"><div class="section" id="dli_08_15073__section18950161910272"><h4 class="sectiontitle">Function</h4><p id="dli_08_15073__p564812277274">Window Deduplication is a special Deduplication which removes rows that duplicate over a set of columns, keeping the first one or the last one for each window and partitioned keys.</p>
|
|
<p id="dli_08_15073__p11648827142715">For streaming queries, unlike regular Deduplicate on continuous tables, Window Deduplication does not emit intermediate results but only a final result at the end of the window. Moreover, window Deduplication purges all intermediate state when no longer needed. Therefore, Window Deduplication queries have better performance if users do not need results updated per record. Usually, Window Deduplication is used with Windowing TVF directly. Besides, Window Deduplication could be used with other operations based on Windowing TVF, such as Window Aggregation, Window TopN and Window Join.</p>
|
|
<p id="dli_08_15073__p364813278271">Window Top-N can be defined in the same syntax as regular Top-N, see Top-N documentation for more information. Besides that, Window Deduplication requires the <strong id="dli_08_15073__b98510234712">PARTITION BY</strong> clause contains <strong id="dli_08_15073__b165561926274">window_start</strong> and <strong id="dli_08_15073__b10816301877">window_end</strong> columns of the relation. Otherwise, the optimizer will not be able to translate the query.</p>
|
|
<p id="dli_08_15073__p13648142717275">Flink uses <strong id="dli_08_15073__b56478461710">ROW_NUMBER()</strong> to remove duplicates, just like the way of Window Top-N query. In theory, Window Deduplication is a special case of Window Top-N in which the N is one and order by the processing time or event time.</p>
|
|
<p id="dli_08_15073__p44591322112816">For more information, see <a href="https://nightlies.apache.org/flink/flink-docs-release-1.15/zh/docs/dev/table/sql/queries/window-deduplication/" target="_blank" rel="noopener noreferrer">Window Deduplication</a>.</p>
|
|
</div>
|
|
<div class="section" id="dli_08_15073__section1817521282919"><h4 class="sectiontitle">Syntax</h4><pre class="screen" id="dli_08_15073__screen7632171910297">SELECT [column_list]
|
|
FROM (
|
|
SELECT [column_list],
|
|
ROW_NUMBER() OVER (PARTITION BY window_start, window_end [, col_key1...]
|
|
ORDER BY time_attr [asc|desc]) AS rownum
|
|
FROM table_name) -- relation applied windowing TVF
|
|
WHERE (rownum = 1 | rownum <=1 | rownum < 2) [AND conditions]</pre>
|
|
</div>
|
|
<p id="dli_08_15073__p9435183212912">Parameter description:</p>
|
|
<ul id="dli_08_15073__ul132292441292"><li id="dli_08_15073__li182291444162910"><strong id="dli_08_15073__b153513481385">ROW_NUMBER()</strong>: Assigns an unique, sequential number to each row, starting with one.</li><li id="dli_08_15073__li2229544142912"><strong id="dli_08_15073__b04801021499">PARTITION BY window_start, window_end [, col_key1...]</strong>: Specifies the partition columns which contain <strong id="dli_08_15073__b116496472918">window_start</strong>, <strong id="dli_08_15073__b2914529914">window_end</strong> and other partition keys.</li><li id="dli_08_15073__li82291644122917"><strong id="dli_08_15073__b187798108103">ORDER BY time_attr [asc|desc]</strong>: Specifies the ordering column, it must be a time attribute. Currently Flink supports processing time attribute and event time attribute. Ordering by ASC means keeping the first row, ordering by DESC means keeping the last row.</li><li id="dli_08_15073__li162294444293"><strong id="dli_08_15073__b144891345101012">WHERE (rownum = 1 | rownum <=1 | rownum < 2)</strong>: The <strong id="dli_08_15073__b1822232201112">rownum = 1 | rownum <=1 | rownum < 2</strong> is required for the optimizer to recognize the query could be translated to Window Deduplication.</li></ul>
|
|
<div class="section" id="dli_08_15073__section192391151103017"><h4 class="sectiontitle">Caveats</h4><ul id="dli_08_15073__ul1549177153113"><li id="dli_08_15073__li1654912753119">Flink can only perform window deduplication on window table value functions that are based on tumble, hop, or cumulate windows.</li><li id="dli_08_15073__li114471913118">Window deduplication is only supported when sorting based on the event time attribute.</li></ul>
|
|
</div>
|
|
<div class="section" id="dli_08_15073__section9855161143017"><h4 class="sectiontitle">Example</h4><p id="dli_08_15073__p197081413163020">The following example shows how to keep last record for every 10 minutes tumbling window.</p>
|
|
<pre class="screen" id="dli_08_15073__screen17116172119305">-- tables must have time attribute, e.g. `bidtime` in this table
|
|
Flink SQL> DESC Bid;
|
|
+-------------+------------------------+------+-----+--------+---------------------------------+
|
|
| name | type | null | key | extras | watermark |
|
|
+-------------+------------------------+------+-----+--------+---------------------------------+
|
|
| bidtime | TIMESTAMP(3) *ROWTIME* | true | | | `bidtime` - INTERVAL '1' SECOND |
|
|
| price | DECIMAL(10, 2) | true | | | |
|
|
| item | STRING | true | | | |
|
|
+-------------+------------------------+------+-----+--------+---------------------------------+
|
|
|
|
Flink SQL> SELECT * FROM Bid;
|
|
+------------------+-------+------+
|
|
| bidtime | price | item |
|
|
+------------------+-------+------+
|
|
| 2020-04-15 08:05 | 4.00 | C |
|
|
| 2020-04-15 08:07 | 2.00 | A |
|
|
| 2020-04-15 08:09 | 5.00 | D |
|
|
| 2020-04-15 08:11 | 3.00 | B |
|
|
| 2020-04-15 08:13 | 1.00 | E |
|
|
| 2020-04-15 08:17 | 6.00 | F |
|
|
+------------------+-------+------+
|
|
|
|
Flink SQL> SELECT *
|
|
FROM (
|
|
SELECT bidtime, price, item, supplier_id, window_start, window_end,
|
|
ROW_NUMBER() OVER (PARTITION BY window_start, window_end ORDER BY bidtime DESC) AS rownum
|
|
FROM TABLE(
|
|
TUMBLE(TABLE Bid, DESCRIPTOR(bidtime), INTERVAL '10' MINUTES))
|
|
) WHERE rownum <= 1;
|
|
+------------------+-------+------+-------------+------------------+------------------+--------+
|
|
| bidtime | price | item | supplier_id | window_start | window_end | rownum |
|
|
+------------------+-------+------+-------------+------------------+------------------+--------+
|
|
| 2020-04-15 08:09 | 5.00 | D | supplier4 | 2020-04-15 08:00 | 2020-04-15 08:10 | 1 |
|
|
| 2020-04-15 08:17 | 6.00 | F | supplier5 | 2020-04-15 08:10 | 2020-04-15 08:20 | 1 |
|
|
+------------------+-------+------+-------------+------------------+------------------+--------+</pre>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="dli_08_15069.html">Window</a></div>
|
|
</div>
|
|
</div>
|
|
|