doc-exports/docs/dws/dev/dws_04_1070.html

<a name="EN-US_TOPIC_0000001764491932"></a><a name="EN-US_TOPIC_0000001764491932"></a>

<h1 class="topictitle1">Introduction to Hudi</h1>
<div id="body0000001543060694"><p id="EN-US_TOPIC_0000001764491932__p197574483344">Apache Hudi indicates Hadoop Upserts Deletes and Incrementals. It is used to manage large analysis data sets stored on the DFS in Hadoop.</p>
<p id="EN-US_TOPIC_0000001764491932__p117581248133418">Hudi is not just a data format. It is also a set of data access methods (similar to the access layer of GaussDB(DWS) storage). In Apache Hudi 0.9, big data components such as Spark and Flink have their own clients. The following figure shows the logical storage of Hudi.</p>
<p id="EN-US_TOPIC_0000001764491932__p11804033194118"><span><img id="EN-US_TOPIC_0000001764491932__image62210366411" src="figure/en-us_image_0000001811491577.png"></span></p>
<ul id="EN-US_TOPIC_0000001764491932__ul193736162469"><li id="EN-US_TOPIC_0000001764491932__li8373171611469">Write Mode<p id="EN-US_TOPIC_0000001764491932__p45541615104312"><a name="EN-US_TOPIC_0000001764491932__li8373171611469"></a><a name="li8373171611469"></a><strong id="EN-US_TOPIC_0000001764491932__b1692717448397">COW</strong>: copy-on-write, applicable to scenarios with few updates.</p>
<p id="EN-US_TOPIC_0000001764491932__p2554111534313"><strong id="EN-US_TOPIC_0000001764491932__b0479547153918">MOR</strong>: replication on read. For UPDATE &amp; DELETE, delta log files are written incrementally. During analysis, base and delta log files are compacted asynchronously.</p>
</li></ul>
<ul id="EN-US_TOPIC_0000001764491932__ul2873211104719"><li id="EN-US_TOPIC_0000001764491932__li128731911154714">Storage Format<p id="EN-US_TOPIC_0000001764491932__p45559151439"><a name="EN-US_TOPIC_0000001764491932__li128731911154714"></a><a name="li128731911154714"></a><strong id="EN-US_TOPIC_0000001764491932__b1833482618417">index</strong>: index of the primary key. The default value is bloomfilter at the file group level.</p>
<p id="EN-US_TOPIC_0000001764491932__p115567151431"><strong id="EN-US_TOPIC_0000001764491932__b8564195064113">data files</strong>: base file + delta log file (for updating and deleting base files)</p>
<p id="EN-US_TOPIC_0000001764491932__p1855891519434"><strong id="EN-US_TOPIC_0000001764491932__b1323217553410">timeline metadata</strong>: manages version logs.</p>
</li><li id="EN-US_TOPIC_0000001764491932__li510184184715">Views<p id="EN-US_TOPIC_0000001764491932__p125591115114318"><a name="EN-US_TOPIC_0000001764491932__li510184184715"></a><a name="li510184184715"></a>Read-optimized view: reads the base file generated after compaction. The reading of data that is not compacted has some latency (efficient read).</p>
<p id="EN-US_TOPIC_0000001764491932__p8560315174319">Real-time view: reads the latest data. The base file and delta file are combined during the read (frequent updates).</p>
<p id="EN-US_TOPIC_0000001764491932__p15560191510439">Incremental view: reads the incremental data written to Hudi, similar to CDC (stream and batch integration).</p>
</li></ul>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="dws_04_1069.html">SQL on Hudi</a></div>
</div>
</div>