This statement is used to insert the SELECT query result or a certain data record into a table.
To implement data operations in self-read and self-write scenarios, you are advised to use a temporary table to handle the data. See Figure 1.
Self-read and self-write means that the destination table and the data source table are the same table. For example, suppose you want to extract information of students with class_no = 1 from the student table and overwrite the original table, the following statements represent typical operations in self-read and self-write scenarios:
INSERT OVERWRITE TABLE student SELECT name FROM student WHERE class_no = 1;
To avoid this situation, you are advised to prioritize data sources that support transactions, such as Hudi data sources. This type of data source has Atomicity, Consistency, Isolation, Durability (ACID) capabilities, which helps ensure data consistency and accuracy.
To learn more, refer to How Do I Handle Duplicate Records After Executing the INSERT INTO Statement?
1 2 | INSERT INTO [TABLE] [db_name.]table_name [PARTITION part_spec] select_statement; |
1 2 | INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement; |
part_spec: : (part_col_name1=val1 [, part_col_name2=val2, ...])
1 2 | INSERT INTO [TABLE] [db_name.]table_name [PARTITION part_spec] VALUES values_row [, values_row ...]; |
1 2 | INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] VALUES values_row [, values_row ...]; |
values_row: : (val1 [, val2, ...])
Parameter |
Description |
|---|---|
db_name |
Name of the database where the target table resides. |
table_name |
Name of the target table. |
part_spec |
Detailed partition information. If there are multiple partition fields, all fields must be contained, but the corresponding values are optional. The system matches the corresponding partition. A maximum of 100,000 partitions can be created in a single table. |
select_statement |
SELECT query on the source table (DLI and OBS tables). |
values_row |
Value to be inserted to a table. Use commas (,) to separate columns. |
1 | insert overwrite table tb1 partition(part1='v1', part2='v2') select * from ... |
On the DLI management console, click SQL Editor. In the upper right corner of the editing window, click Settings to configure parameters.
insert into table table_target select * from table_source distribute by cast(rand() * N as int);
Before importing data, you must create a table. For details, see Creating an OBS Table or Creating a DLI Table.
CREATE TABLE data_source_tab1 (col1 INT, p1 INT, p2 INT) USING PARQUET PARTITIONED BY (p1, p2);
INSERT INTO data_source_tab1 PARTITION (p1 = 3, p2 = 4) SELECT id FROM RANGE(1, 3);
INSERT OVERWRITE TABLE data_source_tab1 PARTITION (p1 = 3, p2 = 4) SELECT id FROM RANGE(3, 5);
CREATE TABLE hive_serde_tab1 (col1 INT, p1 INT, p2 INT) USING HIVE OPTIONS(fileFormat 'PARQUET') PARTITIONED BY (p1, p2);
INSERT INTO hive_serde_tab1 PARTITION (p1 = 3, p2 = 4) VALUES (1), (2);
INSERT OVERWRITE TABLE hive_serde_tab1 PARTITION (p1 = 3, p2 = 4) VALUES (3), (4);
When using Hive and Datasource tables (excluding Hudi), executing data modification commands (such as insert into and load data) may result in data duplication or inconsistency if the data source does not support transactions and there is a system failure or queue restart.
If queue resources are restarted in the data commit phase, data may have been restored to a formal directory. If an insert into statement is executed and a retry is triggered after a resource restart, there is a possibility that data will be repeatedly written.