You can create a sink stream to export data to a file system such as HDFS or OBS. After the data is generated, a non-DLI table can be created directly according to the generated directory. The table can be processed through DLI SQL, and the output data directory can be stored in partitioned tables. It is applicable to scenarios such as data dumping, big data analysis, data backup, and active, deep, or cold archiving.
OBS is an object-based storage service. It provides massive, secure, highly reliable, and low-cost data storage capabilities.
1 2 3 4 5 6 7 8 9 | CREATE SINK STREAM stream_id (attr_name attr_type (',' attr_name attr_type)* ) [PARTITIONED BY (attr_name (',' attr_name)*] WITH ( type = "filesystem", file.path = "obs://bucket/xx", encode = "parquet", ak = "", sk = "" ); |
Parameter |
Mandatory |
Description |
---|---|---|
type |
Yes |
Output stream type. If type is set to filesystem, data is exported to the file system. |
file.path |
Yes |
Output directory in the form: schema://file.path. Currently, Schema supports only OBS and HDFS.
|
encode |
Yes |
Output data encoding format. Currently, only the parquet and csv formats are supported.
|
ak |
No |
Access key. This parameter is mandatory when data is exported to OBS. Global variables can be used to mask the access key used for OBS authentication. |
sk |
No |
Secret access key. This parameter is mandatory when data is exported to OBS. Secret key for accessing OBS authentication. Global variables can be used to mask sensitive information. |
krb_auth |
No |
Authentication name for creating a datasource connection authentication. This parameter is mandatory when Kerberos authentication is enabled. If Kerberos authentication is not enabled for the created MRS cluster, ensure that the /etc/hosts information of the master node in the MRS cluster is added to the host file of the DLI queue. |
field_delimiter |
No |
Separator used to separate every two attributes. This parameter needs to be configured if the CSV encoding format is adopted. It can be user-defined, for example, a comma (,). |
You should carefully select the OBS bucket because of the preceding behavior differences. Data exceptions may occur after abnormal job restart.
In the preceding information, myname in the core-site values hadoop.proxyuser.myname.hosts and hadoop.proxyuser.myname.groups is the name of the krb authentication user.
Ensure that the permission on the HDFS data write path is 777.
The following example dumps the car_info data to OBS, with the buyday field as the partition field and parquet as the encoding format.
1 2 3 4 5 6 7 8 9 10 11 12 13 | create sink stream car_infos ( carId string, carOwner string, average_speed double, buyday string ) partitioned by (buyday) with ( type = "filesystem", file.path = "obs://obs-sink/car_infos", encode = "parquet", ak = "{{myAk}}", sk = "{{mySk}}" ); |
The data is ultimately stored in OBS. Directory: obs://obs-sink/car_infos/buyday=xx/part-x-x.
After the data is generated, the OBS partitioned table can be established for subsequent batch processing through the following SQL statements:
1 2 3 4 5 6 7 8 | create table car_infos ( carId string, carOwner string, average_speed double ) partitioned by (buyday string) stored as parquet location 'obs://obs-sink/car_infos'; |
1 | alter table car_infos recover partitions; |
The following example dumps the car_info data to HDFS, with the buyday field as the partition field and csv as the encoding format.
1 2 3 4 5 6 7 8 9 10 11 12 | create sink stream car_infos ( carId string, carOwner string, average_speed double, buyday string ) partitioned by (buyday) with ( type = "filesystem", file.path = "hdfs://node-master1sYAx:9820/user/car_infos", encode = "csv", field_delimiter = "," ); |
The data is ultimately stored in HDFS. Directory: /user/car_infos/buyday=xx/part-x-x.