Reviewed-by: Kacur, Michal <michal.kacur@t-systems.com> Co-authored-by: Yang, Tong <yangtong2@huawei.com> Co-committed-by: Yang, Tong <yangtong2@huawei.com>
62 KiB
Creating a CDL Data Synchronization Job
Scenario
The CDLService web UI provides a visualized page for users to quickly create CDL jobs and import real-time data into the data lake.
Prerequisites
A user with the CDL management permission has been created for the cluster with Kerberos authentication enabled.
Procedure
- Access the CDLService web UI as a user with the CDL management permissions or the admin user (for the cluster where Kerberos authentication is not enabled). For details, see Logging In to the CDLService WebUI.
- Choose Job Management > Data synchronization task and click Add Job. In the displayed dialog box, set related job parameters and click Next.
Parameter
Description
Example Value
Name
Job name
job_pgsqltokafka
Desc
Job description
xxx
- On the Job Management page, select and drag the target element from Source and Sink to the GUI on the right.
Double-click the two elements to connect them and set related parameters as required.
To delete an element, select the element to be deleted and click Delete in the lower right corner of the page.
Table 3 Source Hudi job parameters Parameter
Description
Example Value
Link
Link used by the Hudi app
hudilink
Interval
Interval for synchronizing the Hudi table, in seconds
10
Start Time
Start time for synchronizing tables
2022/03/16 11:40:52
Max Commit Number
Maximum number of commits that can be pulled from an incremental view at a time.
10
Hudi Custom Config
Customized configuration related to Hudi.
-
Table Info
Detailed configuration information about the synchronization table. Hudi and DWS must have the same table names and field types.
{"table1":[{"source.database":"base1","source.tablename":"table1"}],"table2":[{"source.database":"base2","source.tablename":"table2"}],"table3":[{"source.database":"base3","source.tablename":"table3"}]}
Execution Env
Environment variable required for running the Hudi App. If no ENV is available, manually create one.
defaultEnv
Table 4 Source Kafka job parameters Parameter
Description
Example Value
Link
Created Kafka link
kafkalink
Table 5 thirdparty-kafka job parameters Parameter
Description
Example Value
Link
Created thirdparty-kafka link
thirdparty-kafkalink
DB Name
Name of the database to be connected to.
opengaussdb
Schema
Schema of the database to be checked
oprngaussschema
Datastore Type
Type of the upper-layer source. Value options are as follows:
- opengauss
- ogg
opengauss
Avro Schema Topic
Schema topic used by OGG Kafka to store table schemas in JSON format.
NOTE:This parameter is available when Datastore Type is set to ogg.
ogg_topic
Source Topics
Source topics can contain letters, digits, and special characters (-,_). Topics must be separated by commas (,).
topic1
Tasks Max
Maximum number of tasks that can be created by a connector. For a connector of the database type, this parameter must be set to 1.
10
Tolerance
Fault tolerance policy.
- none: indicates low tolerance and the Connector task will fail if an error occurs.
- all: indicates high tolerance and all failed records will be ignored if an error occurs.
all
Start Time
Start time for synchronizing tables
2022/03/16 14:14:50
Multi Partition
Whether to enable multi-partitioning for topics. If it is enabled, you need to set Topic Table Mapping and specify the number of topic partitions, and the data of a single table will be scattered in multiple partitions.
No
Topic Table Mapping
Mapping between topics and tables.
If configured, table data can be sent to the specified topic. If multi-partitioning is enabled, you need to set the number of partitions, which must be greater than 1.
testtable
testtable_topic
Table 6 Sink Hudi job parameters Parameter
Description
Example Value
Link
Created Hudi link.
hudilink
Path
Path for storing data.
/cdldata
Interval
Spark RDD execution interval, in seconds.
1
Max Rate Per Partition
Maximum rate for reading data from each Kafka partition using the Kafka direct stream API. It is the number of records per second. 0 indicates that the rate is not limited.
0
Parallelism
Parallelism for writing data to Hudi.
100
Target Hive Database
Database of the target Hive
default
Configuring Hudi Table Attributes
View for configuring attributes of the Hudi table. The value can be:
- Visual View
- JSON View
Visual View
Global Configuration of Hudi Table Attributes
Global parameters on Hudi.
-
Configuring the Attributes of the Hudi Table
Configuration of the Hudi table attributes.
-
Configuring the Attributes of the Hudi Table: Table Name
Hudi table name, which must be the same as the source table name.
-
Configuring the Attributes of the Hudi Table: Table Type Opt Key
Hudi table type. The options are as follows:
- COPY_ON_WRITE
- MERGE_ON_READ
MERGE_ON_READ
Configuring the Attributes of the Hudi Table: Hudi TableName Mapping
Hudi table name. If this parameter is not set, the name of the Hudi table is the same as that of the source table by default.
-
Configuring the Attributes of the Hudi Table: Hive TableName Mapping
Mapping between Hudi tables and Hive tables.
-
Configuring the Attributes of the Hudi Table: Table Primarykey Mapping
Primary key mapping of the Hudi table
id
Configuring the Attributes of the Hudi Table: Table Hudi Partition Type
Mapping between the Hudi table and partition fields. If the Hudi table uses partitioned tables, you need to configure the mapping between the table name and partition fields. The value can be time or customized.
time
Configuring the Attributes of the Hudi Table: Custom Config
Custom configuration
-
Execution Env
Environment variable required for running the Hudi App. If no ENV is available, create one by referring to Managing ENV.
defaultEnv
Table 7 Sink Kafka job parameters Parameter
Description
Example Value
Link
Created Kafka link
kafkalink
Table 8 DWS job parameters Parameter
Description
Example Value
Link
Link used by Connector
dwslink
Query Timeout
Timeout interval for connecting to DWS, in milliseconds
180000
Batch Size
Amount of data batch written to DWS
50
Sink Task Number
Maximum number of concurrent jobs when a table is written to DWS.
-
DWS Custom Config
Custom configuration
-
Table 11 ClickHouse job parameters
Parameter
Description
Example Value
Link
Link used by Connector
dwslink
Query Timeout
Timeout interval for connecting to ClickHouse, in milliseconds
60000
Batch Size
Amount of data batch written to ClickHouse
NOTE:It is best practice to set this parameter to a large value. The recommended value range is 10000-100000.
100000
- After the job parameters are configured, drag the two icons to associate the job parameters and click Save. The job configuration is complete.
- In the job list on the Job Management page, locate the created jobs, click Start in the Operation column, and wait until the jobs are started.
Check whether the data transmission takes effect, for example, insert data into the table in the MySQL database and view the content of the file imported to Hudi.