DataGen

Function

DataGen is used to generate random data for debugging and testing.

Table 1 Supported types

Type

Description

Supported Table Types

Source table

Caveats

Syntax

create table dataGenSource(
  attr_name attr_type 
  (',' attr_name attr_type)* 
  (',' WATERMARK FOR rowtime_column_name AS watermark-strategy_expression)
)
with (
  'connector' = 'datagen'
);

Parameter Description

Table 2 Parameters

Parameter

Mandatory

Default Value

Data Type

Description

connector

Yes

None

String

Connector to be used. Set this parameter to datagen.

rows-per-second

No

10000

Long

Rows per second to control the emit rate.

number-of-rows

No

None

Long

The total number of rows to emit. By default, the total number of rows of generated data is not limited. If the generator type is a sequence generator, data generation will stop when either the maximum number of rows has been reached or the sequence number has reached its end value.

fields.#.kind

No

random

String

Generator of the # field. The # field must be an actual field in the DataGen table. Replace # with the corresponding field name. The meanings of the # field for other parameters are the same.

The value can be sequence or random.

  • random is the default value, indicating an unbounded random generator. You can use the fields.#.max and fields.#.min parameters to specify the maximum and minimum values that are randomly generated. If the specified field type is char, varchar, or string, you can also use the fields.#.length parameter to specify the length. If the specified field type is a timestamp, you can use the fields.#.max-past parameter to specify the maximum offset from the current time towards the past.
  • sequence represents a bounded sequence generator. You can specify the start and end values of the sequence using fields.#.start and fields.#.end. Once the sequence number reaches the end value, no more data will be generated.

fields.#.min

No

Minimum value of the field type specified by #

Field type specified by #

This parameter is valid only when fields.#.kind is set to random.

Minimum value of the random generator. It applies only to numeric field types specified by #.

fields.#.max

No

Maximum value of the field type specified by #

Field type specified by #

This parameter is valid only when fields.#.kind is set to random.

Maximum value of the random number. It applies only to numeric field types specified by #.

fields.#.max-past

No

0

Duration

This parameter is valid only when fields.#.kind is set to random.

The random generator generates a maximum offset from the current time towards the past. The # specified field is only applicable to timestamp types.

fields.#.length

No

100

Integer

This parameter is valid only when fields.#.kind is set to random.

Length of the characters generated by the random generator. It applies only to char, varchar, and string types specified by #.

fields.#.start

No

None

Field type specified by #

This parameter is valid only when fields.#.kind is set to sequence.

Start value of a sequence generator.

fields.#.end

No

None

Field type specified by #

This parameter is valid only when fields.#.kind is set to sequence.

End value of a sequence generator.

Example

Create a Flink OpenSource SQL job. Run the following script to generate random data through the DataGen table and output the data to the Print result table.

create table dataGenSource(
  user_id string,
  amount int
) with (
  'connector' = 'datagen',
  'rows-per-second' = '1', --Generates a piece of data per second.
  'fields.user_id.kind' = 'random', --Specifies a random generator for the user_id field.
  'fields.user_id.length' = '3' --Limits the length of the user_id field to 3.
  'fields.amount.kind' = 'sequence', --Specify a sequence generator for the amount field.
  'fields.amount.start' = '1', --Start value of the amount field
  'fields.amount.end' = '1000' --End value of the amount field
);

create table printSink(
  user_id string,
  amount int
) with (
  'connector' = 'print'
);

insert into printSink select * from dataGenSource;

After the job is submitted, the job status changes to Running. You can perform the following operations of either method to view the output result: