This section provides suggestions based on more than 50 test cases to help you create CarbonData tables with higher query performance.
Column name |
Data type |
Cardinality |
Attribution |
|---|---|---|---|
msisdn |
String |
30 million |
dimension |
BEGIN_TIME |
bigint |
10,000 |
dimension |
host |
String |
1 million |
dimension |
dime_1 |
String |
1,000 |
dimension |
dime_2 |
String |
500 |
dimension |
dime_3 |
String |
800 |
dimension |
counter_1 |
numeric(20,0) |
NA |
measure |
... |
... |
NA |
measure |
counter_100 |
numeric(20,0) |
NA |
measure |
implement optimization as follows:
Place this column in the first column of sort_columns.
For example, if msisdn is the most frequently used filter criterion in a query, it is placed in the first column. Run the following command to create a table. The query performance is good if msisdn is used as the filter condition.
create table carbondata_table(
msisdn String,
...
)STORED AS carbondata TBLPROPERTIES ('SORT_COLUMS'='msisdn');
implement optimization as follows:
Create an index for the columns.
For example, if msisdn, host, and dime_1 are frequently used columns, the sort_columns column sequence is "dime_1-> host-> msisdn..." based on cardinality. Run the following command to create a table. The following command can improve the filtering performance of dime_1, host, and msisdn.
create table carbondata_table(
dime_1 String,
host String,
msisdn String,
dime_2 String,
dime_3 String,
...
)STORED AS carbondata
TBLPROPERTIES ('SORT_COLUMS'='dime_1,host,msisdn');
implement optimization as follows:
sort_columns is sorted in ascending order of cardinality.
Run the following command to create a table:
create table carbondata_table(
Dime_1 String,
BEGIN_TIME bigint,
HOST String,
MSISDN String,
...
)STORED AS carbondata
TBLPROPERTIES ('SORT_COLUMS'='dime_2,dime_3,dime_1, BEGIN_TIME,host,msisdn');
create index carbondata_table_index_msidn on tablecarbondata_table (
MSISDN String) as 'carbondata' PROPERTIES ('table_blocksize'='128');
create index carbondata_table_index_host on tablecarbondata_table (
host String) as 'carbondata' PROPERTIES ('table_blocksize'='128');
The result of performance analysis of test-case shows reduction in query execution time from 15 to 3 seconds, thereby improving performance by nearly 5 times. The command for creating a table is as follows:
create table carbondata_table(
Dime_1 String,
BEGIN_TIME bigint,
HOST String,
MSISDN String,
counter_1 double,
counter_2 double,
...
counter_100 double,
)STORED AS carbondata
;
For example, if data is loaded to CarbonData every day, start_time is incremental for each load. In this case, it is recommended that the start_time column be put at the end of sort_columns, because incremental values are efficient in using min/max index. The command for creating a table is as follows:
create table carbondata_table(
Dime_1 String,
HOST String,
MSISDN String,
counter_1 double,
counter_2 double,
BEGIN_TIME bigint,
...
counter_100 double,
)STORED AS carbondata
TBLPROPERTIES ( 'SORT_COLUMS'='dime_2,dime_3,dime_1..BEGIN_TIME');