Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: luhuayi <luhuayi@huawei.com> Co-committed-by: luhuayi <luhuayi@huawei.com>
27 KiB
TABLE Object Design (Prioritized)
Rule 2.9: Selecting the Optimal Distribution Method and Columns During Table Creation
Impact of rule violation:
Incorrect distribution method and column selection can cause storage skew, deteriorate access performance, and even overload storage and computing resources.
Solution:
- When creating a table, it is important to use the DISTRIBUTE BY clause to explicitly specify the distribution method and distribution columns. The table below provides principles for selecting the distribution columns.
Distribution Method |
Description |
Scenario |
|---|---|---|
Hash |
Table data is distributed to each DN based on the mapping between hash values generated by distribution columns and DNs.
|
Large tables and fact tables |
RoundRobin |
Table data is distributed to DNs in polling mode.
|
Large tables, fact tables, and tables without proper distribution columns |
Replication |
Full data in a table is copied to each DN in the cluster.
|
Small tables and dimension tables |
Rule 2.10 Selecting an Optimal Storage Type During Table Creation
Impact of rule violation:
Row-store tables are not properly used. As a result, the query performance is poor and resources are overloaded.
- Improper use of column-store tables causes CU expansion, poor performance, and resource overload.
Solution:
- When creating a table, use the orientation parameter to explicitly specify the storage type. The following table describes the rules for selecting a storage type.
Storage Type |
Applicable Scenario |
Inapplicable Scenario |
|---|---|---|
Row storage |
|
DML query: statistical analysis query (with mass data involved in GROUP and JOIN processes) CAUTION:
When creating a row-store table (orientation is set to row), do not specify the compress attribute or use a row-store compressed table. |
Column storage |
|
|
Rule 2.11 Selecting an Optimal Partitioning Policy During Table Creation
Impact of rule violation:
Without partitioning, query performance and data governance efficiency will deteriorate. The larger the data volume, the greater the deterioration. The advantages of partitioning include:
- High query performance: The system queries only the relevant partitions rather than the whole table, improving the query efficiency.
- Improved data governance efficiency: In the data lifecycle management scenario, performing TRUNCATE or DROP PARTITION on historical partitions is much more efficient and effective than using DELETE.
Solution:
- Design partitions for tables that contain fields of the time type.
Partitioning Policy |
Description |
Scenario |
|---|---|---|
Range partitioning |
Data is stored in different partitions based on the range of partition key values. The partition key ranges are consecutive but not overlapped. |
|
List partitioning |
Partitioning is performed based on a unique list of partition key values. |
|
Suggestion 2.12: Designing Table Columns for Fast and Accurate Queries
Impact of rule violation:
- The system may have limited storage space and low query efficiency.
Solution:
- Design the table columns well for fast queries.
- If you can select an integer, do not select the floating point or character type.
- When using variable-length character type, specify the maximum length based on data features.
- Design the table columns well for accurate queries.
- Use the time type instead of the character type to store time data.
- Use the minimum numeric type that meets the requirements. Avoid using bigint if int or smallint is sufficient to save space.
- Correctly use the constraints.
- Add NOT NULL constraints to columns that never have NULL values. The optimizer automatically optimizes the columns in certain scenarios.
- Do not use the DEFAULT constraint for fields that can be supplemented at the service layer. Otherwise, unexpected results may be generated during data loading.
- Avoid unnecessary data type conversion.
- In tables that are logically related, columns having the same meaning should use the same data type.
- Different types of comparison operations cause data type conversion, which may cause index and partition pruning failures and affect query performance.
Suggestion 2.13: Avoiding the Usage of Auto-increment Columns or Data Types
Impact of rule violation:
When auto-increment sequences or data types are heavily used, the GTM may become overloaded and slow down sequence generation.
Solution:
- Set a UUID to obtain a unique ID.
- If the auto-increment sequence must be used and there is no strict requirement for increasing order, you can set the cache, for example, 1000, to reduce the pressure on GTM.