Updating Statistics

In a database, statistics indicate the source data of a plan generated by a planner. If statistics are unavailable or out of date, the execution plan may seriously deteriorate, leading to low performance.

Scenario

The ANALYZE statement collects statistics on database table contents. These statistics will be stored in the PG_STATISTIC system catalog. Then, the query optimizer uses the statistics to work out the most efficient execution plan.

After executing batch INSERT and DELETE operations, you are advised to run the ANALYZE statement on the table or the entire database to update statistics. By default, 30,000 rows of statistics are sampled. That is, the default value of the GUC parameter default_statistics_target is 100. If the total number of rows in the table exceeds 1,600,000, you are advised to set default_statistics_target to -2, indicating that 2% of the statistics are collected.

For an intermediate table generated during the execution of scripts or stored procedures in batch, you also need to run the ANALYZE statement.

If there are multiple inter-related columns in a table and the conditions or grouping operations based on these columns are involved in the query, collect statistics about these columns so that the query optimizer can accurately estimate the number of rows and generate an effective execution plan.

Generating Statistics

  • After the statistics are declared for multiple columns by running the ALTER TABLE Tablename ADD STATISTICS statement, the system collects the statistics about these columns next time ANALYZE is performed on the table or the entire database. To collect the statistics, run the ANALYZE statement.
  • Use EXPLAIN to show the execution plan of each SQL statement. If rows=10 (the default value, probably indicating the table has not been analyzed) is displayed in the SEQ SCAN output of a table, run the ANALYZE statement for this table.

Improving the Quality of Statistics

ANALYZE samples data from a table based on the random sampling algorithm and calculates table data features based on the samples. The number of samples can be specified by the default_statistics_target parameter. The value of default_statistics_target ranges from -100 to 10000 and the default value is 100.

Automatic Statistics Collection

When the autoanalyze parameter is turned on, the optimizer will automatically collect statistics if it finds that there are no statistics in the table or if the data changes exceed a certain threshold. This ensures that the optimizer has the information it needs to make precise decisions.

In a cost-based optimizer (CBO) model, statistics play a crucial role in determining whether a query plan is generated. Therefore, it is crucial to have timely and effective statistics.

Collection condition: If there is a substantial change in data volume (default threshold is 10%), indicating a shift in data characteristics, the system will initiate the collection of statistics again.

Overall policy: The system enables dynamic sampling to collect statistics promptly and polling sampling to ensure persistent statistics. To ensure fast query performance with response times in seconds, it is recommended to use manual sampling.

Basic Rules

Table 1 Typical sampling methods

Function

Description

Feature

Constraint

Auto sampling

After making significant changes to the data in a job, you need to manually run the ANALYZE command.

  • In normal mode, statistics are stored in system catalogs and shared globally. A level-4 lock is applied, preventing concurrent operations on a table.
  • In light mode, statistics are stored in memory and shared globally. A level-1 lock is applied, allowing concurrent operations on a table.
  • In force mode, you can perform forcible sampling even when statistics are locked, in addition to the normal mode functionalities.

Syntax: ANALYZE tablename; ANALYZE (light|force) tablename;

N/A

Polling sampling

Background thread operates according to a threshold.

Polling maintenance statistics

Only the normal mode is supported. Statistics are stored in system catalogs and shared. A level-4 lock is applied, preventing concurrent operations on a table.

Related GUC parameters:

  • autovacuum
  • autovacuum_mode
  • autovacuum_analyze_threshold
  • autovacuum_analyze_scale_factor

Asynchronous polling triggering

Dynamic sampling

Depending on the threshold, the query parsing process can take several dozen seconds.

Real-time maintenance statistics

  • In normal mode, statistics are stored in system catalogs and shared globally. A level-4 lock is applied, preventing concurrent operations on a table.
  • In light mode, statistics are stored in memory and shared globally. A level-1 lock is applied, allowing concurrent operations on a table.

Related GUC parameters:

  • autoanalyze
  • autoanalyze_mode

Real-time triggering upon query

In lightweight scenarios, persistence relies on polling sampling.

Forcible sampling

Uses SQL hints to forcefully gather statistics for each query.

Used in data feature-sensitive scenarios to ensure real-time and up-to-date query statistics.

Usage: select /*+ lightanalyze (t1 1) */ from t1; (1: forcible sampling; 0: sampling disabled)

The SQL statement needs to be modified.

Collecting partition statistics

Collects incremental information by partition and combines it globally.

Used in ultra-large partitioned tables to ensure accurate query cost estimation after partition pruning.

This method takes up more storage space but provides greater accuracy.

Collecting statistics from multiple columns

Gather statistics from multiple columns.

Used to filter multiple columns simultaneously to ensure accurate query cost estimation.

You need to select target columns manually and use temporary tables.

Collecting expression statistics

Collects statistics on a column based on expression functions.

Used in batch expression filtering scenarios to ensure accurate query cost estimation.

Manual identification is required.

Collecting expression index statistics

Automatically collects statistics for created expression indexes.

Used in the point query expression filtering scenario to ensure accurate query cost estimation.

Manual identification is required.

Freezing statistics

Freezes table-level statistics to prevent changes.

Used in scenarios where data features are extremely stable to prevent sampling and query plan changes.

Used in scenarios where data features are highly variable to ensure sampling for each query.

Parameter: table-level attribute analyze_mode

N/A

Modifying statistics

Directly modifies statistics after manual calculation.

Used to maintain a low sampling ratio with manual calibration. Usage:

select approx_count_distinct(col_name) from table_name;

alter table set (n_distinct=xxx)

N/A

Copying partition information

Copies statistics from old partitions to new ones.

Used for partitioned tables with minimal data feature changes to reduce statistics collection overhead.

N/A

Statistical information inference

Automatically calculates more accurate statistics based on existing data.

Controlled by the GUC parameter enable_extrapolation_stats.

N/A

Backing up and restoring statistics

Backs up statistics to an SQL statement using the EXPLAIN (STAT ON) command.

Used for scenario reproduction or statistics restoration.

Statistics are exported as SQL statements.

Scenarios and Strategies

The table below outlines typical data processing scenarios and the corresponding strategies for collecting statistics.

Table 2 Statistics collection strategies

Scenario

Description

Strategy

Incremental stream processing

Incremental data flow changes with no reasonable time for ANALYZE.

Enable dynamic sampling to automatically collect and share statistics globally.

Online batch processing

(Data lake)

Data processing and querying occur concurrently, requiring stable queries.

Enable dynamic sampling or complete data processing and ANALYZE within a transaction.

begin;
truncate table or partition;
copy/merge/insert overwrite
ANALYZE (light) tablename;
end;

Partition parallel processing

Concurrent data processing in different partitions

Enable dynamic or manual light sampling and collect statistics concurrently for the same table.

Flat-wide table scenario

Wide table with over 100 columns

1. Enable automatic predicate management for dynamic sampling.

2. Collect statistics only on the first N columns.

3. Set column-level participation in sampling based on common query predicates.

Large table scenario

Large data volume with changes not reaching the threshold

Variable statistics

Lower the threshold for triggering dynamic sampling.

Feature-sensitive scenario

Changeable data features causing unstable query plans, requiring forcible collection.

1. Lower the threshold for triggering dynamic sampling.

2. Use the HINT mode in SQL statements for light dynamic sampling.

3. Clear and freeze statistics, re-collecting them for each query without sharing.

High-concurrency scenario

Concurrent queries (over 10) are performed on the same table, triggering dynamic sampling and resource usage.

1. Disable concurrency, and other queries use outdated statistics.

2. Generate the latest statistics before querying (under development).

Streaming performance sensitivity

Stream processing with queries responded in seconds or high resource usage

Disable dynamic sampling at the table or SQL level and use background polling sampling.

Batch performance sensitivity

Batch processing with queries responded in seconds or high resource usage

Manually collect statistics during processing.

Resource Consumption

Table 3 Resource consumption

Category

Sub-Category

Description

CPU

Predicate column management

Automatically manage predicates and collect statistics only on queried columns.

Manually mask non-predicate columns.

Ultra-long column statistics

Data type that can be truncated, counting only the first 1,024 characters.

I/O

30,000 samples are collected by default.

Related to the number of columns, partitions, and small CUs, not table size.

Memory

Buffer usage

At most one slot in the cstore buffer is occupied.

Memory zero copy

Directly calculate statistics from buffer samples without organizing into tuples.

Memory adaptation

Configure the system to use temporary tables for sampling when memory is insufficient. Prevent temporary table creation triggered by queries using the analyze_stats_mode parameter.

Memory size

Control maximum memory usage during ANALYZE with the maintenance_work_mem parameter. Exceeding memory limits results in data being written to disks or reduced samples.

Lock

Level-4 lock

(Normal mode) Applied in distributed mode, conflicting with DDL, VACUUM, ANALYZE, and REINDEX but not with addition, deletion, or modification.

Level-1 lock

(Light mode) Only local level-1 lock is supported, conflicting only with DDL statements.

Accuracy and Reliability

Table 4 Accuracy/Reliability

Accuracy/Reliability

Item

Description

Accuracy

Sampling size

Configurable to adapt to table size with the default_statistics_target parameter.

Sampling randomness

  • Optimize reservoir and range sampling with the analyze_sample_mode parameter.
  • Enhance randomness of random number calculation with the random_function_version parameter.

Global sharing

Statistics can be shared across sessions and nodes.

Modifying count broadcast

Background thread checks and broadcasts the global modification count in polling mode.

The job thread can also directly broadcast the modification count by specifying the tuple_change_sync_threshold parameter.

Cross-CN modification and query have minimal impact. The modification count is broadcast and synchronized in asynchronous mode.

Adjusting the CU sampling ratio

Increase CU sampling ratio if the CU filling rate is low, using the cstore_cu_sample_ratio parameter.

Stabilizing distinct values

Use the n_distinct parameter to stabilize distinct values after random sampling without increasing the sampling ratio.

Statistical information calculation

Use the enable_extrapolation_stats parameter to calculate more accurate statistics based on old statistics during distortion estimation.

Reliability

CN fault

Dynamic sampling is unaffected by other CN faults, and statistics are not synchronized. Query quality on the current CN remains unaffected.

CN restoration

Forcibly perform dynamic sampling and global synchronization during queries after CN recovery.

DN fault

Dynamic sampling of the logical cluster is unaffected by faults in other logical clusters.

O&M Monitoring

GaussDB(DWS) offers a comprehensive view of the ANALYZE running mode and different execution stages by adding comments after the ANALYZE command. This information is primarily presented through the following views:

The format of the ANALYZE command is --Action-RunMode-StatsMode-SyncMode.

Example:

SELECT coorname,datid,datname,pid,usename,application_name,query_id,query 
FROM pgxc_stat_activity WHERE query like '%analyze%' and query not like '%application_name%';
     coorname | datid | datname | pid | usename | application_name | query_id | query
--------------+-------+----------+-----------------+-----------+------------------+-------------------+-----------------------------------------
coordinator1 | 15676 | postgres | 139919333779200 | test | gsql | 73183493944770822 | analyze t_1;
coordinator2 | 15676 | postgres | 140217336461056 | test | coordinator1 | 73183493944770822 | analyze public.t_1;--push stats-manual-memory-sync
coordinator3 | 15676 | postgres | 139944245847808 | test | coordinator1 | 73183493944770822 | analyze public.t_1;--push stats-manual-memory-sync
(3 rows)

Viewing Statistics