Reviewed-by: Eotvos, Oliver <oliver.eotvos@t-systems.com> Co-authored-by: qiujiandong1 <qiujiandong1@huawei.com> Co-committed-by: qiujiandong1 <qiujiandong1@huawei.com>
38 KiB
GPU Metrics
The CCE AI Suite (NVIDIA GPU) add-on provides GPU monitoring metrics. This add-on offers additional GPU observability options. This section describes the metrics provided by CCE AI Suite (NVIDIA GPU).
GPU Metrics Provided by CCE
Category |
Metric |
Type |
Unit |
Monitoring Level |
Description |
|---|---|---|---|---|---|
Utilization |
cce_gpu_utilization |
Gauge |
% |
GPU cards |
GPU compute usage |
cce_gpu_memory_utilization |
Gauge |
% |
GPU cards |
GPU memory usage |
|
cce_gpu_encoder_utilization |
Gauge |
% |
GPU cards |
GPU encoding usage |
|
cce_gpu_decoder_utilization |
Gauge |
% |
GPU cards |
GPU decoding usage |
|
cce_gpu_utilization_process |
Gauge |
% |
GPU processes |
GPU compute usage of each process |
|
cce_gpu_memory_utilization_process |
Gauge |
% |
GPU processes |
GPU memory usage of each process |
|
cce_gpu_encoder_utilization_process |
Gauge |
% |
GPU processes |
GPU encoding usage of each process |
|
cce_gpu_decoder_utilization_process |
Gauge |
% |
GPU processes |
GPU decoding usage of each process |
|
Memory |
cce_gpu_memory_used |
Gauge |
Byte |
GPU cards |
Used GPU memory NOTE:
If the NVIDIA driver version is 510 or later, the cce_gpu_memory_used value may be inaccurate in full GPU mode. The details are as follows:
|
cce_gpu_memory_total |
Gauge |
Byte |
GPU cards |
Total GPU memory |
|
cce_gpu_memory_free |
Gauge |
Byte |
GPU cards |
Idle GPU memory |
|
cce_gpu_bar1_memory_used |
Gauge |
Byte |
GPU cards |
Used GPU BAR1 memory |
|
cce_gpu_bar1_memory_total |
Gauge |
Byte |
GPU cards |
Total GPU BAR1 memory |
|
Frequency |
cce_gpu_clock |
Gauge |
MHz |
GPU cards |
GPU clock frequency |
cce_gpu_memory_clock |
Gauge |
MHz |
GPU cards |
The speed at which the GPU memory operates |
|
cce_gpu_graphics_clock |
Gauge |
MHz |
GPU cards |
GPU frequency |
|
cce_gpu_video_clock |
Gauge |
MHz |
GPU cards |
GPU video processor frequency |
|
Physical status |
cce_gpu_temperature |
Gauge |
°C |
GPU cards |
GPU temperature |
cce_gpu_power_usage |
Gauge |
Milliwatt |
GPU cards |
GPU power |
|
cce_gpu_total_energy_consumption |
Gauge |
Millijoule |
GPU cards |
Total GPU energy consumption |
|
Bandwidth |
cce_gpu_pcie_link_bandwidth |
Gauge |
bit |
GPU cards |
GPU PCIe bandwidth |
cce_gpu_nvlink_bandwidth |
Gauge |
Gbit/s |
GPU cards |
GPU NVLink bandwidth |
|
cce_gpu_pcie_throughput_rx |
Gauge |
KB/s |
GPU cards |
GPU PCIe RX bandwidth |
|
cce_gpu_pcie_throughput_tx |
Gauge |
KB/s |
GPU cards |
GPU PCIe TX bandwidth |
|
cce_gpu_nvlink_utilization_counter_rx |
Gauge |
KB/s |
GPU cards |
GPU NVLink RX bandwidth |
|
cce_gpu_nvlink_utilization_counter_tx |
Gauge |
KB/s |
GPU cards |
GPU NVLink TX bandwidth |
|
Memory isolation page |
cce_gpu_retired_pages_sbe |
Gauge |
N/A |
GPU cards |
Number of isolated GPU memory pages with single-bit errors |
cce_gpu_retired_pages_dbe |
Gauge |
N/A |
GPU cards |
Number of isolated GPU memory pages with dual-bit errors |
Metric |
Type |
Unit |
Monitoring Level |
Description |
|---|---|---|---|---|
xgpu_memory_total |
Gauge |
Byte |
GPU processes |
Total xGPU memory |
xgpu_memory_used |
Gauge |
Byte |
GPU processes |
Used xGPU memory |
xgpu_core_percentage_total |
Gauge |
% |
GPU processes |
Total xGPU cores |
xgpu_core_percentage_used |
Gauge |
% |
GPU processes |
Used xGPU cores |
gpu_schedule_policy |
Gauge |
N/A |
GPU cards |
xGPU scheduling policy. Options:
|
xgpu_device_health |
Gauge |
N/A |
GPU cards |
xGPU device health. Options:
|
- To use the metrics listed in Table 3, ensure that the version of the CCE AI Suite (NVIDIA GPU) add-on is 2.1.30, 2.7.46, or later. If you require these metrics, promptly upgrade the add-on.
- Cloud Native Cluster Monitoring does not automatically collect GPU pod monitoring metrics.
- If the NVIDIA driver version is 510 or later, the gpu_pod_memory_used value may be inaccurate in full GPU mode. The details are as follows:
- In CCE AI Suite (NVIDIA GPU) of a version earlier than 2.7.60 or 2.1.44, the gpu_pod_memory_used value might be approximately 250 MB higher than the actual usage. This discrepancy reflects the memory reserved by the system for the GPU driver or firmware.
- In CCE AI Suite (NVIDIA GPU) of version 2.7.60, 2.1.44, or later, the gpu_pod_memory_used value may be about 100 KB higher than the actual value.
Metric |
Type |
Unit |
Monitoring Process |
Description |
|---|---|---|---|---|
gpu_pod_core_percentage_total |
Gauge |
% |
GPU processes |
GPU compute allocated by a GPU card to GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that 30% of the GPU card's compute is dedicated to processing GPU virtualization workloads.
|
gpu_pod_core_percentage_used |
Gauge |
% |
GPU processes |
Used GPU compute, that is, the GPU compute used by the GPU workloads. It is measured in percentages relative to the compute of an entire GPU card. For example, a setting of 30% means that the GPU workloads are actively using 30% of the GPU card's compute.
|
gpu_pod_memory_total |
Gauge |
Byte |
GPU processes |
GPU memory allocated by a GPU card to the GPU workloads. It is measured in bytes.
|
gpu_pod_memory_used |
Gauge |
Byte |
GPU processes |
Used GPU memory, that is, the GPU memory used by the GPU workloads. It is measured in bytes.
|