BMS
- |
-ECC uncorrectable errors generated on GPU SRAM
- |
-SRAMUncorrectableEccError
- |
-Major
- |
-There are ECC uncorrectable errors generated on GPU SRAM.
- |
-If services are affected, submit a service ticket.
- |
-The GPU hardware may be faulty. As a result, the GPU memory is faulty, and services exit abnormally.
+ | The GPU hardware may be faulty. As a result, the GPU memory is faulty, and services exit abnormally.
|
-osShutdown
+ |
osShutdown
|
-osShutdown
+ | osShutdown
|
-Major
+ | Major
|
-The BMS was stopped
+ | The BMS was stopped
- on the management console.
- by calling APIs.
|
-- Deploy service applications in HA mode.
- After the BMS is started, check whether services recover.
+ | - Deploy service applications in HA mode.
- After the BMS is started, check whether services recover.
|
-Services are interrupted.
+ | Services are interrupted.
|
-Abnormal shutdown
+ |
Abnormal shutdown
|
-serverShutdown
+ | serverShutdown
|
-Major
+ | Major
|
-The BMS was stopped unexpectedly, which may be caused by
+ | The BMS was stopped unexpectedly, which may be caused by
- unexpected power-off.
- hardware faults.
|
-- Deploy service applications in HA mode.
- After the BMS is started, check whether services recover.
+ | - Deploy service applications in HA mode.
- After the BMS is started, check whether services recover.
|
-Services are interrupted.
+ | Services are interrupted.
|
-Abnormal reboot
+ |
Abnormal reboot
|
-serverReboot
+ | serverReboot
|
-Major
+ | Major
|
-The BMS restarted unexpectedly, which may be caused by
+ | The BMS restarted unexpectedly, which may be caused by
- OS faults.
- hardware faults.
|
-- Deploy service applications in HA mode.
- After the BMS is restarted, check whether services recover.
+ | - Deploy service applications in HA mode.
- After the BMS is restarted, check whether services recover.
|
-Services are interrupted.
+ | Services are interrupted.
|
-Network interruption
+ |
Network interruption
|
-linkDown
+ | linkDown
|
-Major
+ | Major
|
-The BMS network was disconnected. Possible causes are as follows:
+ | The BMS network was disconnected. Possible causes are as follows:
- The BMS was unexpectedly stopped or restarted.
- The switch was faulty.
- The gateway was faulty.
|
-- Deploy service applications in HA mode.
- After the BMS is started, check whether services recover.
+ | - Deploy service applications in HA mode.
- After the BMS is started, check whether services recover.
|
-Services are interrupted.
+ | Services are interrupted.
|
-PCIE error
+ |
PCIE error
|
-pcieError
+ | pcieError
|
-Major
+ | Major
|
-The PCIe devices or main board of the BMS was faulty.
+ | The PCIe devices or main board of the BMS was faulty.
|
-- Deploy service applications in HA mode.
- After the BMS is started, check whether services recover.
+ | - Deploy service applications in HA mode.
- After the BMS is started, check whether services recover.
|
-The network or disk read/write services are affected.
+ | The network or disk read/write services are affected.
|
-Disk error
+ |
Disk error
|
-diskError
+ | diskError
|
-Major
+ | Major
|
-The disk backplane or disks of the BMS were faulty.
+ | The disk backplane or disks of the BMS were faulty.
|
-- Deploy service applications in HA mode.
- After the fault is rectified, check whether services recover.
+ | - Deploy service applications in HA mode.
- After the fault is rectified, check whether services recover.
|
-Data read/write services are affected, or the BMS cannot be started.
+ | Data read/write services are affected, or the BMS cannot be started.
|
-Storage error
+ |
Storage error
|
-storageError
+ | storageError
|
-Major
+ | Major
|
-The BMS failed to connect to EVS disks. Possible causes are as follows:
+ | The BMS failed to connect to EVS disks. Possible causes are as follows:
- The SDI card was faulty.
- Remote storage devices were faulty.
|
-- Deploy service applications in HA mode.
- After the fault is rectified, check whether services recover.
+ | - Deploy service applications in HA mode.
- After the fault is rectified, check whether services recover.
|
-Data read/write services are affected, or the BMS cannot be started.
+ | Data read/write services are affected, or the BMS cannot be started.
|
-OS reboot
+ |
OS reboot
|
-osReboot
+ | osReboot
|
-Major
+ | Major
|
-The BMS was restarted
+ | The BMS was restarted
- on the management console.
- by calling APIs.
|
-- Deploy service applications in HA mode.
- After the BMS is restarted, check whether services recover.
+ | - Deploy service applications in HA mode.
- After the BMS is restarted, check whether services recover.
|
-Services are interrupted.
+ | Services are interrupted.
|
-Inforom alarm generated on GPU
+ |
Inforom alarm generated on GPU
|
-gpuInfoROMAlarm
+ | gpuInfoROMAlarm
|
-Major
+ | Major
|
-The driver failed to read inforom information due to GPU faults.
+ | The driver failed to read inforom information due to GPU faults.
|
-Non-critical services can continue to use the GPU card. For critical services, submit a service ticket to resolve this issue.
+ | Non-critical services can continue to use the GPU card. For critical services, submit a service ticket to resolve this issue.
|
-Services will not be affected if inforom information cannot be read. If error correction code (ECC) errors are reported on GPU, faulty pages may not be automatically retired and services are affected.
+ | Services will not be affected if inforom information cannot be read. If error correction code (ECC) errors are reported on GPU, faulty pages may not be automatically retired and services are affected.
|
-Double-bit ECC alarm generated on GPU
+ |
Double-bit ECC alarm generated on GPU
|
-doubleBitEccError
+ | doubleBitEccError
|
-Major
+ | Major
|
-A double-bit ECC error occurred on GPU.
+ | A double-bit ECC error occurred on GPU.
|
-- If services are interrupted, restart the services to restore.
- If services cannot be restarted, restart the VM where services are running.
- If services still cannot be restored, submit a service ticket.
+ | - If services are interrupted, restart the services to restore.
- If services cannot be restarted, restart the VM where services are running.
- If services still cannot be restored, submit a service ticket.
|
-Services may be interrupted. After faulty pages are retired, the GPU card can continue to be used.
+ | Services may be interrupted. After faulty pages are retired, the GPU card can continue to be used.
|
-Too many retired pages
+ |
Too many retired pages
|
-gpuTooManyRetiredPagesAlarm
+ | gpuTooManyRetiredPagesAlarm
|
-Major
+ | Major
|
-An ECC page retirement error occurred on GPU.
+ | An ECC page retirement error occurred on GPU.
|
-If services are affected, submit a service ticket.
+ | If services are affected, submit a service ticket.
|
-Services may be affected.
+ | Services may be affected.
|
-ECC alarm generated on GPU A100
+ |
ECC alarm generated on GPU A100
|
-gpuA100EccAlarm
+ | gpuA100EccAlarm
|
-Major
+ | Major
|
-An ECC error occurred on GPU.
+ | An ECC error occurred on GPU.
|
-- If services are interrupted, restart the services to restore.
- If services cannot be restarted, restart the VM where services are running.
- If services still cannot be restored, submit a service ticket.
+ | - If services are interrupted, restart the services to restore.
- If services cannot be restarted, restart the VM where services are running.
- If services still cannot be restored, submit a service ticket.
|
-Services may be interrupted. After faulty pages are retired, the GPU card can continue to be used.
+ | Services may be interrupted. After faulty pages are retired, the GPU card can continue to be used.
|
-GPU ECC memory page retirement failure
+ |
GPU ECC memory page retirement failure
|
-eccPageRetirementRecordingFailure
+ | eccPageRetirementRecordingFailure
|
-Major
+ | Major
|
-Automatic page retirement failed due to ECC errors.
+ | Automatic page retirement failed due to ECC errors.
|
-- If services are interrupted, restart the services to restore.
- If services cannot be restarted, restart the VM where services are running.
- If services still cannot be restored, submit a service ticket.
+ | - If services are interrupted, restart the services to restore.
- If services cannot be restarted, restart the VM where services are running.
- If services still cannot be restored, submit a service ticket.
|
-Services may be interrupted, and memory page retirement fails. As a result, services cannot no longer use the GPU card.
+ | Services may be interrupted, and memory page retirement fails. As a result, services cannot no longer use the GPU card.
|
-GPU ECC page retirement alarm generated
+ |
GPU ECC page retirement alarm generated
|
-eccPageRetirementRecordingEvent
+ | eccPageRetirementRecordingEvent
|
-Minor
+ | Minor
|
-Memory pages are automatically retired due to ECC errors.
+ | Memory pages are automatically retired due to ECC errors.
|
-- If services are interrupted, restart the services to restore.
- If services cannot be restarted, restart the VM where services are running.
- If services still cannot be restored, submit a service ticket.
+ | - If services are interrupted, restart the services to restore.
- If services cannot be restarted, restart the VM where services are running.
- If services still cannot be restored, submit a service ticket.
|
-Generally, this alarm is generated together with the ECC error alarm. If this alarm is generated independently, services are not affected.
+ | Generally, this alarm is generated together with the ECC error alarm. If this alarm is generated independently, services are not affected.
|
-Too many single-bit ECC errors on GPU
+ |
Too many single-bit ECC errors on GPU
|
-highSingleBitEccErrorRate
+ | highSingleBitEccErrorRate
|
-Major
+ | Major
|
-There are too many single-bit ECC errors.
+ | There are too many single-bit ECC errors.
|
-- If services are interrupted, restart the services to restore.
- If services cannot be restarted, restart the VM where services are running.
- If services still cannot be restored, submit a service ticket.
+ | - If services are interrupted, restart the services to restore.
- If services cannot be restarted, restart the VM where services are running.
- If services still cannot be restored, submit a service ticket.
|
-Single-bit errors can be automatically rectified and do not affect GPU-related applications.
+ | Single-bit errors can be automatically rectified and do not affect GPU-related applications.
|
-GPU card not found
+ |
GPU card not found
|
-gpuDriverLinkFailureAlarm
+ | gpuDriverLinkFailureAlarm
|
-Major
+ | Major
|
-A GPU link is normal, but the NVIDIA driver cannot find the GPU card.
+ | A GPU link is normal, but the NVIDIA driver cannot find the GPU card.
|
-- Restart the VM to restore services.
- If services still cannot be restored, submit a service ticket.
+ | - Restart the VM to restore services.
- If services still cannot be restored, submit a service ticket.
|
-The GPU card cannot be found.
+ | The GPU card cannot be found.
|
-GPU link faulty
+ |
GPU link faulty
|
-gpuPcieLinkFailureAlarm
+ | gpuPcieLinkFailureAlarm
|
-Major
+ | Major
|
-GPU hardware information cannot be queried through lspci due to a GPU link fault.
+ | GPU hardware information cannot be queried through lspci due to a GPU link fault.
|
-If services are affected, submit a service ticket.
+ | If services are affected, submit a service ticket.
|
-The driver cannot use GPU.
+ | The driver cannot use GPU.
|
-GPU card lost
+ |
GPU card lost
|
-vmLostGpuAlarm
+ | vmLostGpuAlarm
|
-Major
+ | Major
|
-The number of GPU cards on the VM is less than the number specified in the specifications.
+ | The number of GPU cards on the VM is less than the number specified in the specifications.
|
-If services are affected, submit a service ticket.
+ | If services are affected, submit a service ticket.
|
-GPU cards get lost.
+ | GPU cards get lost.
|
-GPU memory page faulty
+ |
GPU memory page faulty
|
-gpuMemoryPageFault
+ | gpuMemoryPageFault
|
-Major
+ | Major
|
-The GPU memory page is faulty, which may be caused by applications, drivers, or hardware.
+ | The GPU memory page is faulty, which may be caused by applications, drivers, or hardware.
|
-If services are affected, submit a service ticket.
+ | If services are affected, submit a service ticket.
|
-The GPU hardware may be faulty. As a result, the GPU memory is faulty, and services exit abnormally.
+ | The GPU hardware may be faulty. As a result, the GPU memory is faulty, and services exit abnormally.
|
-GPU image engine faulty
+ |
GPU image engine faulty
|
-graphicsEngineException
+ | graphicsEngineException
|
-Major
+ | Major
|
-The GPU image engine is faulty, which may be caused by applications, drivers, or hardware.
+ | The GPU image engine is faulty, which may be caused by applications, drivers, or hardware.
|
-If services are affected, submit a service ticket.
+ | If services are affected, submit a service ticket.
|
-The GPU hardware may be faulty. As a result, the image engine is faulty, and services exit abnormally.
+ | The GPU hardware may be faulty. As a result, the image engine is faulty, and services exit abnormally.
|
-GPU temperature too high
+ |
GPU temperature too high
|
-highTemperatureEvent
+ | highTemperatureEvent
|
-Major
+ | Major
|
-GPU temperature too high
+ | GPU temperature too high
|
-If services are affected, submit a service ticket.
+ | If services are affected, submit a service ticket.
|
-If the GPU temperature exceeds the threshold, the GPU performance may deteriorate.
+ | If the GPU temperature exceeds the threshold, the GPU performance may deteriorate.
|
-GPU NVLink faulty
+ |
GPU NVLink faulty
|
-nvlinkError
+ | nvlinkError
|
-Major
+ | Major
|
-A hardware fault occurs on the NVLink.
+ | A hardware fault occurs on the NVLink.
|
-If services are affected, submit a service ticket.
+ | If services are affected, submit a service ticket.
|
-The NVLink link is faulty and unavailable.
+ | The NVLink link is faulty and unavailable.
|
-nvidia-smi suspended
+ |
nvidia-smi suspended
|
-nvidiaSmiHangEvent
+ | nvidiaSmiHangEvent
|
-Major
+ | Major
|
-nvidia-smi timed out.
+ | nvidia-smi timed out.
|
-If services are affected, submit a service ticket.
+ | If services are affected, submit a service ticket.
|
-The driver may report an error during service running.
+ | The driver may report an error during service running.
|