Reviewed-by: Rumpler, Mihály <mihaly.rumpler@t-systems.com> Co-authored-by: qiujiandong1 <qiujiandong1@huawei.com> Co-committed-by: qiujiandong1 <qiujiandong1@huawei.com>
315 KiB
Events Supported by Event Monitoring
- Events in Event Monitoring come from operations on cloud service resources and are not collected by the Agent in Server Monitoring.
- The name of a resource that supports event reporting can contain a maximum of 128 characters, including letters, digits, underscores (_), hyphens (-), and periods (.). If it contains other characters, the event may fail to be reported to Cloud Eye.
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
ECS |
SYS.ECS |
Restart triggered due to hardware fault |
startAutoRecovery |
Major |
ECSs on a faulty host would be automatically migrated to another properly-running host. During the migration, the ECSs was restarted. |
Wait for the event to end and check whether services are affected. |
Services may be interrupted. |
Restart completed due to hardware failure |
endAutoRecovery |
Major |
The ECS was recovered after the automatic migration. |
This event indicates that the ECS has recovered and been working properly. |
None |
||
Auto recovery timeout (being processed on the backend) |
faultAutoRecovery |
Major |
Migrating the ECS to a normal host timed out. |
Migrate services to other ECSs. |
Services are interrupted. |
||
ECS deleted |
deleteServer |
Major |
The ECS was deleted
|
Check whether the deletion was performed intentionally by a user. |
Services are interrupted. |
||
ECS restarted |
rebootServer |
Minor |
The ECS was restarted
|
Check whether the restart was performed intentionally by a user.
|
Services are interrupted. |
||
ECS stopped |
stopServer |
Minor |
The ECS was stopped
|
|
Services are interrupted. |
||
NIC deleted |
deleteNic |
Major |
The ECS NIC was deleted
|
|
Services may be interrupted. |
||
ECS resized |
resizeServer |
Minor |
The ECS specifications were resized
|
|
Services are interrupted. |
||
GuestOS restarted |
RestartGuestOS |
Minor |
The guest OS was restarted. |
Contact O&M personnel. |
Services may be interrupted. |
||
ECS failure caused by system faults |
VMFaultsByHostProcessExceptions |
Critical |
The host where the ECS resides is faulty. The system will automatically try to start the ECS. |
After the ECS is started, check whether this ECS and services on it can run properly. |
The ECS is faulty. |
||
Startup failure |
faultPowerOn |
Major |
The ECS failed to start. |
Start the ECS again. If the problem persists, contact O&M personnel. |
The ECS cannot start. |
||
Host breakdown risk |
hostMayCrash |
Major |
The host where the ECS resides may break down, and the risk cannot be prevented through live migration due to some reasons. |
Migrate services running on the ECS first and delete or stop the ECS. Start the ECS only after the O&M personnel eliminate the risk. |
The host may break down, causing service interruption. |
||
Scheduled migration completed |
instance_migrate_completed |
Major |
Scheduled ECS migration is completed. |
Wait until the ECSs become available and check whether services are affected. |
Services may be interrupted. |
||
Scheduled migration being executed |
instance_migrate_executing |
Major |
ECSs are being migrated as scheduled. |
Wait until the event is complete and check whether services are affected. |
Services may be interrupted. |
||
Scheduled migration canceled |
instance_migrate_canceled |
Major |
Scheduled ECS migration is canceled. |
None |
None |
||
Scheduled migration failed |
instance_migrate_failed |
Major |
ECSs failed to be migrated as scheduled. |
Contact O&M personnel. |
Services are interrupted. |
||
Scheduled migration to be executed |
instance_migrate_scheduled |
Major |
ECSs will be migrated as scheduled. |
Clarify the impact on services during the execution window. |
None |
||
Scheduled specification modification failed |
instance_resize_failed |
Major |
Specifications failed to be modified as scheduled. |
Contact O&M personnel. |
Services are interrupted. |
||
Scheduled specification modification completed |
instance_resize_completed |
Major |
Scheduled specifications modification is completed. |
None |
None |
||
Scheduled specification modification being executed |
instance_resize_executing |
Major |
Specifications are being modified as scheduled. |
Wait until the event is completed and check whether services are affected. |
Services are interrupted. |
||
Scheduled specification modification canceled |
instance_resize_canceled |
Major |
Scheduled specifications modification is canceled. |
None |
None |
||
Scheduled specification modification to be executed |
instance_resize_scheduled |
Major |
Specifications will be modified as scheduled. |
Check the impact on services during the execution window. |
None |
||
Scheduled redeployment to be executed |
instance_redeploy_scheduled |
Major |
ECSs will be redeployed on new hosts as scheduled. |
Check the impact on services during the execution window. |
None |
||
Scheduled restart to be executed |
instance_reboot_scheduled |
Major |
ECSs will be restarted as scheduled. |
Check the impact on services during the execution window. |
None |
||
Scheduled stop to be executed |
instance_stop_scheduled |
Major |
ECSs will be stopped as scheduled as they are affected by underlying hardware or system O&M. |
Check the impact on services during the execution window. |
None |
||
Live migration started |
liveMigrationStarted |
Major |
The host where the ECS is located may be faulty. Live migrate the ECS in advance to prevent service interruptions caused by host breakdown. |
Wait for the event to end and check whether services are affected. |
Services may be interrupted for less than 1s. |
||
Live migration completed |
liveMigrationCompleted |
Major |
The live migration is complete, and the ECS is running properly. |
Check whether services are running properly. |
None |
||
Live migration failure |
liveMigrationFailed |
Major |
An error occurred during the live migration of an ECS. |
Check whether services are running properly. |
There is a low probability that services are interrupted. |
||
FPGA link fault |
FPGALinkFault |
Critical |
The FPGA of the host on which the ECS is located was
|
Deploy service applications in HA mode. After the FPGA fault is rectified, check whether services are restored. |
Services are interrupted. |
||
Scheduled redeployment to be authorized |
instance_redeploy_inquiring |
Major |
As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled. |
Authorize scheduled redeployment. |
None |
||
Local disk replacement canceled |
localdisk_recovery_canceled |
Major |
Local disk failure |
None |
None |
||
Local disk replacement to be executed |
localdisk_recovery_scheduled |
Major |
Local disk failure |
Clarify the impact on services during the execution window. |
None |
||
nvidia-smi suspended |
nvidiaSmiHangEvent |
Major |
nvidia-smi timed out. |
If services are affected, submit a service ticket. |
The driver may report an error during service running. |
||
NPU: uncorrectable ECC error |
UncorrectableEccErrorCount |
Major |
There are uncorrectable ECC errors on the NPU. |
If services are affected, replace the NPU with another one. |
Services may be interrupted. |
||
Scheduled redeployment canceled |
instance_redeploy_canceled |
Major |
As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled. |
None |
None |
||
Scheduled redeployment being executed |
instance_redeploy_executing |
Major |
As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled. |
Wait until the event is complete and check whether services are affected. |
Services are interrupted. |
||
Scheduled redeployment completed |
instance_redeploy_completed |
Major |
As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled. |
Wait until the redeployed ECSs are available and check whether services are affected. |
None |
||
Scheduled redeployment failed |
instance_redeploy_failed |
Major |
As being affected by underlying hardware or system O&M, ECSs will be redeployed on new hosts as scheduled. |
Contact O&M personnel. |
Services are interrupted. |
||
Local disk replacement to be authorized |
localdisk_recovery_inquiring |
Major |
Local disks are faulty. |
Authorize local disk replacement. |
Local disks are unavailable. |
||
Local disks being replaced |
localdisk_recovery_executing |
Major |
Local disk failure |
Wait until the local disks are replaced and check whether the local disks are available. |
Local disks are unavailable. |
||
Local disks replaced |
localdisk_recovery_completed |
Major |
Local disk failure |
Wait until the services are running properly and check whether local disks are available. |
None |
||
Local disk replacement failed |
localdisk_recovery_failed |
Major |
Local disks are faulty. |
Contact O&M personnel. |
Local disks are unavailable. |
||
NPU: device not found by npu-smi info |
NPUSMICardNotFound |
Major |
The Ascend driver is faulty or the NPU is disconnected. |
Transfer this issue to the Ascend or hardware team for handling. |
The NPU cannot be used normally. |
||
NPU: PCIe link error |
PCIeErrorFound |
Major |
The possible cause is deskew_fifo overflow, symbol_unlock, deskew_unlock event, or phystatus timeout. |
Transfer this issue to the hardware team for handling. |
The NPU cannot be used properly. |
||
NPU: device not found by lspci |
LspciCardNotFound |
Major |
The NPU is disconnected. |
Transfer this issue to the hardware team for handling. |
The NPU cannot be used normally. |
||
NPU: overtemperature |
TemperatureOverUpperLimit |
Major |
The temperature of DDR or software is too high. |
Stop services, restart the BMS, check the heat dissipation system, and reset the devices. |
The ECS may be powered off due to overtemperature and devices may not be found. |
||
NPU: request for instance restart |
RebootVirtualMachine |
Informational |
A fault occurs and the BMS needs to be restarted. |
Collect the fault information, and restart the BMS. |
Services may be interrupted. |
||
NPU: request for SoC reset |
ResetSOC |
Informational |
A fault occurs and the SoC needs to be reset. |
Collect the fault information, and reset the SoC. |
Services may be interrupted. |
||
NPU: request for restart AI process |
RestartAIProcess |
Informational |
A fault occurs and the AI process needs to be restarted. |
Collect the fault information, and restart the AI process. |
The current AI task will be interrupted. |
||
NPU: error codes |
NPUErrorCodeWarning |
Major |
A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes. |
Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition. |
Services may be interrupted. |
||
DAVP: die device node not found by vasmi |
DAVPSMICardNotFound |
Major |
The driver may be faulty or the card may be disconnected. |
Restart the VM. If the device still cannot be loaded, transfer this issue to the hardware team for handling. |
The DAVP cannot be used properly. |
||
DAVP: device not found by lspci |
DAVPLspciCardNotFound |
Major |
The DAVP is disconnected. |
Transfer this issue to the hardware team for handling. |
The DAVP cannot be used properly. |
||
DAVP: temperature higher than the threshold 85°C |
TemperatureOverDfLimit |
Major |
The core module temperature exceeds 85°C, which causes frequency reduction. |
Stop services. Contact the hardware team to check the heat dissipation system and reset the device. |
The DAVP card frequency is reduced. |
||
DAVP: temperature higher than the threshold 105°C |
TemperatureOverSdLimit |
Major |
The core module temperature exceeds 105°C, which generates a high temperature alarm. |
Stop services. Contact the hardware team to check the heat dissipation system and reset the device. |
Power-off protection is triggered. The DAVP cannot be used properly. |
||
DAVP: core unit exception of the device node |
DeviceCoreAbnormal |
Major |
You may need to restart the die device node. |
Collect the fault information and restart die. |
Services may be interrupted. |
||
VM deletion failure |
faultDeleteServer |
Major |
Failed to delete the ECS. Check whether services are affected. The ECS resources fail to be deleted. |
Failed to delete the ECS. |
Check whether services are affected. |
Automatic recovery: If the hardware where an ECS is located is faulty, the system automatically migrates it to a normal physical host. The ECS will restart during the migration.
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
BMS |
SYS.BMS |
BMS restarted |
osReboot |
Major |
The BMS was restarted
|
|
Services are interrupted. |
Unexpected restart |
serverReboot |
Major |
The BMS restarted unexpectedly, which may be caused by
|
|
Services are interrupted. |
||
BMS stopped |
osShutdown |
Major |
The BMS was stopped
|
|
Services are interrupted. |
||
Unexpected shutdown |
serverShutdown |
Major |
The BMS was stopped unexpectedly, which may be caused by
|
|
Services are interrupted. |
||
Network disconnection |
linkDown |
Major |
The BMS network was disconnected. Possible causes are as follows:
|
|
Services are interrupted. |
||
PCIe error |
pcieError |
Major |
The PCIe devices or main board of the BMS was faulty. |
|
The network or disk read/write services are affected. |
||
Disk fault |
diskError |
Major |
The disk backplane or disks of the BMS were faulty. |
|
Data read/write services are affected, or the BMS cannot be started. |
||
EVS error |
storageError |
Major |
The BMS failed to connect to EVS disks. Possible causes are as follows:
|
|
Data read/write services are affected, or the BMS cannot be started. |
||
System maintenance inquiring |
system_maintenance_inquiring |
Major |
The scheduled BMS maintenance task is being inquired. |
Authorize the maintenance. |
None |
||
System maintenance waiting |
system_maintenance_scheduled |
Major |
The scheduled BMS maintenance task is waiting to be executed. |
Clarify the impact on services during the execution window. |
None |
||
System maintenance canceled |
system_maintenance_canceled |
Major |
The scheduled BMS maintenance is canceled. |
None |
None |
||
System maintenance executing |
system_maintenance_executing |
Major |
BMSs are being maintained as scheduled. |
After the maintenance is complete, check whether services are affected. |
Services are interrupted. |
||
System maintenance completed |
system_maintenance_completed |
Major |
The scheduled BMS maintenance is completed. |
Wait until the BMSs become available and check whether services recover. |
None |
||
System maintenance failure |
system_maintenance_failed |
Major |
The scheduled BMS maintenance task failed. |
Contact O&M personnel. |
Services are interrupted. |
||
NPU: device not found by npu-smi info |
NPUSMICardNotFound |
Major |
The Ascend driver is faulty or the NPU is disconnected. |
Transfer this issue to the Ascend or hardware team for handling. |
The NPU cannot be used normally. |
||
NPU: PCIe link error |
PCIeErrorFound |
Major |
The lspci command returns rev ff indicating that the NPU is abnormal. |
Restart the BMS. If the issue persists, transfer it to the hardware team for processing. |
The NPU cannot be used normally. |
||
NPU: device not found by lspci |
LspciCardNotFound |
Major |
The NPU is disconnected. |
Transfer this issue to the hardware team for handling. |
The NPU cannot be used normally. |
||
NPU: overtemperature |
TemperatureOverUpperLimit |
Major |
The temperature of DDR or software is too high. |
Stop services, restart the BMS, check the heat dissipation system, and reset the devices. |
The BMS may be powered off and devices may not be found. |
||
NPU: uncorrectable ECC error |
UncorrectableEccErrorCount |
Major |
There are uncorrectable ECC errors on the NPU. |
If services are affected, replace the NPU with another one. |
Services may be interrupted. |
||
NPU: request for BMS restart |
RebootVirtualMachine |
Informational |
A fault occurs and the BMS needs to be restarted. |
Collect the fault information, and restart the BMS. |
Services may be interrupted. |
||
NPU: request for SoC reset |
ResetSOC |
Informational |
A fault occurs and the SoC needs to be reset. |
Collect the fault information, and reset the SoC. |
Services may be interrupted. |
||
NPU: request for restart AI process |
RestartAIProcess |
Informational |
A fault occurs and the AI process needs to be restarted. |
Collect the fault information, and restart the AI process. |
The current AI task will be interrupted. |
||
NPU: error codes |
NPUErrorCodeWarning |
Major |
A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes. |
Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition. |
Services may be interrupted. |
||
nvidia-smi suspended |
nvidiaSmiHangEvent |
Major |
nvidia-smi timed out. |
If services are affected, submit a service ticket. |
The driver may report an error during service running. |
||
nv_peer_mem loading error |
NvPeerMemException |
Minor |
The NVLink or nv_peer_mem cannot be loaded. |
Restore or reinstall the NVLink. |
nv_peer_mem cannot be used. |
||
Fabric Manager error |
NvFabricManagerException |
Minor |
The BMS meets the NVLink conditions and NVLink is installed, but Fabric Manager is abnormal. |
Restore or reinstall the NVLink. |
NVLink cannot be used normally. |
||
IB card error |
InfinibandStatusException |
Major |
The IB card or its physical status is abnormal. |
Transfer this issue to the hardware team for handling. |
The IB card cannot work normally. |
||
Local disk replacement to be authorized |
localdisk_recovery_inquiring |
Major |
The local disk is faulty. Local disk replacement authorization is in progress. |
Authorize local disk replacement. |
Local disks are unavailable. |
||
Local disks being replaced |
localdisk_recovery_executing |
Major |
The local disk is faulty and is being replaced. |
When the replacement is complete, check whether the local disks are available. |
Local disks are unavailable. |
||
Local disks replaced |
localdisk_recovery_completed |
Major |
The local disk is faulty and is replaced. |
Wait until the services are running properly and check whether local disks are available. |
None |
||
Local disk replacement failed |
localdisk_recovery_failed |
Major |
The local disk is faulty and fails to be replaced. |
Contact O&M personnel. |
Local disks are unavailable. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
AAD |
SYS.DDOS |
DDoS Attack Events |
ddosAttackEvents |
Major |
A DDoS attack occurs in the AAD protected lines. |
Judge the impact on services based on the attack traffic and attack type. If the attack traffic exceeds your purchased elastic bandwidth, change to another line or increase your bandwidth. |
Services may be interrupted. |
Domain name scheduling event |
domainNameDispatchEvents |
Major |
The high-defense CNAME corresponding to the domain name is scheduled, and the domain name is resolved to another high-defense IP address. |
Pay attention to the workloads involving the domain name. |
Services are not affected. |
||
Blackhole event |
blackHoleEvents |
Major |
The attack traffic exceeds the purchased AAD protection threshold. |
A blackhole is canceled after 30 minutes by default. The actual blackhole duration is related to the blackhole triggering times and peak attack traffic on the current day. The maximum duration is 24 hours. If you need to permit access before a blackhole becomes ineffective, contact technical support. |
Services may be interrupted. |
||
Cancel Blackhole |
cancelBlackHole |
Informational |
The customer's AAD instance recovers from the black hole state. |
This is only a prompt and no action is required. |
Customer services recover. |
||
IP address scheduling triggered |
ipDispatchEvents |
Major |
IP route changed |
Check the workloads of the IP address. |
Services are not affected. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
ELB |
SYS.ELB |
The backend servers are unhealthy. |
healthCheckUnhealthy |
Major |
Generally, this problem occurs because backend server services are offline. This event will not be reported after it is reported for several times. |
Ensure that the backend servers are running properly. |
ELB does not forward requests to unhealthy backend servers. If all backend servers in the backend server group are detected unhealthy, services will be interrupted. |
The backend server is detected healthy. |
healthCheckRecovery |
Minor |
The backend server is detected healthy. |
No further action is required. |
The load balancer can properly route requests to the backend server. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
CBR |
SYS.CBR |
Failed to create the backup. |
backupFailed |
Critical |
The backup failed to be created. |
Manually create a backup or contact customer service. |
Data loss may occur. |
Failed to restore the resource using a backup. |
restorationFailed |
Critical |
The resource failed to be restored using a backup. |
Restore the resource using another backup or contact customer service. |
Data loss may occur. |
||
Failed to delete the backup. |
backupDeleteFailed |
Critical |
The backup failed to be deleted. |
Try again later or contact customer service. |
Charging may be abnormal. |
||
Failed to delete the vault. |
vaultDeleteFailed |
Critical |
The vault failed to be deleted. |
Try again later or contact technical support. |
Charging may be abnormal. |
||
Replication failure |
replicationFailed |
Critical |
The backup failed to be replicated. |
Try again later or contact technical support. |
Data loss may occur. |
||
The backup is created successfully. |
backupSucceeded |
Major |
The backup was created. |
None |
None |
||
Resource restoration using a backup succeeded. |
restorationSucceeded |
Major |
The resource was restored using a backup. |
Check whether the data is successfully restored. |
None |
||
The backup is deleted successfully. |
backupDeletionSucceeded |
Major |
The backup was deleted. |
None |
None |
||
The vault is deleted successfully. |
vaultDeletionSucceeded |
Major |
The vault was deleted. |
None |
None |
||
Replication success |
replicationSucceeded |
Major |
The backup was replicated successfully. |
None |
None |
||
Client offline |
agentOffline |
Critical |
The backup client was offline. |
Ensure that the Agent status is normal and the backup client can be connected to . |
Backup tasks may fail. |
||
Client online |
agentOnline |
Major |
The backup client was online. |
None |
None |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
RDS |
SYS.RDS |
DB instance creation failure |
createInstanceFailed |
Major |
Generally, the cause is that the number of disks is insufficient due to quota limits, or underlying resources are exhausted. |
The selected resource specifications are insufficient. Select other available specifications and try again. |
DB instances cannot be created. |
Full backup failure |
fullBackupFailed |
Major |
A single full backup failure does not affect the files that have been successfully backed up, but prolong the incremental backup time during the point-in-time restore (PITR). |
Try again. |
Full backup failed. |
||
Read replica promotion failure |
activeStandBySwitchFailed |
Major |
The standby DB instance does not take over workloads from the primary DB instance due to network or server failures. The original primary DB instance continues to provide services within a short time. |
Perform the switchover again during off-peak hours. |
The primary/standby switchover will fail. |
||
Replication status abnormal |
abnormalReplicationStatus |
Major |
The possible causes are as follows: The replication delay between the primary instance and the standby instance or a read replica is too long, which usually occurs when a large amount of data is being written to databases or a large transaction is being processed. During peak hours, data may be blocked. The network between the primary instance and the standby instance or a read replica is disconnected. |
Database replication is being repaired. You will be notified immediately after the repair. |
The replication status is abnormal. |
||
Replication status recovered |
replicationStatusRecovered |
Major |
The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored. |
Check whether services are running properly. |
Replication status is recovered. |
||
DB instance faulty |
faultyDBInstance |
Major |
A single or primary DB instance was faulty due to a catastrophic failure, for example, server failure. |
Instance status is being repaired. You will be notified immediately after the repair. |
The instance status is abnormal. |
||
DB instance recovered |
DBInstanceRecovered |
Major |
RDS rebuilds the standby DB instance with its high availability. After the instance is rebuilt, this event will be reported. |
The DB instance status is normal. Check whether services are running properly. |
The instance is recovered. |
||
Failure of changing single DB instance to primary/standby |
singleToHaFailed |
Major |
A fault occurs when RDS is creating the standby DB instance or configuring replication between the primary and standby DB instances. The fault may occur because resources are insufficient in the data center where the standby DB instance is located. |
Automatic retry is in progress. |
Changing a single DB instance to primary/standby failed. |
||
Database process restarted |
DatabaseProcessRestarted |
Major |
The database process is stopped due to insufficient memory or high load. |
Check whether services are running properly. |
The primary instance is restarted. Services are interrupted for a short period of time. |
||
Instance storage full |
instanceDiskFull |
Major |
Generally, the cause is that the data space usage is too high. |
Scale up the storage. |
The instance storage is used up. No data can be written into databases. |
||
Instance storage full recovered |
instanceDiskFullRecovered |
Major |
The instance disk is recovered. |
Check whether services are running properly. |
The instance has available storage. |
||
Kafka connection failed |
kafkaConnectionFailed |
Major |
The network is unstable or the Kafka server does not work properly. |
Check whether services are affected. |
None |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
DDS |
SYS.DDS |
DB instance creation failure |
DDSCreateInstanceFailed |
Major |
A DDS instance fails to be created due to insufficient disks, quotas, and underlying resources. |
Check the number and quota of disks. Release resources and create DDS instances again. |
DDS instances cannot be created. |
Replication failed |
DDSAbnormalReplicationStatus |
Major |
The possible causes are as follows:
|
Submit a service ticket. |
|
||
Replication status recovered |
DDSReplicationStatusRecovered |
Major |
The replication delay between the primary and standby instances is within the normal range, or the network connection between them has restored. |
No action is required. |
None |
||
DB instance failed |
DDSFaultyDBInstance |
Major |
This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure. |
Submit a service ticket. |
The database service may be unavailable. |
||
DB instance recovered |
DDSDBInstanceRecovered |
Major |
If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported. |
No action is required. |
None |
||
Faulty node |
DDSFaultyDBNode |
Major |
This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure. |
Check whether the database service is available and submit a service ticket. |
The database service may be unavailable. |
||
Node recovered |
DDSDBNodeRecovered |
Major |
If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported. |
No action is required. |
None |
||
Primary/standby switchover or failover |
DDSPrimaryStandbySwitched |
Major |
This event is reported when a primary/secondary switchover or failover is triggered. |
No action is required. |
None |
||
Insufficient storage space |
DDSRiskyDataDiskUsage |
Major |
The storage space is insufficient. |
Scale up storage space. For details, see section "Scaling Up Storage Space" in the corresponding user guide. |
The instance is set to read-only and data cannot be written to the instance. |
||
Data disk expanded and being writable |
DDSDataDiskUsageRecovered |
Major |
The capacity of a data disk has been expanded and the data disk becomes writable. |
No further action is required. |
No adverse impact. |
||
Schedule for deleting a KMS key |
planDeleteKmsKey |
Major |
A request to schedule deletion of a KMS key was submitted. |
After the KMS key is scheduled to be deleted, either decrypt the data encrypted by KMS key in a timely manner or cancel the key deletion. |
After the KMS key is deleted, users cannot encrypt disks. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
GeminiDB |
SYS.NoSQL |
DB instance creation failed |
NoSQLCreateInstanceFailed |
Major |
The instance quota or underlying resources are insufficient. |
Release the instances that are no longer used and try to provision them again, or submit a service ticket to adjust the quota. |
DB instances cannot be created. |
Specifications modification failed |
NoSQLResizeInstanceFailed |
Major |
The underlying resources are insufficient. |
Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you need to change the specifications again. |
Services are interrupted. |
||
Node adding failed |
NoSQLAddNodesFailed |
Major |
The underlying resources are insufficient. |
Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you delete the node that failed to be added and add a new node. |
None |
||
Node deletion failed |
NoSQLDeleteNodesFailed |
Major |
The underlying resources fail to be released. |
Delete the node again. |
None |
||
Storage space scale-up failed |
NoSQLScaleUpStorageFailed |
Major |
The underlying resources are insufficient. |
Submit a service ticket. The O&M personnel will coordinate resources in the background and then you scale up the storage space again. |
Services may be interrupted. |
||
Password reset failed |
NoSQLResetPasswordFailed |
Major |
Resetting the password times out. |
Reset the password again. |
None |
||
Parameter group change failed |
NoSQLUpdateInstanceParamGroupFailed |
Major |
Changing a parameter group times out. |
Change the parameter group again. |
None |
||
Backup policy configuration failed |
NoSQLSetBackupPolicyFailed |
Major |
The database connection is abnormal. |
Configure the backup policy again. |
None |
||
Manual backup creation failed |
NoSQLCreateManualBackupFailed |
Major |
The backup files fail to be exported or uploaded. |
Submit a service ticket to the O&M personnel. |
Data cannot be backed up. |
||
Automated backup creation failed |
NoSQLCreateAutomatedBackupFailed |
Major |
The backup files fail to be exported or uploaded. |
Submit a service ticket to the O&M personnel. |
Data cannot be backed up. |
||
Faulty DB instance |
NoSQLFaultyDBInstance |
Major |
This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure. |
Submit a service ticket. |
The database service may be unavailable. |
||
DB instance recovered |
NoSQLDBInstanceRecovered |
Major |
If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported. |
No action is required. |
None |
||
Faulty node |
NoSQLFaultyDBNode |
Major |
This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure. |
Check whether the database service is available and submit a service ticket. |
The database service may be unavailable. |
||
Node recovered |
NoSQLDBNodeRecovered |
Major |
If a disaster occurs, NoSQL provides an HA tool to automatically or manually rectify the fault. After the fault is rectified, this event is reported. |
No action is required. |
None |
||
Primary/standby switchover or failover |
NoSQLPrimaryStandbySwitched |
Major |
This event is reported when a primary/secondary switchover or failover is triggered. |
No action is required. |
None |
||
HotKey occurred |
HotKeyOccurs |
Major |
The primary key is improperly configured. As a result, hotspot data is distributed in one partition. The improper application design causes frequent read and write operations on a key. |
1. Choose a proper partition key. 2. Add service cache. The service application reads hotspot data from the cache first. |
The service request success rate is affected, and the cluster performance and stability also be affected. |
||
BigKey occurred |
BigKeyOccurs |
Major |
The primary key design is improper. The number of records or data in a single partition is too large, causing unbalanced node loads. |
1. Choose a proper partition key. 2. Add a new partition key for hashing data. |
As the data in the large partition increases, the cluster stability deteriorates. |
||
Insufficient storage space |
NoSQLRiskyDataDiskUsage |
Major |
The storage space is insufficient. |
Scale up storage space. For details, see section "Scaling Up Storage Space" in the corresponding user guide. |
The instance is set to read-only and data cannot be written to the instance. |
||
Data disk expanded and being writable |
NoSQLDataDiskUsageRecovered |
Major |
The capacity of a data disk has been expanded and the data disk becomes writable. |
No operation is required. |
None |
||
Index creation failed |
NoSQLCreateIndexFailed |
Major |
The service load exceeds what the instance specifications can take. In this case, creating indexes consumes more instance resources. As a result, the response is slow or even frame freezing occurs, and the creation times out. |
Select the matched instance specifications based on the service load. Create indexes during off-peak hours. Create indexes in the background. Select indexes as required. |
The index fails to be created or is incomplete. As a result, the index is invalid. Delete the index and create an index. |
||
Write speed decreased |
NoSQLStallingOccurs |
Major |
The write speed is fast, which is close to the maximum write capability allowed by the cluster scale and instance specifications. As a result, the flow control mechanism of the database is triggered, and requests may fail. |
1. Adjust the cluster scale or node specifications based on the maximum write rate of services. 2. Measures the maximum write rate of services. |
The success rate of service requests is affected. |
||
Data write stopped |
NoSQLStoppingOccurs |
Major |
The data write is too fast, reaching the maximum write capability allowed by the cluster scale and instance specifications. As a result, the flow control mechanism of the database is triggered, and requests may fail. |
1. Adjust the cluster scale or node specifications based on the maximum write rate of services. 2. Measures the maximum write rate of services. |
The success rate of service requests is affected. |
||
Database restart failed |
NoSQLRestartDBFailed |
Major |
The instance status is abnormal. |
Submit a service ticket to the O&M personnel. |
The DB instance status may be abnormal. |
||
Restoration to new DB instance failed |
NoSQLRestoreToNewInstanceFailed |
Major |
The underlying resources are insufficient. |
Submit a service order to ask the O&M personnel to coordinate resources in the background and add new nodes. |
Data cannot be restored to a new DB instance. |
||
Restoration to existing DB instance failed |
NoSQLRestoreToExistInstanceFailed |
Major |
The backup file fails to be downloaded or restored. |
Submit a service ticket to the O&M personnel. |
The current DB instance may be unavailable. |
||
Backup file deletion failed |
NoSQLDeleteBackupFailed |
Major |
The backup files fail to be deleted from OBS. |
Delete the backup files again. |
None |
||
Failed to enable Show Original Log |
NoSQLSwitchSlowlogPlainTextFailed |
Major |
The DB engine does not support this function. |
Refer to the GaussDB NoSQL User Guide to ensure that the DB engine supports Show Original Log. Submit a service ticket to the O&M personnel. |
None |
||
EIP binding failed |
NoSQLBindEipFailed |
Major |
The node status is abnormal, an EIP has been bound to the node, or the EIP to be bound is invalid. |
Check whether the node is normal and whether the EIP is valid. |
The DB instance cannot be accessed from the Internet. |
||
EIP unbinding failed |
NoSQLUnbindEipFailed |
Major |
The node status is abnormal or the EIP has been unbound from the node. |
Check whether the node and EIP status are normal. |
None |
||
Parameter modification failed |
NoSQLModifyParameterFailed |
Major |
The parameter value is invalid. |
Check whether the parameter value is within the valid range and submit a service ticket to the O&M personnel. |
None |
||
Parameter group application failed |
NoSQLApplyParameterGroupFailed |
Major |
The instance status is abnormal. As a result, the parameter group cannot be applied. |
Submit a service ticket to the O&M personnel. |
None |
||
Failed to enable or disable SSL |
NoSQLSwitchSSLFailed |
Major |
Enabling or disabling SSL times out. |
Try again or submit a service ticket. Do not change the connection mode. |
The connection mode cannot be changed. |
||
Row size too large |
LargeRowOccurs |
Major |
If there is too much data in a single row, queries may time out, causing faults like OOM error. |
1. Control the length of each column and row so that the sum of key and value lengths in each row does not exceed the preset threshold. 2. Check whether there are invalid writes or encoding resulting in large keys or values. |
If there are rows that are too large, the cluster performance will deteriorate as the data volume grows. |
||
Schedule for deleting a KMS key |
planDeleteKmsKey |
Major |
A request to schedule deletion of a KMS key was submitted. |
After the KMS key is scheduled to be deleted, either decrypt the data encrypted by KMS key in a timely manner or cancel the key deletion. |
After the KMS key is deleted, users cannot encrypt disks. |
||
Too many query tombstones |
TooManyQueryTombstones |
Major |
If there are too many query tombstones, queries may time out, affecting query performance. |
Select right query and deleting methods and avoid long range queries. |
Queries may time out, affecting query performance. |
||
Too large collection column |
TooLargeCollectionColumn |
Major |
If there are too many elements in a collection column, queries to the column will fail. |
|
Queries to the collection column will fail. |
||
GeminiDB Influx instance connection limit reached |
InfluxDBConnectionFull |
Major |
The connections on the instance node reach the upper limit. |
1. Upgrade specifications if they cannot meet service requirements. 2. Check whether the client properly manages connections, for example, whether there are unreleased or long connections. |
If no new connection can be created on a node, the client may fail to connect to a GeminiDB Influx instance. As a result, services may become instable. |
||
High availability switchover |
nodeHaSwitch |
Major |
The high availability switchover is triggered by underlying network jitters. |
Check whether the business is normal and it can be restored automatically. |
The network jitter causes a few seconds of delay. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
TaurusDB |
SYS.GAUSSDB |
Incremental backup failure |
TaurusIncrementalBackupInstanceFailed |
Major |
The network between the instance and the management plane (or the OBS) is disconnected, or the backup environment created for the instance is abnormal. |
Submit a service ticket. |
Backup jobs fail. |
Read replica creation failure |
addReadonlyNodesFailed |
Major |
The quota is insufficient or underlying resources are exhausted. |
Check the read replica quota. Release resources and create read replicas again. |
Read replicas fail to be created. |
||
DB instance creation failure |
createInstanceFailed |
Major |
The instance quota or underlying resources are insufficient. |
Check the instance quota. Release resources and create instances again. |
DB instances fail to be created. |
||
Read replica promotion failure |
activeStandBySwitchFailed |
Major |
The read replica fails to be promoted to the primary node due to network or server failures. The original primary node takes over services quickly. |
Submit a service ticket. |
The read replica fails to be promoted to the primary node. |
||
Instance specifications change failure |
flavorAlterationFailed |
Major |
The quota is insufficient or underlying resources are exhausted. |
Submit a service ticket. |
Instance specifications fail to be changed. |
||
Faulty DB instance |
TaurusInstanceRunningStatusAbnormal |
Major |
The instance process is faulty or the communications between the instance and the DFV storage are abnormal. |
Submit a service ticket. |
Services may be affected. |
||
DB instance recovered |
TaurusInstanceRunningStatusRecovered |
Major |
The instance is recovered. |
Observe the service running status. |
None |
||
Faulty node |
TaurusNodeRunningStatusAbnormal |
Major |
The node process is faulty or the communications between the node and the DFV storage are abnormal. |
Observe the instance and service running statuses. |
A read replica may be promoted to the primary node. |
||
Node recovered |
TaurusNodeRunningStatusRecovered |
Major |
The node is recovered. |
Observe the service running status. |
None |
||
Read replica deletion failure |
TaurusDeleteReadOnlyNodeFailed |
Major |
The communications between the management plane and the read replica are abnormal or the VM fails to be deleted from IaaS. |
Submit a service ticket. |
Read replicas fail to be deleted. |
||
Password reset failure |
TaurusResetInstancePasswordFailed |
Major |
The communications between the management plane and the instance are abnormal or the instance is abnormal. |
Check the instance status and try again. If the fault persists, submit a service ticket. |
Passwords fail to be reset for instances. |
||
DB instance reboot failure |
TaurusRestartInstanceFailed |
Major |
The network between the management plane and the instance is abnormal or the instance is abnormal. |
Check the instance status and try again. If the fault persists, submit a service ticket. |
Instances fail to be rebooted. |
||
Restoration to new DB instance failure |
TaurusRestoreToNewInstanceFailed |
Major |
The instance quota is insufficient, underlying resources are exhausted, or the data restoration logic is incorrect. |
If the new instance fails to be created, check the instance quota, release resources, and try to restore to a new instance again. In other cases, submit a service ticket. |
Backup data fails to be restored to new instances. |
||
EIP binding failure |
TaurusBindEIPToInstanceFailed |
Major |
The binding task fails. |
Submit a service ticket. |
EIPs fail to be bound to instances. |
||
EIP unbinding failure |
TaurusUnbindEIPFromInstanceFailed |
Major |
The unbinding task fails. |
Submit a service ticket. |
EIPs fail to be unbound from instances. |
||
Parameter modification failure |
TaurusUpdateInstanceParameterFailed |
Major |
The network between the management plane and the instance is abnormal or the instance is abnormal. |
Check the instance status and try again. If the fault persists, submit a service ticket. |
Instance parameters fail to be modified. |
||
Parameter template application failure |
TaurusApplyParameterGroupToInstanceFailed |
Major |
The network between the management plane and instances is abnormal or the instances are abnormal. |
Check the instance status and try again. If the fault persists, submit a service ticket. |
Parameter templates fail to be applied to instances. |
||
Full backup failure |
TaurusBackupInstanceFailed |
Major |
The network between the instance and the management plane (or the OBS) is disconnected, or the backup environment created for the instance is abnormal. |
Submit a service ticket. |
Backup jobs fail. |
||
Primary/standby failover |
TaurusActiveStandbySwitched |
Major |
When the network, physical machine, or database of the primary node is faulty, the system promotes a read replica to primary based on the failover priority to ensure service continuity. |
|
During the failover, database connection is interrupted for a short period of time. After the failover is complete, you can reconnect to the database. |
||
Database read-only |
NodeReadonlyMode |
Major |
The database supports only query operations. |
Submit a service ticket. |
After the database becomes read-only, write operations cannot be processed. |
||
Database read/write |
NodeReadWriteMode |
Major |
The database supports both write and read operations. |
Submit a service ticket. |
None |
||
Instance DR switchover |
DisasterSwitchOver |
Major |
If an instance is faulty and unavailable, a switchover is performed to ensure that the instance continues to provide services. |
Contact technical support. |
The database connection is intermittently interrupted. The HA service switches workloads from the primary node to a read replica and continues to provide services. |
||
Database process restarted |
TaurusDatabaseProcessRestarted |
Major |
The database process is stopped due to insufficient memory or high load. |
Log in to the Cloud Eye console. Check whether the memory usage increases sharply or the CPU usage is too high for a long time. You can increase the specifications or optimize the service logic. |
When the database process is suspended, workloads on the node are interrupted. In this case, the HA service automatically restarts the database process and attempts to recover the workloads. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
GaussDB(for openGauss) |
SYS.GAUSSDBV5 |
Process status alarm |
ProcessStatusAlarm |
Major |
Key processes exit, including CMS/CMA, ETCD, GTM, CN, and DN processes. |
Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers. |
If processes on primary nodes are faulty, services are interrupted and then rolled back. If processes on standby nodes are faulty, services are not affected. |
Component status alarm |
ComponentStatusAlarm |
Major |
Key components do not respond, including CMA, ETCD, GTM, CN, and DN components. |
Wait until the process is automatically recovered or a primary/standby failover is automatically performed. Check whether services are recovered. If no, contact SRE engineers. |
If processes on primary nodes do not respond, neither do the services. If processes on standby nodes are faulty, services are not affected. |
||
Cluster status alarm |
ClusterStatusAlarm |
Major |
The cluster status is abnormal. For example, the cluster is read-only; majority of ETCDs are faulty; or the cluster resources are unevenly distributed. |
Contact SRE engineers. |
If the cluster status is read-only, only read services are processed. If the majority of ETCDs are fault, the cluster is unavailable. If resources are unevenly distributed, the instance performance and reliability deteriorate. |
||
Hardware resource alarm |
HardwareResourceAlarm |
Major |
A major hardware fault occurs in the instance, such as disk damage or GTM network fault. |
Contact SRE engineers. |
Some or all services are affected. |
||
Status transition alarm |
StateTransitionAlarm |
Major |
The following events occur in the instance: DN build failure, forcible DN promotion, primary/standby DN switchover/failover, or primary/standby GTM switchover/failover. |
Wait until the fault is automatically rectified and check whether services are recovered. If no, contact SRE engineers. |
Some services are interrupted. |
||
Other abnormal alarm |
OtherAbnormalAlarm |
Major |
Disk usage threshold alarm |
Focus on service changes and scale up storage space as needed. |
If the used storage space exceeds the threshold, storage space cannot be scaled up. |
||
DB instance creation failure |
GaussDBV5CreateInstanceFailed |
Major |
Instances fail to be created because the quota is insufficient or underlying resources are exhausted. |
Release the instances that are no longer used and try to provision them again, or submit a service ticket to adjust the quota. |
DB instances cannot be created. |
||
Node adding failure |
GaussDBV5ExpandClusterFailed |
Major |
The underlying resources are insufficient. |
Submit a service ticket. The O&M personnel will coordinate resources in the background, and then you delete the node that failed to be added and add a new node. |
None |
||
Storage scale-up failure |
GaussDBV5EnlargeVolumeFailed |
Major |
The underlying resources are insufficient. |
Submit a service ticket. The O&M personnel will coordinate resources in the background and then you scale up the storage space again. |
Services may be interrupted. |
||
Reboot failure |
GaussDBV5RestartInstanceFailed |
Major |
The network is abnormal. |
Retry the reboot operation or submit a service ticket to the O&M personnel. |
The database service may be unavailable. |
||
Full backup failure |
GaussDBV5FullBackupFailed |
Major |
The backup files fail to be exported or uploaded. |
Submit a service ticket to the O&M personnel. |
Data cannot be backed up. |
||
Differential backup failure |
GaussDBV5DifferentialBackupFailed |
Major |
The backup files fail to be exported or uploaded. |
Submit a service ticket to the O&M personnel. |
Data cannot be backed up. |
||
Backup deletion failure |
GaussDBV5DeleteBackupFailed |
Major |
The backup files fail to be deleted from OBS. |
Delete the backup files again. |
None |
||
EIP binding failure |
GaussDBV5BindEIPFailed |
Major |
The EIP is bound to another resource. |
Submit a service ticket to the O&M personnel. |
The instance cannot be accessed from the public network. |
||
EIP unbinding failure |
GaussDBV5UnbindEIPFailed |
Major |
The network is faulty or EIP is abnormal. |
Unbind the IP address again or submit a service ticket to the O&M personnel. |
IP addresses may be residual. |
||
Parameter template application failure |
GaussDBV5ApplyParamFailed |
Major |
Modifying a parameter template times out. |
Modify the parameter template again. |
None |
||
Parameter modification failure |
GaussDBV5UpdateInstanceParamGroupFailed |
Major |
Modifying a parameter template times out. |
Modify the parameter template again. |
None |
||
Backup and restoration failure |
GaussDBV5RestoreFromBcakupFailed |
Major |
The underlying resources are insufficient or backup files fail to be downloaded. |
Submit a service ticket. |
The database service may be unavailable during the restoration failure. |
||
Failed to upgrade the hot patch |
GaussDBV5UpgradeHotfixFailed |
Major |
Generally, this fault is caused by an error reported during kernel upgrade. |
View the error information about the workflow and redo or skip the job. |
None |
||
DB instance faulty |
GaussDBV5FaultyDBInstance |
Major |
This event is a key alarm event and is reported when an instance is faulty due to a disaster or a server failure. |
Submit a service ticket. |
The database service may be unavailable. |
||
DB instance recovered |
GaussDBV5InstanceRecovered |
Major |
GaussDB(for openGauss) provides an HA tool for automated or manual rectification of faults. After the fault is rectified, this event is reported. |
No action is required. |
None |
||
Faulty node |
GaussDBV5FaultyDBNode |
Major |
This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure. |
This event is a key alarm event and is reported when a database node is faulty due to a disaster or a server failure. |
The database service may be unavailable. |
||
Node recovered |
GaussDBV5FaultyDBNodeRecovered |
Major |
GaussDB(for openGauss) provides an HA tool for automated or manual rectification of faults. After the fault is rectified, this event is reported. |
No action is required. |
None |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
DDM |
SYS.DDM (DDM 1.0) SYS.DDMS (DDM 2.0) |
Failed to create a DDM instance |
createDdmInstanceFailed |
Major |
The underlying resources are insufficient. |
Release resources and create the instance again. |
DDM instances cannot be created. |
Failed to change class of a DDM instance |
resizeFlavorFailed |
Major |
The underlying resources are insufficient. |
Submit a service ticket to the O&M personnel to coordinate resources and try again. |
Services on some nodes are interrupted. |
||
Failed to scale out a DDM instance |
enlargeNodeFailed |
Major |
The underlying resources are insufficient. |
Submit a service ticket to the O&M personnel to coordinate resources, delete the node that fails to be added, and add a node again. |
The instance fails to be scaled out. |
||
Failed to scale in a DDM instance |
reduceNodeFailed |
Major |
The underlying resources fail to be released. |
Submit a service ticket to the O&M personnel to release resources. |
The instance fails to be scaled in. |
||
Failed to restart a DDM instance |
restartInstanceFailed |
Major |
The DB instances associated are abnormal. |
Check whether DB instances associated are normal. If the instances are normal, submit a service ticket to the O&M personnel. |
Services on some nodes are interrupted. |
||
Failed to create a schema |
createLogicDbFailed |
Major |
The possible causes are as follows:
|
Check whether
|
Services cannot run properly. |
||
Failed to bind an EIP |
bindEipFailed |
Major |
The EIP is abnormal. |
Try again later. In case of emergency, contact O&M personnel to rectify the fault. |
The DDM instance cannot be accessed from the Internet. |
||
Failed to scale out a schema |
migrateLogicDbFailed |
Major |
The underlying resources fail to be processed. |
Submit a service ticket to the O&M personnel. |
The schema cannot be scaled out. |
||
Failed to re-scale out a schema |
retryMigrateLogicDbFailed |
Major |
The underlying resources fail to be processed. |
Submit a service ticket to the O&M personnel. |
The schema cannot be scaled out. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
EVS |
SYS.EVS |
Update disk |
updateVolume |
Minor |
Update the name and description of an EVS disk. |
No further action is required. |
None |
Expand disk |
extendVolume |
Minor |
Expand an EVS disk. |
No further action is required. |
None |
||
Delete disk |
deleteVolume |
Major |
Delete an EVS disk. |
No further action is required. |
Deleted disks cannot be recovered. |
||
QoS upper limit reached NOTE:
This event is no longer supported for EVS and will be removed from Cloud Eye. |
reachQoS |
Major |
The I/O latency increases as the QoS upper limits of the disk are frequently reached and flow control triggered. |
Change the disk type to one with a higher specification. |
The current disk may fail to meet service requirements. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
|---|---|---|---|---|
IAM |
SYS.IAM |
Login |
login |
Minor |
Logout |
logout |
Minor |
||
Password changed |
changePassword |
Major |
||
User created |
createUser |
Minor |
||
User deleted |
deleteUser |
Major |
||
User updated |
updateUser |
Minor |
||
User group created |
createUserGroup |
Minor |
||
User group deleted |
deleteUserGroup |
Major |
||
User group updated |
updateUserGroup |
Minor |
||
Identity provider created |
createIdentityProvider |
Minor |
||
Identity provider deleted |
deleteIdentityProvider |
Major |
||
Identity provider updated |
updateIdentityProvider |
Minor |
||
Metadata updated |
updateMetadata |
Minor |
||
Security policy updated |
updateSecurityPolicies |
Major |
||
Credential added |
addCredential |
Major |
||
Credential deleted |
deleteCredential |
Major |
||
Project created |
createProject |
Minor |
||
Project updated |
updateProject |
Minor |
||
Project suspended |
suspendProject |
Major |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
KMS |
SYS.KMS |
Key disabled |
disableKey |
Major |
A key is disabled and cannot be used. |
If the customer needs to disable the key, no action is required. However, if the key is disabled by mistake, the customer needs to log in to the DEW console and enable it again. |
Services may be affected if the key is being used. |
Key deletion scheduled |
scheduleKeyDeletion |
Minor |
A key is scheduled to be deleted and cannot be used. |
If the customer needs to delete the key, no action is required. However, if the deletion of the key is scheduled by mistake, the customer needs to log in to the DEW console, cancel the scheduled deletion, and enable the key again. |
Services may be affected if the key is being used. |
||
Grant retired |
retireGrant |
Major |
A grant is retired and the key cannot be used. |
If the customer needs to cancel the key grant, no action is required. However, if the grant is canceled by mistake, the customer needs to log in to the DEW console and create the grant again. |
Services may be affected if the key is being used. |
||
Grant revoked |
revokeGrant |
Major |
A grant is revoked and the key cannot be used. |
If the customer needs to cancel the key grant, no action is required. However, if the grant is canceled by mistake, the customer needs to log in to the DEW console and create the grant again. |
Services may be affected if the key is being used. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
Cloud Eye |
SYS.CES |
Agent heartbeat interruption |
agentHeartbeatInterrupted |
Major |
The collecting process of the Agent is faulty. |
|
The Agent will stop collecting and reporting metrics. |
Agent back to normal |
agentResumed |
Informational |
The Agent was back to normal. |
No action is required. |
None |
||
Agent faulty |
agentFaulted |
Major |
The Agent was faulty and this status was reported to Cloud Eye. |
The Agent process is faulty. Restart the Agent. If the Agent process is still faulty after the restart, the Agent files may be damaged. In this case, reinstall the Agent. Update the Agent to the latest version. |
The Agent will stop collecting and reporting metrics. |
||
Agent disconnected |
agentDisconnected |
Major |
The communication process of the Agent is faulty. |
Confirm that the Agent domain name cannot be resolved. Check whether your account is in arrears. The Agent process is faulty. Restart the Agent. If the Agent process is still faulty after the restart, the Agent files may be damaged. In this case, reinstall the Agent. Confirm that the server time is inconsistent with the local standard time. Update the Agent to the latest version. |
The Agent will stop collecting and reporting metrics. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
DCS |
SYS.DCS |
Full sync retry during online migration |
migrationFullResync |
Minor |
If online migration fails, full synchronization will be triggered because incremental synchronization cannot be performed. |
Check whether full sync retries are triggered repeatedly. Check whether the source instance is connected and whether it is overloaded. If full sync retries are triggered repeatedly, contact O&M personnel. |
The migration task is disconnected from the source instance, triggering another full sync. As a result, the CPU usage of the source instance may increase sharply. |
Memcached master/standby switchover |
memcachedMasterStandbyFailover |
Minor |
The master node was abnormal, promoting the standby node to master. |
Check whether services can recover by themselves. If applications cannot recover, restart them. |
Persistent connections to the instance will be interrupted. |
||
Redis server abnormal |
redisNodeStatusAbnormal |
Major |
The Redis server status was abnormal. |
Check whether services are affected. If yes, contact O&M personnel. |
If the master node is abnormal, an automatic failover is performed. If a standby node is abnormal and the client directly connects to the standby node for read/write splitting, no data can be read. |
||
Redis server recovered |
redisNodeStatusNormal |
Major |
The Redis server status recovered. |
Check whether services can recover. If the applications are not reconnected, restart them. |
Recover from an exception. |
||
Sync failure in data migration |
migrateSyncDataFail |
Major |
Online migration failed. |
Reconfigure the migration task and migrate data again. If the fault persists, contact O&M personnel. |
Data migration fails. |
||
Memcached instance abnormal |
memcachedInstanceStatusAbnormal |
Major |
The Memcached node status was abnormal. |
Check whether services are affected. If yes, contact O&M personnel. |
The Memcached instance is abnormal and may not be accessed. |
||
Memcached instance recovered |
memcachedInstanceStatusNormal |
Major |
The Memcached node status recovered. |
Check whether services can recover. If the applications are not reconnected, restart them. |
Recover from an exception. |
||
Instance backup failure |
instanceBackupFailure |
Major |
The DCS instance fails to be backed up due to an OBS access failure. |
Retry backup manually. |
Automated backup fails. |
||
Instance node abnormal restart |
instanceNodeAbnormalRestart |
Major |
DCS nodes restarted unexpectedly when they became faulty. |
Check whether services can recover by themselves. If applications cannot recover, restart them. |
Persistent connections to the instance will be interrupted. |
||
Long-running Lua scripts stopped |
scriptsStopped |
Informational |
Lua scripts that had timed out automatically stopped running. |
Optimize Lua scrips to prevent execution timeout. |
If Lua scripts take a long time to execute, they will be forcibly stopped to avoid blocking the entire instance. |
||
Node restarted |
nodeRestarted |
Informational |
After write operations had been performed, the node automatically restarted to stop Lua scripts that had timed out. |
Check whether services can recover by themselves. If applications cannot recover, restart them. |
Persistent connections to the instance will be interrupted. |
||
Automatic failover |
masterStandbyFailover |
Major |
The master node failed due to a hard/software fault, triggering the replica node to take over services. |
Check that the application reconnected to the instance and the fault was rectified. Otherwise, restart the application. |
Access errors interrupt persistent connections to the instance. |
||
Manual switchover |
masterStandbySwitchover |
Major |
Performing master/standby switchovers on the console or calling the master/standby switchover API triggers these events. Master/Standby switchovers occur during specification changes or after instance restarts. Manual O&M on the backend required by fault drills or resource migration initiates master/standby switchovers. |
Check that the application reconnected to the instance and the fault was rectified. Otherwise, restart the application. |
Access errors interrupt persistent connections to the instance. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
HSS |
SYS.HSS |
HSS agent disconnected |
hssAgentAbnormalOffline |
Major |
The communication between the agent and the server is abnormal, or the agent process on the server is abnormal. |
Fix your network connection. If the agent is still offline for a long time after the network recovers, the agent process may be abnormal. In this case, log in to the server and restart the agent process. |
Services are interrupted. |
Abnormal HSS agent status |
hssAgentAbnormalProtection |
Major |
The agent is abnormal probably because it does not have sufficient resources. |
Log in to the server and check your resources. If the usage of memory or other system resources is too high, increase their capacity first. If the resources are sufficient but the fault persists after the agent process is restarted, submit a service ticket to the O&M personnel. |
Services are interrupted. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
IMS |
SYS.IMS |
Create Image |
createImage |
Major |
An image was created. |
None |
You can use this image to create cloud servers. |
Update Image |
updateImage |
Major |
Metadata of an image was modified. |
None |
Cloud servers may fail to be created from this image. |
||
Delete Image |
deleteImage |
Major |
An image was deleted. |
None |
This image will be unavailable on the management console. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
Elastic IP and bandwidth |
SYS.VPC |
Delete VPC |
deleteVpc |
Major |
The VPC resources were deleted. |
Check whether the VPC resources were deleted by mistake. |
Deleting the VPC may affect customer services. |
Modify VPC |
modifyVpc |
Minor |
The VPC information was modified. |
Check whether the VPC information was modified by mistake. |
Modifying the VPC may affect customer services. |
||
Delete subnet |
deleteSubnet |
Minor |
Subnets are deleted. |
Check whether the EIP was release by mistake. |
Deleting the VPC subnets may affect customer services. |
||
Modify subnet |
modifySubnet |
Minor |
The subnet information was modified. |
Check whether the subnet information was modified by mistake. |
Modifying the VPC subnets may affect customer services. |
||
Modify bandwidth |
modifyBandwidth |
Minor |
The bandwidth information was modified. |
Check whether the bandwidth information was modified by mistake. |
Services may be interrupted. |
Event Source |
Namespace |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|---|
OBS |
SYS.OBS |
Delete bucket |
deleteBucket |
Major |
An event is reported when a bucket deletion takes place. |
Once deleted, buckets cannot be restored. Create a new one if needed. CAUTION:
If you want to reuse the name of a deleted bucket for a new bucket, wait at least 30 minutes after the bucket is deleted. |
Deleting buckets may affect your services. Before deleting a bucket, make sure that your services do not depend on it. |
Delete bucket policy |
deleteBucketPolicy |
Major |
An event is reported when a bucket policy deletion takes place. |
|
After a bucket policy is deleted, some users may fail to access the associated bucket and the objects in it. |
||
Set bucket ACL |
setBucketAcl |
Minor |
An event is reported when a bucket ACL configuration takes place. |
If you do not want an account to access a bucket or the objects in it, you can delete the bucket ACL. |
A bucket ACL grants an account the access to the bucket and the objects in it. |
||
Set bucket policy |
setBucketPolicy |
Minor |
An event is reported when a bucket policy configuration takes place. |
If you do not need a bucket policy to perform fine-grained access control over a bucket and the objects in it, you can delete the bucket policy. |
A bucket policy grants an account some operation permissions for the bucket or the objects in it under certain conditions. |
Event Source |
Event Name |
Event ID |
Event Severity |
Description |
Solution |
Impact |
|---|---|---|---|---|---|---|
EIP |
EIP bandwidth overflow |
EIPBandwidthOverflow |
Major |
The used bandwidth exceeded the purchased one, which may slow down the network or cause packet loss. The value of this event is the maximum value in a monitoring period, and the value of the EIP inbound and outbound bandwidth is the value at a specific time point in the period. The metrics are described as follows: egressDropBandwidth: dropped outbound packets (bytes) egressAcceptBandwidth: accepted outbound packets (bytes) egressMaxBandwidthPerSec: peak outbound bandwidth (byte/s) ingressAcceptBandwidth: accepted inbound packets (bytes) ingressMaxBandwidthPerSec: peak inbound bandwidth (byte/s) ingressDropBandwidth: dropped inbound packets (bytes) |
Check whether the EIP bandwidth keeps increasing and whether services are normal. Increase bandwidth if necessary. |
The network becomes slow or packets are lost. |
Delete EIP |
deleteEip |
Minor |
The EIP was released. |
Check whether the EIP was release by mistake. |
The server that has the EIP bound cannot access the Internet. |
|
EIP blocked |
blockEIP |
Critical |
The used bandwidth of an EIP exceeded 5 Gbit/s, the EIP were blocked and packets were discarded. Such an event may be caused by DDoS attacks. |
Replace the EIP to prevent services from being affected. Locate and deal with the fault. |
Services are impacted. |
|
EIP unblocked |
unblockEIP |
Critical |
The EIP was unblocked. |
Use the previous EIP again. |
None |
|
Start DDoS traffic scrubbing |
ddosCleanEIP |
Major |
Traffic scrubbing on the EIP was started to prevent DDoS attacks. |
Check whether the EIP was attacked. |
Services may be interrupted. |
|
Stop DDoS traffic scrubbing |
ddosEndCleanEip |
Major |
Traffic scrubbing on the EIP to prevent DDoS attacks was ended. |
Check whether the EIP was attacked. |
Services may be interrupted. |
|
Enterprise-class QoS bandwidth limit exceeded |
EIPBandwidthRuleOverflow |
Major |
The used QoS bandwidth exceeded the allocated one, which may slow down the network or cause packet loss. The value of this event is the maximum value in a monitoring period, and the value of the EIP inbound and outbound bandwidth is the value at a specific time point in the period. egressDropBandwidth: dropped outbound packets (bytes) egressAcceptBandwidth: accepted outbound packets (bytes) egressMaxBandwidthPerSec: peak outbound bandwidth (byte/s) ingressAcceptBandwidth: accepted inbound packets (bytes) ingressMaxBandwidthPerSec: peak inbound bandwidth (byte/s) ingressDropBandwidth: dropped inbound packets (bytes) |
Check whether the EIP bandwidth keeps increasing and whether services are normal. Increase bandwidth if necessary. |
The network becomes slow or packets are lost. |