Files

yangtong c285e88a17 MRS UMN 20250806 version

Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: yangtong <yangtong2@huawei.com>
Co-committed-by: yangtong <yangtong2@huawei.com>

2025-09-02 10:43:57 +00:00

20 KiB

Raw Blame History

ALM-19034 Number of RegionServer WAL Write Timeouts Exceeds the Threshold

Alarm Description

The system checks the number of RegionServer WAL write timeouts in each HBase service every 30 seconds. This alarm is generated when the number of WAL write timeouts on a RegionServer instance exceeds the threshold for 10 consecutive times.

This alarm is cleared when the number of WAL write timeouts on a RegionServer instance is less than or equal to the threshold.

This alarm applies only to MRS 3.3.1 or later.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
19034	Critical (default threshold: 500) Major (default threshold: 300)	Yes

Alarm Parameters

Type	Parameter	Description
Location Information	Source	Specifies the cluster for which the alarm was generated.
	ServiceName	Specifies the service for which the alarm was generated.
	RoleName	Specifies the role for which the alarm was generated.
	HostName	Specifies the host for which the alarm was generated.
Additional Information	Threshold	Specifies the threshold for generating the alarm.

Impact on the System

The write operation latency increases. Too many WAL write timeouts may severely deteriorate the data write performance.

Possible Causes

A slow disk fault occurred.
RegionServer GC is abnormal.
HBase is overloaded.
The WAL configuration is improper.

Handling Procedure

Log in to MRS Manager and choose O&M. In the navigation pane on the left, choose Alarm > Alarms. On the page that is displayed, locate the row containing the alarm whose Alarm ID is 19034, and view the service instance and host name in Location.

Check whether a slow disk fault occurred.

In the alarm list on MRS Manager, check whether the "Slow Disk Fault" or "Disk Unavailable" is displayed for the instance you checked in 1.
- If yes, go to 3.
- If no, go to 5.
Rectify the fault by following the handling procedure of "ALM-12033 Slow Disk Fault" or "ALM-12063 Disk Unavailable".
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 5.

Check whether RegionServer GC is abnormal.

In the alarm list on MRS Manager, check whether "ALM-19007 HBase GC Duration Exceeds the Threshold" is displayed.
- If yes, go to 6.
- If no, go to 8.
Rectify the fault by following the handling procedure of "ALM-19007 HBase GC Duration Exceeds the Threshold".
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 8.

Check the HBase load.

In the alarm list on MRS Manager, check whether "ALM-19018 HBase Compaction Queue Size Exceeds the Threshold" is displayed.
- If yes, go to 9.
- If no, go to 11.
Rectify the fault by following the handling procedure of "ALM-19018 HBase Compaction Queue Size Exceeds the Threshold".
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 11.

Check the WAL configuration.

On MRS Manager, choose Cluster > Service > HBase, click Configurations > All Configurations, and check whether the values of hbase.wal.hsync and hbase.hfile.hsync are true.
- If yes, go to 12.
- If no, go to 14.
Set both hbase.wal.hsync and hbase.hfile.hsync to false and click Save. Click Dashboard and click More > Restart Service to restart the HBase service.

During HBase service restart, the service is unavailable. For example, data cannot be read or written, table operations cannot be performed, and the HBase web UI is inaccessible.
Wait several minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 14.

Collect fault information.

On MRS Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, and select HBase for the target cluster.
Click the edit icon in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M personnel and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None.

Parent topic: Alarm Reference (Applicable to MRS 3.x)

20 KiB Raw Blame History

ALM-19034 Number of RegionServer WAL Write Timeouts Exceeds the Threshold

Alarm Description

Alarm Attributes

Alarm Parameters

Impact on the System

Possible Causes

Handling Procedure

Alarm Clearance

Related Information

20 KiB

Raw Blame History