The CCE Node Problem Detector add-on (formerly NPD) monitors abnormal events of cluster nodes and can connect to a third-party monitoring platform. It is a daemon running on each node. It collects node issues from different daemons and reports them to the API server. It can run as a DaemonSet or a daemon.
The CCE Node Problem Detector add-on is developed based on the open-source project node-problem-detector. For details, see node-problem-detector.
To monitor kernel logs, the NPD add-on needs to read the host /dev/kmsg. Therefore, the privileged mode must be enabled. For details, see privileged.
In addition, CCE mitigates risks according to the least privilege principle. Only the following privileges are available for NPD running:
You can adjust the number of add-on pods and resource quotas as required. High availability is not possible with a single pod. If an error occurs on the node where the add-on instance runs, the add-on will fail.
Maximum Number of Isolated Nodes in a Fault: specifies the maximum number of nodes that can be isolated to prevent avalanches in case of a fault occurring on multiple nodes. You can configure this parameter either by percentage or quantity.
Parameter |
Description |
|---|---|
Multi-AZ Deployment |
|
Node Affinity |
|
Toleration |
Using both taints and tolerations allows (not forcibly) the add-on Deployment to be scheduled to a node with the matching taints, and controls the Deployment eviction policies after the node where the Deployment is located is tainted. The add-on adds the default tolerance policy for the node.kubernetes.io/not-ready and node.kubernetes.io/unreachable taints, respectively. The tolerance time window is 60s. For details, see Configuring Tolerance Policies. |
Component |
Description |
Resource Type |
|---|---|---|
node-problem-controller |
Isolate faults basically based on fault detection results. |
Deployment |
node-problem-detector |
Detect node faults. |
DaemonSet |
Check items are supported only in 1.16.0 and later versions.
Check items cover events and statuses.
For event-related check items, when a problem occurs, NPD reports an event to the API server. The event type can be Normal (normal event) or Warning (abnormal event).
Check Item |
Function |
Description |
|---|---|---|
OOMKilling |
Listen to the kernel logs and check whether OOM events occur and are reported. Typical scenario: The memory used by the process in the container exceeds the limit, triggering OOM and terminating the process. |
Warning event Listening object: /dev/kmsg Matching rule: "Killed process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB.*" |
TaskHung |
Listen to the kernel logs and check whether taskHung events occur and are reported. Typical scenario: Disk I/O suspension causes process suspension. |
Warning event Listening object: /dev/kmsg Matching rule: "task \\S+:\\w+ blocked for more than \\w+ seconds\\." |
ReadonlyFilesystem |
Check whether the Remount root filesystem read-only error occurs in the system kernel by listening to the kernel logs. Typical scenario: A user detaches a data disk from a node by mistake on the ECS, and applications continuously write data to the mount point of the data disk. As a result, an I/O error occurs in the kernel and the disk is remounted as a read-only disk. NOTE:
If the rootfs of node pods is of the device mapper type, an error will occur in the thin pool if a data disk is detached. This will affect NPD and NPD will not be able to detect node faults. |
Warning event Listening object: /dev/kmsg Matching rule: Remounting filesystem read-only |
For status-related check items, when a problem occurs, NPD reports an event to the API server and changes the node status synchronously. This function can be used together with Node-problem-controller fault isolation to isolate nodes.
If the check period is not specified in the following check items, the default period is 30 seconds.
Check Item |
Function |
Description |
|---|---|---|
Conntrack table full ConntrackFullProblem |
Check whether the conntrack table is full. |
|
Insufficient disk resources DiskProblem |
Check the usage of the system disk and CCE data disks (including the CRI logical disk and kubelet logical disk) on the node. |
Currently, additional data disks are not supported. |
Insufficient file handles FDProblem |
Check if the FD file handles are used up. |
|
Insufficient node memory MemoryProblem |
Check whether memory is used up. |
|
Insufficient process resources PIDProblem |
Check whether PID process resources are exhausted. |
|
Check Item |
Function |
Description |
|---|---|---|
Abnormal NTP NTPProblem |
Check whether the node clock synchronization service ntpd or chronyd is running properly and whether a system time drift is caused. |
Default clock offset threshold: 8000 ms |
Process D error ProcessD |
Check whether there is a process D on the node. |
Default threshold: 10 abnormal processes detected for three consecutive times Source:
|
Process Z error ProcessZ |
Check whether the node has processes in Z state. |
|
ResolvConf error ResolvConfFileProblem |
Check whether the ResolvConf file is lost. Check whether the ResolvConf file is normal. Definition: No upstream domain name resolution server (nameserver) is included. |
Object: /etc/resolv.conf |
Existing scheduled event ScheduledEvent |
Check whether scheduled live migration events exist on the node. A live migration plan event is usually triggered by a hardware fault and is an automatic fault rectification method at the IaaS layer. Typical scenario: The host is faulty. For example, the fan is damaged or the disk has bad sectors. As a result, live migration is triggered for VMs. |
Source:
This check item is an Alpha feature and is disabled by default. |
The kubelet component has the following default check items, which have bugs or defects. You can fix them by upgrading the cluster or using NPD.
Check Item |
Function |
Description |
|---|---|---|
Insufficient PID resources PIDPressure |
Check whether PIDs are sufficient. |
|
Insufficient memory MemoryPressure |
Check whether the allocable memory for the containers is sufficient. |
|
Insufficient disk resources DiskPressure |
Check the disk usage and inodes usage of the kubelet and Docker disks. |
|
Fault isolation is supported only by add-ons of 1.16.0 and later versions.
By default, if multiple nodes become faulty, NPC adds taints to up to 10% of the nodes. You can set npc.maxTaintedNode to increase the threshold.
The open-source NPD plugin provides fault detection but not fault isolation. CCE enhances the node-problem-controller (NPC) based on the open-source NPD. This component is implemented based on the Kubernetes node controller. For faults reported by NPD, NPC automatically adds taints to nodes for node fault isolation.
Parameter |
Description |
Default Value |
|---|---|---|
npc.enable |
Whether to enable NPC This parameter is not supported in 1.18.0 or later versions. |
true |
npc.maxTaintedNode |
The maximum number of nodes that NPC can add taints to when an individual fault occurs on multiple nodes for minimizing impact. The value can be in int or percentage format. |
10% Value range:
|
npc.nodeAffinity |
Node affinity of the controller |
N/A |
Events reported by the NPD add-on can be queried on the Nodes page.
The NPD daemon pod exposes Prometheus metric data on port 19901. By default, the NPD pod is added with the annotation metrics.alpha.kubernetes.io/custom-endpoints: '[{"api":"prometheus","path":"/metrics","port":"19901","names":""}]'. You can build a Prometheus collector to identify and obtain NPD metrics from http://{{NpdPodIP}}:{{NpdPodPort}}/metrics.
If the NPD add-on version is earlier than 1.16.5, the exposed port of Prometheus metrics is 20257.
The metric data includes problem_counter and problem_gauge, as shown below.
# HELP problem_counter Number of times a specific type of problem has occurred.
# TYPE problem_counter counter
problem_counter{reason="DockerHung"} 0
problem_counter{reason="DockerStart"} 0
problem_counter{reason="EmptyDirVolumeGroupStatusError"} 0
...
# HELP problem_gauge Whether a specific type of problem is affecting the node or not.
# TYPE problem_gauge gauge
problem_gauge{reason="CNIIsDown",type="CNIProblem"} 0
problem_gauge{reason="CNIIsUp",type="CNIProblem"} 0
problem_gauge{reason="CRIIsDown",type="CRIProblem"} 0
problem_gauge{reason="CRIIsUp",type="CRIProblem"} 0
..
Add-on Version |
Supported Cluster Version |
New Feature |
Community Version |
|---|---|---|---|
1.19.33 |
v1.27 v1.28 v1.29 v1.30 v1.31 v1.32 v1.33 |
Fixed some issues. |
|
1.19.25 |
v1.25 v1.27 v1.28 v1.29 v1.30 v1.31 v1.32 |
CCE clusters v1.32 are supported. |
|
1.19.20 |
v1.25 v1.27 v1.28 v1.29 v1.30 v1.31 |
Fixed some issues. |
|
1.19.11 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 v1.30 |
Fixed some issues. |
|
1.19.1 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 |
Fixed some issues. |
|
1.19.0 |
v1.21 v1.23 v1.25 v1.27 v1.28 |
Fixed some issues. |
|
1.18.48 |
v1.21 v1.23 v1.25 v1.27 v1.28 |
Fixed some issues. |
|
1.18.46 |
v1.21 v1.23 v1.25 v1.27 v1.28 |
CCE clusters v1.28 are supported. |
|
1.18.22 |
v1.19 v1.21 v1.23 v1.25 v1.27 |
None |
|
1.17.4 |
v1.17 v1.19 v1.21 v1.23 v1.25 |
Optimized DiskHung check item. |