CCE Node Problem Detector

Introduction

The CCE Node Problem Detector add-on (formerly NPD) monitors abnormal events of cluster nodes and can connect to a third-party monitoring platform. It is a daemon running on each node. It collects node issues from different daemons and reports them to the API server. It can run as a DaemonSet or a daemon.

The CCE Node Problem Detector add-on is developed based on the open-source project node-problem-detector. For details, see node-problem-detector.

Notes and Constraints

Permissions

To monitor kernel logs, the NPD add-on needs to read the host /dev/kmsg. Therefore, the privileged mode must be enabled. For details, see privileged.

In addition, CCE mitigates risks according to the least privilege principle. Only the following privileges are available for NPD running:

Installing the Add-on

  1. Log in to the CCE console and click the cluster name to access the cluster console.
  2. In the navigation pane, choose Add-ons. In the right pane, find the CCE Node Problem Detector add-on and click Install.
  3. On the Install Add-on page, configure the specifications as needed.

    You can adjust the number of add-on pods and resource quotas as required. High availability is not possible with a single pod. If an error occurs on the node where the add-on instance runs, the add-on will fail.

  4. Configure the add-on parameters.

    Maximum Number of Isolated Nodes in a Fault: specifies the maximum number of nodes that can be isolated to prevent avalanches in case of a fault occurring on multiple nodes. You can configure this parameter either by percentage or quantity.

  5. Configure deployment policies for the add-on pods.

    • Scheduling policies do not take effect on add-on pods of the DaemonSet type.
    • When configuring multi-AZ deployment or node affinity, ensure that there are nodes meeting the scheduling policy and that resources are sufficient in the cluster. Otherwise, the add-on cannot run.
    Table 1 Configurations for add-on scheduling

    Parameter

    Description

    Multi-AZ Deployment

    • Preferred: Deployment pods of the add-on will be preferentially scheduled to nodes in different AZs. If all the nodes in the cluster are deployed in the same AZ, the pods will be scheduled to different nodes in that AZ.
    • Equivalent mode: Deployment pods of the add-on are evenly scheduled to the nodes in the cluster in each AZ. If a new AZ is added, you are advised to increase add-on pods for cross-AZ HA deployment. With the Equivalent multi-AZ deployment, the difference between the number of add-on pods in different AZs will be less than or equal to 1. If resources in one of the AZs are insufficient, pods cannot be scheduled to that AZ.
    • Forcible: Deployment pods of the add-on are forcibly scheduled to nodes in different AZs. There can be at most one pod in each AZ. If nodes in a cluster are not in different AZs, some add-on pods cannot run properly. If a node is faulty, add-on pods on it may fail to be migrated.

    Node Affinity

    • Not configured: Node affinity is disabled for the add-on.
    • Specify node: Specify the nodes where the add-on is deployed. If you do not specify the nodes, the add-on will be randomly scheduled based on the default cluster scheduling policy.
    • Specify node pool: Specify the node pool where the add-on is deployed. If you do not specify the node pools, the add-on will be randomly scheduled based on the default cluster scheduling policy.
    • Customize affinity: Enter the labels of the nodes where the add-on is to be deployed for more flexible scheduling policies. If you do not specify node labels, the add-on will be randomly scheduled based on the default cluster scheduling policy.

      If multiple custom affinity policies are configured, ensure that there are nodes that meet all the affinity policies in the cluster. Otherwise, the add-on cannot run.

    Toleration

    Using both taints and tolerations allows (not forcibly) the add-on Deployment to be scheduled to a node with the matching taints, and controls the Deployment eviction policies after the node where the Deployment is located is tainted.

    The add-on adds the default tolerance policy for the node.kubernetes.io/not-ready and node.kubernetes.io/unreachable taints, respectively. The tolerance time window is 60s.

    For details, see Configuring Tolerance Policies.

  6. Click Install.

Components

Table 2 Add-on components

Component

Description

Resource Type

node-problem-controller

Isolate faults basically based on fault detection results.

Deployment

node-problem-detector

Detect node faults.

DaemonSet

NPD Check Items

Check items are supported only in 1.16.0 and later versions.

Check items cover events and statuses.

Node-problem-controller Fault Isolation

Fault isolation is supported only by add-ons of 1.16.0 and later versions.

By default, if multiple nodes become faulty, NPC adds taints to up to 10% of the nodes. You can set npc.maxTaintedNode to increase the threshold.

The open-source NPD plugin provides fault detection but not fault isolation. CCE enhances the node-problem-controller (NPC) based on the open-source NPD. This component is implemented based on the Kubernetes node controller. For faults reported by NPD, NPC automatically adds taints to nodes for node fault isolation.

Table 9 Parameters

Parameter

Description

Default Value

npc.enable

Whether to enable NPC

This parameter is not supported in 1.18.0 or later versions.

true

npc.maxTaintedNode

The maximum number of nodes that NPC can add taints to when an individual fault occurs on multiple nodes for minimizing impact.

The value can be in int or percentage format.

10%

Value range:

  • The value is in int format and ranges from 1 to infinity.
  • The value ranges from 1% to 100%, in percentage. The minimum value of this parameter multiplied by the number of cluster nodes is 1.

npc.nodeAffinity

Node affinity of the controller

N/A

Viewing NPD Events

Events reported by the NPD add-on can be queried on the Nodes page.

  1. Log in to the CCE console and click the cluster name to access the cluster console.
  2. In the navigation pane, choose Nodes. In the right pane, click the Nodes tab, locate the row containing the target node, and click View Events in the Operation column.

Collecting Prometheus Metrics

The NPD daemon pod exposes Prometheus metric data on port 19901. By default, the NPD pod is added with the annotation metrics.alpha.kubernetes.io/custom-endpoints: '[{"api":"prometheus","path":"/metrics","port":"19901","names":""}]'. You can build a Prometheus collector to identify and obtain NPD metrics from http://{{NpdPodIP}}:{{NpdPodPort}}/metrics.

If the NPD add-on version is earlier than 1.16.5, the exposed port of Prometheus metrics is 20257.

The metric data includes problem_counter and problem_gauge, as shown below.

# HELP problem_counter Number of times a specific type of problem has occurred.
# TYPE problem_counter counter
problem_counter{reason="DockerHung"} 0
problem_counter{reason="DockerStart"} 0
problem_counter{reason="EmptyDirVolumeGroupStatusError"} 0
...
# HELP problem_gauge Whether a specific type of problem is affecting the node or not.
# TYPE problem_gauge gauge
problem_gauge{reason="CNIIsDown",type="CNIProblem"} 0
problem_gauge{reason="CNIIsUp",type="CNIProblem"} 0
problem_gauge{reason="CRIIsDown",type="CRIProblem"} 0
problem_gauge{reason="CRIIsUp",type="CRIProblem"} 0
..

Release History

Table 10 CCE Node Problem Detector add-on

Add-on Version

Supported Cluster Version

New Feature

Community Version

1.19.33

v1.27

v1.28

v1.29

v1.30

v1.31

v1.32

v1.33

Fixed some issues.

0.8.10

1.19.25

v1.25

v1.27

v1.28

v1.29

v1.30

v1.31

v1.32

CCE clusters v1.32 are supported.

0.8.10

1.19.20

v1.25

v1.27

v1.28

v1.29

v1.30

v1.31

Fixed some issues.

0.8.10

1.19.11

v1.21

v1.23

v1.25

v1.27

v1.28

v1.29

v1.30

Fixed some issues.

0.8.10

1.19.1

v1.21

v1.23

v1.25

v1.27

v1.28

v1.29

Fixed some issues.

0.8.10

1.19.0

v1.21

v1.23

v1.25

v1.27

v1.28

Fixed some issues.

0.8.10

1.18.48

v1.21

v1.23

v1.25

v1.27

v1.28

Fixed some issues.

0.8.10

1.18.46

v1.21

v1.23

v1.25

v1.27

v1.28

CCE clusters v1.28 are supported.

0.8.10

1.18.22

v1.19

v1.21

v1.23

v1.25

v1.27

None

0.8.10

1.17.4

v1.17

v1.19

v1.21

v1.23

v1.25

Optimized DiskHung check item.

0.8.10