npd

Introduction

node-problem-detector (npd for short) is an add-on that monitors abnormal events of cluster nodes and connects to a third-party monitoring platform. It is a daemon running on each node. It collects node issues from different daemons and reports them to the API server. The npd add-on can run as a DaemonSet or a daemon.

For more information, see node-problem-detector.

Notes and Constraints

Permission Description

To monitor kernel logs, the npd add-on needs to read the host /dev/kmsg. Therefore, the privileged mode must be enabled. For details, see privileged.

In addition, CCE mitigates risks according to the least privilege principle. Only the following privileges are available for npd running:

Installing the Add-on

  1. Log in to the CCE console and access the cluster console. Choose Add-ons in the navigation pane, locate npd on the right, and click Install.
  2. On the Install Add-on page, select the add-on specifications and set related parameters.

    • Pods: Set the number of pods based on service requirements.
    • Containers: Select a proper container quota based on service requirements.

  3. Set the npd parameters and click Install.

    The parameters are configurable only in 1.16.0 and later versions. For details, see Table 7.

npd Check Items

Check items are supported only in 1.16.0 and later versions.

Check items cover events and statuses.

Node-problem-controller Fault Isolation

Fault isolation is supported only by add-ons of 1.16.0 and later versions.

By default, if multiple nodes become faulty, NPC adds taints to up to 10% of the nodes. You can set npc.maxTaintedNode to increase the threshold.

The open source NPD plug-in provides fault detection but not fault isolation. CCE enhances the node-problem-controller (NPC) based on the open source NPD. This component is implemented based on the Kubernetes node controller. For faults reported by NPD, NPC automatically adds taints to nodes for node fault isolation.

Table 7 Parameters

Parameter

Description

Default

npc.enable

Whether to enable NPC

NPC cannot be disabled in 1.18.0 or later versions.

true

npc. maxTaintedNode

Check how many nodes can npc add taints to for mitigating the impact when a single fault occurs on multiple nodes.

The int format and percentage format are supported.

10%

Value range:

  • The value is in int format and ranges from 1 to infinity.
  • The value ranges from 1% to 100%, in percentage. The minimum value of this parameter multiplied by the number of cluster nodes is 1.

npc.affinity

Node affinity of the controller

N/A

Collecting Prometheus Metrics

The NPD daemon pod exposes Prometheus metric data on port 19901. By default, the NPD pod is added with the annotation metrics.alpha.kubernetes.io/custom-endpoints: '[{"api":"prometheus","path":"/metrics","port":"19901","names":""}]'. You can build a Prometheus collector to identify and obtain NPD metrics from http://{{NpdPodIP}}:{{NpdPodPort}}/metrics.

If the npd add-on version is earlier than 1.16.5, the exposed port of Prometheus metrics is 20257.

Currently, the metric data includes problem_counter and problem_gauge, as shown below.

# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
problem_counter{reason="DockerHung"} 0
problem_counter{reason="DockerStart"} 0
problem_counter{reason="EmptyDirVolumeGroupStatusError"} 0
...
# HELP problem_gauge Whether a specific type of problem is affecting the node or not.
# TYPE problem_gauge gauge
problem_gauge{reason="CNIIsDown",type="CNIProblem"} 0
problem_gauge{reason="CNIIsUp",type="CNIProblem"} 0
problem_gauge{reason="CRIIsDown",type="CRIProblem"} 0
problem_gauge{reason="CRIIsUp",type="CRIProblem"} 0
..