Volcano is a batch processing platform based on Kubernetes. It provides a series of features required by machine learning, deep learning, bioinformatics, genomics, and other big data applications, as a powerful supplement to Kubernetes capabilities.
Volcano provides general computing capabilities such as high-performance job scheduling, heterogeneous chip management, and job running management. It accesses the computing frameworks for various industries such as AI, big data, gene, and rendering and schedules up to 1000 pods per second for end users, greatly improving scheduling efficiency and resource utilization.
Volcano provides job scheduling, job management, and queue management for computing applications. Its main features are as follows:
Volcano has been open-sourced in GitHub at https://github.com/volcano-sh/volcano.
Install and configure the Volcano add-on in CCE clusters. For details, see Volcano Scheduling.
When using Volcano as a scheduler, use it to schedule all workloads in the cluster. This prevents resource scheduling conflicts caused by simultaneous working of multiple schedulers.
If the Volcano Scheduler add-on is upgraded from 1.4.7 or earlier to a version later than 1.4.7, the webhooks.admissionReviewVersions field information in the new version may be incompatible with that in the old version. As a result, VolcanoJob (vcjob) resources cannot be created.
The resource quotas of the volcano-admission component are related to the cluster scale. For details, see Table 1. The resource quotas of volcano-controller and volcano-scheduler are related to the number of cluster nodes and pods. The recommended values are as follows:
Recommended formula for calculating the requested value:
For example, for 2000 nodes and 20,000 pods, Number of target nodes x Number of target pods = 40 million, which is close to the specification of 700/70,000 (Number of cluster nodes x Number of pods = 49 million). According to the following table, set the requested vCPUs to 4000m and the limit value to 5500m.
Requested memory = Number of target nodes/1000 × 2.4 GiB + Number of target pods/10,000 × 1 GiB
For example, for 2000 nodes and 20,000 pods, the requested memory is 6.8 GiB (2000/1000 × 2.4 GiB + 20,000/10,000 × 1 GiB).
Cluster Scale |
CPU Request (m) |
vCPU Limit (m) |
Memory Request (MiB) |
Memory Limit (MiB) |
|---|---|---|---|---|
50 nodes |
200 |
500 |
500 |
500 |
200 nodes |
500 |
1000 |
1000 |
2000 |
1000 or more nodes |
1500 |
2500 |
3000 |
4000 |
Nodes/Pods in a Cluster |
CPU Request (m) |
CPU Limit (m) |
Memory Request (MiB) |
Memory Limit (MiB) |
|---|---|---|---|---|
50/5000 |
500 |
2000 |
500 |
2000 |
100/10000 |
1000 |
2500 |
1500 |
2500 |
200/20000 |
1500 |
3000 |
2500 |
3500 |
300/30000 |
2000 |
3500 |
3500 |
4500 |
400/40000 |
2500 |
4000 |
4500 |
5500 |
500/50000 |
3000 |
4500 |
5500 |
6500 |
600/60000 |
3500 |
5000 |
6500 |
7500 |
700/70000 |
4000 |
5500 |
7500 |
8500 |
Parameter |
Description |
|---|---|
Multi-AZ Deployment |
|
Node Affinity |
|
Toleration |
Using both taints and tolerations allows (not forcibly) the add-on Deployment to be scheduled to a node with the matching taints, and controls the Deployment eviction policies after the node where the Deployment is located is tainted. The add-on adds the default tolerance policy for the node.kubernetes.io/not-ready and node.kubernetes.io/unreachable taints, respectively. The tolerance time window is 60s. For details, see Configuring Tolerance Policies. |
admission_kube_api_qps: 200
admissions: /jobs/mutate,/jobs/validate,/podgroups/mutate,/pods/validate,/pods/mutate,/queues/mutate,/queues/validate,/eas/pods/mutate,/eas/pods/validate,/npu/jobs/validate,/resource/validate,/resource/mutate,/workloadbalancer/balancer/validate,/workloadbalancer/balancerpolicytemplate/validate
annotations: {}
colocation_enable: 'false'
controller_kube_api_qps: 200
default_scheduler_conf:
actions: allocate, backfill, preempt
metrics:
interval: 30s
type: ''
tiers:
- plugins:
- name: priority
- enableJobStarving: false
enablePreemptable: false
name: gang
- name: conformance
- plugins:
- enablePreemptable: false
name: drf
- name: predicates
- name: nodeorder
- plugins:
- name: cce-gpu-topology-predicate
- name: cce-gpu-topology-priority
- name: xgpu
- plugins:
- name: nodelocalvolume
- name: nodeemptydirvolume
- name: nodeCSIscheduling
- name: networkresource
deschedulerPolicy:
profiles:
- name: ProfileName
pluginConfig:
- args:
nodeFit: true
name: DefaultEvictor
- args:
evictableNamespaces:
exclude:
- kube-system
thresholds:
cpu: 20
memory: 20
name: HighNodeUtilization
- args:
evictableNamespaces:
exclude:
- kube-system
metrics:
type: prometheus_adaptor
nodeFit: true
targetThresholds:
cpu: 80
memory: 85
thresholds:
cpu: 30
memory: 30
name: LoadAware
plugins:
balance:
enabled: null
descheduler_enable: 'false'
deschedulingInterval: 10m
enable_workload_balancer: false
oversubscription_method: nodeResource
oversubscription_profile_period: 300
oversubscription_ratio: 60
recommendation_enable: ''
scheduler_kube_api_qps: 200
update_pod_status_qps: 50
workload_balancer_score_annotation_key: ''
workload_balancer_third_party_types: ''
Function |
Parameter |
Function |
Description |
|---|---|---|---|
Basic scheduling functions |
admission_kube_api_qps |
QPS of requests sent by volcano-admission to Kubernetes API server |
Default value: 200; parameter type: float |
controller_kube_api_qps |
QPS of requests sent by volcano-controller to Kubernetes API server |
Default value: 200; parameter type: float |
|
scheduler_kube_api_qps |
QPS of requests sent by volcano-scheduler to Kubernetes API server |
Default value: 200; parameter type: float |
|
update_pod_status_qps |
QPS of the requests for updating the pod status by volcano-scheduler |
Default value: 50; parameter type: float |
|
default_scheduler_conf |
Used to schedule pods. It consists of a series of actions and plugins and features high scalability. You can specify and implement actions and plugins based on your requirements. |
It consists of:
|
|
default_scheduler_conf.actions |
Actions to be executed in each scheduling phase. The configured action sequence is the scheduler execution sequence. For details, see Actions. The scheduler traverses all jobs to be scheduled and performs actions such as enqueue, allocate, preempt, and backfill in the configured sequence to find the most appropriate node for each job. |
The following options are supported:
Example: actions: 'allocate, backfill, preempt' NOTE:
When configuring actions, use either preempt or enqueue. |
|
default_scheduler_conf.tier.plugin |
Implementation details of algorithms in actions based on different scenarios. For details, see Plugins. |
For details, see Table 5. |
|
descheduler_enable |
Used to enable descheduling. |
This function is disabled by default. Options:
|
|
deschedulerPolicy |
Descheduling policy |
For details about the parameters, see Table 2. |
|
deschedulingInterval |
Descheduling period |
Value range: > 0s; parameter type: time |
|
colocation_enable |
Used to enable cloud native hybrid deployment. |
This function is disabled by default. Options:
|
|
oversubscription_method |
Method for calculating the oversubscription |
nodeResource and podProfile are supported. The default value is nodeResource.
|
|
oversubscription_ratio |
Percentage of idle resource oversubscription of a node |
Value range: 1 to 100; parameter type: int For example, 60 indicates that the maximum oversubscription resources on a node is calculated based on 60% × Idle resources on the node. |
|
oversubscription_profile_period |
Period of pod profiling |
Value range: 60 to 2592000, in seconds, that is, from 1 minute to 1 month. If a pod's metrics are not collected for the entire period, the node's resources will be evaluated according to the resources requested by the pod. When the oversubscription algorithm based on pod profiling is enabled for the first time, the amount of collected data may not be sufficient to cover the entire period. In this case, the oversubscription on the node is temporarily 0 due to lack of initialization data. After the data of the first period is collected, the oversubscription is updated to the actual value. |
Plugins |
Function |
Description |
Demonstration |
|---|---|---|---|
binpack |
Schedule pods to nodes with high resource usage (not allocating pods to light-loaded nodes) to reduce resource fragments. |
arguments:
|
- plugins:
- name: binpack
arguments:
binpack.weight: 10
binpack.cpu: 1
binpack.memory: 1
binpack.resources: nvidia.com/gpu, example.com/foo
binpack.resources.nvidia.com/gpu: 2
binpack.resources.example.com/foo: 3
|
conformance |
Prevent key pods, such as the pods in the kube-system namespace from being preempted. |
None |
- plugins:
- name: 'priority'
- name: 'gang'
enablePreemptable: false
- name: 'conformance'
|
lifecycle |
By collecting statistics on service scaling rules, pods with similar lifecycles are preferentially scheduled to the same node. With the horizontal scaling capability of the Autoscaler, resources can be quickly scaled in and released, reducing costs and improving resource utilization. 1. Collects statistics on the lifecycle of pods in the service load and schedules pods with similar lifecycles to the same node. 2. For a cluster configured with an automatic scaling policy, adjust the scale-in annotation of the node to preferentially scale in the node with low usage. |
arguments:
|
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
- name: lifecycle
arguments:
lifecycle.MaxGrade: 3
lifecycle.MaxScore: 200.0
lifecycle.SaturatedTresh: 0.8
lifecycle.WindowSize: 10
NOTE:
|
Gang |
Consider a group of pods as a whole for resource allocation. This plugin checks whether the number of scheduled pods in a job meets the minimum requirements for running the job. If yes, all pods in the job will be scheduled. If no, the pods will not be scheduled. NOTE:
If a gang scheduling policy is used, if the remaining resources in the cluster are greater than or equal to half of the minimum number of resources for running a job but less than the minimum of resources for running the job, Autoscaler scale-outs will not be triggered. |
|
- plugins:
- name: priority
- name: gang
enablePreemptable: false
enableJobStarving: false
- name: conformance
|
priority |
Schedule based on custom load priorities. |
None |
- plugins:
- name: priority
- name: gang
enablePreemptable: false
- name: conformance
|
overcommit |
Resources in a cluster are scheduled after being accumulated in a certain multiple to improve the workload enqueuing efficiency. If all workloads are Deployments, remove this plugin or set the raising factor to 2.0. NOTE:
This plugin is supported in Volcano 1.6.5 and later versions. |
arguments:
|
- plugins:
- name: overcommit
arguments:
overcommit-factor: 2.0
|
drf |
The Dominant Resource Fairness (DRF) scheduling algorithm, which schedules jobs based on their dominant resource share. Jobs with a smaller resource share will be scheduled with a higher priority. |
None |
- plugins: - name: 'drf' - name: 'predicates' - name: 'nodeorder' |
predicates |
Determine whether a task is bound to a node by using a series of evaluation algorithms, such as node/pod affinity, taint tolerance, node repetition, volume limits, and volume zone matching. |
None |
- plugins: - name: 'drf' - name: 'predicates' - name: 'nodeorder' |
nodeorder |
A common algorithm for selecting nodes. Nodes are scored in simulated resource allocation to find the most suitable node for the current job. |
Scoring parameters:
|
- plugins:
- name: nodeorder
arguments:
leastrequested.weight: 1
mostrequested.weight: 0
nodeaffinity.weight: 2
podaffinity.weight: 2
balancedresource.weight: 1
tainttoleration.weight: 3
imagelocality.weight: 1
podtopologyspread.weight: 2
|
cce-gpu-topology-predicate |
GPU-topology scheduling preselection algorithm |
None |
- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'xgpu' |
cce-gpu-topology-priority |
GPU-topology scheduling priority algorithm |
None |
- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'xgpu' |
cce-gpu |
GPU resource allocation that supports decimal GPU configurations by working with the CCE AI Suite (NVIDIA GPU) add-on. |
None |
- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'cce-gpu' |
numa-aware |
NUMA affinity scheduling. For details, see NUMA Affinity Scheduling. |
arguments:
|
- plugins:
- name: 'nodelocalvolume'
- name: 'nodeemptydirvolume'
- name: 'nodeCSIscheduling'
- name: 'networkresource'
arguments:
NetworkType: vpc-router
- name: numa-aware
arguments:
weight: 10
|
networkresource |
Filter out nodes that require elastic network interfaces. The parameters are transferred by CCE and do not need to be manually configured. |
arguments:
|
- plugins:
- name: 'nodelocalvolume'
- name: 'nodeemptydirvolume'
- name: 'nodeCSIscheduling'
- name: 'networkresource'
arguments:
NetworkType: vpc-router
|
nodelocalvolume |
Filter out nodes that do not meet local volume requirements. |
None |
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' |
nodeemptydirvolume |
Filter out nodes that do not meet the emptyDir requirements. |
None |
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' |
nodeCSIscheduling |
Filter out nodes with malfunctional Everest. |
None |
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' |
Component |
Description |
Resource Type |
|---|---|---|
volcano-scheduler |
Schedule pods. |
Deployment |
volcano-controller |
Synchronize CRDs. |
Deployment |
volcano-admission |
Webhook server, which verifies and modifies resources such as pods and jobs |
Deployment |
volcano-agent |
Cloud native hybrid agent, which is used for node QoS assurance, CPU burst, and dynamic resource oversubscription |
DaemonSet |
resource-exporter |
Report the NUMA topology information of nodes. |
DaemonSet |
volcano-descheduler |
Reschedule pods in a cluster. After the rescheduling capability is enabled, pods will be automatically deployed on nodes. |
Deployment |
volcano-recommender |
Generate recommendations for CPU and memory requests based on the historical CPU and memory usage of a container. |
Deployment |
volcano-recommender-prometheus-adapter |
Collect historical CPU and memory metrics of containers from Prometheus. |
Deployment |
volcano-scheduler is the component responsible for pod scheduling. It consists of a series of actions and plugins. Actions should be executed in every step. Plugins provide the action algorithm details in different scenarios. volcano-scheduler is highly scalable. You can specify and implement actions and plugins based on your requirements.
After the add-on is installed, you can choose Settings in the navigation pane, switch to the Scheduling tab, and configure the basic scheduling capabilities. You can also use the expert mode to customize advanced scheduling policies based on service scenarios.
This section describes how to configure volcano-scheduler.
Only Volcano of v1.7.1 and later support this function.
Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Settings and click the Scheduling tab. In the Default Cluster Scheduler area, find the expert mode and click Try Now.
...
"default_scheduler_conf": {
"actions": "allocate, backfill, preempt",
"tiers": [
{
"plugins": [
{
"name": "priority"
},
{
"name": "gang"
},
{
"name": "conformance"
}
]
},
{
"plugins": [
{
"name": "drf"
},
{
"name": "predicates"
},
{
"name": "nodeorder"
}
]
},
{
"plugins": [
{
"name": "cce-gpu-topology-predicate"
},
{
"name": "cce-gpu-topology-priority"
},
{
"name": "cce-gpu"
},
{
"name": "numa-aware" # add this also enable resource_exporter
}
]
},
{
"plugins": [
{
"name": "nodelocalvolume"
},
{
"name": "nodeemptydirvolume"
},
{
"name": "nodeCSIscheduling"
},
{
"name": "networkresource"
}
]
}
]
},
...
After this function is enabled, you can use the functions of both numa-aware and resource_exporter.
volcano-scheduler exposes Prometheus metrics through port 8080. You can build a Prometheus collector to identify and obtain volcano-scheduler scheduling metrics from http://{{volcano-schedulerPodIP}}:{{volcano-schedulerPodPort}}/metrics.
Prometheus metrics can be exposed only by the Volcano add-on of version 1.8.5 or later.
Metric |
Type |
Description |
Label |
|---|---|---|---|
e2e_scheduling_latency_milliseconds |
Histogram |
E2E scheduling latency (ms) (scheduling algorithm + binding) |
None |
e2e_job_scheduling_latency_milliseconds |
Histogram |
E2E job scheduling latency (ms) |
None |
e2e_job_scheduling_duration |
Gauge |
E2E job scheduling duration |
labels=["job_name", "queue", "job_namespace"] |
plugin_scheduling_latency_microseconds |
Histogram |
Add-on scheduling latency (µs) |
labels=["plugin", "OnSession"] |
action_scheduling_latency_microseconds |
Histogram |
Action scheduling latency (µs) |
labels=["action"] |
task_scheduling_latency_milliseconds |
Histogram |
Task scheduling latency (ms) |
None |
schedule_attempts_total |
Counter |
Number of pod scheduling attempts. unschedulable indicates that the pods cannot be scheduled, and error indicates that the internal scheduler is faulty. |
labels=["result"] |
pod_preemption_victims |
Gauge |
Number of selected preemption victims |
None |
total_preemption_attempts |
Counter |
Total number of preemption attempts in a cluster |
None |
unschedule_task_count |
Gauge |
Number of unschedulable tasks |
labels=["job_id"] |
unschedule_job_count |
Gauge |
Number of unschedulable jobs |
None |
job_retry_counts |
Counter |
Number of job retries |
labels=["job_id"] |
After the add-on is uninstalled, all custom Volcano resources (Table 8) will be deleted, including the created resources. Reinstalling the add-on will not inherit or restore the tasks before the uninstallation. It is a good practice to uninstall the Volcano add-on only when no custom Volcano resources are being used in the cluster.
Item |
API Group |
API Version |
Resource Level |
|---|---|---|---|
Command |
bus.volcano.sh |
v1alpha1 |
Namespaced |
Job |
batch.volcano.sh |
v1alpha1 |
Namespaced |
Numatopology |
nodeinfo.volcano.sh |
v1alpha1 |
Cluster |
PodGroup |
scheduling.volcano.sh |
v1beta1 |
Namespaced |
Queue |
scheduling.volcano.sh |
v1beta1 |
Cluster |
It is a good practice to upgrade Volcano to the latest version that is supported by the cluster.
Add-on Version |
Supported Cluster Version |
New Feature |
|---|---|---|
1.21.2 |
v1.28 v1.29 v1.30 v1.31 v1.32 v1.33 v1.34 |
|
1.19.6 |
v1.27 v1.28 v1.29 v1.30 v1.31 v1.32 v1.33 |
|
1.18.3 |
v1.27 v1.28 v1.29 v1.30 v1.31 v1.32 |
|
1.16.17 |
v1.25 v1.27 v1.28 v1.29 v1.30 v1.31 |
Supported even scheduling in virtual GPUs. |
1.15.11 |
v1.23 v1.25 v1.27 v1.28 v1.29 v1.30 |
Fixed some issues. |
1.15.6 |
v1.23 v1.25 v1.27 v1.28 v1.29 v1.30 |
Resources can be oversubscribed based on pod profiling. |
1.13.3 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 |
|
1.12.1 |
v1.19.16 v1.21 v1.23 v1.25 v1.27 v1.28 |
Optimized application auto scaling performance. |
1.11.21 |
v1.19.16 v1.21 v1.23 v1.25 v1.27 v1.28 |
|
1.11.6 |
v1.19.16 v1.21 v1.23 v1.25 v1.27 |
|
1.9.1 |
v1.19.16 v1.21 v1.23 v1.25 |
|
1.7.1 |
v1.19.16 v1.21 v1.23 v1.25 |
Supported clusters v1.25. |
1.4.5 |
v1.17 v1.19 v1.21 |
Changed the deployment mode of volcano-scheduler from StatefulSet to Deployment, and fixed the issue where pods cannot be automatically migrated when the node is abnormal. |
1.3.7 |
v1.15 v1.17 v1.19 v1.21 |
|