Reviewed-by: Gergo-Bence Lorincz <a200452876@noreply.gitea.eco.tsi-dev.otc-service.com> Co-authored-by: qiujiandong1 <qiujiandong1@huawei.com> Co-committed-by: qiujiandong1 <qiujiandong1@huawei.com>
9.0 KiB
Preparing Virtualized GPU Resources
CCE uses xGPU virtualization technologies to dynamically divide the GPU memory and computing power. A single GPU can be virtualized into a maximum of 20 virtual GPU devices. This section describes how to implement GPU scheduling and isolation capabilities on GPU nodes.
Prerequisites
Item |
Supported Version |
|---|---|
Cluster version |
v1.23.8-r0, v1.25.3-r0, or later |
OS |
HCE OS 2.0 with the kernel version of 5.10 or later |
GPU type |
Tesla T4 and Tesla V100 |
Driver version |
570.86.15, 535.216.03, 535.54.03, 510.47.03, and 470.57.02 |
CUDA version |
CUDA 12.2.0 to 12.8.0 |
Runtime |
containerd |
Add-on |
The following add-ons must be installed in the cluster:
|
Step 1: Enable GPU Virtualization
Both CCE AI Suite (NVIDIA GPU) and Volcano Scheduler must be installed in the cluster.
- Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Settings.
- Switch to the Heterogeneous Resources tab and enable GPU Virtualization.
- Node pool-level GPU virtualization: If CCE AI Suite (NVIDIA GPU) of version 2.7.2 or later is installed, GPU virtualization can be configured by node pool.
- In Node Pool Configurations under GPU Settings, click Add.
- In the Node Pools list, select the node pool where you want to enable GPU virtualization and choose a driver that supports GPU virtualization from Driver. After you customize a GPU driver for a node pool, nodes in that pool will preferentially use the custom driver. Nodes for which no driver is specified will use the cluster's default driver.
- Click
under GPU Virtualization to enable GPU virtualization for the node pool. To configure GPU virtualization for multiple node pools, click Add. - In the lower right corner of the page, click Confirm Settings.
- Node pool-level GPU virtualization: If CCE AI Suite (NVIDIA GPU) of version 2.7.2 or later is installed, GPU virtualization can be configured by node pool.
- After configuring GPU virtualization, verify the settings.
In the navigation pane, choose Cluster > Nodes. In the right pane, click the Nodes tab and find the node where GPU virtualization has been configured. In the Operation column of the target node, choose More > View YAML. If the node-status.volcano.sh/nvidia value in the YAML file is {"enableXGPU":true}, GPU virtualization has been configured on the node.
Step 2: Create a GPU Node
Create nodes that support GPU virtualization in the cluster to use the GPU virtualization function. For details, see Creating a Node or Creating a Node Pool. If there are GPU nodes in your cluster that meet the prerequisites requirements, skip this step.
