The following exceptions occur when services are deployed on the GPU nodes in a CCE cluster:
After you install gpu-beta (gpu-device-plugin) on a node, nvidia-smi will be automatically installed. If an error is reported during GPU deployment, this issue is typically caused by an NVIDIA driver installation failure. Check whether the NVIDIA driver has been downloaded.
# If the add-on version is earlier than 2.0.0, run the following command: cd /opt/cloud/cce/nvidia/bin && ./nvidia-smi # If the add-on version is 2.0.0 or later and the driver installation path is changed, run the following command: cd /usr/local/nvidia/bin && ./nvidia-smi
cd /usr/local/nvidia/bin && ./nvidia-smi
If GPU information is returned, the device is available and the add-on has been installed.
If the driver address is incorrect, uninstall the add-on, reinstall it, and configure the correct address.
You are advised to store the NVIDIA driver in the OBS bucket and set the bucket policy to public read.