forked from docs/doc-exports
Reviewed-by: Gergo-Bence Lorincz <a200452876@noreply.gitea.eco.tsi-dev.otc-service.com> Co-authored-by: qiujiandong1 <qiujiandong1@huawei.com> Co-committed-by: qiujiandong1 <qiujiandong1@huawei.com>
77 lines
12 KiB
HTML
77 lines
12 KiB
HTML
<a name="cce_faq_00501"></a><a name="cce_faq_00501"></a>
|
|
|
|
<h1 class="topictitle1">What Can I Do If a GPU Card Is Unavailable on a GPU Node?</h1>
|
|
<div id="body0000002339703465"><div class="section" id="cce_faq_00501__section147601620115118"><h4 class="sectiontitle">Symptom</h4><p id="cce_faq_00501__p1729091719462">A GPU card on a GPU node is unavailable. The possible causes include:</p>
|
|
<ul id="cce_faq_00501__ul8263165516491"><li id="cce_faq_00501__li1126313554499">The CCE AI Suite (NVIDIA GPU) add-on is not ready or malfunctioning.</li><li id="cce_faq_00501__li18489105165020">The node driver is not ready.</li><li id="cce_faq_00501__li1511142119504">The GPU card is abnormal.</li></ul>
|
|
<p id="cce_faq_00501__p1444113184515"></p>
|
|
</div>
|
|
<div class="section" id="cce_faq_00501__section62266221517"><h4 class="sectiontitle">Solution</h4><p id="cce_faq_00501__p11779234133318">Check whether the driver is faulty. Then, check the <strong id="cce_faq_00501__b1862611513468">device-plugin</strong> component of the CCE AI Suite (NVIDIA GPU) add-on. Finally, check the GPU card.</p>
|
|
</div>
|
|
<div class="section" id="cce_faq_00501__section1089010162568"><div class="dropdownexpand"><div class="dropdowntitle" onclick="ExpandorCollapseNode(this)"><h4 class="sectiontitle">Handling a Driver Fault</h4></div><div class="dropdowncontext"></div><div class="dropdowncontext"><ol id="cce_faq_00501__ol96855561317"><li id="cce_faq_00501__li2068518561637"><span><strong id="cce_faq_00501__b188871116195616">Check the status of the nvidia-driver-installer pod</strong>.</span><p><div class="p" id="cce_faq_00501__p108871716205613">Log in to the CCE console and click the cluster name to access the cluster <strong id="cce_faq_00501__b19545429124716">Overview</strong> page. In the navigation pane, choose <strong id="cce_faq_00501__b1024463711475">Nodes</strong>. In the right pane, click the <strong id="cce_faq_00501__b152441337104712">Nodes</strong> tab. Locate the row containing the target node, choose <strong id="cce_faq_00501__b177162721">More</strong> > <strong id="cce_faq_00501__b017075121">Pods</strong> in the <strong id="cce_faq_00501__b10498127424">Operation</strong> column, and check whether the <strong id="cce_faq_00501__b1764016272213">nvidia-driver-installer</strong> pod runs on the node. If the <strong id="cce_faq_00501__b748685114913">nvidia-driver-installer</strong> pod is present and is:<ul id="cce_faq_00501__ul1088791617562"><li id="cce_faq_00501__li128871816185610">In the <strong id="cce_faq_00501__b4195136586">Running</strong> state: The pod is functioning properly. Proceed to <a href="#cce_faq_00501__li13685195617313">2</a> to verify whether the driver was installed.</li><li id="cce_faq_00501__li4887316185619">Not in the <strong id="cce_faq_00501__b54131557786">Running</strong> state for an extended period: Check the pod events for any abnormalities and troubleshoot based on the reported error information. </li></ul>
|
|
</div>
|
|
<p id="cce_faq_00501__p78871116205614">The name of the <strong id="cce_faq_00501__b425712313411">nvidia-driver-installer</strong> pod varies depending on the OS. The details are listed in the table below.</p>
|
|
|
|
<div class="tablenoborder"><table cellpadding="4" cellspacing="0" summary="" id="cce_faq_00501__table1288921620561" frame="border" border="1" rules="all"><caption><b>Table 1 </b>Names of the nvidia-driver-installer pod</caption><thead align="left"><tr id="cce_faq_00501__row38881916205610"><th align="left" class="cellrowborder" valign="top" width="23.53%" id="mcps1.3.3.2.1.2.3.2.3.1.1"><p id="cce_faq_00501__p19888111611563">OS</p>
|
|
</th>
|
|
<th align="left" class="cellrowborder" valign="top" width="76.47%" id="mcps1.3.3.2.1.2.3.2.3.1.2"><p id="cce_faq_00501__p5888121695617">Pod Name</p>
|
|
</th>
|
|
</tr>
|
|
</thead>
|
|
<tbody><tr id="cce_faq_00501__row78881616125611"><td class="cellrowborder" valign="top" width="23.53%" headers="mcps1.3.3.2.1.2.3.2.3.1.1 "><p id="cce_faq_00501__p888881695616"><span id="cce_faq_00501__ph14203823133013">HCE OS 2.0</span></p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="76.47%" headers="mcps1.3.3.2.1.2.3.2.3.1.2 "><p id="cce_faq_00501__p1388819162568">hce20-nvidia-driver-installer</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="cce_faq_00501__row2889121635611"><td class="cellrowborder" valign="top" width="23.53%" headers="mcps1.3.3.2.1.2.3.2.3.1.1 "><p id="cce_faq_00501__p1188915160563">Ubuntu</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="76.47%" headers="mcps1.3.3.2.1.2.3.2.3.1.2 "><p id="cce_faq_00501__p6889181618569">ubuntu22-nvidia-driver-installer</p>
|
|
</td>
|
|
</tr>
|
|
<tr id="cce_faq_00501__row3889171610569"><td class="cellrowborder" valign="top" width="23.53%" headers="mcps1.3.3.2.1.2.3.2.3.1.1 "><p id="cce_faq_00501__p5889151614565">Others</p>
|
|
</td>
|
|
<td class="cellrowborder" valign="top" width="76.47%" headers="mcps1.3.3.2.1.2.3.2.3.1.2 "><p id="cce_faq_00501__p588914163569">nvidia-driver-installer</p>
|
|
</td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</div>
|
|
</p></li><li id="cce_faq_00501__li13685195617313"><a name="cce_faq_00501__li13685195617313"></a><a name="li13685195617313"></a><span><strong id="cce_faq_00501__b13889101619563">Check whether the GPU driver has been installed.</strong></span><p><ol type="a" id="cce_faq_00501__ol168891816135618"><li id="cce_faq_00501__li9889111619566">In the node list, click the name of the target node. In the dialog box displayed, click <strong id="cce_faq_00501__b115888591452">OK</strong>. On the node details page, click <strong id="cce_faq_00501__b2463105631210">Remote Login</strong> in the upper right corner.</li><li id="cce_faq_00501__li134474241219">Check the driver installation directory.<ol class="substepthirdol" id="cce_faq_00501__ol1124413277217"><li id="cce_faq_00501__li18164153841716">Check whether the directory exists. If it is present, run the below command to go to the driver installation directory. If it is not present, skip this step and go to <a href="#cce_faq_00501__li92194510417">3</a> to check whether there is an error during the driver installation.<pre class="screen" id="cce_faq_00501__screen1817634420176">cd <Driver installation directory></pre>
|
|
<div class="p" id="cce_faq_00501__p1571073891718">The driver installation directory varies depending on the CCE AI Suite (NVIDIA GPU) add-on version. The details are as follows:<ul id="cce_faq_00501__ul139072026403"><li id="cce_faq_00501__li3907202617011">If the CCE AI Suite (NVIDIA GPU) add-on version is later than 2.0.0, the driver installation directory is <strong id="cce_faq_00501__b1829910171154">/usr/local/nvidia</strong>.</li><li id="cce_faq_00501__li1773041215">If the CCE AI Suite (NVIDIA GPU) add-on version is earlier than 2.0.0, the driver installation directory is <strong id="cce_faq_00501__b65791530151512">/opt/cloud/cce/nvidia</strong>.</li></ul>
|
|
</div>
|
|
</li><li id="cce_faq_00501__li937916517618">Run the following command in the driver installation directory to view all files in the directory:<pre class="screen" id="cce_faq_00501__screen768616216229">ls -l</pre>
|
|
<p id="cce_faq_00501__p05961145192217">The figure below shows a typical file directory. <strong id="cce_faq_00501__b199292052867">nvidia.run</strong> is the driver installation file. <strong id="cce_faq_00501__b1946755814612">nvidia-installer.log</strong> is the installation logs generated by the NVIDIA driver. <strong id="cce_faq_00501__b14853121179">nvidia-uninstall.log</strong>, if present, is the corresponding uninstallation logs, though it may not always appear in the directory. <strong id="cce_faq_00501__b14697134989">If any files are missing, except for </strong><strong id="cce_faq_00501__b774720326714">nvidia-uninstall.log</strong><strong id="cce_faq_00501__b17697345816">, go to <a href="#cce_faq_00501__li92194510417">3</a> to check whether there is an error during the driver installation.</strong></p>
|
|
<p id="cce_faq_00501__p1877673922315"><span><img id="cce_faq_00501__image1414111221235" src="en-us_image_0000002484118014.png"></span></p>
|
|
</li><li id="cce_faq_00501__li1564721172518">Run the below command to go to the <strong id="cce_faq_00501__b1756102615167">bin</strong> directory of NVIDIA and check whether <strong id="cce_faq_00501__b87951630161611">nvidia-smi</strong> is functioning properly. <strong id="cce_faq_00501__b14361728173">If the add-on version is earlier than 2.0.0, replace the path with opt/cloud/cce/nvidia/bin.</strong><pre class="screen" id="cce_faq_00501__screen1189123773012">cd /usr/local/nvidia/bin
|
|
./nvidia-smi</pre>
|
|
<p id="cce_faq_00501__p12388172411918">If information similar to that shown in the figure below is not displayed, go to <a href="#cce_faq_00501__li92194510417">3</a> to check whether there is an error during the driver installation.</p>
|
|
<p id="cce_faq_00501__p1531851103216"><span><img id="cce_faq_00501__image195741347133114" src="en-us_image_0000002516077983.png"></span></p>
|
|
</li></ol>
|
|
</li></ol>
|
|
</p></li><li id="cce_faq_00501__li92194510417"><a name="cce_faq_00501__li92194510417"></a><a name="li92194510417"></a><span><strong id="cce_faq_00501__b762819216351">View the node driver installation logs to check whether there is an error during the driver installation.</strong></span><p><div class="p" id="cce_faq_00501__p1679831235415">Run the below command to view the logs of the <strong id="cce_faq_00501__b16349201218368">nvidia-driver-installer</strong> pod. <strong id="cce_faq_00501__b192346527299">If the add-on version is earlier than 2.0.0, replace the path with /opt/cloud/cce/nvidia</strong><strong id="cce_faq_00501__b12352052182917">/nvidia-installer.log</strong><strong id="cce_faq_00501__b1236155219291">.</strong><pre class="screen" id="cce_faq_00501__screen6187514131519">cat /usr/local/nvidia/nvidia-installer.log</pre>
|
|
</div>
|
|
<p id="cce_faq_00501__p19661136171617">If the command output contains the below information, the driver installation completed without error. Otherwise, an error occurred during the installation. </p>
|
|
<pre class="screen" id="cce_faq_00501__screen5992104320221">...
|
|
> Installation of the NVIDIA Accelerated Graphics Driver for xxx (version: x.x.x) is now complete.</pre>
|
|
</p></li></ol>
|
|
</div></div></div>
|
|
<div class="section" id="cce_faq_00501__section15595182212250"><div class="dropdownexpand"><div class="dropdowntitle" onclick="ExpandorCollapseNode(this)"><h4 class="sectiontitle">Handling a device-plugin Fault</h4></div><div class="dropdowncontext"></div><div class="dropdowncontext"><p id="cce_faq_00501__p11434141319289">In a CCE cluster, <strong id="cce_faq_00501__b186217341129">device-plugin</strong> is responsible for reporting hardware resource statuses. In GPU scenarios, <strong id="cce_faq_00501__b2013045081216">nvidia-gpu-device-plugin</strong> in the <strong id="cce_faq_00501__b1012020592127">kube-system</strong> namespace reports the available GPU resources on each node. If the reported GPU resources appear incorrect or if device mounting issues occur, it is advised to first check <strong id="cce_faq_00501__b17152931171316">device-plugin</strong> for potential anomalies.</p>
|
|
<div class="p" id="cce_faq_00501__p1713212493288">Run the following command to <strong id="cce_faq_00501__b179688753115">check the device-plugin status</strong>:<pre class="screen" id="cce_faq_00501__screen139711511173116">kubectl get po -A -owide|grep nvidia</pre>
|
|
</div>
|
|
<ul id="cce_faq_00501__ul1728519011260"><li id="cce_faq_00501__li13911418264">If the <strong id="cce_faq_00501__b653968272">device-plugin</strong> pod is in the <strong id="cce_faq_00501__b82615544222">Running</strong> state, run the following command to check its logs for errors:<pre class="screen" id="cce_faq_00501__screen92091511288">kubectl logs -n kube-system <i><span class="varname" id="cce_faq_00501__varname82101314288">nvidia-gpu-device-plugin-9xmhr</span></i></pre>
|
|
<p id="cce_faq_00501__p4761141913288">If "gpu driver wasn't ready. will re-check" is displayed in the command output, go to <a href="#cce_faq_00501__li13685195617313">2</a> and check whether the <strong id="cce_faq_00501__b18232413122319">/usr/local/nvidia/bin/nvidia-smi</strong> or <strong id="cce_faq_00501__b208061910233">/opt/cloud/cce/nvidia/bin/nvidia-smi</strong> file exists in the driver installation directory.</p>
|
|
<pre class="screen" id="cce_faq_00501__screen19761151917288">...
|
|
I0527 11:29:06.420714 3336959 nvidia_gpu.go:76] device-plugin started
|
|
I0527 11:29:06.521884 3336959 nodeinformer.go:124] "nodeInformer started"
|
|
I0527 11:29:06.521964 3336959 nvidia_gpu.go:262] "gpu driver wasn't ready. will re-check in %s" 5s="(MISSING)"
|
|
I0527 11:29:11.524882 3336959 nvidia_gpu.go:262] "gpu driver wasn't ready. will re-check in %s" 5s="(MISSING)"
|
|
...</pre>
|
|
</li></ul>
|
|
</div></div></div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="cce_faq_00281.html">Node Running</a></div>
|
|
</div>
|
|
</div>
|
|
|