Files
doc-exports/docs/modelarts/umn/modelarts_trouble_0044.html
Lai, Weijian 6aa966a79a ModelArts UMN 24.3.0 version
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: Lai, Weijian <laiweijian4@huawei.com>
Co-committed-by: Lai, Weijian <laiweijian4@huawei.com>
2024-11-02 09:04:52 +00:00

5.0 KiB

Training Job Failed Due to OOM

Symptom

If a training job failed due to out of memory (OOM), possible symptoms as as follows:
  1. Error code 137 is returned.
  2. The log file contains error information with keyword killed.
    Figure 1 Error log
  3. Error message "RuntimeError: CUDA out of memory." is displayed in logs.
    Figure 2 Error log
  4. Error message "Dst tensor is not initialized" is displayed in TensorFlow logs.

Possible Causes

The possible causes are as follows:

  • GPU memory is insufficient.
  • OOM occurred on certain nodes. This issue is typically caused by the node fault.

Solution

  1. Modify hyperparameter settings to release unnecessary tensors.
    1. Modify network parameters, such as batch_size, hide_layer, and cell_nums.
    2. Release unnecessary tensors.
      del tmp_tensor 
      torch.cuda.empty_cache()
  2. Use the local PyCharm to remotely access notebook for debugging.
  3. If the fault persists, submit a service ticket to locate the fault or even isolate the affected node.

Summary and Suggestions

Before creating a training job, use the ModelArts development environment to debug the training code to maximally eliminate errors in code migration.

<script language="JavaScript"> </script>