Running a training job failed, and error information similar to the following is displayed in logs:
[Modelarts Service Log]Training end with return code: 137
According to the log, the exit code of the training job is 137. The training process starts after the user code is executed. Therefore, the exit code mentioned in this section is generated after the code for training job is executed. Common error codes include codes 247 and 139.
The possible cause is that the memory overflows. To resolve this issue, you can reduce the data volume, decrease the batch_size value, optimize the code, or aggregate and replicate the data.
The size of data files is not equal to the memory usage. Therefore, evaluate the memory usage.
Check the version of the installation package. There may be a package conflict.
According to the error information, the error is caused by the user code.
You can use either of the following methods to locate the fault: