forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Lai, Weijian <laiweijian4@huawei.com> Co-committed-by: Lai, Weijian <laiweijian4@huawei.com>
3.4 KiB
3.4 KiB
Stopped Training Job Process
Symptom
The training job process is stopped and the logs are interrupted.
Possible Causes
- CPU soft lock
The decompression of a large number of files may cause CPU soft lock and node restart. You can suspend the decompression for the specified amount of time by invoking sleep method when decompressing a large number of files. For example, every time 10,000 files are decompressed, the decompression stops for 1 second.
- Storage limitation
Use data disks based on specifications. For details about a data disk size, see
- CPU overload
Troubleshooting
According to the error information, the error is caused by the user code.
You can use either of the following methods to locate the fault:
- Debug the code online (only available for the non-distributed code).
- Apply for a development environment instance with the same specifications in the development environment (notebook).
- Debug the user code in the notebook and find the improper code snippet.
- Find a solution by searching the key code snippet and exit code in a search engine.
- Locate the fault based on the training logs.
- Identify the improper code snippet based on the logs.
- Print the improper code snippet to obtain more detailed log information.
- Run the training job again to locate the improper code snippet.
Parent topic: Service Code Issues