Copying Data Using MoXing Is Slow and the Log Is Repeatedly Printed in a Training Job

Symptom

Possible Cause

  1. The possible causes for slow data copying are as follows:
    • Reading data from OBS will make data reading become a training bottleneck, resulting in slow iteration.
    • Data fails to be read from OBS due to environment or network issues. As a result, the job fails.
  2. The log is printed repeatedly. The log indicates that the file is being read from the remote end. After the file list is read, data starts to be downloaded. If there are many files, this process takes a long time.

Solution

When creating a training job, you can save data to OBS. You are advised not to use the OBS APIs of TensorFlow, MXNet, and PyTorch to directly read data from OBS.