An error occurs in the log when a model is saved in a training job. The error details are as follows:
InternalError (see above for traceback): : Unable to connect to endpoint
When OBS connections are unstable, the following error may occur: Unable to connect to endpoint
Add code to solve the problem of unstable OBS connections. You can add the following code at the beginning of the existing code so that TensorFlow can read and write ckpt and summary information in local cache mode:
import moxing.tensorflow as mox mox.cache()