Files
doc-exports/docs/modelarts/umn/develop-modelarts-0023.html
Lai, Weijian 6aa966a79a ModelArts UMN 24.3.0 version
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: Lai, Weijian <laiweijian4@huawei.com>
Co-committed-by: Lai, Weijian <laiweijian4@huawei.com>
2024-11-02 09:04:52 +00:00

73 lines
7.1 KiB
HTML

<a name="EN-US_TOPIC_0000002079097949"></a><a name="EN-US_TOPIC_0000002079097949"></a>
<h1 class="topictitle1">Resumable Training and Incremental Training</h1>
<div id="body0000001166070856"><div class="section" id="EN-US_TOPIC_0000002079097949__section3282182114914"><h4 class="sectiontitle">Overview</h4><p id="EN-US_TOPIC_0000002079097949__p182526352494">Resumable training indicates that an interrupted training job can be automatically resumed from the checkpoint where the previous training was interrupted. This method is applicable to model training that takes a long time.</p>
<p id="EN-US_TOPIC_0000002079097949__p310013774516">Incremental training is a method in which input data is continuously used to extend the existing model's knowledge to further train the model.</p>
<p id="EN-US_TOPIC_0000002079097949__p1453271013459">Checkpoints are used to resume model training or incrementally train a model.</p>
<p id="EN-US_TOPIC_0000002079097949__p12114201290">During model training, training results (including but not limited to epochs, model weights, optimizer status, and scheduler status) are continuously saved. In this way, an interrupted training job can be automatically resumed from the checkpoint where the previous training was interrupted.</p>
<p id="EN-US_TOPIC_0000002079097949__p913110121994">To resume a training job, load a checkpoint and use the checkpoint information to initialize the training status. To do so, add reload ckpt to the code.</p>
</div>
<div class="section" id="EN-US_TOPIC_0000002079097949__section174663914495"><h4 class="sectiontitle">Resumable Training and Incremental Training in ModelArts</h4><p id="EN-US_TOPIC_0000002079097949__p10109318202416">To resume model training or incrementally train a model in ModelArts, configure <strong id="EN-US_TOPIC_0000002079097949__b1698716298394">Training Output</strong>.</p>
<p id="EN-US_TOPIC_0000002079097949__p15375361946">When creating a training job, configure the data path to the training output, save checkpoints in this data path, and set <strong id="EN-US_TOPIC_0000002079097949__b1155753318392">Predownload</strong> to <strong id="EN-US_TOPIC_0000002079097949__b6557113311393">Yes</strong>. If you set <strong id="EN-US_TOPIC_0000002079097949__b1910917519813">Predownload</strong> to <strong id="EN-US_TOPIC_0000002079097949__b510916518818">Yes</strong>, the system automatically downloads the <strong id="EN-US_TOPIC_0000002079097949__b207545513816">checkpoint</strong> file in the training output data path to a local directory of the training container before the training job is started.</p>
<div class="fignone" id="EN-US_TOPIC_0000002079097949__fig585518111466"><span class="figcap"><b>Figure 1 </b>Training Output</span><br><span><img id="EN-US_TOPIC_0000002079097949__image1380018141262" src="figure/en-us_image_0000002079098061.png" width="469.49" height="79.60050000000001" title="Click to enlarge" class="imgResize"></span></div>
<p id="EN-US_TOPIC_0000002079097949__p898511611422">Enable fault tolerance check (auto restart) for resumable training. On the training job creation page, enable <strong id="EN-US_TOPIC_0000002079097949__b6501727184314">Auto Restart</strong>. If the environment pre-check fails, the hardware is not functional, or the training job fails, ModelArts will automatically issue the training job again.</p>
<div class="fignone" id="EN-US_TOPIC_0000002079097949__fig9466161462319"><span class="figcap"><b>Figure 2 </b>Auto Restart</span><br><span><img id="EN-US_TOPIC_0000002079097949__image18587183313251" src="figure/en-us_image_0000002043177360.png"></span></div>
</div>
<div class="section" id="EN-US_TOPIC_0000002079097949__en-us_topic_0000001130616266_section87151342141114"><h4 class="sectiontitle">reload ckpt for MindSpore</h4><pre class="screen" id="EN-US_TOPIC_0000002079097949__en-us_topic_0000001130616266_screen421282141216">import os
import argparse
parser.add_argument("--train_url", type=str)
args = parser.parse_known_args()
# <strong id="EN-US_TOPIC_0000002079097949__b1668181372218">train_url</strong> is set to <strong id="EN-US_TOPIC_0000002079097949__b12682101317223">/home/ma-user/modelarts/outputs/train_url_0</strong>.
train_url = args.train_url
# Initially defined network, loss function, and optimizer
net = resnet50(args_opt.batch_size, args_opt.num_classes)
ls = SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")
opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01, 0.9)
# Initial epoch value for the first training. The initial value of <strong id="EN-US_TOPIC_0000002079097949__b454822152211">epoch_size</strong> will be customized in MindSpore 1.3 and later versions.
# cur_epoch_num = 0
# Check whether there is a model file in the OBS output path. If there is no file, the model will be trained from the beginning by default. If there is a model file, the CKPT file with the maximum epoch value will be loaded as the pre-trained model.
if os.listdir(train_url):
last_ckpt = sorted([file for file in os.listdir(train_url) if file.endswith(".ckpt")])[-1]
print('last_ckpt:', last_ckpt)
last_ckpt_file = os.path.join(train_url, last_ckpt)
# Load the checkpoint.
param_dict = load_checkpoint(last_ckpt_file)
print('&gt; load last ckpt and continue training!!')
# Load model parameters to the network.
load_param_into_net(net, param_dict)
# Load model parameters to the optimizer.
load_param_into_net(opt, param_dict)
# Obtain the saved epoch value. The model will continue to be trained based on the epoch value. This function will be supported in MindSpore 1.3 and later versions.
# if param_dict.get("epoch_num"):
# cur_epoch_num = int(param_dict["epoch_num"].data.asnumpy())
model = Model(net, loss_fn=ls, optimizer=opt, metrics={'acc'})
# as for train, users could use model.train
if args_opt.do_train:
dataset = create_dataset()
batch_num = dataset.get_dataset_size()
config_ck = CheckpointConfig(save_checkpoint_steps=batch_num,
keep_checkpoint_max=35)
# For <strong id="EN-US_TOPIC_0000002079097949__b10281940162215">append_info=[{"epoch_num": cur_epoch_num}]</strong>, <strong id="EN-US_TOPIC_0000002079097949__b1528174017222">append_info</strong> will be supported in MindSpore 1.3 and later versions to save the epoch value at the current time.
ckpoint_cb = ModelCheckpoint(prefix="train_resnet_cifar10",
directory=args_opt.train_url,
config=config_ck)
loss_cb = LossMonitor()
model.train(epoch_size, dataset, callbacks=[ckpoint_cb, loss_cb])
# For <strong id="EN-US_TOPIC_0000002079097949__b21995425220">model.train(epoch_size-cur_epoch_num, dataset, callbacks=[ckpoint_cb, loss_cb])</strong>, the training resumed from the breakpoint will be supported in MindSpore 1.3 and later versions.</pre>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="develop-modelarts-0021.html">Advanced Training Operations</a></div>
</div>
</div>
<script language="JavaScript">
<!--
image_size('.imgResize');
var msg_imageMax = "view original image";
var msg_imageClose = "close";
//--></script>