ModelArts training management enables you to create training jobs, view training statuses, and manage job versions. Model training is an iterative optimization process. Through unified training management, you can flexibly select algorithms, data, and hyperparameters to obtain the optimal input configuration and model. After comparing metrics between training versions, you can determine the most satisfactory training job.
Parameter |
Description |
|
---|---|---|
Name |
Name of a training job. The system automatically generates a name. You can rename it based on the following naming rules:
|
|
Description |
Description of a training job. |
Parameter |
Sub-Parameter |
Description |
---|---|---|
Algorithm Type > Custom algorithm > Boot Mode |
Preset image |
If Boot Mode is set to Preset image, select a preset engine and configure the code directory and boot file.
|
Algorithm Type > Custom algorithm > Boot Mode |
Custom image |
If Boot Mode is set to Custom image, specify the image, code directory, and boot command.
|
Algorithm Type > Custom algorithm |
Local Code Directory |
You can specify the local directory of a training container. When a training starts, the system automatically downloads the code directory to this directory. The default local code directory is /home/ma-user/modelarts/user-job-dir. This parameter is optional. |
Algorithm Type > Custom algorithm |
Work Directory |
Set the directory where the boot file in the training container is located. When a training job starts, the system automatically runs the cd command to change the work directory to the specified directory. |
Created By |
My algorithms |
Select an algorithm or create an algorithm. For details, see Creating an Algorithm. |
Parameter |
Sub-Parameter |
Description |
---|---|---|
Input |
Name |
The recommended value is data_url. The training input must match the data input configuration set in your selected algorithm. For details, see Table 2. For example, if you use argparse in the training code to parse data_url into the data input, set the data input parameter to data_url when creating the algorithm. You can select a dataset or data path for data input. When the training job is started, ModelArts automatically downloads the data in the input path to the container directory for training. |
Dataset |
Select an available dataset and its version from the ModelArts Data Management module. Click Dataset and select the target dataset and its version in the dialog box displayed. NOTE:
If Dataset is unavailable, the training data of the selected algorithm cannot be from a dataset. |
|
Data path |
Select the training data from your OBS bucket. Click Data path and select the OBS bucket and folder in the dialog box displayed. NOTE:
If Data path is unavailable, the training data of the selected algorithm cannot be from a data path. |
|
Obtained from |
The following uses training input data_path as an example. If you select Hyperparameters, do as follows to obtain the training input: import argparse parser = argparse.ArgumentParser() parser.add_argument('--data_path') args, unknown = parser.parse_known_args() data_path = args.data_path If you select Environment variables, do as follows to obtain the training input: import os data_path = os.getenv("data_path", "") |
|
Output |
Name |
The algorithm code reads the local path to the training output based on this parameter. The recommended value is train_url. The training output must match the data output configuration set in your selected algorithm. For details, see Table 3. For example, if you use argparse in the algorithm code to parse train_url into the data output, set the data output parameter to train_url when creating the algorithm. You can select an OBS path for data output. During training, ModelArts automatically uploads the training output to the OBS path. |
Data path |
This data path stores the training output. During and after the training, the system automatically synchronizes files from the local directory to the data path. Currently, only OBS paths can be set as the data path. Select the storage path of the training result (OBS path). To minimize errors, select an empty directory. |
|
Obtained from |
The following uses the training output train_url as an example. Obtain the training output from hyperparameters by using the following code: import argparse parser = argparse.ArgumentParser() parser.add_argument('--train_url') args, unknown = parser.parse_known_args() train_url = args.train_url Obtain the training output from environment variables by using the following code: import os train_url = os.getenv("train_url", "") |
|
Predownload |
If you set Predownload to Yes, the system automatically downloads the files in the training output data path to a local directory of the training container before the training job is started. Select Yes for resumable training and incremental training. |
|
Hyperparameters |
None |
The value of this parameter varies according to the selected algorithm. If you have defined hyperparameters when creating an algorithm, all hyperparameters of the algorithm are displayed. Whether hyperparameters can be modified or deleted depends on how you configure the constraints when creating the algorithm. For details, see Defining Hyperparameters. |
Environment Variable |
None |
Environment variables, which you can add as required. For details about the environment variables preset in the training container, see Viewing Environment Variables of a Training Container. |
Auto Restart |
None |
Number of retries for a failed training job. If this parameter is enabled, a failed training job will be automatically re-delivered and run. On the training job details page, you can view the number of retries for a failed training job.
|
The training input, training output, and hyperparameters vary according to the selected algorithm.
If the system displays a message for Training Input, indicating there is no input channel for the selected algorithm, you do not need to set data input on this page.
If the system displays a message for Training Output, indicating there is no output channel for the selected algorithm, you do not need to set data output on this page.
If the system displays a message for Hyperparameters, indicating the selected algorithm does not support custom hyperparameters, you do not need to set hyperparameters on this page.
Parameter |
Description |
---|---|
Resource Pool |
Select resource pools for the job. Public and dedicated resource pools are available for you to select. If you select a dedicated resource pool, you can view details about the pool. If the number of available cards of this pool is insufficient, jobs may need to be queued. In this case, use another resource pool or reduce the number of cards required. NOTE:
Dedicated resource pools can be accessed to your VPCs and subnets. For details, see (Optional) Interconnecting a VPC with a ModelArts Network. If you want to change the VPC accessible to your dedicated resource pool, see (Optional) Interconnecting a VPC with a ModelArts Network. |
Resource Type |
Select CPU or GPU as needed. Set this parameter based on the resource type specified in your training code. |
Instance Flavor |
Select a resource flavor based on the resource type. If the type of resources to be used has been specified in your training code, only the options that comply with the constraints of the selected algorithm are available for you to choose. For example, if GPU is selected in the training code but you select CPU here, the training may fail. During training, ModelArts will mount NVME SSDs to the /cache directory. You can use this directory to store temporary files. The data disk size varies depending on the resource type. To prevent insufficient memory during training, click Check Input Size to check whether the disk size of selected instance flavor is sufficient for the input size. |
Compute Nodes |
Set the number of compute nodes. The default value is 1. |
Job Priority |
When using a new-version dedicated resource pool, you can set the priority of a training job. The value ranges from 1 to 3. The default priority is 1, and the highest priority is 3. By default, the job priority can be set to 1 or 2. After the permission to set the highest job priority is configured, the priority can be set to 1 to 3. You can change the priority of a pending job. |
SFS Turbo |
When a dedicated resource pool is used for training, multiple SFS Turbo file systems can be mounted for one training job.
A file system can be mounted only once and to only one path. Each mount path must be unique. A maximum of 8 disks can be mounted to a training job. NOTE:
|
Persistent Log Saving |
If you select CPU or GPU flavors, Persistent Log Saving is available for you to set. This function is disabled by default. ModelArts automatically stores the logs for 30 days. You can download all logs on the job details page. After this function is enabled, select an empty OBS path for storing training logs. Ensure that you have read and write permissions to the selected OBS directory. |
Auto Stop |
|
A training job generally runs for a period of time. To view the real-time status and basic information of a training job, switch to the training job list.