Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Wuwan, Qi <wuwanqi1@noreply.gitea.eco.tsi-dev.otc-service.com> Co-committed-by: Wuwan, Qi <wuwanqi1@noreply.gitea.eco.tsi-dev.otc-service.com>
512 KiB
Creating a Training Job
Function
This API is used to create a training job.
URI
POST /v2/{project_id}/training-jobs
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
project_id |
Yes |
String |
Project ID. For details, see Obtaining a Project ID and Name. |
Request Parameters
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
kind |
Yes |
String |
Training job type. The default value is job, indicating a training job. visualization_job: visualization job |
metadata |
Yes |
JobMetadata object |
Metadata of a training job. |
algorithm |
No |
JobAlgorithm object |
Algorithm used by a training job. The options are as follows: |
tasks |
No |
Array of Task objects |
Task list. This function is not implemented currently. |
spec |
No |
Spec object |
Specifications of a training job. If this parameter is specified, leave the tasks parameter blank. |
endpoints |
No |
JobEndpointsReq object |
This section describes the configurations required for remotely accessing a training job. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
name |
Yes |
String |
Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-). |
workspace_id |
No |
String |
Workspace where a job is located. The default value is 0. |
description |
No |
String |
Training job description. The value must contain 0 to 256 characters. The default value is NULL. |
annotations |
No |
Map<String,String> |
Advanced configurations of a training job. The options are as follows: |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
id |
No |
String |
Algorithm ID. |
name |
No |
String |
Algorithm name. Leave it blank. |
subscription_id |
No |
String |
Subscription ID of a subscribed algorithm, which must be used with item_version_id |
item_version_id |
No |
String |
Version ID of the subscribed algorithm, which must be used with subscription_id |
code_dir |
No |
String |
Code directory of a training job, for example, /usr/app/. This parameter must appear together with boot_file. If boot_file is set to id or subscription_id+item_version_id, you do not need to set this parameter. |
boot_file |
No |
String |
Boot file of a training job, which needs to be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used together with code_dir. If code_dir is set to id or subscription_id+item_version_id, you do not need to set this parameter. |
autosearch_config_path |
No |
String |
YAML configuration path of auto search jobs. An OBS URL is required. |
autosearch_framework_path |
No |
String |
Framework code directory of auto search jobs. An OBS URL is required. |
command |
No |
String |
Command for starting the container of the custom image of a training job in the custom image scenario. |
parameters |
No |
Array of Parameters objects |
Running parameters of a training job. |
policies |
No |
JobPolicies object |
Policies supported by jobs, which are used for hyperparameter search. |
inputs |
No |
Array of Input objects |
Input of a training job. |
outputs |
No |
Array of Output objects |
Output of a training job. |
engine |
No |
JobEngine object |
Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm. |
local_code_dir |
No |
String |
Local directory to the training container to which the algorithm code directory is downloaded Rules: |
working_dir |
No |
String |
Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
environments |
No |
Map<String,String> |
Environment variables of a training job. Format: "key":"value". The key can contain a maximum of 8,192 characters, and the value can contain a maximum of 4,096 characters. A maximum of 100 key-value pairs are allowed. The variable name can contain only letters, digits, and underscores (), and must start with a letter or underscore (). Note: Variables cannot contain $. |
summary |
No |
Summary object |
Visualization log summary. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
name |
No |
String |
Parameter name. |
value |
No |
String |
Parameter value. |
description |
No |
String |
Parameter description. |
constraint |
No |
ParametersConstraint object |
Parameter constraint. |
i18n_description |
No |
I18nDescription object |
Internationalization description. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
type |
No |
String |
Parameter type. |
editable |
No |
Boolean |
Whether the parameter is editable. |
required |
No |
Boolean |
Whether the parameter is mandatory. |
sensitive |
No |
Boolean |
Whether the parameter is sensitive. This function is not implemented currently. |
valid_type |
No |
String |
Valid type. |
valid_range |
No |
Array of strings |
Valid range. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
language |
No |
String |
Internationalization language. |
description |
No |
String |
Description. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
auto_search |
No |
AutoSearch object |
Hyperparameter search configuration. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
skip_search_params |
No |
String |
Hyperparameter parameters that need to be skipped. |
reward_attrs |
No |
Array of RewardAttrs objects |
Search metrics. |
search_params |
No |
Array of SearchParams objects |
Search parameters. |
algo_configs |
No |
Array of AlgoConfigs objects |
Search algorithm configurations. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
name |
No |
String |
Metric name. |
mode |
No |
String |
Search mode. - If max is specified, the larger the metric value, the better. - If min is specified, the smaller the metric value, the better. |
regex |
No |
String |
Regular expression of a metric. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
name |
No |
String |
Hyperparameter name. |
param_type |
No |
String |
Parameter type. - continuous: The hyperparameter is of the continuous type. When an algorithm is used in a training job, continuous hyperparameters are displayed as text boxes on the console. - discrete: The hyperparameter is of the discrete type. When an algorithm is used in a training job, discrete hyperparameters are displayed as drop-down lists on the console. |
lower_bound |
No |
String |
Lower bound of the hyperparameter. |
upper_bound |
No |
String |
Upper bound of the hyperparameter. |
discrete_points_num |
No |
String |
Number of discrete points of a hyperparameter with continuous values. |
discrete_values |
No |
Array of strings |
Discrete hyperparameter values. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
name |
No |
String |
Name of the search algorithm. |
params |
No |
Array of AutoSearchAlgoConfigParameter objects |
Search algorithm parameters. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
key |
No |
String |
Parameter key. |
value |
No |
String |
Parameter value. |
type |
No |
String |
Parameter type. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
engine_id |
No |
String |
Engine ID selected for a training job. The value can be engine_id, engine_name + engine_version, or image_url. |
engine_name |
No |
String |
Name of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter. If you use a preset framework and custom image to create a training job, you must set both this parameter and image_url. |
engine_version |
No |
String |
Version of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter. |
image_url |
No |
String |
Custom image URL selected for a training job. The URL is obtained from SWR. You can select an image or enter an image in the format of "Organization name/Image name:tag". |
install_sys_packages |
No |
Boolean |
Whether to install the MoXing version specified by the training platform. Value true means to install the specified MoXing version. This parameter is available only when engine_name, engine_version, and image_url are set. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
log_type |
No |
String |
Visualization log type of a training job. After this parameter is configured, the training job can be used as the data source of a visualization job. The options are as follows: |
log_dir |
No |
LogDir object |
Visualization log output of a training job. This parameter is mandatory when log_type is not empty. |
data_sources |
No |
Array of DataSource objects |
Visualization log input of a visualization job or debug training job. This parameter is mandatory when tensorboard/enable or mindstudio-insight/enable is set to true for advanced training functions. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
pfs |
Yes |
PFSSummary object |
Output of an OBS parallel file system. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
pfs_path |
Yes |
String |
URL of an OBS parallel file system. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
role |
No |
String |
Task role. This function is not supported currently. |
algorithm |
No |
algorithm object |
Algorithm management and configuration. |
task_resource |
No |
task_resource object |
Resource flavors of a training job. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
job_config |
No |
job_config object |
Algorithm configuration, such as the boot file. |
code_dir |
No |
String |
Algorithm code directory, for example, /usr/app/. This parameter must be used together with boot_file. |
boot_file |
No |
String |
Code boot file of the algorithm, which needs to be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used together with code_dir. |
engine |
No |
engine object |
Engine of a heterogeneous job algorithm. |
inputs |
No |
Array of inputs objects |
Data input of an algorithm. |
outputs |
No |
Array of outputs objects |
Data output of an algorithm. |
local_code_dir |
No |
String |
Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows: |
working_dir |
No |
String |
Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
parameters |
No |
Array of Parameter objects |
Running parameter of an algorithm. |
inputs |
No |
Array of Input objects |
Data input of an algorithm. |
outputs |
No |
Array of Output objects |
Data output of an algorithm. |
engine |
No |
engine object |
Algorithm engine. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
name |
No |
String |
Parameter name. |
value |
No |
String |
Parameter value. |
description |
No |
String |
Parameter description. |
constraint |
No |
constraint object |
Parameter constraint. |
i18n_description |
No |
i18n_description object |
Internationalization description. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
type |
No |
String |
Parameter type. |
editable |
No |
Boolean |
Whether the parameter is editable. |
required |
No |
Boolean |
Whether the parameter is mandatory. |
sensitive |
No |
Boolean |
Whether the parameter is sensitive This function is not implemented currently. |
valid_type |
No |
String |
Valid type. |
valid_range |
No |
Array of strings |
Valid range. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
language |
No |
String |
International language. |
description |
No |
String |
Description of an international language. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
name |
Yes |
String |
Name of the data input channel. |
description |
No |
String |
Description of the data input channel. |
local_dir |
No |
String |
Local directory of the container to which the data input channel is mapped Example: /home/ma-user/modelarts/inputs/data_url_0. |
remote |
Yes |
InputDataInfo object |
Information of the data input. Enums: |
remote_constraint |
No |
Array of remote_constraint objects |
Data input constraint |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
dataset |
No |
dataset object |
Dataset as the data input. |
obs |
No |
obs object |
OBS in which data input and output stored. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
id |
Yes |
String |
Dataset ID of a training job. |
version_id |
Yes |
String |
Dataset version ID of a training job. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
obs_url |
Yes |
String |
OBS URL of the dataset required by a training job. For example, /usr/data/. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
data_type |
No |
String |
Data input type, including the data storage location and dataset. |
attributes |
No |
String |
Attributes if a dataset is used as the data input. Options: |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
name |
Yes |
String |
Name of the data output channel. |
description |
No |
String |
Description of the data output channel. |
local_dir |
No |
String |
Local directory of the container to which the data output channel is mapped. |
remote |
Yes |
Remote object |
Description of the actual data output. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
obs |
Yes |
RemoteObs object |
OBS to which data is actually exported. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
obs_url |
Yes |
String |
OBS URL to which data is exported. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
engine_id |
No |
String |
Engine ID selected for an algorithm. |
engine_name |
No |
String |
Engine version name selected for an algorithm. If engine_id is specified, leave this parameter blank. |
engine_version |
No |
String |
Engine version name selected for an algorithm. If engine_id is specified, leave this parameter blank. |
image_url |
No |
String |
Custom image URL selected by an algorithm. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
engine_id |
No |
String |
Engine ID of a heterogeneous job, for example, caffe-1.0.0-python2.7. |
engine_name |
No |
String |
Engine name of a heterogeneous job, for example, Caffe. |
engine_version |
No |
String |
Engine version of a heterogeneous job. |
image_url |
No |
String |
Custom image URL selected by an algorithm. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
name |
Yes |
String |
Name of the data input channel. |
description |
No |
String |
Description of the data input channel. |
local_dir |
No |
String |
Local directory of the container to which the data input channel is mapped. |
remote |
Yes |
remote object |
Information of the data input. Enums: |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
obs |
No |
obs object |
OBS in which data input and output stored. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
obs_url |
Yes |
String |
OBS URL of the dataset required by a training job. For example, /usr/data/. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
name |
Yes |
String |
Name of the data output channel. |
description |
No |
String |
Description of the data output channel. |
local_dir |
No |
String |
Local directory of the container to which the data output channel is mapped. |
remote |
Yes |
remote object |
Description of the actual data output. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
obs |
Yes |
obs object |
OBS to which data is actually exported. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
obs_url |
Yes |
String |
OBS URL to which data is exported. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
flavor_id |
No |
String |
Resource flavor ID of a training job. |
node_count |
Yes |
Integer |
Number of resource replicas selected for a training job. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
resource |
No |
SpecResource object |
Resource flavor of a training job. Select either flavor_id or pool_id and flavor_id.
|
volumes |
No |
Array of SpecVolumes objects |
Volumes attached for a training job. |
log_export_path |
No |
LogExportPath object |
Export path of training job logs. |
auto_stop |
No |
AutoStop object |
Auto stop configuration of a training job. |
schedule_policy |
No |
SchedulePolicy object |
Training job scheduling policy. |
notification |
No |
Notification object |
Training event notification |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
flavor_id |
No |
String |
ID of the resource flavor selected for a training job. flavor_id cannot be specified for dedicated resource pools with CPU specifications. The options for dedicated resource pools with GPU specifications are as follows: |
node_count |
No |
Integer |
Number of nodes used for creating a training job in a pool. By default, a single node is used. |
pool_id |
No |
String |
Dedicated resource pool ID. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
nfs |
No |
Nfs object |
NFS volumes attached for a training job. |
pfs |
No |
Pfs object |
obsfs volumes attached for a training job. |
obs |
No |
Obs object |
OBS volumes attached for a training job |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
nfs_server_path |
No |
String |
NFS server path, for example, 10.10.10.10:/example/path. |
local_path |
No |
String |
Path for attaching volumes to the training container, for example, /example/path. |
read_only |
No |
Boolean |
Whether the disks attached to the container in NFS mode are read-only. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
pfs_path |
No |
String |
obsfs path, for example, /test-bucket/path. |
local_path |
No |
String |
Path for attaching volumes to the training container, for example, /example/path. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
obs_path |
No |
String |
OBS path to be attached, for example, /test-bucket/path |
local_path |
No |
String |
Path for attaching volumes to the training container, for example, /example/path |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
obs_url |
No |
String |
OBS path for storing training job logs, for example, obs://example/path. |
host_path |
No |
String |
Path of the host where training job logs are stored, for example, /example/path. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
time_unit |
Yes |
String |
Time unit. The options are as follows: |
duration |
Yes |
Integer |
Running duration. The minimum value is 1. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
required_affinity |
No |
RequiredAffinity object |
Affinity requirements for training jobs. |
priority |
No |
Integer |
Priority of the training job. |
preemptible |
No |
Boolean |
Whether preemption is allowed |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
affinity_type |
No |
String |
Affinity scheduling policy. Possible values are as follows: |
affinity_group_size |
No |
Integer |
Affinity group size. This parameter is mandatory when affinity_type is set to hyperinstance. In this case, the system schedules tasks specified by affinity_group_size to a supernode to form an affinity group. When a user delivers a training job to the supernode resource pool, if the affinity group size is not set, the system sets the value to 1 by default. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
topic_urn |
No |
String |
URN of the selected topic in SMN |
events |
No |
Array of strings |
Training event that triggers message notification. The value can be: JobStarted: The job is started. JobCompleted: The job is completed. JobFailed: The job is failed. JobTerminated: The job is terminated. JobRestarted: The job is restarted. JobHanged: The job is suspended. JobPreempted: The job is preempted. |
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
ssh |
No |
SSHReq object |
SSH connection information. |
Response Parameters
Status code: 201
Parameter |
Type |
Description |
|---|---|---|
kind |
String |
Training job type, which is job by default. Options: |
metadata |
JobMetadata object |
Metadata of a training job. |
status |
Status object |
Status of a training job. You do not need to set this parameter when creating a job. |
algorithm |
JobAlgorithmResponse object |
Algorithm used by a training job. The options are as follows: |
tasks |
Array of TaskResponse objects |
List of tasks in heterogeneous training jobs. |
spec |
SpecResponce object |
Specifications of a training job. |
endpoints |
JobEndpointsResp object |
This section describes the configurations required for remotely accessing a training job. |
Parameter |
Type |
Description |
|---|---|---|
id |
String |
Training job ID, which is generated and returned by ModelArts after the training job is created. |
name |
String |
Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-). |
workspace_id |
String |
Workspace where a job is located. The default value is 0. |
description |
String |
Training job description. The value must contain 0 to 256 characters. The default value is NULL. |
create_time |
Long |
Time when a training job was created, in milliseconds. The value is generated and returned by ModelArts after a training job is created. |
user_name |
String |
Username for creating a training job. The username is generated and returned by ModelArts after a training job is created. |
annotations |
Map<String,String> |
Advanced configurations of a training job. The options are as follows: |
Parameter |
Type |
Description |
|---|---|---|
phase |
String |
Level-1 status of a training job. The options are: |
secondary_phase |
String |
The level-2 status of a training job is an internal detailed status, which may be added, modified, or deleted. Dependency is not recommended. The options are: |
duration |
Long |
Running duration of a training job, in milliseconds |
node_count_metrics |
Array<Array<Integer>> |
Node count changes during the training job running period. |
tasks |
Array of strings |
Tasks of a training job. |
start_time |
Long |
Start time of a training job. The value is in timestamp format. |
task_statuses |
Array of TaskStatuses objects |
Status of a training job task. |
running_records |
Array of RunningRecord objects |
Running and fault recovery records of a training job |
Parameter |
Type |
Description |
|---|---|---|
task |
String |
Task of a training job. |
exit_code |
Integer |
Exit code of a training job task. |
message |
String |
Error message of a training job task. |
Parameter |
Type |
Description |
|---|---|---|
start_at |
Integer |
Unix timestamp of the start time in the current running record, in seconds. |
end_at |
Integer |
Unix timestamp of the end time in the current running record, in seconds. |
start_type |
String |
Startup mode of the current running record. |
end_reason |
String |
Reason why the current running record ends. |
end_related_task |
String |
ID of the task worker that causes the end of the current running record, for example, worker-0. |
end_recover |
String |
Fault tolerance policy used after the current running record ends. The enums are as follows: |
end_recover_before_downgrade |
String |
Tolerance policy used after the current running record ends and before the fault tolerance policy is degraded. The options are the same as those of end_recover. |
Parameter |
Type |
Description |
|---|---|---|
id |
String |
Algorithm used by a training job. The options are as follows: |
name |
String |
Algorithm name. |
subscription_id |
String |
Subscription ID of a subscribed algorithm, which must be used with item_version_id |
item_version_id |
String |
Version ID of the subscribed algorithm, which must be used with subscription_id |
code_dir |
String |
Code directory of a training job, for example, /usr/app/. This parameter must be set together with boot_file. If id or subscription_id+item_version_id has been set for boot_file, you do not need to set this parameter. |
boot_file |
String |
Boot file of a training job, which needs to be stored in the code directory. for example, /usr/app/boot.py. This parameter must be used together with code_dir. If id or subscription_id+item_version_id has been set for code_dir, you do not need to set this parameter. |
autosearch_config_path |
String |
YAML configuration path of an auto search job. An OBS URL is required. For example, obs://bucket/file.yaml. |
autosearch_framework_path |
String |
Framework code directory of auto search jobs. An OBS URL is required. For example, obs://bucket/files/. |
command |
String |
Boot command for starting the container of a custom image for a training job. For example, python train.py. |
parameters |
Array of Parameter objects |
Running parameters of a training job. |
policies |
policies object |
Policies supported by jobs. |
inputs |
Array of Input objects |
Input of a training job. |
outputs |
Array of Output objects |
Output of a training job. |
engine |
JobEngine object |
Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm. |
local_code_dir |
String |
Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows: |
working_dir |
String |
Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
environments |
Array of Map<String,String> objects |
Environment variables of a training job. The format is key:value. Leave this parameter blank. |
summary |
Summary object |
Visualization log summary. |
Parameter |
Type |
Description |
|---|---|---|
name |
String |
Parameter name. |
value |
String |
Parameter value. |
description |
String |
Parameter description. |
constraint |
constraint object |
Parameter constraint. |
i18n_description |
i18n_description object |
Internationalization description. |
Parameter |
Type |
Description |
|---|---|---|
type |
String |
Parameter type. |
editable |
Boolean |
Whether the parameter is editable. |
required |
Boolean |
Whether the parameter is mandatory. |
sensitive |
Boolean |
Whether the parameter is sensitive This function is not implemented currently. |
valid_type |
String |
Valid type. |
valid_range |
Array of strings |
Valid range. |
Parameter |
Type |
Description |
|---|---|---|
language |
String |
International language. |
description |
String |
Description of an international language. |
Parameter |
Type |
Description |
|---|---|---|
auto_search |
auto_search object |
Hyperparameter search configuration. |
Parameter |
Type |
Description |
|---|---|---|
skip_search_params |
String |
Hyperparameter parameters that need to be skipped. |
reward_attrs |
Array of reward_attrs objects |
List of search metrics. |
search_params |
Array of search_params objects |
Search parameters. |
algo_configs |
Array of algo_configs objects |
Search algorithm configurations. |
Parameter |
Type |
Description |
|---|---|---|
name |
String |
Metric name. |
mode |
String |
Search mode. |
regex |
String |
Regular expression of a metric. |
Parameter |
Type |
Description |
|---|---|---|
name |
String |
Name of the search algorithm. |
params |
Array of AutoSearchAlgoConfigParameter objects |
Search algorithm parameters. |
Parameter |
Type |
Description |
|---|---|---|
key |
String |
Parameter key. |
value |
String |
Parameter value. |
type |
String |
Parameter type. |
Parameter |
Type |
Description |
|---|---|---|
name |
String |
Name of the data input channel. |
description |
String |
Description of the data input channel. |
local_dir |
String |
Local directory of the container to which the data input channel is mapped Example: /home/ma-user/modelarts/inputs/data_url_0. |
remote |
InputDataInfo object |
Information of the data input. Enums: |
remote_constraint |
Array of remote_constraint objects |
Data input constraint |
Parameter |
Type |
Description |
|---|---|---|
dataset |
dataset object |
Dataset as the data input. |
obs |
obs object |
OBS in which data input and output stored. |
Parameter |
Type |
Description |
|---|---|---|
id |
String |
Dataset ID of a training job. |
version_id |
String |
Dataset version ID of a training job. |
obs_url |
String |
OBS URL of the dataset for a training job. It is automatically parsed by ModelArts based on the dataset ID and dataset version ID. For example, /usr/data/. |
Parameter |
Type |
Description |
|---|---|---|
obs_url |
String |
OBS URL of the dataset required by a training job. For example, /usr/data/. |
Parameter |
Type |
Description |
|---|---|---|
data_type |
String |
Data input type, including the data storage location and dataset. |
attributes |
String |
Attributes if a dataset is used as the data input. Options: |
Parameter |
Type |
Description |
|---|---|---|
name |
String |
Name of the data output channel. |
description |
String |
Description of the data output channel. |
local_dir |
String |
Local directory of the container to which the data output channel is mapped. |
remote |
Remote object |
Description of the actual data output. |
Parameter |
Type |
Description |
|---|---|---|
engine_id |
String |
Engine ID selected for a training job. The value can be engine_id, engine_name + engine_version, or image_url. |
engine_name |
String |
Name of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter. If you use a preset framework and custom image to create a training job, you must set both this parameter and image_url. |
engine_version |
String |
Version of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter. |
image_url |
String |
Custom image URL selected for a training job. The URL is obtained from SWR. You can select an image or enter an image in the format of "Organization name/Image name:tag". |
install_sys_packages |
Boolean |
Whether to install the MoXing version specified by the training platform. Value true means to install the specified MoXing version. This parameter is available only when engine_name, engine_version, and image_url are set. |
Parameter |
Type |
Description |
|---|---|---|
log_type |
String |
Visualization log type of a training job. After this parameter is configured, the training job can be used as the data source of a visualization job. The options are as follows: |
log_dir |
LogDir object |
Visualization log output of a training job. This parameter is mandatory when log_type is not empty. |
data_sources |
Array of DataSource objects |
Visualization log input of a visualization job or debug training job. This parameter is mandatory when tensorboard/enable or mindstudio-insight/enable is set to true for advanced training functions. |
Parameter |
Type |
Description |
|---|---|---|
pfs |
PFSSummary object |
Output of an OBS parallel file system. |
Parameter |
Type |
Description |
|---|---|---|
role |
String |
Task role. This function is not supported currently. |
algorithm |
TaskResponseAlgorithm object |
Algorithm management and configuration. |
task_resource |
FlavorResponse object |
Flavors of a training job or an algorithm. |
Parameter |
Type |
Description |
|---|---|---|
code_dir |
String |
Absolute path of the directory where the algorithm boot file is stored. |
boot_file |
String |
Absolute path of the algorithm boot file. |
inputs |
AlgorithmInput object |
Algorithm input channel. |
outputs |
AlgorithmOutput object |
Algorithm output channel. |
engine |
AlgorithmEngine object |
Engine on which a heterogeneous job depends. |
local_code_dir |
String |
Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows: |
working_dir |
String |
Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode. |
Parameter |
Type |
Description |
|---|---|---|
name |
String |
Name of the data input channel. |
local_dir |
String |
Local path of the container to which the data input and output channels are mapped. |
remote |
AlgorithmRemote object |
Actual data input, which can only be OBS for heterogeneous jobs. |
Parameter |
Type |
Description |
|---|---|---|
obs |
RemoteObs object |
OBS in which data input and output are stored. |
Parameter |
Type |
Description |
|---|---|---|
name |
String |
Name of the data output channel. |
local_dir |
String |
Local directory of the container to which the data output channel is mapped. |
remote |
Remote object |
Description of the actual data output. |
mode |
String |
Data transmission mode. The default value is upload_periodically. |
period |
String |
Data transmission period. The default value is 30s. |
Parameter |
Type |
Description |
|---|---|---|
obs |
RemoteObs object |
OBS to which data is actually exported. |
Parameter |
Type |
Description |
|---|---|---|
engine_id |
String |
Engine ID, for example, caffe-1.0.0-python2.7. |
engine_name |
String |
Engine name, for example, Caffe. |
engine_version |
String |
Engine version. Engines with the same name have multiple versions, for example, Caffe-1.0.0-python2.7 of Python 2.7. |
v1_compatible |
Boolean |
Whether the v1 compatibility mode is used. |
run_user |
String |
User UID started by default by the engine. |
image_url |
String |
Custom image URL selected for an algorithm. |
Parameter |
Type |
Description |
|---|---|---|
flavor_id |
String |
ID of the resource flavor. |
flavor_name |
String |
Name of the resource flavor. |
max_num |
Integer |
Maximum number of nodes in a resource flavor. |
flavor_type |
String |
Resource flavor type. Options: |
billing |
BillingInfo object |
Billing information of a resource flavor. |
flavor_info |
FlavorInfoResponse object |
Resource flavor details. |
attributes |
Map<String,String> |
Other specification attributes. |
Parameter |
Type |
Description |
|---|---|---|
max_num |
Integer |
Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported. |
cpu |
Cpu object |
CPU specifications. |
gpu |
Gpu object |
GPU specifications. |
npu |
Npu object |
NPU specifications. |
memory |
Memory object |
Memory information. |
disk |
DiskResponse object |
Disk information. |
Parameter |
Type |
Description |
|---|---|---|
size |
Integer |
Disk size. |
unit |
String |
Unit of the disk size. |
Parameter |
Type |
Description |
|---|---|---|
resource |
Resource object |
Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id]. |
volumes |
Array of JobVolume objects |
Volumes attached for a training job. |
log_export_path |
LogExportPath object |
Export path of training job logs. |
schedule_policy |
SchedulePolicy object |
Training job scheduling policy. |
Parameter |
Type |
Description |
|---|---|---|
policy |
String |
Resource specification mode of a training job. The value can be regular, indicating the standard mode. |
flavor_id |
String |
ID of the resource flavor selected for a training job. flavor_id cannot be specified for dedicated resource pools with CPU specifications. The options for dedicated resource pools with GPU specifications are as follows: |
flavor_name |
String |
Read-only flavor name returned by ModelArts when flavor_id is used. |
node_count |
Integer |
Number of resource replicas selected for a training job. |
pool_id |
String |
Resource pool ID selected for a training job. |
flavor_detail |
FlavorDetail object |
Flavor details of a training job or algorithm. This parameter is available only for public resource pools. |
Parameter |
Type |
Description |
|---|---|---|
flavor_type |
String |
Resource flavor type. The options are as follows: |
billing |
BillingInfo object |
Billing information of a resource flavor. |
flavor_info |
FlavorInfo object |
Resource flavor details. |
Parameter |
Type |
Description |
|---|---|---|
code |
String |
Billing code. |
unit_num |
Integer |
Billing unit. |
Parameter |
Type |
Description |
|---|---|---|
max_num |
Integer |
Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported. |
cpu |
Cpu object |
CPU specifications. |
gpu |
Gpu object |
GPU specifications. |
npu |
Npu object |
NPU specifications. |
memory |
Memory object |
Memory information. |
disk |
Disk object |
Disk information. |
Parameter |
Type |
Description |
|---|---|---|
arch |
String |
CPU architecture. |
core_num |
Integer |
Number of cores. |
Parameter |
Type |
Description |
|---|---|---|
unit_num |
Integer |
Number of GPUs. |
product_name |
String |
Product name. |
memory |
String |
Memory. |
Parameter |
Type |
Description |
|---|---|---|
unit_num |
String |
Number of NPUs. |
product_name |
String |
Product name. |
memory |
String |
Memory. |
Parameter |
Type |
Description |
|---|---|---|
size |
Integer |
Memory size. |
unit |
String |
Number of memory units. |
Parameter |
Type |
Description |
|---|---|---|
size |
String |
Disk size. |
unit |
String |
Unit of the disk size, which is GB generally. |
Parameter |
Type |
Description |
|---|---|---|
nfs_server_path |
String |
NFS server path, for example, 10.10.10.10:/example/path. |
local_path |
String |
Path for attaching volumes to the training container, for example, /example/path. |
read_only |
Boolean |
Whether the disks attached to the container in NFS mode are read-only. |
Parameter |
Type |
Description |
|---|---|---|
obs_url |
String |
OBS path for storing training job logs, for example, obs://example/path. |
host_path |
String |
Path of the host where training job logs are stored, for example, /example/path. |
Parameter |
Type |
Description |
|---|---|---|
required_affinity |
RequiredAffinity object |
Affinity requirements for training jobs. |
priority |
Integer |
Priority of the training job. |
preemptible |
Boolean |
Whether preemption is allowed |
Parameter |
Type |
Description |
|---|---|---|
affinity_type |
String |
Affinity scheduling policy. Possible values are as follows: |
affinity_group_size |
Integer |
Affinity group size. This parameter is mandatory when affinity_type is set to hyperinstance. In this case, the system schedules tasks specified by affinity_group_size to a supernode to form an affinity group. When a user delivers a training job to the supernode resource pool, if the affinity group size is not set, the system sets the value to 1 by default. |
Parameter |
Type |
Description |
|---|---|---|
ssh |
SSHResp object |
SSH connection information. |
jupyter_lab |
JupyterLab object |
JupyterLab connection information. |
tensorboard |
Tensorboard object |
TensorBoard connection information. |
mindstudio_insight |
MindStudioInsight object |
MindStudio Insight connection information. |
Parameter |
Type |
Description |
|---|---|---|
key_pair_names |
Array of strings |
Specifies the SSH key pair name, which can be created and viewed on the Key Pair page of the ECS console. |
task_urls |
Array of TaskUrls objects |
SSH connection address information. |
Parameter |
Type |
Description |
|---|---|---|
task |
String |
ID of a training job. |
url |
String |
SSH connection address of a training job. |
Parameter |
Type |
Description |
|---|---|---|
url |
String |
JupyterLab address of a training job. |
token |
String |
JupyterLab token of a training job. |
Parameter |
Type |
Description |
|---|---|---|
url |
String |
TensorBoard URL of a training job. |
token |
String |
TensorBoard token of a training job |
Parameter |
Type |
Description |
|---|---|---|
url |
String |
MindStudio Insight URL of a training job. |
token |
String |
MindStudio Insight token of a training job. |
Status code: 400
Parameter |
Type |
Description |
|---|---|---|
error_msg |
String |
Error message |
error_code |
String |
Error code |
error_solution |
String |
Solution |
Example Requests
The following is an example of how to create a training job with free specifications. The job name has been set to TestModelArtsJob and the description has been set to This is a ModelArts job. The required algorithm's ID is 3f5d6706-7b67-408d-8ba0-ec08048c45ed. The inputs and outputs have not been defined for the algorithm.
POST https://endpoint/v2/{project_id}/training-jobs { "kind" : "job", "metadata" : { "name" : "TestModelArtsJob", "description" : "This is a ModelArts job" }, "algorithm" : { "id" : "3f5d6706-7b67-408d-8ba0-ec08048c45ed", "parameters" : [ { "name" : "input_dir", "value" : "obs://test/moxingtest-dir/" }, { "name" : "input_file", "value" : "obs://test/moxingtest/" }, { "name" : "large_file_method", "value" : "1" } ], "policies" : { "auto_search" : null }, "environments" : { } }, "spec" : { "resource" : { "flavor_id" : "modelarts.p3.large.public.free", "node_count" : 1 }, "log_export_path" : { "obs_url" : "" } } }The following is an example of how to use a custom image to create a training job whose name is TestModelArtsJob2 and description is This is a ModelArts job2. A dedicated resource pool and NFS mounting are used.
POST https://endpoint/v2/{project_id}/training-jobs { "kind" : "job", "metadata" : { "name" : "TestModelArtsJob2", "description" : "This is a ModelArts job2" }, "algorithm" : { "engine" : { "image_url" : "xxxxxxxx/fastseq:1.2" }, "command" : "cd /home/ma-user/ddp_demo && sh run_ddp.sh", "parameters" : [ ], "policies" : { "auto_search" : null }, "environments" : { "NCCL_DEBUG" : "INFO", "NCCL_IB_DISABLE" : "0" } }, "spec" : { "resource" : { "flavor_id" : "modelarts.pool.visual.xlarge", "node_count" : 1, "pool_id" : "poolfaf38d76" }, "log_export_path" : { "obs_url" : "/training-test/limou/ddp-demo-log/" }, "volumes" : [ { "nfs" : { "nfs_server_path" : "192.168.0.82:/", "local_path" : "/home/ma-user/nfs/", "read_only" : false } } ] } }
Example Responses
Status code: 201
ok
{
"kind" : "job",
"metadata" : {
"id" : "425b7087-83de-49ed-9e40-5bb642be956f",
"name" : "TestModelArtsJob",
"description" : "This is a ModelArts job",
"create_time" : 1637045545982,
"workspace_id" : "0",
"user_name" : ""
},
"status" : {
"phase" : "Creating",
"secondary_phase" : "Creating",
"duration" : 0,
"start_time" : 0,
"node_count_metrics" : null,
"tasks" : [ "worker-0", "server-0" ]
},
"algorithm" : {
"id" : "3f5d6706-7b67-408d-8ba0-ec08048c45ed",
"name" : "ttt-obs-gpu",
"code_dir" : "/test/moxingtest-code/",
"boot_file" : "/test/moxingtest-code/test_obs_gpu.py",
"parameters" : [ {
"name" : "input_dir",
"description" : "",
"i18n_description" : null,
"value" : "s://test/moxingtest-dir/",
"constraint" : {
"type" : "String",
"editable" : true,
"required" : true,
"sensitive" : false,
"valid_type" : "None",
"valid_range" : [ ]
}
}, {
"name" : "input_file",
"description" : "",
"i18n_description" : null,
"value" : "obs://test/moxingtest/",
"constraint" : {
"type" : "String",
"editable" : true,
"required" : true,
"sensitive" : false,
"valid_type" : "None",
"valid_range" : [ ]
}
}, {
"name" : "large_file_method",
"description" : "",
"i18n_description" : null,
"value" : "1",
"constraint" : {
"type" : "Integer",
"editable" : true,
"required" : true,
"sensitive" : false,
"valid_type" : "None",
"valid_range" : [ ]
}
} ],
"engine" : {
"engine_id" : "horovod-cp36-tf-1.16.2",
"engine_name" : "Horovod",
"engine_version" : "0.16.2-TF-1.13.1-python3.6"
},
"policies" : { }
},
"spec" : {
"resource" : {
"policy" : "regular",
"flavor_id" : "modelarts.p3.large.public.free",
"flavor_name" : "Computing GPU(Vnt1) instance",
"node_count" : 1,
"flavor_detail" : {
"flavor_type" : "GPU",
"billing" : {
"code" : "modelarts.vm.gpu.free",
"unit_num" : 1
},
"flavor_info" : {
"cpu" : {
"arch" : "x86",
"core_num" : 8
},
"gpu" : {
"unit_num" : 1,
"product_name" : "GP-Vnt1",
"memory" : "32GB"
},
"memory" : {
"size" : 64,
"unit" : "GB"
}
}
}
},
"log_export_path" : { }
}
}
Status code: 400
Format of the body for a common error response. The following shows the returned information when an algorithm with ID 3f5d6706-7b67-408d-8ba0-ec08048c45ee is not found.
{
"error_msg" : "algorithm not found.",
"error_code" : "ModelArts.2755",
"error_solution" : "Check whether the training project information in the request is valid."
}
Status Codes
Status Code |
Description |
|---|---|
201 |
ok |
400 |
Format of the body for a common error response. The following shows the returned information when an algorithm with ID 3f5d6706-7b67-408d-8ba0-ec08048c45ee is not found. |
Error Codes
See Error Codes.