Files

Wuwan, Qi 78d5ebfce8 ModelArts API 25.3.0 20250710

Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com>
Co-authored-by: Wuwan, Qi <wuwanqi1@noreply.gitea.eco.tsi-dev.otc-service.com>
Co-committed-by: Wuwan, Qi <wuwanqi1@noreply.gitea.eco.tsi-dev.otc-service.com>

2025-07-28 12:24:54 +00:00

512 KiB

Raw Blame History

Creating a Training Job

Function

This API is used to create a training job.

URI

POST /v2/{project_id}/training-jobs

**Table 1** Path Parameters
Parameter	Mandatory	Type	Description
project_id	Yes	String	Project ID. For details, see Obtaining a Project ID and Name.

Request Parameters

**Table 2** Request body parameters
Parameter	Mandatory	Type	Description
kind	Yes	String	Training job type. The default value is job, indicating a training job. visualization_job: visualization job
metadata	Yes	JobMetadata object	Metadata of a training job.
algorithm	No	JobAlgorithm object	Algorithm used by a training job. The options are as follows: id: Only the algorithm ID is used. subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used. code_dir+boot_file: The code directory and boot file of the training job are used.
tasks	No	Array of Task objects	Task list. This function is not implemented currently.
spec	No	Spec object	Specifications of a training job. If this parameter is specified, leave the tasks parameter blank.
endpoints	No	JobEndpointsReq object	This section describes the configurations required for remotely accessing a training job.

**Table 3** JobMetadata
Parameter	Mandatory	Type	Description
name	Yes	String	Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-).
workspace_id	No	String	Workspace where a job is located. The default value is 0.
description	No	String	Training job description. The value must contain 0 to 256 characters. The default value is NULL.
annotations	No	Map<String,String>	Advanced configurations of a training job. The options are as follows: job_template: Template RL (heterogeneous job) fault-tolerance/job-retry-num: 3 (number of retries upon a fault) fault-tolerance/job-unconditional-retry: true (unconditional restart) fault-tolerance/hang-retry: true (restart upon a suspension) jupyter-lab/enable: true (JupyterLab training application) tensorboard/enable: true (TensorBoard training application) mindstudio-insight/enable: true (MindStudio Insight training application)

**Table 4** JobAlgorithm
Parameter	Mandatory	Type	Description
id	No	String	Algorithm ID.
name	No	String	Algorithm name. Leave it blank.
subscription_id	No	String	Subscription ID of a subscribed algorithm, which must be used with item_version_id
item_version_id	No	String	Version ID of the subscribed algorithm, which must be used with subscription_id
code_dir	No	String	Code directory of a training job, for example, /usr/app/. This parameter must appear together with boot_file. If boot_file is set to id or subscription_id+item_version_id, you do not need to set this parameter.
boot_file	No	String	Boot file of a training job, which needs to be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used together with code_dir. If code_dir is set to id or subscription_id+item_version_id, you do not need to set this parameter.
autosearch_config_path	No	String	YAML configuration path of auto search jobs. An OBS URL is required.
autosearch_framework_path	No	String	Framework code directory of auto search jobs. An OBS URL is required.
command	No	String	Command for starting the container of the custom image of a training job in the custom image scenario.
parameters	No	Array of Parameters objects	Running parameters of a training job.
policies	No	JobPolicies object	Policies supported by jobs, which are used for hyperparameter search.
inputs	No	Array of Input objects	Input of a training job.
outputs	No	Array of Output objects	Output of a training job.
engine	No	JobEngine object	Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm.
local_code_dir	No	String	Local directory to the training container to which the algorithm code directory is downloaded Rules: The value must be a directory in /home. In v1 compatibility mode, the current field does not take effect. When code_dir is prefixed with file://, the current field does not take effect.
working_dir	No	String	Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.
environments	No	Map<String,String>	Environment variables of a training job. Format: "key":"value". The key can contain a maximum of 8,192 characters, and the value can contain a maximum of 4,096 characters. A maximum of 100 key-value pairs are allowed. The variable name can contain only letters, digits, and underscores (), and must start with a letter or underscore (). Note: Variables cannot contain $.
summary	No	Summary object	Visualization log summary.

**Table 5** Parameters
Parameter	Mandatory	Type	Description
name	No	String	Parameter name.
value	No	String	Parameter value.
description	No	String	Parameter description.
constraint	No	ParametersConstraint object	Parameter constraint.
i18n_description	No	I18nDescription object	Internationalization description.

**Table 6** ParametersConstraint
Parameter	Mandatory	Type	Description
type	No	String	Parameter type.
editable	No	Boolean	Whether the parameter is editable.
required	No	Boolean	Whether the parameter is mandatory.
sensitive	No	Boolean	Whether the parameter is sensitive. This function is not implemented currently.
valid_type	No	String	Valid type.
valid_range	No	Array of strings	Valid range.

**Table 7** I18nDescription
Parameter	Mandatory	Type	Description
language	No	String	Internationalization language.
description	No	String	Description.

**Table 8** JobPolicies
Parameter	Mandatory	Type	Description
auto_search	No	AutoSearch object	Hyperparameter search configuration.

**Table 9** AutoSearch
Parameter	Mandatory	Type	Description
skip_search_params	No	String	Hyperparameter parameters that need to be skipped.
reward_attrs	No	Array of RewardAttrs objects	Search metrics.
search_params	No	Array of SearchParams objects	Search parameters.
algo_configs	No	Array of AlgoConfigs objects	Search algorithm configurations.

**Table 10** RewardAttrs
Parameter	Mandatory	Type	Description
name	No	String	Metric name.
mode	No	String	Search mode. - If max is specified, the larger the metric value, the better. - If min is specified, the smaller the metric value, the better.
regex	No	String	Regular expression of a metric.

**Table 11** SearchParams
Parameter	Mandatory	Type	Description
name	No	String	Hyperparameter name.
param_type	No	String	Parameter type. - continuous: The hyperparameter is of the continuous type. When an algorithm is used in a training job, continuous hyperparameters are displayed as text boxes on the console. - discrete: The hyperparameter is of the discrete type. When an algorithm is used in a training job, discrete hyperparameters are displayed as drop-down lists on the console.
lower_bound	No	String	Lower bound of the hyperparameter.
upper_bound	No	String	Upper bound of the hyperparameter.
discrete_points_num	No	String	Number of discrete points of a hyperparameter with continuous values.
discrete_values	No	Array of strings	Discrete hyperparameter values.

**Table 12** AlgoConfigs
Parameter	Mandatory	Type	Description
name	No	String	Name of the search algorithm.
params	No	Array of AutoSearchAlgoConfigParameter objects	Search algorithm parameters.

**Table 13** AutoSearchAlgoConfigParameter
Parameter	Mandatory	Type	Description
key	No	String	Parameter key.
value	No	String	Parameter value.
type	No	String	Parameter type.

**Table 14** JobEngine
Parameter	Mandatory	Type	Description
engine_id	No	String	Engine ID selected for a training job. The value can be engine_id, engine_name + engine_version, or image_url.
engine_name	No	String	Name of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter. If you use a preset framework and custom image to create a training job, you must set both this parameter and image_url.
engine_version	No	String	Version of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter.
image_url	No	String	Custom image URL selected for a training job. The URL is obtained from SWR. You can select an image or enter an image in the format of "Organization name/Image name:tag".
install_sys_packages	No	Boolean	Whether to install the MoXing version specified by the training platform. Value true means to install the specified MoXing version. This parameter is available only when engine_name, engine_version, and image_url are set.

**Table 15** Summary
Parameter	Mandatory	Type	Description
log_type	No	String	Visualization log type of a training job. After this parameter is configured, the training job can be used as the data source of a visualization job. The options are as follows: tensorboard mindstudio-insight
log_dir	No	LogDir object	Visualization log output of a training job. This parameter is mandatory when log_type is not empty.
data_sources	No	Array of DataSource objects	Visualization log input of a visualization job or debug training job. This parameter is mandatory when tensorboard/enable or mindstudio-insight/enable is set to true for advanced training functions.

**Table 16** LogDir
Parameter	Mandatory	Type	Description
pfs	Yes	PFSSummary object	Output of an OBS parallel file system.

**Table 17** PFSSummary
Parameter	Mandatory	Type	Description
pfs_path	Yes	String	URL of an OBS parallel file system.

**Table 18** DataSource
Parameter	Mandatory	Type	Description
job	Yes	JobSummary object	Job data source.

**Table 19** JobSummary
Parameter	Mandatory	Type	Description
job_id	Yes	String	Training job ID.

**Table 20** Task
Parameter	Mandatory	Type	Description
role	No	String	Task role. This function is not supported currently.
algorithm	No	algorithm object	Algorithm management and configuration.
task_resource	No	task_resource object	Resource flavors of a training job.

**Table 21** algorithm
Parameter	Mandatory	Type	Description
job_config	No	job_config object	Algorithm configuration, such as the boot file.
code_dir	No	String	Algorithm code directory, for example, /usr/app/. This parameter must be used together with boot_file.
boot_file	No	String	Code boot file of the algorithm, which needs to be stored in the code directory, for example, /usr/app/boot.py. This parameter must be used together with code_dir.
engine	No	engine object	Engine of a heterogeneous job algorithm.
inputs	No	Array of inputs objects	Data input of an algorithm.
outputs	No	Array of outputs objects	Data output of an algorithm.
local_code_dir	No	String	Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows: The directory must be under /home. In v1 compatibility mode, the current field does not take effect. When code_dir is prefixed with file://, the current field does not take effect.
working_dir	No	String	Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.

**Table 22** job_config
Parameter	Mandatory	Type	Description
parameters	No	Array of Parameter objects	Running parameter of an algorithm.
inputs	No	Array of Input objects	Data input of an algorithm.
outputs	No	Array of Output objects	Data output of an algorithm.
engine	No	engine object	Algorithm engine.

**Table 23** Parameter
Parameter	Mandatory	Type	Description
name	No	String	Parameter name.
value	No	String	Parameter value.
description	No	String	Parameter description.
constraint	No	constraint object	Parameter constraint.
i18n_description	No	i18n_description object	Internationalization description.

**Table 24** constraint
Parameter	Mandatory	Type	Description
type	No	String	Parameter type.
editable	No	Boolean	Whether the parameter is editable.
required	No	Boolean	Whether the parameter is mandatory.
sensitive	No	Boolean	Whether the parameter is sensitive This function is not implemented currently.
valid_type	No	String	Valid type.
valid_range	No	Array of strings	Valid range.

**Table 25** i18n_description
Parameter	Mandatory	Type	Description
language	No	String	International language.
description	No	String	Description of an international language.

**Table 26** Input
Parameter	Mandatory	Type	Description
name	Yes	String	Name of the data input channel.
description	No	String	Description of the data input channel.
local_dir	No	String	Local directory of the container to which the data input channel is mapped Example: /home/ma-user/modelarts/inputs/data_url_0.
remote	Yes	InputDataInfo object	Information of the data input. Enums: dataset: The data input is a dataset. obs: The data input is an OBS path.
remote_constraint	No	Array of remote_constraint objects	Data input constraint

**Table 27** InputDataInfo
Parameter	Mandatory	Type	Description
dataset	No	dataset object	Dataset as the data input.
obs	No	obs object	OBS in which data input and output stored.

**Table 28** dataset
Parameter	Mandatory	Type	Description
id	Yes	String	Dataset ID of a training job.
version_id	Yes	String	Dataset version ID of a training job.

**Table 29** obs
Parameter	Mandatory	Type	Description
obs_url	Yes	String	OBS URL of the dataset required by a training job. For example, /usr/data/.

**Table 30** remote_constraint
Parameter	Mandatory	Type	Description
data_type	No	String	Data input type, including the data storage location and dataset.
attributes	No	String	Attributes if a dataset is used as the data input. Options: data_format: Data format data_segmentation: Data segmentation dataset_type: Labeling type

**Table 31** Output
Parameter	Mandatory	Type	Description
name	Yes	String	Name of the data output channel.
description	No	String	Description of the data output channel.
local_dir	No	String	Local directory of the container to which the data output channel is mapped.
remote	Yes	Remote object	Description of the actual data output.

**Table 32** Remote
Parameter	Mandatory	Type	Description
obs	Yes	RemoteObs object	OBS to which data is actually exported.

**Table 33** RemoteObs
Parameter	Mandatory	Type	Description
obs_url	Yes	String	OBS URL to which data is exported.

**Table 34** engine
Parameter	Mandatory	Type	Description
engine_id	No	String	Engine ID selected for an algorithm.
engine_name	No	String	Engine version name selected for an algorithm. If engine_id is specified, leave this parameter blank.
engine_version	No	String	Engine version name selected for an algorithm. If engine_id is specified, leave this parameter blank.
image_url	No	String	Custom image URL selected by an algorithm.

**Table 35** engine
Parameter	Mandatory	Type	Description
engine_id	No	String	Engine ID of a heterogeneous job, for example, caffe-1.0.0-python2.7.
engine_name	No	String	Engine name of a heterogeneous job, for example, Caffe.
engine_version	No	String	Engine version of a heterogeneous job.
image_url	No	String	Custom image URL selected by an algorithm.

**Table 36** inputs
Parameter	Mandatory	Type	Description
name	Yes	String	Name of the data input channel.
description	No	String	Description of the data input channel.
local_dir	No	String	Local directory of the container to which the data input channel is mapped.
remote	Yes	remote object	Information of the data input. Enums: dataset: The data input is a dataset. obs: The data input is an OBS path.

**Table 37** remote
Parameter	Mandatory	Type	Description
obs	No	obs object	OBS in which data input and output stored.

**Table 38** obs
Parameter	Mandatory	Type	Description
obs_url	Yes	String	OBS URL of the dataset required by a training job. For example, /usr/data/.

**Table 39** outputs
Parameter	Mandatory	Type	Description
name	Yes	String	Name of the data output channel.
description	No	String	Description of the data output channel.
local_dir	No	String	Local directory of the container to which the data output channel is mapped.
remote	Yes	remote object	Description of the actual data output.

**Table 40** remote
Parameter	Mandatory	Type	Description
obs	Yes	obs object	OBS to which data is actually exported.

**Table 41** obs
Parameter	Mandatory	Type	Description
obs_url	Yes	String	OBS URL to which data is exported.

**Table 42** task_resource
Parameter	Mandatory	Type	Description
flavor_id	No	String	Resource flavor ID of a training job.
node_count	Yes	Integer	Number of resource replicas selected for a training job.

**Table 43** Spec
Parameter	Mandatory	Type	Description
resource	No	SpecResource object	Resource flavor of a training job. Select either flavor_id or pool_id and flavor_id. If using a public resource pool, select an available public resource flavor (flavor_id). If using a dedicated resource pool, choose the pool first. Then, select the number of PUs. For example, if the flavor has 8 PUs and you need 1, 2, 4, or 8 PUs, select the needed number to save resources. (pool_id and flavor_id)
volumes	No	Array of SpecVolumes objects	Volumes attached for a training job.
log_export_path	No	LogExportPath object	Export path of training job logs.
auto_stop	No	AutoStop object	Auto stop configuration of a training job.
schedule_policy	No	SchedulePolicy object	Training job scheduling policy.
notification	No	Notification object	Training event notification

**Table 44** SpecResource
Parameter	Mandatory	Type	Description
flavor_id	No	String	ID of the resource flavor selected for a training job. flavor_id cannot be specified for dedicated resource pools with CPU specifications. The options for dedicated resource pools with GPU specifications are as follows: modelarts.pool.visual.xlarge (1 card) modelarts.pool.visual.2xlarge (2 cards) modelarts.pool.visual.4xlarge (4 cards) modelarts.pool.visual.8xlarge (8 cards) modelarts.pool.visual.16xlarge (16 cards, only for the 910A3 supernode resource pool)
node_count	No	Integer	Number of nodes used for creating a training job in a pool. By default, a single node is used.
pool_id	No	String	Dedicated resource pool ID.

**Table 45** SpecVolumes
Parameter	Mandatory	Type	Description
nfs	No	Nfs object	NFS volumes attached for a training job.
pfs	No	Pfs object	obsfs volumes attached for a training job.
obs	No	Obs object	OBS volumes attached for a training job

**Table 46** Nfs
Parameter	Mandatory	Type	Description
nfs_server_path	No	String	NFS server path, for example, 10.10.10.10:/example/path.
local_path	No	String	Path for attaching volumes to the training container, for example, /example/path.
read_only	No	Boolean	Whether the disks attached to the container in NFS mode are read-only.

**Table 47** Pfs
Parameter	Mandatory	Type	Description
pfs_path	No	String	obsfs path, for example, /test-bucket/path.
local_path	No	String	Path for attaching volumes to the training container, for example, /example/path.

**Table 48** Obs
Parameter	Mandatory	Type	Description
obs_path	No	String	OBS path to be attached, for example, /test-bucket/path
local_path	No	String	Path for attaching volumes to the training container, for example, /example/path

**Table 49** LogExportPath
Parameter	Mandatory	Type	Description
obs_url	No	String	OBS path for storing training job logs, for example, obs://example/path.
host_path	No	String	Path of the host where training job logs are stored, for example, /example/path.

**Table 50** AutoStop
Parameter	Mandatory	Type	Description
time_unit	Yes	String	Time unit. The options are as follows: HOURS
duration	Yes	Integer	Running duration. The minimum value is 1.

**Table 51** SchedulePolicy
Parameter	Mandatory	Type	Description
required_affinity	No	RequiredAffinity object	Affinity requirements for training jobs.
priority	No	Integer	Priority of the training job.
preemptible	No	Boolean	Whether preemption is allowed

**Table 52** RequiredAffinity
Parameter	Mandatory	Type	Description
affinity_type	No	String	Affinity scheduling policy. Possible values are as follows: cabinet: strong cabinet scheduling hyperinstance: supernode affinity scheduling
affinity_group_size	No	Integer	Affinity group size. This parameter is mandatory when affinity_type is set to hyperinstance. In this case, the system schedules tasks specified by affinity_group_size to a supernode to form an affinity group. When a user delivers a training job to the supernode resource pool, if the affinity group size is not set, the system sets the value to 1 by default.

**Table 53** Notification
Parameter	Mandatory	Type	Description
topic_urn	No	String	URN of the selected topic in SMN
events	No	Array of strings	Training event that triggers message notification. The value can be: JobStarted: The job is started. JobCompleted: The job is completed. JobFailed: The job is failed. JobTerminated: The job is terminated. JobRestarted: The job is restarted. JobHanged: The job is suspended. JobPreempted: The job is preempted.

**Table 54** JobEndpointsReq
Parameter	Mandatory	Type	Description
ssh	No	SSHReq object	SSH connection information.

**Table 55** SSHReq
Parameter	Mandatory	Type	Description
key_pair_names	No	Array of strings	Specifies the SSH key pair name, which can be created and viewed on the Key Pair page of the ECS console.

Response Parameters

Status code: 201

**Table 56** Response body parameters
Parameter	Type	Description
kind	String	Training job type, which is job by default. Options: job: training job
metadata	JobMetadata object	Metadata of a training job.
status	Status object	Status of a training job. You do not need to set this parameter when creating a job.
algorithm	JobAlgorithmResponse object	Algorithm used by a training job. The options are as follows: id: Only the algorithm ID is used. subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used. code_dir+boot_file: The code directory and boot file of the training job are used.
tasks	Array of TaskResponse objects	List of tasks in heterogeneous training jobs.
spec	SpecResponce object	Specifications of a training job.
endpoints	JobEndpointsResp object	This section describes the configurations required for remotely accessing a training job.

**Table 57** JobMetadata
Parameter	Type	Description
id	String	Training job ID, which is generated and returned by ModelArts after the training job is created.
name	String	Name of a training job. The value must contain 1 to 64 characters consisting of only digits, letters, underscores (_), and hyphens (-).
workspace_id	String	Workspace where a job is located. The default value is 0.
description	String	Training job description. The value must contain 0 to 256 characters. The default value is NULL.
create_time	Long	Time when a training job was created, in milliseconds. The value is generated and returned by ModelArts after a training job is created.
user_name	String	Username for creating a training job. The username is generated and returned by ModelArts after a training job is created.
annotations	Map<String,String>	Advanced configurations of a training job. The options are as follows: job_template: Template RL (heterogeneous job) fault-tolerance/job-retry-num: 3 (number of retries upon a fault) fault-tolerance/job-unconditional-retry: true (unconditional restart) fault-tolerance/hang-retry: true (restart upon a suspension) jupyter-lab/enable: true (JupyterLab training application) tensorboard/enable: true (TensorBoard training application) mindstudio-insight/enable: true (MindStudio Insight training application)

**Table 58** Status
Parameter	Type	Description
phase	String	Level-1 status of a training job. The options are: Creating: The gateway is being created. Pending: waiting Running Failed: The task fails to be executed. Completed: completed Terminating: The task is being stopped. Terminated: stopped Abnormal: abnormal
secondary_phase	String	The level-2 status of a training job is an internal detailed status, which may be added, modified, or deleted. Dependency is not recommended. The options are: Creating: The gateway is being created. Queuing: queuing Running Failed: The task fails to be executed. Completed: completed Terminating: The task is being stopped. Terminated: stopped CreateFailed: The creation fails. TerminatedFailed: The service fails to be stopped. Unknown: unknown status Lost: abnormal
duration	Long	Running duration of a training job, in milliseconds
node_count_metrics	Array<Array<Integer>>	Node count changes during the training job running period.
tasks	Array of strings	Tasks of a training job.
start_time	Long	Start time of a training job. The value is in timestamp format.
task_statuses	Array of TaskStatuses objects	Status of a training job task.
running_records	Array of RunningRecord objects	Running and fault recovery records of a training job

**Table 59** TaskStatuses
Parameter	Type	Description
task	String	Task of a training job.
exit_code	Integer	Exit code of a training job task.
message	String	Error message of a training job task.

**Table 60** RunningRecord
Parameter	Type	Description
start_at	Integer	Unix timestamp of the start time in the current running record, in seconds.
end_at	Integer	Unix timestamp of the end time in the current running record, in seconds.
start_type	String	Startup mode of the current running record. init_or_rescheduled: This startup is the first running after scheduling, including the first startup and the running after scheduling recovery. restarted: This startup is not the first running after scheduling but the running after a process restart.
end_reason	String	Reason why the current running record ends.
end_related_task	String	ID of the task worker that causes the end of the current running record, for example, worker-0.
end_recover	String	Fault tolerance policy used after the current running record ends. The enums are as follows: npu_proc_restart: NPU in-place hot recovery gpu_proc_restart: GPU in-place hot recovery proc_restart: Process in-place recovery pod_reschedule: Pod-level rescheduling job_reschedule: Job-level rescheduling job_reschedule_with_taint: Isolated job-level rescheduling
end_recover_before_downgrade	String	Tolerance policy used after the current running record ends and before the fault tolerance policy is degraded. The options are the same as those of end_recover.

**Table 61** JobAlgorithmResponse
Parameter	Type	Description
id	String	Algorithm used by a training job. The options are as follows: id: Only the algorithm ID is used. subscription_id+item_version_id: The subscription ID and version ID of the algorithm are used. code_dir+boot_file: The code directory and boot file of the training job are used.
name	String	Algorithm name.
subscription_id	String	Subscription ID of a subscribed algorithm, which must be used with item_version_id
item_version_id	String	Version ID of the subscribed algorithm, which must be used with subscription_id
code_dir	String	Code directory of a training job, for example, /usr/app/. This parameter must be set together with boot_file. If id or subscription_id+item_version_id has been set for boot_file, you do not need to set this parameter.
boot_file	String	Boot file of a training job, which needs to be stored in the code directory. for example, /usr/app/boot.py. This parameter must be used together with code_dir. If id or subscription_id+item_version_id has been set for code_dir, you do not need to set this parameter.
autosearch_config_path	String	YAML configuration path of an auto search job. An OBS URL is required. For example, obs://bucket/file.yaml.
autosearch_framework_path	String	Framework code directory of auto search jobs. An OBS URL is required. For example, obs://bucket/files/.
command	String	Boot command for starting the container of a custom image for a training job. For example, python train.py.
parameters	Array of Parameter objects	Running parameters of a training job.
policies	policies object	Policies supported by jobs.
inputs	Array of Input objects	Input of a training job.
outputs	Array of Output objects	Output of a training job.
engine	JobEngine object	Engine of a training job. Leave this parameter blank if the job is created using id of the algorithm in algorithm management, or subscription_id+item_version_id of the subscribed algorithm.
local_code_dir	String	Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows: The directory must be under /home. In v1 compatibility mode, the current field does not take effect. When code_dir is prefixed with file://, the current field does not take effect.
working_dir	String	Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.
environments	Array of Map<String,String> objects	Environment variables of a training job. The format is key:value. Leave this parameter blank.
summary	Summary object	Visualization log summary.

**Table 62** Parameter
Parameter	Type	Description
name	String	Parameter name.
value	String	Parameter value.
description	String	Parameter description.
constraint	constraint object	Parameter constraint.
i18n_description	i18n_description object	Internationalization description.

**Table 63** constraint
Parameter	Type	Description
type	String	Parameter type.
editable	Boolean	Whether the parameter is editable.
required	Boolean	Whether the parameter is mandatory.
sensitive	Boolean	Whether the parameter is sensitive This function is not implemented currently.
valid_type	String	Valid type.
valid_range	Array of strings	Valid range.

**Table 64** i18n_description
Parameter	Type	Description
language	String	International language.
description	String	Description of an international language.

**Table 65** policies
Parameter	Type	Description
auto_search	auto_search object	Hyperparameter search configuration.

**Table 66** auto_search
Parameter	Type	Description
skip_search_params	String	Hyperparameter parameters that need to be skipped.
reward_attrs	Array of reward_attrs objects	List of search metrics.
search_params	Array of search_params objects	Search parameters.
algo_configs	Array of algo_configs objects	Search algorithm configurations.

**Table 67** reward_attrs
Parameter	Type	Description
name	String	Metric name.
mode	String	Search mode. max: A larger metric value is preferred. min: A smaller metric value is preferred.
regex	String	Regular expression of a metric.

**Table 68** search_params
Parameter	Type	Description
name	String	Hyperparameter name.
param_type	String	Parameter type. continuous: The hyperparameter is of the continuous type. When an algorithm is used in a training job, continuous hyperparameters are displayed as text boxes on the console. discrete: The hyperparameter is of the discrete type. When an algorithm is used in a training job, discrete hyperparameters are displayed as drop-down lists on the console.
lower_bound	String	Lower bound of the hyperparameter.
upper_bound	String	Upper bound of the hyperparameter.
discrete_points_num	String	Number of discrete points of a continuous hyperparameter.
discrete_values	Array of strings	List of discrete hyperparameter values.

**Table 69** algo_configs
Parameter	Type	Description
name	String	Name of the search algorithm.
params	Array of AutoSearchAlgoConfigParameter objects	Search algorithm parameters.

**Table 70** AutoSearchAlgoConfigParameter
Parameter	Type	Description
key	String	Parameter key.
value	String	Parameter value.
type	String	Parameter type.

**Table 71** Input
Parameter	Type	Description
name	String	Name of the data input channel.
description	String	Description of the data input channel.
local_dir	String	Local directory of the container to which the data input channel is mapped Example: /home/ma-user/modelarts/inputs/data_url_0.
remote	InputDataInfo object	Information of the data input. Enums: dataset: The data input is a dataset. obs: The data input is an OBS path.
remote_constraint	Array of remote_constraint objects	Data input constraint

**Table 72** InputDataInfo
Parameter	Type	Description
dataset	dataset object	Dataset as the data input.
obs	obs object	OBS in which data input and output stored.

**Table 73** dataset
Parameter	Type	Description
id	String	Dataset ID of a training job.
version_id	String	Dataset version ID of a training job.
obs_url	String	OBS URL of the dataset for a training job. It is automatically parsed by ModelArts based on the dataset ID and dataset version ID. For example, /usr/data/.

**Table 74** obs
Parameter	Type	Description
obs_url	String	OBS URL of the dataset required by a training job. For example, /usr/data/.

**Table 75** remote_constraint
Parameter	Type	Description
data_type	String	Data input type, including the data storage location and dataset.
attributes	String	Attributes if a dataset is used as the data input. Options: data_format: Data format data_segmentation: Data segmentation dataset_type: Labeling type

**Table 76** Output
Parameter	Type	Description
name	String	Name of the data output channel.
description	String	Description of the data output channel.
local_dir	String	Local directory of the container to which the data output channel is mapped.
remote	Remote object	Description of the actual data output.

**Table 77** JobEngine
Parameter	Type	Description
engine_id	String	Engine ID selected for a training job. The value can be engine_id, engine_name + engine_version, or image_url.
engine_name	String	Name of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter. If you use a preset framework and custom image to create a training job, you must set both this parameter and image_url.
engine_version	String	Version of the engine selected for a training job. If engine_id has been set, you do not need to set this parameter.
image_url	String	Custom image URL selected for a training job. The URL is obtained from SWR. You can select an image or enter an image in the format of "Organization name/Image name:tag".
install_sys_packages	Boolean	Whether to install the MoXing version specified by the training platform. Value true means to install the specified MoXing version. This parameter is available only when engine_name, engine_version, and image_url are set.

**Table 78** Summary
Parameter	Type	Description
log_type	String	Visualization log type of a training job. After this parameter is configured, the training job can be used as the data source of a visualization job. The options are as follows: tensorboard mindstudio-insight
log_dir	LogDir object	Visualization log output of a training job. This parameter is mandatory when log_type is not empty.
data_sources	Array of DataSource objects	Visualization log input of a visualization job or debug training job. This parameter is mandatory when tensorboard/enable or mindstudio-insight/enable is set to true for advanced training functions.

**Table 79** LogDir
Parameter	Type	Description
pfs	PFSSummary object	Output of an OBS parallel file system.

**Table 80** PFSSummary
Parameter	Type	Description
pfs_path	String	URL of an OBS parallel file system.

**Table 81** DataSource
Parameter	Type	Description
job	JobSummary object	Job data source.

**Table 82** JobSummary
Parameter	Type	Description
job_id	String	Training job ID.

**Table 83** TaskResponse
Parameter	Type	Description
role	String	Task role. This function is not supported currently.
algorithm	TaskResponseAlgorithm object	Algorithm management and configuration.
task_resource	FlavorResponse object	Flavors of a training job or an algorithm.

**Table 84** TaskResponseAlgorithm
Parameter	Type	Description
code_dir	String	Absolute path of the directory where the algorithm boot file is stored.
boot_file	String	Absolute path of the algorithm boot file.
inputs	AlgorithmInput object	Algorithm input channel.
outputs	AlgorithmOutput object	Algorithm output channel.
engine	AlgorithmEngine object	Engine on which a heterogeneous job depends.
local_code_dir	String	Local directory of the training container to which the algorithm code directory is downloaded. The rules are as follows: The directory must be under /home. In v1 compatibility mode, the current field does not take effect. When code_dir is prefixed with file://, the current field does not take effect.
working_dir	String	Work directory where an algorithm is executed. Note that this parameter does not take effect in v1 compatibility mode.

**Table 85** AlgorithmInput
Parameter	Type	Description
name	String	Name of the data input channel.
local_dir	String	Local path of the container to which the data input and output channels are mapped.
remote	AlgorithmRemote object	Actual data input, which can only be OBS for heterogeneous jobs.

**Table 86** AlgorithmRemote
Parameter	Type	Description
obs	RemoteObs object	OBS in which data input and output are stored.

**Table 87** AlgorithmOutput
Parameter	Type	Description
name	String	Name of the data output channel.
local_dir	String	Local directory of the container to which the data output channel is mapped.
remote	Remote object	Description of the actual data output.
mode	String	Data transmission mode. The default value is upload_periodically.
period	String	Data transmission period. The default value is 30s.

**Table 88** Remote
Parameter	Type	Description
obs	RemoteObs object	OBS to which data is actually exported.

**Table 89** RemoteObs
Parameter	Type	Description
obs_url	String	OBS URL to which data is exported.

**Table 90** AlgorithmEngine
Parameter	Type	Description
engine_id	String	Engine ID, for example, caffe-1.0.0-python2.7.
engine_name	String	Engine name, for example, Caffe.
engine_version	String	Engine version. Engines with the same name have multiple versions, for example, Caffe-1.0.0-python2.7 of Python 2.7.
v1_compatible	Boolean	Whether the v1 compatibility mode is used.
run_user	String	User UID started by default by the engine.
image_url	String	Custom image URL selected for an algorithm.

**Table 91** FlavorResponse
Parameter	Type	Description
flavor_id	String	ID of the resource flavor.
flavor_name	String	Name of the resource flavor.
max_num	Integer	Maximum number of nodes in a resource flavor.
flavor_type	String	Resource flavor type. Options: CPU GPU
billing	BillingInfo object	Billing information of a resource flavor.
flavor_info	FlavorInfoResponse object	Resource flavor details.
attributes	Map<String,String>	Other specification attributes.

**Table 92** FlavorInfoResponse
Parameter	Type	Description
max_num	Integer	Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported.
cpu	Cpu object	CPU specifications.
gpu	Gpu object	GPU specifications.
npu	Npu object	NPU specifications.
memory	Memory object	Memory information.
disk	DiskResponse object	Disk information.

**Table 93** DiskResponse
Parameter	Type	Description
size	Integer	Disk size.
unit	String	Unit of the disk size.

**Table 94** SpecResponce
Parameter	Type	Description
resource	Resource object	Resource flavors of a training job. Select either flavor_id or pool_id+[flavor_id].
volumes	Array of JobVolume objects	Volumes attached for a training job.
log_export_path	LogExportPath object	Export path of training job logs.
schedule_policy	SchedulePolicy object	Training job scheduling policy.

**Table 95** Resource
Parameter	Type	Description
policy	String	Resource specification mode of a training job. The value can be regular, indicating the standard mode.
flavor_id	String	ID of the resource flavor selected for a training job. flavor_id cannot be specified for dedicated resource pools with CPU specifications. The options for dedicated resource pools with GPU specifications are as follows: modelarts.pool.visual.xlarge (1 card) modelarts.pool.visual.2xlarge (2 cards) modelarts.pool.visual.4xlarge (4 cards) modelarts.pool.visual.8xlarge (8 cards)
flavor_name	String	Read-only flavor name returned by ModelArts when flavor_id is used.
node_count	Integer	Number of resource replicas selected for a training job.
pool_id	String	Resource pool ID selected for a training job.
flavor_detail	FlavorDetail object	Flavor details of a training job or algorithm. This parameter is available only for public resource pools.

**Table 96** FlavorDetail
Parameter	Type	Description
flavor_type	String	Resource flavor type. The options are as follows: CPU GPU
billing	BillingInfo object	Billing information of a resource flavor.
flavor_info	FlavorInfo object	Resource flavor details.

**Table 97** BillingInfo
Parameter	Type	Description
code	String	Billing code.
unit_num	Integer	Billing unit.

**Table 98** FlavorInfo
Parameter	Type	Description
max_num	Integer	Maximum number of nodes that can be selected. The value 1 indicates that the distributed mode is not supported.
cpu	Cpu object	CPU specifications.
gpu	Gpu object	GPU specifications.
npu	Npu object	NPU specifications.
memory	Memory object	Memory information.
disk	Disk object	Disk information.

**Table 99** Cpu
Parameter	Type	Description
arch	String	CPU architecture.
core_num	Integer	Number of cores.

**Table 100** Gpu
Parameter	Type	Description
unit_num	Integer	Number of GPUs.
product_name	String	Product name.
memory	String	Memory.

**Table 101** Npu
Parameter	Type	Description
unit_num	String	Number of NPUs.
product_name	String	Product name.
memory	String	Memory.

**Table 102** Memory
Parameter	Type	Description
size	Integer	Memory size.
unit	String	Number of memory units.

**Table 103** Disk
Parameter	Type	Description
size	String	Disk size.
unit	String	Unit of the disk size, which is GB generally.

**Table 104** JobVolume
Parameter	Type	Description
nfs	Nfs object	Volumes attached in NFS mode.

**Table 105** Nfs
Parameter	Type	Description
nfs_server_path	String	NFS server path, for example, 10.10.10.10:/example/path.
local_path	String	Path for attaching volumes to the training container, for example, /example/path.
read_only	Boolean	Whether the disks attached to the container in NFS mode are read-only.

**Table 106** LogExportPath
Parameter	Type	Description
obs_url	String	OBS path for storing training job logs, for example, obs://example/path.
host_path	String	Path of the host where training job logs are stored, for example, /example/path.

**Table 107** SchedulePolicy
Parameter	Type	Description
required_affinity	RequiredAffinity object	Affinity requirements for training jobs.
priority	Integer	Priority of the training job.
preemptible	Boolean	Whether preemption is allowed

**Table 108** RequiredAffinity
Parameter	Type	Description
affinity_type	String	Affinity scheduling policy. Possible values are as follows: cabinet: strong cabinet scheduling hyperinstance: supernode affinity scheduling
affinity_group_size	Integer	Affinity group size. This parameter is mandatory when affinity_type is set to hyperinstance. In this case, the system schedules tasks specified by affinity_group_size to a supernode to form an affinity group. When a user delivers a training job to the supernode resource pool, if the affinity group size is not set, the system sets the value to 1 by default.

**Table 109** JobEndpointsResp
Parameter	Type	Description
ssh	SSHResp object	SSH connection information.
jupyter_lab	JupyterLab object	JupyterLab connection information.
tensorboard	Tensorboard object	TensorBoard connection information.
mindstudio_insight	MindStudioInsight object	MindStudio Insight connection information.

**Table 110** SSHResp
Parameter	Type	Description
key_pair_names	Array of strings	Specifies the SSH key pair name, which can be created and viewed on the Key Pair page of the ECS console.
task_urls	Array of TaskUrls objects	SSH connection address information.

**Table 111** TaskUrls
Parameter	Type	Description
task	String	ID of a training job.
url	String	SSH connection address of a training job.

**Table 112** JupyterLab
Parameter	Type	Description
url	String	JupyterLab address of a training job.
token	String	JupyterLab token of a training job.

**Table 113** Tensorboard
Parameter	Type	Description
url	String	TensorBoard URL of a training job.
token	String	TensorBoard token of a training job

**Table 114** MindStudioInsight
Parameter	Type	Description
url	String	MindStudio Insight URL of a training job.
token	String	MindStudio Insight token of a training job.

Status code: 400

**Table 115** Response body parameters
Parameter	Type	Description
error_msg	String	Error message
error_code	String	Error code
error_solution	String	Solution

Example Requests

The following is an example of how to create a training job with free specifications. The job name has been set to TestModelArtsJob and the description has been set to This is a ModelArts job. The required algorithm's ID is 3f5d6706-7b67-408d-8ba0-ec08048c45ed. The inputs and outputs have not been defined for the algorithm.

POST https://endpoint/v2/{project_id}/training-jobs

{
  "kind" : "job",
  "metadata" : {
    "name" : "TestModelArtsJob",
    "description" : "This is a ModelArts job"
  },
  "algorithm" : {
    "id" : "3f5d6706-7b67-408d-8ba0-ec08048c45ed",
    "parameters" : [ {
      "name" : "input_dir",
      "value" : "obs://test/moxingtest-dir/"
    }, {
      "name" : "input_file",
      "value" : "obs://test/moxingtest/"
    }, {
      "name" : "large_file_method",
      "value" : "1"
    } ],
    "policies" : {
      "auto_search" : null
    },
    "environments" : { }
  },
  "spec" : {
    "resource" : {
      "flavor_id" : "modelarts.p3.large.public.free",
      "node_count" : 1
    },
    "log_export_path" : {
      "obs_url" : ""
    }
  }
}

The following is an example of how to use a custom image to create a training job whose name is TestModelArtsJob2 and description is This is a ModelArts job2. A dedicated resource pool and NFS mounting are used.

POST https://endpoint/v2/{project_id}/training-jobs

{
  "kind" : "job",
  "metadata" : {
    "name" : "TestModelArtsJob2",
    "description" : "This is a ModelArts job2"
  },
  "algorithm" : {
    "engine" : {
      "image_url" : "xxxxxxxx/fastseq:1.2"
    },
    "command" : "cd /home/ma-user/ddp_demo && sh run_ddp.sh",
    "parameters" : [ ],
    "policies" : {
      "auto_search" : null
    },
    "environments" : {
      "NCCL_DEBUG" : "INFO",
      "NCCL_IB_DISABLE" : "0"
    }
  },
  "spec" : {
    "resource" : {
      "flavor_id" : "modelarts.pool.visual.xlarge",
      "node_count" : 1,
      "pool_id" : "poolfaf38d76"
    },
    "log_export_path" : {
      "obs_url" : "/training-test/limou/ddp-demo-log/"
    },
    "volumes" : [ {
      "nfs" : {
        "nfs_server_path" : "192.168.0.82:/",
        "local_path" : "/home/ma-user/nfs/",
        "read_only" : false
      }
    } ]
  }
}

Example Responses

Status code: 201

{
  "kind" : "job",
  "metadata" : {
    "id" : "425b7087-83de-49ed-9e40-5bb642be956f",
    "name" : "TestModelArtsJob",
    "description" : "This is a ModelArts job",
    "create_time" : 1637045545982,
    "workspace_id" : "0",
    "user_name" : ""
  },
  "status" : {
    "phase" : "Creating",
    "secondary_phase" : "Creating",
    "duration" : 0,
    "start_time" : 0,
    "node_count_metrics" : null,
    "tasks" : [ "worker-0", "server-0" ]
  },
  "algorithm" : {
    "id" : "3f5d6706-7b67-408d-8ba0-ec08048c45ed",
    "name" : "ttt-obs-gpu",
    "code_dir" : "/test/moxingtest-code/",
    "boot_file" : "/test/moxingtest-code/test_obs_gpu.py",
    "parameters" : [ {
      "name" : "input_dir",
      "description" : "",
      "i18n_description" : null,
      "value" : "s://test/moxingtest-dir/",
      "constraint" : {
        "type" : "String",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    }, {
      "name" : "input_file",
      "description" : "",
      "i18n_description" : null,
      "value" : "obs://test/moxingtest/",
      "constraint" : {
        "type" : "String",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    }, {
      "name" : "large_file_method",
      "description" : "",
      "i18n_description" : null,
      "value" : "1",
      "constraint" : {
        "type" : "Integer",
        "editable" : true,
        "required" : true,
        "sensitive" : false,
        "valid_type" : "None",
        "valid_range" : [ ]
      }
    } ],
    "engine" : {
      "engine_id" : "horovod-cp36-tf-1.16.2",
      "engine_name" : "Horovod",
      "engine_version" : "0.16.2-TF-1.13.1-python3.6"
    },
    "policies" : { }
  },
  "spec" : {
    "resource" : {
      "policy" : "regular",
      "flavor_id" : "modelarts.p3.large.public.free",
      "flavor_name" : "Computing GPU(Vnt1) instance",
      "node_count" : 1,
      "flavor_detail" : {
        "flavor_type" : "GPU",
        "billing" : {
          "code" : "modelarts.vm.gpu.free",
          "unit_num" : 1
        },
        "flavor_info" : {
          "cpu" : {
            "arch" : "x86",
            "core_num" : 8
          },
          "gpu" : {
            "unit_num" : 1,
            "product_name" : "GP-Vnt1",
            "memory" : "32GB"
          },
          "memory" : {
            "size" : 64,
            "unit" : "GB"
          }
        }
      }
    },
    "log_export_path" : { }
  }
}

Status code: 400

Format of the body for a common error response. The following shows the returned information when an algorithm with ID 3f5d6706-7b67-408d-8ba0-ec08048c45ee is not found.

{
  "error_msg" : "algorithm not found.",
  "error_code" : "ModelArts.2755",
  "error_solution" : "Check whether the training project information in the request is valid."
}

Status Codes

Status Code	Description
201	ok
400	Format of the body for a common error response. The following shows the returned information when an algorithm with ID 3f5d6706-7b67-408d-8ba0-ec08048c45ee is not found.

Error Codes

See Error Codes.

Parent topic: Training Management

512 KiB Raw Blame History