forked from docs/doc-exports
Reviewed-by: Pruthi, Vineet <vineet.pruthi@t-systems.com> Co-authored-by: Wuwan, Qi <wuwanqi1@noreply.gitea.eco.tsi-dev.otc-service.com> Co-committed-by: Wuwan, Qi <wuwanqi1@noreply.gitea.eco.tsi-dev.otc-service.com>
753 lines
51 KiB
HTML
753 lines
51 KiB
HTML
<a name="EN-US_TOPIC_0000002340898518"></a><a name="EN-US_TOPIC_0000002340898518"></a>
|
|
|
|
<h1 class="topictitle1">Using PyTorch to Create a Training Job (New-Version Training)</h1>
|
|
<div id="body0000001239563733"><p id="EN-US_TOPIC_0000002340898518__p15380191918816">This section describes how to train a model by calling ModelArts APIs.</p>
|
|
<div class="section" id="EN-US_TOPIC_0000002340898518__section1584656102611"><h4 class="sectiontitle">Overview</h4><p id="EN-US_TOPIC_0000002340898518__p6171183914104">The process for creating a training job using PyTorch is as follows:</p>
|
|
<ol id="EN-US_TOPIC_0000002340898518__ol51731432121217"><li id="EN-US_TOPIC_0000002340898518__li15913193817196">Obtain a user token, which will be added in a request header for authentication.</li><li id="EN-US_TOPIC_0000002340898518__li6173123211219">Call the API for <a href="ShowTrainingJobFlavors.html">obtaining general flavors supported by a training job</a> to obtain the required flavors.</li><li id="EN-US_TOPIC_0000002340898518__li11901135831418">Call the API for <a href="ShowTrainingJobEngines.html">obtaining the preset AI frameworks supported by a training job</a> to view the engines and their versions supported by a training job.</li><li id="EN-US_TOPIC_0000002340898518__li33722031111515"><a name="EN-US_TOPIC_0000002340898518__li33722031111515"></a><a name="li33722031111515"></a>Call the API for <a href="CreateAlgorithm.html">creating an algorithm</a> to create an algorithm and record the algorithm ID.</li><li id="EN-US_TOPIC_0000002340898518__li62310211161"><a name="EN-US_TOPIC_0000002340898518__li62310211161"></a><a name="li62310211161"></a>Call the API for <a href="CreateTrainingJob.html">creating a training job</a> to create a training job using the UUID returned by the created algorithm and record the job ID.</li><li id="EN-US_TOPIC_0000002340898518__li1565892716201">Call the API for <a href="ShowTrainingJobDetails.html">querying details about a training job</a> to query the job status using the job ID.</li><li id="EN-US_TOPIC_0000002340898518__li10490131814338">Call the API for <a href="ShowObsUrlOfTrainingJobLogs.html">querying the logs of a specified task in a training job (OBS link)</a> to obtain the OBS path of the training job logs.</li><li id="EN-US_TOPIC_0000002340898518__li413871186">Call the API for <a href="ShowTrainingJobMetrics.html">querying the running metrics of a specified task in a training job</a> to view detailed metrics of the job.</li><li id="EN-US_TOPIC_0000002340898518__li8603640104114">Call the API for <a href="DeleteTrainingJob.html">deleting a training job</a> to delete the job if it is no longer needed.</li></ol>
|
|
</div>
|
|
<div class="section" id="EN-US_TOPIC_0000002340898518__section8774173316262"><h4 class="sectiontitle">Prerequisites</h4><ul id="EN-US_TOPIC_0000002340898518__ul1645122742017"><li id="EN-US_TOPIC_0000002340898518__en-us_topic_0000001121150482_li1054032119297">You have obtained the endpoints of .</li><li id="EN-US_TOPIC_0000002340898518__en-us_topic_0000001121150482_li178815407564">The following information is available: region where ModelArts is deployed, <a href="modelarts_03_0147.html">project ID and name</a>, <a href="modelarts_03_0148.html">account name and ID</a>, and <a href="modelarts_03_0006.html">username and user ID</a>.</li></ul>
|
|
<ul id="EN-US_TOPIC_0000002340898518__ul11540821132915"><li id="EN-US_TOPIC_0000002340898518__li205401321122913">The training code of PyTorch is available. For example, the startup file <span class="filepath" id="EN-US_TOPIC_0000002340898518__filepath5359161912189"><b>test-pytorch.py</b></span> has been stored in the <span class="filepath" id="EN-US_TOPIC_0000002340898518__filepath10655133181816"><b>obs://cnnorth4-job-test-v2/pytorch/fast_example/code/cpu</b></span> directory of OBS.</li><li id="EN-US_TOPIC_0000002340898518__li12540921112918">A data file for the training job is available. For example, a training dataset has been stored in the <span class="filepath" id="EN-US_TOPIC_0000002340898518__filepath723634814188"><b>obs://cnnorth4-job-test-v2/pytorch/fast_example/data</b></span> directory of OBS.</li><li id="EN-US_TOPIC_0000002340898518__li15944809390">A path for outputting the training job model has been created, for example, <span class="filepath" id="EN-US_TOPIC_0000002340898518__filepath1612695611188"><b>obs://cnnorth4-job-test-v2/pytorch/fast_example/outputs</b></span>.</li><li id="EN-US_TOPIC_0000002340898518__li9405152753420">A path for outputting the training job logs has been created, for example, <span class="filepath" id="EN-US_TOPIC_0000002340898518__filepath530362181911"><b>obs://cnnorth4-job-test-v2/pytorch/fast_example/log</b></span>.</li></ul>
|
|
</div>
|
|
<div class="section" id="EN-US_TOPIC_0000002340898518__section10212145710414"><h4 class="sectiontitle">Procedure</h4><ol id="EN-US_TOPIC_0000002340898518__ol577211102012"><li id="EN-US_TOPIC_0000002340898518__li676316281367"><a name="EN-US_TOPIC_0000002340898518__li676316281367"></a><a name="li676316281367"></a>Call the API for <a href="ShowTrainingJobFlavors.html">obtaining general flavors supported by a training job</a> to obtain the required flavors.<ol type="a" id="EN-US_TOPIC_0000002340898518__ol201258223404"><li id="EN-US_TOPIC_0000002340898518__li6125122212402">Request body:<p id="EN-US_TOPIC_0000002340898518__p18282102520461"><a name="EN-US_TOPIC_0000002340898518__li6125122212402"></a><a name="li6125122212402"></a>URI: GET https://<em id="EN-US_TOPIC_0000002340898518__i4282182554617">{ma_endpoint}</em>/v2/<em id="EN-US_TOPIC_0000002340898518__i82821225124610">{project_id}</em>/ training-job-flavors? flavor_type=CPU</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p328212504617">Request header: X-Auth-Token →<strong id="EN-US_TOPIC_0000002340898518__b52821125114614"><em id="EN-US_TOPIC_0000002340898518__i1928216255468">MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...</em></strong></p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p1628252512468">Set the following parameters based on site requirements:</p>
|
|
<ul id="EN-US_TOPIC_0000002340898518__ul172825251463"><li id="EN-US_TOPIC_0000002340898518__li92821625144612"><em id="EN-US_TOPIC_0000002340898518__i57452613818">ma_endpoint</em>: ModelArts endpoint</li><li id="EN-US_TOPIC_0000002340898518__li1728222594610"><em id="EN-US_TOPIC_0000002340898518__i48361432172819">project_id</em>: user's project ID</li><li id="EN-US_TOPIC_0000002340898518__li11282225154616"><strong id="EN-US_TOPIC_0000002340898518__b1287273603216">X-auth-Token</strong>: token obtained in the previous step</li></ul>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li140424184017">Status code <strong id="EN-US_TOPIC_0000002340898518__b1635414020338">200</strong> is returned. The response body is as follows:<pre class="screen" id="EN-US_TOPIC_0000002340898518__screen1132151913475">{
|
|
"total_count": 2,
|
|
"flavors": [
|
|
{
|
|
"flavor_id": "modelarts.vm.cpu.2u",
|
|
"flavor_name": "Computing CPU(2U) instance",
|
|
"flavor_type": "CPU",
|
|
"billing": {
|
|
"code": "modelarts.vm.cpu.2u",
|
|
"unit_num": 1
|
|
},
|
|
"flavor_info": {
|
|
"max_num": 1,
|
|
"cpu": {
|
|
"arch": "x86",
|
|
"core_num": 2
|
|
},
|
|
"memory": {
|
|
"size": 8,
|
|
"unit": "GB"
|
|
},
|
|
"disk": {
|
|
"size": 50,
|
|
"unit": "GB"
|
|
}
|
|
}
|
|
},
|
|
{
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b167116488199">flavor_id</strong>": "modelarts.vm.cpu.8u",
|
|
"flavor_name": "Computing CPU(8U) instance",
|
|
"flavor_type": "CPU",
|
|
"billing": {
|
|
"code": "modelarts.vm.cpu.8u",
|
|
"unit_num": 1
|
|
},
|
|
"flavor_info": {
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1964311172013">max_num</strong>": 16,
|
|
"cpu": {
|
|
"arch": "x86",
|
|
"core_num": 8
|
|
},
|
|
"memory": {
|
|
"size": 32,
|
|
"unit": "GB"
|
|
},
|
|
"disk": {
|
|
"size": 50,
|
|
"unit": "GB"
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}</pre>
|
|
<ul id="EN-US_TOPIC_0000002340898518__ul87433342473"><li id="EN-US_TOPIC_0000002340898518__li1574303414713">Select and record the flavor required for creating the training job based on the <strong id="EN-US_TOPIC_0000002340898518__b11757529163520">flavor_id</strong> value. This section uses flavor <strong id="EN-US_TOPIC_0000002340898518__b1341545620353">modelarts.vm.cpu.8u</strong> with its <strong id="EN-US_TOPIC_0000002340898518__b14423165983712">max_num</strong> set to <strong id="EN-US_TOPIC_0000002340898518__b1799351611345">16</strong> as an example.</li></ul>
|
|
</li></ol>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li1750593718369"><a name="EN-US_TOPIC_0000002340898518__li1750593718369"></a><a name="li1750593718369"></a>Call the API for <a href="ShowTrainingJobEngines.html">obtaining the preset AI frameworks supported by a training job</a> to view the engines and their versions supported by a training job.<ol type="a" id="EN-US_TOPIC_0000002340898518__ol363293194113"><li id="EN-US_TOPIC_0000002340898518__li3632931124119">Request body:<p id="EN-US_TOPIC_0000002340898518__p147636905114"><a name="EN-US_TOPIC_0000002340898518__li3632931124119"></a><a name="li3632931124119"></a>URI: GET https://<em id="EN-US_TOPIC_0000002340898518__i67632945116">{ma_endpoint}</em>/v2/<em id="EN-US_TOPIC_0000002340898518__i187633919516">{project_id}</em>/job/ training-job-engines</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p376339135117">Request header:</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p37633935119">X-Auth-Token→<strong id="EN-US_TOPIC_0000002340898518__b1776329145118"><em id="EN-US_TOPIC_0000002340898518__i17630975116">MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...</em></strong></p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p2763109185113">Content-Type →application/json</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p776316911511">Set the bold parameters based on site requirements.</p>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li126641726195119">Status code <span class="parmvalue" id="EN-US_TOPIC_0000002340898518__parmvalue1061764712266"><b>200</b></span> is returned. The response body is as follows (only part of the response body is displayed because there are many engines):<pre class="screen" id="EN-US_TOPIC_0000002340898518__screen12217125045114">{
|
|
"total": 28,
|
|
"items": [
|
|
......
|
|
{
|
|
"engine_id": "mindspore_1.6.0-cann_5.0.3.6-py_3.7-euler_2.8.3-aarch64",
|
|
"engine_name": "Powered-Engine",
|
|
"engine_version": "mindspore_1.6.0-cann_5.0.3.6-py_3.7-euler_2.8.3-aarch64",
|
|
"v1_compatible": false,
|
|
"run_user": "1000",
|
|
"image_info": {
|
|
"cpu_image_url": "",
|
|
"gpu_image_url": "atelier/mindspore_1_6_0:train",
|
|
"image_version": "mindspore_1.6.0-cann_5.0.3.6-py_3.7-euler_2.8.3-aarch64-snt9-roma-20211231193205-33131ee"
|
|
}
|
|
},
|
|
......
|
|
{
|
|
"engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
|
|
"engine_name": "PyTorch",
|
|
"engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
|
|
"tags": [
|
|
{
|
|
"key": "auto_search",
|
|
"value": "True"
|
|
}
|
|
],
|
|
"v1_compatible": false,
|
|
"run_user": "1102",
|
|
"image_info": {
|
|
"cpu_image_url": "aip/pytorch_1_8:train",
|
|
"gpu_image_url": "aip/pytorch_1_8:train",
|
|
"image_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20210912152543-1e0838d"
|
|
}
|
|
},
|
|
......
|
|
{
|
|
"engine_id": "tensorflow_2.1.0-cuda_10.1-py_3.7-ubuntu_18.04-x86_64",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b16113619162017">engine_nam</strong><strong id="EN-US_TOPIC_0000002340898518__b94903270204">e</strong>": "TensorFlow",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b282942210203">engine_version</strong>": "tensorflow_2.1.0-cuda_10.1-py_3.7-ubuntu_18.04-x86_64",
|
|
"tags": [
|
|
{
|
|
"key": "auto_search",
|
|
"value": "True"
|
|
}
|
|
],
|
|
"v1_compatible": false,
|
|
"run_user": "1102",
|
|
"image_info": {
|
|
"cpu_image_url": "aip/tensorflow_2_1:train",
|
|
"gpu_image_url": "aip/tensorflow_2_1:train",
|
|
"image_version": "tensorflow_2.1.0-cuda_10.1-py_3.7-ubuntu_18.04-x86_64-20210912152543-1e0838d"
|
|
}
|
|
},
|
|
......
|
|
]
|
|
}</pre>
|
|
<p id="EN-US_TOPIC_0000002340898518__p1563719418515">Select the engine flavor required for creating a training job based on the <span class="parmname" id="EN-US_TOPIC_0000002340898518__parmname1848073915283"><b>engine_name</b></span> and <span class="parmname" id="EN-US_TOPIC_0000002340898518__parmname157006457285"><b>engine_version</b></span> fields, and record the field values. This section uses the PyTorch engine as an example to describe how to create a job. In this example, the <span class="parmname" id="EN-US_TOPIC_0000002340898518__parmname516063262913"><b>engine_name</b></span> value is <span class="parmvalue" id="EN-US_TOPIC_0000002340898518__parmvalue201122035132914"><b>PyTorch</b></span>, and the <span class="parmname" id="EN-US_TOPIC_0000002340898518__parmname270383913293"><b>engine_version</b></span> value is <span class="parmvalue" id="EN-US_TOPIC_0000002340898518__parmvalue133604455298"><b>pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64</b></span>.</p>
|
|
</li></ol>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li62251239143618">Call the API for <a href="CreateAlgorithm.html">creating an algorithm</a> to create an algorithm and record the algorithm ID.<ol type="a" id="EN-US_TOPIC_0000002340898518__ol1915626164220"><li id="EN-US_TOPIC_0000002340898518__li791510268427">Request body:<p id="EN-US_TOPIC_0000002340898518__p1425461125417"><a name="EN-US_TOPIC_0000002340898518__li791510268427"></a><a name="li791510268427"></a>URI: POST https://<em id="EN-US_TOPIC_0000002340898518__i1025471115417">{ma_endpoint}</em>/v2/<em id="EN-US_TOPIC_0000002340898518__i172541116549">{project_id}</em>/ algorithms</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p8254191125418">Request header:</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p4254510546">X-Auth-Token→<strong id="EN-US_TOPIC_0000002340898518__b102544175411"><em id="EN-US_TOPIC_0000002340898518__i172548118540">MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...</em></strong></p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p725417113547">Content-Type →application/json</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p12548125411">Set the bold parameters based on site requirements.</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p1572015143540">Request body:</p>
|
|
<pre class="screen" id="EN-US_TOPIC_0000002340898518__screen189001632135416">{
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1017018272211">metadata</strong>": {
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1615302922114">name</strong>": "test-pytorch-cpu",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b78456341215">description</strong>": "test pytorch job in cpu in mode gloo"
|
|
},
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1388113362212">job_config</strong>": {
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b863953912116">boot_file</strong>": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1344512411212">code_dir</strong>": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1724217513215">engine</strong>": {
|
|
"engine_name": "PyTorch",
|
|
"engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64"
|
|
},
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b825741417227">inputs</strong>": [{
|
|
"name": "data_url",
|
|
"description": "Data source 1"
|
|
}],
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b681413156224">outputs</strong>": [{
|
|
"name": "train_url",
|
|
"description": "Output data 1"
|
|
}],
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b7270329112211">parameters</strong>": [{
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b204333219228">name</strong>": "dist",
|
|
"description": "",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b2034113714224">value</strong>": "False",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1130711393222">constraint</strong>": {
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b8607307236">editable</strong>": true,
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b14983172122316">required</strong>": false,
|
|
"sensitive": false,
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b38044416232">type</strong>": "Boolean",
|
|
"valid_range": [],
|
|
"valid_type": "None"
|
|
}
|
|
},
|
|
{
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b2499441112217">name</strong>": "world_size",
|
|
"description": "",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b3181174332213">value</strong>": "1",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1767234410224">constraint</strong>": {
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b18676651132215">editable</strong>": true,
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b2051134962215">required</strong>": false,
|
|
"sensitive": false,
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b15823154182210">type</strong>": "Integer",
|
|
"valid_range": [],
|
|
"valid_type": "None"
|
|
}
|
|
}
|
|
],
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b123369244225">parameters_customization</strong>": true
|
|
},
|
|
"resource_requirements": []
|
|
}</pre>
|
|
<p id="EN-US_TOPIC_0000002340898518__p20834192920544">Set the following parameters based on site requirements:</p>
|
|
<ul id="EN-US_TOPIC_0000002340898518__ul2832124320579"><li id="EN-US_TOPIC_0000002340898518__li68327437578"><strong id="EN-US_TOPIC_0000002340898518__b5312151162016">name</strong> and <strong id="EN-US_TOPIC_0000002340898518__b1931275102015">description</strong> in the <strong id="EN-US_TOPIC_0000002340898518__b9832184312577">metadata</strong> field indicate the algorithm name and description, respectively.</li><li id="EN-US_TOPIC_0000002340898518__li983254385712"><strong id="EN-US_TOPIC_0000002340898518__b9750111415443">code_dir</strong> and <strong id="EN-US_TOPIC_0000002340898518__b13857416194417">boot_file</strong> in the <strong id="EN-US_TOPIC_0000002340898518__b083244315571">job_config</strong> field indicate the code directory and code startup file of the algorithm, respectively. The code directory is the level-1 directory of the code startup file.</li><li id="EN-US_TOPIC_0000002340898518__li858365345712"><strong id="EN-US_TOPIC_0000002340898518__b68322434579">inputs</strong> and <strong id="EN-US_TOPIC_0000002340898518__b12832174318578">outputs</strong> in the <strong id="EN-US_TOPIC_0000002340898518__b5832114325712">job_config</strong> field indicate the input and output of the algorithm, respectively. You can specify <span class="parmname" id="EN-US_TOPIC_0000002340898518__parmname441002643017"><b>data_url</b></span> and <span class="parmname" id="EN-US_TOPIC_0000002340898518__parmname14408281303"><b>train_url</b></span> based on the instance, and parse hyperparameters in the code to specify the local path of the data file required for training and the local output path of the model generated during training.</li></ul>
|
|
<ul id="EN-US_TOPIC_0000002340898518__ul13832124317577"><li id="EN-US_TOPIC_0000002340898518__li883294313578"><strong id="EN-US_TOPIC_0000002340898518__b1883217435570">parameters_customization</strong> in the <strong id="EN-US_TOPIC_0000002340898518__b1583216431572">job_config</strong> field indicates whether to support custom hyperparameters. Set this parameter to <strong id="EN-US_TOPIC_0000002340898518__b167854212491">true</strong>.</li><li id="EN-US_TOPIC_0000002340898518__li58321243125715"><strong id="EN-US_TOPIC_0000002340898518__b38321343145719">parameters</strong> in the <strong id="EN-US_TOPIC_0000002340898518__b198325437579">job_config</strong> field indicates the hyperparameters of the algorithm. Set <strong id="EN-US_TOPIC_0000002340898518__b7832194319575">name</strong> to the hyperparameter name (a maximum of 64 characters, including uppercase letters, lowercase letters, digits, underscores (_), and hyphens (-)). Set <strong id="EN-US_TOPIC_0000002340898518__b983214385712">value</strong> to the default value of the hyperparameter. Set <strong id="EN-US_TOPIC_0000002340898518__b4832124365718">constraint</strong> to the constraints of the hyperparameter. For example, set <strong id="EN-US_TOPIC_0000002340898518__b483234385715">type</strong> to <strong id="EN-US_TOPIC_0000002340898518__b3183142341213">String</strong> (<strong id="EN-US_TOPIC_0000002340898518__b2784122891220">String</strong>, <strong id="EN-US_TOPIC_0000002340898518__b87386291129">Integer</strong>, <strong id="EN-US_TOPIC_0000002340898518__b272823061211">Float</strong>, and <strong id="EN-US_TOPIC_0000002340898518__b1337283391213">Boolean</strong> are supported), set <strong id="EN-US_TOPIC_0000002340898518__b178321243185718">editable</strong> to <strong id="EN-US_TOPIC_0000002340898518__b9953646181210">true</strong>, and set <strong id="EN-US_TOPIC_0000002340898518__b1383210438572">required</strong> to <strong id="EN-US_TOPIC_0000002340898518__b18833171181311">false</strong>.</li><li id="EN-US_TOPIC_0000002340898518__li2832243115716"><strong id="EN-US_TOPIC_0000002340898518__b383214365712">engine</strong> in the <strong id="EN-US_TOPIC_0000002340898518__b198321943195718">job_config</strong> field indicates the engine on which the algorithm depends. Use the <strong id="EN-US_TOPIC_0000002340898518__b1683324310571">engine_name</strong> and <strong id="EN-US_TOPIC_0000002340898518__b1833124311571">engine_version</strong> values recorded in <a href="#EN-US_TOPIC_0000002340898518__li1750593718369">2</a>.</li></ul>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li663551645812">Status code <span class="parmvalue" id="EN-US_TOPIC_0000002340898518__parmvalue141507162720"><b>200 OK</b></span> is returned, indicating that the algorithm is successfully created. The response body is as follows:<pre class="screen" id="EN-US_TOPIC_0000002340898518__screen33781461700">{
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b13809181717233">metadata</strong>": {
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b553317191236">id</strong>": "01c399ae-8593-4ef5-9e4d-085950aacde1",
|
|
"name": "test-pytorch-cpu",
|
|
"description": "test pytorch job in cpu in mode gloo",
|
|
"create_time": 1641890623262,
|
|
"workspace_id": "0",
|
|
"ai_project": "default-ai-project",
|
|
"user_name": "",
|
|
"domain_id": "0659fbf6de00109b0ff1c01fc037d240",
|
|
"source": "custom",
|
|
"api_version": "",
|
|
"is_valid": true,
|
|
"state": "",
|
|
"size": 4790,
|
|
"tags": null,
|
|
"attr_list": null,
|
|
"version_num": 0,
|
|
"update_time": 0
|
|
},
|
|
"share_info": {},
|
|
"job_config": {
|
|
"code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",
|
|
"boot_file": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py",
|
|
"parameters": [
|
|
{
|
|
"name": "dist",
|
|
"description": "",
|
|
"i18n_description": null,
|
|
"value": "False",
|
|
"constraint": {
|
|
"type": "Boolean",
|
|
"editable": true,
|
|
"required": false,
|
|
"sensitive": false,
|
|
"valid_type": "None",
|
|
"valid_range": []
|
|
}
|
|
},
|
|
{
|
|
"name": "world_size",
|
|
"description": "",
|
|
"i18n_description": null,
|
|
"value": "1",
|
|
"constraint": {
|
|
"type": "Integer",
|
|
"editable": true,
|
|
"required": false,
|
|
"sensitive": false,
|
|
"valid_type": "None",
|
|
"valid_range": []
|
|
}
|
|
}
|
|
],
|
|
"parameters_customization": true,
|
|
"inputs": [
|
|
{
|
|
"name": "data_url",
|
|
"description": "Data source 1"
|
|
}
|
|
],
|
|
"outputs": [
|
|
{
|
|
"name": "train_url",
|
|
"description": "Output data 1"
|
|
}
|
|
],
|
|
"engine": {
|
|
"engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
|
|
"engine_name": "PyTorch",
|
|
"engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
|
|
"tags": [
|
|
{
|
|
"key": "auto_search",
|
|
"value": "True"
|
|
}
|
|
],
|
|
"v1_compatible": false,
|
|
"run_user": "1102",
|
|
"image_info": {
|
|
"cpu_image_url": "aip/pytorch_1_8:train",
|
|
"gpu_image_url": "aip/pytorch_1_8:train",
|
|
"image_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64-20210912152543-1e0838d"
|
|
}
|
|
},
|
|
"code_tree": {
|
|
"name": "cpu/",
|
|
"children": [
|
|
{
|
|
"name": "test-pytorch.py"
|
|
}
|
|
]
|
|
}
|
|
},
|
|
"resource_requirements": [],
|
|
"advanced_config": {}
|
|
}</pre>
|
|
<p id="EN-US_TOPIC_0000002340898518__p10999581307">Record the value of <strong id="EN-US_TOPIC_0000002340898518__b149997820011">id</strong> (algorithm ID, 32-bit UUID) in the <strong id="EN-US_TOPIC_0000002340898518__b59991381208">metadata</strong> field for subsequent steps.</p>
|
|
</li></ol>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li142568414367">Call the API for <a href="CreateTrainingJob.html">creating a training job</a> to create a training job using the UUID returned by the created algorithm and record the job ID.<ol type="a" id="EN-US_TOPIC_0000002340898518__ol86082047164214"><li id="EN-US_TOPIC_0000002340898518__li86081147154218">Request body:<p id="EN-US_TOPIC_0000002340898518__p482613486319"><a name="EN-US_TOPIC_0000002340898518__li86081147154218"></a><a name="li86081147154218"></a>URI: POST https://<em id="EN-US_TOPIC_0000002340898518__i1682614481638">{ma_endpoint}</em>/v2/<em id="EN-US_TOPIC_0000002340898518__i18261848537">{project_id}</em>/training-jobs</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p6826104819310">Request header:</p>
|
|
<ul id="EN-US_TOPIC_0000002340898518__ul2082694812312"><li id="EN-US_TOPIC_0000002340898518__li138261448238">X-Auth-Token →<strong id="EN-US_TOPIC_0000002340898518__b1982612489317"><em id="EN-US_TOPIC_0000002340898518__i18826648335">MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...</em></strong></li><li id="EN-US_TOPIC_0000002340898518__li118264481337">Content-Type →application/json</li></ul>
|
|
<p id="EN-US_TOPIC_0000002340898518__p4826148939">Set the bold parameters based on site requirements.</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p147991622748">Request body:</p>
|
|
<pre class="screen" id="EN-US_TOPIC_0000002340898518__screen595415416416">{
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b74904307236">kind</strong>": "job",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1139325237">metadata</strong>": {
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b179712036162312">name</strong>": "test-pytorch-cpu01",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1683883818231">description</strong>": "test pytorch work cpu in mode gloo"
|
|
},
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b117381540122317">algorithm</strong>": {
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b145451643112317">id</strong>": "01c399ae-8593-4ef5-9e4d-085950aacde1",
|
|
"parameters": [{
|
|
"name": "dist",
|
|
"value": "False"
|
|
},
|
|
{
|
|
"name": "world_size",
|
|
"value": "1"
|
|
}
|
|
],
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1190924912234">inputs</strong>": [{
|
|
"name": "data_url",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1019765822312">remote</strong>": {
|
|
"obs": {
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b640511415242">obs_url</strong>": "/cnnorth4-job-test-v2/pytorch/fast_example/data/"
|
|
}
|
|
}
|
|
}],
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1573815162314">outputs</strong>": [{
|
|
"name": "train_url",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1071361102417">remote</strong>": {
|
|
"obs": {
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b6413512142416">obs_url</strong>": "/cnnorth4-job-test-v2/pytorch/fast_example/outputs/"
|
|
}
|
|
}
|
|
}]
|
|
},
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b1491291913247">spec</strong>": {
|
|
"resource": {
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b165951222142410">flavor_id</strong>": "modelarts.vm.cpu.8u",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b5680825102415">node_count</strong>": 1
|
|
},
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b647517287244">log_export_path</strong>": {
|
|
"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/"
|
|
}
|
|
}
|
|
}</pre>
|
|
<p id="EN-US_TOPIC_0000002340898518__p787914143518">Set the following parameters based on site requirements:</p>
|
|
<ul id="EN-US_TOPIC_0000002340898518__ul587941412510"><li id="EN-US_TOPIC_0000002340898518__li4879614351">Set <strong id="EN-US_TOPIC_0000002340898518__b187920141510">kind</strong> to the type of the training job. The default value is <strong id="EN-US_TOPIC_0000002340898518__b12917134915391">job</strong>.</li><li id="EN-US_TOPIC_0000002340898518__li1487913146512">Set <strong id="EN-US_TOPIC_0000002340898518__b4879414559">name</strong> and <strong id="EN-US_TOPIC_0000002340898518__b38791414958">description</strong> in the <strong id="EN-US_TOPIC_0000002340898518__b148791214355">metadata</strong> field to the name and description of the training job.</li><li id="EN-US_TOPIC_0000002340898518__li68796145519">Set <strong id="EN-US_TOPIC_0000002340898518__b88791514855">id</strong> in the <strong id="EN-US_TOPIC_0000002340898518__b1879121420516">algorithm</strong> field to the algorithm ID obtained in <a href="#EN-US_TOPIC_0000002340898518__li33722031111515">4</a>.</li><li id="EN-US_TOPIC_0000002340898518__li12879191414512">Set <strong id="EN-US_TOPIC_0000002340898518__b178791814656">inputs</strong> and <strong id="EN-US_TOPIC_0000002340898518__b188791149518">outputs</strong> in the <strong id="EN-US_TOPIC_0000002340898518__b68793143513">algorithm</strong> field to the information about the input and output URLs of the training job. In this example, <strong id="EN-US_TOPIC_0000002340898518__b158790147511">obs_url</strong> in <strong id="EN-US_TOPIC_0000002340898518__b18796141250">remote</strong> of the <strong id="EN-US_TOPIC_0000002340898518__b98797148517">inputs</strong> parameter indicates the OBS path for selecting the training data from the OBS bucket. <strong id="EN-US_TOPIC_0000002340898518__b128790147520">obs_url</strong> in <strong id="EN-US_TOPIC_0000002340898518__b78798143512">remote</strong> of the <strong id="EN-US_TOPIC_0000002340898518__b208791814157">outputs</strong> parameter indicates the OBS path for storing the training output.</li><li id="EN-US_TOPIC_0000002340898518__li137099227512">Set <strong id="EN-US_TOPIC_0000002340898518__b16879714256">flavor_id</strong> in the <strong id="EN-US_TOPIC_0000002340898518__b20879191420515">spec</strong> field to the flavor on which the training job depends. Use the <strong id="EN-US_TOPIC_0000002340898518__b122714482081">flavor_id</strong> recorded in <a href="#EN-US_TOPIC_0000002340898518__li676316281367">1</a>. <strong id="EN-US_TOPIC_0000002340898518__b1687917146516">node_count</strong> indicates whether to use multi-node training (distributed training). Set it to <strong id="EN-US_TOPIC_0000002340898518__b105045715167">1</strong> for a single-node training by default. <strong id="EN-US_TOPIC_0000002340898518__b687971413518">log_export_path</strong> specifies the OBS path to which logs are uploaded.</li></ul>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li18838652184211">Status code <span class="parmvalue" id="EN-US_TOPIC_0000002340898518__parmvalue1329814311185"><b>201 Created</b></span> is returned, indicating that the training job has been created. The response body is as follows:<pre class="screen" id="EN-US_TOPIC_0000002340898518__screen4127113219619">{
|
|
"kind": "job",
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b552534711246">metadata</strong>": {
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b16385349152415">id</strong>": "66ff6991-fd66-40b6-8101-0829a46d3731",
|
|
"name": "test-pytorch-cpu01",
|
|
"description": "test pytorch work cpu in mode gloo",
|
|
"create_time": 1641892642625,
|
|
"workspace_id": "0",
|
|
"ai_project": "default-ai-project",
|
|
"user_name": "",
|
|
"annotations": {
|
|
"job_template": "Template DL",
|
|
"key_task": "worker"
|
|
}
|
|
},
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b823834392412">status</strong>": {
|
|
"phase": "Creating",
|
|
"secondary_phase": "Creating",
|
|
"duration": 0,
|
|
"start_time": 0,
|
|
"node_count_metrics": null,
|
|
"tasks": [
|
|
"worker-0"
|
|
]
|
|
},
|
|
"algorithm": {
|
|
"id": "01c399ae-8593-4ef5-9e4d-085950aacde1",
|
|
"name": "test-pytorch-cpu",
|
|
"code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",
|
|
"boot_file": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py",
|
|
"parameters": [
|
|
{
|
|
"name": "dist",
|
|
"description": "",
|
|
"i18n_description": null,
|
|
"value": "False",
|
|
"constraint": {
|
|
"type": "Boolean",
|
|
"editable": true,
|
|
"required": false,
|
|
"sensitive": false,
|
|
"valid_type": "None",
|
|
"valid_range": []
|
|
}
|
|
},
|
|
{
|
|
"name": "world_size",
|
|
"description": "",
|
|
"i18n_description": null,
|
|
"value": "1",
|
|
"constraint": {
|
|
"type": "Integer",
|
|
"editable": true,
|
|
"required": false,
|
|
"sensitive": false,
|
|
"valid_type": "None",
|
|
"valid_range": []
|
|
}
|
|
}
|
|
],
|
|
"parameters_customization": true,
|
|
"inputs": [
|
|
{
|
|
"name": "data_url",
|
|
"description": "Data source 1",
|
|
"local_dir": "/home/ma-user/modelarts/inputs/data_url_0",
|
|
"remote": {
|
|
"obs": {
|
|
"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/data/"
|
|
}
|
|
}
|
|
}
|
|
],
|
|
"outputs": [
|
|
{
|
|
"name": "train_url",
|
|
"description": "Output data 1",
|
|
"local_dir": "/home/ma-user/modelarts/outputs/train_url_0",
|
|
"remote": {
|
|
"obs": {
|
|
"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/outputs/"
|
|
}
|
|
},
|
|
"mode": "upload_periodically",
|
|
"period": 30
|
|
}
|
|
],
|
|
"engine": {
|
|
"engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
|
|
"engine_name": "PyTorch",
|
|
"engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
|
|
"usage": "training",
|
|
"support_groups": "public",
|
|
"tags": [
|
|
{
|
|
"key": "auto_search",
|
|
"value": "True"
|
|
}
|
|
],
|
|
"v1_compatible": false,
|
|
"run_user": "1102"
|
|
}
|
|
},
|
|
"spec": {
|
|
"resource": {
|
|
"flavor_id": "modelarts.vm.cpu.8u",
|
|
"flavor_name": "Computing CPU(8U) instance",
|
|
"node_count": 1,
|
|
"flavor_detail": {
|
|
"flavor_type": "CPU",
|
|
"billing": {
|
|
"code": "modelarts.vm.cpu.8u",
|
|
"unit_num": 1
|
|
},
|
|
"flavor_info": {
|
|
"cpu": {
|
|
"arch": "x86",
|
|
"core_num": 8
|
|
},
|
|
"memory": {
|
|
"size": 32,
|
|
"unit": "GB"
|
|
},
|
|
"disk": {
|
|
"size": 50,
|
|
"unit": "GB"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"log_export_path": {
|
|
"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/"
|
|
},
|
|
"is_hosted_log": true
|
|
}
|
|
}</pre>
|
|
<ul id="EN-US_TOPIC_0000002340898518__ul632414183717"><li id="EN-US_TOPIC_0000002340898518__li1532414189710">Record the <strong id="EN-US_TOPIC_0000002340898518__b1732419181670">id</strong> value (training job ID) in the <strong id="EN-US_TOPIC_0000002340898518__b153243181274">metadata</strong> field for subsequent steps.</li><li id="EN-US_TOPIC_0000002340898518__li19932162115713"><span class="parmname" id="EN-US_TOPIC_0000002340898518__parmname125321343105212"><b>phase</b></span> and <span class="parmname" id="EN-US_TOPIC_0000002340898518__parmname1049864519523"><b>secondary_phase</b></span> under <span class="parmname" id="EN-US_TOPIC_0000002340898518__parmname126821141195220"><b>Status</b></span> indicate the status and next status of the training job, respectively. In the example, <span class="parmvalue" id="EN-US_TOPIC_0000002340898518__parmvalue93002218535"><b>Creating</b></span> indicates that the training job is being created.</li></ul>
|
|
</li></ol>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li445311427369">Call the API for <a href="ShowTrainingJobDetails.html">querying details about a training job</a> to query the job status using the job ID.<ol type="a" id="EN-US_TOPIC_0000002340898518__ol1482812534313"><li id="EN-US_TOPIC_0000002340898518__li18828102504320">Request body:<p id="EN-US_TOPIC_0000002340898518__p143521581711"><a name="EN-US_TOPIC_0000002340898518__li18828102504320"></a><a name="li18828102504320"></a>URI: GET https://<em id="EN-US_TOPIC_0000002340898518__i193521058378">{ma_endpoint}</em>/v2/<em id="EN-US_TOPIC_0000002340898518__i1435211588710">{project_id}</em>/training-jobs/<strong id="EN-US_TOPIC_0000002340898518__b6352258171"><em id="EN-US_TOPIC_0000002340898518__i23527586713">{training_job_id}</em></strong></p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p23522581720">Request header: X-Auth-Token →<strong id="EN-US_TOPIC_0000002340898518__b3536151122415"><em id="EN-US_TOPIC_0000002340898518__i175362116249">MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...</em></strong></p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p163527581275">Set the following parameter based on site requirements:</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p1435218581771">Set <strong id="EN-US_TOPIC_0000002340898518__b1235216581677">training_job_id</strong> to the training job ID recorded in <a href="#EN-US_TOPIC_0000002340898518__li62310211161">5</a>.</p>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li16902211085">Status code <span class="parmvalue" id="EN-US_TOPIC_0000002340898518__parmvalue1468122413278"><b>200 OK</b></span> is returned. The response body is as follows:<pre class="screen" id="EN-US_TOPIC_0000002340898518__screen1468712151490">{
|
|
"kind": "job",
|
|
"metadata": {
|
|
"id": "66ff6991-fd66-40b6-8101-0829a46d3731",
|
|
"name": "test-pytorch-cpu01",
|
|
"description": "test pytorch work cpu in mode gloo",
|
|
"create_time": 1641892642625,
|
|
"workspace_id": "0",
|
|
"ai_project": "default-ai-project",
|
|
"user_name": "hwstaff_z00424192",
|
|
"annotations": {
|
|
"job_template": "Template DL",
|
|
"key_task": "worker"
|
|
}
|
|
},
|
|
"<strong id="EN-US_TOPIC_0000002340898518__b35485812250">status</strong>": {
|
|
"phase": "Running",
|
|
"secondary_phase": "Running",
|
|
"duration": 268000,
|
|
"start_time": 1641892655000,
|
|
"node_count_metrics": [
|
|
[
|
|
1641892645000,
|
|
0
|
|
],
|
|
[
|
|
1641892654000,
|
|
0
|
|
],
|
|
[
|
|
1641892655000,
|
|
1
|
|
],
|
|
[
|
|
1641892922000,
|
|
1
|
|
],
|
|
[
|
|
1641892923000,
|
|
1
|
|
]
|
|
],
|
|
"tasks": [
|
|
"worker-0"
|
|
]
|
|
},
|
|
"algorithm": {
|
|
"id": "01c399ae-8593-4ef5-9e4d-085950aacde1",
|
|
"name": "test-pytorch-cpu",
|
|
"code_dir": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/",
|
|
"boot_file": "/cnnorth4-job-test-v2/pytorch/fast_example/code/cpu/test-pytorch.py",
|
|
"parameters": [
|
|
{
|
|
"name": "dist",
|
|
"description": "",
|
|
"i18n_description": null,
|
|
"value": "False",
|
|
"constraint": {
|
|
"type": "Boolean",
|
|
"editable": true,
|
|
"required": false,
|
|
"sensitive": false,
|
|
"valid_type": "None",
|
|
"valid_range": []
|
|
}
|
|
},
|
|
{
|
|
"name": "world_size",
|
|
"description": "",
|
|
"i18n_description": null,
|
|
"value": "1",
|
|
"constraint": {
|
|
"type": "Integer",
|
|
"editable": true,
|
|
"required": false,
|
|
"sensitive": false,
|
|
"valid_type": "None",
|
|
"valid_range": []
|
|
}
|
|
}
|
|
],
|
|
"parameters_customization": true,
|
|
"inputs": [
|
|
{
|
|
"name": "data_url",
|
|
"description": "Data source 1",
|
|
"local_dir": "/home/ma-user/modelarts/inputs/data_url_0",
|
|
"remote": {
|
|
"obs": {
|
|
"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/data/"
|
|
}
|
|
}
|
|
}
|
|
],
|
|
"outputs": [
|
|
{
|
|
"name": "train_url",
|
|
"description": "Output data 1",
|
|
"local_dir": "/home/ma-user/modelarts/outputs/train_url_0",
|
|
"remote": {
|
|
"obs": {
|
|
"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/outputs/"
|
|
}
|
|
},
|
|
"mode": "upload_periodically",
|
|
"period": 30
|
|
}
|
|
],
|
|
"engine": {
|
|
"engine_id": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
|
|
"engine_name": "PyTorch",
|
|
"engine_version": "pytorch_1.8.0-cuda_10.2-py_3.7-ubuntu_18.04-x86_64",
|
|
"usage": "training",
|
|
"support_groups": "public",
|
|
"tags": [
|
|
{
|
|
"key": "auto_search",
|
|
"value": "True"
|
|
}
|
|
],
|
|
"v1_compatible": false,
|
|
"run_user": "1102"
|
|
}
|
|
},
|
|
"spec": {
|
|
"resource": {
|
|
"flavor_id": "modelarts.vm.cpu.8u",
|
|
"flavor_name": "Computing CPU(8U) instance",
|
|
"node_count": 1,
|
|
"flavor_detail": {
|
|
"flavor_type": "CPU",
|
|
"billing": {
|
|
"code": "modelarts.vm.cpu.8u",
|
|
"unit_num": 1
|
|
},
|
|
"flavor_info": {
|
|
"cpu": {
|
|
"arch": "x86",
|
|
"core_num": 8
|
|
},
|
|
"memory": {
|
|
"size": 32,
|
|
"unit": "GB"
|
|
},
|
|
"disk": {
|
|
"size": 50,
|
|
"unit": "GB"
|
|
}
|
|
}
|
|
}
|
|
},
|
|
"log_export_path": {
|
|
"obs_url": "/cnnorth4-job-test-v2/pytorch/fast_example/log/"
|
|
},
|
|
"is_hosted_log": true
|
|
}
|
|
}</pre>
|
|
<p id="EN-US_TOPIC_0000002340898518__p52847510916">You can learn about the version details of the training job based on the response. The <strong id="EN-US_TOPIC_0000002340898518__b1952544214282">status</strong> value is <strong id="EN-US_TOPIC_0000002340898518__b2525242142811">Running</strong>, indicating that the training job is running.</p>
|
|
</li></ol>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li151494303619">Call the API for <a href="ShowObsUrlOfTrainingJobLogs.html">querying the logs of a specified task in a training job (OBS link)</a> to obtain the OBS path of the training job logs.<ol type="a" id="EN-US_TOPIC_0000002340898518__ol17877123964317"><li id="EN-US_TOPIC_0000002340898518__li13877539164319">Request body:<p id="EN-US_TOPIC_0000002340898518__p1429371371010"><a name="EN-US_TOPIC_0000002340898518__li13877539164319"></a><a name="li13877539164319"></a>URI format: GET https://<em id="EN-US_TOPIC_0000002340898518__i162931213171015">{ma_endpoint}</em>/v2/<em id="EN-US_TOPIC_0000002340898518__i62931513181019">{project_id}</em>/training-jobs/<em id="EN-US_TOPIC_0000002340898518__i429331381017">{training_job_id}</em>/tasks/<em id="EN-US_TOPIC_0000002340898518__i8293313151014">{task_id}</em>/logs/url</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p5293191316101">Request header:</p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p0293201315104">X-Auth-Token→<strong id="EN-US_TOPIC_0000002340898518__b4293101351014"><em id="EN-US_TOPIC_0000002340898518__i16293613191014">MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...</em></strong></p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p1529361316103">Content-Type→<strong id="EN-US_TOPIC_0000002340898518__b1629311134108"><em id="EN-US_TOPIC_0000002340898518__i16293171351014">text/plain</em></strong></p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p32931413121019">Set the following parameters based on site requirements:</p>
|
|
<ul id="EN-US_TOPIC_0000002340898518__ul10293191361011"><li id="EN-US_TOPIC_0000002340898518__li10293191391018"><strong id="EN-US_TOPIC_0000002340898518__b0293131314107">task_id</strong> indicates the name of the training job. Generally, set it to <strong id="EN-US_TOPIC_0000002340898518__b45504573211">work-0</strong>.</li><li id="EN-US_TOPIC_0000002340898518__li287113194101"><strong id="EN-US_TOPIC_0000002340898518__b8858121863218">Content-Type</strong> can be set either to <strong id="EN-US_TOPIC_0000002340898518__b7654144233414">text/plain</strong> or <strong id="EN-US_TOPIC_0000002340898518__b12803174418346">application/octet-stream</strong>. <strong id="EN-US_TOPIC_0000002340898518__b16601442103511">text/plain</strong> indicates that a temporary OBS preview URL is returned. <strong id="EN-US_TOPIC_0000002340898518__b8274756123511">application/octet-stream</strong> indicates that a temporary OBS download URL is returned.</li></ul>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li16140124319108">Status code <span class="parmvalue" id="EN-US_TOPIC_0000002340898518__parmvalue6297142913367"><b>200 OK</b></span> is returned. <p id="EN-US_TOPIC_0000002340898518__p106717261119">The returned field indicates the OBS path of logs. You can copy the value to the browser to view the result.</p>
|
|
</li></ol>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li659220440367">Call the API for <a href="ShowTrainingJobMetrics.html">querying the running metrics of a specified task in a training job</a> to view detailed metrics of the job.<ol type="a" id="EN-US_TOPIC_0000002340898518__ol16371568431"><li id="EN-US_TOPIC_0000002340898518__li13637155611436">Request body:<p id="EN-US_TOPIC_0000002340898518__p6441194411114"><a name="EN-US_TOPIC_0000002340898518__li13637155611436"></a><a name="li13637155611436"></a>URI format: GET https://<em id="EN-US_TOPIC_0000002340898518__i844115446113">{ma_endpoint}</em>/v2/<em id="EN-US_TOPIC_0000002340898518__i34411244191118">{project_id}</em>/training-jobs/<em id="EN-US_TOPIC_0000002340898518__i1844254414112">{training_job_id}</em>/metrics/<em id="EN-US_TOPIC_0000002340898518__i10442194431116">{task_id}</em></p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p10442344181113">Request header: X-Auth-Token →<strong id="EN-US_TOPIC_0000002340898518__b5233125610384"><em id="EN-US_TOPIC_0000002340898518__i1623335616385">MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...</em></strong></p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p84421844171113">Set the bold parameters based on site requirements.</p>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li118671455181118">Status code <span class="parmvalue" id="EN-US_TOPIC_0000002340898518__parmvalue129141911133917"><b>200 OK</b></span> is returned. The response body is as follows:<pre class="screen" id="EN-US_TOPIC_0000002340898518__screen996352715138">{
|
|
"metrics": [
|
|
{
|
|
"metric": "cpuUsage",
|
|
"value": [
|
|
-1,
|
|
-1,
|
|
28.622,
|
|
35.053,
|
|
39.988,
|
|
40.069,
|
|
40.082,
|
|
40.094
|
|
]
|
|
},
|
|
{
|
|
"metric": "memUsage",
|
|
"value": [
|
|
-1,
|
|
-1,
|
|
0.544,
|
|
0.641,
|
|
0.736,
|
|
0.737,
|
|
0.738,
|
|
0.739
|
|
]
|
|
},
|
|
{
|
|
"metric": "npuUtil",
|
|
"value": [
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1
|
|
]
|
|
},
|
|
{
|
|
"metric": "npuMemUsage",
|
|
"value": [
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1
|
|
]
|
|
},
|
|
{
|
|
"metric": "gpuUtil",
|
|
"value": [
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1
|
|
]
|
|
},
|
|
{
|
|
"metric": "gpuMemUsage",
|
|
"value": [
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1,
|
|
-1
|
|
]
|
|
}
|
|
]
|
|
}</pre>
|
|
<p id="EN-US_TOPIC_0000002340898518__p326014212137">You can view the metrics such as the CPU usage.</p>
|
|
</li></ol>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li12755144593615">Call the API for <a href="DeleteTrainingJob.html">deleting a training job</a> to delete the job if it is no longer needed.<ol type="a" id="EN-US_TOPIC_0000002340898518__ol1691722014417"><li id="EN-US_TOPIC_0000002340898518__li0917152017446">Request body:<p id="EN-US_TOPIC_0000002340898518__p10560118141410"><a name="EN-US_TOPIC_0000002340898518__li0917152017446"></a><a name="li0917152017446"></a>URI: DELETE https://<em id="EN-US_TOPIC_0000002340898518__i0560128101412">{ma_endpoint}</em>/v2/<em id="EN-US_TOPIC_0000002340898518__i7560118111410">{project_id}</em>/training-jobs/<em id="EN-US_TOPIC_0000002340898518__i75607811142">{training_job_id}</em></p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p3560178151416">Request header: X-Auth-Token →<strong id="EN-US_TOPIC_0000002340898518__b1290018343404"><em id="EN-US_TOPIC_0000002340898518__i1890083416407">MIIZmgYJKoZIhvcNAQcCoIIZizCCGYcCAQExDTALBglghkgBZQMEAgEwgXXXXXX...</em></strong></p>
|
|
<p id="EN-US_TOPIC_0000002340898518__p185601583147">Set the bold parameters based on site requirements.</p>
|
|
</li><li id="EN-US_TOPIC_0000002340898518__li165051226154418">Status code <span class="parmvalue" id="EN-US_TOPIC_0000002340898518__parmvalue10907772281"><b>202 No Content</b></span> is returned, indicating that the job is successfully deleted.</li></ol>
|
|
</li></ol>
|
|
</div>
|
|
</div>
|
|
<div>
|
|
<div class="familylinks">
|
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="modelarts_03_0400.html">Use Cases</a></div>
|
|
</div>
|
|
</div>
|
|
|