doc-exports/docs/modelarts/umn/develop-modelarts-0104.html
Lai, Weijian 4e4b2d5f6d ModelArts UMN 23.3.0 Version.
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Lai, Weijian <laiweijian4@huawei.com>
Co-committed-by: Lai, Weijian <laiweijian4@huawei.com>
2024-06-26 07:03:02 +00:00

28 KiB

Viewing Environment Variables of a Training Container

What Is an Environment Variable

This section describes environment variables preset in a training container. The environment variables include:

  • Path environment variables
  • Environment variables of a distributed training job
  • Nvidia Collective multi-GPU Communication Library (NCCL) environment variables
  • OBS environment variables
  • Environment variables of the pip source
  • Environment variables of the API Gateway address
  • Environment variables of job metadata

Configuring Environment Variables

When you create a training job, you can add environment variables or modify environment variables preset in the training container.

Figure 1 Setting environment variables

Environment Variables Preset in a Training Container

The following tables list environment variables preset in a training container.

The environment variable values are examples.

Table 1 Path environment variables

Variable

Description

Example

PATH

Executable file paths

PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

LD_LIBRARY_PATH

Dynamic load library paths

LD_LIBRARY_PATH=/usr/local/seccomponent/lib:/usr/local/cuda/lib64:/usr/local/cuda/compat:/root/miniconda3/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64

LIBRARY_PATH

Static library paths

LIBRARY_PATH=/usr/local/cuda/lib64/stubs

MA_HOME

Main directory of a training job

MA_HOME=/home/ma-user

MA_JOB_DIR

Parent directory of the training algorithm folder

MA_JOB_DIR=/home/ma-user/modelarts/user-job-dir

MA_MOUNT_PATH

Path mounted to a ModelArts training container, which is used to temporarily store training algorithms, algorithm input, algorithm output, and logs

MA_MOUNT_PATH=/home/ma-user/modelarts

MA_LOG_DIR

Training log directory

MA_LOG_DIR=/home/ma-user/modelarts/log

MA_SCRIPT_INTERPRETER

Training script interpreter

MA_SCRIPT_INTERPRETER=

WORKSPACE

Training algorithm directory

WORKSPACE=/home/ma-user/modelarts/user-job-dir/code

Table 2 Environment variables of a distributed training job

Variable

Description

Example

MA_CURRENT_IP

IP address of the physical node on which a job container is running.

MA_CURRENT_IP=192.168.23.38

MA_NUM_GPUS

Number of GPUs used by a job container.

MA_NUM_GPUS=8

MA_TASK_NAME

Name of a job container, for example:

  • worker in MindSpore and PyTorch.
  • learner or worker in reinforcement learning engines.
  • ps or worker in TensorFlow.

MA_TASK_NAME=worker

MA_NUM_HOSTS

Compute nodes required for a training job.

MA_NUM_HOSTS=4

VC_TASK_INDEX

Sequence number of a job container for multi-node training. The value of the first container is 0.

VC_TASK_INDEX=0

VC_WORKER_NUM

Compute nodes required for a training job.

VC_WORKER_NUM=4

VC_WORKER_HOSTS

Domain name of each node for multi-node training. Use commas (,) to separate the domain names in sequence. You can obtain the IP address through domain name resolution.

VC_WORKER_HOSTS=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-0.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1.ob-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-2.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,ob-a0978141-1712-4f9b-8a83-000000000000-worker-3.modelarts-job-a0978141-1712-4f9b-8a83-000000000000

Table 3 NCCL environment variables

Variable

Description

Example

NCCL_VERSION

NCCL version

NCCL_VERSION=2.7.8

NCCL_DEBUG

NCCL log level

NCCL_DEBUG=INFO

NCCL_IB_HCA

InfiniBand NIC to use for communication

NCCL_IB_HCA=^mlx5_bond_0

NCCL_SOCKET_IFNAME

IP interface to use for communication

NCCL_SOCKET_IFNAME=bond0,eth0

Table 4 OBS environment variables

Variable

Description

Example

S3_ENDPOINT

OBS endpoint

S3_ENDPOINT=https://obs.region.xxx.com

S3_VERIFY_SSL

Whether to use SSL to access OBS

S3_VERIFY_SSL=0

S3_USE_HTTPS

Whether to use HTTPS to access OBS

S3_USE_HTTPS=1

Table 5 Environment variables of the pip source and API Gateway address

Variable

Description

Example

MA_PIP_HOST

Domain name of the pip source

MA_PIP_HOST=repo.xxx.com

MA_PIP_URL

Address of the pip source

MA_PIP_URL=http://repo.xxx.com/repository/pypi/simple/

MA_APIGW_ENDPOINT

ModelArts API Gateway address

MA_APIGW_ENDPOINT=https://modelarts.region.xxx.xxx.com

Table 6 Environment variables of job metadata

Variable

Description

Example

MA_CURRENT_INSTANCE_NAME

Name of the current node for multi-node training

MA_CURRENT_INSTANCE_NAME=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1

Table 7 Precheck environment variables

Variable

Description

Example

MA_DETECT_TRAIN_INJECT_CODE

Whether to enable ModelArts precheck.

The default value is 1, indicating that precheck is enabled.

The value 0 indicates that the precheck is disabled.

Enable precheck to detect node and driver faults before they affect services.

1