AI DevelopmentDistributed Training
Environment Variables
Learn about common and PyTorch-specific environment variables for distributed training tasks.
Introduction
When submitting a distributed training task, the system will build a container computing environment and set the corresponding environment variables. This section introduces common environment variables. Users can also customize environment variables based on actual training tasks.
Common Environment Variables
Variable Name | Description |
---|---|
TENSORBOARD_LOG_PATH | TensorBoard log storage path. If you need to use TensorBoard to view task training details, you need to specify the log file output to the path corresponding to this environment variable in the code. |
PyTorch Environment Variables
Variable Name | Description |
---|---|
MASTER_ADDR | The IP address or host name of the master node in distributed training. For example, tn-xxxxx-worker-0. |
MASTER_PORT | The port number used for communication on the master node. |
WORLD_SIZE | The total number of nodes participating in distributed training, including both worker and master nodes. For example, 1 master node + 3 worker nodes = WORLD_SIZE = 4. |
RANK | The unique identifier or rank of the current node in distributed training. The RANK of the master node is usually set to 0, while the RANK of the worker node starts at 1 and increases. |