Environment Variables

Learn about common and PyTorch-specific environment variables for distributed training tasks.

Introduction

When submitting a distributed training task, the system will build a container computing environment and set the corresponding environment variables. This section introduces common environment variables. Users can also customize environment variables based on actual training tasks.

Common Environment Variables

Variable Name	Description
TENSORBOARD_LOG_PATH	TensorBoard log storage path. If you need to use TensorBoard to view task training details, you need to specify the log file output to the path corresponding to this environment variable in the code.

PyTorch Environment Variables

Variable Name	Description
MASTER_ADDR	The IP address or host name of the master node in distributed training. For example, tn-xxxxx-worker-0.
MASTER_PORT	The port number used for communication on the master node.
WORLD_SIZE	The total number of nodes participating in distributed training, including both worker and master nodes. For example, 1 master node + 3 worker nodes = WORLD_SIZE = 4.
RANK	The unique identifier or rank of the current node in distributed training. The RANK of the master node is usually set to 0, while the RANK of the worker node starts at 1 and increases.

Environment Variables

Introduction

Common Environment Variables

PyTorch Environment Variables

On this page