singlefabric
AI DevelopmentDistributed Training

Environment Variables

Learn about common and PyTorch-specific environment variables for distributed training tasks.

Introduction

When submitting a distributed training task, the system will build a container computing environment and set the corresponding environment variables. This section introduces common environment variables. Users can also customize environment variables based on actual training tasks.

Common Environment Variables

Variable NameDescription
TENSORBOARD_LOG_PATHTensorBoard log storage path. If you need to use TensorBoard to view task training details, you need to specify the log file output to the path corresponding to this environment variable in the code.

PyTorch Environment Variables

Variable NameDescription
MASTER_ADDRThe IP address or host name of the master node in distributed training. For example, tn-xxxxx-worker-0.
MASTER_PORTThe port number used for communication on the master node.
WORLD_SIZEThe total number of nodes participating in distributed training, including both worker and master nodes. For example, 1 master node + 3 worker nodes = WORLD_SIZE = 4.
RANKThe unique identifier or rank of the current node in distributed training. The RANK of the master node is usually set to 0, while the RANK of the worker node starts at 1 and increases.

On this page