Creating a Distributed Training

Prerequisites

The management console account and password have been obtained.
Personal real-name authentication has been completed and the account balance is greater than 0 yuan.

Log in to the management console.
In the top navigation bar, click Products and Services > AI Computing Cloud Service > AI Computing Cloud Service to enter its overview page.
In the left navigation bar, select Distributed Training. The Distributed Training Task List page is displayed by default. Click Create Training Task.
On the Create Training Task page, configure various parameters and click OK.

Refer to the following parameter descriptions for configuration.

Parameter	Description
Task Name	User-defined name.
Mirror	You can select a public image, a custom image, or a private image address.
Storage and Data (optional)	Select the user directory where the dataset is located, and the corresponding mount directory.
Code	Upload the code file to be executed. Click Upload to select the code file. If the public image is used, the code will be mounted to `/root/code`.
Startup Command	Enter the running command of the corresponding file. If using a public image, the default command is `python3 /root/code/main.py`.
Environment Variables	Customize environment variables for the distributed training task.
TensorBoard	Enable TensorBoard to view task result details.
Automatic Retry	Configure automatic retries in case of failure.
Timeout Configuration	Set the maximum run time for the task. If exceeded, the task is automatically canceled.
Frame	Supports TensorFlow, PyTorch, MXNet, MPI, XGBoost, etc.
Resource Group	Select between a public resource pool or your own resource group.

Return to the distributed training task list page, and the successfully created training tasks are listed.