AI DevelopmentDistributed Training
Creating a Distributed Training
Prerequisites
- The management console account and password have been obtained.
- Personal real-name authentication has been completed and the account balance is greater than 0 yuan.
Procedure
-
Log in to the management console.
-
In the top navigation bar, click Products and Services > AI Computing Cloud Service > AI Computing Cloud Service to enter its overview page.
-
In the left navigation bar, select Distributed Training. The Distributed Training Task List page is displayed by default. Click Create Training Task.
-
On the Create Training Task page, configure various parameters and click OK.
Refer to the following parameter descriptions for configuration.
Parameter | Description |
---|---|
Task Name | User-defined name. |
Mirror | You can select a public image, a custom image, or a private image address. |
Storage and Data (optional) | Select the user directory where the dataset is located, and the corresponding mount directory. |
Code | Upload the code file to be executed. Click Upload to select the code file. If the public image is used, the code will be mounted to /root/code . |
Startup Command | Enter the running command of the corresponding file. If using a public image, the default command is python3 /root/code/main.py . |
Environment Variables | Customize environment variables for the distributed training task. |
TensorBoard | Enable TensorBoard to view task result details. |
Automatic Retry | Configure automatic retries in case of failure. |
Timeout Configuration | Set the maximum run time for the task. If exceeded, the task is automatically canceled. |
Frame | Supports TensorFlow, PyTorch, MXNet, MPI, XGBoost, etc. |
Resource Group | Select between a public resource pool or your own resource group. |
- Return to the distributed training task list page, and the successfully created training tasks are listed.