singlefabric
AI DevelopmentDistributed Training

Creating a Distributed Training

Prerequisites

  • The management console account and password have been obtained.
  • Personal real-name authentication has been completed and the account balance is greater than 0 yuan.

Procedure

  1. Log in to the management console.

  2. In the top navigation bar, click Products and Services > AI Computing Cloud Service > AI Computing Cloud Service to enter its overview page.

  3. In the left navigation bar, select Distributed Training. The Distributed Training Task List page is displayed by default. Click Create Training Task.

  4. On the Create Training Task page, configure various parameters and click OK.

    Refer to the following parameter descriptions for configuration.

ParameterDescription
Task NameUser-defined name.
MirrorYou can select a public image, a custom image, or a private image address.
Storage and Data (optional)Select the user directory where the dataset is located, and the corresponding mount directory.
CodeUpload the code file to be executed. Click Upload to select the code file. If the public image is used, the code will be mounted to /root/code.
Startup CommandEnter the running command of the corresponding file. If using a public image, the default command is python3 /root/code/main.py.
Environment VariablesCustomize environment variables for the distributed training task.
TensorBoardEnable TensorBoard to view task result details.
Automatic RetryConfigure automatic retries in case of failure.
Timeout ConfigurationSet the maximum run time for the task. If exceeded, the task is automatically canceled.
FrameSupports TensorFlow, PyTorch, MXNet, MPI, XGBoost, etc.
Resource GroupSelect between a public resource pool or your own resource group.
  1. Return to the distributed training task list page, and the successfully created training tasks are listed.

On this page