singlefabric
Best Practices

PyTorch Distributed Training Task with minGPT

This guide provides steps for setting up and submitting a PyTorch distributed training task using the minGPT model. It includes the environment preparation, task submission, and various training configurations such as single-node and multi-node setups.

Introduction

This guide provides steps for setting up and submitting a PyTorch distributed training task using the minGPT model. It includes the environment preparation, task submission, and various training configurations such as single-node and multi-node setups.

Environment Preparation

  1. Get the sample code.

  2. Create a file storage and upload the sample code.

  3. Create a container instance of the PyTorch image.

    Notice:

    • The storage and data dataset of the container instance must be in the user directory where the sample code was uploaded.
    • Select the Pytorch image for the container instance.
  4. Log in to the container instance via Jupyter and run the following command to install environment dependencies:

    pip install -r /root/epfs/examples/distributed/minGPT-ddp/requirements.txt

On this page