PyTorch Distributed Training Task with minGPT

This guide provides steps for setting up and submitting a PyTorch distributed training task using the minGPT model. It includes the environment preparation, task submission, and various training configurations such as single-node and multi-node setups.

Introduction

Environment Preparation

Get the sample code.
Create a file storage and upload the sample code.
Create a container instance of the PyTorch image.
Notice:
- The storage and data dataset of the container instance must be in the user directory where the sample code was uploaded.
- Select the Pytorch image for the container instance.
Log in to the container instance via Jupyter and run the following command to install environment dependencies:
pip install -r /root/epfs/examples/distributed/minGPT-ddp/requirements.txt

PyTorch Distributed Training Task with minGPT

Introduction

Environment Preparation

On this page