singlefabric
Best Practices

LLaMA 2 Model Fine-tuning Based on PyTorch

Learn how to fine-tune the LLaMA 2 model based on the pre-trained Atom-7B-Chat model using PyTorch on a distributed training system.

Introduction

LLaMA is a commonly used open-source model. This best practice describes how to fine-tune the model by submitting distributed training tasks based on the Language pre-trained model (Atom-7B-Chat) based on LLaMA 2 7B.

Preparation

Procedure

  1. Log in to the management console.

  2. After decompressing the obtained training model and code files locally, upload them to the specified directory for file storage using SFTP. In this example, a user directory named xxxx0002 has been created, and the training model and corresponding code are uploaded to the /xxxx0002/Atom-7B-Chat and /xxxx0002/Llama-Language folders, respectively.

  3. Create a distributed training task and configure the following parameters:

Configuration ItemsParameterDescription
Task NameCustomizableUsers can set it according to actual conditions.
Mirrorsjz-dockerhub.singlefabric.com/public/llama2-train:pytorch-2.1.2-cuda12.1-cudnn8Specify the image address.
Storage and DataSelect user directoryChoose where you uploaded the code files.
CodeNone requiredNo need to upload code files in this example.
Environment VariablesNone requiredNo setup is needed in this example.
Startup Commandbash /root/epfs/Llama-lang]/train/sft/torchrun_finetune_lora.shModify according to the actual situation.
Automatic RetryDisabledSelect "closure".
Timeout ConfigurationDisabledSelect "closure".
Computing ResourcesPytorchSelect appropriate resources.
Resource GroupPublic resource poolRecommend selecting NVIDIA GPU model 4090, set resources to 4, and nodes to 2.
  1. After configuring the parameters, click OK and wait for the task to complete.

  2. Once training is completed and successful, the trained model can be found in the path /Llama-Language/train/sft/save_folder.

Appendix

Example Startup Script Description

The startup script used in the training task startup command is torchrun_finetune_lora.sh. Its content can be viewed on the Storage and Data Services page.

Some parameters in the startup script:

  • output_model: Fine-tune the model output path. In this example, it is /Llama-Language/train/sft/save_folder.
  • model_name_or_path: Pre-trained model path. In this example, it is /root/epfs/Atom-7B-Chat.
  • train_files: Training dataset.
  • validation_files: Validation dataset.

On this page