LLaMA 2 Model Fine-tuning Based on PyTorch

Learn how to fine-tune the LLaMA 2 model based on the pre-trained Atom-7B-Chat model using PyTorch on a distributed training system.

Introduction

LLaMA is a commonly used open-source model. This best practice describes how to fine-tune the model by submitting distributed training tasks based on the Language pre-trained model (Atom-7B-Chat) based on LLaMA 2 7B.

Preparation

Get the LLaMA 2 model fine-tuning code
Get the Atom-7B-Chat trained model
The file storage user directory has been created.

Procedure

Log in to the management console.
After decompressing the obtained training model and code files locally, upload them to the specified directory for file storage using SFTP. In this example, a user directory named xxxx0002 has been created, and the training model and corresponding code are uploaded to the /xxxx0002/Atom-7B-Chat and /xxxx0002/Llama-Language folders, respectively.
Create a distributed training task and configure the following parameters:

Configuration Items	Parameter	Description
Task Name	Customizable	Users can set it according to actual conditions.
Mirror	sjz-dockerhub.singlefabric.com/public/llama2-train:pytorch-2.1.2-cuda12.1-cudnn8	Specify the image address.
Storage and Data	Select user directory	Choose where you uploaded the code files.
Code	None required	No need to upload code files in this example.
Environment Variables	None required	No setup is needed in this example.
Startup Command	`bash /root/epfs/Llama-lang]/train/sft/torchrun_finetune_lora.sh`	Modify according to the actual situation.
Automatic Retry	Disabled	Select "closure".
Timeout Configuration	Disabled	Select "closure".
Computing Resources	Pytorch	Select appropriate resources.
Resource Group	Public resource pool	Recommend selecting NVIDIA GPU model 4090, set resources to 4, and nodes to 2.

After configuring the parameters, click OK and wait for the task to complete.
Once training is completed and successful, the trained model can be found in the path /Llama-Language/train/sft/save_folder.

Appendix

Example Startup Script Description

The startup script used in the training task startup command is torchrun_finetune_lora.sh. Its content can be viewed on the Storage and Data Services page.

Some parameters in the startup script:

output_model: Fine-tune the model output path. In this example, it is /Llama-Language/train/sft/save_folder.
model_name_or_path: Pre-trained model path. In this example, it is /root/epfs/Atom-7B-Chat.
train_files: Training dataset.
validation_files: Validation dataset.