How to fine-tune(SFT) LLM. PyTorch Tutorial

What is LLM

Large Language Models (LLM) are advanced artificial intelligence systems designed to understand and generate human language coming as speech or text input. They utilize deep learning techniques, particularly transformer architectures, to analyze vast amounts of text data, enabling them to perform tasks like translation, summerization, and content generation. LLMs learn from patterns in data, allowing them to produce coherent and contextually relevant responses to user prompts. Examples include ChatGPT and Google Bard, which can generate text based on user inputs while continuously improving as they process more information.

Not only specific domain like translation, summerization, but other domains like code generation, math solving also pretty good recently. For example, perplexity which is being used to make this blog generates JavaScript codes or Python codes perfectly compared to the previous strong LLM like GPT.

What is PyTorch

PyTorch is an open-source machine learning framework developed by Meta AI, based on the Python programming language and the Torch library. It is widely used for developing deep neural networks, particularly in research settings, due to its flexibility and ease of use. PyTorch supports dynamic computation graphs, enabling real-time code testing and modification, and offers strong GPU acceleration for efficient computation. Its is popular for applications in natural language processing and computer vision.

What is Hugging Face

Hugging Face is an American company and platform that specializes in machine learning and data science. It is renowned for its open-source tools and community-driven approach, enabling users to build, deploy, and train machine learning models, particularly in natural language processing(NLP). - Recently specialized for LLM I think The platform hosts a vast collection of pre-trained models and datasets, simplifying the process of developing AI applications. Hugging Faces's Transformers library is a key component, providing efficient ways to integrate ML models into workflows.l

How to Fine-tune LLM with PyTorch & Hugging Face

To fine-tune a large language model (LLM) using PyTorch and Hugging Face, follow these steps:

Install Required Libraries: Ensure you have the transformers, datasets, and torch libraries installed.
```
pip install transformers datasets torch
```

Load a Pre-trained Model and Dataset: Use Hugging Face's Transformers library to load a pre-trained model and dataset.

from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased')
dataset = load_dataset('glue', 'mrpc')

Prepare the Data: Tokenize the dataset.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
def tokenize_function(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Set Up Training Arguments: Define your training parameters.

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

Initialize Trainer and Train: Use the Trainer API to fine-tune the model.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
)

trainer.train()

This process leverages the flexibility of PyTorch with the extensive model library of Hugging Face to fine-tune LLMs for specific tasks.

Real Example with PyTorch & HuggingFace Packages

OK, Now lets fine tune DistilBERT which is introduced in Hugging Face Model Card. DistilBERT is a smaller version of BERT that was developed by the Hugging Face team. It's designed to be faster and lighter while maintaining most of the performance of BERT. The base part indicates that it has a moderate size compared to other versions like small or large. Uncased means that it doesn't take into account the case sensitivity of words during training. This can make it more robust to variations in text casing but may also affect its ability to handle proper nouns and acronyms accurately. Overall, distilbert-base-uncased is a popular choice for many NLP tasks due to its balance between speed, efficiency, and accuracy.

I already explained how to use PyTorch and Hugging Face APIs for LLM fine-tuning. So here I am introducing a simple example with helpful comments on codes. So what I want to do today is fine-tuen DistilBERT with emotion dataset provided by Hugging Face. That is, it will be a kind of emotion checker.

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

# Load emotion dataset
# The reason why I load dataset first is to get label number. (which is 6 in this case)
dataset = load_dataset("emotion")
num_labels=dataset["train"].features["label"].num_classes

# Hugging Face Model DistilBERT
model_name = "distilbert-base-uncased"

# Load tokenizer from the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load model with 6 labels
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels).to(device)

# Calculate model size
model_size = sum(param.numel() for param in model.parameters())
print(f"Model: {model_size/1000**2:.1f}M parameters")

# How to tokenize
'''
padding, truncation are True, These arguments ensure that all input texts are padded to have equal length and any excessively long texts are truncated to fit within the maximum sequence length supported by the model. By using this function, we can efficiently convert raw text data into tokenized sequences suitable for further processing such as training a machine learning model.
'''
def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
print(tokenized_datasets)

batch_size = 16
logging_steps = len(tokenized_datasets["train"]) // batch_size

training_args = TrainingArguments(output_dir="./results", 
                                  num_train_epochs=5,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  optim="adamw_hf",
                                  weight_decay=0.01, # L2 Regularization
                                  evaluation_strategy="epoch",
                                  save_strategy="epoch",
                                  disable_tqdm=False,
                                  load_best_model_at_end=True,
                                  fp16=torch.cuda.is_available(),
                                  dataloader_num_workers=4,
                                  dataloader_prefetch_factor=2)

trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=tokenized_datasets["train"],
                  eval_dataset=tokenized_datasets["validation"],
                  tokenizer=tokenizer,
                  )
trainer.train()

Let's do inference test

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load my finetuend-model
tokenizer = AutoTokenizer.from_pretrained("./results")
model = AutoModelForSequenceClassification.from_pretrained("./results").to(device)

# Model to evaluation
model.eval()

# 6 emotions
emotion_labels = ["sadness", "joy", "love", "anger", "fear", "surprise"]

def predict_emotion(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    
    with torch.no_grad():
        outputs = model(**inputs)
        
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(probabilities, dim=-1).item()
    
    predicted_emotion = emotion_labels[predicted_class]
    confidence = proababilities[0][predicted_class].item()
    
    return predicted_emotion, confidence

test_texts = [
    "I am really glad today",
    "I am going to work",
    "It's really annoying",
    "This event is really exciting"
]

for text in test_texts:
    emotion, confidence = predict_emotion(text)
    print("*"*30)
    print(f"Text: {text}")
    print(f"Predicted emtion: {emotion}")
    print(f"Confidence: {confidence:.2f}")

Here we have the result.

******************************
Text: I am really glad today
Predicted emtion: joy
Confidence: 1.00
******************************
Text: I am going to work
Predicted emtion: anger # With epoch 1 it shows joy :)
Confidence: 0.71
******************************
Text: It's really annoying
Predicted emtion: anger
Confidence: 0.55
******************************
Text: This event is really exciting
Predicted emtion: joy
Confidence: 1.00

Conclusion

So today we learned how to fine-tune LLM. If I have a chance to write fine-tune post again, I'd like to write about LoRA which is a nice way not to touch original weight but use Adapter.