Supervised Fine-Tuning (SFT) for LLMs: A Practical Guide

In this article, we explore the crucial process of Supervised Fine-Tuning (SFT) for Large Language Models (LLMs). Building on our previous work constructing a Transformer Decoder-only model in PyTorch and pre-training it on OpenWebText, we now focus on transforming this base model into an instruction-tuned model and, ultimately, a chat model capable of natural, human-like conversation.

The Evolutionary Path of LLMs: From Base Model to Chat Model

To effectively fine-tune a model, it's essential to understand the stages of LLM evolution: Base Model, Instruct Model, and Chat Model. Each stage represents a step in aligning the model's behavior with human intent through techniques like SFT and reinforcement learning.

What is a Pre-trained Base Model?

A pre-trained or base model is the direct output of the pre-training phase. Built on transformer architecture and trained on large-scale datasets (e.g., OpenWebText), the base model excels at next-word prediction and text completion but lacks understanding of user intent.

Instruction-Tuned Model: Aligning with Human Commands

Through Supervised Fine-Tuning (SFT), we convert the base model into an instruction-tuned model. SFT involves:

Preparing Data: Curate high-quality instruction-response pairs.
Continuing Training: Fine-tune the pre-trained model on this supervised dataset.
Refining the Objective: Shift the model's focus from generic text completion to generating appropriate responses to instructions.

After SFT, the model can follow instructions and complete tasks, not just continue patterns.

Chat Model: Mastering Conversational AI

A chat model is a further specialization, optimized for multi-turn dialogue. Key differences include:

Data Format: Trained on multi-turn dialogue data.
Training Methods: Uses SFT and often Reinforcement Learning from Human Feedback (RLHF) to improve helpfulness, honesty, and harmlessness (HHH framework).

Model Type	Training Data	Core Capability	Primary Goal	Example Behavior
Base Model	Unlabeled text	Next-word prediction, completion	Learn language patterns	Completes: "The capital of France is..."
Instruct Model	Instruction-response pairs	Task/instruction completion	Align with user commands	Answers: "What is the capital of France?" → "Paris."
Chat Model	Multi-turn dialogue	Coherent, safe conversation	Align with conversational norms	Engages in natural dialogue about France

Selecting an SFT Dataset: Key Considerations

Choosing the right SFT dataset is critical for successful LLM fine-tuning. Datasets typically consist of structured instruction-response pairs in JSON or JSONL format.

Types of SFT Datasets

Human-Curated: High-quality, diverse, but smaller scale (e.g., Dolly-15k).
Model-Generated: Large-scale, cost-effective, but may inherit biases (e.g., Alpaca, Open-Orca).

SFT Dataset Comparison

Alpaca: 52k model-generated entries; diverse instructions; research-only license.
Dolly-15k: 15k human-written entries; high quality; MIT license.
Open-Orca: ~4.2M entries; includes chain-of-thought reasoning; mixed licenses.

Dataset selection should balance quality, scale, licensing, and project goals.

The Four-Step SFT Training Workflow

Follow these steps to fine-tune your LLM using SFT:

1. Data Formatting (Prompt Templating)

Convert structured instruction-response pairs into a consistent prompt template. For Alpaca:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}

Consistency: Ensures uniform input structure.
Role-Playing: Helps the model distinguish between instruction, input, and response.
Controlled Generation: Guides the model during inference.

2. Tokenization and Loss Masking

Tokenization: Converts text to token IDs.
Loss Masking: Only calculates loss on response tokens, ignoring instruction/input tokens. This focuses learning on generating accurate responses.

3. Configuring Training Parameters

Learning Rate: Lower than pre-training (1e-5 to 5e-5).
Epochs: 1-3 to prevent overfitting.
Batch Size: As large as hardware allows for stable training.

4. Training and Evaluation

Run the training loop: feed batches, compute loss, update weights.
Save checkpoints and evaluate using test instructions.
Assess output quality qualitatively and iteratively.

Ethical Considerations in SFT

Be aware of dataset biases and limitations. Model-generated datasets may propagate errors or biases from source models. Responsible curation and ongoing evaluation are essential for safe, reliable AI.

Conclusion

Supervised Fine-Tuning is the key to transforming a pre-trained LLM into an instruction-following or chat model. By selecting the right SFT dataset and following a structured workflow—prompt templating, loss masking, parameter tuning, and iterative evaluation—you can align your model with human intent and build advanced conversational AI systems.

In the next article, we'll apply these concepts to fine-tune our own instruction-following model.