Supervised Fine-Tuning: A Guide to LLM Reasoning

Editor's Note: As organizations increasingly embrace remote work, a notable challenge emerges: maintaining team cohesion and culture. This shift raises important questions about the future of collaboration in a digital landscape. How can leaders foster a sense of belonging and engagement among dispersed teams? Addressing this challenge will not only influence productivity but also shape the evolving dynamics of workplace relationships in an era defined by virtual interactions.

Our initial model, R1 Zero, demonstrated foundational reasoning using the GRPO reinforcement learning strategy. However, its logic was often muddled and its language lacked coherence. To elevate this model from a promising experiment to a robust reasoning engine, we needed a powerful language model fine-tuning strategy.

This article details the critical Supervised Fine-Tuning (SFT) pipeline that transforms the model, reproducing the methods used for DeepSeek R1. We will explore how SFT instills the ability to 'think out loud' with clarity and precision. Our journey covers three key stages:

Cold-Start SFT: Using the high-quality Bespoke-Stratos-17k dataset to teach the model structured, logical argumentation.
Advanced SFT with Rejection Sampling: Refining the model with an elite dataset curated through a rigorous quality control process that incorporates principles from reinforcement learning.
Knowledge Distillation: Transferring the sophisticated reasoning abilities of the final model to a smaller, more efficient version for real-world deployment.

By the end, you will have a comprehensive, reproducible blueprint for fine-tuning a language model that excels at complex reasoning.

Stage 1: Cold-Start SFT with the Bespoke-Stratos-17k Dataset

To build the R1 model, the DeepSeek research team started with Supervised Fine-Tuning (SFT). The first critical step was constructing a high-quality dataset for a "cold start"—training the model from a foundational state.

While building a custom dataset is possible, the open-source community provides an excellent alternative: Bespoke-Stratos-17k. Using this dataset directly significantly cuts down on preparation time and complexity for our SFT pipeline.

SFT is a supervised learning method where we provide the model with training pairs, each containing:

Input (Prompt): The question or instruction.
Output (Completion): The ideal, expert-level response we want the model to generate.

Through this process, the model learns the patterns of methodical thinking and logical reasoning required to solve complex problems.

The SFT training process follows four key steps in a classic machine learning loop:

Forward Pass: The model takes an input (Prompt) and generates its own output.
Loss Calculation: This predicted output is compared against the ideal answer (Completion) to calculate the difference, or "loss."
Backward Pass: Based on this loss, the algorithm calculates the gradient—the direction and magnitude of adjustments needed for the model's internal parameters.
Parameter Update: An optimizer, like AdamW, updates the model's parameters according to the gradient, nudging its future predictions closer to the target answer.

This feedback loop is fundamental to language model fine-tuning. Through thousands of these cycles, the R1 model acquires the ability to reason methodically. The initial problems in R1 Zero—chaotic reasoning and inconsistent style—are precisely what this SFT stage is designed to fix. By learning from the standardized examples in Bespoke-Stratos-17k, the model's reasoning clarity improves. This dataset, with 17,000 samples focused on math and programming, is ideal due to its:

Clear Chain-of-Thought: Detailed, step-by-step breakdowns.
Consistent Language Style: A uniform, professional tone.
Structured Output Format: A standardized template for responses.

The following code snippet shows how to load this dataset:

from datasets import load_dataset

# Load the dataset from the Hugging Face Hub
dataset = load_dataset("TIGER-Lab/Bespoke-Stratos-17k", split="train")

# Take a look at the first entry
print(dataset[0])

Here’s an example of a data entry:

{
  "messages": [
    {
      "content": "Let the sequence be $a_1, a_2, \\dots$. We are given that $a_1 = 2$ and $a_{n+1} = \\frac{a_n^2+6}{2a_n}$ for $n \\ge 1$. We want to find the value of $a_n$ as $n \\to \\infty$. Let $L = \\lim_{n \\to \\infty} a_n$. If this limit exists, then as $n \\to \\infty$, $a_{n+1} \\to L$ and $a_n \\to L$. So, we can take the limit of the recurrence relation: $$L = \\frac{L^2+6}{2L}$$ Now we solve for $L$. $$2L^2 = L^2+6$$ $$L^2 = 6$$ $$L = \\pm \\sqrt{6}$$ Since $a_1 = 2 > 0$, and the recurrence relation $a_{n+1} = \\frac{a_n^2+6}{2a_n} = \\frac{a_n}{2} + \\frac{3}{a_n}$ will always produce a positive value if $a_n$ is positive, all terms in the sequence are positive. Therefore, the limit must be positive. So, $L = \\sqrt{6}$. We should check if the sequence converges. Let's examine the first few terms. $a_1 = 2$ $a_2 = \\frac{2^2+6}{2(2)} = \\frac{10}{4} = 2.5$ $a_3 = \\frac{2.5^2+6}{2(2.5)} = \\frac{6.25+6}{5} = \\frac{12.25}{5} = 2.45$ $a_4 = \\frac{2.45^2+6}{2(2.45)} = \\frac{6.0025+6}{4.9} = \\frac{12.0025}{4.9} \\approx 2.44948979...$ $\\sqrt{6} \\approx 2.44948974...$ The sequence appears to be converging to \\sqrt{6}.",
      "role": "assistant"
    },
    {
      "content": "Let the sequence be $a_1, a_2, \\dots$. We are given that $a_1 = 2$ and $a_{n+1} = \\frac{a_n^2+6}{2a_n}$ for $n \\ge 1$. We want to find the value of $a_n$ as $n \\to \\infty$.",
      "role": "user"
    }
  ],
  "source": "unknown",
  "prompt_id": "unknown"
}

Each assistant response provides a clear reasoning process, making it an excellent template for training.

Configuring the SFT Training Environment with TRL

We use Hugging Face's trl library to build the training pipeline. The configuration for the SFT run is defined using SFTConfig as follows:

from trl import SFTConfig

sft_config = SFTConfig(
    output_dir="./sft_model",          # Directory to save the model
    num_train_epochs=1,                # Number of times to train on the full dataset
    per_device_train_batch_size=4,     # Number of samples per batch on each GPU
    gradient_accumulation_steps=2,     # Accumulate gradients over 2 steps to simulate a larger batch size
    optim="adamw_torch",               # The optimizer to use
    save_steps=500,                    # Save a checkpoint every 500 steps
    logging_steps=10,                  # Log training metrics every 10 steps
    learning_rate=2e-5,                # The starting learning rate
    max_grad_norm=0.3,                 # Clip gradients to prevent them from exploding
    max_steps=-1,                      # -1 means training steps are determined by epochs
    warmup_ratio=0.03,                 # A small portion of training to gradually increase the learning rate
    lr_scheduler_type="cosine",        # The learning rate scheduler type
    report_to="tensorboard",           # Log results to TensorBoard
    remove_unused_columns=False,       # Keep all original columns in the dataset
    gradient_checkpointing=True,       # A memory-saving technique
    bf16=True,                         # Use bf16 mixed-precision for faster training
    use_flash_attention_2=True,        # Use Flash Attention 2 for optimized performance
)

This configuration mirrors our previous GRPO phase. Enabling Flash Attention and gradient checkpointing significantly boosts training efficiency.

Executing the SFT Training Loop with SFTTrainer

Before training, we initialize the tokenizer and load the dataset.

from transformers import AutoTokenizer
from datasets import load_dataset

# Load the tokenizer for the Qwen1.5-0.5B-Instruct model
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen1.5-0.5B-Instruct", 
    trust_remote_code=True
)
# Set the padding token to be the same as the end-of-sequence token
tokenizer.pad_token = tokenizer.eos_token

# Load our high-quality reasoning dataset
dataset = load_dataset("TIGER-Lab/Bespoke-Stratos-17k", split="train")

With the prerequisites handled, we initialize the SFTTrainer and begin training.

from trl import SFTTrainer

# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,                       # The model we're training
    train_dataset=dataset,             # The training dataset
    dataset_text_field="messages",     # The dataset field containing the text data
    max_seq_length=2048,               # The maximum sequence length for tokenization
    tokenizer=tokenizer,               # The tokenizer
    args=sft_config,                   # Our training configuration
)

# Let's start training!
trainer.train()

After training is complete, we save the fine-tuned model.

# Save the final model to disk
trainer.save_model("./sft_model_final")

This completes the first SFT stage, providing a solid foundation for more advanced training.

Stage 2: Advanced SFT with Rejection Sampling

While the initial SFT stage improves reasoning, the second stage in the DeepSeek R1 pipeline introduces advanced techniques: rejection sampling for data curation and the integration of reinforcement learning principles for better LLM alignment.

How Rejection Sampling Creates an Elite Dataset

To source the highest-quality reasoning samples, DeepSeek implemented rejection sampling—a rigorous quality control filter.

Rejection sampling workflow diagram

The process is methodical:

Generation: The model generates a massive volume of reasoning examples.
Evaluation: A reward model, often supplemented by human reviewers, evaluates these examples against criteria like correctness and logical coherence.
Selection: Only the top-tier samples that meet this high standard are selected for the next round of training.

This elite dataset further sharpens the model's comprehension and expressive power.

Integrating Reinforcement Learning Principles into SFT

This second stage cleverly integrates concepts from reinforcement learning to align the model more closely with human values. The goal is to create an AI assistant that is not just accurate but also helpful and harmless.

To achieve this, the evaluation process incorporates a reward mechanism that considers:

Language Consistency: The model's response must match the language of the user's prompt.
Helpfulness: The response must effectively address the user's problem.
Harmlessness: The output must avoid biased, inappropriate, or harmful content.

Workflow diagram for reasoning-oriented reinforcement learning

By integrating these reward mechanisms into the SFT process, the model becomes deeply aligned with human expectations, learning to communicate safely, reliably, and effectively.

Stage 3: Knowledge Distillation for Efficient LLM Deployment

To make DeepSeek R1 accessible for real-world applications, the team employed knowledge distillation in the final training stage. This technique transfers the intelligence of a large, powerful model to a smaller, more lightweight one.

Knowledge distillation workflow diagram

The core idea involves a powerful "teacher" model (the full-sized DeepSeek R1) imparting its wisdom to a smaller "student" model. The outputs of the teacher model are used as the "ground truth" for training the student model via supervised fine-tuning. The student learns to mimic the teacher's high-quality performance.

After this process, the student model retains the core reasoning strengths of its teacher, making it practical for business applications where efficiency is key.

Conclusion

Across this series, we have systematically charted the development of a reasoning-capable large language model by reconstructing the DeepSeek R1 training process. We began with a reinforcement learning framework and, in this final article, detailed the supervised fine-tuning pipeline used to refine its logic and align it with human values.

We have now laid out the core mechanisms of the DeepSeek R1 SFT pipeline, establishing a clear paradigm for building reasoning-focused models. While a full-scale replication is a significant undertaking, this series provides a technical roadmap for your own work in language model fine-tuning, helping you build AI that is powerful, practical, and generalizable.

Key Takeaways

• Implement the Supervised Fine-Tuning (SFT) pipeline to enhance LLM reasoning capabilities.
• Utilize the DeepSeek R1 process for effective language model fine-tuning.
• Focus on improving clarity and coherence in model outputs through SFT techniques.