GRPO Training Pipeline: SFT to RL for Better Reasoning

Editor's Note: In an era marked by rapid technological advancement, the challenge of digital privacy looms large, as individuals grapple with the trade-offs between convenience and security. As companies increasingly harness personal data for innovation, we must consider: how can we balance the benefits of technology with the imperative to protect individual privacy? This dilemma not only shapes consumer trust but also influences regulatory landscapes and the future of digital interactions.

In the previous article, we examined the design of the Generative Reasoning Policy Optimization (GRPO) reward function. Now, we transition from theory to practice. This article provides a comprehensive guide to implementing a full GRPO training pipeline, from Supervised Fine-Tuning (SFT) to the final reinforcement learning stage.

Using the Qwen2.5-0.5B-Instruct model as a case study, we will cover training parameter configuration, model management, and the GRPOTrainer. A key focus will be on constructing high-quality "cold-start" data using prompt engineering techniques like few-shot Chain-of-Thought. This initial SFT phase is critical for establishing a strong model reasoning baseline before applying Reinforcement Learning (RL).

This article is the third in our series on replicating the DeepSeek R1 methodology. For a complete understanding, we recommend reading the first two installments.

Why SFT is Crucial for Improving Model Reasoning

While the technical setup of a training pipeline is straightforward, its success hinges on the model's foundational capabilities. The development of DeepSeek R1 highlights the challenges of enhancing model reasoning and the critical role of data quality. The DeepSeek team's work shows why a preliminary Supervised Fine-Tuning (SFT) phase using high-quality "cold-start" data is essential before proceeding to reinforcement learning.

In the original research, DeepSeek's R1 Zero performed exceptionally well on reasoning benchmarks. This provides strong validation for a key conclusion:

Guiding language models through reasoning training using Reinforcement Learning (RL) is a highly promising technical path.

Despite its strong performance, the DeepSeek team identified two key challenges. The team's simple reasoning template led to two practical issues.

First, the reasoning content generated by the model often appeared unstructured and logically jumbled, making it difficult to interpret the model's internal thought processes.

Second, the model exhibited a tendency to mix languages in multilingual scenarios, degrading the user experience.

Building Cold-Start Data for SFT: A Practical Guide

To address these issues, the DeepSeek team implemented a crucial preparatory step: Supervised Fine-Tuning (SFT) using "cold-start" data. The principle is to establish a strong reasoning foundation before beginning reinforcement learning. By training the model on meticulously annotated, high-quality examples, it learns what constitutes a clear, systematic reasoning process. This SFT phase equips the base model with a foundational understanding of reasoning, preventing unproductive exploration during the subsequent RL phase.

The Bespoke-Stratos-17k dataset is a prime example of a cold-start dataset used for this purpose. Let's explore the methods for constructing such data.

Method 1: Few-shot Chain-of-Thought (CoT) Prompting

In building their cold-start data, the DeepSeek team employed few-shot Chain-of-Thought (CoT) prompting. This technique involves providing the model with a few examples, each containing a problem and a detailed, step-by-step reasoning process. By showing the model examples with clear thought paths, it learns how to break down problems and organize logic.

The following code demonstrates how to use the Qwen2.5-0.5B-Instruct model to generate these "step-by-step reasoning" answers.

# Prepare Few-shot examples
few_shot_prompt = """
Question: What is 1 + 1?
Answer:
1. Identify the operation: Addition.
2. Perform the calculation: 1 + 1 = 2.
3. Final Answer: 2.
<|special_token|>
Question: What is 5 * 3?
Answer:
1. Identify the operation: Multiplication.
2. Perform the calculation: 5 * 3 = 15.
3. Final Answer: 15.
<|special_token|>
Question: What is 10 - 4?
Answer:
1. Identify the operation: Subtraction.
2. Perform the calculation: 10 - 4 = 6.
3. Final Answer: 6.
<|special_token|>
Question: What is 2 + 3 * 4?
Answer:
"""

# Call the model to generate an answer
inputs = tokenizer(few_shot_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

The model's output demonstrates the efficiency of this technique. With just a few high-quality examples, the model can grasp the desired format and logic.

Question: What is 2 + 3 * 4?
Answer:
1. Identify the operations: Addition and Multiplication.
2. Follow the order of operations (PEMDAS/BODMAS): Multiplication before Addition.
3. Perform the multiplication: 3 * 4 = 12.
4. Perform the addition: 2 + 12 = 14.
5. Final Answer: 14.

Method 2: Direct Prompting for Step-by-Step Reasoning

Another flexible approach is Direct Prompting. The core idea is to use a carefully crafted prompt to guide the model to generate an answer that includes its reasoning process. We instruct the model to analyze the problem step-by-step, reason clearly, and self-verify its result.

Here is a Python code example:

# Construct a direct prompt
direct_prompt = """
Question: What is 2 + 3 * 4?
Please provide a step-by-step reasoning process and verify your final answer.
"""

# Call the model
inputs = tokenizer(direct_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=150)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

The model's output:

Question: What is 2 + 3 * 4?
Please provide a step-by-step reasoning process and verify your final answer.

Answer:
Step 1: Identify the order of operations (PEMDAS/BODMAS). Multiplication should be performed before addition.
Step 2: Calculate the multiplication part: 3 * 4 = 12.
Step 3: Calculate the addition part: 2 + 12 = 14.
The final answer is 14.

Verification:
To verify, we can re-check the order of operations.
3 * 4 is indeed 12.
2 + 12 is indeed 14.
The calculation is correct.

This method significantly enhances the interpretability and trustworthiness of the model's output.

Implementing the Full GRPO Training Pipeline

After preparing a base model with a strong reasoning foundation via SFT, we can proceed to the reinforcement learning phase. We modularize the model parameters, training hyperparameters, and callback functions for a flexible and maintainable process.

Step 1: Configuring Model Parameters and Callbacks

We define a ModelConfig data class to centralize the management of all parameters related to loading the model. We also implement a simple LoggingCallback to periodically print logs during training, making it easy to monitor key metrics like loss and learning rate in real-time.

With this structured design, we can quickly swap out models, tweak training strategies, and flexibly integrate various tools.

Step 2: Launching the GRPOTrainer for RL

After initializing the model, loading the data, and setting up our training parameters, we are ready to begin the GRPO training process. The entire process is orchestrated by the GRPOTrainer, which integrates all prepared modules: the model, dataset, reward function, and training configuration.

Since this example sets num_train_epochs = 1 and uses a small-scale model, the training process is relatively quick, which is ideal for rapid validation. In a production environment like the one used for DeepSeek R1, training would involve many more epochs.

Once training is complete, we save the model so it can be loaded later for inference or further fine-tuning. We can then write a simple test function to call our newly trained model for inference and quickly verify its responsiveness and output quality.

Conclusion

In this article, we detailed the workflow for enhancing a model's reasoning capabilities, from cold-start data generation to reinforcement learning. We analyzed the challenges faced by DeepSeek R1 and explored how to address them with Supervised Fine-Tuning (SFT). By using few-shot Chain-of-Thought prompting and direct prompting, we can establish a strong foundation before the GRPO training pipeline optimizes the model's policy.

Our next article will focus on the SFT stage in greater detail, using the high-quality cold-start data we've designed to train and optimize the base model. Stay tuned as we delve deeper into building a powerful reasoning model.

Key Takeaways

• Implement a full GRPO training pipeline, transitioning from SFT to reinforcement learning.
• Utilize cold-start data for effective model training and improved reasoning capabilities.
• Configure training parameters and manage models using the GRPOTrainer for optimal results.