LLM Reasoning Series: How DeepSeek-R1 Uses Reinforcement Learning to Supercharge Reasoning

Isaac Kargar
7 min readJan 30, 2025

--

Large language models (LLMs) like ChatGPT, Claude, and Gemini have dazzled the world with their ability to write essays, solve math problems, and even write code. But behind the scenes, most of these models rely heavily on supervised fine-tuning (SFT) — a process where humans manually provide correct examples to teach the model how to respond. While effective, this approach is time-consuming, expensive, and limits the model’s ability to “think” independently.

Enter DeepSeek-R1, a groundbreaking project by researchers at DeepSeek-AI. Instead of relying on pre-labeled data, this model learns to reason through reinforcement learning (RL) — a method inspired by how humans learn from trial and error. By rewarding the model for correct answers and penalizing mistakes, DeepSeek-R1 develops sophisticated problem-solving skills on its own, beating even top-tier models like OpenAI’s o1–1217 in some benchmarks, which does a lot of Reinforcement Learning (RL) as well, but there are no details about their approach and how they use RL.

But what makes DeepSeek-R1 unique isn’t just its performance. The team also distilled its capabilities into smaller, more efficient models (as tiny as 1.5 billion parameters), making advanced reasoning accessible to everyone. Let’s dive into how this system works, why it matters, and what it means for the future of AI.

The Problem With Traditional Training

Most LLMs follow a two-step training process:

  1. Pre-training: The model learns general language patterns by analyzing vast amounts of text from books, websites, and articles.
  2. Supervised Fine-Tuning (SFT): Humans manually curate examples to teach the model specific tasks, like solving math problems or writing code.

While SFT works, it has limitations. Creating high-quality training data is labor-intensive, and the model’s performance is capped by the examples it sees. Worse, SFT doesn’t encourage the model to explore new strategies — it simply mimics what humans show it.

DeepSeek-AI’s team wondered: What if we skip SFT entirely and let the model learn reasoning through trial and error?

How DeepSeek-R1-Zero Learned to Reason on Its Own

The first breakthrough came with DeepSeek-R1-Zero, a model trained without any supervised fine-tuning. Starting from the DeepSeek-V3 base model, researchers used a reinforcement learning framework called Group Relative Policy Optimization (GRPO). Here’s how it works:

Step 1: Define the Rules of the Game

Imagine teaching a child to solve puzzles. You wouldn’t give them the answers — you’d reward them for correct solutions and gently correct mistakes. Similarly, GRPO uses two types of “rewards” to guide the model:

  • Accuracy Rewards: Points for correct answers (e.g., solving a math problem).
  • Format Rewards: Points for structuring its thoughts clearly (e.g., writing reasoning steps inside <think> tags).

Step 2: Let the Model Experiment

For every question (like “Solve √(a − √(a + x)) = x”), the model generates multiple attempts. GRPO evaluates these attempts as a group, comparing each answer to the others. Instead of judging answers in isolation, it calculates an advantage score based on how much better or worse an attempt is compared to the group average. This encourages the model to iteratively refine its strategies.

Step 3: Repeat and Improve

Over thousands of training cycles, the model discovers sophisticated reasoning tactics on its own. For example, it learns to:

  • Self-verify: Double-check intermediate steps for errors.
  • Reflect: Re-examine flawed approaches and try alternatives.
  • Chain Thoughts: Break complex problems into smaller, logical steps.

The “Aha Moment”

During training, researchers observed a fascinating phenomenon: the model spontaneously began rethinking its mistakes. In one instance, after making an error while solving an equation, it paused and wrote: “Wait, wait. Let’s reevaluate this step-by-step…” — a behaviour no human had explicitly taught it. I’ve also been playing a lot with the model and really love these “wait, wait” moments!

From Raw Potential to Polished Performance: DeepSeek-R1

While R1-Zero was a remarkable proof of concept, it had some problems. Its answers were often hard to read, mixed languages (e.g., Chinese and English), or included irrelevant code snippets. To fix this, the team developed DeepSeek-R1, a more user-friendly version trained in four stages:

𝑪𝒐𝒍𝒅 𝑺𝒕𝒂𝒓𝒕:
Training began with SFT on cold-start data, which included curated Chain-of-Thought (CoT) examples from:
- Few-shot prompting of LLMs for detailed CoT.
- Post-processed outputs of DeepSeek-R1-Zero.
- Manual refinement by annotators.

𝑹𝒆𝒂𝒔𝒐𝒏𝒊𝒏𝒈-𝑶𝒓𝒊𝒆𝒏𝒕𝒆𝒅 𝑹𝑳:
After SFT, reinforcement learning was applied with a language consistency reward to ensure outputs aligned with target language preferences, improving readability.

𝑹𝒆𝒋𝒆𝒄𝒕𝒊𝒐𝒏 𝑺𝒂𝒎𝒑𝒍𝒊𝒏𝒈 𝒂𝒏𝒅 𝑺𝑭𝑻:
- Rejection Sampling: Multiple responses were generated, and only the best (most correct and clear) were retained for further fine-tuning.
Datasets:
- Reasoning Data: About 600,000 samples on reasoning tasks like math and logic.
- Non-Reasoning Data: 200,000 samples on tasks like writing, role-playing, and QA.
- The combined dataset was used to fine-tune the model for two epochs.

𝑹𝒆𝒊𝒏𝒇𝒐𝒓𝒄𝒆𝒎𝒆𝒏𝒕 𝑳𝒆𝒂𝒓𝒏𝒊𝒏𝒈 𝒇𝒐𝒓 𝑨𝒍𝒍 𝑺𝒄𝒆𝒏𝒂𝒓𝒊𝒐𝒔:
A final RL stage aligned the model with user preferences by incorporating:
- Preference-Aligned Prompts: Designed to reflect real-world tasks emphasizing helpfulness, harmlessness, and clarity.
- Safety Rewards: Ensured outputs avoided biases or harmful content.
- No LLM as Reward: Rewards were not explicitly generated by LLMs.

How GRPO Works and Why It Is a Game-Changer

Most RL methods require two models: a policy model (which generates answers) and a critic model (which evaluates them). This doubles computational costs. GRPO simplifies this by eliminating the critic model. Instead of relying on a separate evaluator, it compares responses within a group, using statistical metrics like mean and standard deviation to calculate rewards.

GRPO is a reinforcement learning algorithm designed to enhance the reasoning capabilities of large language models. It modifies the traditional Proximal Policy Optimization (PPO) algorithm by eliminating the need for a separate value function model. Instead, it estimates baselines from group scores, reducing memory usage and computational overhead. This makes GRPO particularly well-suited for training LLMs, which often involve large-scale models and datasets.

Here’s an example of how GRPO can be applied in the LLM domain:
- 𝐒𝐜𝐞𝐧𝐚𝐫𝐢𝐨: Fine-tuning an LLM for generating creative stories.
- Prompt Selection: Define a set of prompts related to creative storytelling, such as “Write a short story about a talking cat who goes on an adventure.”
- 𝐑𝐞𝐬𝐩𝐨𝐧𝐬𝐞 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐨𝐧: Generate multiple story variations (e.g., five) for each prompt using the current policy of the LLM.
- 𝐑𝐞𝐰𝐚𝐫𝐝 𝐒𝐜𝐨𝐫𝐢𝐧𝐠: Evaluate each generated story using a reward function that assesses aspects like creativity, coherence, and engagement. This could involve a combination of rule-based metrics (e.g., measuring the story’s length, the use of descriptive language, the presence of a clear narrative arc) and human evaluation.
- 𝐆𝐑𝐏𝐎 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐚𝐭𝐢𝐨𝐧: Group the stories: For each prompt, group the five generated stories.
- 𝐂𝐚𝐥𝐜𝐮𝐥𝐚𝐭𝐞 𝐚𝐯𝐞𝐫𝐚𝐠𝐞 𝐫𝐞𝐰𝐚𝐫𝐝: Calculate the average reward for each group of stories.
- 𝐂𝐚𝐥𝐜𝐮𝐥𝐚𝐭𝐞 𝐚𝐝𝐯𝐚𝐧𝐭𝐚𝐠𝐞: For each story within a group, calculate its advantage by comparing its reward to the average reward of the group.
- 𝐔𝐩𝐝𝐚𝐭𝐞 𝐭𝐡𝐞 𝐩𝐨𝐥𝐢𝐜𝐲: Adjust the LLM’s policy based on the calculated advantages, increasing the probability of generating stories with higher advantages.

By using GRPO, the LLM can learn to generate more creative and engaging stories by iteratively refining its policy based on the feedback received from the reward model and the group comparisons.

Think of it like grading students on a curve: if one student’s answer is far better than the class average, they get a high score. This approach not only saves resources but also stabilizes training, preventing the model from chasing unrealistic rewards.

Results: From Math Olympiads to Coding Competitions

DeepSeek-R1-Zero’s Raw Talent

  • AIME 2024: A math competition for top high school students. R1-Zero scored 71% on its first try, rivaling OpenAI’s o1–0912 model. With majority voting (averaging 64 attempts), it hit 86.7% — outperforming most humans.
  • MATH-500: A dataset of challenging math problems. The model achieved 95.9% accuracy, nearly matching GPT-4.

DeepSeek-R1’s Refined Mastery

  • AIME 2024: 79.8% accuracy, slightly beating OpenAI’s flagship o1–1217.
  • Codeforces: A programming competition platform. R1 scored a 2029 Elo rating, placing it in the top 96.3% of human coders.
  • General Intelligence: Excelled at creative writing (87.6% win-rate on AlpacaEval) and factual questions (90.8% on MMLU, a college-level exam benchmark).

Small Models, Big Brains

By distilling R1’s knowledge into smaller models, the team achieved surprising results:

  • DeepSeek-R1-Distill-Qwen-7B (7 billion parameters) outperformed GPT-4o on math tasks.
  • DeepSeek-R1-Distill-Llama-70B matched the performance of models twice its size.

Conclusion: A New Era of Self-Learning AI

DeepSeek-R1 isn’t just another LLM — it’s a paradigm shift. By open-sourcing their models and methods, DeepSeek-AI has invited the global community to build on this foundation. As one researcher noted, “We’re not just creating smarter AI — we’re creating AI that learns how to learn.”

--

--

Isaac Kargar
Isaac Kargar

Written by Isaac Kargar

Co-Founder and Chief AI Officer @ Resoniks | Ph.D. candidate at the Intelligent Robotics Group at Aalto University | https://kargarisaac.github.io/

No responses yet