Some Thoughts on Reinforcement Learning in Large Language Models

Isaac Kargar
9 min readFeb 15, 2025

--

As someone with a background in reinforcement learning (RL) and having witnessed its rising prominence in the large language model (LLM) domain, I have been thinking a lot over the last few weeks about how RL is used to refine LLM behavior. From fine-tuning policies based on human feedback to ensuring robust generalization across diverse scenarios, RL methods have carved out an essential niche in modern AI. In this post, I’ll discuss how different RL methods differ, why a purely supervised approach is insufficient, why combining supervised fine-tuning (SFT) with RL — as seen in systems like DeepSeek-R1 — yields better results, and how exploration inherent in RL enhances generalization and out-of-distribution handling.

1. Diverse Reinforcement Learning Methods: A Closer Look

At a high level, many RL methods share the same update principle: they adjust a model’s parameters by taking a gradient step on an objective function. In its simplest form, this is expressed as

where thetarepresents the model parameters, alphais the learning rate, and Delta(J)is the gradient of the objective (often the expected reward). However, the nuances of how this gradient (Delta(J)) is computed, and what additional components are included can vary widely between methods.

Proximal Policy Optimization (PPO), for example, is a policy-gradient method that seeks to optimize a surrogate objective while ensuring that policy updates remain “proximal” to the previous policy. It does this by computing a probability ratio:

This ratio is then multiplied by an advantage estimate (often calculated using techniques like Generalized Advantage Estimation, or GAE), and a clipping operation is applied to prevent the update from moving too far from the old policy. The resulting update, derived from this carefully designed objective, provides stability even when policy changes are substantial.

Reinforcement Learning from Human Feedback (RLHF) is another popular technique that builds on top of methods like PPO by integrating human preference data. Here, a reward model is first trained using pairwise comparisons or ratings provided by human annotators. The subsequent RL phase then optimizes the model using this reward signal, often combining it with techniques from PPO — such as clipping and KL divergence penalties — to ensure safe, incremental updates.

Another approach, Group Relative Policy Optimization (GRPO), which has been used in DeepSeek-R1, modifies this idea further by eliminating the need for a separate value function (critic). Instead of relying on an external estimate of state value, GRPO generates a group of responses per prompt and computes a group-relative advantage by normalizing rewards across these samples. This method simplifies the architecture and reduces computational overhead while still capturing the essential variability in the responses.

One of the limitations in training LLMs using RL is the process of reward assignment. In domains where the output of the model is verifiable by code or some sort of test, the reward definition is easier, and we can prompt the model, give the answer, and let it run and find a solution. This makes it possible to train the models endlessly using RL, which leads to magic. On open domains where the output is not easily verifiable, we usually train a reward model to judge the output. Research shows that this leads to a phenomenon called “Reward Hacking,” and the model learns to output something to get a high reward, while the output is not what we want. In this domain, we cannot train the model using RL endlessly. If you are interested in learning more, watch the following short course by Andrej Karpathy:

This talk by Nathan Lambert is also interesting:

Each of these methods addresses the core challenge of learning from sequential decisions but does so in a way that tailors credit assignment, variance reduction, and stability to the specific context in which they are applied.

2. Why Not Just Supervised Fine-Tuning on the Best Generated Answer?

It might seem intuitive that if a human or reward model clearly identifies a “best” answer, then the simplest solution would be to perform supervised fine-tuning (SFT) on that answer. However, this approach has several shortcomings when it comes to training LLMs.

Limited Diversity and Overfitting:
Supervised fine-tuning typically involves training the model to mimic a single target output. Language generation is inherently stochastic, with multiple valid responses often available for a given prompt. If you focus solely on the “best” answer, you risk overfitting the model to that narrow output. The model may fail to capture the full diversity of the language, potentially missing out on valid variations that could be preferable in different contexts. In the field of Autonomous Driving, during my PhD, techniques like Imitation Learning (we call it SFT in the LLM domain these days) were popular technique, but the same problem was present, and the learned policy couldn’t drive the car if the trajectory deviates a bit from the data in the training dataset. This out-of-domain generalization is a big research topic, and RL and offline RL are some ways to improve it.

Sequential Credit Assignment:
In language generation tasks, the reward or preference signal is often determined by the quality of the entire generated sequence rather than isolated tokens. RL methods, especially those employing policy gradients, are designed to assign credit across the sequence of decisions. By optimizing for long-term reward (using advantage estimates and temporal difference methods), RL effectively propagates feedback back to each individual decision. A pure supervised approach would struggle to perform such nuanced temporal credit assignment, which is critical when the quality of the output is only measurable once the entire sequence is generated.

Handling Non-Differentiable Signals:
Many human feedback signals are non-differentiable or are available only as scalar evaluations (e.g., a “like” or “dislike” rating). Supervised methods require differentiable targets, whereas policy gradient methods can optimize expected rewards directly even if the reward function is not differentiable. This makes RL a more flexible choice when integrating human feedback into the training process.

Exploration vs. Exploitation:
RL inherently includes an exploration component. Instead of relying on a single “gold” answer, RL encourages the model to sample and evaluate multiple trajectories. Through exploration, the model learns not only which outputs are preferred but also understands the landscape of possible responses. This results in a more robust policy that can generalize better to unseen prompts and adapt to varying contexts.

Thus, while SFT on a “best” answer might work for simple tasks, it does not capture the full complexity of language generation. RL methods, through their sophisticated use of policy gradients and exploration, provide a richer framework for aligning models with human preferences.

3. The Advantage of SFT Followed by RL: Insights from DeepSeek-R1

One of the most effective training paradigms in modern LLM development is the two-stage process that starts with supervised fine-tuning (SFT) and is followed by reinforcement learning (RL). DeepSeek-R1 is a prime example of this approach. Here’s why this combination works so well.

Establishing a Strong Base with SFT:
The first stage, SFT, involves fine-tuning the model on high-quality, human-curated data. This step is crucial because it provides the model with a solid foundation — a reliable base model that captures the essential patterns and structures of language as demonstrated by experts. The supervised data helps in reducing the initial variance in the model’s outputs and sets a baseline distribution for further refinement.

Safe and Effective Policy Updates in RL:
Once the model has a robust base, the RL phase refines the policy by incorporating a reward model that reflects human preferences a trained model to simulate them, or even a defined rule-based reward. During RL, techniques like policy gradients are used to optimize an objective function that includes mechanisms such as clipping and KL divergence penalties. These mechanisms ensure that the policy updates remain close to the original SFT model, preventing drastic changes that could destabilize performance. Essentially, the RL phase acts as a fine adjustment, nudging the model towards better performance and better reasoning depending on the defined reward without losing the valuable information already captured during SFT.

Curriculum Learning Perspective:
This two-step approach is akin to curriculum learning, in my opinion, where the model first learns simpler, well-defined tasks (via SFT) and then progressively tackles more complex decision-making challenges (via RL). With a good base in place, the RL phase can explore alternative trajectories and subtle improvements while staying within a “safe” region defined by the reference model. The combination of a strong initial policy and cautious policy updates through clipping and KL penalties leads to a more stable and effective learning process.

Practical Advantages:
Practically speaking, this strategy leverages the best of both worlds. SFT provides clear guidance based on expert data. At the same time, RL allows for the optimization of long-term rewards that capture the nuances of human preferences and also allows the model to explore other possibilities that might be better than the data generated by humans and also learn about a distribution around the trajectory in the dataset. This results in a model that not only mimics high-quality outputs but is also capable of refining its behavior in response to dynamic and sometimes ambiguous reward signals.

In summary, the synergy between SFT and RL in systems like DeepSeek-R1 exemplifies how starting with a strong base model makes the subsequent RL phase more effective, leading to better alignment and improved performance.

4. Enhanced Generalization and Out-of-Distribution Handling Through RL Exploration

One of the remarkable benefits of incorporating reinforcement learning into the training pipeline of LLMs is its inherent ability to enhance generalization and handle out-of-distribution scenarios.

Broad Exploration of the Output Space:
Unlike supervised fine-tuning, which trains the model on a fixed set of target responses, RL emphasizes exploration. During training, the model is exposed to a wide range of outputs through sampling and policy gradient updates. This exploration allows the model to learn about the entire distribution of possible responses — not just a single “correct” answer. By encountering multiple trajectories, the model understands what constitutes a high-reward sequence versus a suboptimal one.

Learning Relative Advantages:
RL methods compute an advantage estimate for each trajectory, which is essentially a relative measure of how much better one sequence is compared to an average outcome. This relative comparison helps the model assign credit more effectively across different parts of the sequence. Even if the base model initially produces a mediocre response, the exploration during the RL phase can uncover alternative trajectories that yield higher rewards. These discoveries are then reinforced, improving the overall quality of the policy.

Robustness to Novel Inputs:
When a model is trained solely via supervised learning, it can be prone to overfitting on the training data. In contrast, the exploration inherent in RL means that the model is regularly exposed to a variety of inputs and outputs. This exposure enables it to build a richer representation of the language and its nuances. As a result, the model becomes better equipped to handle out-of-distribution scenarios and generalize to new, unseen prompts. It knows not only the “best” trajectory but also the surrounding landscape of plausible responses.

Improving Through Exploration:
It is often during the exploration phase that the model discovers superior trajectories compared to those seen in the supervised dataset. Even if the SFT phase provides a strong baseline, RL can help the model refine its policy further by exploring alternatives and learning from the relative differences in rewards. This continual refinement process is what ultimately leads to improvements in performance and robustness.

In essence, the integration of RL into LLM training enables the model to go beyond simply mimicking human-curated examples. It learns a more holistic view of the output space, capturing both high-reward trajectories and the structure of less optimal ones. This richer understanding facilitates better generalization, improved adaptability, and a more robust response to novel inputs.

Conclusion

Reinforcement learning has emerged as a critical component in fine-tuning large language models, offering unique advantages over traditional supervised fine-tuning. In this post, we explored four key aspects: the differences among various RL methods, the limitations of relying solely on supervised fine-tuning for generating the best answer, the benefits of a two-stage training approach as demonstrated by DeepSeek-R1, and the enhanced generalization capabilities enabled by the exploration inherent in RL.

By delving into the nuances of methods like PPO, GRPO, and RLHF, we see that the true power of RL lies in its ability to assign credit over sequential decisions, safely update policies with mechanisms like clipping and KL divergence, and explore a wide range of trajectories that help the model generalize better. For practitioners with an RL background, understanding these mechanisms is key to effectively applying RL techniques to LLMs, ensuring that models not only mimic high-quality outputs but also adapt robustly to complex, real-world scenarios.

As the field of LLMs continues to evolve, the integration of reinforcement learning will remain essential for achieving both alignment with human preferences and superior performance on challenging tasks. This dual-stage approach — leveraging the stability of supervised learning and the dynamic refinement of RL — represents a promising direction for future research and practical applications in AI.

--

--

Isaac Kargar
Isaac Kargar

Written by Isaac Kargar

Co-Founder and Chief AI Officer @ Resoniks | Ph.D. candidate at the Intelligent Robotics Group at Aalto University | https://kargarisaac.github.io/

No responses yet