Reinforcement Learning Fine-Tuning Technique by OpenAI

Isaac Kargar
2 min readDec 15, 2024

--

OpenAI Just announced #Reinforcement Fine Tuning. Here is the demo:⁣⁣

But what is Reinforcement Fine-Tuning (RFT)? ⁣

RFT is a new model customization technique that allows users to fine-tune OpenAI’s models (specifically the #O1 series) using reinforcement learning rather than traditional supervised fine-tuning. The key difference is that while supervised fine-tuning teaches models to mimic inputs, RFT teaches models to develop new reasoning capabilities over custom domains. ⁣⁣

𝐇𝐨𝐰 𝐢𝐭 𝐰𝐨𝐫𝐤𝐬:⁣

1. 𝘋𝘢𝘵𝘢 𝘗𝘳𝘦𝘱𝘢𝘳𝘢𝘵𝘪𝘰𝘯:⁣

- Users provide a dataset in JSONL format⁣

- Each entry contains:⁣

- Input data (e.g., a problem or case)⁣

- Instructions for the model⁣

- The correct answer (used for grading, not shown to model during training)⁣

2. 𝘎𝘳𝘢𝘥𝘪𝘯𝘨 𝘚𝘺𝘴𝘵𝘦𝘮:⁣

- Users define or use pre-built “graders”⁣

- Graders score model outputs from 0 to 1⁣

- Scores can be binary or partial credit⁣

- The grading helps reinforce correct reasoning patterns⁣

3. 𝘛𝘳𝘢𝘪𝘯𝘪𝘯𝘨 𝘗𝘳𝘰𝘤𝘦𝘴𝘴:⁣

- When the model sees a problem, it’s given space to think⁣

- The model’s answer is graded⁣

- Reinforcement learning algorithms:⁣

- Reinforce thinking patterns that led to correct answers⁣

- Disincentivize patterns that led to incorrect answers⁣

𝐊𝐞𝐲 𝐁𝐞𝐧𝐞𝐟𝐢𝐭𝐬:⁣

- Can achieve results with very small datasets (as few as dozens of examples)⁣

- Allows models to learn new reasoning capabilities, not just mimicry⁣

- Can make smaller models perform better than larger base models⁣

- Works especially well for tasks requiring deep expertise⁣⁣

The example in the demo in genetic disease diagnosis:⁣

𝘐𝘯𝘱𝘶𝘵:⁣

- Patient case reports containing:⁣

- Symptoms present⁣

- Symptoms absent⁣

- Patient history⁣

𝘛𝘢𝘴𝘬:⁣

- Identify genes potentially responsible for genetic diseases⁣

- Rank genes by likelihood⁣

- Provide reasoning for selections⁣⁣

𝘙𝘦𝘴𝘶𝘭𝘵𝘴 𝘴𝘩𝘰𝘸𝘯 𝘪𝘯 𝘵𝘩𝘦 𝘥𝘦𝘮𝘰:⁣

- Base O1-mini: 17.7% accuracy⁣

- Base O1: 25% accuracy⁣

- RFT-trained O1-mini: 31% accuracy⁣⁣

The impressive part was that the fine-tuned smaller model (O1-mini) outperformed the larger base model (O1) after reinforcement fine-tuning.⁣

This technique represents a significant advancement in model customization, allowing organizations to create highly specialized AI models for complex domain-specific tasks while requiring relatively small amounts of training data.

https://openai.com/form/rft-research-program/

--

--

Isaac Kargar
Isaac Kargar

Written by Isaac Kargar

Co-Founder and Chief AI Officer @ Resoniks | Ph.D. candidate at the Intelligent Robotics Group at Aalto University | https://kargarisaac.github.io/

No responses yet