Research & Papers

How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

New method fixes the 'style gap' that makes AI fine-tuning fail, turning performance drops into double-digit gains.

Deep Dive

A research team has introduced TESSY, a novel framework that solves a critical flaw in fine-tuning smaller AI reasoning models. The standard approach—using synthetic data from a stronger 'teacher' model like GPT-OSS-120B to train a 'student' like Qwen3-8B—often backfires, causing performance to drop. The paper identifies a 'stylistic divergence' between the teacher's complex outputs and the student's simpler distribution as the culprit. TESSY bridges this gap by having the teacher and student models collaborate, alternately generating tokens to create training data that retains the teacher's advanced reasoning but matches the student's style.

In code generation experiments, the results were stark. Fine-tuning Qwen3-8B on standard teacher-generated data led to significant performance regressions: a 3.25% drop on LiveCodeBench-Pro and a 10.02% drop on OJBench. In contrast, using TESSY's collaboratively synthesized data flipped these losses into substantial gains, achieving improvements of 11.25% and 6.68% on the same benchmarks. This demonstrates that data quality and stylistic alignment are more important than raw data quantity for effective supervised fine-tuning (SFT).

The TESSY framework represents a significant shift in how we think about knowledge distillation for reasoning models. It moves beyond simple imitation learning to a cooperative synthesis process, ensuring the training data is pedagogically appropriate for the student model's current capabilities. This method could unlock more efficient fine-tuning for a wide range of specialized, smaller models, making advanced reasoning more accessible without the performance penalty previously seen.

Key Points
  • Solves the 'style gap': Standard fine-tuning with a stronger teacher's data causes performance drops (e.g., -10.02% on OJBench) due to stylistic mismatch.
  • TESSY's cooperative synthesis: Alternates token generation between teacher (GPT-OSS-120B) and student (Qwen3-8B) to create style-consistent training data.
  • Delivers major gains: Flips performance losses into double-digit improvements (+11.25% on LiveCodeBench-Pro, +6.68% on OJBench) for code generation tasks.

Why It Matters

Enables efficient fine-tuning of smaller, specialized AI models without the performance degradation that has plagued standard knowledge distillation techniques.