New paper released by WizardLM
New framework uses RL to dynamically choose reasoning structures, cutting token waste by matching task type.
WizardLM has released a significant new research paper titled 'Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models,' challenging the prevailing assumption that simply forcing longer Chain-of-Thought (CoT) reasoning is the optimal path for improving LLM-as-a-Judge systems. The research identifies a fundamental flaw in the 'one-size-fits-all' length-scaling approach, arguing that real-world evaluation tasks fall into two distinct categories: subjective preference tasks (like chat) requiring broad, multi-dimensional analysis (B-CoT), and objective correctness tasks (like math) demanding deep, step-by-step verification (D-CoT). Forcing a model designed for depth to reason broadly on a chat task, or vice-versa, often just accumulates noise or misses critical flaws.
The paper's solution is Mix-GRM, a framework that equips a Generative Reward Model with both Breadth and Depth reasoning capabilities. The key breakthrough came through training with Reinforcement Learning from Verdict Reward (RLVR), using only final judgment supervision without any explicit routing labels. Remarkably, the model autonomously learned to 'polarize' its reasoning, achieving 95% structural alignment by dynamically selecting the appropriate Breadth or Depth pathway based on the task. This compute-efficient design matches or surpasses the performance of token-heavy baselines like Self-Consistency while keeping token consumption in the same order of magnitude as standard single-pass reasoning, presenting a smarter, more adaptable architecture for AI evaluation.
- Mix-GRM framework uses B-CoT for subjective tasks and D-CoT for objective ones, achieving 95% structural alignment via RL training.
- Model autonomously learns to route tasks without explicit labels, cutting wasteful token use vs. length-scaling methods like Self-Consistency.
- Maintains high performance while keeping token consumption similar to standard single-pass reasoning, offering a more efficient evaluation paradigm.
Why It Matters
Provides a smarter, more efficient architecture for AI evaluation, reducing computational waste and improving accuracy across diverse task types.