Research & Papers

PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference

arXiv cs.AI March 04, 2026

⚡New algorithm uses process-level scoring to fix AI's biggest reasoning bottleneck, matching larger models.

Deep Dive

A team of researchers including Rituraj Sharma and Weiyuan Chen has published a new paper introducing PRISM (Process Reward Model-Guided Inference), a novel algorithm designed to overcome a critical bottleneck in advanced AI reasoning systems known as DEEPTHINK. These systems work by generating and refining multiple candidate solutions to complex problems, but they often lack reliable signals during the reasoning process, which can amplify errors and suppress correct answers. PRISM solves this by applying a Process Reward Model (PRM) that scores each individual step of the AI's reasoning, creating an 'energy landscape' to guide the refinement and aggregation of solutions, ensuring the population of candidate answers converges on higher-quality logic.

The technical breakthrough is significant: PRISM enabled a 20-billion-parameter model (gpt-oss-20b) to achieve performance competitive with a model six times larger (gpt-oss-120b) on demanding benchmarks like advanced mathematics (AIME25, HMMT25) and graduate-level science questions (GPQA Diamond). Specifically, it scored 90.0%, 75.4%, and 71.4% on these tests, respectively. The algorithm works by reshaping the population of candidate solutions through score-guided resampling and stochastic refinement, which concentrates probability on correct reasoning while preserving diversity. This means the system can reliably correct its own course during deliberation, even when starting with few correct candidates, making it a more compute-efficient and accurate method for complex problem-solving.

Key Points

PRISM uses a Process Reward Model (PRM) to score each reasoning step, guiding solution refinement and aggregation.
A 20B parameter model using PRISM matched the performance of a 120B model, scoring 90.0% on AIME25 and 71.4% on GPQA Diamond.
The algorithm provides consistent 'net-directional correction' during reasoning, fixing a major bottleneck where deeper thinking previously amplified errors.

Why It Matters

Enables smaller, cheaper AI models to perform complex reasoning at the level of much larger models, reducing costs and improving accuracy for math and science.

Read Original Article

PRISM: Pushing the Frontier of Deep Think via Process Reward Model-Guided Inference

Why It Matters

Stay Ahead in AI