Research & Papers

[P] NanoJudge: Instead of prompting a big LLM once, it prompts a tiny LLM thousands of times.

r/MachineLearning March 07, 2026

⚡Open-source tool replaces single complex prompts with thousands of simple 1v1 matchups for mathematically rigorous rankings.

Deep Dive

A new open-source tool called NanoJudge fundamentally rethinks how large language models handle complex ranking tasks. Instead of asking a single LLM to process and rank hundreds or thousands of items in one go—an approach prone to hallucinations, context loss, and clichéd outputs—NanoJudge breaks the problem down into thousands of simple, focused 1v1 comparisons. Built as a pure-computation Rust engine, it connects to any OpenAI-compatible local API (like vLLM or Ollama) and runs exhaustive pairwise tournaments, asking the model to choose between just two options at a time. This method transforms an intractable ranking problem into a series of manageable micro-decisions.

The technical sophistication lies in how NanoJudge compiles these thousands of individual judgments. It uses Bradley-Terry scoring and Bayesian Markov Chain Monte Carlo (MCMC) sampling to create a mathematically rigorous final leaderboard complete with confidence intervals. The engine includes critical optimizations: it extracts win probabilities from raw token logprobs instead of parsing text, employs a Gaussian Gibbs sampler to isolate and subtract positional bias (where LLMs favor the first option), and uses a top-heavy matchmaking algorithm to avoid O(n²) comparisons by focusing compute on high-information matchups between top contenders. This allows it to rank tens of thousands of items, and because each comparison uses a tiny context window, users can inject entire documents (like a game's Wikipedia page) as context for truly informed, retrieval-augmented decisions. The developer is already applying it to build an ML research assistant that can compare thousands of ArXiv papers.

Key Points

Replaces single complex LLM prompts with thousands of simple 1v1 comparisons via a Rust engine and local APIs
Uses Bradley-Terry scoring and Bayesian MCMC to compile results into leaderboards with confidence intervals, correcting for positional bias
Enables ranking of tens of thousands of items with full document context per comparison, optimizing compute with a top-heavy matchmaking algorithm

Why It Matters

Enables reliable, large-scale ranking and comparison tasks—from research to product recommendations—that were previously impossible for LLMs.

Read Original Article

[P] NanoJudge: Instead of prompting a big LLM once, it prompts a tiny LLM thousands of times.

Why It Matters

Stay Ahead in AI