TRACER: Learn-to-Defer for LLM Classification with Formal Teacher-Agreement Guarantees
New open-source tool promises 91.4% coverage while guaranteeing 92% agreement with expensive teacher models.
A new open-source library called TRACER (Trace-Based Adaptive Cost-Efficient Routing) offers a systematic solution to one of the most pressing issues in applied AI: the high cost of running large language models. Developed by researcher Adr-740, the tool allows developers to build intelligent routing policies that can automatically decide when to use a cheap, local surrogate model instead of a costly, powerful 'teacher' LLM for classification tasks. The core innovation is a formal guarantee: users can set a target (e.g., 92%) for how often the surrogate's prediction must agree with the teacher's. TRACER then uses conformal prediction techniques to calibrate an 'acceptor gate' on a held-out dataset, ensuring this agreement rate is met on future traffic. This provides a rigorous, data-driven way to manage the trade-off between cost and accuracy.
TRACER implements three distinct pipeline architectures for developers to choose from. The 'Global' policy accepts all surrogate predictions, 'L2D' (Learn-to-Defer) uses a surrogate plus a gating mechanism, and 'RSB' (Residual Surrogate Boosting) employs a two-stage cascade for harder cases. The library includes a model zoo with options like logistic regression, decision trees, and gradient-boosted trees (XGBoost) to act as the surrogate or gate. In a benchmark test on the Banking77 intent classification dataset using BGE-M3 embeddings, the system automatically selected the L2D pipeline, achieving 91.4% coverage—meaning it could safely route that much traffic to the cheap model—while formally guaranteeing 92% agreement with the teacher LLM, and maintaining a high 96.4% end-to-end macro-F1 score. A paper detailing the methodology is currently in progress.
- Provides formal, calibrated guarantees that a cheap surrogate model agrees with an expensive LLM (teacher) a user-defined percentage of the time (e.g., 92%).
- Achieved 91.4% coverage on Banking77 dataset, routing most queries to cheap models while maintaining 96.4% macro-F1 score.
- Offers three pipeline families (Global, L2D, RSB) and a model zoo including XGBoost and MLPs for flexible deployment strategies.
Why It Matters
Enables companies to drastically reduce LLM inference costs for classification tasks with mathematically proven accuracy safeguards, moving cost optimization from guesswork to engineering.