Open Source

Neo AI Engineer evaluation reveals RAG tweaks boost quality 19% and cut costs 79%

Most expensive model performed worst; retrieval & deduplication mattered more.

Deep Dive

A common RAG pitfall: retrieval problems disguised as LLM problems. The bot's similarity threshold (cosine distance 0.7) was too strict, returning zero documents for casual queries. Logging the context revealed no retrieval, not a model flaw. Heuristic keyword evaluators gave false confidence; switching to an LLM judge (Claude Haiku via OpenRouter) scoring relevance and accuracy for a few cents per run provided real signal.

Deduplicating chunks with >80% token overlap cleaned the context, eliminating hallucination. Stricter grounding (only facts from docs) improved accuracy but reduced helpfulness on knowledge gaps — a deliberate design choice. Model sweep: Gemma 4 26B outperformed Gemini 3.1 Flash Lite Preview (7.88 vs 7.33) at 75% lower cost. End result: 19% quality gain and 79% cost reduction, achieved through systematic evaluation with Neo AI Engineer.

Key Points
  • Retrieval similarity threshold (0.7 cosine) blocked casual queries; logging context revealed zero docs returned, not an LLM issue.
  • LLM judge (Claude Haiku) gave meaningful scores for a few cents per run, unlike misleading keyword matching.
  • Model sweep: Gemma 4 26B scored 7.88 vs original 7.33, costing 75% less per session.

Why It Matters

Practical RAG improvements — better retrieval, evaluation, and model selection — deliver 19% quality gain with 79% cost reduction.