Research & Papers

Assessing LLM Reliability on Temporally Recent Open-Domain Questions

A massive 90+ point gap exposes a critical flaw in how we test AI.

Deep Dive

A new benchmark, RECOM, tested four open-source LLMs on 15,000 recent Reddit questions. Researchers discovered a 'semantic-lexical paradox': models achieved over 99% semantic similarity with human answers but less than 8% BLEU score for word overlap. This 90+ percentage point gap shows LLMs paraphrase heavily to preserve meaning. Surprisingly, model scale didn't predict performance, with 7B-parameter Mistral beating a 20B model. The findings challenge reliance on surface-level metrics for evaluation.

Why It Matters

This forces a rethink of AI evaluation, proving current benchmarks may be completely misleading about true understanding.