Assessing LLM Reliability on Temporally Recent Open-Domain Questions
A massive 90+ point gap exposes a critical flaw in how we test AI.
A new benchmark, RECOM, tested four open-source LLMs on 15,000 recent Reddit questions. Researchers discovered a 'semantic-lexical paradox': models achieved over 99% semantic similarity with human answers but less than 8% BLEU score for word overlap. This 90+ percentage point gap shows LLMs paraphrase heavily to preserve meaning. Surprisingly, model scale didn't predict performance, with 7B-parameter Mistral beating a 20B model. The findings challenge reliance on surface-level metrics for evaluation.
Why It Matters
This forces a rethink of AI evaluation, proving current benchmarks may be completely misleading about true understanding.