Research & Papers

LLMScholarBench: New Benchmark Audits LLM Scholar Recommendations

22 LLMs tested across 9 metrics reveal hidden trade-offs in expert recommendations.

Deep Dive

Researchers Lisette Espín-Noboa and Gonzalo Gabriel Méndez have released LLMScholarBench, a comprehensive benchmark for auditing how large language models (LLMs) recommend academic experts. The work, accepted at KDD '26, addresses a critical gap: existing audits evaluate recommendations in isolation, ignoring end-user interventions like temperature settings, constrained prompting, or retrieval-augmented generation (RAG). The benchmark jointly evaluates model infrastructure and user-level tweaks across 22 LLMs, measuring technical quality (validity, consistency, factuality) and social representation (diversity, parity) using 9 distinct metrics.

When instantiated in physics expert recommendation, the benchmark reveals clear trade-offs. Higher temperature reduces validity, consistency, and factuality. Representation-constrained prompting improves diversity but hurts factuality. RAG via web search primarily boosts technical quality while reducing diversity and parity. The key takeaway: end-user interventions reshape trade-offs rather than providing uniform improvements. LLMScholarBench makes these dynamics auditable, helping developers and organizations choose the right model and settings for fair, accurate scholar recommendations.

Key Points
  • LLMScholarBench evaluates 22 LLMs across 9 metrics covering technical quality and social representation.
  • Higher temperature degrades factuality; representation-constrained prompting improves diversity at the cost of accuracy.
  • RAG boosts technical quality but reduces diversity and parity, showing no free lunch in intervention design.

Why It Matters

Organizations using LLMs for expert matching can now audit bias and accuracy trade-offs with a standardized benchmark.