FineRef: Fine-Grained Error Reflection and Correction for Long-Form Generation with Citations
A new 7B parameter model outperforms GPT-4 in citation F1 score, fixing a major flaw in AI-generated reports.
A research team has introduced FineRef, a novel framework designed to solve a critical problem in AI-generated long-form content: inaccurate citations. Current large language models (LLMs) like GPT-4 often produce mismatched or irrelevant citations when generating reports, research summaries, or articles that require multiple source attributions. This undermines trust and factual accuracy. FineRef addresses this by explicitly teaching models to self-identify and correct two key error types—citation mismatch and irrelevance—on a per-citation basis.
The framework employs a sophisticated two-stage training strategy. First, it uses supervised fine-tuning with data constructed by specialized lightweight models to instill an 'attempt-reflect-correct' behavioral pattern in the AI. An online bootstrapping strategy iteratively enriches training data with verified examples to improve generalization. The second stage applies process-level reinforcement learning with a multi-dimensional reward scheme that promotes reflection accuracy, answer quality, and correction gain.
On the ALCE (Automatic Long-form Citation Evaluation) benchmark, the results are striking. The team's 7-billion-parameter FineRef model outperforms the much larger GPT-4 by up to 18% in Citation F1 score—a key metric for citation accuracy—and by 4% in Exact Match (EM) Recall. It also surpasses other state-of-the-art citation-focused models. The system demonstrates strong robustness, maintaining performance even with noisy or irrelevant retrieved content, which is common in real-world applications like enterprise search or legal document analysis. This represents a significant step toward more reliable, verifiable AI assistants for research, analysis, and content creation.
- The 7B FineRef model outperforms GPT-4 by 18% in Citation F1 score on the ALCE benchmark, a major leap in accuracy.
- Uses a two-stage 'attempt-reflect-correct' training strategy with process-level reinforcement learning for per-citation error correction.
- Demonstrates strong robustness in noisy retrieval scenarios, crucial for real-world applications like enterprise search and legal docs.
Why It Matters
Enables trustworthy, verifiable AI for critical long-form tasks like research summaries, financial reports, and legal document drafting.