Research & Papers

Gemini 1.5 Pro tops LLM safety data extraction benchmark at 84% accuracy

Best model still falls short of the 90% threshold needed for real-world industrial use.

Deep Dive

A new benchmarking study on arXiv (Grill et al., April 2026) evaluates how well four state-of-the-art LLMs—Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B—extract structured safety data from Safety Data Sheets (SDS). The researchers tested text-only and multimodal (text+images) pipelines using zero-shot, few-shot, and chain-of-thought (CoT) prompting strategies across over 50,000 data fields.

Results show that text-based extraction consistently outperforms multimodal processing across accuracy, latency, and cost. Gemini 1.5 Pro combined with CoT achieved the highest accuracy at 84%, followed by GPT-4o (81%) and Claude 3.7 Sonnet (79%). However, none of the models surpassed the 90% accuracy threshold considered necessary for reliable real-world deployment in safety-critical contexts. The authors conclude that general-purpose LLMs are not yet robust enough for unsupervised industrial use, but task-specific fine-tuning and human-in-the-loop verification could bridge the gap.

Key Points
  • Gemini 1.5 Pro with chain-of-thought prompting achieved highest accuracy at 84% on SDS data extraction.
  • Text-based extraction outperformed multimodal processing in all metrics across over 50,000 data fields.
  • No model reached the 90% accuracy threshold required for reliable unsupervised industrial deployment.

Why It Matters

Highlights that even top LLMs need domain adaptation before they can automate safety-critical document processing.

📬 Get the top 10 AI stories daily