Research & Papers

HalalBench: A Multilingual OCR Benchmark for Food Packaging Ingredient Extraction

Current OCR engines fail on Japanese ingredient labels with 0% accuracy.

Deep Dive

Hasan Arief's new paper introduces HalalBench, the first open multilingual benchmark designed specifically for OCR on food packaging ingredient extraction. The benchmark includes 1,043 images—50 real and 993 synthetic—with 36,438 annotations in COCO format, spanning 14 languages. It addresses unique challenges like curved surfaces, dense multilingual text, and sub-8pt fonts that existing document or scene-text benchmarks miss.

Evaluations of four OCR engines—docTR (F1=0.193), ML Kit (0.180), and EasyOCR (0.167)—show poor performance overall, with all failing completely on Japanese (F1=0.000). A clustering-based post-processing algorithm improved F1 by 36%. Results are validated through HalalLens, a production halal scanner used in over 20 countries. The dataset and code are released under open licenses, providing a critical resource for improving automated halal food verification and multilingual OCR systems.

Key Points
  • HalalBench includes 1,043 images and 36,438 annotations across 14 languages
  • Top OCR engine docTR achieves only F1=0.193, with all engines scoring 0.000 on Japanese
  • Custom post-processing algorithm boosts F1 by 36%, validated via production scanner serving 20+ countries

Why It Matters

First standardized benchmark for halal food verification OCR, revealing critical gaps in multilingual ingredient reading.