Research & Papers

SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

New MLLM-based system boosts food image-to-recipe retrieval accuracy by 5.7 percentage points.

Deep Dive

Researchers Keisuke Gomi and Keiji Yanai have introduced SIMMER (Single Integrated Multimodal Model for Embedding Recipes), a novel AI system that revolutionizes how computers match food images to their corresponding recipes. Unlike traditional methods that use separate encoders for images and text—requiring complex alignment strategies—SIMMER employs a single unified Multimodal Large Language Model (MLLM) encoder called VLM2Vec. This architecture processes both visual and textual data through one model, significantly simplifying the system while improving performance. The researchers achieved state-of-the-art results on the benchmark Recipe1M dataset, with their best model improving image-to-recipe retrieval accuracy (R@1) from 81.8% to 87.5% in the 1k test setting and from 56.5% to 65.5% in the more challenging 10k setting.

The key innovation lies in SIMMER's specialized prompt engineering and training approach. The team designed prompt templates specifically for the structured nature of recipes, which typically include titles, ingredients lists, and cooking instructions. This allows the MLLM to generate more effective embeddings that capture the semantic relationships between food images and recipe components. Additionally, they introduced a component-aware data augmentation strategy that trains the model on both complete recipes and partial versions, making the system more robust to incomplete or imperfect inputs that commonly occur in real-world applications.

This technological advancement has immediate practical implications for the growing field of food technology. By achieving higher accuracy with a simpler architecture, SIMMER enables more reliable nutritional management systems, dietary logging applications, and cooking assistance tools. The model's ability to handle partial information makes it particularly valuable for real-world scenarios where users might have incomplete recipe data or imperfect food photos. The research demonstrates how leveraging modern MLLMs can solve specialized cross-modal retrieval problems more effectively than previous task-specific architectures.

Key Points
  • Uses single VLM2Vec encoder instead of dual encoders, simplifying architecture while improving performance
  • Achieves 87.5% accuracy on Recipe1M 1k test (5.7-point improvement) and 65.5% on 10k test
  • Features component-aware data augmentation for robustness with incomplete recipes and specialized prompt templates

Why It Matters

Enables more accurate dietary apps and cooking assistants by reliably matching food photos to recipes with simpler AI architecture.