Research & Papers

FoodSense: A Multisensory Food Dataset and Benchmark for Predicting Taste, Smell, Texture, and Sound from Images

A new AI dataset of 66,842 human ratings teaches models to predict multisensory experiences from photos.

Deep Dive

A research team from the University of Central Florida has published FoodSense, a groundbreaking dataset and benchmark designed to train AI models on a task humans perform effortlessly: predicting multisensory experiences from a single image. Prior computer vision work on food focused on recognition tasks like identifying meals or estimating calories. FoodSense tackles the unexplored challenge of cross-sensory inference, compiling 66,842 human ratings across 2,987 unique food images. Each rating includes numeric scores (1-5) and free-text descriptors for four key sensory dimensions: taste, smell, texture, and even the sound a food makes when eaten.

To enable models to not just predict but also explain their sensory expectations, the researchers used a large language model to generate detailed, image-grounded reasoning traces from the short human annotations. This process created a rich training corpus that connects visual cues to sensory language. Using this data, the team built and released FoodSense-VL, a vision-language benchmark model capable of outputting both predicted sensory ratings and the visual justifications for those predictions. The work demonstrates that standard evaluation metrics are insufficient for this novel task, pushing the frontier of multimodal AI toward more human-like, holistic perception.

Key Points
  • The FoodSense dataset contains 66,842 human-annotated participant-image pairs across 2,987 unique food images.
  • It captures numeric ratings and text descriptors for four sensory dimensions: taste, smell, texture, and sound.
  • The team trained the FoodSense-VL model to predict these multisensory attributes and generate grounded explanations from images.

Why It Matters

This research enables AI for personalized nutrition, enhanced food e-commerce, and robotic cooking by predicting subjective human experiences.