Research & Papers

V-Nutri: Dish-Level Nutrition Estimation from Egocentric Cooking Videos

New AI model uses first-person cooking videos to track hidden ingredients like oils and sauces for better nutrition estimates.

Deep Dive

A team of researchers led by Chengkun Yue, Chuanzhi Xu, and Jiangpeng He has introduced V-Nutri, a groundbreaking AI framework that tackles a major limitation in automated dietary monitoring. Traditional methods rely on a single image of a finished dish, which often fails to account for visually ambiguous but nutritionally significant ingredients like cooking oils, sauces, and mixed components. V-Nutri solves this by analyzing the entire cooking process captured in first-person, or egocentric, videos.

V-Nutri's staged framework first uses a model pre-trained on the Nutrition5K dataset to analyze the final dish. Crucially, it also employs a VideoMamba-based event-detection module to automatically identify and select keyframes from the video that show moments of ingredient addition. A lightweight fusion module then aggregates features from both the final dish and these process keyframes to generate a more accurate nutrition estimate. The team manually annotated the HD-EPIC dataset to create the first benchmark for video-based nutrition estimation, and their experiments show that process cues provide complementary evidence that improves results under controlled conditions.

The research, accepted to the MetaFood Workshop at CVPR 2026, highlights that the benefit of using video depends heavily on the quality of the visual backbone and the event detection. By moving beyond a single snapshot, V-Nutri opens the door to more reliable, passive dietary logging tools that could integrate with smart glasses or kitchen cameras, providing a clearer picture of what we actually consume.

Key Points
  • Uses egocentric cooking videos to track visually ambiguous ingredients like oils and sauces, which are often missed in single-image analysis.
  • Features a VideoMamba-based event-detection model to automatically identify keyframes showing ingredient-addition moments during the cooking process.
  • Establishes the first benchmark for video-based nutrition estimation on the manually annotated HD-EPIC dataset, with code and data publicly released.

Why It Matters

Enables more accurate, passive dietary tracking for health apps, moving beyond flawed photo-based logging to account for hidden calories.