MARINER: A 3E-Driven Benchmark for Fine-Grained Perception and Complex Reasoning in Open-Water Environments
New 16,629-image benchmark shows even advanced MLLMs fail at fine-grained vessel identification and causal reasoning in marine environments.
A research team led by Xingming Liao has introduced MARINER, a groundbreaking benchmark designed to test AI systems on fine-grained perception and complex reasoning in open-water environments. Built on the novel Entity-Environment-Event (3E) paradigm, MARINER contains 16,629 multi-source maritime images spanning 63 fine-grained vessel categories, diverse adverse weather conditions, and 5 typical dynamic maritime incidents like collisions and groundings. The benchmark covers three critical tasks: fine-grained classification, object detection, and visual question answering (VQA), creating what researchers call the first "realistic and cognitive-level evaluation" for maritime multimodal understanding.
The team conducted extensive evaluations on mainstream multimodal large language models (MLLMs) including GPT-4V, Claude 3, and others, establishing baseline performance metrics. Their findings reveal significant shortcomings: even the most advanced models struggle with fine-grained discrimination between similar vessel types and fail at causal reasoning about complex marine incidents. For instance, models couldn't reliably distinguish between different cargo ship subtypes or explain why a vessel might be drifting dangerously. This research fills a critical gap in AI evaluation, as previous benchmarks lacked the realistic complexity of actual maritime operations where environmental factors, vessel details, and dynamic events interact.
MARINER's introduction comes as maritime industries increasingly explore AI for autonomous navigation, port management, and maritime surveillance. The benchmark's comprehensive nature—combining visual data with reasoning challenges—pushes beyond simple object recognition toward true situational understanding. By exposing current AI limitations in this safety-critical domain, MARINER provides a roadmap for developing more robust vision-language models capable of handling the nuanced demands of open-water applications, potentially accelerating progress toward reliable maritime AI systems.
- MARINER contains 16,629 multi-source images across 63 fine-grained vessel categories and 5 maritime incident types
- Testing revealed even advanced MLLMs like GPT-4V struggle with fine-grained discrimination and causal reasoning in marine scenes
- The benchmark uses a novel Entity-Environment-Event (3E) paradigm to evaluate both perception and reasoning capabilities
Why It Matters
Exposes critical gaps in AI's ability to handle safety-critical maritime applications, guiding development of more reliable systems for autonomous navigation and surveillance.