Open Source

No Multimodality yet in DeepSeek-V4. But I'll wait.

r/LocalLLaMA April 24, 2026

⚡DeepSeek-V4 skips image and audio processing, leaving fans hoping for v4.1

Deep Dive

DeepSeek has released the technical report for its V4 model, and the AI community is buzzing with mixed reactions. The report confirms that DeepSeek-V4 does not include multimodality—meaning it cannot process images, audio, or video inputs. This is a notable omission given that competitors like GPT-4o, Claude 3.5, and Gemini 1.5 all support multimodal interactions. The company seems to have prioritized text-based reasoning and efficiency gains over expanding input modalities.

Despite this gap, many users on Reddit and other platforms are expressing patience and even optimism. The sentiment is that DeepSeek-V4's pure text focus may allow for deeper reasoning capabilities or faster inference speeds. The community is now looking ahead to a potential v4.1 or v4.5 release that could add multimodality. For now, developers and researchers who rely on multimodal workflows will need to stick with other models or wait for DeepSeek's next update.

Key Points

DeepSeek-V4's technical report confirms no multimodality (no image or audio processing)
Competitors like GPT-4o and Claude 3.5 all support multimodal inputs
Community hopes for multimodality in the next v4 release or update

Why It Matters

DeepSeek-V4's lack of multimodality limits its use for vision and audio tasks, forcing developers to wait.

Read Original Article

No Multimodality yet in DeepSeek-V4. But I'll wait.

Why It Matters

Stay Ahead in AI