No Multimodality yet in DeepSeek-V4. But I'll wait.
DeepSeek-V4 skips image and audio processing, leaving fans hoping for v4.1
DeepSeek has released the technical report for its V4 model, and the AI community is buzzing with mixed reactions. The report confirms that DeepSeek-V4 does not include multimodality—meaning it cannot process images, audio, or video inputs. This is a notable omission given that competitors like GPT-4o, Claude 3.5, and Gemini 1.5 all support multimodal interactions. The company seems to have prioritized text-based reasoning and efficiency gains over expanding input modalities.
Despite this gap, many users on Reddit and other platforms are expressing patience and even optimism. The sentiment is that DeepSeek-V4's pure text focus may allow for deeper reasoning capabilities or faster inference speeds. The community is now looking ahead to a potential v4.1 or v4.5 release that could add multimodality. For now, developers and researchers who rely on multimodal workflows will need to stick with other models or wait for DeepSeek's next update.
- DeepSeek-V4's technical report confirms no multimodality (no image or audio processing)
- Competitors like GPT-4o and Claude 3.5 all support multimodal inputs
- Community hopes for multimodality in the next v4 release or update
Why It Matters
DeepSeek-V4's lack of multimodality limits its use for vision and audio tasks, forcing developers to wait.