ByteDance Officially Introduces Doubao-Seed-2.0-lite Full-Modal Large Model
The model watches videos, hears audio, and executes GUI actions like clicking buttons.
ByteDance's Volcano Engine unveiled Doubao-Seed-2.0-lite, the first native full-modal model in the Doubao family. Unlike previous models that processed modalities separately, this model achieves unified understanding of video, images, audio, and text from the ground up. In complex reasoning tests for physics and medicine, it significantly surpassed the Pro version released earlier this year, and reached industry-leading levels in fine-grained perception and embodied understanding. Its integrated speech understanding enables 'synchronized audio-visual' joint reasoning — it can judge video content consistency by combining background audio, locate specific events in long videos, and restore complex character relationships. The model also supports transcription in 19 languages and translation between 14 languages, with the ability to detect emotional fluctuations and ambient sounds.
Beyond perception, Doubao-Seed-2.0-lite boasts upgraded Agent and Coding capabilities. It can decompose multi-turn complex instructions, verify results, and generate fully engineered products like front-end pages, 3D scenes, and games. Crucially, it achieves integrated GUI understanding and execution for the first time — it can recognize buttons and menus in web pages or apps and perform actions like clicking, dragging, and inputting, closing the loop from 'understanding the interface' to 'completing tasks end-to-end.' This technology is already applied in e-sports (AI coach analyzing 25-hour match videos), education, and cross-border e-commerce. A more efficient version, Doubao-Seed-2.0-mini, is also available for cost-effective large-scale full-modal reasoning.
- Native unified understanding of video, images, audio, and text, outperforming previous Pro version in physics and medicine reasoning.
- Supports 19-language transcription and 14-language translation with emotion and ambient sound detection.
- First integrated GUI understanding and execution, enabling end-to-end tasks like clicking, dragging, and inputting on web/app interfaces.
Why It Matters
ByteDance's full-modal AI enables practical agents that see, hear, and interact with digital interfaces at scale.