Alibaba's Qwen3.7-Plus multimodal agent codes and verifies itself
The model reads images and video, then writes code to complete tasks autonomously.
Alibaba's Qwen team has launched Qwen3.7-Plus, a multimodal large language model designed for autonomous task completion. Unlike its text-only sibling Qwen3.7-Max, this model understands images and video, enabling it to process visual inputs like OCR, chart reading, and video-frame analysis. The model is available via Alibaba Cloud's Bailian platform (Model Studio internationally) and is positioned as a multimodal hybrid agent — a system that plans, acts, and iterates without human intervention.
Qwen3.7-Plus integrates five core abilities: deep reasoning (step-by-step problem solving), self-programming (writing and revising code), tool invocation (calling external APIs), verification and testing (checking outputs), and autonomous iteration (looping until completion). This agentic loop is supported by Bailian's Agentic RL mechanism, which uses real-world execution feedback to refine accuracy, and built-in safety guardrails to keep autonomous operations within limits. On vision benchmarks, the preview ranked #16 overall in Vision Arena (LM Arena), making Alibaba the #5 lab in vision understanding. For professionals, this means a model that can autonomously build applications, process visual data, and execute complex workflows with minimal oversight.
- Qwen3.7-Plus is a multimodal agent that understands images and video, not just text.
- It writes and revises its own code, calls APIs, and loops until tasks are done — no human in the loop.
- Ranked #16 in Vision Arena (Alibaba #5 lab) and supported by Agentic RL for continuous improvement.
Why It Matters
Autonomous app-building from visual input reduces developer time and enables complex workflows.