Qwen 3.5 0.8B - small enough to run on a watch. Cool enough to play DOOM.
A tiny 0.8B parameter model from Alibaba successfully plays classic FPS DOOM using vision and tools.
A developer has successfully created an AI agent using Alibaba's Qwen 3.5 0.8B model that can play the classic first-person shooter DOOM. The system, built with Python and VizDoom, works by taking a screenshot of the game, overlaying a numbered grid for spatial reference, and sending the image to the vision-language model. The model is equipped with just two tools—'shoot' and 'move'—and must decide which action to take based on the visual input. Remarkably, this 0.8 billion parameter model, which is compact enough to theoretically run on a smartwatch, demonstrates basic competency: in simple scenarios, it correctly identifies enemies, selects the corresponding grid column, and fires its weapon to secure kills.
The project reveals both the capabilities and current limitations of small-scale VLMs acting as agents. While the model shows moments of unexpected self-awareness, outputting statements like "I see a fireball but I'm not sure if it's an enemy," it struggles with higher-level strategy. In the 'defend_the_center' scenario, it fails to conserve ammo and continues to issue shoot commands even when no enemies are present. The current latency is about 10 seconds per decision step when running via HTTP calls to LM Studio on an M1 Mac. To address its shortcomings, the developer is now implementing a 'reason' field, forcing the model to describe its visual observations before committing to an action, a step aimed at improving its decision-making logic and resource management.
- Uses Alibaba's Qwen 3.5 0.8B, a sub-1B parameter model small enough for edge devices like smartwatches.
- Functions as a Vision-Language Model (VLM) agent with 'shoot' and 'move' tools to play DOOM via the VizDoom framework.
- Shows basic competency and surprising self-awareness but struggles with complex strategy, prompting the addition of a 'reasoning' step.
Why It Matters
Demonstrates the potential for tiny, efficient AI models to power complex, interactive agents on consumer hardware, moving AI closer to the edge.