Open Source

Qwen 3.6-35B-A3B plays DCSS roguelike game, MTP version stumbles

Non-MTP Qwen navigates Dungeon Crawl Stone Soup with solid performance...

Deep Dive

Alibaba's Qwen model (version 3.6-35B-A3B, q4_k_xl quantized) has been tested playing the open-source roguelike Dungeon Crawl Stone Soup (DCSS) on an RTX 5090 via LM Studio. While the MTP (Multi-Turn Prediction) version suffers from tool call bugs that corrupt output and trigger repeated wrong actions, the standard non-MTP version handles the game reasonably well. The agent operates via a terminal, sending text commands and receiving ASCII/screenshots updates. The session logged a Minotaur Fighter named BunnyLvl114032 reaching XL 5 on Dungeon:3, with 47/47 HP, 65 gold, and kills including snakes, bats, and a kobold.

This real-world test exposes a gap between benchmark scores and practical agent reliability. The MTP variant, despite offering 240k context and 8k output tokens, produced mangled tool calls, rendering it unusable for sequential gameplay. The non-MTP version succeeded, demonstrating that smaller quantized models can handle complex, long-horizon tasks when tool calling implementations are robust. The author aims to develop a reliable remote-play workflow for DCSS, using the game as a benchmark for LLM agentic capabilities — something current standard benchmarks fail to measure.

Key Points
  • Non-MTP Qwen 3.6-35B-A3B (q4_k_xl) successfully plays DCSS roguelike on an RTX 5090
  • MTP version produces corrupted tool calls, nullifying speed benefits and causing repeated errors
  • The agent reached Dungeon:3 with a level 5 Minotaur Fighter, demonstrating real-world agentic competency

Why It Matters

Real-game tests reveal critical tool-call reliability differences between model variants, guiding deployment decisions for autonomous agents.