Models & Releases

I had LLMs GM/DM solo campaigns for 50+ hours so you didn't have to. AMA

A gamer tested five major LLMs as solo Dungeon Masters, logging over 50 hours across multiple RPG systems.

Deep Dive

A Reddit user, earthwulf, conducted a massive 50+ hour experiment to determine which large language model (LLM) makes the best AI Dungeon Master for solo tabletop role-playing games. They rigorously tested five major models—ChatGPT (both 'Instant' and 'Thinking' modes), Claude Opus 4.6, Gemini, and Grok ('Fast' and 'Expert' modes)—across multiple RPG systems including Dungeons & Dragons 5e, Pathfinder 2e, Runequest, and Shadowrun. The user employed pre-published modules, rolled their own dice, and implemented specific rules to combat common AI issues like repetition and hallucination.

The results provided a clear hierarchy for AI-powered solo RPGs. OpenAI's ChatGPT in 'Thinking' mode emerged as the top performer, scoring 4/5 for its ability to maintain longer, more coherent story threads with deeper party member personalities and less repetition than its 'Instant' counterpart. Grok's 'Fast' mode was surprisingly competitive, matching ChatGPT Thinking in quality. However, other models showed significant flaws: Claude Opus 4.6 stopped responding entirely after about an hour, and Google's Gemini began forgetting chat history almost immediately, rendering it unusable for a sustained campaign. The experiment highlights the current capabilities and limitations of using general-purpose LLMs for complex, stateful narrative tasks.

Key Points
  • ChatGPT 'Thinking' mode scored 4/5 as the best AI GM, praised for longer, coherent storytelling and deeper character personalities.
  • Claude Opus 4.6 failed after ~1 hour of play, and Gemini's memory issues made it nearly unusable for a sustained campaign.
  • The tester identified key LLM flaws for RPGs: repetitive dialogue, narrative drift into 'homebrew,' and a tendency to enable power fantasies.

Why It Matters

This real-world stress test reveals which AI models can handle complex, long-form interactive storytelling, a key benchmark for stateful agent applications.