Claude tops GPT, Gemini, and Grok in 8-week HTML generation showdown
Claude won 9 out of 17 prompts across 68 single-file HTML tests…
Researchers at the 'HTML AI Battle' project ran a controlled experiment from December 2025 to February 2026, collecting 68 single-file HTML outputs from GPT, Gemini, Grok, and Claude. Each model received the same prompt without customization or repair. Human evaluators and a Gemini LLM-as-a-judge scored the outputs on prompt adherence, functional correctness, and UI quality. The results were then published on X, TikTok, and YouTube to measure social reach.
Claude consistently outperformed its rivals, leading mean performance and winning 9 of 17 prompts on the primary human score. Counterintuitively, longer reasoning times did not yield higher quality. The Gemini judge was more lenient than humans, and self-favoring bias remained unresolved. Predictive models for 24-hour X impressions were weak (R² = -0.377), while code verbosity was better predicted by model family than prompt wording (R² = 0.576). The study highlights the limits of current LLMs in web generation and social virality prediction.
- Claude won 9/17 prompts under human evaluation, leading mean performance across all tests.
- Longer reasoning time showed no correlation with higher output quality.
- Model family explained 57.6% of code verbosity variance, but X impressions were unpredictable (R² = -0.377).
Why It Matters
For developers relying on LLMs for web generation, Claude offers the most consistent results, though virality remains unpredictable.