Claude won 9/17 prompts under human evaluation, leading mean performance across all tests?

Claude won 9/17 prompts under human evaluation, leading mean performance across all tests.

Longer reasoning time showed no correlation with higher output quality?

Longer reasoning time showed no correlation with higher output quality.

Model family explained 57.6% of code verbosity variance, but X impressions were unpredictable (R² = -0.377)?

Model family explained 57.6% of code verbosity variance, but X impressions were unpredictable (R² = -0.377).

Developer Tools

Claude tops GPT, Gemini, and Grok in 8-week HTML generation showdown

arXiv cs.SE May 11, 2026

⚡Claude won 9 out of 17 prompts across 68 single-file HTML tests…

Deep Dive

Researchers at the 'HTML AI Battle' project ran a controlled experiment from December 2025 to February 2026, collecting 68 single-file HTML outputs from GPT, Gemini, Grok, and Claude. Each model received the same prompt without customization or repair. Human evaluators and a Gemini LLM-as-a-judge scored the outputs on prompt adherence, functional correctness, and UI quality. The results were then published on X, TikTok, and YouTube to measure social reach.

Claude consistently outperformed its rivals, leading mean performance and winning 9 of 17 prompts on the primary human score. Counterintuitively, longer reasoning times did not yield higher quality. The Gemini judge was more lenient than humans, and self-favoring bias remained unresolved. Predictive models for 24-hour X impressions were weak (R² = -0.377), while code verbosity was better predicted by model family than prompt wording (R² = 0.576). The study highlights the limits of current LLMs in web generation and social virality prediction.

Key Points

Claude won 9/17 prompts under human evaluation, leading mean performance across all tests.
Longer reasoning time showed no correlation with higher output quality.
Model family explained 57.6% of code verbosity variance, but X impressions were unpredictable (R² = -0.377).

Why It Matters

For developers relying on LLMs for web generation, Claude offers the most consistent results, though virality remains unpredictable.

Read Original Article

Claude tops GPT, Gemini, and Grok in 8-week HTML generation showdown

Why It Matters

Related Articles

🚀 Stay Ahead in AI