GPT-5.3-chat shows a surprising and severe regression on EQ-Bench and Longform Writing. Tons of partial refusals, and the prose devolves into tiny 1-5 word paragraphs
Users report a 40% drop in EQ-Bench scores and bizarre, fragmented writing style in the latest model.
OpenAI's latest iteration, GPT-5.3-chat, is facing significant criticism from early users who report a surprising and severe regression in key performance areas. The model, intended as an incremental update, is showing a dramatic 40% drop in scores on the EQ-Bench emotional intelligence benchmark compared to its predecessor. More notably, its long-form writing capability has degraded, with prose frequently devolving into bizarre, fragmented paragraphs of just 1-5 words, making outputs unusable for professional writing tasks. This has sparked concern within the AI community about the stability of OpenAI's rapid release cycle.
Technical analysis points to potential over-tuning in safety and alignment protocols, leading to a high frequency of 'partial refusals' where the model stops generating coherent text mid-response. Unlike a full refusal, these interruptions cripple workflow without clear warning. The regression highlights the inherent challenges in multi-objective optimization for LLMs, where improving safety can inadvertently damage core capabilities like reasoning and coherence. For developers and enterprises relying on consistent API performance, this incident underscores the risks of deploying unproven incremental updates without extensive beta testing.
- GPT-5.3-chat scores ~40% lower on the EQ-Bench emotional intelligence test than previous versions.
- Long-form writing outputs are fragmented into unusable 1-5 word paragraphs, a severe quality regression.
- Users report a high rate of 'partial refusals,' where the model stops tasks mid-completion without warning.
Why It Matters
For businesses, inconsistent model performance disrupts workflows and undermines trust in AI as a reliable tool.