GPT-5.4 is more likely to refuse than any other model so far.
SpeechMap benchmark reveals GPT-5.4 gives complete answers just 29.6% of the time on sensitive topics.
OpenAI's GPT-5.4 has emerged as the most refusal-prone model in the company's lineup according to the SpeechMap compliance benchmark, which measures how freely AI models discuss controversial topics. The benchmark, developed by researchers and covered by TechCrunch, tests models across 1,000 prompts covering sensitive subjects, categorizing responses as 'Complete' (full answer), 'Evasive', 'Denial', or 'Error'. GPT-5.4 scored just 29.6% 'Complete' answers, a dramatic drop from the original GPT-5 Chat's 78.9% and even lower than GPT-5.1 Chat's 42.0%. This represents a clear trend toward increased caution in OpenAI's model iterations.
The technical implications are significant for developers and enterprises using OpenAI's API. The SpeechMap methodology reveals that OpenAI has systematically tightened safety guardrails with each model update, particularly between GPT-5.3 Chat (62.8% Complete) and GPT-5.4. This creates a trade-off where improved safety alignment comes at the cost of reduced willingness to engage with complex, nuanced topics. For applications requiring balanced discussion of controversial subjects—from educational tools to research assistants—developers may need to implement additional prompt engineering or consider alternative models. The benchmark data suggests OpenAI is prioritizing safety over conversational freedom, potentially influencing how other AI companies approach model alignment and setting new industry standards for what constitutes 'responsible' AI behavior.
- GPT-5.4 scores only 29.6% 'Complete' answers on SpeechMap's 1,000-prompt controversial topics benchmark
- This represents a 49.3 percentage point drop from the original GPT-5 Chat model's 78.9% score
- The trend shows systematic guardrail tightening across OpenAI's GPT-5 series iterations
Why It Matters
Developers building applications requiring nuanced discussion of sensitive topics face new challenges with increasingly cautious AI models.