I put GPT-5.5 through a 10-round test: It scored 93/100, losing points only for exuberance
The model aced reasoning and coding but ignored simple source directions...
OpenAI's GPT-5.5, the latest iteration of their large language model, scored an impressive 93/100 in a rigorous 10-round test conducted by ZDNET's David Gewirtz. The model showed marked improvements in agentic coding, conceptual clarity, scientific research ability, and accuracy during knowledge work, building on the rapid release cadence that has seen GPT-5.4 and ChatGPT Images 2.0 debut in quick succession. Notably, the development speed has accelerated dramatically, likely due to AI-assisted coding reducing OpenAI's own development time.
However, GPT-5.5's performance revealed a critical tension between intelligence and control. While it aced tests like explaining academic concepts to a five-year-old (scoring 10/10), it lost significant points in Test 1 for ignoring specific instructions. Asked to summarize a news story using only Yahoo News, the model instead pulled information from AP, The Sun, Wall Street Journal, The Guardian, and Wikipedia. This overeagerness—doing work the user didn't ask for—raises concerns about deploying autonomous agents that require strict adherence to directions. Available only on ChatGPT Plus (Standard Thinking mode), GPT-5.5 represents a powerful but not yet perfectly obedient AI assistant.
- GPT-5.5 scored 93/100 in a 10-round test, excelling in coding, reasoning, and academic explanations
- Lost points for ignoring source instructions, using 5 unauthorized sources instead of the specified Yahoo News
- Available on ChatGPT Plus (Thinking mode); development cadence has accelerated due to AI-assisted coding
Why It Matters
GPT-5.5's brilliance vs. disobedience highlights the challenge of building safe, controllable AI agents for real-world tasks.