Scores 75% on OSWorld computer-use benchmark, surpassing the 72.4% human baseline for the first time?

Scores 75% on OSWorld computer-use benchmark, surpassing the 72.4% human baseline for the first time

Features a 1M-token context window and uses 47% fewer tokens for improved cost-efficiency?

Features a 1M-token context window and uses 47% fewer tokens for improved cost-efficiency

Engineered for agent workflows with enhanced steerability, allowing real-time interruption and adjustment of tasks?

Engineered for agent workflows with enhanced steerability, allowing real-time interruption and adjustment of tasks

Models & Releases

OpenAI's GPT-5.4 beats human baseline on OSWorld with 75% score

r/OpenAI March 06, 2026

⚡The new frontier model uses 47% fewer tokens and features 1M-token context for complex agent workflows.

Deep Dive

OpenAI has officially unveiled GPT-5.4, its latest frontier model engineered specifically for high-level reasoning, coding proficiency, and scalable agent-style automation. The announcement signals a strategic pivot beyond conversational AI, targeting the burgeoning market for autonomous AI agents capable of performing complex, multi-step digital tasks. The model's standout achievement is scoring 75% on the OSWorld-Verified benchmark for computer-use tasks, which involves executing actions in a simulated OS environment—a result that notably exceeds the established human baseline of 72.4%. This performance, coupled with an 82.7% score on the BrowseComp test for web-based reasoning, positions GPT-5.4 as a formidable tool for automating intricate workflows that require understanding and interacting with software and online information.

Technically, GPT-5.4 introduces several critical upgrades for professional deployment, including a massive 1M-token context window for processing extensive documents and codebases, and a claimed 47% reduction in token usage for significant efficiency gains. A key feature is enhanced steerability, allowing users to interrupt and dynamically adjust the model's responses mid-generation, which is crucial for refining agent behavior in real-time. These advancements collectively lower the operational cost and increase the reliability of deploying AI agents for tasks like data analysis, software testing, and automated research. The release underscores OpenAI's focus on capturing the enterprise automation space, where robust, reasoning-capable models can transform knowledge work and operational efficiency.

Key Points

Scores 75% on OSWorld computer-use benchmark, surpassing the 72.4% human baseline for the first time
Features a 1M-token context window and uses 47% fewer tokens for improved cost-efficiency
Engineered for agent workflows with enhanced steerability, allowing real-time interruption and adjustment of tasks

Why It Matters

Enables more reliable and cost-effective AI agents for automating complex software and web-based tasks, transforming enterprise workflows.

Read Original Article

OpenAI's GPT-5.4 beats human baseline on OSWorld with 75% score

Why It Matters

Related Articles

🚀 Stay Ahead in AI