AI Safety

What Makes a Good Terminal Bench Task

A viral benchmark task forces AI agents to install Windows XP SP3 in a QEMU virtual machine with a 2-hour timeout.

Deep Dive

A detailed breakdown of the 'install-windows-xp' task for Terminal Bench 3 has gone viral, illustrating the extreme complexity required to properly benchmark modern AI agents. The task, designed by a contributor and reviewer for the benchmark, requires an AI agent to install and run a 32-bit Windows XP SP3 virtual machine using QEMU. The environment is a multi-layered challenge: Windows XP must be installed inside QEMU, which is itself running inside a Docker container on a Linux host. The agent has a strict 2-hour timeout to complete the process, which includes using an OCR API to read the product key from a virtual CD-ROM box, creating a custom unattended installation ISO, and configuring VNC and nginx for remote monitoring.

The core difficulty lies in creating a flawless unattended installation. The agent must extract the original Windows XP ISO, inject an answer file with dozens of configuration settings to suppress all GUI popups, add OEM scripts to create the required 'tb-admin' user account, and correctly rebuild the ISO. Any mistake causes the installation to revert to interactive mode, resulting in failure. After initiating the 30-60 minute install, the agent must monitor the virtual disk's growth, waiting for it to exceed 1GB and stop expanding before rebooting the VM into the final login screen. This task is designed not to help the AI succeed, but to rigorously test its ability to execute long, precise, and error-prone sequences of system commands—a key differentiator from standard prompting.

Key Points
  • The 'install-windows-xp' task for Terminal Bench 3 gives AI agents a 2-hour timeout to complete a highly complex, multi-layered system installation.
  • Success requires creating a custom unattended Windows XP ISO by injecting answer files and scripts, with any error causing a reversion to interactive GUI and failure.
  • The task is designed to benchmark true agentic capability, pushing models like GPT-4 and Claude on long-horizon reasoning and precise environment manipulation.

Why It Matters

It sets a new standard for evaluating AI agents on real-world, complex system administration tasks beyond simple chat benchmarks.