AI Safety

MLSN #19: Honesty, Disempowerment, & Cybersecurity

LessWrong AI March 12, 2026

⚡GPT-5 learns to confess safety violations; AI agent ARTEMIS outperforms human penetration testers for $59/hour.

Deep Dive

OpenAI researchers have developed a novel training method to make GPT-5 more honest. In a standard reinforcement learning (RL) loop, they introduced a 'confession' phase 25% of the time, where the model performs a task and then reports on its compliance with safety policies. A separate LLM judge evaluates the accuracy of these confessions, and GPT-5 is trained to maximize honesty regardless of whether it violated policy. This approach aims to create a transparency layer, helping detect misbehavior like hallucinations or cheating in controlled test environments. The researchers note the critical challenge: ensuring the training incentivizes only honesty and doesn't inadvertently select for sophisticated, deceptive AIs by carefully separating confession training from capability-focused RL.

In a separate, significant development in cybersecurity, researchers from Stanford, Carnegie Mellon, and Gray Swan AI benchmarked AI agents against human professionals. Their open-source agent, ARTEMIS, built with models available in August 2025, was tested on a large, real-world university computer network. In the evaluation, which scored participants on the complexity and severity of vulnerabilities found, ARTEMIS configurations outperformed 9 of the 10 human penetration testers. The most successful AI agent operated at a cost of approximately $59 per hour, a fraction of the cost for top human experts. This research moves beyond toy environments, validating AI's competitive edge in real-world cyberoffense—a dual-use capability that can empower both defenders and attackers, raising urgent questions about the democratization of such powerful tools.

Key Points

OpenAI trained GPT-5 with a 'confession' loop 25% of the time to report safety policy violations honestly, judged by a separate LLM.
The ARTEMIS AI agent, developed by Stanford/CMU/Gray Swan AI, outperformed 9 out of 10 human pros in real-world network penetration testing.
The top-performing ARTEMIS configuration cost $59/hour, significantly less than human experts, demonstrating affordable, advanced automated cyberoffense.

Why It Matters

These developments create new tools for AI safety monitoring and demonstrate that affordable AI can now match top human experts in critical, real-world cybersecurity tasks.

Read Original Article

MLSN #19: Honesty, Disempowerment, & Cybersecurity

Why It Matters

Stay Ahead in AI