MindCopilot paper introduces new metrics for human-AI co-writing evaluation
Beyond output quality: new framework measures how users actually interact with AI suggestions.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new research paper accepted to IJCAI 2026 tackles a critical blind spot in evaluating AI writing assistants. Traditional benchmarks focus solely on output quality (e.g., fluency, relevance), ignoring the messy reality of how users accept, edit, or reject suggestions in real time. To fix this, the team — from the University of Science and Technology of China and Shanghai AI Laboratory — introduced MindCopilot, a formal framework that models human-LLM co-writing as a Human-in-the-Loop Markov Decision Process (MDP). This sequential, behavior-centered view captures each interaction as a state shaped by user decisions, enabling precise measurement of both alignment and cognitive effort.
The paper's key technical contribution is the Co-Writing Fidelity Suite, a set of interaction-aware metrics that go beyond simple acceptance rates. It includes Hierarchical Acceptance Rate (which accounts for partial edits and multi-level acceptance) and Knowledge-aware Editing Distance (which measures how much a user's edits diverge from the suggestion in terms of semantic knowledge). To validate the framework, the researchers conducted a large-scale simulation using 1,688 controlled continuation queries across 16 diverse writing domains (from fiction to technical reports). They also ran a user study with 30 participants, confirming that the behavioral patterns captured by the new metrics align closely with real user experience and perceived usability. The results show systematic effects of interaction structure — such as suggestion granularity and timing — on acceptance behavior and editing cost, providing actionable insights for designing AI assistants that truly collaborate rather than interrupt.
- Formalizes co-writing as a Human-in-the-Loop Markov Decision Process to model user acceptance and editing decisions.
- Introduces Co-Writing Fidelity Suite with Hierarchical Acceptance Rate and Knowledge-aware Editing Distance for interaction-aware evaluation.
- Validated with 1,688 queries across 16 domains plus a 30-participant user study, showing better alignment with real user experience than output-only metrics.
Why It Matters
Better evaluation metrics mean smarter, less disruptive AI writing assistants that truly understand how professionals work.