Simulate realistic users to evaluate multi-turn AI agents in Strands Evals
New tool generates goal-driven simulated users that adapt conversations based on agent responses, solving multi-turn testing challenges.
Strands has introduced ActorSimulator, a new component in their Evaluation SDK designed specifically for testing multi-turn AI agent interactions. Unlike single-turn evaluations where inputs and outputs are predetermined, multi-turn conversations present unique challenges because each user response depends on what the agent just said. Traditional testing methods fail at scale—manual testing can't cover hundreds of conversation paths, while simple LLM prompts produce inconsistent results. ActorSimulator addresses this by creating structured, goal-driven simulated users that maintain consistent personas and adapt their behavior based on agent responses.
The tool enables teams to programmatically generate realistic user simulations that can handle complex conversational dynamics like follow-up questions, changing directions, and expressions of frustration. This approach mirrors established simulation techniques from other fields like flight simulators and game testing, bringing controlled realism to AI evaluation. By integrating directly into evaluation pipelines, ActorSimulator allows developers to systematically test how agents perform when users ask "Actually, can we look at trains instead?" after initially requesting flights, or when they express dissatisfaction with incomplete answers—scenarios that static test cases miss entirely.
This structured simulation approach provides the repeatability and scalability needed for continuous agent development while capturing the adaptive nature of real human conversations. Teams can now run hundreds of multi-turn evaluations automatically after each agent change, identifying regressions and improvements in how agents handle the dynamic, unpredictable nature of production conversations that unfold over multiple exchanges.
- ActorSimulator creates consistent, goal-driven simulated users that adapt to agent responses in real-time
- Solves scalability problem: replaces manual testing of hundreds of conversation paths with automated evaluation
- Provides structured persona definitions and explicit goal tracking for reliable, repeatable testing results
Why It Matters
Enables systematic testing of how AI agents handle real-world conversational complexity at scale, catching failures before deployment.