Launch HN: Cekura (YC F24) – Testing and monitoring for voice and chat AI agents
Simulates thousands of user conversations to catch regressions in voice and chat agents before they reach production.
Cekura, a Y Combinator F24 startup founded by Tarush, Sidhant, and Shashij, has launched a platform designed to solve the critical challenge of testing and monitoring voice and chat AI agents. The core problem is that traditional manual QA or scripted tests fail to scale for AI agents, whose behavior can vary across thousands of potential user interactions. Cekura's answer is simulation: it deploys synthetic users to interact with an agent, while LLM-based judges evaluate the correctness of the entire conversational arc. This approach is built to catch regressions from prompt changes, model swaps, or new tool integrations before they impact real users.
The platform's effectiveness hinges on three technical pillars. First, it generates test scenarios from agent descriptions and, crucially, automatically extracts test cases from real production conversations, ensuring coverage evolves with user behavior. Second, its mock tool platform allows developers to define tool schemas and return values, enabling fast, deterministic testing of an agent's decision logic without hitting live APIs. Third, it structures test cases as conditional action trees for consistent, repeatable evaluations, avoiding the noise of stochastic LLMs. Unlike turn-based tracing platforms like Langfuse, Cekura evaluates full sessions to identify failures in conversational flow, such as a banking agent skipping a verification step. The service offers a 7-day free trial, with paid plans starting at $30 per month.
- Uses LLM-based judges and synthetic users to simulate and evaluate full conversational sessions, not just single turns.
- Features a mock tool platform and deterministic test cases to ensure consistent, non-flaky CI/CD testing for AI agents.
- Automatically generates test coverage from both agent descriptions and ingested real user conversations to catch evolving failure modes.
Why It Matters
Enables reliable, scalable QA for AI agents, preventing costly behavioral regressions that turn-based monitoring tools can miss.