AI Safety

Scalable and Personalized Oral Assessments Using Voice AI

arXiv cs.CY March 20, 2026

⚡A multi-agent AI system graded student oral exams at $0.42 each, achieving high reliability.

Deep Dive

Researchers Panos Ipeirotis and Konstantinos Rizakos have developed a novel voice AI system that makes oral examinations scalable and affordable for the first time. Their system conducted 36 oral assessments for an undergraduate AI/ML course at a remarkably low total cost of $15, or just $0.42 per student. This breakthrough cost efficiency means oral comprehension checks can be attached to every assignment rather than reserved for high-stakes finals, fundamentally changing assessment practices.

The system employs a sophisticated multi-agent architecture that decomposes examinations into structured phases. A key innovation is the use of a "council" of three different LLM families (likely including models like GPT-4, Claude, and Llama) that grade each transcript through a deliberation process where models revise scores after reviewing peer evidence. This approach achieved an inter-rater reliability score (Krippendorff's α = 0.86) that exceeds conventional thresholds for human grading consistency.

Despite its successes, the system revealed important limitations about current AI capabilities. The researchers found that behavioral constraints on LLMs must be enforced through system architecture rather than prompting alone, as the agent sometimes stacked questions despite explicit prohibitions and couldn't properly randomize case selection. Additionally, using a cloned professorial voice backfired, being perceived as aggressive rather than familiar by students.

Student feedback showed 70% agreed the format tested genuine understanding, though 83% found it more stressful than written exams—unsurprising given that 83% had never taken any oral examination before. The researchers have documented the full system design, failure modes, and student experience, providing all prompts as appendices to support further development in this critical area of educational technology.

Key Points

Conducted 36 oral exams for $15 total ($0.42 per student), making frequent oral assessments economically viable
Achieved inter-rater reliability of α = 0.86 using a council of three LLM families with deliberation rounds
70% of students agreed it tested genuine understanding, though 83% found it more stressful than written exams

Why It Matters

Provides a scalable, low-cost solution to verify genuine student understanding in the age of LLM-assisted cheating on written assignments.

Read Original Article

Scalable and Personalized Oral Assessments Using Voice AI

Why It Matters

Stay Ahead in AI