Research & Papers

Evaluating Large Language Models' Responses to Sexual and Reproductive Health Queries in Nepali

New LEAF framework reveals only 35% of LLM responses on sexual health in Nepali were 'proper'.

Deep Dive

A team of 18 researchers, led by Medha Sharma and Supriya Khadka, has published a groundbreaking study evaluating how Large Language Models (LLMs) like ChatGPT handle sensitive, real-world queries. They introduced the LLM Evaluation Framework (LEAF), a novel assessment tool that moves beyond simple accuracy metrics. LEAF evaluates responses across four critical dimensions: factual accuracy, language quality, usability (including relevance, adequacy, and cultural appropriateness), and safety (covering sensitivity and confidentiality). This holistic approach is designed for culturally sensitive domains like Sexual and Reproductive Health (SRH), especially in low-resource languages.

Applying LEAF to over 14,000 real user queries in Nepali, sourced from more than 9,000 individuals, the researchers manually annotated responses with help from SRH experts. The results were stark: only 35.1% of LLM-generated responses were classified as "proper," meaning they were accurate, adequate, and free of major usability or safety gaps. A full 65% of answers were flawed in some critical way, whether through factual inaccuracies, cultural insensitivity, or safety oversights. The study also noted performance variations between different versions of ChatGPT, which showed similar accuracy but diverged significantly on usability and safety aspects.

This research exposes a major blind spot in the current AI landscape. As LLMs become integrated into daily life for anonymous, judgment-free advice, their failure rate in sensitive, non-English contexts is alarmingly high. The LEAF framework provides a scalable, adaptable blueprint for developers and companies to rigorously test and improve their models. It underscores that global AI safety and utility cannot be achieved by optimizing only for English or objective factuality; cultural nuance and user safety in sensitive domains are equally vital benchmarks for the next generation of AI assistants.

Key Points
  • Only 35.1% of LLM responses to 14,000 Nepali sexual health queries were deemed 'proper' (accurate, adequate, safe).
  • Researchers introduced the LEAF framework, evaluating Accuracy, Language, Usability, and Safety gaps beyond standard benchmarks.
  • The study highlights critical performance disparities in culturally sensitive, low-resource language contexts, a major blind spot for current AI.

Why It Matters

Reveals that AI assistants fail catastrophically on sensitive, non-English queries, demanding new safety and cultural benchmarks for global deployment.