AI therapy bots blocked by moderation tools flagging real session content
OpenAI, Meta, and Google's moderation systems flag real therapy conversations as undesirable.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new study from researchers at an undisclosed institution (arXiv:2605.25454, submitted May 2026) evaluated how three leading content moderation systems handle real therapy conversations. The team tested OpenAI's moderation endpoint, Meta's Llama Guard, and Google's Shield Gemma by feeding them transcripts from actual therapy sessions. They found that all three systems consistently flagged emotionally heavy content—such as discussions of self-harm, abuse, and trauma—as violating safety policies, even though such topics are essential to therapeutic work.
This creates a fundamental paradox: the very guardrails that make LLMs safe for general use make them ineffective as therapists. The authors argue that if AI is to play a role in mental health, current moderation approaches must be redesigned to distinguish between harmful content and clinically necessary discussions. The study has significant implications for startups and organizations building AI therapists, as they may need to develop custom moderation or risk liability. The paper is available on arXiv and has been submitted for peer review.
- Audited OpenAI moderation endpoint, Meta's Llama Guard, and Google's Shield Gemma on real therapy transcripts
- All three systems frequently flagged sensitive but clinically necessary topics (trauma, suicide, abuse) as undesirable
- Raises a fundamental conflict between safety guardrails and the need for open discussion in AI therapy contexts
Why It Matters
This could stall the development of effective AI therapists unless moderation systems evolve to handle clinical conversations.