Research & Papers

Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

A new analysis shows AI models with 94% accuracy still fail on indirect toxicity and political discourse.

Deep Dive

Researchers Trishita Dhara and Siddhesh Sheth have published a paper titled 'Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection,' accepted at TrustNet 2026. The study performs a deep, explainability-driven audit of a neural harmful content detection model, specifically a RoBERTa-based classifier trained on the Civil Comments dataset. Using two popular post-hoc explanation methods—Shapley Additive Explanations (SHAP) and Integrated Gradients—the researchers analyzed not just correct predictions but, more critically, the model's systematic failure cases. This approach moves beyond aggregate metrics to understand *why* the model makes its decisions, especially in challenging, contextual scenarios.

Despite the model's strong overall performance, with an Area Under the Curve (AUC) of 0.93 and an accuracy of 0.94, the explainability analysis uncovered significant limitations invisible to standard evaluation. The two explanation methods produced divergent attributions: SHAP tended to focus on explicit lexical cues, while Integrated Gradients extracted more diffuse contextual signals. This divergence manifested in real-world failure modes, including false negatives for indirect toxicity (where harmful intent is implied, not stated) and false positives in political discourse or cases of lexical over-attribution.

The paper's core argument is that explainable AI (XAI) should be viewed primarily as a transparency and diagnostic resource for building trust in automated moderation systems, not merely as a tool to enhance raw accuracy. By exposing model uncertainty and the rationale behind decisions, XAI can foster effective human-in-the-loop moderation. This work provides a crucial framework for developers and platform operators to audit and improve their content safety systems by understanding their specific weaknesses.

Key Points
  • Study of a RoBERTa classifier with 0.93 AUC and 94% accuracy revealed hidden failure modes through explainability methods.
  • SHAP and Integrated Gradients showed divergent reasoning, leading to failures on indirect toxicity and political discourse.
  • Posits explainability as a key diagnostic tool for transparency in moderation systems, not just a performance lever.

Why It Matters

For platforms using AI moderation, this research provides a method to audit and improve systems by understanding their specific, contextual weaknesses.