AERIC uses 387 parameters to spot implicit harmful AI dialogue early
New monitor reads hidden states during generation, catching harm before it surfaces.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Current safety guards for language models either check completed text or monitor token-by-token, but both can miss implicit harm—subtle, non-overtly toxic phrasing that leads to dangerous completions. AERIC (Anticipatory Hidden-State Monitoring) solves this by reading the generator's internal hidden states during the same forward pass used for decoding, without requiring additional computation. Its default linear monitor has only 387 trainable head parameters, yet it outperforms larger models like Qwen3GuardStream-4B on balanced benchmarks: AUROC rises from 0.6830 to 0.7143 on DiaSafety and from 0.8219 to 0.8582 on Harmful Advice. AERIC also uses a source-side safe-budget rule to maximize harmful trigger coverage while keeping false positives under 10%.
On efficiency, AERIC adds only 2.34% mean latency on a 63-prompt harmful-generation benchmark using Qwen3-8B, compared to a 79.40% increase for Qwen3GuardStream-4B. It withholds between 23.53 and 41.86 answer tokens on average across HarmBench DirectRequest and SocialHarmBench for both Qwen and Gemma models. This makes AERIC practical for real-time deployment where compute and time are constrained. The approach is transfer-oriented, meaning it can be applied to different base models without retraining the entire safety system. By forecasting implicit harmful drift from internal trajectories, AERIC closes a critical gap in AI safety—catching dangerous content early enough to prevent exposure, without the heavy cost of full streaming guard models.
- AERIC uses only 387 trainable parameters and requires no additional forward passes.
- Improves AUROC on DiaSafety from 0.6830 to 0.7143 and on Harmful Advice from 0.8219 to 0.8582 over Qwen3GuardStream-4B.
- Latency increase of just 2.34% vs 79.40% for Qwen3GuardStream-4B on a fixed harmful-prompt benchmark.
Why It Matters
Enables real-time, lightweight safety monitoring for LLMs, catching implicit harmful dialogue before it reaches users.