AI Safety

SafeMath: Inference-time Safety improves Math Accuracy

New method reduces harmful outputs in math problems by 40% while actually improving reasoning scores.

Deep Dive

A research team from IIT Kharagpur and Microsoft has identified a critical vulnerability in how large language models (LLMs) handle mathematical word problems. Their paper, "SafeMath: Inference-time Safety improves Math Accuracy," reveals that seemingly benign math questions framed as natural language narratives can subtly propagate biased, unethical, or psychologically harmful content. This is particularly dangerous in educational settings where children might encounter these AI-generated problems. To study this systematically, the team created ToxicGSM, a dataset containing 1,900 arithmetic problems where harmful context is embedded while keeping the mathematical reasoning task intact.

Using this dataset, the researchers audited existing LLMs and discovered significant safety failures. They then developed SafeMath, a novel safety alignment technique applied during model inference (when the AI generates answers). Unlike traditional methods that often degrade performance, SafeMath actually improves mathematical accuracy while reducing harmful outputs by approximately 40%. The technique works by separating the evaluation of linguistic safety from the mathematical reasoning process, allowing the model to filter toxic content without compromising on solving the core math problem. The team has released both their source code and the ToxicGSM dataset publicly to advance research in this area.

Key Points
  • Created ToxicGSM dataset with 1,900 math problems containing harmful contexts to test AI safety
  • SafeMath technique reduces harmful outputs by 40% while maintaining or improving mathematical accuracy
  • Works during inference time by disentangling safety evaluation from mathematical reasoning processes

Why It Matters

Enables safer AI tutors and educational tools by preventing hidden bias in math problems without sacrificing accuracy.