MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning
A new 571K-instruction dataset tackles the scarcity of French medical data for AI, showing native sources are best.
A research team from Aix-Marseille University and partners has released MedInjection-FR, a large-scale dataset designed to solve a critical bottleneck in non-English AI: the scarcity of high-quality, domain-specific instruction data. The dataset comprises 571,000 instruction-response pairs for French biomedical tasks, meticulously compiled from three distinct sources: native French medical texts, AI-generated synthetic data, and professionally translated English medical instructions. This structure allows for a controlled study on how data provenance impacts model performance.
In their experiments, the team fine-tuned the Qwen-4B-Instruct model across seven different configurations using these data sources. The results were clear: models trained primarily on native French data performed the strongest. However, a hybrid approach combining native and translated data offered the best of both worlds, providing complementary benefits. Synthetic data alone was less effective but could still contribute positively when balanced with native supervision. The evaluation combined automatic metrics, LLM-as-a-judge scoring, and human expert review, revealing that while LLM judgments correlated well with human ratings, they were sensitive to verbosity.
This research provides a crucial roadmap for developing capable AI in specialized, non-English domains like medicine. It demonstrates that while authentic, native data is irreplaceable for top performance, strategic use of translated and synthetic data can effectively mitigate data scarcity. The findings and the open-sourced MedInjection-FR dataset are a significant step toward more equitable and capable multilingual AI in critical fields.
- Dataset contains 571,000 French biomedical instruction pairs from native, synthetic, and translated sources.
- Fine-tuning Qwen-4B-Instruct showed native data yields best performance; native+translated mix is optimal.
- Evaluation used automatic metrics, LLM-as-a-judge, and human experts, finding LLM judgments correlate with humans but are verbose-sensitive.
Why It Matters
Provides a blueprint for building high-quality, specialized AI in languages beyond English, crucial for global healthcare equity.