The Last Fingerprint: How Markdown Training Shapes LLM Prose
New study shows em dash frequency is a hidden signature of how models like GPT-4 and Claude were fine-tuned.
A new research paper titled 'The Last Fingerprint: How Markdown Training Shapes LLM Prose' by E. M. Freeburg provides the first mechanistic explanation for why large language models (LLMs) overuse em dashes. The study proposes that the em dash is the smallest surviving unit of structural formatting that 'leaks' from the markdown-saturated training corpora into plain prose. This connects two previously isolated observations: that LLMs default to markdown-formatted output and that em dash frequency is a widely discussed marker of AI-generated text.
The research tested this hypothesis with a suppression experiment across twelve major models from five providers: Anthropic, OpenAI, Meta, Google, and DeepSeek. When instructed to avoid markdown formatting, overt features like headers and bold text were eliminated, but em dashes persisted—except in Meta's Llama models, which produced none. The frequency and suppression resistance varied dramatically, from 0.0 per 1,000 words for Llama models to 9.1 for GPT-4.1 even under suppression, creating a unique signature.
Further experiments, including a three-condition suppression gradient and a base-vs-instruct model comparison, showed that explicit prohibition often fails to eliminate the artifact and that the latent tendency exists even before reinforcement learning from human feedback (RLHF). The findings reframe em dash frequency from a mere stylistic defect into a diagnostic tool for reverse-engineering the fine-tuning methodology and data composition behind different AI models, offering a new lens for AI forensics and model analysis.
- Em dash overuse is a 'leak' from markdown training data, not random style, with frequency varying from 0.0 (Llama) to 9.1 per 1k words (GPT-4.1).
- Meta's Llama models produced zero em dashes under suppression, creating a unique fingerprint versus other providers.
- The tendency exists pre-RLHF, making it a diagnostic for fine-tuning methods rather than just a post-training quirk.
Why It Matters
Provides a forensic tool for AI detection and model analysis by linking a common 'tell' directly to training and fine-tuning processes.