Founder effects shape the evolutionary dynamics of multimodality in open LLM families
Study of 1.8M models finds multimodal AI emerges through rare 'founder events' then explodes within lineages.
A new study by researcher Manuel Cebrian, analyzing a massive dataset of 1.8 million AI models from Hugging Face, reveals that multimodal capabilities like vision-language models (VLMs) don't gradually emerge in open-source LLM families—they explode from rare starting points. The research, titled 'Founder effects shape the evolutionary dynamics of multimodality in open LLM families,' used lineage data to track how models like Llama, Gemma, and GLM evolved. It found that the first VLM variant in a family typically appears months to over two years after the initial text-only release, with a lag of ~26 months for the GLM family.
Once a VLM 'founder' appears, however, multimodality spreads rapidly but almost exclusively within its own descendant lineage. The data shows a stark divide: fine-tuning a text-only model has only a 0.218% chance of producing a VLM child. In contrast, 94.5% of VLM-child relationships originate from a VLM parent. This creates a 'punctuated' adoption pattern where capabilities are concentrated, not broadly transferred. About 60% of VLM releases appear as new roots without recorded parents, suggesting many are built from scratch or from undisclosed bases, while the rest are fine-tuned from existing VLMs.
This evolutionary dynamic has significant implications for the AI ecosystem. It suggests that multimodal scaling behavior is distinct and limited by these founder lineages, potentially creating bottlenecks for capability transfer. For developers, it means that choosing a base model from a strong multimodal lineage is far more critical for building vision-capable AI than hoping to add the capability through fine-tuning a text model. The research provides a data-driven framework for understanding how complex AI capabilities propagate in the open-source community.
- Only 0.218% of fine-tunes from text-generation parents yield Vision-Language Model (VLM) descendants, showing weak cross-capability transfer.
- 94.5% of VLM-child fine-tuning edges originate from VLM parents, demonstrating that multimodality expands almost exclusively within established VLM lineages.
- The first VLM in a family appears months to years after the text model (e.g., ~1 month for Gemma, ~26 months for GLM), followed by rapid within-lineage diversification.
Why It Matters
For builders, this means choosing the right multimodal lineage is critical; you can't easily fine-tune vision into a text model.