Large genome model: Open source AI trained on trillions of bases
Open source AI trained on 8.8 trillion DNA bases identifies regulatory sequences humans struggle to spot.
The research team behind the original Evo AI system has released Evo 2, a groundbreaking open source model trained on genomic data from bacteria, archaea, and eukaryotes. This represents a significant leap from their previous work focused solely on bacterial genomes, directly addressing the challenge of analyzing complex eukaryotic DNA structures. The model was trained on the massive OpenGenome2 dataset containing 8.8 trillion DNA bases, though notably excluded eukaryotic viruses to prevent potential misuse for creating human threats.
The system uses a StripedHyena 2 convolutional neural network trained in two stages: first on 8,000-base chunks to identify local features, then on million-base sequences to recognize large-scale patterns. Researchers created two versions—a 7-billion-parameter model trained on 2.4 trillion bases and a full 40-billion-parameter version using the complete dataset. By learning evolutionary conservation patterns across species, Evo 2 can identify subtle genomic features like weakly defined splice sites and regulatory sequences that are notoriously difficult for humans to detect in complex eukaryotic genomes.
- Trained on 8.8 trillion DNA bases from OpenGenome2 dataset covering all three domains of life
- Uses StripedHyena 2 convolutional neural network with 40 billion parameters in full version
- Identifies complex eukaryotic features like regulatory sequences and splice sites that challenge existing tools
Why It Matters
Enables more accurate analysis of human and complex genomes, accelerating biomedical research and therapeutic development.