Genome modeling and design across all domains of life with Evo 2

Garyk Brixi(Arc Research Institute), Matthew G. Durrant(Arc Research Institute), Ja‐Lok Ku(Arc Research Institute), Michael Poli(Air Liquide (United Kingdom)), Greg Brockman, Daniel Chang(Arc Research Institute), Gabriel González(Arc Research Institute), S. B. King(Arc Research Institute), David Li(Arc Research Institute), S. B. King(Arc Research Institute), Mohsen Naghipourfar(Berkeley College), Eric Nguyen(Stanford University), Chiara Ricci-Tam(Arc Research Institute), David W. Romero, Gwanggyu Sun(Arc Research Institute), Ali Taghibakshi, Anton Vorontsov, B. S. Yang, Myra Deng(Sunfire (Germany)), Liv Gorton(Sunfire (Germany)), Nam C. Nguyen(Sunfire (Germany)), Nicholas K. Wang(Sunfire (Germany)), Etowah Adams(Columbia University), Stephen A. Baccus(Stanford University), Steven Dillmann(Stanford University), Stefano Ermon(Stanford University), Daniel Guo(Arc Research Institute), Rajesh Ilango(Arc Research Institute), Ken Janik, Amy X. Lu(Berkeley College), Reshma Mehta, Mohammad R. K. Mofrad(Berkeley College), Madelena Y. Ng(Stanford University), Jaspreet Pannu(Stanford University), Christopher Ré(Stanford University), Jonathan C. Schmok(Arc Research Institute), John St. John, Jeremy A. Sullivan(Arc Research Institute), Kevin Zhu(Berkeley College), Greg Zynda, Daniel Balsam(Sunfire (Germany)), Patrick Collison(Arc Research Institute), Anthony Costa, Tina Hernandez‐Boussard(Stanford University), Eric Ho(Sunfire (Germany)), Mingyu Liu, Thomas McGrath(Sunfire (Germany)), Kimberly Powell, Dave P. Burke(Arc Research Institute), Hani Goodarzi(University of California, San Francisco), Patrick D. Hsu(Berkeley College), Brian Hie(Arc Research Institute)
bioRxiv (Cold Spring Harbor Laboratory)
February 21, 2025
Cited by 190Open Access
Full Text

Abstract

Abstract All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific finetuning. Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. Beyond its predictive capabilities, Evo 2 generates mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Guiding Evo 2 via inference-time search enables controllable generation of epigenomic structure, for which we demonstrate the first inference-time scaling results in biology. We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.


Related Papers

No related papers found

Powered by citation graph analysis