Sequence modeling and design from molecular to genome scale with Evo

Eric Nguyen(Palo Alto Institute), Michael Poli(Together), Matthew G. Durrant(Palo Alto Institute), Brian Kang(Palo Alto Institute), Dhruva Katrekar(Palo Alto Institute), David Li(Palo Alto Institute), Liam J. Bartie(Palo Alto Institute), Armin W. Thomas(Stanford University), S. B. King(Palo Alto Institute), Garyk Brixi(Palo Alto Institute), Jeremy A. Sullivan(Palo Alto Institute), Madelena Y. Ng(Stanford Medicine), Ashley Lewis(Stanford University), Aaron Lou(Stanford University), Stefano Ermon(Chan Zuckerberg Biohub San Francisco), Stephen A. Baccus(Stanford University), Tina Hernandez‐Boussard(Stanford University), Christopher Ré(Stanford University), Patrick D. Hsu(Palo Alto Institute), Brian Hie(Palo Alto Institute)
Science
November 14, 2024
Cited by 400Open Access
Full Text

Abstract

The genome is a sequence that encodes the DNA, RNA, and proteins that orchestrate an organism's function. We present Evo, a long-context genomic foundation model with a frontier architecture trained on millions of prokaryotic and phage genomes, and report scaling laws on DNA to complement observations in language and vision. Evo generalizes across DNA, RNA, and proteins, enabling zero-shot function prediction competitive with domain-specific language models and the generation of functional CRISPR-Cas and transposon systems, representing the first examples of protein-RNA and protein-DNA codesign with a language model. Evo also learns how small mutations affect whole-organism fitness and generates megabase-scale sequences with plausible genomic architecture. These prediction and generation capabilities span molecular to genomic scales of complexity, advancing our understanding and control of biology.


Related Papers