Sequence modeling and design from molecular to genome scale with EvoThe genome is a sequence that encodes the DNA, RNA, and proteins that orchestrate an organism's function. We present Evo, a long-context genomic foundation model with a frontier architecture trained on millions of prokaryotic and phage genomes, and report scaling laws on DNA to complement observations in language and vision. Evo generalizes across DNA, RNA, and proteins, enabling zero-shot function prediction competitive with domain-specific language models and the generation of functional CRISPR-Cas and transposon systems, representing the first examples of protein-RNA and protein-DNA codesign with a language model. Evo also learns how small mutations affect whole-organism fitness and generates megabase-scale sequences with plausible genomic architecture. These prediction and generation capabilities span molecular to genomic scales of complexity, advancing our understanding and control of biology.
Genome modeling and design across all domains of life with Evo 2Garyk Brixi, Matthew G. Durrant, Ja‐Lok Ku et al.|bioRxiv (Cold Spring Harbor Laboratory)|2025 Abstract All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific finetuning. Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. Beyond its predictive capabilities, Evo 2 generates mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Guiding Evo 2 via inference-time search enables controllable generation of epigenomic structure, for which we demonstrate the first inference-time scaling results in biology. We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.
Systems and Algorithms for Convolutional Multi-Hybrid Language Models at ScaleWe introduce convolutional multi-hybrid architectures, with a design grounded on two simple observations. First, operators in hybrid models can be tailored to token manipulation tasks such as in-context recall, multi-token recall, and compression, with input-dependent convolutions and attention offering complementary performance. Second, co-designing convolution operators and hardware-aware algorithms enables efficiency gains in regimes where previous alternative architectures struggle to surpass Transformers. At the 40 billion parameter scale, we train end-to-end 1.2 to 2.9 times faster than optimized Transformers, and 1.1 to 1.4 times faster than previous generation hybrids. On H100 GPUs and model width 4096, individual operators in the proposed multi-hybrid StripedHyena 2 architecture achieve two-fold throughput improvement over linear attention and state-space models. Multi-hybrids excel at sequence modeling over byte-tokenized data, as demonstrated by the Evo 2 line of models. We discuss the foundations that enable these results, including architecture design, overlap-add blocked kernels for tensor cores, and dedicated all-to-all and point-to-point context parallelism strategies.
Designing AI-programmable therapeutics with the EDEN family of foundation modelsGeraldene Munsamy, Gavin Ayres, Carla Greco et al.|bioRxiv (Cold Spring Harbor Laboratory)|2026 Abstract The ability to interpret, modify, and design DNA has driven many of the most significant advances in modern medicine, from diagnostics, biologics, and vaccines to cell and gene therapies. However, the inherent complexity of biological systems means that most modern medicines are still engineered using bespoke, labor-intensive processes. To address the need for a generalisable and programmable approach to therapeutic design, we introduce the EDEN (environmentally-derived evolutionary network) family of metagenomic foundation models, including a 28 billion parameter model trained on 9.7 trillion nucleotide tokens from BaseData 1 . This dataset, at the time of training, contained more than 10 billion novel genes from over 1 million new species, and is intentionally enriched for environmental and host-associated metagenomes, phage sequences, and mobile genetic elements, enabling the model to learn from diverse and novel cross-species evolutionary mechanisms and apply them to key challenges in human health. EDEN achieves state-of-the-art performance across a series of predictive and generative genomic and protein benchmarks. To demonstrate the models’ broad applicability across biology, we evaluate EDEN’s capacity for programmable therapeutic design by challenging a single architecture to design biological novelty across three distinct therapeutic modalities, disease areas and biological scales: (i) large gene insertion, (ii) antibiotic peptide design, and (iii) microbiome design. First, we demonstrate AI-programmable Gene Insertion (aiPGI), in which EDEN designs de novo large serine recombinases (LSRs) capable of inserting large pieces of DNA at desired target sites in the human genome when prompted only on 30 nucleotides of DNA sequence from the desired target site. In low-N experimental validation, EDEN generated multiple active recombinases for all tested disease-associated genomic loci (ATM, DMD, F9, FANCC, GALC, IDS, P4HA1, PHEX, RYR2, USH2A) and 4 potential safe harbor sites in the human genome. EDEN achieves an overall functional hit rate of 63.2% across diverse DNA prompts when prompted on only 30bp of DNA from outside the training data. 50% of EDEN-generated LSRs were active in human cells, achieving therapeutically relevant levels of CAR insertion in primary human T cells. We also show that EDEN can generate active bridge recombinases when prompted on the associated guide RNA alone, with sequence identities to training and public data as low as 65%. These results pave the way for a new generation of cell and gene therapies by opening the door to rapid, programmable and site-specific integration of large genetic payloads without double-strand breaks. This offers an alternative to the safety, efficiency and payload limitations inherent in viral or nuclease-based editing at thousands of currently intractable human therapeutic targets. Second, we use the same model to generate a focused low-N library of novel antimicrobial peptides where 97% showed activity, with top candidates achieving single-digit micromolar potency against critical-priority multidrug-resistant pathogens. Third, to demonstrate that EDEN captures inter -genomic features, we design a gigabase-scale microbiome with over 94,000 synthetic metagenomic assemblies, including prophage genomes and correct cross-species metabolic pathway completions. The EDEN-generated synthetic microbiome covers 9,067 species with a biome-specific taxonomic accuracy of 99%. Over 1,500 of the generated species were outside the fine-tuning dataset while retaining the correct microecological properties and biome association, thus significantly expanding genetic and taxonomic diversity. Together, these results establish a new strategic direction for AI-programmable therapeutics, in which a single foundation model architecture designs candidate therapeutics across diverse modalities and disease areas. This suggests that the combination of billions of years of evolutionary data with specific therapeutic records offers a clear, scaling-driven path to making therapeutic design a predictable engineering discipline. Abstract Figure