Palo Alto Institute
ORCID: 0000-0001-8433-1760Publishes on RNA Research and Splicing, CRISPR and Genetic Engineering, RNA and protein synthesis mechanisms. 62 papers and 2.3k citations.
Add your photo, update your bio, and get notified when your ranking changes.
The genome is a sequence that encodes the DNA, RNA, and proteins that orchestrate an organism's function. We present Evo, a long-context genomic foundation model with a frontier architecture trained on millions of prokaryotic and phage genomes, and report scaling laws on DNA to complement observations in language and vision. Evo generalizes across DNA, RNA, and proteins, enabling zero-shot function prediction competitive with domain-specific language models and the generation of functional CRISPR-Cas and transposon systems, representing the first examples of protein-RNA and protein-DNA codesign with a language model. Evo also learns how small mutations affect whole-organism fitness and generates megabase-scale sequences with plausible genomic architecture. These prediction and generation capabilities span molecular to genomic scales of complexity, advancing our understanding and control of biology.
Large serine recombinases (LSRs) are DNA integrases that facilitate the site-specific integration of mobile genetic elements into bacterial genomes. Only a few LSRs, such as Bxb1 and PhiC31, have been characterized to date, with limited efficiency as tools for DNA integration in human cells. In this study, we developed a computational approach to identify thousands of LSRs and their DNA attachment sites, expanding known LSR diversity by >100-fold and enabling the prediction of their insertion site specificities. We tested their recombination activity in human cells, classifying them as landing pad, genome-targeting or multi-targeting LSRs. Overall, we achieved up to seven-fold higher recombination than Bxb1 and genome integration efficiencies of 40-75% with cargo sizes over 7 kb. We also demonstrate virus-free, direct integration of plasmid or amplicon libraries for improved functional genomics applications. This systematic discovery of recombinases directly from microbial sequencing data provides a resource of over 60 LSRs experimentally characterized in human cells for large-payload genome insertion without exposed DNA double-stranded breaks.
Abstract All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific finetuning. Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. Beyond its predictive capabilities, Evo 2 generates mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Guiding Evo 2 via inference-time search enables controllable generation of epigenomic structure, for which we demonstrate the first inference-time scaling results in biology. We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.
Coming soon — researchers in similar fields and career stages