Simulating 500 million years of evolution with a language modelMore than 3 billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here, we show that language models trained at scale on evolutionary data can generate functional proteins that are far away from known proteins. We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. ESM3 can follow complex prompts combining its modalities and is highly responsive to alignment to improve its fidelity. We have prompted ESM3 to generate fluorescent proteins. Among the generations that we synthesized, we found a bright fluorescent protein at a far distance (58% sequence identity) from known fluorescent proteins, which we estimate is equivalent to simulating 500 million years of evolution.
Simulating 500 million years of evolution with a language modelThomas Hayes, Roshan Rao, Halil Akin et al.|bioRxiv (Cold Spring Harbor Laboratory)|2024 Abstract More than three billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here we show that language models trained on tokens generated by evolution can act as evolutionary simulators to generate functional proteins that are far away from known proteins. We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. ESM3 can follow complex prompts combining its modalities and is highly responsive to biological alignment. We have prompted ESM3 to generate fluorescent proteins with a chain of thought. Among the generations that we synthesized, we found a bright fluorescent protein at far distance (58% identity) from known fluorescent proteins. Similarly distant natural fluorescent proteins are separated by over five hundred million years of evolution.
Mini‐Metagenomics and Nucleotide Composition Aid the Identification and Host Association of Novel Bacteriophage SequencesA broad spectrum of metagenomic and single cell sequencing techniques have become popular for dissecting environmental microbial diversity, leading to the characterization of thousands of novel microbial lineages. In addition to recovering bacterial and archaeal genomes, metagenomic assembly can also produce genomes of viruses that infect microbial cells. Because of their diversity, lack of marker genes, and small genome size, identifying novel bacteriophage sequences from metagenomic data is often challenging, especially when the objective is to establish phage-host relationships. The present work describes a computational approach that uses supervised learning to classify metagenomic contigs as phage or non-phage as well as assigning phage taxonomy based on tetranucleotide frequencies. Furthermore, the method assigns phage-host relationships using co-occurrence statistics derived from a recently developed mini-metagenomic experimental technique. This work evaluates method performance at identifying viral contigs and predicting taxonomic classification using publicly available references. Then, using two mini-metagenomic datasets, over 100 novel phage contigs from hot spring samples of Yellowstone National Park are identified and assigned to putative microbial hosts. Results of this work demonstrate the value of combining viral sequence identification with mini-metagenomic experimental methods to understand the microbial ecosystem.
PhaMers identifies novel bacteriophage sequences from thermophilic hot springsJonathan Deaton, Feiqiao Brian Yu, Stephen R. Quake|bioRxiv (Cold Spring Harbor Laboratory)|2017 Abstract Metagenomic sequencing approaches have become popular for the purpose of dissecting environmental microbial diversity, leading to the characterization of novel microbial lineages. In addition of bacterial and fungal genomes, metagenomic analysis can also reveal genomes of viruses that infect microbial cells. Because of their small genome size and limited knowledge of phage diversity, discovering novel phage sequences from metagenomic data is often challenging. Here we describe PhaMers ( Phage k - Mers ). a phage identification tool that uses supervised learning to classify metagenomic contigs as phage or non-phage on the basis of tetranucleotide frequencies. a technique that does not depend on existing gene annotations. PhaMers compares the tetranucleotide frequencies of metagenomic contigs to phage and bacteria references from online databases. resulting in assignments of lower level phage taxonomy based on sequence similarity. Using PhaMers. we identified 103 novel phage sequences from hot spring samples of Yellowstone National Park based on data generated from a microfluidic-based minimetagenomic approach. We analyzed assembled contigs over 5 kbp in length using PhaMers and compared the results with those generated by VirSorter, a publicly available phage identification and annotation package. We analyzed the performance of phage genome prediction and taxonomic classification using PhaMers. and presented putative hosts and taxa for some of the novel phage sequences. Finally. mini-metagenomic occurrence profiles of phage and prokaryotic genomes were used to verify putative hosts.