Simulating 500 million years of evolution with a language model

Thomas Hayes(New York Consortium in Evolutionary Primatology), Roshan Rao(New York Consortium in Evolutionary Primatology), Halil Akin(New York Consortium in Evolutionary Primatology), Nicholas Sofroniew(New York Consortium in Evolutionary Primatology), Deniz Oktay(New York Consortium in Evolutionary Primatology), Zeming Lin(New York Consortium in Evolutionary Primatology), Robert Verkuil(New York Consortium in Evolutionary Primatology), Vincent Q. Tran(Palo Alto Institute), Jonathan Deaton(New York Consortium in Evolutionary Primatology), Marius Wiggert(New York Consortium in Evolutionary Primatology), Rohil Badkundri(New York Consortium in Evolutionary Primatology), Irhum Shafkat(New York Consortium in Evolutionary Primatology), Jun Gong(New York Consortium in Evolutionary Primatology), Alexander Derry(New York Consortium in Evolutionary Primatology), Raul S. Molina(New York Consortium in Evolutionary Primatology), Neil Thomas(New York Consortium in Evolutionary Primatology), Yousuf A. Khan(New York Consortium in Evolutionary Primatology), Chetan Mishra(New York Consortium in Evolutionary Primatology), Carolyn Kim(New York Consortium in Evolutionary Primatology), Liam J. Bartie(Palo Alto Institute), Matthew Nemeth(Palo Alto Institute), Patrick D. Hsu(Palo Alto Institute), Tom Sercu(New York Consortium in Evolutionary Primatology), Salvatore Candido(New York Consortium in Evolutionary Primatology), Alexander Rives(New York Consortium in Evolutionary Primatology)
Science
January 16, 2025
Cited by 565

Abstract

More than 3 billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here, we show that language models trained at scale on evolutionary data can generate functional proteins that are far away from known proteins. We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. ESM3 can follow complex prompts combining its modalities and is highly responsive to alignment to improve its fidelity. We have prompted ESM3 to generate fluorescent proteins. Among the generations that we synthesized, we found a bright fluorescent protein at a far distance (58% sequence identity) from known fluorescent proteins, which we estimate is equivalent to simulating 500 million years of evolution.


Related Papers

No related papers found

Powered by citation graph analysis