Simulating 500 million years of evolution with a language model

Thomas Hayes(Institut de Biologia Evolutiva), Roshan Rao(Institut de Biologia Evolutiva), Halil Akin(Institut de Biologia Evolutiva), Nicholas J. Sofroniew(Institut de Biologia Evolutiva), Deniz Oktay(Institut de Biologia Evolutiva), Zeming Lin(Institut de Biologia Evolutiva), Robert Verkuil(Institut de Biologia Evolutiva), Vincent Q. Tran(Southern California Institute of Architecture), Jonathan Deaton(Institut de Biologia Evolutiva), Marius Wiggert(Institut de Biologia Evolutiva), Rohil Badkundri(Institut de Biologia Evolutiva), Irhum Shafkat(Institut de Biologia Evolutiva), Jun Gong(Institut de Biologia Evolutiva), Alexander Derry(Institut de Biologia Evolutiva), Raul S. Molina(Institut de Biologia Evolutiva), Neil Thomas(Institut de Biologia Evolutiva), Yousuf A. Khan(Institut de Biologia Evolutiva), Chetan Mishra(Institut de Biologia Evolutiva), Carolyn Kim(Institut de Biologia Evolutiva), Liam J. Bartie(Arc Research Institute), Matthew Nemeth(Arc Research Institute), Patrick D. Hsu(Southern California Institute of Architecture), Tom Sercu(Institut de Biologia Evolutiva), Salvatore Candido(Institut de Biologia Evolutiva), Alexander Rives(Institut de Biologia Evolutiva)
bioRxiv (Cold Spring Harbor Laboratory)
July 2, 2024
Cited by 195Open Access
Full Text

Abstract

Abstract More than three billion years of evolution have produced an image of biology encoded into the space of natural proteins. Here we show that language models trained on tokens generated by evolution can act as evolutionary simulators to generate functional proteins that are far away from known proteins. We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins. ESM3 can follow complex prompts combining its modalities and is highly responsive to biological alignment. We have prompted ESM3 to generate fluorescent proteins with a chain of thought. Among the generations that we synthesized, we found a bright fluorescent protein at far distance (58% identity) from known fluorescent proteins. Similarly distant natural fluorescent proteins are separated by over five hundred million years of evolution.


Related Papers

No related papers found

Powered by citation graph analysis