Sarah A. Fahlberg

Neural networks to learn protein sequence–function relationships from deep mutational scanning data

Sam Gelman, Sarah A. Fahlberg, Pete Heinzelman et al.|Proceedings of the National Academy of Sciences|2021

Cited by 177Open Access

The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein's behavior and properties. We present a supervised deep learning framework to learn the sequence-function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network's internal representation affects its ability to learn the sequence-function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find that networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks' ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models' ability to navigate sequence space and design new proteins beyond the training set. We applied the protein G B1 domain (GB1) models to design a sequence that binds to immunoglobulin G with substantially higher affinity than wild-type GB1.

Machine learning to navigate fitness landscapes for protein engineering

Chase R. Freschlin, Sarah A. Fahlberg, Philip A. Romero|Current Opinion in Biotechnology|2022

Cited by 121Open Access

Machine learning-guided acyl-ACP reductase engineering for improved in vivo fatty alcohol production

Jonathan C. Greenhalgh, Sarah A. Fahlberg, Brian F. Pfleger et al.|Nature Communications|2021

Cited by 102Open Access

Alcohol-forming fatty acyl reductases (FARs) catalyze the reduction of thioesters to alcohols and are key enzymes for microbial production of fatty alcohols. Many metabolic engineering strategies utilize FARs to produce fatty alcohols from intracellular acyl-CoA and acyl-ACP pools; however, enzyme activity, especially on acyl-ACPs, remains a significant bottleneck to high-flux production. Here, we engineer FARs with enhanced activity on acyl-ACP substrates by implementing a machine learning (ML)-driven approach to iteratively search the protein fitness landscape. Over the course of ten design-test-learn rounds, we engineer enzymes that produce over twofold more fatty alcohols than the starting natural sequences. We characterize the top sequence and show that it has an enhanced catalytic rate on palmitoyl-ACP. Finally, we analyze the sequence-function data to identify features, like the net charge near the substrate-binding site, that correlate with in vivo activity. This work demonstrates the power of ML to navigate the fitness landscape of traditionally difficult-to-engineer proteins.

Neural network extrapolation to distant regions of the protein fitness landscape

Chase R. Freschlin, Sarah A. Fahlberg, Pete Heinzelman et al.|Nature Communications|2024

Cited by 43Open Access

Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks’ capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models’ extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. We also find that implementing a simple ensemble of convolutional neural networks enables robust design of high-performing variants in the local landscape. Our findings highlight how each architecture’s inductive biases prime them to learn different aspects of the protein fitness landscape and how a simple ensembling approach makes protein engineering more robust. Machine learning accelerates protein engineering by predicting sequence-function relationships. Here, authors evaluate neural network architectures’ ability to extrapolate beyond training data, finding simpler models excel in local design while convolutional models explore deeper sequence spaces.

Neural networks to learn protein sequence-function relationships from deep mutational scanning data

Sam Gelman, Sarah A. Fahlberg, Pete Heinzelman et al.|bioRxiv (Cold Spring Harbor Laboratory)|2020

Cited by 17Open Access

ABSTRACT The mapping from protein sequence to function is highly complex, making it challenging to predict how sequence changes will affect a protein’s behavior and properties. We present a supervised deep learning framework to learn the sequence-function mapping from deep mutational scanning data and make predictions for new, uncharacterized sequence variants. We test multiple neural network architectures, including a graph convolutional network that incorporates protein structure, to explore how a network’s internal representation affects its ability to learn the sequence-function mapping. Our supervised learning approach displays superior performance over physics-based and unsupervised prediction methods. We find networks that capture nonlinear interactions and share parameters across sequence positions are important for learning the relationship between sequence and function. Further analysis of the trained models reveals the networks’ ability to learn biologically meaningful information about protein structure and mechanism. Finally, we demonstrate the models’ ability to navigate sequence space and design new proteins beyond the training set. We applied the GB1 models to design a sequence that binds to IgG with substantially higher affinity than wild-type GB1. Our software is available from https://github.com/gitter-lab/nn4dms . Significance Understanding the relationship between protein sequence and function is necessary to design new and useful proteins with applications in bioenergy, medicine, and agriculture. The mapping from sequence to function is tremendously complex because it involves thousands of molecular interactions that are coupled over multiple lengths and timescales. In this work, we show neural networks can learn the sequence-function mapping from large protein datasets. Neural networks are appealing for this task because they can learn complicated relationships from data, make few assumptions about the nature of the sequencefunction relationship, and can learn general rules that apply across the length of the protein sequence. We demonstrate the learned models can be applied to design new proteins with properties that exceed natural sequences.

Sarah A. Fahlberg

Is this you? Claim your profile.

Top publicationsby citations