Deep learning for improving knowledge of the protein families
Abstract
Rapid developments in sequencing technologies have enabled high-throughput identification of the primary structure of proteins. Accordingly, the number of known protein sequences has risen greatly over the past few decades. Unlocking this wealth of data requires improved methods for predicting the function of protein sequences both existing in the databases or recently obtained. Traditional methods for the prediction of protein function have relied on sequence similarity. However, recently developed deep learning methods, such as the ProtENN model developed by Bileschi et al. [2019], have the ability to extract more information about protein sequences and functional protein domains, and by leveraging their knowledge we can improve our understanding of protein family relationships. Recent advancements in Large Language Model (LLM) development represent another potential path for understanding protein functions using their generative abilities. This work presents two projects that focus on improving protein function understanding using two different modalities: at the level of textual descriptions of protein functions (Chapter 2) and at the level of deep protein representations (Chapter 3). Chapter 2 discusses frameworks for LLM-generated protein family annotations and evaluates them against human-written annotations, showing the potential of automated techniques. Chapter 3 is dedicated to the ProtCNNSim method, which uses ProtENN2 per-residue embeddings to produce new high-dimensional per-family embeddings and can be used for calculating inter-family similarity scores based on these representations. ProtCNNSim is applied to Pfam annotation and helps to refine clan membership for Pfam families, suggesting both new members of existing clans and potential new clans for future Pfam releases. In summary, this thesis explores the benefits of employing deep learning techniques for understanding protein functions. It provides empirical evidence of performance im- provements that can be achieved by such methods, and contributes towards assessing the feasibility of broader adoption of such methods for protein function annotation.
Related Papers
No related papers found
Powered by citation graph analysis