InterPro: the protein sequence classification resource in 2025InterPro (https://www.ebi.ac.uk/interpro) is a freely accessible resource for the classification of protein sequences into families. It integrates predictive models, known as signatures, from multiple member databases to classify sequences into families and predict the presence of domains and significant sites. The InterPro database provides annotations for over 200 million sequences, ensuring extensive coverage of UniProtKB, the standard repository of protein sequences, and includes mappings to several other major resources, such as Gene Ontology (GO), Protein Data Bank in Europe (PDBe) and the AlphaFold Protein Structure Database. In this publication, we report on the status of InterPro (version 101.0), detailing new developments in the database, associated web interface and software. Notable updates include the increased integration of structures predicted by AlphaFold and the enhanced description of protein families using artificial intelligence. Over the past two years, more than 5000 new InterPro entries have been created. The InterPro website now offers access to 85 000 protein families and domains from its member databases and serves as a long-term archive for retired databases. InterPro data, software and tools are freely available.
Investigation of protein family relationships with deep learningMotivation: In this article, we propose a method for finding similarities between Pfam families based on the pre-trained neural network ProtENN2. We use the model ProtENN2 per-residue embeddings to produce new high-dimensional per-family embeddings and develop an approach for calculating inter-family similarity scores based on these embeddings, and evaluate its predictions using structure comparison. Results: We apply our method to Pfam annotation by refining clan membership for Pfam families, suggesting both new members of existing clans and potential new clans for future Pfam releases. We investigate some of the failure modes of our approach, which suggests directions for future improvements. Our method is relatively simple with few parameters and could be applied to other protein family classification models. Overall, our work suggests potential benefits of employing deep learning for improving our understanding of protein family relationships and functions of previously uncharacterized families. Availability and implementation: github.com/iponamareva/ProtCNNSim, 10.5281/zenodo.10091909.
ProtCNNSim embeddingsIrina Ponamareva|Zenodo (CERN European Organization for Nuclear Research)|2024 This entry contains: - Family-specific embeddings and corresponding Pfam labels used in ProtCNNSim method - Predicted similarity scores for Pfam 35 families
ProtCNNSim embeddingsIrina Ponamareva|Zenodo (CERN European Organization for Nuclear Research)|2023 Family-specific embeddings and corresponding Pfam labels used in ProtCNNSim method
Deep learning for improving knowledge of the protein familiesIrina Ponamareva|Apollo (University of Cambridge)|2025 Rapid developments in sequencing technologies have enabled high-throughput identification of the primary structure of proteins. Accordingly, the number of known protein sequences has risen greatly over the past few decades. Unlocking this wealth of data requires improved methods for predicting the function of protein sequences both existing in the databases or recently obtained. Traditional methods for the prediction of protein function have relied on sequence similarity. However, recently developed deep learning methods, such as the ProtENN model developed by Bileschi et al. [2019], have the ability to extract more information about protein sequences and functional protein domains, and by leveraging their knowledge we can improve our understanding of protein family relationships. Recent advancements in Large Language Model (LLM) development represent another potential path for understanding protein functions using their generative abilities. This work presents two projects that focus on improving protein function understanding using two different modalities: at the level of textual descriptions of protein functions (Chapter 2) and at the level of deep protein representations (Chapter 3). Chapter 2 discusses frameworks for LLM-generated protein family annotations and evaluates them against human-written annotations, showing the potential of automated techniques. Chapter 3 is dedicated to the ProtCNNSim method, which uses ProtENN2 per-residue embeddings to produce new high-dimensional per-family embeddings and can be used for calculating inter-family similarity scores based on these representations. ProtCNNSim is applied to Pfam annotation and helps to refine clan membership for Pfam families, suggesting both new members of existing clans and potential new clans for future Pfam releases. In summary, this thesis explores the benefits of employing deep learning techniques for understanding protein functions. It provides empirical evidence of performance im- provements that can be achieved by such methods, and contributes towards assessing the feasibility of broader adoption of such methods for protein function annotation.