I

Irina Ponamareva

European Bioinformatics Institute

ORCID: 0009-0009-6041-5869

Publishes on Machine Learning in Bioinformatics, Bioinformatics and Genomic Networks, semigroups and automata theory. 5 papers and 822 citations.

5Publications
822Total Citations

Is this you? Claim your profile.

Add your photo, update your bio, and get notified when your ranking changes.

Top publicationsby citations

InterPro: the protein sequence classification resource in 2025
Matthias Blum, Antonina Andreeva, Laise Cavalcanti Florentino et al.|Nucleic Acids Research|2024
Cited by 832Open Access

InterPro (https://www.ebi.ac.uk/interpro) is a freely accessible resource for the classification of protein sequences into families. It integrates predictive models, known as signatures, from multiple member databases to classify sequences into families and predict the presence of domains and significant sites. The InterPro database provides annotations for over 200 million sequences, ensuring extensive coverage of UniProtKB, the standard repository of protein sequences, and includes mappings to several other major resources, such as Gene Ontology (GO), Protein Data Bank in Europe (PDBe) and the AlphaFold Protein Structure Database. In this publication, we report on the status of InterPro (version 101.0), detailing new developments in the database, associated web interface and software. Notable updates include the increased integration of structures predicted by AlphaFold and the enhanced description of protein families using artificial intelligence. Over the past two years, more than 5000 new InterPro entries have been created. The InterPro website now offers access to 85 000 protein families and domains from its member databases and serves as a long-term archive for retired databases. InterPro data, software and tools are freely available.

Investigation of protein family relationships with deep learning
Irina Ponamareva, Antonina Andreeva, Maxwell L. Bileschi et al.|Bioinformatics Advances|2024
Cited by 2Open Access

Motivation: In this article, we propose a method for finding similarities between Pfam families based on the pre-trained neural network ProtENN2. We use the model ProtENN2 per-residue embeddings to produce new high-dimensional per-family embeddings and develop an approach for calculating inter-family similarity scores based on these embeddings, and evaluate its predictions using structure comparison. Results: We apply our method to Pfam annotation by refining clan membership for Pfam families, suggesting both new members of existing clans and potential new clans for future Pfam releases. We investigate some of the failure modes of our approach, which suggests directions for future improvements. Our method is relatively simple with few parameters and could be applied to other protein family classification models. Overall, our work suggests potential benefits of employing deep learning for improving our understanding of protein family relationships and functions of previously uncharacterized families. Availability and implementation: github.com/iponamareva/ProtCNNSim, 10.5281/zenodo.10091909.

ProtCNNSim embeddings
Irina Ponamareva|Zenodo (CERN European Organization for Nuclear Research)|2024
Cited by 0Open Access

This entry contains: - Family-specific embeddings and corresponding Pfam labels used in ProtCNNSim method - Predicted similarity scores for Pfam 35 families

ProtCNNSim embeddings
Irina Ponamareva|Zenodo (CERN European Organization for Nuclear Research)|2023
Cited by 0Open Access

Family-specific embeddings and corresponding Pfam labels used in ProtCNNSim method

Deep learning for improving knowledge of the protein families
Irina Ponamareva|Apollo (University of Cambridge)|2025
Cited by 0Open Access

Rapid developments in sequencing technologies have enabled high-throughput identification of the primary structure of proteins. Accordingly, the number of known protein sequences has risen greatly over the past few decades. Unlocking this wealth of data requires improved methods for predicting the function of protein sequences both existing in the databases or recently obtained. Traditional methods for the prediction of protein function have relied on sequence similarity. However, recently developed deep learning methods, such as the ProtENN model developed by Bileschi et al. [2019], have the ability to extract more information about protein sequences and functional protein domains, and by leveraging their knowledge we can improve our understanding of protein family relationships. Recent advancements in Large Language Model (LLM) development represent another potential path for understanding protein functions using their generative abilities. This work presents two projects that focus on improving protein function understanding using two different modalities: at the level of textual descriptions of protein functions (Chapter 2) and at the level of deep protein representations (Chapter 3). Chapter 2 discusses frameworks for LLM-generated protein family annotations and evaluates them against human-written annotations, showing the potential of automated techniques. Chapter 3 is dedicated to the ProtCNNSim method, which uses ProtENN2 per-residue embeddings to produce new high-dimensional per-family embeddings and can be used for calculating inter-family similarity scores based on these representations. ProtCNNSim is applied to Pfam annotation and helps to refine clan membership for Pfam families, suggesting both new members of existing clans and potential new clans for future Pfam releases. In summary, this thesis explores the benefits of employing deep learning techniques for understanding protein functions. It provides empirical evidence of performance im- provements that can be achieved by such methods, and contributes towards assessing the feasibility of broader adoption of such methods for protein function annotation.