SNARE-CNN: a 2D convolutional neural network architecture to identify SNARE proteins from high-throughput sequencing dataDeep learning has been increasingly and widely used to solve numerous problems in various fields with state-of-the-art performance. It can also be applied in bioinformatics to reduce the requirement for feature extraction and reach high performance. This study attempts to use deep learning to predict SNARE proteins, which is one of the most vital molecular functions in life science. A functional loss of SNARE proteins has been implicated in a variety of human diseases (e.g., neurodegenerative, mental illness, cancer, and so on). Therefore, creating a precise model to identify their functions is a crucial problem for understanding these diseases, and designing the drug targets. Our SNARE-CNN model which uses two-dimensional convolutional neural networks and position-specific scoring matrix profiles could identify SNARE proteins with achieved sensitivity of 76.6%, specificity of 93.5%, accuracy of 89.7%, and MCC of 0.7 in cross-validation dataset. We also evaluate the performance of our model via an independent dataset and the result shows that we are able to solve the overfitting problem. Compared with other state-of-the-art methods, this approach achieved significant improvement in all of the metrics. Throughout the proposed study, we provide an effective model for identifying SNARE proteins and a basis for further research that can apply deep learning in bioinformatics, especially in protein function prediction. SNARE-CNN are freely available at https://github.com/khanhlee/snare-cnn.
SuccSite: Incorporating Amino Acid Composition and Informative <i>k</i>-Spaced Amino Acid Pairs to Identify Protein Succinylation SitesHui-Ju Kao, Van-Nui Nguyen, Kai‐Yao Huang et al.|Genomics Proteomics & Bioinformatics|2020 Protein succinylation is a biochemical reaction in which a succinyl group (-CO-CH2-CH2-CO-) is attached to the lysine residue of a protein molecule. Lysine succinylation plays important regulatory roles in living cells. However, studies in this field are limited by the difficulty in experimentally identifying the substrate site specificity of lysine succinylation. To facilitate this process, several tools have been proposed for the computational identification of succinylated lysine sites. In this study, we developed an approach to investigate the substrate specificity of lysine succinylated sites based on amino acid composition. Using experimentally verified lysine succinylated sites collected from public resources, the significant differences in position-specific amino acid composition between succinylated and non-succinylated sites were represented using the Two Sample Logo program. These findings enabled the adoption of an effective machine learning method, support vector machine, to train a predictive model with not only the amino acid composition, but also the composition of k-spaced amino acid pairs. After the selection of the best model using a ten-fold cross-validation approach, the selected model significantly outperformed existing tools based on an independent dataset manually extracted from published research articles. Finally, the selected model was used to develop a web-based tool, SuccSite, to aid the study of protein succinylation. Two proteins were used as case studies on the website to demonstrate the effective prediction of succinylation sites. We will regularly update SuccSite by integrating more experimental datasets. SuccSite is freely accessible at http://csb.cse.yzu.edu.tw/SuccSite/.
Integrating ESM-2 and Graph Neural Networks with AlphaFold-2 Structures for Enhanced Protein Function PredictionProtein function prediction is essential for elucidating biological processes and accelerating drug discovery. However, the vast number of unannotated protein sequences and the limited availability of experimentally validated functional data remain major challenges. Although deep learning models based on protein sequences or protein-protein interaction networks have shown promise, their performance is still restricted, particularly for proteins without interaction data. Furthermore, many existing approaches treat sequence and structural information separately, potentially resulting in suboptimal feature representations. To address these limitations, we propose an improved graph-based framework that integrates two key innovations: (i) ESM-2, a state-of-the-art protein language model, to generate semantically rich sequence embeddings; and (ii) a hybrid pooling mechanism within graph convolutional blocks to better capture both global and local structural features from AlphaFold2-predicted structures. Experiments on the human proteome demonstrate that our model consistently outperforms existing methods in predicting molecular function, cellular component, and biological process annotations. These findings highlight the advantages of combining advanced sequence representations with enhanced structural learning for accurate and generalizable protein function prediction.
Integrating CNN and Bi-LSTM for protein succinylation sites prediction based on Natural Language Processing techniqueExploiting Two-Layer Support Vector Machine to Predict Protein SUMOylation SitesVan-Nui Nguyen, Huy-Khoi Do, Thi-Xuan Tran et al.|Lecture notes in networks and systems|2018