An expanded evaluation of protein function prediction methods shows an improvement in accuracyBACKGROUND: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. RESULTS: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. CONCLUSIONS: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.
UTurku: Drug Named Entity Recognition and Drug-Drug Interaction Extraction Using SVM Classification and Domain KnowledgeThe DDIExtraction 2013 task in the SemEval conference concerns the detection of drug names and statements of drug-drug interactions (DDI) from text. Extraction of DDIs is important for providing up-to-date knowledge on adverse interactions between coadministered drugs. We apply the machine learning based Turku Event Extraction System to both tasks. We evaluate three feature sets, syntactic features derived from deep parsing, enhanced optionally with features derived from DrugBank or from both DrugBank and MetaMap. TEES achieves F-scores of 60 % for the drug name recognition task and 59 % for the DDI extraction task. 1
Lysophosphatidic acid and sphingosine-1-phosphate promote morphogenesis and block invasion of prostate cancer cells in three-dimensional organotypic modelsNormal prostate and some malignant prostate cancer (PrCa) cell lines undergo acinar differentiation and form spheroids in three-dimensional (3-D) organotypic culture. Acini formed by PC-3 and PC-3M, less pronounced also in other PrCa cell lines, spontaneously undergo an invasive switch, leading to the disintegration of epithelial structures and the basal lamina, and formation of invadopodia. This demonstrates the highly dynamic nature of epithelial plasticity, balancing epithelial-to-mesenchymal transition against metastable acinar differentiation. This study assessed the role of lipid metabolites on epithelial maturation. PC-3 cells completely failed to form acinar structures in delipidated serum. Adding back lysophosphatidic acid (LPA) and sphingosine-1-phosphate (S1P) rescued acinar morphogenesis and repressed invasion effectively. Blocking LPA receptor 1 (LPAR1) functions by siRNA (small interference RNA) or the specific LPAR1 inhibitor Ki16425 promoted invasion, while silencing of other G-protein-coupled receptors responsive to LPA or S1P mainly caused growth arrest or had no effects. The G-proteins Gα(12/13) and Gα(i) were identified as key mediators of LPA signalling via stimulation of RhoA and Rho kinases ROCK1 and 2, activating Rac1, while inhibition of adenylate cyclase and accumulation of cAMP may be secondary. Interfering with these pathways specifically impeded epithelial polarization in transformed cells. In contrast, blocking the same pathways in non-transformed, normal cells promoted differentiation. We conclude that LPA and LPAR1 effectively promote epithelial maturation and block invasion of PrCa cells in 3-D culture. The analysis of clinical transcriptome data confirmed reduced expression of LPAR1 in a subset of PrCa's. Our study demonstrates a metastasis-suppressor function for LPAR1 and Gα(12/13) signalling, regulating cell motility and invasion versus epithelial maturation.
Cell line name recognition in support of the identification of synthetic lethality in cancer from textMOTIVATION: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus. RESULTS: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers. AVAILABILITY AND IMPLEMENTATION: The manually annotated datasets, the cell line dictionary, derived corpora, NERsuite models and the results of the large-scale run on unannotated texts are available under open licenses at http://turkunlp.github.io/Cell-line-recognition/. CONTACT: sukaew@utu.fi.
Neural Network and Random Forest Models in Protein Function PredictionKai Hakala, Suwisa Kaewphan, Jari Björne et al.|IEEE/ACM Transactions on Computational Biology and Bioinformatics|2020 Over the past decade, the demand for automated protein function prediction has increased due to the volume of newly sequenced proteins. In this paper, we address the function prediction task by developing an ensemble system automatically assigning Gene Ontology (GO) terms to the given input protein sequence. We develop an ensemble system which combines the GO predictions made by random forest (RF) and neural network (NN) classifiers. Both RF and NN models rely on features derived from BLAST sequence alignments, taxonomy and protein signature analysis tools. In addition, we report on experiments with a NN model that directly analyzes the amino acid sequence as its sole input, using a convolutional layer. The Swiss-Prot database is used as the training and evaluation data. In the CAFA3 evaluation, which relies on experimental verification of the functional predictions, our submitted ensemble model demonstrates competitive performance ranking among top-10 best-performing systems out of over 100 submitted systems. In this paper, we evaluate and further improve the CAFA3-submitted system. Our machine learning models together with the data pre-processing and feature generation tools are publicly available as an open source software at https://github.com/TurkuNLP/CAFA3.