P

Paul Horton

National Cheng Kung University

ORCID: 0000-0002-0916-7339

Publishes on Machine Learning in Bioinformatics, Genomics and Phylogenetic Studies, RNA and protein synthesis mechanisms. 200 papers and 12.2k citations.

200Publications
12.2kTotal Citations

Is this you? Claim your profile.

Add your photo, update your bio, and get notified when your ranking changes.

Top publicationsby citations

WoLF PSORT: protein localization predictor
Paul Horton, Kyungjoon Park, Takeshi Obayashi et al.|Nucleic Acids Research|2007
Cited by 4.2kOpen Access

WoLF PSORT is an extension of the PSORT II program for protein subcellular location prediction. WoLF PSORT converts protein amino acid sequences into numerical localization features; based on sorting signals, amino acid composition and functional motifs such as DNA-binding motifs. After conversion, a simple k-nearest neighbor classifier is used for prediction. Using html, the evidence for each prediction is shown in two ways: (i) a list of proteins of known localization with the most similar localization features to the query, and (ii) tables with detailed information about individual localization features. For convenience, sequence alignments of the query to similar proteins and links to UniProt and Gene Ontology are provided. Taken together, this information allows a user to understand the evidence (or lack thereof) behind the predictions made for particular proteins. WoLF PSORT is available at wolfpsort.org.

Adaptive seeds tame genomic sequence comparison
Szymon M. Kiełbasa, Raymond Wan, Kengo Sato et al.|Genome Research|2011
Cited by 1.4kOpen Access

The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is caused by the nonuniform (oligo)nucleotide composition of these sequences, rather than their size per se. To solve this problem, we modified the standard seed-and-extend approach (e.g., BLAST) to use adaptive seeds. Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches. This method guarantees that the number of matches, and thus the running time, increases linearly, instead of quadratically, with sequence length. LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition.

MitoFates: Improved Prediction of Mitochondrial Targeting Sequences and Their Cleavage Sites*
Yoshinori Fukasawa, Junko Tsuji, Szu-Chin Fu et al.|Molecular & Cellular Proteomics|2015
Cited by 613Open Access

Mitochondria provide numerous essential functions for cells and their dysfunction leads to a variety of diseases. Thus, obtaining a complete mitochondrial proteome should be a crucial step toward understanding the roles of mitochondria. Many mitochondrial proteins have been identified experimentally but a complete list is not yet available. To fill this gap, methods to computationally predict mitochondrial proteins from amino acid sequence have been developed and are widely used, but unfortunately, their accuracy is far from perfect. Here we describe MitoFates, an improved prediction method for cleavable N-terminal mitochondrial targeting signals (presequences) and their cleavage sites. MitoFates introduces novel sequence features including positively charged amphiphilicity, presequence motifs, and position weight matrices modeling the presequence cleavage sites. These features are combined with classical ones such as amino acid composition and physico-chemical properties as input to a standard support vector machine classifier. On independent test data, MitoFates attains better performance than existing predictors in both detection of presequences and in predicting their cleavage sites. We used MitoFates to look for undiscovered mitochondrial proteins from 42,217 human proteins (including isoforms such as alternative splicing or translation initiation variants). MitoFates predicts 1167 genes to have at least one isoform with a presequence. Five-hundred and eighty of these genes were not annotated as mitochondrial in either UniProt or Gene Ontology. Interestingly, these include candidate regulators of parkin translocation to damaged mitochondria, and also many genes with known disease mutations, suggesting that careful investigation of MitoFates predictions may be helpful in elucidating the role of mitochondria in health and disease. MitoFates is open source with a convenient web server publicly available.

Better prediction of protein cellular localization sites with the k nearest neighbors classifier.
Cited by 440

We have compared four classifiers on the problem of predicting the cellular localization sites of proteins in yeast and E. coli. A set of sequence derived features, such as regions of high hydrophobicity, were used for each classifier. The methods compared were a structured probabilistic model specifically designed for the localization problem, the k nearest neighbors classifier, the binary decision tree classifier, and the naïve Bayes classifier. The result of tests using stratified cross validation shows the k nearest neighbors classifier to perform better than the other methods. In the case of yeast this difference was statistically significant using a cross-validated paired t test. The result is an accuracy of approximately 60% for 10 yeast classes and 86% for 8 E. coli classes. The best previously reported accuracies for these datasets were 55% and 81% respectively.