Tae-Hyun Hwang

A hypergraph-based learning algorithm for classifying gene expression and arrayCGH data with prior knowledge

Ze Tian, Tae-Hyun Hwang, Rui Kuang|Bioinformatics|2009

Cited by 115Open Access

MOTIVATION: Incorporating biological prior knowledge into predictive models is a challenging data integration problem in analyzing high-dimensional genomic data. We introduce a hypergraph-based semi-supervised learning algorithm called HyperPrior to classify gene expression and array-based comparative genomic hybridization (arrayCGH) data using biological knowledge as constraints on graph-based learning. HyperPrior is a robust two-step iterative method that alternatively finds the optimal labeling of the samples and the optimal weighting of the features, guided by constraints encoding prior knowledge. The prior knowledge for analyzing gene expression data is that cancer-related genes tend to interact with each other in a protein-protein interaction network. Similarly, the prior knowledge for analyzing arrayCGH data is that probes that are spatially nearby in their layout along the chromosomes tend to be involved in the same amplification or deletion event. Based on the prior knowledge, HyperPrior imposes a consistent weighting of the correlated genomic features in graph-based learning. RESULTS: We applied HyperPrior to test two arrayCGH datasets and two gene expression datasets for both cancer classification and biomarker identification. On all the datasets, HyperPrior achieved competitive classification performance, compared with SVMs and the other baselines utilizing the same prior knowledge. HyperPrior also identified several discriminative regions on chromosomes and discriminative subnetworks in the PPI, both of which contain cancer-related genomic elements. Our results suggest that HyperPrior is promising in utilizing biological prior knowledge to achieve better classification performance and more biologically interpretable findings in gene expression and arrayCGH data. AVAILABILITY: http://compbio.cs.umn.edu/HyperPrior CONTACT: kuang@cs.umn.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at bioinformatics online.

Learning on Weighted Hypergraphs to Integrate Protein Interactions and Gene Expressions for Cancer Outcome Prediction

Tae-Hyun Hwang, Ze Tian, Rui Kuangy et al.|Unknown|2008

Cited by 76

Building reliable predictive models from multiple complementary genomic data for cancer study is a crucial step towards successful cancer treatment and a full understanding of the underlying biological principles. To tackle this challenging data integration problem, we propose a hypergraph-based learning algorithm called HyperGene to integrate microarray gene expressions and protein-protein interactions for cancer outcome prediction and biomarker identification. HyperGene is a robust two-step iterative method that alternatively finds the optimal outcome prediction and the optimal weighting of the marker genes guided by a protein-protein interaction network. Under the hypothesis that cancer-related genes tend to interact with each other, the HyperGene algorithm uses a protein-protein interaction network as prior knowledge by imposing a consistent weighting of interacting genes. Our experimental results on two large-scale breast cancer gene expression datasets show that HyperGene utilizing a curated protein-protein interaction network achieves significantly improved cancer outcome prediction. Moreover, HyperGene can also retrieve many known cancer genes as highly weighted marker genes.

Prioritizing Disease Genes by Bi-Random Walk

Maoqiang Xie, Tae-Hyun Hwang, Rui Kuang|Lecture notes in computer science|2012

Cited by 61

Robust and efficient identification of biomarkers by classifying features on graphs

Tae-Hyun Hwang, Hugues Sicotte, Ze Tian et al.|Bioinformatics|2008

Cited by 45Open Access

MOTIVATION: A central problem in biomarker discovery from large-scale gene expression or single nucleotide polymorphism (SNP) data is the computational challenge of taking into account the dependence among all the features. Methods that ignore the dependence usually identify non-reproducible biomarkers across independent datasets. We introduce a new graph-based semi-supervised feature classification algorithm to identify discriminative disease markers by learning on bipartite graphs. Our algorithm directly classifies the feature nodes in a bipartite graph as positive, negative or neutral with network propagation to capture the dependence among both samples and features (clinical and genetic variables) by exploring bi-cluster structures in a graph. Two features of our algorithm are: (1) our algorithm can find a global optimal labeling to capture the dependence among all the features and thus, generates highly reproducible results across independent microarray or other high-thoughput datasets, (2) our algorithm is capable of handling hundreds of thousands of features and thus, is particularly useful for biomarker identification from high-throughput gene expression and SNP data. In addition, although designed for classifying features, our algorithm can also simultaneously classify test samples for disease prognosis/diagnosis. RESULTS: We applied the network propagation algorithm to study three large-scale breast cancer datasets. Our algorithm achieved competitive classification performance compared with SVMs and other baseline methods, and identified several markers with clinical or biological relevance with the disease. More importantly, our algorithm also identified highly reproducible marker genes and enriched functions from the independent datasets. AVAILABILITY: Supplementary results and source code are available at http://compbio.cs.umn.edu/Feature_Class. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Inferring disease and gene set associations with rank coherence in networks

Tae-Hyun Hwang, Wei Zhang, Maoqiang Xie et al.|Bioinformatics|2011

Cited by 39Open Access

MOTIVATION: To validate the candidate disease genes identified from high-throughput genomic studies, a necessary step is to elucidate the associations between the set of candidate genes and disease phenotypes. The conventional gene set enrichment analysis often fails to reveal associations between disease phenotypes and the gene sets with a short list of poorly annotated genes, because the existing annotations of disease-causative genes are incomplete. This article introduces a network-based computational approach called rcNet to discover the associations between gene sets and disease phenotypes. A learning framework is proposed to maximize the coherence between the predicted phenotype-gene set relations and the known disease phenotype-gene associations. An efficient algorithm coupling ridge regression with label propagation and two variants are designed to find the optimal solution to the objective functions of the learning framework. RESULTS: We evaluated the rcNet algorithms with leave-one-out cross-validation on Online Mendelian Inheritance in Man (OMIM) data and an independent test set of recently discovered disease-gene associations. In the experiments, the rcNet algorithms achieved best overall rankings compared with the baselines. To further validate the reproducibility of the performance, we applied the algorithms to identify the target diseases of novel candidate disease genes obtained from recent studies of Genome-Wide Association Study (GWAS), DNA copy number variation analysis and gene expression profiling. The algorithms ranked the target disease of the candidate genes at the top of the rank list in many cases across all the three case studies. AVAILABILITY: http://compbio.cs.umn.edu/dgsa_rcNet CONTACT: kuang@cs.umn.edu.

Is this you? Claim your profile.

Top publicationsby citations