Y

Yitan Zhu

Argonne National Laboratory

Publishes on Gene expression and cancer classification, Bioinformatics and Genomic Networks, Computational Drug Discovery Methods. 76 papers and 4.1k citations.

76Publications
4.1kTotal Citations

Is this you? Claim your profile.

Add your photo, update your bio, and get notified when your ranking changes.

Top publicationsby citations

A Chaperome Subnetwork Safeguards Proteostasis in Aging and Neurodegenerative Disease
Marc Brehme, Cindy Voisine, Thomas Rolland et al.|Cell Reports|2014
Cited by 597Open Access

Chaperones are central to the proteostasis network (PN) and safeguard the proteome from misfolding, aggregation, and proteotoxicity. We categorized the human chaperome of 332 genes into network communities using function, localization, interactome, and expression data sets. During human brain aging, expression of 32% of the chaperome, corresponding to ATP-dependent chaperone machines, is repressed, whereas 19.5%, corresponding to ATP-independent chaperones and co-chaperones, are induced. These repression and induction clusters are enhanced in the brains of those with Alzheimer's, Huntington's, or Parkinson's disease. Functional properties of the chaperome were assessed by perturbation in C. elegans and human cell models expressing Aβ, polyglutamine, and Huntingtin. Of 219 C. elegans orthologs, knockdown of 16 enhanced both Aβ and polyQ-associated toxicity. These correspond to 28 human orthologs, of which 52% and 41% are repressed, respectively, in brain aging and disease and 37.5% affected Huntingtin aggregation in human cells. These results identify a critical chaperome subnetwork that functions in aging and disease.

MR Imaging Radiomics Signatures for Predicting the Risk of Breast Cancer Recurrence as Given by Research Versions of MammaPrint, Oncotype DX, and PAM50 Gene Assays
Hui Li, Yitan Zhu, Elizabeth S. Burnside et al.|Radiology|2016
Cited by 468Open Access

Purpose To investigate relationships between computer-extracted breast magnetic resonance (MR) imaging phenotypes with multigene assays of MammaPrint, Oncotype DX, and PAM50 to assess the role of radiomics in evaluating the risk of breast cancer recurrence. Materials and Methods Analysis was conducted on an institutional review board–approved retrospective data set of 84 deidentified, multi-institutional breast MR examinations from the National Cancer Institute Cancer Imaging Archive, along with clinical, histopathologic, and genomic data from The Cancer Genome Atlas. The data set of biopsy-proven invasive breast cancers included 74 (88%) ductal, eight (10%) lobular, and two (2%) mixed cancers. Of these, 73 (87%) were estrogen receptor positive, 67 (80%) were progesterone receptor positive, and 19 (23%) were human epidermal growth factor receptor 2 positive. For each case, computerized radiomics of the MR images yielded computer-extracted tumor phenotypes of size, shape, margin morphology, enhancement texture, and kinetic assessment. Regression and receiver operating characteristic analysis were conducted to assess the predictive ability of the MR radiomics features relative to the multigene assay classifications. Results Multiple linear regression analyses demonstrated significant associations (R2 = 0.25–0.32, r = 0.5–0.56, P < .0001) between radiomics signatures and multigene assay recurrence scores. Important radiomics features included tumor size and enhancement texture, which indicated tumor heterogeneity. Use of radiomics in the task of distinguishing between good and poor prognosis yielded area under the receiver operating characteristic curve values of 0.88 (standard error, 0.05), 0.76 (standard error, 0.06), 0.68 (standard error, 0.08), and 0.55 (standard error, 0.09) for MammaPrint, Oncotype DX, PAM50 risk of relapse based on subtype, and PAM50 risk of relapse based on subtype and proliferation, respectively, with all but the latter showing statistical difference from chance. Conclusion Quantitative breast MR imaging radiomics shows promise for image-based phenotyping in assessing the risk of breast cancer recurrence. © RSNA, 2016 Online supplemental material is available for this article.

TCGA-Assembler: open-source software for retrieving and processing TCGA data
Yitan Zhu, Peng Qiu, Yuan Ji|Nature Methods|2014
Cited by 453Open Access

To the Editor: The Cancer Genome Atlas (TCGA) has been generating multi-modal genomics, epigenomics, and proteomics data for thousands of tumor samples across more than 20 types of cancer. While the access to most level-1 and -2 TCGA data is restricted, the entire level-3 TCGA data as well as some level-1 clinical data (e.g., survival and drug treatments) are publicly available. Included in the public data are genome-wide measurements of different genetic characterizations, such as DNA copy number, DNA methylation, and mRNA expression for the same genes, providing unprecedented opportunities for systematic investigation of cancer mechanisms at multiple molecular and regulatory layers [1-3]. Few tools of integrative data mining for TCGA are present, partly due to lack of tools to acquire and assemble the large scale TCGA data. Specifically, the level-3 TCGA data are stored as hundreds of thousands of sample- and platform-specific files, accessible through HTTP directories on the servers of TCGA Data Coordinating Center (DCC) [4]. Navigating through all of the files manually is impossible. Although Firehose [5] nicely assemble and publish TCGA data, it does not share the program code for data assembly. Currently the community does not have access to open-source data retrieving tools for automatic and flexible data acquisition, hence severely hindering the progress in systemic data integration and reproducible computational analysis using TCGA data. To meet these challenges, we introduce TCGA-Assembler, a software package that automates and streamlines the retrieval, assembly, and processing of public TCGA data. TCGA-Assembler equips users the ability to produce Firehose-type of TCGA data, with open-source and freely available program script. TCGA-Assembler opens a door for the development of data-mining and data-analysis tools that generate fully reproducible results, including data acquisition. TCGA-Assembler consists of two modules (Fig. 1a), both written in R (http://www.r-project.org). Module A streamlines data downloading and quality check, and module B processes the downloaded data for subsequent analyses (Supplementary Methods). In particular, module A takes advantage of the informative naming mechanism of TCGA data file system (Supplementary Fig. 1) and applies a recursive algorithm to retrieve the URLs of all data files. By string matching on the URLs, module A allows users to download most of TCGA public data (Supplementary Table 1) across genomic features and cancer types. For each genomics feature (such as gene expression from RNA-Seq) a data matrix combining multiple samples (Fig. 1b) is produced, with rows representing genomics units (such as genes) and columns representing samples. Module B provides convenient and important data preprocessing functions, such as mega-data assembly, data cleaning, and quantification of various measurements. For users interested in integrative analysis [6], a mega data matrix (Fig. 1c) is required that matches different types of genomics measurements for the same genes across samples. Module B provides a function “CombineMultiPlatfomData” to fulfill this requirement (Supplementary Methods), which involves intricate data-matching steps to overcome the feature-labeling discrepancies caused by different lab protocols and biotechnologies in the experiments. Other data-processing functions are also provided to facilitate downstream analysis (Supplementary Methods). Figure 1 TCGA-Assembler as a tool for acquiring, assembling, and processing public TCGA data. (a) Flowchart of TCGA- Assembler. Module A acquires data from TCGA DCC. Module B processes the obtained data using various functions. (b) Illustration of a data matrix ... Other big data tools for TCGA are available [5, 7, 8]. In particular, level-3 TCGA data can also be obtained from Firehose [5] at the MIT Broad Institute in the same format as in Fig. 1b, one for each cancer type and genomics platform. Module A of TCGA-Assembler not only provides the same type of data matrices, but also distributes R functions and associated computer program that produce the data matrices. Equipped with the open-source tool, users will be independent and control what and when TCGA data will be acquired locally. More importantly, quantitatively advanced users may integrate our open-source programs with downstream data analysis tools to realize reproducible and automated data analysis for TCGA. Unique to TCGA-Assembler is module B that provides critical functions for data cleaning and processing. For example, the mega data table (Fig. 1c) can be obtained with a single function, behind which substantial efforts have been directed to ensure the validity of process, such as to check and correct gene symbol discrepancies. Lastly, TCGA-Assembler is fully compatible with Firehose in that the data processing functions in Module B can directly process data files downloaded from Firehose. This compatibility is crucial to those who want to take advantage of both software pipelines. TCGA-Assembler will remain freely available and open-source. In the future, more data processing and analysis functions will be continuously added to TCGA-Assembler based on user feedback and new research needs. The authors request acknowledgment of the use of TCGA-Assembler in published works.

Quantitative MRI radiomics in the prediction of molecular classifications of breast cancer subtypes in the TCGA/TCIA data set
Hui Li, Yitan Zhu, Elizabeth S. Burnside et al.|npj Breast Cancer|2016
Cited by 358Open Access

Abstract Using quantitative radiomics, we demonstrate that computer-extracted magnetic resonance (MR) image-based tumor phenotypes can be predictive of the molecular classification of invasive breast cancers. Radiomics analysis was performed on 91 MRIs of biopsy-proven invasive breast cancers from National Cancer Institute’s multi-institutional TCGA/TCIA. Immunohistochemistry molecular classification was performed including estrogen receptor, progesterone receptor, human epidermal growth factor receptor 2, and for 84 cases, the molecular subtype (normal-like, luminal A, luminal B, HER2-enriched, and basal-like). Computerized quantitative image analysis included: three-dimensional lesion segmentation, phenotype extraction, and leave-one-case-out cross validation involving stepwise feature selection and linear discriminant analysis. The performance of the classifier model for molecular subtyping was evaluated using receiver operating characteristic analysis. The computer-extracted tumor phenotypes were able to distinguish between molecular prognostic indicators; area under the ROC curve values of 0.89, 0.69, 0.65, and 0.67 in the tasks of distinguishing between ER+ versus ER−, PR+ versus PR−, HER2+ versus HER2−, and triple-negative versus others, respectively. Statistically significant associations between tumor phenotypes and receptor status were observed. More aggressive cancers are likely to be larger in size with more heterogeneity in their contrast enhancement. Even after controlling for tumor size, a statistically significant trend was observed within each size group ( P =0.04 for lesions ⩽2 cm; P =0.02 for lesions &gt;2 to ⩽5 cm) as with the entire data set ( P -value=0.006) for the relationship between enhancement texture (entropy) and molecular subtypes (normal-like, luminal A, luminal B, HER2-enriched, basal-like). In conclusion, computer-extracted image phenotypes show promise for high-throughput discrimination of breast cancer subtypes and may yield a quantitative predictive signature for advancing precision medicine.

Nuclear envelope dystrophies show a transcriptional fingerprint suggesting disruption of Rb–MyoD pathways in muscle regeneration
Cited by 327Open Access

Mutations of lamin A/C (LMNA) cause a wide range of human disorders, including progeria, lipodystrophy, neuropathies and autosomal dominant Emery-Dreifuss muscular dystrophy (EDMD). EDMD is also caused by X-linked recessive loss-of-function mutations of emerin, another component of the inner nuclear lamina that directly interacts with LMNA. One model for disease pathogenesis of LMNA and emerin mutations is cell-specific perturbations of the mRNA transcriptome in terminally differentiated cells. To test this model, we studied 125 human muscle biopsies from 13 diagnostic groups (125 U133A, 125 U133B microarrays), including EDMD patients with LMNA and emerin mutations. A Visual and Statistical Data Analyzer (VISDA) algorithm was used to statistically model cluster hierarchy, resulting in a tree of phenotypic classifications. Validations of the diagnostic tree included permutations of U133A and U133B arrays, and use of two probe set algorithms (MAS5.0 and MBEI). This showed that the two nuclear envelope defects (EDMD LMNA, EDMD emerin) were highly related disorders and were also related to fascioscapulohumeral muscular dystrophy (FSHD). FSHD has recently been hypothesized to involve abnormal interactions of chromatin with the nuclear envelope. To identify disease-specific transcripts for EDMD, we applied a leave-one-out (LOO) cross-validation approach using LMNA patient muscle as a test data set, with reverse transcription-polymerase chain reaction (RT-PCR) validations in both LMNA and emerin patient muscle. A high proportion of top-ranked and validated transcripts were components of the same transcriptional regulatory pathway involving Rb1 and MyoD during muscle regeneration (CRI-1, CREBBP, Nap1L1, ECREBBP/p300), where each was specifically upregulated in EDMD. Using a muscle regeneration time series (27 time points) we develop a transcriptional model for downstream consequences of LMNA and emerin mutations. We propose that key interactions between the nuclear envelope and Rb and MyoD fail in EDMD at the point of myoblast exit from the cell cycle, leading to poorly coordinated phosphorylation and acetylation steps. Our data is consistent with mutations of nuclear lamina components leading to destabilization of the transcriptome in differentiated cells.