ChIP-Atlas 2021 update: a data-mining suite for exploring epigenomic landscapes by fully integrating ChIP-seq, ATAC-seq and Bisulfite-seq dataZhaonan Zou, Tazro Ohta, Fumihito Miura et al.|Nucleic Acids Research|2022 ChIP-Atlas (https://chip-atlas.org) is a web service providing both GUI- and API-based data-mining tools to reveal the architecture of the transcription regulatory landscape. ChIP-Atlas is powered by comprehensively integrating all data sets from high-throughput ChIP-seq and DNase-seq, a method for profiling chromatin regions accessible to DNase. In this update, we further collected all the ATAC-seq and whole-genome bisulfite-seq data for six model organisms (human, mouse, rat, fruit fly, nematode, and budding yeast) with the latest genome assemblies. These together with ChIP-seq data can be visualized with the Peak Browser tool and a genome browser to explore the epigenomic landscape of a query genomic locus, such as its chromatin accessibility, DNA methylation status, and protein-genome interactions. This epigenomic landscape can also be characterized for multiple genes and genomic loci by querying with the Enrichment Analysis tool, which, for example, revealed that inflammatory bowel disease-associated SNPs are the most significantly hypo-methylated in neutrophils. Therefore, ChIP-Atlas provides a panoramic view of the whole epigenomic landscape. All datasets are free to download via either a simple button on the web page or an API.
ChIP-Atlas 3.0: a data-mining suite to explore chromosome architecture together with large-scale regulome dataZhaonan Zou, Tazro Ohta, Shinya Oki|Nucleic Acids Research|2024 ChIP-Atlas (https://chip-atlas.org/) presents a suite of data-mining tools for analyzing epigenomic landscapes, powered by the comprehensive integration of over 376 000 public ChIP-seq, ATAC-seq, DNase-seq and Bisulfite-seq experiments from six representative model organisms. To unravel the intricacies of chromatin architecture that mediates the regulome-initiated generation of transcriptional and phenotypic diversity within cells, we report ChIP-Atlas 3.0 that enhances clarity by incorporating additional tracks for genomic and epigenomic features within a newly consolidated 'annotation track' section. The tracks include chromosomal conformation (Hi-C and eQTL datasets), transcriptional regulatory elements (ChromHMM and FANTOM5 enhancers), and genomic variants associated with diseases and phenotypes (GWAS SNPs and ClinVar variants). These annotation tracks are easily accessible alongside other experimental tracks, facilitating better elucidation of chromatin architecture underlying the diversification of transcriptional and phenotypic traits. Furthermore, 'Diff Analysis,' a new online tool, compares the query epigenome data to identify differentially bound, accessible, and methylated regions using ChIP-seq, ATAC-seq and DNase-seq, and Bisulfite-seq datasets, respectively. The integration of annotation tracks and the Diff Analysis tool, coupled with continuous data expansion, renders ChIP-Atlas 3.0 a robust resource for mining the landscape of transcriptional regulatory mechanisms, thereby offering valuable perspectives, particularly for genetic disease research and drug discovery.
Cross-ancestry genome-wide analysis of atrial fibrillation unveils disease biology and enables cardioembolic risk predictionAtrial fibrillation (AF) is a common cardiac arrhythmia resulting in increased risk of stroke. Despite highly heritable etiology, our understanding of the genetic architecture of AF remains incomplete. Here we performed a genome-wide association study in the Japanese population comprising 9,826 cases among 150,272 individuals and identified East Asian-specific rare variants associated with AF. A cross-ancestry meta-analysis of >1 million individuals, including 77,690 cases, identified 35 new susceptibility loci. Transcriptome-wide association analysis identified IL6R as a putative causal gene, suggesting the involvement of immune responses. Integrative analysis with ChIP-seq data and functional assessment using human induced pluripotent stem cell-derived cardiomyocytes demonstrated ERRg as having a key role in the transcriptional regulation of AF-associated genes. A polygenic risk score derived from the cross-ancestry meta-analysis predicted increased risks of cardiovascular and stroke mortalities and segregated individuals with cardioembolic stroke in undiagnosed AF patients. Our results provide new biological and clinical insights into AF genetics and suggest their potential for clinical applications.
Epigenetic landscape of drug responses revealed through large-scale ChIP-seq data analysesBACKGROUND: Elucidating the modes of action (MoAs) of drugs and drug candidate compounds is critical for guiding translation from drug discovery to clinical application. Despite the development of several data-driven approaches for predicting chemical-disease associations, the molecular cues that organize the epigenetic landscape of drug responses remain poorly understood. RESULTS: With the use of a computational method, we attempted to elucidate the epigenetic landscape of drug responses, in terms of transcription factors (TFs), through large-scale ChIP-seq data analyses. In the algorithm, we systematically identified TFs that regulate the expression of chemically induced genes by integrating transcriptome data from chemical induction experiments and almost all publicly available ChIP-seq data (consisting of 13,558 experiments). By relating the resultant chemical-TF associations to a repository of associated proteins for a wide range of diseases, we made a comprehensive prediction of chemical-TF-disease associations, which could then be used to account for drug MoAs. Using this approach, we predicted that: (1) cisplatin promotes the anti-tumor activity of TP53 family members but suppresses the cancer-inducing function of MYCs; (2) inhibition of RELA and E2F1 is pivotal for leflunomide to exhibit antiproliferative activity; and (3) CHD8 mediates valproic acid-induced autism. CONCLUSIONS: Our proposed approach has the potential to elucidate the MoAs for both approved drugs and candidate compounds from an epigenetic perspective, thereby revealing new therapeutic targets, and to guide the discovery of unexpected therapeutic effects, side effects, and novel targets and actions.
Extraction of biological terms using large language models enhances the usability of metadata in the BioSample databaseBioSample is a repository of experimental sample metadata. It is a comprehensive archive that enables searches of experiments, regardless of type. However, there is substantial variability in the submitted metadata due to the difficulty in defining comprehensive rules for describing them and the limited user awareness of best practices in creating them. This inconsistency poses considerable challenges to the findability and reusability of archived data. Given the scale of BioSample, which hosts over 40 million records, manual curation is impractical. Automatic rule-based ontology mapping methods have been proposed to address this issue, but their effectiveness is limited by the heterogeneity of the metadata. Recently, large language models (LLMs) have gained attention in natural language processing and are promising tools for automating metadata curation. In this study, we evaluated the performance of LLMs in extracting cell line names from BioSample descriptions using a gold-standard dataset derived from ChIP-Atlas, a secondary database of epigenomics experiment data in which samples were manually curated. The LLM-assisted methods outperformed traditional approaches, achieving higher accuracy and coverage. We further extended them to extract information about experimentally manipulated genes from metadata when manual curation had not yet been applied in ChIP-Atlas. This also yielded successful results, including the facilitation of more precise filtering of the data and the prevention of possible misinterpretations caused by the inclusion of unintended data. These findings underscore the potential of LLMs in improving the findability and reusability of experimental data in general, which would considerably reduce the user workload and enable more effective scientific data management.