CZ CELLxGENE Discover: a single-cell data platform for scalable exploration, analysis and modeling of aggregated dataHundreds of millions of single cells have been analyzed using high-throughput transcriptomic methods. The cumulative knowledge within these datasets provides an exciting opportunity for unlocking insights into health and disease at the level of single cells. Meta-analyses that span diverse datasets building on recent advances in large language models and other machine-learning approaches pose exciting new directions to model and extract insight from single-cell data. Despite the promise of these and emerging analytical tools for analyzing large amounts of data, the sheer number of datasets, data models and accessibility remains a challenge. Here, we present CZ CELLxGENE Discover (cellxgene.cziscience.com), a data platform that provides curated and interoperable single-cell data. Available via a free-to-use online data portal, CZ CELLxGENE hosts a growing corpus of community-contributed data of over 93 million unique cells. Curated, standardized and associated with consistent cell-level metadata, this collection of single-cell transcriptomic data is the largest of its kind and growing rapidly via community contributions. A suite of tools and features enables accessibility and reusability of the data via both computational and visual interfaces to allow researchers to explore individual datasets, perform cross-corpus analysis, and run meta-analyses of tens of millions of cells across studies and tissues at the resolution of single cells.
CZ CELL×GENE Discover: A single-cell data platform for scalable exploration, analysis and modeling of aggregated dataAbstract Hundreds of millions of single cells have been analyzed to date using high throughput transcriptomic methods, thanks to technological advances driving the increasingly rapid generation of single-cell data. This provides an exciting opportunity for unlocking new insights into health and disease, made possible by meta-analysis that span diverse datasets building on recent advances in large language models and other machine learning approaches. Despite the promise of these and emerging analytical tools for analyzing large amounts of data, a major challenge remains the sheer number of datasets and inconsistent format, data models and accessibility. Many datasets are available via unique portals platforms that often lack interoperability. Here, we present CZ CellxGene Discover ( cellxgene.cziscience.com ), a data platform that provides curated and interoperable data. This single-cell data resource, available via a free-to-use online data portal, hosts a growing corpus of community contributed data that spans more than 50 million unique cells. Curated, standardized, and associated with consistent cell-level metadata, this collection of interoperable single-cell transcriptomic data is the largest of its kind. A suite of tools and features enables accessibility and reusability of the data via both computational and visual interfaces to allow researchers to rapidly explore individual datasets and perform cross-corpus analysis. This functionality is enabling meta-analyses of tens of millions of cells across studies and tissues and providing global views of human cells at the resolution of single cells.
Identification of Candidate Disease Genes by EST Alignments, Synteny, and Expression and Verification of Ensembl Genes on Rat Chromosome 1q43-54We aligned Incyte ESTs and publicly available sequences to the rat genome and analyzed rat chromosome 1q43-54, a region in which several quantitative trait loci (QTLs) have been identified, including renal disease, diabetes, hypertension, body weight, and encephalomyelitis. Within this region, which contains 255 Ensembl gene predictions, the aligned sequences clustered into 568 Incyte genes and gene fragments. Of the Incyte genes, 261 (46%) overlapped 184 (72%) of the Ensembl gene predictions, whereas 307 were unique to Incyte. The rat-to-human syntenic map displays rearrangement of this region on rat chr. 1 onto human chromosomes 9 and 10. The mapping of corresponding human disease phenotypes to either one of these chromosomes has allowed us to focus in on genes associated with disease phenotypes. As an example, we have used the syntenic information for the rat Rf-1 disease region and the orthologous human ESRD disease region to reduce the size of the original rat QTL to only 11.5 Mb. Using the syntenic information in combination with expression data from ESTs and microarrays, we have selected a set of 66 candidate disease genes for Rf-1. The combination of the results from these different analyses represents a powerful approach for narrowing the number of genes that could play a role in the development of complex diseases.
ConsHMM Atlas: conservation state annotations for major genomes and human genetic variationA Arneson, Brooke Felsheim, Jennifer Chien et al.|NAR Genomics and Bioinformatics|2020 ConsHMM is a method recently introduced to annotate genomes into conservation states, which are defined based on the combinatorial and spatial patterns of which species align to and match a reference genome in a multi-species DNA sequence alignment. Previously, ConsHMM was only applied to a single genome for one multi-species sequence alignment. Here, we apply ConsHMM to produce 22 additional genome annotations covering human and seven other organisms for a variety of multi-species alignments. Additionally, we extend ConsHMM to generate allele-specific annotations, which we use to produce conservation state annotations for every possible single-nucleotide mutation in the human genome. Finally, we provide a web interface to interactively visualize parameters and annotation enrichments for ConsHMM models. These annotations and visualizations comprise the ConsHMM Atlas, which we expect will be a valuable resource for analyzing a variety of major genomes and genetic variation.
Predicting the Plant Root-Associated Ecological Niche of 21 Pseudomonas\n Species Using Machine Learning and Metabolic ModelingJennifer Chien, Peter Larsen|arXiv (Cornell University)|2017 Plants rarely occur in isolated systems. Bacteria can inhabit either the\nendosphere, the region inside the plant root, or the rhizosphere, the soil\nregion just outside the plant root. Our goal is to understand if using genomic\ndata and media dependent metabolic model information is better for training\nmachine learning of predicting bacterial ecological niche than media\nindependent models or pure genome based species trees. We considered three\nmachine learning techniques: support vector machine, non-negative matrix\nfactorization, and artificial neural networks. In all three machine-learning\napproaches, the media-based metabolic models and flux balance analyses were\nmore effective at predicting bacterial niche than the genome or PRMT models.\nSupport Vector Machine trained on a minimal media base with Mannose, Proline\nand Valine was most predictive of all models and media types with an f-score of\n0.8 for rhizosphere and 0.97 for endosphere. Thus we can conclude that\nmedia-based metabolic modeling provides a holistic view of the metabolome,\nallowing machine learning algorithms to highlight the differences between and\ncategorize endosphere and rhizosphere bacteria. There was no single media type\nthat best highlighted differences between endosphere and rhizosphere bacteria\nmetabolism and therefore no single enzyme, reaction, or compound that defined\nwhether a bacteria's origin was of the endosphere or rhizosphere.\n