L

Linmeng Liu

Shanghai CASB Biotechnology (China)

Publishes on Genomics and Phylogenetic Studies, Gut microbiota and health, Bioinformatics and Genomic Networks. 5 papers and 1.2k citations.

5Publications
1.2kTotal Citations

Is this you? Claim your profile.

Add your photo, update your bio, and get notified when your ranking changes.

Top publicationsby citations

Majorbio Cloud: A one‐stop, comprehensive bioinformatic platform for multiomics analyses
Yi Ren, Yu Guo, Caiping Shi et al.|iMeta|2022
Cited by 847Open Access

The platform consists of three modules, which are pre-configured bioinformatic pipelines, cloud toolsets, and online omics' courses. The pre-configured bioinformatic pipelines not only combine analytic tools for metagenomics, genomes, transcriptome, proteomics and metabolomics, but also provide users with powerful and convenient interactive analysis reports, which allow them to analyze and mine data independently. As a useful supplement to the bioinformatics pipelines, a wide range of cloud toolsets can further meet the needs of users for daily biological data processing, statistics, and visualization. The rich online courses of multi-omics also provide a state-of-art platform to researchers in interactive communication and knowledge sharing.

Majorbio Cloud 2024: Update single‐cell and multiomics workflows
Chang Han, Caiping Shi, Linmeng Liu et al.|iMeta|2024
Cited by 249Open Access

Majorbio Cloud (https://cloud.majorbio.com/) is a one-stop online analytic platform aiming at promoting the development of bioinformatics services, narrowing the gap between wet and dry experiments, and accelerating the discoveries for the life sciences community. In 2024, three single-omics workflows, two multiomics workflows, and extensions were newly released to facilitate omics data mining and interpretation. Advances in high-throughput multiomics technologies have significantly influenced life science and basic medical research, specifically based on multiomics data, including genomic/transcriptomic sequencings and proteomic/metabolomic mass spectra, paving the way for the discovery of novel predictive biomarkers for predicting treatment response from diverse dimension levels. The state-of-the-art multiomics technologies have enabled researchers to understand biological processes and molecular functions in health and disease. The emerging novel omics strategies and instruments continue to evolve toward higher throughput and lower detection costs. The evergrowing quantity of multiomics data needed to have access to the resources and be analyzed in an easy, fast, and accurate way. The requirement for the development and application of appropriate bioinformatic tools and pipelines to interpret these data is urgent. Two key elements of omics are automatic data analysis and data visualization. Bioinformatics analysis platforms, such as Cell Ranger [1], MetaboAnalyst [2], GEPIA2 [3], and iNAP [4] provide web interfaces to access the data and computational results. However, these interaction-friendly web services are designed for a single type of omics. Majorbio Cloud (https://cloud.majorbio.com/) offers an easy and powerful approach to profiling the bulk transcriptome, single-cell transcriptome, proteome, metabolome, metagenome, and other omics data. It facilitates researchers to analyze complex multiomics data and infer the biological meaning of integrated omics data. Since Majorbio Cloud's first publication in iMeta, it has attracted the attention of researchers around the world and has been widely used by researchers who are not specialists in omics or bioinformatics [5]. Furthermore, it is an interactive communication and omics knowledge dispersion platform. Single-cell RNA sequencing is an emerging technology for high-throughput sequencing analysis of genetic material at the level of individual cells [6]. It has been widely applied in immunology, developmental biology, oncology, cardiology, and neurobiology. The single-cell transcriptomics workflow is an easy-to-use and effective pipeline for high-dimensional single-cell transcriptome data mining, including the following six steps: (1) data preprocessing; (2) cell filtration; (3) batch effect removal and sample merging; (4) clustering; (5) marker gene identification; and (6) downstream analysis. The detailed process is as follows: Reads are processed using the Cell Ranger (v7.1.0) with default parameters. FASTQ files generated by the Illumina sequencer are aligned to the genome. The Seurat package was used for cell normalization and regression based on the unique molecular identifier counts for each sample and mito % to obtain the scaled data, which was normalized by the function NormalizeData for further analysis. The function FindVariableGenes was used to calculate highly variable genes across the single cells. Unsupervised cell cluster results were generated based on the principal component analysis's (PCA's) top 30 principal components by applying the graph-based cluster method (resolution 0.8) in the Seurat package. For subclustering, we applied the same procedure of scaling, dimensionality reduction, and clustering to a specific set of data (usually restricted to one cell type). For each cluster, we used the Wilcoxon Rank-Sum test to find significant deferentially expressed genes comparing the remaining clusters. SingleR [7] and known marker genes were used to identify cell types. Downstream analysis, such as differential expression genes and pathway enrichment of different cell types, pseudo-time analysis, and cell communication analysis, could be used to reveal the functions, states, and interactions of various types of cells in a sample (Figure 1). The proteomics workflow is a user-friendly, comprehensive pipeline for data-independent acquisition mass spectrometry-based, label-free quantitation (LFQ), and Tandem mass tag-based quantitative proteomics data processing, analysis, and interpretation (Figure 2). The standard proteomics workflow consists of seven main modules: data processing, protein expression and functional annotation, statistical analysis, protein set analysis, weighted gene correlation network analysis (WGCNA), gene set enrichment analysis (GSEA), and time-series data analysis. An additional module, biomarker discovery and model development is provided for medical cohort research. The function of the proteomics data processing module is low-quality data filtering and missing value estimation. The protein expression and functional annotation module includes Venn, PCA, correlation analysis, and functional annotations based on databases or software. Paired/unpaired t test, analysis of variance, Kruskal–Wallis test, and post-hoc test are provided for statistically significant differences in protein identification. A protein set is a protein list that is related to the phenotype of the research object according to the protein expression profile, functional annotation, biological pathway enrichment, and research background. Users can generate protein sets of their interest and interpret the data via clustering, protein–protein interaction, pathway analysis, functional enrichment, and so on. LASSO-Logistic/Cox regression, Random Forest [8], and SVM [9] can be used for disease risk prediction, early diagnosis, prognosis monitoring, and response to treatment. Metabolomics research is primarily based on the use of liquid/gas chromatography-mass spectrometry and nuclear magnetic resonance spectroscopy to detect, identify, and quantify small molecule metabolites in organisms [10]. Metabolomics data is large and complex, often requiring specialized data analysis software as well as extensive knowledge of cheminformatics, bioinformatics, and statistics. To enable users to perform metabolomics data analysis easily and quickly, we provide a comprehensive solution for metabolomics workflow (Figure S1). The standard metabolomics workflow consists of five steps: (1) Data preprocessing: the methods mainly include filtering the missing values of the original data, missing value estimation, data normalization, quality control verification, and data transformation. (2) Sample comparison analysis: multivariate statistical analysis was performed by PCA and partial least squares discriminant analysis (PLS-DA); (3) metabolite annotation: metabolites were annotated in kyoto encyclopedia of genes and genomes (http://www.genome.jp/kegg/) and human metabolome database (https://hmdb.ca/) databases; (4) differential expression metabolites analysis: a combination of multidimensional analysis and single-dimensional analysis was used to screen differential metabolites between groups; and (5) metabolite set analysis: analysis and visualization of the key or differential expression metabolites, such as metabolite clustering, correlation analysis, and so on. Moreover, we also provide some advanced analyses to reveal the mysteries of biological processes, such as biomarker discovery by random forest, support vector machine (SVM), and so on. The multiomics technologies facilitate researchers to uncover underlying mechanistic insights into disease pathophysiology and delineate the landscape of clinical phenotypes. Multiomics provides an integrated perspective across multiple levels, while single omics data can only partially explain one aspect of complex biological processes [11]. The transcriptomic and proteomic data combined analysis pipeline supports differential expression analysis, correlation between messenger RNA (mRNA) and protein abundance, functional annotation and enrichment, GSVA [12], and interactive visualization including Venn, quadrant diagram, nine quadrant diagram, bubble plot, box plot, and donut plot. The pipeline enables a combined, complementary insight, which improves a comprehensive understanding of biological molecular processes from mRNA to protein. The microbiome and metabolome association analysis workflow can be used to analyze the association between species/function and metabolites so as to help establish the logical association between “species/function—metabolite—phenotype/target organ.” The results systematically delineate the regulatory mechanisms of biological processes of different dimensions. To facilitate the intuitive presentation of scientific findings, the workflow provides a broad diversity of analyses. The main analysis contents are as follows: (1) annotation and abundance (species, KEGG orthology genes, and metabolic species) of the single-omics feature set; (2) procrustes and orthogonal partial least squares discriminant analysis (O2PLS) are used to analyze the synergy between microbial communities and metabolites and to screen the species and metabolites that contributed the most to distinguishing different groups of samples. (3) To explain the association between key flora and metabolites can be achieved by HCLUST correlation analysis, Mantel test network heatmap, expression correlation heatmap and chord map, expression correlation network, linear regression analysis, MaAsLin analysis, and canonical correlation analysis. (4) The microbiome and metabolome data are used to form a combinatorial marker panel, and four integrated machine learning algorithms, including random forest, SVM, least absolute shrinkage and selection operator (LASSO), and logistic regression, are used to efficiently screen predictive biomarkers. In addition, A metabolic network-based tool for inferring mechanism-supported relationships in microbiome-metabolome data (MIMOSA2) [13], mmvec [14], and WGCNA are available for further analyses to interpret the possible interactions between microorganisms and metabolites. (5) Metabolite detection technology is used to detect intermediate metabolites. In combination with the metabolic pathways predicted by metagenome data, the downstream metabolic pathways can be reconstructed to obtain a complete microbial metabolic pathway. To improve the user experience and expand the depth of analysis, we have developed a completely new interactive analysis mode. Taking the eukaryotic reference transcriptome analysis pipeline as an example, users can select the data table generated in the pipeline and set parameters in extension tools to complete more in-depth data mining. The intermediate data generated by the workflow is extracted into javascript object notation format parameters, encrypted transmission to a specific tool via base64, and parameter parsing is finished in the tool (Figure S2). Twenty-eight eukaryotic reference transcriptome analysis pipeline-specific extension tools are available for users, including integrative genomics viewer visualization, multiref-genome blast, differential expression genes radar chart, hyperbolic curve volcano chart, circos chart, single gene GSEA, multipathways GSEA, and so on. Since October 2016, more than 150,000 scientific and clinical research users, involving over 9000 well-known universities and institutions, have completed more than 600,000 omics data mining tasks on the Majorbio Cloud platform. In 2024, 2015 journal articles cited Majorbio Cloud in their methods. 20, 62, and 393 research articles have been published with the facilitation of the single-cell transcriptomics workflow, proteomics workflow, and metabolomics workflow for data mining, respectively. We will constantly update and iterate the platform to make our users delve more deeply into the omics data. Jichen Han, Chang Han, Caiping Shi, and Linmeng Liu conceived the platform and idea. Linmeng Liu, Caiping Shi, and Wenyao Fu implemented the MIST main code. Chang Han, Qianqian Yang, Yan Wang, and Xiaodan Li designed the graphical user interface. Chang Han, Yan Wang, Xiaodan Li, and Qianqian Yang wrote the manuscript. Chang Han was responsible for editing and revising the manuscript. All authors contributed to the development of Majorbio Cloud. All authors have read the final manuscript and approved it for publication. The authors acknowledge Dr. Boya Liao for the advice on this manuscript. This work was supported by a grant from the Shanghai Science and Technology Little Giant Project (220HX001400). The authors declare no conflict of interest. Supporting Information (graphical abstract, Supporting Information Table) may be found in the online DOI or iMeta Science http://www.imeta.science/. Figure S1: Metabolomics workflow. Figure S2: “Pipeline + Extensions” interactive analysis mode. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.

Expansion of <i>Thaumarchaeota</i> habitat range is correlated with horizontal transfer of ATPase operons
Baozhan Wang, Wei Qin, Yi Ren et al.|The ISME Journal|2019
Cited by 89Open Access

Thaumarchaeota are responsible for a significant fraction of ammonia oxidation in the oceans and in soils that range from alkaline to acidic. However, the adaptive mechanisms underpinning their habitat expansion remain poorly understood. Here we show that expansion into acidic soils and the high pressures of the hadopelagic zone of the oceans is tightly linked to the acquisition of a variant of the energy-yielding ATPases via horizontal transfer. Whereas the ATPase genealogy of neutrophilic Thaumarchaeota is congruent with their organismal genealogy inferred from concatenated conserved proteins, a common clade of V-type ATPases unites phylogenetically distinct clades of acidophilic/acid-tolerant and piezophilic/piezotolerant species. A presumptive function of pumping cytoplasmic protons at low pH is consistent with the experimentally observed increased expression of the V-ATPase in an acid-tolerant thaumarchaeote at low pH. Consistently, heterologous expression of the thaumarchaeotal V-ATPase significantly increased the growth rate of E. coli at low pH. Its adaptive significance to growth in ocean trenches may relate to pressure-related changes in membrane structure in which this complex molecular machine must function. Together, our findings reveal that the habitat expansion of Thaumarchaeota is tightly correlated with extensive horizontal transfer of atp operons.

MIST: A microbial identification and source tracking system for next‐generation sequencing data
Minghui Song, Chang Han, Linmeng Liu et al.|iMeta|2023
Cited by 5Open Access

The Professional Committee of Microbiology of the National Pharmacopoeia Commission organized the drafting of the Technical Guidelines for Microbial Whole Genome Sequencing (WGS), aiming to standardize the method process and technical indicators of microbial WGS and ensure the accuracy of sequencing and identification. On the basis of the Guidelines, we developed an integrated microbial identification and source tracking (MIST) system, which could meet the needs of microbial identification and contamination investigation in food and drug quality control. MIST integrates three analysis pipelines: 16S/18S/internal transcribed spacer amplicon-based microbial identification, WGS-based microbial identification, and single-nucleotide polymorphism-based microbial source tracking. MIST can analyze sequence data in a variety of formats, such as Fasta, base call file, and FASTQ. It can be connected to a high-throughput sequencing instrument to acquire sequencing data directly. We also developed a publicly accessible web server for MIST (http://syj.i-sanger.cn). Microbial identification is of great value for clinical, epidemiological, food, and pharmaceutical research [1]. Traditionally, microbes have been identified based on their morphological, physical, and biochemical properties [2]. However, many prokaryotic microbes are difficult to culture using traditional methods [3] and thus cannot be detected by traditional methods. These unculturable microbes harbor a potential source of novel metabolites and are essential components of natural metabolic networks [4]. Moreover, traditional methods also fail to detect novel culturable microbes and have problems in detecting unusual microbes that have not been comprehensively evaluated [5]. High-throughput sequencing technology (HTS) has enabled sequence-based genomics to become one of the routine and promising methods for microbial identification [6]. HTS-based methods can be subdivided into two categories: amplicon sequencing [7], which amplifies conserved sequences in microbes (e.g., 16S ribosomal RNA [rRNA] for bacteria and 18S recombinant DNA [rDNA]/internal transcribed spacer [ITS] region for fungi), and whole genome sequencing (WGS) [8], which sequences the whole genomes of a microbe after isolation. The 16S rDNA-based amplicon sequencing is an efficient method to investigate all bacteria in a sample because this region has been recognized as the conventional method for prokaryotic identification [9]. The community has accumulated a large amount of well-characterized 16S rDNA sequences in large databases, such as Ribosomal Database Project [10] and SILVA [11]. Amajor limitation of amplicon sequencing is its lack of discrimination among closely related species [12]. WGS-based bacterial identification provides higher discriminatory power and allows bacterial identification at species or even at strain level. It also provides a powerful way for investigating functional genes, such as antibiotic resistance genes (ARGs) [13, 14] and virulence factors genes (VFGs) [15]. Furthermore, the multilocus sequence type (MLST) [16] and single-nucleotide polymorphism (SNP) [17-19] enable source tracking of genetically closely related bacteria that were isolated from different sources. Such analysis enables WGS-based applications in multiple fields, such as forensic investigations, strain identification, and outbreak tracking [20]. Currently, there are some web services and tools for microbial identification, for example, BacWGSTdb [21], ImageGP [22], Bacterial Analysis Pipeline (CGE) (https://cge.cbs.dtu.dk/services/cge/) [23], Qiime2 [24], EasyAmplicon [25], GCType (GCM Type Strain Sequencing project), and rANOMALY [26]. Each website has its own unique strengths and limitations. For example, BacWGSTdb offers MLST-based and whole-genome-based bacterial genotyping but only accepts assembly genome files as inputs. CGE provides various tools for genome-based phenotyping, phylogeny, and annotation of ARGs and VFGs. However, users should upload their data into FASTQ each of these tools separately due to the lack of an integrated backend. Furthermore, all web-based tools require a fast and consistent internet connection to upload raw sequence files, which can have sizes of hundreds to thousands of MBs [8]. With the development of NGS technology, the downstream bioinformatics analysis is challenging, and more software and systems need to be developed [27, 28]. Here, we present a system for the classification and identification of microbes. It implements sophisticated pipelines for both amplicon sequencing data, which enable efficient profiling of unculturable microbes, and WGS data, which enable accurate genotyping of cultured microbes. The system also implements pipelines for the MLST, SNP-based source tracking, and ARGs or VFGs annotation from WGS data. The system consists of three pipelines: (1) amplicon-based microbial identification, such as 16S rDNA/18S rDNA/ITS genes, (2) WGS-based microbial identification, and (3) SNP-based source tracking. To initiate the analysis, users only need to choose sequencing files in base call file or FASTQ format generated by Illumina sequencer, or Fasta-formatted sequence files (such as assembled genomes or 16S sequences) into the server. Then, users can create a task by selecting a pipeline and setting corresponding parameters. Finally, sequencing data and parameters are submitted to the server and trigger the analytic pipelines (Figure 1A). The system provides mainstream reference databases for microbial identification and functional annotation (Figure 1B). We also have a data management system that is responsible for monitoring the processing tasks and managing the database, such as inputs and outputs files (Figure 1D). Users can view the task results on the online interactive analysis report interface and download the results for further use (Figure 1C). This pipeline can be used to identify microbes, cultured or uncultured, using 16S/18S rDNA and ITS regions. The pipeline contains “Quality Control,” “Primer Removal,” “Denoising,” “Annotation,” and “Evaluation” functional components. In short, Fastp v0.23.4 [29] was used to perform quality control and clean the paired-end (PE) FASTQ reads by trimming and filtering reads based on their quality and length. The reads were truncated at any site receiving an average quality score of <20 over a 50 bp sliding window, and the truncated reads shorter than 50 bp were discarded; reads containing ambiguous characters were also discarded. The resulting reads were subjected to the server for merging the pair-end reads, followed by primer removal by a homemade Python script, duplicate removal by vsearch v2.22.1 [30], and denoising by deblur v1.1.1 [31]. The procedure above generates a set of amplicon sequence variants (ASVs), which were each treated as a taxonomic unit. Each ASV was then aligned to a reference genome database using BLASTn v2.11.0 [32]. The taxonomic classification of ASV was estimated by best-hits matches in the reference database. Phylogenetic tree was constructed by the maximum likelihood (ML) method. The workflow is illustrated in Figure 2A. We selected dozens of bacterial species from two different habitats, the human gut and marine, and generated corresponding simulated sequencing data based on the V3–V4, V4, and V4–V5 regions of 16S ribosomal gene. On the basis of the simulated data, the performance of the amplicon identification program was tested. All the bacteria were identified correctly on the genus level (Table S1). The WGS has been increasingly used in basic research and clinical diagnostics. In our system, we used housekeeping genes and Average Nucleotide Identity (ANI) to identify microbial species and infer their phylogenetic relationships with others. The pipeline contains six modules: Quality control, Assembly, Gene prediction, ANI calculation, Annotation, and MLST. Fastp was used for quality control and cleaning the PE FASTQ reads. In the assembly process, SPAdes v3.11 [33] was used to assemble the genome, but for some contaminated samples, the metaSPAdes v3.10 [34] was used for contaminated sample assembly. BUSCO v5.1 [35] was used to evaluate the completeness and contamination of the genomes. We used Prodigal to predict the open reading frames and then translated them into protein products. HMMER v3.1b [36] was used to find the 31 single-copy housekeeping genes (for genes list, see genome database curation) in the genome. The databases CARD v3.1.3 and carbohydrate-active enzymes (CAZy) (202001 updated) [37] were used separately to identify the possible ARGs and CAZy, with the parameter of e-value > 1e − 5. The database virulence factor database (VFDB) 2022 is used to identify potential virulence factors for the identified pathogen strain. Extracting the sequences of the single-copy housekeeping genes from predicted genes after HMM search against 31 single-copy housekeeping genes profiles. Blasting each of the housekeeping genes against the 31 single-copy housekeeping genes database and keeping the top 200 blast results for each gene under e-value > 1e − 5 with the same score and identity. For each species in the database, we then counted the number of housekeeping genes that included the species in the blast results and ranked species based on the number. By default, the pipeline filtered out the species with the counted number of housekeeping genes less than 15, but this value can be modified by users. So our strategy can identify not only the cultured individual microbes but also the contaminating samples. The ANI value was calculated between the genome of the sample and each genome of the species selected from the above method, and only the maximum ANI value of a species was reported. For some species that contained too many strains, we chose up to 1000 strains for ANI calculation. Barrnap v0.9 (https://github.com/tseemann/barrnap) was used to predict 16S rDNA. The phylogenetic tree of 16S rDNA and housekeeping genes was built using IQ-TREE v1.6.12 [38]. Further, if the species identified were included in the PubMLST database (http://pubmlst.org) [39], the molecular typing of the sample was analyzed automatically. The workflow is illustrated in Figure 2C. This workflow was applied to analyze a sample, downloaded from the National Center for Biotechnology Information Short Read Archive database under accession number: SRR12560292. The sample data contained 1,418,820 reads, which produced 46 scaffolds, and the length of the assembly was 2.76 Mbp. The 31 single-copy housekeeping genes were enriched in Staphylococcus aureus, and the S. aureus S3 was the most related strain in the database. The MLST type was ST22, and a total of 142 genes were identified as having a role in the resistance to various antibiotics in CARD and 462 virulence factors in this sample (Figure 3). The genomes of 560 ATCC standard strains were downloaded to test the accuracy of our identification procedure. There were only five genomes whose identification results were inconsistent with their own names. Through careful analysis, it was found that three of them were caused by the naming error of the reference species in the database (GTDB database has corrected their names based on WGS). The other two had disputes about the nomenclature of the representative strains. However, all of our identifications came from the highest-scoring genomes in the database (Tables S2–S4). In practice, in addition to microbial species identification, we also need to analyze the evolutionary relationship between different isolates of a certain species. For example, in a pharmaceutical factory environment, we can determine the source of strain contamination by analyzing the evolutionary distance between isolates. Two modes for microbial traceability by SNP phylogeny are integrated into the system, which are implemented through the software EToki v1.2 [40] and kSNP v3.0 [18], respectively. In the EToki mode, SNPs are called by comparing genomes to a reference genome, and the derived consensus sequence file is used to create an ML phylogeny. The kSNP is a program for SNP identification and phylogenetic analysis without genome alignment or the requirement for reference genomes, which is more useful when the concerned microorganisms are unculturable or have a large intraspecies evolutionary distance. In addition, a phylogenetic tree view is provided in both modes. The workflow is illustrated in Figure 2B. SILVA v138 and UNITE v8.0 [41] are integrated as the source of the amplicon reference database used in the microbial identification by 16S rDNA/18S rDNA/ITS pipeline and the microbial community diversity analysis pipeline. Details of the reference database are described in Table 1. In addition, we built a housekeeping gene database covering 223,491 bacterial RefSeq [42] genomes for fast and accurate profiling of microbial identification in the WGS workflow. Genes with the same name or product of 31 single-copy housekeeping genes (dnaG, frr, infC, nusA, pgk, pyrG, rplA, rplB, rplC, rplD, rplE, rplF, rplK, rplL, rplM, rplN, rplP, rplS, rplT, rpmA, rpoB, rpsB, rpsC, rpsE, rpsI, rpsJ, rpsK, rpsM, rpsS, smpB, and tsf) were extracted from each genome to construct the full database, which contains 6,855,279 amino acid sequences in total. The 31 single-copy housekeeping genes database was used to identify probable species in the WGS pipeline. WGS, amplicon sequencing, and metagenomic sequencing are increasingly used in research to produce complicated environmental sequence data sets, which paved the way for a cultivation-independent genetic content assessment and exploitation of the entire communities of organisms [4, 42-44]. Therefore, it is urgent to develop WGS and amplicon-based microbial species identification pipelines in the field of food safety and drug control. Here, we provide a system to analyze the WGS, amplicon sequences for microbial identification, MLST typing, and SNP source tracking. In our system, one important potential use of the WGS microbial identification pipeline is to identify contaminated sequences or metagenome samples. Simultaneously, it has great value in speeding up pathogen detection in clinical laboratories, while the existing identification and taxonomy methods may be unreliable with contaminated samples. Meicheng Yang, Feng Qin, and Yi Ren conceived the system and idea. Linmeng Liu and Hao Gao implemented the MIST main code. Chang Han and Dan Zhang designed the graphical user interface. Minghui Song, Chang Han, and Linmeng Liu wrote the manuscript. Yi Ren, Chang Han, Qiongqiong Li, and Yiling Fan were responsible for editing and revising the manuscript. All authors contributed to the development of MIST. We are grateful to Zhuo Yang for the graphical user interface development. This work was supported by the grants from the Science and Technology Commission of Shanghai Municipality (22142201600 and 20DZ2293600), the Open Fund Project of NMPA Key Laboratory for Testing Technology of Pharmaceutical Microbiology (2021-WSW-01), and the Standard Improvement Project of Chinese Pharmacopoeia Commission (2022Y21 and 2023Y36). The authors declare no conflict of interest. Supplementary materials (tables, scripts, graphical abstracts, slides, videos, Chinese translated version, and update materials) may be found in the online DOI or iMeta Science http://www.imeta.science/. Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.

Majorbio Cloud 2026 provides comprehensive analysis workflows for microbiome
Jianhua Zhao, Linmeng Liu, Jichen Han et al.|iMeta|2026
Cited by 3Open Access

The integrated microbiome data analysis platform on Majorbio Cloud (https://cloud.majorbio.com/) encompasses 26 analytical workflows, with a core architecture of two modules: single-omics workflows and cross-omics integration and correlation workflows. The platform supports multi-scale microbiome research (strain to community levels) and cross-omics analyses spanning DNA, RNA, protein, and metabolite layers. The platform features four key functions: (1) Application guide, streamlines analytical workflows for user convenience; (2) Default analysis parameters and one-click analysis, enables one-step data processing; (3) One-click plot enhancement, optimizes figures to meet academic publication standards; (4) Plot patchwork feature, stores optimized images in my gallery, facilitates the creation of publication-ready image composites, supports composite downloads in PDF/PNG/SVG formats, and allows the preservation of patchwork templates for subsequent applications. By late 2025, the platform has facilitated over 5,050 scientific publications, accelerating microbiome research advances. To the editor, Microorganisms are critical to all life on Earth, playing essential roles in key biological processes and diverse interactions with other organisms that shape ecosystems, drive biogeochemical cycles, and influence both human and environmental health [1]. The rapid advancement of high-throughput sequencing technologies for environmental samples has revolutionized our understanding of microbial diversity and functions. Vast genomic datasets spanning Earth's biomes now provide a blueprint of microbial life, enabling a more holistic perspective on the structure and function of microbiomes across various ecosystems. Over the last decade, a growing number of computational pipelines have been developed to meet the analytical challenges of high-throughput sequencing, such as QIIME 2 [2], EasyAmplicon [3], MG-RAST [4], gcMeta [5], IPGA [6], MicrobiomeAnalyst [7], SAMSA2 [8], metaTP [9], and ViOTUcluster [10]. While existing analytical pipelines have significantly advanced microbiome research in fields such as human health, agriculture, and environmental monitoring, they remain limited in scope and generally lack an integrated cross-omics perspective (Table S1). Furthermore, most pipelines require users to possess specialized bioinformatics skills, such as coding proficiency for data analysis or the preparation of complex input files, thereby restricting accessibility for non-specialists. Additionally, the visualization outputs often require manual refinement using professional software (e.g., Adobe Illustrator) prior to publication. As microbiome research advances toward more sophisticated cross-omics strategies—integrating heterogeneous datasets like microbiome-metabolome and microbiome-transcriptome analyses—existing tools often fail to meet the demands of such integrated analyses. To address the growing need for diversified microbiome data analysis, we have developed an integrated platform on the Majorbio Cloud [11, 12]. This platform facilitates multi-scale research (from strains to communities) and cross-omics investigations across DNA, RNA, protein, and metabolite layers, incorporating both relative and absolute quantification methods. All user-uploaded raw sequencing data and intermediate analysis files are stored securely in our cloud infrastructure, with strict access controls and encryption protocols in place. Users retain full ownership and ultimate management authority over their data. Through the platform's interface, they can manage datasets, control sharing, and assign granular permissions to collaborators. The integrated microbiome data analysis platform comprises a comprehensive suite of 26 analytical workflows. Its core architecture is organized into two primary modules: single-omics analytical workflows and cross-omics integration and correlation workflows (Figure 1). This design is engineered to deliver both depth and rigor for in-depth analysis of individual omics layers, while acting as a bridge to enable high-dimensional data integration and facilitate biological discovery. This integrated platform features eight core workflows spanning key domains of microbiome research, enabling precise and in-depth analysis of each data type: (1) Bacterial (Archaeal)/Fungal genome; (2) Prokaryotic transcriptomics; (3) Amplicon sequencing; (4) Metagenomics; (5) Metagenome-assembled genome; (6) Metatranscriptomics; (7) Proteomics; and (8) Metabolomics. To overcome the limitations of single-omics strategies, this platform has incorporated advanced cross-omics integration workflows, which are specifically designed to elucidate intrinsic correlations across distinct molecular layers and include two core association analysis workflows: (1) Microbiome–metabolome association analysis and (2) Microbiome–host transcriptome association analysis. A more detailed introduction to the aforementioned workflows is provided in Supporting Information. The software and packages utilized in these workflows are listed in Table S2. A comparative summary of the three binning tools—MetaBAT2, CONCOCT, and MaxBin2—is available in Table S3. In summary, this integrated platform offers a unified analytical framework for microbiome research, linking descriptive community ecology to the exploration of underpinning microbial mechanisms and providing a solid basis for advancing functional microbiome science. Amplicon sequencing is a highly targeted approach enabling detailed characterization of specific genomic regions, such as 16S/18S rRNA genes or the Internal Transcribed Spacer (ITS) region. Unlike whole-genome sequencing, this technique employs PCR-based amplification of target gene regions prior to sequencing. The analytical workflow is primarily dictated by the data processing paradigm—either clustering reads into Operational Taxonomic Units (OTUs) or resolving exact Amplicon Sequence Variants (ASVs) (Figure 2). Furthermore, the workflow may also vary depending on the sequencing technology utilized (second- or third-generation sequencing platforms) and the quantification strategy adopted (relative or absolute quantification) (Figure 1). For data processing, the OTU-based workflow employs UPARSE [13] for OTU clustering, whereas the ASV-based workflow utilizes denoising tools such as DADA2 [14], Deblur [15], and UNOISE2 [16]. Taxonomic classification is supported by over 20 accessible taxonomic annotation databases, including SILVA, RDP, Greengenes, NT, UNITE, Protist Ribosomal Reference Database 2 (PR2), MaarjAM, and FunGene. Functional potential prediction is supported via tools such as PICRUSt2, Tax4Fun, BugBase, FAPROTAX, and FUNGuild. The workflow provides 25 alpha diversity indices, covering richness indices (Sobs, Chao1, and Ace), diversity indices (Shannon and Simpson), coverage index (Coverage), evenness index (Pielou's evenness), and phylogenetic diversity index (PD). Beta diversity is explored through hierarchical clustering, Principal Component Analysis (PCA), Principal Coordinate Analysis (PCoA), or Non-Metric Multidimensional Scaling (NMDS), and the significance of separation is tested by Permutational Multivariate Analysis of Variance (PERMANOVA) or Analysis of Similarities (ANOSIM). Given that microbial communities are often shaped by external environmental factors, the workflow incorporates environmental correlation analysis such as Redundancy Analysis (RDA), distance-based RDA (db-RDA), Variance Partitioning Analysis (VPA), and Mantel test. Furthermore, advanced modules are available for specialized research needs. For microbial community assembly, methods include the Neutral Community Model (NCM), Normalized Stochasticity Ratio (NST), beta Nearest Taxon Index (betaNTI), and infer Community Assembly Mechanisms via Phylogenetic-bin-based Null Model Analysis (iCAMP). In medical microbiology research, the workflow integrates predictive modeling approaches such as Random Forest, Support Vector Machine (SVM), k-Nearest Neighbor (k-NN), and Least Absolute Shrinkage and Selection Operator (LASSO). Since its launch in late 2016, the amplicon sequencing analytical workflow has facilitated the publication of approximately 3240 scientific papers (based on a Google Scholar search conducted on December 3, 2025, using keywords: “cloud.majorbio.com operational taxonomic unit OR amplicon sequence variants” or “www.i-sanger.com operational taxonomic unit OR amplicon sequence variants”). This widespread adoption is attributed to the workflow's three core strengths: user-centric design, scientific rigor, and publication-ready visualization output. Since the concept was first introduced in 1998, metagenomic technologies driven by high-throughput sequencing have revolutionized our understanding of microbial communities, providing unprecedented insights into the genetic and functional diversity of microorganisms across Earth's ecosystems [17]. The application of long-read sequencing platforms, such as Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio), enables single-molecule-level analysis of microbial communities and their genomic features, thereby delivering near-complete genomic context. Continuous advancements in bioinformatics data analysis tools are further propelling metagenomic research toward higher resolution and more precise mechanistic interpretations. Data analysis serves as a cornerstone of metagenomic research. To address diverse analytical demands, this workflow offers three core approaches (Figure 2): read-based analysis (e.g., Kraken2, MetaPhlAn4, HUMAnN3), assembly-based analysis (employing tools such as MEGAHIT, IDBA-UD, SOAPdenovo2), and metagenome-assembled genome analysis. The following section elaborates on the assembly-based analysis approach as a representative example. The assembly-based metagenomic workflow comprises six sequential core steps: data preprocessing, metagenomic assembly, gene prediction, construction of non-redundant gene sets, taxonomic and functional annotation, and data visualization with downstream analysis. To support comprehensive investigations, the workflow integrates 20 curated annotation databases, including core databases (e.g., NR, eggNOG, KEGG, CAZy, CARD, VFDB) and specialized databases (e.g., GO, PHI-base, TCDB, QSDB, Pfam). Furthermore, to cater to diverse research needs, 11 specialized functional gene sets have been curated based on the KEGG database, encompassing key biogeochemical cycles and metabolic pathways: carbon (C) cycling, nitrogen (N) cycling, phosphorus (P) cycling, sulfur (S) cycling, heavy metal cycling (e.g., arsenic, manganese, cadmium, chromium), iron metabolism, environmental stress response pathways, short-chain fatty acid metabolism (e.g., acetic acid, propionic acid, butyric acid), organic carbon degradation, microplastic biodegradation, and organic pollutant degradation. Since its initial release in early 2018, this workflow has facilitated the publication of 1687 articles across diverse research fields (based on a Google Scholar search conducted on December 3, 2025, using the search terms: “cloud.majorbio.com metagenome” or “www.i-sanger.com metagenome”). Moving forward, the workflow will be continuously updated and enhanced, integrating more advanced analytical tools and methodologies to improve its usability and analytical depth. It is expected to facilitate efficient and streamlined metagenomic data processing and analysis for an expanding user base. Integrated microbiome and metabolome studies represent a paradigm shift from static description to dynamic functional interpretation. This approach goes beyond characterizing microbial composition in isolation, directly revealing how microbial communities interact with their host via metabolites that serve as a “molecular language,” and ultimately elucidating their impacts on host health and disease. Furthermore, integrating microbiome with host transcriptome data enables a systematic dissection of host–microbe interactions, facilitating simultaneous characterization of microbial community structure/function and host gene expression. Such an integrated strategy provides multidimensional evidence for identifying key drivers and biomarkers in studies of disease mechanisms, agricultural ecology, and environmental adaptation. Statistical and machine learning methods are widely used to analyze paired datasets, such as microbiome–metabolome or microbiome–eukaryotic transcriptome data, to identify microbe–metabolite or microbe–host transcriptome associations. Notably, human gut microbiome–metabolome studies have garnered increasing attention in recent years, driven by accumulating evidence of the interplay among gut microbes, metabolites, and host health [18]. In such gut-focused investigations, these methods are particularly valuable for identifying specific microbe-associated metabolites that are potentially modifiable through microbiome-based interventions, thereby offering a pathway to promote gut metabolic health. Therefore, this section focuses specifically on the integrated microbiome–metabolome analysis workflow as a representative example. The integrated microbiome-metabolome analysis workflow enables direct correlation analysis between microbiome and metabolome datasets. Microbiome data typically derive from amplicon sequencing or metagenomic analysis, whereas metabolome data are generated via untargeted or targeted metabolomic profiling (Figure 2). This workflow incorporates 19 analytical approaches, including Procrustes analysis, Two-way Orthogonal Partial Least Squares (O2PLS), Mantel test, Canonical Correspondence Analysis (CCA), Random Forest, Least Absolute Shrinkage and Selection Operator (LASSO), Logistic Regression, Microbial–Metabolic Interactions Model for Omics data Analysis 2 (MIMOSA2), microbe–metabolite vectors (mmvec), and Weighted Gene Co-expression Network Analysis (WGCNA). The “Application guide” feature (Figure S1) serves as an intelligent, step-by-step module tailored for the integrated microbiome data analysis cloud platform. Its primary objective is to minimize the learning curve and maximize data analysis efficiency for users. The core value of this feature resides in transforming complex bioinformatics workflows into clear, task-specific guides, thereby significantly accelerating project setup time. This enables users to focus exclusively on scientific inquiry rather than troubleshooting technical details, ultimately delivering a streamlined “one-click” analytical experience. The module facilitates the rapid configuration of optimized analytical workflows customized to specific research objectives (e.g., biomarker screening, mechanistic pathway exploration). For instance, the metagenomic analysis workflow integrates specialized guides for carbon-nitrogen-phosphorus-sulfur cycling, antibiotic resistance gene profiling, virulence factor analysis, environmental pollutant bioremediation, and key metabolite profiling, allowing users to execute configurations via one-click automation. This feature (Figure S1) streamlines data analysis by integrating pre-optimized default parameters, covering key analytical steps such as sequence subsampling, distance matrix algorithms, correlation methods, and clustering. Preconfigured based on authoritative literature and expert consensus, these parameters thereby ensure scientific validity and reproducibility. Users can initiate analyses without manually adjusting complex parameters; instead, they simply select the appropriate analysis menu based on their research objectives (e.g., alpha diversity analysis, beta diversity analysis, differential analysis), and the one-click function automates the entire process. This approach offers three core advantages: lowering technical barriers, significantly enhancing analytical efficiency, and ensuring result robustness. Consequently, researchers can focus on addressing core scientific questions rather than navigating technical complexities. Across research, clinical, and industrial settings, this feature accelerates project timelines and facilitates the efficient generation of high-quality analytical reports. This feature (Figure S1) is designed to optimize the visualization of microbiome analysis results. It provides intelligent, ready-to-deploy, publication-quality templates for mainstream visualization types, including boxplots, stacked bar charts, heatmaps, PCA plots, PCoA plots, NMDS plots, RDA/CCA plots, db-RDA plots, linear regression plots, and VPA plots. Each template incorporates pre-configured color schemes, axis scaling, and label formatting, aligned with the aesthetic standards of top-tier journals while being tailored to data-specific characteristics (e.g., sample size, variable distribution). Users simply select the appropriate template to automatically generate visuals with harmonized colors, a clear layout, and emphasized key elements, eliminating the need for manual adjustment of graphic layers or aesthetic parameters. This reduces the barrier to generating high-quality visualizations, boosts efficiency, and ensures figures are both academically rigorous and visually compelling. Consequently, these research findings effectively engage the audience in manuscripts, reports, or presentations, facilitating the clear dissemination of core findings. This feature (Figure S1) streamlines the assembly and customized layout of multi-panel figures. Users first save figures generated on analysis pages to the “my gallery” module. The gallery supports comprehensive image management, allowing users to search, organize, and delete images. After selecting a minimum of two figures from the gallery, users enter the composite canvas interface to adjust canvas dimensions, row/column counts, and inter-figure spacing via layout configuration tools. Images on the canvas can be freely dragged, resized (with optional aspect ratio locking), and cropped to eliminate excess white space. Annotations, such as subfigure labels, can be added and customized, with adjustable parameters including font type, size, weight, and color. Once assembled, the composite figure can be previewed and exported in PDF or raster formats (PNG, TIFF, JPG). Layout configurations can be saved as a template for future use. This functionality significantly reduces the technical barrier to creating multi-panel figures, enhances efficiency, and enables users to produce figures that meet rigorous academic standards. Driven by a profound understanding of global users, continuous tracking of cutting-edge technologies, and extensive experience accrued from hundreds of thousands of projects, our team of microbiology experts has steadily enhanced and expanded our comprehensive microbial multi-omics cloud platform suite. Our core goal remains to empower more researchers to achieve meaningful and impactful scientific outcomes. While maintaining a leading position in the field, we are committed to integrating emerging technologies into our platform—notably, mechanistic investigations of microbial epigenomics, applications of single-cell and spatial omics in microbiology, and meta-analysis of large-scale microbiome datasets. Looking ahead, AI-driven, self-optimizing cloud platforms will offer tremendous potential. Such systems are poised to autonomously curate high-quality microbial databases tailored to biomedical, and environmental research, data and research In the we to these advancements into the Majorbio microbial multi-omics cloud platform. Since 2016, over researchers from more than have over omics data on the Majorbio Cloud platform. The platform's scientific is by the research articles to that have utilized the Majorbio microbiome multi-omics cloud platform. Notably, in more than articles Majorbio remain committed to the continuous of the platform to empower users to biological insights from large-scale microbiome multi-omics data. data project and data and data project and data and Data and Data and Data and Data and data and data and Data and Data and and and and All have the and for publication. The the and technology support team Majorbio for their technical The of and of for on this The platform in this is to an The are of are other to or in this All data are available and may be in the or The data that support the findings of this are available from the Table of the integrated microbiome data analysis cloud platform on Majorbio Cloud with other Table and packages utilized in the workflows of Table among three binning CONCOCT, and The is for the or functionality of by the than be to the for the