Assessing the Impact of Data Preprocessing on Analyzing Next Generation Sequencing DataBinsheng He, Rongrong Zhu, Huandong Yang et al.|Frontiers in Bioengineering and Biotechnology|2020 Data quality control and preprocessing are often the first step in processing next-generation sequencing (NGS) data of tumors. Not only can it help us evaluate the quality of sequencing data, but it can also help us obtain high-quality data for downstream data analysis. However, by comparing data analysis results of preprocessing with Cutadapt, FastP, Trimmomatic, and raw sequencing data, we found that the frequency of mutation detection had some fluctuations and differences, and human leukocyte antigen (HLA) typing directly resulted in erroneous results. We think that our research had demonstrated the impact of data preprocessing steps on downstream data analysis results. We hope that it can promote the development or optimization of better data preprocessing methods, so that downstream information analysis can be more accurate.
SAELGMDA: Identifying human microbe–disease associations based on sparse autoencoder and LightGBMFeixiang Wang, Huandong Yang, Yan Wu et al.|Frontiers in Microbiology|2023 Introduction Identification of complex associations between diseases and microbes is important to understand the pathogenesis of diseases and design therapeutic strategies. Biomedical experiment-based Microbe-Disease Association (MDA) detection methods are expensive, time-consuming, and laborious. Methods Here, we developed a computational method called SAELGMDA for potential MDA prediction. First, microbe similarity and disease similarity are computed by integrating their functional similarity and Gaussian interaction profile kernel similarity. Second, one microbe-disease pair is presented as a feature vector by combining the microbe and disease similarity matrices. Next, the obtained feature vectors are mapped to a low-dimensional space based on a Sparse AutoEncoder. Finally, unknown microbe-disease pairs are classified based on Light Gradient boosting machine. Results The proposed SAELGMDA method was compared with four state-of-the-art MDA methods (MNNMDA, GATMDA, NTSHMDA, and LRLSHMDA) under five-fold cross validations on diseases, microbes, and microbe-disease pairs on the HMDAD and Disbiome databases. The results show that SAELGMDA computed the best accuracy, Matthews correlation coefficient, AUC, and AUPR under the majority of conditions, outperforming the other four MDA prediction models. In particular, SAELGMDA obtained the best AUCs of 0.8358 and 0.9301 under cross validation on diseases, 0.9838 and 0.9293 under cross validation on microbes, and 0.9857 and 0.9358 under cross validation on microbe-disease pairs on the HMDAD and Disbiome databases. Colorectal cancer, inflammatory bowel disease, and lung cancer are diseases that severely threat human health. We used the proposed SAELGMDA method to find possible microbes for the three diseases. The results demonstrate that there are potential associations between Clostridium coccoides and colorectal cancer and one between Sphingomonadaceae and inflammatory bowel disease. In addition, Veillonella may associate with autism. The inferred MDAs need further validation. Conclusion We anticipate that the proposed SAELGMDA method contributes to the identification of new MDAs.
A computational framework to trace tumor tissue-of-origin of 19 cancer types based on RNA sequencingAbstract Carcinoma of unknown primary (CUP) is a type of metastatic cancer with tissue-of-origin (TOO) unidentifiable by traditional methods. Most CUP patients have poor prognosis since no therapy targeting TOO is allowed. Thus, it’s critical to develop accurate computational methods to infer TOO. While qPCR or microarray-based methods are effective in predicting TOO for most cancer types, the overall prediction accuracy is yet to be improved. Here, we propose a computational framework to trace TOO of 19 cancer types based on RNA sequencing (RNA-seq). Specifically, we download the RNA-seq data of 7000+ tissue samples covering 19 cancer types with known TOO from TCGA. By feature selection, 90 genes are finally selected to train a random forest model for TOO inference; the 90 genes are enriched in both tissue-specific functions and tissue-general functions. The cross-validation accuracy of our framework reaches 97.55% across all cancer types. Furthermore, we collected an independent cohort of samples in GEO as testing samples. The accuracy on the independent data is 74% despite the differences in experiment procedures and pipelines. In conclusion, we develop an accurate yet robust computational framework for identifying TOO, which might be promising in clinical applications.