scBERT as a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq Data

Fan Yang(Tencent (China)), Wenchuan Wang(Shanghai Jiao Tong University), Fang Wang(Tencent (China)), Yuan Fang(Harvard University), Duyu Tang(Tencent (China)), Junzhou Huang(The University of Texas at Arlington), Hui Lü(Shanghai Jiao Tong University), Jianhua Yao(Tencent (China))
bioRxiv (Cold Spring Harbor Laboratory)
December 7, 2021
Cited by 24Open Access
Full Text

Abstract

Abstract Annotating cell types based on the single-cell RNA-seq data is a prerequisite for researches on disease progress and tumor microenvironment. Here we show existing annotation methods typically suffer from lack of curated marker gene lists, improper handling of batch effect, and difficulty in leveraging the latent gene-gene interaction information, impairing their generalization and robustness. We developed a pre-trained deep neural network-based model scBERT (single-cell Bidirectional Encoder Representations from Transformers) to overcome the challenges. Following BERT’s approach of pre-train and fine-tune, scBERT obtains a general understanding of gene-gene interaction by being pre-trained on huge amounts of unlabeled scRNA-seq data and is transferred to the cell type annotation task of unseen and user-specific scRNA-seq data for supervised fine-tuning. Extensive and rigorous benchmark studies validated the superior performance of scBERT on cell type annotation, novel cell type discovery, robustness to batch effect, and model interpretability.


Related Papers

No related papers found

Powered by citation graph analysis