Lizhong Chen

edgeR v4: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets

Yunshun Chen, Lizhong Chen, Aaron T. L. Lun et al.|Nucleic Acids Research|2025

Cited by 529Open Access

edgeR is an R/Bioconductor software package for differential analyses of sequencing data in the form of read counts for genes or genomic features. Over the past 15 years, edgeR has been a popular choice for statistical analysis of data from sequencing technologies such as RNA-seq or ChIP-seq. edgeR pioneered the use of the negative binomial distribution to model read count data with replicates and the use of generalized linear models to analyze complex experimental designs. edgeR implements empirical Bayes moderation methods to allow reliable inference when the number of replicates is small. This article announces edgeR version 4, which includes new developments across a range of application areas. Infrastructure improvements include support for fractional counts, implementation of model fitting in C and a new statistical treatment of the quasi-likelihood pipeline that improves accuracy for small counts. The revised package has new functionality for differential methylation analysis, differential transcript expression, differential transcript and exon usage, testing relative to a fold-change threshold and pathway analysis. This article reviews the statistical framework and computational implementation of edgeR, briefly summarizing all the existing features and functionalities but with special attention to new features and those that have not been described previously.

edgeR v4: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets

Yunshun Chen, Lizhong Chen, Aaron T. L. Lun et al.|bioRxiv (Cold Spring Harbor Laboratory)|2024

Cited by 120Open Access

Abstract edgeR is an R/Bioconductor software package for differential analyses of sequencing data in the form of read counts for genes or genomic features. Over the past 15 years, edgeR has been a popular choice for statistical analysis of data from sequencing technologies such as RNA-seq or ChIP-seq. edgeR pioneered the use of the negative binomial distribution to model read count data with replicates and the use of generalized linear models to analyse complex experimental designs. edgeR implements empirical Bayes moderation methods to allow reliable inference when the number of replicates is small. This article announces edgeR version 4, which includes new developments across a range of application areas. Infrastructure improvements include support for fractional counts, implementation of model fitting in C, and a new statistical treatment of the quasi-likelihood pipeline that improves accuracy for small counts. The revised package has new functionality for differential methylation analysis, differential transcript expression, differential transcript and exon usage, testing relative to a fold-change threshold and pathway analysis. This article reviews the statistical framework and computational implementation of edgeR, briefly summarizing all the existing features and functionalities but with special attention to new features and those that have not been described previously.

Faster and more accurate assessment of differential transcript expression with Gibbs sampling and edgeR v4

Pedro L. Baldoni, Lizhong Chen, Gordon K. Smyth|NAR Genomics and Bioinformatics|2024

Cited by 14Open Access

This article further develops edgeR's divided-count approach for differential transcript expression (DTE) analysis of RNA-seq data to produce a faster and more accurate pipeline. The divided-count approach models the precision of transcript quantifications from the kallisto and Salmon software tools and divides the estimated overdispersions out of the transcript read counts, after which the divided-counts can be analysed by statistical tools developed for gene-level counts. This article adds three new refinements to the pipeline that dramatically decrease the computational overhead and storage requirements so that DTE analysis of very large datasets becomes practical. The new pipeline replaces bootstrap with Gibbs resampling and replaces edgeR v3 with v4. Both of these changes improve statistical power and accuracy and provide better resolution for low-count transcripts. The accuracy of overdispersion estimation is shown to depend on the total number of resamples across the whole dataset rather than on individual samples, dramatically reducing the recommended number of technical samples for large datasets. Test data and extensive simulations data show that the new pipeline is more powerful and efficient than previous DTE pipelines while providing correct control of the false discovery rate for any sample size.

Dividing out quantification uncertainty enables assessment of differential transcript usage with limma and edgeR

Pedro L. Baldoni, Lizhong Chen, Mengbo Li et al.|bioRxiv (Cold Spring Harbor Laboratory)|2025

Cited by 5Open Access

Abstract Differential transcript usage (DTU) refers to changes in the relative abundance of transcript isoforms of the same gene between experimental conditions, even when the total expression of the gene doesn’t change. DTU analysis requires the quantification of individual isoforms from RNA-seq data, which has a high level of uncertainty due to transcript overlap and read-to-transcript ambiguity (RTA). Popular DTU analysis methods do not directly account for the RTA overdispersion within their statistical frameworks, leading to reduced statistical power or poor error rate control, particularly in scenarios with small sample sizes. This article presents limma and edgeR analysis pipelines that account for RTA during DTU assessment. Leveraging recent advancements in the limma and edgeR Bioconductor packages, we propose DTU analysis pipelines optimized for small and large datasets with a unified interface via the diffSplice function. The pipelines make use of divided counts to remove RTA-induced dispersion from transcript isoform counts and account for the sparsity in transcript-level counts. Simulations and analysis of real data from mouse mammary epithelial cells demonstrate that the diffSplice pipelines provide greater power, improved efficiency, and improved FDR control compared to existing specialized DTU methods.

Dividing out quantification uncertainty enables assessment of differential transcript usage with limma and edgeR

Pedro L. Baldoni, Lizhong Chen, Mengbo Li et al.|Nucleic Acids Research|2025

Cited by 3Open Access

Differential transcript usage (DTU) refers to changes in the relative abundance of transcript isoforms of the same gene between experimental conditions, even when the total expression of the gene does not change. DTU analysis requires the quantification of individual isoforms from RNA-seq data, which has a high level of uncertainty due to transcript overlap and read-to-transcript ambiguity (RTA). Popular DTU analysis methods do not directly account for the RTA overdispersion within their statistical frameworks, leading to reduced statistical power or poor error rate control, particularly in scenarios with small sample sizes. This article presents limma and edgeR analysis pipelines that account for RTA during DTU assessment. Leveraging recent advancements in the limma and edgeR Bioconductor packages, we propose DTU analysis pipelines optimized for small and large datasets with a unified interface via the diffSplice function. The pipelines make use of divided counts to remove RTA-induced dispersion from transcript isoform counts and account for the sparsity in transcript-level counts. Simulations and analyses of real data from mouse mammary epithelial cells demonstrate that the diffSplice pipelines provide greater power, improved efficiency, and improved false discovery rate control compared to existing specialized DTU methods.

Is this you? Claim your profile.

Top publicationsby citations