Exploratory analysis and error modeling of a sequencing technology

Michael Inouye(Wellcome Sanger Institute), Kerrin S. Small(Centre for Human Genetics), Yik Ying Teo(Centre for Human Genetics), Heng Li(Wellcome Sanger Institute), Nava Whiteford(Wellcome Sanger Institute), Tom Skelly(Wellcome Sanger Institute), Irina Abnizova(Wellcome Sanger Institute), Daniel J. Turner(Wellcome Sanger Institute), Panos Deloukas(Wellcome Sanger Institute), Dominic Kwiatkowski(Centre for Human Genetics), Clive Brown(Wellcome Sanger Institute), Taane G. Clark(Centre for Human Genetics)
bioRxiv (Cold Spring Harbor Laboratory)
March 11, 2016
Cited by 0Open Access
Full Text

Abstract

Abstract Next generation DNA sequencing methods have created an unprecedented leap in sequence data generation, thus novel computational tools and statistical models are required to optimize and assess the resulting data. In this report, we explore underlying causes of error for the Illumina Genome Analyzer (IGA) sequencing technology and attempt to quantify their effects using a human bacterial artificial chromosome sequenced to 60,000 fold coverage. Seven potential error predictors are considered: Phred score, read entropy, tile coordinates, local tile density, base position within read, nucleotide call, and lane. With these parameters, logistic regression and log-linear models are constructed and used to show that each of the potential predictors contributes to error (P<1×10 −4 ). With this additional information, we apply the logistic model and achieve a 3% improvement in both the sensitivity and specificity to detect IGA errors. Further, we demonstrate that these modeling approaches can be used as a feedback loop to inform laboratory methods and identify specific machine or run bias.


Related Papers

No related papers found

Powered by citation graph analysis