ANALYSIS OF CONTEXT-DEPENDENT ERRORS FOR ILLUMINA SEQUENCING

Irina Abnizova; Steven Leonard; Tom Skelly; Andy Brown; David K. Jackson; Marina Gourtovaia; Guoying Qi; René te Boekhorst; Nadeem Faruque; Kevin Lewis; Tony Cox

doi:10.1142/s0219720012410053

ANALYSIS OF CONTEXT-DEPENDENT ERRORS FOR ILLUMINA SEQUENCING

Irina Abnizova(Wellcome Sanger Institute), Steven Leonard(Wellcome Sanger Institute), Tom Skelly(Wellcome Sanger Institute), Andy Brown(Wellcome Sanger Institute), David K. Jackson(Wellcome Sanger Institute), Marina Gourtovaia(Wellcome Sanger Institute), Guoying Qi(Wellcome Sanger Institute), René te Boekhorst(Wellcome Sanger Institute), Nadeem Faruque(Wellcome Sanger Institute), Kevin Lewis(Wellcome Sanger Institute), Tony Cox(Wellcome Sanger Institute)

Journal of Bioinformatics and Computational Biology

January 30, 2012

10.1142/s0219720012410053

Cited by 26

Abstract

The new generation of short-read sequencing technologies requires reliable measures of data quality. Such measures are especially important for variant calling. However, in the particular case of SNP calling, a great number of false-positive SNPs may be obtained. One needs to distinguish putative SNPs from sequencing or other errors. We found that not only the probability of sequencing errors (i.e. the quality value) is important to distinguish an FP-SNP but also the conditional probability of "correcting" this error (the "second best call" probability, conditional on that of the first call). Surprisingly, around 80% of mismatches can be "corrected" with this second call. Another way to reduce the rate of FP-SNPs is to retrieve DNA motifs that seem to be prone to sequencing errors, and to attach a corresponding conditional quality value to these motifs. We have developed several measures to distinguish between sequence errors and candidate SNPs, based on a base call's nucleotide context and its mismatch type. In addition, we suggested a simple method to correct the majority of mismatches, based on conditional probability of their "second" best intensity call. We attach a corresponding second call confidence (quality value) of being corrected to each mismatch.

Heng Li, Richard Durbin|Bioinformatics|2009|62.5k

Base-Calling of Automated Sequencer Traces Using<i>Phred.</i> I. Accuracy Assessment

Brent Ewing, LaDeana Hillier, Michael C. Wendl et al.|Genome Research|1998|7k

Base-Calling of Automated Sequencer Traces Using <i>Phred.</i> II. Error Probabilities

Brent Ewing, Phil Green|Genome Research|1998|5.5k

Genome sequence of the human malaria parasite Plasmodium falciparum

Malcolm J. Gardner, Neil Hall, Eula Fung et al.|Nature|2002|4.5k

Nucleotide sequence of bacteriophage G4 DNA

G N Godson, B. G. Barrell, Rodger Staden et al.|Nature|1978|2.1k

ANALYSIS OF CONTEXT-DEPENDENT ERRORS FOR ILLUMINA SEQUENCING

Abstract

Related Papers