PEAKS DB: De Novo Sequencing Assisted Database Search for Sensitive and Accurate Peptide Identification

Jing Zhang; Lei Xin; Baozhen Shan; Weiwu Chen; Mingjie Xie; Denis Yuen; Weimin Zhang; Zefeng Zhang; Gilles Lajoie; Bin Ma

doi:10.1074/mcp.m111.010587

PEAKS DB: De Novo Sequencing Assisted Database Search for Sensitive and Accurate Peptide Identification

Jing Zhang(Bioinformatics Solutions (Canada)), Lei Xin(Bioinformatics Solutions (Canada)), Baozhen Shan(Bioinformatics Solutions (Canada)), Weiwu Chen(Bioinformatics Solutions (Canada)), Mingjie Xie(Bioinformatics Solutions (Canada)), Denis Yuen(University of Waterloo), Weimin Zhang(Bioinformatics Solutions (Canada)), Zefeng Zhang(Bioinformatics Solutions (Canada)), Gilles Lajoie(Western University), Bin Ma(University of Waterloo)

Molecular & Cellular Proteomics

December 21, 2011

10.1074/mcp.m111.010587

Cited by 1,110Open Access

Full Text

Abstract

Many software tools have been developed for the automated identification of peptides from tandem mass spectra. The accuracy and sensitivity of the identification software via database search are critical for successful proteomics experiments. A new database search tool, PEAKS DB, has been developed by incorporating the de novo sequencing results into the database search. PEAKS DB achieves significantly improved accuracy and sensitivity over two other commonly used software packages. Additionally, a new result validation method, decoy fusion, has been introduced to solve the issue of overconfidence that exists in the conventional target decoy method for certain types of peptide identification software. Many software tools have been developed for the automated identification of peptides from tandem mass spectra. The accuracy and sensitivity of the identification software via database search are critical for successful proteomics experiments. A new database search tool, PEAKS DB, has been developed by incorporating the de novo sequencing results into the database search. PEAKS DB achieves significantly improved accuracy and sensitivity over two other commonly used software packages. Additionally, a new result validation method, decoy fusion, has been introduced to solve the issue of overconfidence that exists in the conventional target decoy method for certain types of peptide identification software. Peptide identification from tandem mass spectrometry (MS/MS) 1The abbreviations used are:MS/MStandem mass spectrometryPTMpost-translational modificationETDelectron transfer dissociationFDRfalse discovery ratePSMpeptide spectrum matchiPRGProteome Informatics Research Group. 1The abbreviations used are:MS/MStandem mass spectrometryPTMpost-translational modificationETDelectron transfer dissociationFDRfalse discovery ratePSMpeptide spectrum matchiPRGProteome Informatics Research Group. data is a central task in proteomics. The accuracy and sensitivity of this task directly impacts the performance of protein identification from peptide hits, as well as other downstream analyses. Many software tools have been developed for peptide identification; these tools can be broadly divided into two categories: de novo sequencing and database search. tandem mass spectrometry post-translational modification electron transfer dissociation false discovery rate peptide spectrum match Proteome Informatics Research Group. tandem mass spectrometry post-translational modification electron transfer dissociation false discovery rate peptide spectrum match Proteome Informatics Research Group. De novo sequencing derives the peptide sequence directly from the MS/MS spectrum, whereas a database search queries a sequence database for the best peptide to explain the peaks in the MS/MS spectrum. Representative de novo sequencing software packages include PEAKS (1Ma B. Zhang K. Hendrie C. Liang C. Li M. Doherty-Kirby A. Lajoie G. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry.Rapid Commun. Mass Spectrom. 2003; 17: 2337-2342Crossref PubMed Scopus (966) Google Scholar), PepNovo (2Frank A. Pevzner P. PepNovo: De novo peptide sequencing via probabilistic network modeling.Anal. Chem. 2005; 77: 964-973Crossref PubMed Scopus (526) Google Scholar), NovoHMM (3Fischer B. Roth V. Roos F. Grossmann J. Baginsky S. Widmayer P. Gruissem W. Buhmann J.M. NovoHMM: A hidden Markov model for de novo peptide sequencing.Anal. Chem. 2005; 77: 7265-7273Crossref PubMed Scopus (138) Google Scholar), and Lutefisk (4Taylor J.A. Johnson R.S. Sequence database searches via de novo peptide sequencing by tandem mass spectrometry.Rapid Commun. Mass Spectrom. 1997; 11: 1067-1075Crossref PubMed Scopus (339) Google Scholar), and representative database search software packages include Mascot (5Perkins D.N. Pappin D.J. Creasy D.M. Cottrell J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data.Electrophoresis. 1999; 20: 3551-3567Crossref PubMed Scopus (6763) Google Scholar), SEQUEST (6Eng J. McCormack A.L. Yates 3rd, J.R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.J. Am. Soc. Mass Spectrom. 1994; 5: 976-989Crossref PubMed Scopus (5420) Google Scholar), X!Tandem (7Craig R. Beavis R.C. TANDEM: Matching proteins with tandem mass spectra.Bioinformatics. 2004; 20: 1466-1467Crossref PubMed Scopus (1987) Google Scholar), OMSSA (8Geer L.Y. Markey S.P. Kowalak J.A. Wagner L. Xu M. Maynard D.M. Yang X. Shi W. Bryant S.H. Open mass spectrometry search algorithm.J. Proteome Res. 2004; 3: 958-964Crossref PubMed Scopus (1164) Google Scholar), ProteinProspector (9Chalkley R.J. Baker P.R. Huang L. Hansen K.C. Allen N.P. Rexach M. Burlingame A.L. Comprehensive analysis of a multidimensional liquid chromatography mass spectrometry dataset acquired on a quadrupole selecting quadrupole collision cell, time-of-flight mass spectrometer: II. New developments in protein prospector allow for reliable and comprehensive automatic analysis of large datasets.Mol. Cell. Proteomics. 2005; 4: 1194-1204Abstract Full Text Full Text PDF PubMed Scopus (145) Google Scholar), MaxQuant (10Cox J. Mann M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification.Nat. Biotechnol. 2008; 26: 1367-1372Crossref PubMed Scopus (9154) Google Scholar) (11Cox J. Neuhauser N. Michalski A. Scheltema R.A. Olsen J.V. Mann M. Andromeda: A peptide search engine integrated into the MaxQuant environment.J. Proteome Res. 2011; 10: 1794-1805Crossref PubMed Scopus (3450) Google Scholar) and MS-GFDB (12Kim S. Mischerikow N. Bandeira N. Navarro J.D. Wich L. Mohammed S. Heck A.J. Pevzner P.A. The generating function of CID, ETD and CID/ETD pairs of tandem mass spectra: Applications to database search.Mol. Cell. Proteomics. 2010; 9: 2840-2852Abstract Full Text Full Text PDF PubMed Scopus (196) Google Scholar). The database search is generally believed to be a simpler approach because the protein sequence database provides a limited space for the software to search. Therefore, when a protein sequence database is available, a database search is the most common method for peptide identification. However, existing database search tools still experience problems of low identification rates (low sensitivity) (13Bell A.W. Deutsch E.W. Au C.E. Kearney R.E. Beavis R. Sechi S. Nilsson T. Bergeron J.J. HUPO Test Sample Working Group: A HUPO test sample study reveals common problems in mass spectrometry-based proteomics.Nat. Methods. 2009; 6: 423-430Crossref PubMed Scopus (274) Google Scholar) (14Kapp E.A. Schütz F. Connolly L.M. Chakel J.A. Meza J.E. Miller C.A. Fenyo D. Eng J.K. Adkins J.N. Omenn G.S. Simpson R.J. An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: Sensitivity and specificity analysis.Proteomics. 2005; 5: 3475-3490Crossref PubMed Scopus (311) Google Scholar) and high false discovery rates (low accuracy) (15Askenazi M. Bandeira N. Chalkley R.J. Clauser K.R. Deutsch E. Lam H.H.N. McDonald W.H. Neubert T. Rudnick P.A. Martens L. iPRG 2011: A Study on the Identification of Electron Transfer Dissociation (ETD) Mass Spectra.J Biomol Tech. 2011; 22: S20Google Scholar). The improvement of database search performance has always been an active research area in this field. Two competing objectives are sought in the database search approach: accuracy and sensitivity. The accuracy is usually measured by the false discovery rate (FDR), which is defined as the percentage of the false identifications in all identifications above the score threshold. Accuracy can be accomplished by increasing the score threshold. However, this will at the same time reduce the sensitivity. To improve both accuracy and sensitivity, a new scoring function needs to be developed that more accurately separates the true and false identifications (16Brosch M. Yu L. Hubbard T. Choudhary J. Accurate and sensitive peptide identification with Mascot percolator.J. Proteome Res. 2009; 8: 3176-3181Crossref PubMed Scopus (332) Google Scholar, 17Keller A. Nesvizhskii A.I. Kolker E. Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.Anal. Chem. 2002; 74: 5383-5392Crossref PubMed Scopus (3886) Google Scholar). Meanwhile, to maintain an search database search software a method to a of protein peptide and will with a more usually scoring function for R. Beavis R.C. TANDEM: Matching proteins with tandem mass spectra.Bioinformatics. 2004; 20: 1466-1467Crossref PubMed Scopus (1987) Google Scholar). However, this peptides and sensitivity. A is to sensitivity, and this the PEAKS DB software is for peptide identification using the database search However, as to the database search the PEAKS DB software de novo sequencing results to improve the and the scoring results in significantly improved sensitivity and accuracy in to existing database search software. to the two objectives and the high of proteomics mass spectrometry data the automated validation of database search this validation is by the target decoy method J.E. S.P. search for in protein identifications by mass Methods. 4: PubMed Scopus Google Scholar, L. J.D. to peptides by tandem mass spectrometry using decoy Proteome Res. 2008; PubMed Scopus Google Scholar). method decoy proteins to be by the same search engine and the on the decoy proteins to estimate the of false However, the method has to be used with because a search can the M. D. of mass Proteome Res. 2009; 8: PubMed Scopus Google Scholar, C. statistical analysis for search Proteome Res. 2010; 9: PubMed Scopus Google Scholar, M. on statistical analysis for search Proteome Res. 2011; 10: PubMed Scopus Google Scholar). A in C. statistical analysis for search Proteome Res. 2010; 9: PubMed Scopus Google Scholar, and M. on statistical analysis for search Proteome Res. 2011; 10: PubMed Scopus Google Scholar) that the still an by more decoy proteins at the of the search on of the decoy proteins introduced of the search engine at the and is a of the target decoy method is that of a search results the protein is used in the peptide scoring function and sensitivity with and Scholar). this that a to the target decoy method will solve these two of the decoy proteins as of the the target and decoy sequences of the same protein as a of the this this new is and an improved target decoy method, decoy fusion, is The of PEAKS DB is to peptides from a sequence database with MS/MS PEAKS DB to the database search of peptide identification software. However, PEAKS DB de novo sequencing as a and the de novo sequencing results to improve both the and accuracy of the database search. The of the PEAKS DB software as novo The PEAKS (1Ma B. Zhang K. Hendrie C. Liang C. Li M. Doherty-Kirby A. Lajoie G. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry.Rapid Commun. Mass Spectrom. 2003; 17: 2337-2342Crossref PubMed Scopus (966) Google Scholar) is used to de novo sequencing for The de novo sequence are used to in the protein sequence of the proteins in the database are to the sequence The proteins the protein and are used in of the peptides of the protein are used to match the MS/MS with a scoring the scoring peptide with are for MS/MS the in the peptide a scoring function is used to the best peptide for spectrum. The the de novo sequence and the database peptide is an in the scoring the score is to can be A target decoy approach is used to the peptide spectrum score to the of the and The high peptides the above are used to the proteins that the same of peptide are for a more The of these are in the The PEAKS is used to de novo sequencing for spectrum. The same and by the for database search are used for de novo spectrum, the de novo sequencing peptide by PEAKS is The PEAKS a for amino acid in the de novo this is a percentage The of PEAKS is to a sequence by the low amino by mass of amino acid with is by a that is to the mass of the as an this the the de novo sequence to a of proteins from the protein in the will on this to reduce the The a de novo sequence and a database peptide is measured by the of common amino the of the score is that in this protein because is modification in the sequence a on the de novo sequence can match an in the sequence However, in the peptide scoring a can match the same with the same modification for the score The proteins are by the score by the peptides of two proteins have the same the is by the and the this the database proteins are as the protein which be a of the proteins in most proteomics experiments. is made on proteins in the Therefore, the of proteins to be the has a of proteins and the search is on a large database as the can be in the of PEAKS of the peptide sequences in from the protein are the to peptide spectrum peptide sequence peptides by all of the peptide sequence the peptide mass is and the MS/MS with the mass is with the A is used to the score of the A data is used to the sequence for spectrum. The is from the same de novo sequencing scoring function used in PEAKS de novo sequencing (1Ma B. Zhang K. Hendrie C. Liang C. Li M. Doherty-Kirby A. Lajoie G. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry.Rapid Commun. Mass Spectrom. 2003; 17: 2337-2342Crossref PubMed Scopus (966) Google Scholar). a spectrum is to two and the that the peptide has a the with mass and the that the peptide has a the with mass The are with the a collision dissociation spectrum, and are used B. Zhang K. Hendrie C. Liang C. Li M. Doherty-Kirby A. Lajoie G. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry.Rapid Commun. Mass Spectrom. 2003; 17: 2337-2342Crossref PubMed Scopus (966) Google for an ETD spectrum, and are used X. B. L. B. score function for peptide identification with ETD MS/MS 2011; 11: Scopus Google for of the and are the match score of a peptide is as the of the and for all the and score can be by and in this the peptide of a MS/MS spectrum be the scoring sequence is most the scoring sequence in the for this spectrum. A more scoring function is used to the sequence for spectrum. the match score is by the the score of the and the of the of the the peptides is to spectra. A of other are used in to the match have been However, the of a peptide to be most and are in PEAKS the of amino the de novo sequence the protein protein a score by peptide and the protein of a peptide is the score of the proteins this the peptide the sequence in the the sequence in the the mass the the of the and the of the of these used in the (16Brosch M. Yu L. Hubbard T. Choudhary J. Accurate and sensitive peptide identification with Mascot percolator.J. Proteome Res. 2009; 8: 3176-3181Crossref PubMed Scopus (332) Google Scholar) and A. Nesvizhskii A.I. Kolker E. Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.Anal. Chem. 2002; 74: 5383-5392Crossref PubMed Scopus (3886) Google Scholar) and used in and used in to and used in and a to used in and used more with the match are with a The are with an search on a large data to the area on the of the as in the are by the for a from to The score is to a for a the is defined as the that a false identification in the search achieves the same The to the false the the of false identifications above the score and the of false that false rate is a from the is the peptide score the by PEAKS DB is is the common with A target decoy decoy fusion, is used to estimate the at score threshold. The more conventional target decoy approach the of a decoy protein sequence for target protein sequence in the database (16Brosch M. Yu L. Hubbard T. Choudhary J. Accurate and sensitive peptide identification with Mascot percolator.J. Proteome Res. 2009; 8: 3176-3181Crossref PubMed Scopus (332) Google Scholar). The target and decoy databases are by the and the is by the the of the decoy and target However, in PEAKS DB, the target and decoy sequences are as in the are for the database the same of protein the of protein is The software searches this the the target and decoy identifications are by are from the the of score the is as the the of decoy and the of target above the score threshold. the amino acid of the target protein is an a decoy sequence to the search engine from the peptide of the target To solve this a is in the target and decoy sequences as the Mascot and PEAKS DB can at both of the for the in that the peptide from the target protein is protein is the of this the is a of the protein in PEAKS are to a score a protein is to protein all of the peptides of with a score are in X. the of PEAKS DB, is to to a of is a identification and is to the of proteins is for the proteins are into several that be a proteins other in a the can to all all proteins from the The score of protein is from peptides as peptides are the same peptide is from the scoring peptide is Two peptides are the same are by the the amino acid sequence are all the of the peptides are as the score of the protein is to The score of a protein is to the score of the Two data with and the other used to the performance of PEAKS data with The data from the of and used to study the protein and J.M. C. T. K. M. are more 2010; 10: PubMed Scopus Google Scholar). The data from the data the P. protein from in used for database search. The database protein The ETD data from the of a peptide to The data from used in the study by the Proteome Informatics Research of the of (15Askenazi M. Bandeira N. Chalkley R.J. Clauser K.R. Deutsch E. Lam H.H.N. McDonald W.H. Neubert T. Rudnick P.A. Martens L. iPRG 2011: A Study on the Identification of Electron Transfer Dissociation (ETD) Mass Spectra.J Biomol Tech. 2011; 22: S20Google Scholar). The same data is used the ETD data the same protein sequence database by the of iPRG study used for database search. the for with proteins The database protein all of the decoy the decoy sequences by the amino in peptides of decoy of target a target decoy method used to estimate the the target and decoy databases the performance of the de novo sequencing and database search when the same data will the of the de novo sequencing results in PEAKS the data PEAKS and Mascot for the de novo sequencing and database search spectrum, the de novo sequencing peptide by PEAKS peptide by Mascot the of amino with the de novo sequence is the of the when the P. database is can be that the best of the target and decoy is by a of both the database search score and the the of using de novo sequencing results in the peptide Mascot to a the spectrum is when databases of are on the data the to Mascot of and when the P. and databases a the performance of de novo sequencing and database search the P. and databases are used for the Mascot database the de novo sequencing to more amino score on and of the by Mascot with The of the target decoy and the decoy is that the score of the false target and the decoy are the of decoy can be used to estimate the of false target is to this because is to a target is true the to the The data the P. database by and PEAKS The peptides by all as A database by these peptides in the P. all other amino in a search engine is used to search in this the peptides that have more amino with the peptides can be as false by using the database as the the score of the false target and the decoy can be decoy and target decoy and the results are in that for the PEAKS DB the decoy method score The target decoy method decoy the false target hits, which that decoy is more for the PEAKS DB However, the two decoy for and The result of is with to the by the two decoy The two for of and whereas the decoy of PEAKS DB is more the target decoy in all the decoy method used to estimate the of PEAKS DB, and the target decoy method used to estimate the of all other searching the the peptide identification performance of PEAKS DB by with two commonly used software Mascot and SEQUEST Proteome The search with of the used the same of The mass and mass to in and at most of peptide the of and of and of and from and used as the for the and ETD data peptide spectrum match SEQUEST two and this used as SEQUEST score because this the for the has been developed to improve Mascot database search results by with a method (16Brosch M. Yu L. Hubbard T. Choudhary J. Accurate and sensitive peptide identification with Mascot percolator.J. Proteome Res. 2009; 8: 3176-3181Crossref PubMed Scopus (332) Google Scholar). is a database search a with the of Mascot and the for the and ETD data a the of target are PEAKS DB SEQUEST Mascot from the data and PEAKS DB Mascot SEQUEST from the ETD data of the software tools on the ETD data The the of peptide spectrum from the target and the the database search MS-GFDB (12Kim S. Mischerikow N. Bandeira N. Navarro J.D. Wich L. Mohammed S. Heck A.J. Pevzner P.A. The generating function of CID, ETD and CID/ETD pairs of tandem mass spectra: Applications to database search.Mol. Cell. Proteomics. 2010; 9: 2840-2852Abstract Full Text Full Text PDF PubMed Scopus (196) Google Scholar), a improvement over the MS-GFDB with at the time of this a by in PEAKS DB a of the performance of PEAKS PEAKS DB MS-GFDB by and in a for and The of this is in the The from is that PEAKS DB significantly more Mascot and in to at a PEAKS DB more for the data and more for the ETD data PEAKS DB more for and for at Mascot for and for at significantly improved the performance of PEAKS DB still by for data and by for ETD data at on these data of the of peptides search Mascot on the ETD data in the iPRG study above (15Askenazi M. Bandeira N. Chalkley R.J. Clauser K.R. Deutsch E. Lam H.H.N. McDonald W.H. Neubert T. Rudnick P.A. Martens L. iPRG 2011: A Study on the Identification of Electron Transfer Dissociation (ETD) Mass Spectra.J Biomol Tech. 2011; 22: S20Google Scholar). the results in the iPRG the most of by the ProteinProspector (9Chalkley R.J. Baker P.R. Huang L. Hansen K.C. Allen N.P. Rexach M. Burlingame A.L. Comprehensive analysis of a multidimensional liquid chromatography mass spectrometry dataset acquired on a quadrupole selecting quadrupole collision cell, time-of-flight mass spectrometer: II. New developments in protein prospector allow for reliable and comprehensive automatic analysis of large datasets.Mol. Cell. Proteomics. 2005; 4: 1194-1204Abstract Full Text Full Text PDF PubMed Scopus (145) Google Scholar), PEAKS DB, Yang B. L.Y. L. C. peptide identification for analysis on comprehensive of electron transfer dissociation Proteome Res. 2010; 9: PubMed Scopus Google Scholar), and However, these several PEAKS DB and results the accuracy by the iPRG study However, is that the method used by the iPRG study and the experience of in software tools have the above are in the of the iPRG study (15Askenazi M. Bandeira N. Chalkley R.J. Clauser K.R. Deutsch E. Lam H.H.N. McDonald W.H. Neubert T. Rudnick P.A. Martens L. iPRG 2011: A Study on the Identification of Electron Transfer Dissociation (ETD) Mass Spectra.J Biomol Tech. 2011; 22: S20Google Scholar). The of the decoy method is for validation of the PEAKS DB the target decoy approach the of PEAKS DB results and be from two that are to the that the decoy sequences are introduced as of the the protein more target proteins the decoy the false identifications in to in the target proteins with a The decoy method this by the target and decoy sequences in the same protein the is used in the peptide the of the peptide in the target more false will be from the target proteins from the decoy the target and decoy sequences the score is to the target and decoy peptide the score of the false target and decoy the in the the of protein in the peptide scoring the protein the of the target decoy validation method and used in A. Nesvizhskii A.I. Kolker E. Aebersold R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.Anal. Chem. 2002; 74: 5383-5392Crossref PubMed Scopus (3886) Google Scholar) and is used in the Mascot and sensitivity with and Scholar). the other M. D. of mass Proteome Res. 2009; 8: PubMed Scopus Google Scholar) significantly improved sensitivity by a search on the proteins for more which can be as an of using the protein in the peptide scoring that the of the protein is the search on a protein a database search engine the that peptide sequence in the sample with to the search. be when peptide from the same protein is with high will the peptide identification sensitivity, the of the protein a more result validation method the target decoy The decoy method in this provides a to solve this PEAKS DB, the for the score for peptide scoring are for is from the approach used in the scoring function is for the search is and the target and decoy peptides by the search the improve the sensitivity, the decoy to the scoring a of the To the the approach is used in the of PEAKS De novo sequencing to be and to with mass has been used when the protein database to the in and improvement of the is an issue for de novo in the PEAKS to de novo sequence on a The high mass accuracy has available because of the of new mass as the de novo sequencing a for mass spectrometry analysis in proteomics. De novo sequencing and database search be as two that are used in to sensitivity and accuracy in proteomics as in this Additionally, the that de novo sequencing database are from novo peptides be more in the database are in an analysis on database search. the PEAKS DB software that of de novo sequencing results and several new The is an in both sensitivity and accuracy and an performance to other commonly used search is true for mass spectral data by ETD which PEAKS DB a for peptides with a more result validation method, decoy fusion, for the of PEAKS DB

Related Papers

No related papers found

Powered by citation graph analysis