mzML—a Community Standard for Mass Spectrometry DataLennart Martens, Matthew Chambers, Marc Sturm et al.|Molecular & Cellular Proteomics|2010 Mass spectrometry is a fundamental tool for discovery and analysis in the life sciences. With the rapid advances in mass spectrometry technology and methods, it has become imperative to provide a standard output format for mass spectrometry data that will facilitate data sharing and analysis. Initially, the efforts to develop a standard format for mass spectrometry data resulted in multiple formats, each designed with a different underlying philosophy. To resolve the issues associated with having multiple formats, vendors, researchers, and software developers convened under the banner of the HUPO PSI to develop a single standard. The new data format incorporated many of the desirable technical attributes from the previous data formats, while adding a number of improvements, including features such as a controlled vocabulary with validation tools to ensure consistent usage of the format, improved support for selected reaction monitoring data, and immediately available implementations to facilitate rapid adoption by the community. The resulting standard data format, mzML, is a well tested open-source format for mass spectrometer output files that can be readily utilized by the community and easily adapted for incremental advances in mass spectrometry technology. Mass spectrometry is a fundamental tool for discovery and analysis in the life sciences. With the rapid advances in mass spectrometry technology and methods, it has become imperative to provide a standard output format for mass spectrometry data that will facilitate data sharing and analysis. Initially, the efforts to develop a standard format for mass spectrometry data resulted in multiple formats, each designed with a different underlying philosophy. To resolve the issues associated with having multiple formats, vendors, researchers, and software developers convened under the banner of the HUPO PSI to develop a single standard. The new data format incorporated many of the desirable technical attributes from the previous data formats, while adding a number of improvements, including features such as a controlled vocabulary with validation tools to ensure consistent usage of the format, improved support for selected reaction monitoring data, and immediately available implementations to facilitate rapid adoption by the community. The resulting standard data format, mzML, is a well tested open-source format for mass spectrometer output files that can be readily utilized by the community and easily adapted for incremental advances in mass spectrometry technology. Mass spectrometry (MS) 1The abbreviations used are:MSmass spectrometryHUPOHuman Proteome OrganizationPSI-MSProteomics Standards Initiative working group for mass spectrometry standardsLC-MS/MSliquid chromatography-tandem mass spectrometryCVcontrolled vocabulary.Author contributions: E.W.D. is the chair, P.A.B. is the co-chair, and L.M. is the secretary of PSI-MS WG. All authors actively contributed to the creation and implementation of the standard format. All authors have agreed to all the content in the manuscript, including the data as presented. has recently emerged as a major discovery tool in the life sciences (1.Mind the technology gap.Nat. Methods. 2007; 4 (Editors): 765Crossref PubMed Scopus (5) Google Scholar). This analytical technique is used to analyze the molecular composition of a biological sample by ionizing the sample or analyte molecules and then measuring the mass-to-charge ratios of the resulting ions. The data from an MS experiment consist of mass spectra that are used to identify, characterize, and quantify the abundance of the molecules of interest. The resulting MS spectra, along with their associated metadata (e.g. experimental protocol, MS instrumentation, operational parameters, etc.), are then semi-automatically processed by specialized software packages to identify or quantify the sampled ions. The inherent variability introduced by using different instruments, instrument software, and experimental conditions, however, affects the downstream ability to analyze, integrate, and compare data sets originating from different MS experiments. mass spectrometry Human Proteome Organization Proteomics Standards Initiative working group for mass spectrometry standards liquid chromatography-tandem mass spectrometry controlled vocabulary. Indeed, with the ever-increasing use of mass spectrometry, two issues have arisen in terms of handling MS data: (i) the necessity to share data throughout the scientific community in order to facilitate integration and comparison (2.Prince J.T. Carlson M.W. Wang R. Lu P. Marcotte E.M. The need for a public proteomics repository.Nature Biotechnology. 2004; 22: 471-472Crossref PubMed Scopus (133) Google Scholar), and (ii) the importance of utilizing open and readily accessible standard formats that verifiably capture a consistent amount of crucial information. The importance of addressing these issues has been further emphasized in prominent journal editorials (3.Thou shalt share your data.Nat. Methods. 2008; 5 (Editors): 209Crossref Scopus (23) Google Scholar, 4.Democratizing proteomics data.Nat Biotechnol. 2007; 25 (Editors): 262Crossref Scopus (37) Google Scholar). Data repositories have since been created to allow data to be shared, including Tranche (5.Falkner J.A. Andrews P.C. Tranche: Secure Decentralized Data Storage for the proteomics community.Journal of Biomolecular Techniques. 2007; 18: 3Google Scholar), GPMDB (6.Craig R. Cortens J.P. Beavis R.C. Open source system for analyzing, validating, and storing protein identification data.J. Proteome Res. 2004; 3: 1234-1242Crossref PubMed Scopus (562) Google Scholar), PRIDE (7.Martens L. Hermjakob H. Jones P. Adamski M. Taylor C. States D. Gevaert K. Vandekerckhove J. Apweiler R. PRIDE: the proteomics identifications database.Proteomics. 2005; 5: 3537-3545Crossref PubMed Scopus (422) Google Scholar), and PeptideAtlas (8.Desiere F. Deutsch E.W. King N.L. Nesvizhskii A.I. Mallick P. Eng J. Chen S. Eddes J. Loevenich S.N. Aebersold R. The PeptideAtlas project.Nucleic Acids Res. 2006; 34: D655-658Crossref PubMed Scopus (566) Google Scholar), among others (9.Mead J.A. Bianco L. Bessant C. Recent developments in public proteomic MS repositories and pipelines.Proteomics. 2009; 9: 861-881Crossref PubMed Scopus (35) Google Scholar), and various proposed standard formats for MS data (10.Taylor C.F. Binz P.A. Aebersold R. Affolter M. Barkovich R. Deutsch E.W. Horn D.M. Huhmer A. Kussmann M. Lilley K. Macht M. Mann M. Muller D. Neubert T.A. Nickson J. Patterson S.D. Raso R. Resing K. Seymour S.L. Tsugita A. Xenarios I. Zeng R. Julian Jr., R.K. Guidelines for reporting the use of mass spectrometry in proteomics.Nat Biotechnol. 2008; 26: 860-861Crossref PubMed Scopus (67) Google Scholar, 11.McDonald W.H. Tabb D.L. Sadygov R.G. MacCoss M.J. Venable J. Graumann J. Johnson J.R. Cociorva D. Yates 3rd, J.R. MS1, MS2, and SQT-three unified, compact, and easily parsed file formats for the storage of shotgun proteomic spectra and identifications.Rapid Commun Mass Spectrom. 2004; 18: 2162-2168Crossref PubMed Scopus (288) Google Scholar, 12.Orchard S. Montechi-Palazzi L. Deutsch E.W. Binz P.A. Jones A.R. Paton N. Pizarro A. Creasy D.M. Wojcik J. Hermjakob H. Five years of progress in the Standardization of Proteomics Data 4(th) Annual Spring Workshop of the HUPO-Proteomics Standards Initiative April 23–25, 2007 Ecole Nationale Superieure (ENS), Lyon, France.Proteomics. 2007; 7: 3436-3440Crossref PubMed Scopus (44) Google Scholar, 13.Pedrioli P.G. Eng J.K. Hubley R. Vogelzang M. Deutsch E.W. Raught B. Pratt B. Nilsson E. Angeletti R.H. Apweiler R. Cheung K. Costello C.E. Hermjakob H. Huang S. Julian R.K. Kapp E. McComb M.E. Oliver S.G. Omenn G. Paton N.W. Simpson R. Smith R. Taylor C.F. Zhu W. Aebersold R. A common open representation of mass spectrometry data and its application to proteomics research.Nat Biotechnol. 2004; 22: 1459-1466Crossref PubMed Scopus (638) Google Scholar, 14..mzData, http://psidev.info/index.php?q=node/80#mzdata, .Google Scholar) were developed. Other formats such as JCAMP-DX (http://www.acornnmr.com/JCAMP.htm; www.jcamp.org), which was designed for IR spectrometry and adapted to NMR and mass spectrometry, and NetCDF are quite variably implemented, difficult to validate, and cannot encode extensive metadata in a standard fashion and therefore have not gained much use for proteomics applications and other complex MS analyses. Analytical Information Markup Language (AnIML; http://animl.sourceforge.net/), which aims to encompass several analytical platforms, including eventually mass spectrometry, is still being designed. For mass spectrometry-based proteomics workflows, mzXML (13.Pedrioli P.G. Eng J.K. Hubley R. Vogelzang M. Deutsch E.W. Raught B. Pratt B. Nilsson E. Angeletti R.H. Apweiler R. Cheung K. Costello C.E. Hermjakob H. Huang S. Julian R.K. Kapp E. McComb M.E. Oliver S.G. Omenn G. Paton N.W. Simpson R. Smith R. Taylor C.F. Zhu W. Aebersold R. A common open representation of mass spectrometry data and its application to proteomics research.Nat Biotechnol. 2004; 22: 1459-1466Crossref PubMed Scopus (638) Google Scholar) and mzData (14..mzData, http://psidev.info/index.php?q=node/80#mzdata, .Google Scholar) have been the most widely used open formats for several years. However, each of these initial efforts to develop an open, vendor-neutral XML data format to store MS information was undertaken with a different underlying purpose. One format, mzData, was developed by HUPO-PSI as a data exchange and archive standard (14..mzData, http://psidev.info/index.php?q=node/80#mzdata, .Google Scholar, 15.Orchard S. Zhu W. Julian Jr., R.K. Hermjakob H. Apweiler R. Further advances in the development of a data interchange standard for proteomics data.Proteomics. 2003; 3: 2065-2066Crossref PubMed Scopus (20) Google Scholar), and was implemented as such in PRIDE (16.Jones P. Cote R.G. Martens L. Quinn A.F. Taylor C.F. Derache W. Hermjakob H. Apweiler R. PRIDE: a public repository of protein and peptide identifications for the proteomics community.Nucleic Acids Res. 2006; 34: D659-663Crossref PubMed Scopus (236) Google Scholar). The other format, mzXML, was developed at the Institute for Systems Biology in an effort to streamline their data processing software (17.Keller A. Eng J. Zhang N. Li X.J. Aebersold R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats.Mol. Syst. Biol. 2005; 1 (2005.0017)Crossref PubMed Google Scholar), and became a popular de-facto standard format. These two formats also differed in their underlying philosophies regarding flexibility. mzData utilized a controlled vocabulary that could be frequently updated as the technology advanced. In contrast, mzXML had a strict schema that used enumerated attributes to describe the auxiliary information, such that support for new annotations required revisions to the schema and software updates. Although each of the proposed formats satisfied the requirements of openness and accessibility, the multiplicity of the formats proved to be confusing and distracting to scientists and computer programmers alike. In order to resolve this situation, the teams that developed mzData and mzXML, along with many other researchers and developers from academia, industry, and vendors joined forces in the Human Proteome Organization (HUPO) Proteomics Standards Initiative working group for mass spectrometry standards (PSI-MS), and set out to create a single MS data standard that would build on the strengths of the previous efforts. The challenge in creating the new unified output format, called mzML, was therefore the resolution of the opposing philosophies of mzXML and mzData, while retaining the best technical attributes of these two formats. In 2006, the unification process was initiated at a PSI workshop based on the guiding design principles determined by members representing instrument and software vendors, data repositories, end users, and the teams that built the mzXML and mzData standards. The designers of mzML focused on four key objectives: (i) creation of a simple format, (ii) elimination of alternate to encode the information, support for all the features of mzXML and mzData, and validation implementation to these would to a single unified format that could support the of mzXML and mzData and that could be easily by vendors and software, with further to be in In order to facilitate adoption and uniform implementation of the new standard format, the of PSI-MS also created open source tool sets that developers as well as end to immediately the format having to their on the format was at PSI as well as to In the mzML standard format was E. a data format for mass spectrometer 2008; PubMed Scopus Google Scholar, E.W. Mass spectrometer output file format Biol. PubMed Scopus Google Scholar). However, the process J.A. Martens L. Hermjakob H. Julian R.K. Paton N.W. The PSI process and its implementation on the PSI 2007; 7: PubMed Scopus Google Scholar), several became as vendors to the new format, most support for and and a file for of which features that had been from the These along with several other were by the PSI-MS working group in with the that had the a mzML was in with the that this new will for quite In to the best technical attributes of the formats, several key were introduced in in order to support new such as the and mzML can multiple operational for an and spectra to a new is the ability to capture data the introduced are also in mzML, such as the ability to encode data to be and the of multiple a liquid R. B. R. K. S. S. Mallick P. mass peptide identification on Proteome Res. 2008; 7: PubMed Scopus Google Scholar). with mzML a controlled vocabulary that and of In mzML with a set of validation These are in a XML to the PSI L. S. F. B. Jones A.R. Martens L. Hermjakob H. The PSI a to of proteomics data.Proteomics. 2009; 9: PubMed Scopus Google and have been implemented in two mzML applications The technical of the mzML standard are available with of its and various files at will the technical of the mzML standard and All of the information from a single MS including the spectra and associated is the mzML its mzML is in XML schema the format and many tools are readily available to an XML to its XML schema The mzML file is as information the controlled in the of the mzML information on the of spectra in the is an that of of controlled vocabulary terms that can be as a throughout a the can information that are in the information the instrument that the and provide a of data processing that the is an that for the mass such as These are by the spectra and and data are by format data for and with This design not has been agreed by the of In order to to the mzML was designed with a for in the as This to a the file having to the file Although is to a of the of years of with mzXML have that these are and are by the of having an To for the of an in the software can easily be to the and the is an To the mzML was designed that the not have an the can still be in a schema that has an an mzML file a or mzML and software is designed to the open and XML it also a that the of the data files by as much as a of for spectra with the However, the files by a of the is of the of XML standard and used tools can be used to this for storage and of mzML for mzML files can by and available on many provide of of the in spectra by the to mzML and therefore to files the A file with in the file with mzML or using and as as the using and file with the applications are to with mzML an in the files are associated with using standard formats are much the associated with to with multiple formats. In an effort to the information in different and to provide support for new with mzML, have designed the format to encode most of the metadata in which provide a to a the PSI MS controlled vocabulary These terms have and including the data and of The controlled vocabulary is and new terms can be of the mzML an a new to describe a new the proposed and can be to the PSI-MS vocabulary the can be and then to the or other can also be used to the for can for be used to the To the use of a was with the data format. validation a simple to the and of the metadata in an mzML such as the of a required the of an source with a or the use of two terms can be at a have the available as a with file or as a tool for the mzML format is designed to support (10.Taylor C.F. Binz P.A. Aebersold R. Affolter M. Barkovich R. Deutsch E.W. Horn D.M. Huhmer A. Kussmann M. Lilley K. Macht M. Mann M. Muller D. Neubert T.A. Nickson J. Patterson S.D. Raso R. Resing K. Seymour S.L. Tsugita A. Xenarios I. Zeng R. Julian Jr., R.K. Guidelines for reporting the use of mass spectrometry in proteomics.Nat Biotechnol. 2008; 26: 860-861Crossref PubMed Scopus (67) Google Scholar, C.F. Paton N.W. Lilley Binz P.A. Julian Jr., Jones A.R. Zhu W. Apweiler R. Aebersold R. Deutsch E.W. M.J. A. Macht M. Mann M. Martens L. Neubert T.A. Patterson S.D. P. Seymour S.L. P. Tsugita A. Vandekerckhove J. J.P. Xenarios I. Yates 3rd, J.R. Hermjakob H. The information a proteomics experiment Biotechnol. 2007; PubMed Scopus (562) Google Scholar) can be out a by of an mzML data file to the of the required information. is that the metadata can be for different of data, that different of spectra can be using the with different of the mzML format has the HUPO PSI community standard development which in is based on the open source software development A group of of the efforts of the many community members that their and at different and a of the process is an that is accessible to This development has been to be as with and Indeed, the development of mzML has the this the development of the standard The mzML standard is quite as it has been developed with in The required of the format from its of the data are in the XML format such as the necessity to an instrument the that this is quite open, and not by the XML an the for an The different of are controlled vocabulary parameters, and a new source is a simple to the will mzML files to the use of this new terms are this new source will be immediately to software as a it will have an to the This is in in mzML, the format to XML schema or software need be to the which is a simple file that is available in a system and that can be updated and Indeed, the public of mzML, have been introduced to the controlled vocabulary downstream on the XML schema or the of the community in are several implementations of the mzML format in software data and for a of for a In the of software that mzML to and is of the strengths of The software D. M. R. D. Mallick P. open source software for rapid proteomics tools 2008; PubMed Scopus Google Scholar, .Google Scholar) has the for and implementation of mzML in its of of a set of tools and in for proteomic data analyses. The provide a that data file and standard and an to in software that to mzML or is available under a which the to be used in software the terms of that The is used by several software to provide mzML The tool can many different formats to mzML, as well as mzXML files M. A. C. A. R. E. N. A. K. open-source software for mass 2008; 9: PubMed Scopus Google Scholar), an open-source for mass spectrometry, also for and mzML which can be easily in other software it validation and validation of mzML This of was used to an tool for validation of mzML files which is of The Proteomics K. C. E. N. M. proteomics 2007; PubMed Scopus Google Scholar). the and the R.G. F. Martens L. an open-source for mzML, the PSI standard for MS data.Proteomics. PubMed Scopus (44) Google Scholar) provide for and these are available to of mzML several software applications are being with mzML These and software such as C. M. Cheung K. an improved for in on of Proteome Res. 2008; 7: PubMed Scopus Google Scholar), D.L. mass peptide identification by Proteome Res. 2007; PubMed Scopus Google Scholar), the A. Eng J. Zhang N. Li X.J. Aebersold R. A uniform proteomics MS/MS analysis platform utilizing open XML file Biol. 2005; 1 (2005.0017)Crossref PubMed Scopus Google Scholar, P.G. a for proteomic Biol. PubMed Scopus Google Scholar, E.W. L. D. H. N. Nilsson E. Pratt B. B. Eng J.K. Nesvizhskii A.I. Aebersold R. A of the PubMed Scopus Google Scholar), and the J. G. K. F. The software an platform for and analysis of proteomics Proteome Res. 2009; PubMed Scopus Google Scholar). vendors have to provide mzML support in the of their The support for mzML in along with the of several open source software packages and in a of that data in the mzML format is readily accessible to end or software the of open data standards formats are and have two use in the of mass spectrometry-based proteomics to of the of the mzML data standard. many multiple from different vendors for their analyses. Although this in the of strengths of the different it also a at the of data The various data formats by each instrument to its data, are to these different from the can output a the development of software that can on data from such as the tools in the quite difficult This in was of the the mzXML format was developed as of the to the various formats in a open data that of data to support various of downstream including identification and of a from mzXML, mzML these data from many to be the available or the common mzML format, which is in and by all downstream data processing software A use of standard data formats the of data to the scientific an that is in the life sciences J.A. M. Hermjakob H. L. a for 2009; 3: PubMed Scopus Google Scholar). data were in formats, would in in L. Nesvizhskii A.I. Hermjakob H. Adamski M. Omenn Vandekerckhove J. Gevaert K. data mass spectrometry data in public proteomics data 2005; 5: PubMed Scopus Google (i) to data would have and the data and to the and software with the format, (ii) of the data, would in and processing the data, and a all data would become as the required software will be or an open, XML therefore format such as mzML, these key issues are of these of on the of software the format, as can be from the previous many actively and open source implementations in a of and for a of are available for mzML and many other implementations are or will be available with their software it be that the two use are in by to mzML as the format for data processing and the to in mzML The data by mzML will most not in a by sample and sample mass spectrometry data is then further processed to identify or quantify the it is to that HUPO PSI has also standards for protein including based and based for identification of molecules from mass spectra and for the of on the integration of data and metadata in the life sciences is being actively undertaken by the for working group of the which has in the format P. M. A. D. J. J. F. N. Jones P. A. M. N. N. Taylor C. W. G. S. The a simple format for complex 2008; PubMed Scopus Google Scholar). information in all the formats on the other is the C.F. D. J. Apweiler R. M. Binz P.A. M. A. A. Deutsch E.W. J. P. F. G. N.W. Hermjakob H. Julian Jr., R.K. M. C. C. E. M. N.L. J. P. N. H. R. A. N. S. J. P. H. H. J. R.H. D. Smith B. J. Jr., K. P. A. J. S. reporting for biological and the Biotechnol. 2008; 26: PubMed Scopus Google Scholar). In years its mzML was and has to be a format that can easily incremental advances in mass spectrometry while a for to of data from new set of software that support mzML will adoption of the format. However, the formats are also the to is and the adoption of mzML in will be initial of implementations a of to since the of have not been is therefore that will for quite The of instrument vendors in PSI-MS further that mzML will become available on instrument software by