jmzTab: A Java interface to the mzTab data standard

mzTab is the most recent standard format developed by the Proteomics Standards Initiative. mzTab is a flexible tab-delimited file that can capture identification and quantification results coming from MS-based proteomics and metabolomics approaches. We here present an open-source Java application programming interface for mzTab called jmzTab. The software allows the efficient processing of mzTab files, providing read and write capabilities, and is designed to be embedded in other software packages. The second key feature of the jmzTab model is that it provides a flexible framework to maintain the logical integrity between the metadata and the table-based sections in the mzTab files. In this article, as two example implementations, we also describe two stand-alone tools that can be used to validate mzTab files and to convert PRIDE XML files to mzTab. The library is freely available at http://mztab.googlecode.com.

the information. In addition, no standard format existed for MS-based metabolomics results. To overcome these issues, the mzTab standard (J. Griss et al., paper submitted) was recently developed as a flexible tab-delimited file format, to report proteomics and metabolomics results, including both identification and quantification data. At the time of writing, the first stable version (v1.0) of the standard is about to be released (see the updated specification document at http://mztab.googlecode.com). mzTab has a flexible design which allows the reporting of identification and quantification results at different levels, ranging from a simple summary or subset of the complete information (e.g., the final results) up to fairly comprehensive representation of the results including the experimental design. Many data consumers are only concerned about having access to the final results of a study in an easily accessible format that is compatible with tools like Microsoft Excel R or the R programming language (R Core Team, http://www.R-project.org), among others. For this reason, mzTab is also aimed to make MS proteomics and  metabolomics results available to the wider biological community, beyond the field of MS.
An mzTab file can have up to five different sections: metadata, protein, peptide, psm (peptide spectrum match), and small molecule. In addition, it can reference the corresponding mass spectra in the relevant external files. There are two types of mzTab files: ''Identification'' (including peptide, protein, and/or small molecule identifications) and ''Quantification'' (used for quantification results, and optionally may contain identification results as well). In addition, there are two levels of detail (called ''mode'') of reporting data: ''Summary'' and ''Complete.'' The ''Summary'' mode can be used to report the final results of a study, for example, reporting data averaged from different replicates. The ''Complete'' mode is used if detailed experimental information coming from each individual assay/replicate is provided. The experimental design is modeled in a similar way to mzQuantML, including the elements ''study variable'', ''assay'', ''ms run'', and ''sample'' (see mzTab specification document for more details).
There are already several tools implementing mzTab. For example, mzTab is in use in the OpenMS Proteomics Pipeline [5] and fully supported by the MSnbase R/Bioconductor package [6]. Also, as the first nonproteomics tool, the LipidDataAnalyzer [7] supports the export of mzTab files, including quantitative information extracted from lipidomics MS data. In addition, two prominent data resources, the PRIDE database (for MS-based proteomics data) [8] and MetaboLights (for metabolomics data) [9] make use of the new standard and are planning to use it heavily in the near future.
Here we introduce jmzTab, an open source Java application programming interface (API) for reading, writing, and validating mzTab files. This API greatly simplifies accessing the information included in mzTab files, thereby promoting its use and facilitating its support in third party software. Analogous Java libraries were developed before for other PSI standards, such as jmzML [10], jmzIdentML [11], jmzQuantML [12] or jTraML [13]. The API is released under the permissive Apache License 2.0, and the source code is available at http://mztab.googlecode.com. In addition, as example implementations of the library, we describe two stand-alone tools supporting validation and conversion functionality for the format. Additionally, mzTab example files that can be used to test the API are available at https://code.google.com/p/mztab/wiki/ExampleFiles. jmzTab is structured in a three-layer architecture: (i) the Core Model Layer, a lightweight independent implementation for maintaining the integrity between the different sections in the file; (ii) the Enhancement Utilities Layer, which provides parsing, validation, and conversion functionality; and (iii) the Standalone Application Layer, which constitutes the cen- documentation about how to use the API can be found at https://code.google.com/p/mztab/wiki/jmzTab2 and at http://mztab.googlecode.com/svn/jmztab/trunk/docs/index. html.
The main principle behind jmzTab Core Model's design is to provide an independent, light-weight architecture to simplify the integration of the library into different proteomics/metabolomics software applications. In fact, the model can be integrated into external applications without the need for any other third-party packages. The second key feature of the Core Model is the use of a flexible framework to keep the logical data integrity between the metadata and the tablebased sections (including the protein, peptide, psm, and small molecule sections). In the jmzTab Core Model, the MZTabFile class is the central entry point to manage the internal relationships among the different sections in the file. It contains three key components: (i) Metadata, which is a mandatory meta model that provides the definitions from the dataset included in the file; (ii) MZTabColumnFactory, a factory class that can be used to generate stable MZTabColumn elements, and to add different optional columns dynamically (e.g., protein and peptide abundance related columns). The Metadata and MZTabColumnFactory constitute the framework for the MZTabFile class; and (iii) Consistency constraints among the different sections of the model. For example, the MZTab-File class supports the iterative modification of the elements ''study variable'', ''ms run'', ''assay'', and ''sample'' assigned numbers (1−n) and its concrete location in the file, maintaining the internal consistency between the metadata section and the optional elements in the table-based sections. These methods are particularly useful when information from different experiments (e.g., from different MS runs) is condensed into a single mzTab file.
As mentioned above, in addition to the Core Model, the classes included in the Enhancement Utilities Layer provide mzTab parsing, validation, and conversion related functionality: (i) Parsing and validation: mzTab files can be validated to ensure that they comply with the latest version of the format specification. The process includes two steps. First of all, the basic model architecture is created, including the metadata section and the generation of the table column headers. The second step is the validation of the column rows. The class MZTabFileParser is used to parse and validate the mzTab files. If the validation has completed, an MZTabFile model will be generated. A series of messages are then reported, which can help to diagnose different types of format-related and/or logical (reporting errors related to the logical relationships among the different sections in a file) errors. At the time of writing, there are about sixty types of error messages (https:// code.google.com/p/mztab/wiki/jmzTab2_message). Each validation message has a unique identifier and is classified in three levels: Info, Warn, and Error, according to the requirements from the specification document. (ii) Conversion from PRIDE XML (internal XML format used by PRIDE) files to mzTab: The library supports the one-to-one conversion from PRIDE XML to mzTab. This functionality is provided since the PRIDE team is starting to make all proteomics results available in this format. In addition, by extending the ConvertProvider class, the current model can be used for the conversion of third-party format files. Converters from mzIdentML and mzQuantML into mzTab are being developed at present making use of this capability, for example, as part of the mzidLibrary [14]. The objective is that the researchers will always be able to access the information in PRIDE through the mzTab format, independently of the format used for the data submission.
As example implementations, two stand-alone tools were developed. Both tools are included in a zipped file, available to download at https://code.google.com/p/mztab/ downloads/list.
(i) mzTabGUI, a desktop application which provides mzTab validation functionality and conversion from PRIDE XML files to mzTab, in two different tabs (Fig. 2). After the conversion, the tool additionally performs an automatic validation of the mzTab file. (ii) mzTabCLI, a command line interface (CLI) which provides a more flexible way of processing mzTab files in a batch mode. It also includes validation and conversion functionality.
In addition, it is important to highlight that jmzTab is already integrated and used in other applications such as the LipidDataAnalyzer (http://genome.tugraz.at/lda/), and in an mzQuantML to mzTab converter included in the mzq-lib library (https://mzq-lib.googlecode.com/). The two stand-alone tools described above also play an important role in parsing, converting, and validating mzTab files in these projects. Detailed documentation about how to use the API can be found at https://code.google.com/p/mztab/wiki/ jmzTab2 and at http://mztab.googlecode.com/svn/jmztab/ trunk/docs/index.html.

Conclusions
We have presented the open-source library jmzTab to support the PSI's new mzTab standard format. The API follows the design principles used in other existing analogous APIs for other PSI standards such as jmzML or jmzIdentML. Since jmzIdentML and jmzML are focussed around XML formats, jmzTab had to be developed from scratch. In addition, we present two stand-alone tools, which make use of the API. It is planned that the mzTab format will be heavily used in prominent resources such as PRIDE and MetaboLights. Therefore, it is expected that jmzTab will be one of the essential pieces of software to facilitate data provision and data access to both resources.