GC-EI-MS datasets of trimethylsilyl (TMS) and tert-butyl dimethyl silyl (TBDMS) derivatives for development of machine learning-based compound identification approaches

In the field of environment and health studies, recent trends have focused on the identification of contaminants of emerging concern (CEC). This is a complex, challenging task, as resources, such as compound databases (DBs) and mass spectral libraries (MSLs) concerning these compounds are very poor. This is particularly true for semi polar organic contaminants that have to be derivatized prior to gas chromatography-mass spectrometry (GC-MS) analysis with electron impact ionization (EI), for which it is barely possible to find any records. In particular, there is a severe lack of datasets of GC-EI-MS spectra generated and made publicly available for the purpose of development, validation and performance evaluation of cheminformatics-assisted compound structure identification (CSI) approaches, including novel cutting-edge machine learning (ML)-based approaches [1]. We set out to fill this gap and support the machine learning-assisted compound identification, thus aiding cheminformatics-assisted identification of silylated derivatives in GC-MS laboratories working in the field of environment and health. To this end, we have generated 12 datasets of GC-EI-MS spectra, six of which contain GC-EI-MS spectra of trimethylsilyl (TMS) and six GC-EI-MS spectra of tert-butyldimethylsilyl (TBDMS) derivatives. Four of these datasets, named testing datasets, contain mass spectra acquired by the authors. They are available in full, together with corresponding metadata. Eight datasets, named training datasets, were derived from mass spectra in the NIST 17 Mass Spectral Library. For these, we have only made the metadata publicly available, due to licensing reasons. For each type of derivative, two testing datasets are generated by acquiring and processing GC-EI-MS spectra, such that they include raw and processed GC-EI-MS spectra of TMS and TBDMS derivatives of CECs, along with their corresponding metadata. The metadata contains IUPAC name, exact mass, molecular formula, InChI, InChIKey, SMILES and PubChemID, of each CEC and CEC-TMS or CEC-TBDMS derivative, where available. Eight GC-EI-MS training datasets are generated by using the National Institute of Standards and Technology (NIST)/U.S. Environmental Protection Agency (EPA)/National Institute of Health (NIH) 17 Mass Spectral Library. For each derivative type (TMS and TBDMS), four datasets are given, each corresponding to an original dataset obtained from NIST/EPA/NIH 17 and three variants thereof, obtained after each of the filtering steps of the procedure described below. Only the metadata about the training datasets are available, describing the corresponding NIST/EPA/NIH 17 entires: These include the compound name, CAS Registry number, InChIKey, exact mass, Mw, NIST number and ID number. The datasets we present here were used to train and test predictive models for identification of silylated derivatives built with ML approaches [4]. The models were built by using data curated from the NIST Mass Spectral Library 17 [2] and the machine learning approach of CSI:Output Kernel Regression (CSI:OKR) [2]. Data from the NIST Mass Spectral Library 17 are commercially available from the National Institute of Standards and Technology (NIST)/U.S. Environmental Protection Agency (EPA)/National Institute of Health (NIH) and thus cannot be made publicly available. This highlights the need for publicly available GC-EI-MS spectra, which we address by releasing in full the four testing datasets.


b s t r a c t
In the field of environment and health studies, recent trends have focused on the identification of contaminants of emerging concern (CEC). This is a complex, challenging task, as resources, such as compound databases (DBs) and mass spectral libraries (MSLs) concerning these compounds are very poor. This is particularly true for semi polar organic contaminants that have to be derivatized prior to gas chromatography-mass spectrometry (GC-MS) analysis with electron impact ionization (EI), for which it is barely possible to find any records. In particular, there is a severe lack of datasets of GC-EI-MS spectra generated and made publicly available for the purpose of development, validation and performance evaluation of cheminformaticsassisted compound structure identification (CSI) approaches, including novel cutting-edge machine learning (ML)-based approaches [1] . We set out to fill this gap and support the machine learning-assisted compound identification, thus aiding cheminformatics-assisted identification of silylated derivatives in GC-MS laboratories working in the field of environment and health. To this end, we have generated 12 datasets of GC-EI-MS spectra, six of which contain GC-EI-MS spectra of trimethylsilyl (TMS) and six GC-EI-MS spectra of tert -butyldimethylsilyl (TBDMS) derivatives. Four of these datasets, named testing datasets, contain mass spectra acquired by the authors. They are available in full, together with corresponding metadata. Eight datasets, named training datasets, were derived from mass spectra in the NIST 17 Mass Spectral Library. For these, we have only made the metadata publicly available, due to licensing reasons. For each type of derivative, two testing datasets are generated by acquiring and processing GC-EI-MS spectra, such that they include raw and processed GC-EI-MS spectra of TMS and TBDMS derivatives of CECs, along with their corresponding metadata. The metadata contains IUPAC name, exact mass, molecular formula, InChI, InChIKey, SMILES and PubChemID, of each CEC and CEC-TMS or CEC-TBDMS derivative, where available. Eight GC-EI-MS training datasets are generated by using the National Institute of Standards and Technology (NIST)/U.S. Environmental Protection Agency (EPA)/National Institute of Health (NIH) 17 Mass Spectral Library. For each derivative type (TMS and TBDMS), four datasets are given, each corresponding to an original dataset obtained from NIST/EPA/NIH 17 and three variants thereof, obtained after each of the filtering steps of the procedure described below.
Only the metadata about the training datasets are available, describing the corresponding NIST/EPA/NIH 17 entires: These include the compound name, CAS Registry number, InChIKey, exact mass, M w , NIST number and ID number. The datasets we present here were used to train and test predictive models for identification of silylated derivatives built with ML approaches [4] . The models were built by using data curated from the NIST Mass Spectral Library 17 [2] and the machine learning approach of CSI:Output Kernel Regression (CSI:OKR) [2] . Data from the NIST Mass Spectral Library 17 are commercially available from the National Institute of Standards and Technology (NIST)/U.S. Environmental Protection Agency (EPA)/National Institute of Health (NIH) and thus cannot be made publicly available. This highlights the need for publicly available GC-EI-MS spectra, which we address by releasing in full the four testing datasets.
© 2023 The Author(s The testing GC-EI-MS spectral datasets are given in .txt, .msp and .mgf formats and the accompanying metadata is given in .xlsx format. The metadata regarding the training GC-EI-MS spectral datasets is given in .xlsx format.

Description of data collection
The GC-EI-MS spectral datasets, that we used for testing ML-based CSI models, were generated in the full scan range of m/z 50-800 amu for the TMS derivatives and m/z 50-10 0 0 amu for the TBDMS derivatives. Raw instrument data was reduced to two-dimensional peak lists ( m/z , abundance) using Mass Hunter Qualitative Analysis v.B.07 (Agilent Technologies, USA), in which background subtraction was also performed. The GC-EI-MS spectral datasets intended for training of ML-based CSI models were generated from the NIST/EPA/NIH Mass Spectral

Value of the Data
• The generated GC-EI-MS datasets provide a comprehensive collection of GC-EI-MS spectra of TMS and TBDMS derivatives of structurally and chemically diverse environmental contaminants, given along with their metadata in universal ready-to-use formats (.txt, .msp, .mgf) for further cheminformatics-based processing. • The generated GC-EI-MS datasets are of value for the environmental and exposomics researchers, as well as for the CSI and ML communities, interested in the development of new CSI tools. • Few datasets of mass spectra are publicly available: This is especially true for GC-EI-MS spectra, which makes the generated data even more valuable. • Both the testing and the training data can be further used on their own or as part of larger datasets, for training, testing and validation in the development of novel CSI approaches, for challenging existing approaches, and for performance comparison of novel and existing CSI, especially ML-based approaches. • The data can be used as a stand-alone database (or joined with other in-house databases of GC-EI-MS spectra), serving as valuable reference during suspect screening and non-targeted environmental analysis.
Standard solutions of selected environmental contaminants (104 compounds), listed in Table 1 , were used for generating the TMS and TBDMS test dataset. The presented data consists of .txt, .msp and .mgf data files for each of the four MS datasets (Test dataset_TMS_RAW, Test dataset_TMS_BS, Test dataset_TBDMS_RAW and Test dataset_TBDMS_BS), in which each of the GCRC-EI-MS spectra is recorded with the compound name, InChIKey, M w , molecular formula (MF), CAS Registry number and list of peaks represented as m/z and intensities. Metadata of the TMS/TBDMS derivatives and the corresponding parent CEC are given in .xlsx files containing the IUPAC name, exact mass, MF, InChI, InChIKey, SMILES and PubChem ID, when available ("Meta-data_test_TMS derivatives.xlsx" and "Metadata_test_TBDMS derivatives.xlsx"). The four datasets were used to test predictive models for identification of silylated derivatives, built with ML approaches.
Note that for some of the spectra in the test datasets, spectra of the corresponding compounds also appear in the training datasets. Evaluation of the models generated by ML has been conducted separately on spectra of compounds whose spectra appear/do not appear in the training datasets, respectively, and results are reported separately by Ljoncheva et al. [2] . In the metadata files for the testing datasets ("Metadata_test_TMS derivatives.xlsx" and "Meta-data_test_TBDMS derivatives.xlsx"), an extra column (the last one) indicates whether a spectrum of the compound at hand also appears in the corresponding training dataset and the respective metadata file ("Metadata_training_TMS_3.3", "Metadata_ training_TBDMS_3.3").
The predictive models for CSI of silylated derivatives were built by ML approaches from training datasets of GC-EI-MS spectra of TMS and TBDMS derivatives, which are not publicly available, as they were curated from the commercially available NIST/EPA/NIH Mass Spectral Library 17 [3] , licensed under the United States Department of Commerce Copyright. NIST's end-user's license for the NIST 17 MSL restricts its use to a single computer that is not accessible by more than one person. While the training datasets themselves cannot be made publicly available, we make available the corresponding metadata: With licensed address to NIST MSL 17, they can be used to reconstruct the training datasets, by following the workflow summarized below and described by Ljoncheva et al. [2] .
The ML approach used to build models for the identification of silylated derivatives from these data [2] was the approach titled CSI:OKR [3] .

Experimental design and generation of training datasets
Initial versions of the TMS and TBDMS datasets (TMS_0.1 and TBDMS_0.1) were generated by extracting all GC-EI-MS spectra of TMS, resp. TBDMS, derivatives of small molecules from the NIST/EPA/NIH 17 Mass Spectral Library [3] . The first constrained search for GC-EI-MS TMS spectra, using the constraints name fragment: trimethylsilyl and elements allowed: Si , resulted in a collection of 9958 entries, while for GC-EI-MS TBDMS spectra, the constraints name fragment: tertbutyldimethylsilyl and elements allowed: Si, r esulted in an initial dataset of 2238 entries. Entries were extracted in .msp file format and subsequently converted to .txt format, using the LIB2NIST conversion tool (NIST 2011). Each GC-EI-MS entry included the compound name, InChIKey, MF, M w , exact mass, CAS number, NIST ID and MS peak list. The GC-EI-MS spectra of TMS/TBDMS derivatives with erroneous metadata (name, molecular formula, InChIKey) that do not correspond to the analyzed compound were excluded from the dataset.
The TMS/TBDMS GI-EI-MS spectral datasets were further filtered using a three-step spectral filtering process, including: 1) Exclusion of chemical irregularities. The GC-EI-MS spectra of TMS, resp. TBDMS derivatives of compounds not susceptible to derivatization, defined by the absence of functional group(s) amenable to silylation, were filtered out. The functional groups amenable to silylation are those containing an active hydrogen, i.e., carboxyl, hydroxyl, amine and thiol. 2) Exclusion of high-molecular mass TMS, resp. TBDMS derivatives. The GC-EI-MS spectra of TMS, resp. TBDMS derivatives of CEC with molecular mass ≥ m/z 10 0 0 were eliminated, since, as such, they are above the working linear range of the GC-MS instruments. 3) Exclusion of insufficient-quality GC-EI-MS spectra. The following GC-EI-MS spectra were excluded: -GC-EI-MS spectra not acquired at the upper m/z of at least M w of the derivative + 10 amu; -GC-EI-MS spectra that do not contain both the molecular ion [M] + peak and at least one of the isotope peaks, such as the 13 C isotope peak; -GC-EI-MS spectra that contain neither peaks of fragment ions specific for TMS groups ( m/z 73, 147, 221 and 295, corresponding to one, two, three and four TMS groups, respectively) nor for TBDMS groups ( m/z 115, 230 and 345, corresponding to one, two and three TBDMS groups, respectively) and -GC-EI-MS spectra not containing at least five fragment ion peaks.
As a result, the final version of the TMS dataset consists of 4648 TMS GC-EI-MS spectra, while the final version of the TBDMS dataset consists of 1883 GC-EI-MS spectra. For each of the GC-EI-MS spectra in the final TMS and TBMDS datasets, the m/z range was between 50 m/z to M w of the derivative ± 10 amu. For data parsing, all ion fragments with intensity 0 were removed from the refined TMS and TBDMS datasets.

Chemicals and materials
From the in-house pool of reference standards, 104 CEC were selected as environmentally rel-  Table 1 .
The selected compounds had to satisfy at least three of the following five criteria: 1) Positioning : the compound is present in the US EPA Comptox Chemistry Dashboard (CCD) [5] , the most comprehensive repository of EE constituents; 2) Persistence : compound's half-life in fresh or estuarine water > 40 days; 3) Bioaccumulation : BAF and/or BCF > 20 0 0, or in absence of such data, logK ow ≥ 5.0; 4) Mobility : compound's water solubility ≥ 0.15 mg/L and log K oc ≤ 4.0, i.e. between -10.0 and 4.0; and 5) EcoToxicity : long-term no-observed-effect concentration (NOEC) for marine or freshwater organisms < 0.01 mg/L. Further details of the selection procedure are given by Ljoncheva et al. [2] .
The MSD was operated in EI ionization mode (70 eV) by scanning over the mass range of m/z 50-800 amu for TMS derivatives and m/z 50-10 0 0 amu for TBDMS derivatives. In-between the acquisitions of the derivatized standards, EtAc was run as the solvent check to assess potential background interferences and was used for background subtraction as a part of the postacquisition processing of the GC-EI-MS spectra.
The retention times (R t ) of the TMS and TBDMS derivatives are given in Table 2 and Table 3 , respectively.

Data processing
GC-EI-MS data acquisition resulted in the generation of multiple ( ≥15) GC-EI-MS spectra for most of the TMS and TBDMS derivatives. Exceptions are the L-ascorbic acid TMS, L-leucine TMS and L-serine TMS, with three GC-EI-MS spectra each, and the TBDMS derivatives of L-serine, 4nitroguaiacol, 5-nitroguaiacol, catechol, 3-methylcatechol, 3-methyl-5-nitrocatechol, syringol, 4nitrosyringol, 4-nitrocatechol, p-coumaric acid, m-coumaric acid, o-coumaric acid, mycophenolic acid, 4,6-dinitroguaiacol, etofylline and urea, with one GC-EI-MS spectrum in the test TBDMS datasets for each. All GC-EI-MS spectra were processed using Mass Hunter Qualitative Analysis v B.07 (Agilent Technologies, USA) that reduced raw instrument data to two-dimensional peak lists ( m/z , abundance), exported in .txt format. This software was also used to perform background subtraction, in order to remove constantly present background signals, such as m/z 149 as a typical phtalate interference, m/z 282, m/z 256 and m/z 284 for oleic, palmitic and stearic acid, and m/z 207, m/z 281 and m/z 327 of common polysiloxanes resulting from GC column stationary phase degradation. Their presence was confirmed a priori in the multiple EtAc solvent runs acquired between the acquisitions of CEC silyl derivatives, and was used for background subtraction.
The .txt data were transformed into .mgf format by a Python script which formats the beginning of a new spectrum in .mgf required syntax (i.e., "BEGIN IONS"), then lists the exact mass of the compound (e.g., "MASS = 194.076"), taken from the appropriate file with the metadata, fol-  lowed by a row with the charge ("CHARGE = 1 + "). The introductory part of the data record finishes with a line starting with "TITLE = InChIKey:", followed by the InChIKey of the compound, then "Name:" and the name of the compound, where the InChIKey and the name of the compound are taken from the .txt file, followed by an empty line. The peaks are then copied verbatim from the .txt file, line after line, and, afterwards, the data record finishes with the row "END IONS", preceded by an empty line. Each spectrum entry from the .txt files is thus converted into .mgf format and the spectra are listed in the same order as in the .txt file.
The .mgf files can be read by the ProteoWizard MSConvert software (version 3.0.22153-da6d3d1) [6] and consequently converted into a number of other formats that are in use by the MS community (such as mzML or mzXML). Unfortunately, these formats do not include the .msp format. The .msp format of the data was generated from the above described .mgf format by using the Python library available at https://github.com/matchms/matchms .

Ethics Statements
The authors declare that the manuscript meets all the rules and conditions described in the "Ethics in publishing" section standards ( https://www.elsevier.com/journals/data-in-brief/ 2352-3409/guide-for-authors ). The work did not include any investigations involving animal experiments, human participants and data collected from social media platforms.
The training GC-EI-MS spectral datasets were curated from the commercially available NIST/EPA NIH 17 Mass Spectral Library. Explicit permission to release the metadata about the training datasets was obtained by the authors from NIST. Due to NIST's individual license, restricting the use to a single computer that is not accessible by more than one person, the training datasets cannot be made available to the public by the authors. However, with licensed access to the NIST 17 MSL, the training data can be reconstructed from the available metadata files by following the detailed description of data preparation, given above.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability
GC-EI-MS datasets of trimethylsilyl (TMS) and tert-butyl dimethylsilyl (TBDMS) derivatives for development of machine learning-based compound identification approaches (Original data) (Mendeley Data).