DIA proteomics data from a UPS1-spiked E.coli protein mixture processed with six software tools

In this article, we provide a proteomic reference dataset that has been initially generated for a benchmarking of software tools for Data-Independent Acquisition (DIA) analysis. This large dataset includes 96 DIA .raw files acquired from a complex proteomic standard composed of an E.coli protein background spiked-in with 8 different concentrations of 48 human proteins (UPS1 Sigma). These 8 samples were analyzed in triplicates on an Orbitrap mass spectrometer with 4 different DIA window schemes. We also provide the spectral libraries and FASTA file used for their analysis and the software outputs of the six tools used in this study: DIA-NN, Spectronaut, ScaffoldDIA, DIA-Umpire, Skyline and OpenSWATH. This dataset also contains post-processed quantification tables where the peptides and proteins have been validated, their intensities normalized and the missing values imputed with a noise value. All the files are available on ProteomeXchange. Altogether, these files represent the most comprehensive DIA reference dataset acquired on an Orbitrap instrument ever published. It will be a very useful resource to the proteomic scientists in order to assess the performance of DIA software tools or to test their processing pipelines, to the software developers to improve their tools or develop new ones and to the students for their training on proteomics data analysis.


a b s t r a c t
In this article, we provide a proteomic reference dataset that has been initially generated for a benchmarking of software tools for Data-Independent Acquisition (DIA) analysis. This large dataset includes 96 DIA .raw files acquired from a complex proteomic standard composed of an E.coli protein background spiked-in with 8 different concentrations of 48 human proteins (UPS1 Sigma). These 8 samples were analyzed in triplicates on an Orbitrap mass spectrometer with 4 different DIA window schemes. We also provide the spectral libraries and FASTA file used for their analysis and the software outputs of the six tools used in this study: DIA-NN, Spectronaut, ScaffoldDIA, DIA-Umpir e, Skyline and OpenSWATH . This dataset also contains post-processed quantification tables where the peptides and proteins have been validated, their intensities normalized and the missing values imputed with a noise value. All the files are available on ProteomeXchange. Altogether, these files represent the most comprehensive DIA reference dataset acquired on an Orbitrap instrument ever published. It will be a very useful resource to the proteomic scientists in order to assess the performance of DIA software tools or to test their processing pipelines, to the software developers to improve their tools or develop new ones and to the students for their training on proteomics data analysis.
© 2022 The Author(s

Value of the Data
• This dataset is the most comprehensive DIA dataset acquired on an Orbitrap mass spectrometer with a complex proteomic standard. • In comparison to other proteomic reference dataset, it contains spiked-in proteins at known concentrations to assess the ability of proteomic pipelines to recover low abundance proteins. • Proteomic scientists could use it to better understand the performance of DIA software tools and choose the best pipeline for their study. • Students could use it for their training on DIA proteomics data analysis.
• Software developers could use it to assess and improve their tools for the detection of low abundance proteins. • This dataset can be considered as a reference for the development of new DIA analysis tools.

Data Description
The dataset provided in this article has been initially generated with the aim to benchmark DIA acquisition methods and software tools [1] . As shown on Fig. 1 , we used a complex proteomic standard composed of an E.coli protein background spiked-in with 8 different concentrations of the 48 UPS1 human proteins (Sigma) ranging from 0.1 to 50 fmol per microgram of E.coli proteins. These samples were analyzed on an Orbitrap mass spectrometer operating in Data-Independent Acquisition mode. Four different DIA acquisition schemes were used since narrow windows are expected to provide less complex DIA spectra (less precursors are selected for fragmentation) but wide windows can better cover the mass range in an appropriate chromatographic cycle time. Two other schemes using overlapped windows or mixed window sizes were also tested ( Table 1 ).
For each DIA scheme (Narrow, Wide, Mixed and Overlapped ), three injections (analytical replicates) of the 8 samples were done. Therefore we provide 4 datasets of 24 raw files in Thermo . raw file format and converted . mzML or . mzXML formats.
Six public or proprietary software tools were used for the processing of the raw files ( Table 2 ). Some of them required the use of a DDA (Data Dependent Acquisition) spectral li-   brary ( Library mode) (Skyline and OpenSWATH), one only required a FASTA file ( FASTA mode) (DIA-Umpire) and others can do both (Spectronaut, DIA-NN, ScaffoldDIA). For the Library processing mode, we used (and provide here) DDA .raw files acquired on the same instrument from 48 peptide fractions of an E.coli protein extract and 1 unfractionated sample containing a protein digest of E.coli background spiked-in with UPS1 proteins. Two spectral libraries were generated using these 49 DDA .raw files and are provided as well as .blib, .tsv and .csv files. The .FASTA file (containing E.coli proteome and UPS1 proteins sequences) used for the FASTA mode is given as well. Finally, the 96 DIA . raw files were processed with the 6 software tools with the use of these libraries or with the FASTA file. The corresponding software outputs are provided as viewer files that can be re-open in the corresponding software tool and as untreated .txt export tables. We finally provide .txt precursor quantification tables in which the data is validated for protein and peptide identification, normalized and the missing values imputed with a noise value as described in the Table 3 . Fig. 2 and Supplementary Table 1 give the number of proteins identified in each pipeline. The Venn diagrams ( Fig. 2 , left panels) show the number of E.coli protein identifications and their overlap between the different software tools used. For each of them we identified between 1292 and 2373 E.coli proteins. On the right panels, we can observe how many of the 48 UPS1 proteins were identified in each sample. This number is decreasing with the concentration of UPS1 spiked in the E.coli background.

Preparation of the proteomic standard
E.coli protein extract was obtained from a broth culture Escherichia coli (strain #CCRI-12923, CCRI, Québec, Canada) in Brain Heart Infusion (BHI) medium at 8 × 10 8 cfu/mL. The culture was centrifuged at 10,0 0 0 x g for 15 min and stored at -20 °C. Proteins were extracted by resuspension of the pellet in the extraction buffer (50 mM ammonium bicarbonate, 1% sodium deoxycholate and 20 mM 1,4 dithiothreitol), heated 10 min at 95 °C and sonicated 15 minutes with 30s/30s ON/OFF cycles at high intensity (Bioruptor, Diagenode). The lysed cells were then centrifugated at 13,0 0 0 x g for 10 min to remove debris and the protein concentration in the supernatant was determined by Bradford Assay. The concentration was then adjusted at 0.1μg/μL in extraction buffer.
A vial of Universal Proteomic Standard-1 (UPS1, Sigma) containing 48 human proteins (5pmol each) was serially diluted using the E.coli protein extract to obtain 8 concentrations of UPS1 per microgram of E.coli (50, 25, 10, 5, 2.5, 1, 0.25 and 0.1 fmol/μg). Reduction and alkylation of cysteines was performed by heating the sample for 30 min at 37 °C followed by addition of 50mM iodoacetamide and incubation for 30 min. The pH was then adjusted to 8.0, trypsin enzyme (Promega) was added at a ratio of 1:50 (enzyme:protein) and the samples were incubated at 37 °C. The reaction was stopped by acidification to pH2.0 with formic acid. The samples were then centrifugated at 16,0 0 0 x g for 5 minutes. The peptides contained in the supernatants were purified on Oasis HLB cartridge 10 mg (Waters) and vacuum dried.

Mass spectrometry
The samples were resuspended in 2% acetonitrile, 0.05% TFA and for each one, an equivalent of 1μg peptides was analyzed by LC-MS/MS an U30 0 0 NanoRSLC liquid chromatography system (ThermoScientific, Dionex Softron GmbH, Germering, Germany) in line with an Orbitrap Fusion Tribrid -ETD mass spectrometer (ThermoScientific, San Jose, CA, USA). Peptides were concentrated at 20μL/min (loading solvent: 2% acetonitrile/0.05% trifluoroacetic acid) on a 300 mm i.d x 5 mm, C 18 PepMap100, 5 mm, 100 Å precolumn cartridge (Thermo Fisher Scientific) for 5 minutes. Then, the separation was performed on a PepMap100 RSLC, C 18 3 mm, 100 Å , 75 μm i.d. Table 3 Post-processing of precursor tables. For each software tool, the information on outliers' removal, normalization and missing value imputation is given along with the criteria to consider a precursor as identified and quantifiable. The steps were performed in R in the same order than shown in the table from to top to bottom.

Generation of spectral libraries
250μg of E.coli protein extract was prepared as described above and digested in the same conditions was high-pH fractionated on a Agilent Extend C 18 (1.0 mm x 150 mm, 3.5 μm) column using an Agilent 1200 Series HPLC system. Peptides were separated at 1 mL/min by a gradient of 5-35% solvent B for 60 minutes and 35-70% solvent B for 24 minutes (A: 10 mM ammonium bicarbonate, pH 10; B: 90% acetonitrile/10% ammonium bicarbonate pH 10). 48 fractions were collected. A sample of 200 fmol UPS1 per μg of E.coli extract was also prepared as described for the proteomic standard to complete the spectral library with UPS1 human proteins.
These 49 samples were analyzed on the same instrument and with the same chromatographic conditions than for the proteomic standard but the mass spectrometer was operated in Data Dependent Acquisition (DDA) mode with the following settings:

DIA file processing
The DIA files were then processed with 6 different software tools in FASTA mode using a single E.coli and UPS1 FASTA file (UniProt Reference Proteome -Taxonomy 83333 -Proteome ID UP0 0 0 0 0 0625 -4312 entries -2016.03.15 and the 48 sequences of UPS1) or in Library mode with one of the Skyline or Spectronaut spectral libraries as described in Table 2 . The tools were used as recommended by the user manual or by the software developers and detailed parameter settings can be found in the supplementary table S1 of the Gotti et al. article [1] .

Data post-processing
All data post-processing was performed using R software [9] from precursor tables exported from each software tool. Table 3 shows the treatment applied to the data of each tool as well as the criteria to consider a precursor identified and quantifiable.

Ethics Statement
This work does not involve human subjects, animal experiments or data collected from social media platforms.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Supplementary Materials
Supplementary material associated with this article can be found in the online version at doi: 10.1016/j.dib.2022.107829 .