Optimization of Experimental Parameters in Data-Independent Mass Spectrometry Significantly Increases Depth and Reproducibility of Results*

Comprehensive, reproducible and precise analysis of large sample cohorts is one of the key objectives of quantitative proteomics. Here, we present an implementation of data-independent acquisition using its parallel acquisition nature that surpasses the limitation of serial MS2 acquisition of data-dependent acquisition on a quadrupole ultra-high field Orbitrap mass spectrometer. In deep single shot data-independent acquisition, we identified and quantified 6,383 proteins in human cell lines using 2-or-more peptides/protein and over 7100 proteins when including the 717 proteins that were identified on the basis of a single peptide sequence. 7739 proteins were identified in mouse tissues using 2-or-more peptides/protein and 8121 when including the 382 proteins that were identified based on a single peptide sequence. Missing values for proteins were within 0.3 to 2.1% and median coefficients of variation of 4.7 to 6.2% among technical triplicates. In very complex mixtures, we could quantify 10,780 proteins and 12,192 proteins when including the 1412 proteins that were identified based on a single peptide sequence. Using this optimized DIA, we investigated large-protein networks before and after the critical period for whisker experience-induced synaptic strength in the murine somatosensory cortex 1-barrel field. This work shows that parallel mass spectrometry enables proteome profiling for discovery with high coverage, reproducibility, precision and scalability.


Supplementary information:
1.1 Individual improvements in DIA leading to a leap in performance: • The MS1 precursor scan resolution was increased from 35,000 (Bruderer et al. MCP 2015) to 120,000 in this study. This increased the identifications by 8% for 2h DIA runs (Figure 1d) • To increase the dynamic range of the MS1 precursor scans, the m/z space was split into two segments. This increased the identifications by 4% for 4h DIA runs (Figure 3, compared to 4h DIA with only one MS1 segment) • In this study, high resolution chromatography was used in addition to an improved peak gradient shape resulting in a more uniform peak widths at higher acetonitrile concentrations (Suppl figure   1). The use of Reprosil Pur 1.9um particles instead of ProntoSil 3.5um increased the identification by 10% for 2h DIA runs on the Q Exactive HF of HeLa using the HeLa spectral library of this manuscript.
• Increased sample loading from 1ug (as in Bruderer et al. MCP 2015) to 2ug increased the identifications by 4% in 2h DIA runs of HeLa using the HeLa spectral library of this manuscript.

Precursor FDR (q-value) estimation in Spectronaut for DIA
Spectronaut uses the mProphet algorithm for FDR estimation which was initially published for MRM data (1). The mProphet algorithm has been shown to be readily applicable to the targeted analysis of DIA/SWATH data because the data structure is identical to MRM (2,3). In brief Spectronaut will analyze any DIA data not only with the peptide precursors specified in the spectral library ("targets") but also with additionally, on the fly generated decoy peptide precursors ("decoys"). By default, in Spectronaut, the number of decoys will be the maximum of 1000 and 10% of the spectral library size (strategy and number can be adjusted in the Spectronaut settings schema). Due to the large size of spectral libraries used in this study the number of decoys was in the range of 10'000 to 20'000. Extracted ion currents for the precursor isotopic envelope are generated on MS1 for all targets and decoys. Extracted ion currents for the fragment ions are generated on MS2 for all targets and decoys. Extracted ion currents are generated in the retention time window where peptide elution is expected and as determined by the iRT calibration (described in detail in (4). For every target/decoy, candidate peak groups are determined and a discriminant score is computed. For every target/decoy only the candidate peak group with the highest discriminant score is further considered.
Optimization of Experimental Parameters in Data-Independent Analyses Significantly Increases Depth and Reproducibility of Results

3/29
The classical assumption for FDR estimation in DDA using the target/decoy approach is: "False peptidespectrum matches will uniformly distribute with respect to the target and decoy database." This assumption is not valid in the targeted analysis of DIA, i.e. false identifications do not uniformly distribute with respect to targets and decoys. Hence, the FDR is estimated with the q-value approach as described in the mProphet paper (5). In brief, the discriminant scores (Cscores) are converted to p-values based on a kernel density fit to the Cscore distribution of the decoys and q-values are estimated based on the pvalues of the targets with the Storey method using a lambda of 0.5.
The decoy Cscore distribution should be representative of false identifications to achieve an accurate FDR estimation 1 . We used the "mutated decoy strategy" for decoy generation for all experiments in this manuscript (available as of Spectronaut 11.0.15038.9 in the settings).
The mutated decoy strategy works as follows: Pick N random peptide precursors from the spectral library as templates, where N is the number of decoys to generate. Exchange C-terminal amino acid from K to R, from R to K and from any other amino acid to a random amino acid (I to L and vice versa is avoided).
Recalculate the precursor m/z based on this new sequence. Take the original sequence and randomly exchange a variable number of amino acids in the sequence (last amino acid stays the same). Recalculate the fragment ions based on this new sequence. The iRT and relative fragment ion intensities are kept identical to the template. For a more detailed description see the pseudo code below. The mutated decoy model was validated using a two-organism spectral library approach (see below). It was also found to be slightly less optimistic than the scrambled decoy model described in 1 (See Suppl. Figure 1). in DIA with slight adaptations. In brief, a unified scoring scale across all runs is established. Since machine learning is performed on a run by run bases, the Cscores which is the primary score used for precursor FDR calculation cannot be directly compared between runs within one experiment. In order to normalize the Cscores across runs a robust normalization is applied. Since the distribution of decoy scores behaves very similar across different runs it can be used as a means of normalization as follows where DecoyCScores is the set of all Cscores associated with decoy precursors of the respective run and IQR is the interquartile range.
Optimization of Experimental Parameters in Data-Independent Analyses Significantly Increases Depth and Reproducibility of Results

5/29
In contrast to using the protein grouping of the spectral library, Spectronaut typically performs its own protein inference on the set of identified peptides (typically performed on all the precursors with FDR of 1%) using the ID picker algorithm (Zhang et al. 2007). That's a small difference to what was published in the Rosenberger et al. approach and has the following subtle effect. Certain protein groups are being "stripped off" all of their peptides. These protein groups are no more representative and removed. If Spectronaut is not asked to perform protein inference on the set of identified peptides this step is not performed. The number of resulting protein groups is typically smaller if Spectronaut protein inference is performed. Hence, in this manuscript the more conservative approach of the two was used throughout.
For protein group FDR estimation, scores for both targets and decoy protein groups are needed. A target protein group score is defined as the best nCscore across all runs associated with this protein group. A decoy protein group is generated for each target protein group. A decoy protein group will be a randomly assembled set of decoy precursors. The number of decoy precursors assigned to a decoy protein is corresponding to its associated target protein groups total precursor count. Similar to a target protein group, the best nCscore of each decoy precursor across all runs will represent the decoy protein group score. Having a set of target and decoy protein groups and protein scores, the Storey method (Storey et al. 2003) can be used to determine a q-value for each target protein group as described Rosenberger et al. 2017 or above for the precursor FDR.
We validated protein group FDR as described below.

Precursor FDR (q-value) validation
In order to validate the precursor FDR a two-organism spectral library approach was used. Two spectral libraries derived from human Hela cell line and maize were prepared in exactly the same manner and applied to a DIA measurement of Hela. The number of maize precursor identifications can be used to cross following columns was generated: PG.FastaFiles and EG.Qvalue without report filters (decoys not exported). The PG.FastaFiles column was used to count the number of maize identifications and hence to determine an "actual" FDR for a given estimated q-value (Suppl. Fig. 1).
In general, a repeated analysis using Spectronaut with unchanged settings and raw data will result in identical results. For this specific analysis here, we performed a 250-fold bootstrap analysis in order to have a more accurate cross validation. This means we repeated the analysis 250 times with various random seeds.

Protein group FDR (q-value) validation
The base spectral libraries derived from the human Hela cell line and maize as described under "Precursor FDR (q-value) validation" were further filtered for protein group FDR validation as follows: Using in house software the spectral libraries were filtered for proteins that share any peptides between the two species Further, all proteins with less than three precursors were removed. Proteins were sorted by average qvalue and the top 1000 protein groups in each library were retained.
Both spectral libraries were applied to a technical triplicate of 2h DIA runs of a Hela sample acquired on a Q Exactive HF using Spectronaut with standard settings. A report with following columns was generated: to count the number of maize protein identifications for a certain cutoff and hence to determine an "actual" FDR for a given estimated q-value (Suppl. Figure 1).
In general, a repeated analysis using Spectronaut with unchanged settings and raw data will result in identical results. For this specific analysis here, we performed a 250-fold bootstrap analysis in order to have a more accurate cross validation. This means we repeated the analysis 250 times with various random seeds.

Calculation of the identified "explained" TIC of Suppl figure 10:
First, the scans were centroid the scan. Next, all peak intensities were summed to obtain the Total ion current. Then feature detection was performed (minimally 2 peaks, mono isotopic +1 isotope). Then all precursors that elute for a scan were obtained (Apex RT +/-FWHM *2.55) and fall within the precursor selection window and are identified (Prec and Protein FDR). Use all library fragments for the relevant assays (all detected in DDA of the runs used for the spectral library) and match each fragment to a detected feature (must fall within m/z tolerance from mass-calibration and match the same detected feature charge). Next, all features were collected that matched to any library fragment and all associated peak intensities were summed resulting in the Explained Ion Current For the control shift analysis, the identified peaks RT by three times the peak width used for explained tic to earlier retention time and get intensities.

Supplementary references:
Optimization acquisition with the sample containing peptides from lysates from four organisms were calculated according to their origin (shared peptides were counted only for one organism in this order first to E.coli, then S. cerevisiae followed by C. elegans). (c) Significantly differential abundant C. elegans proteins (5% FDR by q-value) were calculated using a one sample t-test analysis. (d) Significantly differential abundant S. cerevisiae proteins (5% FDR by q-value) were calculated using a one sample t-test analysis. (e) Significantly differential abundant E. coli proteins (5% FDR by q-value) were calculated using a one sample t-test analysis. Cerebellum DIA when using the project specific spectral library. The precursor ions were annotated with the respective isotope envelope. The complete spectral library fragment information from the DDA runs (all identified fragments, not only the optimized 6) were taken, additionally, the fragment space was trimmed to >250 m/z, since no fragments below 250 m/z were in the spectral library allowed. Note, that the minimal peptide length for die spectral library DDA runs was set to 7 and a limited set of modifications was used. The analysis was performed as described above. (b) A random DIA MS1 and corresponding MS2 spectrum was annotated by identified peptides and fragments of the HEK-293 DIA when using the project specific spectral library. The precursor ions were annotated with the respective isotope envelope. The analysis was performed as described above. (c) TIC and identified TIC were calculated for the mouse Cerebellum DIA on MS1 and MS2 level. Additionally, as a control, the identifications were shifted by 3 peak widths to earlier retention time. The analysis was performed as described above. (d) TIC and identified TIC were calculated for the HEK-293 DIA on MS1 and MS2 level. The control and analysis was performed as described above.