Improved Normalization of Systematic Biases Affecting Ion Current Measurements in Label-free Proteomics Data*

Normalization is an important step in the analysis of quantitative proteomics data. If this step is ignored, systematic biases can lead to incorrect assumptions about regulation. Most statistical procedures for normalizing proteomics data have been borrowed from genomics where their development has focused on the removal of so-called ‘batch effects.’ In general, a typical normalization step in proteomics works under the assumption that most peptides/proteins do not change; scaling is then used to give a median log-ratio of 0. The focus of this work was to identify other factors, derived from knowledge of the variables in proteomics, which might be used to improve normalization. Here we have examined the multi-laboratory data sets from Phase I of the NCI's CPTAC program. Surprisingly, the most important bias variables affecting peptide intensities within labs were retention time and charge state. The magnitude of these observations was exaggerated in samples of unequal concentrations or “spike-in” levels, presumably because the average precursor charge for peptides with higher charge state potentials is lower at higher relative sample concentrations. These effects are consistent with reduced protonation during electrospray and demonstrate that the physical properties of the peptides themselves can serve as good reporters of systematic biases. Between labs, retention time, precursor m/z, and peptide length were most commonly the top-ranked bias variables, over the standardly used average intensity (A). A larger set of variables was then used to develop a stepwise normalization procedure. This statistical model was found to perform as well or better on the CPTAC mock biomarker data than other commonly used methods. Furthermore, the method described here does not require a priori knowledge of the systematic biases in a given data set. These improvements can be attributed to the inclusion of variables other than average intensity during normalization.

Normalization is an important step in the analysis of quantitative proteomics data. If this step is ignored, systematic biases can lead to incorrect assumptions about regulation. Most statistical procedures for normalizing proteomics data have been borrowed from genomics where their development has focused on the removal of socalled 'batch effects.' In general, a typical normalization step in proteomics works under the assumption that most peptides/proteins do not change; scaling is then used to give a median log-ratio of 0. The focus of this work was to identify other factors, derived from knowledge of the variables in proteomics, which might be used to improve normalization. Here we have examined the multi-laboratory data sets from Phase I of the NCI's CPTAC program. Surprisingly, the most important bias variables affecting peptide intensities within labs were retention time and charge state. The magnitude of these observations was exaggerated in samples of unequal concentrations or "spike-in" levels, presumably because the average precursor charge for peptides with higher charge state potentials is lower at higher relative sample concentrations. These effects are consistent with reduced protonation during electrospray and demonstrate that the physical properties of the peptides themselves can serve as good reporters of systematic biases. Between labs, retention time, precursor m/z, and peptide length were most commonly the top-ranked bias variables, over the standardly used average intensity (A). A larger set of variables was then used to develop a stepwise normalization procedure. This statistical model was found to perform as well or better on the CPTAC mock biomarker data than other commonly used methods. Furthermore, the method described here does not require a priori knowledge of the systematic biases in a given data set. These improvements can be attributed to the inclusion of variables other than average intensity during normalization. The number of laboratories using MS as a quantitative tool for protein profiling continues to grow, propelling the field forward past simple qualitative measurements (i.e. cataloging), with the aim of establishing itself as a robust method for detecting proteomic differences. By analogy, semiquantitative proteomic profiling by MS can be compared with measurement of relative gene expression by genomics technologies such as microarrays or, newer, RNAseq measurements. While proteomics is disadvantaged by the lack of a molecular amplification system for proteins, successful reports from discovery experiments are numerous in the literature and are increasing with advances in instrument resolution and sensitivity.
In general, methods for performing relative quantitation can be broadly divided into two categories: those employing labels (e.g. iTRAQ, TMT, and SILAC (1)) and so-called "label-free" techniques. Labeling methods involve adding some form of isobaric or isotopic label(s) to the proteins or peptides prior to liquid chromatography-tandem MS (LC-MS/MS) analysis. Chemical labels are typically applied during sample processing, and isotopic labels are commonly added during cell culture (i.e. metabolic labeling). One advantage of label-based methods is that the two (or more) differently-labeled samples can be mixed and run in single LC-MS analyses. This is in contrast to label-free methods which require the samples to be run independently and the data aligned post-acquisition.
Many labs employ label-free methods because they are applicable to a wider range of samples and require fewer sample processing steps. Moreover, data from qualitative experiments can sometimes be re-analyzed using label-free software tools to provide semiquantitative data. Advances in these software tools have been extensively reviewed (2). While analysis of label-based data primarily uses full MS scan (MS1) 1 or tandem MS scan (MS2) ion current measurements, analysis of label-free data can employ simple counts of confidently identified tandem mass spectra (3). So-called spectral counting makes the assumption that the number of times a peptide is identified is proportional to its concentration. These values are sometimes summed across all peptides for a given protein and scaled by protein length. Relative abundance can then be calculated for any peptide or protein of interest. While this approach may be easy to perform, its usefulness is particularly limited in smaller data sets and/or when counts are low.
This report focuses only on the use of ion current measurements in label-free data sets, specifically those calculated from extracted MS1 ion chromatograms (XICs). In general terms, raw intensity values (i.e. ion counts in arbitrary units) cannot be used for quantitation in the absence of cognate internal standards because individual ion intensities depend on a response factor, related to the chemical properties of the molecule. Intensities are instead almost always reserved for relative determinations. Furthermore, retention times are sometimes used to align the chromatograms between runs to ensure higher confidence prior to calculating relative intensities. This step is crucial for methods without corresponding identity information, particularly for experiments performed on low-resolution instruments. To support a label-free workflow, peptide identifications are commonly made from tandem mass spectra (MS/MS) acquired along with direct electrospray signal (MS1). Or, in alternative workflows seeking deeper coverage, interesting MS1 components can be targeted for identification by MS/MS in follow-up runs (4).
"Rolling up" the peptide ion information to the peptide and protein level is also done in different ways in different labs. In most cases, "peptide intensity" or "peptide abundance" is the summed or averaged value of the identified peptide ions. How the peptide information is transferred to the protein level differs between methods but typically involves summing one or more peptide intensities, following parsimony analysis. One such solution is the "Top 3" method developed by Silva and co-workers (5).
Because peptides in label-free methods lack labeled analogs and require separate runs, they are more susceptible to analytical noise and systematic variations. Sources of these obscuring variations can come from many sources, including sample preparation, operator error, chromatography, electrospray, and even from the data analysis itself. While analytical noise (e.g. chemical interference) is difficult to selectively reject, systematic biases can often be removed by statistical preprocessing. The goal of these procedures is to normalize the data prior to calculations of relative abundance. Failure to resolve these issues is the common origin of batch effects, previously described for genomics data, which can severely limit meaningful interpretation of experimental data (6,7).
These effects have also been recently explored in proteomics data (8). Methods used to normalize proteomics data have been largely borrowed from the microarray community, or are based on a simple mean/median intensity ratio correction. Methods applied on microarray and/or gene chip and used on proteomics data include scaling, linear regression, nonlinear regression, and quantile normalizations (9). Moreover, work has also been done to improve normalization by subselecting a peptide basis (10). Other work suggests that linear regression, followed by run order analysis, works better than other methods tested (11). Key to this last method is the incorporation of a variable other than intensity during normalization. It is also important to note that little work has been done towards identifying the underlying sources of these variations in proteomics data. Although cause-and-effect is often difficult to determine, understanding these relationships will undoubtedly help remove and avoid the major underlying sources of systematic variations.
In this report, we have attempted to combine our efforts focused on understanding variability with the work initiated by others for normalizing ion current-based label-free proteomics data. We have identified several major variables commonly affecting peptide ion intensities both within and between labs. As test data, we used a subset of raw data acquired during Phase I of the National Cancer Institute's (NCI) Clinical Proteomics Technology Assessment for Cancer (CPTAC) program. With these data, we were able to develop a statistical model to rank bias variables and normalize the intensities using stepwise, semiparametric regression. The data analysis methods have been implemented within the National Institute of Standards and Technology (NIST) MS quality control (MSQC) pipeline. Finally, we have developed R code for removing systematic biases and have tested it using a reference standard spiked into a complex biological matrix (i.e. yeast cell lysate).

EXPERIMENTAL PROCEDURES
Data Sets-All of the CPTAC Phase I data sets used in this work are publicly available at https://cptac-data-portal.georgetown.edu/ cptacPublic/. This site provides direct links to download the study data. Hereafter, data are referenced according to designated CPTAC Study number. Descriptions of the CPTAC yeast reference material, and SOPs used to acquire the data, can be found elsewhere (12). Briefly, the data for Study 6 were collected in the following order with a blank in between each study sample: (1) Sample 1B-NCI-20, (2) Sample 6-QC2 yeast only, (3) Sample 6A yeast ϩ UPS1 at 0.25 fmol/l, (4) Sample 6B, yeast ϩ UPS1 at 0.74 fmol/l, (5) Sample 6C, yeast ϩ UPS1 at 2.2 fmol/l, (6) Sample 6D, yeast ϩ UPS1 at 6.7 1 The abbreviations used are: MS1, full MS scan; MS2, tandem MS scan; MS/MS, tandem MS scan; NCI, National Cancer Institute; NIST, National Institute of Standards and Technology; NIST MSQC, NIST Mass Spectrometry Quality Control (Software); SOP, standard operating procedure; CPTAC, Clinical Proteomics Technology Assessment for Cancer; MD, mean of deviance; ROC, receiver operating characteristic; RT, retention time; PSM, peptide-spectrum match; SILAC, stable isotope labeling with amino acids in cell culture; GUI, graphical user interface. fmol/l, (7) Sample 6E, yeast ϩ UPS1 at 20 fmol/l, (8) Sample 6-QC1, UPS1 only at 20 fmol/l. All yeast samples (prior to spike-ins) contained trypsin-digested yeast lysate at 60 ng/l. A standard injection volume of 2 l was used for all samples. All platforms used the same prepacked columns and a nano-ESI source. The Study 8 data were acquired without the use of the CPTAC SOP, according to individual lab protocol. The yeast sample was run in two amounts, 120 ng and 600 ng, which are referred to hereafter as 'low' and 'high', respectively.
Data Processing-All calculations on MS data files presented in this work were made by the NIST MSQC pipeline (v. 1.2.0) (13). All of the intensity values were extracted directly from output .msqc files using a parser written in Perl (available by request). The major changes from the previously published version of the software include the following: (1) the replacement of ReAdW4Mascot2.exe with a more refined program for processing MS1 data (ProMS), (2) the addition of MSPepSearch (v. 0.9.0) as the default search engine, (3) the introduction of a 'full' mode for providing peptide/protein relative quantitation information in the report. Several new metrics were also added that were used in this report. These include a calculation for the median intensity deviation.
The NIST MSQC pipeline is driven by a Perl program, which controls component programs in the following order: (1) ReAdW4Mascot2 for extracting spectra and retention times from RAW files to mzXML and MGF, (2) a search engine (MSPepSearch, SpectraST or OMSSA), (3) ProMS (optional but recommended), (4) nistms_metrics for calculating metrics and statistics within and between series, (5) merge_pep_results for final formatting of the result files. All command-line arguments and options for running the pipeline can be found at http://peptide.nist. gov/metrics or by running the pipeline without arguments. The newest version of the pipeline also comes with a GUI and Windows TM installer. Additionally, MSFileReader from Thermo Fisher (http:// sjsupport.thermofinnigan.com/public/detail.asp?idϭ703) can be used, allowing users without an installation of XCalibur TM to operate the pipeline. As with earlier versions, the pipeline requires an installation of Perl and runs only on the Windows TM operating system. Thermo Fisher LTQ, LTQ-Orbitrap, LTQ-FT and QExactive TM data files can be analyzed. A version utilizing ProteoWizard libraries for use with other vendor data files is under development. And at the time of writing, the pipeline was compatible with Agilent QTOF files.
Normalization and Variable Ranking-Normalization is the process of removing systematic bias in order to make runs more comparable. In this work, the data are 'normalized' based on the assumption that, on average, the peptide ion intensities do not change between runs. 'Runs' in this work denote technical replicates, replicates of the same sample between laboratories, or replicates of similarly engineered samples with known differences (as in the case of CPTAC Study 6.) Although the CPTAC data do not represent true biological experiments, they do represent best-case-scenario and a starting point from which further development of the aforementioned computational methods can be initiated.
Ideally, log2 ratios of peptide ion intensities (M) between any two technical replicates should either equal a constant value of 0, or log2(x) if runs have a known x-fold concentration difference. M values should then distribute randomly around the reference level without any significantly discernible pattern. However, in practice, many factors cause systematic bias in experimental data acquisition. An example of this is pen location during production of oligonucleotide arrays. Often, this extraneous variability is confounded with the biological variability of interest in the sample. It is thus necessary to remove systematic bias before proceeding to any relative quantification.
A collection of normalization methods borrowed from microarray analysis has been tested on proteomics data (9). These methods include mean/median removing, linear regression, locally-weighted regression and quantile techniques. However, in all but one case, only abundance (the average of peptide intensity between the two technical replicates in comparison) is used for normalization. In the current work, it was observed that abundance is not the only source of systematic bias. Therefore a scheme was developed to include a set of identified bias variables for effective data normalization. A detailed description of model development follows.
Let M j ͑1͒ be the initial log ratio of intensities for peptide ion j in a given pair of runs, j ϭ 1, … , N and N is the number of detected peptide ions. The normalization procedure starts with modeling the log ratio of intensities by ͑ ⅐ ͒ is the regression function with variable x p . Both the simple linear and p-spline models are considered for f p ͑1͒ ͑ ⅐ ͒. The parameters ␤ 0,p ͑1͒ , ␤ 1,p ͑1͒ , and u k,p ͑1͒ 's are the regression coefficients in the model f p ͑1͒ ͑ ⅐ ͒. The knots ( k 's ͑k ϭ 1, K, K͒ are selected according to (14).
͑2͒ are then the normalized log ratios after adjusting for variable x p 's impacts.
Each variable is fitted with individual f p ͑1͒ ͑x p ͒ using the appropriate semiparametric (the p-spline) or simple linear regression models according to Equation 1. A deviance measure is designed to select the best normalization variable at each step. The goal is to select the variable that is most significantly related to systematic variation in the log ratio of intensities. The deviance for variable p at step 1 with where Dev p ͑1͒ is the deviance for variable p at step 1 and M j,p ͑2͒ is the residuals from the regression model. The best variable selected at step 1 is the variable p* minimizing Dev p

͑1͒
. The normalized log ratio of intensities is then defined as M j,p* ͑2͒ and the optimal deviance is Dev p*

͑1͒
. The above procedure is repeated P times (the number of variables included), with the selected variable at each step removed from the following steps. Generally, at step s (s ϭ 1, 2, L, P), the remaining variables x p 's are fit into The best variable selected at step s is the one minimizing the deviance The ranking of the variables is based on the order they were selected in the above iterative procedure, and their impacts are measured by the reduction in deviance ⌬Dev (s) ϭ Dev p ͑s͒ Ϫ Dev p* ͑sϪ1͒ , where Dev p* ͑sϪ1͒ is the sample variance of the original M when s ϭ 1 and the deviance obtained at the selected variable p* when 1 Ͻ s Յ P. The final normalized log ratio of peptide ion intensities is M j ͑Pϩ1͒ , j ϭ 1, L, N. R code for performing these calculations is available at http://homepages.uc.edu/ϳwang2x7/.
Median Deviation-Median deviation is defined as the square root of the ratio of the 25th percentile to the median peptide intensity ratio. By using this statistic based on the center of the absolute deviations, the method is not sensitive to one observation deviating from an expected ratio of 1 and therefore provides a robust measure of the distribution of the deviations.

RESULTS
Previous work on the reproducibility of proteomic analyses by LC-MS formed the basis for this study (13). In that work, it was noted that peptide ion intensities are valuable reporters of overall reproducibility. Here, we describe their use as central indicators of systematic bias. As with any data analysis project in proteomics, software must be chosen that is capable of extracting the required data from the instrument-generated binary files. And for ion-current based label-free quantitative studies, routines for performing the necessary calculations on extracted ion chromatograms are also needed. While there are many software programs available, we chose to continue developing the tools within the NIST MSQC pipeline. As such, ProMS replaces ReAdW4Mascot2.exe for calculating intensity (Experimental Procedures). These values can now be written to the NIST MSQC output using the command-line option "-mode full" during processing.
Systematic Bias in the Peptide Ion Intensities-To begin our studies, we chose test data from CPTAC Study 8 (no SOP) and a subset of runs from Study 6 (SOP-controlled). The data were all processed through the NIST MSQC pipeline (v.1.2.0) as described under "Experimental Procedures." In Study 8, three replicates of two amounts, 120 ng (low) and 600 ng (high), of the yeast material (Sample QC2) were analyzed in triplicate. The data from 6 different instruments in five different laboratories were analyzed. Addition-ally, data from Study 6 (yeast ϩ UPS1) from two Orbitraps were also analyzed. To visualize the raw intensities, and the need for normalization, the distributions were plotted in Fig. 1.
Several observations can be made from Fig. 1. First, the low versus high runs from Study 8 can be easily distinguished because of the five-fold difference in loading. The medians of the boxplots for high runs range from 3.2-2.9 in LTQs and 4.4 -5.1 in Orbitraps, whereas the medians of low runs range from 2.6 -3.2 in LTQs and 3.4 -3.9 in Orbitraps. Second, the interquartile range (IQR) of peptide ion intensities measured on an Orbritrap is between 0.8 -1.2, which is wider than IQR of 0.5-0.62 measured on an LTQ. Furthermore, even within the same type of instrument, obvious differences in median intensities and IQR exist between runs and between instruments. For example, the boxplots for Lab 1 in Study 6 have an average median around 4.3 with an IQR of ϳ1. The boxplots for Lab 2 in Study 6 have a lower average median around 3.8 with IQR between 0.8 -0.9. These results suggested the need for a data normalization step prior to any interpretation of differences within an experiment, and for normalization if data are to be compared between labs for both LTQ and Orbitrap instruments. Results on Orbitrap instruments are presented in detail in the main text, whereas results on LTQ instruments are included in Supplemental Data.
Systematic Bias as Identified by Relative Intensities-Next, we used the relative intensities and a group of variables to investigate systematic bias in the above data. Relative intensity is defined as M ϭ log2 (I R1 /I R2 ), where I R1 and I R2 are the MS1 intensities for the run R1 and R2, respectively. The underlying assumption is that the majority of relative intensities do not change between runs, allowing the user to examine dispersion in the data. Fig. 2 shows the systematic biases in M related to a group of selected variables. , 60 ng/l ("low," filled boxplots) and 300 ng/l ("high," unfilled boxplots) of the yeast material (Sample QC2) were included in the analysis of Study 8. For Study 6, a subset of runs on two Orbitraps were also included. The runs were plotted in the order specified as below, each with three replicates: (1) Sample 6A yeast ϩ UPS1 at 0.25 fmol/l, (2) Sample 6B, yeast ϩ UPS1 at 0.74 fmol/l, (3) Sample 6C, yeast ϩ UPS1 at 2.2 fmol/l, (4) Sample 6D, yeast ϩ UPS1 at 6.7 fmol/l, (5) Sample 6E, yeast ϩ UPS1 at 20 fmol/l, (6) Sample 6-QC2 yeast only. Fig. 2A shows commonly used M-A plots for a comparison of runs. In these plots, the y axis represents relative intensities M and the x axis is used to order the peptides by average abundance defined as A ϭ 0.5 [log10 (I R1 ) ϩ log10 (I R2 )]. Specifically, these plots can be used to identify systematic biases affecting absolute abundance (i.e. low signal and saturation effects). The first panel of Fig. 2A shows data from two technical replicates (the second and third runs in 300 ng/l yeast sample, high) on the same instrument (Orbitrap 65 in Study 8). In the other three panels, M-A plots compare data from the second run in high sample to the same run on each of the three Orbitrap instruments used in Study 8.
Several observations can be made from this analysis. First, as expected, replicate runs from the same instrument fit the expected reference value (dashed line) much better than runs compared between different instruments. Second, dispersions were on average, greater at lower A values. Third, several inter-instrument (also interlaboratory) comparisons exhibited nonlinear trends and global and/or local departures from the expected reference value. Local deviations confound simple global corrections (i.e. scaling) during normalization because they are not representative of the entire run. It is also important to note that no two comparisons displayed exactly the same behavior, suggesting the importance of pair-wise versus precursor m/z, z/ͱpeptide length, and retention time (RT). All runs are from Orbitrap 65 in Study 8. The technical replicate pair is the 2nd and the 3rd runs in the 300 ng/l yeast samples (high). The fivefold difference pair is the 2nd run in the 300 ng/l yeast sample (high) and the 2nd run in the 60 ng/l yeast sample (low). The left column shows technical replicates pairs and the right column shows fivefold difference pairs. These plots illustrate that systematic bias is more significant between the high and low samples with fivefold difference. analyses, particularly if data between instruments are compared. supplemental Fig. S1A in the Supplemental Data shows the similar results on an LTQ instrument.
From previous work, it was known that varying the injection amount leads to broad differences in a number of performance metrics: in particular, the number of singly charged relative to doubly charged peptide ion identifications increases with injection amount, as does the average precursor m/z (13). These observations indicate a reduced average peptide ion charge state at higher injection amounts, most likely because of limited proton availability during ESI.
With the intent of further investigating these observations on the individual peptide-level, we compared data from the Study 8 low and high samples to look for deviations in M from the expected reference value, given no or a fivefold difference in concentration between samples. In Fig. 2B, the technical replicates pair includes the second and the third runs in the 300 ng/l yeast samples (high); the fivefold difference pair includes the second run in the 300/; ng yeast sample (high) and the second run in the 60 ng/l yeast sample (low). Peptides are ordered according to peptide properties chosen here because they are known to be affected by sample concentration.
In this analysis, all three variables, precursor m/z, m/z, z/ͱpeptide length, and retention time, showed strong correlation with systematic biases, indicating that larger peptides and those eluting late in the gradient are more intense at higher concentrations. This suggests that sample loading selectively biases the intensities of the peptides with higher charge state potentials. The variable z/ͱpeptide length is the charge density standardized by the square root of the peptide length. The square root was used because z/peptide length exhibited strong correlation with the peptide length, which was also included in the group of variables discussed later.
To further investigate the possibility that loading systematically biases peptides with higher charge state potentials, M values of the above runs in Fig. 2B were separated by charge states. Fig. 2C summarizes the distributions of M values in boxplots. For technical replicates (second and third run from the high sample on Orbitrap 65), the median of boxplots on the relative intensities under three charge states were all close to the reference level 0 with IQR as 0.58, 0.46, and 0.49 for 2ϩ, 3ϩ, and 4ϩ, respectively. For the pair with fivefold difference (second run from the high sample and second run from the low sample on Orbitrap 65), the boxplots showed variations in both the medians and the IQRs across charge states. Charge state 2ϩ had a median (-3.13) which was the closest to the reference line (Ϫlog2(5)) with the shortest IQR ϭ 0.84. For 3ϩ, the median was Ϫ3.48 with IQR ϭ 1.05 and for 4ϩ, the median was Ϫ3.61 with IQR ϭ 1.20.
To determine if the distributions were significantly different, a two-sample Wilcoxon rank test was used because the bias on M values under different charge states is not normally distributed. As expected, the distributions between the charge states in high and low samples (5-fold difference) were statistically different (p value Ͻ0.001) with the exception of ϩ3 compared with ϩ4 (p value ϭ 0.46). The distributions of peptide ion intensities were not significantly different under different charge states for technical replicates (p value Ͼ 0.15). These results indicated that precursor charge state is an important variable to be considered during data normalization, especially between different samples. Supplemental Fig. S1B in Supplemental data shows the similar analysis on an LTQ instrument.
Since retention time introduces a large systematic bias when data from samples at different concentrations were compared, median relative abundance deviations (Experimental Procedures) were calculated for each retention quartile from comparisons between the above runs on Orbitrap 65. The high median deviations in M illustrated the higher variability during the fourth quartile of retention time (Q4) observed for intersample comparisons in five-fold difference pair (dashed lines) relative to technical replicates (solid lines) (See supplemental Fig. S2 in the Supplemental Data). The investigation on the median deviance suggested that, although comparisons between low and high samples are somewhat artificial, significant biases may be unwittingly introduced by small loading differences, giving rise to retention time biases.
Development of a Normalization Model-The variables examined in Fig. 2B as well as abundance (A), peptide length, and the number of mobile protons (precursor charge minus number of H, K and R's ϩ 1) were used to develop a normalization model (Experimental Procedures). The proposed method is an extension of normalization using a linear regression model. But instead of using only abundance, several bias variables are included as predictors. Supplemental Fig. S3 shows a schematic of the algorithm and supplemental Fig. S4 gives an example showing resultant regression curves and deviance values during stepwise normalization using two runs from Study 8.
Examining the Results of the Normalization Model- Fig. 3 shows the densities of M values before and after normalization using various approaches, including removing the mean of log ratios (around Ϫ2.7), regression against the abundance (A) and the proposed semiparametric regression models. The data used were a pair with fivefold difference from Study 8. The comparison in Fig. 3 indicated that models using only mean of M or abundance (A) as normalization factors, which have been widely used in proteomics and on microarray data, may have room for improvement when applied to peptide abundance data derived from ion current measurements.
Examining the Influence of the Variables-To cover many cases, the interlaboratory data from Study 8 were again used to approximate the relative influence of these variables with respect to peptide intensity deviations. The low and high concentration runs within and between labs were all compared. In this analysis, the frequency of each variable appear-ing in the rank 1 (most influential) position was calculated for all pairs of runs (Fig. 4A).
Within labs, RT was consistently ranked as the most influential variable and among the most influential variables when runs were compared between labs. The other two variables that became influential were precursor m/z and the peptide length. What is also notable from this analysis is the fact that within labs, average abundance (A) -the standardly used normalization variable -was never ranked as the most important variable. RT, length and precursor m/z were the most influential between labs. Again, although performing a differential expression analysis on runs between instruments (laboratories) may be somewhat artificial, it nevertheless highlighted the most variable aspects of the data. Similar results on LTQ instruments are included in supplemental Fig. S5A of the Supplemental Data.
The magnitude of the bias accounted for by each of the variables can be shown by the mean of the deviance (See Eq. (2) and (4) under "Experimental Procedures"). Fig. 4B shows the average of the mean of deviance (MD) within lab and Fig.  4C shows the average of the MD for each pair between labs. Several observations were made from this analysis. First, MD was larger for pairs across instruments than for those from the same instrument. Second, the largest reduction of the MD was introduced by the rank 1 variable. The remaining variables improved the MD only marginally. After removing systematic bias from the M values, the residual MD (RSE, white bars) was comparable at the two concentration levels, although pairs from the same labs had a smaller MD to those pairs from different labs. Similar results on LTQ instruments are presented in supplemental Fig. S5B and S5C of Supplemental Data.
The normalization algorithm developed for this work is directly applicable when the two runs are from an identical sample or are from samples with a known overall concentration difference, such as Study 8 data sets. In these samples, all peptide ions can be used in the normalization procedure because they can all be assumed to be randomly distributed around a known and identical reference level. When the two runs from biologically different samples are analyzed, some peptide ions are expected to remain quantitatively similar (common peptide ions), whereas significant changes are expected in others. Those peptide ions, whose intensities are suspected to be different under different biological conditions, should not be included in the normalization and ranking procedure, if possible. Their M values are adjusted by interpolation using the regression parameters obtained from the normalization-ranking procedure with the common peptide ions. The application of the proposed method to this type of data is shown below using Study 6 mock biomarker data.
Normalization Lowers the False Positive Rate in Mock Biomarker Experiments-In order to assess the effectiveness of the newly developed normalization method, we used data from Study 6. In this study, several of the processed samples contained the SigmaUPS1 proteins spiked into the yeast reference material at known amounts, sequentially differing by factors of 3 (Experimental Procedures). To visualize the data prior to normalization, we plotted the M values of the yeast matrix peptide ions and Sigma UPS1 spike-ins separately versus RT (Fig. 5)  Sample 6D on Lab 1 in Study 6. The Sigma UPS1 peptides in this plot are clearly separated from the yeast matrix peptides and are present at close to the expected log2-ratio (approximately Ϫ1.58). However, because the spike-level is of relatively high concentration, a clear RT bias is visible at the end of the gradient.
Study 6 data is a mock case for biologically different samples. That is the common peptide ions were known (yeast peptide ions). In a real data set, this information is usually not available. The selection of the common peptide ions for normalization is another important issue in comparative proteomics. Various methods have been developed in the literature (11, 15, 16 -18). In the following results, we normalized and ranked the Study 6 data as if the identities of the yeast and Sigma UPS1 proteins were not known like in many real data sets. To select the set of common peptide ions, we used the global rank-invariant set selection methods in (18).
To examine the ability of the model to separate known spike-ins from matrix, each pair of runs from Sample 6C and 6D was normalized. In Fig. 6, the densities of the yeast (upper panel) and Sigma UPS1 (lower panel) M values in a pair (Run #8 of Sample 6C and Run #11 of Sample 6D on Orbitrap 65 in Study 6) are shown for the cases before normalization, normalization by abundance (A) only, by rank 1 variable only and by all variables used, respectively. Biases in M values were reduced more by the single rank 1 variable or by the combination of all six variables, than by that of the abundance (A).
Gain in the stage of data preprocessing directly lead to better sensitivity and specificity for detection of the mockbiomarkers. Supplemental Fig. S6A shows ROC curves using fold-changes for spike-in concentration C versus D (Sample 6C with Run #7, 8, 9 and Sample 6D with Run #10, 11, 12 on Orbitrap 65 in Study 6). This figure shows the false positive rate (FPR ϭ 1-specificity) and true positive rate (sensitivity), when using fold-change thresholds ranging from 1.5 to 6 at intervals of 0.5. Each of the runs from sample 6C (lower concentration) was used in the numerator to calculate M in a pair. The results are shown in panel (a) for pairs using run 7 as FIG. 4. Ranking and mean deviances of the normalization variables. The data used were 19 experimental runs on the 3 Orbitrap instruments from Study 8, including 60 ng/l (low) and 300 ng/l (high) yeast samples. A, The frequency of the variables as Rank 1 for runs within the same lab or across different labs. B, The magnitude of the mean deviance adjusted by each variable, as well as the remaining mean deviance (represented by RSE) when experimental runs were from the same labs. C, The magnitude of the mean deviance adjusted by each variable as well as the remaining mean deviance (represented by RSE) when experimental runs were from different labs. the base run; (b) for pairs using run 8 as the base run; (c) for pairs using run 9 as the base run. Normalization with all variables lead to a higher true positive rate, especially for fold-change thresholds below 4, and a consistently lower false positive rate compared with the results before normalization or normalization with the abundance (A) only. Also notably, the ROC curves after normalization with all variables did not vary much using different base runs, whereas base run choice significantly affected the ROC curves for data before normalization. At the true fold change (ϭ 3), the FPRs were improved dramatically after normalization with all variables compared with those before normalization or normalization with the abundance only, independent of the large differences in base run (Table I).
A rigorous discussion and comparison of different common peptide selection methods is beyond the scope of the current study. For a simple comparison in the current report, we also normalized and ranked variables based on known identities of peptide ions. That is, only the yeast peptide ions were used in normalization and ranking whereas the intensities of the spike-in Sigma UPS1 peptide ions were interpolated based on the models estimated using only yeast peptide ions. Results are included in Supplementary Data (supplemental Fig. S6B and supplemental Table S1). Though results still support the normalization by all variables, the selection of common peptide ions did have an impact on sensitivity and specificity in   I R2 )) under different normalization methods for run pairs Sample 6C (yeast ؉ UPS1 at 2. 2 fmol/l) against Sample 6D (yeast ؉ UPS1 at 6.7 fmol/l) (3-fold difference) in Study 6. In normalization and ranking, common peptides were selected based on the global invariant-ranking set (18). The top panel is for the yeast peptides ions, whose relative intensities were expected to be centered on the reference line at 0. The bottom panel is for the Sigma UPS1 spike-in peptides, whose relative intensities were expected to be centered on the threefold difference reference line at [-log2 (3)] (approximately [-1.58]). The dark gray curve (two-dash) shows the original relative intensities M (before normalization). The blue curve (dot-dash) represents the normalized relative intensities M using peptide abundance (A). The purple curve (long-dash) is for the normalized relative intensities M using Rank1 variable only. The black curve (solid) is for the normalized relative intensities using all variables. the biomarker discovery analysis. We suggest that the common set of peptide ions should be selected on the basis of biological consideration (16) or internal standards if possible. Data-driven methods may be used but caution is recommended.

DISCUSSION
The goal of this study was to identify and characterize systematic biases present in proteomics data, and then to attempt to develop a better normalization method. This was accomplished by examining the effects of loading on individual peptide ions as well as by monitoring other unidentified biases between technical replicates. Peptide property variables were used as reporters to identify biases which manifest as functions of retention time or one of several other variables. Callister et al. state that the most appropriate normalization techniques are ideally developed following identification and characterization of systematic biases (9). Here we described peptides with higher charge state potentials to be more sensitive to loading differences. Local deviations were also likely related to electrospray issues.
We also demonstrated that systematic biases are almost always present. Moreover, we observed that these effects are not predictable and local, nonlinear corrections are almost always necessary. A LOWESS regression model can be used to overcome the nonlinearity. However, it involves the choice of a fraction of neighboring samples (bandwidth) in the normalization step. Berger et al. demonstrate that bandwidth parameter choice significantly affects results (15). Semi-parametric regression avoids this problem by selecting the number of knots, resulting in statistical stability as long as the knots are dense enough (17,19). Additionally, the semiparametric regression models are constructed within the framework of linear regression models. Linear model diagnostics and evaluation methods can be directly used; its computation can be easily implemented using the widely available mixed model packages.
We have developed an iterative normalization that does not require the user to be knowledgeable of the largest systematic biases present in their data prior to normalization. This was achieved by ranking variables along with stepwise normalization. Although this model proved effect in the analysis of the CPTAC Study 6 data, it will be necessary in follow-up studies to examine the effectiveness in the face of biological variability.
The products of this work are improvements to the NIST MSQC pipeline, which now performs the necessary calculations to be used for label-free analysis, and the R code to perform normalization. The pipeline has been tested with Thermo LTQ, Orbitrap and FT, and to a lesser extent, with Agilent QTOF and AB TripleTof TM 5600 data. The goal of this work was to unveil systematic biases in these proteomics data sets in order to help validate the appropriateness of these methods for quantitative workflows. Without first understanding the effects of systematic biases, interpretation of label-free data sets are subject to so-called "batch effects" and other incorrect assumptions about the samples underlying the data. These anomalies can lead to costly testing of irrelevant hypotheses, particularly in biomarker discovery efforts. Just as in microarray experiments, these biases can be effectively eliminated by applying appropriate statistical methods.

TABLE I
The sensitivity and false positive rate (FPR ϭ 1-specificity) with the threefold decision criterion for Sample 6C (yeast ϩ UPS1 at 2.2 fmol/l) against Sample 6D (yeast ϩ UPS1 at 6.7 fmol/l) from the first lab in Study 6. Each of the runs from Sample 6C (run #7, 8,9) was used in the numerator (base run) to calculate the relative intensities (M) in a pair with the runs from Sample 6D (run #10, 11,12). The results in Column "before" used data before normalization. The results in Column "With A only" used data normalized by the abundance (A) only. The results in Column "With all variables" used data normalized by all variables. The normalization was based on common peptides selected using the global rank-invariant set in (18)