MSstatsQC: Longitudinal System Suitability Monitoring and Quality Control for Targeted Proteomic Experiments*

Selected Reaction Monitoring (SRM) is a powerful tool for targeted detection and quantification of peptides in complex matrices. An important objective of SRM is to obtain peptide quantifications that are (1) suitable for the investigation, and (2) reproducible across laboratories and runs. The first objective is achieved by system suitability tests (SST), which verify that mass spectrometric instrumentation performs as specified. The second objective is achieved by quality control (QC), which provides in-process quality assurance of the sample profile. A common aspect of SST and QC is the longitudinal nature of the data. Although SST and QC have received a lot of attention in the proteomic community, the currently used statistical methods are limited. This manuscript improves upon the statistical methodology for SST and QC that is currently used in proteomics. It adapts the modern methods of longitudinal statistical process control, such as simultaneous and time weighted control charts and change point analysis, to SST and QC of SRM experiments, discusses their advantages, and provides practical guidelines. Evaluations on simulated data sets, and on data sets from the Clinical Proteomics Technology Assessment for Cancer (CPTAC) consortium, demonstrated that these methods substantially improve our ability of real time monitoring, early detection and prevention of chromatographic and instrumental problems. We implemented the methods in an open-source R-based software package MSstatsQC and its web-based graphical user interface. They are available for use stand-alone, or for integration with automated pipelines. Although the examples focus on targeted proteomics, the statistical methods in this manuscript apply more generally to quantitative proteomics.

Selected Reaction Monitoring (SRM) is a powerful tool for targeted detection and quantification of peptides in complex matrices. An important objective of SRM is to obtain peptide quantifications that are (1) suitable for the investigation, and (2) reproducible across laboratories and runs. The first objective is achieved by system suitability tests (SST), which verify that mass spectrometric instrumentation performs as specified. The second objective is achieved by quality control (QC), which provides in-process quality assurance of the sample profile. A common aspect of SST and QC is the longitudinal nature of the data. Although SST and QC have received a lot of attention in the proteomic community, the currently used statistical methods are limited. This manuscript improves upon the statistical methodology for SST and QC that is currently used in proteomics. It adapts the modern methods of longitudinal statistical process control, such as simultaneous and time weighted control charts and change point analysis, to SST and QC of SRM experiments, discusses their advantages, and provides practical guidelines. Evaluations on simulated data sets, and on data sets from the Clinical Proteomics Technology Assessment for Cancer (CPTAC) consortium, demonstrated that these methods substantially improve our ability of real time monitoring, early detection and prevention of chromatographic and instrumental problems. We implemented the methods in an open-source R-based software package MSstatsQC and its web-based graphical user interface. They are available for use stand-alone, or for integration with automated pipelines. Although the examples focus on targeted proteomics, the statistical methods in this manuscript apply more generally to quantitative proteomics. Mass spectrometry-based Selected Reaction Monitoring (SRM) 1 is a powerful tool for targeted detection and quantification of peptides in complex biological mixtures and matrices (1)(2)(3). SRM assays can identify and quantify targeted analytes with great specificity and accuracy. They are increasingly adopted by the biomedical community, for applications ranging from clinical diagnostic, to research and exploratory studies.
An important objective of SRM assays is to produce peptide quantifications that are (1) suitable for the investigation, and (2) reproducible across laboratories and runs (4). As the complexity of the experiments and of the instrumentation grows, so does the need to characterize the accuracy and the consistency of the results. To help achieve this, the United States Pharmacopeia (USP) chapter ➜ introduced four components of data quality: analytical instrument qualification, analytical method validation, system suitability testing, and quality control checks (5). Analytical instrument qualification presents evidence that the instrument, its setup and calibration are suitable for the intended use. Analytical method validation demonstrates that the assay performed on the instrument can produce reliable results, and reports characteristics such as accuracy, precision, specificity, detection limit, and quantification limit (6,7).
At the same time, the practical utility of SRM assays depends not only on the general properties of the instrument and of the assay, but also on whether their implementation worked as intended in an experimental setting. A same SRM assay of a same biological material can produce variable results, depending on conditions such as laboratories, instruments, operators, or time of data acquisition (8). Substantial reproducibility gains can be achieved by minimizing this undue variation (4,8). To this end, system suitability tests (SST) verify that a laboratory system satisfies the prespecified criteria immediately before sample analysis. SST relies on a series of reference materials and decision rules, designed to separately test aspects of the system such as consistency of the response, carryover, retention time stability, mass accuracy, or signal-to-noise (9,10). If a component of the SST test fails, the instrument is stopped to remedy the problem. In contrast, quality control (QC) uses reference materials and/or calibration standards, and decision rules, to provide an inprocess assurance during sample analysis. When reference materials cannot be part of the biological sample, a separate QC sample, e.g. a spiked protein mixture of known composition, or a cell lysate or other mixture that mimics the biological material of interest, is interleaved between the biological samples (10,11). Because SST and QC can use different reference materials, metrics, and decision rules, their results do not necessarily agree. For example, when the system passes SST but the experiment fails QC, the measurement system does not need to be re-evaluated, and the problem may lay elsewhere (e.g. in sample storage or processing). Alternatively, an instrument may have unsuitable performance, but produce measurements with acceptable QC (12).
A common aspect of SST and QC is the longitudinal nature of the input data, and of data-driven decisions. An accurate, reproducible and objective monitoring and decision-making requires the use of statistical methodology. The same general statistical methodology for summarizing longitudinal profiles can be applied to SST and QC.
A state-of-the art approach for longitudinal profiling is Statistical Process Control (SPC). SPC is a collection of statistical methods and graphical summaries that generate warning flags when undesirable deviations occur, and help identify and eliminate the root causes of these deviations (13)(14)(15). The use of SPC in manufacturing, automotive industry, food, service, and healthcare has demonstrated excellent performance in reducing internal and external failure costs (16). Previous reports cite striking results, such as saving millions of dollars in expenses, reducing cycle times by half or more, and reduction of processing errors (16). SPC has been widely adopted in clinical laboratories (17), and its applications in mass spectrometry-based proteomics increasingly appear. For example, Bramwell (18) demonstrated the potential of SPC in a 2D DIGE experiment. Other applications are presented in Bourmaud et al. (19), Bereman et al. (20), and Bereman et al. (21). However, the statistical methodology used in these applications is limited to control charts that monitor large changes in mean of a metric, and do not match the sensitivity and the accuracy of more modern, state-of-the-art methods.
The contribution of this manuscript is in adapting a broader class of modern SPC methods, beyond what is currently used in mass spectrometry-based proteomics, to the context of SRM experiments. The methods include simultaneous monitoring mean of a metric in addition to its variability, time weighted control charts, and change point analysis. We provide practical guidelines for using these methods, and for making decisions from multivariate measurements, in the context of both SST and QC. We demonstrated the advantages of these methods using simulated longitudinal QC data sets for single peptides, and using experimental longitudinal SST data sets from the Clinical Proteomics Technology Assessment for Cancer (CPTAC) consortium evaluating multimetrics and multipeptide criteria of performance. We implemented the methods in an open-source R-based (22) software package MSstatsQC and its graphical user interface, which can be used stand-alone, or integrated into larger and more automated data analysis pipelines. We argue that this statistical methodology should become part of the daily practice of SRM-based investigations. Although all the examples in this manuscript focus on targeted proteomics, the statistical methods are general, and can be in principle applied to other quantitative mass spectrometry-based workflows.
Related Work-Several noteworthy studies focused on QC monitoring of large-scale label-free proteomic experiments with data dependent acquisition (DDA). They considered two aspects: which metrics to monitor, and how to monitor them. An interlaboratory study by the Human Proteome Organization (HUPO) revealed common reproducibility problems in proteomic experiments, and raised awareness of quality control (23). The National Institute of Standards and Technology (NIST) in collaboration with CPTAC identified 46 metrics, obtained with the software pipeline NISTMSQC, for evaluating the performance of LC-MS/MS DDA (24). This approach was not widely adopted because of several drawbacks, including applicability to only one vendor and search algorithm, complexity of data extraction, and lack of visual representation for less experienced users (20). Another tool for LC-MS/MS DDA, QuaMeter (25), offered a more flexible way to generate metrics, but required manual evaluation.
An outcome of these studies was an extensive list of metrics, used to monitor chromatographic and electrospray stability, and mass spectrometer performance. LC performance can be monitored by evaluating peak retention times and intensities, peak widths, and total peak areas of peptides (20,24,26). Electrospray stability can be monitored by its impact on MS1 and MS/MS ion injection times, and on number of acquired MS1 and MS/MS scans (26). Mass accuracy is commonly used to monitor mass spectrometer performance (20,21).
In addition to choosing informative metrics, an equally important concern is the choice of statistical methodology to monitor their values in time. SIMPATIQCO (26) proposed run charts with green (safe), yellow (warning), and red (out-of-control) regions, which used median absolute deviation of metrics to detect abrupt changes in QC. Likewise, bean plots and run charts were used to monitor QC metrics with Metriculator (27), however, the tool did not implement thresholds for distinguishing undesirable system variation from noise. Bennett et al. (28) analyzed multisite DDA experiments with QuaMeter and NISTMSQC. The authors used principle component analysis and control charts to highlight patterns of between-laboratory variation, however the longitudinal aspect of the study was limited to only nine time points.
More recently, research interest shifted to QC of targeted SRM experiments. Statistical Process Control in Proteomics (SProCoP) (20) and Panorama AutoQC (21) implemented statistical methods from the SPC toolbox. SProCop is an opensource R-based plug-in to the Skyline software (29) and a Skyline external tool (30). It takes as input metrics such as retention time, total peak area, full peak width at half maximum (FWHM) or peak asymmetry. It implements control charts (called Z charts) to monitor standardized outputs for each peptide, and applies constant decision thresholds. Likewise, Panorama AutoQC is an open-source interface between Skyline and Panorama server (31), which uses similar input metrics, and implements control charts for nonstandardized outputs, where decision thresholds can depend on a metric and on a peptide. Variation of each metric is assumed to be stable over time.
Although a lot of method development in proteomics has already been devoted to QC, fewer publications discuss standardizing and monitoring system suitability testing of SRM assays. Although SST guidelines for clinical laboratories are available (9), metrics and acceptance criteria for multiplexed discovery and validation SRM assays (called tier 2 and 3 in (4)) are mostly empirically established (12). A notable example is a comprehensive study conducted by Abbatiello et al. (8,32) as part of CPTAC, to design SST for nanoHPLC-MRM-MS peptide-based assays. The study contributed an adaptable plan and acceptance criteria for metrics such peak area and retention time, derived from benchmarks on a wide range of instruments, vendors, and laboratories. This protocol, setup, and metrics to monitor are available for use by other laboratories.
This manuscript contributes to the proteomic community additional statistical SPC methods for longitudinal monitoring of both SST and QC. These methods are frequently used in other areas of industry and research, but have not yet made their way to targeted proteomics. The methods include simultaneous monitoring of mean and variation of a metric (XmR and ZmR charts), time weighted control charts to detect small changes (CUSUM m and CUSUM v charts), change point analysis for identifying time of a change, and maps for high-dimensional decision making. For SRM experiments, the methods take as input quantitative metrics such as retention time, total peak area, and peak asymmetry, or any other quantitative metric of the experimentalist's choice. They help make deci-sions according to user defined criteria, or according to deviations from an example good quality guide set.
The manuscript illustrates the value of these methods for longitudinal QC monitoring using simulated data sets, where the true source of undesirable deviations is known, to illustrate cases with (1) isolated outlier QC measurements, (2) sustained step shifts of metric mean and variation, and (3) slow linear drifts of metric mean. Furthermore, the manuscript illustrates the performance of the methods for SST monitoring using data sets from a large-scale, multisite CPTAC study 9.1 to illustrate the importance of (1) monitoring multiple peptides, (2) monitoring multiple metrics, and (3) empirically derived limits in detecting suboptimal performance. Finally, we use CPTAC study 9.1 to illustrate a case to illustrate the effect of a poor-quality guide set on decision making. We implemented these methods in an open-source R-based software MSstatsQC, and in a web-based Shiny (30) user interface. The implementation, as well as its support documentation, such as a vignette for step-by-step explanation of the workflows, and a cheat-sheet for a quick reference, are available at www.msstats.org/msstatsqc.

EXPERIMENTAL PROCEDURES
Overview and Notation-Denote i ϭ 1,2, . . ., I the index of a metric (such as retention time or peak area), j ϭ 1,2, . . ., J the index of a peptide, and t ϭ 1,2, . . ., T the time point that an SST or QC run is acquired. Denote X ijt the value of the metric for this peptide and time point. Because most discussion below focuses on one metric and one peptide at a time, we will omit the indices of metrics and peptides for simplicity, and denote the values in a time point as X t .
Because X t is subject to stochastic variation, it is a random variable. Denote and the expected value and the standard deviation of X t .
It is also useful to define the quantity Z t ϭ X t Ϫ , called Z score, which standardizes X t . The Z score is independent of the measurement units of the metric. It fluctuates around the expected value of 0, and has the standard deviation of 1. Note that is a standard deviation and not a standard error, and therefore the Z score is different from a statistic used, e.g. in a t test. The true values of and are typically unknown, and their estimates and are obtained from a guide set of high quality runs.
Example Data Sets for Longitudinal QC Monitoring: Simulated QC-Three simulated data sets were designed to evaluate longitudinal QC monitoring algorithms for SRM proteomic workflows in situations where the true nature of the problem is known. The simulations were based on the time course QC profiles of proteins in blood presented in tutorials for SProCop (20) and Panorama AutoQC (21). Ten measurements (runs 5-15) were used to estimate peak area mean ( ) and standard deviation ( ) of peptide VYVEELKPT-PEGDLEILLQK. This peptide was stable in these runs, and produced good quality data (20). Next, peak areas measurements at 50 time points were simulated from Normal distribution X t ϳ N͑ , 2 ͒. Finally, three types of disturbances were created. The Simulated Data Set 1 introduced three outlying observations with a larger mean level of peak area by simulating X t ϳ N͑ ϩ 3 , 2 ͒ at t ϭ 30, 40, 50. The Simulated Data Set 2 introduced a small systematic increase (i.e. step shift) in peak area measurements by simulating X t ϳ N͑ ϩ , 2 ͒ at 26 Յ t Յ 50. The Simulated Data Set 3 imitated a drift in peak areas, and introduced an incremental increase by simulating X t ϳ N͑ ϩ ␤ ͑t Ϫ 25͒, 2 ͒ at 26 Յ t Յ 50, with ␤ ϭ 0.1. We used these data sets to illustrate the ability of SPC to detect isolated outliers and subtle measurement drifts.
Example Data Sets for Longitudinal SST Monitoring: CPTAC Study 9.1-CPTAC study 9.1 (sites 54, 56A, 65, and 86) (8) were used to evaluate longitudinal SST monitoring algorithms for SRM proteomic workflows. The study designed a reference sample with 15 peptides to characterize instrument performance in terms of peak area, retention time and FWHM metrics. The study acquired guide sets from this sample, as well as longitudinal measurements collected during two months in 11 labs. The number of time points in a lab varied between 36 and 89. The data sets were processed with Skyline 3.6.1.10690 to generate a table of quantitative metrics for input to MSstatsQC. We used these data sets to evaluate the ability of SPC to discover abnormalities such as retention time and peak area drifts (Site 54), retention time fluctuation (Site 86), and chromatographic shape problems (Site 65), and to illustrate effects of guide set selection on the performance of detecting a retention time drift (Site 56A). Guide sets are manually determined from in-control observations for each site.
Previously Used SPC Methods in Proteomics: Control Charts for Monitoring Mean-The fundamental SPC tool for monitoring is a control chart, first proposed by Shewhart (13,14), and later transferred to clinical chemistry by Levey and Jennings (15). A control chart summarizes QC measurements during time (e.g. peak area in Fig. 1A and 1B). Decisions from the control charts are made by comparing the observations to control limits. In some cases, the control limits can be pre-defined by industry standards for the purposes of SST, e.g. "mass accuracy must be less than 2 ppm" or "retention time measurements must be within 2 min of the target retention time." In many other cases, however, the control limits are empirically derived, as we discuss below.
To make conclusions on the original (i.e. not standardized) scale of the metric, the X chart (Fig. 1A) monitors the mean of X t and sets control limits in the form ( Ϯ L ). This expression assumes that the observations are independent and Normally distributed. A typical value of L is 3, which corresponds to Ϯ3 standard deviations from the mean of a metric, and results in the false alarm probability ␣ ϭ 0.0027. This approach is used in SProCop (20). Equivalently, the Z chart (Fig. 1B) monitors the standardized measurements Z t ϭ X t Ϫ , and sets the control limits to Ϯ3. If an observation is within the control limits, it is considered in-control. If an observation exceeds the limits, it is out-of-control. The goal of the next steps is to find the earliest time point when the variation occurred, and investigate and eliminate its root cause.

SPC Method Advocated In This Manuscript: Simultaneous Monitoring of Mean and Variability With XmR and ZmR
Charts-Although X charts can detect shifts the mean value of a metric, they can fail detecting shifts in its variation. Yet shifts in variation can be a major issue in mass spectrometric experiments, e.g. in cases of undesirable between-run fluctuation in retention time of an analyte. Monitoring variation of a metric along with its mean allows us to detect these problems. To this end, the X chart is supplemented with another control chart that displays moving ranges mR t ϭ ͉ X tϩ1 Ϫ X t ͉, t Ն 2 ( Fig. 1C) (16). Values of mR are sensitive to changes between consecutive measurements, and therefore to changes in the variability of a metric. Control limits for the mR chart are set to (0, LmR). The constant L is often set to L ϭ 3.268, matching the false alarm probability ␣ ϭ 0.0027 of the X and Z charts for moving ranges of length 2. The value of mR is estimated as the average of all the moving ranges in the guide set.
A pair (X chart, mR chart) of a metric is called XmR chart, and a pair (Z chart, mR chart) is called ZmR chart. In XmR and ZmR charts the estimate of variation is typically replaced to mR, and the constant L in the control limits of the X chart is set to L ϭ 2.66, to match the false alarm probability ␣ ϭ 0.0027 (16). The XmR and ZmR charts detect observations that are out-of-control in terms of both the mean of the metric, and its variation.

SPC Methods Advocated in This Manuscript: Detection of Subtle Shifts
With CUSUM m and CUSUM v -Although XmR and ZmR charts detect large undesirable deviations, they miss subtler problems, such as small isolated deviations of values of a metric, or slow sustained drifts. As an attempt to remedy this situation, Westgard rules (17) are used. These rules are used as plug-ins to X control chart. For example, Rule 10x raises a red flag when 10 consecutive retention time measurements fall on one side of their respective for a peptide. Unfortunately, these rules do not meet the needs of mass spectrometry-based proteomics. They are somewhat arbitrary, and do not scale beyond a small number of metrics and analytes. The differences in failure times and types of the problems between the analytes complicate the use of these rules even further. We advocate another approach that uses a family of cumulative sum (CUSUM) control charts (33) detects such subtle shifts effectively by monitoring timeweighted measurements.
The CUSUM m chart (16, 34) (Fig. 2C) monitors shifts in the metric's mean. It has two components, CUSUM mϩ and CUSUM mϪ that separately consider positive and negative changes. For example, a slow but steady increase in retention time is reflected by an increased CUSUM mϩ . For at time point (t Ն 1), CUSUM mϩ,t and CUSUM mϪ,t are defined as cumulative sums of sequential measurements, i.e. CUSUM mϩ,t ϭ max (0,X t Ϫ k ϩ CUSUM mϩ,(t Ϫ 1) ), and CUSUM mϪ,t ϭ max(0,ϪX t Ϫ k ϩ CUSUM mϪ,t Ϫ 1 ), with starting values CUSUM mϩ,0 ϭ CUSUM mϪ,0 ϭ 0. Here k is a parameter (also called slack value), which affects the sensitivity of the approach. The control limit h is defined such that If CUSUM mϩ,t or CUSUM mϪ,t exceed the control limit h, the process is considered out-of-control.
Here we advocate a modified version of CUSUM that simplifies the interpretation. Specifically, we replace X t with its standardized version Z t , and set k ϭ 0.5 and h Х 5. In this definition, CUSUM mϩ,t ϭ max (0,Z t Ϫ 0.5 ϩ CUSUM mϩ,(t Ϫ 1) ), and CUSUM mϪ,t ϭ max (0,ϪZ t Ϫ 0.5 ϩ CUSUM mϪ,t Ϫ 1 ). It can be shown that this approach detects 1 deviations from the metric mean, with the false alarm rate of ␣ ϭ 0.0027 that matches the approaches above.
Similarly, CUSUM charts (Fig. 2D) detect slow changes in variation. For example, if retention time becomes increasingly unstable, the value of CUSUM ϩ increases. CUSUM ϩ is constructed from the transformation V t ϭ ͙ ͉Z t ͉ Ϫ 0.822 0.349 (16,34). Hawkins (1981) (35) showed that ͙ ͉Z t ͉ is very close to Normal distribution with mean 0.822 and standard deviation 0.349, and therefore V t follows approximately standard Normal distribution. Similarly to CUSUM m , define CUSUM ϩ,t ϭ max (0, V t Ϫ k ϩ CUSUM ϩ,(tϪ1) ), and CUSUM Ϫ,t ϭ max(0,ϪV t Ϫ k ϩ CUSUM Ϫ,tϪ1 ) We suggest the same parameters k ϭ 0.5 and h Х 5 to match ␣ ϭ 0.0027. If CUSUM ϩ,t or CUSUM Ϫ,t exceeds the control limit h, the variation of the process is out-of-control.
SPC Method Advocated in this Manuscript: Change Point Analysis-The goal of longitudinal profiling is to detect the earliest time point, called change point, at which a problem occurs. However, in some cases, especially in presence of slow sustained drifts, the problem can only be detected with a delay. Change point analysis, performed after an out-of-control observation occurs, helps pinpoint the change point more exactly (e.g. supplemental Fig. S2). Many statistical approaches for change point analysis exist (36).
Here we advocate detecting change point based on maximum likelihood estimation of (1) a step shift change model for the mean (37), and (2) a step shift change model for the variability (38). This approach has been proven effective in conjunction with control charts (37)(38)(39). Assume that a step change in the mean of a metric occurs after the change point ( ϭ 1, ..., T). Then, X t ϳ ͭ N͑,͒,t ϭ 1, 2, . . ., N͑ 1 , ͒, t ϭ ϩ 1, ϩ 2, . . ., T ͮ where N(.) denotes Normal distribution, 1 ϭ ϩ ␦ is the post-change mean and ␦ is the magnitude of a change. The estimate of the change point is obtained by maximizing the likelihood L͑, 1 Similarly, the change point for variation is estimated by maxi- 2 is the reverse cumulative sum of squares of T Ϫ t most recent observations. The change point estimate is ϭ argmax 1ՅtϽT (D t ) The change point estimate for variability helps user identify the time of a change in the variability of a metric.
SPC Method Advocated in this Manuscript: Summary Plots and Decision Maps-A distinctive feature of mass spectrometric experiments is the ability to simultaneously quantify multiple metrics of multiple analytes, for the purposes of biological investigations, but also for the purposes of SST or QC. However, extending the control charts to highly multivariate settings is non-trivial. The metrics and analytes produce multiple charts, which exponentially increase the comparisons to the control limits, and inflate the number of false alarms.
Given the complexity of the situation, we advocate a visual representation of individual longitudinal profiles of metrics and analytes, as well as simple ad hoc summaries (pass, warning, fail) of the overall performance, as described below.
A decision map (e.g. Fig. 4A and 4B) presents the most general overview of the overall performance per method (e.g. X chart and mR chart). It takes as input user-defined pass/warning/fail criteria for a control chart, and aggregates the conclusions from each metric over all the analytes and summaries obtained for the method. For example, the user-defined criteria can generate a warning signal if the proportion p t of analytes with out-of-control values in at least one metric (e.g. retention time) is 0.10 Յ p t Ͻ 0.25, and generate a fail signal if p t Ն 0.25. With this approach, each metric generates its own pass/warning/fail signals at each time point for each control chart. The signals are visually summarized, e.g. with a heatmap. Although this approach does not guarantee optimal decisions, it is a step toward making more objective, reproducible, and documentable decisions in multivariate situations.
A more detailed Look into the results of the control charts is required to efficiently identify the root causes of the problems. A river plot and a radar plot provide such more detailed insight. The river plot (e.g. Fig. 4C) details the performance of each metric in terms of the mean and variability of a metric, aggregated over the analytes. For example, in the case of retention time or total peak area in Fig. 4C, the river plots distinguish the proportions of out-of-control peptides with positive and negative deviations in mean and variation (colored lines), present trends over time (x axis), and indicate the estimated change point time for each deviation type and each analyte (red dots). Sloping curves and large numbers of red dots indicate worsening of the performance.
A radar plot (e.g. Fig. 4D) investigates the extent of the problems at the peptide level. For each metric, each summary and each peptide, it displays the number of time points that exceeded the control limits throughout the duration of the experiment. Different types of deviations (i.e. increases and decreases in the mean or variability) are color-coded. Large colored areas indicate worsening of the performance.
Implementation in Open-source Software-We implemented the methods discussed in this manuscript in an R package MSstatsQC. MSstatsQC is available free and open-source under Artistic 2.0 license (same license as, e.g. the project Bioconductor (40)). The package takes as input SST or QC measurements in a tabular format produced by data processing tools such as Skyline, or any other. The package can be accessed through its online graphical user interface, where input files of at most 5MB are uploaded on a remote server hosted by RStudio. Alternatively, the package can be installed on a local server, and used via its web interface or via a command line. The size of the input files can be increased to 100MB in that case. This is useful for organizations that wish to perform all the analyses locally, or integrate SPC into broader automated pipelines. The implementations, examples, experimental data sets, link to the source code on Github are available at msstats.org/msstatsqc. Fig. 1 illustrates the ability of X charts to detect large isolated deviations in longitudinal profiles in Simulated Data Set 1. The simulation contained three outlying peak area observations at time points t ϭ 30, 40, 50. Detecting such isolated outliers im mass spectrometric experiments is critical, as they can point to problems related to flow rate or run-to-run carry-over. Fig. 1A and 1B illustrate that X or Z charts were effective at detecting these outlying measurements, while avoiding false positive alarms. The X chart may be a better option for users who prefer monitoring profiles on the original scale. A side-effect of the isolated outliers was the increase in metric variation around t ϭ 30, 40, 50. Fig. 1C shows that the mR chart exceeded the upper control limit around the outlying time points. It effectively detected the change in variation, but produced one false alarm.

XmR Control Charts Are Sensitive to Large Aberrations in Longitudinal Measurements-
CUSUM Control Charts Are Sensitive to Small Sustained Drifts in Metric Mean and/or Variation-CUSUM charts are designed to accumulate historical information about the values of the metric. They did not detect the isolated outliers in Simulated Data Set 1 above (supplemental Fig. S1A-S1B). However, they outperform XmR charts in other situations, e.g. in presence of minor (less than 1.5) deviations from the metric mean, or in presence of slow drifts in metric mean or variation. Monitoring such drifts in mass spectrometric experiments is important, as it helps discover problems such as faulty calibration or wear in parts.
Simulated Data Set 2 contained a small step shift, where the mean of the peak area increased systematically at t ϭ 30 ( Fig. 2A). The step shift was small enough that the XmR chart ( Fig. 2A and 2B) did not detect any out-of-control observations. However, the CUSUM m chart (Fig. 2C) indicated a steady increase at t Ͼ 25, and successfully detected the subtle increase in peak area mean. Similarly, the CUSUM chart (Fig. 2D) monitored changes in peak area variation around its mean. The chart did not detect any out-of-control observations in this case, and correctly indicated that the variability of peak area around its mean was due primarily to random causes. The change point analysis (supplemental Fig.  S2) correctly pointed to the shift in peak area mean at t ϭ 26.
Similarly, Simulated Data Set 3 contained a small linear drift in the peak area of the peptide at 26 Յ t Յ 50. This disturbance wasn't easily detected with the XmR chart ( Fig. 3A and Fig. 3B), however, a steady increase in CUSUM mϩ (Fig. 3C) provided a strong evidence of an increased peak area over time. No out-of-control observations were detected with CUSUM chart (Fig. 3D), correctly indicating that the variability of peak area around its mean was primarily random. The change point analysis (supplemental Fig. S3) supports the existence of a mean shift in peak area at t ϭ 24. To summarize, the CUSUM m control chart was particularly effective in detecting subtle shifts and drifts in metric mean, and change point analysis allowed us to pinpoint the time when the change first occurred.
Monitoring Multiple Peptides Helps Detect Different Types of System Problems-Up to this point, we discussed SPC methods for monitoring one peptide. Monitoring one peptide is a common approach in SST or QC. It makes an implicit assumption that all the peptides respond to a problem in a similar way. Unfortunately, this assumption does not always hold. For example, in Site 54 of CPTAC study 9.1, the peak areas of early eluting peptides were out of control, whereas the rest of the peptides remained stable. supplemental Fig. S4 provides an example Skyline output of Site 54, and shows the retention time and peak area problems for peptide FFVAPF-PEVFGK. Because the disturbance was relatively small, the CUSUM m chart was best suited to detect the steady increase in retention time and the peak area decrease. This problem could not have been detected if we only monitored a stable peptide such as LVNELTEFAK (supplemental Fig. S5).
We used the CPTAC study 9.1 to illustrate strategies of simultaneous profiling of multiple metrics and multiple peptides suggested in this manuscript. The decision maps for CUSUM m and CUSUM generated the "pass" signal for a metric at a time when the percentage of out-of-control peptides p t was p t Յ 0.10 for all the statistical summaries. They generated the "warning" signals for each metric when 0.10 Ͻ p t Յ 0.25, and the "fail" signal when p t Ͼ 0.25 for the corresponding statistical summary (e.g. river plot for mean and river plot for variability). Applied to Site 54, and using the first 20 observations as the guide set, this ad hoc decision concluded that the system failed in terms of both the mean of retention time and the mean of peak area after t ϭ 20 (Fig. 4A). Additionally, decision-map for variability showed increases in the variability of retention time and peak area after t ϭ 20 (Fig. 4B).
Next, the river plot (Fig. 4C) and the radar plot (Fig. 4D) analyses of Site 54 were used to gain additional insight. River plots showed an increase in retention time mean and a decrease in peak area mean. These problems were observed in almost half of the peptides. Early eluting peptides were particularly affected. To further refine the analysis, peptide-level CUSUM m (Fig. 4E) and CUSUM m (Fig. 4F) plots for the early eluting peptide TAAYVNAIEK detected an increase in both the mean and the variability of retention time. Change point estimates (red dots in Fig. 4C) indicated that problems in retention time and total peak area begun to occur after t ϭ 15. Overall, the proposed approach allowed us to first outline the general performance readout of SST or QC runs over multiple metrics and peptides, using summaries that can detect diverse types of problems such as subtle shifts. Next, the approach allowed us to zoom into the individual metrics, peptides and deviation types, evaluate the nature of the problems, and the time when it first occurred.
Monitoring Multiple Diverse Metrics Helps Detect a Wider Range of Problems-So far, we have focused on a limited number of metrics, such as total peak area and retention time. However, monitoring a larger and more diverse set of metrics can be extremely useful for identifying causes of undesirable variation. Good chromatographic shape, a hallmark of high quality SRM experiments, can be monitored in terms of FWHM and in terms of peak asymmetry. Gradual increase in peak width can be caused by column problems, such as column aging or build-up of contamination on column, and require diligent attention for better reproducibility. Fig. 5 shows FWHM and peak asymmetry results for the CPTAC study 9.1 Site 65, where the first 5 observations were used as the guide set. River plots (Fig. 5A) summarized the XmR charts, and showed a large increase in the mean of peak width, and a decrease in the mean of peak asymmetry. Radar plots (Fig. 5B) showed that the disturbance affected almost all the peptides, and in a similar way. After examining the control charts for the individual peptides, such as the X chart for FWHM (Fig. 5C) or for peak asymmetry (Fig. 5D) of HGGTIP-IVPTAEFQDR, we concluded that a disturbance in peak width and asymmetry occurred after t ϭ 10. The mR charts (Fig. 5E and 5F) showed an increased variability of these metrics over time. supplemental Fig. S6 shows river and radar plots for retention time and peak area measurements. It illustrates that the problem with peak shape would not have been detected if we only monitored retention time. Overall, we recommend using a diverse range of metrics, such as the ones related to chromatographic shape, for better identification of problems and root causes.
Empirical Control Limits Provide Laboratory-specific Information for SST and QC-Predefined control limits are commonly used in SST and QC. However, a process that satisfies the predefined criteria can still perform poorly. Supplementing the predefined criteria with empirical control limits helps better assess the performance of the measurement system. We examined this point using CPTAC Study 9.1, Site 86. supplemental Fig. S7 shows the Skyline report for the peptide LVNELTEFAK. Suppose that the predefined control limits for the retention time of this peptide were set to 33.5 and 35.5 min. These limits were relatively wide, and produced only a few out-of-control observations ( Fig. 6A and 6B). On the other hand, empirically calculated control limits were narrower, and detected all the measurements as below the expected retention time. In other words, although the prespecified control limits were wider than warranted for this particular process, the empirical control limits were more exact and effectively detected the retention time drift. Therefore, combining predefined and empirical control limits helps assess the perform- ance with respect to both industry benchmark and the properties of the particular laboratory setup.
The Choice of the Guide-set Impacts the Conclusions of SPC-Because empirical derivations of control limits rely on a guide set of good quality runs, analysts should be alert about the importance of appropriately selecting the guide set. Mass spectrometry instruments often require warm-up or calibration time for effective assay development and implementation. If warm-up samples are included in the guide set, they can inflate the estimates of acceptable variation, widen the control limits, and deteriorate our ability to detect undesirable deviations. This problem was observed in the CPTAC study 9.1 Site 56A. Site 56A performed a series of adjustments during the first 15 runs to generate desirable assays, and later experienced an increase in retention time. Fig. 7 shows the XmR control charts of retention times of the peptide LVNELTEFAK, obtained with two different guide sets, namely runs 1-20 on Fig. 7A and 7B, and runs 30 -60 on Fig. 7C and 7D. The adjustments increased the variability in the retention times among the first 15 runs, and therefore the variability of the guide set on Fig. 7A and 7B, and widened the control limits. As the result, the XmR charts did not detect the undesirable systematic increase in retention time. In contrast, when the guide sets only included later runs, the X chart successfully detected the retention time drift (Fig. 7C). The large variation in the warm-up period was also noticeable in  5. CPTAC study 9. 1, Site 65: monitoring multiple diverse metrics to detect a wider range of problems. Detecting chromatographic shape problems using FWHM and peak asymmetry. A, river plot for X and mR control charts. B, radar plot for X and mR control charts. C, X chart of HGGTIPIVPTAEFQDR for FWHM. D, X chart of HGGTIPIVPTAEFQDR for peak asymmetry. E, mR chart of HGGTIPIVPTAEFQDR for FWHM. F, mR chart of HGGTIPIVPTAEFQDR for peak asymmetry. Upper panel of each river plot shows results for the X chart whereas lower the mR control chart (Fig. 7D). Overall, we recommend to exercise caution when selecting the guide set. A careful curation of the guide set runs is also recommended, to manually eliminate isolated outliers or other potential problems. DISCUSSION In mass spectrometry-based proteomics, reference materials are interleaved between the biological samples for system suitability testing, and as a proxy of in-process quality control. A common feature of spectra acquired from these materials is their longitudinal nature, and the consistency of the measurements between good-quality runs. The statistical methods presented in this manuscript make an efficient use of such longitudinal data, and help detect deviations from the consistent patterns. Different statistical summaries presented in the manuscript help detect different types of problems, and therefore it is advantageous to use these summaries jointly. Furthermore, change point time analysis pinpoints the time when a systematic change in patterns likely occurred. The statistical methods rely on control limits. When the control limits are empirically derived, they usefully complement predefined standards of good performance. They help detect undesirable deviations specific to the instrument configura-tion and the particular lab, provided that the guide sets are chosen with care.
Our results demonstrated that in mass spectrometry-based experiments it is advantageous to monitor multiple metrics of multiple peptides, as they help us detect deviations specific to a subset of the peptides. We proposed visual approaches for exploring the large decision space of multiple peptides. Although these approaches are ad-hoc and potentially suboptimal, they are a step toward objective and reproducible decisions in these multivariate settings. It is unfortunately impossible for us to provide specific recommendations for the action that follows the out-of-control diagnosis, but it will depend on the type and purpose of the experiment, the root cause of the deviations, and the existing regulatory, manufacturer, and operating guidelines.
The methods discussed in this manuscript are implemented in open-source R-based software, which can be used via a command line or via graphical user interface, locally or on a remote server. The implementation aims to be inclusive, in that it can be interfaced with any data processing tool that produces quantitative metrics in a tabular format.
Although the methods and the implementation have so far been tested with SRM experiments and relative protein quanpanel summarizes results for mR chart. Red dots in river plot show change point estimates for each peptide. Because of an increase in the mean of FWHM and a decrease in the mean of peak asymmetry, almost all peptides are outlying after t ϭ 10. In control charts, blue circles represent measurements that are within control limits, red lines are control limits and filled red circles are out-of-control observations that exceed lower or upper thresholds in the XmR charts for HGGTIPIVPTAEFQDR. The drift in the mean of both FWHM and peak asymmetry is detected effectively in the X charts. River plots and mR charts indicate that problems with variability of these metrics is less significant.  Fig. 1. All the measurements were below the target retention time. They were within the pre-specified control limits, but were effectively detected with the empirical control limits. tification, they are in principle applicable to other metric types (such as limits of detection, limits of quantification, or linearity of an assay), or to other mass spectrometric workflows such as data-dependent acquisition (DDA) or data-independent acquisition (DIA). We hope that these methods and the implementation will become widely used.   [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. C, X chart. Gray area: guide set runs 30 -60. D, mR chart for LVNELTEFAK (guide set: 30 -60). All other colors and symbols as in Fig. 1. The range between control limits for the first guide set was wider than for the second guide. The increase in mean retention time was not detected when the first guide set, but was detected with the second guide set.