prolfqua: A Comprehensive R-Package for Proteomics Differential Expression Analysis

Mass spectrometry is widely used for quantitative proteomics studies, relative protein quantification, and differential expression analysis of proteins. There is a large variety of quantification software and analysis tools. Nevertheless, there is a need for a modular, easy-to-use application programming interface in R that transparently supports a variety of well principled statistical procedures to make applying them to proteomics data, comparing and understanding their differences easy. The prolfqua package integrates essential steps of the mass spectrometry-based differential expression analysis workflow: quality control, data normalization, protein aggregation, statistical modeling, hypothesis testing, and sample size estimation. The package makes integrating new data formats easy. It can be used to model simple experimental designs with a single explanatory variable and complex experiments with multiple factors and hypothesis testing. The implemented methods allow sensitive and specific differential expression analysis. Furthermore, the package implements benchmark functionality that can help to compare data acquisition, data preprocessing, or data modeling methods using a gold standard data set. The application programmer interface of prolfqua strives to be clear, predictable, discoverable, and consistent to make proteomics data analysis application development easy and exciting. Finally, the prolfqua R-package is available on GitHub https://github.com/fgcz/prolfqua, distributed under the MIT license. It runs on all platforms supported by the R free software environment for statistical computing and graphics.

The R markdown file, with all the R code used to produce this pdf-document can be found here : pro-lfqua_supplement.Rmd. To replicated the analysis you need to install the prolfquabenchmark R package.

Material S2: Benchmark Vignettes (IonStar/MaxQuant)
Pre-build version of the vignettes, created to run the DEA benchmark, are available: • Benchmarking of methods implemented in prolfqua using the IonStar dataset MaxQuant • Benchmarking of MSstats using the IonStar dataset MaxQuant • Benchmarking of proDA using the IonStar dataset MaxQuant • Benchmarking of msqrob2 using the IonStar dataset MaxQuant Material S3: DEA benchmark IonStar/MaxQuant/peptide.txt -Significance test Table 1 summarizes pAU C of the DEA benchmark when using the IonStar/MaxQuant/peptide.txt data. Figure 5 in the Article, shows a barplot showing the pAU C 10 .
We test if we should reject the null hypothesis that the area under the ROC curve at 0.1 FDR (pAU C 10 ), computed using the scaled p-values (see scaled.p.value's in Table 1 ), do not differ for msqrob2, proDA, and prolfqua_merged. To this task we use the function roc.test, that performs a bootstrap method test for a difference in the pAUC of the ROC curves, from the R package pROC. The Table 2 shows the p-values of the pairwise bootstrap test for partial areas under the ROC curves.
The Rmarkdown file which generated these results, can be found here: "DEA benchmark Ion-Star/FragPipeV14/combined_protein.tsv" Supplementary Figure 1 shows the partial area under the curve for msqrob2, proDA and prolfqua, when using the difference among the groups, the scaled p-values or the t-statistics to rank the proteins. For this dataset, the differences among groups show a higher performance than the t-statistic and scaled p-value. The proDA package performs slightly better then prolfqua or msqrob although the differences are between the pAU C 10 for the scaled p-value score are not statistically significant. Table 4 shows the results of the Bootstrap test for two ROC curves. We observe that for this benchmark data, we can not reject the null hypothesis that there are no significant differences among the pAU C 10 (ROC based on scaled p-value) for the three packages.

S-2
Supplementary    Table 4 shows the results of the Bootstrap test for two ROC curves. We observe that for this benchmark data, we can not reject the null hypothesis that there are no significant differences among the pAU C 10 (ROC based on scaled p-value) for the three packages.

Material S5: DEA benchmark : IonStar/FragPipeV14/combined_protein.tsv
The Rmarkdown file which created these results, with more details about the IonStar FragPipe v14 dataset, and the data processing, can be found here: "DEA benchmark IonStar/FragPipeV14/combined_protein.tsv" After filtering for two peptides per protein the dataset comprises of 3836 proteins. Since we are using the 'combined_protein.tsv' as input, comparison with neither msqrob2 nor MSstats is possible. To fit the hurdle (msqrob) model, peptide intensities are required, while MSstats requires the 'MSstats.tsv' file, containing precursor level abundances. However, starting from protein level intensities has the advantage that the effect size estimates obtained from the model agree with protein abundances, a feature frequently requested by the biologist.
Supplementary Figure 2 shows the partial area under the curve for proDA and prolfqua, computed by ranking the proteins using the difference among groups (diff), the scaled p-value (scaled.p.value) and the t-statistics (t_statistics), Table 6 shows the results of the Bootstrap test for the difference of two ROC curves. We observe that for this benchmark, there is no significant differences (at significance level of 0.1) among the pAU C 10 , where the scaled p-value was use to rank the proteins, for the two packages.
Supplementary Table 6: p-values for pairwise comparsions of pAU C 10 , for *proDA* and *prolfqua* CP-TAC/MaxQuant data. The Rmarkdown file which created these results, with more details about the analysis, can be found here: "DEA benchmark IonStar/FragPipeV14/MSstats.tsv" Since we are using the 'MSstats.tsv' file containing precursor level abundances as input, we can benchmark the DEA performed of msqrob2, MSstats, proflqua, and proDA. The dataset has 3899 proteins after filtering for two peptides per protein. Since we make four comparisons between groups, 15596 is the maximum possible number of effect size estimates. Supplementary Figure 3 shows the number of comparisons obtained. Supplementary Figure 2 shows the partial area under the curve for proDA and prolfqua, computed by ranking the proteins using the difference among groups (diff), the scaled p-value (scaled.p.value) and the t-statistics

Material S7: Comparing DEA results for MaxQuant and FragPipe
In the Article, we discussed other factors influencing the performance of the DEA analysis. One of these factors is the quantification software. Here we compare the benchmarking results obtained with MaxQuant and FragPipe. We see (Supplementary Figure 5) that the choice of the quantification software changes the partial area under the curve (pAU C 10 ) much stronger (by up to 8%) than the DEA analysis method (by about 4%). Furthermore, we observe that the DEA benchmark results differ depending on if 'MSstats.tsv' or 'combined_protein.tsv' is used as input (see Figure 5 Left Panel, methods proDA and prolfqua). Supplementary S-9

Material S8: Estimating A LOD
The main text describes how we estimate the abundance at the detection limit A LOD . Supplementary Figure  6 shows the distribution of the average abundance within a group for all proteins. The red density shows the distribution for those proteins without a missing value, and the olive, turquoise, and lilac density show for proteins with 1,2 and 3 missing values, respectively. We see that the proteins with three missing values have, on average lower abundances than those with no missing values. We are using the median of the proteins with a single observation per group; the lila density, as the A LOD estimate.

Material S9: The probabilities produced by ROPECA are not p-values
In the main text discuss that the properties of the Beta distribution-based probabilities, computed from peptide level p-values, using the ROPECA method are not well understood. We are running here a simulation experiment, which shows that if we start with a dataset where the null hypothesis is valid for all the peptides, i.e., the p-values are uniformly distributed, the produced Beta distribution-based probabilities are not uniformely distributed.
For all the peptides (nrPep = 10000) H0 is true. We know that if H0 is true, the distribution of the p−values will be uniform. (Supplementary Figure 7 left Panel). Therefore, we can omit the step of generating the data from H0 and computing the p-values but simulate the p-values directly by sampling from the uniform distribution (runif(nrPep)).
By sampling 800 protein id's 10000 times (with replacement), we assign the peptides to proteins ('sample(1:800, S-10 size = nrPep, replace = TRUE, . . . )'). To make the data more realistic, that is that some proteins have a lot of peptides, and many have just a few peptides; the density of sampling the protein Id is exponential (dexp(seq(0,5,length = 800)). Supplementary Figure 7 center panel shows the number of proteins as a function of the number of peptides per protein. Afterward, we apply the ROPECA methods as discussed in the methods section of the manuscript. Supplementary Figure 7 shows the distribution of the Beta distribution-based probabilities, which is not uniform. Hence, they do not have the same interpretation as p-values, i.e., the probability of falsely rejecting the H0 if H0 is true.

Material S10: Specifying Contrasts for Models with two Factors and Interaction Term
The following code illustrates how to specify a model with two factors and an interaction term and define the differences we want to estimate. The example dataset was acquired in two batches (p2370, p2691). In each, we measured Yeast samples grown under two different conditions (Glucose and Ethanol).
To estimate treatment differences, we first specify the linear model that explains the observed protein abundances (transformedIntensities) using the explanatory variables condition_ and batch_ and the interaction term condition_:batch_. Then, after fitting the model, we can estimate the treatment difference among the groups Ethanol and Glucose defined by the factor condition and similarly among the groups defined by batch_. Furthermore, we can examine differences between Ethanol and Glucose within each batch. We also can assess if these differences are the same or are different in both batches, i.e., if there is an interaction between condition and batch. We also show the code to compute the array of weights c used to multiply the model coefficient β.

S-11
Supplementary Table 9: Weights c for each of the contrasts (rows) which will be applied to the model parameter (columns).
Given the weights c contrasts from model parameters obtained with proDA, or msqrob2 can be estimated, which will enables us to implement adapters to proDA or msqrob2.
For more details see prolfqua vignette: "Modelling dataset with two Factors".

Material S11: Creating a prolfqua configuration
The following code demonstrates how we use prolfqua to analyze protein intensities reported in the FragPipe 'combined_protein.tsv' file. First, we create a tidy table containing the protein abundances by reading the combined_protein.tsv file using tidy_FragPipe_combined_protein. Then, we read the sample annotation from the file annotation.xlsx file. Next, we create an AnalysisTableAnnotation R6 object. Bottom-up proteomics data is hierarchical, i.e., a protein has peptides, peptides might be modified, etc. Therefore, the

S-12
AnalysisTableAnnotation has a hierarchy field storing a list with an entry for each hierarchy level. Since combined_portein.tsv only holds protein level data, the hierarchy list has one element, and we use it to specify which column contains the protein identifiers. We also need to define which column contains the protein abundances we want to use for the data analysis. Finally, we have to specify which columns contain the explanatory variables of the analysis. The AnalysisTableAnnotation has the field factors, a list with as many entries as explanatory variables. Here we include two explanatory variables, the dilution, specified in the column 'sample', and 'run' stored in the column 'run_ID', representing the order of the measurement.
datadir <-file.path(find.package("prolfquadata") , "quantdata") inputFragfile <-file.path(datadir, "MSFragger_IonStar2018_PXD003881.zip") inputAnnotation <-file.path(datadir, "annotation_Ionstar2018_PXD003881.xlsx") # read input annotation annotation <-readxl::read_xlsx(inputAnnotation) protein <-tibble::as_tibble( read.csv(unz(inputFragfile,"IonstarWithMSFragger/combined_protein.tsv"), header = TRUE, sep = "\t", stringsAsFactors = FALSE)) Supplementary Figure 10: The figure illustrates some possible data analysis procedures that can be achieved with only very few functions from prolfqua. As input different levels from a variety of quantitation tools are accepted and along with an annotation table configured for prolfqua using either pre-built import functions or made ready with a little reshaping. In prolfqua (depending on the input), one can choose roll-up procedures and perform basic steps of mass spectrometry-based data processing steps such as quality control, data normalization, protein aggregation, as well as statistical modeling, hypothesis testing, and sample size estimation. At all levels, one can always use functions to show diagnostic plots or even inspect individual proteins. Furthermore, prolfqua supports multiple models for differential expression analysis such as linear (mixed effect) models and allows to specify multiple contrasts.