PBLMM: Peptide‐based linear mixed models for differential expression analysis of shotgun proteomics data

Here, we present a peptide‐based linear mixed models tool—PBLMM, a standalone desktop application for differential expression analysis of proteomics data. We also provide a Python package that allows streamlined data analysis workflows implementing the PBLMM algorithm. PBLMM is easy to use without scripting experience and calculates differential expression by peptide‐based linear mixed regression models. We show that peptide‐based models outperform classical methods of statistical inference of differentially expressed proteins. In addition, PBLMM exhibits superior statistical power in situations of low effect size and/or low sample size. Taken together our tool provides an easy‐to‐use, high‐statistical‐power method to infer differentially expressed proteins from proteomics data.


| INTRODUCTION
The advances in mass spectrometry instrumentation nowadays allow for the quantification of multiple peptides per protein (up to a few hundred) during shotgun proteomics experiments. Quantification accuracy of different peptides during mass spectrometry runs is highly dependent on the physical properties of individual peptides and might differ from run to run due to technical constraints. Therefore, outlier peptides can bias the quantification accuracy of the whole protein and need to be carefully assessed during differential expression analysis. However, in most downstream analysis workflows for differential expression analysis, the peptide level information is ignored and summed to a single protein quantification used for statistical analysis. 1,2 While these workflows are commonly applied and represent easy and intuitive methods for statistical analysis, they discard the obtained information on peptide level, which leads to loss of statistical power. 3 In recent years, the first approaches using linear models for differential expression analysis have been transferred from microarray experiments to proteomics. [4][5][6][7] These are capable of carrying out statistical analyses on the peptide level. However, only a few packages use some of the peptide information for statistical analysis 8,9 and are often only available for certain workflows, such as label-free quantification. 9 Linear mixed models (LMMs) in general have been suggested and used for statistical inference before (notably, the exact model differs from study to study) and despite them having shown great statistical power, 3,6,8,9 there is no easy-to-use package or standalone application published so far that is also applicable to multiplexed proteomics. To solve this issue we present peptide-based linear mixed models (PBLMM), a standalone desktop application for differential protein expression analysis from shotgun proteomics experiments, especially those applying isobaric labellings, such as tandem mass tags (TMT) or iTRAQ. We compare PBLMM to currently used tools, such as MSstatsTMT 10 or limma 4 and found PBLMM to provide statistical benefits over these methods, depending on the use-case.

| Statistical model of PBLMM
In the implemented statistical model, the expression of each protein is separately modelled by a linear mixedeffects model: where y i,j denotes the jth measurement of expression of peptide i, β 0 is the individual protein's global intercept, βX is the linear combination of indicator variables encoding categorical experimental conditions, u i is the additive random intercept of peptide i with u i ∼ N (0, σ Peptide 2 ), and ε i j , are residual errors with ε i j , ∼ N (0, σ ε 2 ). Note that this collapses to ordinary linear regression when there are no multiple peptide measurements per protein.
In the absence of further experimental conditions, the variance of the response variable y i,j can be described by the sum of the variance components (σ 2 ) peptide, technical replicate (TechRep), and multiplex, as well as unexplained residual variance σ ε 2 : The variance components reduce when no technical replicates and/or different multiplexes are present. This setup makes PBLMM aware of most experimental designs that are commonly used in proteomics.
Since the input matrix is log 2 transformed, the treatment coefficients from the models can be interpreted as log 2 fold changes and p values for the main treatment effects can be extracted. The null hypothesis tested is that the coefficient for the tested term equals zero and the factor is not meant to explain the protein expression. Therefore a low p value indicates the importance of the factor and a low likelihood of the fold change is 0 for the tested treatment. To control the false discovery rate (FDR), we applied multiple testing corrections by the Benjamini-Hochberg FDR method. 11 The application automatically calculates differential analysis between all condition pairs possible.

| RESULTS
We created two implementations of the PBLMM statistical model described above ( Figure 1A): (i) for user-friendly analysis, we implemented the PBLMM algorithm into a standalone desktop application with a graphical user interface and (ii) a Python package containing additional parameters, processing steps and pipelining features to facilitate customisable data analysis workflows.
To test how PBLMM performs compared to other commonly used methods, we created a ground truth dataset, consisting of a TMTpro 12-plex containing spiked-in Escherichia coli digests in a human background in different known ratios ( Figure 1B). We then performed differential expression analysis using different classical protein level statistics and PBLMM: (i) sumbased protein rollup (peptides are summed for each Receiver operating characteristic curves of p values generated by different statistical tools/tests as predictors of species. Statistics for samples with twofold changes were used for this graph protein before statistical inference), followed by Student's t-test, (ii) sum-based protein rollup, followed by limma, 4 and (iii) PBLMM. Receiver operating characteristic (ROC) curve analysis showed that PBLMM outperforms the alternative classic methods on our test dataset ( Figure 1B). Notably, we found that the difference between statistical approaches was less pronounced with very high effect sizes (e.g., fourfold). We thus performed our analysis with lower effect sizes throughout this study.
2.1 | PBLMM shows advantages with small effect size and low replicate number While methods using protein-based statistics exhibit high statistical power in situations where enough replicates are present to sufficiently estimate variances of indicator variables, they inherently lack statistical power in experimental designs that rely on low numbers of replicates or study low fold changes across conditions. In these cases, the use of peptide-level data provides additional statistical power. For each sample and protein, multiple peptides are measured giving a better complete view of the different variance sources, biological, or technical. We tested PBLMM against the current state-of-the-art tool MSstatsTMT 5,10 ( Figure 2A). MSstatsTMT uses flexible LMMs to infer differential expression between biological conditions from TMT data. However, the LMMs are fitted on protein levels, previously inferred from an additive linear model on peptide level data. Here we used our fractionated ground truth dataset from before and validated the statistical power of both tools. In the ROC analysis, both tools show comparable performance. However, when we applied standard FDR cut-offs, such as 0.05 or 0.01 (either alone or in combination with additional fold change cut-offs), proteinlevel statistics failed to detect any significantly changed proteins at a small effect size of 1.5-fold ( Figure 2B). This effect has been already discussed by others and represents a common problem in proteomics experiments. 12 In contrast, the enhanced statistical power of our peptide-based model was able to detect several hundred significantly changed proteins. While both methods performed comparably at a higher effect size with three replicates, we also observed more pronounced differences during the analysis of only two replicates ( Figure 2C). Here, only the peptide-based model was able to correctly detect differentially expressed proteins, although with a slightly inflated empirical FDR that could be easily controlled by applying additional fold change cut-offs. Since large experiments with tens to hundreds of conditions are becoming more and more popular and naturally suffer from lower replicate numbers, 13,14 statistical models for their analyses become increasingly important.
We next hypothesised that also the estimation of technical variation between runs can benefit from peptide data. Thus, we generated another ground truth dataset and measured the resulting multiplex three times as technical replicates ( Figure 3A). While the differentially expressed proteins clearly separated from the background (i.e., human proteins) when analysing differences inside one multiplex ( Figure 3B), the data points started to converge when looking at the data across the different replicate injections ( Figure 3C). The technical variation of the background proteins between MS runs was higher than the effect size, thus masking the real effects. Linear models are able to divide these effects and calculate the technical and biological variance separately, leading to high statistical power. When we applied our peptide-based model in conjunction with internal reference scaling preprocessing, 15 we observed an improvement in statistical power compared to the protein level quantification, although nine replicates (three replicates for each three MS runs) were present for each condition ( Figure 3D).
Taken together, we found that PBLMM performed comparably to other state-of-the-art statistical tools and provides distinct advantages in situations commonly observed in biological or medical experimental designs, such as experiments with low effect sizes or low numbers of replicates. These advantages are beneficial for large scale experiments, which would not be feasible with the currently required high number of replicates, and conditions with small effects sizes.

| DISCUSSION
Statistical data analysis of proteomics experiments needs adjustments for each experiment since it is heavily influenced by multiple parameters. The effect size is not only determined by the biological effects but also strongly affected by instrumentation. Advances in instrumentation are occurring rapidly and many different instruments are used in proteomics facilities worldwide, each influencing the effect size and technical variance in their own way. In addition, applied measurement methods, offline fractionation, and sample preparation are varying a lot across laboratories. Therefore, each experiment may have its own statistical needs that may not be covered by single tools. Consequently, we think that the choice of the statistical tools should ideally depend on the experimental setup.
LMMs have proven to be able to account for several of these aspects, like technical or subject variance, and therefore provide more advanced hypothesis testing. We showed that PBLMM is able to perform similarly or better than state-of-the-art methods like MSstatsTMT depending on the experimental setup. Strikingly, we found that, for several scenarios, using the peptide information directly for differential expression analysis, strongly enhanced the statistical power, while maintaining a low FDR. Especially testing biological conditions with low effect sizes (as observed in mild treatments) or a low number of replicates (as observed in high-throughput experiments) benefited from the additional information provided by the individual peptides. We anticipate that the power of statistical tools might not be adequately reflected solely by ground truth datasets, however, these datasets represent the only quantifiable source of statistical power and accuracy together with purely simulated datasets. Importantly, PBLMM is easily accessible as an interactive desktop tool and thus allows it to be used broadly used in different analyses pipelines and for all different types of input data.