protti: an R package for comprehensive data analysis of peptide- and protein-centric bottom-up proteomics data

Abstract Summary We present a flexible, user-friendly R package called protti for comprehensive quality control, analysis and interpretation of quantitative bottom-up proteomics data. protti supports the analysis of protein-centric data such as those associated with protein expression analyses, as well as peptide-centric data such as those resulting from limited proteolysis-coupled mass spectrometry analysis. Due to its flexible design, it supports analysis of label-free, data-dependent, data-independent and targeted proteomics datasets. protti can be run on the output of any search engine and software package commonly used for bottom-up proteomics experiments such as Spectronaut, Skyline, MaxQuant or Proteome Discoverer, adequately exported to table format. Availability and implementation protti is implemented as an open-source R package. Release versions are available via CRAN (https://CRAN.R-project.org/package=protti) and work on all major operating systems. The development version is maintained on GitHub (https://github.com/jpquast/protti). Full documentation including examples is provided in the form of vignettes on our package website (jpquast.github.io/protti/).


Introduction
With novel and evolving proteomics methods such as limited proteolysis coupled to mass spectrometry (LiP-MS; Feng et al., 2014), N-terminomics, phosphoproteomics and other post-translational modification-centric mass spectrometry-based experiments, all of which require peptide-centric data analysis, there is a growing need for flexible software tools to analyse bottom-up proteomics data. Data structures of such bottom-up proteomics experiments are diverse, and users often require specific workflows. Existing bottom-up proteomics R packages such as MSstats (Choi et al., 2014), DEP (Zhang et al., 2018), msmsEDA (Gregori et al., 2021), MSnbase (Gatto et al., 2021), TPP (Childs et al., 2021), HaDeX (Puchala et al., 2020) and NormalyzerDE (Willforss et al., 2019) either offer fixed analysis pipelines (e.g. packages of the MSstats family) and are therefore not easily implementable for specific user needs or are not suited for comprehensive peptide-level analysis, especially of LiP-MS data, since they offer a limited set of functions (DEP, msmsEDA, MSnbase, TPP, HaDeX, NormalyzerDE). Finally, few available tools offer functions for quality control, data analysis and data interpretation in one package.
To fill this gap, we developed protti, a user-friendly label-free bottom-up proteomics R package that facilitates the analysis of data-dependent, data-independent and targeted mass spectrometry data on the protein, peptide, precursor or fragment level. protti provides flexible functions for quality control, data filtering, differential quantification and dose-response analysis that can be tailored to a user's needs, as well as supporting biological interpretation such as functional enrichment and network analysis. protti follows the design principles of the tidyverse package family (Wickham et al., 2019), which makes it accessible and easy to modify even for inexperienced R users; the code is written with the novice user in mind, and is carefully documented. Due to its modular design, most functions can be used independently of each other and can be applied to input data from many sources. Although protti does not provide a graphical user interface, unlike proteomics tools such as proteosign (Efstathiou et al., 2017), Perseus (Tyanova and Cox, 2018), Prostar (Wieczorek et al., 2017), DAnTE (Karpievitch et al., 2009), PIQMIe (Kuzniar and Kanaar, 2014), StatQuant (van Breukelen et al., 2009), LFQ-Analyst (Shah et al., 2020), ProtExA (Minadakis et al., 2020) and MSqRob (Goeminne et al., 2018) this aids the seamless implementation of protti into any R data analysis workflow.

Overview
protti provides a wide range of functions (Fig. 1). Briefly, the user can assess and visualize data quality and perform median normalization to correct for unequal sample amounts. Protein abundances can be inferred from precursor or peptide intensities based on methods adapted from the MaxLFQ algorithm (Cox et al., 2014) and as previously implemented in the R package iq (Pham et al., 2020). Differential abundance calculation in conjunction with statistical testing can be applied to any case-control dataset. Statistical testing can either be performed using a standard Welch's t-test, ANOVA, a moderated t-test based on the limma R package (Ritchie et al., 2015) or the proDA R package (Ahlmann-Eltze, 2020). Doseresponse experiments can be analysed by fitting four-parameter loglogistic regression models to the data based on functions from the drc R package (Ritz et al., 2015).
Lists of significantly altered proteins obtained with any of the described workflows, can be further analysed using functional (Ashburner et al., 2000) or pathway enrichment analyses, as well as a network analysis tool based on the STRINGdb R package (Szklarczyk et al., 2019). Furthermore, to obtain additional information on proteins, including PTMs, subcellular location, PDB structures, and interaction partners, data can be annotated using information imported from several databases (UniProt (The UniProt Consortium, 2021), KEGG (Kanehisa and Goto, 2000), MobiDB (Piovesan et al., 2021), ChEBI (Hastings et al., 2016)). With the respective fetch functions, the user can load this information directly into R. Functions for the analysis of peptide-centric data from LiP-MS experiments allow visualiztion of differentially abundant peptides along the protein sequence.

Applications
protti can be applied to any bottom-up label-free proteomics dataset. In principle, the output of any table-formatted proteomic quantitative data analysis software is supported if it contains information on sample, condition, intensity, protein ID (UniProt ID), and the feature the measured intensity reports on (fragment, precursor, peptide) if different from protein intensity. The quality control, data analysis and data interpretation modules of protti can be used serially or independently.

Implementation
The modularity of protti enables its use in conjunction with userdefined workflows and functions. protti is written as simply as possible using tidyverse packages (Wickham et al., 2019), making its source code easy to understand and adapt or extend, even for inexperienced R users.

Conclusions
We describe protti, a flexible, user-friendly R package for analysis of various bottom-up proteomics datasets. protti provides functions for quality control, data analysis on protein, peptide, precursor or fragment levels, as well as data interpretation within a single, easyto-use package. We include examples of data analysis workflows as vignettes to illustrate the utility of the different functions for the analysis of protein-or peptide-centric datasets.