ComplexBrowser: a tool for identification and quantification of protein complexes in large scale proteomics datasets

We have developed ComplexBrowser, an open source, online platform for supervised analysis of quantitative proteomics data that focuses on protein complexes. The software uses information from CORUM and Complex Portal databases to identify protein complex components. Based on the expression changes of individual complex subunits across the proteomics experiment it calculates Complex Fold Change (CFC) factor that characterises the overall protein complex expression trend and the level of subunit co-regulation. Thus up- and down-regulated complexes can be identified. It provides interactive visualisation of protein complexes composition and expression for exploratory analysis. It also incorporates a quality control step that includes normalisation and statistical analysis based on Limma test. ComplexBrowser performance was tested on two previously published proteomics studies identifying changes in protein expression in human adenocarcinoma tissue and during activation of mouse T-cells. The analysis revealed 1519 and 332 protein complexes, of which 233 and 41 were found co-ordinately regulated in the respective studies. The adopted approach provided evidence for a shift to glucose-based metabolism and high proliferation in adenocarcinoma tissues and identification of chromatin remodelling complexes involved in mouse T-cell activation. The results correlate with the original interpretation of the experiments and also provide novel biological details about protein complexes affected. ComplexBrowser is, to our knowledge, the first tool to automate quantitative protein complex analysis for high-throughput studies, providing insights into protein complex regulation within minutes of analysis. A fully functional demo version of ComplexBrowser v1.0 is available online via http://computproteomics.bmb.sdu.dk/Apps/ComplexBrowser/ The source code can be downloaded from: https://bitbucket.org/michalakw/complexbrowser Highlights Automated analysis of protein complexes in proteomics experiments Quantitative measure of the coordinated changes in protein complex components Interactive visualisations for exploratory analysis of proteomics results In brief ComplexBrowser is capable of identifying protein complexes in datasets obtained from large scale quantitative proteomics experiments. It provides, in the form of the CFC factor, a quantitative measure of the coordinated changes in complex components. This facilitates assessing the overall trends in the processes governed by the identified protein complexes providing a new and complementary way of interpreting proteomics experiments.


Highlights (85 characters)
• Automated analysis of protein complexes in proteomics experiments

Introduction
Proteomics has become a method of choice for large scale analysis of biological systems.
Recent advancements in multidimensional separation methods, improved instrument speed, sensitivity and resolving power allow for the generation of nearly full proteomes in the scope of 35 hours of analysis, providing quantitative information for more than 12 000 gene products and covering 4-6 orders of magnitude [1,2].
Coverage of nearly complete proteomes raises great research opportunities, but also challenges, especially in the domain of biological interpretation of the results. Currently, this process remains time-consuming and requires considerable expertise.
Commonly used analysis pipelines involve annotating proteins according to their molecular function, cellular component and biological processes, based on information gathered in Gene Ontology (GO) databases [3,4]. Further GO-term enrichment methods define over-represented annotations in users data, providing a general understanding of the biological processes affected [5].
Pathway analysis is a different approach for determining protein role, concentrated on its specific biochemical activity. Tools such as IPA®, KEGG or Reactome map proteins to molecular pathways [6][7][8] and visualise processes those gene products are known to be involved in. A clear advantage of this approach is that pathway databases are mostly based on experimental, manually curated data, while the majority of GO-annotations come from in silico predictions and text mining [9].
An alternative practice, often employed in studies of less researched organisms, is protein domain and motif analysis [10]. This strategy utilises sequence alignment and secondary structure prediction tools to find similarity between a protein of interest and better-annotated analogues in other species. Identification of a specific sequence motif enables assigning function to previously undescribed proteins.
Analysis of protein-protein interactions (PPI) is a complementary approach used very often in parallel to the above methods. Platforms such as STRING [11] utilise information from co-expression studies, cross-species predictions, experimental evidence and literature mining to build protein interaction graphs, in which nodes represent gene products and edges correspond to interactions. These maps facilitate identification of genes involved in similar processes or influenced by common regulators. Platforms for PPI investigation provide comprehensive information on proteins function and involvement in various biological processes. However, the enormous knowledgebase of these platforms and the amount of interactions drawn from large gene/protein lists is often very difficult to handle and interpret.
Protein complexes are molecular machines that perform many of the key biochemical activities essential to cell e.g. replication, transcription, translation, cell signalling, cellcycle regulation and oxidative phosphorylation. Their role in maintaining cell homeostasis and involvement in disease development [12] prove that the detailed characterization of protein complex expression would be very helpful in understanding the often highly intertwined processes in the cell.
Interrogation of the expression of components of known protein complexes in large scale proteomics data had been performed in a number of studies [13,14] https://www.biorxiv.org/content/early/2018/07/11/367227). It is now clear that the majority of known protein complexes are translationally and post-translationally regulated and therefore exhibit co-expression when compared across cell types and tissues. However, so far no automated and user-friendly approach for the analysis of complex behaviour in new datasets had been developed.
In this manuscript we present ComplexBrowser that enables automated and quantitative analysis of protein complexes in proteomics experiments. The software interrogates CORUM [15] and EBI Complex Portal [16] databases to find known protein complexes present in a given protein list and utilises quantitative proteomics data and factor analysis to summarise the overall expression trends for each complex across the studied biological conditions. The re-analysis of two, previously published, large scale proteomics datasets shows the high potential of the approach to gain in-depth knowledge about regulation of protein complexes in different biological contexts.

Protein complex databases
The presented software relies on information from two established, manually curated protein complex databases: CORUM [15] and EBI Complex Portal [16] with 2693 and 2454 entries respectively, together covering 22 species (state for 24.05.2018).

Software design Data Input
ComplexBrowser accepts data tables in .csv or .txt format as input. The file must contain single, unique UniProt [17] accessions in the first column and quantitative information in subsequent columns (LFQ or summarised reporter ion intensities for each analysed sample). Optionally, columns with confidence scores from statistical tests result can be appended. These values are calculated in relation to the first condition appearing in the input file. If they are not included by the user, differential expression analysis using limma [18] is conducted, along with FDR estimation using the qvalue R package [19].

Quality control
Prior to the analysis of protein complexes, the software creates data visualisations for quality control (QC) evaluation purposes. The QC visualisations include: • Boxplot of log-transformed intensities to control for inconsistency in e. g. injection amounts.
• Missing value bar plots generated by summing up the number of missing values in each quantitative column to compare protein coverage between samples.
• Pairwise Scatter plots of log-transformed intensities of all proteins quantified in two selected conditions, displaying sample to sample correlations (Pearson, Kendall, Spearman), to test similarity between the samples.
• Histograms of coefficient of variation (CV) of protein absolute intensity measurements within each experimental condition to assess replicate variation.
• Q-value charts counting the number of differentially expressed features in relation to a set threshold.
• Volcano plots to depict the relation between fold-changes and confidence for differentially regulated features.
• PCA results for visual comparison of all samples.
The software also implements four common, previously described normalisation methods [20].

Complex expression analysis
ComplexBrowser employs the FARMS algorithm to determine complex fold change (CFC).
FARMS [21] is based on Bayesian factor analysis with an assumption of Gaussian measurement noise. It has previously been used to estimate protein abundance, based on peptide concentrations in the protein summarisation process. It has proven useful in detecting outliers in peptide expression profiles and limiting their influence on protein quantitation [22]. In ComplexBrowser, we employed our own implementation of the FARMS algorithm for performing weighted summarisation of log-transformed expression changes of protein complex subunits. The complex expression is calculated in two steps.
First, scaled intensities of all subunits are summed within each replicate, subsequently, the sums are averaged to obtain one value of complex expression for each analysed condition. Relative change of complex expression between two given conditions is defined as complex fold change (CFC). ComplexBrowser also provides a summary measure, describing the amount of variability in the expression profile of a given complex. This information is presented as signal to noise ratio, or in short-noise. Noise of 0 indicates perfect co-expression, while noise value of 1 points to very poor correlation. The default noise threshold set in the software is 0.5.

Linearity of complex components co-expression
Building on the idea of using linearity of subunit co-expression as a measure of data quality [23], ComplexBrowser draws supplementary visualisations to investigate coregulation between different conditions. For a selected protein complex, it takes log-transformed abundances of all its subunits in two conditions and displays them on a scatter plot, where each point corresponds to one protein. Orthogonal distance regression (ODR) is employed to determine the quality of co-expression similarity, since unlike ordinary least squares regression, ODR considers variability in both x and y values, therefore it fits a model that minimizes errors of both measurements [24]. The procedure returns a single R 2 value per complex for each pair of conditions as a measure of co-expression.

Test datasets
ComplexBrowser performance was tested using protein quantification data from two previously published studies [25,26]. Further on in the manuscript we refer to the studies as adenocarcinoma dataset [26] and T-cell dataset [25].
Adenocarcinoma dataset investigates protein expression differences between formalinfixed, paraffin embedded tissue samples from patients with colon cancer in comparison to healthy colon mucosa and nodal metastatic tumours using label-free quantitation based on LFQ intensities. The MS proteomics data of the adenocarcinoma dataset were obtained from supplementary tables of the original publication available from the publisher's site [26]. For the purpose of this analysis we have discarded samples denoted as "CA2" and "NO2" to ensure an equal number of replicates in each condition.
We have filtered the protein intensities table to retain proteins with at least 4 valid quantitative values within each condition. We have removed isoform identifiers from the original accession numbers and rows with non-unique identifiers were removed. This resulted in a dataset containing LFQ values for 6824 proteins from 3 conditions, 7 biological replicates each. The input file used is this study can be found in supplementary   Table S1.
The T-cell dataset studies activation of quiescent mouse T cells over four time points (0, 2, 8 and 16h) in two biological replicates. Proteins were quantified using tandem mass tags (TMT) labelling and were analysed on an Orbitrap Elite MS instrument. The data were obtained from the original publication from PRIDE database (accession numbers PXD004367 and PXD005492) [27]. The dataset contained normalised intensities for 8431 proteins. The input file used is this study can be found in Table S2.

Software implementation
ComplexBrowser was implemented in R [28]. The user interface was developed using Shiny, Plotly, networkD3, heatmaply, DT and data.table libraries, allowing interactive and adjustable data visualization. PreprocessCore, stringr, pracma, dplyr, limma and qvalue packages were used for data manipulation and statistical analysis.

Results
We have developed ComplexBrowser for analysis of protein complexes abundance and co-expression in large scale proteomics experiments. The general analysis pipeline implemented in the program is presented in Fig. 1. An extensive description of ComplexBrowser's procedures and results can be found in supplementary File S1. In brief, a table containing the quantitative information of the identified proteins is uploaded using the web browser interface. After defining parameters of the analysis (e.g. number of conditions and replicates) the analysis of the quality of the quantitative data provided is carried out and visualised. In a following window analysis of p the presence and changes in abundance of protein complexes is carried out. Interactive tables and graphics allow the user to conveniently evaluate the results of the analysis.
Tables containing results and a summary report are available for download.
To test the performance of the developed platform, we have used two published proteomics studies: adenocarcinoma dataset [26] and T-cell dataset [25]. The results of quality control and protein complex analysis steps are presented below.

Quality control of proteomics data in ComplexBrowser
Adenocarcinoma dataset containing quantitative proteomics values from 3 biological conditions with 7 replicates each were uploaded to ComplexBrowser in the following order: C1 -control, C2 -metastasis, C3 -cancer. A summary of the quality analysis step can be found in a supplementary File S2.
Analysis of Boxplot graphs of log-transformed intensities, Fig. 2A Among the top 5 regulated complexes we have identified overexpression of 3 complexes related to mitotic cell cycle progression and activation: RalBP1-CDC2-CCNB1 complex (7.420 CFC), CDC2-CCNA2-CDK2 complex (6.031 CFC); and Cell cycle kinase complex CDC2 (4.820 CFC), for more details see Table 2 and supplementary Table S4. This points to an increased activation of cell division and replication, as well as reduction in respiration, which are known characteristics of cancer [29,30].
In addition to previously described results, ComplexBrowser detected 3.58 fold upregulation of MTA1 complex involved in metastatic tumour formation in nodal tissue samples [31]. The same complex was changed 2.33 fold in main tumour tissue supplementary Table S4.

Activation of mouse T-cells is reflected in coordinated changes of protein complex components
ComplexBrowser was used to analyse protein complexes during activation of murine T-

Discussion
ComplexBrowser is, to our knowledge, the first automated tool that enables quantitative analysis of protein complexes in proteomics experiments. It is available through a webbrowser and does not require any installations or programming experience. Thus, it has high potential for integration into data analysis workflows commonly used by the scientific community. ComplexBrowser relies on information stored in CORUM [15] and Complex Portal databases [16] and is therefore dependent on the efforts of their administrators. The composition of these resources introduces a bias in the analysis, since the largest proportion of complexes described in both databases are of human origin (66.36% of CORUM and 25.79% of Complex Portal). Thus, currently, ComplexBrowser is most suitable for analysis of human proteins. This is visible when comparing the number of proteins found to be involved in complexes from adenocarcinoma (human) and T-cell Table 1. Aditionally the databases contain entries that are not fully annotated. Further developments of the databases will improve the results provided by the software.

ARW was supported by a grant from the Independent Research Fund Denmark -Natural
Sciences and VILLUM Foundation for a grant to the VILLUM Center for Bioanalytical Sciences at SDU. VS was supported by ELIXIR DK. WM was supported by student grant from MC2 Therapeutics ApS. We thank Ole N. Jensen for his critical comments to the project and manuscript.