A benchmark dataset for analyzing and visualizing the dynamic epiproteome

In this paper, we present a benchmark dataset to evaluate the currently available analysis methods and visualizations for epiproteomic data. The benchmark dataset is a subset of a high-throughput time-series study of phosphoevents occurring upon insulin stimulation. Our dataset is provided in multiple formats for use with four currently available tools. We also provide a file containing the kinase assignments for the sites, as well as a simple kappa model on phosphorylation changes in insulin signalling. A detailed description of the tools, their analysis methods, and the visualizations generated using the input files described here, are discussed in detail in the accompanying review titled “Visualization and analysis of epiproteome dynamics" [1].


Experimental design, materials and methods
Humphrey et al. [2] carried out a comprehensive study measuring phosphorylation changes in response to insulin signalling in 3T3-L1 mouse cells. They published a dataset containing 37,248 distinct phosphorylation events involving 5705 distinct proteins across 9 time points (including basal). The incidence of similar datasets, measuring dynamic epiproteomic changes, is increasing [3e5].
Thus, from Humphrey et al. we chose 103 profiles as a benchmarking dataset for evaluating analysis and visualizations possible through available tools for epiproteomics data. Such a selection had many advantages: a dataset of 103 profiles is easily managed compared to the total of 37,248. Furthermore, the phosphosites in this dataset are well studied, as they have been the subject of Figure 5 in Humphrey et al., and in subsequent publications [6], where the signalling networks underlying these observed data have been well documented and an assignment of kinases responsible for the phosphorylations presented (in the file Kinase_assignments.xlsx).
In the originating paper [1], we undertook a review of the various tools and methods for visualizing and analyzing dynamic epiproteomic datasets. We utilized the benchmark dataset for this purpose by Specifications

Value of the data
We present a benchmarking dataset to evaluate tools, analysis methods and visualizations for the dynamic epiproteome. The phosphosites in this dataset are well studied -they have been the subject of multiple prior publications. We provide these data converted into the required input formats for four currently available tools. These files can be used to recreate visualizations discussed in the originating paper.
We provide a small kappa model on the subject of these phosphosites. Models such as these can be used to validate our understanding of the processes in the underlying cellular system. converting it into the formats required by the tools. Thus, in this paper, we provide six input files that can be used to create visualizations with five tools. The 103 profiles from Humphrey et al. are provided in the file Original.txt. This dataset was converted to various input formats for use with the tools: DiBS, DynaPho, PhosphoPath and PHOXTRACK. Additionally, to demonstrate the use of mathematical modelling to evaluate our understanding of these data, we created a sample kappa model on phosphorylation changes in insulin signalling by surveying the literature, and provide it for use with kappa tools. The input files for these five tools are described in the sections below.

DiBS
Input for this tool is provided in the file Dibsvis.csv. It is a simple CSV file containing the columns 'Gene name', 'Phosphorylated amino acid', 'IPI position' (where IPI is the International Protein Identifier) and the abundance ratios at 8 time points.

DynaPho
Input for this tool is provided in the file Dynapho.txt. It is a tab-separated file containing the human UniProt identifier, site (phosphosite position), residue (amino acid), 13-amino acid peptide sequence, and the abundance ratios at 8 time points. All the columns were directly extracted from the original dataset, except for UniProt identifier and site. This is because DynaPho only accepts human data, therefore, the mouse proteins and the phosphosites were converted to their human equivalents by performing a BLAST search [7] of each mouse protein against all human proteins in UniProt. 1

PhosphoPath
Two input files are required for creating a visualization using this tool. Thus, two files are provided, namely, PhosphoPath_network.txt and PhosphoPath_timeseries.txt.
The file PhosphoPath_network.txt provides a network structure for representing a protein and its phosphosites. Since PhosphoPath is a plugin for Cytoscape, where a network is represented in the form 'source node' and 'target node', the same structure is used here: the protein node as a source node and the site node as a target node, in columns 1 and 2, respectively. Columns 3 and 4, contain the site node display name and the peptide number, respectively.
The file PhosphoPath_timeseries.txt contains time series values, which are used to build the heatmap in the visualization (see Figure 5 in the originating article). These data are in two columns, where the first column is in the format "UniProt Id-Residue Site-Peptide Id-Time point number", and the second column contains the quantified abundance ratio value at that particular time point for the phosphosite. We additionally also provide a script (getTpPhosData.py) and an example input file (phosphopath_input.tab) which can be used to generate this file. Details needed to run this script are found in the README.md file.

Phoxtrack
Data for this tool are provided in the file Phoxtrack.txt. It is a tab-separated file containing the 13amino acid peptide sequence, followed by the mass spectrometry abundance ratios at 8 time points.

Kappa
We provide a model of phosphorylation changes in response to insulin stimulation in the Kappa language format, in the file InsulinSignallingModel.ka. This model was compiled from the literature. It does not aim to provide an accurate representation of the underlying biology, but was built only for demonstration purposes (to demonstrate the use of Kappa tools for epiproteomics data, see section Kappa in the originating article).
The model contains 13 proteins, 2 additional molecules (GTP and GDP), and a total of 33 reactions. The initial concentration of all molecules was set to 10, and the concentration of GTP was set to 1000, GDP was set to 10 (bound to ras protein). We observed the output of Insr Y1175 and Erk1 T203. A list of modelled reactions are shown in Table 2.

Benchmark dataset limitations
In this paper, a benchmark dataset is presented, in various formats, consisting of profiles of 103 phosphosites on 58 proteins, and their associated kinases. Although the sites in this dataset are well studied, a number of limitations are present.
The original study by Humphrey et al. [2], which this benchmark dataset is derived from, was considered a cutting-edge landmark study in 2013. However, mass spectrometry technologies are improving every year, thus, now, five years on, if the study is repeated, due to advances in technologies it is possible that the results may differ and offer higher resolution, sensitivity, and coverage.
Secondly, of the 37,000 sites quantified, the benchmark dataset is a very small subset with only 103 sites. Furthermore, even though these 103 sites have been well studied and depict both phosphorylation and dephosphorylation changes, greater emphasis has been laid on studying the phosphorylation changes, and thus the phosphatases potentially causing the dephosphorylation changes are not included in our benchmark. Instead, dephosphorylation was considered only indirectly, as a result of deactivation of the kinases [6].