Analyzing causal relationships in proteomic profiles using CausalPath

Summary CausalPath (causalpath.org) evaluates proteomic measurements against prior knowledge of biological pathways and infers causality between changes in measured features, such as global protein and phospho-protein levels. It uses pathway resources to determine potential causality between observable omic features, which are called prior relations. The subset of the prior relations that are supported by the proteomic profiles are reported and evaluated for statistical significance. The end result is a network model of signaling that explains the patterns observed in the experimental dataset. For complete details on the use and execution of this protocol, please refer to Babur et al. (2021).

SUMMARY CausalPath (causalpath.org) evaluates proteomic measurements against prior knowledge of biological pathways and infers causality between changes in measured features, such as global protein and phospho-protein levels. It uses pathway resources to determine potential causality between observable omic features, which are called prior relations. The subset of the prior relations that are supported by the proteomic profiles are reported and evaluated for statistical significance. The end result is a network model of signaling that explains the patterns observed in the experimental dataset. For complete details on the use and execution of this protocol, please refer to Babur et al. (2021).

BEFORE YOU BEGIN Overview
CausalPath maps proteomic and other molecular profiles to pathway information from the Pathway Commons database (Rodchenkov et al., 2020;Babur et al., 2021;Barsi and Szalai, 2021) and other resources, focusing on causality among correlated measurements. This mapping generates a model of molecular signal flow that is consistent with both the profiling data and the published literature. Results are presented as a network model with links to information found in the Pathway Commons database. This analysis mimics aspects of the ''manual'' exploration of the scientific literature that researchers perform when they review their experimental data. CausalPath allows the systematic exploration of hundreds of thousands of curated mechanisms, which would be infeasible to do manually.
The protocol below describes the specific steps for the CausalPath analysis of a proteomics study that quantified changes in >3000 phosphopeptide levels using mass spectrometry, related to the initialization and progression of platelet GPVI signaling Aslan, 2021). We have also used similar protocols for the analysis of high-throughput mass spectrometry and reverse-phase protein array (RPPA) screens from National Cancer Institute's Clinical Proteomic Tumor Analysis Consortium (CPTAC) and The Cancer Genome Atlas (TCGA) programs, respectively (Babur et al., 2021).
CRITICAL: A prerequisite to following this protocol is basic knowledge for 1) exploring a computer file system and 2) running commands through a command-line terminal.

Download and install Java, CausalPath, and ChiBE
Timing: <1 h CausalPath can be either run through its web server (causalpath.org), or locally from its JAR file. While the former option does not require local installation of any special software, the latter option provides more capability and is therefore included in this protocol. A local run of CausalPath requires the Java software environment, which is freely available. CausalPath is available as a Java Archive (.jar) for download on GitHub; this tutorial presents instructions for CausalPath 1.2.0. Additionally, users interested in ChiBE (Chisio BioPAX Editor) for visualization can go to the ChiBE website for software download (Babur et al., 2010, Babur et al., 2014 Alternatives: Visualization of results can be done using the CausalPath website (causalpath. org), and the Newt tool (http://web.newteditor.org); no download or installation is required in either case.
Note: This protocol provides instructions for visualization with the CausalPath website and Newt.
CRITICAL: CausalPath can be considered installed and ready to run once 1) the input files and the casualpath.jar are located on the user's computer as detailed in the section "Run CausalPath" and 2) Java has been installed.
Download and install the R environment and readxl/stringr packages (to reformat source data if needed)

Timing: <1 h
This step installs the R environment and readxl/stringr packages to extract and reformat example data from Excel spreadsheets.
4. R is a free software environment for statistical computing and graphics. It runs on Linux, Windows and MacOS. Download from (r-project.org). The current protocol was tested using R version 3.6.2. 5. Install the ''readxl'' R package with the command:

OPEN ACCESS
Note: This step is not necessary if your data is already in the necessary input format for CausalPath.

MATERIALS AND EQUIPMENT Software
CausalPath is a freely available open-source project hosted on GitHub. The software is written in Java. This protocol uses the R (3.6.2) as well as the readxl (1.3.1), and stringr (1.4.0) packages to reformat example data from an Excel spreadsheet into the necessary plain-text, tab-delimited, tabular input format. A list of software used is in the Key Resources Table. Data format The example input data consists of phosphoproteomic abundance measurements collected from a TMT-SPS-MS3 proteomics study that quantified >3000 phosphopeptides related to the initialization and progression of platelet GPVI signaling .
The required columns of the input file are labeled with the following names: ''ID'', ''Symbols'', ''Sites'', ''Effect'', and ''Value''. Column names may differ from their default labels; the configuration file (i.e., parameters.txt) specifies the column names. Below is a description of the contents of each column for a given row.

ID:
A unique text identifier for each row in the dataset. Ideally, the ID contains the gene symbols and modification sites if applicable. These IDs will be used within the visualization software. Recommended format SYMBOL_SITE (e.g., G6PD_Y401).

Symbols:
The HGNC symbol for the related gene. If there is more than one gene symbol related to a given row, then there should be a single space between each symbol.
Sites: If the row contains a phosphoprotein measurement, this column should indicate protein sites that are affected. The format for the site is the one letter capital amino acid abbreviation followed by an integer for the location on the UniProt canonical sequence (e.g., Y401, or S439). In the case of multiple sites, the sites should be separated with a pipe character ''|'' (e.g., S151|T153); termed a install.packages("readxl") install.packages("stringr") site group. If a row is related to more than one gene symbol, then there should exist a site group for each symbol that is separated with a single space. See ''Sites'' column for MAPK11/MAPK12 or the EHBP1 entries in the Example Input below.
Effect: If the row is a phosphoprotein measurement, this column can contain the effect of the related phosphorylation on the protein activity. Use ''a'' for activating and ''i'' for inhibiting phosphorylations. If the effect cannot be classified as activating or inhibiting, then it should be left as blank (preferred) or ''c'' for ''complex effect''. If ''c'' is used then the row will be removed from possible causes, as these ''complex effects'' cannot be currently evaluated. If this column is blank for a row, then CausalPath will use data compiled from various resources such as PhosphoSitePlus, Signor, and limited, manual in-house curation (Hornbeck et al., 2019;Licata et al., 2020).
Value (named ''SignedP'' in protocol example data): A numeric value for the row. There can be multiple value columns (with distinct column names) depending on how the measurements are presented. A numeric value may encode various types of information depending on the analysis. Values may represent normalized protein reads, control and test columns, a change in expression levels (e.g., fold change), or p-values that are generated by statistical tests. The type of information stored in value columns is specified in the configuration file (i.e., parameters.txt). In this protocol example, we have a single value column, named ''SignedP'' that stores signed p-values where the sign indicates the change direction and the value is the multiple-testing-adjusted p-value from a moderated t-test; both are provided in the original dataset.
Alternatives: When a ''fold change'' is used in the value column, the ''value-transformation'' parameter should be set to ''arithmetic-mean'', and the threshold value should be provided using the ''threshold-for-data-significance'' parameter.
Alternatives: The column names can be different from those specified here; the configuration file (i.e., parameters.txt) is used to specify the column names. Files for this protocol use the column name ''SignedP'' as an example of ''Value''.

Available configuration options and format
A configuration file contains the parameters for a CausalPath analysis. Each parameter in this file should be given in a separate line, in the format ''parameter-name = parameter-value''. This protocol will use the following subset of parameters. A complete list of parameters along with additional descriptions is available on the project website (github.com/PathwayAndDataAnalysis/causalpath/ blob/master/wiki/InputFormat.md).
CRITICAL: The file containing these parameters must have the exact name ''parameters.txt''.

OPEN ACCESS
proteomics-values-file: A relative path to the proteomics data file (e.g., ''data_causalpath.txt'') id-column: Name of the column with IDs (e.g., ''ID'') symbols-column: Name of the column with gene symbols (e.g., ''Symbols'') sites-column: Name of the column with sites (e.g., ''Sites'') effect-column: Name of the column with effects (e.g., ''Effect'') value-column: Name of the column with values; use only when a single set of values is being considered (e.g., ''SignedP'') value-transformation: A transformation applied to the values in the value-column. Use of ''signedp-values'' denotes p-value for each row has been pre-calculated and that negative p-values indicate a downregulation and vice versa. (e.g., ''signed-p-values'') threshold-for-data-significance: A threshold value for selecting significant data; format ''thr-val data-type'' (e.g., ''0.1 phosphoprotein'') color-saturation-value: In the network visualization, this parameter determines which value maps to the most intense upregulation and downregulation colors. When p-values are used in the analysis, this parameter applies to Àlog(p-value) (e.g., ''10'') calculate-network-significance: If value is "true", a p-value for activation and inhibition of proteins in the prior network will be calculated through permutation testing. permutations-for-significance: Number of iterations to use in permutation testing. (e.g., ''10,000'') fdr-threshold-for-network-significance: The false discovery rate for network significance calculations for the downstream activity detection. (e.g., ''0.1'') use-network-significance-for-causal-reasoning: After calculation of network significances, setting this value to ''true'' makes CausalPath use the inferred activities in causal reasoning.
show-all-genes-with-proteomic-data: Whether to include disconnected genes in the results. If the value is ''true'', any protein that is represented among causal priors and has significant value change will be represented on the result network, even if they are disconnected. gene-activity: Assign a specific activity or inactivity to a gene; format: gene name and one letter code for activity (''a'') or inactivity (''i''). (e.g., ''BRAF a'')

Data reformatting code
The following R code will extract data for use with CausalPath from the Supplementary File 2 from the recent publication on GPVI signaling . This code should be saved as ''phospho_mass_spec_data_extract.R'' to be used as part of this protocol. The code below will run quickly (< 1 min). Modification of the R code for particular user datasets is likely to be more time-consuming.

The code can be run in R with the command:
CRITICAL: The purpose of this step is to produce input files in the format as described in the ''Data Format'' section. Users are free to choose alternative methods for conducting this step; see ''Alternatives.'' This step is not necessary if your source data is already formatted for CausalPath.
Alternatives: The data reformatting step can be done in many different ways. This protocol uses the R environment as an example. Users are encouraged to format source data in the CausalPath input format in the manner they are most comfortable whether it be through the use of any programming language or an alternative file editing method they prefer.
CRITICAL: The R code and input Excel spreadsheet must be in the same directory. CRITICAL: Reformatted data will be produced in the file ''data_causalpath.txt''. Modifications to the code should result in a data file with a data format as described above.
Alternatives: The provided R code converts one of the sheets (named ''Cond #2 TiO2 Brief") in the original xlsx file. An alternative is to include multiple sheets in the same conversion for more complete analysis.

Adjust parameters
Timing: <15 min Make modifications to the ''parameters.txt'' as necessary 2. Modify, add, or remove parameters in the ''parameters.txt'' file while keeping the format of each line as ''parameter-name = parameter-value'' Note: Users reproducing this protocol do not need to make modifications to the parameters.txt above.
Note: Pre-formatted data and configuration files, as well as result files, are at this link (zenodo. org/record/5311589).

Run the CausalPath algorithm
Note: For a single analysis, both the input files data file (e.g., ''data_causalpath.txt'') and configuration file (i.e., ''parameters.txt'') can be put in a directory by themselves, as shown above. For multiple analyses of the same data, the data can be placed in the parent directory and reused for the different analyses.
Note: Results will be produced in the same directory as the ''parameters.txt'' file.

Run the CausalPath analysis from a command line terminal
Note: Results will be produced in the same directory as the input files. The directory structure of a successful CausalPath run will appear as follows: Description of all possible output files is available on the project site (github.com/ PathwayAndDataAnalysis/causalpath/blob/master/wiki/OutputFiles.md

Protocol
CausalPath results can be visualized and explored using the project website (causalpath.org).
5. On causalpath.org, click ''Visualize results from previous analysis'' 6. Select the folder containing the result files (see the ''Run CausalPath'' section for details) or a parent folder that contains multiple result folders. Figure 1 visualizes the output of CausalPath.
CRITICAL: A folder must be selected and not individual files 7. Accept the browser request to upload files if prompted 8. Once the upload completes and the visualization page loads, the next step is to double-click the folder name from the left-hand sidebar Note: Users can click the ''Show legend'' button for a guide to the visualization.
Note: Users can drag nodes to reposition them and click on phosphorylation sites to learn their positions.
Optional: The complexity of large networks is managed (i.e., simplified) through the ''Topology grouping'' feature. ''Topology grouping'' groups nodes that have the same graph topology. This feature can be turned off by clicking the ''Topology grouping'' checkbox.
(Alternative) visualize results using the Newt pathway editor
9. Click File / Import / SIF then select the ''causative.sif'' to import the network 10. Click File / Import / SIF Style the select the ''causative.format'' to import the network colorization Optional: ''Topology grouping'' can be turned on to simplify large networks. This is done by clicking the ''Map'' tab on the right-hand sidebar then clicking the ''Enable Topology Grouping of SIF Nodes'' checkbox.
Alternatives: Users can also use ChiBE as a visualization tool and instructions are provided on the ChiBE project page (github.com/PathwayCommons/chibe).

Note:
The text here summarized the three options for visualization of results; Figure 1C shows a render comparison between the tools. The comments below may guide user decisions in selecting the best tool for their needs.
a. The CausalPath site i. Provides a legend to help explore the networks ii. Does not provide an image export feature; take a screenshot if needed iii. Visualization is done as a client-side browser operation that does not require sending files to the server b. Newt/ChiBE i. Exports high-resolution vector graphics useful for publication-quality images ii. Only requires the upload of resulting networks. c. ChiBE i. Can be run locally ii. Runs on Linux and Windows.

EXPECTED OUTCOMES
The set of 1) resulting output files and 2) the network graphs both as images and interactive graphs (via the visualization tools) are outputs of the protocol.

LIMITATIONS
Currently, CausalPath cannot analyze phosphorylation sites that have complex effects. This can occur in situations where the current understanding in the literature points to both an activating and inhibiting function for a particular site.

OPEN ACCESS
Separately, for this protocol example, we have provided R code to extract and reformat input data into the CausalPath format. Often data suitable for CausalPath is found in ad hoc formats in spreadsheets, and within a single protocol, it is not possible to account for all possible variations in which this data may be found.
CausalPath uses several information resources to help infer the resulting network and results are limited to the contents of these resources (Babur et al., 2021). Two specific limitations that result from the pathway databases we utilize include: 1) that this project is solely to examine datasets from human cells and 2) there are no interactions for protein isoforms in these resources, currently.
With version 1.2.0 presented in this protocol, users can only analyze phosphorylation post-translational modifications (PTMs).

Problem 1
The following error during Run CausalPath can be caused by a data formatting issue.

Potential solution
Ensure that your input data format matches the description in this protocol. In addition, ensure that the column names used in ''parameters.txt'' match those in the input data file. Missing data can be represented as ''NaN'' entries.

Problem 2
The following error during Run CausalPath can be caused by a misspelling in the ''parameters.txt'':

Potential solution
Ensure that the assigned values in ''parameters.txt'' entries match the available options that are described at the wiki page (github.com/PathwayAndDataAnalysis/causalpath/blob/master/wiki/ InputFormat.md) and spelled correctly.

Other problems
We have thoroughly tested the CausalPath software, but other problems may occur.

RESOURCE AVAILABILITY
Lead contact Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Ö zgun Babur (ozgun.babur@umb.edu).

Materials availability
This study did not generate new unique reagents.
Data and code availability All necessary data and source code for this protocol is freely and publicly available on GitHub. The CausalPath Java Archive (.jar) can be downloaded from: (github.com/PathwayAndDataAnalysis/ causalpath/releases). Pre-formatted data, configuration, and R code files are at the following link (zenodo.org/record/5311589)