Computational protocol to identify shared transcriptional risks and mutually beneficial compounds between diseases

Summary The accumulation of omics and biobank resources allows for a genome-wide understanding of the shared pathologic mechanisms between diseases and for strategies to identify drugs that could be repurposed as novel treatments. Here, we present a computational protocol, implemented as a Snakemake workflow, to identify shared transcriptional processes and screen compounds that could result in mutual benefit. This protocol also includes a description of a pharmacovigilance study designed to validate the effect of compounds using electronic health records. For complete details on the use and execution of this protocol, please refer to Gao et al.1 and Baylis et al.2


SUMMARY
The accumulation of omics and biobank resources allows for a genome-wide understanding of the shared pathologic mechanisms between diseases and for strategies to identify drugs that could be repurposed as novel treatments.Here, we present a computational protocol, implemented as a Snakemake workflow, to identify shared transcriptional processes and screen compounds that could result in mutual benefit.This protocol also includes a description of a pharmacovigilance study designed to validate the effect of compounds using electronic health records.For complete details on the use and execution of this protocol, please refer to Gao et al. 1 and Baylis et al. 2

BEFORE YOU BEGIN
The protocol below describes the specific steps used to identify the shared transcriptional risks between cancer and atherosclerosis using The Cancer Genome Atlas (TCGA) datasets for the various cancer subtypes and the Stockholm-Tartu Atherosclerosis Reverse Network Engineering Task (STARNET) and Biobank of Karolinska Endarterectomy (BiKE) atherosclerosis datasets.However, the protocol is adaptable to any analogous datasets from which summary statistics of transcriptional risks can be derived.This section includes recommendations for the hardware specifications, and instructions for the pipeline localization, the computing environment setup, and some restricted resources downloading.

Institutional permissions
This protocol involves pharmacovigilance study design that leverages de-identified electronic health records.While the code is initially tailored for use on the Stanford STARR platform and the necessary permissions required, it is important to note that any warehouse adhering to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) would be compatible with this protocol.

Hardware
Most of the protocol steps can be executed on a modern personal computer, such as one equipped with an Intel Core i7 processor and 16 GB of memory.However, for the drug screening step, it is highly recommended to utilize a high-performance computing cluster.Doing so will significantly reduce the processing time, condensing it from weeks to just a few days.Additionally, this pipeline is primarily designed for execution on a Linux system.

Downloading the pipeline
Timing: <10 min 1.Within a working directory, retrieve the pipeline code from https://github.com/ghbore/protocol-cancer-cvd-similarity (https://doi.org/10.5281/zenodo.10493942)by downloading the zip archive and then extracting it, or by using the git-clone command: Installing the computing environment Timing: 2 h 2. To facilitate setting up the computing environment, a Conda environment YAML file is included along with the pipeline code.Thus, before the environment initiation, it is essential to have Conda or Mamba installed on the system.Otherwise, refer to the ''Mamba Installation Guidelines'' (https://github.com/conda-forge/miniforge)for detailed instructions.3. Create a conda virtual environment and activate that environment.

Given that the CVD BiKE dataset is array-based expression profiles using Affymetrix Human
Genome U133 Plus 2.0 Array, and the official array annotation file is under restricted access, researchers can download the annotation file following the instruction in this Thermofisher webpage, and then move the downloaded file to ''resources/bike/HG-U133_Plus_2-na36-annot-csv.zip''.Custom dataset registration, allowing researchers to integrate their custom datasets into the analysis.The configuration includes, a. Dataset name.Define unique dataset names for easy identification and reference.b.Gene level risk statistics file.The file can be a TSV file, or Excel file, and must contain the following minimal columns: i. Gene IDs, which can be ensembl (Ensembl gene ID), entrez (NCBI Entrez gene ID), or symbol (gene symbol), ii.A beta value, the effect size, such as survival log hazard ratio and correlation coefficient, iii.A pval value, the statistical P-value associated with the corresponding ''beta'' value.
P-value and effect size cutoffs to remove noise and outlier genes.Dataset clustering configuration, further including, a. Analysis name.Define unique names for different analyses combining various datasets and resolution setting.Each name will serve as a prefix for corresponding output filenames (e.g., ''reports/cancer_only-clustering.html'' for analysis ''cancer_only'').These names will again prefix output filenames, for example, ''reports/cancer_clusters-share-d_risks.html'' for the analysis ''cancer_clusters''. b.Group names and their associated datasets.Assign meaningful names to each cluster of datasets.
1. Download the dependent resources.
Note: All steps should be triggered within the pipeline directory ''protocol-cancer-cdvsimilarity''.Ensure that you remain within the designated computing environment.Be prepared to provide the file ''resources/bike/HG-U133_Plus_2-na36-annot-csv.zip''manually in advance, as detailed in "Before you begin" step 5. Similarly, ensure that the files inside ''resources/starnet/'' have been prepared in advance, as detailed in "Before you begin" step 6.
Note: Specifically, the latest TCGA clinical data can be downloaded through Bioconductor/ TCGAbiolinks package, as showcased in the script ''workflow/scripts/tcga-clinical.R''.Briefly, for each TCGA project such as TCGA-BRCA, the ''GDCquery_clinic'' function retrieves a 158-column table of clinical information.These tables were then merged and saved to ''resources/tcga/all_indexed_clinical.rds'' for further extraction.
2. Compile the resources.Note: Only newly generated files are listed here.

Transcriptional risk identification Timing: 2 h
To understand the clinical significance of various transcriptional pathways, we correlate transcriptional data to parameters that approximate disease severity.For the cancer datasets, we use mortality data whereas for the atherosclerosis datasets we have to rely on disease severity scores like DUKE and SYNTAX (in the case of STARNET) or on prospective clinical events (BiKE).For ease of reference, we term this transcriptional data correlated to disease severity the ''transcriptional risk profile.''These steps are dedicated to identifying the transcriptional risk profile by fitting survival models for cancer TCGA and CVD-events for the BiKE datasets to generate the summary statistics correlating each gene with the disease severity, which includes the effect size and P-value.Next, we aggregate the risk profile from gene-level to pathway level.
3. Fit survival models to define the transcriptional risk profile.

Protocol
Note: The downloading, compiling, and model fitting steps are dataset specific.However, the results in the directory ''results/gene/'' are standardized.Each R Data (RDS) file contains a table that stores the summary statistics (beta and pval), and the Gencode annotations (ensembl, entrez and symbol.The filenames serve as unique identifiers for the corresponding datasets.This standard structure simplifies the process of preparing results for analogous datasets, making it easier for researchers to incorporate them into subsequent analysis steps with minimal modification. 4. Perform a Gene Set Enrichment Analysis (GSEA) on hallmark gene sets to pinpoint the high-risk pathways.

Dataset clustering and identification of shared risk
Timing: 1 h This step enables the detection of subgroups within the list of datasets using clustering analysis, laying the groundwork for identifying the shared risk pathways.

Protocol
Note: The clustering parameters should be fine-tuned in accordance with available knowledge and data.
6. Adjust the dataset grouping configuration in "config.yaml"as needed.
Note: If new subgroups emerge during the clustering, researchers can make adjustments to the configuration file "config.yaml"and include the subgroup definitions for following analyses.For instance, in this example, certain TCGA cancer types will be found to group closely with atherosclerosis and be named as atherosclerosis-similar cancers and the rest as atherosclerosis-dissimilar cancers.When looking exclusively at cancer types, we find that cancers clustered into three main groups, which we name inflammatory, proliferative, and metabolic cancers based on their overarching pathway dysregulation.Once these subgroup definitions are incorporated into the "config.yaml", it is possible to identify shared risk genes and pathways not only between atherosclerosis and cancers, but also between atherosclerosis and atherosclerosis-similar cancers, and so forth.
7. Identify the shared risk genes and pathways.

Timing: several days to weeks
This step leverages the NIH Library of integrated networks-based cellular signatures (LINCS) drug perturbation database to predict drug/compound efficacy in reversing each dataset's risk profile.
For those highly ranked drugs with sufficient statistical power for validation using electronic health records, this step further executes pharmacovigilance studies to test the drugs' effectiveness.
8. Screen compounds that could reverse the transcriptional risk profile.
Note: This step is computationally taxing and therefore it is highly recommended that this step be executed on a high-performance computing cluster.The output ''results/compound/rges.rds'' is a table containing the reversal gene expression score (RGES) and associated P-value for each compound perturbation assay in each dataset.9. Prioritize the highly ranked on-market drugs that possess adequate statistical power for testing using electronic health records and design a pharmacovigilance study.
Note: we use Clopidogrel as an example.Clopidogrel is predicted to have specific benefit for certain cancer types.To validate this prediction, we plan to collect two propensitymatched (matched on demographics, smoking status, comorbid conditions, procedures, and therapeutics in the 6 months leading up to enrollment) cohorts.Clopidogrel is generally prescribed to patients with cardiovascular events like myocardial infarction.Thus, those indications are defined as the entry events.Patients prescribed clopidogrel will be then in the treatment cohort, while those that are not prescribed this drug will be in the control cohort.The endpoint is defined as the incidence of different cancer types within 5 years.
10. Once the study design is finalized, prepare the configuration files (see examples in EHR-OMOP/ conf) by translating the drug, diseases, and indications to OMOP Vocabulary concept IDs, using Athena (https://athena.ohdsi.org/search-terms/terms).11.Setup the credentials for accessing the EHR database and initiate the pharmacovigilance study.
Note: It is highly recommended that this step be executed on a high-performance computing cluster.The output ''EHR-OMOP/fit/clopidogrel.csv'' is a table containing the survival statistics.

EXPECTED OUTCOMES
A key outcome of this pipeline is a quadrant plot summarizing the shared risks between diseases (step 7).For example, in Figure 1, the pathways in quadrant 1 are common pathways for atherosclerosis and cancer enriched with detrimental genes, while the pathways in quadrant 3 are common pathways enriched with beneficial gene.Additionally, clustering (step 5) could reveal interesting insights into disease heterogeneity.For example, when clustering atherosclerosis with cancer datasets, certain cancer types grouped more closely with atherosclerosis (Figure 2).
This pipeline can also validate drug screening results using electronic health records by collecting two propensity-matched cohorts and fitting survival models to test whether the drug can reduce the incidence of expected diseases (step 11).For example, in Figure 3, Clopidogrel is associated with a reduced incidence of inflammatory cluster cancers as predicted in step 8.

LIMITATIONS
This protocol includes resources for downloading, compiling, and transcriptional risk identification steps, which are tailored specifically to the TCGA datasets, STARNET and BiKE datasets.When working with analogous datasets, researchers may need to prepare the gene level risk statistics file and adjust the configuration file accordingly.Additionally, this protocol leverages the NIH Library of integrated networks-based cellular signatures (LINCS) drug perturbation dataset to screen compounds capable of reversing the transcriptional risk.Notably, since most cell lines in LINCS dataset are cancer cell lines, the screen results are particularly robust and reliable for cancer-related diseases, applying this dataset to other diseases may require approximation (for example using the most abundant cell type or cell type of interest for that disease).

TROUBLESHOOTING Problem 1
Some software or packages are unavailable.

Protocol
Potential solution Ensure every step is executed within the designated computing environment, which should be set up correctly as outlined in ''Before you begin'' step 2-4.
Problem 2 Some dependent file is unavailable.

Potential solution
Ensure the restricted access files are manually downloaded and moved to the correct directory, as outlined in ''Before you begin'' steps 5-6.Ensure every step is executed within the pipeline directory ''protocol-cancer-cdv-similarity''.

Problem 3
Upon executing the command.
subsequent to custom datasets registration, an error message ''Unrecognized file format'' is encountered.

Potential solution
Ensure the custom dataset file adheres to one of the supported formats: Excel (.xls and .xlsx),TSV (.tsv), CSV (.csv), and R Data (.rds).Confirm that the file's extension accurately reflects its format.An incompatible extension will trigger the error.

Problem 4
Upon executing the command.

Potential solution
Ensure the custom dataset file fulfills the following mandatory requirements: at least one gene ID column (ensembl, entrez, and/or symbol), beta column, and pval column.Missing any of the required columns will trigger the error.

Problem 5
If you encounter unreasonable pathway enrichment results and consider inspecting the intermediate gene-level risk statistics and pathway-level enrichment metrics for potential insights.

Potential solution
You can dump the R Data file into an Excel file or TSV file for convenient manual inspection.For example, to inspect the gene-level risk statistics for TCGA BRCA dataset in ''results/gene/ BRCA.rds'', execute.

Figure 1 .
Figure 1.Quadrant plot summarizing the shared transcriptional risks between atherosclerosis and cancer This example plot comes from ''reports/athero_vs_cancer-shared_risks.html''.Each boxplot with dots represents the distribution of normalized enrichment score (NES) per pathway for each disease.Blue color denotes atherosclerosis and red denotes cancer.Figure reprinted with permission from Baylis et al., 2023.

Figure 2 .
Figure 2. Clustering results for TCGA cancer and CVD datasets This example plot comes from ''reports/athero_and_cancer-clustering.html''.Each dot represents a dataset, and the clusters are color-coded.UMAP_1 and UMAP_2 refer to the first two principal components captured by the UMAP (Uniform Manifold Approximation and Projection) algorithm, which facilitates the visualization of high dimensional data.Figure reprinted with permission from Baylis et al., 2023.

Figure 3 .
Figure 3. Kaplan-Meier curves of cancer incidence in clopidogrel-treated and control cohorts This example is from ''EHR-OMOP/plot/''.The red color denotes the control cohort and green denotes clopidogrel-treated cohort.The endpoints are inflammatory cancers, proliferative cancers, and metabolic cancers, respectively.P value is resulted from the Cox proportional hazards model on the propensity-matched cohort, adjusted for gender, age, race, and smoking history.The shaded band denotes the 95% confidence interval.Figure reprinted with permission from Gao et al., 2022.
. Associated datasets.Specify the list of datasets included in each analysis.c.Resolution parameter.This parameter controls the granularity of clustering.Higher values (above 1.0) lead to a larger number of communities, while lower values (below 1.0) result in a smaller number of communities.The optimal value depends on the specific goals.Additional details are available here: https://satijalab.org/seurat/reference/findclusters.
# Custom dataset registration.datasets: custom_1: # Dataset name # Gene-level risk profile file.# The file can be a TSV or Excel, at least three columns: # 1. gene ID, such as ensemble, entrez, symbol bDataset grouping configuration.The grouping configuration may depend on dataset clustering results.The configuration further includes: a. Analysis name.Similar to clustering, specify distinct names for various grouping configurations.