Stabilized COre gene and Pathway Election uncovers pan-cancer shared pathways and a cancer-specific driver

Approaches systematically characterizing interactions via transcriptomic data usually follow two systems: (i) coexpression network analyses focusing on correlations between genes and (ii) linear regressions (usually regularized) to select multiple genes jointly. Both suffer from the problem of stability: A slight change of parameterization or dataset could lead to marked alterations of outcomes. Here, we propose Stabilized COre gene and Pathway Election (SCOPE), a tool integrating bootstrapped least absolute shrinkage and selection operator and coexpression analysis, leading to robust outcomes insensitive to variations in data. By applying SCOPE to six cancer expression datasets (BRCA, COAD, KIRC, LUAD, PRAD, and THCA) in The Cancer Genome Atlas, we identified core genes capturing interaction effects in crucial pan-cancer pathways related to genome instability and DNA damage response. Moreover, we highlighted the pivotal role of CD63 as an oncogenic driver and a potential therapeutic target in kidney cancer. SCOPE enables stabilized investigations toward complex interactions using transcriptome data.

, where I(.) is an indicator function

LASSO (Least Absolute Shrinkage and Selection Operator)
Among multiple methods proposed for variable selection and model building, the LASSO (18) and ridge (15) regularized regression methods have gained popularity in recent years for highdimensionality problems. Regularization typically refers to the addition of an additional term in the loss function that is meant to prevent overfitting. LASSO (a form of 1 regularization) adds the sum of absolute values of the regression coefficients to the loss function and in turn allows some of these coefficients to be reduced to 0thus inducing sparsity. It has been shown that L1 regularization promotes sparsity more than higher order regularizations (such as L2, or ridge regression) among convex forms of regularization models (18).
In particular, a typical regularized loss function is: where is a non-negative tuning parameter that controls the trade-off between sparsity and accuracy, is the number of samples and is the number of features/variables.
In this paper, a logistic regression model is used and therefore, the loss function becomes, The tuning parameter is typically chosen via cross-validation and in this paper, a 10-fold cross validation is used to select the optimal .
Differential Co-expression Analysis (DiffCoEx) To compare our method with a standard method based on network analysis, we use differential co-expression networks to identify groups (or "modules") of differentially co-expressed genes and conduct pathway enrichment on these modules. While there are multiple methods of differential co-expression analysis, the most widely used method remains DiffCoEx (70), an extension of the popular WGCNA (46) to differential co-expression. DiffCoEx begins with the construction of two adjacency matrices: : = ( , ) for cancerous samples and similarly for healthy samples.
While different correlation measures can be used in this step, the authors of DiffCoEx used the Spearman rank correlation. A matrix of adjacency difference is then calculated, where ≥ 0 is an integer tuning parameter which can be selected in multiple ways. In this paper we choose ∈ [5,6,7,8,9,10] such that each cancer has the minimum number of modules with the largest module containing the smallest number of genes. Next, a Topological Overlap dissimilarity Matrix (TOM) is calculated where lower values of indicate that a pair of genes and have significant correlation changes (between case and control) with the same group of genes.
Finally, the dissimilarity matrix is used for clustering and "modules" of differentially coexpressed genes are identified. These modules which contain sets of genes are then tested for pathway enrichment and the same measures of overlap as in the SCOPE method are calculated.

Supplementary Figure S1 -Impact of tuning parameters on Pathway Overlap Score
Range of Pathway Overlap Scores (POS) across the six cancers (BRCA, COAD, KIRC, LUAD, PRAD and THCA) of the TCGA database when running SCOPE using multiple sets of values for the parameters indicates similar maximum POS scores for different values of the A) number of iterations/subsamples ( ), B) ratio of data used in each subsample ( ), C) coexpression cut-off ( ℎ ) and D) differential co-expression cut-off ( ℎ ). D) Indicates that POS Scores do change significantly when higher ℎ values are specified (which also inversely relates to the number of core genes identified) leading to the conclusion that ℎ may be tuned based on feasibility of further study of the core genes. Figure S2 -F1 Scores of Core Pathway identification of simulated gene expression data with 670 samples at FDR < 0.05 F1 Score (=TP/(TP+0.5*(FP+FN))) calculated for the accuracy of Adaptive Elastic-Net, Randomized LASSO and SCOPE models in identifying core pathways simulated in gene expression data with 670 samples at an FDR of <0.05 for both A) linearly generated phenotypes and B) non-linearly generated phenotypes with core genes of both high and low correlations individually. Note: In the non-linear model, we assumed that the genes participating interactions are known as a priori; otherwise, the powers of all three methods are close to zero. Please see detailed justifications in Methods.

Supplementary Figure S3 -F1 Scores of Core Gene identification of simulated gene expression data with 670 samples
F1 Score (=TP/(TP+0.5*(FP+FN))) calculated for the accuracy of Adaptive Elastic-Net, Randomized LASSO and SCOPE models in identifying core genes simulated in gene expression data with 670 samples for both A) linearly generated phenotypes and B) non-linearly generated phenotypes with core genes of both high and low correlations individually. Note: In the nonlinear model, we assumed that the genes participating interactions are known as a priori; otherwise, the powers of all three methods are close to zero. Please see detailed justifications in Methods. Figure S4 -Simulations comparing performance of SCOPE, Adaptive Elastic-Net and Randomized LASSO at a sample size of 500 F1 Score (=TP/(TP+0.5*(FP+FN))) calculated for the accuracy of Adaptive Elastic-Net, Randomized LASSO and SCOPE models in identifying causal genes and pathways simulated in gene expression data with 500 samples. A) and B) indicate the ability of the three methods in identifying causal genes in linear and non-linear scenarios respectively under various correlation structures and signal-to-noise ratios. C) and D) indicate the ability of the three methods in identifying core pathways among the top 10 enriched pathways in linear and non-linear scenarios respectively under various correlation structures and signal-to-noise ratios. Note: In the nonlinear model, we assumed that the genes participating interactions are known as a priori; otherwise, the powers of all three methods are close to zero. Please see detailed justifications in Methods. Figure S5 -F1 Scores of Core Pathway identification of simulated gene expression data with 500 samples at FDR < 0.05 F1 Score (=TP/(TP+0.5*(FP+FN))) calculated for the accuracy of Adaptive Elastic-Net, Randomized LASSO and SCOPE models in identifying core pathways simulated in gene expression data with 500 samples at an FDR of <0.05 for both A) linearly generated phenotypes and B) non-linearly generated phenotypes with core genes of both high and low correlations individually. Note: In the non-linear model, we assumed that the genes participating interactions are known as a priori; otherwise, the powers of all three methods are close to zero. Please see detailed justifications in Methods. Figure S6 -F1 Scores of Core Gene identification of simulated gene expression data with 500 samples F1 Score (=TP/(TP+0.5*(FP+FN))) calculated for the accuracy of Adaptive Elastic-Net, Randomized LASSO and SCOPE models in identifying core genes simulated in gene expression data with 500 samples for both A) linearly generated phenotypes and B) non-linearly generated phenotypes with core genes of both high and low correlations individually. Note: In the nonlinear model, we assumed that the genes participating interactions are known as a priori; otherwise, the powers of all three methods are close to zero. Please see detailed justifications in Methods.

Supplementary Figure S7 -Simulations comparing performance of SCOPE, Adaptive Elastic-Net and Randomized LASSO at a sample size of 250
F1 Score (=TP/(TP+0.5*(FP+FN))) calculated for the accuracy of Adaptive Elastic-Net, Randomized LASSO and SCOPE models in identifying causal genes and pathways simulated in gene expression data with 670 samples. A) and B) indicate the ability of the three methods in identifying causal genes in linear and non-linear scenarios respectively under various correlation structures and signal-to-noise ratios. C) and D) indicate the ability of the three methods in identifying core pathways among the top 10 enriched pathways in linear and non-linear scenarios respectively under various correlation structures and signal-to-noise ratios. Note: In the nonlinear model, we assumed that the genes participating interactions are known as a priori; otherwise, the powers of all three methods are close to zero. Please see detailed justifications in Methods. Figure S8 -F1 Scores of Core Pathway identification of simulated gene expression data with 250 samples at FDR < 0.05 F1 Score (=TP/(TP+0.5*(FP+FN))) calculated for the accuracy of Adaptive Elastic-Net, Randomized LASSO and SCOPE models in identifying core pathways simulated in gene expression data with 250 samples at an FDR of <0.05 for both A) linearly generated phenotypes and B) non-linearly generated phenotypes with core genes of both high and low correlations individually. Note: In the non-linear model, we assumed that the genes participating interactions are known as a priori; otherwise, the powers of all three methods are close to zero. Please see detailed justifications in Methods. Figure S9 -F1 Scores of Core Gene identification of simulated gene expression data with 250 samples F1 Score (=TP/(TP+0.5*(FP+FN))) calculated for the accuracy of Adaptive Elastic-Net, Randomized LASSO and SCOPE models in identifying core genes simulated in gene expression data with 250 samples for both A) linearly generated phenotypes and B) non-linearly generated phenotypes with core genes of both high and low correlations individually. Note: In the nonlinear model, we assumed that the genes participating interactions are known as a priori; otherwise, the powers of all three methods are close to zero. Please see detailed justifications in Methods.

Supplementary Figure S10 -DNA replication (hsa03030) and some core gene interactions highlighted by SCOPE.
A) Shows differentially expressed core genes (light blue) and their interactions with other genes (grey) in the pathway while B) shows similar interactions with non-differentially expressed core genes. Correlations are indicated as edges ranging from red (-1) to blue (+1). Boxplots show the correlations indicated in the same pathways to highlight differences in distributions along with the p-value for the Kolmogorov-Smirnov test with the null hypothesis being that the two samples (correlations of Tumour and Normal tissues) were drawn from the same distribution.

Supplementary Figure S11 -Base excision repair (hsa03410) and some core gene interactions highlighted by SCOPE.
A) Shows differentially expressed core genes (light blue) and their interactions with other genes (grey) in the pathway while B) shows similar interactions with non-differentially expressed core genes. Correlations are indicated as edges ranging from red (-1) to blue (+1). Boxplots show the correlations indicated in the same pathways to highlight differences in distributions along with the p-value for the Kolmogorov-Smirnov test with the null hypothesis being that the two samples (correlations of Tumour and Normal tissues) were drawn from the same distribution.

Supplementary Figure S12 -Cell cycle (hsa04110) and some core gene interactions highlighted by SCOPE.
A) Shows differentially expressed core genes (light blue) and their interactions with other genes (grey) in the pathway while B) shows similar interactions with non-differentially expressed core genes. Correlations are indicated as edges ranging from red (-1) to blue (+1). Boxplots show the correlations indicated in the same pathways to highlight differences in distributions along with the p-value for the Kolmogorov-Smirnov test with the null hypothesis being that the two samples (correlations of Tumour and Normal tissues) were drawn from the same distribution.

Supplementary Figure S13 -Homologous recombination (hsa03440) and some core gene interactions highlighted by SCOPE.
A) Shows differentially expressed core genes (light blue) and their interactions with other genes (grey) in the pathway while B) shows similar interactions with non-differentially expressed core genes. Correlations are indicated as edges ranging from red (-1) to blue (+1). Boxplots show the correlations indicated in the same pathways to highlight differences in distributions along with the p-value for the Kolmogorov-Smirnov test with the null hypothesis being that the two samples (correlations of Tumour and Normal tissues) were drawn from the same distribution.

Supplementary Figure S14 -p53 signaling pathway (hsa04115) and some core gene interactions highlighted by SCOPE.
A) Shows differentially expressed core genes (light blue) and their interactions with other genes (grey) in the pathway while B) shows similar interactions with non-differentially expressed core genes. Correlations are indicated as edges ranging from red (-1) to blue (+1). Boxplots show the correlations indicated in the same pathways to highlight differences in distributions along with the p-value for the Kolmogorov-Smirnov test with the null hypothesis being that the two samples (correlations of Tumour and Normal tissues) were drawn from the same distribution.

Supplementary Figure S15 -Network plots of cancer specific pathways.
Core genes are indicated in light blue and other genes in the pathway, in grey. Correlations are indicated as edges ranging from red (-1) to blue (+1).

Supplementary Figure S16 -Survival curves of CD63 over-expressed and under-expressed patients in all cancers.
Log-rank tests indicate a significant difference in survival probabilities only for KIRC with respect to expression levels of CD63. (EXP < 0 indicates samples in which the expression level of the gene (CD63) is lower than the arithmetic mean of the expression levels of the gene across all samples; while EXP > 0 indicates higher than mean expression levels.)

Supplementary Figure S17 -PI3K-AKT-mTOR signaling pathway is highly mutated in BRCA.
Pathway diagram obtained from cBioPortal. and False Negative Rate (FNR) in identifying core pathways (based on the top 10 pathways ranked by FDR) by each of the models tested, in linear and non-linear scenarios under different signal-to-noise ratios (phenotypic variance explained) and correlation structures relative to core genes.