1. Introduction
In the last several years, technological advances in the high-throughput quantitative analysis of omics data have provided ample opportunities to investigate the onset and progression mechanisms of several complex diseases, including cancer. International collaborations in large projects, such as The Cancer Genome Atlas (TCGA), which constitutes the core of The Genomic Data Commons (GDC Data Portal) [
1,
2], the European Genome-Phenome Archive (EGA) [
3], and the Gene Expression Omnibus (GEO) [
4,
5], among many others, have contributed to profiling large tumor sets for different omics layers (genomic, epigenomic, transcriptomic, metabolomics, and proteomic data).
The availability of such a huge volume of cancer omics data has favored the development of novel computational and statistical methods for personalized therapeutic strategies, improving the ability to diagnose, treat, and predict cancer progression. In particular, researchers have devoted significant efforts to developing computational methods to cope with the curse of data dimensionality and variables’ correlation, which constitute the two main challenges faced when working with survival and omics data. Such methods are based on penalized regression, statistical boosting, random forest, and, more recently, deep learning, and they have proven their utility in several cancer studies (we refer the reader to [
6] for an overview of survival methods using multi-omics data). Additionally, in the last few decades, model prediction data-driven methods have emerged in the context of stochastic processes and particle diffusion models [
7,
8,
9]. These methods aim to increase the accuracy of machine learning and statistical methods applied to experimental data.
Among the several available approaches, penalization methods constitute a general framework that has the advantage of being easily interpretable, allowing the modeling of several situations, and offering competitive performance. According to [
10], we can group penalization methods into two main categories: (i) traditional penalized regression methods and (ii) network-penalized Cox regression methods. In particular, traditional penalized regression methods address the high-dimensionality problem through penalized Cox regression approaches, where the penalty serves to regularize the solution and enforce sparsity. Well-known examples in the context of Cox regression are LASSO penalty [
11,
12,
13], Elastic-net penalty [
14,
15], SCAD [
16], adaptive Lasso [
17], and Dantzig selector [
18], among many others. Network-penalized Cox regression methods explicitly incorporate the relationships among the variables in the penalty term. Therefore, they improve the prediction capabilities and better address the inherent structure of omics data. The papers [
19,
20] introduced the idea using linear models, followed by DrCOX [
21,
22], AdaLNet [
23], Net-Cox [
24],
penalty [
25], and DegreeCox [
26] generalized to Cox regression. In [
10], the authors described the advantages and limits of penalized methods in the context of linear models, logistic regression, and Cox regression, and [
6] provides a large benchmark study with eleven methods (including penalization approaches) for survival analysis. Moreover, [
27,
28,
29,
30] offer a perspective for precision medicine and examples of successful applications in cancer studies.
All the approaches presented above can take advantage of and reuse the existing biological information available in databases such as KEGG [
31], GO [
32], and STRING [
33] to guide and improve statistical methods. Such information is often available as networks, with nodes being the genes and edges their connections. In previous papers, we showed how such information can be successfully integrated into Cox regression models; see [
34,
35]. However, to facilitate the use of widespread network-based methods within the biomedical community, it is necessary to make available software that can handle all the analysis phases, not only the mere implementation of the statistical core.
To this purpose, several R packages have been implemented, including both traditional and network-based penalization approaches for several regression contexts [
10]. However, if we limit the search to network-based approaches in the context of Cox regression, the availability of the methods is limited and they are often designed for experts.
glmSparseNet is an R package implementing the methodology proposed in [
36]. It includes network-based regularizers in sparse models when a graph structure can represent the feature space (the gene network is a protein–protein interaction network obtained from STRING). The model uses the network’s centrality measures as penalty weights in the regularization process to analyze high-dimensional survival data. The function
NetPredCode.R [
37] provides an implementation of the penalized model described in [
38], to obtain accurate prediction models. Specifically, it uses the correlation structure of the feature space (e.g., genes) as a network structure to derive network information included in the penalized model.
NetPredCode.R implements a three-step approach based on (i) network construction, (ii) cluster approaches to detect modules or pathways, and (iii) development of the final prediction model using the detected modules. In summary,
NetPredCode.R can apply different combinations of network analyses and penalized regression approaches, but its usage requires prior experience in
R.
As shown in [
39,
40], introducing statistical screening in the context of linear regression, generalized linear models, and Cox regression holds great potential in improving the accuracy of the proposed models. Furthermore, variable screening can help to better cope with the high-dimensionality challenges by reducing the dimensionality from a high-/ultra high-dimensional space to a moderate-dimensional space where the penalization approaches might be more effective. We showed the advantages of merging screening techniques with network-based Cox models in [
34,
35]. In conclusion, although few tools are already available, there is still a lack of versatile packages that combine screening approaches with penalized regression methods based on network analysis.
Here, we present
COSMONET (COx Survival Methods based On NETworks), a novel R package inspired by the methodology proposed in [
34,
35]. The novelty of the statistical model behind
COSMONET is the combination of screening techniques (i.e., the transformation of data from a high-dimensional space into a low-dimensional space) and network-penalized Cox regression approaches for the selection of significant biomarkers. By combining the biological knowledge related to the disease under investigation and the statistical information derived by the data, the package allows the user to identify new potential biomarkers. The package implements a novel pipeline that selects a subset of genes associated with cancer survival and predicts individual patients’ risk by evaluating a prognostic index (PI) on the selected gene signature. In addition,
COSMONET offers a complete workflow that starts from the pre-processing and normalization steps of omics data and trains the model to identify potential biomarkers, computes PIs, and determines the optimal cut-off to separate high- from low-risk patients. Then, it allows the evaluation of the PIs on a new set of patients (test set) and predicts the patient-specific survival risk. The package also includes advanced visualization options to investigate the data using survival curves, heat-maps, gene pathway networks, Venn diagrams, and correlation plots. Moreover,
COSMONET has a complete and straightforward step-by-step guide, with a detailed vignette, making the proposed methodology ready to be used within the biomedical community. Overall, it provides to the scientific collectivity a comprehensive set of easy-to-use functions to take full benefit of the increasingly available shared and heterogeneous cancer data.
COSMONET is publicly available and it can be accessed at
http://bioinfo.na.iac.cnr.it/cosmonet/ (accessed on 13 November 2021).
In the following sections, we describe the workflow, the statistical methodology, and the functions available in COSMONET. We also illustrate the package’s capabilities using two cancer datasets (downloaded from the GEO database and GDC data portal).
2. Materials and Methods
COSMONET input data are of the form , for . Here, is the omic profile of the ith patient over p genes. is the response variable composed of the survival time (i.e., the time until endpoint or last follow-up) and the censoring time , and is the censoring indicator (i.e., a 0/1 variable, where 0 indicates that the ith patient was censored at time and 1 that the ith patient had an event at time ). Specifically, COSMONET takes as input data two numeric matrices , the training set T, and , the testing set D, and the relative survival information ( and (, respectively. Here, and denotes the number of samples in the two sets of data and p is the number of covariates. The omic data can represent gene expression profiles as measured by microarrays or RNA-seq technologies, or any other numeric genome-wide feature that can be associated with genes or proteins to form numeric matrices. In our examples, we will use gene expression data since they are the most popular choice.
Figure 1 summarizes the steps implemented in
COSMONET. More precisely, it consists of two phases: (i) the training phase and (ii) the testing phase.
- (i)
The training phase uses to identify prognostic markers, computes , and defines a cut-off for defining patient risk classes. The core of the training phase consists of the following steps:
a screening approach to provide an essential dimensional reduction step that allows a transition from the high-dimensionality p to a moderate scale using biological information, data-related information, or combining both pieces of knowledge;
a network-penalized Cox regression method to model observed survival times through genome-wide omic profiles (while accounting for coordinated genes functioning in the form of biological pathways or networks) and identify gene signatures;
a procedure to evaluate the PIs and determine an optimal cut-off to separate low- from high-risk patients;
a subnetwork analysis based on gene signatures to visualize new potential genes and biological pathways.
- (ii)
The testing phase performs survival prediction to evaluate prognostic genes on by using the parameters tuned in the training phase (regression coefficients, gene signatures, and the optimal cut-off for the PIs). This phase uses the log-rank test to compare the Kaplan–Meier curves of the patients in the high- and low-risk groups.
We describe the training and testing phases in detail in the following sections.
Note that, when only a single dataset is available, the functionSplitData() can perform a random split in the training and testing sets before starting the training phase.
2.1. Training Phase
In the training phase, COSMONET performs an (i) optional pre-processing and data normalization step, (ii) variable screening (offering three different approaches plus the possibility to give as the input parameter a list defined by the user), (iii) network construction for the regression analysis (offering two different possibilities), (iv) network-penalized Cox regression analysis, and (v) determination of an optimal cut-off for the PIs. COSMONET includes pathway analysis and heat-map visualizations in the training phase to facilitate the interpretation of the results.
The following sections report the details of each step and
Table 1 reports the options available in the package for the screening procedure and the network construction that a user can apply during the process of the training phase.
2.1.1. Data Pre-Processing and Normalization
COSMONET can perform an optional pre-processing phase (see
Figure 1a) consisting of normalization between samples and between different datasets (here denoted
between sample normalization), when necessary. It does not perform the so-called
within-sample normalization since such a procedure strongly depends on the type of high-throughput platform used. Therefore, individual samples must be already normalized according to the specific technology.
The function NormalizeBetweenData() implements the between-sample normalization. This function uses quantile normalization to make the distributions of the training and testing sets the same across samples. The function first normalizes the training set using the quantile normalization and obtains the normalized training dataset used for the training phase. Then, it normalizes the testing dataset to the normalized training dataset sample-by-sample. It adds one column (i.e., a sample) of the testing set to the training set and normalizes the dataset with samples, repeating the normalization for all samples in the testing set. Then, it takes all the normalized test columns and builds the normalized testing set. The between-sample normalization is necessary to remove unwanted variability, improve the models’ performance and stability, and make the two datasets comparable. The proposed approach has the advantage that the normalization training set is entirely independent of the testing set. Moreover, the test set samples are normalized individually to the training, thus allowing the system to incorporate new samples with a study’s progress. The user can omit the pre-processing step if the datasets were already normalized using other procedures.
2.1.2. Screening Techniques
The variable
screening approaches (
Figure 1b) reduce the number of variables
p to a moderate dimension
. To this purpose, let
be the subset of the screened variables. Denote
its cardinality.
COSMONET includes three different screening approaches.
Biomedical-driven (BMD) screening. The subset
of the screened variables consists of those variables that are known to be associated with the type of cancer under investigation [
34]. Typically, this knowledge is derived from the literature or external databases. In [
34], we considered the Human Experimental/Functional Mapper (HEFaIMp) database [
41]. Recently, the HumanBase database [
42]) replaced HEFaIMp. HumanBase uses posterior probabilities (PPs) to identify a significant relationship between a set of genes and the disease of interest. Therefore, to select the
biologically relevant gene set
,
COSMONET uses the information provided by HumanBase to (i) decreasingly rank the
p genes (connected to the disease of interest) based on PPs and (ii) select the top genes. The user can perform the latter step by using as a threshold score
a cut-off on PPs, i.e., the value of the reduced dimension
d selecting the top
d genes (the genes with the highest PPs).
Data-driven (DAD) screening. The subset
of the screened variables is obtained by using only information from the data, as in [
43]. In particular, it uses component-wise estimators that are computed very efficiently and do not suffer from the numerical instability associated with ultrahigh-dimensional estimation problems, as follows. Let
be the true sparse Cox model, where
denotes the true value of the parameter vector and
. The maximum marginal likelihood estimator (MMLE)
, for
, is defined in Cox models as the maximizer of the log-partial likelihood with a single covariate:
where
is the risk set.
This procedure (implemented in the MarginalCoxRanking() function) provides the marginal regression coefficients of each feature and the p-values associated with the univariate models. Then, to select the optimal threshold d and optimize data prediction, COSMONET allows the user to rank the genes according to the three options below.
Magnitude of the
marginal regression coefficients. This approach selects the
d-top-ranked covariates, i.e., the
d genes with the largest marginal coefficients in absolute value. Typically,
d is equal to
[
43].
p-value. This option identifies as d-genes all the genes that have p-values , regardless of the magnitude of their marginal regression coefficients.
Magnitude of the marginal regression coefficients and p-values. This approach first orders the genes according to the largest marginal regression coefficients in absolute value; then, it selects only those genes that have p-values .
Biomedical-driven + data-driven (BMD + DAD) screening. The subset of the screened variables is obtained by combining the biomedical information and the data-driven knowledge. To this purpose, COSMONET takes the union of the BMD and DAD sets of genes. By using BMD + DAD screening, COSMONET explores the best model that can sufficiently explain the data in the most parsimonious way to (i) make use of available information, (ii) identify new markers that the BMD screening ignores, and (iii) improve the ability to make precise prognosis, diagnosis, and treatments.
The function ScreeningMethods() implements the three methodologies discussed above. Moreover, COSMONET also allows the user to import an external list of genes as the screened variables (obtained from the user’s clinical experience or the literature).
2.1.3. Gene Network Construction
The set of screened variables
is used to build the networks that will be included in the penalized Cox regression model (
Figure 1c,d). In particular,
COSMONET implements the following network construction procedure:
Functional linkage () network.COSMONET builds
from the HumanBase tool [
42,
44]). Each element in the
S matrix represents the PP that two genes are functionally related. The higher the probability, the stronger the functional relation between the genes in the disease of interest.
COSMONET completes the
S matrix by setting to zero the weights of all genes that are not identified by the functional linkage map. The function
CreateNetworkMatrix() builds matrix
.
Moreover, COSMONET allows the user to import a molecular network that matches the set of screened genes . Note that the network matrix must be an adjacency matrix with zero diagonal and non-negative off-diagonal.
2.1.4. Network-Penalized Cox Regression Algorithm
The next step consists in the application of a penalized algorithm using the set of screened variables
and the a priori network information (
Figure 1b,d).
Given the Cox penalized partial log-likelihood function
COSMONET considers the following penalized problem:
with
where
are two regularization parameters. The penalty function consists of two terms. The first term is a
-norm that enforces sparsity in the solution. The second term
is a Laplacian matrix constraint that gives smoothness among connected genomic variables or regression coefficients in the network. More precisely, the graph Laplacian regularization
is equal to
with
(graph Laplacian), where
is the degree matrix and
is the adjacency matrix.
To solve the optimization problem proposed in Equation (
2),
COSMONET uses an efficient
-norm feature selection algorithm based on augmented and penalized minimization-
(APM-
) [
45]. Specifically, APM-
employs a two-stage procedure where the first stage replaces the
-penalty with a penalty that provides computationally manageable optimization, and the second stage performs hard thresholding. Indeed, the objective function in Equation (
2) is reformulated by introducing a surrogate parameter
of
and bounding the difference between them by a smooth convex function, which guarantees the convergence of the proximal operator. The following formula gives the Lagrangian form of APM-
:
where
is a convex function such that
and
, and
. To minimize Equation (
4) for a given
,
and
, APM-
updates all parameters using the following algorithm:
where
is an initial value. Equation (
6) is minimized component-wise, and its solution is given by
i.e.,
is obtained by hard thresholding the
parameters obtained in the first step.
COSMONET sets
as the default parameter, and chooses
and
iteratively by
k-fold cross-validation (
as default parameter). Our package allows the user to select either the value of
that gives the minimum average cross-validated error or the value of
that gives the most regularized model such that the cross-validated error is within one standard error of the optimal. The latter choice is inspired by the
glmnet package [
46], and we modified the APM-
package for such a purpose (the modified version is incorporated into
COSMONET).
Overall, this step selects a subset of genes (i.e.,
, gene signature in
Figure 1e), which is used in the next step to determine PIs and predict patients’ risk groups.
2.1.5. Prognostic Index (PI) and Survival Analysis
By using the regression coefficients
, for each patient
i (
) in the training set
T,
COSMONET computes
, defined as
, where
is the vector of screened gene expression values associated with the
i-th patient (
Figure 1g). To determine the optimal cut-off
for dividing the patients into low-risk (LR) and high-risk (HR) groups,
COSMONET offers three different procedures (
Figure 1h).
Adaptive-based approach: each patient
i (
) in
T is placed in the high-risk (or low-risk) group if
is above (or below) the
-quantile, where
The procedure is repeated in an adaptive manner for each , and a log-rank test for each -quantile is used to compare the Kaplan–Meier survival curve between the two risk groups. The optimal cut-off is the value that corresponds to the best separation of the two groups, i.e., the -quantile related to the lowest non-zero p-value resulting from the log-rank test.
Median-based approach: each patient i () in T is set in the high-risk (or low-risk) group if is above (or below) the -quantile with equal to . The optimal cut-off is the value that corresponds to the median of the .
Survival-based approach: each patient in T is allocated to the high-risk (or low-risk) group by using an outcome-oriented method that provides a cut-point corresponding to the most significant relationship with the survival. COSMONET uses the surv_cutpoint() function from the survminer package.
The function SelectOptimalCutoff() computes the optimal cut-off and the p-value corresponding to the Kaplan–Meier (KM) survival curves on T. The KM curves provide a useful visual representation for assessing the effect of cancer progression over time for different groups of patients.
The function CosmonetTraining() performs the complete procedure (variable selection and survival analysis on T).
2.1.6. Network and Pathway Analysis
COSMONET provides an interactive visualization to investigate the gene signature
in terms of pathway analysis (
Figure 1f). It uses pathway information from the KEGG database [
31] to create sub-networks of the gene signature identified by the regression algorithm (
Figure 1e). Specifically,
COSMONET generates a dashboard with two networks: (i) a sub-network of the
not-isolated genes, i.e., genes that share at least one pathway with another selected gene, and (ii) a complete network displaying the full gene signature selected by the network regression method. Each vertex in the network is a gene. An edge between two genes indicates that they belong to the same KEGG pathway. The color of the nodes/genes allows identification of the functional link between the genes and the disease of interest according to the HumanBase database. Specifically,
COSMONET automatically retrieves from HumanBase a list of genes ranked according to the PP of being associated with the disease under investigation (identified through the disease ID according to the Disease Ontology (DO) Project [
47]). The signature genes among the top 500 genes in the list are displayed in red (and labeled as Mapped Up); the other genes in the list are shown in blue (and labeled as Mapped Down). Any gene signature not recorded in the HumanBase ranking is shown in white (and labeled as Not Mapped).
The complete networks can be useful to gauge an understanding of the type of genes selected as gene signatures by the different approaches (mapped up, mapped down, or not mapped genes). Specifically, the networks allow the detection of any not mapped isolated genes (white nodes, i.e., genes that are not associated with cancer under study by HumanBase), enabling the identification of new potential biomarkers.
The function GenerateNetwork() implements the pipeline for visualizing the gene pathway network. This function can also be used as a standalone tool for visualizing the network given a list of genes provided by the user. COSMONET includes information from the KEGG database in an internal file (KEGGrepository.RData) that is used by the GenerateNetwork() function to automatically map the list of genes to the corresponding KEGG pathways, and create a gene-by-pathway adjacency matrix. This matrix is then used to create the final gene pathway network.
2.1.7. Heat-Maps
To facilitate the interpretation of the results, COSMONET includes the heat-mapSurvGroup() function. This function first orders patients according to PIs (from the highest to the smallest). Then, it divides them into two risk groups (i.e., low-risk and high-risk classes) using the optimal cut-off . Finally, it depicts the expression values for each gene of the signature as a Z-score. In this way, the function provides a practical way to identify genes whose upregulation leads to a poorer prognosis (i.e., those in red in the high-risk group) and those whose downregulation leads to a poorer prognosis (i.e., those depicted in green in the high-risk group). Therefore, the heat-map in clinical studies aids in visualizing and pointing out the gene signature (and up/down regulation of key genes), assessing its reproducibility between the training and testing sets (and other datasets), and evaluating the similarity between observations or clusters of patients, in a single figure.
2.2. Testing Phase
The testing phase works on an external dataset, or the testing set resulting from the initial splitting into training and testing sets, as explained in
Section 2.1.1. In the testing phase, survival analysis is performed to assess the prediction accuracy (
Figure 1i–k).
Survival Analysis
To perform survival analysis,
COSMONET computes PIs for each patient
j (
) in the test set as
, where
is the vector of screened gene expression values associated with the
j-th patient and
is the gene signature derived from the training phase (
Figure 1i). The optimal cut-off
selected on the training set
T is used on the validation set to split the patients into high-risk and low-risk groups. Each
j-th patient is designated as a high- (or low-) risk of death if
is above (or below) the optimal cut-off
.
COSMONET uses the log-rank test to calculate the statistical significance level (i.e., computing
p-values;
Figure 1j) and the KM curves to further analyze the results (
Figure 1k). A good separation between high- and low-risk survival curves and a significant
p-value (i.e.,
p-value < 0.05) indicate that the prognostic classifiers selected in the validation set can identify good prognosis patients. In other words, the low-risk class has a significantly better prognosis than the high-risk class. The function
ValidationTest() implements the procedure to perform survival analysis using the testing set. The function
CosmonetTesting() performs the complete prediction procedure. The KM curves allow us to visualize the survival probability depending on the assigned risk class and evaluate the difference. The larger is the difference between the two curves, the better is the gene signature as a prognostic biomarker.
2.3. Correlation Analysis
The function CorrPlot() allows the user to analyze and visualize the correlation coefficient R (and its p-value) between results obtained using two pairs of PIs, such as those generated using different screening techniques on the same dataset. This enables the comparison of risk assignments based on the two gene signatures and allows us to classify the patients at the individual level: low-risk, high-risk, and not consistently identified (usually, they are borderline). From a clinical point of view, we suggest that the group of patients not consistently recognized might need further investigation to establish cancer risk factors. The function allows the user to select different methods for computing correlation coefficients, such as Pearson, Kendall, or Spearman. The benefit of using correlation analysis is to assess the overall concordance between the PI indexes obtained using different methods, and to easily identify patients that are consistently assigned to one class or the other, from patients whose assignment might depend on the chosen method. In the latter case, the final assignment might be done using additional information.
2.4. R Implementation
COSMONET is a novel
R-package that implements all the steps described in
Figure 1. It is available at
http://bioinfo.na.iac.cnr.it/cosmonet/ (accessed on 13 November 2021), as a source code, with data examples and a detailed vignette. The main function
Cosmonet() can be used to run the full pipeline. However, the user can also run the two phases individually by using the
CosmonetTraining() and
CosmonetTesting() functions.
Table 2 describes all the functions available in the package.
2.5. Real Data Examples
To illustrate the features and performance of
COSMONET, we used two gene expression data examples in breast and lung cancer studies.
Table 3 shows the details of the datasets.
2.5.1. Example 1: Breast Cancer Datasets from Microarray Case Studies
We used two independent gene expression datasets on breast cancer available from the GEO database [
4]. We used GSE2034 as the training set
T to build the model, while we used GSE2990 as the testing set
D to validate the model. The training set
T consists of gene expression profiles from the total frozen RNA of 286 lymph-node-negative breast cancer patients [
48]. The testing set
D contains gene expression profiles of 189 invasive breast carcinomas [
49]. The median survival time (RFS) in
T was 86 months, and the censoring proportion was 62.59%, while the median survival time (RFS) in
D was 77 months and the censoring proportion was 63.49%.
We used the
RMA and
preprocessCore Bioconductor packages for the pre-processing steps (see
Section 2.1.1). After the within-arrays normalization and the reduction to the same
p-dimensional feature space (for a total of 13229 genes), we applied the function
NormalizeBetweenData() to perform the normalization between datasets (see
Supplementary Figure S1A).
2.5.2. Example 2: Lung Cancer Dataset from an RNA-Seq Case Study
We considered TCGA-LUAD (lung adenocarcinoma) gene expression data from the GDC Data Portal [
1]. The data were obtained from the Illumina HiSeq platform and included 492 lung cancer patients (for a total of 19988 genes). We used the data already pre-processed and normalized (gene-level, RPKM), which are available in the LinkedOmics portal [
50]. The median survival time (OS) in
T was 701 days, and the censoring proportion was 63.82%, while the median survival time (OS) in
D was 624.5 days, and the censoring proportion was 63.82%.
To analyze the dataset, we first randomly split the dataset into 50% train (246 samples) and 50% test (246 samples) and then applied the between normalization by using the functions
splitData() and
NormalizeBetweenData(), respectively (see
Supplementary Figure S1B).
4. Discussion and Conclusions
In this work, we presented
COSMONET, a novel
R package that merges advanced screening techniques with network-based Cox regression for survival prediction. This package implements a complete pipeline built upon an efficient computational algorithm based on two main features: (i) screening techniques to reduce the feature space
p to a moderate scale
d and (ii) a network-based Cox regression method to select the high-risk cancer genes among the screened covariates for the survival of patients [
34,
35]. The variable screening aims to find a subset of input and significant variables associated with cancer patient survival. The network-based Cox regression algorithm, based on augmented and penalized minimization-
(APM-
) [
45], aims to select the most functional genes related to cancer. Moreover, the package is completed by several additional functions to determine the optimal cut-off for the
, perform data pre-processing, and visualize and compare the results in several forms. We illustrated the capabilities of
COSMONET on two case studies with gene expression data (microarrays and RNA-seq) from the literature. The vignette provides a step-by-step guide for beginner users, making
COSMONET appealing for use within the biomedical community in order to plan, develop, and implement a personalized care plan.
The current version of
COSMONET works with a single type of omic data. We have used gene expression, although it is possible to use other types of omics profiles. Together with several others (i.e., [
35,
57,
58]), we have shown that integrating multiple omics data types can improve the performance of a method. The approach used in [
35] can be easily extended to
COSMONET. In the aforementioned proposal, we used MANCIE [
59] to correct the gene expression data using the copy number aberrations (CNA), which showed better performance. The same approach can be used with
COSMONET when two types of omics are available. However, the approach in MANCIE assumes that one of the omics (i.e., the gene expression) acts as the principal matrix and the second one as a support matrix (i.e., the CNA in [
35]). Therefore, the approach does not easily extend to multiple omics and does not give equal importance to the experiments. To further improve
COSMONET for data integration, we should use the multi-task regression learning approaches in a manner similar to [
60,
61]. Moreover, we have shown that
COSMONET runs fast also on large datasets such as those in our case studies. However, due to the increasing complexity of a multi-omics approach, we expect a longer runtime when integrating multiple datasets or when the size of the problems increases. For this reason, future improvements of the package should integrate the option to run the process in parallel mode. Furthermore, the integration of different databases other than KEGG, such as MetaCyc [
62], could also improve the user experience, allowing an analysis of the data from different perspectives. In fact, MetaCys contains significantly more reactions and pathways than KEGG. Hence, we aim to include additional pathway repositories in a future extension of
COSMONET.
In conclusion, to the best of our knowledge, COSMONET is the only R package that combines both biologically driven and data-driven screening techniques within a network-penalized Cox regression model. Hence, the proposed package provides the biomedical community with a valuable and straightforward tool for investigating new cancer biomarkers and predicting survival outcomes while using the most recent statistical techniques.