Cross-platform Data Analysis Reveals a Generic Gene Expression Signature for Microsatellite Instability in Colorectal Cancer

The dysfunction of the DNA mismatch repair system results in microsatellite instability (MSI). MSI plays a central role in the development of multiple human cancers. In colon cancer, despite being associated with resistance to 5-fluorouracil treatment, MSI is a favourable prognostic marker. In gastric and endometrial cancers, its prognostic value is not so well established. Nevertheless, recognising the MSI tumours may be important for predicting the therapeutic effect of immune checkpoint inhibitors. Several gene expression signatures were trained on microarray data sets to understand the regulatory mechanisms underlying microsatellite instability in colorectal cancer. A wealth of expression data already exists in the form of microarray data sets. However, the RNA-seq has become a routine for transcriptome analysis. A new MSI gene expression signature presented here is the first to be valid across two different platforms, microarrays and RNA-seq. In the case of colon cancer, its estimated performance was (i) AUC = 0.94, 95% CI = (0.90 – 0.97) on RNA-seq and (ii) AUC = 0.95, 95% CI = (0.92 – 0.97) on microarray. The 25-gene expression signature was also validated in two independent microarray colon cancer data sets. Despite being derived from colorectal cancer, the signature maintained good performance on RNA-seq and microarray gastric cancer data sets (AUC = 0.90, 95% CI = (0.85 – 0.94) and AUC = 0.83, 95% CI = (0.69 – 0.97), respectively). Furthermore, this classifier retained high concordance even when classifying RNA-seq endometrial cancers (AUC = 0.71, 95% CI = (0.62 – 0.81). These results indicate that the new signature was able to remove the platform-specific differences while preserving the underlying biological differences between MSI/MSS phenotypes in colon cancer samples.


Introduction
Microsatellite instability (MSI) refers to a genetic abnormality found in many human cancers. Microsatellites are short tandem repeats of 1-6 base pairs per unit. Spontaneous mismatches or indels in microsatellites may occur during DNA replication. Such abnormalities can be recognised and repaired by the mismatch repair (MMR) genes. Cells with defective MMR gene function exhibit an abnormal length of microsatellite repeats resulting in microsatellite instable phenotype.
Traditional approach to identify patients with MSI is using a recommended panel of five markers also known as the Bethesda Panel [1]. However, a variety of other marker panels were developed to assess MSI [2,3]. Instability detected in ≥ 30% tested markers is designated as microsatellite-high (MSI-H). Instability detected in < 30% tested is termed microsatellite-low (MSI-L), and the absence of instability is termed microsatellite stability (MSS). Although microsatellite instable (MSI) phenotype has been reported in diverse human cancers (e.g., colon, gastric, and endometrial), it is the most frequently associated with colon cancer. Approximately 15% sporadic colon cancers manifest the MSI phenotype [4]. The MSI colon tumours have characteristic molecular biomarkers such as silencing of the MLH1 promoter by hypermethylation [5]. Other well-known contributors to MSI instability in colon cancer are MSH2, MSH6, MLH3, or PMS2 [6,7].  In colon cancer, despite being associated with resistance to 5-fluorouracil treatment [8], MSI is a favourable prognosis marker [9,10]. In gastric and endometrial cancer, its prognostic value is not so well established. Nevertheless, recognising the MSI tumours is of clear clinical importance and may be important for predicting the therapeutic effect of immune checkpoint inhibitors.
Nowadays, RNA-seq represents the technology of choice for gene expression analysis. Despite the benefits of RNAseq, a wealth of expression data already exist in the form of microarray data sets. Moreover, microarray data sets were used in several studies to obtain gene expression signatures to understand the regulatory mechanisms underlying microsatellite instability in colorectal cancer [11][12][13][14][15]. Therefore, having a MSI gene expression signature able to remove the platform-specific differences while preserving the underlying biological differences between MSI/MSS phenotypes would be beneficial. Although MSI testing exists, it is not routinely performed on all cases. Hence a transcriptional signature may complement available clinical features with information on MSI status.
We performed a binary classification between MSI and MSS cases. Since MSS and MSI-L tumours share similar clinicopathologic features [16,17], MSS and MSI-L populations were pooled in a single class. A new MSI gene expression signature presented here is the first to be valid across two different platforms, microarrays and RNA-seq. A simple nearest-centroid classifier was built, and its performance in terms of area under the ROC curve estimated using a 10-fold cross-validation procedure. The final classifier was validated on independent data sets representing colon, gastric, and endometrial cancers. Pathway analysis was performed for identifying enriched pathways from MSigDB.
The GSE39582 data set consists of two independent data sets. The second data set from GSE39582 (n = 87) was used as an independent validation cohort B1. Another independent validation colon cancer cohort B2 (n = 136) is from Affymetrix gene expression (GSE41258 data set from GEO database [19]).
A brief summary of all data sets can be found in Table 1.

RNA-Seq and Microarray Data Analysis.
Gene expression data were processed following standard practices in the field as follows.
In RNA-seq data sets, genes with low counts across all libraries were filtered out prior to further analysis. Read counts were normalised using Trimmed Mean of M-values normalisation procedure [25]. Differential gene expression analysis was performed using edgeR [26] generalised linear model (batch effects included in the generalised linear model). Only genes with the absolute value of log2 fold change >1 were considered as differentially expressed (adjusted p value < 0.05, Benjamini-Hochberg procedure [21]).
Two types of Affymetrix human gene expression arrays were used in this study: Human Genome U133A 2.0 (HG-U133A) and Human Genome U133 Plus 2.0 (HG-U133Plus). HG-U133A and HG-U133Plus differ from the number of probe sets presented in the chip (HG-U133A comprises more than 22,000 probe sets; HG-U133Plus comprises more than 54,000 probe sets).

Construction of the Gene Expression Signature for MSI
Status. For the analysis, MSI-low and MSS (microsatellite stable) populations were pooled in a single class. Using four published gene expression signatures of MSI trained exclusively on microarray data sets [11][12][13][14], we identified a core MSI gene list. First, we filtered genes common to both platforms and then found differentially expressed genes (DEGs) between MSI/MSS in RNA-seq development cohort A1. A new gene expression signature was defined as the intersection of these DEGs and the core MSI gene list. To minimise redundancy of the gene expression signature, genes with the absolute value of Pearson's correlation coefficient > 0.75 either in the cohort A1 or in the cohort A2 were excluded from the final gene expression signature (if expression levels of two genes were highly correlated, only one randomly selected representative from these two genes was included in the signature). The gene expression signature was used to construct a nearest (cosine similarity) centroidbased classifier. For each sample, a score was computed as the difference between cosine distances from the sample and the centroids of the MSI and MSS classes and used for the prediction of MSI status. If the score exceeded an optimised threshold, a sample was classified as MSI. We did not construct more sophisticated classifiers to allow direct comparison with published signatures trained exclusively on microarray data sets.

Performance Evaluation of the Gene Expression
Signature for MSI Status. The performance of the classifier was estimated using 10-fold cross-validation. As the main performance index was used in an area under the receiver operating characteristic curve (AUC) and 95% confidence intervals (CI) were computed using the DeLong's method [20] (implemented in pROC R-package [29]). The gene expression signature was validated on two independent colon cancer data sets: cohort B1 and cohort B2.
Only the independent microarray data sets were used for validation due to the lack of an independent publicly available colon cancer RNA-seq data set (with present MSI status).
Besides the validation on an independent colon cancer samples, we evaluated the performance of the gene expression signature on gastric and endometrial cancer samples (cohorts C1, C2, and D1).

Comparison with Published Signatures Trained Exclusively on Microarray Data
Sets. The gene expression signature performance was also compared with published MSI gene expression signatures trained exclusively on microarray data sets.
Giacomini et al. [11] developed a 7-gene expression signature using a custom microarray. The signature was trained on colon cancer cell lines and included one probe for noncoding RNA.
Kruhøffer et al. [12] constructed a 9-gene expression signature capable of separating the MSI and MSS samples using both sporadic and hereditary nonpolyposis tumours. The Human Genome U133A 2.0 (Affymetrix) was used to measure the level of gene expression.
Lanza et al. [13] identified a signature consisting of 27 differentially expressed genes including eight miRNAs (19gene expression signature used in comparison with a new gene expression signature; miRNAs were excluded from the analysis). Hybridisation was performed to the human 18.5k Expression Bioarray.
Tian et al. [14] developed a 64-gene expression signature for the detection of MSI phenotypes using Agilent 44K oligonucleotide array. The signature included probes without mapping to a known gene or multiple mapping probes.
The classification of samples was carried out in the same way as before (genes of the new signature were replaced by the genes from previously mentioned published gene expression signatures). DeLong's test [20] was used to compare the AUCs of the gene expression signature and published MSI gene signatures trained exclusively on microarray data sets. The correlation analysis was performed in RNA-seq development cohort A1 to detect potential multicollinearity among the genes from signatures trained exclusively on microarray data sets. The correlation was measured as the absolute value of Pearson's correlation coefficient.

Functional Interpretation and Pathway Enrichment Analysis of the Gene Expression Signature.
A functional and biological interpretation of the 25-gene expression signature was obtained from the Database for Annotation, Visualization and Integrated Discovery (DAVID) version 6.8 [30].

Statistical and Survival
Analysis. All statistical analyses and survival analysis were performed in R (version 3.3.1; [33]).
The prognostic value of the gene expression signature was assessed by fitting the Cox regression model in stage II and stage III cohort A1/A2 subpopulation (adjusted p value < 0.05, Benjamini-Hochberg procedure [21]).

Comparison with Published Signatures Trained Exclusively on Microarray Data Sets.
The performance of the 25-gene expression signature was compared with published signatures trained exclusively on microarray data sets (Table 3, Figure  S1). The 25-gene expression signature yields better performance in comparison with Giacomini et al. [11] signature on most cohorts. On RNA-seq cohort C1, the 25-gene expression signature yields better performance in comparison with Giacomini et al. [11], Kruhøffer et al. [12], andLanza et al. [13] signatures. In case of microarray development cohort A2, the AUCs of Kruhøffer et al. [12] and Tian et al. [14] signatures were significantly better in comparison with AUC of the 25gene expression signature.
On the contrary, the AUCs of Giacomini et al. [11] and Lanza et al. [13] signatures were significantly worse in comparison with AUC of the 25-gene expression signature on the same cohort.
In general, the accuracy of Tian et al. [14] signature was high in all cohorts including RNA-seq development cohort A1. Therefore, we performed correlation analysis to detect potential multicollinearity among the genes from the signature in the RNA-seq development cohort A1. A high correlation between expression levels indicates the strong relationship between genes and introduces a great deal of redundancy in the signature. In the RNA-seq development cohort A1, 15 genes from the Tian et al. [14] signature had the absolute value of Pearson's correlation coefficient higher than 0.75 (Figure 2). These results suggest high redundancy of this signature in RNA-seq cohort A1.
The intersection of the 25-gene expression signature and the published signatures is shown in Figure 3.

Functional Interpretation and Pathway Enrichment Analysis of the Gene Expression Signature.
A functional and biological interpretation of the 25-gene expression signature was obtained from the DAVID database. Tumour-suppressor genes (MLH1 and RUBCNL), protooncogene (AGR2), and genes reported to be linked with colon cancer (EPDR1, MLH1, AXIN2) were enriched in the signature. The signature also comprised multiple genes with related oncogenic signaling pathways such as EGFR (VAV3), AKT (TNFSF9 and GNG4), or WNT (AXIN2, NKD1) signaling pathway. Genes GNG4 and VAV3 are involved in the chemokine signaling pathway that activates downstream signaling pathways such as MAPK. The 25-gene expression signature encompasses genes associated with cell differentiation, growth, adhesion, and migration.
We also carried out pathway enrichment analysis against MSigDB gene collections [31]. Three gene sets from MSigDB were significantly enriched in the new 25-gene expression signature ( Table 4). The pathway enrichment analysis results support the 25-gene expression signature association with colon cancer MSI phenotype. VAV3, ACSL6, GNG4, and KRT23 were significantly enriched in gene set defined as "downregulated genes discriminating between MSI and MSS colon cancers" [22]. Results of Koinuma et al. [23] study indicate that epigenetic silencing of AXIN2 is specifically associated with carcinogenesis in MSI colorectal tumours. This is in concordance with our results.

Proposed Gene Expression Signature and Prognosis.
We assessed the prognostic value of each gene from the proposed 25-gene expression signature by fitting the Cox regression model to identify potential drivers of the prognostic effect. Two endpoints were tested in stage II and III cohort A2 subpopulation: relapse-free survival (RFS, n = 301) and overall survival (OS, n = 304). Because of the limitation of TCGA data set, only OS endpoint was tested in stage II and III cohort A1 subpopulation (n = 115).
It is well known that patients with MSI have a more favourable prognosis compared with those with MSS. The prognostic value of the proposed 25-gene expression signature for MSI colon cancers was not statistically significant. This suggests rather than being a prognostic gene set the new 25-gene expression signature captures the underlying biological differences between MSI/MSS phenotypes.

Discussion
Carcinogenesis is a multistep process, during which genetic and epigenetic alterations determine the malignant transformation of the cell. The molecular profile of a tumour is a key determinant of clinical outcome. Therefore, the precise MSI status detection is needed for guiding the treatment strategies.
Having a single MSI gene expression signature that can be used without regard to platform allows researchers to take advantage of all available microarray/RNA-seq data sets.
The main objective of this study was to identify a gene expression signature for MSI predictions in colon cancer that could be applied to both microarrays and RNA-seq data sets. We developed a new 25-gene expression signature that yields high accuracy in MSI phenotype prediction in colon cancer.
Interestingly, the signature yields good performance also in gastric and endometrial cancers. From a biologic perspective, this supports the idea that MSI gene expression pattern is comparable across various cancers pointing towards similar regulatory pathways.
The 25-gene expression signature performance was also compared with published MSI gene expression signatures trained exclusively on microarray data sets. The proposed 25gene expression signature yields better performance in comparison with Giacomini [14] signature was high in all cohorts including RNA-seq development cohort A1. However, the correlation analysis revealed high redundancy of this signature in RNA-seq cohort. Therefore, we propose the new 25-gene expression signature as a core cross-platform pattern that may form the basis for a MSI phenotype classifier across multiple cancers.
Kruhøffer et al. signature support the association with colon cancer MSI phenotype.
Two tumour-suppressor genes and one protooncogene were enriched in the signature. AXIN2 gene is associated with the WNT signaling pathway, and it is a direct repressor of the MYC protooncogene [34]. AXIN2 was silenced in MSI subgroup, possibly as a result of methylation of its promoter region frequently observed in MSI colon cancer patients. Interestingly, AXIN2 was also identified as one of the 36 genes that contribute to the distinction between MSI-L and MSI-H samples [35]. RPL22L1 gene was previously identified as MSI specific in gastric cancer [36] and identified as colon cancer CIMP-H subtype (characterised as enrichment for MSI, right side and mucinous histology) specific gene [37].
It should also be mentioned that MLH1 gene was previously identified as part of a gene list able to differentiate deficient/nondeficient mismatch repair phenotype in colorectal cancer samples [15].
In the microarray development cohort A2, MSI colon cancer samples with downregulated MLH1 gene expression form a compact cluster. On the contrary, MSI colon cancer samples without silencing of the MLH1 gene expression are clustered together with some MSS colon cancer samples (see dendrograms in Figure 1). Most of these MSS samples were misclassified as MSI by the proposed 25-gene expression signature. A similar pattern was observed in the RNA-seq development cohort A1. Even if these samples were predicted to be microsatellite stable, we might hypothesize they have disrupted the DNA mismatch repair system in a similar way to MSI samples without silencing of the MLH1 gene expression.

Conclusion
We present a new 25-gene expression signature able to identify MSI cases in colon cancer with consistently strong performance across microarray and RNA-seq platforms. Therefore, the new MSI gene expression signature is able to remove the platform-specific differences while preserving the underlying biological differences between MSI/MSS phenotypes in colon cancer samples. The performance of the signature on the RNA-seq data set was compared with published MSI gene signatures trained exclusively on microarray data sets. The pathway enrichment analysis results support the 25-gene expression signature association with colon cancer MSI phenotype. Moreover, the new signature is able to capture common gene activation patterns in the colon, gastric, and endometrial cancers, suggesting that the development of a common expression-based cross-platform test is feasible.

Data Availability
The R code is freely available at https://github.com/bioinforecetox/Cross platform MSI signature.

Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.