Introduction

Colorectal cancer (CRC) is a significant public health issue, being one of the most prevalent human malignancies worldwide and the second leading cause of cancer-related deaths1,2,3. While surgical resection, chemoradiation, and immunotherapy have advanced, they remain inadequate in many cases. Moreover, the incidence of CRC is increasing in younger individuals, particularly in the United States and other countries4,5,6. Genetic predisposition plays a crucial role in the development of CRC, with hereditary and sporadic causes accounting for a significant proportion of cases2,6,7. The etiology of CRC can be broadly classified into two categories: hereditary or sporadic. Hereditary CRC accounts for 10−15% of the overall incidence and is attributable to mutations in APC or DNA mismatch repair genes. Sporadic CRC is more frequent, representing >80% of CRCs, and is characterized by chromosomal instability, microsatellite instability (MSI), or CpG island methylation6,7.

Over the past decades, many transcriptomic studies have been performed which have shed light on the molecular mechanisms underlying CRC development, with a large number of genes being identified as differentially expressed between tumor and nontumor tissues8,9,10,11,12,13,14,15. At the molecular level, CRC are classified into four consensus molecular subtypes (CMS), each of which is characterized by distinct expression profiles of oncogenic/tumor suppressive genes and pathways, mutation states of particular genes, MSI, and clinical outcomes16,17, however their clinical utility remains to be validated.

So far, most transcriptomic studies have used traditional analytical approaches which rely on fold changes of individual genes between tumor and control tissues or pathway enrichment analysis based on current knowledge of genes and biological processes11,18,19,20. As a result, the number of genes/transcripts reported is large and it is uncertain which of them plays a critical role in cancer identification and classification. Furthermore, gene-gene interactions were not well addressed in traditional analytical models. Thus, there is a need to develop novel analytical methods to identify critical DEGs with high sensitivity and specificity. Recent advances in the machine learning community have shown great promise for applying new methods to improve cancer identification/classification and have demonstrated superior performance over traditional methods21,22,23.

In this study, we applied a newly proven and powerful machine-learning method to identify a parsimonious subset of critical differentially expressed genes (DEGs) for CRC. Our method is based on the max-logistic competing structure, which takes into account the competing relationships among genes in predicting the outcome variable, including gene-gene interactions, a feature not captured by traditional analytical models21,22,23. We analyzed ten transcriptome profiling datasets, including nine public datasets and one separate transcriptome dataset collected from a Chinese population. Using the max-logistic competing risk factor models, we identified four critical DEGs, namely, CXCL8 (C-X-C Motif Chemokine Ligand 8), PSMC2 (Proteasome 26S Subunit, ATPase 2), APP (Amyloid Beta Precursor Protein), and SLC20A1 (Solute Carrier Family 20 Member 1), that demonstrated the highest sensitivity, specificity, and robustness for identifying CRC, compared to the existing literature. Furthermore, the results are interpretable and reproducible across different studies of diverse human populations. Our study represents a significant advancement in the identification of critical genes for CRC and demonstrates the potential of interpretable machine learning in cancer identification and classification. We note that many existing machine learning approaches lack interpretability and often exhibit limited robustness across diverse cohorts. Thus, our findings can be considered as valuable contributions to precision oncology, providing insights that may have practical applications in the field of CRC precision medicine.

Results

Identification of critical DEGs

Using the data described in Table 1 and Section Data Availability, we identified four critical DEGs, namely, CXCL8, PSMC2, APP, and SLC20A1. We note that PSMC2 has been linked to CRC cancers, but its interactions with other genes haven’t been reported. The other three genes are less known in the CRC literature though they have been listed in the literature. We will discuss further of these four genes.

Table 1 Basic information of the nine public datasets and one vadilating dataset.

Identification of classifiers based on the four critical DEGs

Each of the CFs (competing factors) that are in competition with each other (\({{CF}}_{{\boldsymbol{i}}}{\boldsymbol{,}}\,{\boldsymbol{i}}{\boldsymbol{=}}{\boldsymbol{1}}{\boldsymbol{,}}\,{\boldsymbol{2}}{\boldsymbol{,}}\,{\boldsymbol{3}}{\boldsymbol{)}}\) can be expressed as a linear combination of gene expressions from critical DEGs. The final classifiers used were the combination of these three competing factors, as presented in Table 2. The risk probability was calculated by applying the logistic function of exp(Data_i_CFmax)/(1+exp(Data_i_CFmax)) for the combined classifiers in each dataset, and of exp(Data_i_CFj)/(1+exp(Data_i_CFj)) for each individual classifier i = 1, 2, 3, j = 1, 2, 3. Data_i_CFj represents one of the gene-CRC relationships and reflects how genes interact with each other. There may be multiple gene expression combinations (e.g., j = 1, 2, 3) for a particular patient, representing the competing risk factors for that patient. \({Data\_}{\boldsymbol{i\_}}{{CF}}_{\max }\), i = 1, 2, 3, represents the combined maximum of linear competing factors of the ith dataset.

Table 2 The four critical DEGs and the classifiers identified in the 10 datasets.

In the first, second, eighth, and ninth datasets, two classifiers, CF1 and CF2, had sufficiently high powers to identify CRC patients, and additional CFs were not required. In the third and seventh datasets, however, three classifiers (CF1, CF2 and CF3) were needed to accurately predict the presence of CRC tumors, due to the low sensitivity of CF2 in those datasets. In the remaining datasets, only one classifier (CF1) was needed to achieve the highest sensitivity and specificity in identifying CRC.

For illustration purposes, Fig. 1 displays the risk probabilities estimated by the final classifiers in all datasets. Figure 2 is four-dimensional plots that visualize the signature patterns formed by each classifier in all datasets. Each plot demonstrates how the genes create signature patterns in a geometry space, with the classifiers separating yellow from blue and green colors. Notably, these signature patterns are unique to the four specific genes utilized and cannot be replicated by arbitrarily selecting three different genes.

Fig. 1: Probabilities of risk estimated by final classifiers for all ten cohorts.
figure 1

CRC samples and normal controls are designated by asters and circles, respectively. A horizontal line at 0.5 (50%) probability threshold is shown in each panel.

Fig. 2: Diagnostic Views of classifiers in each dataset.
figure 2

Gene expression values and their combination effects with different strengths are shown in each plot. The fourth dimension represents the risk probabilities, providing a simple way to identify patients with a high risk of CRC.

Figure 3 depicts a Venn diagram that showcases the patient subgroups, which were classified by the classifiers in the first three cohorts and the validating Chinese cohort. The results demonstrate that in the first and second cohorts, CRC patients were categorized into three distinct subgroups based on the classifiers mentioned above. The first subgroup included patients who were detected only by CF1, while the second subgroup included patients who were only detected by CF2. The third subgroup comprised patients who were identified by CF1 and CF2 simultaneously (as seen in Fig. 3). In the third cohort, patients were classified into seven subgroups based on the classifiers CF1, CF2, and CF3. Similarly, the validating Chinese cohort also had seven distinct subgroups. However, in datasets 4, 5, 6 and 10, patients were not further categorized, as one classifier had sufficient power to identify CRC patients. This figure highlights the intricate nature of disease status as patients were categorized into different subgroups based on their gene-gene interactions at the genomic level. Such categorization would be valuable in determining the effectiveness of CRC diagnosis, prognosis, and management. We note applying other machine-learning and AI approaches cannot identify subtypes of CRC.

Fig. 3: Venn Diagrams for Four Datasets.
figure 3

Venn diagrams displaying the classification of CRC patients into distinct subgroups for selected cohorts (Venn diagrams for Datasets 8 and 9 are similar to Datasets 1 and 2).

Table 2 displays the coefficients associated with the classifiers in all ten datasets. It demonstrates the highest performance of the CFmax classifier in discriminating between tumor and nontumor samples, with an overall sensitivity of over 98%, specificity of over 80%, and accuracy of over 94%. Furthermore, the combined use of CF1, CF2, and CF3 improved the ability to detect cancer. Notably, in dataset 4, which comprised early-stage CRC cases in a Han Chinese population, the CXCL8 classifier alone achieved 100% sensitivity, specificity, and accuracy in identifying CRC. Therefore, we conducted focused analyses on early-stage CRC (stage 1) in other datasets. The results revealed that CXCL8 alone demonstrated high sensitivity (>89%), specificity (>75%), and accuracy (>85%) in identifying stage 1 CRC in datasets 1, 2, 3, and 6. However, focused analysis was not conducted on the remaining datasets, as there were not enough cases of early-stage CRC.

The classifiers in Table 2 were used to determine the risk probability of CRC based on the direction and absolute value of the gene expression coefficient. A positive coefficient indicated that higher gene expression was associated with a higher risk probability of CRC, while a negative coefficient indicated that lower gene expression was associated with a higher risk probability of CRC. For instance, in datasets 1, 2, 3, and 6, a decrease in APP gene expression was associated with a reduced risk probability of CRC, whereas in datasets 5, 7, 8, 9, and 10, APP showed the opposite direction, indicating a different effect of this gene in White and Asians. SLC20A1 was consistently associated with an increased risk probability of CRC across all datasets, indicating that its expression was suppressed in diverse CRC patient populations. Additionally, CXCL8 and PSMC2 were consistently associated with a reduced risk probability of CRC across all datasets, suggesting that their decreased expression was protective against CRC development. The different signs of APP in different classifiers suggest that the relationship of a single human gene with disease status can be nonlinearly correlated, and its interaction with other genes can be either positively or negatively correlated. We note that the existing CRC literature never reported such interpretations.

Notably, positive coefficients were observed for CXCL8 and PSMC2 across all datasets, indicating their central roles in gene-gene interactions and as competing risk factors for CRC development. The varying coefficients of APP in different classifiers suggest that its relationship with CRC is non-linearly correlated and can be modulated by other genes.

For the purpose of illustration, a subset of gene expression values for the four critical DEGs is shown in Table 3. The complete datasets, including original gene expression values and calculated risk probabilities, are available online.

Table 3 Gene expression values, competing factors, and risk probability in a portion of the samples for selected cohorts.

The risk probability of a sample with CRC can be calculated using the logistic function in Data_i_CFj. Table 4 shows the patient subgroups defined by individual classifiers and their combinations in each dataset. The table demonstrates that there were at least seven subgroups of CRC patients, each with different genetic characteristics and determinations. For instance, in Dataset 1, four patients were detected by Data-3-CF1, one patient by Data-3-CF2, 94 patients by Data-3-CF3, and 60 patients were detected simultaneously by all three classifiers. This illustrates the heterogeneity of CRC and the potential utility of genetic subtyping for CRC diagnosis, prognosis, and management.

Table 4 Subgroups of patients by individual classifiers and their combinations.

Analysis in validating Chinese cohort

We evaluated the performance of the four crucial DEGs (CXCL8, PSMC2, APP, and SLC20A1) identified in the aforementioned public datasets in a validating Chinese cohort from Sun Yat-sen University Cancer Center, China. By setting \({\boldsymbol{K}}{\boldsymbol{=}}{\boldsymbol{1}}\) and solving Eq. (5), we obtained the classifiers (Table 2). The classifier achieved an overall accuracy of 88.04%, sensitivity of 91.11%, and specificity of 85.11%. Notably, the coefficient of APP exhibited a different direction (negative) in this Chinese cohort compared to the North American, European, and Israeli cohorts, suggesting diverse effects of this gene in different populations.

Analysis in validating the consensus molecular subtypes

Given well-documented heterogeneity of CRC, we performed additional analysis to validate the CMS proposed by other groups (source: https://www.cell.com/trends/cancer/pdf/S2405-8033(16)30098-X.pdf). Theoretically, when we have access to CMS subtype information, it is possible to integrate CMS with the subtypes generated by our competing risk classifiers (CF), creating more nuanced subtypes. This amalgamation can offer a deeper insight into CRC pathology. To illustrate, consider a CMS subtype, let’s say CMS1, and our model-defined subtypes CFiCFj. The combination of (CMS1, CFiCFj) gives rise to a novel subtype, as it possesses distinct characteristics not found in other subtypes. The dataset GSE15645117 included four distinct CMS with divergent biology and clinical behavior. Therefore, we directly fitted our model using the identified four genes and we found the four genes led to high accuracy (accuracy 90.9722%, sensitivity 90.2778%, specificity 91.6667%). The performance is similar to that in the Chinese cohort that we collected for this study (i.e., the seventh dataset). The following table shows its performance within each CMS subtype. We note that there are 7 patients that were listed as “No group” in the published paper17. In addition, there are 8 patients whose information was not disclosed in the published paper, and we denote those as “Others” in Table 5.

Table 5 Direct check of model performance in CMS subtypes.

We conducted further analysis within each CMS subtype and obtained Table 6.

Table 6 Performance of fitting withing each CMS subtype.

From above tables, it is apparent that the identified four genes show greater universality and specialty. When fitting them directly to each CMS subtype, the results are further refined to achieve higher accuracy, which shows the four critical genes contain not only general genomic-level information for CRC but also CMS subtype information.

Looking at group “Others”, the accuracy is much lower than the other four types and “No group”, which suggests these patients’ CRC types are not typical. Once again, this finding reinforces the connection between the four genes and CRC.

It is evident that the classifiers established by the four critical genes have distinct forms, implying the heterogeneity of CRC, which aligns with the CMS theory24.

Results interpretation through heatmaps

We present heatmaps using classes of normal vs. CRC, the subtypes classified by our max-logistic competing classifiers, and CMS subtypes. All plots are displayed in Fig. 4. Here, we briefly summarize what the heatmaps tell about the four critical genes.

Fig. 4: Heatmaps of heteogeneous populations and classification patterns.
figure 4

First row Left: Heatmaps of heteogeneous populations and classification patterns: A heatmap of the selected genes for all cohorts and samples. Middle: For the first dataset TCGA-COAD-329, the mean value of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values of CXCL8 in CRC cells (1,2), (1,3); the mean value of PSMC2 in (0,0) is smaller than all mean values in all other cells; the mean value of SLC20A1 in (0,0) is larger than the mean values in all other cells. This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information. Right: For the second dataset TCGA-COAD-512, the mean value of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values of CXCL8 in CRC cells (1,1), (1,2), (1,3); the mean value of PSMC2 in (0,0) is smaller than the mean values of PSMC2 in cells (1,2) and (1,3); the mean value of SLC20A1 in (0,0) is larger than the mean values in all other cells. This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information. Second row Left: For the third dataset GSE39582, the mean value of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values of CXCL8 in CRC cells (1,1), (1,3), (1,5), (1,6), (1,7); the mean value of PSMC2 in (0,0) is smaller than the mean values of PSMC2 in cells (1,1) and (1,3-7); the mean value of SLC20A1 in (0,0) is larger than the mean values in all other cells. This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information. Middle: For the fourth dataset GSE9348, the mean value of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values of CXCL8 in CRC cells (1,1); the mean value of PSMC2 in (0,0) is smaller than the mean values of PSMC2 in cell (1,1); the mean value of SLC20A1 in (0,0) is twice larger than the mean value in cell (1,1). This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information. Right: For the fifth dataset GSE18105, the mean value of CXCL8 in the normal cell (0,0) is smaller than the mean values of CXCL8 in CRC cells (1,1); the mean value of PSMC2 in (0,0) is smaller than the mean values of PSMC2 in cell (1,1); the mean value of SLC20A1 in (0,0) is larger than the mean value in cell (1,1). This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information. Third row Left: For the sixth dataset GSE41258, the mean value of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values of CXCL8 in CRC cells (1,1); the mean value of PSMC2 in (0,0) is ten times smaller than the mean values of PSMC2 in cell (1,1); the mean value of SLC20A1 in (0,0) is larger than the mean value in cell (1,1). This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information. Middle: For the self-collected dataset, the mean value of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values of CXCL8 in CRC cells (1,2), but 30-80 times smaller than those in (1,3) and (1,6); the mean value of PSMC2 in (0,0) is smaller than the mean values of PSMC2 in cell (1,2), (1,3), (1,6); the mean value of SLC20A1 in (0,0) is larger than the mean value in all other cells. This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information. Right: For the eighth dataset TCGA-READ, the mean value of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values of CXCL8 in all CRC cells; the mean value of PSMC2 in (0,0) is smaller than the mean values of PSMC2 in all CRC cells; the mean value of SLC20A1 in (0,0) is larger than the mean value in all CRC cells. This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information. Fourth row Left: For the nineth dataset GSE103512, the mean value of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values of CXCL8 in CRC cells (1,2) and (1,3); the mean value of PSMC2 in (0,0) is smaller than the mean values of PSMC2 in all CRC cells; the mean value of SLC20A1 in (0,0) is larger than the mean value in cells (1,1) and (1,3). This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information. Middle: For the tenth dataset GSE156451, the mean value of CXCL8 in the normal cell (0,0) is three times smaller than the mean values of CXCL8 in CRC cells (1,1); the mean value of PSMC2 in (0,0) is twice smaller than the mean values of PSMC2 in cell (1,1); the mean value of SLC20A1 in (0,0) is twice larger than the mean value in cell (1,1). This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information. Right: For the tenth dataset GSE156451and CMS subtypes, the mean values of CXCL8 in the normal column are significantly twice or much smaller than the mean values of CXCL8 in CRC column with respect to each CMS subtype; the mean values of PSMC2 in normal column are twice of nearly twice smaller than the mean values of PSMC2 in CRC column with respect to each CMS subtype; the mean values of SLC20A1 in normal column are twice larger than the mean values in CRC column with respect to each CMS subtype. This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information and CMS subtype information.

The first heatmap is a heatmap of the selected genes for all cohorts and samples. It can be clearly seen that the relative mean changes of CXCL8 and PSMC2 are uniformly positive, while the signs of SLC20A1 are all negative. This observation shows that these three genes contain pathological information of CRC. Also, we can notice that the strengths of changes of CXCL8 are much larger/stronger than the strengths from other genes. This phenomenon suggests CXCL8 responded to CRC symptoms earlier and stronger, which is consistent with our analysis with those datasets of Stage I CRC.

For the first dataset TCGA-COAD-329, the mean value (5.953) of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values (9.914, 9.634) of CXCL8 in CRC cells (1,2), (1,3); the mean value (10.26) of PSMC2 in (0,0) is smaller than all mean values in all other cells; the mean value (11.58) of SLC20A1 in (0,0) is larger than the mean values in all other cells. This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information.

For the second dataset TCGA-COAD-512, the mean value (1.419) of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values (6.027, 2.193, 4.553) of CXCL8 in CRC cells (1,1), (1,2), (1,3); the mean value (4.156) of PSMC2 in (0,0) is smaller than the mean values (4.842, 4.911) of PSMC2 in cells (1,2) and (1,3); the mean value (4.969) of SLC20A1 in (0,0) is larger than the mean values in all other cells. This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information.

For the third dataset GSE39582, the mean value (6.303) of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values of CXCL8 in CRC cells (1,1), (1,3), (1,5), (1,6), (1,7); the mean value (9.507) of PSMC2 in (0,0) is smaller than the mean values of PSMC2 in cells (1,1) and (1,3-7); the mean value (10.19) of SLC20A1 in (0,0) is larger than the mean values in all other cells. This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information.

For the fourth dataset GSE9348, the mean value (64.33) of CXCL8 in the normal cell (0,0) is significantly smaller than the mean value (39.47) of CXCL8 in CRC cells (1,1); the mean value (458) of PSMC2 in (0,0) is smaller than the mean value (509.6) of PSMC2 in cell (1,1); the mean value (4577) of SLC20A1 in (0,0) is twice larger than the mean value (1969) in cell (1,1). This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information.

For the fifth dataset GSE18105, the mean value (7.569) of CXCL8 in the normal cell (0,0) is smaller than the mean value (10.72) of CXCL8 in CRC cells (1,1); the mean value (11.6) of PSMC2 in (0,0) is smaller than the mean value (12.46) of PSMC2 in cell (1,1); the mean value (10.69) of SLC20A1 in (0,0) is larger than the mean value (9.725) in cell (1,1). This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information.

For the sixth dataset GSE41258, the mean value (686.8) of CXCL8 in the normal cell (0,0) is significantly smaller than the mean value (1133) of CXCL8 in CRC cells (1,1); the mean value (49.8) of PSMC2 in (0,0) is ten times smaller than the mean value (534.3) of PSMC2 in cell (1,1); the mean value (545.7) of SLC20A1 in (0,0) is larger than the mean value (459.1) in cell (1,1). This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information.

For the self-collected dataset, the mean value (0.3774) of CXCL8 in the normal cell (0,0) is significantly smaller than the mean value (0.8049) of CXCL8 in CRC cells (1,2), but 30-80 times smaller than those (11.43, 24.78) in (1,3) and (1,6); the mean value (0.9826) of PSMC2 in (0,0) is smaller than the mean values of PSMC2 in cell (1,2), (1,3), (1,6); the mean value (1.956) of SLC20A1 in (0,0) is larger than the mean value in all other cells. This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information.

For the eighth dataset TCGA-READ, the mean value (1.795) of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values of CXCL8 in all CRC cells; the mean value (4.303) of PSMC2 in (0,0) is smaller than the mean values of PSMC2 in all CRC cells; the mean value (3.903) of SLC20A1 in (0,0) is larger than the mean value in all CRC cells. This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information.

For the ninth dataset GSE103512, the mean value (5.334) of CXCL8 in the normal cell (0,0) is significantly smaller than the mean values (7.669, 7.291) of CXCL8 in CRC cells (1,2) and (1,3); the mean value (7.422) of PSMC2 in (0,0) is smaller than the mean values of PSMC2 in all CRC cells; the mean value (5.298) of SLC20A1 in (0,0) is larger than the mean value in cells (1,1) and (1,3). This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information.

For the tenth dataset GSE156451, the mean value (28.17) of CXCL8 in the normal cell (0,0) is three times smaller than the mean value (109.2) of CXCL8 in CRC cells (1,1); the mean value (13.98) of PSMC2 in (0,0) is twice smaller than the mean value (29.04) of PSMC2 in cell (1,1); the mean value (65.64) of SLC20A1 in (0,0) is twice larger than the mean value (25.92) in cell (1,1). This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information.

For the tenth dataset GSE156451and CMS subtypes, the mean values of CXCL8 in the normal column are significantly twice or much smaller than the mean values of CXCL8 in CRC column with respect to each CMS subtype; the mean values of PSMC2 in normal column are twice or nearly twice smaller than the mean values of PSMC2 in CRC column with respect to each CMS subtype; the mean values of SLC20A1 in normal column are twice larger than the mean values in CRC column with respect to each CMS subtype. This phenomenon confirms CXCL8, PSMC2, and SLC20A1 have essentially important CRC information and CMS subtype information.

In summary, all ten cohorts demonstrate that CXCL8, PSMC2, and SLC20A1 contain essentially important CRC information.

Characterization of clinical and pathological features

In order to compare the characteristics of subgroups defined by the classifiers, we analyzed various attributes such as sex, age, histologic grades, and TNM tumor stages (as shown in Table 7). It should be noted that some samples did not have complete clinical or pathological information. Datasets 4, 5, and 6 were excluded from this analysis as the CRC patients in these datasets were identified by only one CF. In the first and second datasets, the majority of CRC patients were identified by CF(1,2), which means either CF1 or CF2 could determine the cancer status. There was no significant difference in gender distribution among subgroups, and patients were found in almost all age groups. In the second dataset, patients identified by individual CF1 or CF2 appeared to have a higher frequency of late-stage CRC, but the sample size in these subgroups was small. In the third dataset, CF3 was the classifier that identified the majority of patients. There was no significant gender preference between groups, but stages 2 and 3 CRC were more common in subgroups associated with CF3.

Table 7 Distribution of clinical and pathological characteristics.

Discussion

Transcriptomic studies are crucial in identifying significant genes and biological pathways associated with CRC. However, current transcriptomic data analyses have limitations. Firstly, the number of genes/transcripts analyzed is significantly larger compared to the number of study samples. Secondly, traditional statistical methods based on gene expression fold change used in most studies lack the power to determine key drivers of cancer development due to a large number of genes analyzed. Lastly, gene-gene interaction effects are not adequately addressed in existing analytical models. As a result, some of the DEGs identified in previous studies may be noise effects or chance findings25. In contrast, the newly introduced interpretable machine learning method employed in our study is unique as it considers complex gene-gene interactions and competition, which more closely resemble biological processes underlying cancer development. The proposed method identifies critical DEGs and signatures in heterogeneous cohorts, even on different platforms with varying gene expression values. This method has been successfully applied in our early efforts to identify critical DEGs in lung cancer26, breast cancer27, and COVID-1923. Our current study further supports the power and validity of the method. Importantly, our method stands apart from other artificial intelligence methods that function as a black box since our new method follows a transparent and interpretable formula that can be easily validated.

Our study identified four critical DEGs (CXCL8, PSMC2, APP, and SLC20A1) that demonstrated high performance in identifying CRC in diverse populations and ethnicities, covering a range of tumor stages and histologic grades. These genes may serve as biomarkers that can capture the underlying characteristics of CRC at the transcriptomic level. As such, they may have significant practical implications. Given the high performance of our classifier, we hypothesize that individuals who have higher risk of developing CRC can be identified by the classifiers based on four critical genes. For example, colon biopsy specimens from screening colonoscopy procedures can be sequenced for transcriptomic profiles to identify the individuals with a risk to develop CRC. The colon biopsy specimens may include those random biopsies or colon polyp biopsies without definitive histologic evidence of CRC.

It is important to emphasize that the gene-gene interaction effects in our model are distinct from those commonly used in other experimental designs, such as row-column or chemical-laboratory interactions in industrial and agricultural studies. Our method goes beyond the interaction term in a typical linear regression model, where interactions are often analyzed by multiplying two predictors to construct an additional covariate. In our model, the combination of genes in each CF determines the interaction effects, which can be seen in the coefficient values of each CF. This was observed across all datasets analyzed, indicating the presence of gene-gene and gene-subtype interactions. Furthermore, each classifier identified in our study can serve as a potential marker for CRC, and the effects of these classifiers may be modulated by critical genes such as PSMC2 and CXCL8, which are associated with CRC of different tumor stages. To illustrate this concept, we take two classifiers from Table 2 (TCGA COAD data) as an example:

CF1 − 90.3645 +2.8598*APP +5.5149*PSMC2 −0.8795* SLC20A1

CF2 −26.5288 +0.5692*CXCL8 +3.1516*PSMC2 −1.0474* SLC20A1

Consider these classifiers like assembling a basketball team, where each critical gene corresponds to a player on the team. The various combinations of players determine the team’s scoring ability. In this analogy, the gene combinations in each CF influence our model’s classification performance. A positive coefficient associated with a gene in a CF suggests that the presence of that gene increases the likelihood of classifying a sample as CRC, while a negative coefficient indicates the opposite. For instance, let’s envision a basketball team with five players: PSMC2 as the Point Guard, SLC20A1 as the Shooting Guard, CXCL8 as the Center, APP as the Power Forward, and the fifth player to be determined. The accuracy of these two classifiers is 97.87%, leaving room for one more player. This team has two main scoring combinations: (APP, PSMC2, SLC20A1) and (CXCL8, PSMC2, SLC20A1). The negative sign associated with SLC20A1 suggests that its role is to guard against the opponent’s ball handling time, increasing the team’s scoring probability. For the CXCL8, increasing the ball handling time will increasing team’s scoring probability, so do PSMC2 and APP. For SLC20A1, different coefficients in both combinations mean its interactions with other players in the team are different, which decides the different scoring abilities. Other genes and their associated coefficients can be interpreted similarly. While this analogy simplifies gene interactions, it’s important to note that gene-gene interactions are much more complex. They involve intricate interactions, analogous to player playing time, ball handling, coordinated defense, and the overall dynamics of the entire game, including the spectators in the stadium.

The functional relevance of the four critical DEGs to carcinogenesis has been reported. For example, CXCL8, a member of the CXC cytokine family, is one of the most significantly upregulated chemokines in CRC, contributing to tumor growth, invasion, and metastasis28,29,30,31,32,33. CXCL8 induces cell migration in colon cancer cells, acting as an autocrine growth factor34,35,36. In line with this evidence, our results suggest CXCL8 could be a marker for early-stage CRC. Similarly, PSMC2, a key member of the 19 S RP, has been implicated in the progression of several types of cancer, including ovarian cancer, pancreatic cancer, and osteosarcoma37,38,39,40. High expression of PSMC2 in CRC is associated with poorer survival, and silencing of PSMC2 is a potential therapeutic strategy for CRC41. Interestingly, in our study, decreased APP expression was associated with reduced CRC risk in the US, European, and Israel populations, but a reverse effect was observed in Asian populations (Chinese and Japanese) and rectal cancer. SLC20A1, a sodium-phosphate symporter involved in tumor formation by HeLa cells in xenografted mice42,43,44,45, has a negative coefficient in all datasets, suggesting its expression is inhibited in CRC, although its significance in CRC is currently unknown. We note that although the literature has sparse reports about these four genes as discussed above, their functions are still less known, and their interactions with each other and other genes are totally unknown due to the limitations of the earlier studying methods. As a result, they cannot be used in precision cancer diagnostics. Our new study is the first to achieve precision cancer diagnostics at the genomic level. We note that (CXCL8, PSMC2, SLC20A1) formed classifiers appeared in all North American CRC Cohorts and a European cohort, but not in Asian cohorts, Israel cohort, and the TCGA READ cohort. Such a phenomenon suggests there is genomic heterogeneity among CRC patients due to their ethnicity and subtypes.

Our method has the advantage of being applicable to gene expression data generated by different platforms, including RNA-seq and microarrays, without requiring batch correction. We note that many existing models cannot handle heterogeneous populations, and data have to be corrected for batch effects, which may make the inference inaccurate. Notably, our classifiers achieved the highest accuracy in cancer detection without adjusting for variables such as tumor stage or ethnicity. However, we acknowledge that some public datasets lack complete clinical and pathological information, which limits our ability to explore the prognostic value of the four DEGs. Nonetheless, our study is unique in its adoption of a novel machine learning approach, which has not been previously used in transcriptomic studies of human malignancies. Additionally, our findings were validated across diverse populations and ethnicities, further highlighting the potential of this approach for cancer research.

This study has a few limitations that should be acknowledged. Firstly, as a retrospective analysis of public datasets, it was not possible to conduct a comprehensive prognosis analysis to determine the prognostic value of the identified critical genes in CRC. However, the four-gene-based classifiers identified in this study exhibit the highest performance, providing a promising starting point for future studies to explore their prognostic potential. Secondly, since clinical diagnosis of CRC currently depends mainly on traditional methods such as colonoscopy and tissue biopsies, it may be challenging to identify clinical significance for the critical DEGs identified in this study. However, examining molecular subtypes based on transcriptomic patterns is crucial to understanding the genetic mechanisms underlying carcinogenesis. Integrating reliable genomic biomarkers like the four-gene-based classifiers into the CRC diagnosis algorithm will enhance the accuracy of disease identification and patient classification, leading to personalized and precision oncology. Finally, our study was based exclusively on gene expression profiles of CRC tissues, and it is unclear whether the four-gene-based classifiers can be used to detect CRC patients in the general population using blood samples. To address this issue, future studies could include cohorts with both CRC tissues and blood samples.

One note pertains to the computational aspect and how to solve Eq. (4) across all cohorts and how to integrate the critical gene criteria. As of the present stage, we have yet to develop a complete, fast, and efficient computer program for this purpose, and it remains an open challenge. In our study, we have adopted a simplified approach discussed in methodology section.

While further experimental studies are needed to validate our findings and confirm the significance of the identified critical DEGs, we employed rigorous criteria to define these genes, and the high accuracy of the classifiers, reaching 100% in some cohorts, suggests that the findings are unlikely to be due to chance (probability less than 10−11). Moreover, each of the four genes has been previously reported to be functionally relevant to carcinogenesis, and our study is the first to demonstrate their interaction effects in CRC. These findings provide a solid foundation for future investigations, including gene network analysis, exploring related genes and their functional interactions, and identifying potential causal relationships.

Methods

Data acquisition and processing

Table 1 presents an overview of the datasets used in this study, including information on data sources, experimental platforms, sample sizes, populations, and tumor stages. The discovery analyses involved nine publicly available whole-transcriptome profiling datasets, including three from the Cancer Genome Atlas (TCGA) obtained through RNA-seq and six from the Gene Expression Omnibus (GEO) database using the keywords “colon cancer”, “CRC”, and “Homo sapiens,” which were based on mRNA expression (microarray-based). The datasets were collected from a diverse population, including North American and European White, Blacks, Asians, and Jewish. In addition, clinical and pathological information, such as age, sex, TNM tumor stages, and histologic grades, were also obtained when available.

Analytical methodology

Our approach to building a competing risk factor classifier for CRC involved utilizing the max-logistic regression model. This model offered an advantage over existing models in its ability to provide nonlinear prediction and classification. Our aim was to identify a concise subset of key genes with the highest predictive accuracy. The theoretical basis of the competing risk factor models can be found in previous literature21,22. To ensure consistent analysis of DEGs across the nine public datasets and our Chinese validation cohort, we utilized the heterogeneous extension of the max-logistic regression. Starting with each competing risk factor in the max-logistic regression models, we randomly selected three genes from each dataset’s available genes/transcripts. To determine the final model with the highest sensitivity and specificity while using the smallest number of genes, we employed a Monte Carlo method that required extensive computation. Our criteria for defining critical DEGs were stringent.

  1. 1.

    The critical DEG subset should contain no more than 15 genes (such a number has been widely reported in the literature). We note that we only need four in CRC study.

  2. 2.

    The critical DEG subset should exhibit an overall accuracy of at least 95% in at least three distinct study cohorts, with a total of at least 1000 patients/subjects.

  3. 3.

    The critical DEG subset should exhibit an overall accuracy of 100% in at least one study cohort, with a minimum of 10 subjects.

  4. 4.

    At least one gene functions and shows the same sign ( + or -) in each study cohort.

  5. 5.

    The critical DEG subset should demonstrate at least 80% accuracy in any given cohort, with either sensitivity or specificity values exceeding 75%.

  6. 6.

    The competing classifiers should include the smallest number of genes possible.

  7. 7.

    The number of competing classifiers should be minimized to avoid redundancy.

We remark that many published work didn’t report cohort-to-cohort validations, i.e., they cannot satisfy (2), (3) and (5). Many existing machine-learning approaches lack interpretability and didn’t satisfy (4). Most published work couldn’t apply to heterogeneous populations, e.g., the ten cohort studies in this paper.

The basic ideas of competing risk classifiers for heterogeneous populations are briefly described below.

Suppose there are \(K\) cohorts with their binary (1 for CRC, 0 for CRC-free) response variables being \({{\boldsymbol{Y}}}_{(1)},\ldots ,{{\boldsymbol{Y}}}_{(K)}\) where

$${{\boldsymbol{Y}}}_{(k)}={({Y}_{1k},{Y}_{2k},\ldots ,{Y}_{{n}_{k},k})}^{T},k=1,\ldots ,K.$$
(1)

Each of the \({Y}_{{ik}}\), (\(i=1,\ldots ,{n}_{k},k=1,\ldots ,K)\) may be related to \(G\) groups of genes

$${\varPhi }_{{ijk}}=({X}_{i,{j}_{1},k},{X}_{i,{j}_{2},k},\ldots ,{X}_{i,\,{j}_{{g}_{j}},k}),j=1,\ldots ,G,{g}_{j}\ge 0$$
(2)

where \(i\) is the ith individual in the sample, \({g}_{j}\) is the number of genes in \(j\) th group. The competing (risk) factor classifier for the k th outcome variable is defined as

$$\log \left(\frac{{p}_{{ik}}}{1-{p}_{{ik}}}\right)=\max ({\beta }_{01k}+{\varPhi }_{i1k}{\beta }_{1k},{\beta }_{02k}+{\varPhi }_{i2k}{\beta }_{2k},\ldots ,{\beta }_{0{Gk}}+{\varPhi }_{{iGk}}{\beta }_{{Gk}})$$
(3)

where \({\beta }_{0{jk}}\)’s‘are intercepts, \({\varPhi }_{{ijk}}\) is a \(1\times {g}_{j}\) observed vector, \({\beta }_{{jk}}\) is a \({g}_{j}\times 1\) coefficient vector which characterizes the contribution of each predictor to the outcome variable \({{\boldsymbol{Y}}}_{(k)}\) in the j th group to the risk, and \({\beta }_{0{jk}}\) + \({\varPhi }_{{ijk}}{\beta }_{{jk}}\) is called the j th competing risk factor, i.e., j th signature.

Remark 1. With \({\beta }_{0{jk}}=-\infty ,j=2,\ldots ,G\), (3) is reduced to the classical logistic regression classifier. It is clear that every component of \({\beta }_{0{jk}}+{\varPhi }_{{ijk}}{\beta }_{{jk}},\,j=1,\ldots ,G\) is a risk factor for a patient to have CRC, and the highest risk is from the largest one, i.e., these risk factors compete against each other to win out to make the final effect, i.e., to determine whether or not a patient has CRC. As such, they are called competing risk factors.

The unknown parameters are estimated from

$$(\hat{{\beta }_{(k)}},\hat{S})={\arg }{\min }_{{\beta }_{(k)},{S}_{j}\subset S,j=1,2,\ldots ,G}\mathop{\sum }\limits_{i=1}^{n}[I({p}_{{ik}}\le 0.5)I({Y}_{{ik}}=1)+I({p}_{{ik}} \,>\, 0.5)I({Y}_{{ik}}=0)]$$
(4)

where 0.5 is a probability threshold value that is commonly used in machine learning classifiers, \(I(.)\) is an indicator function, \({p}_{i}\) is defined in Eq. (3), \(S=\left\{\mathrm{1,2},\ldots ,54675\right\}\) is the index set of all genes, \({S}_{1}=\{{1}_{1},{1}_{2},\ldots ,{1}_{{g}_{1}}\}\), \({S}_{2}=\{{2}_{1},\ldots ,{2}_{{g}_{2}}\}\), \(\ldots\), \({S}_{G}=\{{G}_{1},\ldots ,{G}_{{g}_{G}}\}\) are index sets corresponding to (2), and \(\hat{S}=\{{1}_{1},{1}_{2},\ldots ,{1}_{{g}_{1}};{2}_{1},\ldots ,{2}_{{g}_{2}};\ldots ;{G}_{1},\ldots ,{G}_{{g}_{G}}\}\) is the final gene set selected in the final classifiers.

To introduce sparsity for both the number of variables (genes) and the number of groups (competing factors, signatures) into the model, the following optimization problem with penalties is considered:

$$\begin{array}{l}(\hat{\beta },\hat{S},\hat{G})={\text{argmin}}_{\beta ,{S}_{j}\subset S,j=1,2,\ldots ,G}\left\{\right.{(1+{\lambda }_{1}+{\rm{|}}{S}_{u}{\rm{|}})}^{\mathop{\sum }\limits_{k=1}^{K}\mathop{\sum }\limits_{i=1}^{n}[I({p}_{{ik}}\le 0.5)I({Y}_{{ik}}=1)+I({p}_{{ik}} > 0.5)I({Y}_{{ik}}=0)]}\\\qquad\qquad\;\;\,+\,{\lambda }_{2}({\rm{|}}{S}_{u}{\rm{|}}-\frac{{\rm{|}}{S}_{u}{\rm{|}}+G-1}{({\rm{|}}{S}_{u}{\rm{|}}+1)\times G-1})\left.\right\}\end{array}$$
(5)

where \({S}_{u}\) is the union set of \(\{{S}_{j}{\}}_{j=1}^{G}\), \(|\cdot |\) is the cardinality. Tuning parameters \({\lambda }_{1}\) and \({\lambda }_{2}\) are both non-negative. \(\frac{|{S}_{u}|+G-1}{(|{S}_{u}|+1)\times G-1}\) is monotone decreasing in both \(|{S}_{u}|\) and \(G\). Additional properties of this bivariate function are described elsewhere21.

Remark 2. The S4 property of (5) and its capability to simultaneously classify multiple heterogeneous populations with common variables (genes) make the new competing risk factor classifier different from existing ones.

Remark 3. The details of computational steps were described elsewhere23 and demo \({\mathrm{Matlab}}^{\circ{\mathrm{R}}}\) codes are publicly available online. Note that Eq. (5) is an optimization problem with extremely high computational complexity, as it’s an integration of integer programming, combinatorial optimization, and continuous optimization. Therefore in practice and for this study, we adopted Monte Carlo approach to solve Eq. (5). Here, we restate the computational guidance for this optimization problem elaborated in our earlier work23 with slight modifications according to our setting and references therein.

  1. 1.

    Set G = 1, S1 = 3 (or 4, 5). Pre-define sensitivity level sen = 0.6 (or 0.7, 0.8, 0.9) and specificity level spe = 0.90 (or 0.95). The initial selection of pre-defined sen and spe levels may rely on previous literature or researchers’ target.

  2. 2.

    Perform 50,000 (or larger numbers) random draws of sets of S1 genes.

  3. 3.

    Evaluate each set of S1 genes in Step 2 and calculate the sensitivity and specificity. These genes are recorded if both of their sensitivity and specificity are larger than the pre-specified sen and spe levels. This step helps to filter key genes and reduce the number of genes in scope.

  4. 4.

    If the number of recorded genes in Step 3 is greater than 30 (or 25, 20, depending on the target of gene number shrinking), raise sen and spe values and repeat Step 2 with random draws among recorded genes in Step 3. Repeat Step 3 to further filter the recorded genes.

  5. 5.

    Repeat Step 4 until the number of recorded genes is less than 30 (or 25, 20). These recorded genes are considered as candidate genes.

  6. 6.

    Set G = 3 (or 2,4, 1), Si = 3 (2, 4, 1). Perform 50,000 (or larger numbers) random draws of sets of Si genes among the candidate genes selected in Step 5.

  7. 7.

    Report the best results with the S4 properties.

In this study, we took advantage of a published paper18 which studied 1991 genes, and we further reduced the number from 1991 to 9. We conducted a full search to obtain the current results. The procedure is called merging variable selections and common ground seeking (MVS-CGS) and is as follows.

  • Choose a dataset18.

  • For each gene,

    1. 1.

      Compute its mean within CRCs;

    2. 2.

      Compute its mean within non-tumors;

    3. 3.

      Compute its standard deviation within CRCs;

    4. 4.

      Compute its standard deviation within non-tumors;

    5. 5.

      Compute its Sharpe ratio, i.e., mean/standard deviation, within CRCs;

    6. 6.

      Compute its Sharpe ratio within non-tumors;

    7. 7.

      Compute the absolute relative mean change of means of CRCs over non-tumors;

    8. 8.

      Compute the absolute relative standard deviation change of standard deviations of CRCs over non-tumors;

    9. 9.

      Compute the absolute relative Sharpe ratio change of Sharp ratios of CRCs over non-tumors.

  • Sort the changes within mean, standard deviation, and Sharpe ratio, respectively.

    1. 1.

      Keep genes from the bottom 1% of the sorted changes related to means;

    2. 2.

      Keep genes from the top 15% of the sorted changes related to standard deviations;

    3. 3.

      Keep genes from the top 5% of the sorted changes related to Sharpe ratios.

  • Set a threshold 0.5 for genes kept using Sharpe ratio.

    1. 1.

      For each gene in the kept set using Sharpe ratio,

      1. 1.

        Compute the mean (Tm) of Sharpe ratios within CRCs;

      2. 2.

        Compute the mean (Nm) of Sharpe ratios within non-tumors;

      3. 3.

        If Tm > Nm and the proportion of CRC patients whose gene expression values are greater than the maximum of gene expression values of non-tumors, keep this gene as a final candidate.

      4. 4.

        If Tm < Nm and the proportion of CRC patients whose gene expression values are smaller than the minimum of gene expression values of non-tumors, keep this gene as a final candidate.

  • Set a threshold 0.7 for genes kept using mean, do the same as above.

  • Set a threshold 0.8 for genes kept using standard deviation, do the same as above.

  • Now we have a set of genes with which we can run a full search.

  • If we want to further reduce the size of the final gene set, we can take squared transformation of the data (or other transformations), do the above procedure to get a new set of genes, than take the common genes as the final gene set.