Integrative approach for classifying male tumors based on DNA methylation 450K data

: Malignancies such as bladder urothelial carcinoma, colon adenocarcinoma, liver hepatocellular carcinoma, lung adenocarcinoma and prostate adenocarcinoma significantly impact men’s well-being. Accurate cancer classification is vital in determining treatment strategies and improving patient prognosis. This study introduced an innovative method that utilizes gene selection from high-dimensional datasets to enhance the performance of the male tumor classification algorithm. The method assesses the reliability of DNA methylation data to distinguish the five most prevalent types of male cancers from normal tissues by employing DNA methylation 450K data obtained from The Cancer Genome Atlas (TCGA) database. First, the chi-square test is used for dimensionality reduction and second, L1 penalized logistic regression is used for feature selection. Furthermore, the stacking ensemble learning technique was employed to integrate seven common multiclassification models. Experimental results demonstrated that the ensemble learning model utilizing multiple classification models outperformed any base classification model. The proposed ensemble model achieved an astonishing overall accuracy (ACC) of 99.2% in independent testing data. Moreover, it may present novel ideas and pathways for the early detection and treatment of future diseases.


Introduction
Cancer is a serious disease that profoundly affects human physical and mental health.According to the International Agency for Research on Cancer (IARC) of the World Health Organization, approximately 19.3 million people worldwide were diagnosed with cancer in 2020 [1], with over half being male patients.Moreover, male-specific tumors [2] (such as prostate adenocarcinoma) have garnered significant attention due to their high incidence rates and their impact on men's health.There are notable differences in the global occurrence rates of male cancers [3].Accurately classifying these cancers is one of the fundamental strategies to furnish clinical decision-making information and reduce the mortality rates of male cancers [4].Among these, bladder urothelial carcinoma [5], colon adenocarcinoma [6], liver hepatocellular carcinoma [7], lung adenocarcinoma [8] and prostate adenocarcinoma [9] are prevalent cancers in males.The incidence of these prevalent cancers in men, among which prostate adenocarcinoma is one of the most common, increases with age.Bladder urothelial carcinoma and colon adenocarcinoma are usually associated with diet, lifestyle and genetic factors.On the other hand, liver hepatocellular carcinoma is primarily associated with hepatitis B and C virus infection, while lung adenocarcinoma is associated with smoking and exposure to airborne pollutants.These cancers' high incidence and mortality rates pose considerable threats to human health and life.Hence, cancer classification is crucial in selecting appropriate treatment strategies and improving the patient's prognosis.DNA methylation analysis has emerged as a promising tool for cancer classification [10], providing valuable insights into tumor biology and revealing potential therapeutic targets.
DNA methylation is the process of covalently modifying DNA by adding a methyl group to cytosine residues located in CpG dinucleotide contexts without altering the DNA sequence itself [11].This process is critical in regulating gene expression, maintaining genomic stability and silencing transposable elements [12].Increasing evidence suggests that abnormal DNA methylation patterns are associated with many diseases [13], especially cancer.Specifically, abnormal DNA methylation patterns in CpG island promoter regions [14] can lead to an increased loss of control of gene expression and genomic instability, thus promoting tumor initiation and progression.It is noteworthy that DNA methylation analysis has become an effective tool for cancer classification primarily because this technique can provide comprehensive information on the methylation status of individual CpG sites [15].Consequently, it can accurately identify differential methylation patterns between normal and tumor tissues, making it an essential tool for cancer diagnosis and classification.
In recent years, high-throughput sequencing technology [16] has emerged as one of the most crucial tools in cancer research.DNA methylation data, which are closely associated with cancer development, are one of the types of data analyzed using this technology.With the continuous advancement of sequencing technology and computer processing capabilities, an increasing amount of large-scale DNA methylation data has been amassed.The challenge is now to extract useful information from these data and classify cancer, a critical issue in current cancer research.In addition to integrating multiple high-throughput sequencing data, artificial intelligence technology has also been widely used in cancer research.For instance, deep learning algorithms can be utilized to automate tasks and improve work efficiency in cancer diagnosis and treatment.For example, Mohammed et al. [17] used multiple One-Dimensional Convolutional Neural Network (1D-CNN) models stacked together to classify five types of cancer based on The Cancer Genome Atlas (TCGA) RNA-seq data.Jia et al. [18] proposed a method that combines variance selection with recursive feature elimination, successfully selecting 20 optimal features from over 480,000 dimensions of DNA methylation data.They compared the performance of four different estimators and five classifiers and achieved an accuracy of over 93%.Furthermore, Lin et al. [19] developed a new cancer prediction model, iCancer-Pred, utilizing deep neural networks.This model can classify seven different cancer datasets obtained from the TCGA Hub database on the University of California Santa Cruz (UCSC) XENA platform [19,20].The authors compared this method with machine learning techniques such as support vector machines (SVM), logistic regression (LR) and random forest (RF).By means of 5-fold cross-validation, they achieved the highest accuracy of the model to be up to 97%.Although several existing studies have made significant progress in cancer classification using various models, there is still a need to overcome model limitations and improve the overall performance of cancer classification.
In this study, we propose an ensemble learning-based classification algorithm called Stacking for classifying male tumors.Specifically, we utilize the chi-square test and L1 regularity based logistic regression to select features highly associated with the characteristics of the cancer dataset.Subsequently, we devised an ensemble learning algorithm to distinguish the five most common cancers in males and their corresponding normal tissues.Stacking has been tested on DNA methylation 450K cancers data set, where the results demonstrated a significant advantage in the accurate classification of cancer.In addition, this study explores the relationship between potential genes and the survival rates of these five common cancers through gene ontology analysis, survival analysis, literature review and other related methods.Our findings suggest that the SRC gene is associated with bladder urothelial carcinoma survival, while RPS2, RPL23A, RPL22, RPL27 and SRC genes are related to liver hepatocellular carcinoma survival.Furthermore, KRAS gene is associated with lung adenocarcinoma survival, and SRC gene is associated with prostate adenocarcinoma survival.These discoveries may assist in the early identification and precise categorization of these cancer types, while also pinpointing potential treatment approaches to enhance the survival rates among high-risk males.

Data collection and preprocessing
UCSC XENA is one of the websites derived from the TCGA database.The site stores several large public datasets on cancer, including TCGA, GETX and TARGET, among others with powerful and intuitive functionality.
The DNA methylation 450K data used in this study were downloaded exclusively from the UCSC XENA platform, which included datasets for bladder urothelial carcinoma (BLCA), colon adenocarcinoma (COAD), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD) and prostate adenocarcinoma (PRAD).A total of 2241 samples of both cancer and normal tissue were obtained by combining these five datasets, as shown in Table 1.The dataset was then divided into a training set and a testing set at a ratio of 9:1.

Feature selection method
Given the high-dimensional nature of the data in this study, with more than 480,000 dimensions, the sample size seems somewhat limited.However, it is essential to note that not all features hold equal importance for the classification model.Therefore, it is crucial to identify and select the most informative features to ensure accurate and effective cancer classification.This task is achieved through feature selection and dimensionality reduction, where representative features are selected.In the training dataset, features containing "NaN" were removed, and in the test dataset, they were substituted with 0.
To select features relevant to the five common tumor classifications, the chi-square test was initially employed for feature selection.The chi-square test [21] is a statistical method used to evaluate the independence between categorical variables.It is employed to assess the significance of each feature in predicting the target variable.We can determine the association between features and cancer classifications by utilizing the chi-square test, thereby selecting the crucial features.Specifically, we used the SelectKBest [22] function from the scikit-learn library [23] to filter out features with top chisquare scores.Based on this, through a cross-validation approach, we determined that the performance of the chi-square test was significantly improved at a feature count of 22,120.The features selected by the chi-square test are highly relevant to cancer classification tasks [24] and thus are of significant importance.Consequently, these features were utilized as input features for subsequent classifier training and testing.The formula used is as follows: denotes the observed value of the cross term in row , column ;  represents the expected value of the cross term in row , column ； denotes the number of rows;  denotes the number of columns.
Although approximately 400,000 CpG sites were removed from the cancer dataset using chisquare testing, which significantly reduced the number of sample features, it is still necessary to further reduce the number of features to construct a high-performance predictor.Feature selection can assist in reducing model complexity [25], which minimizes the risk of overfitting and enhances model interpretability and explainability.By selecting features with strong predictive power for the target variable, feature selection can improve the predictive performance of the model [26].Additionally, feature selection can decrease data processing and modeling time and expenses.
To accomplish this aim, we employed a logistic regression model based on the L1 parametric penalty [27], and the SelectFromModel function of the scikit-learn library was used to filter features."L1" refers to L1 regularization, which is a regularization technique used in machine learning models like linear regression and logistic regression [28].This approach helps identify crucial features of the classification task by penalizing the model's complexity, thus preventing overfitting.These features not only enhance the model's performance and predictive power but also its interpretability and practical application value.In practical applications, we can perform more nuanced feature selection and optimization based on the significance and weight of these features to further improve the model's performance and application.The L1 regularization method naturally possesses feature selection properties because of its sparse solution characteristic.
The logistic regression model computes the probability of a data point belonging to a specific class based on a linear combination of input features [29].The fundamental logistic regression formula without regularization is: Here,   | denotes the probability of data point  belonging to class  ,  represents the weights for each feature  and exp is the exponential function.When L1 regularization is applied, the objective function is the sum of the log loss to be minimised and the L1 regularization term, which is the absolute sum of the weights.The L1 regularization term is added with a regularization strength parameter .The objective function is: where  is the true label of the data points, and the L1 regularized logistic regression model is derived by minimizing the objective function   with respect to the weight  .

Overall process
Stacking is an ensemble method for models [30], where the combination of multiple weaker models often yields better performance than a single strong model.This approach involves training several base learners and using their predictions as input to a meta-learner.The stacked ensemble algorithm offers superior performance, generalization capabilities and flexibility compared to individual algorithms by leveraging the advantages of multiple base learners to enhance model accuracy and robustness.In this study, we propose a framework combining a chi-square test, logistic regression with L1 penalty and stacking ensemble learning to construct a multiclass classifier for five types of cancer data.The overall flowchart of this study is presented in Figure 1.
The approach has trained seven base classification models: random forest (RF), support vector machine (SVM), bootstrap aggregated algorithm (Bagging), stochastic gradient descent (SGD), multilayer perceptron (MLP), logistic regression (LR) and LightGBM (LGBM) [31][32][33][34].The reason for selecting these models is that they are based on different algorithms and can capture different data features.The LR, SVM and SGD models are linear models, the RF model captures nonlinear relationships and interactions, Bagging and LighTGBM capture nonlinear relationships by boosting weak learners and MLP can solve linearly inseparable problems.In this study, the integrated algorithm is designed to leverage the strengths of multiple models more effectively than a single algorithm.This approach aims to enhance performance robustness and accuracy.
All base learners were trained on the whole training set and then evaluated with the validation set.These predictions were used to train the meta-learner along with the true labels in the validation set.For the meta-learner, we chose the LGBM model.The specific prediction process is shown in Figure 2.
In summary, stacking is an effective method of ensemble learning that combines multiple models to achieve higher performance than any single model [35].By leveraging different base learners and using a meta-learner to find the best way to combine them, stacking can produce better results compared to any single model.Our experimental performance demonstrated the effectiveness of the stacking approach for this classification task.

Classifier performance evaluation
Scientific evaluation metrics are crucial to the performance metrics of a model.We usually use a variety of evaluation metrics to measure the performance of a model, such as accuracy and recall.These metrics can not only help us understand the predictive ability of the model but also help us optimize the parameters and structure of the model to improve its performance.In this study, the evaluation of model performance contains five metrics: Accuracy (ACC), Matthews Correlation Coefficient (MCC) [36], Precision (PRE), Geometric mean (Gmean), Recall (RECALL) and Kappa Coefficient (KAPPA) [37].

Results
During model training, we performed 10-fold cross-validation on the training set to optimize the parameters.In this, 10-fold cross-validation is performed on the training dataset.The original dataset is first divided into training and test sets.Then, we further divide the training set into ten equal-sized subsets for cross-validation.In each cross-validation, one subset is used as the validation set and the remaining nine subsets are used to train the model, ensuring that each subset has acted as a validation set.
Selecting an effective feature selection method improves the performance of predictive models and obtains better explanatory power.For this purpose, a comprehensive comparison of various feature selection methods was performed and recursive feature elimination (RFE) [38], elastic network (ENET) [39], and a combination of logistic regression with chi-square test based on L1 regularity were considered.
During the comparison process, we observed that the combined methods exhibit promising performance in the feature selection task.The results presented in Table 2 enable us to clearly compare the variations in performance among these methods for the prediction task.In the fourth row of Table 2, "99.22  0.004" indicates that in the 10-fold cross-validation, the value of ACC is 99.22 and the variance is 0.004.
Building upon these findings, the chi-square test proves to be highly valuable in categorization problems as it allows us to identify features that are significantly associated with the target variable, potentially related to cancer in this study.As for the logistic regression method based on L1 regularization, it induces sparsity in the feature coefficients by applying L1 regularization, and this sparsity helps to filter out irrelevant or redundant features, thus improving the generalization ability of the model.Consequently, we opted for the combination method to screen features and have continued utilizing this approach in subsequent studies.Afterwards, we conducted a 10-fold cross-validation of all classification methods.The statistical results are shown in Tables 3 and 4. The performance of the stacking ensemble learning method was found to be superior to those of the base learners while also exhibiting good stability.Due to the instability of independent testing results at each training, we took the average of 10 results in the experiment.The data achieved very good results on the stacking ensemble learning model, with over 99% in all criteria except RECALL, as shown in Table 4 and Figure     The confusion matrix [40] is shown in Figure 4, and it can be seen that the model performs well in distinguishing between the five types of cancer and normal tissue.It can also be observed that out of all the samples in the independent testing data, only two were misclassified, where one normal tissue sample was incorrectly predicted as a PRAD sample, and another COAD sample was incorrectly predicted as a BLCA sample.This result is satisfactory, which indicates that the model has high accuracy and reliability and can be used in clinical practice.At the same time, although the misclassification rate is low, we still need to continue to optimize and improve the model to improve its accuracy and applicability in future research.
To validate the generalizability of the proposed model, we utilized the dataset and neural network employed by Lin et al. [19] to assess the performance of our classifier in this study.By comparing the performance of our model with their dataset and methods, our model consistently outperforms theirs, as demonstrated in Tables 5 and 6.These results indicate that our model excels not only on the original dataset but can also be successfully applied to other datasets with a degree of generality and replicability.This, in turn, enhances the reliability and stability of the model for practical medical applications.

Explaining model predictions using LIME
To achieve interpretable predictions and gain insights into feature contributions, we utilized the local interpretable model-agnostic explanations (LIME) [41] model.In this study, we used 9511 CpG sites as features for predicting cancer types.Figure 5 shows the LIME prediction results of a sample by using the stacking integrated learning model, which screens the top 6 predictive biomarkers most helpful in classifying normal tissue, bladder urothelial carcinoma (BLCA), colon adenocarcinoma (COAD), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD) and prostate adenocarcinoma (PRAD).The prediction probability table in the top-left corner of Figure 5 shows the model's probability of predicting a given sample as one of these types of cancer.In this case, LIME assigns a feature weight 0.20 for cg11055493 feature values less than or equal to 0.39 (cg11055493 ≤ 0.39).
Additionally, we demonstrated the feature weights of other predicted features.We detailed each feature's values and color codes in the "Feature-Value" table, which specifies whether a given feature contributes to the prediction.Specifically, normal tissue is displayed in blue, bladder urothelial carcinoma (BLCA) is color coded in orange, colon adenocarcinoma (COAD) is color-coded in green, liver hepatocellular carcinoma (LIHC) is color coded in purple, lung adenocarcinoma (LUAD) is color coded in red and prostate adenocarcinoma (PRAD) is color coded in brown.

Genes and biological significance
We obtained 9511 CpG loci by screening and annotating them into genes and finally obtained 8087 genes.By comparing these 8087 genes with published CpG biomarkers, we found that Ding et al.'s study [42] included data for the five types of cancer used in our study, as well as 3000 CpG biomarker genes.At the same time, 863 genes overlapped between the two studies, as shown in Figure 6.For the list of overlapping genes, we performed pathway and process enrichment analysis by using multiple ontology sources, including GO Biological Processes, GO Cellular Components, GO Molecular Functions and KEGG Pathway [43].A series of criteria were applied to screen for biologically significant enrichment terms, including p values less than 0.01, minimum counts of 3 and enrichment factors greater than 1.5 (enrichment factor refers to the ratio between observed counts and randomly expected counts).Based on their membership similarity, we grouped the enriched terms into clusters and used Kappa scores as a similarity measure in the hierarchical clustering process, treating subtrees with a similarity greater than 0.3 as a cluster.Finally, in each cluster, the most significant enriched terms in terms of the above metrics (p-values, counts, etc.) were selected to represent its clusters.For example, for GO:0016570 we filtered the following metrics: count = 34, Log10(P) = -8.09,Log10(q) = -4.16."Count" is the number of genes in the user-provided lists with membership in the given ontology term."Log10(P)" is the p-value in log base 10. "Log10(q)" is the multi-test adjusted p-value in log base 10.The results of these enrichment analyses can help us to deeply understand the functions of these overlapping genes in different biological processes and pathways, providing important clues for further studies.As shown in Figure 7, gene ontology analysis shows that overlapping genes are present in biological processes of histone modification (GO:0016570), DNA metabolic process (GO:0006259), neuromuscular process (GO:0050905), embryo development ending in birth or egg hatching (GO:0009792), localization within membrane (GO:0051668), brain development (GO:0007420), modulation of chemical synaptic transmission (GO:0050804), regulation of cell cycle process (GO:0010564), developmental maturation (GO:0021700) and other related genes; molecular functions exist in transcription coregulator activity (GO:0003712), molecular adaptor activity (GO:0060090), protein domain specific binding (GO:0019904) and transcription factor binding (GO:0008134); among cellular components, there is extrinsic component of membrane (GO:0019898), dendrite (GO:0030425), cell projection membrane (GO:0031253), perinuclear region of cytoplasm (GO:0048471) and transporter complex (GO:1990351) were enriched.In addition, growth hormone synthesis (hsa04935) secretion and action (hsa04935) and the MAPK signaling pathway (hsa04010) were identified in the KEGG pathway.
In this study, the STRING database was employed to search for potential interactions among encoded proteins and investigate their potential interactions.Through this step, we obtained a representation of the protein-protein interaction network, as shown in Figure 8.This network describes the relationships between genes and proteins, such as physical contacts and targeted regulation.Our goal was to elucidate the meaningful molecular regulatory networks in living organisms.Subsequently, the "cytoHubba" plugin in Cytoscape [44] software calculated the node scores of genes in the PPI network and identified the top 10 key genes: RPS2, RPL23A, RPL3L, RPL22, RPL27, EEF2, PIK3CA, SRC, KRAS and CREBBP (as shown in Figure 9).To investigate the effect of these genes on the survival of cancer patients, we performed a survival analysis of 10 potential biomarkers screened from the PPI network and used the TIMER database [45] to further draw Kaplan-Meier survival curves (Figure 10).The Kaplan-Meier survival curve was first proposed by Kaplan and Meier in 1958 [46].It is a non-parametric method used for analyzing survival data, capable of estimating survival probabilities at different time points and visualizing the changes in survival curves.In the field of cancer research, the Kaplan-Meier survival curve is widely employed for analyzing patients' survival data [47,48].Recent studies have shown that this method remains highly effective in predicting patient survival rates.For instance, Hamid Bakhtiari et al. [49] utilized this method to predict the survival rates of hypertensive patients with COVID-19.According to statistical significance, a gene was considered to be significantly associated with cancer survival when P < 0.05.Our analysis revealed that the SRC gene is significantly associated with the prognosis of bladder urothelial carcinoma, liver hepatocellular carcinoma and prostate adenocarcinoma.Additionally, the RPS2, RPL23A, RPL22, RPL27 and SRC genes are significantly associated with the prognosis of liver hepatocellular carcinoma.Furthermore, we found that the KRAS gene is significantly associated with the prognosis of lung adenocarcinoma.These results suggest that these genes have potential prognostic value.However, further clinical validation of these potential biomarkers is needed before they can be used.

Literature review
After conducting a comprehensive review of the literature, we found that many of the critical genes associated with cancer survival identified in this study have been previously reported in the literature.For example, Sree Karani Kondapuram et al. [50] identified SRC as a central autophagy gene associated with LIHC survival, making it a potential drug target.Kowalczyk et al. [51] discovered that RPS2 is overexpressed in mouse liver hepatocellular carcinoma samples and may impact the accuracy of mRNA translation related to aminoacyl-tRNA binding ribosomes, thus promoting cell proliferation.Additionally, Pan et al. [52] identified SRC as a potential therapeutic target for docetaxelresistant prostate adenocarcinoma and an effective prognostic indicator that was significantly correlated with the immune score, ferroptosis, methylation and OCLR score.Wang et al. [53] found that high expression of ribosome-related genes RPL23A and RPL27 significantly reduced the survival rate of patients with liver hepatocellular carcinoma.Moreover, Xu et al. [54] identified SRC as a prognostic gene for BLCA through multivariate Cox regression analysis [55].In lung cancer, KRAS gene mutation is most common in patients with lung adenocarcinoma, and approximately 33% of patients will have this mutation [56].
In summary, the comprehensive literature review confirms the importance of the key genes identified in this study in cancer survival.Our feature selection method has proven to be effective in extracting potential biomarkers.Furthermore, these findings provide additional evidence supporting the potential clinical relevance of our model and the importance of integrating machine learning methods into cancer research.Further research is needed to validate these findings and explore the underlying mechanisms of these genes in cancer development and progression.

Conclusions
In this paper, we present a novel model for predicting the types of five different cancers and their corresponding normal tissues in the DNA methylation 450K dataset.Our proposed model includes a stacked ensemble learning approach combined with a feature selection method based on a chi-square test and logistic regression with L1 regularization.This framework effectively addresses the challenges posed by the high-dimensional nature of the data.Specifically, we utilize a chi-square test for feature selection, followed by logistic regression with L1 regularization as the estimator for SelectFromModel to create an optimized feature set.These selected features are then employed in our stacking ensemble learning model for prediction.Additionally, we have taken steps to mitigate the issue of an unbalanced sample distribution between cancer samples and normal tissues by applying SMOTETomek integrated sampling to the training set.
Compared to existing methods, our proposed stacking ensemble learning model consistently performs better in classifying different cancer types.Our study establishes a robust multiclass predictor capable of identifying a patient's cancer type.Furthermore, we have conducted survival analysis on essential genes to identify potential biomarkers associated with cancer survival, and we have performed comprehensive GO and KEGG pathway analyses to underscore the biological relevance of our findings.In conclusion, our model has great potential in the field of cancer diagnosis and treatment, highlighting the value of combining machine learning methods with DNA methylation data analysis.However, despite conducting relevant bioinformatics analyses, our study still has limitations and requires further validation and testing on a broader cancer dataset.In order to enhance the robustness of our approach, we plan to explore and integrate other types of cancer and related multi-omics data in future research efforts.

4
In this context,  (True Positive) represents the true positives, indicating the number of times the model predicted the positive class correctly;  (True Negative) represents the true negatives, indicating the number of times the model predicted the negative class correctly;  (False Positive) represents the false positives, indicating the number of times the model predicted the negative class as positive;  (False Negative) represents the false negatives, indicating the number of times the model predicted the positive class as negative.The kappa coefficient is a measure of agreement between a classifier and human classification.It compares the observed classification accuracy with the chance agreement. is the observed classification accuracy, and  is the expected classification accuracy by chance.They can be expressed as follows: 3(b).

Figure 3 .
Figure 3. 10-fold cross-validation results and average results of 10 independent test line graphs.

Figure 4 .
Figure 4. Confusion matrix for independent testing of multiclass predictors.

Figure 6 .
Figure 6.Venn diagram of overlapping genes between the two studies.

Figure 7 .
Figure 7. Heatmap of enriched terms for overlapping genes.The intensity of the color indicates the level of enrichment, with darker colors indicating higher levels of enrichment.On the right side, there is a wealth of information on terms from the Gene Ontology (GO) and KEGG Pathway that can be used to clarify the meaning and function of each enrichment term.

Figure 9 .
Figure 9. Key genes for scoring the top 10.

Figure 10 .
Figure 10.Kaplan-Meier curve plot showing the genes with p values less than 0.05.

Table 1 .
Number of samples per type of cancer.

Table 2 .
10-fold cross-validation results for different feature methods.

Table 5 .
Comparison of results using our dataset with the iCancer-Pred approach.

Table 6 .
Comparison of results using iCancer-Pred's dataset with our approach.