Research on Predicting the Occurrence of Hepatocellular Carcinoma Based on Notch Signal-Related Genes Using Machine Learning Algorithms

Background/Aims: Hepatocellular carcinoma, a highly malignant tumor, is difficult to diagnose, treat, and predict the prognosis. Notch signaling pathway can affect hepatocellular carcinoma. We aimed to predict the occurrence of hepatocellular carcinoma based on Notch signal-related genes using machine learning algorithms. Materials and Methods: We downloaded hepatocellular carcinoma data from the Cancer Genome Atlas and Gene Expression Omnibus databases and used machine learning methods to screen the hub Notch signal-related genes. Machine learning classification was used to construct a prediction model for the classification and diagnosis of hepatocellular carcinoma cancer. Bioinformatics methods were applied to explore the expression of these hub genes in the hepatocellular carcinoma tumor immune microenvironment. Results: We identified 4 hub genes, namely, LAMA4, POLA2, RAD51, and TYMS, which were used as the final variables, and found that AdaBoostClassifie was the best algorithm for the classification and diagnosis model of hepatocellular carcinoma. The area under curve, accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score of this model in the training set were 0.976, 0.881, 0.877, 0.977, 0.996, 0.500, and 0.932; respectively. The area under curves were 0.934, 0.863, 0.881, 0.886, 0.981, 0.489, and 0.926. The area under curve in the external validation set was 0.934. Immune cell infiltration was related to the expression of 4 hub genes. Patients in the low-risk group of hepatocellular carcinoma were more likely to have an immune escape. Conclusion: TheNotch signaling pathway was closely related to the occurrence and development of hepatocellular carcinoma. The hepatocellular carcinoma classification and diagnosis model established based on this had a high degree of reliability and stability.


INTRODUCTION
Primary liver cancer is one of the most common malignant tumors that seriously threaten human health in the world, and hepatocellular carcinoma (HCC) is the most common type of primary liver cancer. The latest report 1 released by the International Anti-Cancer Alliance shows that primary liver cancer is the fifth most likely malignant tumor occurring in men and the seventh most likely malignant tumor occurring in women. Currently, the global incidence of HCC is increasing year by year, and there are 1 million new cases of HCC each year. 2 The main treatment for HCC is surgical resection. Although surgical treatment may be effective for early HCC, the overall 5-year survival rate of patients is only 50%-70%. 3 Moreover, up to 60%-70% of patients experience tumor recurrence within 5 years after surgery, and the long-term prognosis after hepatectomy is poor. 4 Therefore, early diagnosis and early treatment of HCC are particularly important for the prognosis of HCC, and it is necessary to develop a new model to predict the occurrence of HCC.
The Notch signaling pathway is a classic signaling pathway, and its family members are highly conserved in structure. With the in-depth study of the Notch signaling pathway, it is found that this pathway plays an important role in the occurrence and development of tumors. 5 It has been reported that the Notch signaling pathway can promote the development of cervical cancer cells, resulting in the formation of tumors. 5 Similarly, downregulation of Notch1 expression in pancreatic cancer can significantly inhibit the growth of pancreatic cancer cells, promote cell apoptosis, and stop the cell cycle at G0-G1. 6 Gramantieri et al 7 found that the expression of Notch3 and Notch4 in liver cancer was significantly higher than that in adjacent tissues, Notch3 and Notch4 were also expressed in normal liver tissues and chronic hepatitis tissues, and that the Notch signaling pathway may participate in the invasion and metastasis of liver cancer. However, it is unclear whether Notch signal-related genes (NSRGs) are related to the prognosis of HCC, and it is necessary to further explore the exact relationship between immune infiltrating cells in the HCC microenvironment and NSRGs. With the rapid development of machine learning algorithms, we used it to study the above 2 problems and write this paper.

MATERIALS AND METHODS Data Collection
In October 2021, we downloaded the data of 424 HCC cases (tumor tissue: 374 cases, normal tissues: 50 cases) from The Cancer Genome Atlas (TCGA) database (https:// tc ga-da ta.nc i.nih .gov/ tcga/ ) as a training set and main research cohort. Then, we downloaded the HCC patient data (normal tissues: 192, tumor tissues: 240) of the GSE36376 and platform GPL10558433 from the Gene Expression Omnibus (GEO) database (https ://ww w.ncb i.nlm .nih. gov/g eo/) as an external validation data set.
We used the R language to extract the expression matrix of NSRGs. All expression data were standardized by the Z-score processing (the mean value of the sample becomes 0, and the variance becomes 1). Notch signalrelated genes were set as independent variables (feature) and normal samples/tumor samples as dependent variables (label) for the occurrence of HCC.

Differential Expression Analysis of Notch Signal-Related Genes
We used the "limma" package of R language to select the differentially expressed NSRGs (DENSRGs) of HCC, with the criteria of |logFC| > 1 and FDR < 0.05 (FC is fold change and FDR is false discovery rate).

Identification of Hub Notch Signal-Related Genes
In the TCGA-HCC cohort, 2 machine learning algorithms of least absolute shrinkage and selection operator (LASSO) and Support Vector Machine Recursive Feature Elimination (SVM-RFE) were used to screen the important NSRGs of HCC.
Learning algorithms of least absolute shrinkage and selection operator is a kind of regression analysis algorithm that selects variables while regularizing. It was implemented by the "glmnet" package of R language with parameter settings of family:binomial, alpha:1, type, measure:deviance, nfolds:10.
SVM-RFE can express complex classification boundaries by combining with the kernel function. It was implemented by the "e1071, kernalb, caret" packages of R language with parameter settings of functions: care-Funcs, method: cv, methods: svmRadial. In order to avoid the over-fitting of the model, we also performed a univariate regression analysis of the gene in the selection of the feature genes using the "survival, survminer" packages of R language with the filter condition of P < .05.
The intersecting genes of the 3 methods were identified as the final feature genes for further research as the variables of the classification and diagnosis model.

Establishment and Verification of the Prognostic Model
The model of HCC classification and diagnosis was constructed based on the expression of core genes. In this study, 5 classification algorithms of machine learning classification algorithms, including XGBClassifier, LGBMClassifier, AdaBoostClassifier, MLPClassifier, and SVM, were used to construct the initial classification and diagnosis model.

Hub Gene Expression Analysis
In order to further analyze the relationship between participating variables (hub genes) and HCC. The Mann-Whitney U-test was used to compare the expression levels of hub genes between tumor group and normal group. Pearson's correlation was used to calculated the correlation of risk genes.

Immune Cell Infiltration Analysis
The CIBERSORT algorithm was used to evaluate the infiltration of 22 immune cells in the TCGA-HCC cohort. Then, we compared the distribution of the 22 immune cells between normal group and tumor group. Spearman's correlation analysis was performed between 22 immune cells and hub genes. In the end, the risk of patients was scored according to the gene expression of the selected variables in the model. The patients were divided into the high-risk group and the low-risk group by the median value of the risk score and then analyzed for immunotherapy responsiveness. The risk score was calculated as the sum of the predicted values weighted by the LASSO coefficient, including all risk genes. The Tumor Immune Dysfunction and Exclusion (TIDE) tool (http: //tid e.dfc i.har vard. edu/) was used to predict immunotherapy responsiveness.

RESULTS
The flowchart of this study is shown in Figure 1.

Identification of Hub Genes
The LASSO algorithm analyzed the DENSRGs of the TCGA-HCC cohort to select key feature genes and determine the optimal value λ with the smallest mean square error through 10-fold cross-validation ( Figure 2A). When λ was 0.006, 17 feature genes were screened out ( Figure 2B). The main use of the SVM algorithm is for the 2 classification problems, find a hyperplane and divide the 2 categories to ensure the minimum classification error rate. SVM-REF analysis showed that 37 genes were closely related to HCC ( Figure 2C). Hepatocellular carcinoma univariate regression analysis found that there were 18 NSRGs related to the survival of HCC ( Figure 2D). The intersection of the 3 methods finally resulted in 4 hub genes: LAMA4, POLA2, RAD51, and TYMS ( Figure 2E).
In our present study, the classification and diagnosis model of HCC was constructed based on the expression levels of LAMA4, POLA2, RAD51, and TYMS ( Figure 3A). We chose the best among the 5 machine learning classification algorithms of XGBClassifier, LGBMClassifier, AdaBoostClassifier, MLPClassifier, and SVM by the metrics of AUC, accuracy, sensitivity, PPV, NPV, and F1 score. The best performer in the training set was AdaBoostClassifier ( Figure 3B), and the corresponding scores in the training set in each evaluation standard were AUC: 0.976, accuracy: 0.881, sensitivity: 0.877, specificity: 0.977, PPV: 0.996, NPV: 0.500, and F1 score: 0.932 (Table 1). The best performer in the testing set was also AdaBoostClassifier ( Figure 3C), and the corresponding scores in the testing set in each evaluation standard were AUC: 0.934, accuracy: 0.863, sensitivity: 0.881, specificity: 0.886, PPV: 0.981, NPV: 0.489, and F1 score: 0.926 ( Table 2). The results in the training set were consistent with those in the testing set, and AdaBoostClassifier was considered as the best model.

Validation model
We used GEO-HCC cohort data to verify the classification and diagnosis model constructed by the AdaBoostClassifier method. All patients were verified as AUC = 0.940 in the HCC tumor and normal tissue classification model ( Figure 3D).

Gene Expression Analysis
The 4 risk genes of LAMA4, POLA2, RAD51, and TYMS showed significant differences in the expression of normal and tumor tissues, and the 4 genes showed high expression in tumor tissues ( Figure 4A). We found that POLA2 and TYMS were highly positively correlated with RAD51, while LAM4 was highly negatively correlated with RAD51 (Figure 4 B).
In the analysis of the difference in immune cell infiltration between normal and tumor tissues, Tregs (P < .001), monocytes (P = .024), macrophages (P < .001), and neutrophils (P = .002) showed differences ( Figure 4C and 4D). The correlation analysis between immune infiltrating cells suggested that macrophages and Tregs were negatively correlated with T cells ( Figure 5A). The correlation analysis between risk genes and immune cell infiltration subtypes found that NK cells activated had no correlation with LAMA4, POLA2, RAD51, and TYMS. However, almost all of the other immune cell infiltration subtypes showed varying degrees of correlation with risk genes ( Figure 5B-5E). We used the public website http:// tid e.dfc i.har vard. edu to perform an analysis of TIDE and microsatellite instability (MSI) immunotherapy of HCC and found that HCC patients showed differences in TIDE between the high-risk group and the low-risk group ( Figure 6A-6D).

DISCUSSION
The mortality of HCC ranks second among all kinds of cancers, and the new cases of HCC in China each year accounts for more than half of new cases around the world. 8 The treatment of HCC is affected by liver function, nodule size, metastasis, and age. The Notch signaling pathway as a classic signaling pathway can regulate the occurrence of tumor cells. 9 There is evidence that the Notch signaling pathway is extraordinarily active in HCC. 10 At present, surgical resection is still the main option for HCC, but its recurrence risk is very high. With the help of machine learning methods, the prediction accuracy of early HCC can be improved, thereby further improving the treatment outcome of HCC patients.
In our present study, we firstly used LASSO and SVM-REF algorithms combined with univariate survival analysis to study the differential expression matrix of NSRGs in the TCGA-HCC cohort. Then, after comparing the results of 5 machine learning classification algorithms, we finally decided to use AdaBoostClassifier to establish the HCC classification and diagnosis model. The HCC classification and diagnosis model established in our present study had an AUC of 0.976 ± 0.007 in the training set and 0.934 ± 0.033 in the test set. In the external data test of the GEO-HCC cohort, this model had an AUC of 0.934, an accuracy of 0.863, the sensitivity of 0.881, specificity of 0.886, PRV of 0.981, NPV of 0.489, and F1 score of 0.926. These results showed that the classification and diagnosis model established in our present study might be highly reliable. Duan   which suggested that our presented study has certain advantages and reliability compared with similar models.
Studies have shown that the downregulation of LAMA4 expression can inhibit the proliferation and migration of breast cancer, renal cell carcinoma, gastric cancer, and ovarian cancer. [14][15][16] Our study showed that LAMA4 was highly expressed in HCC tumor tissues. Considering that LAMA4 is closely related to the migration of cancer cells and tumor progression in a series of tumors, the latest research describes LAMA4 as "oncolaminin." 17 The crosstalk between Notch and TGF-β1 has been reported many times. It has been reported that LAMA4 could affect the level of Notch ligand and its receptor by regulating TGF-β1, 18 thereby inducing the expression of some key proteins related to the occurrence and development of HCC. Cir_POLA2 has been reported as an oncogene of lung cancer. 19 It has been found that overexpression of Cir_POLA2 can promote the proliferation of acute myeloid leukemia cells. 20 Circ_POLA2 may upregulate the G protein subunit beta 1 (Notch pathway-related molecules) by serving as an endogenous competing RNA for miR-326.14. 21 Guanine nucleotide regulatory protein (G protein) is the core of normal liver cell function and is  related to the occurrence and progression of liver disease. It was reported that the G protein family was involved in the development of HCC. 22 In our present study, when we performed the comparison between liver tumor tissue and paracancerous tissue, it was found POLA2 was relatively lower in the normal paracancerous tissues. As we know, the DNA repair protein RAD51 mainly plays an important role in maintaining genome stability and regulating the cell life cycle. It has been reported that the DNA repair system in most HCC cells is extraordinarily active, resulting in poor therapeutic effect of HCC. 23 RAD51 is a key protein for DNA double-strand repair. Highly expressed RAD51 promotes the repair of HCC. 24 Chen et al 25 have reported that in female ovarian cancer, knocking down the expression of RAD51 can significantly reduce the proliferation rate of ovarian cancer cells. It has been found

Figure 5. (A) Graphs depicting significant associations between 22 immune cells infiltration; (B) correlation between LAMA4 and immune cells infiltration; (C) correlation between POLA2 and immune cells infiltration; (D) correlation between RAD51 and immune cells infiltration; (E) correlation between TYMS and immune cells infiltration.
Zhou et al. Diagnosis model of hepatocellular carcinoma Turk J Gastroenterol 2023; 34 (7): 760-770 that the high expression of RAD51 is related to the higher pathological grade and clinical stage of HCC, and it is an independent risk factor affecting the overall survival and prognosis of HCC. 26 TYMS is the key rate-limiting enzyme that controls the synthesis of dTMP. The synthesis of dTMP is closely related to functions such as DNA synthesis, replication, and repair. Therefore, TYMS is currently considered to be the next anti-tumor target that is most likely to be successfully developed. 27 Studies have shown that the activity of thymidylate synthase in many patients with malignant tumors is significantly higher than that in normal tissues. 27 TYMS can regulate the growth of tumor cells by affecting the expression and expression cycle of P53, so TYMS is related to the proliferation status of malignant tumors. 28 Studies have shown that in most of the tumor cells with growth advantages, TYMS is overexpressed, and the higher the expression of TYMS is, the worse the prognosis of patients is. 28 A study showed that the positive expression rate of TYMS in liver cancer tissues was significantly higher than that in the control group and adjacent tissues, and the high expression of TYMS indicated that the tumor was more aggressive. 29 Immune infiltrating cells are an important part of the tumor microenvironment (TME) which are closely related to the progression of HCC. 30 In this study, we used the CIBERSORT algorithm to evaluate patients' immune cell infiltration. Further analysis found that there were differences in T cells regulatory (Tregs), monocytes, macrophages M0, and neutrophils between normal tissues and tumor tissues (P < .05). The proportion of Tregs and macrophages M0 in tumor tissues is higher than that in normal tissues, and Tregs are strongly positively correlated with macrophages M0, while the proportion of monocytes and neutrophils in tumor tissues is lower than that in normal tissues, and monocytes are strongly negatively correlated with neutrophils. The increase of macrophages M0 is significantly correlated with OS and tumor stage of HCC. 31 Macrophages M0 can stimulate the production of TAM and Kupffer cells in the presence of carcinogenic factors, thereby inhibiting the progression of HCC caused by immunity. 32 This may be related to the malignant behavior of the highly expressed genes from our research and analysis. The correlation analysis of our present study also confirmed the significant positive correlation between LAMA4, RAD51, and TYMS and macrophages M0. It has been reported that TGF-β1 is strongly positively correlated with macrophage in HCC. 33 Macrophages M0 can secrete a large amount of TGF-β1. 33 In our study, it was found that LAMA4 affected the progress of HCC by regulating TGF-β1. This also shows the correctness of this research. Under normal circumstances, Tregs inhibit the anti-autoimmune response and play an important role in balancing immune tolerance and inflammation. It has been reported that the upregulation of Tregs is a predictor of adverse outcomes in HCC patients. 34 It has been reported that Tregs promote the migration and invasion of liver cancer cells through epith elial -to-m esenc hymal transition induced by TGF-β1. 35 Neutrophils are the most common white blood cells in circulation. They play an important role in host defense, immune regulation, and tissue damage. Neutrophils are considered to be one of the first immune cells that enter the TME and interact with cancer cells. Thus, it plays an important role in the progression of cancer. Our research findings are similar to other studies. The content of neutrophils in normal tissues is high, and it is significantly negatively correlated with LAMP4, RAD51, and TYMS. In our present study, we found that RAD51 was strongly positively correlated with TYMS, and this interrelationship further strengthens the immune function of neutrophils.
The liver has a special population of immunosuppressive cells, which can avoid liver damage caused by autoimmunity and chronic inflammation under normal physiological conditions. But for patients with liver cancer, these special cells can cause tumor immune escape and promote disease progression. Our present study found that TIDE in the low-risk population of HCC was all higher than those in the high-risk population of HCC (P < .05). This showed that immune escape was prone to occur in the low-risk population. The previous analysis showed that Tregs cells are abundant in HCC tumors and are a subset of CD4+ T cells, a type of lymphocyte with high immunosuppressive properties. 36 They suppress the immune response by inhibiting CD8+ T cell effector functions and directly promote tumor escape through a variety of contact-dependent and non-contact mechanisms. 37 In HCC, neutrophils can recruit macrophages and Tregs into HCC by releasing cytokines, thereby promoting tumor progression and developing resistance to sorafenib. 38 There are some limitation to our study. The biological functions of LAMA4, POLA2, RAD51, and TYMS need to be further explored by experiments. The construction and validation of our established model are only based on the public databases, and thus it is necessary to use more clinical research data to further validate clinical efficacy of this model.
In conclusion, the classification and diagnosis model of HCC based on NSRGs in our presente study showed