A metastasis prediction model in non-small cell lung cancer using GLCM_contrast and epithelial mesenchymal transition related genes

Purpose: The aim of this study was to estimate a metastasis prediction model in non-small cell lung cancer by correlation next generation sequence gene expression level and uorine-18-2-uoro-2-deoxy-D-glucose positron emission tomography image features from non-small cell lung cancer patients. Methods: RNA-sequencing data and 18 F-FDG PET images of 63 patients with NSCLC (29 metastasis and 34 non-metastasis) from The Cancer Imaging Archive and The Cancer Genome Atlas Program databases were used in a combined analysis. Weighted correlation network analysis was performed to identify gene groups were related metastasis. Module was selected with high module signicance. Genes selection was performed by gene function related metastasis and high AUC (AUC > 0.6). A total of 47 image features were extracted from PET images as radiomics. The relationship of Gene expression and image features were calculated by using a hypergeometric distribution test with the Pearson correlation method. Metastasis prediction model was validated by random forest algorithm using image texture features related gene expression. Results: 36 modules were identied by gene expression pattern with WGCNA assay. The modules had highest module signicance was selected assay. 7 genes from selected module were identied to involve in the epithelial mesenchymal transition pathway that have important role in the cancer metastasis and had high AUC. Also, expression of these genes was related to quantitative of image feature (GLCM_contrast, -log10 P-value: 2.45~3.89). The AUC value (accuracy: 0.856 ± 0.06, AUC: 0.868 ± 0.05) was shown from the EMT-related gene and GLCM_contrast model and AUC value (accuracy: 0.842 ± 0.06, AUC: 0.838 ± 0.09) was shown from GLCM_contrast image texture model. Conclusion: GLCM_contrast image texture feature shows relationship with EMT related gene expression. We developed a model for predicting metastasis of non-small cell lung cancer using 18 F-FDG PET image feature and evaluated its accuracy.


Introduction
Non-small cell lung cancer (NSCLC) has a high incidence among cancers that can occur in modern people with large molecular heterogeneity in tissues 1,2 . Its molecular heterogeneity was shown to be different between patients and intratumor and intertumor regions 3 . Intratumor heterogeneity is known to be linked to the development of primary tumors and metastases 4 . It is possible to diagnose cancer by analyzing intracellular gene expression events and nding a suitable treatment method for each cancer 5 . Many studies have been conducted to search for methods to diagnose cancers having different genotypes and to nd a treatment for each cancer: image features that analyze phenotypes based on genotype, next generation sequencing (NGS) for large-scale gene analysis, and radiogenomics that uses uorine-18-2-uoro-2-deoxy-D-glucose positron emission tomography ( 18 F-FDG PET) image features and NGS in combination.
NGS is a high-throughput sequencing analysis method that is capable of accurately quantifying large amounts of gene information compared to conventional gene analysis methods 6 . In the past, gene expression was characterized one by one with electrophoresis after PCR, a time-consuming, expensive procedure and limitation of sample amounts. Recently, advances in NGS technology have made it possible to analyze total RNA in single cells. Studies of genes involved in NSCLC metastasis have also been conducted using NGS, and genes that play important roles in metastasis, such as EFGR, had been identi ed 7 . However, this method has some disadvantages: time-consuming sequencing, painful invasive biopsies, and identi cation the genes from the sampled tissue but not necessarily from the entire tissue 8 .
The classical image technique uses radiation to image the affected area without causing pain to the patient, grasping the overall characteristics of the affected area, and has the advantage of quick analysis 9 but only showing the cancer phenotype. Radiogenomics is a study that combines image feature technology for analyzing images and NGS technology for mass analysis of genes, revealing the relationship between expression of speci c genes related to cancer and image features present. By combining the two analysis methods, diagnosis and prediction of cancer without any invasive method is possible 10 . 18 F-FDG PET/CT has the advantage of evaluating metabolic processes in cancer. It is an advanced technique compared to CT, a traditional imaging technique: 18 F-FDG is absorbed during glucose metabolism, and it is possible to estimate glucose metabolism by imaging FDG remaining in the cell.
Depending on the degree of cancer progression, glucose uptake and FDG concentration remaining in the cells are different. Because the residual FDG concentration in the initial cancer is low and increases as the cancer progresses, the degree of cancer progression can be evaluated through FDG imaging. This method is also suitable for evaluating cancer metastasis 11 . 18 F-FDG PET/CT imaging was used to predict the chemotherapy response after treatment with an anticancer drug in NSCLC 12 . In other studies, 18 F-FDG PET/CT imaging can be used for prognosticating survival in NSCLC by analyzing image features 13 .
The following is a case study of NSCLC that recently utilized radiogenomics. Research on gene expression speci c to NSCLC has already been conducted, and it is well known that the EGFR gene plays an important role in metastasis when mutation occurs 14 . A recent study has shown that 18 F-FDG PET/CT image features are correlated with EGFR mutation status in NSCLC 15 . In this study, patient DNA was collected to distinguish patients with EGFR mutations, and image features of CT images were analyzed to determine whether the features (SUVmax, SUVmean, and SUVpeak) were related to the EGFR mutation. A metastasis prediction model was estimated with these results. In another study, mRNA extracted from NSCLC tissues was analyzed by NGS to nd metagenes, and image features from CT images were used for analysis by searching for correlations between NGS and CT image features 16 . The relationship and action of the expressed metagenes and image features for cancer cell proliferation were studied.
Epithelial mesenchymal transition (EMT) plays a most important role in cancer metastasis. In NSCLC cells, activation of EMT induces cell migration, proliferation, and invasion 17 .
In this study, we estimated correlation between the expression of genes in metastasis of NSCLC and the quantitative 18 F-FDG PET image texture features. The NSCLC metastasis prediction model was developed by image texture features have relation with gene expression.

Results
In this study, 18 F-FDG PET data and RNA-sequencing data from 63 patients with NSCLC were used for analysis. The average age of the patients was 67.5 years, and the ratio of men and women was approximately 8:2. (Table 1). The process of development of the relationship between the RNAsequencing data and 18 F-FDG PET image features are schematically described in Fig. 1. Gene modulation and hub gene assay To search for hub genes, have important role in the metastasis, WGCNA was used rst to construct a gene module with a similar expression pattern, and a network analysis was performed to search for hub genes. A total of 36 gene modules were obtained (Fig. 2). The module with the highest signi cance in the metastasis group was selected. To con rm the function of the gene module, GO term analysis was performed. A total of 7 genes were selected as EMT-related genes with high GS scores (GS > 0.8) and high AUC value (AUC > 0.6).
Hub gene and image feature associations The analysis was performed using 47 radiomics and 7 EMT-related genes. Results regarding the relationship between expression levels of the factors were obtained. Among the relationships between image features and gene expression levels, the top 50 genes were selected to show the total relationship in the highest order and visualized as a heatmap (Fig. 3). The results show one image feature (GLCM_contrast) that was expressed deeply in relation (P-value < 0.05) to the expression of seven genes (NME1.NME2, LST1, KAT7, BMX, CLIC1, KANSL2, and UFL1) ( Table. 2). Table 2 List of the seven genes that are related to image features in NSCLC metastasis. P-value was calculated using the hypergeometric distribution method. Genes were selected with the smaller P-values and related metastasis and EMT function (value was normalized by -log10). Estimation of the prediction model Genetic expression levels and features extracted from PET/CT images were used to create a model for predicting metastasis of NSCLC. The EMT-related gene (7) model precision, recall, AUC, and accuracy score were 0.860 ± 0.16, 0.642 ± 0.2, 0.766 ± 0.09, and 0.799 ± 0.06, respectively. The histogram rst order (15) model precision, recall, AUC, and accuracy were 0.77 ± 0.14, 0.713 ± 0.04, 0.713 ± 0.04, and 0.794 ± 0.07, respectively. The texture (32) mod el precision, recall, AUC, and accuracy score were 0.80 ± 0.13, 0.642 ± 0.18, 0.766 ± 0.08, and 0.805 ± 0.07, respectively. The EMT gene (7) and radiomics (47) model precision, recall, AUC, and accuracy score were 0.840 ± 0.10, 0.814 ± 0.13, 0.856 ± 0.06, and 0.868 ± 0.05, respectively. Finally, the GLCM_contrast model precision, recall, AUC, and accuracy score were 0.759 ± 0.04, 0.828 ± 0.21 0.838 ± 0.09, and 0.842 ± 0.06, respectively (Table 3). Table 3 Precision, recall, AUC, and accuracy values of predictive models created using meta-related genes and image extraction factors expressed using the random forest algorithm.

Discussion
In this study, RNA-sequencing and 18 F-FDG PET/CT images of non-small cell lung cancer patients were used to search for gene groups expression related to non-small cell lung cancer metastasis and imaging features related to the expression of gene groups. The gene group involved in metastasis has an EMT function known to be induce metastasis. It was observed that one of the imaging features, GLCM_contrast, was expressed in relation to the expression of EMT function. This is a clue that can predict the metastasis of non-small cell lung cancer through the analysis of imaging features.
Recently, a combination of two analysis methods, NGS and PET CT imaging, has been studied to overcome the limitations of each. The prediction and diagnosis of lung cancer metastasis is related to serious problems for patients because lung cancer shows no symptoms or pain until the late stages and has spread to other organs, with a high probability of being at a late stage when diagnosed 18 .
Development of a composite diagnosis method for genes and images has the advantage of being noninvasive 19 and fast compared to existing diagnostic methods, and is also capable of diagnosing overall cancer. In terms of genetic analysis, two methods were used to reduce the number of genes used for analysis. The rst was to select genes with signi cant differences between the two groups using a ttest 20 and the second was to use the hub gene assay to select genes with the desired functions. A t-test was performed for more e cient analysis to remove genes with low P-values using mathematical calculations 20 . Genes were divided into modules according to the gene expression pattern through WGCNA analysis, and each module was assigned a signi cant value according to its contribution to the module. One module selected had the highest gene signi cance. A total of 7 genes were identi ed as EMT-related genes from the selected module (GS > 0.8 and AUC > 0.6). The hypergeometric distribution method 21 was used to identify which EMT-related genes are associated with image features extracted from the genetics. The relevance of image features and genes was calculated by P-value and was listed from low values. P-values greater than 0.05 were excluded. Gene expressed levels were compared in patients with and without metastasis of each gene to identify differences in both conditions. A total of seven genes were identi ed as having a high relationship with one radiomics: GLCM_contrast. The seven identi ed genes, NME1.NME2, LST1, KAT7, BMX, CLIC1, TAP2 and PSMB9 are known to be involved in EMT. Bone marrow X-linked kinase (BMX) has been reported to be involved in EMT, such as cell growth, transformation, migration, survival, apoptosis, and tumorigenicity [22][23][24][25] . Nucleoside diphosphate kinase A (NME1) and nucleoside diphosphate kinase B (NME2) form the complex unit NM23 (NME1.NME2) and have the nucleoside diphosphate kinase activity, which catalyzes the phosphorylation of nucleoside diphosphates to the corresponding nucleoside triphosphates. NME1.NME2 is the rst metastasis suppressor in lung cancer. A decrease in NME1.NME2 increases cancer metastasis 26 . The function or mechanism of leukocyte-speci c transcript 1 protein (LST1) has not been well studied, but high expression of LST1 in metastasized lung cancer has been reported 27 . Chloride intracellular channel 1 (CLIC1) has the ability of the antiangiogenic peptide CLT1 on proliferating endothelial cells 28 . CLIC1 is mainly overexpressed in the tumor vasculature, and overexpression has been observed in breast, lung, and liver cancer patients 29,30 . CLIC1 has been shown to promote regular invasion and proliferation of tumor and endothelial cells, but the underlying mechanism is unclear 31 . Transporter associated with antigen processing 1 (TPA1) regulates WISP2, which can affect TGF-b signaling. TGB-b signaling is one of the most important roles of EMT in breast cancer 32 . Proteasome subunit beta type-9 (PSMB9) is coexpressed with RARRES3 and is a well-known metastasis suppressor in breast cancer cells 33 .
EMT is an evolutionarily conserved process in which cells undergo the conversion from epithelial cells to mesenchymal cells. EMT was found in a study on the development of embryo stem cells. EMT is a major activity during embryo stem cell development, gastrulation, neural nests, and development of the heart and other tissues and organs 34 . Recent studies have shown that EMT is also implicated in cancer progression and metastasis. Studies on breast cancer metastasis suggest that EMT is also involved in the acquisition of characteristics of cancer stem-like cells (CSCs) 35 . CSCs are cancer cells that have the characteristics of embryonic stem cells of self-renewal, regeneration, and differentiation to diverse types of cancer cells. CSCs are thought to be crucial for the initiation and maintenance of tumors as well as their metastasis 36 . Many studies using NGS for NSCLC have been performed because of the ability to determine the molecular characteristics of the cancer state for diagnosis or treatment 37 . NGS is a technology that can analyze gene expression levels at a fast and large scale compared to conventional gene analysis methods. However, a limitation is biopsies are need for sampling, which is not available all cancer cases because of cancer location 38 . Another limitation is representativeness 39 . Cancer tissues have a high heterogeneity; biopsy samples cannot represent all cancer regions. To overcome this limitation, image features had to be introduced into the analysis.
PET/CT images have become a popular research topic for the diagnosis of NSCLC in recent days. Features extracted from the images were used for analysis. Each feature is represented by a call status such as cell shape, cell surface texture, and cell density. These features were digitized for cancer analysis using a mathematical method 40  GLCM_contrast is a feature from image feature analysis. It is considered a texture feature from the LIFEx image analysis tool. In general, features such as SUVmax, SUVpeak, TLG, and ENTROPY were used for radiogenomics analysis for cancer prediction or cancer metastasis prediction 43 . However, in this study, the correlation (P-value) of SUVmax, SUVpeak, TLG, and ENTROPY was lower than that of GLCM_contrast. This result shows that new factors such as GLCM_contrast can be used to develop a model for predicting metastasis of NSCLC using radiogenomics. One of the limitations of our study that although we provide the evidence that EMT related gene has relation to GLCM_contrast in NSCLC but do not provide mechanistic studies. While this was not the goal of this study, future investigations could be directed toward to uncover the mechanisms of operation of genes that play an important role in NSCLC metastasis, and to elucidate the correlation of expression of imaging features. Large scale of follow-up studies with molecular mechanism of metastasis in NSCLC could strengthen the study and further con rm and extend our ndings. In addition, it was possible to search for radiomics related to EMT genes in this study and it will be possible to search for imaging biomarkers for diagnosis and prognosis by analyzing genetic functions related to other cancers or diseases.

Conclusion
In this study, we con rmed through RNA-sequencing analysis that the group genes involved in the NSCLC metastasis were related to EMT function. The expression of these group genes was related to the image texture feature like GLCM_Contrast. It was con rmed that the accuracy of the prediction model developed using two factor that was consist of the EMT-related group genes and GLCM_Contrast and GLCM_Contrast only by the Random Forest algorithm was high. These results reveal the possibility of a prediction model using image text features related to gene expression in NSCLC metastasis.  images. The classi cation in the metastasis and non-metastasis models was performed with reference to clinical data from TCGA. Patients in the N1 and N2 stages were placed in the metastasis group, and those in the N0 stage were placed in non-metastasis group. Patient information is summarized in Table 1. Acquired data were normalized by FRKM. The genes with zero FRKM values from all the samples were trimmed for fast analysis 44 . For differentially expressed gene (DEG) analysis, the Deseq2 tool of the R packaged was used 45 . Input data groups followed the metastasis and non-metastasis groups. DEG analysis results were visualized in volcano plots by ggplot in R 46 . Weighted gene co-expression networks and modules associated with clinical traits

Material And Methods
To analyze the correlation between expressed genes and features extracted from images, gene selection was conducted at rst. A total of 22,125 genes were analyzed by DEG and the selected only those genes with signi cant differences 47 . To obtain the gene module with the greatest in uence on determining metastasis, WGCNA analysis was performed 48 . The genes were separated into several modules using the WGCNA tool in the R package. A soft threshold for network construction was selected for gene clustering.
In the soft threshold, the adjacency matrix forms a continuous range of values between 0 and 1. The constructed network conforms to the power-law distribution and is closer to a real biological network state. A scale-free network was constructed using the blockwise module function, followed by module partition analysis to identify gene co-expression modules, which grouped genes with similar expression patterns. The modules were de ned by cutting the clustering tree into branches using a dynamic tree cutting algorithm and assigned to different colors for visualization 49 . The module eigengene (ME) of each module was calculated. ME represents the expression level for each module. The correlation between ME and clinical traits in each module was calculated. Finally, the gene signi cance (GS) that represented the correlation between genes and samples was further calculated. Genes from selected modules with a GS value of 0.8 or more and a P-value of 0.05 or less were selected 50 . Each gene's AUC value was calculated, and genes have high AUC values (AUC > 0.6) were selected for correlation assays. Functional and pathway enrichment analyses of selected modules Genes from selected modules were used for functional analysis. DAVID 6.8 51 [14]. Texture features were assessed using four texture matrices: co-occurrence matrix (CM), gray-level run length matrix (GRLM), gray-level zone length matrix (GZLM), and neighborhood gray-level different matrix (NGLDM). The CM was calculated in 13 directions with one voxel distance relationship between neighboring voxels, and each texture feature calculated from this matrix was the average of the features over the 13 directions in space (X, Y, Z). The GRLM was also calculated for 13 directions via a similar method, whereas the GZLM was computed directly in 3D. The NGLDM was computed from the difference in gray levels between one voxel and its 26 neighbors in 3D, and each texture feature was calculated from this matrix 57 . A total of 47 features were extracted from the PET image data. Hub gene and image feature correlation A total of 47 image features and 7genes were used to estimate the relationship between all table factors, which was calculated using a hypergeometric distribution test with the Pearson correlation method. The hypergeometric P-value was calculated using the equation p = (kCx) ((n − k)C(n − x))/NCn, where N is the number of total genes in the genome, k is the number of expression values identi ed in gene expression, n is the expression value of features identi ed in the images, x is the number of overlapping genes, and kCx is the number of possible genes and features from image combinations 58 . The image features and genes for estimation of the metastasis prediction model were selected by the P-value of correlation (Pvalue < 0.05). The selected image features were compared with image values that are generally used for validation of radiogenomics.

Evaluation of the metastasis prediction model
To predict the patient's outcome in terms of metastasis, we used a machine learning approach 59 called random forest (RF) 60 . The machine learning prediction model was used to evaluate the accuracy, precision, and recall score using test data. Prediction was performed 10 times to obtain an average value 61 . A radiomics (47) only prediction model, an EMT-related gene (7) model, a histogram rst order (15) model, a texture (32) model, an EMT-related gene (7) and radiomics (47) model, and a GLCM_contrast model was used for estimation of the machine learning method using the random forest algorithm.

Declarations
Authors' contributions BC design of this experiment, acquire and analysis of patient data with RNA-sequencing analysis tools and write this article. IH and BH advised about the Classi cation of clinical data for metastasis and nonmetastasis and trimming of data. JK analysis of image features and machine learning analysis. SK supervise the total process as a corresponding author. All authors read and approved the nal manuscript.