Intratumoral Microbiota-Host Interactions Shape the Variability of Lung Adenocarcinoma and Lung Squamous Cell Carcinoma in Recurrence and Metastasis

ABSTRACT Differences in tissue microbiota-host interaction between lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD) about recurrence and metastasis have not been well studied. In this study, we performed bioinformatics analyses to identify the genes and tissue microbes significantly associated with recurrence or metastasis. All lung cancer patients were divided into the recurrence or metastasis (RM) group and the nonrecurrence and nonmetastasis (non-RM) group according to whether or not they had recurred or metastasized within 3 years after the initial surgery. Results showed that there were significant differences between LUAD and LUSC in gene expression and microbial abundance associated with recurrence and metastasis. Compared with non-RM, the bacterial community of RM had a lower richness in LUSC. In LUSC, host genes significantly correlated with tissue microbe, whereas host-tissue microbe interaction in LUAD was rare. Then, we established a novel multimodal machine learning model based on genes and microbes to predict the recurrence and metastasis risk of a LUSC patient, which achieves an area under the curve (AUC) of 0.81. In addition, the predicted risk score was significantly associated with the patient’s survival. IMPORTANCE Our study elucidates significant differences in RM-associated host-microbe interactions between LUAD and LUSC. Besides, the microbes in tumor tissue could be used to predict the RM risk of LUSC, and the predicted risk score is associated with patients’ survival.

information on mutations and copy number variants in 15 genes in LUSC patients (10). Nevertheless, studies predicting recurrence or metastasis based on tissue microbiota with mRNA data and comparisons between different omics data are still lacking.
To resolve the above-mentioned issues, for LUAD, 123 paired transcriptome and tissue microbiota samples were collected, including 75 recurrence or metastasis (RM) samples and 48 nonrecurrence and nonmetastasis (non-RM) samples. For LUSC, 110 paired transcriptome samples and tissue microbiota samples were collected, of which 41 were RM samples and 69 were non-RM samples (30). The primary objectives of this study were (i) to identify the genes and tissue microbes significantly associated with recurrence or metastasis in LUAD and LUSC samples, respectively, (ii) to compare the differences between LUAD and LUSC in gene expression and tissue microbiota, and (iii) to identify novel biomarkers capable of predicting a lung cancer patient's prognosis. This study helps us to expand our understanding of the mechanisms of recurrence or metastasis in NSCLC and contributes to targeted therapies in the clinic.

RESULTS
LUAD and LUSC revealed significant differences in gene expression and tumor tissue microbiota. The host transcriptome and tissue microbiota of LUAD and LUSC were significantly different for recurrence and metastasis. Principal-coordinate analysis (PCoA) was used in gene expression and tumor tissue microbiome data, respectively; we found that patients with LUSC and LUAD were distinguishable (the P value of PCoA is 0.001 in the gene expression data and 0.01 in microbiome data) ( Fig. 1B and D). We analyzed the differentially expressed genes (DEGs) between RM and non-RM. Four hundred one DEGs were identified in LUAD mRNA expression, and 325 DEGs were identified in LUSC mRNA expression (Fig. 1C). Only 15 genes were significantly overlapping in the two sets of DEGs (Fisher's exact test, P = 6.17e210). Stacked bar charts were used to demonstrate the composition of microbes in cancerous tissue from types of patients at the genus level (Fig. 1E). The top 10 genera in relative abundance were similar in LUAD and LUSC, but microbial community diversity in LUSC was significantly higher than in LUAD (Fig. 1F). These results revealed that significant differences between LUAD and LUSC were in terms of both gene expression profiles and tissue microbiome profiles.
RM and non-RM demonstrated significant differences in the microbiome profiles of LUSC. Alpha diversity was calculated across LUAD and LUSC. In LUSC, we did not find a clear difference between RM and non-RM in the Shannon index and Simpson index. However, the richness index and Chao index of RM and non-RM revealed notable differences. Meanwhile, PCoA (P = 0.31) suggested no significant differences between the bacterial communities of RM and non-RM in LUSC (Fig. 2B). Conversely, there were no clear differences between RM and non-RM both in alpha diversity and in beta diversity in LUAD microbial communities (Fig. 2A). These results demonstrated an association between tumor recurrence or metastasis and the tissue microbiome diversity in patients with LUSC. The microbe diversity of cancer tissue was significantly lower in patients who developed tumor recurrence or metastasis than in those who did not.
Integrating host gene expression and tissue microbiome abundance. Procrustes analysis was performed to find an overall association between host gene expression and tissue microbial composition in LUAD and LUSC. The results of our analysis showed a remarkable connection between host gene expression and tissue microbial composition across subjects in LUSC (P = 0.004) (Fig. 3A). The Mantel test was conducted to verify this result. However, the result of the Procrustes analysis was not significant in LUAD (P = 0.09; see Fig. S1A in the supplemental material). The lack of a significant overall connection between host gene expression and tissue microbiota in LUAD might indicate that, instead of an overall correspondence between the two, it is probable that only a subset of tissue microbes is connected with a subset of host genes in LUAD.
Bacteria were significantly associated with DEGs in LUSC. In LUSC, Shigella, Staphylococcaceae, Staphylococcus, Pseudogulbenkiania, and Chromobacteriaceae were enriched in RM, while Leuconostocaceae, Acidovorax, Shewanellaceae, Shewanella, and Comamonadaceae were enriched in the non-RM group (Fig. 3B). Among them, the . DEGs in LUSC patients potentially interacted with differential bacteria in tissues (Fig. 3C). Using Spearman correlations, we found 38 significant host gene-tissue microbiota correlations in LUSC (false-discovery rate [FDR] [q value] of ,0.1). We visualized all the correlations between taxon abundance and host gene expression in Fig. 3D. There were 27 genes and seven microbes in the network (here referring to novel recurrence or metastasis-related signature [34RMSig]). In particular, we found that in the gene-microbe network, microbial nodes have more edges on average than do genes, where Comamonadaceae formed distinct hubs in the network. The family Comamonadaceae now includes the genera Comamonas, Delftia, and Acidovorax. Previous studies have shown that the relative abundance of Acidovorax was higher in smokers, and squamous cell carcinoma (SCC) tumors from smokers have an even greater relative abundance of these bacteria (26). To understand the role of important genes and species in the host-microbe correlation network in LUSC, we used the network shift (NetShift) method to identify important species and genes in   the network. Impressively, the Comamonadaceae family and Shewanella genus displayed high neighbor shift (NESH) scores as the "driver" microbes in our network. Equally, we found HSD3BP4, LINC00675, MKRN9P, and MLIP as the "driver" genes ( Fig. 3E and F). Tissue microbes were associated with pathways. We applied enrichment analysis to characterize the biological functions represented by the host genes which associate with specific tissue microbes. Within the DEGs in LUSC, we found 175 genes significantly correlated with genes in 34RMSig (Spearman correlation, q value of ,0.05), and then we identified eight host pathways that associated with tissue microbes utilizing these selected genes ( Fig. 4A; Fisher's exact test, Benjamini-Hochberg FDR of ,0.05). The host pathways we identified were known to participate in metabolic processes, such as cytochrome P450 (CYP), which was implicated in the development and progression of tumors and the activation of anticancer prodrugs and their metabolic clearance (31)(32)(33).
34RMSig data were precise predictors of recurrence and metastasis in LUSC. We evaluated the performance of 34RMSig data in predicting cancer recurrence or metastasis in patients with LUSC. We selected microbes and mRNA from 34RMSig for modeling separately. The performance of each data set was measured by the average scoring indicator of 5-fold cross-validation (5-fold-cv) ( Fig. 5A and B). Results showed that mRNA performed better than microbiome. The Random Forest (RF) classifier performed better than other machine learning models on mRNA data, with an area under the curve (AUC) of 0.78 on a 5-fold-cv (Fig. 5A). The best AUC of the microbiome was 0.7 (Fig. 5B). Additionally, when combining the mRNA and microbiome, the RF model achieved an AUC value of 0.81, which was higher than the value obtained with mRNA or microbiome alone (Fig. 5C). Results indicated that both gene expression and microbiome abundance information can be used to predict recurrence or metastasis in patients with squamous lung cancer, and mRNA should be prioritized to do such work. Combining the two omics data sets can yield better classification results.
Furthermore, we calculated the feature importance of each gene or microbiome in the machine learning classification model, as shown in Fig. S2A and C. The importance of each feature is represented by the feature information. As a statistical indicator describing the ability of an attribute to distinguish data samples, the information gain can describe the ability of the feature to distinguish sample data. We can see that among the feature importance values of the gene set used to construct the classification model, the feature importance of the gene obtained by the NetShift method ranks in the top 10. Similarly, among the feature importance values of the gene set used to construct the classification model, the feature importance of Shewanella and Comamonadaceae accounts for half of the total microbial feature importance. In addition, we removed the features obtained by NetShift from the classification model and used the same model to build the classification model. We found that the AUC of the prediction model significantly decreased, as shown in Fig  was used for host expression data (diamonds), and Bray-Curtis distance was used for tissue microbiome data (circles). (B) LDA scores for the bacterial taxa differentially abundant between RM and non-RM in LUSC (LDA score of .2.5). Blue bars indicate microbes significantly enriched in the RM group, and red bars indicate microbes significantly enriched in the non-RM group. (C) Correlation plot portraying gene-microbe correlations. Color and size of squares indicate the scale of the correlation, and the asterisks indicate significance of correlation (** denotes q value of ,0.05, and * denotes q value of ,0.1). (D) Visualization of significant gene-microbe correlations using the network. In the network, blue edges denote negative correlations and red edges denote positive correlations. The thickness of the line represents the strength of the correlation. (E and F) Using the network shift (NetShift) method to capture changes between the two cooccurrence networks that correspond to the RM and non-RM patients. Black nodes indicate nodes that are present in both the Case and Control subnets. The size of a node is proportional to its corresponding neighbor shift (NESH) score, and we define a node with a NESH score greater than 1.6 as a drive node. The color of the edges represents the interworking status of the nodes, with red indicating interworking in the RM subnetwork only, green indicating interworking in the non-RM subnetwork only, and blue indicating a tight relationship in both subnetworks. (E) NetShift analysis of host genes in association networks. (F) NetShift analysis of tissue microbes in association networks.

Differences in LUAD and LUSC
Microbiology Spectrum Therefore, we believe that the features obtained by the NetShift method have a good performance in predicting the recurrence and metastasis of LUSC patients.
34RMSig data were strongly significantly associated with patient survival time. To investigate the potential relationship between 34RMSig and patients' survival time, 34RMSig was transformed into a risk-scoring model. The optimal risk cutoff value stratified patients into the high-risk group and low-risk group with different survival times (Fig. 6). Among the four sets of survival data, patients in the high-risk group had conspicuously poorer survival than those in the low-risk group (log rank test P value of ,0.001).
The time-dependent receiver operating characteristic (ROC) analysis showed the performance of 34RMSig for predicting 3-, 4-, and 5-year overall survival, progressionfree interval, disease-specific survival, and disease-free interval. Compared to 3-and 5-year values, 34RMSig achieved higher AUC scores in predicting 4-year survival

DISCUSSION
Currently, the treatment methods for LUAD and LUSC mainly include chemical therapy, radiation treatment, and surgical treatment. However, a high proportion (30% to 60%) of patients will develop a local and/or distant recurrence after surgery, resulting in a long-term survival rate of less than 50% for patients after surgery (34)(35)(36)(37). Though there are many studies of NSCLC recurrence or metastasis, differences between LUAD and LUSC in recurrence or metastases are rarely explored. Our study fills this vacancy and finds biomarkers at the tissue microbiota level. Based on transcriptome sequencing (RNA-seq) and microbiome abundance data of LUAD and LUSC patients, we analyzed the differences in host gene expression and tissue microbes between the two groups of patients. DEGs associated with tumor recurrence or metastasis were significantly different in the two patient groups. In LUSC patients, the richness of bacterial communities of non-RM was higher than that of RM, and no significant difference was found between non-RM and RM in patients with LUAD. Again, we found there were no significant differences between the two groups in beta diversity. Then, we creatively identified a host-microbe association in LUSC patients by integrating data from these two omics data sets, and there was no significant correlation in LUAD. Further, we identified the driver genes and microbes in the host-microbe interaction network and observed significant differences in several driver nodes such as LINC00675 and Comamonadaceae. We established classification models to predict patients' recurrence or metastasis in LUSC and compared the predictive performance of mRNA and microbiome data in 34RMSig. Results showed that mRNA has a better predictive performance than tissue microbiome, and the prediction of the model was further enhanced when combining two omics data sets, with an AUC value of 0.81 on 5-fold-cv, which suggested that host gene expression and tissue microbiome abundance information were complementary in predicting recurrence or metastasis in LUSC. To further investigate the clinical application of these correlated genes and microbes, we used the 34RMSig to stratify LUSC patients into two risk groups, and the 34RMSig revealed a significant association with survival and could distinguish between LUSC patients with good and poor outcomes.
We also identified a set of pathways that were associated with specific tissue microbiome composition in LUSC. These pathways were mostly related to various metabolic and metabolism-related enzymes, and some of them have been previously implicated in lung cancer: for example, drug metabolism-cytochrome P450, metabolism of xenobiotics by cytochrome P450, and steroid hormone biosynthesis. Previous studies found that CYP (cytochrome P450) metabolizes carcinogens to their inactive derivatives but sometimes converts chemicals to carcinogens. Additionally, CYP is also involved in the activation and/or inactivation of anticancer agents, which indicates that the expression of CYP in lung cancer and paraneoplastic tissues may be an important factor in the efficacy of anticancer drugs. CYP24, which is frequently expressed in lung cancer tissues, can convert 25-dihydroxyvitamin D3 and 1alpha into inactive 24-hydroxylated derivatives, and 1alpha and 25-dihydroxyvitamin D3 inhibit the proliferation of cancerous tissues (31,32,38). Besides, steroid hormones include glucocorticoids, salt corticoids, progestins, androgens, and estrogens (39). There is growing evidence that the lung contains receptors for estrogen and progesterone and that these hormones have a role in lung development, lung inflammation, and lung cancer (40). It was shown that cytoplasmic estrogen receptor (ER)-a, cytoplasmic ER-b, and total progesterone receptor (PR) expression were higher in squamous cell carcinoma than in adenocarcinoma (AD), while HER2 expression was higher in adenocarcinoma than in squamous cell carcinoma (41). Furthermore, steroid hormones are associated with poor prognosis in lung cancer patients, and one study demonstrated that ERb-1 was associated with a poorer prognosis in women with stage I lung cancer. Nuclear ERb-1 was also associated with poor survival in patients with metastatic lung cancer but not in patients with early-stage lung cancer (40).
Over the past few decades, the study of microbes has provided many insights into research on lung cancer. The importance of the interaction between the host and microbiome in the pathogenesis of lung cancer has become increasingly apparent (42). Studying the interaction between host genes and tissue microbes in the recurrence or metastasis of lung cancer can fill out the understanding of cancer occurrence, development, and diagnosis.
The genus Comamonadaceae is of interest in our study. Previous studies found that in lung cancer, the relative abundance of the Comamonadaceae in paracancerous tissue was significantly higher than that in cancerous tissue areas, and Comamonadaceae had the highest linear discriminant analysis (LDA) score among several taxa associated with tumor status (43). The family Comamonadaceae included the genera Comamonas and Acidovorax, which were differentially abundant in squamous cell carcinoma (SCC) versus adenocarcinoma (AD) in non-small cell lung cancer. Of these, Acidovorax is well differentiated between smoker and nonsmoker populations, with higher levels of Acidovorax in the lungs of smokers than of nonsmokers. In addition, Acidovorax and Comamonas were also more abundant in tumors carrying TP53 mutations, the most common somatic mutation in lung squamous cell carcinoma and strongly associated with prognostic survival of lung squamous carcinoma patients, suggesting that Acidovorax and Comamonas are specific for lung SCC with TP53 mutations (26,44,45). Although there is no direct evidence that these microorganisms promote recurrence or metastasis of lung SCC or induce TP53 mutations, they are significantly associated with factors that contribute to recurrence or metastasis and poor prognosis in patients with squamous lung cancer, such as smoking and TP53 mutations. Therefore, we believe that these tissue microbes can be used as markers to identify recurrence or metastasis in LUSC.
In host-microbe associations, maternally expressed gene 3 (MEG3) was significantly negatively correlated with Comamonadaceae. MEG3, located in the imprinted region of chromosome 14 DLK1-MEG3, is a long noncoding RNA (lncRNA) that is expressed in many normal tissues. Recently, several studies have shown that MEG3 expression level is reduced in lung cancer, hepatocellular carcinoma, and gastric cancer (46)(47)(48). In NSCLC, MEG3 regulates cell proliferation and apoptosis through activation of TP53, and patients presenting low levels of MEG3 expression had a comparatively poor prognosis. Besides, compared with normal tissue, the expression of MEG3 is reduced in non-small cell lung cancer (NSCLC) tumor tissues (49).
We speculate that the association between MEG3 and Comamonadaceae in lung squamous carcinoma patients might be able to be described by TP53: in NSCLC patients with deletion of MEG3 expression and relative increase in Comamonadaceae due to mutations in the TP53 locus, there is a negative correlation between MEG3 expression and Comamonadaceae abundance, in agreement with our findings in host gene-tissue microbial association analysis in Fig. 3D.
LINC00675 was significantly negative with the genus Shigella. LINC00675, also known as TMEM238L, is a long intergenic noncoding RNA found in the 17p13.1-p12 region of the human chromosome. Previous studies have shown an association between abnormal LINC00675 expression and poor prognosis in a variety of cancers. LINC00675 is involved in the proliferation and migration of cancer cells by promoting the activity of the Wnt/b-catenin signaling pathway to activate downstream, targeting c-myc and cyclin D1 (50,51). In gastric cancer, LINC00675 enhanced the phosphorylation of vimentin on Ser83, while in colorectal cancer, LINC00675 inhibited the development of colorectal cancer by sponging miR-942 and regulating Wnt/b-catenin signaling (46,52). miR9-942-5p is a microRNA (miRNA) that has now been shown to regulate cell proliferation and apoptosis in esophageal squamous cell carcinoma, human liver fibrosis, and non-small cell carcinoma (53,54). In recent years, it has been shown that miR-942 is highly expressed in NSCLC, and serum miR-942 expression levels are significantly upregulated in NSCLC and correlated with unfavorable clinical variables and low survival in NSCLC (55). Furthermore, in a study of the NSCLC tumor microenvironment, the M2 macrophage-derived exosome miR-942 was found to promote the migration and invasion of LUAD cells and promote angiogenesis (56). Overexpression of miR-513b significantly inhibits proliferation, invasion, and migration and promotes apoptosis in non-small cell lung cancer (NSCLC) cells (57). LINC00675 has been previously found to be aberrantly expressed with miR-513b-5p in several types of cancer and regulates the malignant phenotype of cancer cells (58). Therefore, we hypothesized that LINC00675 might be able to influence tumor recurrence or metastasis in NSCLC by regulating miR-942 and miR-513b.
THSD7A promotes angiogenesis, which is closely associated with the tumor microenvironment, growth, and metastasis in NSCLC, and tumor-associated angiogenesis promotes tumor progression and metastasis (59)(60)(61). Shewanella bacteria are Gram negative and commonly found in the marine environment, ice-cold fish, and protein foods, and their role in the development of cancer is poorly studied. Recently, it has been shown that Shewanella and Halomonas are symbiotic in lung tissue and that microbes belonging to the genera Halomonas and Shewanella predominate in the lung tissue of patients with lung cancer (62).
The relative abundance of Shewanella and Halomonas was significantly higher in rectal and distal patients; however, in lung cancer patients, the relative abundance of Halomonas was significantly lower in lung tumor tissue than in normal tissue near the tumor, contradicting the results we obtained previously (63,64). We hypothesize that in addition to the abundance of Halomonas, Shewanella abundance in lung tumor tissue of lung cancer patients is influenced by other factors, such as dysregulated immune responses (65). Therefore, the relationship between THSD7A and Shewanella needs to be further explored.
Our study has some limitations. First, our data are limited, limiting the generalizability and applicability of our model. Some factors that may influence microbial abundance or host gene expressions such as patient medication history and dietary habits were not available to us. Second, the tissue microbiome data in our study had only genus-level taxonomic resolution. The absence of species-level bacterial data reduces the depth to investigate differences in host-microbial interactions between LUAD and LUSC for recurrence and metastasis. In the future, the addition of species-level bacterial abundance data can further validate our conclusion, that is, to test the reliability of the genus-level bacteria excavated in this work. Finally, given that studying causality in humans is challenging, future studies using in vivo or in vitro models may be used to investigate the causal relationship and direction of action between specific host genes and microbes.
In conclusion, we performed a comprehensive analysis and found significant differences in gene expression and tissue microbiome between LUAD and LUSC for recurrence or metastasis. In LUSC, the analysis was capable of finding an interaction between host genes and tissue microbes, whereas in LUAD there was no such relationship. Besides, our results suggest that microbiome data can be used as a complementary explanatory variable in the prediction of tumor recurrence and metastasis in LUSC patients, improving the predictive power of the model. In addition, the genes and microbes in the association network can provide a good picture of the survival status of LUSC patients. This study can expand our understanding of the differences between LUAD and LUSC and provide new ideas for the study of tumor recurrence or metastasis in patients with LUSC.

MATERIALS AND METHODS
Sample collecting and preprocessing. Research data were collected from the TCGA database. Paired transcriptome profiles and tissue microbiome data of LUAD and LUSC patients were obtained from previous studies (LUAD samples, 123 cases; LUSC samples, 110 cases) (30). Then, we used clinical information to determine tumor recurrence or metastasis, and using 3 years as the boundary, 75 of the LUAD patients had cancer recurrence or metastasis, while the corresponding number of LUSC patients was 41. We noted patients who suffered tumor recurrence or metastasis as RM and marked the annotations with patients who had no tumor recurrence and metastasis as non-RM.
Differences in transcriptome and microbiome between LUAD and LUSC. The R package 'vegan' (66) was used to identify the differences in transcriptome and tissue microbiome data between LUAD and LUSC. Differences in gene expressions and microbial communities were analyzed using principalcoordinate analysis (PCoA).
Selection of DEGS. We used the R package "DESeq2" (67) to identify the differentially expressed genes (DEGs) of mRNA. We used a P value of ,0.05 and a log 2 fold change of .1 as thresholds for upregulated genes and a P value of ,0.05 and a log 2 fold change of ,1 for downregulated genes. Finally, expression levels with significant differences related to recurrence or metastasis were obtained according to a jlog 2 (fold change)j of $1 and an adjusted P value of ,0.05.
Procrustes analysis and Mantel test. To assess the overall correspondence between host gene and tissue microbe in LUAD and LUSC, we performed Procrustes analysis in R using the 'vegan' package. For each disease cohort, we used Euclidean distance for host gene expression data and Bray-Curtis dissimilarity for tissue microbiome data. The significance of the rotation agreement was obtained using the 'protest ()' function with 999 permutations. We also applied a Mantel test to verify the overall correlation between dissimilarity matrices of host gene expression and tumor tissue microbiome abundance in each cohort.
Microbial diversity analysis. Alpha diversity was used to analyze the diversity of the microbial community based on the operational taxonomic unit (OTU) counts in our samples. Additionally, we used PCoA to analyze differences in microbial communities. The LDA effect size (LEfSe) uses the Wilcoxon rank sum test and the nonparametric Kruskal-Wallis rank sum test method to analyze the differences between groups at different levels (68). An LDA score of .2.5 and a P value of ,0.05 were considered to be statistically significant.
Integrated analysis of interactions between significant host gene and microbiome in lung cancer recurrence or metastasis. Correlation analyses were performed between differential tissue microbiomes obtained by LEfSe and DEGs. We computed the Spearman rank correlation coefficients and the corresponding P values using the 'cor.test ()' function with a two-sided alternative hypothesis. P values were corrected for multiple comparisons using the 'qvalue' package in R. Remarkable gene-taxon correlations were demonstrated using 'corrplots' in R, where the direction of the correlation is represented by the color and thickness of the line indicating the strength of the correlation. What is more, the asterisk was used to mark a significant gene-taxon correlation. Finally, using Cytoscape v3.9.1 (69), we obtained the network for demonstrating significant gene-microbe correlations (q value of ,0.1).
Machine learning classification model and evaluation metrics. We chose Random Forest (RF), Gaussian naive Bayes (NB), and Adaboost (Ada) as classification models for predicting RM status in LUSC patients.
NB classifier is based on classical mathematical theory and has a high practicality. The principle of the plain Bayesian approach to determine the class to which a sample belongs is by calculating the probability that the sample belongs to a certain class. Given data x = (x 1 , x 2 , . . ., x d ), whose corresponding set of class labels is C = {C 1 , C 2 , . . ., C k }, and where P(C k ) denotes the prior probability of C k (K = 1, 2, . . ., K), the conditional probability of x conditional on C k can be expressed as P(C k jx), and according to Bayes' theorem, the following equation can be obtained: If the individual feature attributes satisfy mutual conditional independence, the above equation can be expressed in the following form: Pðx i j cÞ Based on the above calculation, the plain Bayesian finds the maximum P(C k jx) for each item to be classified, which will be used as the classification prediction result. The NB model has the advantages of simplicity, speed, and robustness in training and testing large amounts of data and thus is widely used in the analysis of bioinformatics.
The Ada model algorithm starts with the same settings for the training data set and base classifiers and then changes the weight distribution while training to obtain different classifiers, resulting in a combined classifier with improved classification performance. The expression of the Ada model G(x) is as follows: Similar to the principle of the Ada model, the RF model trains multiple decision tree models and eventually determines the class to which the sample belongs by considering the classification results of all decision tree models together. Ada and RF models have excellent properties such as good classification, robustness, and low influence by the imbalance of sample categories. The depth of the Ada and RF models and the number of base classifiers (or decision trees) are important parameters that affect how well the models classify.
We choose to use the 5-fold cross-validation method to evaluate the average prediction performance of the model. The 5-fold cross-validation method involves randomly dividing the data set into five parts with equal sample sizes, using four parts at a time as the training set to train the model, and using the data from the remaining part as the validation set, cycling five times so that each part is used once as the validation set of the model. Fivefold cross-validation maximizes the use of the data set and reduces the overfitting of the training model. One hundred ten LUSC patients were used as the data set for cross-validation, and 27 DEGs and the Shewanella genus, Pseudogulbenkiania genus, Acidovorax genus, Shigella genus, Comamonadaceae family, Shewanellaceae family, and Chromobacteriaceae family were used as predictors. The results of the five cross-validations were averaged as the final results of the model. The specific tuning parameters are shown below.
As for the RF and Ada algorithms, we adjusted the tree number and the maximum depth of the tree. The evaluation metrics of the model include the area under the curve (AUC), accuracy (ACC), the area under the accuracy-recall curve (AUPR), and recall, which are used to evaluate the performance of the machine learning method in binary classification problems. AUC is the area covered under the ROC curve, which is used to measure the comprehensive classification performance of the model, ACC is the proportion of samples correctly predicted by the model to the total sample, and recall is the proportion of RM patients correctly predicted by the model to all RM patients. AUPR is an evaluation metric used to measure the comprehensive classification performance of the sample in the case of unbalanced sample categories.
Survival analysis. We selected genetic and microbial biomarkers that could distinguish RM from non-RM to explore whether genes and microbes significantly associated with tumor recurrence and metastasis in LUSC patients were significantly associated with patient survival. Overall survival (OS), progression-free interval (PFI), disease-specific survival (DSS), and disease-free interval (DFI) were chosen as survival data, respectively. The survival analysis regression model was constructed by the 'coxph' function of the R package, with biomarker as the model-independent variable and survival time as the model-dependent variable. The risk scores for each LUSC patient were calculated from the survival analysis regression model as follows: risk score i = b 1 biomarker i1 1 b 2 biomarker i2 1 . . . 1 b n biomarker in , where i denotes the i-th LUSC patient, n is the number of biomarkers, and b n is the regression coefficient corresponding to the n-th biomarker in the survival analysis regression model. We plotted the time-dependent ROC curves for OS, PFI, DSS, and DPI separately by using the R package 'survivalROC'.
We selected the point on the time-dependent ROC curve with the largest difference between true positives and false positives as the cutoff value of the risk score, and patients above this cutoff value were defined as the high-risk group and those below this value as the low-risk group. The survival curve was performed by using the Kaplan-Meier method and the log rank test was used to compare the difference in survival probability with the R package 'survival'. A P value of ,0.05 was considered statistically significant.
Code availability. All programming scripts to access, manage, and run data on LUAD and LUSC as well as development of the interaction analysis, ML pipelines, survival analysis, and so forth can be found at our GitHub repository link: https://github.com/xfchow/gene_microbiota_interaction. Our pipeline is also publicly shareable and available upon reasonable request of the corresponding author.
Data availability. The gene expression, tissue microbiome, and clinical data that support the findings of this study are openly available at https://portal.gdc.cancer.gov and ftp://ftp.microbio.me/pub/ cancer_microbiome_analysis/TCGA/.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. SUPPLEMENTAL FILE 1, PDF file, 0.2 MB.