Leveraging existing 16S rRNA microbial data to identify diagnostic biomarker in Chinese patients with gastric cancer: a systematic meta-analysis

ABSTRACT Gastric cancer is the second most prevalent and deadly cancer in China. Microbiota play an important role in gastric tumorigenesis. However, the available microbial marker studies for gastric cancer do not have consistent results. We searched PubMed for 16S rRNA sequencing in relevant literature on Chinese patients from 1 January 2005 to 18 July 2022, and 16 original articles were finally obtained. Alpha diversity, beta diversity, and bacterial taxa were used to explore the differences in gastric microbiota. Linear discriminant analysis of effect size and a random forest model were used to find the combination of genera with the best diagnostic efficacy. Streptococcus, Pseudomonas, Fusobacterium, Selenomonas, Peptostreptococcus, and Prevotella showed significant differences between gastric cancer and non-gastric cancer, but a single genus performed poorly in identifying patients with gastric cancer. However, a combination of genera Streptococcus, Peptostreptococcus, Selenomonas, Pseudomonas, and Prevotella had excellent performance in screening gastric cancer with the median area under the curve values of 0.7525 (range: 0.5859–0.9350), 0.8818 (range: 0.7397–0.9533), and 0.7435 (range: 0.7131–0.8483) in the Matched, Unmatched, and Other groups, respectively. Therefore, the results indicated that this combination of genera has good diagnostic efficacy and wide applicability for patients with gastric cancer, which may provide new clues for the non-invasive diagnosis of gastric cancer. Importance Gastric cancer is a significant and growing health problem in China. Studies have revealed significant differences in gastric microbiota between patients with gastric cancer and non-cancerous patients, suggesting that microbiota may play a role in tumorigenesis. In this meta-analysis, existing 16S rRNA microbial data were analyzed to find combinations consisting of five genera, which had good efficacy in distinguishing gastric cancer from non-cancerous patients in multiple types of samples. These results lend support to the use of microbial markers in detecting gastric cancer. Moreover, these biomarkers are plausible candidates for further mechanistic research into the role of the microbiota in tumorigenesis.

have been recognized as risk factors for GC and have received more attention recently.Studies have shown that microorganisms in the upper gastrointestinal tract can promote cancer by promoting inflammation (4) or by interacting with pathogens (5).Therefore, the exploration of microbiota-based GC diagnostic markers in the Chinese population is of great value for the identification, prevention, and treatment of GC.
16S rRNA is a component of the ribosomal 30S subunit that is highly conservative and specific in prokaryotes.Sequencing the variable region of 16S rRNA is currently the most common method for studying microorganism diversity.The use of 16S rRNA amplicon sequencing showed differences among the phyla Proteobacteria and Firmicutes, and the genera Streptococcus, Prevotella, and Fusobacterium in cancerous and adjacent normal tissues of GC (6,7).However, some studies have reported regional differences in microbiota profiles, and the microorganisms responsible for GC may not be the same in different regions.A recent study including 110 Mongolian and 83 Han Chinese stool samples from healthy individuals showed that Mongolians have a more unique and diverse gut microbial community than Han Chinese, suggesting that the environment influences the microbial community (8).Nevertheless, there were significant differen ces in gastric mucosal microbial communities between patients with GC and healthy people in different studies from the Chinese population.For example, a study con ducted in Xi'an and Inner Mongolia identified a combination of several operational taxonomic units (OTUs)-Peptostreptococcus_OTU16, Parvinomonas_OTU35, Strepto coccus_OTU68, Dialister_OTU151, and Slackia_OTU174-with good results in diagnosing GC in gastric mucosal samples (9).In addition, a multicenter study in China, pooling samples from Beijing, Tianjin, Nanjing, Shanghai, Xiamen, and Guangzhou, found that the combination of the species Streptococcus anginosus and Streptococcus constellatus had high sensitivity and diagnostic efficacy in identifying patients with GC and could be used for GC screening (10).The differences in the results of the above studies based on Chinese populations may be explained by differences in sample sizes, controls, types, and the selection of sequencing platforms and variable regions among studies.This highlights the need for systematic analysis of microbial diagnostic markers for GC.
Based on this, we performed a meta-analysis using the existing 16S rRNA sequencing data published in previous articles to find microbial markers with diagnostic value for GC in the Chinese population.This study is of great significance for exploring microbial markers with a clinical diagnostic role for GC.OR (China)).Data were included according to the following criteria: the study population was a Chinese population containing GC and NGC [Non-GC, including distal normal control (NC), healthy control (HC), superficial gastritis (SG), and chronic gastritis (CG)], providing publicly available sequence and subgroup information.A total of 693 articles was obtained using the above search criteria, and after reviewing the titles and abstracts, 643 irrelevant articles were excluded.Among the remaining 50 articles, 37 irrelevant articles were excluded by reading the full text.In addition, three relevant studies were added manually by literature reading, and so 16 original articles were finally included (Fig. 1).

Data preprocessing
Raw sequence data and metadata were obtained from Sequence Read Archive (SRA; https://www.ncbi.nlm.nih.gov/sra) and European Nucleotide Archive (ENA; https:// www.ebi.ac.uk/ena/browser/home).Quality checks were performed using FastQC and each data set was processed separately using QIIME2 (version 2021.2) (11).Single-or double-ended FASTQ files were imported to generate QIIME2 files.The double-ended data were merged using the q2-vsearch (12) plugin, and then the quality filter plugin was used to quality filter the merged or single-ended data, setting a minimum quality score of 20.Next, the connected and qualityfiltered sequences were noise reduced using the Deblur noise reduction program (13) to obtain single sequence variants.Then we used the default parameters to annotate species by QIIME2 using a well-trained plain Bayesian classifier (99% similarity) (https://github.com/QIIME2/q2featureclassifier)( 14) using Greengenes full-length reference sequences.Any sequences identified as members of the eukaryotic, archaeal, mitochondrial, chloroplast, and cyanobacterial lineages were removed.

Community analysis
To elucidate species diversity between the NGC and GC, we performed alpha-diver sity analysis using the QIIME2 platform.After excluding samples with <1,000 reads, community richness, diversity, and evenness were expressed by observed features, Shannon index, and evenness index, respectively.In each data set, we used a rank transformation to ensure that the data obeyed a normal distribution.For the transformed data, we tested the hypothesis that the alpha diversity significantly differed between the NGC and GC groups using a linear mixedeffects model.We plotted sparse curves at the median sequencing depth for each data set to determine the stability of the alpha diversity results.We measured differences between individuals by calculating beta diversity (BrayCurtis distance, weighted unifrac distance, and unweighted unifrac distance) and used permutational multivariate ANOVA (PERMANOVA) to determine whether there were significant differences between NGC and GC patients.

Linear discriminant analysis of effect size
We used linear discriminant analysis of effect size (LEfSe) (15) to assess how the microbiome varied with disease state.This determines the features most likely to explain differences between classes by coupling standard tests for statistical significance with additional tests encoding biological consistency and effect relevance.We used the Galaxy implementation of LEfSe (accessed on 20 March 2022; available at http:// huttenhower.sph.harvard.edu/galaxy)with default options.Differences were evaluated by differentiating the threshold of the log-linear discriminant analysis (LDA) score of the features, which was set to 2.0.

Random forest classification analysis
Incorporating the common differential genera obtained from the LEfSe.we constructed a random forest model using the R package randomForest (16).Each data set was randomly divided 7:3 into training and test data sets.For all models, ntree was set to 500, and mtry was set to the square root of the number of taxa within the model.The mean decline in accuracy (MDA) is a measure of the importance of each categorical unit to the overall model, and we obtained the overall importance ranking by normalizing the MDA values of the categorical units in each model (Z-transformation).

Statistical analysis
SPSS version 22.0 (SPSS Inc., Chicago, IL, USA) was used for statistical analysis.Graph Pad Prism 8.0 (GraphPad Software Inc., San Diego, CA, USA) was used for graphing.Mann-Whitney U test was used to check the alpha diversity parameter.Differences in colony structure between groups were calculated using Bray-Curtis distances, weighted unifrac distances, and unweighted unifrac distances, and PERMANOVA was performed using the Vegan package (v2.5.7) in R (version 4.1.2) to detect variability in beta diversity between groups.The LEfSe algorithm was used to identify specific microbial taxa and functions that differed significantly between groups.Differences with absolute LDA scores >2.0 were considered significant.Random forest models were constructed using the randomForest package (v4.6.14).The ROC curves were plotted using the pROC (1.18.0) and ggplot2 (3.3.6)packages.

Basic characteristics of the included data set
The 16 original articles have differences in DNA extraction, PCR primer, Sequence Region, and Sequence platform.The main methods used for DNA extraction are the QIAamp DNA Mini Kit and the QIAGEN DNeasy Kit.The most commonly used PCR primers are 515F/806R, followed by 338F/806R, 319F/806R, etc.In terms of Sequence Region, Regions V3-V4 and V4 are the most common, while the Illumina platform is used for sequencing, with Illumina MiSeq being the most common (Tables 1 to 3).
We divided samples into three groups based on whether they originated from the same individual and sample type: Matched, Unmatched, and Other groups.A total of 10 gastric mucosal microbiome data were included in the Matched group.Of these, Tseng et al. (17) were excluded due to a small sample size and ultimately included nine data from eight studies (S1-S9): Yu et al. ( 18 (22).A total of 1,189 tissue samples were included in this group, of which 595 were GC samples and 594 were NGC samples (Table 1).A total of eight gastric mucosal microbiome studies were included in the Unmatched group.Of these, Wang et al. (30) were excluded due to small sample size and ultimately included seven data from six studies (S10-S16): Coker et al. ( 9 26), and He et al. (22).A total of 557 tissue samples were included in this group: 215 were GC samples and 342 were NGC samples (Table 2).The Other group included six data from five studies (S17-S22) and included microbiome data from gastric mucosal swab, oral swab, tongue coating, stool, and gastric fluid: Liu et al. ( 29 27), and He et al. (22).A total of 898 samples were included in this group: 501 were GC samples and 397 were NGC (Table 3).

Community analysis showed significant differences in microbiota between GC and NGC
To explore the differences in gastric microbiota between GC and NGC, we compared the alpha diversity, beta diversity, and bacterial taxa of each data in the Matched, Unmatched, and Other groups after controlling for differences in study and variable regions.In the Matched group (S1-S9), evenness index, observed features, and Shannon index were significantly higher in the GC group (P-values 4.96E−03, 1.12E−04, and 3.27E−07, respectively) (Fig. 2A and E; Table S1).By contrast, in the Unmatched group (S10-S16), evenness and Shannon indexes were significantly lower in the GC group (P-value of 0.0074 and 0.0099, respectively), while observed features did not significantly differ between GC and NGC (P = 0.9337) (Fig. 2C and E; Table S2).The beta-diversity analysis showed significant differences between GC and NGC in most of the data (Table S4).In terms of bacterial taxa, Proteobacteria, Firmicutes, and Bacteroidetes were the dominant phyla constituting the gastric mucosal microbial community.In addition, the phyla Cyanobacteria, Actinobacteria, and Fusobacteria showed high relative abundance for some data.Compared to the NGC, Proteobacteria were significantly lower in the GC group (P-value of 4.90E−12 in the Matched and 9.41E−08 in the Unmatched groups), Firmicutes were significantly higher in the GC group (P-value of 5.44E−09 in the Matched and 2.00E−06 in the Unmatched groups), while Bacteroidetes did not differ significantly between the two groups (P-value of 9.53E−02 in the Matched and 5.04E−02 in the Unmatched groups) (Fig. 2B, D and F; Table S2).The Other group contained several types of samples, with the bacterial taxa of patients with GC differing significantly between sample types (Fig. S1; Table S3).

Six genera may have potential diagnostic biomarkers for distinguishing GC from NGC
Bacterial genera with significant differences between GC and NGC should have better efficacy in identifying patients with GC.Therefore, we performed LEfSe on the data from the Matched group to find microbial markers for GC.We summarized the results of LEfSe of data from S1 to S9 groups and selected genera that appeared in two or more data groups.The following eight genera were obtained: Streptococcus, Pseudomo nas, Fusobacterium, Selenomonas, Novosphingobium, Halomonas, Peptostreptococcus, and Prevotella (Fig. 3).However, Novosphingobium and Halomonas could not be annotated in some of the Matched data, and so were excluded.Finally, six genera Streptococcus, Pseudomonas, Fusobacterium, Selenomonas, Peptostreptococcus, and Prevotella were considered as potential diagnostic biomarkers for distinguishing GC from NGC.

A single genus not so well-diagnostic biomarkers for distinguishing GC from NGC
We explored the possibility of each of these six genera alone as diagnostic markers for GC.Given that Streptococcus was identified as a potential marker in many data in previous results, we first analyzed its diagnostic ability.In gastric mucosal tissue, the relative abundance of Streptococcus was significantly higher in most GC groups than in controls.The median area under the curve (AUC) value was 0.6575 (range: 0.4150-0.8669) in the Matched group (S1-S9) and 0.6706 (range: 0.4831-0.822) in the Unmatched group (S10-S16) (Fig. 4A and B).The results were slightly better in the Unmatched than the Matched groups.In the Other group (S17-S22), the AUCs were 0.6394, 0.6781, and 0.7687 in one group of oral swab samples and two groups of stool samples, respectively, and less than 0.6 in the remaining types of samples (Fig. 4C).However, Streptococcus did not show significant differences in the Other group.These results suggested not so well of Streptococcus as a diagnostic marker for GC for the majority of data.We supplemented the expression of six genera in all data sets NGC and GC groups, and also analyzed the diagnostic efficacy of the remaining five genera, with results generally similar to those of Streptococcus.Details are shown in Fig. S2 and S3.

The best-performing combination of microbial diagnostic biomarkers obtained by random forest analysis
We constructed random forest models in each data set of the Matched group (S1-S9) and Z-transformed the resulting MDA values.The sum of the Z-score MDA results was ranked, and the total importance ranking of each genus in the model was obtained as follows (in order of decreasing importance): Streptococcus, Peptostreptococcus, Selenomo nas, Pseudomonas, Prevotella, and Fusobacterium (Fig. 5A).In this model, Streptococcus was the most important factor, consistent with our LEfSe results.We obtained six combinations by progressively including genera based on importance ranking and validated the diagnostic efficacy of these combinations in the data set of the Matched group.The results showed that COM5 was the best-performing combination, with a median AUC value of 0.7525 (range: 0.5859-0.9350)(Fig. 5B through D).

The best-performing combination of microbial diagnostic biomarkers verified in the unmatched and other group
To further explore the ability of the combination in screening patients with GC, we validated the combination COM5 in the Unmatched group (S13 excluded due to lack of Selenomonas) and the Other group (S17 excluded due to lack of Pseudomonas and S21 excluded due to lack of Pseudomonas and Selenomonas).Unexpectedly, the combination had an extremely high AUC in the Unmatched group with a median value of 0.8818 (range: 0.7397-0.9533)(Fig. 5E).The AUC was also above 0.7 in all four types of sample data in the Other group, with values of 0.8483, 0.7131, 0.7650, and 0.7219, respectively (Fig. 5F).These results suggested that the bacterial genera that are significantly different from NGC in GC mucosal tissues might also be used in the diagnosis of GC in oral swabs, tongue coating, feces, or gastric juice specimens.

DISCUSSION
An increasing number of studies have found that microorganisms in the stomach other than Hp are closely associated with the development of GC.In this study, we performed a systematic meta-analysis using 16S rRNA sequencing data published in previous articles and identified the best-performing combination of microbial diagnostic biomarkers for distinguishing GC from NGC in Chinese patients.And this biomarker has good diagnostic efficiency in a variety of different types of samples, which has very important clinical value.
We first explored the basic characteristics of the flora in GC versus non-cancerous samples.The alpha diversity indicators (evenness index, Shannon index, and observed features) were elevated in GC from the Matched group, but the evenness and Shannon indexes were decreased in GC from the Unmatched group, after controlling factors.Current studies have not reached a consistent conclusion on the relationship between microbial diversity and gastric disease status, and Dai et al. showed that the diversity and abundance of gastric microbiota were higher in tumor tissues than in non-tumor tissues (7).However, the diversity and richness of peritumoral and tumor tissues were decreased in 276 patients with GC compared with non-tumor tissues (28).Our study may partly explain the inconsistent alpha diversity results in previous studies, which we speculated is related to the controls being from the same individual or different individuals.
LEfSe analysis indicated six genera, Streptococcus, Peptostreptococcus, Selenomonas, Pseudomonas, Prevotella, and Fusobacterium, may have potential diagnostic biomarkers for distinguishing GC from NGC.Interestingly, all these genera belong to oral microor ganisms, which suggests that oral flora play important roles in the development of GC.Streptococcus is a common purulent Gram-positive coccus widely present in the human gastrointestinal tract and nasopharynx.Its elevation was seen in multiple types of samples from various gastrointestinal diseases, and its view as a pathogenic microorgan ism is widely accepted (33).Selenomonas, a Gram-negative genus, can be found in the oral cavity, stomach, and feces and is associated with colon cancer (34).Pseudomonas, a Gram-negative aerobic bacterium, is a common conditional pathogenic bacteria.It has been reported that the abundance of Pseudomonas in serum is high in both GC and normal groups (35).Prevotella, a genus of Gram-negative anaerobic bacteria, is found in multiple sites in the human body, including the oral cavity and gastrointestinal tract (36).Prevotella was enriched in GC tissue (7,9,25) and decreased in saliva (37), suggesting that translocation of Prevotella from the oral cavity to the gastrointestinal tract is a cause of gastric disease development.It has been suggested that Fusobacterium is present in the infected gastrointestinal tract (38), and this may stem from the selective translocation of Fusobacterium in the oral cavity (39).Fusobacterium has also been demonstrated to be associated with IBD and colorectal cancer (CRC) (23,40,41).Peptostreptococcus belongs to a group of Gram-positive anaerobic bacteria present in the gastrointestinal tract and vagina.Coker et al. identified Peptostreptococcus as an important genus for gastritis to GC development (9).The above results indicated that the six differential genera are closely associated with gastric diseases and have potential as microbial markers for GC diagnosis.
Next, we explored the possibility of each of these six genera alone as diagnostic markers for GC.The diagnostic value of Streptococcus performed well in some samples, but not others, as did the other five genera.Studies on Streptococcus as a diagnos tic marker have been reported, for example, Zhou et al. diagnosed GC by detecting Streptococcus anginosus and Streptococcus constellatus in stool samples with AUC of 0.91 (10).Yu et al (33).explored Streptococcus as a marker of liver metastasis in GC with an AUC of 0.651.Streptococcus has also been reported to be significantly elevated in CRC (42,43), suggesting its potency as a diagnostic marker for gastrointestinal cancer.However, our meta-analysis did not verify the above results.We speculated that the reason may be that the cases in our study were from gastric mucosal tissue rather than fecal samples.This suggests that Streptococcus may have poor tissue specificity and diagnostic efficacy as a marker, so its feasibility as a single diagnostic marker for GC remains to be explored.The other five genera have been combined as diagnostic markers in previous studies (32,37,44), but no studies have investigated them as markers alone.Our study showed that these genera suffered from poor diagnostic efficacy when used alone as diagnostic markers for GC.
To find the marker combinations that performed best in the 16S RNA data of gastric mucosa, we constructed a random forest model in each of the Matched group data, and progressively added genera based on the order of importance to obtain six marker combinations, of which the combination consisting of five genera (Streptococcus, Peptostreptococcus, Selenomonas, Pseudomonas, and Prevotella) had the best diagnostic power with a median AUC value of 0.7525 (range: 0.5859-0.9350).In exploring the best combination of biomarkers, we found that the performance of each combination was excellent in the S3 data set, with AUC values of all the COMs signatures equal to 0.9.After analyzing the expression and diagnostic ability of single microbial genera, we believe that this may be due to two reasons.First, in S3, the expression levels of Streptococcus, Peptostreptococcus, and Selenomonas in GC were significantly higher than those in NGC, and the differences were large.Second, Streptococcus, Peptostreptococcus, and Selenomo nas all had excellent diagnostic efficiency as single microbial genera in S3.Furthermore, we found that the COM5 had good diagnostic ability for the population in Hohhot, Inner Mongolia (S3 AUC = 0.935, S10 AUC = 0.934), which suggests that developing regionspecific diagnostic biomarkers to improve diagnostic ability may be feasible.Then, the diagnostic value of this combination was verified with mucosal samples from the Unmatched group, and tongue, oral wipe, stool, and gastric fluid samples in Other group, and the median AUC value was up to 0.8818 (range: 0.7397-0.9533).However, there have been no reports using the combination of these five genera for GC diagnosis.These results indicated that this combination of genera has good diagnostic efficacy and wide applicability for patients with GC, which may open a door for the non-invasive diagnosis of GC.
In addition, there are some shortcomings in this study (2).The effect of Hp on gastric microbial community composition has been demonstrated (45), but in this study, it was not possible to subgroup Hp for a more precise analysis because of incomplete Hp data (3).The number of studies included in this meta-analysis was still small, and some did not have a large sample size, which may affect some of the results (4).There were insufficient data for other types of samples, such as oral and fecal samples with only one or two datasets each, giving insufficient validation strength.In the future, more informative and comprehensive studies are needed to verify the microbial diagnostic markers of GC obtained in this study.
In conclusion, leveraging existing 16S rRNA microbial data, we demonstrated the significant differences in gastric microbiota between GC and non-cancerous patients and obtained a combination of genera Streptococcus, Peptostreptococcus, Selenomonas, Pseudomonas, and Prevotella had excellent performance in screening GC with broad applicability and good diagnostic efficacy for the Chinese population.Our results would lend support to the use of microbial markers in detecting GC.Moreover, these biomark ers might also be plausible candidates for further mechanistic research into the role of the microbiota in tumorigenesis.

FIG 2
FIG 2 Community analysis between GC and NGC in the Matched and Unmatched groups.(A) Alpha-diversity indicators evenness index, observed features, and Shannon index in the Matched group of data in each group.(B) The microbiota composition of each data set in the Matched group.(C) Evenness index, observed features, and Shannon index in the Unmatched group of data in each group.(D) The microbiota composition of each data set in the Unmatched group.(E) Evenness index, observed features, and Shannon index in the Matched and Unmatched groups after controlling for variables with a linear mixedeffects model.(F) After controlling for variables with a linear mixedeffects model, the six most highly expressed phyla in the Matched and Unmatched groups.

FIG 4
FIG 4 Relative abundance, fold-change, and P-value of Streptococcus between the GC and NGC in each data set, and AUC of Streptococcus as a marker for diagnosis of GC. (A) Matched group.(B) Unmatched group.(C) Other group.

FIG 5
FIG 5 Identifying the best-performing combination of microbial diagnostic biomarkers and validating their diagnostic capabilities.(A) Heatmap of MDA values obtained by random forest model after Z-transformation.(B) AUC values for each combination in the Matched group for each data set.(C) Genus composition in each combination.(D) ROC curves of COM5 in the Matched group for each data set.(E) ROC curves of COM5 in the Unmatched group for each data set.(F) ROC curves of COM5 in the Other group for each data set.
We searched PubMed for 16S rRNA sequencing relevant literature of Chinese patients from 1 January 2005 to 18 July 2022 with the following search strategy: ((((microbiome OR microbial OR microbiota [MeSH Terms]) OR microflora OR bacterial OR dysbiosis) AND (gastric [MeSH Terms] OR stomach OR upper digestive tract OR upper gastrointestinal tract)) AND ((lesion OR cancer [MeSH Terms] OR neoplasia OR neoplasms OR malignancy OR tumor OR carcinoma OR adenocarcinoma OR premalignancy OR premalignant OR tumorigenesis OR carcinogenesis) OR intestinal metaplasia OR gastritis)) AND ((Chinese)

TABLE 1
Characteristics of the studies included in the Matched group