Interaction analysis of Mycobacterium tuberculosis between the host environment and highly mutated genes from population genetic structure comparison

Abstract We aimed to investigate the genetic and demographic differences and interactions between areas where observed genomic variations in Mycobacterium tuberculosis (M. tb) were distributed uniformly in cold and hot spots. The cold and hot spot areas were identified using the reported incidence of TB over the previous 5 years. Whole genome sequencing was performed on 291 M. tb isolates between January and June 2018. Analysis of molecular variance and a multifactor dimensionality reduction (MDR) model was applied to test gene-gene-environment interactions. Adjusted odds ratios (OR) and 95% confidence intervals (CI) were computed to test the extent to which genetic mutation affects the TB epidemic using a multivariate logistic regression model. The percentage of the Beijing family strain in hot spots was significantly higher than that in cold spots (64.63% vs 50.69%, P = .022), among the elderly, people with a low BMI, and those having a history of contact with a TB patient (all P < .05). Individuals from cold spot areas had a higher frequency of out-of-town traveling (P < .05). The mutation of Rv1186c, Rv3900c, Rv1508c, Rv0210, and an Intergenic Region (SNP site: 3847237) showed a significant difference between cold and hot spots. (P < .001). The MDR model displayed a clear negative interaction effect of age groups with BMI (interaction entropy: −3.55%) and mutation of Rv0210 (interaction entropy: −2.39%). Through the mutations of Rv0210 and BMI had a low independent effect (interaction entropy: −1.46%). Our data suggests a statistically significant role of age, BMI and the polymorphisms of Rv0210 genes in the transmission and development of M. tb. The results provide clues for the study of susceptibility genes of M. tb in different populations. The characteristic strains showed a local epidemic. Strengthening genotype monitoring of strains in various regions can be used as an early warning signal of epidemic spillover.


Background
Tuberculosis (TB) is still considered to be the major respiratory infectious disease affecting human health and seriously hindering economic development in developing countries. The global TB report shows that the estimated number of cases in 2018 was 10 million [1] and China accounted for 9% of all new cases.
Guangxi is a province with a relatively high incidence of TB compared to other provinces, but the reported incidence is spatially heterogeneous. Differences in reported morbidity rates across the province exceed 100/100,000 population. Regions where observed genomic variations in Mycobacterium tuberculosis (M. tb) are distributed uniformly (cold spots) or localized to some region (hot spots)have been previously reported in Guangxi [2]. Despite the reported correlation between TB incidence and environmental factors such as sunshine duration, GDP per capita and health insurance coverage, the correlation was relatively weak. The contribution of pathogen transmissibility and pathogenicity to the outbreak and its interaction with the environment are not well understood.
Host-pathogen interaction and co-evolution are the result of species adaptation [8]. Host immune pressure and pathogen immune evasion are key points in this process [9]. Consistent with this concept, RD1, which is deleted in M. bovis BCG, encoded a type-VII secretion system and host immune response [10,11,7]. Deletion of the RD2 gene also resulted in a decrease in the virulence of M. tb. However, this attenuation of virulence can in part be complemented by the introduction of the genes Rv1979c to 1982, but not other genes of RD2 [12]. Similarly, another study suggests that RD4 plays an important role in mycobacterium virulence, and RD4 knock-in BCG strains can provide a better protection [13].
The biological and social behaviour characteristics of the host may have a certain impact on the immune system of the body and affect the pathogenicity of M. tb. However, the effect of host-pathogen synergy on TB transmission is not clear. To further investigate the genetic and demographic differences between cold and hot areas of the TB epidemic and their interactions , and to propose hypotheses of TB transmission and virulence, we used a SNP-based method of whole-genome structural differences comparative analysis for highly mutated gene loci detection. We also examined whether these genes interacted with host environmental factors.

Subjects
Moran's I local spatial autocorrelation statistics, and space-time scan statistics were recruited to detect temporal and spatial clusters of tuberculosis reported incidence in Guangxi from 2010 to 2016 [2]. The spatiotemporal analysis identi ed 3 counties located in the central Guangxi with a signi cant high TBreported incidence cluster (hot spots), and 3 counties located in the eastern Guangxi with a signi cant low TB-reported incidence cluster (cold spots). TB patients were con rmed with chest-X ray, sputum smear, culture and drug sensitivity test (DST) from the TB designated hospitals of these six counties and were enrolled in this study with informed consent during January to June 2018. To be eligible for this study, participants must have been a resident in the study sites for at least two years and their isolates available for analysis. Children under the age of ve and people with mental illness were excluded. According to the requirements of the national TB control program, all TB patients should be referred to designated hospitals at county level for diagnosis and treatment. Through investigation, it was found that the rate of TB centralized treatment was more than 90% [14], so the data in this study were representative to a certain extent.
The sample size of this study was calculated based on the formula for comparing two independent proportions [15]. According to a previous study, the It was estimated that the mutation rate of the major SNP locus in the hot spots' strains was 60% in hot spots and 40% in cold spots. On account of type I error of 0.05 and a power of 90%, at least 260 TB patients (130 cases in each area) and their sputum cultures were required.

DNA extraction and whole genome sequencing
The sputum samples collected from the participants were cultured in the hospital of each study site. All isolates from the culture were transported to the Guangxi CDC for M. tb genomic DNA extraction. The process followed the standard laboratory protocol (HiPure Bacterial DNA Kit, Magen Biotech Co. Ltd) and We performed all of the whole genome sequencing at least twice to validate the reproducibility.

Variables
Host variables obtained from participants included sex, age, ethnicity, monthly income, body mass index (BMI), history of household contact with another TB patient, history of smoking and drinking, history of BCG vaccination, travel history (For more than 3 months within 2 years in an area other than the place of permanent residence) and status of multi-drug resistance (MDR). Structured questionnaires were used to collect information about host data. Pathogen variables included mutation information of all SNP positions among included isolates.

Statistical analyses
Comparison of demographic characteristics and area of residence (cold vs hot area) was done using Pearson's chi-square test for categorical variables and the ranksum test for continuous variables. Molecular typing and statistical inference were conducted based on the genotype assignment of the isolates based on the SNP classi cation [16]. Analysis of Molecular Variance (AMOVA) was used to test the genomic differentiation (F-statistic index) between the groups [17]. Multifactor dimensionality reduction (MDR) models were applied to test gene-gene-environment interactions [18]. The best model was selected based on the prediction error using 10-fold cross-validations. Hierarchical interaction graphs and interaction dendrograms of MDR were employed to visualize the interactions of M.tb gene-host environment in the best model [19].

General situation
Totally, 291 TB patients and their isolates were included in the study. There were 147 isolates in the 3 hot spot areas and 144 in the 3 cold spot areas. Through whole genome sequencing and molecular typing, the dominant strains in both areas belonged to the Beijing family (SPOLIGO typing, SITVIT2 Database).
However, the proportion of Beijing family strains in hot spots was signi cantly higher than that in cold spots (64.6% vs 50.7%, p = 0.022). Other genotypes included T1, T2, T3, H, H3 and LAM which belong to the Euro-American lineage.

Comparison of demographic characteristics
As shown in Table 1, elderly, ethnic minorities (91.5% were Zhuang), those with low income, low BMI and a history of contact with a former TB patient were the predominant characteristics of TB patients in hot spot areas (p < 0.05). Individuals from cold spot areas had a higher frequency of out-of-town travelling (p < 0.05).
However, due to the obvious imbalance of the ethnicity and economic levels between cold and hot spots, the variables of ethnic and income are not comparable [20]. 3.3 Population genetic structure differences  was signi cant high in hot spots.

MDR analysis of gene-host environment interaction
To test the gene-host environment interaction, the 5 SNP locus mentioned above and factors with signi cant differences between the two spots were included in the model. But ethnicity and income are excluded because of their incomparability. Table 4. summarizes the cross-validation consistency and prediction error through multifactor dimensionality reduction for each mutation of high-difference SNP sites and host factors. "Rv0210-BMI" model had a maximum testing accuracy of 61.2% and a maximum cross-validation consistency (8/10, P < 0.0001). Figure 2. exhibits three combinations associated with high risk and low risk for the "Rv0210-Age groups-BMI" model. From the distribution of high and low risk factors/SNP sites mutations, we can see that a strain with a mutation of Rv0210, and its host belonging to the normal BMI and low age group was more likely to be in cold spot area.
A hierarchical interaction graph based on all alternative models is shown in Figure 3A. It displays a clear negative interaction effect of age groups with BMI (interaction entropy: -3.55%) and mutation of Rv0210 (interaction entropy: -2.39%). Through the mutations of Rv0210 and BMI had a low independent effect (interaction entropy: -1.46%).
An interaction dendrogram is shown in Figure 3B and shows that age groups and BMI were located on the same branch. These two factors were estimated to have the strongest redundancy interaction, as indicated visually by the blue line. The mutation of Rv0210 was on a different branch, demonstrating a weak redundancy interaction with other factors.

Discussion
During the period of this study, we collected most strains from areas with high and low reported incidence of tuberculosis. The data could therefore be representative of genetic and demographic diversity, at least to some extent.
In this study, Beijing family strains were the dominant genotype in both cold and hot spot areas as con rmed by whole genome sequencing. Beijing family strains, which were rst reported in 1995, have spread worldwide [21]. These strains originated in Beijing and Mongolia, have highly conserved spoligotyping patterns, and characteristic IS6110 RFLP patterns of Mtb isolates [22,23]. In addition to the biological characteristics of high multidrug-resistance rate [24][25][26], the Beijing family strains may also have higher virulence compared to the other genotype strains. A study conducted in The Gambia suggested that patients infected with the Beijing family strain were more likely to progress to disease than those infected with Mycobacterium africanum [27]. However, the sample size of that study was small and other M. tb complex genotypes among the control group were less clear. Interestingly, from animal models, there is clear evidence that the expression of proteins, glycolipids and triglycerides in the Beijing strain is altered, which may contribute to increased pathogenicity. In our study, the proportion of Beijing family strains in hot spots was signi cantly higher than that in cold spots (P < 0.05). This suggests that the virulence of strains in hot spots is higher than those in cold spots.
In addition to the ratio of genetic makeup, there were also statistically signi cant differences in the distribution of age, BMI, history of TB patient contact and migration history between participants from cold and hot spots. A study performed in The Netherlands found that disease transmission was higher in younger aged people [28], a nding that contrasted our study. In our study, the proportion of elderly TB cases was higher in areas with a high TB incidence. This phenomenon may be related to the socio-economic status of different regions. As an independent predictor of TB incidence, it remains to be seen whether socioeconomic factors or immune factors in uence the spread and development of TB. We hypothesize that the high polymorphism of strains in cold spots might be related to the high frequency of travelling among this population.
Population genetic structure comparison is the main method used to determine whether two populations evolve independently or have gene exchange [29,30]. Because M. tb is a highly conserved and differentiated species, the molecular structure differences of its mutant subgroups are relatively small. Through comparison of population genetic structure differences, this study found that the average population structure difference coe cient of strains collected in hot spots and cold spots was only 0.019. In particular, the dominant strain (Beijing family) in this study is more evolutionary conserved than other Mtb lineages, and thus less likely to undergo recent mutations [31]. However, some speci c SNP locations showed signi cant population differences. The gene locus of these positions included Rv1186c, Rv3900c, Rv1508c and Rv0210, where Rv1508c belongs to the fragment of DR4. it has been mentioned that the knock-in of RD4 can improve the protective e cacy of the BCG vaccine. However, the effect of this region deletion on the pathogenicity of M. tb and its clinical phenotype still needs further research. However, this study found that the mutation rate of Rv1508c in hot spots was higher than that in cold spots (58.5% vs 34.0%) which con icts with other studies. However, after adjusting for other factors, the difference was not statistically signi cant. The proportion of Rv1186c mutation in hot spots was signi cantly lower than that in cold spots.
The product of this gene is a conserved protein called PruC. M.tb is an obligate aerobic bacterium. However, it has shown remarkable metabolic exibility, being able to survive and metabolize for a long time without oxygen [32]. A study showed that M.tb can grow on carbon -and energy-derived proline under hypoxia conditions and is regulated by a unique transcription factor (PruC) [33]. Thus, mutations in this gene indicates the immune escape and changes in pathogenicity that affect the transmissibility of M.tb.
The global persistence of M.tb infection over a long period of time suggests that there is a strong evolutionary pressure for the interaction between host and pathogen genomes [34][35][36]. susceptibility gene polymorphisms interacted by M.tb play a role in the development of TB [37,38]. The GG genotype of IL-17 rs2275913 in the Spanish population is associated with a high risk of tuberculosis [39], while the CC genotype of rs763780 in the Chinese population increases the risk of tuberculosis [40]. In this study, the proportion of Zhuang population was signi cantly higher in TB hot spots. However, because the two research sites are not comparable, ethnicity was not included in the MDR model for interactive analysis.
Nevertheless, further biochemical and immunological evidence is needed to con rm the hypothesis of high susceptibility among Zhuang population. In addition, the results of this study suggest that there is a strong negative interaction between age group and mutation of Rv0210. A similar result showed that SNP Rs9272785 for HLA-DQA1 showed a suggestive association in the young onset Tuberculosis subgroup (onset age 20-40 years, N = 396), although no signi cant association was found in the entire sample [41].
This result supports the hypothesis that the pathogenesis of TB strains in different age groups and that genetics may play an important role only in the younger onset of TB. In areas with high rates of TB, older patients showed signi cantly lower BMI and had a strong negative interaction. These results suggest the importance of immune level and nutrition to TB [42]. Therefore, in addition to socioeconomic factors, we also need to consider the impact of nutritional de ciencies on the TB development.
In conclusion, we found signi cant evidence for an association between the SNP difference of Mtb, host environment and TB epidemic.

Competing interests
The authors declare that they have no competing interests.
Funding: National Natural Science Foundation of China (81760603) has supported the DNA extraction and whole genome sequencing, Guangxi Natural Science Foundation (2018GXNSFAA281018) and Guangxi health promotion project (S2019067) has supported the data collection and downstream genetic analysis.
Author Contributions: ZC and DL contributed to the study design, preparation of materials, data collection, data analysis, and manuscript writing. YC contributed to the study design and data analysis. JO and LH contributed to the data collection and data analysis. All authors have read and approved the manuscript

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. questionnaire.pdf