Diversification of gene content in the Mycobacterium tuberculosis complex is determined by phylogenetic and ecological signatures

ABSTRACT We analyzed the pan-genome and gene content modulation of the most diverse genome data set of the Mycobacterium tuberculosis complex (MTBC) gathered to date. The closed pan-genome of the MTBC was characterized by reduced accessory and strain-specific genomes, compatible with its clonal nature. However, significantly fewer gene families were shared between MTBC genomes as their phylogenetic distance increased. This effect was only observed in inter-species comparisons, not within-species, which suggests that species-specific ecological characteristics are associated with changes in gene content. Gene loss, resulting from genomic deletions and pseudogenization, was found to drive the variation in gene content. This gene erosion differed among MTBC species and lineages, even within M. tuberculosis, where L2 showed more gene loss than L4. We also show that phylogenetic proximity is not always a good proxy for gene content relatedness in the MTBC, as the gene repertoire of Mycobacterium africanum L6 deviated from its expected phylogenetic niche conservatism. Gene disruptions of virulence factors, represented by pseudogene annotations, are mostly not conserved, being poor predictors of MTBC ecotypes. Each MTBC ecotype carries its own accessory genome, likely influenced by distinct selective pressures such as host and geography. It is important to investigate how gene loss confer new adaptive traits to MTBC strains; the detected heterogeneous gene loss poses a significant challenge in elucidating genetic factors responsible for the diverse phenotypes observed in the MTBC. By detailing specific gene losses, our study serves as a resource for researchers studying the MTBC phenotypes and their immune evasion strategies. IMPORTANCE In this study, we analyzed the gene content of different ecotypes of the Mycobacterium tuberculosis complex (MTBC), the pathogens of tuberculosis. We found that changes in their gene content are associated with their ecological features, such as host preference. Gene loss was identified as the primary driver of these changes, which can vary even among different strains of the same ecotype. Our study also revealed that the gene content relatedness of these bacteria does not always mirror their evolutionary relationships. In addition, some genes of virulence can be variably lost among strains of the same MTBC ecotype, likely helping them to evade the immune system. Overall, our study highlights the importance of understanding how gene loss can lead to new adaptations in these bacteria and how different selective pressures may influence their genetic makeup.


Figure S3
. Orthologous protein clusters of the pan-genome detected by the sequential addition of genomes of the Mycobacterium tuberculosis complex (MTBC).Each boxplot represents the distribution of the number of orthologous protein clusters from 100 randomly generated strain orders.The refined power-law pan-genome model [1] was applied using the panmatrix and heaps functions of micropan R package [2].Calculated alpha parameter = 1.5 If the exponent alpha > 1, that bacterial group has a closed pan-genome.

Figure S2 .
Figure S2.Pie charts representing the percentage of the number of orthologous protein clusters (A) and proteins (B) predicted from genes and pseudogenes of the Mycobacterium tuberculosis complex (MTBC) genomes.

Figure S6 .
Figure S6.Cluster of Orthologous Groups (COG) of the Mycobacterium tuberculosis complex (MTBC) and enrichment analysis.(A) Functional classification of the COG of the proteome of 233 genomes of the MTBC.(B) Functional classification of the COG of the MTBC according to the pan-genome.Axis x: Percentage of proteins of the MTBC.Axis y: COG Functional Classification.COG categories are: [D] Cell cycle control, cell division, chromosome partitioning; [M] Cell wall/membrane/envelope biogenesis; [N] Cell motility; [O] Post-translational modification, protein turnover, and chaperones; [T] Signal transduction mechanisms; [U] Intracellular trafficking, secretion, and vesicular transport; [V] Defense mechanisms; [W] Extracellular structures; [Y] Nuclear structure; [Z] Cytoskeleton; [A] RNA processing and modification; [B] Chromatin structure and dynamics; [J] Translation, ribosomal structure and biogenesis; [K] Transcription; [L] Replication, recombination and repair; [C] Energy production and conversion; [E] Amino acid transport and metabolism; [F] Nucleotide transport and metabolism; [G] Carbohydrate transport and metabolism; [H] Coenzyme transport and metabolism; [I] Lipid transport and metabolism; [P] Inorganic ion transport and metabolism; [Q] Secondary metabolites biosynthesis, transport, and catabolism; [R] General function prediction only; [S] Function unknown; [X] Mobilome components.More than 89% of the CDS (coding DNA sequences) of each group were successfully annotated with EggNOG.Categories highlighted with bold edges were significantly enriched (p<0.05)compared to the whole pan-genome.

Figure S7 .
Figure S7.Venn diagram of shared proteins of the accessory genome of each bacterial group.The groups of orthologous proteins of the accessory genomes (except the hypothetical clusters) of each bacterial group were evaluated using STRING [3] using Mycobacterium tuberculosis H37Rv as reference.The list of gene IDs identified by STRING for each bacterial group was matched to generate this Venn Diagram.Thus, only proteins with a gene ID as detected by STRING are included.

Figure S8 .
Figure S8.Protein networks of the accessory genomes of Mycobacterium tuberculosis andMycobacterium africanum.The groups of orthologous proteins of the accessory genomes (except the hypothetical clusters) of each bacterial group were evaluated using STRING[3] using Mycobacterium tuberculosis H37Rv as reference.Edges -aqua green: from curated databases; pink: experimentally determined; green: gene neighborhood; yellow: text-mining; black: coexpression.

Figure S9 .
Figure S9.Protein networks of the accessory genomes of Mycobacterium bovis and 'other animal strains'.The groups of orthologous proteins of the accessory genomes (except the hypothetical clusters) of each bacterial group were evaluated using STRING[3] using Mycobacterium tuberculosis H37Rv as reference.Edges -aqua green: from curated databases; pink: experimentally determined; green: gene neighborhood; yellow: text-mining; black: co-expression.

Figure S10 .
Figure S10.Dotplot comparisons of good-quality and low-quality genomes of Mycobacterium tuberculosis or Mycobacterium bovis against the reference genome sequences of M. tuberculosis H37Rv or M. bovis SP38.The reference genomic sequence is on the x-axis; sequences of the other strains are on the y-axis.A-C are examples of quality-approved genomes.D-I are the examples of genomes sequencing by PacBio or IonTorrent plataform that did not pass quality control.Dotplots were generated R package seqinr.

Figure 11 .
Figure 11.Correlation graphs of 233 strains of the Mycobacterium tuberculosis complex.(A) Correlation between the number of strain-specific proteins and the number of contigs in each draft genome.(B) Correlation between the number strain-specific proteins and the N50 of each draft genome.(C) Correlation between the number strain-specific proteins and the number genomes of M. tuberculosis (n=114), M. africanum (n=33), M. bovis (n=66) and animal strains (n=20) (each dot corresponds to one group).(D) Correlation between the pan-genome size and number of genomes of of M. tuberculosis (n=114), M. africanum (n=33), M. bovis (n=66) and animal strains (n=20) (each dot corresponds to one group).