Single nucleotide variation catalog from clinical isolates mapped on tertiary and quaternary structures of ESX-1-related proteins reveals critical regions as putative Mtb therapeutic targets

ABSTRACT Proteins encoded by the ESX-1 genes of interest are essential for full virulence in all Mycobacterium tuberculosis complex (Mtbc) lineages, the pathogens causing the highest mortality worldwide. Identifying critical regions in these ESX-1-related proteins could provide preventive or therapeutic targets for Mtb infection, the game changer needed for tuberculosis control. We analyzed a compendium of whole genome sequences of clinical Mtb isolates from all lineages from >32,000 patients and identified single nucleotide polymorphisms. When mutations corresponding to all non-synonymous single nucleotide polymorphisms were mapped on structural models of the ESX-1 proteins, fully conserved regions emerged. Some could be assigned to known quaternary structures, whereas others could be predicted to be involved in yet-to-be-discovered interactions. Some mutants had clonally expanded (found in >1% of the isolates); these mutants were mostly located at the surface of globular domains, remote from known intra- and inter-molecular protein–protein interactions. Fully conserved intrinsically disordered regions of proteins were found, suggesting that these regions are crucial for the pathogenicity of the Mtbc. Altogether, our findings highlight fully conserved regions of proteins as attractive vaccine antigens and drug targets to control Mtb virulence. Extending this approach to the whole Mtb genome as well as other microorganisms will enhance vaccine development for various pathogens. IMPORTANCE We mapped all non-synonymous single nucleotide polymorphisms onto each of the experimental and predicted ESX-1 proteins’ structural models and inspected their placement. Varying sizes of conserved regions were found. Next, we analyzed predicted intrinsically disordered regions within our set of proteins, finding two putative long stretches that are fully conserved, and discussed their potential essential role in immunological recognition. Combined, our findings highlight new targets for interfering with Mycobacterium tuberculosis complex virulence.

Raimond Ravelli passed away on 30 June 2023 during the preparation of this manuscript.
The authors declare no conflict of interest.
See the funding table on p. 17.

In Memoriam of Raimond Ravelli: A Trailblazing Scientist in the World of Nanobiology and Proteins
It is with heavy hearts that we bid farewell to our esteemed colleague, Raimond Ravelli, who passed away on 30 June 2023.Raimond was an extraordinary scientist and a devoted collaborator in our research endeavors.His brilliance, dedication, and passion for advancing our understanding of mycobacterial virulence proteins have left an indelible mark on the scientific community and all of us who had the privilege of working alongside him.
Beyond his remarkable scientific achievements, Raimond was known for his kindness, generosity, and willingness to help others.He was a mentor and friend to many, offering guidance and support to aspiring scientists, nurturing their potential, and fostering a sense of community within the scientific realm.
On behalf of all his colleagues and collaborators, we extend our deepest sympathies to Raimond's family and friends.Let us celebrate his life by carrying forward his dedication to scientific exploration, his passion for nature, and his unwavering commitment to the pursuit of knowledge.We wish to send warm regards and appreciation to Raimond' s wife Maaike and children Seppe and Noé.
(Mycobacterium bovis strain Calmette Guérin) was achieved by the spontaneous loss of the chromosomal region RD1 (region of difference 1) that carries the gene esxA (11).Likewise, BCG strains have acquired further attenuation mutations in addition to the loss of RD1, as indicated by the finding that complementation of BCG with the extended RD1 region can increase the virulence of the recombinant strain to some extent but not to the full level of a typical M. bovis cattle isolate.Similarly, deletion of RD1 in Mtb caused decreased virulence similar to that of BCG in vitro (12).Despite the spontaneous loss of an overlapping section of the RD1 region, Mycobacterium microti can show different degrees of virulence in animal models.Some M. microti strains, which were previously used as vaccines, were found to be highly attenuated, whereas some other strains show higher virulence (13).
The RD1 region contains genes of the ESAT-6 secretion system 1, ESX-1 (11,14).ESX-1 is a member of the type VII secretion systems (T7SS) family and is essential for full virulence in all Mtbc lineages (L1-L8) as well as in the closely related pathogenic Mycobacterium marinum, Mycobacterium kansasii (15), and Mycobacterium leprae (16).The ESX-1 genes of interest (GOI) are mainly located within the extended RD1 locus but also include multiple genes across different loci and are required for the build ing, functioning, and regulation of ESX-1, the transport of virulence factors, and their membrane lysis activity.The ESX-1 GOI encodes 31 proteins (ESX-1-related proteins, see Table S14) that can be divided into four functional categories: 7 substrates (products that are secreted during virulence), 6 inner-membrane core components (proteins part of the secretion machinery), 8 regulators (transcription factors), and 10 peripherals (exact function yet to be determined); the mechanism by which the substrates translocate through the mycobacterial outer membrane has not been solved yet, and the proteins potentially involved in that step have not been identified.
The structure of the ESX-1 inner core complex has not been solved yet; however, those of ESX-3 (17) and ESX-5 (18) have been deciphered, and atomic models for a few ESX proteins (in isolation or as part of a complex) have also been experimentally solved.Accurate structure prediction of single proteins has become available through artificial intelligence (AI) techniques (19,20), and the obtained models generally compare well to the experimental ones, thus allowing for large structural bioinformatics screens in silico.Using AI-generated protein models also allows for the building of higher-order protein complexes together with homologous templates such as the ESX-3 and ESX-5 core machineries in the case of ESX-1, thus allowing for the exploration of the inter-molecular interfaces within.
Despite their high genomic similarity, Mtbc lineages (L) differ significantly in the host immune response they elicit, host tropism, phenotypes, drug resistance, and transmissi bility (21)(22)(23)(24).So far, most research on the ESX-1 machinery has focused on the Mtbc L2 and L4 because of their widespread geographic range and the availability of laboratoryadapted reference strains such as H37Rv (L4) and HN878 (L2).Thus, our knowledge on the genomic differences and convergence in the ESX-1 genes across the Mtbc is limited.
To obtain a deeper understanding of virulence across Mtbc, we generated single nucleotide polymorphism (SNP) catalog found in the ESX-1 GOI for Mtbc lineages (L1-L6) of human importance.We used whole-genome sequencing (WGS) data from >32,000 publicly available clinical isolates to shed light on the intricate landscape of viru lence proteins.We investigated-with advanced bioinformatics methods-the spatial distribution of these genetic polymorphisms at the protein level from known essential amino-acid positions to protein-protein interfaces.In the process, we identified new hyperconserved protein domains in the peripherals and substrates of ESX-1: those coincide with intrinsically disordered regions, thus highlighting their essential nature in the intracellular survival of Mtb.

RESULTS
First, we provide evidence illustrating the variation tolerance of the ESX-1 GOI, confirm ing that it is the most non-synonymous single nucleotide polymorphism (nSNP)-dense group of genes within the Mtb genome (25).Next, we identified several genomic regions, including variants that arose independently under positive selection as done by Vargas et al. (25).We then examined the amino-acid locations that bear abundant SNP counts.In 31 ESX-1 genes, only 21 nSNPs were found in more than 1% of the isolates; given their extensive human-to-human passage, we considered them fully functional.None of these nSNPs resulted from convergence but were due to either opportunistic sampling of the data set or occurring in ancestral lineages.The data also revealed which parts of the 31 proteins are fully conserved.We mapped all nSNPs onto each of the experimental and predicted ESX-1 proteins' structural models and inspected their placement.Varying sizes of conserved regions were found, and some proteins showed clear polarity in their nSNP distribution.We then scrutinized some essential motifs [such as Walker motifs (26), secretion signals, SS-bonds, and post-translational modification sites] and identified less than 0.01% of isolates bearing nSNPs in those locations.Next, we analyzed predicted intrinsically disordered regions (IDRs) within our set of proteins, finding two putative long stretches that are fully conserved, and discussed their potential essential role in immunological recognition.Finally, we compared known and predicted quaternary structures, correlated interaction interfaces with nSNP distribution maps, and experimen tally validated one unexpected mutation on the interaction interface of EsxA and EsxB, which still permits complex formation.Combined, our findings highlight new targets for interfering with Mtbc virulence.

Consolidating SNPs for ESX-1 GOI
A collection of 32,399 unique Mtbc isolates, including clinical Mtbc isolates (25) L1-L6 (NCBI), L7 (27), as well as M. bovis (B.C. de Jong, unpublished data), was collated (Fig. 1; Fig. S1).The presence of an ESX-1 machinery and its substrates, essential to Mtb virulence, in virulent clinical isolates from human TB patients may be considered as positively selected strains for virulence in a genetic selection screen.For the ESX-1 GOI, we examined 31 genes encoding 31 proteins with a total of 11,167 amino acids, of which 8,616 displayed an SNP: 2,742 synonymous mutation sites (sSNPs) and 5,874 nSNPs.More than half (52%) of the encoded amino acids had at least one nSNP.Figures S2-S4 visualize the nSNPs mapped on the AlphaFold2-predicted structures for each of the 31 ESX-1 GOI.

Converging evolution
From convergence analysis, two silent sSNPs in espI in four independent lineages (L1, L4, L3, and L8) in a total of 10 unrelated clinical isolates were identified and confirmed by Sanger sequencing (Fig. S5; Table S2); these sSNPs featured a six cytosine repeat at codons 134-135.In cancer, this type of motif has been linked to changes in methylation (28).Long PacBio reads could provide information about whether these mutations affect the methylation state of ESX-1.

Conservation on 3D level
We mapped all nSNPs onto each of the known or predicted 3D structural models and inspected their spatial distribution.Varying sizes of conserved regions were found, as shown by the dominant blue color of the predicted protein structures ( Fig. S2-S4).Some proteins showed clear polarity; for example, nSNPs of the regulators WhiB6, PhoR, PhoP, MprA, and MprB accumulated on one side of their surface, while the opposite side showed no nSNP (Fig. S3).

Hotspots in ESX-1 GOI
After excluding the two polymorphisms in espI due to convergent evolution, we identified 78% of the nSNPs in so-called hotspots [in which the same mutation was found in over 1% of the clinical isolates (n > 300)], reflecting a phylogenetically more "basal" polymorphism with clonal expansion.We identified 21 hotspots (Fig. 2) in 13 of the 31 ESX-1 GOI, involving all lineages (Table S3).Only four proteins showed more than one hotspot (PPE68, EccD1, EccE1, and EspK), while nine showed each a single hotspot (PhoR, MprA, MrpB, EspA, EspE, EccB1, EsxB, EspH, and EspB).The two most abundant nSNPs were excluded as hotspots (T192I in EspA and E99* in PE35) since they included almost the entire data set, suggesting those polymorphisms appeared in the H37Rv reference strain instead; the L339H MprB variant appeared in half the data set.Hotspots occurred in all protein categories: we identified them in three regulators, three peripherals, four substrates, and three core components.Upon mapping the hotspots onto the structural models, we found them predominantly located on the surface of the proteins, including both polar and apolar residues (Fig. 2).Visual inspection of the experimental as well as predicted protein structures of these 13 proteins and 21 hotspots showed very few polar and H-bond interactions of the hotspot side chains within the protein itself (data not shown).
Since the effectors EsxA and EsxB are known to assemble into a heterodimer prior to its secretion by the ESX-1 system, we investigated the most prevalent EsxB variant E68K as it involves a residue buried in the interface with EsxA (29,30) and determined how it affected the assembly.We expressed and purified the mutant protein and showed an unharmed interaction with EsxA (Fig. 3), confirming earlier predictions by molecular modeling that the E68K mutation would have minimal impact on the interface between EsxB and EsxA.that harbour them: the EspB heptamer was solved experimentally (N-terminal moiety, PDB: 7P13), EccB1 and EccD1 dimers were built by a combination of AlphaFold2 prediction and molecular threading using their homologs from ESX-3 and ESX-5 as templates, the PhoR and MprB dimers were predicted using AlphaFold2-multimer, the other models were gathered from the AlphaFold Protein Structure Database (https://alphafold.ebi.ac.uk/).Regions of proteins predicted not to be folded and not containing mutation hotspots are omitted from the models.A hotspot is classified as such when the same nSNPs is found in more than 300 clinical isolates (constituting over 1% of the isolates).Homo-oligomers are depicted with a one subunit in colour tan.

Essential motifs and binding interface regions
Next, we analyzed the presence of nSNPs in the two motifs known to be essential for secretion in the substrates and peripheral proteins: WxG and YxxxD/E.Overall, these secretion motifs were found to be highly conserved.Only nine nSNPs were found at the tryptophan and four at the glycine sites of the WxG motif, while for the YxxxD/E motif 13 and 2 nSNPs were identified at the tyrosine and aspartic acid sites, respectively.Some of the ESX-1 core components contain ATPase domains, the hydrolysis of which powers the secretion of the ESX-1 substrates through the mycobacterial membranes.These domains require the presence of Walker motifs for the proper binding and hydrolysis of nucleotides.Thus, we analyzed whether nSNPs occurred in these motifs.The Walker A motif, also called the "P-loop" for its phosphate-binding loop, exhibits the classical pattern (G/A)xxxxGK(T/S) (31).Our data show a total number of six nSNPs within this motif for the total of three ATPase domains found in EccCa1 and EccCb1; no nSNP was found in the P-loop motif of EccA1 (Table S4).The Walker B motif (hhhhD, where h is any hydrophobic residue) is fully conserved in all three proteins.The peripheral protein EspI harbors another P-loop NTPase domain; its role in downregulating ESX-1-mediated secretion when Mtb cellular ATP levels are low has been established (32), and a binding site of moderate affinity to ATP has been identified within its Walker A motif (33).In EspI, this motif is well conserved with only six nSNPs, and the essential ATP-binding residue Lys425 is unaltered.
The core membrane component (e.g., the secretion machine) of ESX-1 is critical for the virulence of the bacteria and represents an important target for the development of new antitubercular agents.It is built from multiple copies of four ESX conserved components (Ecc), named EccB, EccC, EccD, and EccE, and is stabilized by the protease MycP1, resulting in a large ~1,500 kDa particle.Therefore, we looked for SNP trends in the ESX-1 inner membrane core complex.Because no experimental structure of this complex has been elucidated to date, we modeled it as a trimer of protomer dimers and mapped the identified nSNPs onto it.Our data set revealed a large overall number of mutations within each of its components: 5,923 in EccB1, 2,679 in EccCa1, 3,614 in EccCb1, 6,760 in EccD1, 7,722 in EccE1, and 1,148 in MycP1 (total of 27,846 in the five proteins combined).No obvious trend was found in terms of nSNP clustering or conservation (Fig. 2 and Fig. 4; Fig. S9) Finally, we analyzed post-translational modification sites, allowing for glycosylation and phosphorylation, which were shown to play a significant role in Mtb adaptive processes.A recent study mapped glycoproteomic patterns of clinical isolates of Mtb (34).Consequently, we honed in on the 31 ESX-1 GOI, which contains 27 glycosylation sites in total.The majority of sites (18/27) did not harbor nSNPs; the other nine sites counted only 31 nSNPs in >32,000 isolates (0.01%) (Table S6).Moreover, complete conservation (0 nSNP) was observed in the phosphorylation sites for EspJ (Ser70) (35) and the transcription regulators PhoR (His259) (36) and MprB (His249) (37).Together, these data demonstrate that functional regions essential to ESX-1 function are largely conserved within the set of 31 ESX-1 GOI.

Intrinsically disordered regions
The predicted AlphaFold structural models contain regions of low and very low confidence scores (respectively, <70 and <50).The percentage of lowconfidence regions varies for different species and is relatively small for bacterial proteomes such as M. tuberculosis (13.29%) (38).We analyzed whether these regions of low confidence in the proteins encoded by the ESX-1 GOI reflect intrinsically disordered regions (39), which tend to be hydrophilic unstructured protein structures thought to be disproportionately involved in interactions.Of the 31 ESX-1 GOI proteins, 11 contained at least one IDR, defined in this study by both a low predicted local distance difference test (pLDDT) confidence score and a high IDR prediction score (40) (Fig. S6-S9; Table S5).To filter out flexible loops connecting folded domains, stretches containing a minimum of 16 amino acids were kept.Out of those 11 IDR-containing proteins, 7 substrates (PPE68 and EspB) and peripheral ones (EspA, EspE, EspI, EspJ, and EspK) contain large IDR stretches (between ~70 and ~290 residues).By quantifying the relative nSNP content of those regions, we further split them into sub-regions of high and low polymorphism.As such, we identified a single conserved (~11% SNV) IDR for EspK, and polymorphic-only IDRs for EspA, EspJ, and PPE68; EspE, EspI, and EspB displayed both kinds (Fig. 5).IDRs harboring no nSNPs (in EspI and EspB) suggest a conserved role for these regions, as exemplified by the entirely nSNP-free IDR identified in EspB (residues 262-335) between the folded N-terminal domain (1-269) and the MycP1 cleavage site [Ala358/Ser359 (41)] that promotes the oligomerization of the protein (42,43).

Quaternary structures
To gain insight into the fidelity of interfaces for protein complexes utilized by the T7SS in pathogenic mycobacteria, we conducted a thorough examination of nSNPs within the context of quaternary structures.We employed the PISA interface tool (44) to scrutinize known complex interfaces, using a BSA score > 0 as an indicator of predicted interaction sites.For every complex interface in Mycobacterium tuberculosis (Mtb), we quantified the number of nSNPs associated with each interacting residue.
To begin, we identified nSNPs within experimentally verified regions where multi ple ESX-1 substrates have been documented to interact (29,42,(45)(46)(47).Focusing on the primary substrate complex transported by the ESX-1 system, we emphasize the conservation within the interaction sites between EsxA and EsxB.Within these sites, 19 amino acids participate in their interaction.Remarkably, only two interacting residues from each protein exhibited counts of two to three detected nSNPs at each position.To put this into perspective, these SNP occurrences were observed in just nine isolates out of a pool exceeding 32,000 (as detailed in Table S7), with a notable hotspot nSNP located at EsxB (E68K, 2,478 nSNPs).
When we focused our analysis on the self-interactions of EspB and PhoR oligom ers, we discovered fewer than 50 nSNPs in less than 25% of the interacting residues (as detailed in Tables S8 and S9).Moreover, within the 15 residues involved in the biologically validated interaction between EspK and EspB, only five residues exhibited nSNPs (with counts ranging from 1 to 15), while the remaining 10 residues showed no nSNPs across the pool of analyzed genomes.This high level of conservation in the residues mediating this interaction emphasizes the stability and fidelity of the EspK-EspB complex (as shown in Table S10).
Next, we turned our attention to regions of DNA-protein interaction, specifically focusing on regulators EspR and PhoP (as depicted in Fig. 6).Remarkably, we identified complete conservation in all 11 interaction sites, comprising seven residues in the PhoP-DNA interaction and four residues in the EspR-DNA interaction (as detailed in Tables S11 and S12).This conservation strongly signifies the functional importance of these interaction sites for precise gene regulation within the context of Mtb's virulence mechanisms.
Finally, we analyzed the conservation of the cysteines, present in 22 of the 31 ESX-1 proteins (Table S13): these rare, highly reactive residues play crucial roles in modulating intra-and intermolecular protein stability.We examined the cysteines involved in the formation of disulfide bonds in the periplasmic domain of the core components EccB1 (Cys150-Cys345), MycP1 (Cys49-Cys118 and Cys204-Cys242), and the secreted peripheral protein EspE (Cys114-Cys170).A single nSNP was identified within the disulfide bridge of EccB1.We also inspected the cysteines involved in the probable homo-or heteromultimerization of the secreted peripheral components EspA, C, and D (Cys138, Cys48, and Cys174, respectively).EspA only harbored nSNPs at that position in the form of a mutation hotspot (640 nSNPs).Mutation of this position has been linked to the lack of dimerization of EspA and a general decrease in virulence but is not associated with a loss of ESX-1-mediated secretion (48), highlighting the probable role of EspA in the correct delivery of the virulence factors to the host.Regarding cytoplasmic components, the reducing conditions imposed intracellularly do not favor the formation of such disulfide bridges.In general, the cysteines belonging to the cytosolic components are generally well conserved (11 or less nSNPs), except for Cys8 of EspM (25 nSNPs), Cys729 of EspK (20,242 nSNPs), and Cys53 of WhiB6 (76 nSNPs).The latter may form a [4Fe-4S] cluster (Pfam: PF02467)-together with Cys34, Cys56, and Cys62-seemingly involved in sensing the redox conditions within the mycobacterial cytosol and essential in downregulating the secretion of EsxA and EsxB during the innate immune response to mycobacterial infection (49).The general lack of nSNPs detected at the cysteine positions underscores the critical significance of their conservation for the proper activity of ESX-1 at both structural and functional levels.

DISCUSSION
The Mtbc genome is very conserved between lineages and species.Over a ~4.4 Mb genome, we identified a grand total of ~800,000 SNPs (positions, ~18% of the genome) representing nSNPs over the 32,399 isolates, of which half were singletons (a single nSNP identified per SNP position).In comparison, the ESX-1 GOI displays an average of 13% ± 3.7% SNPs (range 4%-21%), indicating those 31 genes undergo different selection pressures.Within the 31 GOI, the five most mutated genes encoding EccB1, EccD1, EspJ, PhoR, and WhiB6 (all with >50% of mutated amino-acid positions) may be under a stronger positive selection pressure than the other 28 genes.Co-evolution analyses may reveal new interaction partners for these proteins.Despite clonal expansion being the major force behind the spreading of new Mtb variants (50), the maximum nSNP counts for ESX-1 proteins EspC and EsxA are still fairly low (13 and 17, respectively), and the SNPs identified in these genes are mostly singletons; this might be related to their essential role in Mtb's parasitic lifestyle, with their respective genes undergoing a strong negative selection.
The core complex proteins exhibited the highest degree of variation as a group within the ESX-1 proteins.As these mutations were not detrimental to the virulence of Mtbc in vivo, we can conclude they did not alter the structural integrity of the machi nery nor the interactions between the different components sustaining its function.While the structural integrity of the machinery may not be easily compromised by single point mutations, the interactions between the different partners of this intricate complex might be disrupted.However, it must be noted that the disulfidebridgeform ing cysteines showed a very high degree of conservation (i.e., a single nSNP identified over 10 positions) that underscores the importance of maintaining structural integrity for the components involved.The other cysteines in ESX-1 were more prone to mutations (55% SNPs over 47 positions) as exemplified by the ones found in the global regulator WhiB6.We identified mutants at each position, four of which were involved in the formation of a putative iron-sulfur cluster 4Fe:4S.This redox-reactive cluster is heavily involved in the regulation of ESX-1-mediated virulence.M. marinum mutant exhibited reduced ESX-1-mediated, contact-dependent hemolysis in vitro, concomitant to reduced secretion levels of EsxA and EsxB (49).A detailed analysis of the 89 mutants identified in this study may help unveil new regulation networks involved in maintaining virulence in those isolates.As well, co-evolution network analyses may help in identifying muta tion pairs responsible for intra-and inter-molecular epistasis involved in stabilizing the various variants at other positions.An experimentally determined structure of the ESX-1 inner membrane complex would also provide a suitable framework for a detailed analysis of these mutations.
Our analyses revealed ample sequence conservation in the known interaction and active sites of ESX-1 proteins, such as the ATPase domains of the core complex.Most proteins have signature sequences or motifs that are characteristic of protein families and represent an important feature in the protein structure or function.The Walker B motifs were indeed conserved in the ESX-1 ATPases.It is noteworthy that even the ATPase 2 and 3 domains of EccCb1 are relatively conserved even though these domains may not be catalytically active (51), as suggested by the orientation of the involved residues in the experimental Thermomonospora curvata EccC structural model, and the discovery of ATP molecules bound to these domains (51).However, those observations may not be relevant to EccCb proteins of the Mtbc strains as we have evidence of ATPase activity in EccCb1 from M. tuberculosis H37Rv (B.C. de Jong, unpublished data).Moreover, several putative sites for post-translational glycosylation and phosphorylation identified in the ESX-1-associated proteins displayed considerate conservation, thus highlighting their relevance as diagnostics or drug and vaccine targets.
Additionally, some IDRs also proved to be highly conserved, harboring only a few nSNPs.While several IDP-specialized prediction tools have been developed in the past two decades, AlphaFold has become a surprisingly efficient tool at identifying IDRs (long amino-acid stretches exhibiting low pLDDT confidence score) (52).We located a large proportion of such putative IDRs in 11 ESX-1 proteins, in which four substrates/peripheral proteins displayed highly conserved, long amino-acid IDR stretches.The functions of IDRs are not tied to their proper spatial folding but rather to their lack thereof.These natively unfolded proteins may undergo partial conformational arrangement upon binding of specific ligands (nucleic acids, co-factors, membranes, and proteins) and often play a scaffolding role in the establishment of protein complexes.This high conservation level implies an essential nature of these regions for the intracellular lifestyle of parasitic Mtb and-together with their secreted nature-their possible involvement with host factors.In addition, a strong immuno-modulator role can specifically be attributed to the IDRs encoded by the proteins of the PE/PPE/PE_PGRS families.Those low-complexity regions have been shown to be prone to accumulate polymorphisms and may act as decoy antigens, thus helping Mtb to evade immune surveillance by the host (53).The different levels of conservation observed for the different IDRs identified within the peripheral and secreted substrates of the ESX-1 machinery imply those regions may fulfill different roles, with the polymorphic IDRs acting on immune evasion, while the highly conserved ones assume functions essential to the survival and pathogenesis of Mtb, probably involving interactions with some host factors that may not allow for molecular epistasis between the host and its parasite.This is perfectly exemplified by EspB's first IDR region (amino acids ~260-340) that is thought to interact with phospha tidic acid (PA) and, to a lesser extent, with phosphatidylserine (PS) (54).Those phospho lipids are abundant in the phagosome membrane and are involved in its maturation into bacteria-killing phagolysosomes.A major role of EspB in Mtb virulence resides in delaying the phagosome maturation, and it has been proposed that the unstructured residues following the folded N-terminal domain partially fold to bind on the surface of biological membranes containing negatively charged phospholipids such as PA and PS, thus interfering with their trafficking (42,55).The lack of nSNPs within that first EspB IDR supports this hypothesis, showing its essential role in Mtb virulence.The identification and characterization of other putative IDR-interacting host factors may require the use of tailored experimental strategies, such as microscopic co-localization of the interacting IDRs with the host ultrastructures, the direct pulldown of those factors from host cellular extracts using the IDRs of interest as baits, or the use of microarray screens.Unexpect edly, when analyzing the amino acid sequence of EspB (42), EspK (56), and EspI IDRs (data not shown) in other species, we noticed that these are the least conserved regions in the protein, only the nature of the amino acids is conserved, i.e., high content level of prolines and other nonpolar residues.The meaning of this observation will have to be further investigated, but it is possible that it is only important to maintain the physical chemistry properties of these IDRs between species, while in Mtb there are signatures with specific functions that cannot be modified.
In contrast to conserved regions, mutation hotspots seem to occur mostly at places on the protein surface for which no interactions have been reported so far.The observed clustering of nSNPs to a single amino acid was not surprising (Fig. 2), as surface areas (where no interactions occur) are known to be prone to polymorphisms.The hotspots suggest that these regions can vary without functional consequence and thus have no evolutionary pressure to remain stable.
Overall, our analysis of nSNP counts in specific interactions and post-translational modification sites within the ESX-1 genes of Mtb provides valuable insights into the conservation and genetic diversity within these critical protein-protein and protein-DNA interfaces.These findings contribute to our understanding of the functional significance of these interactions and their potential implications for Mtb's pathogenicity.Moreover, we showed here the potential of this approach to identify clinically relevant molecular targets in the presence of both highly conserved IDRs and motifs essential to the virulence, which should be further taken up by the Mtb community as potential targets for new vaccines.In this study, the use of a limited number of sequenced genomes from clinical isolates, relative to the size of the Mtb genome, did not allow for a true identification of essential protein regions at amino-acid resolution; however, it can be expected that the wider adoption of WGS as a diagnostics tool will soon allow the gathering of sufficient data to reach this goal and better understand the dynamics of Mtb virulence outside the laboratory.
Conserved areas in a pathogen genome are good targets for a vaccine, as vaccine efficacy requires both adequate immune coverage (adaptive immune mechanisms in people of diverse genetic backgrounds must be able to generate immunity against the vaccine) and vaccine epitopes against which immunity is induced must be con served among variants of the pathogen of interest.Nemes et al. (57) compared nSNPs' distribution in the Mtb genome divided into three gene sets, including "essential genes, " "non-essential genes, " and "antigens." Essential genes were defined as transposon insertion mutants that were defective for the ability to grow on a solid medium.Antigens were defined based on the presence of experimentally confirmed human T-cell epitopes.The observation that the known human T-cell epitopes are hyperconserved suggests that the bacteria actually benefit from T-cell recognition.Given the lengthy co-evolution of MTB and humans, it is yet unclear if the identification of hyperconserved epitopes yields ideal vaccine targets or favors MTBs' immune recognition leading to increased cavitation and ongoing transmission.

Whole-genome sequencing and phylogenetic analysis of 967 clinical isolates
A total of 967 genomes of clinical isolates from the Mtbc (n = 961; L1-L8) and from M. bovis (n = 6) were included, from Bangladesh and Gambia (58-60) (Supplemantary Table ).We used this data set as the initial backbone phylogeny for the rest of the analysis.The semi-automated Mtbseq pipeline was used for read mapping and variant detection (61).The output of Mtbseq was used to generate an SNP alignment using an in-house Python script (https://github.com/alxndravc/ESX-1-MS).Based on this SNP alignment, a maximum likelihood tree was built using RaxML-NG (62) with a GTR + CAT model of evolution and 100 bootstraps; M. canettii was added as an outgroup.The different phylogenetic lineages were visualized using the online interactive Tree Of Life tool (63).

Phylogenetic analysis of 31,428 publicly available Mtbc isolates
We used an SNP barcode (64) to type a collection of 31,428 WGS Mtbc isolates down loaded from NCBI into Mtbc lineage and sub-lineage.We excluded isolates that were missing 10% of SNP sites, were not typed as belonging to Mtbc L1-6, or were typed as L4 but were not typed further with an L4 sub-lineage.We split the 31,428 isolates into eight groups based on genetic similarity, five groups corresponding to global L1, L2, L3, L5, and L6, and three groups for lineage 4 (i.e., L4.1.12).To generate phylogenies for each of these groups, we first merged VCF files of the isolates in each group with bcftools (65).We then removed repetitive, antibiotic resistance, and low-coverage regions (64).We generated a multi-sequence FASTA alignment from the merged VCF file with vcf2phylip (version 1.5).Finally, we constructed the phylogenetic trees for each group with IQ-TREE 1.6.12(66).We used the mset option to restrict model selection to GTR + CAT models and selected the GTR + F + I + R model for the six isolate groups corresponding to L1-L4 and implemented the automatic model selection with ModelFinder Plus (67) for the isolate groups corresponding to L5 and L6.

SNP catalog
An in-house Python script was used to count the unique SNPs present within each isolate group (stratified by global lineage) within the ESX-1 GOI (https://github.com/alxndravc/ESX-1-MS).To calculate SNP frequency per 1 kb, the number of unique SNP locations per gene was multiplied by 1,000 bp and divided by their respective gene length.The same methodology was used on the 31,000 WGS isolates from NCBI.The genes were divided into four groups: machinery, substrates, regulatory, and peripheral (i.e., genes not belonging to any of the other three categories) (Table S1).

SNPPar analysis homoplasy
The resulting phylogenetic tree from the 967 clinical isolates was used in SNPPar along with the SNP data set to obtain the mutation events across all Mtbc lineages.To screen for convergent SNP sites in the alignment, SNPPar was used.Based on the provided phylogenetic tree, SNPPar searches for SNPs that are the same mutation (e.g., C, G) at the same position in two or more unrelated isolates or different mutations that result in the same base (e.g., C, G, A, G) in the same position.It also detects revertant mutations back to the ancestral state (e.g., C G C) (68).We used the default settings of SNPPar, which is a TreeTime for ancestral state reconstruction.As input, the phylogenies mentioned above were used together with the H37Rv reference genome in GenBank format (NC_000962.3) and an SNP position file.On the large data set of 31,428 publicly available samples, SNPPar was run eight times on eight independent sets of isolates corresponding to eight genetic backgrounds (L1-L6).Lineage sample counts are as follows: L1: 2,815; L2: 8,090; L3: 3,398; L4A: 5,839; L4B: 6,958; L4C: 4,134; L5: 98; and L6: 96.

Mapping of the nSNPs on the ESX-1 predicted models
AlphaFold2-generated protein 3D models were collected from the EBI/AlphaFold collection of models built upon the UniProt database (version 4, last accessed on 20 July 2022, available at https://alphafold.ebi.ac.uk), using their corresponding UniProt reference.For each model, pLDDT scores per amino-acid position were extracted from the coordinates files (b-factor column), and nSNP counts per amino acid were assigned into corresponding attribute files prior to their rendering with UCSF Chimera 1.15 (69) and/or ChimeraX 1.6.1 (70) with the following coloring scheme: blue for 0 or yellow >0 SNPs, red for the 1% outliers (>300 SNPs).

Modeling of the ESX-1 core complex
To this date, no experimental structural model of the ESX-1 core complex (EccB1/Ca1/Cb1/D1/E1) has been deciphered; however, the structures of the homolo gous complexes ESX-3 of Mycobacterium smegmatis (17) and ESX-5 of M. tuberculosis (18) were recently solved by cryo-SPA (single particle averaging) electron microscopy.The overall high homology levels observed between the components of ESX-1 and those of ESX-3 and ESX-5 allowed us, with the help of AlphaFold2-generated models, to propose a composite predicted model of the whole ESX-1 inner-membrane core machinery.The general procedure consisted first of retrieving (i) the AF2-generated models of the ESX-1 core components through their respective Uniprot references and (ii) the atomic models of the ESX-3 and ESX-5 core complex components deter mined from the experimental cryo-SPA experiments (PDB: 6SGX, 6SGZ, and EMD-10187, and PDB: 7NPR and EMD-12517, respectively, obtained from https://www.rcsb.organd https://www.ebi.ac.uk/emdb; see Table S14).Then, using UCSF Chimera 1.15, domain orientation in the core ESX-1 AF2-generated models was corrected by manually isolating the individual domains, structurally aligning them to their ESX-3/5 counterparts, and reassembling them with the program Modeller 9.20 (71) (available at https://sali lab.org/modeller).Finally, the tweaked ESX-1 models were assembled into a composite model using the whole core ESX-3 and ESX-5 complexes as references: a protomer dimer containing two copies of EccB1, EccCa1, EccCb1, and EccE1, four copies of EccD1, and one copy of MycP1 linking the protomers was first assembled; the larger complex was deduced by trimerization of this protomer dimer (C3 symmetry).
Specific modifications were applied to the ESX-1 initial models as follows: for EccB1, two different models were derived from the initial AF2 model to account for the difference in the orientation of the N-terminal helices (located in the cytoplasm and the inner membrane, respectively) found in the protomers of ESX-3 and ESX-5.Regard ing EccCa1 and EccCb1, the ESX-3 and ESX-5 cryo-SPA density maps do not offer a sufficient resolution for the modeling of the EccC3 and EccC5 cytoplasmic regions; the relative orientation of the EccCb1 sub-unit toward EccCa1 was deduced by a structural alignment using the experimentally deduced model of the EccC cytoplasmic domain from T. curvata (PDB: 4NH0), in which the different ATPase domains are physically linked through covalent bonding (51); remarkably, this solution allowed for the EccCb1 monomers to not overlap with each other but assemble in a ring-like structure of the N-terminal ATPase domains of EcCb1 in the final ESX-1 composite model.For the EccD1 dimer, the assembly of the C-terminal transmembrane domains used EccD3 (from PDB: 6SGZ) and EccD5 (from PDB: 7NPR) dimers as references; regarding the assembly of the N-terminal ubiquitin-like domain, the cyclin C2 dimer assembly identified by X-ray diffraction (72) (9PDB: 4KV2) was disregarded, instead the dimerization of that domain was modeled with EccD3 and EccD5 as references as well.Finally, the N-terminal hydrophobic alpha-helix of MycP1 was truncated to mimic the maturation of the signal peptide during trafficking, prior to adjusting the orientation of the C-terminal linker and membrane alpha-helix anchor using MycP5 as a reference (from PDB: 7NPR).The PhoR and MprB dimers were predicted using AlphaFold2-multimer (73).

nSNPs vs pLDDT vs IDR plots
Putative IDRs of the 31 ESX-1 proteins of interest (POIs) were predicted in batch using the OdiNPred server (Prediction of Order and Disorder by evaluation of NMR data, https://st-protein.chem.au.dk/odinpred) without evolution; the disorder probability (DP) scores were retained for further processing.nSNP counts, pLDDT scores, and DP scores were assigned to each amino acid position, per POI, and DP scores were normalized to percentage values for plotting purposes.Figures were generated with SigmaPlot 12.5 (Systat Software): nSNPs' counts per amino acid were displayed on a logarithmic scale (0.1-40,000, left axis), while pLDDT and DP scores per amino acid were displayed on a percentage scale (0-100, right axis).

FIG 1
FIG 1 nSNP catalog for 31 ESX-1 GOI and per MTBc lineage.Circos plot visualization of distinctive nSNPs counted in the data was made from merged variant files (MTBseq pipeline) and then filtered by uniqueness per lineage (not shared with other lineages).The outer lane depicts gene names.The first lane is color coded by functionality: green for substrates, gray for regulators, salmon for core complex, and blue for peripheral.The next six inner lanes designate lineage stratification (L1-L6).

FIG 2
FIG 2 Spatial distribution of nSNPs hotspots in ESX-1 components.Hotspots (depicted in red) were mapped onto the structural models of the ESX-1 proteins

FIG 3
FIG 3 Hotspot E68K in EsxB does not affect the interaction with EsxA.(A) Structure depicts interaction between wild-type protein EsxA (magenta) and EsxB (blue = 0 SNPs, yellow > 0 SNPs).Zoom-in region of residue E68 showing intra-(with Q65 of EsxB) and intermolecular contacts (with the main chain of W58 of EsxA).Lysine replacing E68 is displayed in green with its respective interaction.(B) Size-exclusion chromatogram shows conserved interaction between EsxA and EsxB regardless of hotspot E68K (black curve); wild-type EsxB (red curve) is shown for comparison.Insets correspond to the respective fractions resolved in a SDS-PAGE, showing both proteins (EsxA 9.9 kDa, EspB 10.8 kDa).

FIG 4
FIG4 nSNPs mapped onto the AlphaFold2-predicted models of an ESX-1-assembled core complex modeled by molecular threading onto ESX-3 and ESX-5.The complex was modeled as assembled in the Mtbc cytoplasmic membrane, where one subunit of EccB1, EccCa1, EccCb1, and EccE1 and two of EccD1 form a protomer, two protomers are stabilized by a copy of MycP1, which then trimerizes to form the hexamer.Color code: blue, 0 nSNPs; yellow>0 nSNPs in>32,000 clinical isolates.

FIG 5
FIG 5 ESX-1 components with the largest predicted IDR domains.The amino-acid sequences of the ESX-1 GOI were analysed with the AlphaFold2 and ODiNPred algorithms.The proteins with IDRs were identified as the longest low-pLDDT, high-disorder scores.A highly conserved IDR was identified in EspK (A).EspE (B), EspI (C) and EspB (D) displayed both conserved and polymorphic IDR sections.Long polymorphic IDR regions were identified in EspA (E), EspJ (F) and PPE68 (G).Per amino-acid position (X axis): nSNPs counts (blue bars, left Y axis in logarithmic scale), AlphaFold2 confidence (red dashes as pLDDT score on left Y axis as percentage) and ODiNPred disorder scores (green dots as IDR score also on the right Y axis as percentage).

FIG 6
FIG 6 High level of conservation in the interfaces of EspR (A) and PhoP (B) that interact with DNA.nSNPs were mapped and colour-coded onto the experimentally derived structural models for the ESX-1 transcription regulators EspR (PDB: 3QYX) and PhoP (PDB: 5ED4).Colour code: blue = 0 nSNP, yellow ≥ 1 nSNP in 32,000 clinical isolates.