Conserved patterns of sequence diversification provide insight into the evolution of two-component systems in Enterobacteriaceae

Abstract Two-component regulatory systems (TCSs) are a major mechanism used by bacteria to sense and respond to their environments. Many of the same TCSs are used by biologically diverse organisms with different regulatory needs, suggesting that the functions of TCS must evolve. To explore this topic, we analysed the amino acid sequence divergence patterns of a large set of broadly conserved TCS across different branches of Enterobacteriaceae, a family of Gram-negative bacteria that includes biomedically important genera such as Salmonella, Escherichia, Klebsiella and others. Our analysis revealed trends in how TCS sequences change across different proteins or functional domains of the TCS, and across different lineages. Based on these trends, we identified individual TCS that exhibit atypical evolutionary patterns. We observed that the relative extent to which the sequence of a given TCS varies across different lineages is generally well conserved, unveiling a hierarchy of TCS sequence conservation with EnvZ/OmpR as the most conserved TCS. We provide evidence that, for the most divergent of the TCS analysed, PmrA/PmrB, different alleles were horizontally acquired by different branches of this family, and that different PmrA/PmrB sequence variants have highly divergent signal-sensing domains. Collectively, this study sheds light on how TCS evolve, and serves as a compendium for how the sequences of the TCS in this family have diverged over the course of evolution.


INTRODUCTION
Two-component regulatory systems (TCSs), versatile regulatory systems that couple environmental signal detection to physiological responses, are an important mechanism used by many organisms to ensure the timely expression of their genetic repertoires in response to changing conditions [2,3].They are most prevalent in bacteria, where a single genome can often encode dozens or even hundreds of different TCSs [4].The two conserved elements that define a TCS are a histidine kinase (HK), which is responsible for environmental sensing, and a response regulator (RR), which mediates a biological response.In a prototypical TCS, the HK dimerizes upon sensing stimuli and autophosphorylates by transferring the γ-phosphoryl group of ATP to a conserved His residue, which is subsequently transfered to a conserved Asp residue of its cognate RR [2,5,6].Phosphorylation of the RR leads to a stabilization of its active conformation, triggering physiological responses that depend on the nature of the RR's output domain [5,7,8].Many HKs are bifunctional and also act as phosphatases that inactivate their cognate RR in the absence of a stimulating signal [2,6,9].In general, both the HK and the RR components of a TCS have a modular architecture [7,9,10].A typical HK is composed of discrete domains involved in sensing, signal transmission and kinase activity.The kinase domain is a conserved feature of HKs and can be subdivided into a dimerization and histidine phosphotransfer domain (DHp) and a catalytic domain (CA) [9].By contrast, the sensing and signal transduction modules are not conserved across different HKs and exhibit tremendous diversity both in terms of structure and the stimuli recognized [3,9,10].Most HKs possess extracytoplasmic sensory domains (class I), while others utilize multiple transmembrane (class II) or even cytoplasmic (class III) sensory domains [10,11].RRs are composed of a conserved receiver (REC) domain, which contains the phosphorylatable Asp side chain, and a variable output domain that elicits biological changes in response to the phosphorylation of the REC domain by the HK.The output domain differs amongst TCS, with DNA-binding transcription factors representing the most common output modules [12,13].Hybrid HKs elaborate on the classic TCS scheme by encoding a REC domain, and may also include histidine phosphotransfer (HPt) domain(s) or associated HPt proteins, resulting in a multistep phosphorelay [2,6,14].In addition to this twist on the classic TCS paradigm, the regulatory cascades of many TCSs often involve additional components beyond the core HK/RR pair.The nature of these auxiliary components is variable, but includes a long list of proteins that directly interact with the TCS (often the HK) to influence its activity level [15][16][17].In many instances, these proteins serve as 'connectors' that allow for the integration of multiple signals into complex regulatory networks [18,19].The diversity observed across TCS in terms of both function and architecture is the culmination of a long, complex evolutionary history and is a testament to the efficacy of the TCS as a regulatory mechanism.
As bacteria evolve to adopt different lifestyles or inhabit new niches, their regulatory networks must also adapt.Although TCS with a narrow phylogenetic distribution are also common, many TCSs are widely distributed across evolutionarily distant taxa, indicating that the same TCSs are often employed by organisms with disparate regulatory requirements.For example, the PhoP/ PhoQ (PhoPQ) TCS is conserved amongst diverse gammaproteobacterial lineages, where it senses Mg 2+ concentrations and serves a conserved role in maintaining magnesium homeostasis [20].However, in addition to this conserved function, PhoPQ senses a variety of other signals and controls highly variable repertoires of genes in a manner that differs amongst different lineages [20][21][22].For example, in Salmonella enterica, PhoPQ is a master regulator that senses the intracellular environment this species inhabits upon invading an animal cell, and responds by activating the expression of genes encoding proteins that range from stress-response systems to toxins that disrupt host-cell biology [20,22].This intracellular niche bears little resemblance to any environment encountered by soil-dwelling or aquatic organisms that encode orthologous PhoPQ TCS.The PhoPQ regulatory network has adapted in several ways to accommodate the variable roles it plays in different organisms, including changes to its signal sensing, signal transduction, and DNA-binding properties, and the acquisition of different connector proteins that integrate PhoPQ signalling with other TCS [15,16,20,[22][23][24].This example is not unique, and different TCSs face variable evolutionary pressures depending upon their unique biology and the unique challenges faced by the organisms that encode them [25][26][27][28].With few exceptions, very little is known about how the functional properties of TCSs differ in different organisms.
In this study, we explore the evolution of TCSs in the Enterobacteriaceae family through an analysis of how the sequences of broadly distributed TCSs have changed across this lineage.Enterobacteriaceae, a large family of Gram-negative gammaproteobacteria, is composed of a cosmopolitan assortment of organisms and includes environmental organisms, phytopathogens, and important human pathogens such as S. enterica, Klebsiella pneumoniae, and assorted Escherichia coli pathovars [29].Recently, the Enterobacterales order has been reorganized to yield monophyletic families that better represent the evolutionary relationships amongst many genera originally assigned to the Enterobacteriaceae family [30,31].For example, phylogenetically distant species such as Yersinia pestis and Proteus mirabilis that were historically described as Enterobacteriaceae, now fall under the families Yersiniaceae and Morganellaceae, respectively [30,31].We chose the Enterobacteriaceae family to investigate TCS evolution for several reasons including the following: (i) its importance with regard to human health and biotechnology, (ii) TCS are relatively well studied in this family, with species such as E. coli and S. enterica serving as model systems for the study of TCS biology, and (iii) different genera of this family are sufficiently diverse to capture tremendous phenotypic and ecological diversity, but

Impact Statement
TCS play fundamentally important roles in bacteria, ranging from antibiotic resistance, to virulence, to metabolism, to interbacterial interactions, to fundamental cell biology.The question of how these systems change over evolution -resulting in functional differences in orthologous systems -is an important and poorly understood aspect of TCS biology.This study sheds light on this topic, highlighting both broadly conserved evolutionary trends, as well as specific evolutionary patterns for the many conserved, highly studied TCS found in the Enterobacteriaceae family.Additionally, this study identifies numerous instances of TCS whose evolutionary patterns differ significantly from broadly conserved trends, such as individual functional domains whose sequences have diverged more than expected in a particular TCS, or individual TCS that are unusually divergent in one particular lineage.These observations provide insight into the biology and evolution of these systems, as well as exciting avenues for future research.
sufficiently similar such that orthologous TCS can be readily identified and are likely to serve a related regulatory function.Previous in silico studies of TCS have made important contributions such as analysing how TCS repertoires vary across different taxa, surveying the architectural diversity of TCS, identifying conserved functional elements of HKs and RRs, providing insights into how new TCSs emerge, and shedding light on how HK-RR specificity is maintained over evolution [4,[32][33][34][35][36][37][38][39][40][41][42][43].In this study, we focus on the patterns of sequence diversification amongst TCSs in genera that are separated by millions of years of evolution.Our results reveal highly variable rates of evolutionary divergence for different TCSs.However, we observe strong correlations in the patterns of sequence divergence for individual TCS across different genera, or when comparing the HK and RR constituents or the functional domains that comprise these proteins.Importantly, TCSs with atypical evolutionary patterns were identified, providing leads for future investigations into areas such as species-specific adaptations of individual TCS.Our analysis identified PmrA/PmrB, an important mediator of resistance to certain antimicrobial compounds, as the broadly distributed TCS with the most divergent sequence across Enterobacteriaceae, and we provide evidence that multiple different alleles of this TCS with divergent signal sensing domains were independently acquired by different branches of this family.Collectively, this study reveals how the sequences of Enterobacteriaceae's TCSs have diverged over evolutionary time and sheds new light on how TCSs evolve.

Reference strains and their phylogenetic relationships
Reference strains selected for this study were based on previous phylogenetic analyses of the Enterobacteriaceae lineage.Based on these analyses, we selected nine strains, which were chosen in order to provide a cross section with sufficient breadth and depth to fulfil the aims of our analyses.For each of the genera selected, one strain was arbitrarily chosen to serve as a representative.The selected strains were Salmonella enterica, serovar Typhimurium (strain LT2), Escherichia coli (strain K12 MG1655), Citrobacter rodentium (strain ICC168), Klebsiella variicola (strain 342), Cronobacter turicensis (strain z3032), Enterobacter cloacae (strain ATCC 13047), Phytobacter diazotrophicus (strain TA9730), Kosakonia sacchari (strain BO-1), and Huaxiibacter chinensis (strain ZB04).A phylogenetic tree to analyse the evolutionary relationships between these strains was generated from whole genome sequences of the nine reference strains obtained from NCBI which were then re-annotated to extract concatenated protein sequences of single-copy core genes.Alignments of the amino acid sequences of the orthologs of all identified core genes were generated and used to infer the phylogenetic tree following the method outlined in Zheng et al. [44].

Identification and selection of TCS and CG proteins and compiling and aligning of their amino acid sequences
The 50 genes for the CG data set were selected arbitrarily on the basis that they, (i) were conserved in all reference strains, (ii) have a well-defined and conserved function, (iii) fall into diverse functional categories.AA sequences for CGs were compiled using a combination of the www.microbesonline.comsequence database and tblastn searches of the NCBI nr DNA sequence database using the S. Typhimurium protein sequences as the query [45].The complete set of TCS encoded by both the E. coli and S. enterica reference strains was identified by generating (redundant) lists of TCS by various approaches including (i) searches of domain and protein annotations from www.microbesonline.comusing these two strains (ii) analysis of the prokaryotic two-component system database (www.p2cs.org)for these two organisms (iii) analysis of previously published, comprehensive TCS lists for these organisms [46].The various approaches were cross-referenced to confirm that no TCS were missed, and any inconsistencies were individually investigated.TCS were identified in the remaining reference strains via tblastn searches using the S. enterica protein sequence as the query.To identify RRs and HKs throughout the reference strain set we set as thresholds a minimum query coverage of 90 % and minimal AA percent identities of ≥60 % (RRs) and ≥40 % (HKs).These thresholds were set empirically on the basis that orthologous proteins (same genome location, characterized to have a similar function) were reliably captured using these cutoffs, and that TCS proteins identified by blast that fell below these thresholds were consistently found to be different TCS from the query sequence and were found at disparate genome locations.TCS that were found in fewer than seven of the nine reference strains were omitted from further consideration since many of the analyses conducted would be skewed by (or not interpretable for) a TCS with a narrow representation amongst our reference strains.Once all sequences were compiled, we performed MSAs for all CGs, RRs, and HKs using Clustal Omega (default parameters) and extracted percent identity matrices for each pairwise comparison of reference strains [47].During this process, we noted that the curated start codons often differed for certain proteins in one or more of the reference strain(s).To account for this, we manually refined our AA sequences by analysing the DNA sequences of proteins with atypical start positions, seeking ATG, GTG, or TTG start codons that resulted in a similar sequence length to the orthologous proteins.We then repeated the MSAs for any proteins where alternative start codon selection was required.

Generating and analyzing AA divergence rate matrices
AA sequence divergence rates (AA differences/100 AA) were generated based on the percent sequence identity matrices generated as described above.For each protein (CG, RR or HK) average AA sequence divergence rates were calculated for all pairwise combinations of reference strains.A matrix of the average of all CG values for each pairwise combination of reference strains ('cumulative CG matrix') was generated and used as a standard for the relative divergence rates expected of the various combinations of reference strains.To identify pairs of organisms whose RR or HK divergence rates (population level) differed from the expected values, the cumulative CG matrix was multiplied by a correction factor (HK or RR global average divergence rate/CG global average divergence rate) to obtain an expected matrix to which the observed matrix was compared.To identify individual TCS proteins whose AA sequence diverged more than expected in an individual species, the cumulative matrix was multiplied by (the average divergence rate for that protein/the CG global average) to generate an expected matrix.For any TCS protein that was absent from one or more reference strains, CG matrices and global averages used were calculated based on the same set of strains [i.e.lacking the same reference strain(s) as that TCS protein].Expected matrices were compared to observed matrices to obtain an observed/expected matrix.To determine which TCS proteins were more divergent than expected, unpaired two-tailed t-tests were used comparing all observed/expected values from combinations that involve a given species to the complete set of values from combinations that do not involve that species.

Analyzing correlations of sequence divergence rates of TCSs across different lineages, of HKs and their cognate RRs, and TCS across functional domains
To analyse the extent to which TCS sequence diversity in one branch of Enterobacteriaceae correlate with the level of diversity in other branches, we selected pairs of reference stains (comparison pair) predicted to have diverged from a common ancestor after the branching point for the remaining seven reference strains, such that any sequence changes between the comparison pair are independent from any sequence changes in the remaining reference strains: S. enterica/E.coli were selected as one comparison pair and E. cloacae/H.chinensis as a second pair.Scatterplots were generated where for each TCS protein its AA sequence divergence rate for the comparison pair was plotted against its average AA sequence divergence rate for all pairwise comparisons of reference strains that do not include either strain from the comparison pair.Correlations were analysed using the linear regression analysis tools of the OriginLab (2023b) graphing and data analysis software [48].Outliers for this analysis were identified using a residuals plot, where proteins with a residual value greater than three were considered statistically significant outliers.
Analysis of the relative divergence rates of HK/RR pairs was conducted using the average values calculated across all pairwise combinations of reference strains.The values for each HK were plotted against the value for its cognate RR using a scatterplot and the linear regression analysis tools of the OriginLab (2023b) graphing and data analysis software [48].To perform a similar analysis based on the functional domains that comprise HKs and RRs, we first parsed the amino acid sequence of each RR and HK into two functional domains.For RR, the boundaries of the receiver domain were first identified using the InterPro Protein Family Classification tool (European Bioinformatics Institute) using the AA sequence of the S. enterica protein as a query [49].To ensure consistent boundaries were used, the REC domain for other reference strains were defined based on the S. enterica boundaries and the results of the MSA for that RR.DBD were then defined as the complete sequence of the RR that is downstream of the C-terminal boundary of the REC domain.A similar approach was used for HKs, where the AA sequence of S. enterica HKs were analysed using InterPro to identify the boundaries of the kinase domain, spanning the dimerization/histidine phosphotransfer (DHp) domain combined with the ATP-binding catalytic (CA) domains.The sensor domain was then defined as the complete sequence that is upstream of the N-terminal boundary of the kinase domain, while any sequences downstream of the C-terminal boundary of the kinase domain were not included in this analysis.Domain sequences and the relationship between their divergence rates were then analysed as described for HK/RR pairs.

Analyzing selective forces acting on TCS using dN/dS ratios
As with the analyses of amino acid sequences above, similar multiple sequence alignments of the DNA sequences of response regulators and histidine kinases were performed using mega (v.11.0.13).Sequences were aligned in mega using ClustalW (Codons) and subsequently used to compute pairwise distances between their coding sequences [50].Distance estimation was performed using the Kumar method (Kimura 2-para, 1000 Bootstrap replications, Syn-Nonsynonymous substitutions model at default settings).Matrices for the number of synonymous substitutions per synonymous site (dS) and nonsynonymous substitutions per nonsynonymous site (dN) were exported into Excel to generate a matrix of dN/dS ratios for each RR/HK and their pairwise comparisons between each species.The value of this matrix was averaged and used in Scatterplots against each RR or HK's respective AA divergence rate as well as against the dN/dS ratio of their cognate partner.

PmrAB evolutionary analysis
The genome locations of pmrA and pmrB (found to be adjacent in all genomes analysed) were identified by independently analysing the neighbouring genes in each of the reference strains.Analysis utilized the NCBI genome browser tool coupled with iterative tblastn searches, as well as DNA and protein-level sequence comparisons and analyses.The same approach was used to analyse pmrAB genome locations outside of the Enterobacteriaceae family by investigating individual strains from various families within the Enterobacterales order.Phylogenetic trees based on the sequences of PmrA, PmrB, PhoP and PhoQ were generated using the mega (Molecular Evolutionary Genetics Analysis) V11 software using the maximum-likelihood method and a WAG +G+I substitution model based on Clustal Omega-generated MSAs.A bootstrap method with 500 total replicates was used.

Analysis of the evolutionary relationships between the selected Enterobacteriaceae reference strains and their TCS repertoires
To analyse the evolution of TCS in Enterobacteriaceae, we first generated a set of reference genomes.We selected one strain from each of nine different genera within this family: Escherichia coli K-12 (strain MG1655), Salmonella enterica (serovar Typhimurium, strain LT2), Citrobacter rodentium (strain ICC168), Klebsiella variicola (strain 342; originally assigned to the K. pneumoniae species but later reassigned to the K. variicola species, which is a member of the K. pneumoniae complex), Cronobacter turicensis (strain z3032), Enterobacter cloacae (strain ATCC 13047), Phytobacter diazotrophicus (strain TA9730), Kosakonia sacchari (strain BO-1), and Huaxiibacter chinensis (strain ZB04; originally referred to as Lelliottia amnigena, but later assigned to H. chinensis) [51,52].The goal of this reference strain set was not to comprehensively capture the genetic diversity within this lineage, but rather to serve as a cross-section of Enterobacteriaceae TCS genetic diversity, such that individual strains represent genera that have diverged in ecology and lifestyle over millions of years.To provide a framework for exploring the TCS evolutionary relationships, we generated a phylogenetic tree based on the conserved proteomes of these strains, using Y. pestis strain CO92 as an outgroup (Fig. 1).Congruent with previous analyses of Enterobacteriaceae phylogenetics, this analysis confirms that the reference strains selected are well-dispersed across this family [30,53].Although well-dispersed, certain lineages are predicted to have diverged from one another more recently, such as E. cloacae/H.chinensis or the E. coli/S.enterica/C.rodentium clade.By contrast, C. turicensis is predicted to have branched off from the other lineages relatively early in the evolutionary history of this family.
For certain analyses, we aimed to contextualize the evolution of TCS proteins by comparing their divergence to that of other proteins that are conserved within Enterobacteriaceae.To this end, we compiled a set of 50 genes present in all reference strains with highly conserved functions spread across assorted functional categories (Table S1, available in the online version of this article).Using the amino acid (AA) sequences for these 50 conserved genes (CGs), we conducted multiple sequence alignments (MSAs), and generated percent AA sequence identity matrices for each CG across the reference strains, and a cumulative matrix of the average values across the 50 proteins (Fig. 1b, Datasets S1-S2).The phylogenetic distances inferred by the cumulative matrix are consistent with the phylogenetic tree generated above, and with previously published results (Fig. 1a, b) [30,54].The average AA sequence identity of the 50 CGs between species ranges from ~89 % (C.turicensis compared all other species) to ~94 % (E.coli/S.enterica/C.rodentium, and E. cloacae/H.chinensis); these values are somewhat higher than previously reported genome-wide averages, which was expected since our gene set was limited to highly conserved genes with integral functions [53,55,56].The spread in average AA sequence identities across the reference strains indicates they are well-dispersed across Enterobacteriaceae (Fig. 1b).When these analyses are contextualized using previous estimates of Enterobacteriaceae evolutionary timelines, it is likely that all of our reference strains are separated by ~100 million of years of evolution or more [57,58] We next compiled the set of TCS that are broadly conserved amongst our reference strains.Since E. coli and S. enterica have served as model organisms for studying TCS, we first generated a complete list of HKs and RRs from the genomes of these two reference strains.Because we sought to conduct global analyses of how HK and RR pairs have diverged over evolution, we limited our list to cognate HK/RR pairs, and disregarded other TCS proteins such as orphan HKs/RRs without overt interacting partners, or auxiliary proteins that expand the TCS phosphorelay.We found that 25 TCSs are conserved in both species, which is in agreement with previous studies (Fig. 1c) [4,59,60].Extracytoplasmic class I HKs dominated the set of conserved TCS, followed by cytoplasmic class III HKs and transmembrane class II HKs (Fig. 1c).With the exception of the chemotaxis RR CheY (which lacks an output domain), all other RRs have DNA-binding output domains, which is notable since DNA-binding output domains comprise only ~66 % of RR subfamilies in Pfam [13].At the subfamily level, OmpR-type were the most common, followed by NarL, NtrC, CitB, and LytTR (Fig. 1c).We then analysed the distribution of the E. coli/S.enterica TCS list across the other reference strains (Fig. 1c).We found that, in general, TCS were broadly conserved across the family, and that the majority of TCS were found in most or all of the reference strains.Although we aimed to select a broad assortment of TCS for further analysis, the inclusion of TCS with a narrow phylogenetic distribution would potentially skew downstream analyses that examine TCS evolutionary trends.To balance these considerations, we omitted any TCS that was absent from more than two reference strains (TCS names shaded grey in Fig. 1c); this cutoff eliminated 6 of the originally identified 25 TCS, and the remaining 19 broadly conserved TCS were selected for further analysis.

Population-level analysis of TCS sequence diversification in Enterobacteriaceae
To explore the evolutionary diversification of TCS, we first compiled the AA sequences of the 19 selected TCSs in each reference strain and conducted MSAs for each HK and RR (Table S2, Datasets S3-S4).We generated a matrix of the AA divergence rates for each HK and RR (which we define as the number of AA sequence differences per 100 AAs) between each pairwise combination of the reference strains (Dataset S2).Using these data, we then determined the average rates at which RRs and HKs accumulated amino differences across our entire dataset, and how these values compared to the rates for the CG dataset (Fig. 2a).We found that, when averaged across all proteins and species comparisons, RRs exhibited 9.9 AA differences/100 AAs, a very similar value to the 8.7 AA differences/100 AA observed for our CG set.This suggests that, at a population level, the AA sequences of broadly conserved RRs have diverged at a similar rate as a typical broadly conserved protein within Enterobacteriaceae.By contrast, we found that HKs had a significantly greater divergence rate (P<0.01 when compared to CGs) of 17.9 AA differences/100 AAs, indicating that, on average, HKs accumulate sequence changes at about double the rate of RRs and of conserved proteins at large.Since RRs and HKs generally work together as a single functional unit, this suggests that evolutionary adaptations might be more prevalent in the sensing and signal transduction components than in the output proteins of TCS.
We next compared the patterns of sequence change for TCS and CGs across the different combinations of reference strains.Because TCSs are a central mechanism of sensing and responding to the environment, we reasoned that, at the population level, the TCSs of species that have adapted to very different niches might be disproportionately divergent compared to the CGs, which have diverse functions that do not necessarily relate to niche adaptation.To explore this, we used the CG data to generate matrices Fig. 1.Overview and analysis of the reference strains and TCS analysed in this study.(a) Phylogenetic tree of the reference strains selected for analysis in this study.The evolutionary relationships between the nine chosen reference strains, each from a different major genus of the Enterobacteriaceae family, were analysed by generating a phylogenetic tree based on AA sequence alignments of the proteins encoded by the core genome of these strains.Y. pestis strain CO92, a member of the Enterobacterales order but not the Enterobacteriaceae family, was included as an outgroup.Numbers represent bootstrap support values calculated from 500 replicates.Lineages are listed as species; strain names are provided in the main text.(b) Matrix of the average AA sequence identities of the conserved gene set between all pairwise combinations of reference strains.Numbers indicate the average percent AA sequence identity across the 50 selected conserved genes; matrix is shown as a heat map for visual clarity.(c) The 25 TCS conserved in E. coli and S. enterica were analysed for their distribution amongst the reference strains set, where green indicates the presence of both the RR and HK, yellow indicates the absence of one component, and red denotes the absence of both the HK and the RR.TCS that were absent in more than two reference strains (and not selected for further analysis) are shaded in grey.TCS are classified based on their architecture and the nature of their output domain.HK architectures (HK) are designated as either: Class I (yellow), the classic HK archetype, which feature a periplasmic-sensing domain linked by a transmembrane domain to its cytoplasmic kinase domain, Class II (blue), which use multiple transmembrane regions for sensing but have a cytoplasmic kinase domain, or Class III (purple), which sense cytoplasmic signals and can be either membrane-associated or soluble [10].
The "H" modifier for HK class denotes a hybrid HK that features an additional phosphotransfer domain or an unusual phosphotransfer scheme.TCSs were further classified according to the output domain of their RR.Other than CheY, which lacks a discrete output domain, all other RRs possess a DNA-binding output domain belonging to one of five families: OmpR, NarL, NtrC, CitB and LytTR [12,13].Shorthand used for strains in all panels (e.g.'Ec') is described in (a).
of the expected divergence rates for both RRs and HKs across the reference strains, and compared these expected values to the observed average divergence values to generate matrices of the AA divergence compared to expected (Fig. 2b, c).We found that the HK and RR matrices showed strikingly similar trends across the various species comparisons.Although various combinations of organisms exhibited more or less TCS sequence variation than expected, certain trends were apparent in these data.For example, more variation than expected was observed for 13 of the 16 comparisons involving K. variicola HKs/RRs, suggesting that TCS on the whole are unusually divergent in this lineage.This is consistent with the remarkable ecological diversity of Klebsiella species, which inhabit a very wide range of environments, including assorted free-living and host-associated niches [61,62].We also found that TCS in the E. coli/S.enterica/C.rodentium clade are somewhat more similar to one another than expected, but they are almost universally more divergent than expected when compared to other lineages.This is in agreement with the adaptation of this clade to animal intestinal tracts, which is not thought to be the predominant environmental reservoir for the other lineages.Interestingly, many of the comparisons yielding the lowest rates of TCS divergence compared to expected involved the Cronobacter/Enterobacter/Phytobacter/Kosakonia lineages (Fig. 2b, c).Little is known about the natural ecology of these lineages, however they can be isolated from various environmental niches such as soil or water, and strains amongst these genera have been identified as plant-associated or as plant pathogens [29,[63][64][65][66][67][68][69].The overall trend, therefore, was that TCS were more divergent than expected when comparing species adapted to animal hosts to those adapted to environmental niches/plant hosts, but less divergent than expected when comparing within those groups; the TCS of the ecologically flexible Klebsiella lineage was divergent from both groups.Collectively, these data indicate that TCS sequence divergence is disproportionately dictated by the nature of the niches they inhabit when compared to the proteome at large.

Sequence divergence patterns of individual TCS across Enterobacteriaceae
We next examined the evolutionary patterns of individual TCS.First, we compared the mean rates at which individual TCS proteins have diverged across Enterobacteriaceae by averaging the divergence rates from all pairwise combinations of reference strains (Table 1).Consistent with previous observations that have noted coevolution for TCS constituents, we observed a strong positive correlation in the divergence rates of HK and RR partners (Fig. 3a) [32].However, despite the clear co-evolutionary trend across all TCS, there is variability in this relationship from TCS to TCS.Above, we noted that the divergence rates of HK are, on average, ~twofold higher for HK than for RR.This ratio is in this range for many HK-RR pairs, however several TCS exhibit ratios as low as ~1 (similar HK/RR divergence rates), while for other HK-RR pairs this ratio is as high as ~4 (Fig. 3b).As an example to illustrate this variability, the average divergence rate of the HK component of the RcsBC TCS is ~40 % higher than that of the BtsRS TCS, however BtsRS RR is more variable than an average RR, while the RcsBC RR is amongst the most highly conserved of all RR.The factors underlying the relative variability of the HK and RR components of a given TCS are presumably multifaceted.In the case of the RcsBC TCS, the complex architecture of this signalling cascade might be a relevant factor in its atypical HK:RR divergence ratio [70].Overall, these data indicate that HK-RR pairs co-evolve, but that the relative rates of HK and RR divergence varies amongst different TCS.
The most overt observation from our analysis of individual TCS was that the average divergence rates varied drastically from TCS to TCS (Table 1).The most conserved TCS protein, the RR OmpR, was found to be more highly conserved than any of the The expected average divergence rate for a given pair of strains was calculated by multiplying the average divergence rate for those strains across the CG set by a correction factor that accounts for differences in the average divergence rate of CG compared to HKs (c) or RRs (d).The divergence rate compared to expected is expressed as a percentage, such that values exceeding 100 % represent pairs of reference strains whose TCS proteins have diverged more than expected.Matrix is presented as a heat map for visual clarity.
50 proteins in the CG set, with a remarkably low divergence rate of <0.2 AA differences/100 AA.By contrast, the most divergent protein was the HK PmrB with a divergence rate of 41.1 AA differences/100 AA; this rate is more than 200-fold higher than that of OmpR and nearly fivefold higher than the CG average.As expected, and consistent with the variable divergence rates being a product of variable selective pressures on the different TCS, the divergence rates of both HK and RR exhibit strong linear correlations with their average ratio of non-synonymous to synonymous substitutions (dN/dS, see Fig. S1).Notably, the extent to which a given TCS's sequences were conserved tended to be consistent across different species comparisons (Dataset S2).To explore this, we plotted the divergence rate for each HK (Fig. 3c) and RR (Fig. 3d) for the E. coli / S. enterica comparison against its average divergence rate amongst reference strain combinations that do not involve these two species.Because S. enterica and E. coli diverged after the other seven reference strains branched off from one another (Fig. 1a), sequence differences between these two strains should be independent of any differences between the other reference strains.Despite this, for both HKs and RRs, we observed a strong positive correlation (R 2 =0.644/0.694for HK/RR, P<0.0001 for both) between their divergence rates in E. coli and S. enterica and the average divergence rates amongst the other reference strains.The spread of these data indicates that a surprisingly high proportion of TCS proteins fit very well to the linear regression, but that a few TCS deviate from this relationship.This indicates that, in most cases, the extent to which a given TCS has diverged between E. coli and S. enterica is predictive of how much it has diverged across independent branches of the Enterobacteriaceae family.A notable TCS for which this relationship differs is PmrAB, which was a statistical outlier for the both the HK and the RR analyses (Figs 3c, d and S2).In fact, removing PmrAB from this analysis increases the correlation coefficient markedly (R 2 >0.8) for both HKs and RRs (Figs 3c,  d and 2).PmrAB's unusual pattern of evolution is explored further in subsequent analyses presented below.The predictive nature of TCS divergence between different Enterobacteriaceae lineages is not restricted to E. coli/S.enterica, since the same analysis using E. cloacae/H.chinensis as the reference comparison yielded a similarly strong correlation (Fig. S2).This suggests that the *Denotes the number of species in the reference strain set (out of a total of nine) in which the TCS is present.†Divergence rate indicates the average number of AA differences per 100 AAs in all pairwise comparisons of HKs/RRs among all the species in which the HK/RR is present.
dominant factor driving the divergence rates of TCS in our dataset is the nature/function of the TCS itself, rather than unique biological features or evolutionary pressures of the various organisms.

Species-specific evolutionary adaptations of conserved TCS
The analyses above indicate that TCS divergence rates are correlated amongst different lineages, but that certain TCS proteins deviate from this relationship.For example, the extent to which the PmrAB, DcuRS and RstAB TCS have diverged between E. coli and S. enterica differs from what is expected based on their divergence rates in other Enterobacteriaceae (Fig. 3c, d).To identify TCS whose sequences have changed disproportionately in a single reference strain, we analysed how much each TCS protein has diverged in each reference strain compared to what would be expected based on its divergence rates in the other reference strains (Table 2, Dataset S5).These results show that, for the vast majority of TCS proteins in all reference strains, their sequence has diverged at a rate that is similar to the expected value.However, certain TCS proteins have diverged significantly more than expected in specific organisms, which might be indicative of TCS functional adaptations unique to that species (Table 2).One example of this is the RstAB TCS, which was generally more variable than expected throughout the E. coli/S.enterica/C.rodentium clade (Dataset S5).The function of RstAB is not well understood, but in these lineages this TCS has various regulatory roles The extent to which TCS diverge in one branch of Enterobacteriaceae strongly correlates with its divergence in independent branches of this family.For each HK (c) and RR (d) the divergence rate when comparing its E. coli AA sequence to its S. enterica sequence was compared to its average divergence rate across all pairwise combinations of the other seven reference strains using scatterplots.Select individual HKs or RRs that deviate from the strong correlation observed between these two variables are identified.PmrA and PmrB (red text) were both determined to be statistical outliers in these analyses, which is shown in Fig. S2.A similar analysis using E. cloacae/H.chinensis as the reference comparison is also shown in Fig. S2.For all scatterplots, P values (shown within plots) were derived from the linear regression analysis and indicate highly significant positive correlations (i.e. that the correlation coefficients are greater than zero).
related to virulence, and its expression is regulated by the virulence regulator PhoPQ [71][72][73].This suggests that RstAB might have adopted novel roles related to pathogenesis in these lineages, triggering the need for functional adaptions.In S. enterica, RstA, the RR, is amongst the most divergent proteins compared to expected, but its cognate HK, RstB, is not (Table 2, Dataset S5).This is noteworthy because there are significant differences in RstA's regulon in S. enterica and E. coli, and, intriguingly, RstA induces the expression of iron uptake machinery in S. enterica in a PhoPQ-dependent, but PmrB-independent manner [74][75][76].Coupled with the data presented here, this suggests that PmrA might have adapted to fill a novel virulence role that is independent of its cognate HK in S. enterica.

Functional domain-level analysis of TCS evolution
We next explored the evolutionary diversification of Enterobacteriaceae TCSs at the level of their functional domains.In all TCSs, the HK contains a highly conserved kinase domain, and the RR contains a highly conserved receiver domain, but other segments of these proteins are variable.To analyse the functionally and architecturally diverse TCSs in our dataset in a consistent manner, we partitioned their proteins as follows.For RRs, the receiver (REC) domain boundaries were identified (see Methods), and the downstream DNA-binding domain (DBD) was defined as the complete sequence that was C-terminal of this sequence.The DBD, as defined here, therefore also includes the short linker that typically connects the REC and DBD of a RR.HKs were also broken into two domains: (i) the conserved kinase domain spanning from the start of the DHp domain through the end of the CA domain, and (ii) the N-terminal segment of the HK, consisting of everything prior to the first AA of the kinase domain.The architecture and function of this N-terminal segment is variable in different TCSs.In most HKs, however, this region consists of sensing, signal transduction and transmembrane motifs.For simplicity, we refer to this region as the sensing domain in accordance with its predominant function.Because of substantial differences in the architecture of the chemotaxis TCS CheY/CheA relative to the others, it was omitted from this analysis [77].We generated AA divergence rate matrices for each domain of each protein (Dataset S6) and calculated the average divergence rates for each domain of each TCS protein (Fig. 4a).We further analysed the relationship between the divergence rates of each of the six possible combinations for the four functional domains across the different TCSs (Fig. 4b-g).
Overall, the kinase domain was the most variable on average (18.6 AA differences/100 AA), followed by the sensor domain (17.1 AA differences/100 AA) and DBD (11.9 AA differences/100 AA), and the REC domain was the most highly conserved (9.7 AA differences/100 AA).The divergence rates of all pairwise combinations of domains exhibited a statistically significant positive correlation, but the strength of this correlation varied substantially for different combinations, with R 2 values ranging from slightly above 0.5 to nearly 0.9 (Fig. 4b-g).Despite the kinase domain being the most variable overall, the sensor domain exhibited the weakest correlations with the others, which is likely driven (at least in part) by their architectural variability.Given that the sensor and kinase domain are both part of the same polypeptide and are functionally interconnected, we found it interesting that the divergence rate of the sensor domain correlated more strongly with the REC domain of its cognate RR, than it did to its associated kinase domain.Despite the population-level correlations for all domains, inspection of the divergence rates for individual TCS reveals substantial variability from TCS to TCS (Fig. 4a).For example, for RstAB, the divergence rate of its DBD is nearly twice as high as its sensor domain, which is highly atypical and suggests that the output functions of this TCS might have changed more than its sensory functions.By contrast, for NarLX, the reverse situation was observed, where the divergence rate of its sensor domain is nearly sevenfold higher than that of its DBD suggesting that its signal sensing functions might be more evolutionarily variable than its regulatory functions.Another interesting example is the EnvZ/OmpR TCS, where the REC, DBD and sensor domains are all easily the most highly conserved of any TCS with divergence rates between 0 and 2.3 AA differences/100 AA, but its kinase domain has a markedly higher divergence rate of 11.4 AA differences/100 AA.This suggests that any species-specific functional adaptations in this important TCS are likely to map to the cytoplasmic portion of its HK.Although the predominant function of this domain is generally signal transduction, for EnvZ it has been shown that the cytoplasmic region ('kinase domain') is also directly involved in signal sensing, suggesting that the variability we observe here could be reflected in altered signal transduction, signal sensing, or both [78].

Multiple sequence variants of the PmrAB TCS appear to have been independently acquired by different branches of the Enterobacteriaceae family
PmrAB is a canonical TCS composed of the HK PmrB and the RR PmrA, an OmpR-family transcription factor [25].PmrB's periplasmic sensory domain detects signals that include high concentrations of Fe 3+ and a mildly acidic pH, leading it to phosphorylate (activate) PmrA [79,80].PmrA regulates the expression of numerous genes, many of which are involved in the chemical modification of lipopolysaccharide (LPS) [25].LPS modifications can have a profound impact on the cell envelope that, in turn, can have a range of effects, including enabling pathogens to evade the host immune system and heightening resistance to antimicrobial compounds [81].PmrAB has functional connections with TCS such as PhoPQ and QseBC, and it is broadly distributed throughout the Enterobacterales order, including all nine of the reference strains used in this study (Fig. 1c) [82,83].Despite its broad distribution, regulatory integration, and important function, PmrAB stood out in the analyses above as the TCS with the most variable sequence (Table 1), and as having a different divergence pattern than other TCS (Figs 3c, d and 2).When compiling the TCS sequences, we noted that the genes neighbouring pmrAB varied in different lineages.However, it was not immediately clear if the genomic location of pmrAB differed, or if there had been genetic flux at a conserved pmrAB genomic location.To clarify this issue, we mapped the genomic locations of the pmrAB locus in each of our reference strains onto the E. coli genome by identifying conserved genes that flank this locus on either side.This analysis revealed that pmrAB is encoded at three different loci as small genetic islets inserted between highly conserved genes, with each locus represented by three of the reference strains (Fig. 5a).Interestingly, the genetic content of each islet differs, with locus 1 also including pmrC and the arginine decarboxylase system (adi genes), locus 2 including pmrCAB, and locus 3 encoding only pmrAB [84].pmrAB can be found at locus 3 outside of the Enterobacteriaceae family, such as in Y. pestis (Fig. 5a), suggesting this might be the ancestral locus.The mobility of this TCS expands beyond Enterobacteriaceae, as we identified a fourth pmrAB genomic islet location for members of the Pectobacteriaceae and Erwiniaceae families (Fig. S3).
We reasoned that if the different pmrAB genome locations were a result of horizontal acquisition of different alleles, that there would be a connection between the sequences of PmrA/PmrB and their genomic locations that would not be expected if they were vertically inherited from a common ancestor.To examine this, we compared the sequence divergence of PmrAB with that of PhoPQ, which was selected because it is a broadly conserved TCS that is encoded at a consistent genomic location, and which has a sequence divergence pattern that is consistent with that of the CG set.We found that, for both the HKs and the RRs, PmrAB and PhoPQ had similar divergence rates when comparing reference strains that have the same pmrAB genomic location (Fig. 5b, PhoP/PmrA and PhoQ/PmrB comparisons highlighted with green boxes).By contrast, PmrAB exhibited markedly higher divergence rates than PhoPQ when comparing species where pmrAB is encoded at different genomic locations (Fig. 5b, PhoP/PmrA and PhoQ/PmrB comparisons not in green boxes).Furthermore, we found that phylogenetic trees based on PhoP or PhoQ sequences were very similar to the tree generated using the complete core proteomes of the reference strains, consistent with PhoPQ vertical inheritance and genetic drift (Fig 1a and 3).By contrast, phylogenetic trees generated using PmrA and PmrB sequences clustered the reference strains into three clades based on their pmrAB genomic location, and the branching and branch lengths differed from the core proteome tree (Figs 1a and 3).Collectively, these data strongly suggest that the different sequence variants of the PmrAB TCS were independently acquired by different branches of the Enterobacteriaceae family.
To explore potential functional consequences of the sequence differences amongst the pmrAB alleles, we compared their AA sequences focusing on their HK signal sensing domain and the RR DNA-binding domain, regions directly responsible for detecting input stimuli and eliciting a biological response.To analyse PmrA's DNA-binding motif, we took advantage of a previous study that captured the DNA-bound structure of PmrA from K. pneumonia [85].We found that, despite the high levels of PmrA sequence divergence, all of the AA that directly interact with the promoter in this structure are conserved across all nine species (Fig. 5c).This is congruent with observations that the promoter sequences of PmrA-regulated genes are widely conserved across different taxa [25].This suggests that any functional adaptations in PmrA might relate more to its regulatory functions rather than to its DNA recognition.Analysis of PmrB's periplasmic sensor domain revealed that its ExxE motif, required for sensing Fe 3+ , is conserved in all reference strains (Fig. 5d).This is in agreement with findings that Fe 3+ sensing is widely conserved amongst divergent PmrAB systems [79,86,87].However, the 31 AA sensing domains vary considerably amongst the reference strains, particularly between those that encode pmrAB at different genomic locations.Highlighting the allele-specific differences in PmrAB, numerous AA are conserved amongst all reference strains with a given pmrAB locus, but differ in all reference strains whose pmrAB genomic location differs (Fig. 5d, green residues).Importantly, we find that there is an abundance of both positively and negatively charged AA residues in this region, and that these residues vary greatly between the three different pmrAB loci (Fig. 5d).For example, over the 31 AA, there are ten AA differences involve a charged residue between E. coli (locus 1) and P. diazotrophicus (locus 2).This remarkable variation in charged residues is significant given the essential role that ionizable AA play in sensing both Fe 3+ and acidic pH for PmrB [79,80].This suggests that PmrB's sensing properties with respect to these cues (or perhaps others) are likely to differ across the Enterobacteriaceae family.Importantly, PmrAB plays a clinically significant role in cationic antimicrobial peptide antibiotic resistance for organisms that encode pmrAB at each of the three genomic locations [88][89][90][91][92][93][94].Most of what is known about this TCS, however, is based on a few model species, and predominantly from studies in S. enterica.The data presented here suggest that caution should be taken when applying what is known about PmrAB from one Enterobacteriaceae species to another.

DISCUSSION
In this study, we analysed patterns of sequence change in order to investigate the evolution of TCS across the Enterobacteriaceae lineage.The factors that influence the rate of sequence change for a TCS include both positive evolutionary pressures (where mutations that alter the function of a TCS in a manner that confers a competitive advantage are more likely to become fixed in the population) and negative evolutionary pressures (where mutations that reduce the fitness of the bacterium are selected against).With respect to positive selection, higher rates of sequence change would be expected for TCS with specialized functions that vary from species to species, or for TCS that respond to signal(s) whose nature or abundance varies in a niche-specific manner.By contrast, the sequences of TCS with a stable 'housekeeping' function that is conserved throughout the lineage would be expected to be less variable.With respect to negative selection, the extent to which a given TCS influences the overall fitness of the organism is likely an important factor.Even if the actions of a particular TCS are beneficial, many mutations that arise will have a very minor impact on its function, and thus their effects could be insufficient to impart a sufficient fitness disadvantage to prevent those mutations from becoming fixed.If a TCS's sequences are highly conserved over large evolutionary distances, this therefore suggests that even minor functional changes to that TCS impart significant fitness costs.In this context, another factor that could influence how much a TCS protein diverges is the nature of the protein itself (i.e. the proportion of mutations that will influence its function).It is likely that this is a factor underlying the higher divergence rates of HKs compared to RRs, and in the weaker correlations we observe for the (architecturally diverse) sensor domain compared to other domains.However, several factors suggest this is not the predominant driving force underlying the different divergence rates of different TCS, including the strong correlations we observe between the divergence rates of the RRs and their cognate HKs, and the fact that HKs or RRs that have similar sizes, domain architectures and structures, often exhibit markedly different divergence rates.A range of other factors that relate to interactions or communication of the HK/RR pair with other regulatory factors could also influence TCS divergence rates; this would include the complexity of the TCS regulatory cascade (e.g.connector proteins, additional phosphotransfer steps, etc.), the manner in which the TCS is integrated into global regulatory networks, and the need to avoid crosstalk with other TCS.Indeed, recent work has elegantly shown that the potential for crosstalk between TCS appears to have influenced the evolution of the related EnvZ/ OmpR, CpxRA and RstAB TCS [43].However, while this is undoubtedly a bona fide evolutionary pressure for these TCS and others, our data suggest that this is not a major driving force in the overall rate of TCS sequence change, since we found that closely related TCS that presumably face a similar pressure to avoid crosstalk can have very different divergence rates.For example, EnvZ/OmpR was the least divergent of the TCS analysed, whereas RstAB was amongst the most divergent.
The patterns of sequence change observed in this study for different TCS or their constituent proteins or functional domains offer a window into their biology.For example, the sequence of the most conserved TCS protein, the RR OmpR, was 100 % identical across all nine reference strains other than E. cloacae and C. turicensis, which have a single Ser to Ala change near the C-terminus.This extreme level of sequence conservation indicates that EnvZ/OmpR, which regulates cellular responses to osmolarity and acidic pH, plays a vital role throughout this lineage [95].It further suggests that OmpR has a very finely tuned regulatory mechanism that is perturbed by even subtle changes to its sequence.OmpR's regulatory mechanism is noteworthy in several ways including, (i) its recognition of target promoters exhibits limited sequence specificity and DNA structure appears to play an important role, (ii) it has been proposed that OmpR adopts different structures when bound to different target promoters in a manner that leads to differential regulation of different target genes under different conditions, (iii) its activation by EnvZ is complex and has been proposed to also involve non-canonical, phosphorylation-independent activation [95][96][97].Based on the data presented here, these features (and perhaps others) have imposed very stringent sequence requirements on OmpR, such that deviations from the ancestral sequence are strongly selected against and purified from the population.Interestingly, the sequence of the cytoplasmic portion of EnvZ, which is involved in both signal sensing and signal transduction, was observed to be much more variable than the rest of this TCS [78,95].This suggests that this region might be more flexible to sequence changes without incurring perturbations to function, or that evolutionary adaptations to this TCS have been concentrated on the functions carried out by this domain.
On the opposite end of the spectrum, we find that PmrA and PmrB were the most divergent RR and HK of the proteins analysed, and we provide evidence that this was influenced by the horizontal acquisition of different pmrAB alleles by different branches of Enterobacteriaceae.PmrAB has a well-established role in regulating LPS modifications, and the broad distribution of this TCS suggests that this is a function that is important throughout this family [25].However, it is likely that the nature of the environmental conditions would trigger the need for LPS modifications, as well as the specific response required to react to these conditions, would vary for organisms with different lifestyles and that occupy different niches.This notion is supported by studies that have identified important differences in how PmrAB is integrated into the regulatory networks of different species in this family [86,98].Our data indicate that PmrB's 31 AA periplasmic domain, which is well established to be essential for signal sensing in a manner that hinges on ionic interactions, is highly variable amongst different members of this family and contains a disproportionate number of charged amino acids that differ from species to species [79,80].There is limited data available concerning how PmrB signal sensing differs in different species, but Zn 2+ has been shown to serve as an activating signal for E. coli PmrB but not for S. enterica PmrB, indicating that its signal sensing properties are indeed evolutionarily variable [79,99].The highly divergent PmrB sensing domains of different Enterobacteriaceae suggest that different lineages are likely to have different signal-sensing properties, such as responding to different cues or responding to a given cue over a different concentration range.The sequence diversity of PmrAB in Enterobacteriaceae has direct biomedical relevance since point mutations to PmrA or PmrB are commonly observed to be the source of clinical resistance to polymyxin-family antibiotics within this family [100].
In addition to TCS-specific trends that apply across the various lineages, the data above also highlight organism-level evolutionary trends, and we identify individual TCS proteins that are more divergent than expected in a given species.In the case of S. enterica RstA that we highlighted above, we hypothesized that its high divergence rate might be reflective of expanded or altered functionality in this species in light of its regulation by PhoPQ (which is known to play an expanded role in Salmonella) and evidence that its function differs in Salmonella compared to other species [74].However, the biological context underlying why specific TCS proteins have diverged more than expected in a given species could be variable.For example, in K. variicola, both constituents of the QseF/QseE TCS were found to be more divergent than expected.This TCS is known to be involved in regulating metabolic processes that feed into cell-envelope biosynthesis, however we currently lack a complete understanding of QseEF and its cellular function [101].While it is possible that this TCS has taken on an expanded or altered biological role in K. variicola that has required functional adaptations, it is also possible that this TCS is less important for fitness in this lineage and thus mutations that impact its function are less likely to be purified by natural selection.As with other findings presented here, experimental investigation will be required in order to provide functional and mechanistic context to the evolutionary trends observed.
In summary, this study has identified patterns of TCS sequence divergence across the Enterobacteriaceae family as well as individual TCS that deviate from these trends.These findings have important implications for the broader subject of how TCSs evolve, and provide insight into the biology of individual TCSs and their evolutionary trajectories within the Enterobacteriaceae family.

Fig. 2 .
Fig. 2. Population-level trends in TCS divergence across Enterobacteriaceae.(a) The average divergence rate (AA differences/100 amino acids) across all proteins analysed and all pairwise combinations of reference strains for HKs, RRs, and the conserved gene (CG) set.Error bars represent the standard deviation.The statistical significance of the differences in divergence rates of HK and RR compared to the CG was analysed using a oneway ANOVA test; ** indicates P<0.001, n.s.s.indicates not statistically significant.(b and c) Matrices showing the divergence rates averaged across all HKs (b) or RRs (c) compared to the expected value for all combinations of the reference strains.The expected average divergence rate for a given pair of strains was calculated by multiplying the average divergence rate for those strains across the CG set by a correction factor that accounts for differences in the average divergence rate of CG compared to HKs (c) or RRs (d).The divergence rate compared to expected is expressed as a percentage, such that values exceeding 100 % represent pairs of reference strains whose TCS proteins have diverged more than expected.Matrix is presented as a heat map for visual clarity.

Fig. 3 .
Fig.3.Trends in the sequence divergence of individual TCS across Enterobacteriaceae.(a) There is a strong positive correlation in the AA sequence divergence rates of the HK and RR constituents of a TCS.For each TCS, the average divergence rates across all pairwise combinations of reference strains were calculated for its HK and RR.Scatterplot shows the average divergence rate (AA differences/100 AA) for the HK compared to its cognate RR.The identities of select TCS with atypically low or high divergence rates are shown.(b) Bar graph showing the ratio of the average divergence rates of each HK compared to its cognate RR. (c and d) The extent to which TCS diverge in one branch of Enterobacteriaceae strongly correlates with its divergence in independent branches of this family.For each HK (c) and RR (d) the divergence rate when comparing its E. coli AA sequence to its S. enterica sequence was compared to its average divergence rate across all pairwise combinations of the other seven reference strains using scatterplots.Select individual HKs or RRs that deviate from the strong correlation observed between these two variables are identified.PmrA and PmrB (red text) were both determined to be statistical outliers in these analyses, which is shown in Fig.S2.A similar analysis using E. cloacae/H.chinensis as the reference comparison is also shown in Fig.S2.For all scatterplots, P values (shown within plots) were derived from the linear regression analysis and indicate highly significant positive correlations (i.e. that the correlation coefficients are greater than zero).

Fig. 4 .
Fig. 4. Domain-level analysis of TCS sequence divergence in Enterobacteriaceae.(a) Average divergence rates of individual functional domains for each TCS across the reference strains.For both the HK and the RR component of each TCS, two major functional domains were identified as outlined in the main text; sensor (Sen) and kinase (Kin) domains for the HK, and receiver (REC) and DNA binding domains (DBD) for the RR.Heat map chart shows the average divergence rate (AA differences/100 AA) across all pairwise combinations of reference strains.'Ave' indicates the average value for the four domains of that TCS.(b-g) Scatterplots of the average divergence rate for each TCS comparing one functional domain [as described in (a)] to another in a pairwise fashion.Each plot represents a different combination of domains, as indicated.For all scatterplots, linear regression analysis yielded a P<0.001 that the correlation coefficient was greater than zero, indicating a statistically significant positive correlation.

Fig. 5 .
Fig. 5. Evidence that different sequence variants of the PmrAB TCS were independently acquired by different Enterobacteriaceae lineages.(a)The pmrA/pmrB locus is found in three distinct genome locations amongst the strains analysed in this study, referred to as locus 1-3.Diagram depicts these genome locations, showing their sites of insertion relative to the E. coli genome.Green arrows represent the pmrC/pmrB/pmrA genes (pmrC is absent from locus 3 strains), black arrows represent broadly conserved genes that flank the pmrAB locus that were used to map its location, and white arrows represent additional genes that are part of pmrAB locus 1. PmrAB is found in at locus 3 in certain species outside the Enterobacteriaceae family, including Y. pestis.(b) The unusually high level of sequence variation observed for PmrA and PmrB is only observed between species where pmrAB is found at different genomic locations.Diagram shows pairwise amino acid sequence comparisons (represented as heat maps) for PmrAB, as well as for a control TCS, PhoPQ.The extent of sequence variation between the RRs PhoP and PmrA and between the HKs PhoQ and PmrB is similar for comparisons where both strains carry pmrAB at the same genome location (green boxes).However, PmrA and PmrB show markedly more sequence variation than PhoP and PhoQ amongst strains where the pmrAB genome location differs (comparisons not in green boxes).(c) The DNA-binding amino acid residues of PmrA are highly conserved despite substantial PmrA sequence variation.Multiple amino acid sequence alignment of the C-terminal segment of PmrA that has been observed to directly contact target promoter DNA for the nine reference strains used in this study[85].Amino acid residues previously observed to make direct contact with the DNA of a PmrA target promoter are shown in red.(d) Multiple AA sequence alignment of the periplasmic sensing domain of PmrB across the reference strain set.Red resides represent the conserved 'ExxE' motif that is directly involved in iron binding/sensing[79,86,87]. Green residues represent amino acid residues that are conserved amongst the PmrB encoded at the same locus, but different in all PmrB proteins found at different genomic loci.The large number of green residues suggests that the sensing properties of different PmrAB TCS might vary in a manner that correlates to genome location.Shorthand used for strains in all panels (e.g.'Ec') is described in panel (a).Additional data relevant for this figure can be found in Fig.S3.

Table 1 .
List of two-component systems (TCS) analysed in this study, from least divergent to most

Table 2 .
List of response regulators (RRs) and histidine kinases (HKs) whose sequences have diverged significantly more in a particular reference strain than expected Percent of expected indicates how much the sequence of that TCS protein in that reference strain deviates from its sequence in the other reference strains compared to what would be expected based on sequence divergence patterns in the CG set and how variable the sequence of that TCS protein is on average.Only TCS proteins with a value of at least 110 % of expected are listed.See methods for more information.†Only TCS proteins with a P<0.05 are listed.See methods for more information. *