Diverse domain architectures of CheA histidine kinase, a central component of bacterial and archaeal chemosensory systems

ABSTRACT Chemosensory systems in bacteria and archaea are complex, multi-protein pathways that enable rapid cellular responses to environmental changes. The CheA histidine kinase is a central component of chemosensory systems. In contrast to other histidine kinases, it lacks a sensor (input) domain and utilizes dedicated chemoreceptors for sensing. CheA is a multi-domain protein; in model organisms as diverse as Escherichia coli and Bacillus subtilis, it contains five single-copy domains. Deviations from this canonical domain architecture have been reported, however, a broad genome-wide analysis of CheA diversity is lacking. Here, we present the results of a genomic survey of CheA domain composition carried out using an unbiased set of thousands of CheA sequences from bacteria and archaea. We found that four out of five canonical CheA domains comprise a minimal functional unit (core domains), as they are present in all surveyed CheA homologs. The most common deviations from a classical five-domain CheA architecture are the lack of a P2/CheY-binding domain, which is missing from more than half of CheA homologs, and the acquisition of a response regulator receiver (CheY-like) domain, which is present in ~35% of CheA homologs. We also document other deviations from classical CheA architecture, including bipartite CheA proteins, domain duplications, and fusions, and reveal that phylogenetically defined CheA classes have pre-dominant domain architectures. This study lays a foundation for a better classification of CheA homologs and identifies targets for experimental investigations. IMPORTANCE We found that in contrast to the best-studied model organisms, such as Escherichia coli and Bacillus subtilis, most bacterial and archaeal species have a CheA protein with a different domain composition. We report variations in CheA architecture, such as domain duplication and acquisition as well as class-specific domain composition. Our results will be of interest to those working on signal transduction in bacteria and archaea and lay the foundation for experimental studies.

flagella and cell tumbling.When CheA activity is reduced by attractants, a decrease in the concentration of phosphorylated CheY promotes counterclockwise rotation of the flagella and smooth swimming.The CheR methyltransferase and the CheB methyl esterase that covalently modify chemoreceptors comprise an adaptation pathway (9).The E. coli system also includes the CheZ phosphatase, which de-phosphorylates CheY resulting in signal termination (10).
Chemosensory systems homologous to the E. coli pathway have been identified and experimentally studied in other bacterial and archaeal species, where they were shown to control not only flagellar motility but also type IV pili (Tfp) based motility, biofilm formation, cell-cell interaction, biosynthesis, development, and other cellular functions (11)(12)(13)(14).By regulating these vital processes, chemosensory systems ultimately have a strong impact on bacterial behavior, lifestyle, and interactions with hosts and between species (15,16).Comparative genomic analysis suggests that approximately half of bacterial species contain chemosensory pathways (17).In terms of component design, chemosensory systems are the most complex mode of signal transduction in bacteria (18).Although the number of components varies from organism to organism and system to system, four core proteins are present in every system: chemoreceptors, CheA, CheW, and CheY (17).CheA is an ideal marker for studying the diversity of chemosensory systems because (i) it is a large, multi-domain protein (5), (ii) it is present in every system (17), and (iii) there is always only one copy of CheA per system, which is not the case for the other three core proteins (17).
The current knowledge of the structure and function of CheA primarily comes from work on E. coli and Thermotoga maritima (19), which are model organisms for functional and structural studies, respectively.CheA proteins from both organisms consist of five domains that were initially labeled P1 through P5 (20) and were later recognized as members of conserved domain families (21) (Fig. 1; Table S1).The leading protein domain database Pfam (22), which is now a part of the InterPro resource (23), contains profile models for the five canonical CheA domains that show their relationship with other protein families and allow for their identification in genomic data sets (Table S1).
N-terminal P1 is the histidine-containing phosphotransfer domain (Hpt); it is responsible for the movement of the phosphate group from its substrate histidine residue to the CheY and CheB response regulators (25,26).P2 is a docking domain that binds CheY and CheB (20,27).It is not required for phosphotransfer, but it greatly accelerates its rate (20).The P3 dimerization domain (H-kinase_dim) mediates CheA dimer formation (5).The P4 histidine kinase domain (HATPase_c: histidine kinase-, DNA gyrase B-, and HSP90-like ATPase) binds ATP (28).The C-terminal P5 domain (CheW) is homologous to the CheW protein (5,21); it couples CheA to CheW and chemoreceptors enabling the formation of a chemosensory signaling complex (29).The leading protein domain database Pfam (22), which is now a part of the InterPro resource (23), contains profile models for the five canonical CheA domains, which show their relationship with other protein families and allow for their identification in genomic data sets (Table S1).
A phylogenomic study classified CheA proteins into more than a dozen of classes based on sequence similarity and genomic context (17).Most of these classes were predicted to govern flagellar motility and were termed F1 through F17, whereas two classes were predicted to control Type IV pili-based motility (termed Tfp class) and other (alternative) cellular functions (termed ACF class) (17).In that study, a comparative CheA analysis was performed using the unit of three domains-dimerization (H-kinase_dim), histidine kinase (HATPase_c), and CheW-because they were detected in all CheA sequences and were always in the same configuration, which was not the case for the other two domains, Hpt and P2/CheY-binding (17).Interestingly, structural studies also resulted in defining the last three domains as a core unit (5), whereas structures of the Hpt domain and P2/CheY-binding domain were solved separately because these domains are separated by long linker regions in both E. coli and T. maritima (27).This five-domain arrangement is considered the canonical domain architecture of CheA because it was reported in key model organisms including not only E. coli and T. maritima but also Salmonella enterica (30), Bacillus subtilis (31) and the archaeon Halobacterium salinarum (13).However, subsequent studies with other bacterial species revealed some variation in CheA domain composition.For example, the CheA of Helicobacter pylori and Campylobacter jejuni does not contain a P2/CheY-binding domain but has a response regulator receiver domain at its C-terminus (32).The Azospirillum brasilense and Azorhizobium caulinodans CheA proteins contain two CheW domains and a response regulator receiver domain (33,34).CheA of the Tfp class in Pseudomonas aeruginosa has a response regulator receiver domain in its C-terminus, lacks a P2/CheY-binding domain, and has multiple Hpt domains in the N-terminus (35).Although these findings revealed substantial and intriguing variations in the domain composition and arrangement of CheA, the extent of this diversity is unknown.In this study, we aimed to fill this gap by projecting our current understanding of CheA structure and function onto the current genomic landscape.

Constructing a representative set of CheA protein sequences
The number sequenced of microbial genomes is increasing; there are currently almost 1.4 million of bacterial genomes in the NCBI database (36).However, this impressive number of the sequenced genomes does not reflect the phylogenetic diversity, as 90% of these genomes belong to very few (out of more than a hundred) phyla, thus making this data set heavily biased and leaving most bacterial phyla highly underrepresented.Therefore, to address this bias, for our CheA search we decided to use a much more balanced set of representative genomes from the Genomic Taxonomy database (GTDB) (37).The current version GTDB v.95 contains a representative set of 16,859 RefSeq (38) genomes spanning 50 bacterial and 18 archaeal phyla.For rapid and convenient CheA identification and classification, we restricted the data set to the genomes that were available in the MIST 3.0 database (39), resulting in 14,796 genomes.Among those genomes, 7,900 did not have any identifiable CheA protein.Available archaeal genomes are not as numerous, thus we collected CheA sequences from all archaeal genomes

The histidine kinase domain (HATPase_c)
As expected, this domain is present in every CheA homolog (Fig. 3B), as it defines its key physiological role.It is the most conserved of all CheA domains, and no substantial size variation or any duplications of this domain were found in our data set (Fig. 3B).

The dimerization domain (H-kinase_dim)
H-kinase_dim is a CheA-specific version of the histidine kinase dimerization domain.We found that it always occurs in combination with the histidine kinase and CheW domains, as was reported previously for a much smaller data set (17).The dimerization domain is poorly conserved: in many cases, it was not detected by HMMER (41), an automated sequence-to-profile search tool implemented in the MiST database, but upon further investigation we were able to identify it using HHpred, a more sensitive profile-to-profile search tool (42), as a match to a Pfam H-kinase_dim profile or to its structural model (PDB ID 6Y1Y) with probabilities above 90%.
The dimerization domain in CheA of model organisms E. coli and B. subtilis is small (~60 a.a.).It is formed by two helices (~26-28 a.a.each) connected by a short loop.A recent study revealed an extended dimerization domain in Spirochetes (43).In CheA of Treponema denticola, additional ~50 residues extend the length of the dimerization helices by approximately twofold (PDB ID 6Y1Y).It was suggested that the extended H-kinase_dim domain might enable stronger dimerization of CheA molecules (43).
While most of the analyzed CheAs in our data set had the classical short (~60-70 a.a.) H-kinase_dim domain, many sequences contained its extended versions (>70 a.a.).We found a large range of length variation of the extended dimerization domain, with the longest being up to three times the length of the classical one (170 a.a.).The longest dimerization domains were found in CheA from Chloroflexota and Cyanobacteria belonging to the Tfp system.Thus, it appears that this region of CheA is subject to various duplications that substantial elongate the dimerization helices.

CheW domain
This domain is found exclusively in chemosensory systems (17).CheW is one of the core domains in CheA, as it was detected in 100% of all CheA sequences (Fig. 3B).Similarly to the histidine kinase domain, the CheW domain is highly conserved, and it was detected by HMMER in every CheA sequence from our data set.However, in contrast to the kinase domain, we detected instances of its duplication: approximately 8% of CheA sequences in our data set contained two CheW domains (Fig. 3B).Notably, almost all CheAs with two CheW domains belong to the F5 class, which is widely distributed in bacteria (Fig. 2).Two CheW domains were also found in some CheA proteins from class F1 and one CheA from class F10.This distribution suggests that the appearance of two CheW domains is a result of independent duplication events.
In several CheA sequences (for example in the Desulfobacteraceae family), three CheW domains were identified by HMMER.Upon closer inspection, we concluded that only two CheW domains are present in these proteins.A 50 a.a.insertion in the middle of the N-terminal CheW domain splits CheW into two parts, thus "misleading" HMMER into identifying two CheW domains.On the other hand, in two organisms from Chloroflexota (e.g., Thermomicrobium roseum) CheA proteins of the F1 class contain three true CheW domains, with one of them separated from the other two CheW domains by a CheY-like domain (Fig. 3A).
CheA proteins with two CheW domains have been experimentally studied in alphaproteobacteria Rhodospirillum centenum (44), A. brasilense (33), and Caulobacter crescentus (14) (all belong to the F5 class); however, none of those studies specifically explored the function of the two CheW domains.A more recent study of the CheA with two CheW domains from A. caulinodans (also belonging to the F5 class) showed that the strain in which CheW2 was deleted together with the C-terminal response regulator receiver domain was just as defective in chemotaxis as at ΔcheA mutant, suggesting that both CheW domains are involved in controlling chemotaxis (34).The presence of more than one CheW domain should significantly change the classical arrangement of bacterial signaling arrays (45); structural studies illuminating the contribution of additional CheW domains would be productive.

The phosphotransfer domain (Hpt)
The histidine phosphotransfer domain is responsible for the phosphorylation of the aspartate residues in response regulator proteins CheY and CheB.This domain was found in only 98.3% of the analyzed CheA sequences, thus initially raising a question of whether it is truly a core domain of CheA.Upon further investigation, we found that CheA sequences in which no Hpt domain was identified fall into two categories.First, in some instances the Hpt domain was missing because of sequencing and/or assembly errors, as these genes were located at the end of the contig.Second, in some cases in which Hpt was missing, we found the Hpt domain encoded as a separate gene in the same operon.Experimental studies showing that the Hpt domain liberated from the rest of the kinase is fully functional (46), and the location of the gene encoding Hpt in the same operon as the rest of CheA strongly suggest a bipartite system, an arrangement which is not uncommon in bacterial signal transduction (47).
A unique case of partitioning CheA functions between two proteins is seen in Cereibacter (Rhodobacter) sphaeroides, which has two of its four cheA genes, namely the cheA3 and cheA4, located in the same operon (Fig. 3) (48).CheA4 consists of only the three core domains: dimerization, kinase, and CheW, whereas CheA3 contains only the Hpt and CheW domains separated by a 794 a.a.region with no identifiable domains (49).Both proteins localize with the cytoplasmic chemoreceptor array and act together as a single functional CheA to control the flagellar motor (48,49).We were intrigued by the lack of domains in this very long, functionally important region of CheA3 and performed sensitive profile-profile searches using HHpred that confidently identified the following domains in this region: two more Hpt domains (95.94% and 97.40% probability), a P2 domain (96.98%), an H_kinase_dim domain (98.14%), and a HATPase_c (97.94%).Sequence alignment showed that the newly identified histidine kinase domain in CheA3 is missing one full alpha helix and parts of the two helices on either side and does not contain all the residues necessary for Mg2+and ATP binding.The Alphafold2 (24) model also predicts the histidine kinase domain with missing helices.Thus, we conclude that it is not functional, which agrees with the published data showing a lack of autophosphorylation by CheA3 (48).The Alphafold2 model also shows three Hpt domains.However, in contrast to the first Hpt domain, which is structurally intact, contains a conserved histidine (in a position corresponding to His-48 in E. coli CheA), and was experimentally shown to be phosphorylated in vitro (50), the second and third Hpt domains have structural deviations and lack histidine in a conserved position.Structural deviations are also seen in the P2 domain of CheA3.Taken together, these observations suggest that CheA3 contains structurally modified and largely nonfunctional domains between its N-terminal Hpt and C-terminal CheW domain.It is likely that CheA3 was initially fully functional, but upon duplication of its core region, which gave birth to CheA4, the corresponding domains of CheA3 lost their function due to redundancy while maintaining their basic structure.These findings also provide an additional explanation for why CheA3 and CheA4 work together as a single unit.Twenty other genomes within the Rhodobacteraceae have CheA3 and CheA4 orthologs in similar gene neighborhoods, indicating that this unique CheA system emerged early during the evolution of this family.
CheA homologs with multiple Hpt domains were originally described in chemo sensory systems controlling twitching motility in Synechocystis PCC6803 (51) and P. aeruginosa (52).Several homologous CheA proteins with multiple Hpt domains were subsequently identified in genomic studies (17,21).In our data set, 18% of CheA sequences contain two or more Hpt domains.Interestingly, more than 90% of those also contain a response regulator receiver domain at the C-terminus.Approximately 93% of the CheA proteins with multiple Hpt domains belong to the ACF and Tfp classes.More than half of the CheAs with multiple Hpt domains contain five or more copies, with the maximum number (fourteen) detected in the Tfp CheA of a gammaproteobacterium Oleiphilus messinensis (Fig. 3A).
The defining feature of the Hpt domain is a conserved histidine residue, which serves as a phosphorylation site.However, some of the Hpt domains in multi-Hpt CheA proteins lack this site.For example, ChpA (CheA of the Tfp class) from P. aeruginosa was reported to have a total of eight Hpt domains, with the conserved histidine only present in six of them (35).The flow analysis of the phosphoryl group showed that in vitro there is no phosphorylation of Hpt domains lacking the conserved histidine, and the function of these domains is yet to be established (35).Our analysis revealed that ChpA has one additional Hpt domain without a conserved histidine (Fig. 4).Examination of the ChpA structure predicted by the AlphaFold (UniProt ID Q9I696) revealed that the newly identified Hpt domain consists of four helices (instead of classical five), similarly to the C-terminal Hpt domain which does contain a conserved histidine (Fig. 4).Addition ally, between Hpt1 (N-terminal) and of the newly identified Hpt2, AlphaFold predicted another domain, which structurally resembles Hpt but is not recognized as such by HHpred.The role of Hpt domains lacking a conserved phosphorylation site is yet to be determined.

Auxiliary domains of CheA
While the four core domains are the minimum requirement for CheA function, many CheA proteins contain additional domains.For example, the classical five-domain CheA protein in E. coli has a P2 domain (Fig. 1), which promotes CheY binding.

P2/CheY-binding
The Pfam database has two domain models, P2 and CheY-binding, that belong to the same clan (Table S1).The existence of two models for this domain is likely due to substantial sequence variability: this domain is much less conserved than any of the CheA core domains (21).The P2 model was developed using CheA from T. maritima, the model organism for structural biology, and related organisms (e.g., B. subtilis), whereas the CheY-binding model was developed using CheA from E. coli, where its role was studied experimentally.However, the structures of these versions of the P2 domain match closely (27); thus, we refer to this domain as P2/CheY-binding.Initially, this domain was identified in less than 20% of all CheA sequences in our data set (detected by HMMER as P2 in 1,960 sequences and as CheY-binding in 873 sequences).We then used sensitive profile-profile searches with HHpred and identified a total of 6,519 sequences with P2/CheY-binding domains.Even so, more than 50% of CheA sequences in our data set lack this domain (Fig. 3B).The lack of P2/CheY-binding domain is further supported by AlphaFold models that show an extended unstructured region between the Hpt and dimerization domains in sequences where no P2/CheY-binding domain was identified by HHpred.On the one hand, this was a surprising finding, because the best-studied CheA proteins from the model organisms E. coli, S. enterica, B. subtilis, and T. maritima all contained this domain.On the other hand, it was known that the CheA proteins in C. jejuni and H. pylori do not have this domain, and even in E. coli it is not essential for the key CheA function-phosphotransfer to CheY (53).Deletion of P2 results in much slower phosphotransfer rates and therefore impaired chemotaxis ability (20,53); however, The structure was predicted by AlphaFold (24).overexpression of the Hpt domain could correct for the lack of a P2 domain (20).Thus, the fact that the P2/CheY-binding domain is dispensable may not be surprising after all.
The vast majority of CheA proteins contain only one copy of the P2/CheY-binding domain, although we detected some sequences with duplications of this domain (Fig. 3).Notably, duplication of this domain occurred in the common ancestor of Borreliales (Spirochaetota phylum), as all members of this order have CheA proteins with a duplicated P2 domain.Phyletic distribution of other CheA proteins with a duplicated P2/CheY-binding domain suggests several independent duplication events.

Response regulator receiver domain (CheY-like)
Together with the histidine kinase domain, the response regulator receiver domain comprises the essential core of bacterial two-component signal transduction systems (54).Current domain databases define it as a superfamily (Pfam accession CL0304 termed CheY-like) containing several families, one of which has the more general name of response regulator receiver domain (Pfam accession PF00072 termed Response_reg).In the model chemosensory systems of E. coli and B. subtilis, this domain is present in two response regulators-CheY and CheB.However, in other homologous systems, for example in C. jejuni and H. pylori, it was also found as a component of CheA.The function of the CheY-like domain in CheA homologs has been studied experimentally in several organisms, and its common role appears to be that of a phosphate sink (55), as shown in H. pylori (56).In P. aeruginosa ChpA, the CheY-like domain was shown to potentially function as a phosphate sink and/or a source of phosphoryl groups for two of the Hpt domains that do not have a conserved histidine residue (35).In A. caulinodans CheA, the CheY-like domain does not seem to function as a phosphate sink, but it is necessary for the dephosphorylation of the Hpt domain (34).In other systems, the role of this domain has not been studied in detail, but it is known to be important for the proper function of the CheA kinase (33,34,57,58).For example, the disruption of the CheY-like domain of R. centenum CheA (CheA 1 ) eliminated chemotaxis and phototaxis (34,58).
We have identified the CheY-like domain in approximately 34% of analyzed CheA sequences (Fig. 3).The majority of those belong to four CheA classes: F3, F5, ACF, and Tfp.Additionally, several sequences belong to F1 and F7, whereas the CheAs from all other flagellar classes do not have this domain.In most cases, the CheY-like domain was at the C-terminus of CheA; however, we found several cases in which this domain was located at the N-terminus (Fig. 3A).While the majority of the CheA sequences contained only one CheY-like domain (Fig. 3B), some had two, and two sequences, including the Tfp class CheA from the cyanobacterium Spirulina major, contained three CheY-like domains (Fig. 3B).
To test the hypothesis that the main function of the CheY-like domain is as a phosphate sink, as shown for H. pylori CheA (56), we analyzed the presence of dedica ted phosphatases CheZ (10), CheC and CheX (59) in genomes that had a single CheA protein containing the CheY-like domain.Most bacterial chemosensory systems employ either CheZ or CheC/CheX type phosphatases (17).Thus, we argue that the absence of phosphatase genes in genomes containing a single CheA with a CheY-like domain would support the phosphate sink role for this domain.Indeed, we found that 93% of such genomes (811 out of 871) lacked chemosensory phosphatases.

CheC/CheX
CheC and CheX are protein phosphatases that dephosphorylate CheY (59).Both phosphatases are part of the Pfam CheC-like clan (CL0355) and have similar structural topologies (59).In our data set, we found CheC/CheX fusion with F1 CheAs in the Leptospirae class of spirochetes.Given that these domains dephosphorylate CheY, it is possible that such CheA-CheC/CheX fusions have a dual kinase/phosphatase function and are able to phosphorylate and dephosphorylate CheY.

Domain co-occurrence and class-specific domain composition
In addition to the invariable co-occurrence of the four core domains, we noticed the following trends.First, more than 90% of CheA protein sequences containing multiple Hpt domains also had a CheY-like domain.This trend is predominant in the Tfp class (94%) but it is also found in the ACF class.Second, the majority of CheA homologs that contain the CheY-like domain do not have the P2/CheY-binding domain.Finally, domain architectures are generally class-specific (Fig. 5).For example, all CheA sequences from class F4 contain three Hpt domains, and 99% of CheA sequences from F5 class have a duplicated CheW domain.

Conclusions
CheA homologs have four core domains: (i) a phosphotransfer domain, which can be present in multiple copies and, occasionally, resides as a separate gene in the same gene neighborhood; (ii) a dimerization domain of a variable length; (iii) a histidine kinase domain, which is always present in a single copy; and (iv) a CheW domain, which is duplicated in some homologs.The P2/CheY-binding domain, which enhances phospho transfer, and the CheY-like domain, which likely serves as a phosphate sink, are the most common auxiliary domains found in CheA homologs.CheA homologs from each class typically have the same domain composition.
In spite of their high specificity, current models (profile HMMs) for the Hpt, P2/ CheY-binding, and dimerization domains models have low sensitivity and perform poorly in automated sequence similarity searches, often resulting in missing domains.However, models for the histidine kinase (HATPase_c) and CheW domains are both highly sensitive and specific.Thus, because the presence of these two core domains uniquely distinguishes CheA homologs from other proteins, CheA sequences can easily be identified by automated searches and then further explored for the presence of other domains using more sensitive approaches, such as.HHpred (42) and AlphaFold (24).
The domain architectures for all collected CheA sequences were identified using two approaches.First, domains were predicted using HMMER (41).Then, in cases where no domain was identified by HMMER in sequence regions longer than 100 amino acid Chemosensory classes are defined in reference (17).Domains are colored as in the previous figures.residues (the average domain size), we ran a more sensitive domain identification search tool HHpred (42), using the region of interest as a query.All sequence similarity searches were performed with default parameters.Archaeal CheA sequences were gathered from AnnoTree hits using CheW domain as a query and subtracting CheW protein sequences.Their domain composition and class were predicted using TREND (60,61).
Multiple sequence alignments were built using MAFFT (62) with automatically selected parameters; structural models were built by AlphaFold2 (24).

FIG 3
FIG 3 CheA domain composition.(A) Representative domain architectures; species names and accession numbers for CheA proteins are shown.Domains detected by structural but not sequence similarity are shown as empty rectangles.(B) Core and auxiliary domains and their duplications.The percentage of CheA sequences with a given domain is shown.Single domain occurrence is labeled as 1 on the X axis; >1 depicts sequences with more than one copy of a given domain.

FIG 4
FIG 4 Hpt domains in the ChpA (Tfp CheA) from P. aeruginosa PAO1.Domains are colored as in previous figures.Known and predicted Hpt domains are numbered.Domains 2 and 3 were identified in this study.The presence of a conserved histidine residue, corresponding to His-48 in E. coli CheA is marked by "H".

FIG 5
FIG 5 Predominant domain composition of CheA proteins from different chemosensory classes.