Virulence effector SidJ evolution in Legionella pneumophila is driven by positive selection and intragenic recombination

Effector proteins translocated by the Dot/Icm type IV secretion system determine the virulence of Legionella pneumophila (L. pneumophila). Among these effectors, members of the SidE family (SidEs) regulate several cellular processes through a unique phosphoribosyl ubiquitination mechanism mediated by another effector, SidJ. Host-cell calmodulin (CaM) activates SidJ to glutamylate the SidEs of ubiquitin (Ub) ligases and to make a balanced Ub ligase activity. Given the central role of SidJ in this regulatory process, studying the nature of evolution of sidJ is important to understand the virulence of L. pneumophila and the interaction between the bacteria and its hosts. By studying sidJ from a large number of L. pneumophila strains and using various molecular evolution algorithms, we demonstrated that intragenic recombination drove the evolution of sidJ and contributed to sidJ diversification. Additionally, we showed that four codons of sidJ which are located in the N-terminal (NTD) (codons 58 and 200) and C-terminal (CTD) (codons 868 and 869) domains, but not in the kinase domain (KD) had been subjected to strong positive selection pressure, and variable mutation profiles of these codons were identified. Protein structural modeling of SidJ provided possible explanations for these mutations. Codons 868 and 869 mutations might engage in regulating the interactions of SidJ with CaM through hydrogen bonds and affect the CaM docking to SidJ. Mutation in codon 58 of SidJ might affect the distribution of main-chain atoms that are associated with the interaction with CaM. In contrast, mutations in codon 200 might influence the α-helix stability in the NTD. These mutations might be important to balance Ub ligase activity for different L. pneumophila hosts. This study first reported that intragenic recombination and positive Darwinian selection both shaped the genetic plasticity of sidJ, contributing to a deeper understanding of the adaptive mechanisms of this intracellular bacterium to different hosts.


INTRODUCTION
Legionella pneumophila (L. pneumophila) is the most common causative agent of legionellosis, which manifests as atypical pneumonia or non-pneumonia type illnesses, ubiquitin (Ub) ligase activity; which is shown to be harmful to the host by poisoning the cellular Ub pool and possibly blocking the action of other L. pneumophila effectors that manipulate the host Ub machinery (Bhogaraju et al., 2016). Three functional domains of SidJ were identified in a previous report, including a kinase domain (KD) in the center of protein spanning residues 336 to 593, which forms a catalytic structure and an N-terminal (NTD) and C-terminal domains (CTD) (Black et al., 2019). Given the crucial role of SidJ in this regulation network, studying the nature of evolution of sidJ is of great importance to understand the virulence in L. pneumophila and the interaction between the bacteria and its hosts. A study on 32 unrelated strains of L. pneumophila revealed that recombination was an important strategy in the evolutionary adaptive process and played an active role in sidJ genetic plasticity (Costa et al., 2014). Recombination of sidJ might also provide a broad-host-range for L. pneumophila by preventing host specialization and contributing to the resilience of the species (Costa et al., 2014). Besides recombination, selection was another fundamental evolutionary force that shaped DNA sequence variation. Interaction between recombination and natural selection within a gene can either increase or decrease sequence diversity. Moreover, recombination can generate genetic variation, which is tested by natural selection, and as such, it plays an important role in fueling adaptive evolution (Jouet, McMullan & Van Oosterhout, 2015).
The ultimate goal of this work is to understand the underlying patterns in the evolution of the sidJ gene of L. pneumophila during its lifecycle (e.g., infect the amoebae via amoebal attack and present a sympatric lifestyle) through identifying individual codons under positive selection. We utilized various algorithms to identify the molecular evolution patterns of sidJ in a relatively large number of L. pneumophila strains. It is shown here both intragenic recombination and positive selection drove the adaptive evolution of sidJ and shaped its genetic plasticity. Codons of sidJ that were identified to experience positive selection might play key roles in regulating the binding affinity of SidJ to CaM; and thus, change the glutamylases activation of SidJ, which might, in turn, manipulate the host Ub machinery balance. This reticular regulation network might be an important strategy for the survival and adaptability of L. pneumophila to variable host cells.

L.pneumophila strains
One hundred and sixteen L. penumophila strains were enrolled in this study. These strains were isolated from 1947 to 2016, from different environmental and clinical sources. Fulllength sequences of sidJ were captured from the whole genome of the strains. The detailed information of these strains including the sources and place of isolation, the collection dates, the NCBI biosample, and the sequence accession numbers were summarized in Table S1. Some of these strains are defined as subspecies (subsp. pneumophila, subsp. fraseri, subsp. pascullei, and subsp. raphaeli) of L. pneumophila, and belong to different serogroups (sgs) including sg1, sg4, sg5, sg8, sg11, etc. based on literature report (Kozak-Muiznieks et al., 2018).

Sequence and phylogenetic analysis
The sidJ gene sequences from the 116 L. pneumophila strains were manually checked for integrity and were aligned by MEGA X software using Muscle (codons) algorithms (Kumar et al., 2018). Allele type and DNA sequence polymorphism analyses were performed by DnaSP 6.12.03 (Rozas et al., 2017). The most appropriate model for sidJ nucleotide or SidJ amino acid substitution was determined by the model finder module of MEGA X and using the Akaike Information Criterion (AIC) (Posada & Buckley, 2004). An unrooted phylogenetic tree of the sidJ alleles was constructed using MEGA X, inferring the evolutionary history using the Maximum Likelihood (ML) method and Hasegawa-Kishino-Yano model with gamma distribution (HKY+G) (Hasegawa, Kishino & Yano, 1985). Initial trees were obtained automatically by applying the Neighbor-Joining and BioNJ algorithms to a matrix of pairwise distances estimated using the Maximum Composite Likelihood (MCL) approach. The evolutionary history of SidJ protein was inferred by using the Maximum Likelihood method and the JTT matrix-based model (Jones, Taylor & Thornton, 1992). Initial tree(s) for the heuristic search were obtained automatically by applying Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated using the JTT model, and then selecting the topology with a superior log-likelihood value. A discrete Gamma distribution was used to model evolutionary rate differences among sites. Bootstrap values were estimated using 1000 replications.

Molecular evolution analysis
The neighbor-net analysis was performed and split networks were constructed with algorithms implemented in SplitsTree4 software (version 4.14.4) (Huson & Bryant, 2006). A reticulate network tree was prepared to show the relationships among sidJ alleles and to visualize possible recombination events. Pairwise homoplasy index (Phi) tests were used to calculate a measure of statistical significance for recombination and a cutoff value was set at 0.05 (Bruen, Philippe & Bryant, 2006). The sidJ allele sequences were screened by RDP4 to detect intragenic recombination (Martin et al., 2015). Six methods (RDP Martin & Rybicki, 2000), GENECONV, BootScan (Martin et al., 2005), MaxChi (Smith, 1992), Chimaera (Posada, 2002), and SiScan (Gibbs, Armstrong & Gibbs, 2000) implemented in the RDP4 were utilized. Potential recombination events (PREs) were defined as those identified by at least four methods. Common settings for all the methods were as following: sequences were considered as linear, statistical significance was set at P < 0.05, Bonferroni correction was used to correct P-values for multiple comparisons, phylogenetic evidence was required, and breakpoints were polished. Genetic diversity of the sidJ alleles was investigated by using DnaSP 6.12.03 (Rozas et al., 2017).

Population genetics analysis
DnaSP 6.12.03 was used to perform genetic diversity analyses for the sidJ alleles Rozas et al., 2009). Tajima's D, Fu, and Li's D* and F* tests were employed to verify the neutrality hypothesis of sidJ as previously described by our research group (Zhan & Zhu, 2017). These analyses were carried out using DnaSP 6.12.03 (Rozas et al., 2017). A statistical significance cutoff was set at 0.05 for all the tests. Nonsynonymous and synonymous mutations of sidJ were calculated using the MEGA X software package (Kumar et al., 2018). A parsimony network of sidJ alleles was created using PopART software (http://popart.otago.ac.nz) (Clement, Posada & Crandall, 2000). The demographic history of sidJ was inferred by analyzing the mismatch distribution of pairwise nucleotide differences in the sidJ alleles, which was carried out by an algorithm implemented in Arlequin3.5 (Excoffier & Lischer, 2010). Expected values for a model of constant sidJ allele population size were calculated and plotted against the observed values. Harpending's raggedness index and the sum of squared deviations (SSD), as implemented in Arlequin3.5, were used to evaluate Rogers' sudden expansion model, which fits a unimodal mismatch distribution (Rogers & Harpending, 1992).

Analysis of positive selection at the codon level
The positive selection pressure operating on the L. pneumophila sidJ gene was investigated using the Maximum Likelihood (ML) method by a visual tool of codeml software program (Bielawski, Baker & Mingrone, 2016), EasyCodeML (Gao et al., 2019). First, the topologies of the ML trees of sidJ alleles were generated by MEGA X as mentioned above, for the subsequent selection analysis. Then, three nested models (M3 vs. M0, M2a vs. M1a, and M8 vs. M7) were compared, and the likelihood ratio tests (LRTs) were applied to assess the best fit of codons. Model fitting was performed using multiple seed values for dN/dS. SidJ codon frequencies were assumed using the F3x4 model. When the LRT was significant (P < 0.05), Bayes empirical Bayes (BEB) (M8 model) and Naive Empirical Bayes (NEB) methods (M3 and M2a models) were used to identify codons that evolved under positive selection based on a posterior probability of more than 0.95. Positive selection was inferred when the individual site or codon had a ratio of nonsynonymous to synonymous mutations greater than one (ω > 1). To omit the influence of intragenic recombination on the selection analysis, a modified topology of ML trees was applied to the selection analysis by identifying non-recombinant regions and allowing each to have its phylogenetic tree. The modified and fitted trees were obtained by using the GARD (http://www.datamonkey.org/gard) (Kosakovsky Pond et al., 2006;Pond & Frost, 2005), which can screen an alignment for recombination breakpoints, infer a unique phylogenetic history for each detected recombination block, and generate a modified tree topology. The HyPhy software package was also employed to validate the results obtained using the ML method (Kosakovsky Pond et al., 2019). Fixed Effects Likelihood (FEL), Fast, Unconstrained Bayesian AppRoximation (FUBAR), Evolutionary Fingerprinting, and Mixed Effects Model of Evolution (MEME) algorithms were used (Kosakovsky Pond & Frost, 2005;Murrell et al., 2013). These methods can take recombination into account by screening recombination breakpoints of the sequences, identifying non-recombinant regions and allowing each to have its own phylogenetic tree by using an updated partitioned dataset provided by GARD (Pond et al., 2006).

Mapping of positively selected sites to structure models of proteins
The three-dimensional structure of SidJ and CaM was modeled using the Phyre server (Kelley et al., 2015), and the SWISS-MODEL (http://swissmodel.expasy.org)    Table S1). The sidJ-1 subgroup is shown in a purple cycle, sidJ-2 blue, sidJ-3 green, and sidJ-4 red. The internal nodes represent hypothetical ancestral alleles and edges correspond to reticulate events such as recombination in the evolution of sidJ. The red arrow points to a representative reticulate event. (B) The evolutionary history of sidJ was inferred by using the Maximum Likelihood method. The ML tree was constructed from the alignment of nucleotide sequences of 39 alleles. Allele names were marked as their representative strain names. The earliest possible year for the allele to arose is shown in blue. Numbers on the interior branches represent bootstrap values and are indicated when the values are >0.5. The tree is drawn to scale, with branch lengths measuring in the number of substitutions per site. Branches in the same color were clustered into a group. Allele names are marked in red to indicate that they are recombinants. Unique recombination events detected by six recombination detection methods implemented under the RDP4 based on sidJ alleles are mapped onto the corresponding breaking point positions in the alignment (in the right of the figure). Recombination events that were identified by four or more methods were selected and numbered according to the RDP4 analysis, and the minor parent names of the recombinant alleles are shown nearby the breaking point positions (see Table 1). Full-size DOI: 10.7717/peerj.12000/fig-1 (Waterhouse et al., 2018). The positive selection sites were mapped onto the structure and visualized by PyMol (http://www.pymol.org/) (Lilkova et al., 2015).

Characteristics of L.pneumophila strains and sidJ sequences
The sidJ sequences of the 116 L. pneumophila strains in this study could be clustered into 39 unique alleles (alleles were marked with representative strain name, shown in Table S1), which corresponded to 36 different SidJ amino acid sequences. We propose that these sequences might represent most of the sidJ alleles. A total of 544 polymorphic (segregating) sites in the 39 sidJ alleles (the full-length gene is 2,622 bp) generate high amino acid sequence polymorphism in the SidJ protein, about one fifth (19.68%, 172/874) amino acid sites were polymorphic. Significantly higher amino acid sequence polymorphisms were found in the NTD than the CTD (24.78% Vs. 15.35%, 83/335 Vs. 43/280, Chi-Square test, P = 0.019).

Intragenic recombination drives the evolution of sidJ
A reticulate network tree was obtained by the Neighbor-net algorithm of SplitsTree4, using the alignment of the 39 sidJ alleles. As shown in Fig. 1A, we could observe many reticulate events which indicated possible recombination events among sidJ alleles. Moreover, the implemented Phi test in SplitsTree4 did find significant evidence for recombination within the alleles (P < 0.001). Thus, we tested the intragenic recombination by using RDP 4. Eight potential recombination events (PREs) and 14 recombinant alleles were identified, which were supported by at least four of the six analysis methods according to Costa et al. (2014) report (Table 1). Figure 1B showed the phylogenetic relationship of these alleles. Allele names were shown with their representative strain names. Four main clades were found and all recombinants sidJ were distributed in clades three and four. Two non-recombinant sidJ alleles, D7119 and NCTC12273 formed a sub-clade and had a relatively far relationship with those non-recombinants. We observed that the stains harboring such alleles (D7119 and NCTC12273) all belonged to L. pneumophila subsp. pascullei and most were from environmental sources. In contrast, three recombinant sidJ alleles, D5265, D4954, and D7705 also formed a sub-clade but stains harbored such alleles all belonged to L. pneumophila subsp. raphaeli and most were from clinical sources. Previously reports by Costa et al. (2014) indicated that intragenic recombination was an important strategy in the evolutionary adaptive process of sidJ. We here showed similar results, but due to the fact that in this study a greater number of L. pneumophila strains and alleles of sidJ were used, we can consider our results more robust.

Intragenic recombination drives the diversification of the sidJ
To explore whether intragenic recombination drives the diversification of the sidJ, we categorized the 116 strains into two groups: the recombinant group and the nonrecombinant group. Then, we utilized DnaSP to study the difference in genetic diversity between the two groups. It showed that most of the parameters that represent genetic diversity were higher in the recombinant group. These parameters included nucleotide diversity, polymorphic sites, and the average number of nucleotide differences ( Table 2), suggesting that recombination added a high density of polymorphisms in sidJ. The phylogeny of sidJ alleles also showed that recombinant alleles roughly formed an outside subclade (sidJ-3 and sidJ-4, Fig. 1B) compared with those non-recombinants. Recombination also introduced an excess of non-synonymous and synonymous diversity for sidJ (0.02454 Vs. 0.01371 and 0.1660 Vs.0.1063, Table 2). These results were partly consistent with the research with another intracellular bacteria, Mycobacterium tuberculosis, of which the genetic diversification was partly driven by recombination and led to a high genetic diversity and genomic plasticity (Namouchi et al., 2012). Furthermore, we did not find evidence for recombinant sidJ alleles experiencing demographic expansion based on the current pools of strains. The mismatch distributions for the total sidJ set, recombinant sidJ, and non-recombinant sidJ genes were roughly multimodal with P > 0.05 for the SSD. Also, P-values for Harpending's Raggedness index was lower than 0.05 in each group, indicating that no demographic expansion exists ( Figs. 2A-2C). However, a potential reduction of recombinant sidJ alleles in the L. pneumophila community was found. Fu and     (Rowbotham, 1980). Parsimony (TCS) network of sidJ alleles showed no central allele (node). Many mutations were found among the alleles, and these alleles did not comprise a scattered star structure (Fig. 3), suggesting that the expansion of the L. pneumophila population with a specific mutation in the sidJ gene has not taken place (Leigh & Bryant, 2015).

Evidence of positive selection in sidJ
The M0 model of the EasyCodeML package showed an average ω of 0.1636 which was less than 1, suggesting that at the whole gene level, purifying selection conducting the  -3 evolution of sidJ. Moreover, the frequency distribution of codon classes based on the M3 model showed proportion of sidJ codons that was subject to purifying selection was 0.8176 (Fig. 4A), further proved that purifying selection was a major force during sidJ evolution.
Although the proportion of codon 3 (under positive selection) was relatively smaller (0.01473, Fig. 4A), the three likelihood ratio tests (LRTs) showed that model M3 and M8 were significantly better fit (P < 0.05) than the relevant null model M0 and M7, respectively. These results together suggested that a small number of codons of sidJ were subjected to positive selection pressure (ω =1.1155-2.5605). Here, we took results from models M7 Vs. M8 as a standard as Yang et al. recommended (Yang, 2007;Yang & Bielawski, 2000). Thus, four positive selection sites including 58G, 200N, 868T, and 869S were identified with posterior probabilities (Pr) of at least 0.95 (Table 3). Still, only eight PREs were identified among the 39 alleles, the PREs and allele sequences' ratio was about 20%, indicating that the LRT was robustness to such low levels of recombination (<30%) (Anisimova, Nielsen & Yang, 2003). These nested models were also more realistic and showed more robust to recombination (Anisimova, Nielsen & Yang, 2003). Similar results were obtained when recombination was not taken into account by using the unmodified tree topology of the 39 sidJ alleles. 868T, a definitive positive selection site with Pr =0.973, when using the modified tree topology, was identified as a critical positive selection site (Pr =0.942). (Table S2). To further confirm our results, three additional algorithms implemented in the HyPhy software package were used to identify positive selection sites of SidJ. We could identify all of the same positive selection codons as we obtained from the codeML package (Table S3). Thus, combined with the results from different algorithms, finally, four sites including 58G, 200N, 868T, and 869S were identified as definitive positive selection sites of SidJ. These sites were distributed in either the NTD or CTD of SidJ (Fig. 4B). The distributions of single amino acid polymorphic loci in the KD versus the NTD and CTD were not uneven (17.83% Vs. 20.45%, 46/258 Vs. 126/616, P =0.464, Chi-Square test), and average numbers of amino acid site profiles in each domain of SidJ were roughly the same (Fig. 4C). This result further verified that positive selection pressure selectively operated on the NTD and CTD domains, but not on the KD. Based on this result, we propose that the KD domain of SidJ is essential, but conserved to maintain its glutamylase activity to catalyze the glutamylation of L. pneumophila SidEs, while NTD and CTD domains modulate the interaction between SidJ and CaMs from different hosts. Positive selection in the CTD and NTD of SidJ may be an evolutionary strategy for L. pneumophila surviving in different hosts. Few studies had focused on studying positive selection signals in the

Not Allowed
Notes. P is the number of parameters in the ω distribution; lnL is the log likelihood; ω is ratio of dN /dS, LRT P-value indicates the value of chi-square test; Parameters indicating positive selection are presented in bold; positive selection sites were identified by the Bayes empirical Bayes (BEB) methods under M8 model or by Naive Empirical Bayes (NEB) methods under M3 and M2a models. The posterior probabilities (p) ≥0.90, (p) ≥0.95 and p ≥ 0.99 are indicated by *, ** and ***, respectively. (Yang, 2007;Yang & Bielawski, 2000) recommended that results from M8 model were preferred to find sites under positive selection pressure, and it is more robust to recombination which was proved by Anisimova, Nielsen & Yang (2003) individual L. pneumophia genes. Kenzaka et al. (2018) previously reported preferential positive selection in F-Box Domain gene (lpp0233), and they found a higher ω in this gene compared with those in other protein encoding genes and housekeeping genes. However, the ω in lpp0233 was less than 1 (0.40), indicating that at the whole gene level, the positive selection signal was still weak and codon level analysis was required. Costa et al. (2014) also attempted to discover positive selection sites of sidJ by using the M8 model but failed because the likelihood ratio tests between M8 and M7 models showed no significant difference. This might be due to the limited sidJ alleles (23 alleles) they obtained, as the codeML algorithm was phylogeny-based, and more allelic profiles of a gene could infer an authentic phylogenetic history much better. This result highlighted the importance of using enough alleles of a gene for analyzing positive selection at the codon level. We previously verified 14 positive selection sites of a key protein associated with an antibiotic-resistance characteristic of methicillin-resistant Staphylococcus aureus, named penicillin-binding protein (PBP) 2a (Zhan & Zhu, 2018). We found that all these sites in PBP2a have only one mutation profile. However, SidJ positive selection sites showed there were more mutation profiles in a codon. Three mutation profiles including G58R, G58M, and G58E were identified at codon 58. Four mutation profiles including N200T, N200I, N200A, and N200V were identified at codon 200. Two mutation profiles including T868N, T868P, and S869T, S869P were identified at codons 868 and 869, respectively (Table S4).
Most of these mutation profiles included changes in chemical properties of the amino acids (e.g., amino acid polarity) (Table S4), which might affect the three-dimensional structure of SidJ. SidJ modulates host cellular pathways through the membrane remodeling of L. pneumophila (Luo, 2012). Given that SidJ interacts with both SdeA and eukaryotic CaM, diversified mutation profiles of these positive selection sites might imply that sidJ was a target for host specialization or selection and these mutations might increase the fitness of L. pneumophila in certain environments, and in turn promote their survival in different hosts (Costa et al., 2014;Park, Ghosh & O'Connor, 2020). However, the exact function of these mutations requires further study.

Recombination and positive selection shape the population structure of SidJ in L. pneumophila
The evolutionary history of the SidJ proteins corresponding to 39 alleles was studied by using MEGA X. Similar topology of the trees was found when compared to that of the sidJ alleles (Figs. 1B and 4). Three paired alleles (Lorraine and ATCC35096, Ice27 and Aco13, and D5265 and D4954) encoded SidJ with the same protein sequences. The properties of these alleles and the information of their representative strains were studied. Of the 39 sidJ alleles, some were distributed both in clinical and environmental strains, while some were only distributed in environmental strains (Table S1). We defined those alleles that could be found in clinical strains as clinically associated alleles. Among the 14 recombinant alleles, 12 (85.71%) were clinically associated. In contrast, among the 25 non-recombinant alleles, only 15 (60%) were clinically associated. This result suggested that recombinant sidJ alleles were more likely to be clinically associated alleles, although not significant (P =0.093, Chi-Square test). (Fig. 5A). A larger pool of L. pneumophila strains is required to sufficiently explain the association of recombinant sidJ with clinical strain. Based on this result, we propose that recombination is an important strategy for L. pneumophila to survive in different environments, and for infecting human cells. We did find specific mutation profiles of positive selection sites in clinically associated alleles (Fig. 5B) or recombinant alleles (Fig. 5C), for example G58M mutation happened less frequently in clinically associated alleles, while T868P, T869P happen more frequently in recombinant alleles of sidJ (Table S5 and Figs. 5B-5C). Considering that the positive selection is usually beneficial to the survival of the individual bacteria carrying the mutation, these results indicated that mutations in positive selection sites increased SidJ variability, and made more extensive the adaptability in environmental hosts for L. pneumophila. This might lead to broad coevolution of L. pneumophila genes (e.g., sidJ ) with viable environmental hosts before it could infect human cells and thus be of crucial importance in the virulence of this bacteria. The fact that the finding of alleles harboring particular mutations in positive  T  T  T  T  T  T  A  A  A   T  T  T  T  T  T  N  S   T  T  T  S  T  S  T  S  T  S  T  S  T  S  T  T  T  T  T  T  T  S  T  S  T  S  T  selection sites were more likely recombinants further demonstrated that recombination in sidJ enhanced the environmental adaptability of some L. pneumophila strains (Fig. 5C).

Mutation of positive selection sites of SidJ might influence the binding of CaM to SidJ
Here, the four definitive positive selection sites are located in the NTD and CTD, but not in the KD. CaM binding is required by SidJ glutamylase activity. An IQ (I841 and Q842) motif located in the C-terminal domain of SidJ is involved in CaM binding by burying in a hydrophobic cleft of the CaM C lobe. The CTD of CaM semi-encircles the C-terminal helix of SidJ and the NTD domain of CaM makes extensive contacts with the N-terminus of SidJ (Bhogaraju et al., 2019). Residues in these domains also play roles in mediating the formation of the SidJ-CaM complex, and in turn, stabilize the position of the N-lobe of the KD, and thereby leading to the formation of a stable catalytic pocket for SidES (Bhogaraju et al., 2019). A previous site-directed mutagenesis study verified the importance of I841 and Q842 for optimal binding. In addition, some other sites including Q830, S808, E812, R796, R660, R804 that engaged in hydrogen-bonding interactions with corresponding CaM residues also showed the importance of the binding affinity of SidJ with CaM (Gan et al., 2019). Therefore, the mutation on the positive selection sites might change the level of interactions of SidJ with CaM through hydrogen bonds and salt  bridges. Structural studies on a truncated SidJ-CaM complex indicated that some of the residues in the C-terminus of SidJ were crucial to the complex which is adjacent to the IQ motif (Fig. 6A). Our protein structural modeling showed a similar three-dimensional image as the truncated one (Fig. 6B). The definitive positive selection site on codon 58 of SidJ indicated that the mutation of this site was functional, although a previous study  suggested that a truncated SidJ lacking the first 99 residues (SidJ( N99)) showed activity indistinguishable from that of the full-length protein (Gan et al., 2019). This implied that a full-length protein of SidJ was required for a more detailed description of the function of some important amino acid sites of the whole protein. As shown in Fig. 6C, the G58E or G58M mutations significantly influence the three-dimensional structure of SidJ. The G58E mutation might be associated with the interaction with CaM because it was spatially closer to the I841 and Q842 than the original 58G (Fig. 6C). A similar explanation of the influence of codons 868 and 869 was determined as they were closer to those that could interact with CaM residues (Fig. 6B) and where the CaM docked and could mediate most of the interactions with CaM (Bhogaraju et al., 2019). Although codon 200 of SidJ had four mutation profiles, we found two of which might be functional. As shown in Fig. 6D, the N200T and N200V mutations significantly affected the hydrogen bonds of SidJ. Two hydrogen bonds of N200 could be formed with K196 and S204, while T200 could form three ones with K196, P197, and S204. In contrast, V200 could only form one with K196 (Fig. 6D). The more hydrogen bonds indicated a more stable α-helix, in turn, stabilizes the whole protein structure. Thus, the mutations in codon 200 may also mediate the interaction between SidJ and CaM and all these mutations in the positive selection sites could only adjust, but not abolish the affinity of CaM binding to SidJ. Based on protein structure modeling, we propose a potential explanation of the influence of the mutation on the four positive selection sites. Experimental data was still required to understand the exact function of these mutations. Mutation in positive selection sites might facilitate the survival of the lifeform containing mutated alleles. As an intracellular bacterium, entering the host and establishing infection were crucial for L. pneumophila lifecycle, and in which SidJ was dedicated to balance the host Ub ligase activity and was important for successful infection. L. pneumophila was shown to survive as an intracellular parasite of free-living protozoa in aquatic and moist soil environments (Bhogaraju et al., 2019;Fields, Benson & Besser, 2002;Gan et al., 2019). Protozoa provide a specific environment for gene exchange between L. pneumophila and other microorganisms invading them as pathogens or symbionts, also protozoa might be act as donors and transfer their own DNA to L. pneumophila (Mondino, Schmidt & Buchrieser, 2020). Many potential environmental hosts of L. pneumophila have been identified, including Dictyostelium discoideum (soil amoeba), Acanthamoebae castellanii, Entamoeba histolytica, Naegleria, and Tetrahymena, etc. (Hagele et al., 2000;Steinert et al., 1994;Rowbotham, 1980;Swart et al., 2018). Figure 7A showed the phylogenetic relationship of CaM among humans and potential hosts of L. pneumophila. Despite that CaM was relatively conserved, the protein sequences of CaM in these eukaryotes are variable (Fig. 7B, Table S6). And this might lead to a slightly structural difference among human CaM and those of protozoa hosts of L. pneumophila (Fig. 7C). Given that most of the time L. pneumophila is inhabiting in environmental hosts, we also propose that the variability in SidJ, especially that in positive selection sites might be important towards L. pneumophila survival in different environmental hosts and has an adaptive function to a broad selection of environments. The exact functions of these mutations to L. pneumophila living in the environmental host are worthy of further study. These results also suggested that more natural variants of a protein from a broad-host bacterium were required to discover the mechanism of how this protein and its variants are involved in the infection.

CONCLUSIONS
We presented molecular evolution analyses on a large and comparative set of sidJ alleles, derived from a collection of L. pneumophila strains. We found that among the 39 recognized sidJ alleles, about one-third were recombinants generated by eight PREs. Intragenic recombination also drove the sidJ diversification manifested by a higher genetic diversity in the recombinants as compared with that in non-recombinants. In addition, we found definitive positive Darwinian selection of SidJ at the codon level. Four codons in the NTD and CTD domains of SidJ were identified, and their mutation profiles were also determined. Protein structural modeling of SidJ provided possible functional explanations for the mutations in positive selection sites. It might influence the binding affinity of CaM to SidJ, thus regulate SidJ glutamylase activity to SidEs, and in turn balance the Ub ligase activity in different hosts. This study gave us a deeper understanding of the adaptive mechanisms of this intracellular bacterium to different hosts and highlighted the importance of the NTD and CTD domains in SidJ kinase activity that is activated by the binding of CaM. Further studies should focus on experimental evidence to illustrate the function and mechanism of natural sidJ mutants (with a mutation in positive selection sites) in regulating balanced Ub ligase activity in different L. pneumophila hosts.

ADDITIONAL INFORMATION AND DECLARATIONS
Funding