REC protein family expansion by the emergence of a new signaling pathway

ABSTRACT This report presents multi-genome evidence that REC protein family expansion occurs when the emergence of new pathways gives rise to functional discordance. Specificity between residues in REC domain containing response regulators with paired histidine kinases is under negative purifying selection, constrained by the presence of other bacterial two-component systems signaling cascades that share sequence and structural identity. Presuming that the two-component systems can evolve by neutral amino acid changes (neutral drift) when purifying evolutionary constraints are relaxed, how might the REC protein family expand by amino acid changes when these constraints remain intact? Using an unsupervised machine learning approach to observe the sequence landscape of REC domains across long phylogenetic distances, we find that within-gene recombination, a subcategory of gene conversion, switched the effector domain and, consequently, the regulatory context of a duplicated response regulator from transcriptional regulation by σ54 to that by σ70. We determined that the recombined response regulator diverged from its parent by episodic diversifying selection and neutral drift. Functional experiments of the parent of recombined response regulators in a model Pseudomonas putida KT2440 model system revealed that the parent and recombined response regulators sense and respond to different carboxylic acids. Finally, a residue-switching experiment using structural predictions and functional characterization suggests that the new residues in the recombined regulator could form a new interaction interface and mediate condition-specific phosphotransfer. Overall, our study finds that genetic perturbations can create conditions of functional discordance, whereby the REC protein family can evolve by episodic diversifying selection. IMPORTANCE We explore when and why large classes of proteins expand into new sequence space. We used an unsupervised machine learning approach to observe the sequence landscape of REC domains of bacterial response regulator proteins. We find that within-gene recombination can switch effector domains and, consequently, change the regulatory context of the duplicated protein.


Figure S1: Dimensionally reduced REC domain sequence alignments with and without gap replacement show similar data structures.
Similarity between members of the REC protein families by explaining 95% of the variation between the sequences shows the relationship between REC domain sequence alignments with (A) the same data as Figure 1D but without gap replacement (B, C) independently sampled data with gap replacement.As in Figure 1, REC domains were sampled from species found within the taxonomic rank (kingdom -Bacteria, phylum-Proteobacteria, or genera-Pseudomonas) labeled at the top of each plot and sampled as described in the methods section.After species sampling, we identified all proteins with REC domains in each of the sampled species' genomes and aligned their sequences.Points represent the sequences of unique REC domains from the sampled genomes, points that are close together share sequence identity with each other.Each REC domain is annotated by the identity of its effector domain Trans_reg_C (yellow) , GerE (blue), AAA+ (red), other (gray), and no domains (black).Note that the location of REC domains is not fixed in the plot between independent runs of the algorithm, however, the relative distribution between the points is visually consistent between runs.Black arrow indicates the REC domains linked to GerE effector domains as a result of with-in-gene recombination of parent REC domains fused to AAA+ effector domains.After species sampling, we identified all proteins with REC domains in each of the sampled species' genomes and aligned their sequences.To read the plot, each point represents the sequence of a REC domain from the sampled genomes, points that are close together share sequence identity with each other.Each REC domain is annotated by the identity of its effector domains depicted with the same colors as in   S3 (bottom panels) were re-aligned with REC domains from P. putida KT2440 and passed through the dimensional reduction algorithm.REC domains from P. putida are displayed as x markers, all other REC domains are displayed as circles.By annotating points of two-component systems from P. putida that are known to be among the recombined cluster, PP_1066 and PP_3551, the location of the recombined cluster on the dimensionally reduced map becomes apparent.

Figure S5:
(A) The REC domains within the black box in Figure 3A were used to generate a REC domain codon sequence alignment (generated by reverse translating the peptide alignment) to test for episodic diversification on the branches separating the parent REC domains with AAA+ effectors (red) and recombined REC domains with GerE (or no domain) effectors (blue).Episodic diversification occurred after the within-gene-recombination (thick gray branch).(B)The REC domains within the black box (Figure 3A), excluding outgroup domains, are used to construct a domain tree (REC domain codon sequence alignment (generated by reverse translating the peptide alignment).The tree is annotated with the branch colors red if the domain's effector identity was AAA+ or blue if the domain's effector identity was GerE (or no domain).Bootstrap supports (labeled at the nodes and shown as branch thickness) are 100 for the nodes separating the parent REC domains with AAA+ effectors and recombined REC domains with GerE (or no domain) effectors.

Figure S6. Functional genomics show regulation by carboxylic acids. (A)
A pool of randomly barcoded transposon (RB-Tn) insertions were grown in defined media with glutamic, succinic, alpha-ketoglutaric, or butyric acids as the sole carbon source.We show the fitness score (average fold change of RB-Tn insertions within a given gene before and after competition in a pooled experiment) for genes coding for the response regulators in P. putida KT2440 that have the parent REC domains with AAA+ effectors (PP_1066 (red), PP_0263 (orange), PP_1401 (yellow) and response regulator with the recombined REC domain (PP_3551(blue)) when the pool was grown in the carbon-source.Although not discussed in the main text, we also show fitness scores for genes related to the response regulators by synteny, co-fitness, or the presence of a DNA binding site.The function of genes are annotated below the gene ID: R = response regulator, K = histidine kinase, P = promoter region tested by a GFP reporter assay, P N.T. = promoter region not tested by GFP reporter experiment.(B-E) DAP-seq results shown as fold enrichment by genome location (Genome Index) for response regulators (RR) treated with or without acetyl phosphate, error bars show the 95% confidence interval of all replicates (n=2 per treatment with or without acetyl phosphate).Results for orthologous RRs from Pseudomonas species are displayed from left to right (B) P. putida KT2330 (PP) (C) P. stutzeri RCH2 (Psest), (D) P. fluorescens N2E2 (Pf6N2E2), and (E) P. fluorescens N2C3 (AO356).Orthologous RRs are matched and displayed in the same row.Red triangles below each individual plot indicate the genomic location where a predicted binding motif was identified; genes upstream/downstream of this site are summarized in Table S3.Samples that failed to enrich DNA above a fold-enrichment threshold of 2 are shown in gray.

Figure S9:
To show the structural impact of specificity switching, we modeled the structure of PP3551, PP1066, and their respective mutant with AlphaFold.Covariant (blue), selected (yellow), covariant and selected (red), active aspartate (purple) residues are highlighted and shown in a space-filling model.Table S1.Summary of REC domain random sampling.Databases were created by randomly sampling REC domains from curated genomes found in the microbial signal transduction database.To control for overrepresentation of subclades due to more representative species with full genome sequences in any given subclade, species were sampled evenly from the taxonomic rank below the one represented in the plot title (e.g.The taxonomic rank below kingdom is phylum; the maximum of species from each bacterial phylum were randomly sampled to generate the bacteria dataset).This table summarizes the number of taxa that are represented in the REC sequence landscapes in each indicated figure using each indicated database and the maximum number of species per rank below used to generate the data.The perplexity, or expected size of the clusters in the dimensional reduction of the REC sequence alignments, is defined by the number of "unique taxa below rank" for a given database.reservoir for 1 minute.f.Suspend and dispense 100 µL Recharge Solution stored in a 50 mL reservoir for 1 minute.g.Suspend and dispense 100 µL Recharge Equilibration Buffer stored in a 50 mL reservoir for 1 minute.h.The tips are then washed in a reservoir with 150 mL Storage Buffer.i.Before completing the wash, the tips suspend 150 µL Storage Buffer without dispensing and return to their original box.j.The tips are then wrapped in parafilm and stored at 4˚C.Tips can be recharged and reused up to 10 times.

Figure S2 :
Figure S2: Dimensionally reduced of scrambled REC domain sequence alignments with and without gap replacement reveals bias from the alignment strategy.Similarity between members of the REC protein families by explaining 95% of the variation between the sequences shows the relationship between REC domain sequence alignments of the same data as Figure 1D, but the sequences were randomly scrambled (A) without gap replacement or (B) with gap replacement.REC domains are annotated by the identity of its effector domains depicted with the same colors as in Fig S1 Please also see Fig S1 notes on location of REC domains colors as in Fig S1.Please also see Fig S1 notes on location of REC domains and its interpretation.

Figure S3 :
Figure S3: Within-gene-recombination that changed parent REC domain fused to AAA+ domain to REC domain fused to GerE domain occurred during the Proteobacteria lineage, specifically the Alphaproteobacteria lineage.Similarity between members of the REC protein families by explaining 95% of the variation between the sequences shows the relationship between REC domain sequence alignments with gap replacement of (A, B) two independently sampled datasets (shown respectively in A and B) in (from left to right) Chloroflexi, Firmicutes, Bacteroidetes, Actinobacteria species.(C) independently sampled data compared to data in Figure2in species from Alphaproteobacteria, Betaproteobacteria, Gammaproteobacteria, Deltaproteobacteria.The REC domains were sampled from species found within the taxonomic rank (phylum) labeled at the top of each plot and sampled as described in the methods section.REC domains are annotated by the identity of its effector domains depicted with the same colors as in Fig S1 Please also see Fig S1 notes on location of REC domains colors as in Fig S1.Please also see Fig S1 notes on location of REC domains and its interpretation.
Fig S1.Please also see Fig S1 notes on Note that the location of REC domains and its interpretation.

Figure S4 :
Figure S4: Within-gene-recombination that changed parent REC domain fused to AAA+ domain to REC domain fused to GerE domain occurred during the Proteobacteria lineage, specifically the Alphaproteobacteria lineage shown by fitting data with REC domains from P. putida KT2440.Data used to generate Figure 2 (top panels) and FigureS3(bottom panels) were re-aligned with REC domains from P. putida KT2440 and passed through the dimensional reduction algorithm.REC domains from P. putida are displayed as x markers, all other REC domains are displayed as circles.By annotating points of two-component systems from P. putida that are known to be among the recombined cluster, PP_1066 and PP_3551, the location of the recombined cluster on the dimensionally reduced map becomes apparent.

Figure S7 .
Figure S7.GFP reporters demonstrate regulation by carboxylic acids.(A) Fold-change of the median fluorescence intensity (MFI) of WT strains (or relevant deletion strains) bearing GFP reporter plasmids of the indicated upstream promoter region (empty vector (EV), p2435, p1400, p3553) driving expression of GFP.Strains grown with glutamic acid (top, red), butyric (middle, blue) or alpha-ketoglutaric acid (bottom, yellow) were compared to strains grown without an inducer.Center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points with black diamonds, outliers; n = 3. (B) Baseline fluorescence using either FL1 (left) or FL3 (right) for GFP detection -depending on configuration at the time of experiment -without an inducer for each reporter plasmid.Center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points with black diamonds, outliers; n = 3. (C) Histograms showing single cell fluorescence data from all independent replicates using either FL1 or FL3 for GFP detection (indicated below each plot) for each reporter strain in either the WT (top) or relevant deletion background (bottom) grown in defined glucose media with or without (gray), glutamic acid (red), alpha-ketoglutaric (yellow) or butyric acid (blue).

Figure S8 .
Figure S8.Identifying positions of covariation in REC and DHp domains.Covariant residues were identified by mutual information scoring of residues between REC domains and DHp-CA domains in pairs of two-component systems (left panels).Pairs of REC and DHp-CA domains were identified by syntenic relationship to each other in their respective genomes.Paired REC and DHp-CA domains show >1 mutual information scores (bottom panels) at 7 positions in the REC domain, these positions are covariant with DHp-CA domains and are interpreted as biochemically relevant for interaction and specificity of the REC domain.Randomly shuffled pairs of REC and DHp-CA domains do not show mutual information scores above the cutoff (right panels).

Figure S10 :
Figure S10: Changing conserved and/or selected residues in PP_1066 breaks condition specific responses.(A) Histograms showing single cell fluorescence data from all independent replicates using FL1 for GFP detection for each P. putida ∆PP1066∆PP3551 double deletion strain bearing a PP_1066 complementation plasmid grown in defined glucose media as indicated in Fig S7. (B) Fold-change of the median fluorescence intensity (MFI) of complementation strains grown with glutamic acid (left) or butyric acid (right) compared to strains grown without an inducer.Center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points with black diamonds, outliers; n = 3. (C) Baseline fluorescence using FL1 for GFP detection without an inducer for each complementation plasmid.Dashed line shows the fluorescence baseline for the empty vector (EV) as shown in Figure S7.Center line, median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; points with black diamonds, outliers; n = 3.

Table S2 . Results summary from non-synonymous/synonymous rate ratio tests.
The 43rd and the 75th codons of the reverse translated REC domain from alphaproteobacteria alignments were found to be under episodic diversifying selection by non-synonymous/synonymous rate ratio tests.

Table S3 .
Summarized results of DAP-seq experiments showing predicted targets of RRs referenced in this study.

Table S4 .
Minimal media recipes used in this study.

Table S5 . (separate .xlsx file)
List of primers used in this study.2.Grow strains overnight (ON) in LB in a 96 deep-well plate.Back-dilute 5 µL of ON culture into wells of 995 µL autoinduction media.Grow strains at 37˚C shaking at 250 RPM for 5-6 hours, then transfer plates to 17˚C, 250 RPM for overnight growth.3. Pellet cultures in a centrifuge at 3214 x g, pour off the supernatant and store at -20˚C for no more than 1 week.Storage Buffer is expelled from IMAC resin tips by performing a high pressure blow-out over the Storage Buffer reservoir.b.IMAC resin tips are washed or remaining ethanol by suspending and dispensing 100 µL water for 1 minute.c.IMAC resin tips are equilibrated by suspending and dispensing 100 µL Wash Buffer for 1 minute.3. Protein Binding: a. IMAC resin tips bind metal affinity tagged protein by suspending and dispensing 100 µL clarified lysates (stored in 96-well plate from step 1a-c) for 10 minutes.b.The tips are then washed by suspending and dispensing 100 µL Wash Buffer washed for one minute in the wells of a 96-well plate containing 300 µL Wash Buffer per well.To ensure no residual buffers from the previous steps remain on the IMAC resin tips before elution, tips are dried with a custom protocol, in which liquid remaining in the tip undergoes a high pressure blow-out.b.The tips are then touched to a kim-wipe stabilized with a tip-box lid.The kim-wipe collects any residual liquid and the tips are dried for the next steps.6. Elution: a. Protein-DNA complex bound to IMAC resin is eluted by suspending and dispensing 25µL µL Elution Buffer in the wells of a 96-well plate containing 25 µL Elution Buffer per well.b.Eluted Protein-DNA can be stored at -20˚C for 1 week before library preparation.c.Optional QC: Sample a few wells to check for protein expression by western blot d.Note: Downstream removal of salts and imidazole are not necessary at this stage, as this protocol leverages a reagent tolerant PCR master-mix for library preparation.7. Final washing and storage: a. IMAC resin tips are then washed in a reservoir containing 150 mL Elution Buffer, followed by a reservoir containing 150 mL water.b.The tips are then washed in a reservoir with 150 mL Storage Buffer.c.Before completing the wash, the tips suspend 150 µL Storage Buffer without dispensing and return to their original box.d.The tips are then wrapped in parafilm and stored at 4˚C.Tips can be recharged and reused up to 10 times (see step 8).8. Recharging IMAC resin tips: a. Equilibrate tips by suspending and dispensing 100 µL Recharge Equilibration Buffer stored in 50 mL reservoir for 1 minute.b.Suspend and dispense 100 µL Recharge Stripping Buffer stored in a 50 mL reservoir for 1 minute.c.Optional: Suspend and dispense 100 µL 1 N NaOH stored in a 50 mL reservoir for 1 minute.Equilibrate tips again by suspending and dispensing 100 µL Recharge Equilibration Buffer stored in 50 mL reservoir for 1 minute.d.Optional: Suspend and dispense 100 µL Recharge Salt Buffer stored in a 50 mL reservoir for 1 minute.Equilibrate tips again by suspending and dispensing 100 µL Recharge Equilibration Buffer stored in 50 mL reservoir for 1 minute.e. Suspend and dispense 100 µL Recharge Equilibration Buffer stored in a 50 mL