Origins and Molecular Evolution of the NusG Paralog RfaH

In all domains of life, NusG-like proteins make contacts similar to those of RNA polymerase and promote pause-free transcription yet may play different roles, defined by their divergent interactions with nucleic acids and accessory proteins, in the same cell. This duality is illustrated by Escherichia coli NusG and RfaH, which silence and activate xenogenes, respectively. We combined sequence analysis and recent functional and structural insights to envision the evolutionary transformation of NusG, a core regulator that we show is present in all cells using bacterial RNA polymerase, into a virulence factor, RfaH. Our results suggest a stepwise conversion of a NusG duplicate copy into a sequence-specific regulator which excludes NusG from its targets but does not compromise the regulation of housekeeping genes. We find that gene duplication and lateral transfer give rise to a surprising diversity within the only ubiquitous family of transcription factors.

tion, NGNs of NusG homologs from archaea, bacteria, and eukaryotes bind to the same sites on the elongating RNAP (4)(5)(6), composed of the clamp helix (CH) domain in the largest RNAP subunit (␤= in Bacteria) and the gate loop in the second-largest subunit (␤ in Bacteria). Once bound, NusG proteins (or their NGNs alone) promote processive, pause-free RNA synthesis (7), a function thought to be particularly important for the synthesis of very long RNAs. Recent structural studies revealed a common molecular basis for antipausing activity among all NusG-like proteins (4,5,8).
NusG homologs comprise two distinct families, which are correlated with the architecture of their respective target RNAPs (Fig. S1). In bacteria, NusG binds to a "minimal" RNAP typically composed of five subunits and promotes uninterrupted RNA synthesis (9). Although NusG can interact with other proteins as part of specialized antitermination complexes (10), it does not require any accessory factors for binding to RNAP. In contrast, in eukaryotes and archaea, which have more complex 12ϩ subunit RNAPs, Spt5 has an obligatory partner, a small zinc finger protein, Spt4 (called RpoE in archaea). Spt4 and Spt5 form an extensive interface with several conserved residues (11,12); among them, a universally conserved Glu residue is essential for Spt4/5 binding, and its replacement of Gln (the corresponding residue in NusG) abolishes their interactions (13). Together, Spt4/5 (DSIF in metazoans) promote transcription elongation similarly to NusG (8,14). Spt4 was long thought to simply buttress Spt5 stability (11,14), but recent structural data suggest that it also contributes to maintaining RNAP processivity, for example, during transcription through nucleosomes (15). Spt4 binds to Spt5-NGN opposite the RNAP interaction surface, and several conserved basic residues in Spt4 form a part of the upstream DNA channel (4). In NusG, a positively charged ␤-hairpin loop is positioned similarly to Spt4 (5,16) and may interact with the upstream DNA duplex (17); large modulatory domains present in place of the ␤-hairpin in some NusG proteins may contribute to DNA interactions (2,13). The presence of the ␤-hairpin is incompatible with an auxiliary protein binding to NusG in a manner similar to the way it binds to Spt4 (11); accordingly, Spt5 proteins do not have insertions at this position (Fig. S1). Within a given cell, NusG and its paralogs can be viewed as alternative transcription elongation factors which compete for binding to RNAP, similarly to initiation factors (7). This analogy is strengthened by the fact that NusG and (or Spt5 and TFE in Archaea) share the binding site on RNAP (18,19). However, in stark contrast to factors, which perform the same function at their cognate promoters, NusG-like proteins play surprisingly multifaceted roles, as can be illustrated in Escherichia coli, which encodes two of the best-characterized members of this family: an abundant and essential housekeeping NusG protein and its scarce nonessential specialized paralog RfaH (20). NusG promotes productive RNA synthesis as part of antitermination complexes (10) or by coupling transcription to translation via direct contacts with the ribosome (21,22). Yet if RNA is useless or potentially harmful, as is the case with many xenogenes, the NusG-KOW domain interacts with the termination factor Rho to induce its early release from RNAP (23); in fact, silencing of xenogenes constitutes an essential function of E. coli NusG (24). RfaH plays an opposite role; it activates expression of xenogenes (7), many of which encode virulence factors, and is required for virulence in enteric pathogens (7).
While NusG associates with RNAP transcribing all operons (20), RfaH is recruited to its targets only at operon polarity suppressor (ops) elements in the nontemplate DNA strand in the transcription bubble (20,25). The ops signal halts RNAP to provide more time for RfaH recruitment and forms a short DNA hairpin that interacts with the RfaH-NGN to induce RfaH transformation from an autoinhibited state to an activated state (26) (Fig. 1). Once bound, RfaH excludes NusG from the transcribing RNAP, thereby insulating it from Rho, and activates translation by recruiting the ribosome (20,27). Extensive genetic, biochemical, and structural data available for RfaH and NusG provide a detailed molecular context for understanding their effects on gene expression. While both proteins interact with similar regions on RNAP, RfaH binds much more tightly (5), giving RfaH advantage to compete with 100-fold more abundant NusG (28), and only NusG interacts with Rho (23). These proteins make similar contacts with the ribosomal protein S10 (21,27), but in the case of RfaH, a dramatic metamorphosis (in which the entire RfaH-KOW motif refolds from an ␣-helical hairpin observed in free, autoinhibited RfaH [29] to a ␤-barrel) is required to expose the residues that interact with S10 (27). This switch is triggered when RfaH binds to the ops-paused RNAP (30).
In contrast, relatively little is known about NusG homologs present in diverse bacteria (31). An emerging view is that specialized NusG paralogs (NusG SP s) function as dedicated antiterminators of long, difficult-to-express gene clusters required for adaptation to diverse environments, including human hosts. Bacterial genes shown to be dependent on NusG SP s for expression encode adhesins, capsular polysaccharides, conjugation machinery, polyketide antibiotics, and toxins (7). While RfaH is recruited to ops sites in the leader regions of several unlinked chromosomal targets (20), some NusG SP s are encoded within the operons that they regulate (32,33) and their modes of recruitment are unknown.
In this work, we set out to reconstruct the origins and evolutionary history of RfaH and its relationship to NusG, expanding previous phylogenetic analysis (31) to incorporate the growing number of sequences in public databases and recent experimental insights into the functions of these proteins. Using sensitive profile searches, including those with a newly constructed profile model for RfaH, we revealed the phyletic distribution of NusG and RfaH across the tree of life. Our results show that ancient and recent gene duplication, horizontal gene transfer, and rapid functional divergence of paralogs underlie the evolution of the NusG family. One of these NusG duplications, which occurred in Proteobacteria, led to the emergence of RfaH. Changes within the key functional regions of NusG paralogs suggest that nascent NusG duplicates have grad-ually morphed into fully specialized RfaH-like regulators by losing contacts with Rho first and acquiring sequence-specific DNA contacts last. We found that NusG homologs are encoded in most plants and photosynthetic protists and in all except severely reduced bacterial genomes. These results support a notion that NusG modulates transcription in nearly every cell that utilizes RNAP of the bacterial type.

RESULTS AND DISCUSSION
In addition to housekeeping NusG/Spt5 proteins, their specialized paralogs are known in bacteria and eukaryotes (31,34). These paralogs are assumed to have arisen by gene duplication, followed by adaptation to unique regulatory demands, e.g., upregulation of virulence genes during bacterial pathogenesis, a key function of several NusG paralogs in Gram-negative bacteria. Among many bacterial NusG paralogs (31), only a handful have been characterized, but even cursory analyses revealed a surprising diversity in their primary sequence, function, and even structure. NusG-like proteins modulate gene expression through a network of contacts with RNAP, nucleic acid signals, and ribosome (7). In-depth studies of E. coli NusG and RfaH provided atomiclevel details of these interactions and identified dramatic conformational changes that underlie their differential recruitment mechanisms (Fig. 1).
New RfaH model. NusG homologs are widely distributed across all three domains of life ( Fig. 2A), but they are very diverse, likely reflecting adaptation to very different niches. This diversity necessitates the use of robust models to investigate the evolution of the NusG family. We needed a model that can reliably distinguish RfaH proteins from the rest of the NusG family. Pfam (35), the leading protein domain database, does not have a specific RfaH model, and its NusG model (PF02357) cannot distinguish NusG from its paralogs. An RfaH-specific model is available in TIGRfam, but this model (TIGR01955) was constructed using only five sequences and was last modified in 2011.  (37) and will be available in its next release.
Distribution of housekeeping NusG. Although presumed to be ubiquitous, NusG was absent in a few (7 out of 711) representatives of COG0250 (38; https://www.ncbi .nlm.nih.gov/COG/). We extended this analysis to a data set of nearly 20,000 representative bacterial and archaeal genomes from the Genome Taxonomy Database (39), to which we refer here as GTDB_reps (see Materials and Methods). In Archaea, Spt5 is widespread ( Fig. 2A) but not ubiquitous: using Spt5-NGN as a model, we identified Spt5 in only 789 out of 847 archaeal genomes. A similar trend was observed in bacteria, where 6% of bacterial GTDB_reps genomes had no identifiable NusG proteins (Data Set S1A and -B). The lack of NusG/Spt5 may be due to (i) incomplete genome assemblies or sequencing errors, (ii) gene loss, or (iii) the low sensitivity of the search model. To evaluate these scenarios, we analyzed NusG homolog distribution in ϳ130,000 bacterial genomes from the NCBI nonredundant database. Among them, 1,879 appeared to lack NusG homologs (Data Set S1C), but no clear pattern has emerged. Moreover, approximately the same fraction of genomes lacked SecE, RecA, and essential ribosomal proteins L5, L6, S2, and S7 (Data Set S1C). The absence of essential core genes in a significant fraction of genomes is most likely due to technical issues arising during genome sequencing/assembly and exposes limitations of this broad-stroke approach, necessitating in-depth analysis. By analyzing 13,140 NusGs (Data Set S1A) using TREND (40; http://trend.zhulinlab.org), we found that nusG is invariably present within a highly conserved operon that encodes the protein translocase SecE and 50S ribosomal proteins. Thus, we further investigated secE-nusG?-rplK-rplA genomic loci in 183 genomes that appear to lack NusG but contain SecE and ribosomal protein L1 (rplA), as well as RecA and L5, L6, S2, and S7 (Data Set S1D).
To ensure genome completeness, we selected only those NusG-less representatives that have a "complete genome" assembly level (12 total). Analysis of the secE-nusG?-rplK-rplA operons identified 1-nt frameshifts in the nusG open reading frames (ORFs) in 11 genomes. Among these, 9 have sequences of the same species in which nusG is intact, whereas two genomes are present in single copies, albeit with sequences of their NusG-encoding close relatives available (Data Set S1D). The nusG gene was deleted from "Candidatus Evansia muelleri," an endosymbiont with a severely reduced 0.36-Mbp genome. Consistently, six out of seven NusG-less COG0250 representatives have genomes smaller than 0.28 Mbp, whereas the remaining genome is incomplete.

FIG 2
The distribution of NusG-like factors. (A) NusG/Spt5 factors were identified using NusG and Spt5-NGN Pfam models, respectively, in Aquerium (93; http://aquerium.zhulinlab.org/). The outer ring shows the number of hits; the darker the color, the more hits it represents. The inner rings represent the major taxonomic ranks and supergroups for eukaryotes (93). E, Eukaryota; A, Archaea; B, Bacteria. Plantae are green. (B) RfaH distribution in bacteria on the phylum level. The genome tree was downloaded from AnnoTree (77; http://annotree.uwaterloo.ca/). Phyla with representatives that contain RfaH (based on hits with our new model) are highlighted in purple. Numbers appended after taxons indicate the number of genome hits divided by the total number of genomes. (C) RfaH distribution in Proteobacteria. The percentages of genome hits were calculated for RfaH-containing families with Ն10 genomes. Families with Ͼ50% hits are shown in red, and those with Ͻ50% hits are shown in blue. A genome tree of representative Gammaproteobacteria is shown. This and other genome trees are maximum-likelihood trees inferred from the alignment of 120 ubiquitous single-copy proteins (53).
These findings suggest that reduced genome endosymbionts may function with reduced transcription machinery. In E. coli, a transcribing five-subunit core RNAP (␣ 2 ␤␤=) associates with NusA and NusG across the entire genome (20); both Nus factors are essential in wild-type E. coli. We wondered if NusA and , which acts as a chaperone and is not essential (41), could also be absent in endosymbionts. We analyzed complete genomes ranging from 0.11 to 5ϩ Mbp (Data Set S1E). We found that all genomes smaller than 0.2 Mbp did not encode NusG or NusA, whereas genomes larger than 0.36 Mbp encoded both proteins. In genomes bridging these groups, all possible NusA/NusG distribution patterns were observed, sometimes varying between genomes of the same species. Interestingly, is absent from many endosymbionts (Data Set S1E), as well as from some free-living bacteria (COG1758). We conclude that all bacterial genomes with the exception of severely reduced genomes encode NusA and at least one NusG family protein. While this conclusion may appear trivial in the case of the "ubiquitous" regulator, nusG has been shown to be dispensable in some model organisms grown under laboratory conditions, such as Bacillus subtilis (42), and can even be deleted in E. coli lacking toxic prophages (43), albeit at a marked fitness cost. Clearly, bacterial survival and adaptation to complex environmental conditions impose requirements different than those of growth in rich medium at an optimal temperature.
Expansion of NusG taxonomic presence. Realizing that NusG is not restricted to prokaryotes ( Fig. 2A), we investigated its distribution further. Using phylogenetic profiling with the most recent Archaeplastida taxonomy (44), we established that, in addition to Spt5, NusG homologs are encoded in the genomes of all major land plant and algal lineages except for some green algal species (Data Set S1F). In addition to identifying NusG homologs in Archaeplastida, we identified them in the genomes of various phyla of photosynthetic chromists ( Fig. 3A and Data Set S1F). All genomes in which we could not identify NusG were of poor quality and only partial. All identified NusG homologs in Plantae and Chromista are encoded in the nuclear genomes, except with the Paulinella genus. We hypothesize that these "bacterial" regulators have been retained to assist RNA synthesis by plastid-encoded RNA polymerase (PEP) of the bacterial type. Several lines of evidence support this hypothesis. First, a NusG homolog of a model organism, Arabidopsis thaliana, annotated as "plastid transcriptionally active 13" protein (pTAC13), has been identified as a component of the active transcriptional machinery in chloroplasts (45). Second, a Rho ortholog has been shown to terminate transcription by Arabidopsis PEP (46). Finally, ChloroP 1.1 (47) predicted the presence of a chloroplast transit signal in several newly identified NusG-like proteins (Data Set S1F). Pervasive plastid transcription has been documented in protists (48,49).
In rhizarian amoebas of the Paulinella genus, nusG is carried in the remnants of a bacterial genome: a photosynthetic organelle called chromatophore. Paulinella representatives formed an evolutionarily recent symbiotic relationship with a photosynthetic cyanobacterium independently from the primary endosymbiosis that gave rise to plastids in Archaeplastida (50,51). Our phylogenetic analyses revealed that Paulinella NusG is nested within the bacterial NusG cluster in the branch with Synechococcus (Fig. 3A), which is considered to be the ancestor of chromatophores (52).
Phylogenetic analysis showed that eukaryotic NusG sequences from Plantae and Chromista formed clusters separate from bacterial and archaeal NusGs (Fig. 3A). Comparative genome analysis using plant and Chromista NusG proteins did not identify any single bacterial group to which all eukaryotic NusG proteins would be most similar (Data Set S1G). These data strongly suggest the presence of a progenitor NusG-like protein in the last universal common ancestor (LUCA).
RfaH evolution events. A total of 1,922 RfaH proteins were found in 23 out of 117 phyla of Bacteria ( Fig. 2B; Data Set S1H and -I), with ϳ95% of RfaHs being found in Proteobacteria. Seventy percent and 18% of rfaH genes are found in Gammaproteobacteria and Alphaproteobacteria, respectively ( Fig. 2C; Fig. S3). Further analysis revealed that families with a high percentage of hits for RfaH are clustered around the Entero-bacteriaceae ( Fig. 2C; Fig. S4). Although in the majority of lineages, the rfaH gene is likely a result of vertical evolution, the presence of rfaH-like genes on plasmids and prophages suggests that some RfaHs were acquired via horizontal gene transfer (HGT). To evaluate this possibility, we compared the topologies of phylogenetic trees (Fig. 3B to D; Data Set S1J). The three classes of Proteobacteria on the NusG tree were well separated, and the clades inside each class showed a topology nearly identical to that of the genome tree built using 120 ubiquitous marker genes for microbial classification, bac120 (53). In contrast, the RfaH tree topology was different from that of the genome tree, suggesting that while the evolution of NusG was vertical, HGT events contributed substantially to the evolution of RfaH.
To study RfaH evolution in more detail, we analyzed RfaH distribution in two well-studied families of Gammaproteobacteria: Enterobacteriaceae and Pseudomonadaceae. Among 486 genomes of Enterobacteriaceae, ϳ84% have RfaH. A previously defined representative genome data set of Enterobacteriaceae (54) was used for closer examination of RfaH distribution (Fig. 4). Among these genomes, three contained rfaH genes on plasmids, but the best BLAST hits of these plasmid-borne rfaH genes were to chromosomal genes from different strains, suggesting that RfaH can travel around on plasmids (Fig. 4). The plasmid RfaH formed a separate branch on a phylogenetic tree (Fig. S5). On the other hand, we observed similar topologies of the RfaH proteins and ribosomal trees within Enterobacteriaceae (Fig. 4; Fig. S5). Thus, we conclude that both vertical inheritance and HGT events shape RfaH evolution.
Unlike with Enterobacteriaceae, in which RfaH thrives, ϳ60% of Pseudomonadaceae lack RfaH (Fig. 2C). To reveal the origins of this different distribution, we expanded our analysis to include 617 representatives of Pseudomonadaceae. Most species containing RfaH are found around the root, suggesting that RfaH was present in the common ancestor and was subsequently lost in some lineages (Fig. S6A); observations that strains within the same species occasionally lose rfaH genes suggest that this process Origins and Molecular Evolution of the NusG Paralog RfaH is ongoing (Data Set S1K). Conversely, we also observed rfaH duplications on the chromosome, which occurred mainly in three clades (Fig. S6B). The species of these three clades were isolated from very different environments, including sputum of a cystic fibrosis patient, cocoon mucus of an earthworm, hyperthermic compost, permafrost, plant roots, marine sediment, etc. These findings indicate that RfaH is actively evolving in Pseudomonadaceae through gene loss and duplication, perhaps to enable adaptation to unique ecological niches.
While RfaH is ubiquitous in Proteobacteria, we identified only one genome that encodes RfaH among 1,908 available genomes of Bacteroidota (Bacteroidetes) (Fig. 2B; Data Set S1H and I). Instead, divergent NusG SP is present in approximately half of Bacteroidota. In Bacteroides fragilis NCTC 9343, eight UpxY proteins are encoded within different capsular polysaccharide operons (32). Each UpxY protein activates the expression of its resident operon, while the product of an adjacent upxZ gene interferes with the expression of heterologous upx operons. However, two uncharacterized UpxYs in the NCTC 9343 genome are not accompanied by UpxZ (Data Set S1L) and may perhaps act similarly to RfaH. Both the upxY and rfaH genes are present in bacteria isolated from different niches, including marine and terrestrial environments and animal hosts (Data Set S1L), and may be under pressure to rapidly adapt to changing environments. Phylogenetic comparison of NusG, RfaH, and UpxY reveals that, as judged by the average branch length, UpxY and RfaH evolve faster than NusG (Fig. S7), and both genes show extensive duplication. Thus, we conclude that NusG paralogs rapidly evolve by gene duplication and subfunctionalization.
Steps in the molecular evolution of RfaH. In E. coli, NusG and RfaH bind to the same site on RNAP yet have opposite effects on gene expression. NusG is abundant, essential, and acts genome-wide to aid Rho silencing of xenogenes, whereas RfaH inhibits Rho in just a few horizontally acquired operons that are dispensable for survival but necessary for virulence. Transformation of a NusG duplicate into a fully specialized RfaH protein requires several key events: (i) loss of binding to Rho, which is an essential function of NusG (43); (ii) an increased affinity for RNAP (5), which enables RfaH to compete with 100-fold more abundant NusG (28); and (iii) target-specific recruitment, which limits RfaH action to a subset of operons, thereby preventing dysregulation of NusG-controlled genes (20). Recent structural and functional analyses of E. coli NusG and RfaH identified individual residues responsible for their differences, allowing us to investigate the molecular evolution of this family ( Fig. 5; Data Set S1M).
Our analysis allowed for the identification of a group of uncharacterized proteins homologous to RfaH. Phylogenetic reconstruction using Spt5 as an outgroup showed that this group of proteins and RfaH sequences are in two separate branches and that they both have NusG from Desulfurobacterium sp. strain TC5-1 as their common ancestor (Fig. 5A). Desulfurobacterium sp. TC5-1 belongs to Aquificae, which are thought to be among the most deeply diverging bacterial lineages, along with Thermotogae and Thermodesulfobacteria (55).
We previously proposed that the NusG paralog first lost its ability to bind Rho (Fig. 5B), most likely by altering the Rho contact residues in the NusG-KOW motif (20). Our current data support this scenario. We recently found that a conserved 5-residue loop of NusG, including residues I164, F165, and G166, makes key contacts with Rho (23); furthermore, this loop enables RfaH binding to Rho upon replacement of a loop in RfaH, which contains residues L145-I146-N147 at the corresponding positions (23). Our analysis reveals that the Rho-binding residues were lost by RfaH early on (Fig. 5A), which might be expected given that the opposite effects on Rho termination underlie cellular functions of NusG and RfaH.
Next, we envisioned that increased hydrophobicity of the NGN led to a protein with a high affinity for RNAP, which was able to compete with NusG. The RNAP ␤= CH domain interacts with a hydrophobic patch on the NGNs of NusG and RfaH (5). RfaH NGN is more hydrophobic, and RfaH outcompetes NusG in vitro and in vivo (5, 20), even though NusG outnumbers RfaH 100:1 (28). RfaH residue F56 is required for binding to Origins and Molecular Evolution of the NusG Paralog RfaH ® RNAP, and its replacement of Leu, the corresponding residue in NusG, confers binding defects (56). F56 is present in RfaH, unknown proteins, and NusG of Desulfurobacterium sp. TC5-1 (Fig. 5A), suggesting that stable interactions with RNAP are important for keeping RfaH in the game of evolution by preventing its displacement by a more abundant NusG. In contrast, F81 in RfaH or the corresponding G95 in NusG makes contact with RNAP in both proteins and is not highly conserved.
Finally, NusG SP had to become soluble and to evolve a sequence-specific recruitment mechanism to control several targets in trans. In autoinhibited RfaH, the KOW domain, which is folded as an ␣-helical hairpin, unlike KOW domains of all other NusGs, shields a hydrophobic surface on the NGN that serves as an RNAP-binding site (29). An opposite side of the NGN contains a patch of residues that recognize the ops DNA (Fig. 1), which folds into a small hairpin on the RNAP surface (26). In addition to making direct contacts with the NGN, ops halts RNAP to facilitate RfaH recruitment (26); ops-like sequences induce pausing of phylogenetically diverse RNAPs (57).
Nearly all ops bases are required for RfaH function, and several RfaH residues directly contact the ops DNA hairpin (5,26). We reason that such a complex mechanism must have evolved incrementally, perhaps with NusG SP initially binding to a paused RNAP and then learning to recognize DNA. Mapping of the RfaH DNA-binding determinants on the phylogenetic tree (Fig. 5A) is consistent with a sequential acquisition of residues that bind DNA: K10 (F in NusG) acquisition preceded the emergence of RfaH, whereas R73 arose later.
We believe that autoinhibition controls RfaH recruitment indirectly, by making RfaH binding to RNAP dependent on the presence of the ops signal. RfaH residues E48, I93, and F130 are required for autoinhibition; their replacement allows sequenceindependent, NusG-like recruitment of RfaH (27,58). RfaH contacts with the ops-paused complex relieve autoinhibition, exposing the RNAP-binding site on the NGN (30). The acquisition of residues that mediate interdomain interactions coincide with that of the DNA-binding residues (Fig. 5A), consistent with autoinhibition and ops contacts acting in concert. In summary, our analysis supports a sequential transformation of NusG into RfaH in which the exclusion of Rho binding and increased binding to RNAP precede sequence-specific recruitment to the elongation complex (Fig. 5B).
RfaH targets and gene neighbors. While E. coli RfaH is monocistronic and acts in trans, other NusG SP proteins, such as Myxococcus xanthus TaA (33) and UpxY (32), are encoded within their target operons. We wondered whether RfaH-like proteins, which display significant variations in their functional regions (Fig. 5A), could fall into different groups, perhaps associated with particular regulatory contexts. Markov clustering of all RfaH sequences identified in this study revealed eight distinct clusters, CL1 to CL8 ( Fig. 6A; Fig. S8; Data Set S1N). Using TREND (40), we found that, unlike with the invariant gene neighborhood of nusG (see above), the gene neighbors of rfaH were highly diverse; they encoded polysaccharide biosynthesis enzymes, nucleoid-associated protein H-NS, toxin-antitoxin systems, secondary metabolites, Tat protein secretion system, etc.
To assess whether each cluster could be associated with a subset of genes, we assigned their gene neighbors to cluster of orthologous group (COG) categories (Fig. 6B) (38). Similarly to E. coli RfaH, which is included in CL1, RfaHs of CL1 were not strongly associated with a particular COG category, although H (coenzyme metabolism) and U (secretion) genes were frequent. These diffuse-pattern proteins act in trans on distant targets. In contrast, genes involved in cell envelope biogenesis (M), which are known targets of NusG SP regulators, were overrepresented among neighbors of CL2 to CL8; glycosyltransferases, nucleoside-diphosphate-sugar epimerases, and exopolysaccharide biosynthesis functions were most common ( Fig. 6B; Fig. S9A). Notable differences exist among these clusters (Fig. 6B; Fig. S9A). CL1 is frequently adjacent to Sec-independent protein secretion pathway functions (U). CL4 is associated with a Origins and Molecular Evolution of the NusG Paralog RfaH ® helix-turn-helix (HTH) transcriptional regulator (K). CL6 neighbors encode undecaprenyl pyrophosphate synthase, involved in terpenoid biosynthesis (I), and nucleoidassociated protein H-NS (R), whereas CL7 comprises a group of diverse RfaHs from Shewanella that are encoded within putative exopolysaccharide operons (Fig. S9B), an arrangement resembling B. fragilis operons controlled by diverse UpxY proteins (32). Many CL7 genes are adjacent to signal transduction (CheY) and envelope biogenesis (ABC transporter) genes, but their relative orientations differ among CL7 members.
In addition to activating several chromosomal targets, RfaH activates an F plasmid tra operon, which encodes a type IV secretion system (Fig. 6C) and is required for conjugation (59). Other plasmids encode resident NusG SP s in their tra operons. As we await experimental assessment of their functions, this genetic syntax suggests that plasmid NusG SP acts as an antiterminator of tra operons, which are among the longest bacterial operons and are thus expected to be prone to premature termination. Carrying a resident antiterminator confers a significant advantage to plasmids that, unlike F, are transferred between different species. Conjugative plasmids are major contributors toward the clinical dissemination of antibiotic resistance, and some of these plasmids encode NusG SP s (60,61).
RfaH and other NusG SP s are required for the expression of very diverse macromolecules, including adhesins, antibiotics, capsular polysaccharides, toxins, etc. The most obvious common feature of NusG SP targets is their length (Fig. 6C). A shared ability of all NusG-like proteins to make RNA synthesis more efficient suggests a mechanism in which NusG SP -bound RNAP ignores intragenic termination signals; consistently, NusG SP is annotated as an antiterminator. However, while RfaH increases gene expression hundreds of folds, its antitermination activity makes only a minor contribution to its effects in vivo (62). Instead, RfaH excludes NusG from RNAP and promotes ribosome recruitment, thereby inhibiting premature RNA release by Rho (27). Furthermore, by coupling RNAP to the ribosome (27), RfaH may enable the complete synthesis of long polypeptides, such as a giant 5,559-amino-acid-long nonfimbrial adhesin encoded by Salmonella pathogenicity island IV (63) (Fig. 6C). Similarly, LoaP-like regulators (31) may promote translation of 4,200-and 5,200-amino-acid-long polyketide synthases in the Bacillus amyloliquefaciens dfn operon.
The marked diversity of their gene neighborhoods supports a view that RfaH-like regulators act on any operon, once recruited; indeed, E. coli and Klebsiella pneumoniae RfaH activate expression of the Photorhabdus luminescens lux operon, as long as the ops element is present in the leader region (64). However, in this work, we show that different types of RfaH-like proteins are associated with different classes of neighbors (Fig. 6B), a correlation that may reflect their evolutionary history or distinct mechanisms of recruitment. E. coli RfaH is the only representative for which a detailed mode of recruitment is known, and future studies are required to address this question.
Concluding remarks. The only ubiquitous family of transcription factors comprises two very different classes of regulators. One class includes essential general elongation factors that coevolved with RNAP since the LUCA (1). These NusG-like core regulators are recruited to RNAP once it escapes from a promoter, replacing transcription initiation factors that bind to the same site (18,19), and remain associated with RNAP transcribing all genes (20,65). Here, we show that the bacterial NusG protein is present in genomes of all cells that utilize bacterial RNAPs, except a few endosymbionts and some algae. What makes NusG indispensable?
Although their sequences have diverged considerably, bacterial, archaeal, and eukaryal factors make remarkably similar interactions with RNAP that are thought to increase the enzyme's processivity, acting akin to replicative clamps (66); the NGNs are necessary and sufficient for RNAP modifications (14,29,67). This antitermination function of NusG, reflected in genome annotations, has long been thought to be its signature activity. However, NusG alone has only modest effects on RNA synthesis (9). Instead, antitermination is achieved through the assembly of large nucleoprotein complexes, e.g., on bacteriophage RNA, in which the NusG-KOW domain makes contact with diverse protein partners (10). In fact, it is through alternative contacts with Rho (23) or ribosome (21) that the NusG-KOW domain determines the fate of the nascent RNA. Multiple Spt5 KOW domains play analogous functions in eukaryotes, coupling RNA synthesis to splicing, polyadenylation, and other cotranscriptional processes (3). Transcription of chloroplast genomes by PEP depends on its binding to several accessory proteins (68), including NusG (45). We speculate that the NusG-KOW domain acts as a hub for PEP complex assembly.
Despite its ubiquity, NusG is a dissociable factor rather than an RNAP subunit, a property exploited by the second class of NusG proteins exemplified by RfaH. These regulators outcompete NusG for binding to RNAP and exert much stronger antitermination effects (5) but must be selectively recruited to only a few targets to avoid misregulation of housekeeping genes (20). In the case of RfaH, targeted recruitment is achieved through a complex DNA-dependent mechanism (26). Here, we show that RfaH-like proteins are rapidly evolving through a combination of HGT and vertical inheritance. We identified eight distinct groups of RfaH that we propose control different sets of genes, sometimes coevolving with their targets. While the RfaH-NGN mediates recruitment to RNAP and DNA, we hypothesize that the RfaH-KOW domain plays key regulatory roles. The KOW domain controls RfaH recruitment indirectly, through autoinhibition (58), is thought to load the ribosome onto mRNA lacking ribosome-binding sites (27), and may interact with some membrane components during secretion of proteins whose expression it activates (69). While RfaH is not strictly essential for growth in the lab, it is critical for expression of the cell wall, capsules, adhesins, siderophores, and conjugative pili, whereas other NusG SP s are essential for the synthesis of capsules and antibiotics (7), molecules that determine bacterial success in natural environments.
Eukaryotes also encode multiple copies of Spt5 ( Fig. 2A), and specialized paralogs have been implicated in the regulation of RNA silencing and meiosis (34,70). Thus, all life depends on the NusG-like regulators to balance the expression of housekeeping genes with niche-specific demands. The mechanisms by which this balance is maintained remain to be elucidated.
Construction of a new RfaH model. RfaH (NCBI accession no. NP_418284.1) from Escherichia coli strain K-12 substrain MG1655 was used as a query in BLAST searches against genomes of selected representatives to find potential RfaH homologs. One species from each family of Proteobacteria was selected as a representative. All potential RfaH sequences were verified using a reciprocal best BLAST hit approach (74) (see Fig. S2 in the supplemental material for an example). The final set of 103 RfaH sequences was used to construct an initial multiple-sequence alignment (MSA). Based on the MSA, an initial HMM profile was generated and used to query the UniProt Reference Proteomes database (v. 2019-09). The hits were filtered based on known conserved positions in RfaH and structural information to collect an extended set of RfaH protein sequences. The redundancy of the set was reduced to the 80% identity level by CD-HIT, and a new MSA was generated based on the reduced sequence set. This set was used to generate a final HMM profile. The final profile was used to query the UniProt reference proteome database and to set the trusted and noise cutoffs of the profile.
Database of species representatives (GTDB_reps). The list of species representatives of bacteria and archaea (release 89.0) was downloaded from the GTDB (39). The genome files (file type: protein FASTA) were retrieved from NCBI using Batch Entrez (https://www.ncbi.nlm.nih.gov/sites/batchentrez). A total of 18,436 bacterial genome files and 847 archaeal genome files were downloaded and used as a database of species representatives in this study, which was named GTDB_reps.
Distribution of NusG and RfaH. NusG TIGRfam and the newly built RfaH HHM were used to search against GTDB_reps by HMMER (75). Taxonomy assignment of the collected protein sequences was done using a custom python script. The percentage of genome hits was calculated using a custom python script. The results were visualized on phylogenetic trees by FigTree (76). The maximum-likelihood genome trees were downloaded from AnnoTree (77; http://annotree.uwaterloo.ca/).

Identification of NusG in Eukaryota.
We used the NusG protein sequence (NCBI accession no. WP_012415655.1) from Elusimicrobium minutum to search eukaryotic protein databases. We used BLASTP and PSI-BLAST against the nonredundant database at the NCBI and a BLASTP search against the oneKP database (78), with default parameters (May 2020). Domain identification was carried out using the TREND (40) and HHpred (79)  algorithm of MAFFT (80) and edited in Jalview (81). A maximum-likelihood phylogenetic tree was constructed using the MEGA X package (82) and edited in the Interactive Tree of Life (iTOL) v4 tool (83).

RfaH evolution events.
To study the topology of NusG and RfaH phylogenetic trees, representatives were selected from GTDB_reps (Data Set S1J). One representative genome containing both NusG and RfaH was selected from each family. A total of 82 family representatives of Proteobacteria were selected. A maximum-likelihood bacterial genome tree of family representatives was inferred from a concatenated alignment of 120 ubiquitous single-copy proteins, also known as the bac120 data set (53) using RAxML (84). Maximum-likelihood phylogenetic trees of NusG and RfaH were constructed using FastTree (85) and RAxML (84). The trees constructed by the two methods showed similar topologies. To show examples of evolution events, two families, Enterobacteriaceae and Pseudomonadaceae, were investigated. The maximum-likelihood phylogenetic tree of 16S rRNA sequences of Enterobacteriaceae was from a previous study (54), whereas a maximum-likelihood genome tree of Pseudomonadaceae was inferred from the bac120 data set. The presence of RfaH was determined using the new RfaH model. The maximumlikelihood RfaH tree of Enterobacteriaceae was inferred using FastTree (85).
Phylogenetic tree for molecular evolution study. To study the molecular evolution of RfaH, a data set was compiled with three parts (Data Set S1M). The first part was representative genomes containing both RfaH and NusG. To select these representatives, a maximum-likelihood phylogenetic tree was inferred from 1,922 RfaH sequences (Data Set S1H) by FastTree (85). Then representatives were selected from this phylogenetic tree according to tree depth. The second part was representative genomes containing proteins which have bit scores between trusted and noise cutoffs of the new RfaH model (referred to as unknown NusG SP s). The third part was representative archaeal genomes containing Spt5, which served as an outgroup. The structural alignment was performed with MAFFT-DASH (86). The maximum-likelihood phylogenetic tree was inferred using FastTree with the JTT model (85) and RAxML with the LG4X model (84). The two programs produced nearly identical phylogenetic trees.
Clustering of RfaH protein sequences. RfaH protein sequences collected running the new RfaH HMM profile against GTDB_reps were clustered in a stepwise fashion: Step 1 reduced the redundancy of the sequences at a 95% identity level, giving a final set of 1,481 sequences.
In step 2, reciprocal BLASTP all-vs-all was run using the final set. With the result, an undirected graph was built. The following cutoffs were used to construct the graph edges: an E value less than or equal to 5e-30 and a coverage of Ն80%. The edge weights were initialized using an average of two E values of each reciprocal BLASTP. Using this graph, Markov clustering was performed. An inflation value of 5 was used, as it gave the most efficient clustering. The majority of the sequences ended up in eight coherent clusters.
Neighbor genes of RfaH. Gene neighborhoods of 1,122 reference rfaH genes (Fig. 6A) were determined using TREND (40); each neighbor gene was assigned to clusters of orthologous groups (COGs) (38,87). The distribution of COGs in the eight RfaH clusters were presented by Heatmap using the R package (http://www.R-project.org/). UpxY search. BLASTP with the E value threshold of Ͻ10 Ϫ10 was used to query GTDB_reps with eight UpxY protein sequences from B. fragilis NCTC 9343 (32). Representatives were selected to build a maximum-likelihood phylogenetic tree with RfaH and NusG (Data Set S1L). The structural alignment computed by MAFFT-DASH (86) was used to build the phylogenetic tree. The phylogenetic tree was inferred using FastTree with the JTT model (85).
NusG family detection. An entire list of GTDB genome identifiers (release 89.0) was downloaded. Based on the list, 129,663 genomes were fetched from the NCBI and compiled into a complete database. The database was searched using profile HMMs of eight ubiquitous vertically inherited proteins: NusG, SecE, RecA, L1, L5, L6, S2, and S7.
Software. We used the following software: AnnoTree v1.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.