Evolutionary and Comparative Analysis of Bacterial Nonhomologous End Joining Repair

Abstract DNA double-strand breaks (DSBs) are a threat to genome stability. In all domains of life, DSBs are faithfully fixed via homologous recombination. Recombination requires the presence of an uncut copy of duplex DNA which is used as a template for repair. Alternatively, in the absence of a template, cells utilize error-prone nonhomologous end joining (NHEJ). Although ubiquitously found in eukaryotes, NHEJ is not universally present in bacteria. It is unclear as to why many prokaryotes lack this pathway. Toward understanding what could have led to the current distribution of bacterial NHEJ, we carried out comparative genomics and phylogenetic analysis across ∼6,000 genomes. Our results show that this pathway is sporadically distributed across the phylogeny. Ancestral reconstruction further suggests that NHEJ was absent in the eubacterial ancestor and can be acquired via specific routes. Integrating NHEJ occurrence data for archaea, we also find evidence for extensive horizontal exchange of NHEJ genes between the two kingdoms as well as across bacterial clades. The pattern of occurrence in bacteria is consistent with correlated evolution of NHEJ with key genome characteristics of genome size and growth rate; NHEJ presence is associated with large genome sizes and/or slow growth rates, with the former being the dominant correlate. Given the central role these traits play in determining the ability to carry out recombination, it is possible that the evolutionary history of bacterial NHEJ may have been shaped by requirement for efficient DSB repair.


Introduction
Accurate transmission of genetic material from parent to progeny is essential for the continuity of life. However, low rates of error during replication and DNA break-inducing mutagenic agents (such as ionizing radiation and reactive oxygen), while generating diversity for natural selection to act on (Tenaillon et al. 2004), also adversely affect viability and could lead to diseases including cancer (Tomasetti et al. 2017; Significance Unlike recombination-based repair, nonhomologous end joining (NHEJ) repair is ubiquitously present in eukaryotes but sporadically in prokaryotes. Outstanding fundamental questions that remain to be answered in the field include 1) what evolutionary trajectory shaped the bacterial NHEJ distribution and 2) what evolutionary forces are responsible for it. Our study shows multiple independent gain and loss events of NHEJ, resulting in a sporadic distribution across the bacterial domain, with evidence of multiple horizontal gene transfers. We also highlight the association of NHEJ with three central genome characteristics: G-C content, genome size, and growth rate. Approaches used in our study could be extended to other repair pathways in order to understand how they might have contributed to bacterial evolution or vice versa. Srinivas et al. 2019). Therefore, most cellular life forms invest in mechanisms that repair damaged DNA including doublestrand breaks (DSBs).
Two major mechanisms of repair of DNA DSBs are homologous recombination and nonhomologous end joining (NHEJ). Recombination-based repair requires a homologous copy of the DNA around the damage site for repair to occur. In contrast, NHEJ, the subject of the present work, directly ligates the DSB after detecting and binding the break ends (Aniukwu et al. 2008;Bhattarai et al. 2014). Where direct ligation is not possible-at breaks that generate complex ends-a processing step involving the removal of damaged bases and resynthesis of lost DNA is required. Such processing can be error-prone (B etermier et al. 2014). Thus, NHEJ can be a double-edged sword: required for essential DNA repair when a homologous DNA copy is not available, but also prone to causing errors at complex DNA breaks.
NHEJ is a major mechanism of DNA repair in eukaryotes. In bacteria however, homologous recombination-based repair is the most common mechanism of DNA repair; NHEJ, on the other hand, was described only recently (Aravind and Koonin 2001;Doherty et al. 2001), and its prevalence still remains to be systematically elucidated. Unlike eukaryotes, where NHEJ activity is regulated in a cell-cycle-dependent manner, it is unclear as to when NHEJ may be a preferred mode of bacterial DSB repair. Recent reports have shown that NHEJ can contribute to mutagenesis during a specific stage of bacterial growth, such as in stationary phase (Bhattarai et al. 2014;Paris et al. 2015), raising the possibility that availability of a second copy of the genome for repair and/or growth phase may dictate whether recombination or NHEJ is employed for repair.
Experiments in Mycobacterium and Pseudomonas have shown that bacterial NHEJ repair machinery consists primarily of two proteins. The homodimeric Ku binds DNA break ends and recruits the three-domain LigD harboring phosphoesterase (PE), polymerase (POL), and ligase (LIG) activity. These three domains are respectively required to process the ends, add bases if necessary, and subsequently ligate the break. Additionally, the POL domain mediates interaction between Ku and LigD (Della et al. 2004;Stephanou et al. 2007;Aniukwu et al. 2008;Zhu and Shuman 2010). As an exception, studies in Bacillus subtilis found that these bacteria encode Ku along with a two-domain (LIG and POL) LigD (Moeller et al. 2007). In the absence of LigD, it is also possible for LigC, which contains only the LIG domain, to carry out repair (Aniukwu et al. 2008;Bhattarai et al. 2014;Matthews and Simmons 2014). Though Escherichia coli does not encode NHEJ, expression of Mycobacterium tuberculosis Ku and LigD renders E. coli NHEJ proficient (Malyarchuk et al. 2007).
Studies in the early 2000s, with a small number of genomes, suggested that the distribution of NHEJ in bacteria could be patchy (Bowater and Doherty 2006). Collectively, what does it mean for bacteria like E. coli to not code for NHEJ and others like Mycobacterium tuberculosis to harbor it? Given the large population sizes and relatively short generation times that make selection particularly strong, the question of the pressures that determine the deployment of the potentially risky NHEJ assumes importance. In this study, using bioinformatic sequence searches of Ku and LigD domains in the genomic sequences of $6,000 bacteria, we have tried to 1) understand how pervasive their pattern of occurrence is, 2) trace their evolutionary history, and 3) understand what selection pressures could explain it.

Materials and Methods
Data All "complete" and "latest" (assembly_summary.txt; as of January 2017) genome information files for $6,000 bacteria were downloaded from the NCBI ftp website using in-house scripts-whole genome sequences (.fna), protein coding nucleotide sequences (.fna), RNA sequences (.fna), and protein sequences (.faa). All the organisms were assigned respective phylum and subphylum based on the KEGG classification (https://www.genome.jp/kegg/genome.html; as of May 2018).

Identification of NHEJ Repair Proteins
Initial BlastP (Altschul et al. 1990) was run using each of the four protein domain sequences-Ku and LigD (LigD-LIG, LigD-POL, and LigD-PE) from Pseudomonas aeruginosa PAO1 as the query sequence against the UniProtKB database (UniProt Consortium 2019) with an E-value cutoff of 0.0001. The top 250 full-length sequence hits were downloaded from UniProt for both Ku and LigD domains, respectively. A domain multiple sequence alignment (MSA) was made using phmmer -A option with the top 250 hits as the sequence database and P. aeruginosa domain sequences as the query. An hmm profile was built using the hmmbuild command for the MSA obtained in the previous step. To find domain homologs, hmmsearch command with an E-value cutoff of 0.0001 was used with the hmm profile as the query against a database of 5,973 bacterial genome sequences (supplementary table 1, Supplementary Material online). These homolog searches were done using HMMER package v3.3 (Finn et al. 2011). An in-house python script was written to extract the results and assign organisms with Ku and LigD domains. The same methods were used to identify NHEJ components in 243 archaeal genomes (supplementary table 2

Ku and LigD Neighborhood Analysis
In-house python scripts were written to determine the proximity of ku and ligD on the genome using the annotation files. Taking one gene as the reference, the presence of the other gene was checked within a distance of ten genes upstream or downstream inclusive of both strands. The organization of NHEJ genes fell into three categories: 1) both genes on the same strand and within the distance range; this we call as operonic NHEJ, 2) both genes on different strands and within the distance range, and 3) genes outside the distance range. This analysis was done only for NHEJþ organisms which code for LigD containing all the three domains as part of a single protein.

Identification of Ribosomal Ribonucleic Acid Sequences
A 16S ribosomal ribonucleic acid (rRNA) sequence database was downloaded from the Genomic-based 16S ribosomal RNA database (GRD) website (https://metasystems.riken.jp/ grd/; last accessed November 2, 2020). A MSA and a profile of the database was made using muscle v3.8.31 (Edgar 2004) and hmmbuild (Finn et al. 2011), respectively. To detect 16S rRNA homologs in our database of 5,973 bacteria, nhmmer (Wheeler and Eddy 2013) was used with an E-value cutoff of 0.0001 where the GRD-based 16S rRNA sequence profile was used as the query. In-house python script was written to parse the output files and select the best hits for further analysis.

Calculation of Genome Sizes, Growth Rates, and G-C Content
In-house python scripts were written to calculate the genome sizes (GSs), growth rates (GRs), and G-C content of all bacteria in our data set (supplementary table 1, Supplementary Material online). These calculations were made after excluding the plasmid sequence information from the whole genome sequence assembly files. Previous studies have shown a significant positive correlation between a bacteria's GR and the number of rRNA operons harbored in its genome (Klappenbach et al. 2000; Vieira-Silva and Rocha 2010;  Gyorfy et al. 2015;Roller et al. 2016). Therefore, rRNA copy number was taken as a proxy for GR.

GS Randomization Analysis
For this analysis, 920 organisms harboring conventional NHEJ were considered. The total number of coding DNA sequences was used as the proxy for GS. Because the chance of finding a gene in a larger genome is more than that in a smaller genome, 920 genes were drawn at random from a pool of all the genes coming from 5,973 organisms in our data set and the genome coding DNA sequences size (GS) from which each gene was picked was noted. For each such iteration, the median GS was calculated. This was repeated for 100 iterations and a distribution of median GS was obtained. The nonparametric Wilcoxon rank-sum test was used to compare this random distribution with the median GS of NHEJharboring bacteria.

HGT Analysis
Alien Hunter v1.7 (Vernikos and Parkhill 2006) with default options was used to predict horizontally acquired regions (HARs)-based on their oligonucleotide composition-across bacterial genomes present in our data set. In-house python scripts were used to detect Ku and LigD in the predicted HARs. NHEJ was said to be acquired through horizontal gene transfer (HGT) if both Ku and LigD were present in the predicted HARs.
Phylogenetic incongruence between the archaeal-bacterial 16S rRNA-based species tree and the corresponding gene trees (for Ku, LigD-LIG domain, and RecA) was checked using the Approximately Unbiased (AU) statistical test (Shimodaira 2002). AU test was carried out with 10,000 simulations with a constrained versus unconstrained approach (explained next). The MSA for each gene was run in an unconstrained mode for the given model parameters. Then, the alignments were run in a constrained mode with respect to their respective species tree topologies. p-AU gives the probability of identifying the gene family as having evolved according to the species, that is, the evolutionary history of the gene is the same as that of the species. This test was carried out using the -au options and -zb option implemented in IQTREE.
Further, to test the strength with which the aforementioned genes could be vertically transmitted or moved among closely or distantly related bacteria, nonparametric Mantel tests were carried out. Patristic distances were calculated between bacteria in the species tree and a given gene tree, respectively. A Pearson product-moment correlation coefficient r was calculated between these two distance matrices consisting of bacteria shared between the two phylogenies under test. The significance of correlation was assessed via randomization test by conducting 10,000 permutations of distance matrices. The correlation coefficient r was recalculated for each permutation to produce the null distribution and P value was obtained using the one-tailed test using the R package.
RANGER-DTL 2.0 (Bansal et al. 2018) was used to predict the donors and recipients of NHEJ HGT events among and between bacterial and archaeal species. A nonpolytomous rooted 16S rRNA-based tree with archaea as an outgroup was used as a species tree. Optimal rootings for Ku and LigD-LIG domain-based bacterial-archaeal gene trees were determined using the OptRoot program with default options such that the duplication-transfer-loss (DTL) reconciliation cost was minimized. Ranger-DTL program was used to compute the optimal DTL reconciliation of a given rooted species tree-rooted gene tree pair. In case of multiple optimal reconciliations, the program inherently reports an optimal reconciliation sampled uniformly at random. Therefore, the analysis was run for 100 simulations each with a transfer cost T ¼ 1, 2, or 3 (default), that is, a total of 300 simulations. The lower the transfer cost, the more the HGT events allowed during the reconciliations. AggregateRanger was used to compute support values for the most frequent mappings, that is, the donor species, by accounting for the variance due to multiple optimal reconciliations and alternative event cost assignments. An in-house python script was written to back trace the most frequent recipient for a given most frequent mapping for both Ku (supplementary table 3, Supplementary Material online) and LIG domain (supplementary table 4, Supplementary Material online). The results were overlaid on the respective species phylogeny using the features provided in the iTOL server (https://itol.embl.de/login.cgi; last accessed November 2, 2020).
Sequence alignments were viewed and pruned using Jalview (Waterhouse et al. 2009). Principal component analysis of Ku domain sequences was carried out based on the method by Konishi et al. (2019).

Phylogenetic Tree Construction
For the construction of a species tree, one 16S rRNA sequence per genome was extracted into a multi-fasta file using an inhouse python script. For bacteria with multiple 16S rRNA sequences, one 16S rRNA sequence was chosen such that it minimizes the number of Ns in that sequence and has the maximum sequence length. In order to build a pruned phylogenetic tree, 970 bacteria were randomly selected such that a genus was represented exactly once for every NHEJ state. Please note that in the case of nonconventional NHEJ, all its four subcategories, as described in the previous sections, were treated separately, when including bacteria at the genus level. A MSA was built using muscle v3.8.31 (Edgar 2004) with default options. The conserved regions relevant for phylogenetic inference were extracted from the MSA using BMGE v1.12 (Criscuolo and Gribaldo 2010). After the manual detection of the alignment, one spurious sequence was removed and the MSA was built again for 969 bacteria (supplementary table 5, Supplementary Material online). Using IQTREE v1.6.5 (Nguyen et al. 2015), a maximum-likelihood (ML)-based phylogenetic tree was built with the best model chosen as SYM þ R10 (LogL ¼ À118,152.5876, BIC ¼ 249,946.0604). ModelFinder (-m MF option) (Kalyaanamoorthy et al. 2017) was used to choose the best model for the tree construction compared against 285 other models. Branch supports were assessed using both 1,000 ultrafast bootstrap approximations (-bb 1000 -bnni option) (Hoang et al. 2018) and SH-like approximate likelihood ratio (LR) test (-alrt 1000 option) (Guindon et al. 2010).
A similar approach was used to build a phylogeny of 1,403 (supplementary table 6, Supplementary Material online) organisms, nonredundant at the species level, comprising just two NHEJ states: 1) NHEJÀ and 2) conventional NHEJþ.
For the construction of bacterial-archaeal gene trees (Ku, LigD-LIG, or RecA), only bacteria harboring one Ku and one conventional LigD in their genomes were included. Because the motivation behind the construction of gene trees was to study the origin of NHEJ in bacteria, all archaea harboring either Ku or LigD-LIG domain were included in the respective phylogeny. In species with multiple copies of RecA, RecA with the highest alignment length and identity was chosen for a given genome. The species in the bacterial-archaeal 16S rRNA-based phylogeny depended on the species included in the corresponding gene tree. The approach used to build these phylogenies was similar to the one described in the previous two paragraphs.

NHEJ Ancestral State Reconstruction Analysis
To trace the evolutionary history of NHEJ, four discrete character states were defined as follows: Ku only, LigD only, NHEJÀ, and NHEJþ. States were estimated at each node using stochastic character mapping (Bollback 2006) with 1,000 simulations provided as make.simmap() method in the R package phytools v0.6-44 (Revell 2012). The phylogenetic tree was rooted using the midpoint method and polytomies were removed by assigning very small branch lengths (10 À6 ) to all the branches with zero length. The prior distribution of the states was estimated at the root of the tree. Further, by default the method assumes that the transitions between different character states occur at equal rates. This might not always be true, especially with complex traits where it is supposedly easier to lose than gain such characters. Therefore, for the estimation of transition matrix Q, three discrete character evolution model fits were compared: Equal Rates (ER), Symmetric (SYM), and All Rates Different (ARD). This allowed for models that incorporate asymmetries in transition rates. Based on Akaike Information Criterion (AIC) weights, the ARD model (w-AIC ARD ¼1, w-AIC SYM ¼ 0, and w-AIC ER ¼ 0) was chosen as the best fit with unequal forward and backward rates for each character state transition. Finally, Q was sampled 1,000 times from the posterior probability distribution of Q using Markov chain Monte Carlo and 1,000 stochastic maps were simulated conditioned on each sampled value of Q. This strategy was used to reconstruct ancestral states for both phylogenies comprising 969 and 1,403 organisms as described in the previous sections.
Ancestral states for GS, a continuous trait, were reconstructed using fastAnc() method employed in phytools v0.6-44 based on the Brownian motion model. This model was found to be the better fit model as compared with multiple rate model (Bayesian Predictive Information Criterion [BPIC] Brownian < BPIC stable and Proportional Scale Reduction Factor approaching <1.1 well within 1,000,000 iterations), assessed using StableTraits (Elliot and Mooers 2014). The multiple rate model allows for the incorporation of neutrality and gradualism associated with Brownian motion and also includes occasional bursts of rapid evolutionary change. Ancestral states for GR were reconstructed by converting it into a binary trait. An organism was said to be slow growing if it encoded rRNA copy numbers less than or equal to the median rRNA copy number (median ¼ 3) and fast growing otherwise. The reconstruction was carried out using the same approach that was used for estimating NHEJ ancestral states, as described in the previous paragraph.

NHEJ and Genome Characteristics Phylogenetic Comparative Analysis
The two genome characteristics-GS and GR-were compared across bacteria with different NHEJ states. The distributions across bacteria with different NHEJ states were first compared assuming statistical independence of bacteria, using Wilcoxon rank-sum test, wilcox.test() in R. Next, two measures of phylogenetic signal-Pagel's k (Freckleton et al. 2002) and Blomberg's K (Blomberg et al. 2003)-were used for detecting the impact of shared ancestry, for GS and GR across bacteria, using the phylosig() routine in phytools R package (Revell 2012). A phylogenetic analysis of variance (ANOVA), employed in phytools R package v0.6-44 (Revell 2012), was carried out with 1,000 simulations and Holm-Bonferonni correction to control for familywise error rate, based on a method by Garland et al. (1993), to compare the genome characteristics in a phylogenetically controlled manner. Please note that GSs and rRNA copy numbers were log 10 transformed for all the analysis.

Correlated Evolution Analysis
Two relationships-(NHEJ repair and GS) and (NHEJ repair and GR)-were quantified using a statistical framework. To test if changes in genome characteristics occur independently of NHEJ or whether these changes more (or less) likely to occur in lineages with (or without) NHEJ, two models of evolution were considered-independent and dependent. In the independent model, both the traits were allowed to evolve separately on a phylogenetic tree, that is, noncorrelated evolution. In the dependent model, the two traits were evolved in a nonagnostic manner, that is, correlated evolution. The NHEJ repair trait had two repair character states-NHEJÀ (0) and conventional NHEJþ (1). GSs, a continuous trait, were converted into binary state as well. For this, the mean GS of organisms with 0 or 1 NHEJ repair state was computed. A "lower" state (0) was assigned if a value was less than the mean and a "higher" state (1) if the value was more. The same approach was used to convert GRs (rRNA copy number) into a binary state.
A continuous-time Markov model approach was used to investigate correlated evolution between NHEJ repair and the genome characteristics. First, the ML approach (Pagel 1994) was used to calculate log-likelihoods for the two models of evolution per trait pair: 1) NHEJ repair and GS; 2) NHEJ repair and GR. A LR statistic was calculated for both comparisons, followed by a chi-square test to assess if the dependent model was a better fit. The degrees of freedom are given by df chisquare test ¼ (n rate-dependent model -n rate-independent model ). There are eight transition rates in the dependent model across four states (00,01,10,11) and four transition rates in the independent model across two states (0,1; 0,1). Therefore, the test was run with four degrees of freedom.
The ML approach implicitly assumes that the models used for hypothesis testing are free of errors. Therefore, to make the analysis robust, the Bayesian reverse jump Markov chain Monte Carlo (RJMCMC) approach was used to calculate the marginal log-likelihoods of the independent and dependent models of evolution (Pagel and Meade 2006). This approach takes into consideration the uncertainty and minimizes the error associated in calculating the parameters used in each of our model(s), ensuring reliable interpretations. Log Bayes factor was used to assess the better fit out of the two models.
BayesTraits v3 (Pagel and Meade 2006) was used to carry out both ML-and Bayesian RJMCMC-based correlated evolution analysis as described above for both the models. ML was run using the default parameters. Bayesian RJMCMC was run for 5,050,000 iterations, sampling every 1,000th iteration with a burn-in of 50,000. For the estimation of marginal likelihood, a stepping stone sampler algorithm was used where the number of stones was set to 100 and each stone was allowed to run for 10,000 iterations.

Phylogenetic Logistic Regression Analysis
A method developed by Ives and Garland (2010) was used to carry out the phylogenetic logistic regression analysis provided in the R package phylolm v2.6 (Tung Ho and An e 2014) as the subroutine phyloglm() with method ¼ logistic_IG10. NHEJ repair was the binary-dependent variable with two states: NHEJÀ (0) and conventional NHEJþ (1). The two independent continuous variables were GS and GR. Before proceeding with the analysis, these independent variables were checked for multicollinearity by calculating variance inflation factor (VIF), Three models were tested: 1) NHEJ $ GS, 2) NHEJ $ GR, and 3) NHEJ $ GS þ GR. The best model was chosen according to AIC scores.
All the scripts used for analysis were written in python, perl, or R. Statistical tests and data visualizations were carried out in R and iTOL server.
All the files and scripts used and generated in this study can be found at the github repository link-https://github.com/ Mohak91/bacterial_nhej_repair_evolution.

NHEJ Is Sporadically Distributed across Bacteria
To identify NHEJ machinery across bacteria, we used the reference sequences of the Ku domain, and the LIG, POL, and PE domains of LigD from P. aeruginosa to search $6,000 complete bacterial genomes for homologs (see Materials and Methods). We defined bacteria encoding Ku and the complete, three-domain version of LigD as those harboring a conventional NHEJ system. Organisms lacking the POL and/or the PE domains of LigD, and those encoding these domains in separate proteins, were defined as those carrying nonconventional NHEJ ( fig. 1).
We found NHEJ in only $1,300 (22%) genomes studied here. There were various combinations of Ku and LigD domains across these organisms, but a large majority (920) carried conventional NHEJ. Seventy-five percent bacteria harboring conventional NHEJ coded for Ku and LigD in a 10-kb vicinity of each other, with 60% organisms carrying Ku and LigD on the same strand of the 10-kb vicinity. Most bacteria (84%) harboring NHEJ coded for a single copy of Ku, whereas the remaining coded for 2-8 Ku copies in their genomes. For example, as reported by McGovern et al. (2016) and Kobayashi et al. (2008), we identified four Ku-encoding genes in Sinorhizobium meliloti. About two-thirds of NHEJ positive bacteria carried multiple copies of the LIG domain, 37% carried multiple copies of the POL domain, and 8% bacteria had multiple copies of the PE domain (supplementary table 1, Supplementary Material online). We also noticed that 138 (2.3%) organisms encoded Ku and not LigD, and 619 (10.3%) only LigD and not Ku ( fig. 1 and supplementary table  1, Supplementary Material online).

NHEJ Was Gained and Lost Multiple Times through Evolution
We traced the number of NHEJ gains and losses starting from the eubacterial ancestor to the species at the tips of the 16Sbased bacterial phylogenetic tree. To trace the evolutionary history of NHEJ, we defined four discrete character states: Ku only, LigD only (conventional and nonconventional), NHEJÀ, and NHEJþ. Note that an NHEJþ state is defined only when We performed this analysis for a set of 969 genomes in which each genus was represented once for each state (see Materials and Methods).
We first asked if NHEJ was present in the common eubacterial ancestor and, given the sporadicity of NHEJ, subsequently lost in several lineages (supplementary table 8 Supplementary Material online). We assigned a major primary gain to an internal ancestral node if 1) all nodes leading to it from the root had NHEJÀ state; 2) the pp of either NHEJþ, LigD only, or Ku only at that ancestral node was !0.7; 3) a gain of LigD only or Ku only was followed by a transition to NHEJþ; and 4) if it had at least three descendent species. We observed multiple major independent primary gains at ancestral nodes within Bacteroidetes, Actinobacteria, Firmicutes, Acidobacteria, and multiple subclades of Proteobacteria (supplementary table 8, Supplementary Material online). It follows that the common eubacterial ancestor likely did not have NHEJ ( fig. 4).
The gain of NHEJ can be sequential, gaining either Ku only or LigD only followed by the gain of the other component; or it can be a one-step acquisition of both components ( fig. 3B). The most common transition from an NHEJÀ state was to a LigD only state. Also frequent was the direct acquisition of both components to transition from an NHEJÀ to an NHEJþ state. Transition from NHEJÀ to Ku only was negligible. In the reverse direction, a one-step loss of both Ku and LigD is the most likely. Again, the Ku only state is rare.
A one-step transition from NHEJÀ to NHEJþ is likely through HGT. Sixty bacterial genomes belonging to the phyla Alpha-proteobacteria (in particular the Rhizobiales)-and Beta-proteobacteria, and Streptomycetales carried their NHEJ components on plasmids (supplementary table 1, Supplementary Material online). However, based on abnormal word usage statistics (see Materials and Methods), we could not find NHEJ to be a part of the horizontally acquired component of the chromosomes of any bacterial genome. At least two NHEJÀ to NHEJþ transitions occurred close to the root, and it is possible that the predictions of horizontally acquired NHEJ systems made so far may be an underestimate ( fig. 4). We investigate this in greater detail below.
In summary, 1) the common eubacterial ancestor was devoid of NHEJ; 2) NHEJ was gained and lost multiple times; and 3) transitions to a Ku only state are rare.

NHEJ and HGT
Experimental studies in the archaea Methanocella paludicola (Bartlett et al. 2013(Bartlett et al. , 2016Brissett et al. 2013) have confirmed the presence of a functional NHEJ repair, with crystal structures revealing close relationship with the bacterial proteins (Bartlett et al. 2016;White and Allers 2018).
To assess the possibility of horizontal transfer of NHEJ machinery across prokaryotes, we first performed a domain wise search in 243 archaea to complement the data we had assembled for bacteria. These searches revealed the presence of bacteria. These species were included such that each genus was chosen once for each NHEJ state (see Phylogenetic Tree Reconstruction section for further details). The phylum names arranged on either sides of the vertical bars are for representational purposes only. The tip and node labels are colored according to the NHEJ states-red: NHEJÀ; yellow: Ku only; green: LigD only; blue: NHEJþ (conventional and nonconventional). NHEJ state for nodes is shown only when the posterior probability support is >70%; interpreted as change in NHEJ state at that node as compared with shallower phylogenetic depths.
both Ku and LigD-LIG domains in ten archaeal species (see Materials and Methods; supplementary table 2, Supplementary Material online). However, 230 archaeal genomes encoded LigD but no Ku. This lends support to previous reports suggesting that NHEJ is rare in archaea (Bartlett et al. 2013(Bartlett et al. , 2016White and Allers 2018).
In order to check for horizontal transfer events between bacteria and archaea, we used phylogenetic methods based on detecting conflicts between an organismal phylogeny and a phylogeny inferred for Ku and LigD-LIG domains, respectively. This method allowed us to test for any ancient transfers across bacteria as well. We found that NHEJ proteins undergo HGT events at significantly high rate (p-AU ¼ 0; fig. 5B and C; see Materials and Methods) and these are not limited to closely related species or cospeciation events ( fig. 5D; Mantel test, P < 10 À4 , r Ku ¼ 0.25; P < 10 À4 , r LIG ¼ 0.4). We also noted incongruence with respect to the RecA phylogeny (p-AU ¼ 0; fig. 5A) as has been reported before (Eisen 1995;Lang et al. 2013). However, these were limited at best to transfers among closely related species ( fig. 5D; Mantel test, P < 10 À4 , r RecA ¼ 0.94).
An incongruence between a species and gene tree could result due to processes other than HGT, like duplications and losses. Therefore, to predict the most frequent transfer events, we used a reconciliation approach based on the DTL model. DTL employs a parsimonious framework where each evolutionary event is assigned a cost and the goal is to find a reconciliation (possible evolutionary history of gene tree inside a species tree) with minimum total cost. We observed a high rate of HGT between bacterial clades-Firmicutes, Actinobacteria, and Proteobacteria and Archaea, where each of these played the role of a donor and a recipient in Ku ( fig. 6A) and LigD-LIG transfer events ( fig. 6B). We observed that all proteobacterial species-with the exception of delta-proteobacteria, which was a donor of Ku to archaeal recipients-were recipients of both Ku and LigD from archaea or other distantly related bacteria. On one hand, we found evidence of Ku transfers from Archaea to Firmicutes and Actinobacteria and on the other hand, LigD-LIG transfers most likely occurred from Firmicutes and Actinobacteria to Archaea. Together, this raises the possibility of NHEJ transfers between bacteria and archaea. An example of the former is depicted as a MSA of Ku domain sequences belonging to Archaea, Firmicutes, and Actinobacteria in figure 6C (upper panel). This transfer event corresponds to the asterisk marked in figure 6A-corresponding to that between the ancestor of the genus Archaeoglobus and that of Firmicutes and Actinobacteria. We further carried out a principal component analysis of these domain sequences that had evolved from the aforementioned ancestors ( fig. 6D). Along the first principal component, all but two Actinobacteria-Eggerthella lenta and Microlunatus phosphovorus-form a distinct cluster from Archaea and Firmicutes. Along the second principal component, we see two distinct clusters. The cluster on the bottom left consists of anaerobic prokaryotes-Archaea (genus Archaeoglobus), Firmicutes (Clostridium cellulosi and Desulfitobacterium dichloroeliminans), and Actinobacteria (Eggerthella lenta), highlighting the possibility of HGT among these prokaryotes. Another instance of a LigD-LIG transfer is depicted in figure 6C, lower panel. This transfer event corresponds to the asterisk in figure 6B, involving the Actinobacteria-Brevibacterium linens, and Archaea-Archaeoglobus veneficus (coding for both Ku and LigD-LIG domains).
In addition to the evidence supporting HGT of NHEJ components between Archaea and Bacteria, we observed transfers among different bacterial clades as well ( fig. 6A and B; blue arrows). We found Ku transfers between donor-recipient pairs: 1) Alphaproteobacterium (Asticcacaulis excentricus) and common ancestor of Acidobacteria (genus Acidobacterium, Granulicella, and Terriglobus) and 2) common ancestor of Delta-proteobacteria genus Geobacter and Chlamydiae (Parachlamydia acanthamoeba). For LigD-LIG transfers, we observed the following donor-recipient pairs-1) Proteobacteria (Phenylbacterium zucineum) and Acidobacteria (Terriglobus roseus) and 2) common ancestor of Actinobacteria (genus Eggerthella) and Firmicutes (Desulfitobacterium dicholoroeliminans). A full list of donorrecipient events can be found in supplementary files 3 and 4, Supplementary Material online, for Ku and LigD-LIG domains, respectively.
Kanhere and Vingron (2009) carried out HGT detection of bacterial core genes among prokaryotes. They found that a majority of these transfers occurred from bacteria to archaea and that these genes were mostly metabolic genes. Overall, our study is consistent with their observation, with additional evidence showing a possibility of NHEJ transfers from archaea to bacteria as well. We also show evidence of NHEJ transfers between closely and distantly related bacteria. Using the approach used in our study, it remains to be tested how HGT events have shaped noncore genes like other repair pathways throughout evolution in prokaryotes.

NHEJ Occurrence Is Associated with GS, GR, and G-C Content
Recently, Ku-encoding organisms were shown to have higher genomic G þ C content (Weissman et al. 2019). Given its central role in DNA repair, we asked whether any other genome characteristics could also be associated with the presence or absence of NHEJ. First, we verified that the findings of Weissman et al. on the correlation between the presence of Ku and G þ C content held true for NHEJþ states as defined in our study ( fig. 7A and supplementary figs. S4, S5, and S7C, Supplementary Material online) (Weissman et al. 2019). Along with this, we tested two additional characteristics: GS and GR (as measured by the copy number of rRNA operons), both of which could determine the availability or the lack of a homologous template for high fidelity recombination-based repair. We restricted these analyses to conventional NHEJ-harboring bacteria as a proxy for repair proficiency and compared them with NHEJÀ genomes. Data including nonconventional NHEJ are shown in supplementary figures S2 and S3, Supplementary Material online.
Bacteria with NHEJ were found to have larger genomes (median ¼ 5.4 Mb) than those without NHEJ (median¼ 2.9 Mb; Wilcoxon rank-sum test, P < 10 À15 ; fig. 7B and supplementary fig. S7A, Supplementary Material online) and significantly larger than that expected by a random distribution in which the probability of having NHEJ is linearly proportional to GS (Wilcoxon rank-sum test, P < 10 À15 , across 100 simulations; fig. 7D). This relationship was found to be true within the phylum Proteobacteria, Actinobacteria, Bacteroidetes, and Firmicutes as well (supplementary fig. S6, Supplementary Material online). In addition, bacteria harboring NHEJ were found to have significantly fewer rRNA copies (median ¼ 3), and by inference slower GRs, than bacteria without NHEJ (median ¼ 4; Wilcoxon rank-sum test; P < 10 À15 ; fig. 7C). Although the distribution of rRNA copy numbers for genomes without NHEJ was broad, those with conventional NHEJ fell within a narrow range, representing relatively slower growth (supplementary fig. S7B, Supplementary Material online). At the phyla level, this relationship was found to hold true for Proteobacteria and Actinobacteria, whereas there was no significant difference for Bacteroidetes and Firmicutes, respectively (supplementary fig. S8, Supplementary Material online).
In order to confirm the result in a phylogenetically controlled manner, Pagel's k and Blomberg's K were used to first measure whether closely related bacteria tended to have similar GSs and GRs in the data set (see Materials and Methods). These measures suggest that phylogenetic coherence is significantly greater than random expectations for both the genome characteristics (supplementary table 9, Supplementary Material online). Therefore, the distributions of GS between bacteria with different NHEJ status were compared while accounting for the statistical nonindependence of closely related taxa (see Materials and Methods). We found a significant difference in log 10 (GSs) between bacteria with conventional NHEJ and without the repair (phyloANOVA; P ¼ 6 Â 10 À3 ); the characters being mapped on the phylogenetic tree of 969 bacteria with five discrete groups: NHEJÀ, Ku only, LigD only, conventional NHEJþ, and nonconventional NHEJþ. However, we did not observe a significant difference in log 10 (rRNA copy number) between the two groups of bacteria (phyloANOVA; P ¼ 1; see Discussion).
We used ML as well as Bayesian approaches to test whether the observed associations of conventional NHEJ individually with large genomes and slow GRs are indicative of dependent or independent evolution of these traits on the phylogenetic tree (see Materials and Methods). Both suggested that the phylogenetic data fit models of evolution in which conventional NHEJ presence or absence and GS or GR are evolving in a correlated manner (table 1). This strengthens the association between the tested variables in a phylogenetically controlled way. Phylogenetic logistic regression of the conventional NHEJ occurrence with both the continuous independent variables showed, however, that GS is the stronger correlate (see Materials and Methods; table 2).
As a case study where the association of both GS and GR with NHEJ evolution was prominent, we found a gain of conventional NHEJ in the ancestor of two genera belonging to Corynebacteriales-Mycobacterium and Corynebacteriumwhere the former retained and the latter had a secondary NOTE.-A chi-squared-based LRT with four degrees of freedom was used to test the better model-1) independent model where the character pair evolved independent of each other and 2) dependent model where the characters were allowed to evolve assuming a correlated evolution-based on maximum likelihood. Bayes factor was used to test the better model, based on Bayesian RJMCMC analysis. NOTE.-Alpha represents the phylogenetic signal of the dependent variable, that is, NHEJ state in our case. The higher the alpha, the lesser the phylogenetic signal. AIC or the Akaike Information Criterion is used to select the best model for the NHEJ state, out of the three tested against two independent variables-GS and GR. The two independent variables were tested against multicollinearity, with variance inflation factor or VIF ¼ 0.93715 (VIF < 10 is preferred), making the analysis reliable.
loss of the machinery. Phylogenetic ancestral reconstruction analysis revealed an increase in GS in the ancestor of Corynebacteriales, followed by an NHEJ gain. Although Mycobacterium retained NHEJ, Corynebacterium lost the machinery along with a decrease in GS. Using a similar analysis, GR mapped to this subclade revealed an increase in rRNA copy number in Corynebacterium (supplementary fig. S9, Supplementary Material online; see Materials and Methods).

Discussion
Taking advantage of the availability of a large number of genomes, we confirm the nonubiquitous nature of NHEJ across the bacterial domain (Aravind and Koonin 2001;Matthews and Simmons 2014) with 22% bacteria coding for it. At the taxa level, although some phyla retain this sporadicity, others are devoid of NHEJ repair machinery, consistent with previous studies on distribution of bacterial NHEJ proteins (Matthews and Simmons 2014). In line with previous reports on the multiplicity of NHEJ in certain bacteria (B etermier et al. 2014;Hoff et al. 2016Hoff et al. , 2018McGovern et al. 2016;Dupuy et al. 2017Dupuy et al. , 2019, we further discovered a sporadic presence of multiple copies of NHEJ proteins extending to different subclades of Proteobacteria, Actinobacteria, Firmicutes, Bacteroidetes, etc. (supplementary table 1, Supplementary Material online). A trace of the evolutionary history suggested that NHEJ was most likely absent in the eubacterial ancestor. Furthermore, 96% archaeal genomes coded for LIG but no Ku. Therefore, by inference, NHEJ was absent in the Most Recent Common Ancestor of all prokaryotes as well. It was instead gained independently multiple times in different bacterial lineages. However, these primary gains were not sufficient to stabilize the repair states in subclades, because a large number of secondary losses and gains were observed across the phylogeny (elaborated further below).
Our analysis suggests that there are two common methods to arrive at an NHEJþ state: 1) the most common way to gain NHEJ was by the acquisition of LigD followed by Ku and 2) a direct transition from an NHEJÀ to NHEJþ state. We found that the LigD only state is more prevalent and is a prominent intermediate in the evolution of the NHEJþ state in Proteobacteria and Bacteroidetes (pp Proteobacterial-subclade ¼ 0.875 and pp Bacteroidetes ¼ 0.7; supplementary fig. S10C and D, Supplementary Material online). However, 90% of LigD only genomes did not encode POL and PE domains, raising the possibility that LigD only to NHEJþ transitions could actually be NHEJÀ to NHEJþ transitions.
Direct NHEJ gain can be explained by their acquisition via HGT. In this direction, we observed 60 NHEJþ organisms coding this repair on their plasmids. However, our analysis based on the detection of base compositional differences revealed no horizontally acquired chromosomal NHEJ. These numbers may be an underestimate because one is unlikely to pick HGT events if 1) the regions getting horizontally transferred have the same compositional biases in the donor and the recipient cells and/or 2) such HGT events occur early in bacterial evolution. Two instances of the latter case that we observe in our analysis are direct primary gains of NHEJ at the ancestral nodes of Firmicutes and Actinobacteria, with high posterior probability support (pp Firmicutes ¼ 0.987 and pp Actinobacteria ¼ 0.967; supplementary fig. S10A and B, Supplementary Material online). Aravind and Koonin (2001) proposed a possibility of HGT between bacteria and archaea. In this direction, using a gene tree-species tree reconciliation approach, we observed evidence for HGT between bacteria and archaea. Furthermore, this method allowed us to find evidence of HGT between distantly related bacteria as well (supplementary tables 3 and 4, Supplementary Material online). It should be noted that for the lack of methods to accurately date bacterial species trees and because transfers from unsampled species or extinct lineages could violate time constraints, we used an undated phylogeny for this analysis. Therefore, for each most frequent donor predicted, we noted the most frequent recipient as the potential donor-recipient pair. A similar search for NHEJ components in $3,000 actinobacteriophages, yielded no hits, although, Pitcher et al. (2006) have shown the employment of Mycobacterium smegmatis LigD by the Ku expressing mycobacteriophages Omega and Corndog for a successful infection. Thus, although our results suggest no transfer events between bacteria and bacteriophages, whether this is an artifact of sequence amelioration processes (Lawrence and Ochman 1997) remains to be tested.
A third route to an NHEJþ state could have been via a Ku only state. However, we observed that the time spent in a Ku only state is the least and NHEJ gains via this route are rare. Approximately 90% Ku only states are present in Firmicutes alone, specifically belonging to two genera, Bacillus and Fictibacillus. This suggests that Ku only state is largely avoided across most bacteria. This could be because Ku alone is nonfunctional in repair or in some cases could even block the access of break ends for recombination-based repair (Sinha et al. 2009;Gupta et al. 2011;Zhang et al. 2012). In 138 organisms, where Ku is retained in the absence of LigD, its function could be mediated by cross-talk with other ligases such as LigA. Consistent with this, it has been previously shown that if damage produces 3 0 overhangs specifically, LigA could repair the lesion even in the absence of LigB/C/D (Aniukwu et al. 2008). In vitro studies have also shown that Mycobacterium smegmatis Ku can stimulate T4 DNA ligase (Kushwaha and Grove 2013). We observed the same trend in Archaea; Ku was always present with a LIG domain in ten archaeal species, whereas 230 archaea coded for LIG domain and no Ku. These observations are consistent with experimental studies reporting a fully functional complement of NHEJ being present only in a single archaea, and with the likelihood of microhomology-mediated end joining (via ligase alone) being more prevalent in these organisms (Bartlett et al. 2013;White and Allers 2018).
Furthermore, we observed that a primary gain does not stabilize the NHEJ trait in the subsequent lineages; with the most common type of loss being NHEJþ to NHEJÀ. We discuss this, with examples, at two levels-1) across genera belonging to closely related phyla and 2) within the same genera. As discussed earlier, there was a direct primary gain of NHEJ at the common ancestor of the phyla Firmicutes and Tenericutes ( fig. 4). First, we found that the monophyletic class Mollicutes (belonging to Tenericutes), which includes genera like the obligate plant parasites Phytoplasma and human parasites Mycoplasma, have had a one-step secondary loss of NHEJ during their evolution. This is consistent with the well supported hypothesis that these Mollicutes have evolved from Gram-positive bacteria including Firmicutes by reductive genome evolution (Woese et al. 1980;Wolf et al. 2004;Oshima et al. 2013;Ipoutcha et al. 2019). This suggests that NHEJ is either dispensable or costly to maintain a parasitic lifestyle in this class of bacteria. Second, although many species of the para-and poly-phyletic genus Clostridium (belonging to Firmicutes) including Clostridium cellulosi and Clostridium stercorarium retained NHEJ, others like Clostridium clariflavum and Clostridium propionicum had direct secondary losses. Furthermore, species like Clostridium phytofermantans and Clostridium saccharolyticum had direct secondary gains. A full list of NHEJ primary/secondary gains and losses at different taxa level-direct and sequential-can be found in the supplementary table 8, Supplementary Material online.
It is likely that several factors have contributed to the current distribution of NHEJ in bacterial systems. For example: 1) studies in both prokaryotes and eukaryotes have suggested a cross-talk between NHEJ and other repair mechanisms like recombination-based repair (Hoff et al. 2016(Hoff et al. , 2018Biehs et al. 2017;Bertrand et al. 2019), base excision repair (Bertrand et al. 2019), and mismatch repair (Shen et al. 2018). Shen et al. (2018) have shown that a bacteriophage infection is prevented by generating DSBs produced by MutL, a conventional mismatch repair endonuclease, which is subsequently repaired by NHEJ via a large deletion. However, this might not be a conserved pathway of repair as our analysis suggests an absence of NHEJ in a majority of actinomycetes belonging to the genus Corynebacterium including Corynebacterium glutamicum and 96% of archaea. 2) NHEJ could co-occur with other pathways that may or may not be directly involved in DSB repair. For example, NHEJ has been found to be active during sporulation in Bacillus subtilis (Wang et al. 2006;Bertrand et al. 2019), however, nonspore-forming bacteria also encode for NHEJ (Aravind and Koonin 2001;McGovern et al. 2016). Additional cross-talk at the level of regulation has also been reported, such as in Mycobacterium, where the deacetylase Sir2 has been implicated in regulating NHEJ (Li et al. 2011, p. 2). It is equally plausible that signatures of displacement of certain domains from genomes encoding NHEJ could inform us on how acquisition of NHEJ could have shaped genome architecture. In line with this, we did note 20 additional domains coded in Ku and LigD proteins (supplementary table 10, Supplementary Material online) across different phyla. These would be useful to experimentally test in the future to understand whether they show functional interaction with NHEJ repair.
To understand the selection pressures that might play a role in shaping the evolutionary pattern of NHEJ, we focused our efforts on studying the genome characteristics associated with this DNA repair. We reasoned that under the event of a DSB, the unavailability of a template would prevent the highly accurate homologous recombination repair to act. Thus, factors that affect rates of genome duplication or probability of multiple DSBs occurring could be central to determining the need for NHEJ-based repair. Therefore, one might expect a higher selection pressure of maintaining NHEJ in bacteria with slower GRs and larger GSs. In line with our hypothesis, we found that the organisms possessing conventional NHEJ tend to have significantly larger GS and slower GR as compared with those that are devoid of it. A phylogenetically controlled test suggested that this association held true for GS and not GR. We note that phyloANOVA used here for hypothesis testing assumes normality. However, our data for GS and rRNA copy number are not normally distributed, even after converting them into a logarithmic scale (Shapiro-Wilk normality test; P genome size ¼ 1.7 Â 10 À10 , P rRNA copy number < 10 À15 ). Therefore, the result should be interpreted keeping this caveat in mind. Furthermore, ML and Bayesian inferences showed that there is a correlated evolution of NHEJ with GS and GR along the phylogenetic tree. Considering the two variables together, we found that GS is the better correlate of the NHEJþ state. Therefore, we conclude that the NHEJþ state is strongly associated with GS and to a much smaller extent with GR throughout its evolution.
In sum, our study highlights the evolutionary trajectory of NHEJ and central characteristics that may have determined its sporadic distribution. DSB repair, including NHEJ, has been implicated in shaping bacterial genomes through mutagenesis (Paris et al. 2015), HGT (Popa et al. 2011), and their effect on genomic G-C content (Weissman et al. 2019). Given this relationship between repair and genome evolution, it is important to ask how one factor may have influenced the other during bacterial evolution. It is also possible that similar forces have played a role in the evolution of other repair pathways and the genomes encoding them.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.