Are Protein Domains Modules of Lateral Genetic Transfer?

Background In prokaryotes and some eukaryotes, genetic material can be transferred laterally among unrelated lineages and recombined into new host genomes, providing metabolic and physiological novelty. Although the process is usually framed in terms of gene sharing (e.g. lateral gene transfer, LGT), there is little reason to imagine that the units of transfer and recombination correspond to entire, intact genes. Proteins often consist of one or more spatially compact structural regions (domains) which may fold autonomously and which, singly or in combination, confer the protein's specific functions. As LGT is frequent in strongly selective environments and natural selection is based on function, we hypothesized that domains might also serve as modules of genetic transfer, i.e. that regions of DNA that are transferred and recombined between lineages might encode intact structural domains of proteins. Methodology/Principal Findings We selected 1,462 orthologous gene sets representing 144 prokaryotic genomes, and applied a rigorous two-stage approach to identify recombination breakpoints within these sequences. Recombination breakpoints are very significantly over-represented in gene sets within which protein domain-encoding regions have been annotated. Within these gene sets, breakpoints significantly avoid the domain-encoding regions (domons), except where these regions constitute most of the sequence length. Recombination breakpoints that fall within longer domons are distributed uniformly at random, but those that fall within shorter domons may show a slight tendency to avoid the domon midpoint. As we find no evidence for differential selection against nucleotide substitutions following the recombination event, any bias against disruption of domains must be a consequence of the recombination event per se. Conclusions/Significance This is the first systematic study relating the units of LGT to structural features at the protein level. Many genes have been interrupted by recombination following inter-lineage genetic transfer, during which the regions within these genes that encode protein domains have not been preferentially preserved intact. Protein domains are units of function, but domons are not modules of transfer and recombination. Our results demonstrate that LGT can remodel even the most functionally conservative modules within genomes.

of the extent and the significance of such clustering. Intuitively, when pairwise comparisons of all informative sites are laid out in a compatibility matrix and compatibility and incompatibility of sites are color-coded, the NSS is simply the fraction of adjacent squares of the same color in the matrix.
(c) Pairwise homoplasy index (PHI) [4] In sequence comparison, homoplasy is an observation of same or similar character states that do not share a common ancestral origin. A recombination event would create a homoplasious region in the sequence. Similar to NSS, PHI is a compatibility-based statistic that compares topologyinformative sites in a pairwise matrix. The main difference between the two statistics is that NSS measures the clustering of incompatible sites, whereas PHI measures the compatibility between closely linked sites using a refined incompatibility matrix and a refined incompatibility score.
The significance for each of these three statistics was assessed by randomly permuting the columns of the alignment (for MaxChi), or parsimoniously informative sites in the alignment (for NSS and PHI), 1,000 times. We earlier found that biases exist in each of these three tests when detecting reciprocal and non-reciprocal recombination events, and therefore one should not rely on a single test when, as is usually the case, one does not know a priori whether an event is reciprocal or non-reciprocal [1]. Here, we accepted that there exists evidence of a recombination event when two of the three statistics had p-values ≤ 0.10. We find that our approach has a 96% specificity in detecting recombinants at the p-value threshold of 0.10 (unpublished). Given the computational efficiency of implementations of these tests, there is no great burden associated with computing p-values for all three, even on thousands of sequence sets.

Phase II: Identification of recombination breakpoints
Once recombination was detected in a gene sequence set, we used a rigorous Bayesian phylogenetic approach, DualBrothers [7], to identify the corresponding recombination breakpoints. We applied a modified version of the program to infer the change-points in phylogenetic signal across the alignment. The Bayesian approach was found to show high accuracy in identifying recombination breakpoints, although the approach itself is computationally expensive in time and memory [8].
DualBrothers is an implementation of reversible-jump Markov chain Monte Carlo (MCMC) to perform inference under a dual multiple change-point model, in which change-points in tree topologies and change-points in evolutionary rates across sites within a sequence set are modeled independently [7]. A prior distribution on the number of change-points creates a strong preference for fewer change-points, which corresponds to the a priori assumption that recombination and changes in substitution rate are a rare occurrence. For a given aligned gene sequence set, MCMC was implemented in DualBrothers to sample the joint posterior distribution of phylogenetic trees relating the sequences, the change-points in phylogeny along the alignment, and the change-points in the substitution rate along the alignment. Trees are assumed to be unrooted with exponentially distributed branch lengths. The mean value for the exponential is taken to be the substitution rate, which itself is Poisson distributed. The ratio of nucleotide transititon to transversion (parameter kappa) was also sampled in DualBrothers, emulating an HKY model of nucleotide substitution.
A sequence set of size n has (2n-5)!! possible unrooted tree topologies relating the taxa. For many of the sequence sets in our dataset the value of n is large, creating an intractable search space of possible topologies for the DualBrothers MCMC sampler. At large n, typically a relatively small number of topologies dominate the posterior distribution and many topologies are extremely unlikely given the data. DualBrothers proposes new updates of topology uniformly-at-random from the set of all possible tree topologies, so unlikely topologies can be proposed, with the consequence that convergence and mixing of the MCMC chain becomes inordinately slow. To concentrate the sampling efforts of DualBrothers on tree topologies likely to be well-represented in the joint posterior distribution, we implemented a preprocessing method suggested by the authors [7] to identify candidate topologies and restrict the search space. We applied MRBAYES [9]

Annotation of protein domains
The number of gene sets, annotated protein domains and the inferred recombination breakpoints within the gene sets are shown in Table S1.    Figure S4. In the cases of sequences with long domains, the chance of a breakpoint being found outside a domain (ρ ≈ 1) is low because the length of inter-domain region in these sequences is short. Therefore our inference of breakpoint locations with respect to domain structure might be biased by the lengths of annotated domains in the sequences.

Relating breakpoint locations to corresponding domains
We developed a normalized breakpoint-to-midpoint distance statistic (ρ, as shown in Figure 1B in the main text). A breakpoint with ρ ≈ 0 is located closer to the midpoint (center) of the corresponding domain than a breakpoint with ρ ≈ 1. A breakpoint with ρ = 1 is located either at the boundary or outside the corresponding domain. Table S2 shows