The infinitely many genes model with horizontal gene transfer

The genome of bacterial species is much more flexible than that of eukaryotes. Moreover, the distributed genome hypothesis for bacteria states that the total number of genes present in a bacterial population is greater than the genome of every single individual. The pangenome, i.e. the set of all genes of a bacterial species (or a sample), comprises the core genes which are present in all living individuals, and accessory genes, which are carried only by some individuals. In order to use accessory genes for adaptation to environmental forces, genes can be transferred horizontally between individuals. Here, we extend the infinitely many genes model from Baumdicker, Hess and Pfaffelhuber (2010) for horizontal gene transfer. We take a genealogical view and give a construction -- called the Ancestral Gene Transfer Graph -- of the joint genealogy of all genes in the pangenome. As application, we compute moments of several statistics (e.g. the number of differences between two individuals and the gene frequency spectrum) under the infinitely many genes model with horizontal gene transfer.


Introduction
Many prokaryotic species (i.e. bacteria and archea) are now known to have highly flexible genomes (e.g. Tettelin et al., 2005;Ehrlich et al., 2005;Tettelin et al., 2008;Koonin and Wolf, 2012). Unlike in eukaryotes, genes can be transferred horizontally (i.e. without a direct relationship between donor and recipient) between prokaryotic individuals of either different or the same population. As a result, gene content can differ substantially between strains from the same population. For example, the pathogenic strain E. coli O157:H7 carries 1387 genes which are absent in the commensal strain E. coli K-12 (Perna et al., 2001). This huge variation in gene content led to the concepts of the distributed genome of bacteria and their pangenome Ehrlich et al., 2005).
In order to understand the growing amount of genomic data from bacterial species, classical population genetic theory -using mutation, selection, recombination and genetic drift as main evolutionary forces -must be extended in order to include realistic mechanisms of horizontal gene transfer (HGT). Since genomic data from prokaryotic species has become abundant only recently, HGT can in particular be seen as a newly discovered evolutionary factor (Doolittle, 1999;Koonin et al., 1997). However, theoretical work on the population genomics of HGT is still in its infancy. In order to include HGT in population genetic models, some scenarios have to be distinguished according to the following basal mechanisms: Transformation, which is the uptake of genetic material from the environment. Transduction, which describes the infection of a prokaryote by a lysogenic virus (phage) which provides additional genetic material that can be integrated into the bacterial genome. Conjugation, which is also termed bacterial sex, which requires a direct link (pilus) between two bacterial cells and leads to exchange of genetic material. In addition, small virus-like elements called Gene Transfer Agents (GTAs) have been found which may become even more important for the amount of horizontal genetic exchange in some species (McDaniel et al., 2010). Another mechanism of horizontal gene transfer are due to mobile genetic elements like plasmids, gene cassettes and transposons, which transfer genes even within a single individual (de la Cruz and Davies, 2000). Although all these mechanisms transfer only parts of the gene sequences, it is a valid approach to model only complete gene transfers events, since bacteria are efficient in getting rid of non-functional genetic material. Most importantly, when considering horizontal gene transfer by transformation, transduction and conjugation, transformation (and to a lesser extend also transduction) transfers genes mainly between distantly related species, while conjugation only works for bacteria from closely related species.
In Baumdicker et al. (2010) (see also Baumdicker et al., 2012), we presented the infinitely many genes model, a population genomic model which includes HGT from different species, e.g. by transformation, but no HGT within species. It accounts for gene gain (or gene uptake) from the environment (at rate θ/2) along the genealogical tree which describes the relationships between the individuals of the population. The term gene gain covers HGT from other species as well as gene genesis (the formation of new genes), because from the perspective of a species under consideration these two mechanisms are indistinguishable. Pseudogenization may lead to deletion of genes and is incorporated by gene loss (at rate ρ/2). The model uses the coalescent (Kingman, 1982;Hudson, 1983) as underlying genealogy instead of a fixed (phylogenetic) tree. On the latter, the same two mechanisms were studied already by Huson and Steel (2004).
In the present paper, we extend the infinitely many genes model in order to incorporate events of intraspecies horizontal gene transfer. We stress that HGT in bacteria differs from crossover recombination in eukaryotes, since only single, non-homologous, genes are horizontally transferred in bacteria, while only homologous genomic regions are transferred by recombination in eukaryotes. Accordingly, since we aim at a genealogical picture of HGT in bacteria, the ancestral recombination graph (Hudson, 1983;Griffiths and Marjoram, 1997) as an extension of the coalescent, cannot be used. Rather, we model HGT such that each gene present in the population comes with its own events of HGT, resulting in the Ancestral Gene Transfer Graph (AGTG). In the limit of large population sizes, we compute moments of several quantities of interest. The gene frequency spectrum -see Theorem 1 -describes the amount of genes present in k out of n individuals. In Baumdicker et al. (2012) the gene frequency spectrum has been used to test whether a bacterial population shows unusual patterns for neutral evolution. In Theorem 2, we give our results for the expectations of the average number of genes per individual, and the average number of symmetric pairwise differences and the total number of genes, respectively. Calculations which give the variances of some of these quantities, can be carried out using the AGTG and are given in Theorem 3.
The paper is organized as follows: In Section 2, we introduce the infinitely many genes model with horizontal gene transfer. After stating our results in Section 3, we discuss our results with a view towards biological applications in Section 4. In Section 5 we introduce our main tool, the AGTG. The proofs of the main results, Theorems 1-3, are given in Section 6.

The model
We introduce two different views on the same model. In this section, we describe a Moran model forwards in time, including events of gene gain, gene loss and horizontal transfer of genes; see also Figure 1 for a graphical representation. Later, in Section 5 we describe how to obtain the distribution of genes in equilibrium using a genealogy-based approach.
We consider the following model for bacterial evolution: Each bacterial cell carries a set of genes and every gene belongs either to the core genome or to the accessory genome. The uncountable set I := [0, 1] is the space of conceivable accessory genes. In addition there is a set of persistent genes, the core genomesee also Baumdicker et al. (2012). As by definition these core gens can never be lost or gained and are just present in all individuals we will ignore these genes in the following analysis. A population of constant size consists of N individuals, where each individual represents a (genome of a) bacterial cell which consists of several accessory genes. We model this accessory genome of individual i at time t by a finite counting measure G N i (t) on I. We will identify finite counting measures with the set of atoms, i.e. we write u ∈ G N i (t) if G N i (t), 1 u ≥ 1. The dynamics of the model is such that G N i (t), 1 u ≤ 1 for all i and u ∈ [0, 1], almost surely. In other words, there is at most one copy of each gene in any individual.
The population evolves according to Moran dynamics; see also Figure 1. That is, time is continuous and every (unordered) pair of individuals {i, j} undergoes a resampling event at rate 1. Here, in each resampling event between individuals i and j, one bacterium is chosen at random (i, say), produces one offspring which replaces the other individual (j in this case) such that the population size stays constant. The offspring carries the same genes as the parent, i.e. if an offspring of i replaces j at time t, we have G N j (t) = G N i (t−). In addition to such resampling events, the following (independent) events occur:  Figure 1: The graphical representation of the Moran model of size N = 5 from Definition 2.1. At thick arrows, the individual at the tip of the arrow is replaced by a copy of the individual at the tail. Three mechanisms are illustrated as follows: 1. Gene loss of gene u is given at events • u ; rate ρ/2 per gene per line.
2. Gene gain of gene u is given at events u ; rate θ/2 per line. 3. Horizontal gene transfer of gene u from individual i to j is given through a thin arrow i u − − →j; rate γ/(2N ) per gene for every ordered pair (i, j), indicating a potential HGT event.
Here we show only the events for genes u 1 , u 2 and u 3 . Presence and absence of these genes at the top gives rise through resampling, gene gain (only gene u 2 ), gene loss and HGT to presence and absence of the genes at the bottom. surely) new gene at rate θ/2.
3. Horizontal gene transfer : For every (ordered) pair of individuals (i, j) and u ∈ G N i (t), a horizontal gene transfer event occurs at rate γ/(2N ). For such an event, set G N individual i is the donor of gene u and transfers a copy of the gene u to the recipient j.
Horizontal gene transfer events can as well be written in the measure-valued notation as G N j (t) = (G N j (t−) + δ u ) ∧ 1. The '∧1'-term indicates that we do not model paralogous genes, i.e. horizontal gene transfer events have no effect if the recipient individual j already carries the transferred gene.
Definition 2.1 (Moran model with horizontal gene transfer). We refer to (G N 1 (t), . . . , G N N (t)) t≥0 undergoing the above dynamics as the Moran model for bacterial genomes with horizontal gene flow.
Lemma 2.2 (Equilibrium). The Moran model of size N for bacterial genomes with horizontal gene flow has a unique mixing, ergodic equilibrium. We denote random finite measures distributed according to this equilibrium by G First, existence of a stationary measure follows from tightness of the family (G N 1 (t), ..., G N N (t)) t≥0 . In order to see this, note that a single gene in frequency k rises to k + 1 at rate ( 1 2 + γ 2N )k(N − k) and decreases to k − 1 at rate 1 2 k(N − k) − ρ 2 k. Denoting the hitting time of 0 of this birth-death process by T , we have that E k [T ] < ∞ by positive recurrence of the birth-death process for all k = 1, ..., N . As a consequence, we can bound E N n=1 G N n (t), 1 by contributions from the time-0 population and newly gained genes, i.e. by which is enough for tightness; see Kallenberg (2002), Lemma 14.15. Any weak limit must be a stationary measure by standard arguments. Now, we show that (G N 1 (t), ..., G N N (t)) t≥0 , started in any stationary measure at time −∞, is mixing. Indeed, there is a random, finite time S when all genes present at time t = 0 have become lost. Now, the distribution of (G N 1 (t), ..., G N N (t)) t≤0 is independent of (G N 1 (t), ..., G N N (t)) t≥S , since the latter only depends on events in the graphical construction after t ≥ 0. Remark 2.3 (Diffusion limit). In mathematical population genetics, one frequently studies models of finite populations forward in time, constructs their diffusion limit -most often a Fleming-Viot measure-valued diffusion -and only afterwards uses genealogical relationships in order to have a dual process to the Fleming-Viot measure-valued diffusion and to compute specific properties of the underlying forwards model. Since our interest in the present paper lies in seeing the effects of horizontal gene transfer on summary statistics (see Theorems 1-3), we take another route here and leave the construction of the infinite model forwards in time for future research. Here, one would have to consider the set of counting measures on [0, 1] as a type space (which is locally compact), and define the current state of a finite population as the empirical measure of types on this state space. Constructing the diffusion limit then gives the measure-valued diffusion. We foresee two challenges in such a construction: (i) The corresponding recombination operator modeling HGT events for the limiting Fleming-Viot process is unbounded, since the number of genes in a genome is unbounded; (ii) Although we will give a genealogical construction of HGT in Section 5, it is not straight-forward to interpret the resulting graph as a dual process.
Remark 2.4 (Gene transfer of more than one gene). Note that we model only the transfer of DNA segments too small for more than one gene, although it is known that the transfer of multiple genes at once can occur (Price et al., 2008). However, we postulate that the results of Theorem 1 and Theorem 2 are as well valid for a model with multiple gene transfers and multiple gene losses, where at rate γ (ρ ) each gene of an individual is transfered (lost) with some probability p γ (p ρ ). In this case, first moments of the statistics we compute in Theorems 1 and 2, as well as Lemma 2.5, are not affected if we replace γ and ρ by γ p γ and ρ p ρ . However, our results for second moments in Theorem 3 differ in the case of multiple gene transfer/loss events.
We are mainly interested in large populations. The corresponding limit is usually referred to as large population limit in the population genetic literature. The following result of the Moran model with HGT will already be useful in various applications.
Lemma 2.5 (The frequency path of a single gene). Let X N (t) be the frequency of gene u at time t in the Moran model for bacterial genomes with horizontal gene flow of size N with X N (0) such that X N (0) N →∞ − −−− → x. Then, in the large population limit, N → ∞, the process (X N (t)) t≥0 converges weakly to the solution of the SDE with X(0) = x for some Brownian motion W .
Remark 2.6 (The diffusion (2.1) in population genetics). The diffusion (2.1) also appears in population genetics models including selection (see e.g. Kimura (1964), (Ewens, 2004, chapt. 5.3), (Durrett, 2008, chapt. 7.2)). In the present setting, the term proportional to X(1 − X)dt appears because horizontal gene flow increases the frequency of the gene by a rate which is proportional to the number of possible donor/recipient-pairs of individuals; see also Tazzyman and Bonhoeffer (2013). Due to the close connection of horizontal gene transfer with selective models, a comparison to recent work is appropriate. In particular, the theory for the frequency spectrum in selective models with irreversible mutations is carried out in Fisher (1930); Wright (1938); Kimura (1964Kimura ( , 1969. We re-derive these results in our proof of Theorem 1 below, but we stress that the genealogical interpretation we give is derived with a special focus on horizontal gene flow, but not to the selective case. Proof of Lemma 2.5. As in the proof of Lemma 2.2, note that gene loss reduces X N with rate ρ 2 N X N . Second, horizontal gene transfer increases X N with rate γ 2N N 2 X N (1 − X N ). By construction, the evolution of frequencies of gene u is a Markov process with generator for f ∈ C 2 ([0, 1]). Using e.g. standard results from (Ewens, 2004, chapt. 4) it is now easy to show weak convergence to the diffusion (2.1).

Results on Summary Statistics
Consider a sample G N 1 , . . . , G N n of size n taken from the Moran model of size N in equilibrium. We introduce several statistics under the above dynamics: • The average number of genes (in the accessory genome) of the sampled n individuals is given by where |G N i | := G N i , 1 is the total number of accessory genes in individual i.
• The average number of symmetric pairwise differences is given by where G N i \ G N j := (G N i − G N j ) + are the genes present in i but not in j.
• The size of the accessory genome of the sample is given by ∧ 1 is the set of genes present in any individual from the sample, counting each gene only once no matter in how many individuals it is present.
• The gene frequency spectrum (of the accessory genome) is given by G (n) Remark 3.1 (Notation). In the following results, we will suppress the superscript N of the population size. Instead, we query that our results hold in the large population limit. E.g. if we say that (3.6) holds in the large population limit, we really mean that The proofs of all results presented here are given in Section 6. For first moments, we provide proofs using diffusion theory and Lemma 2.5. For second moments, we rely on the Ancestral Gene Transfer Graph (AGTG) of Section 5. Since the proofs of the results are either using Lemma 2.5 or the AGTG or both, we formulate the following three Theorems.
Theorem 1 (Gene frequency spectrum). Consider a sample of size n taken from the Moran model for bacterial genomes with horizontal gene flow with ρ > 0, θ > 0, γ ≥ 0 in equilibrium. Then, in the large population limit, it holds that Theorem 2 (More sample statistics). Under the same assumptions as in Theorem 1, in the large population limit.
Remark 3.2 (Behavior of (3.5)-(3.8)). The infinite sums in (3.5) -(3.8) are all finite as can be seen by a comparison with the exponential series. Note further that (3.6) and (3.7) do not depend on the sample size n, while (3.8) shows a nontrivial dependence on the sample size: The size of the accessory genome grows logarithmically with n if n is large enough.
Theorem 3 (Second moment of the number of genes). Under the same assumptions and in the large population limit as in Theorem 1, we have, in the limit γ → 0, Remark 3.3 (Higher order terms in (3.9) and (3.10)). Although computationally intensive, it should be straightforward to improve the approximations in Theorem 3 for small γ to higher orders O(γ n ). In our proof, we use an ancestral perspective which includes up to two HGT events, leading to the order γ 2 . Including more than two HGT events will result in higher order terms, but will lead to an increasing amount of genealogies which must be considered. For the second order result in (3.10), we had to consider more than 5000 genealogies.
Before we prove our Theorems, let us give some biological implications and relations to previous work in the biology literature.

Discussion: Biological Implications
Unraveling the amount of HGT shaping bacterial diversity can today be tackled using a growing amount of genomic data. In particular, several datasets from closely related strains, which are of the same bacterial species are available today Tettelin et al., 2005Tettelin et al., , 2008. In such datasets, genes present in all genomes of a taxon are called core genes while genes present in only some but not all individuals comprise the accessory genome. The latter set of genes is further split into the medium-frequency shell of genes and the cloud of genes of low frequency (Koonin and Wolf, 2008).
HGT comes in two flavors, either between or within populations. As for HGT between populations, a variant of the infinitely many genes model from Baumdicker et al. (2010) was introduced by Haegeman and Weitz (2012), who couple gene gain and loss events in order to obtain a genome of constant size. However, this is in contrast to available data, since flexible genomes of bacteria usually come with different genome sizes. An interesting extension of the infinitely many genes model was studied in Collins and Higgs (2012); see also Lobovsky et al. (2013). Here, different random trees, including the coalescent tree, were used as underlying genealogies as well as different classes of genes, each class with its own rate of gene gain and loss. It was found that the coalescent produces a good fit with data, and it is likely that the rate of gene gain and loss depends on the gene.
In contrast to the vast amount of available data on HGT in bacteria, mathematical models for HGT within a population are hardly available. A first approach of the population genomics of bacteria was made by Novozhilov et al. (2005), extending a model from Berg and Kurland (2002). Here, a birth and death process is used in order to describe the evolution of the frequency of a single gene under selection under within-population horizontal transmission ("infection"), mutation (leading to loss of the gene) and population size changes. However, this study is limited since only a single gene is considered, but bacterial genomes are comprised of several hundreds of genes, each of which may be under selection and horizontal gene transfer. In Mozhayskiy and Tagkopoulos (2012), a simulation study was carried out, taking selective forces into account which arise from gene regulatory networks, i.e. epistasis of presence and absence of genes. Finally, Vogan and Higgs (2011) present a macro-evolutionary model in a constant environment and conclude that HGT was probably favorable in early evolution since loss of genes is frequent, but later, when genomes are rather adapted to the environment, HGT is not favorable and gene losses are rarer.
Conceptually, HGT within and between populations are different. Above all, the tree of life has become a classical way of thinking about inheritance since Darwin's Origin of Species. However, the abundance of HGT within population counteracts the tree-like structures evolutionary biologists like to think about. Results are phylogenetic networks, which display at the same time the joint evolutionary fate of many genes (Huson and Scornavacca, 2011;Dagan, 2011), in addition to other reticulate events such as hybridization and incomplete lineage sorting. It is becoming clear that any genealogical tree of bacteria which have a flexible genome is at most a tree of 1% of all genetic material (Dagan and Martin, 2006), which may eventually lead to a paradigm shift in evolutionary biology of prokaryotes (Koonin and Wolf, 2012).
Using the incongruence of genes with the species tree, several approaches have led to a number of methods to estimate HGT rates and identify the corresponding genes (Lawrence and Ochman, 2002;Kunin and Ouzounis, 2003;Nakhleh et al., 2005;Linz et al., 2007;Didelot et al., 2010). Current estimates show that at least 32% of the genes in prokaryotic populations have been horizontally transferred at some point (Koonin et al., 2001;Dagan and Martin, 2007). It may even be argued that this number is still a lower bound because only a fraction of all events of HGT can be seen in data, either because the transferred gene is subsequently lost or the pattern is in accordance to vertical gene transfer (Gogarten et al., 2002).
Note that the distinction of HGT within and between populations points to the long-standing question of a clear definition of a bacterial species (Fraser et al., 2009). In our approach, we at least assume that the entity of a bacterial population exists.
Recently, the concepts of open and closed pangenomes were introduced (Medini et al., 2005). If, after sequencing a finite number of genomes, all genes present in the population are found, one speaks of a closed pangenome. If new genes are found even after sequencing many genomes, the pangenome is called open.
In the infinitely many genes model with HGT, one can not identify a sharp transition between open and closed pangenomes in the limit of large sample sizes (n → ∞). Theorem 2 (see (3.8)) shows that an infinite population possesses an infinite number of genes, regardless of the parameters γ, ρ and θ. However, a closer look reveals that almost all of these genes are in extreme low frequency. Nonetheless it is possible to give a quantitative impression how typical rare genes are in a sample in the presence of HGT. It is not hard to see that abundant HGT (i.e. a high value of γ) implies that most genes are in high-frequency. In other words, sequencing a new individual hardly leads to new genes which were not seen before. This impact of openness and closeness of the pangenome can as well be seen from Figure 2.
Although we have made an attempt to include the important evolutionary force of HGT within a population, the model presented here can still be extended. We think that the following approaches are conceivable: • Gain, loss and transfer of multiple genes: The exact mechanisms of gene gain, loss and HGT are still under study. However, it seems clear that Figure 2: The expected gene frequency spectrum from Theorem 1 is highly dependent of γ, the rate of horizontal gene flow. For high values of γ, most genes are in high frequency, leading to a closed pangenome. We use ρ = 2 and sample size n = 10 in the figure.
several genes can be gained or lost at once.
• Gene families: Frequently, a single gene is present not only once but several times in a bacterial genome. The reason can either be a copying event along its ancestral line, or the gene is introduced by HGT although it was already present.
• Gene synteny: The order of genes in the genome is called gene synteny. In our model, the synteny of genes is not modeled, but can be observed in genomic data. Above all, gene synteny can give hints of events of horizontal gene transfer, since the order of genes can be different in donor and recipient.
• Mobile genetic elements: There are parts of the genome like mobile elements which are more likely to be transferred horizontally. Examples are transposons, plasmids and gene cassettes, i.e. horizontal gene transfer even within a single cell can be considered.

The Ancestral Gene Transfer Graph
Since the seminal work of Kingman (1982) and Hudson (1983), the genealogical view is a powerful tool in the analysis of population genetic models. Here, we will give a genealogical construction in order to obtain the distribution of G N 1 , ...G N n for a sample of size n ∈ N in the large population limit of the Moran model with horizontal gene transfer in equilibrium. The resulting genealogy is denoted the Ancestral Gene Transfer Graph (AGTG). In this random graph, every ancestral line splits at constant rate γ/2 per gene due to a potential gene transfer event. We note that such events leading to potential ancestors are wellknown for the ancestral selection graph (ASG) of Neuhauser and Krone (1997) and Krone and Neuhauser (1997). However, potential ancestors in the ASG arise by fitness differences within the population while the potential ancestors within the AGTG may take effect by events of horizontal gene transfer.
We start with the construction of the genealogy for a single gene and come to the full picture including all genes afterwards.
Definition 5.1 (The AGTG for a single gene). Consider a random graph A n which arises as follows: Starting with n lines, denoted i = 1, ..., n, • each (unordered) pair of lines coalesces at rate 1, • each line disappears at rate ρ/2 (meaning that the gene was lost), • each line splits in two lines at rate γ/2 (meaning that the gene was horizontally transferred from another individual which was so far no ancestor of the original n cells, such that the gene can now have two different origins).  Every pair of lines coalesces with rate 1, and every line splits at rate γ/2 and disappears (marked by •) at rate ρ/2. The sampled point E (marked by for a gene gain event) determines G 1 = 0 (indicated by the ) and G 2 = G 3 = G 4 = 1 (indicated by the 's).
For later use, we show that all moments of the length of the AGTG are finite.
In particular, the length is almost surely finite, and the uniform distribution according to the length measure, from which E is picked, is well-defined.
Lemma 5.2 (Length of AGTG for a single gene has finite moments). Let A n be the AGTG for a single gene from Definition 5.1 and let L(A n ) be its length. Then, E[L(A n ) k ] < ∞ for all k = 1, 2, ...
Proof. The number of lines in A n is a birth-death process with birth rate λ i = γi/2 and death rate µ i = i 2 + ρi/2 (when there are i lines) and 0 as absorbing state. Since the length during times with i lines increases at rate i, L(A n )/2 is distributed as the hitting time T of 0 of a birth-death process (Z t ) t≥0 with rates λ i = γ and µ i = i − 1 + ρ, i = 1, 2, ... and absorbing state 0. In order to show finite moments of T , note that the process (Ž t ) t≥0 withŽ t = Z t − 1 is bounded from above by a birth-death process with birth ratesλ k = γ and death-ratě µ k = k. In other words, (Ž t ) t≥0 is the number of customers in an M/M/∞queue. Let S denote the partial busy period of this queue (i.e. the first time when the queue is empty). Moreover, whenŽ t = 0 we have that Z t = 1 and there is a chance ρ/(γ +ρ) that T is reached after an exp(γ +ρ)-distributed time.
From this construction, we see that T ≤ S 1 + · · · + S N where S k d = S + S , and S ∼ exp(γ + ρ) independent from S, and N ∼ geom(ρ/(γ + ρ)), all S k 's being independent. Hence, the assertion follows from finite moments of S (Artalejo and Lopez-Herrero, 2001).
We now come to the desired connection between the Moran model with horizontal gene transfer and the AGTG. Clearly, by (a) we can determine the coordinates (i, −t) with the property, that . This subgraph is a random graph, which can be constructed from time 0 backwards as follows: Starting with n lines, any pair of lines coalesces at rate 1, every line is killed at rate ρ/2 and every line splits in two lines (due to a horizontal gene transfer event) at rate γ(N − k)/(2N ) if there are currently k lines within the graph. (For the latter rate, observe that the donor of gene u might already be part of the graph.) Hence, as N → ∞, this random graph converges (weakly) to the AGTG for a single gene as in Definition 5.1. In addition, for small ε, genes in (u − ε, u + ε) are gained at most once on this graph. Hence, when conditioning on the event of a gene gain of gene u on the random graph (i.e. the Poisson point process has a point (x, u)), by well-known properties of Poisson processes, this event is uniformly distributed on the graph. In other words, the distribution is the same as that of E in Definition 5.   While the construction of the genealogy of a single gene was straight-forward, considering all genealogies of all genes seems to be harder. The reason is that there can be infinitely many genes, and each of these genes comes with its own events of gene gain, loss and horizontal gene flow. Even worse, we can decide on the presence or absence of a gene only if we know if there was a gene gain event somewhere along the genealogy, which means that we have to follow all (uncountably many) potential genes back in time.
We will resolve such difficulties by constructing (countably) many potential genealogies, which all rely on the same clonal genealogy of n bacterial cells, and model gene gain events along them. The result is the ancestral gene transfer graph (AGTG) for infinitely many genes. An illustration is found in Figure 4.

Definition 5.4 (The AGTG for infinitely many genes). Consider a sequence
n , ... of coupled random graphs which arise as follows: • A (0) n is distributed according to Kingman's coalescent, i.e. starting with n lines, each (unordered) pair of lines coalesces at rate 1 (and the graph is stopped as soon as the number of blocks (lines) is one).
Given A • each line splits in a continuing and an incoming line at rate γ/2 (meaning that the gene was horizontally transferred from the incoming line). If the line was part of A n , the continuing line runs along A n as well. The resulting splitting event is marked with "1" • each line is terminated by a loss event, also marked with "1", at rate ρ/2 (indicating that gene 1 was lost).
• in addition to coalescences of lines within A n as well. The resulting splitting event is marked with "k + 1".
• each line is terminated by a loss event, also marked with "k + 1", at rate ρ/2 (indicating that gene k + 1 was lost).
• in addition to coalescences of lines within After the construction of all graphs we consider for each k only the relevant parts of A (k) n , i.e. those parts that can be reached from at least one of the leaves by running through coalescence events or splitting events marked with k. In Figure 4 these parts are shown as solid lines, while the additional lines, necessary for the construction but unreachable from the leaves, are shown as dashed lines.
In order to model gene gain events, consider the events (T m , U m ) m=1,2,... of a Poisson point process on [0, ∞) × [0, 1] with intensity measure 1 2 θ dt du (ordered by their first coordinate). For all k, • let L(A Finally, for every i = 1, ..., n, let U k ∈ G i if there is a direct (i.e. increasing in time) path from i to E k in A (k) n . Then, (G 1 , ..., G n ) is denoted the Gene distribution read off from the AGTG in the infinitely many genes model. n along all paths at constant speed. Such a path might well go back and forth and jump in time. Then, the gene gain event is placed after running length T k .
The following Lemma is the key element in the proofs given in Section 6.
1. Note that the space of finite (counting) measures on [0, 1] is equipped with the topology of weak (or vague) convergence. (In our proofs we will use Skorohod's Theorem which states that weak convergence is equivalent to almost sure convergence on an appropriate probability space; cf. Kallenberg, 2002, Theorem 4.30.) In addition, we will interpret a vector (ξ 1 , ..., ξ n ) of counting measures on [0, 1] as a counting measure on {1, ..., n} × [0, 1].
2. In our proof, we use the following convergence criterion from Kallenberg (2002), Proposition 16.17, here adapted for random measures on a compact space: Let ξ, ξ 1 , ξ 2 , ... be random counting measures on a compact metric space I, where ξ is simple. Then, ξ n N →∞ Proof of Lemma 5.6. We proceed in five steps. In Step 1, we define another set of models for a population of size N with horizontal gene transfer, indexed by K, in which I = [0, 1] is separated into K classes of genes, ∆ K i := [(i − 1)/K; i/K), i = 1, ..., K. For the resulting genomes, denoted (G N,K 1 , ..., G N,K n ), we show in Step 2 that the genealogies of (G N,K 1 , ..., G N,K n ) are given by an AGTG with K + 1 coupled random graphs. The construction of these random graphs can be re-ordered such that the limit K → ∞ can be taken easily; see Step 3. In Step 4, we let N → ∞ and show the convergence of the coupled random graphs to (A (0) n , A (1) n , ...), implying the convergence to (G 1 , ..., G n ). In the last step we show convergence of second moments.
Step 1: Definition of G N,K i : Fix K ∈ N and set ∆ K i := [(i − 1)/K; i/K), i = 1, ..., K. We define another Moran model (called Moran ∆ -model) with horizontal gene transfer. Briefly, in this model, all genes u ∈ ∆ K i follow the same gene loss and gene transfer events. Precisely, in addition to resampling events (at rate 1 for every ordered pair of individuals), the following events occur: 1. Gene loss: For all k = 1, ..., K, a gene loss event occurs at rate ρ/2 per individual. Upon such an event in individual i, we have 2. Gene gain: For every individual i, at rate θ/2, choose U uniformly in 3. Horizontal gene transfer : For every (ordered) pair of individuals (i, j) and k = 1, ..., K, a horizontal gene transfer event occurs at rate γ/(2N ). For such an event, set G N,K Step 2: (G N,K 1 , ..., G N,K n ) can be constructed using K + 1 random graphs: Recall the construction of the AGTG for a single gene from Definition 5.1. We extend this construction in order to obtain the distribution of (G N,K 1 , ..., G N,K n ). Since K is finite, we can proceed by a two-step procedure similar to the proof of Lemma 5.3 in the Moran ∆ model. Here, we first generate resampling, gene loss and transfer events and subsequently introduce gene gain events. So, first consider a Moran model with (i) resampling events, (ii) potential gene loss events for genes in ∆ K k with rate ρ/2 along all lines, where a transition from G N,K i (t) to G N,K i (t) \ ∆ K k occurs, k = 1, ..., K and (iii) potential gene transfer events of genes in ∆ K k with rate γ/(2N ) per pair of individuals, as in Moran ∆ , k = 1, ..., K. Next, introduce gene gain events for all lines and all ∆ K k , k = 1, ..., K at rate θ/(2K), where each new gene is assigned a uniformly distributed random variable on ∆ K k . Equivalently, as for the AGTG for a single gene, we can start from time 0 backwards and construct K + 1 random graphs such that graph k describes the possible ancestry of genes in ∆ K k , k = 1, ..., K. Precisely, start with graph 0, which is a coalescent started with n individuals (without gene loss and horizontal gene transfer events). In graph 1, add gene loss events, valid for all u ∈ ∆ K 1 and gene transfer events, which lead to splits of lines in the graph at rate γ(N − m)/(2N ), if it currently has m lines. In addition, at rate γm/(2N ), the split of a line leads to ancestry to a line which is already within the graph. Iteratively, in graph k, additional loss and split events, valid for genes u ∈ ∆ K k , occur. Again, a split might generate a line which was already present in graph 0, ..., k − 1, and otherwise gives a new line. These graphs are denoted A (0) n,N,K , ..., A ) n,N,K . After having constructed all K + 1 random graphs, graphs 1, ..., K are hit by gene gain events, each with rate θ/(2K). As above, each new gene in graph k is assigned a uniformly distributed random variable on ∆ K k . In each ∆ K k , keep only the gene which is closest to time 0, since in Moran ∆ , a new gene in ∆ K k overwrites present ones. By this procedure, we can read off (G N,K 1 , ..., G N,K n ) from the random graphs, which are marked by gene gain events. Note that there is at most one gene in each ∆ K k for any individual i (i.e. G N,K i (∆ K k ) ≤ 1) and we claim that graphs 1, ..., K are exchangeable by construction. Indeed, the gene losses and splits of A (j) n,N,K are only valid for genes in ∆ K j . Hence the crucial part to understand exchangeability is the time a line produced by a split within ∆ K j needs to merge back to the graph A n,N,K produced at the same time by a split in ∆ K i+1 .
Step 3: .., G N n ) and the limit can be constructed using countably many random graphs: In the construction of the last step, we reverse the order of generating gene gain events and the random graphs. First, let (T m , U m ) m=1,2,... be the points in a Poisson point process T on [0, ∞) × [0, 1] with intensity θ 2 dtdu, ordered by their first coordinates. Instead of constructing the random graphs 0, ..., K + 1 in the order of the intervals ∆ N 1 , ..., ∆ N K in [0, 1], we can as well construct the random graphs in the order of appearance of gene gain events. Formally, let K m := k if U m ∈ ∆ K k , i.e. K m gives the number of the interval ∆ K k in which the mth gene gain event (T m , U m ) m=1,2,... appears. Then, let i 1 := 1 and i r+1 := inf{m > i r : K m / ∈ {K 1 , ..., K m−1 } for r < K. This means that K i1 , ..., K i K is the number of intervals in the order of the appearance of the first gene gain within each interval. Most importantly, (A n,N,K . By construction, the first gene gain event at time T 1 falls into ∆ K K1 . Hence, this graph is hit after an exponentially distributed time with rate θ/2. (Note the difference to the rate θ/(2K) from the last step.) In order to model this, take T 1 (the time of the first gene gain in T ), determine a set of paths how to move through A (K1) n,N,K and place a gene gain event after time T 1 . In case the length of A (K1) n,N,K is smaller than T 1 , do nothing. Continuing, by construction, A (Ki 2 ) n,N,K (recall K i2 = K 2 if K 2 = K 1 ) is hit by a gene gain event, this event occurs by time T 2 of T . Again, determine a set of paths how to move through A (K2) n,N,K and place a gene gain event after time T 2 , if possible. Continue until T i K and A (K K ) n,N,K . In this construction, we can now let K → ∞, which means that we construct infinitely many random graphs, A n,N , ..., we can now use all points (T m , U m ) in order to construct the resulting genomes, which we denote by G N .
We now show that G N,K K→∞ = === ⇒ G N as well as G N,K K→∞ = === ⇒ G N (the latter being the set of genomes from the Moran model), implying that G N d = G N , i.e. the genomes G N can be constructed from infinitely many random graphs by using all points in T . We use the criterion from Remark 5.7. For the convergence to G N , note that both G N,K and G N can be constructed on a joint probability space, using the same (infinitely many) random graphs. The difference in construction is that for G N , all points in T are used, while in G N,K only the first points within each ∆ K i are used. Moreover, as long as at most one gene gain event hits A (i) n,N,K the random measures G N,K and G N agree on ∆ K i . Hence, we write, for any Borel set A ⊆ {1, ..., n} × [0, 1] and k = 0, 1, 2, ... and using Lemma 5.2 in the last step  (1) 1,N )] < ∞; see Lemma 5.2.) Next, we come to the convergence G N,K K→∞ = === ⇒ G N . Again, we observe that both random measures can be constructed on one probability space. Here, use the K + 1 random graphs in order to construct G N,K first and draw them as a part of a graphical construction of the Moran ∆ -model, starting at time 0. Note that in the Moran ∆ -model, gene gain events for a gene u ∈ ∆ K k can lead to loss of another gene v ∈ ∆ K k , if the line of the gene gain event carries gene v. For such genes, which are lost in the Moran ∆ -model, put additional gene loss and transfer events in the (regular) Moran model. Again, we claim that G N,K = G N if every random graph A Step 4: .., G n ), constructed from infinitely many random graphs: By now, we have shown that (G N 1 , ..., G N n ) can be constructed from infinitely many random graphs A   N )) N =1,2,... is uniformly integrable, by standard arguments (see e.g. Billingsley (1999), Theorem 3.5), (5.8) Step 5: Convergence of moments: The calculations are similar to (5.7) and (5.8).
We only have to deal with finiteness of moments in order to show G N 1 ⊗ · · · ⊗ G N n N →∞ = === ⇒ G 1 ⊗· · ·⊗G j . Here, (i) of Remark 5.7.2 is implied by (5.7). For (ii), we know that (L(A (1) N ) n and the latter is uniformly integrable by Lemma 5.2) and L(A (1) (5.9) 6 Proofs of Theorems 1-3 6.1 Proof of Theorem 1 Using diffusion theory and Lemma 2.5, we obtain first moments of all of the statistics G We consider the diffusion (2.1) with infinitesimal mean and variance The Green function for the diffusion, measuring the time the diffusion, i.e. a gene, spends in frequency x until eventual loss, if the current frequency is δ ≤ x, is given by Following (Durrett, 2008, chapt. 7.11), we introduce new genes in frequency δ 1 at rate θ 2 1 φ(δ) in a consistent way. That is, the gene rises in frequency to ε > δ with probability φ(δ) φ( ) . Hence the number of genes in frequency x is Poisson with mean The gene frequency spectrum is now given by

Proof of Theorem 2
We give two proofs, one using diffusion theory and Theorem 1, one using the AGTG from Section 5.
Proof of Theorem 2 using Theorem 1. Given the expected gene frequency spectrum from Theorem 1, it is now easy to compute first moments of A, D and G by using, in the infinite population limit, 1 , Proof of Theorem 2 using the AGTG. First we note that A (1) = G (1) and and it suffices to compute E[G (n) ] in the proof. We will abuse notation and write dx and dy for infinitely small portions of the genome. In order to compute E[G (n) ], the idea is to write G (n) := ( 2) such that we have to compute the expected length of A n , the AGTG for a single gene, which we denote by L(A n ). Therefore consider the birth and death process (Z t ) t≥0 with birth rate λ i = γ and death rate µ i = i + ρ − 1. Recall that the hitting time T , when this birth and death process hits zero has the same distribution as L(A n )/2, see proof of Lemma 5.2. Now, it is well known (see e.g. Karlin and Taylor, 1975, chapt. 4 Combining (6.2) and (6.3) yields (6.5) According to (6.1), E[A (n) ] is readily obtained and the expected number of differences is given using (6.5) by

Proof of Theorem 3
Since A (1) = G 1 ([0, 1]) = 1 0 G 1 (dx), we can use the first and second moment measures of G 1 in order to compute the moments of A (1) ; see e.g. Daley and Vere-Jones (2003), Section 5.4, which are given by A → E[G 1 (A)] for the first and (A, B) → E[G 1 (A)G 1 (B)] for the second moment. (A similar statement holds for the random measure D 1,2 := |G N 1 − G N 2 | and 2D (2) = D 1,2 ([0, 1]).) For the integral with respect to these measures, we will -as in (6.2) -abuse notation (see e.g. the term V[G 1 (dx)] below) such that (6.6) First, given A (1) 1 (which is distributed like the AGTG for a single gene A 1 ), the probability of a gene gain event on A (1) Second, for x = y, 1 ], E[|G 1 (dy)| |A (1) 1 , A 1 )] up to second order in γ. For this computation, we make use of the fact that the AGTG for two genes can be defined in analogy to the AGTG for a single gene from Definition 5.1, but with two different kind of loss and transfer events. Precisely, we consider the following random graph: starting with x lines of state only gene 1, y lines of state both genes and z lines only gene 2, pairs of lines coalesce at rate 1. (Note that coalescence of a line of state only gene 1 and a line of state only gene 2 gives a single line of state both genes.) Lines where gene 1 (gene 2) is considered are lost at rate ρ/2. (If a line of state only gene 1 (only gene 2) is lost, it is lost completely, while if a line of state both genes is lost, it turns into a line of state only gene 2 (only gene 1 ).) Finally, every line of state only gene 1 (only gene 2 ) is split at rate γ/2 and the new line is again of state only gene 1 (only gene 2 ). In addition, a line of state both genes splits at rate γ and the new line is of state only gene 1 or only gene 2, both with probability 1/2. The length of the graph of lines at states with gene 1 (gene 2), i.e. either at state only gene 1 (only gene 2 ) or both genes is denoted L 1 (t) (L 2 (t)) if a sample from time t of the population is considered. We write E xyz [.] for the expected value if the process is started as above.
To compute the variance for the number of differences D (2) we will use a similar approach. Recall D 1,2 = |G N 1 − G N 2 | and 2D (2) = As S γ 2 has more than 5000 elements we used Mathematica to compute P(s) -see the accompanying file available at the arXiv (http://arxiv.org/abs/ 1301.6547) -the variables D i k (s), resp. D j k (s), and the parameters of the exponentially distributed times T k (s) for 1 ≤ k ≤ 8 and all s ∈ S γ 2 . Combining (6.13) and (6.14) gives the result as shown in (3.10).