Accumulation and maintenance of information in evolution

Significance Through variation in fitness, selection accumulates and maintains information in the genomes of organisms. This process takes place over many generations, in populations that evolve stochastically due to finite size and random mutation. The information, which we quantify in bits, corresponds to the degree to which selection shapes the population composition, the DNA sequence, and the phenotype. We prove a general bound on the rate at which information can accumulate per generation. We find that both accumulation and maintenance of information are most efficient (require the least fitness variation per bit) when individual loci experience weak selection. This is relevant for selection on traits influenced by many small-effect loci—a common genetic architecture according to genome-wide association studies.


How Well Can Selection
Specify the Genotype and the Phenotype?. The degree to which within-and between-species genetic variations are shaped by selection has been the subject of the neutralist-selectionist debate (8)(9)(10)(11). Today, we know that much of the human genome is involved in various biochemical processes (12,13), but this does not mean that it is strongly shaped by selection (14)(15)(16). Here we ask a related question in information-theoretic terms: How much information can selection accumulate and maintain in the genome? Much of the sequence is to some degree random, and, given its size, l ≈ 3 × 10 9 base pairs, it likely contains far less information than the maximum conceivable 6 × 10 9 bits of information. A similar question has been raised in the context of origin of life: Given high mutation rates, how much information could be maintained in the genome of early organisms (2)?
Analogous questions can be asked about the phenotype. How many traits can selection optimize? It is easy to list a large number of potentially relevant traits: Take the expression of all genes in all cell types and conditions, or regulatory interactions between pairs of genes. For a fit organism, these traits need to be specified with some precision, and this precision is likely limited (even if it is, to some degree, facilitated by correlations among traits). For example, a study of selective constraint on human gene expression (17) gave evidence of constraint, but, overall, this seems weak. Given the large number of possibly important phenotypes, how precisely can selection specify them?

Quantifying Genetic
Information. An established method in bioinformatics quantifies the information content of a short genomic motif, such as a binding site, by comparing an alignment of its instances across the genome to the genomic background (18,19). Our definition of genetic information is mathematically similar, but aims to apply more generally (to large regions without multiple instances available). It is therefore based in theoretical population genetics rather than sequence data analysis. A key related concept is the repeatability of evolution (20,21). Evolution is stochastic due to genetic drift and mutation, but selection can reduce the space of possible outcomes. For example, suppose that, in a sequence of length l , n sites are under strong selection for specific nucleotides. By fixing those nucleotides, selection will accumulate 2n bits of information.

Significance
Through variation in fitness, selection accumulates and maintains information in the genomes of organisms. This process takes place over many generations, in populations that evolve stochastically due to finite size and random mutation. The information, which we quantify in bits, corresponds to the degree to which selection shapes the population composition, the DNA sequence, and the phenotype. We prove a general bound on the rate at which information can accumulate per generation. We find that both accumulation and maintenance of information are most efficient (require the least fitness variation per bit) when individual loci experience weak selection. This is relevant for selection on traits influenced by many small-effect loci-a common genetic architecture according to genome-wide association studies.
Meanwhile, the remaining l − n sites will be occupied by random nucleotides, and, if a replicate population evolves under identical conditions, the l − n nucleotides will likely be different. Therefore, our concept of information in a sequence is inversely related to how differently it could have evolved under identical conditions.
In general, however, the information content of the genome cannot be quantified by simply counting the sites that are under selection. A single bit of information can be spread across many loci under weak selection-a phenomenon particularly relevant when selection acts on polygenic traits, long recognized in quantitative genetics and described by the infinitesimal model (22,23). Polygenicity and weak selection also resolve the apparent contradiction between the variety of phenotypes, or biochemical processes involving the DNA, and the lack of strong selective constraint on all of them. Selection might act on a small number of high-level traits, which are influenced by large numbers of loci spread across the genome [described by the omnigenic model (24)], which experience only weak selection individually.
In Section 2, we define information on three levels-the population state (genotype frequencies), the genotype, and the phenotype. There are simple inequalities between the three levels. This means that the upper bound on information accumulation rate, which we prove at the population level, also implies a bound at the genotype and phenotype levels. We use the Kullback-Leibler [KL] divergence, a central quantity in information theory (25), to quantify the difference between their actual distribution and their corresponding neutral distribution.
Notably, the neutral phenotype distribution corresponds approximately to the phenotype distribution among random DNA sequences. Recent work with random mutant libraries suggests that, for some phenotypes, this distribution is accessible experimentally [gene expression driven by random promoters (26)(27)(28) or enhancers (29)]. Any departure from this neutral distribution amounts to accumulation of information.

Cost of Information.
After defining what genetic information means, we ask how quickly it can accumulate and how much of it can be maintained. We look for answers in terms of the cost of selection-the amount of relative fitness variation in a population. This cost, traditionally measured as the relative fitness variance or the genetic load, is itself limited. In a population with constant size, relative fitness is proportional to the expected number of offspring, and the number of offspring can only vary between zero and the reproductive capacity of the organism.
We rely on an information-theoretic measure of cost of selection, which is itself upper bounded by the relative fitness variance and genetic load but has favorable mathematical properties. It relates the cost of selection to the KL cost of control (30)(31)(32), or the thermodynamic power (33).
The relationship between information accumulation rate and the cost of selection has been studied by Kimura (1) and, later, Worden (3), MacKay (4), and Barton (7). In Section 3, we discuss these works in more detail and derive a more general bound. The problem of maintenance has been studied by Eigen (2), Watkins (5), and Peck and Waxman (6). We discuss these in Section 4 and present example calculations that suggest general trends in the amount of information that can be maintained per unit cost.

Quantifying Genetic Information
The measures of information studied in this paper are based on comparisons between the distributions of various variables under selection versus neutrality. The focus on probability distributions accounts for the stochasticity of evolution, and the difference between the distributions with and without selection corresponds to the control that selection exerts on evolution. We quantify this difference in bits, using the KL divergence (25) where U is a variable that takes values u with probabilities ψ U (u) with selection and ϕ U (u) under neutrality. Below, we focus on three variables-genotype frequencies (which describe population states), genotypes, and phenotypes. For a pair of variables U , V , statistical dependencies are reflected in their joint and conditional KL divergence, D(U , V ) and D(U |V ) (see SI Appendix, section S1 for the definitions). Both are nonnegative quantities, and they follow the chain rule The chain rule allows a comparison of the effects of selection on different variables, as well as on the same variable at different times.

Population-Level Information.
Evolution is a stochastic process happening to populations, and genotype frequencies form the state space. We use X to denote the genotype frequencies as a random variable, with each value x being a vector with an element x g for each genotype g, normalized as g x g = 1. As an example, Fig. 1A shows a common evolutionary scenario where a singlelocus, two-allele system starts from a single copy of a beneficial allele A, and, later, the frequency evolves stochastically.
X takes values x with probabilities ψ X (x ) under selection and ϕ X (x ) under neutrality. Fig. 1B shows examples of these distributions for the single-locus system at three different times. In general, these distributions are shaped by various evolutionary forces-mutation, drift, recombination, selection (ψ X only), and others. We refer to D(X ), the KL divergence between ψ X and ϕ X , as the population-level information.
The example in Fig. 1 illustrates two important phenomena we discuss in the rest of the paper. The first phenomenon is the accumulation of information. A population evolves from an initial distribution (in the simplest case, ψ X = ϕ X and D(X ) = 0, but this is not necessary). For example, the initial state x may be completely specified as in Fig. 1A, or both ψ X and ϕ X may start at the neutral stationary distribution. Over time, selection causes ψ X to diverge from ϕ X , and the information D(X ) accumulates (Fig. 1B). We study this in detail in Section 3. The second phenomenon is the maintenance of information, and it takes place when both ψ X (x ) and ϕ X (x ) are stationary, and the information D(X ) is constant. In Section 4, we study how much information can be maintained at a given cost of selection.
The population-level information D(X ) has been studied under different names and in different roles (7,(34)(35)(36). It captures any departure of the genotype frequency distribution ψ X from its neutral counterpart ϕ X -notably, selection can favor not only high frequencies of fit genotypes but also higher or (more typically) lower amounts of genetic variation within populations. Note that D(X ) refers to the effects of selection on the genotype frequencies, rather than allele frequencies. It therefore includes effects of selection on correlations between loci (linkage disequilibrium), which are generated by physical linkage, by chance in finite populations, or due to functional interactions (epistasis)see also SI Appendix, section S2.
Notably, D(X ) (or D(G) introduced below) appears as a term in free fitness-a quantity analogous to free energy which, under some assumptions, increases over time (35,37,38). This implies that evolution maximizes the expected log-fitness while constraining D(X )-see SI Appendix, section S8.

Genotype-Level Information.
If we sample a random genotype from a population in a given state x , we find the genotype g with a probability given simply by its frequency ψ G|X (g|x ) = ϕ G|X (g|x ) = x g . Taking into account evolutionary stochasticity, we average over all population states x with their probabilities ϕ X (x ) or ψ X (x ), Under symmetric point mutations, the neutral distribution ϕ G converges to a uniform distribution over all genotypes, while selection typically concentrates ψ G among a smaller number of fit genotypes. This is also the case for the single-locus system in Fig. 1C. The divergence between ψ G and ϕ G is the genotype-level information D(G).
If selection precisely specifies n out of l nucleotides in the genome-that is, ψ G (g) is uniform over a fraction 1/4 n out of 4 l possible genotypes-this implies D(G) = 2n bits. This corresponds to the intuition of 2n bits of information encoded in the genome. More typically, selection will specify many sites only weakly (biasing the probability toward some alleles; see also Fig. 1C), and may contribute to D(G) through linkage disequilibrium-correlations between linked or epistatically interacting sites. Without linkage or epistasis, D(G) is approximately additive across loci (SI Appendix, Fig. S1).
D(G) generalizes some previous definitions of genetic information (1, 3, 6) which focused on strong selection or uniform distributions, and coincides with others in important special cases (4, 5).

Phenotype-Level Information.
Finally, selection controls evolution on the level of the phenotype Z . Z could be a categorical trait such as the presence/absence of a disease or the correct/incorrect protein fold, a quantitative trait, a comprehensive characterization of an individual, or its fitness. Given a genotype g, the probability of the phenotype z will be given by the possibly noisy genotype-phenotype relationship ψ Z |G (z |g) = ϕ Z |G (z |g) = ζ(z |g). When there are no environmental effects or intrinsic noise, ζ(z |g) will be concentrated at a single value z for each genotype g. Taking into account the variation within populations, as well as the evolutionary stochasticity, the marginal probability of z is

[4]
We show the distributions ψ Z , ϕ Z for the single-locus system in Fig. 1D, where the trait has a genotype-dependent mean and Gaussian noise. While, under neutrality, ϕ Z tends to spread out over time, selection causes ψ Z to be more concentrated. The divergence between ψ Z and ϕ Z is the phenotype-level information D(Z ).
If we can take the genotype distribution ϕ G to be uniform over all possible DNA sequences of some length, then ϕ Z is the phenotype distribution among such random sequences. Examples of this distribution have recently been measured experimentally for gene expression generated by random promoter sequences in Saccharomyces cerevisiae and Escherichia coli (26,28). If a healthy cell requires the gene expression to be in some narrow range, this translates to a requirement on the phenotype-level information D(Z ), and this requirement will increase if the expression needs to be specified across cell states.

The Relationship between the Three Levels.
The definitions above, combined with the chain rule (Eq. 2) lead to a hierarchy among the three levels, [

5]
This inequality can be observed across the columns of Fig. 1B-D. Intuitively, the phenotype-level information D(Z ) is bounded by the genotype-level information D(G), since the information about the phenotype has to be encoded in the genome. A special case of this relationship has been noted by Worden (3), who, however, worked in a deterministic setting (SI Appendix, section S3). The difference between the two, D(G) − D(Z ) = D(G|Z ), can have two sources. First, the phenotype distribution ζ(z |g) may overlap between genotypes, causing the phenotype to be specified less precisely than the genotype (as in Fig. 1D). Second, selection may favor genotypes based on criteria other than the phenotype Z , such as other phenotypes or robustness.
Similarly, D(G) can only be as large as the population-level information D(X ). To increase the probability of a genotype g, selection must increase the probability of population states with a high frequency of g. However, selection can also shape the patterns of genetic diversity in populations, without impacting the average genotype frequencies, therefore contributing to the difference D(X ) − D(G) = D(X |G). In populations with weak mutation, which tend to have little diversity, this difference is small-see Fig. 2.
We rely on the inequalities in Eq. 5 in two ways. First, an upper bound on the population-level information D(X ) which we prove in Section 3 also implies an upper bound on the genotype and phenotype-level information D(G) and D(Z ). In other words, selection can only fine-tune the phenotype to the degree to which it can control the population state.
Second, D(X ) and D(G) can be difficult to estimate directly for systems with multiple loci, due to the high dimensionality (SI Appendix, Fig. S1). In such situations, D(Z ) for fitness or a low-dimensional phenotype Z can serve as a lower bound on D(G) and D(X ). If Z is the trait under selection, or fitness itself, this lower bound can be tight. This approach is applicable even for Illustration of D(X) (cyan) and D(G) (orange) for a single-locus, twoallele system at stationary distributions ψ X , ϕ X as a function of selection strength Ns for two different mutation strengths Nμ. The genotype-level information D(G) grows with Ns; from zero up to one bit, when one out of the two alleles dominates, with the steepest increase around Ns = 1. The population-level information D(X) can be much greater than D(G) when mutation is strong, and generates diversity within the population that selection can shape (or suppress). When mutation is weak, D(X) and D(G) are similar, since the population state can be specified by the allele that is currently fixed, and D(X|G) = 0. Computed using a Wright-Fisher model as in Fig. 1, with population size N = 100. essentially black box genotype-phenotype models, such as models of gene regulation or protein folding.

Accumulation of Information
In this section, we show how the rate at which D(X ), the population-level information, increases over time is limited by the population size and the variation in fitness. We start by pointing out a connection between population genetics and control theory.

Accumulation of Information and the Cost of Control.
We consider a population evolving over time, with a trajectory X 0 , X 1 , . . . , X T forming a Markov chain between generations 0 and T (such as in Fig. 1A). The divergence of the trajectories' distribution from neutrality, D(X 0 , X 1 , . . . , X T ), has been proposed as a measure of predictability of evolution (21). Using the chain rule (Eq. 2), we can decompose it in two ways, Effect of selection on trajectories reaching X T . [7] In Eq. 6, we distinguish between the divergence of the initial states X 0 and the additional conditional divergence in each generation, D(X t+1 |X t ). The latter can be recognized as the KL cost of control, averaged over the initial states x t (30,31). In the context of population genetics, selection takes the role of control. Eq. 7 makes the distinction between the distribution of endpoints X T , and the conditional distribution of the states that precede those endpoints. Selection can shape the full trajectories, but only the effects on X T constitute the final population-level information.
Together, Eqs. 6 and 7 imply a bound on the information accumulated between times 0 and T in terms of the KL cost of control, [8] Specifically, the information accumulated over a single generation, Analogous bounds for continuous time Markov chains and the diffusion approximation are provided in SI Appendix, sections S6 and S7. Note that control theory is concerned with computing optimal control policies, which maximize an imposed objective while minimizing the cost . This is analogous to computing the optimal artificial selection-in fact, the KL divergence control theory framework has recently been used to study artificial selection on quantitative traits (32).
In contrast, natural selection is typically given by the biological or ecological circumstances, and not necessarily optimized in this sense. Still, the KL cost of control provides bounds on the rate at which selection accumulates information ( Eqs. 8 and 9), and it has a meaning in population genetics, which we discuss in the next section.
We also note that Eq. 9 is related to the proof that free fitness increases over time (37,38); see SI Appendix, section S8.

Variation in Fitness as Cost of Control.
To compute D(X t+1 |X t ) in population genetics, we need to specify a model. We analyze multiple general model classes in SI Appendix: Wright-Fisher and discrete Moran models in SI Appendix, section S5, continuous time Moran model in SI Appendix, section S6, and the diffusion approximation in SI Appendix, section S7. In summary, the bound in Eq. 9 always takes the form where N is the population size, kN is the number of individuals that are sampled with selection in each generation (k = 1 under asexual reproduction and k = 2 under sexual reproduction when two parents are sampled with selection for each individual).
C (x t ) is the cost of selection at the population state x t (see below), and C t is the expected cost at time t. To upper bound information accumulated over multiple generations, we need to sum over them, [11] The cost C (x ) is a measure of fitness variation in a population in the state x , whereŵ g (x ) is the (frequency dependent) relative fitness of genotype g. When sampling genotypes as parents for the next generation, g is picked with probability x g under neutrality and x gŵg (x ) under selection-C (x ) is the KL divergence between these two distributions.
C (x ) is related to two more established measures of cost in population genetics-the relative fitness variance V (x ) and the genetic load L(x ), which have been studied under a number of circumstances-for example, mutation-selection balance (39), genetic drift (40,41), certain types of epistasis and the evolution of sex (42,43), ongoing substitutions (44)(45)(46), or stabilizing selection on quantitative traits (47). They are defined as whereŵ max (x ) is the maximum relative fitness present in the population x ,ŵ max (x ) = max g; xg >0ŵg (x ). We derive the relationships between C (x ), V (x ), and L(x ) in SI Appendix, section S9.
(see also ref. 48), and both provide an upper bound on C (x ), .

[15]
In addition, under weak selection and in the diffusion approximation, C (x ) = V (x )/(2 log 2). The bounds in Eqs. 10 and 11. can therefore also be rewritten in terms of V (x ) or L(x ) using Eq. 15.
Assuming constant population size, relative fitness is proportional to the expected number of offspring, and therefore limited by the species' reproductive capacity. The quantitiesŵ max (x ), L(x ), V (x ), and C (x ), and, as a consequence, ΔD(X ), are therefore all limited in realistic settings (SI Appendix, section S9).
In the context of artificial selection or genetic algorithms, an alternative measure of cost is the population size N , which is the number of cultivated plants or animals, or fitness function evaluations (49,50). We note that, according to the bounds in Eqs. 10 and 11, the maximal accumulation rate is also proportional to N . Furthermore, increasing the strength of selection (and therefore C (x )) beyond an optimal value may increase the immediate response to selection, but it reduces the long-term response, due to loss of genetic diversity (49,50). Therefore, in practice, C (x ) will be limited even in this context.

Example 1: The Fates of a Beneficial
Allele. The bounds in Eqs. 10 and 11 hold in genetically diverse populations with clonal interference or recombination. Still, it is interesting to consider the case of sequential fixation/loss of mutations, as was done previously (1,7,44).
Suppose that a beneficial allele A appears in one copy at time t = 0, and is guaranteed to be fixed or lost before another mutation appears that could interfere with it. The population and genotype-level information, D(X t ) and D(G t ), start at zero and accumulate over time, as selection tends to increase the frequency of A (Fig. 3A). The cumulative cost of selection N C 0,t serves as the upper bound on both D(X t ) and D(G t ).
Note that, under relatively strong selection (Ns = 3; Fig. 3A, Right), A increases in frequency considerably faster than under neutrality, leading to high D(X t ). But some of these gains are later lost as A is fixed or lost. This is an example of how only the probabilities of endpoints, and not the shape of the trajectories, matters for the information that is ultimately accumulated (the two terms in Eq. 7).
The increments in D(X t ) and D(G t ) in each generation are plotted in Fig. 3B, along with the bound by N C t , Eq. 10. The bound on ΔD(X t ) is relatively tight. ΔD(G t ) can temporarily exceed N C t , since the accumulation bound in Eq. 10 does not directly apply to the genotype level, but this is only a transient phenomenon due to the inequality between the cumulative genotypeand population-level information D(G t ) ≤ D(X t ).
Both D(X t ) and D(G t ) saturate at the same value D(X ∞ ) = D(G ∞ ), since the ultimate fate of the population is given simply by whether the allele A is fixed or lost. The fixation probability is 1/N under neutrality and ψ fix = ψ X ∞ ((1)) = ψ G ∞ (A) under selection, and the accumulated information is a function of this probability, This function is plotted in cyan in Fig. 3C. According to Eq. 11, it provides a lower bound on the total cost, N C 0,∞ ≥ D(X ∞ ), given a fixation probability. This holds when the allele A has a constant, frequency-independent selective advantage, as in the three examples in Fig. 3A and B (full black line and black points in Fig. 3C). By computing a suitable frequencydependent selection, which optimizes the fixation probability while constraining the total cost N C 0,∞ , we can reduce the cost considerably (dash-dotted black line in Fig. 3C; see SI Appendix, section S11 and Fig. S4 for details). This is achieved by making selection weaker at high frequencies, where the risk of losing A is low. Still, the cost stays above D(X ∞ ), as it has to under arbitrary frequency and time-dependent selection. Under both forms of selection, the bound is only tight when selection is weak. To emphasize this, we plot the information accumulated per unit cost, D(X ∞ )/ C 0,∞ , as function of the fixation probability ψ fix in Fig. 3D. At weak selection, ψ fix is only perturbed a little from its neutral value 1/N , but up to N bits can be accumulated per unit cost. A special case of this was shown by Barton (7). Similar scaling with N was also found in a different setting by Kimura (45).
Stronger selection accumulates more information, but at a disproportionately higher cost, since a large part of it is spent on shaping trajectories rather than outcomes. In the extreme case, to achieve ψ fix = 1, only individuals carrying the A allele can be allowed to reproduce, and A gets fixed in only one generationa highly unlikely way to fixation under neutrality. In this case, selection has the same effect on each genotype sampled as a parent in the first generation as it does on the allele that is ultimately fixed (both are A with probability 1/N under neutrality and one under selection). As a result, the cost is equal to the accumulated information, C 0,∞ = D(G ∞ ) = D(X ∞ ), and only one bit per unit cost is accumulated (Fig. 3D). This is why previous results derived in deterministic settings (1, 3) claimed much more stringent limits on accumulation of information.

Example 2: Accumulation of Information under Mutation.
Unlike the example above, real systems experience ongoing mutation. On the one hand, mutation is necessary to supply beneficial alleles for adaptation, but, on the other hand, mutation can disrupt existing adaptation. In this section, we assume that the single-locus, two-allele system starts at the neutral stationary distribution with D(X 0 ) = D(G 0 ) = 0, and then selection is turned on. Adaptation exploits copies of the allele A that either segregate in the population by chance at time 0, or arise later by mutation. Fig. 4A shows the information D(X t ) and D(G t ) over time. Accumulation take place on the time scale of 1/μ. Note that the bound Eq. 11 is not very tight. This is even more apparent in Fig. 4B, where the average cost per generation N C t remains positive even after the system has reached the new stationary state, while the increments in D(X t ) and D(G t ) are zero. This corresponds to the cost of maintaining information, which we discuss in Section 4.
In summary, the accumulation of information is upper bounded by the KL cost of control, which, in turn, corresponds to the population size times the variation in fitness. However, if selection changes not only the probabilities of the final states but also the paths that lead there (because it is strong, because adaptation is maintained for a long time, or because adaptation is reversed by time-dependent selection), then the information accumulated is less than the total cost.

Comparison with the Fitness Flux
Bound. The fitness flux theorem (35) implies another upper bound on information accumulation rate, ΔD(X t ) ≤ 2N φ t , where φ t is the expected fitness flux per generation. It is plotted in gray in Fig. 4. It differs from the cost of selection bound both quantitatively and in terms of interpretation.
Quantitatively, neither bound is tighter in general. In Fig. 4B, the cost of selection bound is tighter in early stages of adaptation, and the fitness flux bound is tighter in the late stages. This is consistent with the interpretation of fitness flux as the rate of ongoing adaptation, or the rate of ascent in the mean fitness landscape/seascape (35). This rate is high in the early stages of adaptation, when the population is far from the fitness peak and tends to climb up quickly. Later, when the population approaches a stationary distribution, there is no more adaptation, on average, and 2N φ t as well as ΔD(X t ) vanish. Meanwhile, the cost of selection bound kN C t is tighter in the earlier stages when most of the cost is spent on new adaptation, but it remains positive under stationarity, due to maintenance costs.
Technically, the fitness flux theorem was originally derived in ref. 35 under the diffusion approximation, and requires an additional assumption that the neutral process is at a stationary distribution with detailed balance. We derive and discuss the technical aspects of the fitness flux bound in SI Appendix, section S10 and Figs. S2 and S3.

Maintenance of Information
In this section, we ask how much information can be maintained in the genome for a given cost of selection. A general bound analogous to Eq. 10 seems to be out of reach for now, but we can study how the information maintained depends on key evolutionary parameters. We start by analyzing the single-locus, two-allele system, and then proceed to systems with large numbers of loci. information-up to one bit at the genotype level, and more on the population level. However, it comes with a higher cost of selection C (Fig. 5B). Notably, the cost increases faster than the maintained information. As a result, the amount of information maintained per unit cost decreases with selection strength (Fig. 5C).
There are two important asymptotic regimes. When selection is very strong, Ns 1, deleterious mutations are purged as soon as they arise, and D(G) ≈ 1 bit. Mutations arise with a probability N μ per generation, and purging each costs C ≈ 1/(N ln(2)) (assuming truncation selection with α = 1 − 1/N ; see SI Appendix, section S9). In this regime, Strong selection: bits can be maintained per unit cost (Fig. 5C). Similar arguments apply when N μ > 1. The inverse scaling with μ is expected based on the deterministic mutation load (39) or Eigen's error catastrophe (2) which occurs when selection cannot maintain sequences without error, and it was also derived by Watkins (5). Selection is much more efficient when it is weak, Ns 1. Both the cost and the maintained information can be calculated under the diffusion approximation (see SI Appendix, section S4B for details). If mutation is also weak, N μ 1, the amount of genetic variation (pairwise diversity) scales with 2N μ, and the cost (variation in fitness) is approximately C ≈ N μs 2 /(2 ln 2). Meanwhile, selection shifts the mean frequency of A away from 1/2 by about Ns/2, and this is associated with genotype-level information D(G) ≈ N 2 s 2 /(2 ln 2) bits. In this regime, up to N /μ bits per unit cost are maintained. When mutation N μ is not negligible, a more accurate result is Weak selection: 4N μ) ; see SI Appendix, section S4. This limit is also highlighted in Fig. 5C. The special case when N μ 1, D(G)/ C ≈ 1/(4μ 2 ), was previously derived by Watkins (5).
By itself, a single locus under weak selection cannot contribute much to biological function. However, selection can act on a polygenic trait influenced by many loci. If they are unlinked, we expect both the maintained information and the cost of selection to be approximately additive, and the ratio D(G)/ C to scale according to Eq. 19. To confirm this, we next study a polygenic system.

Information Stored among Many Loci.
We use an individual-based model to study a population of N haploids with l = 1, 000 biallelic loci, mutation and free recombination. Offspring are produced by sampling pairs of parents with selection, shuffling their genomes (at each locus, the allele from either parent is inherited with probability 1/2), and flipping each allele with probability μ. Selection acts on a fully heritable, additive trait with equal effects, z g = (the number of A alleles in g), with fitness being w g = (1 + s) zg .
The results are shown in Fig. 6. Fig. 6A shows an example of a stochastic population trajectory, indicating the phenotypes present in the population over time. The system is initialized with random genomes that contain the beneficial allele at each locus with probability 1/2, with z taking values around l /2 = 500 with binomial noise. Selection with s = 0.01 makes the beneficial alleles more frequent over time. The stationary distribution over phenotypes is shown in Fig. 6B. Under neutrality, ϕ Z = Binom(l , 1/2) by symmetry. The distribution ψ Z under selection is shifted relatively far from ϕ Z , leading to D(Z ) = 88.0 bits of information on the phenotype level.
The population state distribution and the genotype distribution are inaccessible due to their dimensionality (SI Appendix, Fig. S1). However, we know that they are lower bounded by D(Z ), which is easy to compute, and D(Z ) ≈ D(G), since Z is the only trait under selection. Since the loci are unlinked and have equal effects, the information D(Z ) can be divided evenly among them. The marginal distribution over allele frequencies is only slightly different from neutrality (Fig. 6C), by about D(X single ) = 0.095 bits in terms of allele frequency distribution and D(G single ) = 0.088 in terms of allele probabilities. The 1,000 loci, however, combine to produce a large shift in the phenotype distribution, This information is maintained at a very low cost of selection, C = 0.0012 bits per generation, or relative fitness variance V = 0.0017. This amounts to D(Z )/ C = 7.1 × 10 4 bits per unit cost, only a little below the single-locus limit N /μ/(1 + 4N μ) = 7.4 × 10 4 under weak selection.

Interference between Loci.
In practice, the selection on different loci might interfere, and this can hinder the maintenance of information. The interaction may be due to Hill-Robertson interference, linkage, or epistasis. In Fig. 6D, we vary the selection coefficient s on individual alleles in an l = 10 4 locus system, and plot the maintained D(Z ) against the cost C . We use the individual-based model to compute these with free recombination (as in Fig. 6A-C) and with zero recombination (offspring genotypes are identical to those of single parents, up to mutation). We compare the results with the weak selection scaling according to Eq. 19, and results for 10 4 loci that evolve independently (cost and information are summed over 10 4 single-locus systems).
With free recombination, weak selection maintains about as much information as if the loci were independent (brown points and gray line in Fig. 6D, Inset), approximately according to Eq. 19 (gray dotted line). However, when selection is strong ( C ≈ 0. neutrality (ϕ X single , blue, computed using a transition matrix for the singlelocus system) and under selection (ψ X single , red, computed as a histogram over all loci and 2 × 10 5 generations at stationarity). The associated D(X single ) and D(G single ) correspond to information maintained at one locus, and, because the loci are approximately independent, the total information is about l = 1,000 times more. The population size is N = 40, the mutation strength is Nμ = 0.02, and the selection strength is Ns = 0.4. (D) The relationship between the maintained information D(Z) and the cost of selection C , with recombination (brown points) and without recombination (olive points). This is compared with predictions under the assumption of independent loci (gray line; computed using single-locus diffusion approximation and multiplying both information and cost by the number of loci) and the linear scaling with C based on Eq. 19 (dotted gray line). Computed for a system with l = 10 4 loci, population size N = 40, mutation strength Nμ = 0.02, and variable Ns. Distributions estimated from a stochastic trajectory over 5 × 10 4 generations, after 5 × 10 3 generations of burn-in. Inset shows identical data with a log vertical scale.
frequency, due to random associations with alleles at other loci in a finite population (51,52), reducing the efficiency of selection. As a result, the freely recombining loci maintain less information than if they were independent. This is in addition to the fact that, under strong selection, maintenance is more costly even for independent loci (full gray line departs from dotted gray line, Fig. 6D). Extremely strong selection, which removes potentially adaptive variation at other loci, maintains even less information than more moderate selection, and it makes recombination ineffective (brown points at high C in Fig. 6D).
Without recombination, less information is maintained at any given cost (olive points in Fig. 6D). In fact, Watkins (5) has shown that, due to clonal interference, organisms with no recombination cannot maintain more than the order of ln(N )/μ bits of information even if the cost is unlimited, making Haldane's (39) and Eigen's results (2) pertinent to asexual populations.
The advantage of recombination has also been recognized in a similar context by MacKay (4) and Peck and Waxman (6), and relates to the evolution of sex and epistasis. Recombination is advantageous when facing unconditionally deleterious or beneficial alleles (43), but can be disadvantageous when adaptation depends on beneficial combinations of alleles (53). However, it is not clear whether any form of selection can maintain more information at a given cost than N /[μ(1 + 4N μ)] achieved by weak directional selection with recombination.

Discussion
Selection exerts control on evolving populations, but its capacity is limited. The limits to selection have been approached from various angles. Here we build upon previous work that had developed the idea that selection accumulates and maintains information in the genome (1,2), and that this is associated with a cost in terms of variation in fitness, such as genetic load or fitness variance (39,44). The early work has suggested remarkably simple limits to selection: that the maximal rate of accumulation is bounded by the cost itself (1,3), and that maintenance is limited to about 1/μ functional sites in the genome (2,39).
Later work has pointed out that both accumulation (4, 7) and maintenance (5,6) can exceed these limits, notably when recombination is involved. However, the general bounds remained unclear, possibly, in part, due to the difficulty of defining genetic information in general.
The measures of information that we have introduced in Section 2 coincide with or generalize previous definitions, and offer two advantages. First, they facilitate connections between different levels-for example, between the abstract population-level information that has been studied theoretically in different contexts (34)(35)(36) and the effect that selection has on the distribution of phenotypes.
Second, the generality of our definition allows proving a general bound on information accumulation rate. This turns out to be a factor N faster than the early bounds, but depends on selection on individual loci being weak. The bound relies on a measure of cost of selection that connects the genetic load and fitness variance (48) with the KL cost in control theory (30,31), recently used in the context of artificial selection (32).
How much information can be maintained in the genome at a given cost remains an open problem, but we have discussed how this might scale with the population size and the mutation rate. The scaling in Eq. 19 generalizes a result by Watkins (5) for realistic populations with N μ < 1. Still, more work is needed to make claims about the information content of any real organism's genome. Typical populations have N e /μ much greater than the genome size, suggesting that the genome size or other factors are more limiting than Eq. 19. The maintenance can be made more difficult by linkage or epistasis, and parts of the genome are likely under strong selection which is more costly. Still, Eq. 19 suggests that, in theory, the genome could contain a substantial amount of information among weakly selected loci, for example, coding for polygenic traits. This is consistent with recent work (54) pointing out that mutation load does not pose severe limitations to the functional fraction of the human genome.
Similarly, the bound on accumulation rate in Eq. 10 hypothetically allows accumulation of information amounting to 10% of the human genome in about 10 6 generations (6 × 10 8 bits, assuming effective population size N e ≈ 10 4 , k = 2, and meager cost C ≈ 0.03 or relative fitness variance V ≈ 0.018 devoted to accumulation). But this is unlikely to have happened. Some selection was likely strong and more costly, and selection could have fluctuated, reversing previous adaptation. However, under the right conditions, information can accumulate very fast.
Our findings are complementary to the point raised by Kondrashov (41), that the survival of populations could be threatened by large numbers of weakly deleterious mutations (Ns < 1). While selection cannot purge them, it can perturb the allele frequency distribution of each by a small amount, and thus shift the distribution of higher-level traits very far from neutrality. This is similar to the resolution by Charlesworth (55). In fact, information accumulation and maintenance are most cost efficient in this regime. This does not mean that a genomic architecture, where most mutations operate at Ns < 1 and information is encoded among many weakly specified sites, would evolve as an adaptation to maximize information gain. Nevertheless, such an architecture might arise in multicellular organisms as a side effect of their small effective population sizes and long genomes (56,57).
Focus on the information content of genomes, rather than their fraction under selection, could help better frame the controversy sparked by some publications from the Encyclopedia of DNA Elements (ENCODE) project (12-16, 54, 58). On the one hand, genomic regions under detectable selection [less than 15% in humans (59)] likely contain less than two bits per base pair, because their current function could be achieved by a number of alternative sequences (e.g., due to synonymous mutations in coding regions, or flexibility of transcription factor binding site sequence and location). On the other hand, regions without detectable selection could contain a considerable amount of information in the aggregate, at a low cost, encoding polygenic traits.
In bioinformatics, there already is a measure of information content applicable to short regulatory motifs (18,19). Future work could examine the precise relationship between this measure and our theoretical definitions. The generality of our framework also opens directions for future research. One is to predict the maximal amount of information that can be maintained in genomes and populations with realistic parameters. Another is to study the information content of genomic elements with well-described genotype-phenotype maps [e.g., promoters (26,27)], under different hypotheses about selection on the phenotype.

Data, Materials, and Software Availability.
There are no data underlying this work.