Are There Laws of Genome Evolution?

Research in quantitative evolutionary genomics and systems biology led to the discovery of several universal regularities connecting genomic and molecular phenomic variables. These universals include the log-normal distribution of the evolutionary rates of orthologous genes; the power law–like distributions of paralogous family size and node degree in various biological networks; the negative correlation between a gene's sequence evolution rate and expression level; and differential scaling of functional classes of genes with genome size. The universals of genome evolution can be accounted for by simple mathematical models similar to those used in statistical physics, such as the birth-death-innovation model. These models do not explicitly incorporate selection; therefore, the observed universal regularities do not appear to be shaped by selection but rather are emergent properties of gene ensembles. Although a complete physical theory of evolutionary biology is inconceivable, the universals of genome evolution might qualify as “laws of evolutionary genomics” in the same sense “law” is understood in modern physics.


Introduction
Darwin's concept of evolution, all its generality and plausibility notwithstanding, was purely qualitative. In the 1920s and 1930s, seminal work of Fisher, Wright, and Haldane laid the foundation for quantitative analysis of elementary processes in evolving populations, and in the 1950s, this population genetic theory was incorporated in the framework of the Modern Synthesis of evolutionary biology. However, the formalism of population applies only to microevolution in idealized populations and falls far short of a general quantitative theory of evolution. Rapid progress of genomics and systems biology at the end of the 20th century and in the beginning of the 21st century brought about enormous amounts of new data amenable to quantitative analysis. The new data types include numerous complete genome sequences, transcriptomes (genome-wide gene expression information), proteomes (organism-wide protein abundance information), interactomes (organism-wide data on physical and genetic interactions between proteins or gene), regulomes (comprehensive data on gene expression regulation), and more. This deluge of new information spawned a research direction that occupies itself with quantification of the relationships between various genomic and molecular phenomic variables and may be called quantitative evolutionary genomics [1,2].

Universals of Genome and Molecular Phenome Evolution
Quantitative comparative genomic analysis revealed several universals of genome evolution that come in the form of distinct distributions of certain quantities or specific dependencies between them. The most conspicuous universals include (Figures 1 and 2): N log-normal distribution of the evolutionary rates between orthologous genes [3][4][5]; N power law-like distributions of membership in paralogous gene families and node degree in biological ''scale-free'' networks [6][7][8][9]; N negative correlation between a gene's sequence evolution rate and expression level (or protein abundance) [10][11][12][13]; N distinct scaling of functional classes of genes with genome size [14,15].
The universality of these dependencies appears genuinely surprising. For example, the distributions of sequence evolution rate of orthologous genes are virtually indistinguishable in all evolutionary lineages for which genomic data are available, including diverse groups of bacteria, archaea, and eukaryotes [3][4][5]. The shape of the distribution did not perceptibly change through about 3.5 billion years of the evolution of life even though the number of genes in the compared organisms differs by more than an order of magnitude, and the repertoires of gene functions are dramatically different as well [5]. The same conundrum pertains to the other universals: despite major biological differences between organisms, these quantitative regularities hold, often to a high precision. What is the nature of the genomic universals? Do they reflect fundamental ''laws'' of genome evolution or are they ''just'' pervasive statistical patterns that do not really help us understand biology? A related major question is, are these universals affected or maintained by selection?

Mathematical Models to Account for the Evolutionary Universals
Clearly, should there be laws of genome evolution; in the sense this term is used in physics, identification of recurrent patterns and universal regularities is only the first step in deciphering these laws. The obvious next steps involve developing physical (mathematical) models of the evolutionary processes that generate the universals and test the compatibility of the predictions of these models with the observations of comparative genomics and systems biology. Indeed, such models have been proposed to account for each of the universals listed above ( Figure 2). Notably, these models can be extremely simple, based on a small number of biologically plausible elementary processes, but they are also highly constrained. A case in point is the birth-death-and-innovation model (BDIM) that explains the power law-like distribution of gene family sizes in all genomes [7][8][9]. This model includes only three elementary processes, the biological relevance of which is indisputable: i) gene birth (duplication), ii) gene death (elimina-tion), and iii) innovation (that is, acquisition of a new family, e.g., via horizontal gene transfer). A model with precise balance between the rates of these elementary processes and a particular dependency of birth and death rates on paralogous family size yields family membership distributions that are statistically indistinguishable from the empirically observed distributions [7].
Straightforward models of evolution have been developed that apparently account for more than one universal ( Figure 2). A case in point is a recent amended BDIM of evolution that connects two genomic universals that are not obviously related, namely, the distribution of gene family size and differential scaling of  functional classes of genes with the genome size [16]. In this model, gain and loss rates of genes in different functional classes (e.g., metabolic enzymes and expression regulators) are linked in a biologically motivated proportion. The model jointly reproduces the power law distribution of gene family sizes and the non-linear scaling of the number of genes in functional classes with genome size. Moreover, the model predicted that functional classes of genes that grow faster-than-linearly with genome size would show flatter-than-average family size distributions. The existence of such a link between these a priori unrelated exponents is indeed confirmed by analysis of prokaryotic genomes.
The ubiquitous negative correlation between sequence evolution rate and expression level triggered the hypothesis of misfolding-driven protein evolution that explains the universal dependency between evolution and expression under the assumption that protein misfolding is the principal source of cost incurred by mutations and errors of translation [4,17]. This assumption was used to incorporate evolutionary dynamics into an off-lattice model of protein folding [18]. The resulting model of protein evolution reproduced, with considerable accuracy, the universal distribution of protein evolutionary rates, as well as the dependency between evolutionary rate and expression. These findings suggest that both universals of evolutionary genomics could be direct consequences of the fundamental physics of protein folding.

Universals of Evolution Are Emergent Properties of Gene Ensembles, Not Selectable Features
The models of evolution that generate the observed universal patterns of genome evolution do not explicitly incorporate selection. The question of selective versus neutral emergence of global quantitative regularities has been explored in some detail for the case of network architectures. Networks have become ubiquitous images and tools of systems biology [6]. Indeed, any class of interacting objects can be naturally represented by nodes, and the interactions between these objects, regardless of their specific nature, can be represented by edges. Commonly explored biological networks represent gene coexpression; genetic interactions between genes; physical interactions between proteins; regulatory interactions between genes; metabolic pathways where metabolites are nodes and enzymes are associated with edges; and more, considering that the network formalism is general and flexible enough to capture all kinds of relationships. In a sharp contrast to random networks that are characterized by a Poisson distribution of the node degree, biological networks typically show a power law-like node degree distribution, P(k),k 2c , where k is the node degree, i.e., the number of nodes to which the given node is connected, and c is a positive coefficient. These networks are said to be scale-free because the shape of their node degree distribution remains the same regardless of the chosen scale, that is, any subnetwork is topologically similar to the complete network (in other words, scale-free networks display fractal properties). The negative power law node degree distribution is characteristic not only of biological networks but also of certain purely ''artificial'' networks such as the Internet. Barabási and colleagues came up with the provocative idea that this is an intrinsic feature of evolved networks and proposed a simple and plausible mechanism of network evolution known as preferential attachment [19]. In addition to the scale-free architecture, most of the biological networks possess additional interesting features such as small world properties, modularity, and hierarchical structure that are also widespread but tend to differ among networks representing different classes of biological phenomena [6].
Scale-free networks are ''robust to error but vulnerable to attack'': elimination of a randomly chosen node most of the time has little effect on the overall topology and stability of the network, whereas elimination of highly connected nodes (hubs) disrupts the network. This property might be conceived as implying that the architecture of such networks represents ''design'' that evolved under selection for increased robustness. However, this idea is no more justified than the view that the Internet was deliberately designed with the same purpose in mind. The preferential attachment mechanism in itself is a non-adaptive route of network evolution. Simulation of the growth of a network by random duplication of its nodes with all their connections followed by subfunctionalization, i.e., differential loss of edges by the daughter nodes, not only yields the typical power law distribution of the node degree but also reproduces the modular structure of biological (specifically, protein-protein interaction) networks [20]. Duplication followed by subfunctionalization is the most common route of gene evolution that does not intrinsically involve selection. Rather, subfunctionalization is naturally interpreted as a type of ''constructive neutral evolution'' whereby complexity, and complex networks in particular, evolve not as adaptations but through irreversible emergence of dependencies between parts of the evolving system [21,22].
Compelling evidence of the non-adaptive origin of global architectural features of networks was obtained through the analysis of gene coexpression networks in mutation accumulation (MA) lines of the nematode Caenorhabditis elegans [23]. The MA lines are virtually free of selective constraints, so comparison between these lines and natural isolates provides for evaluation of the contribution of selection to the evolution of various characters, in particular network architecture. The global architectures of evolutionary coexpression networks (i.e., networks in which edges connected genes with similar patterns of expression across multiple lines) were indistinguishable between MA lines and natural isolates, demonstrating that these features are not subject to selection. Furthermore, there was no significant correlation between the properties of any given node, such as the degree and the clustering coefficient, in the networks from mutation accumulation lines and natural isolates. These results strongly suggest that not only general architectural properties of networks but even the position of individual nodes in networks are not subject to substantial selection.
Collectively, the ability of simple models to generate the universals of genome evolution and additional results indicating that the global architecture of biological networks is not a selected feature suggest that all evolutionary universals are not results of adaptive evolution. Such a conclusion does not imply that these universals are biologically irrelevant: beneficial properties such as network robustness may emerge ''for free'' from the most general principles of evolution.
The universal dependencies and distributions seem to be emergent properties of biological systems that appear because these systems consist of numerous (sufficiently numerous for the manifestation of robust statistical regularities) elements (genes or proteins, depending on the context) that weakly interact with each other, compared to the strong interactions that maintain the integrity of each element. Clearly, this representation of biological systems as ensembles of weakly interacting ''particles'' resembles rough but enormously useful approximations, such as ideal gas, that are routinely used in statistical physics. This approach is obviously over-simplified because higher level interactions such as epistasis are common and critically important in biology [24,25]. Nevertheless, the ability of simple models akin to those used in statistical physics to quantitatively reproduce universals of genome and molecular phenome evolution attest to the fruitfulness of the ''statistical ensemble'' approximation.

''Laws'' of Evolutionary Genomics
The analogies between the evolutionary process and statistical physics are not limited to the existence of universal dependencies and distributions, some of which can be derived from simple models. It is actually possible to draw a detailed correspondence between the key variables in the two areas [26,27]. The state variables (degrees of freedom) in statistical physics such as positions and velocities of particles in a gas are analogous either to the states of sites in a nucleotide or protein sequence, or to the gene states in a genome, depending on the level of evolutionary modeling. The characteristic evolutionary rate of a site or a gene naturally corresponds to a particle velocity. Furthermore, effective population size plays a role in evolution that is clearly analogous to the role of temperature in statistical physics, and fitness is a natural counterpart to free energy.
The process and course of evolution critically depend on historical contingency and involve extensive adaptive ''tinkering'' [28,29]. Therefore a complete physical theory of evolution (or any other process with a substantial historical component) is inconceivable. Nevertheless, the universality of several simple patterns of genome and molecular phenome evolution, and the ability of simple mathematical models to explain these universals, suggest that ''laws of evolutionary biology'' comparable in status to laws of physics might be attainable.

Peer Review and Author's Response
At the editor's suggestion the peer review comments we received follow, along with Eugene V. Koonin's response.

Peer Review by Ruben Valas, J. Craig Venter Institute (Counterpoint)
In many ways this article is an attempt to show that we could be deriving universal laws in biology. I think much of the work cited strongly argues for some universal laws in biology, but I think the article could be strengthened by taking a wider perspective. It seems a large focus of this post-modern synthesis is to reduce the role selection plays in the study of evolution. Here are several examples of why I think the universal laws do not make this true, and why it is essential to catalog the exceptions to these laws.
The BDIM and associated models all describe the distribution of gene families as a function of genome size. But what determines genome size? Is it not subject to selection? So the power law in general may not be result of selection, but the specific instance in a specific genome must be directly dictated by selective constraints on total genome size at the very least.
Let's consider the BDIM in its original form [30]: ''An implication of these observations is that, in general, large families are older than small ones. Exceptions to this generalization probably point to selection for a specific family size; for example, it seems likely that selection acts against proliferation of certain essential proteins, e.g. ribosomal proteins, which typically form single-member families.'' The ribosomal proteins are an interesting example, but I've also considered the immunoglobulin as an important exception to this rule. According to the Superfamily Database [31], this family has 6,325 domains in humans, making it the second most popular domain. However, its distribution is nearly metazoan specific, so it's a fairly young protein superfamily. Clearly, the largest and smallest families in most genomes are under some selective pressure that cannot by captured by the initial BDIM.
It seems the modified BDIM takes this into account better by either considering evolutionary potentials or correlated functional categories. It seems one could not define the evolutionary potentials without taking selection into account. In the model cited in [32] it seems straightforward to link functional categories such as regulation and metabolism, but where would the immunoglobulins discussed above come into this model? I argue that they are a truly novel functional class: ''immune response of multicellular organisms''. It would be very complicated to incorporate the formation of novel functional classes into this model.
Where do truly novel functions come from? The example of subfunctionalization is certainly a case where selection plays at best a supporting role. But what about the generation of ORFans? What about the many molecular innovations of the eukaryotes? It seems most of protein evolution is pretty neutral, but in my opinion that makes the rare events that involve selection more important and interesting.
The observation that distribution of evolutionary rates is conserved is another apparent law. However, the means of these distributions vary by several orders of magnitude. Understanding the universal distribution is useful, but to understand the history of any one particular genome it seems one needs to include selection on some level to explain that difference.
I think the misfolding mistranslation hypothesis should motivate us to look for examples in evolution where the laws of protein folding change. If most of selection is to ensure proteins fold properly then surely innovations in protein folding and degradation would have dramatic consequences on these landscapes. It seems a history of major changes in protein folding would complement this universal observation, and possibly explain some of the differences in the means of evolutionary rate distributions.
I think the author overstates the result in [33]. As the authors of that work conclude: ''Our own comparison of the MA versus NI evolutionary gene coexpression networks has revealed that similar properties at a high level of abstraction can obscure substantial and biologically relevant differences at lower levels. With respect to the evolution of biological systems, the details remain important.'' This point seems totally lost when one looks at the universality of scale-free networks.
All that said, I think the author is gaining ground in developing this perspective. I think this article could certainly use a more futuristic perspective. I am curious as what the author's vision of biology would be if everything could be reduced to some universal laws, as seems to be the case in physics. What cannot be reduced to laws and how will the law complement that?
In conclusion, I think the work cited here is convincing that there are laws in biology, but I think it is more interesting to try to find and understand the exceptions. The laws appear to be real much of the time, and it is certainly worthwhile to try to understand the universals with well-defined theory. But when the theory and law tells you all organisms are the same regardless of their place in the natural world, it seems counterproductive to studying biology. It seems a pursuit of universal laws could lead one to reduce biological systems too much to the point where they behave nicely instead of behaving in a way that represents the biology. Put another way: this paper justifies that biology could be defined in laws in a manner similar to physics, but it needs much more on why it should be.
Author's Response to Ruben Valas I appreciate this constructive and insightful review. I believe that a few comments on the important general points made by the reviewer will be useful to clarify my position on the ''postmodern synthesis'', the role of laws in biology, and the place of selection in our evolving understanding of evolution.
First of all, I have never advocated the view that biology could be ''fully defined in laws in manner similar to physics''-in fact, the conclusion of this essay states exactly the opposite. For that matter, neither can physics, at least in some of its most exciting areas, in particular, modern physical cosmology, the study of the evolution of the universe(s). As a most general principle, I would submit that any sufficiently complex domain of study, in which there is an intrinsic arrow of time-and that, as far as we know, applies to the entire universe or multiverse [34]-cannot be reduced in this manner. In these fields, be it cosmology or biology, deterministic chaos is a major component of evolution whereby miniscule causes can trigger major effects, so that the existence of statistical laws does not imply predictability of histories. The interplay between stable, predictable patterns (laws) and unpredictability of specific outcomes in large part defines biological evolution [29]. This is the old opposition of chance and necessity from the eponymous book of Monod [35]-only now we know much more about both parts of the dyad. The theory and law certainly do not tell us that all organisms are the same. On the contrary, they differ dramatically, in particular, in terms of genomic and phenotypic complexity, depending on the pressure of purifying selection, which itself critically depends on the effective population size [36,37] and hence on the happenstance of evolution. The theory does suggest, though, that all the contributions of chance notwithstanding, certain simple evolutionary models apply to all lines of evolution, albeit with different parameter values.
Second, it is not the case that ''a large focus of this post-modern synthesis is to reduce the role selection plays in the study of evolution''. The better, more nuanced understanding of the balance between selection and neutral, stochastic processes in evolution is not the goal but a major outcome of research in evolutionary genomics and systems biology. The ''post-modern synthesis'' does posit a change of the fundamental null hypothesis of evolutionary biology: the new null hypothesis is that any observed pattern is first assumed to be the result of non-selective, stochastic processes, and only once this assumption is falsified, should one start to explore adaptive scenarios [29,38].
So should evolutionary biologists strive to turn their science into physics or should they collect the colorful stamps of unique adaptations? Certainly both! Adaptations are incredibly interesting and beautiful but to understand their nature and origins, as opposed to concocting ''just so stories'' [39], the description of the underlying evolutionary background with models akin to those used in statistical physics is crucial.