Introduction

Evolutionary novelty appears to arise more frequently through changes in existing patterns of gene regulation, the function of existing proteins, or both, rather than by invention of completely new genes (Betrán and Long 2002; True and Carrol 2002). The tailoring of existing genetic systems to new uses is called genetic co-option (True and Carrol 2002). Although new genes have been created by assembling normally unrelated genomic segments (e.g., Long 2001; Long and Langley 1993), the observation that the total gene number in complex organisms does not differ greatly from simpler organisms suggests the importance of co-option of pre-existing genetic systems (e.g., Claverie 2001; Betrán and Long 2002). Furthermore, genetic co-option events have been associated with major changes in organism ecology and life history (e.g., Chen et al. 1997; Harris et al. 2002). Although the molecular basis of genetic co-option has been studied only in a few cases, the process is thought to have played a role in the major episodes of adaptive divergence of multicellular organisms (Lynch and Conery 2000; Lynch and Force 2000; Taylor et al. 2001).

Gene duplication is an important mechanism for genetic co-option. It provides a mechanism for evolution of divergent protein functions (Piatigorsky and Wistow 1991; Ohta 1993; Hughes 1994), novel gene expression patterns (Force et al. 1999; Lynch and Force 2000), or both (Gibert 2002; Hughes 2002). For example, a single gene that is expressed in different tissues might experience conflicting selective pressures, the result being a compromise between optimal adaptations for any one tissue. Duplication of such a locus can lead to specialized patterns of expression among gene copies (Force et al. 1999), providing natural selection the freedom to promote tissue-specific functional divergence (e.g., Gibert 2002). The tremendous diversity of extant gene families, taken together with the observation that there is often an acceleration of amino acid substitution rates following gene duplication (Li 1985; Lynch and Conery 2000), suggests that gene duplication has been an important mechanism for functional divergence of genetic systems. However, it is generally difficult to identify the functionally important amino acid changes associated with these events.

In this paper we implement a new method for detecting functional divergence of proteins following a gene duplication event and for identifying the specific amino acid sites involved. The approach is an extension of the model of codon evolution developed by Goldman and Yang (1994; see also Muse and Gaut 1994), Modeling evolution among codons allows maximum likelihood (ML) estimation of the relative rates of nonsynonymous (d N) and synonymous (d S) changes. The ratio of these rates (ω = d N/d S) is a measure of selective pressure on the protein product of a gene (Yang and Bielawski 2000). For example, if nonsynonymous mutations are deleterious, purifying selection will reduce their fixation rate and d N/d S will be less than 1, whereas if nonsynonymous mutations are advantageous they will be fixed at a higher rate than synonymous mutations, and d N/d S will be greater than 1. A d N/d S ratio equal to one is consistent with neutral evolution. The original model (Goldman and Yang 1994) averaged d N/d S over all sites of a gene and lineages of a phylogenetic tree. It was subsequently extended to allow variation in d N/d S among sites (Nielsen and Yang 1998; Yang et al. 2000) and among branches (Yang 1998). A recent approach allows variation in d N/d S among sites, with additional variation at some sites in a prespecified branch (Yang and Nielsen 2002). Here we describe a model that allows variation in d N/d S among sites, with a fraction of sites evolving under divergent selective pressures between two clades following a co-option event. We implement the model in the maximum likelihood framework and apply it to two cases of evolution by gene duplication: (i) the ε and γ globin gene family (Meireles et al. 1995; Johnson et al. 1996; Fitch et al. 1991) and (ii) the eosinophil cationic protein (ECP) and eosinophil-derived neurotoxin (EDN) gene family (Zhang et al. 1998; Zhang and Rosenberg 2002).

Theory

We assume that the phylogeny is given or independently estimated and that there has been some change in selective constraints following some point in evolutionary time that can be specified a priori. In this paper we are specifically interested in testing if functional constraints differ significantly between two paralogous clades of genes following a gene duplication event. A duplication event is a point in evolutionary time that can be easily identified a priori on a phylogeny.

The model of codon substitution of Goldman and Yang (1994) describes the substitution rate from one sense codon, i, to another, j, as

$$q_{ij}\,=\,{\kern 1pt} \left\{ {\matrix{{0,} \hfill & {{\hbox{if }}i{\hbox{ and }}j{\hbox{ differ at two or three codon positions}}} \hfill \cr{\mu \pi _j ,} \hfill & {{\hbox{if }}i{\hbox{ and }}j{\hbox{ differ by a synonymous transversion}}} \hfill \cr{\mu \kappa \pi _j ,} \hfill & {{\hbox{if }}i{\hbox{ and }}j{\hbox{ differ by a synonymous transition}}} \hfill \cr{\mu \omega \pi _j ,} \hfill & {{\hbox{if }}i{\hbox{ and }}j{\hbox{ differ by a nonsynonymous transversion}}} \hfill \cr{\mu \omega \kappa \pi _j ,} \hfill & {{\hbox{if }}i{\hbox{ and }}j{\hbox{ differ by a nonsynonymous transition}}} \hfill \cr } } \right.$$

Parameter κ the transition–transversion rate ratio, π j is the equilibrium frequency of codon j, and ω (= d N/d S) is a measure of the selective pressure acting on the protein product of the gene.

We assume that selective pressure varies among the amino acids encoded by a gene. Moreover, we assume that a subset of sites experience a change in selective pressure at a point in evolutionary history such as a duplication event. We do not know the history of selective pressure at each site, and we wish to identify which sites have experienced a change following the duplication event and estimate the level of selective pressure in each paralogous clade at such sites. To achieve this we considered two “branch-site” models, which we refer to as Models C and D, as they follow two earlier branch-site models (called A and B) implemented by Yang and Nielsen (2002).

Model C is an extension of the site-specific “neutral” model (M1) of Nielsen and Yang (1998). M1 assumes two classes of sites; one class is completely conserved, with ω0 = 0, and the other class is completely neutral, with ω1 = 1. Only the proportion of sites under ω0 = 0 (f 0) is estimated via ML, as f 1=1−f 0. Although M1 allows different selective pressure among site classes, it assumes that the same selective pressure (for each site class) acts over all branches of a phylogeny. Model C extends M1 by adding a third class of sites where selective pressure differs in different parts of a phylogeny. The ω parameters of this third class of sites are estimated from the data via ML. Because assumptions under M1 and Model C are too simplistic for many datasets (Yang 2001), we used Model D only to analyze data sets considered in this paper. Model D is an extension of the site-specific “discrete” model (M3) of Yang et al. (2000). For example, M3 (k = 2 categories), assumes two classes of sites with proportions f 0, f 1 and ratios ω0, ω1 estimated from the data (Fig. 1A). Model D extends M3 by allowing selective pressure at one class of sites to differ in different parts of a phylogeny (Fig. 1B). For gene families, this means a class of sites having two independent ω parameters, one for each clade of paralogous genes that arose by duplication (Fig. 1B: ω1A, ω1B).

Figure 1
figure 1

A The discrete model (M3) of Yang et al. (2000) with k = 2 site classes assumes that two classes of sites are evolving under different levels of selective pressure (ω0 and ω1), with proportions f 0 and f 1. B Model D extends M3 by allowing selective pressure at one class of sites to differ in different parts of the phylogeny. A gene duplication is indicated on the topology in B by an open circle. Divergent selection pressure following the duplication event is accommodated by one site class with two independent ω parameters (ω1A and ω1B) for each clade of paralogous genes.

Note that Models A and B of Yang and Nielsen (2002) assume four classes of sites in a sequence: the first two classes have different but uniform selective pressure over all branches of the phylogeny (ω0, ω1); the third and fourth classes have ω0 and ω1 in all but a few “foreground branches,” where selective pressure is assumed to have changed (i.e., ω0→ω2 and ω1→ω2). Model A fixes ω0 = 0 and ω1 = 1, whereas they are free parameters in Model B. Models A and B were designed for cases where a certain event caused some sites to evolve under positive selection along prespecified branches (i.e., ω2 > 1). In Model D we are interested in those sites that have evolved under divergent selective pressures (with ω1A ≠ ω1B in Fig. 1B), and not necessarily in sites under positive selection.

Let h be a codon site and n the number of such sites in a dataset. The observed data x h at a site is a vector of codons across the alignment. Let y h be the site class to which site h belongs, and assume that there are k = 2 classes of sites. In the first site class (y h  = 0) all branches have ω0, while in the second class (y h  = 1) the two clades have two independent ω parameters for paralogous clades (ω1A and ω1B, respectively). The probability of the data at site h, conditioned on the site class p(x h |y h ), can be calculated according to Goldman and Yang (1994) if y h  = 0, or Yang (1998) if y h  = 1. The unconditional probability is an average over the site classes:

$$p({{\bf x}}_h ){\hbox{\,=\,}}\sum\limits_{k\,=\,0}^{\hbox{1}} {f_k p({{\bf x}}_h |y_h}\,=\,k)$$

We assume that the substitution process at individual codon sites is independent, so that the log likelihood is a sum over all sites in the sequence:

$$l{\hbox{\,=\,}}\sum\limits_{h\,=\,1}^n {{\kern 1pt} \log \{ p({{\bf x}}_h } )\}$$

The model was also implemented with k\,=\,3 site classes. In this case, if y h  = 0 or 1, all branches have the same ω ratio, ω0 or ω1, respectively. If y h  = 2, the two paralogous clades have ω2A or ω2B, respectively.

Parameters of the model include κ, the ω’s, the f‘s, and the (2N – 3) branch lengths of the phylogeny. These are estimated by numerical maximization of the log likelihood. The branch length measures the expected number of nucleotide substitutions per codon and is defined as an average across site classes (Nielsen and Yang 1998). Codon frequencies (π i ‘s) are estimated by using observed base or codon frequencies. Since an analytical solution is not possible, an iterative, hill-climbing, algorithm is used to maximize the likelihood function. At each iterative step the algorithm computes a search direction and does a one-dimensional search along that direction. The process is repeated at the best point along each search direction. The iteration continues until there is no improvement in the log-likelihood value, and changes to the parameter values are very small. All parameters, with the exception of codon frequencies, are updated simultaneously.

The likelihood ratio test (LRT) is used to compare the null model (M3) with Model D, which differs only by the assumption of divergent selective pressures at one class of sites following a duplication event. If the assumed topology is unrooted, twice the difference in log likelihood (2δ) under Models M3 and D with the same number of site classes is compared with a χ2 distribution having one degree of freedom. Note that the example in Fig. 1 shows rooted topologies. In this case there are two degrees of freedom because there is an extra branch length at the root under Model D that is not an identifiable parameter under M3. Significance of the LRT indicates the presence of sites evolving under significantly different selective pressures between the two clades. An empirical Bayes approach, based on ML estimates of model parameters, is used to infer to which class an individual codon site is most likely to belong (Nielsen and Yang 1998). The empirical Bayes approach uses ML parameter estimates in the prior distribution without accounting for their sampling errors. As a result, the accuracy of prediction may be influenced. An alternative is to use the more computationally costly hierarchical Bayesian approach, integrating over the uncertainty in the prior distribution. This is not pursued in this paper.

Data Analysis

We compiled data for two presumed examples of genetic co-option: (i) the divergence of ε and γ globins and (ii) the divergence of the eosinophil cationic protein (ECP) and the eosinophil-derived neurotoxin (EDN). Both these gene families have been well studied. Evidence has been found for the action of positive Darwinian selection since the divergence of ECP and EDN (Zhang et al. 1998; Bielawski and Yang 2003) but not since the divergence of the ε and γ globins (G. Aguileta, pers. commun.). In neither case has there been a specific test of the hypothesis that a long-term shift in selective constraints has occurred at a fraction of sites following gene co-option. We tested this hypothesis in these two cases.

We implemented three types of codon models. The first was the site- and time-homogeneous model, M0 (one ratio), of Goldman and Yang (1994), which averages selective pressure over codon sites and branches. The second was the site heterogeneous model, M3 (discrete), of Yang et al. (2000), with k = 2 and k = 3 site categories. The third was the new branch-site model, Model D, also with k = 2 and k = 3 site categories. All models were implemented under two different tree topologies: (i) the expected species tree, derived from the literature (Goodman et al. 1998; Goodman 1999; Meireles et al. 1999; Page et al. 1999), and (ii) the estimated gene tree. Gene trees were estimated using ML under the HKY85 substitution model (Hasegawa et al. 1985) combined with a discrete gamma model of rate variation among sites (Yang 1994). Tree searches were conducted by using the PAUP* computer program (Swofford 2000). All ML analyses of codon models were performed using the codeml program of the PAML package (Yang 1997; http://abacus.gene.ucl.ac.uk/software/paml.html).

ε and γ Globins

Transport of oxygen from lungs to tissues in vertebrates is accomplished via reversible binding with hemoglobin. In all vertebrates but cylostomes, hemoglobin is a tetramer comprised of two pairs of subunits, with the adult subunits designated α and β. In placental mammals, two paralogs (ε and γ) are expressed during early development instead of β. ε and γ arose about 80–100 MYA via a tandem duplication of an embryonic ε-type globin (Koop and Goodman 1988). Expression of ε is embryonic in all placental mammals, while γ expression is embryonic only in nonprimate placental mammals and prosimian primates, being delayed to fetal expression in simian primates (Johnson et al. 1996).

Persistence of both ε and γ over 80 to 100 million years of evolution implies strong selective pressure for both gene products, presumably due to some form of genetic co-option and divergence (Fitch et al. 1991). If functions of ε and γ had not diverged, it is likely that one copy would have become nonfunctional via the accumulation of deleterious mutations. Fitch et al. (1991) suggested that the initial gene duplication event was followed by divergence of ε and γ for different “embryonic niches.” Later (35 to 55 MYA), a second case of genetic co-option occurred when embryonically expressed γ was recruited for fetal expression in the early simian lineage (Koop and Goodman 1988; Tagle et al. 1988; Meireles et al. 1995; Johnson et al. 1996). The objective of our analysis was, first, to test for divergence in selective pressure between ε and γ and, second, to identify sites consistent with this type of selective pressure if they existed.

The ε globin gene sequences were from Alouatta seniculus (GenBank accession number = L25367), Aotus azarai (L25371), Ateles geoffroyi (L25368), Brachyteles arachnoids (L25366), Callithrix jacchus (L25363), Cebus olivaceus (U18610), Cheirogaleus medius (U11712), Eulemer macaco (M15735), Galago crassicaudatus (M36304), Homo sapiens (U01317), Hylobates syndactylus (U64616), Lagothrix lagothrica (L25358), Macaca mulatta (M81364), Otolemur crassicaudatus (U60902), Pan paniscus (M81362), Pongo pygmaeus (X05035), Saimiri sciureus (L25354), and Tarsius syrichta (M81411). γ globin gene sequences were from Alouatta seniculus (AF030097), Aotus azarai (U57044), Ateles paniscus (AF030093), Brachyteles arachnoides (AF030089), Callithrix jacchus (AF321384), Cebus apella (U57043), Cheirogaleua medius (M15758), Eulemur macaco (M15757), Galago crassicaudatus (M36305), Homo sapiens (U01317), Hylobates lar (J05174), Lagothrix lagotricha (AF030094), Macaca mulatta (M19434), Otolemur crassicaudatus (U60902), Pongo pygmaeus (M16208), Pan troglodytes (X03109), Saimiri ustus (AF016984), and Tarsius bancanus (AF0726810).

The estimated phylogeny for the ε and γ sequences is shown in Fig. 2. That gene tree and a “species” tree assuming the expected species relationships within the ε and γ clades were used in all analyses. Results obtained from the two trees were very similar and only those obtained under the gene tree (Fig. 2) are presented.

Figure 2
figure 2

Gene tree for the 36 sequences from the ε and γ gene family. The topology was obtained by using maximum likelihood analysis under the HKY85 substitution matrix combined with a correction for among sites rate variation (discrete gamma model). The scale bar indicates the mean number of substitutions per nucleotide site. The open circle indicates the duplication event that gave rise to the ε and γ genes. Under Model D, a fraction of sites was allowed to evolve under divergent selection pressure, with ω1A and ω1B for the two paralogous clades, respectively. Macaca f. and Macaca n. indicate M. fascicularis and M. nemestrina, respectively.

The one-ratio model (M0) yielded an estimated ω= 0.19 (Table 1), indicating that purifying selection dominated the evolution of these globins. However, this finding is based on an average over all sites and branches. Next we tested for heterogeneous selective pressure among sites. The discrete model (M3), which assumes variation among sites but no variation among branches, was applied to these sequences (Table 1). Likelihood ratio tests (LRTs) of M0 against M3 indicated significant variation in selective pressure among sites (Table 2). An LRT of M3 with k = 2 site classes against M3 with k = 3 site classes was not significant (Table 2). M3 with k = 2 site classes suggested a large fraction of sites (70%) evolving under very strong purifying selection (ω = 0.05), and a small fraction of sites (30%) evolving more quickly, under much weaker purifying selection (ω = 0.55). We note that estimates under M3 with k = 2 are quite different from those under M3 with k = 3. However, both models provide strong evidence that the ω ratio and selective pressure are highly variable among sites.

Table 1 Parameter estimates and log likelihood scores for the ε and γ globin gene familya
Table 2 Likelihood ratio test statistics (2δ) for the ε and γ globin family

In order to test for divergence in selective pressure between ε and γ, we applied the new model (Model D), which accommodates both the heterogeneity among sites and divergent selective pressures. Here, we allowed one class of sites to evolve under divergent selective pressures following the duplication of the ancestral ε-type globin (Fig. 2). Significance of the LRT for divergent selective pressures at a fraction of sites was borderline when we assumed k = 2 site classes (2δ = 5.88, df = 2, P = 0.05), and was unmistakable when we assumed k = 3 site classes (2δ = 13.88, df = 2, P = 0.001). Parameter estimates under Model D with k = 3 site classes suggested a large set of sites (∼65%) evolving under strong purifying selection (ω = 0.04), a small set of sites (∼19%) evolving under much weaker selective pressure (ω = 0.61), and a small set of sites (∼16%) evolving under divergent selective pressures, with very strong purifying selection in the ε-clade (ω = 0.008), and weak purifying selection in the γ clade (ω = 0.79). Model D with k = 2 also suggested sites under divergent selection with ω = 0.38 for the ε clade and ω = 0.75 for the γ clade.

We examined the sensitivity of the analysis to tree topology and to assumptions about codon usage. Results shown above were obtained under model F3×4, which uses the nucleotide frequencies at the three positions of the codon to compute the expected codon frequencies (Goldman and Yang 1994). We also examined parameter estimates under two other models of codon frequencies: the Fequal model, which assumes all codons are used equally, and the F61 model, which uses the 61 empirical codon frequencies as parameters. Parameter estimates under Model D with k = 3 site classes are similar under all three models of codon frequencies and under both tree topologies, and the qualitative conclusions about selective pressure acting at the three site classes are the same (Table 3).

Table 3 Parameter estimates for the ε and γ globin family under Model D with k = 3 and gene tree (and species tree)a

We identified 12 codon sites with posterior probabilities ≥75% of evolving under divergent selective pressures in ε and γ (2V, 6A, 66K, 70T, 87A, 88K, 117T, 118H, 119F, 127V, 144H; Callithrix ε used as amino acid reference). We mapped those sites to the three-dimensional structure of hemoglobin. Most sites had a nonrandom distribution on the tetramer; four were located at or within one residue of the α1 β1 interface (116, 117, 118, and 126), two were located at binding sites for 2,3-diphosphoglycerate (DPG) (1 and 143), and four were located in the region of the heme pocket (65, 69, 86, 87).

Divergent selective pressures might be related to either co-option associated with the duplication of embryonic ε-type globin or recruitment of γ for fetal expression. Recruitment of γ is thought to be associated with a gene duplication event that resulted in Aγ and Gγ globins, but this event cannot be resolved on a gene tree because frequent gene conversion among A and G copies has made them virtually identical in DNA sequence. However, considerable information is available concerning biochemical consequences of fetal expression of γ. Higher oxygen affinity of fetal blood is required for efficient oxygen transfer from mother to fetus (Poyart et al. 1992). In primates with fetal expression of γ globin, this is accomplished through partial loss of the regulatory effect of 2,3-DPG, which normally binds sites in the two adult β chains of hemoglobin and shifts the equilibrium to low oxygen affinity (Poyart et al. 1992). 2,3-DPG interacts with the hemoglobin tetramer at just seven residues: residues 1, 2, and 82 on both chains and residue 143 on one chain (Perutz and Imai 1980). Posterior probabilities for sites 1 and 143 were highest (>75%) for divergent selective pressures. Thus three of the seven residues relevant to the affinity of fetal hemoglobin to 2,3-DPG were identified to be under divergent selective pressures in our analysis.

We note that our high estimate of ω = 0.79 for sites evolving under divergent selection pressures in the γ clade is an average over all lineages in that clade. There are at least two scenarios for the high ratio. The first is that the high ω ratio in γ represents relaxed functional importance at those sites relative to ε. Relaxation of selective pressure, especially at sites that interact with 2,3-DPG, could have been an indirect result of amino acid changes at other sites that increased the oxygen affinity of γ-chain hemoglobin relative to maternal hemoglobin. A second scenario is a short burst of positive Darwinian selection somewhere in the γ clade, yielding an elevated ω ratio when averaged over all the γ clade lineages. It is tempting to speculate that such a burst of adaptive evolution was associated with duplication of embryonic ε-type globin or with recruitment of γ for fetal expression, with positive selection acting directly on those amino acid changes that increased oxygen affinity of γ-chain hemoglobin. However, altered oxygen affinity in γ could have evolved via just a few amino acid substitutions (Poyart et al. 1992), which might not be detected by a long-term average of ω. In either case the results support functional divergence between ε and γ.

ECP and EDN

ECP and EDN are RNase genes that arose about 31 million years ago through a gene duplication event in the ancestor to Old World monkeys and hominoids (Hamann et al. 1990; Zhang et al. 1998). Both ECP and EDN have host-defense roles, but their specific functions differ. ECP is a nonspecific toxin to bacteria and parasites, probably through cell membrane disruption, whereas EDN acts as a potent antiviral agent through degradation of viral RNA (Rosenberg and Domachowske 1999). ECP also has antiviral activity, but it is substantially less effective than Old World monkey EDN (Domachowske et al. 1998).

Evolution of this gene family has been well studied. Zhang et al. (1998) found an excess of nonsynonymous substitutions over synonymous substitutions in the branch leading to the ECP gene, a pattern consistent with positive Darwinian selection. Those authors suggested that strong selection for the antiparasitic function of ECP probably acted shortly after gene duplication. Zhang and Rosenberg (2002) later showed that the increased antiviral activity of EDN was predominately due to amino acid substitutions at two interacting sites. Using maximum likelihood methods, Bielawski and Yang (2003) confirmed an elevated rate of nonsynonymous substitution following the ECP–EDN duplication event and also, reported that ECP might have continued to evolve under positive Darwinian selection long after the initial period of functional divergence, whereas EDN evolution had been dominated by purifying selection. Bielawski and Yang (2003) used methods that did not account for variable selective pressure among sites. Here, we use Model D to specifically test for a subset of sites evolving under divergent selective pressures in ECP and EDN.

ECP gene sequences were from Gorilla gorilla (U24097), Homo sapiens (AF294019), Macaca fascicularis (U24098), Macaca nemestrina (AF479627), Pan troglodytes (AF294028), and Pongo pygmaeus (U24101). EDN gene sequences were from Cercopithecus aethiops (AF479630), Gorilla gorilla (U24100), Homo sapiens (AF294007), Hylobates leucogenys (AF479628), Macaca fascicularis (U24096), Macaca nemestrina (AF479631), Pan troglodytes (AF294081), Papio hamadryas (AF479629), and Pongo pygmaeus (U24104).

The estimated phylogeny for these ECP and EDN sequences is shown in Fig. 3. Results shown below were obtained by assuming the gene tree; results obtained by assuming the species tree were very similar and are not shown. The one ratio model (M0) yielded an estimated ω = 0.85 (Table 4), indicating a high relative rate of amino acid evolution. Likelihood ratio tests (LRTs) of M0 against M3 indicated significant variation in selective pressure among sites (Table 5). Similar to the globin dataset above, an LRT of M3 with k = 2 site classes against M3 with k = 3 site classes was not significant (Table 5). M3 with k = 2 site classes (Table 4) suggested a large fraction of sites (72%) evolving under purifying selection (ω = 0.34) and a small fraction of sites (28%) evolving under positive Darwinian selection with ω = 2.72. Next, we tested for divergence in selective pressure between ECP and EDN by using Model D; we allowed one class of sites to evolve under divergent selective pressures following the gene duplication event (Fig. 3). LRTs for divergent selective pressures at a fraction of sites were significant whether we assumed k = 2 or k = 3 site classes (Table 5).

Figure 3
figure 3

Gene tree for 15 sequences from the ECP–EDN gene family. The topology was obtained by using maximum likelihood analysis under the HKY85 substitution matrix combined with a correction for among-site rate variation (discrete gamma model). The scale bar indicates the mean number of substitutions per nucleotide site. The open circle indicates the duplication event that gave rise to the ECP and EDN genes. Under Model D, a fraction of sites was allowed to evolve under divergent selection pressure, with ω1A and ω1B for the two paralogous clades, respectively.

Table 4 Parameter estimates and log likelihood scores for the ECP–EDN gene familya
Table 5 Likelihood ratio test statistics (2δ) for the ECP–EDN family

Parameter estimates under Model D vary depending on whether k = 2 or k = 3. This arises from (i) differences in averaging ω’s over sites when assuming two versus three classes of sites; and (ii) sampling errors in ML parameter estimation. Because of sampling errors, particularly when the number of sampled lineages is low, individual parameter estimates must be interpreted cautiously. However, both models suggest the presence of sites evolving under positive selection with ω > 1 in both ECP and EDN and sites evolving under very divergent selective pressures in these two paralogs. Estimates when k = 3 suggest a set of sites (∼42%) evolving under strong purifying selection (ω = 0.07), a small set of sites (∼13%) evolving under positive Darwinian selection in both clades (ω = 3.76), and a set of sites (∼45%) evolving under purifying selection in the EDN clade (ω = 0.28) and positive Darwinian selection in the ECP clade (ω = 3.21). The sites evolving under positive selection just in ECP could reflect long-term selective pressure to maintain antiviral activity against respiratory viral pathogens (Bielawski and Yang 2003). We conducted sensitivity analyses and found that the LRTs and qualitative results of ML parameter estimation were robust to model of codon frequencies and tree topology (data not shown). Zhang and Rosenberg (2002) recently reported that additional gene duplication and conversion events occurred in the orangutan Pongo pygmaeus. We repeated our analyses on a dataset that excluded the orangutan and obtained very similar findings (data not shown). Because these findings are based on a relatively small sample of lineages, they need to be confirmed in a larger dataset. The empirical Bayes approach uses ML estimates of parameters to identify sites under divergent selection pressure but does not account for their sampling errors. Sampling errors of the ML estimates will be high in small datasets, such as ECP-EDN; as a result, the reliability of Bayesian site identification will be affected and may be sensitive to tree topology and model of codon frequencies.

Discussion

Mechanisms of genetic co-option have been difficult to study, in part, because functionally important changes have been difficult to identify. Recently, this problem has received a lot of attention from the standpoint of amino acid evolution (reviewed by Massingham et al. 2001; and Gaucher et al. 2002). Several approaches have been developed based on the premise that site-specific shifts in rates of amino acid evolution are related to changes in selective pressure (e.g., Gu 2001; Knudsen and Miyamoto 2001; Susko et al. 2002). Such a framework provides an important tool for studying genetic co-option (Massingham et al. 2001; Gaucher et al. 2002), especially for ancient divergences, where saturation of synonymous substitutions excludes a reliable codon-based analysis. However, for more recent divergences, an amino acid-based analysis does not fully utilize the information content of nucleotide datasets (Massingham et al. 2001; Bielawski and Yang 2003). Amino acid rates are limited by an inability to differentiate between different types of selective pressure that give rise to an amino acid rate shift; i.e., positive selection, neutral evolution, and purifying selection. The branch-site model of codon evolution (Model D) presented in this paper should provide a valuable tool for studying genetic co-option when sequences are not too divergent.

In this paper we applied Model D to address two problems of gene family evolution. First, we asked if there was a fraction of sites evolving under divergent selective pressures following a gene duplication event. We expect that the LRT will be a powerful basis for answering such a question, as similar LRTs have been shown to be a powerful and reliable means of testing for site specific heterogeneity in selective pressure (Anisimova et al. 2001). Indeed, we found significant evidence for divergent evolution in both the ε and γ and the ECP and EDN gene families, two well-studied families thought to have undergone functional divergence following gene duplication. The second problem is identification of specific sites involved in functional divergence. We expect this to be a more difficult problem, as information about rates of synonymous and nonsynonymous changes must be divided among paralogous clades, thus increasing sampling errors. If such partitioning results in too few changes along the branches of a specific clade, parameter estimates will be less reliable, as will the posterior probabilities.

Application of Model D to ε and γ divergence demonstrated that important clues to the mode of adaptive molecular evolution can be obtained from sequences which do not exhibit the characteristic marker of positive Darwinian selection (ω > 1). In particular, our results support the notion that a weakened relationship between γ-chain hemoglobin and 2,3-DPG is connected with molecular adaptation for increased oxygen affinity (Poyart et al. 1992). Furthermore, we note that the majority of sites identified as evolving under divergent selective pressures in ε and γ were associated with the major structural and functional features of the hemoglobin tetramer. We believe that Bayesian site identification in well-sampled datasets, combined with structural and functional information, can provide a valuable framework for identifying and studying mechanisms of genetic co-option.

Model D is very simple in the sense that it allows for only one set of sites evolving under divergent selective pressures; however, two or more such classes of sites might exist. For example, two classes of sites might represent two different domains, one being released from purifying selection in one paralog and the other being released from purifying selection in the other paralog. Model D would be forced to average over both classes of sites, resulting in reduced power of the LRT and lower posterior probabilities. One solution to this problem is to use a bivariate distribution for selective pressure, similar to what has been implemented for amino acid rates (e.g., Gu 2001; Susko et al. 2002). For example, the gamma model of among sites variation in ω (Yang et al. 2000) could be extended by allowing two gamma distributions, one for each paralogous clade. This approach would have the advantage of greater flexibility in modeling functional divergence.

Both site models and branch-site models are computationally complex, with estimation of parameters for finite mixture distributions (such as those in M3 and Model D) being particularly difficult. For example, a small fraction of sites under strong selective pressure might fit a dataset nearly as well as a higher fraction of sites under lower selective pressure. This impacts Bayesian site identification, as ML parameter estimates are used to compute the posterior probabilities. Simulation studies showed that low accuracy in Bayesian site identification occurs when sequence divergence is very low or too few sequences are sampled because under such conditions the sampling errors in ML parameter estimates are too high (Anisimova et al. 2002). Similarly, suboptimal parameter estimates, based on local optimum, also could lead to low accuracy in Bayesian site identification. We found local optima in both datasets under Model D but not under Model M0 or M3. With these points in mind, we make the following recommendations: first, to avoid being trapped at a local optimum, users should run Model D multiple times using different initial values; second, we advise caution on Bayes site prediction when sequence divergence is very low or when few sequences are sampled; and third, different tree topologies and models of codon frequencies may be used to evaluate the robustness of parameter estimates.

Addendum

After submission of the manuscript for this paper, Forsberg and Christiansen (2003) published a similar codon model, which also allows position-specific changes in selection pressure in two different parts of a phylogeny. They applied Gu’s (2001) model of functional divergence to the codon model of Goldman and Yang (1994). Their model assumes that the ω ratio varies among sites according to a discrete distribution with three classes. However, a proportion p d of sites is under different selective pressures and has independent ω’s, drawn from the same distribution, for two subclades of a phylogeny. The rest of the sites have the same ω, drawn from the same discrete distribution, for the whole tree. Our Models C and D are different in assuming different overall selective pressures (ω2A versus ω2B) for the two clades. Forsberg and Christiansen (2003) applied their codon model to a set of influenza A virus nucleoprotein sequences to study the changes in selection pressure after a shift from avian to human hosts. They used an LRT to test the hypothesis that p d > 0 and an empirical Bayes procedure to predict which sites had experienced a shift in selection pressures. Both the analysis by Forsberg and Christiansen (2003) and that in this paper demonstrate the importance of codon-based approaches in studying genetic co-option events, whether they be functional divergence following gene duplication or adaptive alteration of existing genes to the functional requirements of parasitizing a new host.