Preferential retention of the slowly evolving gene in pairs of duplicates in angiosperm genomes

Gene duplication provides raw material for functional innovation, but gene duplicability varies considerably. Previous studies have found widespread asymmetrical sequence evolution between paralogs. However, it remains unknown whether the rate of evolution among paralogs affects their propensity of being retained after another round of whole‐genome duplication (WGD). In this study, we investigated gene groups that have experienced two successive WGDs to determine which of two older duplicates with different evolutionary rates was more likely to retain both younger duplicates. To uncouple the measurement of evolutionary rates from any assignment of duplicate or singleton status, we measured the evolutionary rates of singleton genes in out‐lineages but classified these singleton genes according to whether they are retained or not in a crown group of species. We found that genes that retained younger duplicates in the crown group of genomes were more constrained prior to the younger duplication event than those that failed to leave duplicates. In addition, we also found that the retained clades have more genes in out‐lineages. Subsequent analyses showed that genes in the retained clades were expressed more broadly and highly than genes in the singleton clades. We concluded that the set of repeatedly retained genes after two WGDs is biased toward slowly evolving genes in angiosperms, suggesting that the potential of genes for both functional conservation and divergence likely affects their propensity of being retained after WGD in angiosperms.


Introduction
Gene duplication is the primary source of new genes during evolution and is consequently a vital source of genetic novelty. Whole-genome duplications (WGDs) generate a copy of all genes and are associated with organismal diversification and the acquisition of novel features (Rensing, 2014). Angiosperms have experienced many rounds of WGDs, in the most recent common ancestor (MRCA) of angiosperms and at the levels of the order or family (Fig. 1A) (Jiao et al., 2011;Vanneste et al., 2014;Huang et al., 2016b;Xiang et al., 2017;Ren et al., 2018). Recent WGD events also occurred independently within many plant groups, such as Brassicaceae (Barker et al., 2009), Fabaceae (Cannon et al., 2015), Poales (McKain et al., 2016), and Poaceae (Ren et al., 2018). In addition, more recent polyploidizations have been detected in Brassica rapa var. pekinensis (Lour.) Kita, (Wang et al., 2011), Glycine max (L.) Merr (Schmutz et al., 2010), and Zea mays L. (Ren et al., 2018). However, most duplicates obtained from WGDs were silenced and then lost within a few million years (Li et al., 2016;Ren et al., 2018).
Those that are retained may meet one of several fates: (i) functional conservation, which is beneficial due to the increased amount of ancestral gene product (Ohno, 1970); (ii) subfunctionalization, which divides the ancestral function among the two copies (Lynch & Conery, 2000); (iii) neofunctionalization, which entails the acquisition of a new function by one copy while the other retains the original function (Assis & Bachtrog, 2013;Turetzek et al., 2016); or (iv) specialization, a combination of (ii) and (iii) (Jiang & Assis, 2017). Both (iii) and (iv) suggest that two duplicates could evolve at different rates.
Indeed, widespread asymmetrical evolution has been supported by substantial empirical evidence (Conant & et al., 2010). In a previous study that focused on recent duplicates in rodents, it was found that acceleration in the rate of evolution after gene duplication is widespread across genes but is generally limited to only one copy (Pegueroles et al., 2013). In land plants, most duplicates have asymmetrical rates of evolution or significant differences in the substitution rates (Carretero-Paulet & Fares, 2012). Asymmetrical expression evolution, where one WGD paralog has low or no expression in all tissues, is also prevalent between duplicates (Assis & Bachtrog, 2013). Altogether, many studies have reported that asymmetrical evolution is common in gene duplication.
Another question regarding gene duplication is whether genes with different rates of sequence evolution show differential propensities for the retention of subsequent duplicates (i.e., gene duplicability). Although several studies have investigated the possible relationships between rates of coding sequence evolution and gene duplicability (Davis & Petrov, 2004;Brunet et al., 2006;He & Zhang, 2006;Liang & Li, 2007;Semon & Wolfe, 2008;O'Toole et al., 2018), they have reached different conclusions. For example, in yeast genomes, less-important genes have higher rates of successful duplication (He & Zhang, 2006), suggesting that more rapidly evolving genes have a higher chance of retaining their duplicates. In primate genomes, genes with family-size conservation evolve more slowly than those without (Chen et al., 2010), and rapidly evolving genes tend to be without family-size conservation (O'Toole et al., 2018). However, others have argued that genes that have more important biological functions and whose coding sequences are under stronger functional constraints are retained more often (Liang & Li, 2007). Studies of eukaryotic genomes found that duplicates encoding more conserved proteins are retained more often (Davis & Petrov, 2004), and duplicates also evolve more slowly than genes in singleton clades (Jordan et al., 2004). However, in plants, the relationship between the rates of coding sequence evolution and gene duplicability remain elusive.
Angiosperms have experienced many rounds of WGDs (Fig. 1A), and these large-scale duplications have generated numerous pairs of duplicates, which provide an excellent opportunity to analyze the differential sequence evolution of A B C Fig. 1. Species phylogeny and the approach used to estimate the rate of evolution for retained and singleton clades. A, The species phylogeny in this study was inferred from the phylogeny of conserved nuclear genes (Zeng et al., 2014(Zeng et al., , 2017Huang et al., 2016a ). Genome duplication and triplication events are marked according to previous reports (Jiao et al., 2011;Vanneste et al., 2014;Huang et al., 2016b;Ren et al., 2018) and are depicted by stars. Species names given in red are used as family-level out-lineage species. B, Genes that experienced two successive duplications and the orthologs of these genes in the outlineage. We were interested in identifying genes that are singletons in species S1, S2, and S3 and in sorting those genes into Group A (retained clade) and Group B (singleton clade), according to their status in the S4 genome. For all genes in Group A (light green) and Group B (blue) in S4, their unduplicated orthologs (relative to younger duplication) in species S1, S2, and S3 were identified. C, Calculating the proxy for the ancestral evolutionary rate. The rate of evolution in the orthologous pairs in out-lineage species (S1, S2, and S3) was used as a proxy for the ancestral rate of evolution in the S4 lineage, independent of the younger duplication event in molecular evolution.
WGD-derived duplicates and to examine the possible relationships between rates of sequence evolution and gene duplicability. Both evolutionary rates and duplicability vary among genes of different functional classes in a given genome (Koonin & Wolf, 2010); thus, it is important to consider the possible effects of functional constraints on the evolutionary rate. These influences on the evolutionary rate comparison can be minimized if the comparison is made between two copies of an older duplication, which are similar in sequence and have the same type of function. In plant groups that have experienced two or more rounds of WGD, gene families that have both copies from an older WGD and one set of duplicates from a younger one (Fig. 1B) can be examined for a relationship between the evolutionary rate of the sequence and its duplicability. To compare the rates of sequence evolution in duplicates generated by the first round of WGD, we measured the evolutionary rates of orthologous genes (a pair of orthologs of this type is called an orthologous pair) in two species in relation to the crown species that experience a second round of WGD (Fig. 1C).
Because the rates of evolution for a particular gene are highly correlated in related lineages (Bromham & Penny, 2003), we reasoned that the non-synonymous (K a ) rates of substitution between the members of each representative orthologous pair would be a good proxy for the rate of evolution of the study genes in a way that is unaffected by the process of younger WGD. These analyses were undertaken in a similar way to those of previous studies (Davis & Petrov, 2004;He & Zhang, 2006;O'Toole et al., 2018).
In this study, we partitioned older duplicates into two groups based on their status in the crown lineages (Brassicaceae, Fabaceae, or Poaceae) or species (B. rapa, G. max, or Z. mays) and compared their rates of evolution, thus uncoupling the estimation of the rate of sequence evolution from the process of the duplication itself. We found that genes that are duplicated in crown lineages or species have lower non-synonymous (K a ) rates of substitution in closely related out-lineage species, thus supporting the model of the preferential duplication of slowly evolving genes. This study design has the additional advantage of focusing on both ancient and recent periods, so it should reveal retention biases both shortly post-duplication and over the much longer term.

Data collection
Sequence data collection of 30 species and phylogenetic analyses were carried out exactly as described in a previous article . The sequence of Vitis vinifera was downloaded from the public database Phytozome version 12.0 (https://phytozome.jgi.doe.gov/pz/portal.html) (Goodstein et al., 2012). The transcript and protein expression data for Arabidopsis thaliana (L.) Heynh. were obtained from the multi-omics atlas given in a previously published paper (Mergner et al., 2020). mRNA quantities are displayed as transcripts per million. Protein intensities are intensity-based absolute quantification values (Mergner et al., 2020). The level of expression of each gene was estimated using the median value of all 30 tissues. The tissue specificity was measured with the index τ (Yanai et al., 2005): where N is the number of tissues, and x i is a normalized expression level for each tissue, divided by the highest level of expression across the N tissues. The index τ ranges from 0 to 1, and higher values indicate higher specificity. If a gene was expressed in only one tissue, it was assigned τ = 1, and if it was expressed equally in all tissues, it was assigned τ = 0.

Gene tree construction and reconciliation
We collected the protein-coding sequences for 30 sequenced angiosperm genomes that can stand in for major angiosperm lineages (Fig. 1A), and we classified them into orthogroups using OrthoFinder (Emms & Kelly, 2015). Of the 177 988 clusters that were obtained using OrthoFinder, 12 046 contained genes from at least five different species, which were included in subsequent analyses. Of these, 74 clusters with more than 500 genes were removed because of the enormous computational resources required by the large number of sequences.
We implemented a gene tree-species tree reconciliation pipeline to automatically construct phylogenetic gene trees for all clusters to obtain estimates of duplication and speciation events along the gene tree and to test whether these trees could be traced back to a single ancestral angiosperm gene. The pipeline incorporated a procedure for gene tree improvement using TreeFix (Wu et al., 2013), a gene tree-species tree reconciliation algorithm that carries out topological changes to gene trees and searches for alternative topologies that minimize the duplication/ loss cost, with a likelihood that is not statistically worse than that of the maximum likelihood topology. The tree reconciliation algorithm used a predefined species tree ( Fig. 1A) drawn from recent studies of angiosperm phylogeny to infer all of the necessary gene gains and losses to explain a given gene tree topology (Zeng et al., 2014(Zeng et al., , 2017Huang et al., 2016a). The phylogenetic analyses were undertaken as presented previously . Briefly, sequence alignments based on multiple protein sequences were produced using Multiple Alignment using Fast Fourier Transform (MAFFT) version 7.273 (Katoh & Standley, 2013) and were trimmed with trimAl (Capella-Gutierrez et al., 2009), using the options "-gt 0.1resoverlap 0.75 -seqoverlap 80." The processed multiple sequence alignments were fed into the IQ-TREE (Nguyen et al., 2015) to construct phylogenetic gene trees, using the amino acid substitution (LG + F + G) model (Le & Gascuel, 2008). Then codon-based coding sequence alignments generated from the sequence alignments of amino acids were back translated using TreeBeST version 1.9.2 (Vilella et al., 2009) and used for the analyses with TreeFix (Wu et al., 2013) with the default reconciliation model (duplication/loss cost), but increasing the number of iterations to 1000. If a reconstructed gene tree generated by TreeFix indicated a duplication event at a node before the MRCA of angiosperms, we split the phylogenetic tree into subtrees using ETE3 (Huerta-Cepas et al., 2016), a Python framework for phylogenetic analyses.
J. Syst. Evol. 60(4): 848-858, 2022 www.jse.ac.cn 2.3 Identification of duplicate and singleton genes and their orthologs Duplication and speciation nodes were detected by ETE3 (Huerta-Cepas et al., 2016), using a reconstructed gene tree generated by TreeFix, in which duplication and separation nodes strictly followed the species tree topology (Zeng et al., 2014(Zeng et al., , 2017Huang et al., 2016a). In this study, only duplication nodes that exceeded a duplication consistency score of 0.2 were considered in subsequent analyses, following the procedure of a previous study (Vilella et al., 2009). The consistency score is the intersection of the number of species post-duplication divided by the union (Vilella et al., 2009), which is a measure of the plausibility of the duplication events inferred. By detecting evolutionary events, orthologous and paralogous relationships among genes can also be inferred.

Measurement of proxy-ancestral evolutionary rate
For each of the orthologous pairs, the protein sequences were aligned using MAFFT version 7.273 (Katoh & Standley, 2013) and then converted into nucleotide alignments using TreeBeST version 1.9.2 (Vilella et al., 2009). K a , K s , and ω (K a /K s ) were estimated using CodeML software from the PAML package (version 4.8) (Yang, 2007), based on codon sequence alignments using the GY model, with stationary codon frequencies empirically estimated by the F3x4 model. Using the alignment of out-lineage orthologous pairs of genes, we prevented gene duplication from affecting inferences of evolutionary rates. Orthologous pairs with K a > 5 or K s < 0.01 were excluded from further analyses, which could denote either misalignment or potential sequence saturation.

Results
3.1 Evolutionary rates for orthologous singleton genes are highly correlated in related lineages In all, 30 species with genome sequences (22 eudicots, seven monocots, and one basal angiosperm) (Fig. 1A) were selected to represent the major angiosperm lineages. Duplicated and singleton genes were inferred from gene trees (see Section 2 for details). Then we undertook a systematic search for genes that were singleton at the base of the eudicots tree in Aquilegia coerulea E.James, Carica papaya L., Theobroma cacao L., Fragaria vesca L., and Vitis vinifera L., which lacked evidence of WGDs after divergence from other eudicots (hereafter, family-level out-lineage species (FLOS)). Familylevel out-lineage species are not similar to Brassicaceae and Fabaceae species, whose own MRCAs experienced at least one ancestral WGD. We identified genes for which we could be confident were singletons at the base of the eudicot tree.
To assess whether the rates of evolution for orthologous singleton genes were correlated in angiosperms, K a values were calculated for these orthologous pairs in a pairwise mode. Overall, the K a values were highly positively correlated across orthologous pairs (Spearman's rank correlation, R > 0.5, P < 0.001) (Fig. S1), suggesting that they generally had similar rates of evolution. We also repeated the entire analyses in Brassicaceae-, Fabaceae-, and Poaceae-level orthologous gene pairs, and the results remained unchanged and strongly statistically significant (Spearman's rank correlation, R > 0.5, P < 1.0e−16) (Figs. S2-S4). From these results, we reasoned that the K a values between the members of each representative pair would be a good proxy for the rate of evolution of the duplicates in a way that would be unaffected by the process of duplication.
3.2 Genes in retained clades evolve at slower rates than those in singleton clades From these species (Fig. 1A), we identified a large number of gene duplications in the core eudicots, Brassicaceae, Fabaceae, Poaceae, Brassica rapa, Glycine max, and Zea mays. Because we focused on genes that experienced two successive duplications (old and young WGD) (Fig. 1B), we chose the following five gene datasets: core eudicots and Brassicaceae duplicates (dataset 1) ( Fig. 2A), core eudicots and Fabaceae duplicates (dataset 2) (Fig. 2C), Brassicaceae and B. rapa duplicates (dataset 3) (Fig. 3A), Fabaceae and G. max duplicates (dataset 4) (Fig. 3C), and Poaceae and Z. mays duplicates (dataset 5) (Fig. 3E). Then we classified the genes of these five datasets into either Group A or Group B based on their duplication status after young WGD. Three copies generated by the γ triplication (core eudicot-wide WGD) event were also classified into two groups, with one group (Group A or Group B) containing two old clades. For three or four copies generated by the α/β duplication event, only copies generated by the second-round old WGD were included in this study, which requires that both Group A and Group B contained one old clade. For each of two successive duplications, Group A always contained more genes than Group B in crown lineages or species after the younger WGD (Fig. 1B). For clarity, Group A is called the retained clade and Group B is the singleton clade. First, we analyzed dataset 1 genes (core eudicots and Brassicaceae duplicates) ( Fig. 2A). Genes from dataset 1 were partitioned into the retained clade and the singleton clade, based on their status in Brassicaceae. Because we were interested in investigating the evolutionary rate of the retained and singleton clades, independent of the potential effects of the Brassicaceae-level duplication event, we measured the amount of evolution (K a and ω) in two FLOS, T. cacao and C. papaya, and we used the results as a proxy for the ancestral evolutionary rate of the Brassicaceae genes. We observed significant differences between the retained clade and the singleton clade in K a and the ω values (Mann-Whitney U-test, P < 0.05) (Fig. 2B). The retained clade showed lower K a and ω values, although these measurements were based on the comparison between orthologs from the T. cacao and C. papaya sequences, where the genes had not experienced Brassicaceae-level WGD. We also carried out another set of comparisons, using both V. vinifera and T. cacao, and A. coerulea and T. cacao (Fig. S5). The retained clades have significantly lower ω values (between orthologs from V. vinifera and T. cacao (Fig. S5A)) and K a values (between orthologs from V. vinifera and T. cacao (Fig. S5B)) than those between similar orthologs in singleton clades (Mann-Whitney U-test, P < 0.05). The decrease in statistical power in these comparisons could be due to the relatively long divergence times of the two pairs of species and limited gene number. We also repeated the whole set of analyses for the genes in dataset 2 (Fig. 2C) and measured the amount of evolution (K a and ω) between orthologs from F. vesca and T. cacao. The results remained statistically significant J. Syst. Evol. 60(4): 848-858, 2022 www.jse.ac.cn (Mann-Whitney U-test, P < 0.05) (Fig. 2D). In addition, significantly lower K a values between orthologs from A. coerulea and F. vesca of the retained clades were also observed in the comparisons to singleton clades (Fig. S6).
As the γ WGDs and the family-level WGD (i.e., the Brassicaceae WGD (α/β), the Poaceae WGD (ρ), and the Fabaceae WGD) are relatively old, to reveal the patterns of relatively younger post-duplication retention biases in datasets 3-5, we compared the evolutionary rate of the retained clade and the singleton clade, partitioned by species-level WGD events (i.e., the B. rapa WGD, the G. max WGD, and the Z. mays WGD) (Fig. 3). We also found that the evolutionary rate of retained clades was statistically significantly lower than the singleton clades for all pairs of comparisons mentioned above (Mann-Whitney U-test, P < 0.05) (Figs. 3, S7-S9). A paired-sample Mann-Whitney U-test was used in the comparisons between retained clades and singleton clades in datasets 3-5. Together with results described above, we concluded that the retained clade experienced a generalized higher constraint than the singleton clade, observed as slower sequence evolution. This is consistent with the results of a previous study that found preferential duplication of conserved proteins in eukaryotic genomes (Davis & Petrov, 2004).

Genes in retained clades are expressed more widely and highly than those in singleton clades
The level of gene expression is related to the rate of sequence evolution (Alvarez-Ponce & Fares, 2012; Zhang & Yang, 2015). To test whether genes in retained clades were expressed more broadly or highly than genes in singleton clades, we compared the gene expression pattern of these two categories of Arabidopsis thaliana genes. These genes were partitioned into a retained clade and singleton clade, based on their status in B. rapa. We found that A. thaliana genes in the retained clade were expressed more broadly (lower tissue specificity) and highly (median value) than genes in the singleton clade at the transcript and protein level (Mann-Whitney U-test, P < 0.05) (Fig. 4), supporting the model of the preferential retention of genes with broader or higher expression levels.
3.4 Retained clades contain more genes than singleton clades in out-lineage species If a retained clade has greater duplicability than a singleton clade, it can also be expected to be more likely to be retained in related out-lineage species. To test this hypothesis, we compared the number of genes in retained clades and singleton clades in FLOS. First, we   (Ren et al., 2018). The results showed that retained clades consistently had more genes than the singleton clades (Mann-Whitney U-test, P < 0.05) (Fig. 5A). We repeated this entire set of analyses for datasets 2-5, and the retained clades consistently had more genes than singleton clades (Mann-Whitney U-test, P < 0.05) (Figs. 5B, S10-S13), suggesting that retained clades were more likely to be retained after WGD. Because some of these out-lineage species lack evidence of WGD events after their divergence from other angiosperms, these results also suggest that genes with lower rates of evolution are inherently more likely to be retained after small-scale duplication, or genes with higher rates of evolution are inherently more likely to be lost.

Discussion
Previous studies have shown that some genes, the so-called duplicable genes, are more likely to be retained after WGDs Boxplots showing K a , K s , and ω (K a /K s ) values for retained clades (containing more duplicates than singleton clades) and singleton clades (containing less duplicates than retained clades) in the Brassicaceae, Fabaceae, and Poaceae gene sets. All genes have singleton status (relative to species-level whole-genome duplication) in relative species, and the evolutionary rates were calculated by comparison of orthologous pairs. Orthologous pairs in retained and singleton clades are given pairwise. Values were compared using the Mann-Whitney U-test (paired), and P-values are shown above each pair of boxplots. A, B, Genes in Arabidopsis thaliana (Arth) and Eutrema salsugineum (Eusa) are classified into retained clade and singleton clade according to their status in Brassica rapa (Bara) genomes (n = 1754). C, D, Genes in Phaseolus vulgaris (Phvu) and Vigna unguiculata (Viun) were classified into retained and singleton clades, according to their status in Glycine max (Glma) genomes (n = 822). E, F, Genes in Sorghum bicolor (Sobi) and Setaria italica (Seit) were classified into retained and singleton clades, according to their status in the Zea mays (Zema) genome (n = 399). NS, not significant.
J. Syst. Evol. 60(4): 848-858, 2022 www.jse.ac.cn study, we asked a different question: among WGD-derived paralogs, which gene was more likely to be retained after a second round of WGD? The approach that was used to identify duplicates in one organism or family and to obtain evolutionary rate measurements from two out-lineage species (Fig. 1) allowed the comparison of the evolutionary rate of genes between those retained to that those that were not. Importantly, this allowed us to make this comparisons while avoiding the confounding effects that the duplication itself has on the rates of sequence evolution (Davis & Petrov, 2004;He & Zhang, 2006;Li et al., 2006;O'Toole et al., 2018). Our data showed that previous duplicates that have evolved slowly, presumably reflecting high functional constraints, were preferentially retained after second-round WGD across angiosperms. Because different functional classes of genes have different evolutionary rates in a genome (Koonin & Wolf, 2010), we compared the evolutionary rate of retained clades and singleton clades in a pairwise mode free from functional classes bias. Due to the long periods of evolutionary time we examined (i.e., γ shared by core eudicots, various family-level WGD events, and various species-level WGD events), the retention biases were more important in this study than in others that have examined shorter lengths of time (O'Toole et al., 2018). In addition, ascertaining the magnitude of any effect would require forward simulations.
These results provide a consistent framework for considering several previous findings, as follows: highly connected proteins tend to have higher gene duplicability in humans (Liang & Li, 2007); gene duplicates maintained after WGDs exhibit the strongest purifying selection in teleost fishes (Brunet et al., 2006); the bias of the set of duplicate genes has important implications for genome-scale studies (Davis & Petrov, 2004); gene duplicability in core genes is highly consistent across all angiosperms (Li et al., 2016); and slow-evolving genes after allopolyploidization in Xenopus laevis Daudin show preferential subfunctionalization (Semon & Wolfe, 2008). In addition, we previously found that duplicates of widely expressed and slowly evolving AP2/ERF genes were retained repeatedly and relatively often and continue to experience functional constraints . All of these results support a model where a subset of the genome is strongly constrained in sequence evolution but is easily retained after WGD.
One possible explanation for these results is that a slowly evolving (and functionally important) gene might be expressed at high level, such that a duplicated slowly evolving gene might have a better chance of survival than a duplicated rapidly evolving one (Li et al., 2016). Another possible explanation is that a slowly evolving gene, which usually has multiple functions, might have a greater potential of functional diversification (e.g., subfunctionalization) for the duplicates. Theoretical studies of population genetics also predict that subfunctionalization is a much more common mechanism of the preservation of duplicate genes than neofunctionalization (degenerative mutations are much more frequent than beneficial changes) (Force et al., 1999). These observations led us to propose that among WGDderived paralogs, slowly evolving ones were more easily subfunctionalized (or more likely to experience specialization) and therefore were more likely to be retained long after the WGD, as illustrated in a model of the evolution of duplicable genes after WGD in angiosperms shown in Fig. 6.
In this model, two duplicates derived from one gene accumulate sequence differences, but more so in the rapidly evolving one than in the slowly evolving one. After secondround WGD, copies from the slowly evolving gene were more easily retained. One explanation is that it increased the amount of the slowly evolving gene product, which might need to be expressed at a high level (Li et al., 2016). An alternative explanation is that the slowly evolving gene, which tends to have multiple functions (being expressed in numerous tissues or encoding multidomain proteins), were more easily subfunctionalized, resulting in functional diversification (Liang & Li, 2007;Alvarez-Ponce & Fares, 2012;Zhang & Yang, 2015). Then, newly subfunctionalized duplicates, which could acquire new functions, facilitate a more sophisticated regulation of signaling networks or adaptation to new environmental conditions. By contrast, the rapidly evolving gene, which is more likely already A B D C Fig. 4. Boxplots showing differences in expression between retained and singleton clades. Expression data were obtained from a quantitative atlas of the transcriptomes and proteomes of 30 tissues of Arabidopsis thaliana (Mergner et al., 2020). P-values were compared using the Mann-Whitney U-test (unpaired) and are shown above each pair of boxplots. A, B, Comparison of median level of gene expression between genes in retained and singleton clades at the transcript (n = 2244) (A) and protein (n = 1864) (B) levels. C, D, Comparisons of tissue specificity (τ) between genes in retained and singleton clades at the transcript (C) and protein (D) levels.
J. Syst. Evol. 60(4): 848-858, 2022 www.jse.ac.cn specialized and is not expressed at high level, tends to have low potential to be further subfunctionalized after duplication. For these reasons, the additional copy of the rapidly evolving gene after a second round WGD is less likely to be retained. Thus, among WGD-derived paralogs, the persistent conservation or subfunctionalization of the slowly evolving duplicate might be a mechanism for repeated gene retention after WGDs. The evolutionary scenario presented here might also be true for WGD-derived duplicates in other lineages.   6. Model to explain why slowly evolving genes are preferentially retained after whole-genome duplications (WGDs). Here we assume a gene with two independently mutable subfunctions (depicted as regulatory regions or domains by the two small boxes), which are spatially non-overlapping. The states presented in the second column indicate that the gene was retained in both copies after a WGD. Then these duplicates began to accumulate sequence differences and became subfunctionalized into gene a and gene b. Here, gene a was preserved with its ancestral function, and gene b lost its single, non-overlapping subfunctions. The states in the third column indicate that after another WGD, gene a was retained again by conservation or subfunctionalization (depicted as dashed boxes), and gene b2 was lost by non-functionalization; this occurs when either the coding region is knocked out or both regulatory regions are lost.

Supplementary Material
The following supplementary material is available online for this article at http://onlinelibrary.wiley.com/doi/10.1111/jse. 12712/suppinfo: Fig. S1. Evolutionary rates of orthologous genes are highly correlated in diverse lineages in eudicots. All genes had singleton status (relative to family-level whole-genome duplications (WGDs)) in relative species, and the evolutionary rates were calculated by a comparison of the orthologous pairs in family-level out-lineage species. We reasoned that the rate of evolution (K a ) between the members of each representative pairs would be a good proxy for the rate of evolution of the study genes in a way that is unaffected by the process of duplication (Fig. 1). Fig. S2. Evolutionary rates of orthologous genes are highly correlated in different lineages in Brassicaceae. All genes had singleton status (relative to the Brassica rapa whole-genome duplication (WGD)) in relative species, and the evolutionary rates were calculated with a comparison of orthologous pairs. Fig. S3. Evolutionary rates of orthologous genes are highly correlated in different lineages in Fabaceae. All genes have singleton status (relative to the Glycine max whole-genome duplication (WGD)) in relative species, and the evolutionary rates were calculated with a comparison of orthologous pairs. Fig. S4. Evolutionary rates of orthologous genes are highly correlated in different lineages in Poaceae. All genes had J. Syst. Evol. 60(4): 848-858, 2022 www.jse.ac.cn singleton status (relative to the Zea mays whole-genome duplication (WGD)) in relative species, and the evolutionary rates were calculated with comparison of orthologous pairs. Fig. S5. Boxplots showing K a , K s , and ω (K a /K s ) values for retained clades and singleton clades in family-level outlineage species. Genes were separated into retained and singleton clade according to their status in the Brassicaceae genome. Values were compared using the Mann-Whitney Utest (paired), and P-values are shown above each pair of boxplots. A, Boxplots showing K a , K s , and ω (K a /K s ) values for retained-clade and singleton-clade genes in the Vitis vinifera and Theobroma cacao gene set (retained clade, n = 381; singleton clade, n = 403). B, Boxplots showing K a , K s , and ω values for retained clades and singleton clade genes in the Aquilegia coerulea and T. cacao gene set (retained clade, n = 376; singleton clade, n = 391). Fig. S6. Boxplots showing K a , K s , and ω (K a /K s ) values for retained and singleton clades in family level out-lineage species. Gene were separated into retained and singleton clades according to their status in the Fabaceae genome. A, Boxplots showing K a , K s , and ω (K a /K s ) values for retainedclade and singleton-clade genes in the Vitis vinifera and Fragaria vesca gene set (retained clade, n = 256; singleton clade, n = 254). B, Boxplots showing K a , K s , and ω values for retained-clade and singleton-clade genes in the F. vesca and Theobroma cacao gene set (retained clade, n = 346; singleton clade, n = 347). Fig. S7. Boxplots showing K a , K s , and ω (K a /K s ) values for the retained clade (containing more duplicates than the singleton clade) and the singleton clade (containing fewer duplicates than the retained clade, or none) in the Brassicaceae gene set. All genes had singleton status (relative to Brassica rapa whole-genome duplication (WGD)) in relative species, and the evolutionary rates were calculated with a comparison of orthologous pairs. The data were separated into A and B genes, according to their status in the B. rapa genome. Values were compared using the Mann-Whitney U-test (paired), and P-values are shown above each pair of boxplots. Fig. S8. Boxplots showing K a , K s , and ω (K a /K s ) values for retained-clade and singleton-clade genes in the Fabaceae gene set. All genes had singleton status (relative to Glycine max whole-genome duplications (WGDs)) in relative species, and the evolutionary rates were calculated with a comparison of orthologous pairs. The data were separated into A and B genes according to their status in the G. max genome. Values were compared using the Mann-Whitney U-test (paired), and Pvalues are shown above each pair of boxplots. Fig. S9. Boxplots showing K a , K s , and ω (K a /K s ) values for retained-clade and singleton-clade genes in the Poaceae gene set. All genes had singleton status (relative to Zea mays wholegenome duplications (WGDs)) in relative species, and the evolutionary rates were calculated by a comparison of orthologous pairs. The data were separated into A and B genes according to their status in the Z. mays genome. Values were compared using the Mann-Whitney U-test (paired), and Pvalues are shown above each pair of boxplots. Fig. S10. Boxplots showing the number of genes in the retained clade and the singleton clade in family-level out-lineage species. Numbers of genes greater than 10 are not shown. Values were compared using the Mann-Whitney U-test (unpaired), and Pvalues are shown above each pair of boxplots. A, Genes were separated into retained clade and singleton clade according to their status in the Brassicaceae genome (retained clade, n = 912; singleton clade, n = 1144). B, Genes were separated into retained clade and singleton clade according to their status in the Fabaceae genome (retained clade, n = 809; singleton clade, n = 980). Fig. S11. Boxplots showing the number of genes in the retained clade and the singleton clade in species-level out-lineage species in Brassicaceae. Genes were separated into the retained clade and the singleton clade according to their status in Brassica rapa. Values were compared using the Mann-Whitney U-test (n = 2848), and P-values are shown above each pair of boxplots. Fig. S12. Boxplots showing the number of genes in the retained and singleton clades in species-level out-lineage species in Fabaceae. Genes were separated into the retained clade and the singleton clade according to their status in Glycine max. Values were compared using the Mann-Whitney U-test (n = 1550), and the P-values are shown above each pair of boxplots. Fig. S13. Boxplots showing the number of genes in the retained and singleton clades in species-level out-lineage species in Poaceae. Genes were separated into the retained clade and the singleton clade according to their status in Zea mays. Values were compared using the Mann-Whitney U-test (n = 1281), and the P-values are shown above each pair of boxplots.