Lineage-specific sequence evolution and exon edge conservation partially explain the relationship between evolutionary rate and expression level in A. thaliana

Rapidly evolving proteins can aid the identification of genes underlying phenotypic adaptation across taxa, but functional and structural elements of genes can also affect evolutionary rates. In plants, the ‘edges’ of exons, flanking intron junctions, are known to contain splice enhancers and to have a higher degree of conservation compared to the remainder of the coding region. However, the extent to which these regions may be masking indicators of positive selection or account for the relationship between dN/dS and other genomic parameters is unclear. We investigate the effects of exon edge conservation on the relationship of dN/dS to various sequence characteristics and gene expression parameters in the model plant Arabidopsis thaliana. We also obtain lineage-specific dN/dS estimates, making use of the recently sequenced genome of Thellungiella parvula, the second closest sequenced relative after the sister species Arabidopsis lyrata. Overall, we find that the effect of exon edge conservation, as well as the use of lineage-specific substitution estimates, upon dN/dS ratios partly explains the relationship between the rates of protein evolution and expression level. Furthermore, the removal of exon edges shifts dN/dS estimates upwards, increasing the proportion of genes potentially under adaptive selection. We conclude that lineage-specific substitutions and exon edge conservation have an important effect on dN/dS ratios and should be considered when assessing their relationship with other genomic parameters.

Variations in dN/dS are thought to stem primarily from gene-specific selective pressures related to the functionality of their protein products (Tennessen 2008). As such, dN/dS is often used to identify those genes likely to be involved in adaptation (Yang & Bielawski 2000;Hurst 2002;Nielsen 2005). Determining which genes are under selection is important for understanding how genetic diversity is maintained and the relative importance of opposing selective forces in shaping a species' genetic diversity.
As with many species, plant genes predominantly evolve under purifying selection (Gossmann et al. 2010), with low estimates for the number of positively selected genes in sorghum (Hamblin et al. 2006), maize (Ross-Ibarra et al. 2009), A. thaliana (Schmid et al. 2005;Slotte et al. 2011) and A. lyrata (Foxe et al. 2008). In these studies, dN/dS was calculated from pairwise alignments resulting in ratios which are a composite of substitutions in both lineages compared since their divergence from their last common ancestor. Using an outgroup species allows the calculation of lineage-specific dN/dS (Arbiza et al. 2006;Bakewell et al. 2007;Kawahara & Imanishi 2007;Weedall et al. 2008;Parmakelis et al. 2010;Toll-Riera et al. 2011) which could unmask further genes with species-specific signatures of positive selection and/or potentially stronger associations between certain genomic characteristics and the rate of sequence evolution. The model plant Arabidopsis thaliana is an ideal organism for investigating genomewide signatures of selection in the plant taxa as a sister species, Arabidopsis lyrata (with an estimated 13 mya divergence time from A. thaliana), has been sequenced (Hu et al. 2011). Thellungiella parvula [43 mya divergence from A. thaliana (Dassanayake et al. 2011)] provides a suitable outgroup for assessing lineage-specific sequence evolution. The availability of multiple A. thaliana genomes (Cao et al. 2011;Gan et al. 2011) enables the assessment of intraspecific diversity, which can be used to estimate deviations from a neutral expectation based on both sequence divergence and intraspecific variation, such as the neutrality index (NI) (Haldane 1956).
The interpretation of dN/dS and NI estimates assumes that synonymous substitutions are mostly evolving under neutral or nearly neutral conditions and are a proxy of the underlying mutation rate. However, exon sequences can contain exonic splicing enhancers (ESEs), sequence motifs involved in both constitutive and regulated splicing by facilitating the assembly of splicing complexes (Tacke & Manley 1999;Blencowe 2000;Zheng 2004). ESEs are enriched in the vicinity of splice sites, particularly downstream of a splice acceptor, with their peak abundance increasing closer to an exon-intron boundary (Wu et al. 2005). As higher conservation in this region, including at synonymous sites, can reflect differential patterns of codon usage (Comeron & Guthrie 2005;Parmley & Hurst 2007;Warnecke & Hurst 2007;Caceres & Hurst 2013) and affect the overall dN/dS estimate per gene (Carlini & Genut 2006;Parmley et al. 2006), this may influence relationships between dN/dS and various genomic parameters, particularly in compact, intron-rich genomes. In A. thaliana's genome, 75% of the genes are multi-exonic, 29% of the exons are below 100 bp, the median exon length is 53 codons, and ESE hexamers have been identified (Pertea et al. 2007). Thus, ESE conservation could have a strong impact on estimates of dN/dS, and consequently on estimates of the relative contribution of positive and purifying selection to A. thaliana genome evolution.
It is not yet known, however, how dN/dS estimates in plants are influenced either by increased conservation at exon edges or by the introduction of an outgroup species to obtain lineage-specific estimates nor how this may affect the covariance between the rate of sequence evolution and any genomic parameter previously shown to be a significant predictor of NI and/or dN/dS.
Here, we address this issue by examining coding sequence evolution in A. thaliana, with A. lyrata and T. parvula as comparison species. We investigate whether the calculation of lineage-specific sequence evolutionary rate and/or the removal of exon edges (i) may unmask a larger proportion of genes with signatures of selection, (ii) alter the relationship between expression level and evolutionary rate and (iii) alter the association between dN/dS and other structural and functional parameters previously identified as dN/dS correlates in one or more other species.

Other data sources
Codon usage bias per gene was expressed both as the effective number of codons (ENC) (Wright 1990) and as the frequency of optimal codons (F op ) (Ikemura 1981). The number of protein-protein interactions (PPIs) per gene was obtained from BioGRID, version 3.1.75 (Stark et al. 2006(Stark et al. , 2011. Recombination data were obtained from Marais et al. (2004); this variable is used as a control as an insignificant relationship between recombination and dN/dS is expected in an effectively obligate selfer. A gene's degree of multi-functionality was measured as the number of GOslim terms assigned to it for biological processes. 'GOslim' is a condensed set of gene ontology (GO) categories, obtained from TAIR (ftp://ftp.arabidopsis.org/home/tair/Ontologies/ Gene_Ontology/ATH_GO_GOSLIM.txt, downloaded 8 October 2013) (Berardini et al. 2004). The majority of GOslim terms (~87%) are derived from curated experimental or computational evidence, rather than being inferred from sequence similarity, which can result in higher false prediction rates (Jones et al. 2007). All raw data used in this study are available in Table S1 (Supporting information).

Tests of sequence evolution and selection
Two measures of the degree and direction to which A. thaliana sequences diverge from a neutral expectation were calculateda neutrality index (NI) and dN/dS. Calculations require data on the number of polymorphic and diverged residues in each sequence. To obtain the former, we used single nucleotide polymorphism (SNP) data obtained after aligning 17 fully sequenced and independently assembled accessions against the Col-0 reference genome (Gan et al. 2011) (data from Po-0 were not used as it has both unusually high heterozygosity and similarity to Oy-0). Diverged positions were identified from pairwise alignments of A. thaliana against both A. lyrata and T. parvula. Alignments were made for 21 198 genes against A. lyrata and 10 289 genes against T. parvula, of which 7086 genes could be aligned against both. Alignments were first obtained for exons in the longest available transcript per A. thaliana gene, using blastn (Altschul et al. 1990) with default parameters and a significance threshold of 1e À10 . These were refined by applying the Smith-Waterman algorithm to the best blastn hit [fasta36.3.5d with parameters -A -a (require alignments to use entire sequence)] (Pearson 2000). The resulting alignments were then concatenated to create a single sequence alignment per gene. To ensure the alignment was in-frame, the translated A. thaliana sequence was aligned against either the A. lyrata or T. parvula sequence using tblastn (default parameters and significance threshold 1e À10 ).
For genes with at least 150 aligned bases, dN/dS was estimated from the concatenated sequences using the Yang and Nielson model, as implemented in the yn00 package of PAML (Yang 2007). These estimates are referred to as 'pairwise' dN/dS. We also calculated a lineage-specific estimate of dN/dS using the extremophile crucifer Thellungiella parvula (Dassanayake et al. 2011) as an outgroup, according to the method of Toll-Riera et al. (2011). First, we identified those T. parvula genes with detectable homology to an A. thaliana gene for >50% of the CDS length of the longest Col-0 transcript (blastn with default parameters). Multiple sequence alignments between the CDS of an A. thaliana gene, its A. lyrata orthologue (if extant) and the homologous sequence in T. parvula were made using PRANK (L€ oytynoja & Goldman 2008). dN/dS was calculated using the codeml package of PAML (Yang 2007), with the equilibrium codon frequencies of the model used as free parameters (CodonFreq = 3). These data were filtered to remove sequences less than 150 bp in length or with branches showing either dS < 0.02, dS > 2 or dN > 2 as these are either unreliable for estimates of the dN/dS ratio, nonbona fide orthologues or otherwise saturated with substitutions (L€ oytynoja & Goldman 2008). We assumed an unrooted tree topology of [(A. thaliana, A. lyrata), T. parvula].
The neutrality index for each sequence, NI, was calculated as log((2D s + 1) (2P n + 1)/(2D n + 1) (2P s + 1)), where D s and D n are the numbers of synonymous and non-synonymous substitutions, and P s and P n are the numbers of synonymous and non-synonymous polymorphisms (Haldane 1956). NI values can be tested with the null hypothesis of neutrality, that the ratios of intra-and interspecies non-synonymous to synonymous variation are equal. Positive selection is inferred when interspecies exceeds intraspecies variationadaptive mutations spread throughout a population rapidly and so affect the number of observed substitutions (i.e. divergence), but not the number of polymorphisms (Egea et al. 2008). NI can thus be interpreted in the same manner as a McDonald-Kreitman test for comparing the ratio of fixed to within-species differences: its symmetrical distribution allows the inference of purifying selection when NI > 0 and positive selection when

Exon edge trimming
To assess the effect of exon edge conservation on rates of sequence evolution, we removed up to 30 codons from the edges of each A. thaliana exon that could be fully aligned against the A. lyrata or T. parvula genome with an alignment both in-frame and a multiple of three in length. Exons were then concatenated, and genes with sequences of at least 150 bp after trimming constituted 'trimmed' subsets of, at minimum, 1443 genes (i.e. those for which all 30 codons can be removed) and 174 genes, for alignments against A. lyrata and T. parvula, respectively. All analyses comparing 'trimmed' and 'untrimmed' sequences use the same set of exons per gene. A supplemental file containing both the raw alignments and evolutionary rate estimates for all data sets is available at the DRYAD repository (http:// dx.doi.org/10.5061/dryad.905sq).

Randomization test
Estimates of dN, dS, dN/dS and NI vary when codons are removed from the edges of exons, suggesting that the strength of selection differs in these regions. To assess whether the difference is indeed due to the nature of exon edges or due to codon removal, we created a parallel set of estimates of dN, dS, dN/dS and NI after random codon removal (s = 1000 randomizations per gene) for comparison. A numerical P-value was calculated as follows: letting q be the number of times the 'sequential removal' estimate of dN, dS or dN/dS was higher than the 'random removal' estimate (or lower, in the case of NI), then p = ((s-q) + 1)/s + 1. As variable estimates of dN, dS, dN/dS and NI can in turn alter the correlation strength with predictors of evolutionary rate (such as, e.g. expression level), the above test was also repeated using estimates of Spearman's rho for both the 'sequential removal' and 'random removal' conditions.

Expression data
Three independent sources of A. thaliana transcript abundance data were used: (i) the Arabidopsis Development Atlas (ADA), representing 79 tissues, generated by the AtGenExpress Consortium (Schmid et al. 2005) [NASCARRAYS reference numbers 150-154, (http:// affymetrix.arabidopsis.info/, downloaded 7 November 2011)]. Expression level was quantified both as the maximum absolute gcRMA [robust multi-array analysis corrected for the GC content of the oligo (Wu et al. 2004)] across all tissue types (after clustering the data into seven typesroot, stem, seed, leaf, flower, pollen and apex) (Slotte et al. 2011) and as the average across all 79 tissues (Yang & Gaut 2011). Expression breadth was calculated from this database as both the number of tissues in which a gene is expressed and the tissue specificity index (tau), a scalar measure bounded between 0 (for housekeeping genes) and 1 (for genes expressed in a single tissue) (Yanai et al. 2005). (ii) Massive parallel signature sequencing (MPSS) data (Brenner et al. 2000;Meyers et al. 2004;Nakano et al. 2006)which quantifies gene expression by counting short (17-20 bp) mRNA-derived tagsrepresenting five tissues (http://mpss.udel.edu/at/mpss_index.php, downloaded 28 March 2011). Expression level was quantified as either the average (Yang 2009) or the maximum number of tags across all tissues (Foxe et al. 2008). (iii) RNA-seq transcript abundance data, where expression levels were taken as absolute read values corrected by sequence length (Gan et al. 2011). On top of the indices of expression obtained from each data set, all three estimates of transcript abundance (MPSS, ADA and RNA-seq) were transformed into Z-scores (Cheadle et al. 2003) to allow direct comparisons between them. In addition, the weighted average of two sets of A. thaliana protein abundance data was obtained for a total of 19 761 genes (pax-db.org, downloaded 15 February 2013) (Baerenfaller et al. 2008;Castellana et al. 2008). These data employ tandem mass spectrometry to quantify protein abundance by spectral counting.

Alternative splicing
Alternative splicing indices were calculated as described in Chen et al. (2014). In brief, alternative splicing events were identified by comparing mapping coordinates from EST data [obtained from dbEST (Boguski et al. 1993

Correlations between evolutionary rates and functional and structural gene characteristics
To determine whether, and to what extent, any functional and structural variables affect a gene's dN/dS and NI estimates, various correlation analyses were performed. All analyses were conducted in R (R Development Core Team 2012). Initially, all correlations were assessed using Spearman's rho. However, as many of the variables found to be significantly associated with dN/dS are themselves covariates of expression level (the strongest correlate of dN/dS), it is possible that some parameters co-vary with dN/dS as a by-product of their relationship with expression level. As such, to better understand the relative contribution of genomic features to dN/dS, we assessed the relationship between individual parameters and dN/dS after controlling for the effect of expression level, using partial Spearman's correlation coefficients [R package 'ppcor' (Kim & Yi 2006]. In addition, to test whether correlation strengths for dN/dS with any given genomic feature differ between pairwise and lineage-specific dN/dS estimates, we assessed statistical significance using a t-test on the Z-transformed values of rho, as implemented by the paired.r method of the R package 'psych' (Revelle 2014).

Correlates of dN/dS and NI in A. thaliana
Expression level and breadth [calculated using RNAseq data (Gan et al. 2011)] were significant predictors of dN/dS and NI (calculated from pairwise alignments between A. thaliana and A. lyrata) and were in fact their strongest predictors compared to other variables (Table 1 and Table S2 in Supporting information). Similar results were obtained when using independent expression-level estimates from two alternative platforms, microarrays and MPSS as well as when applying four normalization procedures previously used for each set of estimates (Table S2 in Supporting information). When dN/dS and NI estimates were obtained from alignments of A. thaliana and T. parvula, a more distant relative of A. thaliana than A. lyrata, expression level and breadth remain the strongest predictors of both dN/dS and NI, albeit with comparatively weaker correlation strengths in the case of dN/dS variance (but equivalent correlation strengths for NI variance) ( Table 1 and Table  S2 in Supporting information). In general, other variables such as gene length and codon usage bias explain progressively smaller proportions of dN/dS and NI variance in an equivalent order to that using A. thaliana-A.lyrata estimates. The association between dN/dS and NI with any non-expression-related genomic parameter is not fully accounted for by its association with gene expression as after the effect of expression level is removed using partial Spearman's correlation coefficients, all significant correlates of dN/dS and NI remained so (Table 2 and Table S3 in Supporting information).
Accounting for exon edge conservation influences dN/ dS and its relationship with various genomic parameters, and unmasks higher levels of positive selection Using pairwise alignments of A. thaliana against either A. lyrata or T. parvula, we find that codon removal at the edges of exons results in increased dN, dS and dN/dS estimates when compared to estimates made after random codon removal from any position in the sequence ( Fig. 1 and Table S4 in Supporting information). This is observed irrespective of whether 10, 20 or 30 codons are removed (Table S4 in Supporting information). Estimates of NI were found to decrease after codon removal from the exon edges compared to random codon removal, also suggesting a weakening in the departure of sequence evolution from a neutral expectation ( Fig. 1 and Table S4 in Supporting information). These patterns are consistent with exon edges being under selective constraint, having fewer non-synonymous substitutions than sequence elsewhere in the gene. In general, exon edge removal shifts dN/dS values towards a range indicative of either stronger positive or relaxed purifying selection, with an overall increase in the proportion of genes potentially under adaptive selection (Table 3 and Table S5 in Supporting information).
To understand the effect of higher conservation at the exon edges on the relationships between dN/dS and other genomic parameters, we then re-analysed the correlations. We found that the correlation strength of dN/dS with several genomic featuresin particular, expression level and expression breadthdecreased after the removal of exon edges. In contrast, we observed only marginal changes to these correlation coefficients after removing an equivalent number of codons from random positions ( Fig. 2 and Table S6 in Supporting information). This suggests that, after the removal of exon edges, the decreased correlation strength between dN/dS and genomic parameters is not explained by increased noisiness resulting from the use of shorter sequences to estimate dN/dS. It also suggests that a dN/dS-based test of selection is most acute for more highly expressed genes and that stronger correlations of dN/dS with  and NI (d) for a sample of 1443 genes with at least one fully alignable exon between A. thaliana and A. lyrata, after removing one codon at a time from exon edges (black), to a maximum of 30. The effects of random codon removal are shown in red. Distributions significantly differ when 30 codons are removed sequentially, but not randomly, compared to when no codons are removed. For sequential removal vs. no removal, Kruskal-Wallis P = 0.02 (dN/dS) and < 2.2 9 10 À16 (NI). For random removal vs. no removal, Kruskal-Wallis P = 0.08 (dN/dS) and 0.49 (NI).
their various characteristics reflect the stronger constraints upon them. Furthermore, when considering NI, several variables including expression level, expression breadth, the total number of introns and various measures of gene length become marginally, but significantly, stronger predictors of NI (Table S6 in Supporting information). Nevertheless, the relative order of these parameters as predictors of dN/dS remains largely unchanged with expression level still the dominant predictor.

Reduced prominence of gene expression as a predictor of A. thaliana's lineage-specific dN/dS
Lineage-specific dN/dS estimates derived from multiple alignments of A. thaliana genes with A. lyrata and T. parvula resulted in a marked decrease in the correlation between dN/dS and various genomic parameters including expression level and expression breadth, with total exon length becoming the strongest correlate of dN/dS (Table 1 and Table S2 in Supporting information).
To rule out the possibility that reductions in both the absolute and relative strength of the correlation between dN/dS and gene expression when examining lineagespecific changes may be explained by differences in the gene/codon set tested, we recalculated pairwise dN/dS for A. thaliana against A. lyrata and T. parvula using only those codons common to the multiple alignments of A. thaliana, A. lyrata and T. parvula (i.e. those used to estimate lineage-specific dN/dS; Table S7 in Supporting information). This analysis confirmed that when the same codons are analysed, lineage-specific dN/dS estimates have markedly weaker correlations with numerous genomic features compared to either pairwise estimate (Table 4 and Table S8 in Supporting information).
We further found that using lineage-specific substitution patterns markedly reduces the number of genes with dN/dS >1 (21 genes have dN/dS >1, 0.3% of the sample analysed) when compared to pairwise alignments of A. thaliana with A. lyrata (423 genes have dN/ dS >1, 2% of the sample analysed; chi-square P < 2.2 9 10 À16 ), although not when compared to alignments with T. parvula (41 genes have dN/dS > 1, 0.4% of the sample analysed, chi-square P = 0.327). In summary, when examining lineage-specific dN/dS estimates, the prominence of gene expression is diminished, and protein length becomes the dominant predictor. This pattern is not explained by variations in the sample of genes/codons used for the analyses. Importantly, we observed no evidence that the use of lineage-specific dN/dS estimates unmasks any additional signatures of positive selection compared to pairwise alignments.

Selective constraint upon exon edges affects the relationship between dN/dS and expression
Previous studies have shown that in mammalian species, exonic splicing enhancer sequences result in higher conservation of synonymous sites at exon edges, suggestive of selective constraint to maintain correct splicing (Carlini & Genut 2006;Parmley et al. 2006). Here, we show that the removal of codons at the exon edges has a strong effect on the rate of substitutions at synonymous sites in A. thaliana, suggesting similar constraint, and associated functional importance, for ESE-containing regions in plants. A moderate increase was also observed in the rate of non-synonymous substitutions reflecting the fact that purifying selection at these sites is higher than the average observed at non-synonymous sites elsewhere in the gene.
This study is, to the best of our knowledge, the first to explore the relationship between rates of sequence evolution and genomic parameters (including gene expression) in the context of exon edge conservation due to the presence of splice enhancers. Generally, the removal of exon edges resulted in a weaker association between dN/dS and NI with measures of expression level and breadth. The relationship between dN/dS and other genomic parameterssuch as various measures of gene/protein lengthshowed a moderate decrease, whereas the association between several length parameters and NI was strengthened (Table S6 in Supporting information). The observed decrease in the relationship between dN/dS and NI to gene expression after the removal of exon edges suggests that a stronger degree of purifying selection acting upon splice enhancer regions partly explains the association between dN/dS, NI and expression. From this, we can infer stronger splice-mediated selection in more highly expressed genes.
It is possible that more highly expressed genes are under increased constraint for accurate splice site definition, with this relationship partly masked by the stronger association of higher expression with lower dN/dS, which largely reflects constraint on the gene's function. In this respect, selection may also be masked on other properties expected to be under stronger constraint in more highly expressed genes, such as codon usage affecting translational error rate (Drummond et al. 2005), translation efficiency (Akashi & Eyre-Walker 1998) and mRNA stability (Tuller et al. 2010), although such analyses are beyond the scope of this study.
It is reasonable to ask whether anything can explain the higher selective constraint upon exon edges in such a way as to also relate both to a gene's structure and to  Fig. 2 Variables that have a significantly different correlation with dN/dS after the sequential removal of 30 codons from exon edges, compared to random codon removal. The four variables shownexpression breadth, expression level, tau and GC contentare those which have significantly different estimates of rho for their correlation with dN/dS before and after codon removal. Two criteria are met for each variable: that rho is significantly different after sequential, compared to random codon removal, and that rho is significantly different after sequential, compared to no codon removal. Estimates of dN/dS are made using alignments of A. thaliana against A. lyrata. Data for this figure, including P-values and sample sizes, are shown in Table S6 (Supporting information).
its expression. One possible explanation may be the extent to which a gene is alternatively spliced. Alternative splicing has been shown to positively correlate with both the ratio of total intron length to overall gene length (Koralewski & Krutovsky 2011) and gene expression level (Chen et al. 2014). As longer genes are more likely to have more complex exon-intron architectures (Zhu et al. 2009), they are expected to have a higher number of possible alternative splicing events. If we assume that the exon edges are under increased selection for accurate alternative splicing compared to nonalternatively spliced exons, then those genes with higher levels of alternative splicing are expected to show a greater discrepancy in evolutionary rate estimates before and after codon removal. Using estimates of the number of alternative splicing events per gene, we find that dN/dS ratios (calculated from pairwise alignments of A. thaliana and A. lyrata to maximize sample size) are more strongly affected by codon removal from the exon edges in genes with higher levels of alternative splicingfor instance, the increase in dN after 10 codons are removed is significantly higher for genes with more splicing events (rho = 0.13, P = 2.7 9 10 À4 ; Table S9 in Supporting information). This pattern is also observed when removing 20 or 30 codons from exon edges (Table S9 in Supporting information). Although based upon a limited sample size, this finding merits further scrutiny as it shows that genes with alternative splicing events, compared to nonspliced genes, have a higher degree of conservation at exon edges relative to conservation of the remaining coding sequence.
Lineage-specific dN/dS estimates have a stronger relationship with gene length than with expression level The use of pairwise alignments for estimating dN/dS could influence any relationship between dN/dS and a gene's characteristics as biases are introduced due to branch-specific changes in the strength and direction of selection. For example, if a gene in A. lyrata was under a greater degree of purifying selection than its A. thaliana orthologue, this would result in a decreased dN/dS estimate in A. thaliana (Toll-Riera et al. 2011). This would introduce noise into the correlation of dN/dS and any genic feature in A. thaliana. Estimating a lineage-specific dN/dS using T. parvula as an outgroup, we found the correlation strength of dN/dS with many genic features, both structural and functional, is reduced (Table S2 in Supporting information). In particular, the estimate of rho for the expression level-dN/dS relationship is reduced by more than 50% when using a lineage-specific compared to a pairwise dN/dS estimate (Table 1 and Table   S2 in Supporting information). However, the use of lineage-specific dN/dS estimates increased the correlation between dN/dS and gene length. This is of interest given the relationship between the three variablesas expression and length are both negative correlates of dN/dS, it follows that genes under stronger purifying selection are more likely to be both highly expressed and shorter. As selection for higher expression can reasonably predict a gene's length, with shorter genes minimizing costly transcription and translation (Castillo-Davis et al. 2002;Eisenberg & Levanon 2003;Urrutia & Hurst 2003), this suggests that gene length itself, rather than expression, could be a stronger predictor of dN/dS. This finding also supports a previously observed negative relationship between dN/dS and gene length identified using A. thaliana-A. lyrata orthologous pairs (Yang & Gaut 2011). It is possible that the comparatively reduced prominence of expression level as a predictor of evolutionary rate is explained in this case by mating system: A. thaliana, unlike A. lyrata or T. parvula, is a near obligate selfer, having a patchy distribution of inbred populations with relatively rare outcrossed mating between different ecotypes (Tian et al. 2002). Selfing increases genomewide homozygosity, and thus decreases the number of gametes which may be independently sampled in a given population, in effect reducing effective population size (Sz€ ov enyi et al. 2014). As a consequence, the efficacy of selectionparticularly purifying selectionat purging weakly deleterious mutations is reduced (Wright et al. 2013;Glemin & Muyle 2014). In this respect, the degree of constraint acting upon highly expressed genes may be partially masked when using lineage-specific dN/dS estimates. Nevertheless, that A. thaliana experiences a general trend of relaxed selection compared to A. lyrata is only weakly supported (Gl emin 2007) and in any case, the relationship of expression level to lineage-specific dN/dS for A. lyrata is equally reduced, assuming expression to be equivalent in both species (rho = À0.15, P < 2.2 9 10 À16 ; Table S2 in Supporting information). In addition, it is important to note that the differences between pairwise and lineage-specific dN/dS are not explained by the differences in gene/codon samples used to estimate dN/dS resulting from the fact that a smaller proportion of the A. thaliana genome can be simultaneously aligned with both the A. lyrata and the T. parvula genomes as similar results are obtained when restricting the analyses to a common set of codons. Finally, we believe that T. parvula is a justifiable outgroup species as it has an estimated divergence time from A. thaliana of approx. 40 mya; this falls within the range of distances for species used to calculate lineage-specific dN/dS (e.g. approx. 90 mya for the divergence of humans and dogs, as in Toll-Riera et al. (2011)) without confounding the estimate by saturation.
Exon edge removal, but not lineage-specific substitution patterns, unmasks higher levels of positive selection One key objective of this study was to assess whether exon edge conservation and the use of pairwise alignments could be masking higher levels of molecular adaptation than what has previously been observed. In general, we find that the proportion of genes under potential positive selection (dN/dS >1) is increased by the removal of exon edges. Of particular interest are four genes (AT1G08680, AT1G60930, AT2G17305 and AT4G27370) where dN/dS ratios are higher than 1 only after codons are removed from the exon edges, but not when codons are removed from random positions. This could suggest, in these cases, that an adaptive signature has been partially masked by disproportionate synonymous substitutions at the edges of exons. Of note is that AT1G08680 (ARF GAP-like zinc finger-containing protein ZIGA4) has been linked to adaptive germination phenotypes (Morrison & Linder 2014) and that AT1G60930 (RECQ helicase L4B) appears to be a duplicate gene that has undergone a degree of functional divergence (Singh et al. 2010). As duplicated genes undergo asymmetric sequence divergence relative to each other (Conant & Wagner 2003), an adaptive interpretation is in this case plausible.
When considering lineage-specific dN/dS, however, the proportion of genes with dN/dS >1 is significantly lower than when dN/dS is estimated using pairwise alignments of A. thaliana with A. lyrata. This could indicate that dN/dS values higher than 1 are, for several genes, being driven by increased dN/dS values in the A. lyrata lineage which, notably, does not show prevalent self-fertilization.
Having found a significant effect of exon edge conservation and lineage-specific substitution upon dN/dS estimates when each was considered separately, we wished to test whether the relationship between dN/dS and the set of genomic parameters changed when both factors are taken into account together. However, there were only a limited number of genes for which full exons could be aligned across all three species, as required for the analysis of codon removal at the exon edges and the estimates of lineage-specific dN/dS. Using a limited sample (n = 73) in which 10 codons could be removed from the exon edges, we found no significant differences in the relationship of dN/dS to any genomic parameter after codons were removed from the exon edges compared to removal at random sites (Table S6 in Supporting information). Better annotation of A. lyrata and T. parvula, or the genomes of related species, would improve the testing of the effects of exon edge conservation upon dN/dS estimates using lineage-specific substitutions.
The variation in sequence evolution among genes and its association with genic characteristics, including expression, could also be partly explained by genomic context. Most notably, chromosomal location has been associated with gene expression in A. thaliana (Yamada et al. 2003;Schmid et al. 2005). Several studies have also shown that across the genome, there are nonrandom clusters of genes with similar expression profiles in a variety of taxa (Lercher et al. 2002;Versteeg et al. 2003). Clusters of genes with similar evolutionary rate have also been identified (Williams & Hurst 2000;Lercher et al. 2001). A common mechanism may explain both clusters (i.e. Williams & Hurst (2002) but see Lercher et al. (2004)), although further assessment of such hypotheses fall outside of the scope of this study.
In summary, we show that higher conservation at the edges of exons in A. thaliana plays an important part in determining dN/dS ratios by increasing the proportion of conserved synonymous sites. The effect of these conserved regions upon overall dN/dS values partly explains the relationship between rates of protein evolution and expression level. By accounting for lineagespecific substitution patterns and the effect of conservation at the exon edges, the ability of expression level to explain variation in evolutionary rate is diminished, with gene length becoming the strongest correlate. In addition, we found evidence of masked positive selection from the conservation of exon edges, irrespective of the noise introduced to dN/dS estimates by the use of pairwise alignments.

Table S3
Partial correlations between dN/dS and 20 genomic characteristics in A. thaliana, controlling for expression level.

Table S4
Average estimates of four evolutionary rate variables after sequential codon removal from the exon edges vs. random codon removal.

Table S5
Characteristics of the dN/dS and NI distributions for dataset A (pairwise alignment of A. thaliana with A. lyrata) and dataset B (pairwise alignment of A. thaliana with T. parvula), before and after codon removal at exon-intron junctions.

Table S6
Correlations between 4 selection strength/direction variables and 25 genomic characteristics, after the removal of 10, 20 and 30 codons from the exon edges vs. random removal of an equal number of codons. Table S7 dN/dS estimates using codons common to the alignment of A. thaliana, A. lyrata and T. parvula.

Table S8
Relationship between dN/dS and evolutionary rate predictors using estimates derived from codons common to the alignment of A. thaliana, A. lyrata and T. parvula.

Table S9
Relationship between the average number of alternative splicing events per gene and the difference in evolutionary rate estimates before and after codon removal from the exon edges.