Evolutionary trade-off and mutational bias could favor transcriptional over translational divergence within paralog pairs

How changes in the different steps of protein synthesis—transcription, translation and degradation—contribute to differences of protein abundance among genes is not fully understood. There is however accumulating evidence that transcriptional divergence might have a prominent role. Here, we show that yeast paralogous genes are more divergent in transcription than in translation. We explore two causal mechanisms for this predominance of transcriptional divergence: an evolutionary trade-off between the precision and economy of gene expression and a larger mutational target size for transcription. Performing simulations within a minimal model of post-duplication evolution, we find that both mechanisms are consistent with the observed divergence patterns. We also investigate how additional properties of the effects of mutations on gene expression, such as their asymmetry and correlation across levels of regulation, can shape the evolution of paralogs. Our results highlight the importance of fully characterizing the distributions of mutational effects on transcription and translation. They also show how general trade-offs in cellular processes and mutation bias can have far-reaching evolutionary impacts.


Defining gene-specific fitness functions
We defined the relationship between fitness and the expression level (protein abundance) of any gene using parabolic functions. Each such function W (p) is described by its vertex (p opt , µ), where the highest fitness is obtained for an optimal protein abundance, and a noise sensitivity Q [1] related to the curvature W ′′ (p opt ) of the parabola (Eq 1).
Since the curvature of the function can be isolated from the previous equation, Q can be used to compute the a parameter of the parabola, and from there, obtain the full equation of the standard form ax 2 + bx + c: Following a duplication event, the new fitness function of cumulative protein abundance W (p 1 + p 2 ) is defined from the W (p) function of the ancestral gene. Only one parameter is modified, p opt , which is multiplied by 1.87 (see below). The post-duplication fitness function thus becomes: Selecting the post-duplication change in optimal protein abundance After a gene duplication event, the total transcription of the resulting gene pair is the ancestral transcription rate of the singleton times factor ∆ m . Similarly, the optimal cumulative protein abundance for the two paralogs is ∆ opt times the original optimal expression p opt of the ancestral singleton. From equation Eq 3, a strictly positive postduplication fitness is thus obtained when: Multiplying by ∆ opt , an expression of the form a∆ 2 opt + b∆ opt + c is obtained: To solve for ∆ opt , we consider the most extreme case: an ancestral gene with the highest possible noise sensitivity and protein abundance. Accordingly, Q is set to the highest possible value within the framework of [1] (∼ 6.8588 × 10 −6 ) and p opt is set to the highest expression level observed in the dataset (∼ 6.0649 × 10 6 proteins per cell).

2/4
Constant ∆ m is set to 2, meaning a doubling of total transcription, and a maximum growth rate µ of 0.42h −1 is considered. The following bounds are obtained: In accordance with this result, we used ∆ opt = 1.87 throughout the current work. For all the random seeds used in the simulations, this value, obtained for the minimal model, was also valid for the precision-economy model.
Estimating expression noise for a protein expressed from a pair of paralogous genes As mentioned in the main text, the variance of protein abundance for a single-copy gene can be estimated as: In order to obtain a similar equation for a pair of identical paralogs expressing the same protein, the extrinsic and intrinsic components of noise must be treated separately.
Because two duplicate genes are by definition present in the same cell, extrinsic fluctuations will be equal for both of them (as we assume they are identical and thereby share all regulators), while intrinsic fluctuations will independently affect the expression level of each copy. As it does not depend on any gene-specific property, the noise floor c v0 is chosen as the extrinsic component (Eq 7). Although recent modeling work indicates that this noise floor is extrinsic in nature [2], we note that it might still not fully represent extrinsic noise.
The variance on the cumulative protein abundance of P 1 and P 2 can be obtained from the variances of their individual protein abundances. In order to perform this calculation, the fluctuations from mean protein abundance across a population of cells can be seen as a random variable with mean 0 and variance σ 2 . As shown above (Eq 7), this random variable is itself the sum of two other random variables representing the intrinsic and extrinsic components of these fluctuations. In turn, these two components are each a sum of the respective contributions of both paralogs. For intrinsic noise, the cumulative variance is the sum of the intrinsic variances respectively calculated for each duplicate gene. By definition, intrinsic fluctuations are uncorrelated between duplicates, meaning that the intrinsic components are two independent variables and that their variances can be summed. In contrast, extrinsic fluctuations are the same for two identical paralogs within the same cell, resulting in the extrinsic components of protein abundance variance for P 1 and P 2 being two perfectly positively correlated variables. Their cumulative variance is thus the square of the sum of their standard deviations. Accordingly, the variance of cumulative protein abundance for a duplicate couple is obtained using the following equation:

Selection of valid ancestral genes
During the generation of ancestral singletons, a minimal threshold of fitness function curvature is enforced. This ensures that all selected genes are sensitive enough to 3/4 changes in protein abundance for the immediate post-duplication loss of a paralog to be deleterious. A duplicate pair for which this would not be the case would rapidly revert to the singleton state. Classical population genetics theory indicates that a mutation needs to cause a loss of fitness greater than the inverse of the effective population size to be efficiently selected against. Accordingly, we want to identify conditions under which the loss of a paralog immediately after duplication would reduce fitness by more than 1/N . That is: For the filtering of singleton genes, it is more convenient to express p tot as twice the ancestral protein abundance optimum p opt . Using the parabola of form ap 2 tot + bp tot + c that is the fitness function W (p opt ) and adding constants ∆ m and ∆ optdescribing the post-duplication change of total transcription and optimal cumulative protein abundance, respectively -to generalize to any duplication, we obtain: Summing and simplifying, we obtain the following expression: Accordingly, all ancestral singletons included in the current simulations combine a noise sensitivity Q and a protein abundance optimum p opt which satisfy the following condition: Qp opt > 1 c n N (12) where c n = ∆ m 1 − 3∆ m 4∆ opt This condition is only valid when c n > 0, which implies that 3∆m 4∆opt < 1. In accordance with this, all simulations presented in the current work are done under 3 4 ∆ m < ∆ opt .