Noise-precision tradeoff in predicting combinations of mutations and drugs

Many biological problems involve the response to multiple perturbations. Examples include response to combinations of many drugs, and the effects of combinations of many mutations. Such problems have an exponentially large space of combinations, which makes it infeasible to cover the entire space experimentally. To overcome this problem, several formulae that predict the effect of drug combinations or fitness landscape values have been proposed. These formulae use the effects of single perturbations and pairs of perturbations to predict triplets and higher order combinations. Interestingly, different formulae perform best on different datasets. Here we use Pareto optimality theory to quantitatively explain why no formula is optimal for all datasets, due to an inherent bias-variance (noise-precision) tradeoff. We calculate the Pareto front of log-linear formulae and find that the optimal formula depends on properties of the dataset: the typical interaction strength and the experimental noise. This study provides an approach to choose a suitable prediction formula for a given dataset, in order to best overcome the combinatorial explosion problem.


Approximating to zeroth order
Now we want to find the condition that this equality is true to zeroth order, we take Taylor expansions to zeroth order of both sides to get: (0,0,0) = (0,0,0) + 3 (0,0,0) + 3 (0,0,0) And the condition for a formula to be precise to zeroth order is: This is how we added the ∅ terms to the standard formula -we require that the formula will satisfy the above equation so it will be precise to the zeroth order.

Approximating to second order
We can continue to approximate a function to the second order, the left side will have the form (only second order, all derivative are taken at (0,0,0)): 1 2 ? @ @ @ + @ @ @ + @ @ @ A + B + + C For a log linear formula on the right side we get: 2 ? @ @ @ + @ @ @ + @ @ @ A + + ? @ @ @ + @ @ @ + @ @ @ A + B + + C If we treat second derivatives and mixed derivative as more or less of the same size, we get that the deviation is proportional to: Or since the terms adds incoherently, it makes sense to take a function of the form: Were the first term come from comparing the second derivative terms and the second term comes from comparing mixed derivative terms.

S3. The Pareto front of noise and first order precision
We compute the Pareto front as the set of points for which the gradients of the performance functions are proportional (but in opposite directions), in the case of 1 st order approximation performance the gradient is: :;< ( , , ) = |1 − − 2 | ∇ :;< ( , ) = ±(1,2) The noise function for noiseless wildtype is defined as: The noise gradient will therefore be in the case of drug combination: The Pareto front is then be defined by the equation (taking the gradients to be proportional):

S4. The Pareto front of noise and second order precision
The Pareto front is defined as the set of points which can't be improved at all performance functions at once. If we have two performance functions, it is simply the curve for which the gradients of the two performance functions are proportional to one another and point in opposite directions. This can be computed easily for the case of noise and second order performance functions. For We can compute the gradients according to the parameters , to get: Requiring the gradients to be proportional gives the following implicit representation of the Pareto front: 6 -−2(1 − ) + (−1 + + ). = 6 (− 1 2 + 2 + ) Or: −2 @ + 7 + 2 @ + − 6 = 0 If we take instead the function for noisy wild-type: The equation for the Pareto front becomes: Or:

S5. Performance function for non-independent measurement noise
In the main text we analyzed the noise-robustness performance function in the case in which noise in the different measurements is independent. Here we consider the case of a general noise covariance matrix.
We find that the performance function for this case is also quadratic in , , and its equiperformance contours are ellipsoids around the origin in the three dimensional space of , , .
The analysis is therefore analogous to that described in the main text.

S6. Generalization to combinations of higher order
In the main text we claim that our approach helps to overcome the exponential explosion problem of number of combinations, and the main text demonstrates the case = 3. We briefly discuss here the generalization to higher orders of interaction. We treated this problem from a mathematical perspective elsewhere (Tendler & Alon, 2018).
The aim is to estimate the effect of perturbations using only the single and pair effects. One can define a family of log-linear formulae as: :…l = ∅ + ∑ J + ∑ Jh The noise performance function in this case is analogous to the triplet case: The zeroth order precision condition is also defined analogously by the zeroth order Taylor approximation as: It is a line in the 3-dimensional space of formulae, and its exact form depends on the order of combination we try to estimate.
The first order performance function is defined as the deviation of the first order Taylor approximation yielding (assuming zeroth order precision): :;< ( , , ) = (1 − − ( − 1) ) @ As in the triplet case the contours are lines, whose slope depends on .
Finally, for the second-order precision we compute the deviation from the second order Taylor series. The generalized performance function is: As in the triplet case, there is a single regression formula which is precise to the second order, and the performance contours are ellipsoids around it. The second-order-precise formula is:

S7. Synthetically generated datasets
To generate data synthetically we used randomly generated functions of third order of the following form: ( , , ) = : + @ ( + + ) + e ( @ + @ + @ ) + u ( e + e + e ) + v + w ( + + ) + U ( @ + @ + @ + @ + @ + @ ) Were J are chosen randomly from the uniform distribution on [0,1]. The function was evaluated at , , ∈ {0, }. So is a measure for the distance of evaluation of the function, which we called "interaction strength" in our graphs. To all evaluations of the function we added Gaussian noise of STD , which is the measurement of noise in the related plots.
For each value of and we took 10 different random functions, from each random function we generated 300 data points by choosing random values for , , uniformly from [0,1]. For each dataset we computed the optimal formula, we obtain the optimal formula for the values of , by averaging the optimal formulae predicting the 10 functions.
We note that the distance of evaluation deserves the name "interaction strength". It is indeed related to a natural definition of interaction strength as

S8. The bias-variance tradeoff in the generalized mean class of formulae
In the main text we analyzed the log-linear class of formula. Here we analyze another class of formula, the class of generalized means defined as: :@e = ƒ = " 1 3 ( : @e ) ƒ + 1 3 ( @ :e ) ƒ + 1 3 ( e :@ ) ƒ … : ƒ Where is a parameter. For p=-1 this is the harmonic mean of terms si sjk, and for p=1 it's their arithmetic mean. Each of these terms is similar to a Bliss approximation taking a single si and the other pair sjk, so the generalized mean is away to account for all three Bliss-like possibilities. At certain limits of the generalized mean has the following forms: € ‡ = min ( : @e , @ :e , e :@ ) ‹ =( : @e @ :e e :@ ) : e ‡ = max ( : @e , @ :e , e :@ ) All GM formulae are accurate to the zeroth order. Indeed, if we plug in : = @ = e = :@ = :e = @e = 1 we get :@e = 1.
The generalized mean formulae differ in their noise robustness. Noise grows with which makes the min model ( € ‡ ) the most noise-robust ( Fig B). As suggested by the noise robustness analysis, the min model provides the best predictions in this class for some datasets (Extra data on A549, H1299 (Zimmer, Tendler, Katzir, Mayo, & Alon, 2017) and yeast three-gene deletion interactions (Kuzmin et al., 2018), not shown).
Generalized means differ also by their prediction about the type of interactions, because of the mean inequality: if < then ƒ ≤ ' . Therefore, the min model will be favored for synergistic interactions, whereas the max model ( ‡ ) will be preferred if most interactions in the dataset are antagonistic. In the present study, all non-expanded datasets (Table 1) showed a majority of antagonistic interactions ( J h < Jh ). The only exception was (Beppler et al., 2016), which was constructed as synergistic.  The horizontal axis is the parameter of the generalized mean model and the vertical axis is the standard deviation of the predictions of the model. Each point is based on 100000 lognormally distributed J , Jh (log of si and sij has mean zero and standard deviation of 1). The model becomes more sensitive to noise as grows larger. (B) Comparison of the performance of log-linear and generalize mean classes. For each dataset we computed the optimal formula in the log-linear and generalize mean classes, we compare the performance of the optimal formulae in terms of root mean square error. Error-bars result from bootstrapping each dataset 100 times.