An extension of the Walsh-Hadamard transform to calculate and model epistasis in genetic landscapes of arbitrary shape and complexity

Accurate models describing the relationship between genotype and phenotype are necessary in order to understand and predict how mutations to biological sequences affect the fitness and evolution of living organisms. The apparent abundance of epistasis (genetic interactions), both between and within genes, complicates this task and how to build mechanistic models that incorporate epistatic coefficients (genetic interaction terms) is an open question. The Walsh-Hadamard transform represents a rigorous computational framework for calculating and modeling epistatic interactions at the level of individual genotypic values (known as genetical, biological or physiological epistasis), and can therefore be used to address fundamental questions related to sequence-to-function encodings. However, one of its main limitations is that it can only accommodate two alleles (amino acid or nucleotide states) per sequence position. In this paper we provide an extension of the Walsh-Hadamard transform that allows the calculation and modeling of background-averaged epistasis (also known as ensemble epistasis) in genetic landscapes with an arbitrary number of states per position (20 for amino acids, 4 for nucleotides, etc.). We also provide a recursive formula for the inverse matrix and then derive formulae to directly extract any element of either matrix without having to rely on the computationally intensive task of constructing or inverting large matrices. Finally, we demonstrate the utility of our theory by using it to model epistasis within both simulated and empirical multiallelic fitness landscapes, revealing that both pairwise and higher-order genetic interactions are enriched between physically interacting positions.


Introduction
A fundamental challenge in biology is to understand and predict how changes (or mutations) to biological sequences (DNA, RNA, proteins) affect their molecular function and ultimately the phenotype of living organisms.The phenomenon of 'epistasis' (genetic interactions)broadly defined as the dependence of mutational effects on the genetic context in which they occur [1][2][3]-is widespread in biological systems, yet knowledge of the underlying mechanisms remains limited.Defining the extent of epistasis and better understanding of its origins has relevance in fields ranging from genetic prediction, molecular evolution, infectious disease and cancer drug development, to biomolecular structure determination and protein engineering [3].
Evolutionarily related sequences, natural genetic variation within populations, and more recently results of techniques such as deep mutational scanning (DMS) [4]-also known as massively parallel reporter assays (MPRAs) and multiplex assays of variant effect (MAVEs)represent valuable sources of data to study epistasis [1,5].In particular, DMS enables the systematic measurement of mutational effects across entire combinatorially complete genetic landscapes [5][6][7][8][9][10][11][12][13].Importantly, the typical use of engineered genotypes, haploid individuals and near-identical environmental (laboratory) conditions in these experiments allows population genetic considerations-such as dominance, variable allele frequencies and linkage disequilibrium-to be ignored [14].In other words, measurements obtained from deep mutational scanning and related methods permit the modeling of epistasis in the mechanistic sense (sequence-to-function encoding) rather than in the evolutionary sense, i.e. based on the dynamics of population genotype frequencies.Nevertheless, precisely how to extract the most biologically relevant pairwise and higher-order epistatic coefficients (genetic interaction terms) from this type of data is an unresolved problem.
Quantitative definitions of epistasis vary among fields, but it has been argued that one particular formulation termed 'background-averaged' epistasis, also known as 'ensemble' epistasis [1,12], may provide the most useful information on the epistatic structure of biological systems [2].The underlying rationale is that by averaging the effects of mutations across many different genetic backgrounds (contexts), the method is robust to local idiosyncrasies in the relationship between genotype and phenotype.It has been previously pointed out that the definition of background-averaged epistasis is conceptually similar to that of 'statistical epistasis' attributed to Fisher, but instead of measuring the average effect of allele substitutions against the population average genetic background i.e. averaging over all genotypes present in a given population (taking into account their individual frequencies), the approach instead averages over all possible genotypes (assuming equal genotype weights) [1,2].
The current mathematical formalism of background-averaged epistasis is based on the Walsh-Hadamard transform [2].Interestingly, although widely used in physics and engineering, the Walsh-Hadamard transform was first applied to non-biological fitness landscapes in the field of genetic algorithms (GA) [15], subsequently being proposed as the basis of a framework for the computation of higher-order epistasis in empirical settings [16].However, the Walsh-Hadamard transform can only accommodate two alleles (amino acid or nucleotide states) per sequence position, with no extension to multialleleic landscapes (cardinality greater than two) yet made, as confirmed by multiple recent reports [2,[17][18][19].Alternative implementations for multiallelic landscapes either rely on 'one-hot encoding' elements of larger alphabets as biallelic sequences-requiring the manipulation of prohibitively large Walsh-Hadamard matrices-or constructing graph Fourier bases [18,20], which is mathematically complex and provides no straightforward way to interpret epistatic coefficients.The result is that the application of background-averaged epistasis has been severely limited and its properties remain largely unexplored in more biologically realistic scenarios.
In this work we provide an extension of the Walsh-Hadamard transform that allows the calculation and modeling of background-averaged epistasis in genetic landscapes with an arbitrary number of states (20 for amino acids, 4 for nucleotides, etc.).We also provide a recursive formula for the inverse matrix, which is required to infer epistatic coefficients using regression.Furthermore, we derive convenient formulae to directly extract any element of either matrix without having to rely on the computationally intensive task of constructing or inverting large matrices.Lastly, we apply these formulae to the analysis of both simulated and empirical multiallelic DMS datasets, demonstrating that sparse models inferred from the background-averaged representation (embedding) of the underlying genetic landscape more regularly include epistatic terms corresponding to direct physical interactions.

Extension of the Walsh-Hadamard transform to multiallelic landscapes
In this work, a genotype sequence is represented as a one-dimensional ordering of monomers, each of which can take on s possible states (or alleles), for example s = 4 for nucleotide sequences or s = 20 for amino acid sequences.Without loss of generality, the s states can be labelled 0, 1, 2, . .., s − 1, where 0 denotes the wild-type allele.We are going to consider genotype sequences of length n 2 N, i.e. sequences taking values in S n , where S ≔ f0; 1; . . .; s À 1g.
Each genotype ĩ 2 S n is associated with its phenotype y ĩ .Note that here we use the term 'phenotype' as shorthand for 'molecular phenotype score' from a quantitative laboratory assay (DMS) reporting on a molecular function for each genotype of interest.In quantitative genetics terminology this might be referred to as 'genotypic value' because environmental deviation is negligible due to the controlled nature of the experiments, but our subject here is the macromolecule not an individual from a population [14].In the context of empirical genotype-phenotype landscapes, the phenotypic effect of a genotype ĩ is typically measured with respect to the wild-type, i.e. it is given by y ĩ À y ð0;...;0Þ .
It is important to emphasize that in what follows we implicitly restrict ourselves to the haploid reference base, because our primary goal is the modeling of sequence-to-function encodings for individual genotype sequences-for the ultimate purpose of understanding and engineering macromolecules-not the modeling of sequence evolution or quantification of sources of phenotypic variance in populations.
If the phenotypic effects of individual mutations were independent, they would be additive, meaning that the phenotypic effect of ĩ ¼ ði 1 ; . . .; i n Þ would be the sum of the phenotypic effects of the single mutants (i 1 , 0, . .., 0), . .., (0, . .., 0, i n ).The epistatic coefficient quantifies how much the observed phenotypic effect of ĩ deviates from this assumption.In the case of background-averaged epistasis, we quantify the interactions between a set of mutations by averaging over all possible genotypes for the remaining positions in the sequence.For example, if n = 3 and s = 2, the pairwise epistatic coefficient involving the mutations at positions 2 and 3 is calculated by averaging over all states (backgrounds) for the remaining positions, in this case given by the two states of the first position (* denotes the positions at which the averaging is performed), i.e.
More generally, in [2] it is shown that for s = 2 and any sequence length n, phenotypic effects can be decomposed into background-averaged epistatic coefficients with where � y n is the vector ðy ĩ ; ĩ 2 ½0; 1� n Þ, � ε n is the vector ðε j ; j 2 ½ * ; 1� n Þ and Ĥn and V n are 2 n × 2 n matrices defined recursively as follows: The matrix Ĥ is known as the Walsh-Hadamard transform [21,22] and V is a diagonal weighting (or normalisation) matrix to correct the sign and account for averaging over different numbers of backgrounds as a function of epistatic order [2].
In this work, we provide an extension of this theory to describe background-averaged epistasis for sequences with an arbitrary number of states s.Before writing a general formula, we consider the simplest possible multi-state (multiallelic) landscape i.e. a sequence of length n = 1 with s = 3, Consistent with the definition of background-averaged epistasis for biallelic landscapes [2], the zeroth-order epistatic coefficient ε (�) is the mean phenotypic value across all genotypes and the first-order epistatic coefficients ε (1) and ε (2) are simply the respective individual phenotypic effects of genotypes y (1) and y (2) with respect to the wild-type.However, the key feature of H 1 for multiallelic landscapes-and where it departs from the canonical Walsh-Hadamard transform-is the introduction of zero elements to exclude phenotypes that are irrelevant for the calculation of a given epistatic coefficient.In other words, these phenotypes are excluded because they correspond neither to relevant intermediate genotypes nor alternative genetic backgrounds.
If we now consider a sequence of length n = 2 with s = 3, then the H 2 and V 2 matrices become 9 × 9 (s n × s n ) and can be constructed from recurring to the case n = 1 above, giving " (1;2) " (2;¤) " (2;1) " (2;2) y (1;2) y (2;0) y (2;1) y (2;2) where the colors highlight the block structure of the matrices.In V 2 , the red square corresponds to 1 s V 1 and the light red squares to −V 1 .In H 2 , the gray squares correspond to H 1 and the blue squares to −H 1 .In Table 1 we show the results of background-averaged epistatic coefficients calculated by applying the above formula to an empirical multiallelic landscape with n = 2 and s = 3 [6].
More generally, for any value of s, when n = 1, . . .
where ε (�) corresponds to averaging phenotypes over all possible genotypes and the remaining coefficients simply correspond to the phenotypic effects of each mutation.
For n = 2, we have to consider different combinations of mutations in both positions.In this case, the phenotypes can be written as y ð0;0Þ ; y ð0;1Þ ; . . .; y ð0;ðsÀ 1ÞÞ ; y ð1;0Þ ; . . .; y ð1;ðsÀ 1ÞÞ ; . . .; y ððsÀ 1Þ;0Þ ; . . .; y ððsÀ 1Þ;ðsÀ 1ÞÞ : A natural ordering of the phenotypes is given by interpreting genotype ĩ as the base s representation of an integer (see Table 1).From this, we can see how the first s genotypes correspond to combining the wild-type allele at the first position with a state from the case n = 1, i.e. to genotypes that can be written 0 _ ĩ ≔ ð0; ĩÞ, with ĩ 2 S 1 .The next s genotypes correspond to the first mutated allele at the first position combined with all the genotypes of n = 1, i.e. 1 _ ĩ; ĩ 2 S 1 , and so on.Therefore, we can write the matrices H and V following a block structure.In the case n = 2 and any given s, we would then have where the number of H 1 blocks corresponds to the number of states of the first position, so s.Moreover, each of these blocks must be normalized to yield the corresponding background-Table 1. Interaction terms based on background-averaged epistasis (� ε) for an empirical multiallelic genotype-phenotype landscape consisting of all combinations of two mutations each at positions 6 and 66 in the tRNA-Arg(CCU) [6], i.e. n = 2 and s = 3.The first two columns indicate nucleic acid sequences and their base 3 representations.Here the 'GC' reference (wild-type) genotype corresponds to that of S. cerevisiae, denoted by (0, 0).The second two columns show the measured phenotypic effects and corresponding background-averaged epistatic coefficients together with their empirical errors and their propagated errors.See Results for a regression analysis of the entire dataset and Proposition 8 in S1 Text for a derivation of the error propagation.averaged epistatic terms.Therefore V 2 can also be expressed as a function of V 1 as follows:

Nucleic acid sequence Base
Given these two matrices, we can write the background-averaged epistatic coefficients for the case of n = 2 and s different states per position as � More generally, the decomposition of phenotypic effects into background-averaged epistatic coefficients is given by where H n and V n can be defined recursively as

Recursive inverse matrix
Eq (9) defines the vector of epistatic coefficients, � ε n , as a function of the vector of phenotypes, � y n , which in general is the quantity that is measured experimentally.However, usually phenotypic measurements are only available for a subset of genotypes.An alternative is therefore to estimate the epistatic coefficients � ε n by regression, where the product H À 1 n � V À 1 n represents a matrix of sequence features.This is analogous to the more widely used one-hot encoding strategy, which implicitly relies on a 'background-relative' (or 'biochemical') view of epistasis when regressing to full order [2].We discuss other advantages of estimating background-averaged epistatic coefficients using regression at the end of this manuscript.Since V n is a diagonal matrix, its inverse is also a diagonal matrix whose elements are the inverse of the elements of V n .
The inverse of H n is the matrix A n which can be defined recursively as See Proposition 1 in S1 Text for a proof of this result.This is the most efficient method to determine the full matrix A n (see Results).

Formulae to obtain elements of the matrices
When regressing phenotypes on genotypes, a common goal is to determine whether epistatic coefficients up to the r th order (where r < n) are sufficient to describe the complexity of the biological system.Furthermore, as mentioned above, some fraction of phenotype values within combinatorially complete genetic landscapes are typically unavailable, representing missing data.Restricting the epistatic order and missing phenotypes respectively correspond to omitting rows and columns from H n (and vice versa from A n ).Formulae to directly obtain elements of the matrices in Eqs ( 9) and ( 12) would therefore be convenient.
In order to write the matrix element (H n ) ij , we need to compare the genotype sequences ĩ; j 2 S n , where ĩ denotes the i th element in S n , S ¼ f0; 1; . . .; s À 1g, and the elements of S n are ordered by the base s representation of integers.For instance, for any value of n, we will denote the wild-type state with index i = 1 and write ĩ ¼ 1 ¼ ð0; . . .; 0Þ.The element denoted with index i = 2 would be ĩ ¼ 2 ¼ ð0; . . .; 0; 1Þ and so on.
The elements of H n can be written as where M and E are s n × s n matrices whose elements are where δ ij denotes the Kronecker delta of i, j, which is equal to 1 when i = j and 0 if i 6 ¼ j.In words, ðE n Þ ij counts the number of positions at which the genotype sequences ĩ and j carry the same mutated allele and ðM n Þ ij is equal to ðE n Þ ij plus the number of positions where ĩ or j carry the wild-type allele.See Proposition 2 in S1 Text for a proof of this result.Furthermore, the elements of A n can be written as where E n is defined as in Eq (15).See Proposition 3 in S1 Text for a proof of this result.Finally, the matrices V n and V À 1 n are diagonal matrices whose diagonal elements can be written as where and ĩ again denotes the i th element in S n when ordered by the base s representation of integers.
In words, w k = 1 if the genotype sequence ĩ carries the wild-type allele at position k and W n ð ĩÞ counts the number of positions in ĩ carrying the wild-type allele.We prove this result in Proposition 4 in S1 Text.

Generalization to different numbers of states per position
We can generalize the formulae described in the previous subsection further by considering that each position can have different numbers of states.In this case, we can denote s k the number of possible states at position k.For n = 1, this corresponds to exactly the same matrix as in the previous case but with s = s 1 , which is the number of possible states in this position.For n = 2, the matrix changes because now the new position can have a different number of possible states, s 2 .Following the recursive definition of H n , we can construct H 2 by repeating H 1 s 2 times, with the structure stated in Eq (10).Therefore, we have So the structure is exactly the same but the size of the matrix for each n varies according to the number of possible states of the new position.The definition of H n is the same as in Eq (10) but the dimensions of the matrix are Similarly, the inverse matrix A n+1 can be written recursively as The matrix A n defined in Eq (21) is the inverse of the matrix H n in the general case where each position can have a different number of states.In this general case, the elements of H n and A n can be written as where E n and M n are defined as in Eqs ( 15) and ( 16) and e k ¼ The matrices V n and V À 1 n are diagonal matrices whose diagonal elements can be written as and where We prove the results in this subsection in Propositions 5, 6 and 7 in S1 Text.
The above formulae permit the calculation and modeling of background-averaged epistasis in arbitrarily-shaped genetic landscapes, i.e. with any number of alleles (states) per position, as well as the direct construction of sub-matrices for regression to any desired epistatic order and/or in the presence of missing data.In the following subsections we report benchmarking results comparing the performance of alternative methods to obtain H n and A n , as well as results from the application of our theory extension to an empirical multiallelic genotype-phenotype landscape.In this paper, we provide different methods to construct A n ¼ H À 1 n .First, H n can be numerically inverted using standard matrix inversion algorithms (here we use the linalg.invfunction from the SciPy library in Python), referred to as "Recursive H n inverse" in Fig 1E and  1F.Alternatively, the recursive definition of the inverse given by Eq (13) can be used, which we refer to as "Recursive A n ".As can be seen in Fig 1E, this method is faster than numerically inverting H n .

Benchmarking
Finally, we also provide a convenient formula for extracting specific individual elements of A n Eq (17), referred to as "All elements A n " in Fig 1E and 1F.This method is more computationally intensive than the previously described methods, due to the formula relying on the computation of (E n ) ij , which equates to counting the number of sequence positions that are identically mutated in vectors ĩ and j, each of size n.However, in situations where subsets of elements (or sub-matrices)-rather than full matrices-are desired, Eq (17) provides a method that can be faster and more memory efficient (see "10 elements A n " in Fig 1E and  1F).
For example, in the case of a 10-mer DNA sequence, constructing the full inverse transform A 10 with s = 4 would require > 10 23 bytes (100 million petabytes) of memory in the best-case scenario ("Recursive H n inverse" in Fig 1F, log-linear extrapolation).Similarly, the full inverse transform for a 4-mer amino acid sequence (A 4 with s = 20) would impose a memory footprint > 10 20 bytes.On the other hand, calculating the subset of elements from these matrices required for the prediction of a single phenotype using epistatic coefficients up to third order (three-way genetic interaction terms) is feasible in both situations using Eq (17) (3,675 and 29,678 elements; 2.5 GB and 192 GB of memory; 1.8 and 99 seconds, respectively).This memory footprint can easily be diminished further using data chunking, which is a unique benefit of this method.

Application to a simulated multiallelic genotype-phenotype landscape
In order to demonstrate the utility of our extension of the Walsh-Hadamard transform, we used it to model epistasis within a simulated multiallelic genetic landscape.Knowledge of the 'ground truth' epistatic coefficients allowed us to (i) determine whether background-averaged coefficients can be accurately inferred using regression and (ii) investigate the impact of data sparseness on the results i.e. in the context of missing phenotypic values.In most practical applications-particularly in the case of large empirical genotype-phenotype landscapesmeasuring the phenotypic effects of all mutation combinations is infeasible.The simulated combinatorially complete landscape consisted of all possible combinations of four alleles at six different positions i.e. a total of 4 6 = 4, 098 genotypes (Fig 2A , left).First-order and a subset of second-order epistatic coefficients were randomly drawn from Gaussian distributions, while all higher-order epistatic coefficients were set to zero (see Methods).We arbitrarily selected three pairs of positions at which mutations genetically interact i.e. ground truth non-zero pairwise epistatic terms (Fig 2A , left).Fitness values for all variants were reconstructed using Eq (12) with random noise added to simulate measurement error (see Methods).
We then trained Lasso regression models of the form in Eq (12) to predict fitness values from sequence, where the inferred model parameters correspond to background-averaged epistatic coefficients up to third order (S1(A) Fig; see Methods).To assess the impact of data sparsity on the results, we sub-sampled the simulated phenotype (fitness) values to obtain training ).Comparisons are shown between numerically inverting the recursively constructed H n (using scipy.linalg.inv),i.e. "Recursive H n inverse", using the recursive formula for A n , using the formula to extract all elements of A n and extracting 10 random elements of A n (see legend).The mean across 10 replicates is depicted.Linear regression lines were fit to data from matrices with at least 100 elements.F. Similar to E but indicating memory usage.Linear regression lines were fit to data from matrices with at least 10 elements.https://doi.org/10.1371/journal.pcbi.1012132.g001Left: simulated multiallelic landscape consisting of all combinations of 4 nucleotides in a DNA 6-mer where ground truth interactions ( [1,2], [3,5] and [5,6]) are depicted with red arcs (see Methods).Middle: secondary structure of S. cerevisiae tRNA-Arg(CCU) indicating variable positions (closed circles) combinatorially mutated in a DMS experiment as described in [6].Three Watson-Crick base pairing (WCBP) interactions involving pairs of these positions ( [1,71], [2,70] and [6,66]) are indicated.Right: crystal structure of a blue fluorescent variant (TagBFP) of the Entacmaea quadricolor protein eqFP611 (PDB: 3M24) and 13 positions (12 shown) that differ in the red fluorescent variant (mKate2) that were mutated in a DMS experiment as described in [7].Physical interactions between the fluorophore (L63) and three proximal residues ([63, 143], [63,174] and [63,197] For comparison, we used the same procedure to fit Lasso regression models of the form ε, where G −1 represents a matrix of one-hot encoded sequence features i.e. the presence or absence of a given mutation-or mutation combination (interaction)-with respect to the reference (wild-type) genotype is denoted by a '1' or '0' respectively (S1(A) Fig, Fig 2B, left; 'One-hot models', blue).The definition of G and its relationship to the biochemical (or background-relative) view of epistasis is explained in [2].Although one-hot models show similar performance to background-averaged models on simulated test data, they tend to be more complex with larger numbers of coefficients (Fig 2B ), including spurious third-order epistatic terms although their ground truth values are zero (S1(C) Fig) .In summary, these results demonstrate that our theory extensions allow the accurate inference of background-averaged epistatic coefficients in multiallelic genetic landscapes, even in situations when large amounts of phenotypic data are missing.

Application to empirical combinatorial fitness landscapes
We next applied our theory to model epistasis within two different empirical combinatorial fitness landscapes (Fig 2A , middle, right).In the first DMS assay, a budding yeast strain was used in which the single-copy arginine-CCU tRNA (tRNA-Arg(CCU)) gene is conditionally required for growth [6].A library of variants of this gene was designed to cover all 5,184 (2 6 × 3 4 ) combinations of the 14 nucleotide substitutions observed in ten positions in post-wholegenome duplication yeast species.The library was transformed into S. cerevisiae, expressed under restrictive conditions and the enrichment of each genotype in the culture was quantified by deep sequencing before and after selection.After reprocessing of the raw data, we retained high quality fitness estimates for 3,847 variants (74.2%, see Methods).
Although the findings in [6] were based on the application of background-averaged epistasis theory, the prior limitation of the Walsh-Hadamard transform to only two alleles per sequence position required the authors to adopt an ad hoc strategy that involved performing separate analyses on combinatorially complete biallelic sub-landscapes.After imputing fitness values for missing variants we decomposed phenotypes into background-averaged epistatic coefficients using Eq (9), which yielded very similar values for significant coefficients of all orders (S2(B) Fig, see Methods).However-as shown above for simulated data-our theory extensions permit the modeling of background-averaged epistasis in the context of multiallelic landscapes with missing data.We followed the same strategy as described in the previous section, using the tRNA DMS data to train Lasso regression models to predict variant fitness from sequence incorporating epistatic coefficients up to eighth order ( To evaluate whether the inferred models report on biologically relevant features of the underlying genetic landscapes, we tested whether sparse model coefficients were more likely to comprise genetic interactions (or modulators thereof) involving known physically interacting positions in the wild-type tRNA secondary structure (Fig 2A,middle).Regardless of data sparsity, background-averaged model coefficients tend to be significantly enriched for physical interactions (Fig 2D , middle).On the other hand, in the case of even moderate sub-sampling of training data (16%), one-hot model coefficients show no such enrichment (Fig 2D , middle).The results of this enrichment analysis closely mirror those obtained using simulated data (Fig 2D,left), suggesting that they are likely to generalise to other empirical genotype-phenotype landscapes.
Finally, we repeated similar analyses using a second DMS dataset comprising fluorescence measurements for all combinations of single substitution mutations at 13 positions (2 13

Discussion
We have provided an extension to the most rigorous computational framework available for describing and modeling empirical genotype-phenotype mappings.Our approach derives from the Walsh-Hadamard transform, yet we note that the new set of columns is no longer orthogonal.However, the columns remain independent, as proven by the existence of the inverse (Eq (13)).Beyond the study of background-averaged epistasis with respect to mutations in the primary sequence, this also permits the inclusion of 'epimutations' (changes in the epigenetic state of DNA), amino acid post-translational modifications or even particular environmental/experimental conditions.
In the simplest application, background-averaged epistatic coefficients (genetic interaction terms) can be directly computed from phenotypic measurements via the decomposition in Eq (9).However, estimating epistatic coefficients by regression-as in Eq (12)-is a more natural choice in the presence of missing data, when data for multiple related phenotypes is available [23] and/or in the presence of global epistasis [24,25].Our mathematical results provide three alternative methods to compute the multi-state (multiallelic) extension of the inverse Walsh-Hadamard transform A n , one of which allows the direct extraction of specific elements or submatrices.In which situations might this capability be desirable?
First, constructing full A n matrices-particularly by numerical inversion-is impractical for large genetic landscapes.Second, the result of the product H À 1 n � V À 1 n represents a matrix of sequence features when setting up the inference of epistatic (model) coefficients � ε n from phenotypic measurements � y n as a regression task [23,24,26,27].The ability to construct this feature matrix in batches (of rows) allows computational resource-efficient iteration over large datasets when using frameworks such as TensorFlow or PyTorch.
Third, there are currently no methods to comprehensively map empirical genotype-phenotype landscapes with size greater than the low millions of genotypes.Therefore, assaying landscapes of this size or larger will typically involve experimental measurement of a (random) sub-sample of genotypes, corresponding to distinct rows in A n .In other words, it is usually unnecessary to construct full A n matrices when modeling real experimental data.Finally, there is evidence of extreme sparsity in the epistatic architecture of biomolecules where only a small fraction of theoretically possible genetic interactions are non-zero [7].The feasibility of sampling very large background-averaged epistatic coefficient spaces may improve methods to infer accurate genotype-phenotype models.
Using results from the analysis of both simulated and empirical multiallelic fitness landscapes, we have shown that sparse regression models relying on a background-averaged definition of epistasis can efficiently capture salient features of the underlying biological systemnamely direct physical interactions or ground truth genetic interactions-even in situations of sparse sampling of phenotypes.This behaviour, which we speculate is due to a richer representation of the sequence feature space compared to one-hot models (i.e. higher level of constraint during model fitting; S1(A) Fig), is particularly desirable in the case of very large genetic landscapes where comprehensive phenotyping is infeasible.However, more work is needed to determine whether this result holds more generally.One difficulty in such comparisons between approaches is the requirement for a set of interactions or landscape features that are known to be critical for biomolecular function.Here we rely on Watson-Crick base pairing interactions and direct physical interactions involved in fluorophore orientation whose respective importance for RNA secondary structure and fluorescence activity are well-established.
More broadly, this work opens the door to investigations of the biological properties of background-averaged epistasis in empirical genetic landscapes of arbitrary shape and complexity.Beyond applications within the field of DMS, we believe our theory extensions have the potential to influence research in evolutionary and synthetic biology including protein engineering.In future it will be important to compare the performance and properties of models relying on this definition of epistasis to those of other recently proposed models that incorporate higher-order genetic interactions for phenotypic prediction [28,29].

Simulated multiallelic fitness landscape
The simulated multiallelic genetic landscape (Fig 2A , left) consists of all possible combinations of 4 nucleotides per position in a 6-mer i.e. 4 6 = 4, 096 possible genotypes.Consistent with observations of empirical fitness landscapes where single mutant phenotypes tend to be biased towards negative (detrimental) effects and larger in size than pairwise and higher-order epistatic terms, ground truth first-order epistatic coefficients (single nucleotide substitution effects) were drawn randomly from a normal distribution with a mean of -1 and standard deviation of 2 i.e. � = N(μ, σ 2 ) = N(−1, 4).Ground truth non-zero second-order terms (pairwise genetic interactions) were arbitrarily selected between the following pairs of positions: [1, 2], [3,5] and [5,6], where their coefficient values were randomly sampled from � = N(0, 1).All other second-and higher-order coefficients were set to zero.Fitness scores for all variants were calculated using Eq (12).We simulated measurement error by adding Gaussian noise to all fitness scores of similar magnitude to the variance of first-order epistatic coefficients i.e. s 2 � ¼ Nð0; 4Þ.Finally, we also simulated versions of the same multiallelic fitness landscape with both relatively lower and higher measurement noise where s 2 � ¼ Nð0; 1Þ and s 2 � ¼ Nð0; 9Þ, respectively (S1(B) Fig).

Empirical combinatorial fitness landscapes
Raw sequencing (FASTQ) files obtained from the tRNA-Arg(CCU) DMS experiment in [6] were re-processed with DiMSum v1.3 [30] using default parameters with minor adjustments.We obtained fitness estimates for 5,059 out of a total of 5,184 possible variants (97.6%) in the combinatorially complete genetic landscape.We restricted the data to a high quality subset by requiring fitness estimates in all six biological replicates as well as at least 10 input read counts in all input samples.This resulted in a total of 3,847 retained variants (74.2%) for downstream analysis.For comparisons to previously-reported background-averaged coefficients (Fig 2C, S2(B) Fig), we used the author-processed data and analysis scripts [6].Likewise, we used the author-processed data for imputation of missing phenotype values (replaced with the mean fitness at every mutation order) and calculation of background-averaged epistatic coefficients using the decomposition in Eq (9) (S2(B) Fig) .For the eqFP611 fluorescent protein DMS experiment, we used the author-processed fitness estimates (brightness scores) [7].

Lasso regression models
We trained Lasso regression models to predict variant fitness estimates from nucleotide or amino acid sequences using the 'scikit-learn' Python package.Training data comprised random subsets of 1, 2, 4, 8, 16, 32 and 64% of retained variants of all mutation orders.All remaining held out variants comprised the 'test' data which was unseen during model training in each case.
To train models inferring background-averaged epistatic coefficients we used feature matrices of the form H À 1 n � V À 1 n (see Eq (12)).For comparison, one-hot encoded matrices of sequence features were used.Linear regression was performed using 10-fold cross validation to determine the optimal value of the L1 regularization parameter λ in the range [0.005, 0.25] ('LassoCV' and 'RepeatedKFold' functions).[30]) from the total fitness variance.For the simulated and eqFP611 DMS datasets, the maximum explainable variance was assumed to be 100%.In comparisons between sparse model-derived background-averaged epistatic coefficients and ground truth or previously-reported coefficients (scatter plots in To test enrichment of physical interactions in Lasso model coefficients we used the following approach: for each model, all position pairs represented in non-zero epistatic coefficients of at least second order were determined.The number of position pairs corresponding to direct physical interactions was counted and an associated enrichment score (odds ratio) and P-value calculated using Fisher's Exact Test.The background set consisted of all position pairs in all possible epistatic coefficients.To test the appropriateness of the null hypothesis we also repeated enrichment analyses using random models i.e. randomly chosen sets of epistatic coefficients matching the numbers of non-zero coefficients in Lasso models and their distribution

Fig
Fig 1A-1D provides a visualization of the matrices H n and A n for different values of n and s, clearly showing a self-similar pattern in all cases due to their recursive nature.In this paper, we provide different methods to construct A n ¼ H À 1 n .First, H n can be numerically inverted using standard matrix inversion algorithms (here we use the linalg.invfunction from the SciPy library in Python), referred to as "Recursive H n inverse" in Fig1E and 1F.Alternatively, the recursive definition of the inverse given by Eq (13) can be used, which we refer to as "Recursive A n ".As can be seen in Fig1E, this method is faster than numerically inverting H n .Finally, we also provide a convenient formula for extracting specific individual elements of A n Eq(17), referred to as "All elements A n " in Fig 1Eand 1F.This method is more computationally intensive than the previously described methods, due to the formula relying on the computation of (E n ) ij , which equates to counting the number of sequence positions that are identically mutated in vectors ĩ and j, each of size n.However, in situations where subsets of elements (or sub-matrices)-rather than full matrices-are desired, Eq(17) provides a method that can be faster and more memory efficient (see "10 elements A n " in Fig1E and 1F).For example, in the case of a 10-mer DNA sequence, constructing the full inverse transform A 10 with s = 4 would require > 10 23 bytes (100 million petabytes) of memory in the best-case scenario ("Recursive H n inverse" in Fig1F, log-linear extrapolation).Similarly, the full inverse transform for a 4-mer amino acid sequence (A 4 with s = 20) would impose a memory footprint > 10 20 bytes.On the other hand, calculating the subset of elements from these matrices required for the prediction of a single phenotype using epistatic coefficients up to third order (three-way genetic interaction terms) is feasible in both situations using Eq (17) (3,675 and 29,678 elements; 2.5 GB and 192 GB of memory; 1.8 and 99 seconds, respectively).This memory footprint can easily be diminished further using data chunking, which is a unique benefit of this method.

Fig 1 .
Fig 1. Benchmarking results and heat map representations of matrices corresponding to the binary (biallelic) and multi-state (multiallelic) extension of the Walsh-Hadamard transform, and their corresponding inverses.A. H 6 Walsh-Hadamard transform (s = 2).B. H 4 multi-state extension of the Walsh-Hadamard transform for s = 3. C. A 6 Inverse Walsh-Hadamard transform.D. A 4 multi-state extension of the inverse Walsh-Hadamard transform for s = 3. E. Computational time on a MacBook Pro (13-inch, 2017, 2.3GHz dual-core Intel Core i5) for extracting elements of A n matrices of various dimensions and numbers of states (alleles) per position (s 2 [2, 10]).Comparisons are shown between numerically inverting the recursively constructed H n (using scipy.linalg.inv),i.e. "Recursive H n inverse", using the recursive formula for A n , using the formula to extract all elements of A n and extracting 10 random elements of A n (see legend).The mean across 10 replicates is depicted.Linear regression lines were fit to data from matrices with at least 100 elements.F. Similar to E but indicating memory usage.Linear regression lines were fit to data from matrices with at least 10 elements.

Fig 2 .
Fig 2. Learning sparse models from simulated and empirical combinatorial fitness landscapes. A. Schematic description of combinatorial DMS datasets.Left: simulated multiallelic landscape consisting of all combinations of 4 nucleotides in a DNA 6-mer where ground truth interactions ([1,2],[3,5] and[5,6]) are depicted with red arcs (see Methods).Middle: secondary structure of S. cerevisiae tRNA-Arg(CCU) indicating variable positions (closed circles) combinatorially mutated in a DMS experiment as described in[6].Three Watson-Crick base pairing (WCBP) interactions involving pairs of these positions ([1, 71],[2, 70]  and[6, 66]) are indicated.Right: crystal structure of a blue fluorescent variant (TagBFP) of the Entacmaea quadricolor protein eqFP611 (PDB: 3M24) and 13 positions (12 shown) that differ in the red fluorescent variant (mKate2) that were mutated in a DMS experiment as described in[7].Physical interactions between the fluorophore (L63) and three proximal residues([63, 143],[63, 174]  and[63, 197]) are indicated.B. Performance of sparse models fit using different proportions of the DMS datasets indicated in panel A. The median number of model coefficients is indicated.Colour scale as in panel D. C. Scatter plots comparing all non-zero model-inferred background-averaged epistatic coefficients (training set size = 64%) to ground truth values (left panel) or previously-reported coefficients (remaining panels)[6,7].Pearson's r is shown.Four outlier 4th order coefficients in the tRNA dataset were omitted.D.
Fig 2. Learning sparse models from simulated and empirical combinatorial fitness landscapes. A. Schematic description of combinatorial DMS datasets.Left: simulated multiallelic landscape consisting of all combinations of 4 nucleotides in a DNA 6-mer where ground truth interactions ([1,2],[3,5] and[5,6]) are depicted with red arcs (see Methods).Middle: secondary structure of S. cerevisiae tRNA-Arg(CCU) indicating variable positions (closed circles) combinatorially mutated in a DMS experiment as described in[6].Three Watson-Crick base pairing (WCBP) interactions involving pairs of these positions ([1, 71],[2, 70]  and[6, 66]) are indicated.Right: crystal structure of a blue fluorescent variant (TagBFP) of the Entacmaea quadricolor protein eqFP611 (PDB: 3M24) and 13 positions (12 shown) that differ in the red fluorescent variant (mKate2) that were mutated in a DMS experiment as described in[7].Physical interactions between the fluorophore (L63) and three proximal residues([63, 143],[63, 174]  and[63, 197]) are indicated.B. Performance of sparse models fit using different proportions of the DMS datasets indicated in panel A. The median number of model coefficients is indicated.Colour scale as in panel D. C. Scatter plots comparing all non-zero model-inferred background-averaged epistatic coefficients (training set size = 64%) to ground truth values (left panel) or previously-reported coefficients (remaining panels)[6,7].Pearson's r is shown.Four outlier 4th order coefficients in the tRNA dataset were omitted.D.
= 8, 192 genotypes) separating two variants of the Entacmaea quadricolor protein eqFP611 (Fig 2, right column; see Methods) [7].Lasso models capture the majority of fitness variance even with high levels of data sparsity (Fig 2B, right) and model-inferred background-averaged coefficients correlate well with those obtained using the decomposition in Eq (9) (Pearson's r = 0.83, Fig 2C, right).We also find that direct pairwise physical interactions involving the fluorophore (L63, Fig 2A, right) are enriched in model-derived background-averaged epistatic terms (Fig 2D, right), a result that is robust to missing data and recapitulates our observations obtained using ground truth genetic interactions in simulated DMS data.
Final models were fit to all training data.In order to estimate model-related statistics and performance results we fit 100 models to different random subsets of the training data for each model type and training data fraction.In Fig 2, S1 (B), S1(C) and S2(A) Figs we plot the median of the indicated measures over all models, where error bars indicate the interquartile range.For model performance estimates in Fig 2B, the maximum explainable variance was calculated by substracting the total estimated technical variance (as reported by DiMSum fitness errors Fig 2C and S2(A) Fig) we required model-derived terms to be non-zero i.e. those whose interquartile range over all models did not include zero.Four outlier 4th order coefficients in the tRNA dataset were omitted from Fig 2C.
over different epistatic orders.The results of this analysis performed using the simulated dataset are shown in S1(B) Fig, left.Finally, the middle and right panels in S1(B) Fig assess the impact of measurement noise on the results of this enrichment analysis, revealing that noise level is anti-correlated with enrichment score and significance thereof, as expected.