Revealing evolutionary constraints on proteins through sequence analysis

Statistical analysis of alignments of large numbers of protein sequences has revealed “sectors” of collectively coevolving amino acids in several protein families. Here, we show that selection acting on any functional property of a protein, represented by an additive trait, can give rise to such a sector. As an illustration of a selected trait, we consider the elastic energy of an important conformational change within an elastic network model, and we show that selection acting on this energy leads to correlations among residues. For this concrete example and more generally, we demonstrate that the main signature of functional sectors lies in the small-eigenvalue modes of the covariance matrix of the selected sequences. However, secondary signatures of these functional sectors also exist in the extensively-studied large-eigenvalue modes. Our simple, general model leads us to propose a principled method to identify functional sectors, along with the magnitudes of mutational effects, from sequence data. We further demonstrate the robustness of these functional sectors to various forms of selection, and the robustness of our approach to the identification of multiple selected traits.

Introduction.-Proteins play crucial roles in all cellular processes, acting as enzymes, motors, receptors, regulators, and more. The function of a protein is encoded in its amino-acid sequence, and the recent explosion of available sequences has inspired data-driven approaches to discover the principles of protein operation. At the root of these new approaches is the observation that amino-acid residues which possess related functional roles often evolve in a correlated way. In particular, analyses of large alignments of protein sequences have identified "sectors" of collectively correlated amino acids [1-3, 18, 20], which has enabled successful design of new functional sequences [3]. Sectors are spatially contiguous in the protein structure, and in the case of multiple sectors, each one may be associated with a distinct role [13,18]. What is the physical origin of these sectors, and can we identify them from sequence data in a principled way?
To address these questions, we developed a general physical model that gives rise to sectors. Specifically, we start from an elastic-network model that quantifies the energetic cost of protein deformations [1], and show that selection acting on such elastic properties yields collective correlation modes in sequence data. Next, we generalize our model to any selected protein property to which mutations contribute additively. We show that the main signature of this selection process lies in the small-eigenvalue modes of the covariance matrix of the selected sequences. Finally, we demonstrate a principled method to identify sectors, and to quantify mutational effects, from sequence data alone.
Elastic-network models have elucidated various protein properties [1,2,13,14], including the emergence of allostery [15][16][17]. Thus motivated, we begin by building an elastic-network model [1,2] for a PDZ protein domain [ Fig. 1(a,b)] [18,19], and computing the relationship between its "sequence" and the energetic cost of a functionally-relevant conformational change. Within our elastic model [20], the energetic cost of a small deformation from the equilibrium structure is where r i is the position of the ith carbon atom, r 0 i is its equilibrium position, and the Hessian matrix M contains the second derivatives of the elastic energy with respect to atomic coordinates. Here, we take δr to be the conformational change from a ligand-free state (1BFE) to a ligand-bound state (1BE9) of the same PDZ domain [ Fig. 1(a)]. This conformational change is central to PDZ function, so its energetic cost has presumably been under selection during evolution. Any other coherent conformational change would also be suitable for our analysis. To mimic simply the effect of mutation and selection, we introduce "mutations" of residues that weaken the spring constants involving their beta carbons by a small fraction . We represent mutations using a sequence S with S l ∈ {0, 1}, where l is the residue index: S l = 0 denotes the reference state, while S l = 1 implies a mutation [ Fig. 1(c)]. The sequence S and the spring network fully determine the Hessian matrix M , and thus the energy cost E of a conformational change [Eq. (1)].
Additivity of mutational effects.-Focusing on mutations that weakly perturb the elastic properties of a protein, we perform first-order perturbation analysis: M = M (0) + M (1) + o( ). Using Eq. (1) yields E = E (0) + E (1) +o( ), with E (1) = δr T M (1) δr/2. Both M (1) and E (1) can be expressed as sums of contributions from individual mutations. We define ∆ l as the firstorder energy cost E (1) of a single mutation at site l of the sequence. To leading order, the effect of mutations  [21], (gray: ligand free, 1BFE; blue: ligand bound, 1BE9 (ligand not shown)). (b) Elastic network model for 1BFE, where each amino-acid residue is represented by its alpha carbon (Cα, black node) and beta carbon (Cβ, purple node). Nearby nodes interact through a harmonic spring [2,20]. (c) Relation between protein sequence S and elastic network: 0 denotes the reference state, while 1 denotes a mutated residue, which weakens interactions of the corresponding Cβ with all its neighbors by . (d) Histogram of the energy δE required to deform the domain from its ligand-free to its ligand-bound conformation, for randomly sampled sequences where 0 and 1 are equally likely at each site. Sequences are selectively weighted using a narrow Gaussian window (orange) around δE * . (e) Eigenvalues of the covariance matrix C for the selectively weighted protein sequences. (f) Upper panel: last principal component ν (L) l of C (red) and average mutant fraction S l * (green) at site l after selection; lower panel: effect ∆ l of a single mutation at site l on δE. (g) Schematic representation of the selected ensemble in sequence space, where each dot is a highly-weighted sequence; thus dots are restricted to a narrow region around a plane perpendicular to ∆. (h) Recovery of ∆ for all principal components ν (j) , with maximum Recovery=1 [Eq. (5)]. Gray dashed line: random expectation of Recovery [20]. on the energy cost of a deformation reads This equation, which relates sequence S to function δE, is quite general: 1) it constitutes the leading-order contribution of mutational effects, not limited to our elasticnetwork model; 2) system-specific details are encoded by the single-site mutational effects ∆ l , which in principle can be measured experimentally; 3) "energy" can refer to any physical property of the protein, say the binding affinity, catalytic activity or thermal stability [22], not just the energy cost of a conformational change.
The general sequence-function relation in Eq. (2) suggests a physical definition of a protein sector as the set of sites with dominant mutational effects on some selected property of the protein. For example, the vector ∆ of mutational effects on the energy cost of the change from the ligand-unbound to the ligand-bound conformation in our elastic-network model of PDZ is shown in Fig. 1(f). The magnitudes of mutational effects are strongly heterogeneous (see also Fig. S1 [20]). Here, the amino acids with largest effects are those that move significantly upon ligand binding.
How is such a functionally-defined sector reflected in the statistical properties of the sequences that survive evolution? To answer this question, we introduce a selection procedure and analyze the selected sequences.
Selection model and PCA of selected sequences.-For our elastic model of the PDZ domain, the distribution of δEs for random sequences is shown in Fig. 1(d).
As an example of a physical selection process, we assume that sequences with a narrower distribution of δEs are selected by evolution, corresponding e.g. to a preferred ligand-binding affinity. Specifically, we weight each sequence S according to the distribution P ( S) = exp(w( S))/ S exp(w( S)), where w( S) is the fitness of sequence S: Here, the selection strength κ sets the width of the selection window, and δE * is its center. For all selections, we take κ = 10/( l ∆ 2 l ), so that the width of the selection window scales with that of the unselected distribution. Importantly, although mutations have additive effects on δE, selection gives rise to correlations among sites. For instance, if δE * = 0 and if ∆ l < 0 for all l, as in Fig. 1, a mutation at a site with large |∆ l | will decrease the likelihood of additional mutations at all other sites with large |∆ l |.
Previous approaches to identifying sectors from real protein sequences have relied on modified forms of Principal Component Analysis (PCA). So we begin by asking: can PCA identify sectors in our physical model? PCA corresponds to diagonalizing the covariance matrix C of sequences: it identifies the principal components (eigenvectors) ν (j) associated with progressively smaller variances (eigenvalues) λ (j) . We introduce · * to denote ensemble averages over the selectively weighted sequences, reserving · for averages over the unselected ensemble. The mutant fraction at site l in the selected ensemble is S l * = S S l P ( S), and the covariance matrix C reads To test the ability of PCA to identify a physical sector, we employed the selection window shown in orange in Fig. 1(d). The resulting eigenvalues are shown in Fig. 1(e). One sees outliers. In particular, why is the last eigenvalue so low? Due to the narrow selection window, the highly-weighted sequences satisfy l S l ∆ l = S · ∆ ≈ δE * . This means that in the L-dimensional sequence space, the data points for the highly-weighted sequences lie in a narrow region around a plane perpendicular to ∆ [ Fig. 1(g)]. Hence, the data has exceptionally small variance in this direction, leading to a particularly small eigenvalue of C. Moreover, the corresponding last principal component ν (L) points in the direction with the smallest variance and is consequently parallel to ∆ [ Fig. 1(f)]. Formally, in Eq. (3), ∆ plays the part of a repulsive pattern in a generalized Hopfield model [6,7], penalizing sequences aligned with ∆, or in our case those that fail to have the selected projection onto ∆.
In this example, the last principal component accurately recovers the physical sector corresponding to the largest elements of the mutational-effect vector ∆. More generally, to quantify the recovery of ∆ by a given vector ν, we introduce which is nonnegative, has a random expectation of ( 2/πL) l |∆ l |/ l ∆ 2 l for L 1 [20], and saturates at 1 (including the case of parallel vectors). For our test case, Fig. 1(h) shows Recovery for all principal components. The last one features the highest Recovery, almost 1, confirming that it carries substantial information about ∆. The second-to-last principal component and the first two ones also provide a Recovery substantially above the random expectation.
In our model, ∆ is fundamentally a direction of small variance. So why do the first principal components also carry information about ∆? Qualitatively, when variance is decreased in one direction due to a repulsive pattern ∆, variance tends to increase in orthogonal directions involving the same sites. To illustrate this effect, let L = 3 and ∆ = (−1, 1, 0), and consider the sequences S satisfying ∆ · S = 0 [namely (0, 0, 0); (1, 1, 0); (0, 0, 1); (1, 1, 1)]. The last principal component is ∆, with zero variance, and the first principal component is (1, 1, 0): Recovery is 1 for both of them. This selection conserves the trace of the covariance matrix (i.e. the total variance), decreasing variance along ∆, and therefore increasing it along (1, 1, 0). This toy model provides an intuitive understanding of why the large-eigenvalue modes of the covariance matrix also carry information about ∆.
ICOD method.-An important concern is whether the last principal component is robust to small and/or noisy datasets. Indeed, other directions of small variance can appear in the data. As a second example, we applied a different selection window, centered in the tail of the distribution of δEs from our elastic model of PDZ [ Fig. 2(a), inset]. This biased selection generates strong conservation, S l * ≈ 1, for some sites with significant mutational effects. Extreme conservation at one site now dictates the last principal component, and disrupts PCA-based recovery of ∆ [ Fig. 2(a,b)].  To overcome this difficulty, we developed a more robust approach that relies on inverting the covariance matrix. Previously, the inverse covariance matrix was successfully employed in Direct Coupling Analysis (DCA) to identify strongly coupled residues that are in contact within a folded protein [3,4,15]. The fitness in our model [Eq. (3)] involves one and two-body interaction terms, and constitutes a particular case of the DCA Hamiltonian [20]. A small-coupling approximation [3-5, 17, 20] gives where P l denotes the probability that site l is mutated.
Since we are interested in extracting ∆, we can simply set to zero the diagonal elements of C −1 , which are dominated by conservation effects, to obtain a new matrix The first eigenvector ofC −1 (associated with its largest eigenvalue) should accurately report ∆ since, except for its zero diagonal,C −1 is proportional to the outer product ∆ ⊗ ∆. We call this approach the Inverse Covariance Off-Diagonal (ICOD) method. As shown in Fig. 2(d-e), ICOD overcomes the difficulty experienced by PCA for biased selection, while performing equally well as PCA for unbiased selection (Fig. S2 [20]). Removing the diagonal elements of C −1 before diagonalizing is crucial: otherwise, the first eigenvector of C −1 is the same as the last eigenvector of C and suffers from the same shortcomings for strong conservation. Here too, both ends of the spectrum contain information about ∆ [Figs. 2(b,d)].
Multiple selection.-An important challenge in sector analysis is distinguishing multiple, independently evolving sectors [13,18,19]. We can readily generalize our fitness function [Eq. (3)] to allow for multiple selections: where κ i is the strength of the i-th selection, ∆ i is the corresponding vector of mutational effects, and δE * i is the selection bias. For example, ∆ 1 might measure how mutations change a protein's binding affinity, while ∆ 2 might be related to its thermal stability, etc.
Performance.-We further tested the performance of ICOD by systematically varying the selection bias (Fig. 3). ICOD achieves high Recovery of mutationaleffect vectors for both single and double selection over a broad range of selection biases δE * , albeit performance Average recovery of mutational-effect vectors ∆ as a function of relative selection bias γ ≡ (δE * − δE )/ (δE − δE ) 2 . (a) Single selection. Different ∆s are used to generate sequence ensembles: the elastic-network ∆ from Fig. 1 (red); synthetic ∆s [20] with number of sites of large mutational effect (sector sites) ranging from 1 to 100, for sequences of length L = 100 (blue). Recovery is shown for ICOD (solid curves) and for SCA [13,18] (dashed curves). (b) Double selection. Different pairs of synthetic ∆s [20] are used to generate sequence ensembles (with L = 100): "0%" indicates two non-overlapping sectors, each with 20 sites; "100%" indicates two fully overlapping sectors, each with 100 sites; "Aver." indicates average Recovery over 100 cases of double selection, where the single-sector size increases from 1 to 100, and the overlap correspondingly increases from 0 to 100. ICA was applied to improve Recovery [20]. falls off in the limit of extreme bias. This confirms that our physical sectors are encoded in the bottom eigenvectors of the covariance matrix.
How does ICOD compare with other approaches to identifying sectors? We compared the performance of ICOD with Statistical Coupling Analysis (SCA), the standard PCA-based method [13,18]. In SCA, the covariance matrix C is reweighted by a site-specific conservation factor φ l , the absolute value is taken,C (SCA) ll = |φ l C ll φ l |, and sectors are identified from the leading eigenvectors ofC (SCA) . We therefore tested the ability of the first eigenvector ofC (SCA) to recover ∆ for a single selection. We found that the square root of the elements of the first SCA eigenvector can provide high Recovery of ∆ (Figs. 3, S10, S11) [20]. However, the performance of SCA relies on conservation through φ l [19]; consequently, for unbiased selection, SCA breaks down [ Fig. 3(a), dashed curves] and cannot identify sector sites (Fig. S14 [20]). ICOD does not suffer from such shortcomings, and performs well over a large range of selection. Note that in SCA, only the top eigenvectors ofC (SCA) convey information about sectors (Figs. S10, S12).
We also compared ICOD with another PCA-based approach [6], which employs an inference method specific to the generalized Hopfield model, and should thus be well adapted to identifying sectors within our physical model [Eq. (3)]. Overall, this specialized approach performs similarly to ICOD, being slightly better for very localized sectors, but less robust than ICOD for strong selective biases and small datasets [20]. Exactly as for PCA and ICOD, within this method, the top Recovery is obtained for the bottom eigenvector of the (modified) covariance matrix, consistent with ∆ in our model being a repulsive pattern [6], but large Recoveries are also obtained for the top eigenvectors (Fig. S15).
So far we considered binary sequences, with only one type of mutation with respect to the reference state. However, we have also shown that ICOD extends to mutations among the 20 different amino-acid types, plus gap state, relevant to real proteins [20]. On a multiple sequence alignment of PDZ domains, ICOD correctly predicts the majority of experimentallydetermined residues [20].
Conclusion.-We have demonstrated how sectors of collectively correlated amino acids can arise from evolutionary constraints on the physical properties of proteins. Our physical sectors are directly associated with the small-eigenvalue modes of the covariance matrix. These modes also contain information about structural contacts [7], which helps explain previously observed correlations between sectors and contacts [33]. Previ-ous works [7,33] found that sectors could be recovered from the large-eigenvalue modes of the covariance matrix, and some signatures of our physical sectors are indeed obtained in these modes. However, the fundamental link we propose between physical sectors and smalleigenvalue modes is important, since large-eigenvalue modes of the covariance matrix also reflect information about subfamily-specific residues [34] and phylogeny [35]. To build the elastic-network model of the PDZ domain, we replace each of the L = 76 amino-acid residues by its corresponding alpha carbon Cα and beta carbon Cβ, as shown in Fig. 1(b). Every pair of carbons within a cutoff distance d c are then connected with a harmonic spring [1]. Following an analysis of the same PDZ domain by De Los Rios et al. [2], we set d c = 7.5Å and assign spring constants as follows: a) 2 for Cα-Cα pairs if adjacent along the backbone, 1 otherwise; b) 1 for Cα-Cβ pairs; c) 0.5 for Cβ-Cβ pairs. For the following sequence analysis, we consider that mutations decrease the stiffnesses of the springs involving mutated Cβ atoms by a fraction = 0.2.

II. RECOVERY BY A RANDOM VECTOR
Here, we calculate the random expectation of the Recovery of the mutational-effect vector ∆ by a generic other vector ν, in order to establish a null model to which to compare. For a binary sequence, Recovery, as defined in Eq. (5), can be expressed as with ∆ l = |∆ l |/ l ∆ 2 l and ν l = |ν l |/ l ν 2 l . As before, L denotes the length of the sequence and hence the number of components of ∆ and ν. As ν is a normalized L-dimensional vector, its components can be expressed in L-dimensional spherical coordinates using L − 1 angles θ i : where θ i ∈ [0, π/2] for all i ∈ {1, · · · , L}, because all components of ν are nonnegative. Note that we employ the usual convention that empty products are equal to one: Eq. (S2) yields ν 1 = cos θ 1 . The average Recovery for a random vector ν with an orientation uniformly distributed in the L-dimensional sphere reads: where the angular element is dΩ = and similarly, Eq. (S3) yields Using the following results valid for n > −1: where Γ denotes the Euler Gamma function, which satisfies Γ(x + 1) = x Γ(x) for all x, we obtain for 1 ≤ l ≤ L: which is independent of l. Besides, Combining Eq. (S4) with Eqs. (S8) and (S9) finally yields (S10) In particular, in the relevant regime L 1, an asymptotic expansion of Γ yields: The maximum expectation of Recovery is obtained when all components of ∆, i.e. all mutational effects, are identical: Conversely, the average Recovery becomes minimal when only one component of ∆ is nonzero, which constitutes the limit of the case where the mutational effect at one site is dominant: which approaches zero in the limit L → ∞.

III. INVERSE COVARIANCE MATRIX OF OUR SEQUENCE ENSEMBLES
Here, we present a derivation of the small-coupling approximation of the inverse covariance matrix for our artificiallygenerated sequence ensembles. In this small-coupling limit, the inverse covariance matrix provides an estimate of the energetic couplings used to generate the data. More generally, deducing energetic parameters from observed statistics is a well-known inference problem, also known as an inverse problem. Two-body energetic couplings can be inferred from the one and two-body frequencies observed in the data, using a standard maximum entropy approach. However, the exact calculation of the energetic terms is difficult, and various approximations have been developed. Following Refs [3,4], we use the mean-field or small-coupling approximation, which was introduced in Ref. [5] for the Ising spin-glass model. For the sake of completeness, we now review the main steps of the calculation, which follow Ref. [4]. Note that we do not use inference methods specific to low-rank coupling matrices [6,7] because we wish to retain generality, with the application to real sequence data in mind.
We begin with the case of binary sequences, which is discussed in the main text. Following that, we generalize to cases where more than two states are allowed at each site, such as the 21 possible states for real protein sequence (20 amino acids plus gap).

A. Binary sequences
We begin by deriving Eq. (6) from the main text, which provides an approximation for the inverse covariance matrix of the ensembles of our binary artificial sequences. Each sequence S is such that S l ∈ {0, 1} for each site l with 1 ≤ l ≤ L, where L is the length of the sequence. (S14) We introduce s l = 2S l − 1: it is an "Ising spin" variable (S l = 0 ⇔ s l = −1 and S l = 1 ⇔ s l = 1). The fitness in Eq. (S14) can be rewritten as with D l = ∆ l /2 and α = δE * − l D l . Expanding yields where we have used the fact that s 2 l = 1. The second term and the last term in Eq. (S16) are both constants, and therefore our fitness is equivalent to This fitness has the form of a standard Ising Hamiltonian with inter-spin couplings and local fields, albeit with the convention difference in overall sign between fitness and energy.

First-order small-coupling expansion
We next consider the general Ising Hamiltonian with inter-spin couplings and local fields where is a constant to be employed in a small-coupling expansion. With this Hamiltonian, taking thermal energy k B T = 1, the equilibrium probability of finding a particular sequence s is where Z = s e −H( s) .
where, following the Ising terminology, m i denotes the average magnetization at site i, while C denotes the covariance matrix in the Ising convention. Note that, using the identity m i = 2P i − 1, where P i denotes the probability that s i = 1, we obtain where P ij is the probability that s i = s j = 1, and C denotes the covariance matrix in the Potts convention, which is used in the main text because it allows straightforward generalization to the case where more than two states are possible at each site. Performing a Legendre transform, we introduce We now perform a small-coupling expansion and express G to first order in (see Eq. (S18)): G( ) ≈ G(0) + G (0). Since sites are independent for = 0, it is straightforward to express G(0) and G (0) as a function of the one-body expectations, represented by m i , and of the couplings. We obtain and Using these expressions, and taking = 1 in the expansion, we obtain the following approximation for G: Using Eqs. (S22) and (S23), we obtain the elements of the inverse covariance matrix from Eq. (S26): where P l denotes the probability that s l = 1. Note that Eq. (S26) is a first-order small-coupling (or mean-field) approximation. The expansion can be extended to higher order, and the second-order expansion is known as the Thouless, Anderson, and Palmer (TAP) free energy [5,8].

Application to our sector model
Comparing Eqs. (S17) and (S18) (with = 1) allows us to identify the couplings in our sector model as Note that this expression is in the Ising gauge (also known as the zero-sum gauge). Recall also that the link to the Potts convention is made through C = 4 C [Eq. (S21)], which implies C −1 = C −1 /4. Finally, recall that fitness and energy have opposite signs.
Hence, in the Potts convention, Eq. (S27) yields for our sector model: This corresponds to Eq. (6) in the main text.
B. Sequences with q possible states at each site

From a sector model to a Potts model for sequences
Motivated by the fact that a real protein sequence has 21 possible states at each site (20 different amino acids plus gap), we now generalize the above result to the case where q states are possible at each site. We denote these states by α with α ∈ {1, .., q}. Our sector model can then be mapped to a q-state Potts model. The length-L vector ∆ of single-site mutational effects introduced in the two-state case in the main text is replaced by a (q − 1) × L matrix of mutational effects, each being denoted by ∆ l (α l ). These mutational effects can be measured with respect to a reference sequence α 0 satisfying ∆ l (α 0 l ) = 0, ∀l ∈ {1, . . . , L}: at each site l, the state present in the reference sequence α 0 serves as the reference with respect to which the mutational effects at that site are measured. For the sake of simplicity, we will take state q as reference state at all sites. This does not lead to any loss of generality, since it is possible to reorder the states for each l.
The generalization of the fitness function Eq. (3) [Eq. (S14)] to our q-state model can be written as Expanding this expression, discarding a constant term, and using the fact that there can only be one state at each site, we find that the fitness of sequences can be expressed as This is a particular case of the more general Potts Hamiltonian which is the one usually considered in Direct Coupling Analysis (DCA) [3,4]. In order to identify Eq. (S31) and Eq. (S32), one must deal with the degeneracies present in Eq. (S32), where the number of independent parameters is L(q − 1) + L(L − 1)(q − 1) 2 /2 [9]. To lift this degeneracy, we choose the gauge usually taken in mean-field DCA [4]: e lk (α l , q) = e lk (q, α k ) = h l (q) = 0 for all l, k, α l , α k . This choice is consistent with taking state q as the reference state for mutational effects (see above), and we will refer to it as the reference-sequence gauge. This gauge choice enables us to identify the couplings between Eq. (S31) and Eq. (S32): for all l = k, and all α l , α k , with ∆ l (q) = 0 for all l (recalling that fitness and energy have opposite signs).

First-order small-coupling expansion
The derivation of the first-order mean-field or small-coupling approximation for q-state models is very similar to the Ising case presented above. Hence, we will simply review the main results (see Ref. [4]).
We start with the Hamiltonian where has been introduced to perform the small-coupling expansion. Eq. (S34) coincides with Eq. (S32) for = 1.
is the Potts Hamiltonian in Eq. (S34), we have for all k and all α k < q: where P k (α k ) is the one-body probability. Similarly, we have for all k, l and all α k < q and α l < q: where we have introduced the covariance C kl (α k , α l ) = P kl (α k , α l ) − P k (α k )P l (α l ).
We next perform a first-order expansion of G in , and take = 1, yielding: Applying Eqs. (S37), (S38) to Eq. (S39), and using P l (q) = 1 − α l <q P l (α l ) gives This result is the standard one found in DCA [4].

Multiple selections
So far, we have mainly discussed the case where there is only one selection (one sector). However, real proteins face various selection pressures. The generalization of the fitness in Eq. (S30) to N simultaneous different selections reads which corresponds to Eq. (8) in the main text. We choose the reference-state gauge, assuming again for simplicity that the reference state is q at each site. The identification to the general Potts Hamiltonian Eq. (S32) (recalling that fitnesses and energies have opposite signs) then yields which generalizes Eq. (S33) to the multiple selection case. Using the small-coupling expansion result in Eq. (S40), we obtain the following approximation for the inverse covariance matrix: This generalizes Eq. (S41) to the multiple selection case.

IV. ROBUSTNESS OF ICOD
In the main text, we introduced the Inverse Covariance Off-Diagonal (ICOD) method to identify protein sectors from sequence data. The ICOD method exploits the approximate expression derived above for the inverse covariance matrix [Eq. (S41)]; in particular, ICOD makes use of the fact that the off-diagonal elements of C −1 are simply related to the elements of the mutational effect vector ∆. In this section, we first describe our comparison of ICOD to SCA for single selection, and detail our test of ICOD for double selection, using synthetic binary sequences. Next, we show how the ICOD method can be extended to sequences with more than two states per site, and demonstrate its robustness to gauge choice and pseudocounts.

A. Robustness of ICOD to selection bias and multiple selections
To quantify the performance of ICOD and to compare to SCA over a range of selection biases we focused on binary sequences. To obtain the average curve for single selections in Fig. 3(a), we first generated 100 distinct synthetic ∆s, one for each sector size from n = 1 to 100, where sector sites are defined as those with large mutational effects. To this end, the mutational effects of the sector sites and the non-sector sites were sampled, respectively, from zero-mean Gaussian distributions with standard deviations 20 and 1. For each sector size and each selection bias we generated a sequence ensemble of 50,000 random sequences and weighted each sequence according to the distribution where w( S) is the fitness of sequence S, given by the single selection formula Eq. (3). In general, we wish to employ a selection window whose width in energy (or any other selected variable) scales with the overall width of the unselected distribution. Hence, as mentioned in the main text, we perform all selections with a strength κ = 10 Then, for each method (ICOD or SCA), performance as measured by Recovery of ∆ by the first eigenvector was averaged over the 100 different sector sizes. Similarly, to obtain the average curve for double selection in Fig. 3(b), we generated 100 distinct pairs of ∆ 1 s and ∆ 2 s, one pair for each sector size from n = 1 to 100. Specifically, the sector for ∆ 1 consisted of the first n sites, while the sector for ∆ 2 corresponded to the last n sites, so that the two sectors overlap for n > 50. As for the single selections, the mutational effects of the sector sites and the non-sector sites were sampled, respectively, from Gaussian distributions with standard deviations 20 and 1. As an example, two synthetic ∆s for n = 20 are shown in Fig. S3. Again, for each sector size and each selection bias, we generated an ensemble of 50,000 random sequences and weighted them according to Eq. (S45) along with the double selection formula Eq. (S42) (i.e. Eq. (8) in the main text). The performance of ICOD as measured by Recovery of ∆ 1 and ∆ 2 by the first two eigenvectors was averaged over the 100 different sector sizes. In Fig. 3(b) we also reported the performance of ICOD for two non-overlapping sectors, each with 20 sites, and for two fully overlapping sectors, each with 100 sites. We followed a protocol similar to that described above, but in each of these cases, we averaged Recovery over 100 realizations using distinct pairs of ∆ 1 and ∆ 2 . Unless otherwise stated, data for other plots were generated in the same way, i.e. using 50,000 random sequences, sequence length L = 100, selection strength κ in Eq. (S46), and standard deviation 20/1 of ∆ l in the sector/non-sector sites.
Note that to improve Recovery in the case of double selection, we applied Independent Component Analysis (ICA) [11][12][13] to the first two eigenvectors in order to disentangle the contributions coming from the two constraints. In general, we expect that the first N eigenvectors of the ICOD matrixC −1 will report N constraints. However, each of these N eigenvectors is likely to include a mixture of contributions from different constraints. Applying ICA to the first N eigenvectors to recover the individual constraints amounts to assuming that all the constraints are statistically independent. As an example, in Fig. S4, we consider the case of two selections targeting a different set of sites and with different selection windows (one biased, one non-biased). In this case, ICOD plus ICA yields excellent Recovery (Fig. S4). Without ICA, the results are noticeably worse (Fig. S5). Moreover, Fig. 3(b) demonstrates that ICOD plus ICA can achieve a high Recovery for a broad range of overlaps between two sectors. In Fig. 3(b), one observes a slight decrease of performance of ICOD plus ICA for double selection with overlapping sectors. Does this arise from increasing sector size or from increasing overlap? As expected from Eqs. (5) and (7), Fig. S6(a) shows that Recovery does not fall off with increased sector size. Thus, we tested whether larger sector overlaps could reduce Recovery. Fig. S6(b) shows that this is indeed the case for sequence ensembles subject to two selections each with a fixed sector size of 20, but with different numbers of overlapping sites. However, the reduction of Recovery is quite modest, as even for 100% overlap, Recovery remains above 0.9. It is interesting to note that, independent of sector size and overlap, Recovery decreases faster for double selection than for single selection at large relative biases (see Figs. 3 and S6). Recovery is shown as a function of relative selection bias γ ≡ (δE * − δE )/ (δE − δE ) 2 for sectors of size 1, 10, 20, 40, 60, 80, and 100 out of 100 sequence sites (cf. Fig. 3(a)). Recovery is almost perfect for sectors of size larger than 10, but is substantially lower for sector size 1, which violates the criteria ∆ l l ∆ 2 l . (b) Double selection with different degrees of sector overlap. For each selection, the sector size is 20 out of 100 sequence sites, and the overlap varies from 0 to 20 sites. The average Recovery for ∆1 and ∆2 is shown as a function of relative selection bias. The data in (b) is averaged over 20 realizations of ∆s.

B. Multiple states per site and alternative gauge choice
In Section III B above, we described how to generalize from binary sequences to sequences with q possible states at each site. Correspondingly, we now generalize the ICOD method to higher values of q. Since we are interested in extracting the single-site mutational effects ∆ l (α l ) with respect to a reference state at each site, we can simply set to zero the diagonal blocks of C −1 in Eq. (S44), yielding the modified inverse covariance matrix for the case of multiple selections, or more simply for a single selectioñ This equation generalizes Eq. (7) obtained for q = 2 in the main text. As in that case, the first eigenvector ofC −1 (associated with the largest eigenvalue) should accurately report the single-site mutational effects ∆ k (α k ). Indeed, Fig. S7 shows that this generalized version of ICOD performs very well on synthetic data generated for the case q = 21 relevant to real protein sequences. Note that in the reference-sequence gauge, Recovery generalizes naturally to the q-state model as where the sums over states α l do not include the reference state at each site. While the reference-sequence gauge is convenient and allows a clear interpretation of the mutational effects, other gauge choices are possible. For instance, in the DCA literature, the zero-sum (or Ising) gauge is often employed [10,14]. In this gauge, the couplings satisfy Qualitatively, the gauge degree of freedom means that contributions to the Hamiltonian in Eq. (S32) can be shifted between the fields and the couplings [15]. In DCA, the focus is on identifying the dominant two-body interactions, so one does not want the couplings to include contributions that can be accounted for by the one-body fields [9]. The zero-sum gauge satisfies this condition because it minimizes the Frobenius norms of the couplings Hence, the zero-sum gauge attributes the smallest possible fraction of the energy in Eq. (S32) to the couplings, and the largest possible fraction to the fields [10,15]. In order to transform to the zero-sum gauge defined in Eq. (S50), each coupling e ij (α, β) is replaced bỹ where . ζ denotes an average over ζ ∈ {1, ..., q} [10]. Shifting from the reference-sequence gauge where one state (in our derivations, state q) is taken as a reference at each site to the zero-sum gauge requires the replacement The new reference-state-free mutational effects satisfy q β=1∆ l (β) = 0, and the associated couplingsẽ lk (α l , α k ) = −κ∆ l (α l )∆ k (α k ) (see Eq. (S33)) are related to the initial ones e lk (α l , α k ) through Eq. (S52).
Importantly, these reference-state-free mutational effects can be used to assess the overall importance of mutations at each particular site in the sequence. To this end, let us introduce the Frobenius norm of the reference-state-free mutational effects: (S54) This quantity, which we refer to as the "site significance", measures the overall importance of mutational effects at site l. In order to assess site significances from an ensemble of sequences, without investigating the impact of each particular mutation at each site, one can apply the zero-sum gauge to the ICOD-modified inverse covariance matrix (see Eq. (S48)), and compute the Frobenius norm of each 20 × 20 block associated to each pair of sites (i, j) according to Eq. (S51). The first eigenvector of this compressed L × L matrix accurately reports the mutational significance of each site, as illustrated in Fig

C. Pseudocounts
As pseudocounts are often necessary to regularize real sequence data, and as a high fraction of pseudocounts is generally used in DCA, we consider here whether the ICOD method is robust to the addition of pseudocounts.
Until now, we used only raw empirical frequencies obtained from sequence data. For instance, one-body frequencies were obtained by counting the number of sequences where a given state occured at a given site and dividing by the total number M of sequences in the ensemble. Covariances were computed from the empirical single-site frequencies of occurrence of each state α at each site i, denoted by f e i (α), and the empirical two-site frequencies of occurrence of each ordered pair of states (α, β) at each ordered pair of sites (i, j), denoted by f e ij (α, β). Specifically, we obtained the covariance matrix as C ij (α, β) = f e ij (α, β) − f e i (α)f e j (β) [15]. To avoid issues arising from limited sample size, such as states that never appear at some sites (which present mathematical difficulties, e.g. a non-invertible covariance matrix [4]), one can introduce pseudocounts via a parameter Λ [3,4,15,16]. The one-site frequencies f i then become where q is the number of states per site. Similarly, the two-site frequencies f ij become These pseudocount corrections are uniform (i.e. they have the same weight 1/q for all states), and their influence relative to the raw empirical frequencies can be tuned through the parameter Λ. In DCA, a high value of f Λ has been found to improve contact prediction: typically Λ ≈ 0.5 [3,4,17]. Note that the correspondence of Λ with the parameter λ in Refs. [3,4,16] is obtained by setting Λ = λ/(λ + M ). From these quantities, we define the pseudocount-corrected covariances We show in Fig. S9 that adding pseudocounts as high as Λ = 0.3 still allows for accurate extraction of mutational effects (Recovery 0.96) and provides a reliable prediction of sector sites. Protein sectors were first discovered from sequence data using a PCA-based method called Statistical Coupling Analysis (SCA) [13,18]. Interestingly, in SCA, sectors are found from the eigenvectors associated to the largest eigenvalues, while in ICOD they are found from the (modified) eigenvectors associated to the smallest eigenvalues. This difference stems from the fact that SCA and ICOD do not start from the same matrix. For binary sequences, SCA uses the absolute value of a conservation-weighted covariance matrix,C (SCA) ll = |φ l C ll φ l | (see main text and Ref. [18]). When all amino-acid states are accounted for, SCA compresses each block of the conservation-weighted matrix corresponding to two sites to obtain one positive value, e.g. the Frobenius norm of the block [13]. Conversely, ICOD employs the regular covariance matrix, suppressing the diagonal blocks of its inverse at the last step before diagonalization. To better understand the performance of SCA in recovering the site-dependent mutational effects associated with a selective constraint, it is helpful to have analytical estimates for the average mutant fraction S l * at each site l and the covariance matrix C ll for an ensemble of binary sequences obtained from a single selection using vector of mutational effects ∆. To this end, we provide the following two ansatzes: where σ 2 l = S 2 l * − S l 2 * = S l * (1 − S l * ) represents the variance of S l . Recall that S l ∈ {0, 1}, where 0 is the reference state and 1 the mutant state, and that · * denotes ensemble averages over the selectively weighted subset of sequences, while · denotes averages over the unselected (unweighted) ensemble.
Although we have not proven these two ansatzes, numerical tests (Fig. S10) have verified these two relations for ensembles generated from a ∆ with multiple sites of comparably large mutational effects so as not to be dominated by a single site, i.e., ∆ l / l ∆ FIG. S12. Recovery of ∆ from the first SCA eigenvector using ν (1) or   √ ν (1) . The sequence data are the same as used for the blue curves in Fig. 3(a). As suggested by Eq. (S67), use of the square root of ν (1) significantly improves Recovery.

C. Comparison between ICOD and SCA
In the main text, we compared the performance of ICOD and SCA with respect to Recovery of mutational-effect vectors ∆ in synthetic data (see Fig. 3). We found that ICOD performs well over a broader range of relative biases γ than SCA. The failure of SCA at biases close to zero can be explained by the fact that the conservation weights φ l then vanish (see Eq. (S65)). A further example of the failure of SCA for non-biased selections is given by the case studied in Fig. S4, where we considered two selections, a biased one associated to ∆ 1 and a non-biased one associated to ∆ 2 . Fig. S13 shows that SCA recovers ∆ 1 well, but performs badly for ∆ 2 , while ICOD recovers both of them very well (see Fig. S4). While the comparison of Recovery favors ICOD, SCA was originally used to identify sectors (in our model, sites with important mutational effects under a given selection) rather than to recover complete mutational effect vectors ∆. Hence, in Fig. S14, we compare the ability of ICOD and SCA to predict the n sites with the largest mutational effects. Note that this comparison is independent of whether we use ν (1) or √ ν (1) as the predictor in SCA, since the square-root function is increasing and preserves order. Using this criterion, we again find that ICOD performs well over a broad range of relative biases γ, while SCA only works well for sequences selected under moderate biases.
FIG. S14. Comparison of sector-site identification by ICOD and SCA (see also Fig. 3). We use the synthetic ∆ in (a) to selectively weight 5,000 random sequences at four relative bias values γ ≡ (δE * − δE )/ (δE − δE ) 2 = 0, 1, 2, 3 and test the ability of ICOD or SCA to correctly predict the sites with the n largest mutational effects. (b) Magnitudes of mutational effects of ∆ by rank. (c-d). True Positive (TP) rates obtained by taking the first eigenvector ν (1) from either ICOD or SCA, generating a ranked list of sites of descending |ν (1) l | at each site l, and computing the fraction of the top n sites in this predicted ordering that are also among the top n sites of the actual ordering of mutational effect magnitudes |∆ l |. The effect of relative bias γ on Recovery is shown in Fig. 3. (c) As expected, the prediction of ICOD is very good under all relative biases. (d) On the other hand, SCA does not perform well at the smallest or largest relative biases.

VI. PERFORMANCE OF A METHOD BASED ON THE GENERALIZED HOPFIELD MODEL
As mentioned in the main text, we also compared ICOD with another PCA-based approach developed in Ref. [6], which employs an inference method specific to the generalized Hopfield model. For L Ising spins (s l ∈ {−1, 1} for 1 ≤ l ≤ L), the Hamiltonian of the generalized Hopfield model reads (see Eq. (3) in Ref. [6]) where h l is the local field at site l, while ξ i = (ξ i,1 , . . . , ξ i,L ) is an attractive pattern and ξ i = (ξ i,1 , . . . , ξ i,L ) is a repulsive pattern. Here there are N attractive patterns and N repulsive ones. In our model, in the single-selection case, the fitness of a sequence s in the Ising representation reads [see above, Sec. III A 1, Eq. (S15)] with D l = ∆ l /2 and α = δE * − l D l . Recalling that fitnesses and Hamiltonians have opposite signs, a comparison of Eqs. (S68) and (S69) shows that ∆ plays the part of a repulsive pattern, with the exact correspondence given by ξ = ∆ √ κL/2. Note that in our model the local fields are proportional to the components of ∆. In Fig. S16, we systematically compare all methods discussed in our work to recover ∆ from sequence data under various selection biases, using different sector sizes, for selectively weighted ensembles of 50,000 random sequences. We focus on the case of a single selection and compare Recovery of ∆ according to: • ICOD, using the first eigenvector of the modified inverse covariance matrixC −1 [see main text, Eq. Overall, ICOD and GHI perform best. For small selection biases, all methods perform accurately, except SCA, which fails when selection bias vanishes, as explained above. When the sector size is small compared to the sequence length L [ Fig. S16 (a-d)], GHI performs a little bit better than ICOD for relatively small selection biases (however Recovery remains > ∼ 95% with ICOD). Conversely, GHI is significantly outperformed by ICOD for relatively large selection bias, and the performance of PCA and SCA falls off quite rapidly in this regime. The performances of ICOD, PCA, and GHI become similar when the sector size becomes comparable to the sequence length [ Fig. S16 (e, f)].
We further find that GHI is more sensitive to the size of the sequence ensemble than ICOD, although it becomes the most accurate for very large dataset sizes (see Fig. S17). The performance of ICOD is quite robust to dataset size. Note that PCA outperforms other methods when the data size becomes very small (Fig. S17, number of sequences = 500). FIG. S17. Effect of dataset size on Recovery of ∆. Selectively reweighted ensembles of 5 × 10 2 , 5 × 10 3 , 5 × 10 4 , and 5 × 10 5 random sequences are generated for the elastic-network ∆ and synthetic ∆s with sector sizes 1, 10, and 50. All results are averaged over 100 realizations, except those using 5 × 10 5 sequences, where only 5 realizations were used. For synthetic ∆s, each realization employs a different ∆ with the same sector size. For the case of 500 sequences, some Recoveries were not computed at high biases due to numerical instabilities.
Overall, we find that GHI is very well suited to infer ∆ from very large synthetic datasets. However, ICOD is more robust to variation of dataset size and to selection bias, which should be an advantage in the application to real protein data.

VII. APPLICATION OF ICOD TO A MULTIPLE SEQUENCE ALIGNMENT OF PDZ DOMAINS
Our general physical model for sectors provides insights into the statistical signatures of sectors in sequence data. In particular, we have found that the primary signature of physical sectors lies in the modes associated with the smallest eigenvalues of the covariance matrix, even though there is often additional signal from these sectors in the large eigenvalue modes, as studied more conventionally, e.g. in SCA. The success of ICOD on synthetic data demonstrates that information about sectors can indeed be extracted from the small eigenvalue modes of the covariance matrix.
How well does ICOD perform on real sequence data? Here, we apply ICOD to an actual alignment of sequences of PDZ domains from the Pfam database (https://pfam.xfam.org/) containing 24,934 sequences of length L = 79. In Ref. [20], sites important for the specific binding of PDZ to peptide ligands were identified experimentally via complete single-site mutagenesis. In particular, 20 sites showing particularly high mutational effects were deemed functionally significant [20]. It was further shown that 15 among the 20 sector amino acids found by SCA (i.e. 75%) were also functionally significant sites.
In order to compute the empirical covariance matrix of the data, we first removed sites with more than 30% gaps (6 sites out of 79). To eliminate the confounding effects of very rare residues at particular sites, we used a pseudocount weight Λ = 0.005.
Next, we performed both SCA and ICOD using this empirical covariance matrix: • For SCA, we computed the conservation reweighting factors as in Refs. [13,18], using the background frequency values from Ref. [18]. We compressed the conservation-reweighted covariance matrix using the Frobenius norm, and we focused on the first eigenvector of this reweighted and compressed covariance matrix in order to predict sector sites (see above, Section V B, and Ref. [13]). • For ICOD, we used as reference the most abundant state (residue or gap) at each site. We inverted the covariance matrix, set its diagonal to zero, changed the gauge to the zero-sum gauge, and used the Frobenius norm to compress the matrix. Then we focused on the first eigenvector of this reduced ICOD-modified inverse covariance matrix in order to predict site significances and sector sites (see above, Section IV B, especially Fig. S9). Fig. S18 shows the results obtained by applying SCA and ICOD to our multiple sequence alignment of PDZ domains. In both cases, we find one strong outlying eigenvalue [see Fig. S18(a) and (c)], thus confirming that PDZ has only one sector [20]. Comparing the top n sites identified from the top eigenvector to the 20 experimentally-identified sites with strong mutational effect [20] shows that SCA identifies experimentally important sites somewhat better than ICOD [see Fig. S18(b) and (d)]. For instance, over the n = 20 top sites identified by SCA (resp. ICOD), we find that 75% (resp. 55%) of them are also among the 20 experimentally important sites. Note that for SCA, we recover the result from Ref. [20]. Importantly, both methods perform much better than the random expectation, which is 20/73 = 27%. Hence, both ICOD and SCA should be useful to identify functionally important sites. The slightly better performance of SCA on this particular dataset might come from the fact that many of the experimentally-identified functionally important sites in PDZ are strongly conserved [19], which makes the SCA conservation reweighting advantageous. Moreover, the real sequences are related by phylogeny, which interacts with functional constraints in ways which require more study. We emphasize that the main goal of this paper is to provide insight into the possible physical origins of sectors, and into the statistical signatures of these physical sectors in sequence data. A more extensive application of ICOD and related methods to real sequence data will be an interesting subject for future work.
FIG. S18. Performance of SCA and ICOD in predicting the 20 sites with largest experimentally-determined mutational effects [20] in PDZ sequence data. (a) Eigenvalues of the SCA matrix. (b) True Positive (TP) rates obtained by taking the first eigenvector ν (1) from the SCA matrix, generating a ranked list of sites of descending |ν (1) l | at each site l, and computing the fraction of the top n sites in this predicted ordering that are also among the top 20 experimentally-important sites [20]. (c) Eigenvalues of the reduced ICOD-modified inverse covariance matrix. (d) TP rates from ICOD, computed as in panel (b) for SCA. In panels (b) and (d), the TP rate values obtained for the top 20 predicted sites are indicated by arrows. A pseudocount ratio Λ = 0.005 was used throughout this analysis.