Inverse Ising inference with correlated samples

Correlations between two variables of a high-dimensional system can be indicative of an underlying interaction, but can also result from indirect effects. Inverse Ising inference is a method to distinguish one from the other. Essentially, the parameters of the least constrained statistical model are learned from the observed correlations such that direct interactions can be separated from indirect correlations. Among many other applications, this approach has been helpful for protein structure prediction, because residues which interact in the 3D structure often show correlated substitutions in a multiple sequence alignment. In this context, samples used for inference are not independent but share an evolutionary history on a phylogenetic tree. Here, we discuss the effects of correlations between samples on global inference. Such correlations could arise due to phylogeny but also via other slow dynamical processes. We present a simple analytical model to address the resulting inference biases, and develop an exact method accounting for background correlations in alignment data by combining phylogenetic modeling with an adaptive cluster expansion algorithm. We find that popular reweighting schemes are only marginally effective at removing phylogenetic bias, suggest a rescaling strategy that yields better results, and provide evidence that our conclusions carry over to the frequently used mean-field approach to the inverse Ising problem.

E-mail: benedikt.obermayer@mdc-berlin.de, elevine@fas.harvard.edu Abstract. Correlations between two variables of a high-dimensional system can be indicative of an underlying interaction, but can also result from indirect effects. Inverse Ising inference is a method to distinguish one from the other. Essentially, the parameters of the least constrained statistical model are learned from the observed correlations such that direct interactions can be separated from indirect correlations. Among many other applications, this approach has been helpful for protein structure prediction, because residues which interact in the 3D structure often show correlated substitutions in a multiple sequence alignment. In this context, samples used for inference are not independent but share an evolutionary history on a phylogenetic tree. Here, we discuss the effects of correlations between samples on global inference. Such correlations could arise due to phylogeny but also via other slow dynamical processes. We present a simple analytical model to address the resulting inference biases, and develop an exact method accounting for background correlations in alignment data by combining phylogenetic modeling with an adaptive cluster expansion algorithm. We find that popular reweighting schemes are only marginally effective at removing phylogenetic bias, suggest a rescaling strategy that yields better results, and provide evidence that our conclusions carry over to the frequently used mean-field approach to the inverse Ising problem. An exciting confluence of techniques from statistical physics, computer science and information theory over the last decade has yielded new methods for the study of high-dimensional interacting systems, including neuronal networks [1], bird flocks [2], justices on the US supreme court [3], gene expression networks [4], protein-protein interactions [5], transcription factor binding motifs [6], HIV vaccine design [7], and protein folding [8][9][10][11][12]. Briefly, a maximum-entropy formalism [13,14] is used to infer the parameters of a Boltzmann-like probability distribution such that its first two moments coincide with the ones observed in the data. These parameters in turn can be used to distinguish direct interactions from indirect correlations. In the comparative genomics field, which is boosted by the rapid growth of sequenced genomes, such methods are used to study evolutionary correlations in protein sequences, fueled by the observation that sequence changes at one locus are frequently accompanied by compensatory changes at another locus. Assuming that this type of evolutionary constraint results from a physical interaction of the involved residues, inference of such direct correlations in multiple alignments of homologous protein sequences allows one to identify pairs of protein residues in close spatial proximity within the tertiary structure, as opposed to indirect correlations due to intermediaries [15]. This can be used to aid and greatly simplify computational protein structure prediction [8][9][10][11].
Consider an alignment X of binary sequences from M samples (e.g., species, numbered by greek indices) for N sites (e.g., genomic loci, numbered by roman indices), see Fig. 1. In a comparative genomics application, the two states X αi = ±1 could signify whether or not the sequence agrees with a consensus sequence, usage of a preferred or a rare codon, the presence or absence of a binding site, or any other binary observation. To obtain a description of these data with minimal prior assumptions means to infer parameters h and J of the maximum-entropy probability distribution P (x) = Z −1 exp i h i x i + i<j J ij x i x j that reproduces the observed moments m i = α X αi /M and m ij = α X αi X αj /M . This is known as "inverse Ising" inference, and a complex global problem, since in general all inferred parameters are interdependent.
Algorithms proposed so far include small-correlation expansions [2,16], meanfield methods [17][18][19], belief propagation [5,20], a cluster expansion method [21,22] and logistic regression [23,24]. A common assumption is that the samples X α are independent of each other. This, however, is often not the case: for instance, aligned homologous sequences share a common evolutionary history, represented by a phylogenetic tree. Generally, the resulting correlations are always positive and give rise to biases that do not average out within the sample but lead to coherent fluctuations. Since the underlying evolutionary experiment normally cannot be repeated, there is no way to obtain a less biased estimate from independent replicates. Moreover, available sequences are usually not a fair sample of the evolutionary history, because some clades have received more attention or were more thoroughly sequenced than others (for instance, primates within mammals, or mammals within vertebrates). Alternatively, positive correlations between samples could arise when sampling too densely from a time series or Markov chain. Disregarding such correlations between samples can therefore lead to over-estimation of true correlations between sites, and significantly bias inferred parameters of the corresponding model.
Previously it has been suggested that one could account for the redundancy in the data, e.g., due to oversampling of closely related species, by weighting the samples when calculating moments,m i = α w α X αi [5,8,9,25]. The weights w α are chosen by heuristic methods, among them specialized weighting schemes for data from a phylogenetic tree [26,27]. However, this approach may lead to loss of information, and cannot correct for global biases. Alternatively, it was proposed that the coherent nature of phylogenetic correlations leads to a pronounced signal primarily in the first eigenvector of the observed correlations matrix [28,29] and can thus be efficiently removed. Other studies (reviewed in Ref. [30]) compared observed evolutionary correlations against a background expected from the phylogeny, or obtained estimates within an explicit phylogenetic model, but have not addressed the full inverse problem.
Here, we analyze inference biases due to correlated samples and propose an inverse Ising inference method to account for such correlations. Our approach is motivated by the special case of phylogenetic correlations, but our methods and conclusions also apply to between-sample correlations arising from slow dynamical processes in other contexts unrelated to biology. The paper is organized as follows: Sec. 1 contains a definition of the problem and a detailed description of analytical and numerical methods used. The latter are not essential for a first reading of Sec. 2, which contains a discussion of our main results. Sec. 3 discusses potential applications of our findings in the context of protein structure prediction.

Definition of the problem
Although evolutionary dynamics does not generally occur in equilibrium, observable correlations between samples can often be well approximated by an equilibrium process. We thus assume that the entire dataset is one representative sample generated by such a known process, and estimate the remaining parameters causing deviations from expectation by maximum likelihood. Specifically, our unified framework minimizes the cross-entropy of the entire alignment with respect to the unknown parameter sets h and J, where the fields h cause deviations of single loci from the background and the couplings J connect pairs of loci. This minimization is equivalent to maximizing the log-likelihood of the model given (all) the data. The M species represent the leaves of an (unrooted) phylogenetic tree, with additional M − 2 hidden (or ancestral) nodes in the interior of the tree for unknown states of common ancestors (Fig. 1). Including these nodes into our calculation gives a larger data matrix X . Marginalizing over unobserved ancestral states, the probability of the data under the model reads Here, we set the energy unit k B T = 1, Tr denotes a partial trace over the ancestral nodes only (i.e., the grey nodes in Fig. 1), Z = Tr e −H is the partition function with the trace performed over all nodes, and the Hamiltonian H for a configuration x = (x αi ) is given by  Different from a standard phylogenetic approach, we model the dependencies induced by shared evolutionary history using a "phylogenetic" Hamiltonian with fields and couplings g and K, respectively, where K αβ is nonzero only for neighbors on the phylogenetic tree, decreasing roughly with the logarithm of inverse branch length [31]. The fields g α serve to prescribe a prior distribution on the states (e.g., beliefs about missing data or other biases for some species). By means of a reference "background" data set X (0) , the parameters of this model (M fields at the leaves of the tree and 2M − 3 couplings) can be inferred by matching the first two moments µ α = X (0) α and µ αβ = X between observed values and those calculated from H 0 (see Sec. 1.3.3 below). We note that this choice of phylogenetic model is based on comparable assumptions as more standard phylogenetic Markov models, and the differences lie mostly in how their parameters are interpreted (see Discussion for details).

A simple linear problem and local inference
We consider first a simplified version of the problem, where the correlation structure between the M species follows a linear chain rather than a tree. This model does not have hidden ancestral nodes and amounts to N coupled Ising chains with fields h, between-loci couplings J and between-sample coupling K αβ = K 0 δ α,β−1 (i.e., between neighboring rows in Fig. 1). In this case, the partition function can be calculated using textbook transfer matrix methods. We will further restrict ourselves to the simplest case N = 2.
Specifically, we use a system of N = 2 Ising chains with fields h 1 and h 2 , intrachain coupling K and inter-chain coupling J 12 . For large M , the partition function Here, Λ is the largest eigenvalue of the transfer matrix, which can be written as using R = tanh K, T = tanh J 12 , and U 1/2 = tanh h 1/2 , respectively. The eigenvalue is computed by solving Inferring a field. To first order we ignore the coupling J ij , and only deal with one single chain of length M , with intra-chain coupling K = atanh R and field h i = atanh U i . The partition function is From this, we compute the average magnetization as meaning that the inferred fieldsĥ i can be calculated from observed magnetizations m i as in Eq. (24a) below with U i = tanhĥ i . For small fields (and hence small magnetization), this corresponds to the expression Inferring a coupling. Here, we consider two coupled chains with intra-chain coupling K = atanh R and inter-chain coupling J ij = atanh T ij , but no field. The partition function for this case is given by This produces an average pair magnetization meaning that the coupling J ij is derived from observed moments m ij as in Eq. (24b) below. For small J ij , we can approximatê Again, ignoring the correlations between samples (K = 0) gives higherĴ ij than when using a finite value.
Inference errors. Since the intra-chain correlations introduce coherent fluctuations, the sample averages m i and m ij can be quite different from the thermodynamic averages m i = ∂ ln Z M ∂hi and m ij = ∂ ln Z M ∂Jij , respectively. We can quantify the leading-order contributions to the expected inference errors ∆ĥ 2 when using the observed value m i for inference is estimated by expanding in the difference between the error for an average observation m i and the average error: where from Eq. (9) we get Note that the inferred fieldĥ i of Eq. (11) uses the assumed intra-chain couplingK, while the average magnetization m i from Eq. (10) and the fluctuation corrections (16) are calculated with the actual intra-chain coupling K 0 (via R = tanh K 0 ). Combining these results in Eq. (15) gives Eq. (25a) below.

Numerical approach for global inference and correlations with a tree structure
In general, one is interested in inferring all fields and couplings simultaneously. At the same time, the correlation structure between samples is often heterogeneous. In particular, in comparative genomics applications some samples are often more similar to each other than others, reflecting the degree of shared ancestry summarized in a phylogenetic tree. Below, we detail a numerical procedure to perform global inference in the presence of between-sample correlations with a tree structure. The basic idea is to break up the system into small clusters of n sites [21] and then to condense all n values from one sample for each cluster into a single 2 n -dimensional Potts spin. The interaction graph between these variables has a tree topology, and the partition function can be calculated in linear time [20]. Note that although the linear chain discussed before can be seen as a special case of the tree topology (and indeed the transfer matrix recursions are related to the belief propagation approach used below), it is much harder to derive analytical results for a tree, even when betweensample couplings are all identical: fixed points of transfer matrix or belief propagation recursions apply to bulk spins, while observations with phylogenetic correlations come from the leaves of the tree, and thus represent surface states of the system.
In the following, we show how to evaluate Eq. (2) using belief propagation where it is implicitly assumed that N is a small number. The next subsection recapitulates the cluster expansion algorithm from Ref. [21,22] that can be used to systematically break down a large system into a collection of small interacting clusters.

Evaluation of Eq.
(2) We write ln P(X|h, J) = ln Z − ln Z, where the restricted partition function Z is computed by performing the trace only over hidden ancestral nodes, with leaf nodes fixed to observed values. We compute these two partition functions from the Bethe free energy using the same procedure [20]. In general, the Bethe free energy reads Here, we introduced marginal distributions P 2 (x α , x β ) and P 1 (x α ), and re-organized terms of the Hamiltonian as follows: x αi x αj comes from the effective Potts field for node α and H J αβ = − iK αβ x αi x βi stems from the Potts coupling between two nodes. The first term in Eq. (19) sums over values of neighboring nodes (α, β) and the second term runs over single nodes α weighted by the number |∂α| of neighbors. The marginal distribution P 1 (x α ) of a single Potts variable x α at node α is given by: where means equality up to normalization ( xα P 1 (x α ) = 1) and the product includes all neighbors ∂α of node α. The distribution P 2 (x α , x β ) for a pair of neighboring nodes reads Finally, both distributions use messages or beliefs P α→β (x α ), which are computed from the recursion These equations are evaluated along the tree, in one pass from the ancestral nodes outwards to the leaves, in a second pass from the leaves inwards. For the restricted partition function Z , we use the same method, where messages for a leaf α are simply fixed at the observed value X α by setting P α→β (x α ) = δ(x α , X α ). The entropy is then minimized with respect to the N (N + 1)/2 parameters h and J by numerical optimization [32], adding L 2 -regularization terms as prior on the coefficients. Our approach can readily be adopted to the case where only (a small number of) nonzero entries in J ij are to be inferred: adding L 1 -regularization terms γ 1 J 1 and using appropriate optimization methods instead enforces sparsity of the inferred matrixĴ [24,33,34].

Cluster expansion for larger systems
We expand the entropy S in contributions from successively larger clusters [21,22], A cluster C = (i 1 . . . i n ) of n spins is only included if its contribution ∆S C = S C − S 0,C − i∈C ∆S i − i,j∈C ∆S ij − . . . exceeds in absolute value a predefined threshold Θ after the contributions of all subclusters have been removed. Here, the entropy is computed from the difference in Bethe free energy Eq. (19). Larger clusters are recursively tested by merging smaller overlapping clusters. Each cluster's entropy is separately minimized with respect to its n(n + 1)/2 associated parameters h and J, and optimal parameters from overlapping clusters are summed up as described in Refs. [21,22]. The procedure terminates when no larger cluster with significant contribution to the entropy can be found. Finally, while a mean-field approximation as in Ref. [21] could be used as well (possibly including the rescaling method proposed below), for now we choose the entropy of the background model is the probability of a single column under the phylogenetic model Eq. (3) with Z 0 = Tr e −H0 . Integrating a common preprocessing step, the entropy difference of single columns ∆S i or pairs of columns ∆S ij can then conveniently be used to decide which loci exhibit significant deviations from the background model and should be included in the inference.

Background estimation 3M − 3 coefficientsĝ andK of the phylogenetic
Hamiltonian H 0 need to be estimated from background data which plausibly evolved undisturbed by any fields h or couplings J. For instance, in a protein sequence alignment one could take less conserved columns that are usually not used to infer evolutionary correlations. Given N 0 uncorrelated columns of such background data X (0) , one would then match marginal distributions π α = 1 2N0 i (X (0) αi + 1) and βi + 1) to the theoretical marginals P 1 (x α = 1) and P 2 (x α = 1, x β = 1) = P 1 (x α = 1|x β = 1)P 1 (x β = 1) by nonlinear least squares, using Eqs. (20) and (22) to compute the marginals. Appropriate pseudo-counts or regularization terms should be added when estimating background parameters directly from data to avoid overfitting and reduce noise. Alternatively, if a phylogenetic Markov model for the relevant genomic regions of the species of interest is known, one could use it to calculate the marginals and then fit parameters g and K. For the phylogenies of Fig. 3, we used our explicit Ising model on a perfect binary tree from which leaves were sampled, and then fitted the coefficients of H 0 on the induced topology by exactly calculating marginals for corresponding leaves via Eq. (20).

Analytical results for a simplified correlation structure
To gain insight into the effect of between-sample correlations on inference, we first consider a simplified version of the problem. Instead of a branching process giving rise to a tree structure of between-sample correlations, we assume these correlations have the structure of a linear chain, as would happen if samples were taken from a time series or a Markov chain. We do not attempt to explicitly model the process that gives rise to these correlations, but assume that a linear Ising chain with intra-chain coupling K 0 is a sufficiently accurate description. This coupling could be estimated from background data X (0) for N 0 uncorrelated loci via tanhK = 1 In Sec 1.2.2, we detailed how optimal valuesĥ i andĴ ij can be inferred from the observed moments m i and m ij when treating different sites or site pairs independently: As expected, these estimates depend on the assumed intra-chain couplingR = tanhK. Ignoring the phylogenetic correlations between samples (by takingK = 0 and thereforê R = 0) would yield higherĥ i andĴ ij than when using a finite value. Due to the coherent fluctuations induced by the between-sample correlations, the sample averages m i and m ij can be only poor estimators for the ensemble averages required for accurate inference. Above, the leading order contribution to the expected inference errors ∆ĥ 2 i = (ĥ i − h i ) 2 and ∆Ĵ 2 ij = (Ĵ ij − J ij ) 2 was calculated as The first term stems from the error made for the "average" configuration when neglecting or misestimatingK = K 0 . It vanishes for perfect knowledge about the intra-chain correlations (K = K 0 ), in which case the estimates for the fields h and couplings J do not incorrectly account for background correlations. The second term is a finite-size error, coming from coherent fluctuations of single finite configurations about the average, and it therefore scales as 1/M . Indeed, finite-size errors exist even in the uncoupled case h = J = 0. While the effect of finite size fluctuations can be reduced by overestimatingK, the first term dominates for any sample of reasonable size, and the total error is minimized at (or very near)K = K 0 . To validate these results, we consider a system with N = 2 loci and M = 200 samples, varying the impact of correlations between the samples by increasing K 0 . At the same time, we use an adjusted coupling J 12 = 0.25/ cosh 2K 0 and fields h 1/2 = 0.125 e −2K0 to guarantee that m 1 , m 2 and m 12 are roughly independent of K 0 , while the amplitude of coherent fluctuations increases with K 0 . To obtain representative configurations of this system we simulated the model using a cluster Monte Carlo algorithm [35,36]. Then we inferredĥ 1/2 andĴ 12 from each configuration (separately) via numerical minimization of the entropy S as in Sec. 1.2.1. Fig. 2 shows the mean squared error in our inference across the sampled configurations as function of K 0 orK, respectively, and confirms our expectations from Eq. (25). Discrepancies between theory and numerical results for higher K 0 are mainly due to frozen configurations which are affected by regularization. Note that relative inference errors ∆ĥ/h and ∆Ĵ/J are dominated by a global trend from our choice of adjusting fields and couplings with K 0 , and are therefore less feasible for a comparison of results across K 0 -values and between methods.
Intriguingly, to leading order in m i or m ij , the estimates in Eq. (24) become independent ofK if rescaled valuesm i ≡ m i e −2K andm ij ≡ m ij / cosh 2K are used. This suggests that we can simply inferĥ andĴ from these rescaled moments and otherwise ignore correlations between samples. The triangles in Fig. 2 validate this procedure for our simulated data. Indeed, it works even slightly better than numerical minimization, mainly because singularities due to frozen configurations are entirely avoided, and it is also useful whenK is not precisely estimated.

Numerical approach for correlations with a tree structure.
We now turn to the biologically motivated problem where the interactions between species follow a tree structure. In contrast to the above first-order analysis, we aim to infer all fields h and couplings J simultaneously. Our approach is based on maximizing the likelihood of the model parameters h and J using Eq. (2), with the Hamiltonian Eq. (3). Parametersĝ α andK αβ of the phylogenetic background model H 0 (Eq. (4)) are assumed to be known; they can be separately inferred from appropriate background data (see Sec. 1.3.3). As detailed in Sec. 1.3, we can in principle evaluate Eq. (2) by condensing all N values X α from one species (i.e., one row in Fig. 1) into a single 2 N -state Potts variable. The interaction graph between these variables has a tree topology, and the partition function can be evaluated in a time linear in M using belief propagation [20]. The computational cost of this procedure grows as M 2 2n , which is obviously infeasible for systems with more than a handful of loci (i.e., larger N ). As a solution, we combine this approach with an adaptive cluster expansion method [21,22], in order to decompose the system into a collection of clusters of manageably small size, comprising only strongly interacting members. Fields and couplings are then inferred for each cluster separately. Briefly, the procedure starts from pairs of loci and tests for correlations by comparing the entropy (or log-likelihood) of models with and without an interaction term. This interaction is included, and the procedure is iterated to possibly expand the cluster, only if it brings a significant improvement in likelihood beyond a predefined threshold.
Tree generation. For the case of phylogenetic correlations we aim to emulate a biological problem. We create a plausible tree topology by sampling M leaves from an initial perfect binary tree with homogeneous neighbor couplings K αβ ≡ K 0 (Fig. 3(A,B)). The phylogenetic correlations between the chosen leaves are used to numerically infer the non-homogeneous parameters of the Hamiltonian H 0 on the induced phylogeny just as would be done with real data. In terms of observables relevant in a biological context, the phylogenetic correlations are indicative of the sequence identity between two samples (the fraction of identical spins). For a priori equiprobable binary states (i.e., g α ≡ 0), this is calculated from 2π αβ = 1 2 (µ αβ + 1), which ranges from 0.5 for perfectly uncorrelated samples to 1 for perfectly correlated ones. Note that for all values K 0 1 some of the samples are actually entirely uncorrelated (see Fig. 3(E,F)). Mimicking frequently observed sampling bias leading to a more heterogeneous correlation structure, we also create "skewed" topologies, where we preferentially sample leaves from one side of the tree (second row in Fig. 3).
Simulation results. Choosing h and J as described in the legend of Figs. 4 and 5, we generate configurations X for N loci on the induced phylogeny by Monte Carlo sampling [35,36]. These simulated configurations are used next to reconstruct valueŝ h andĴ, respectively. For a relatively simple inference problem with sparse matrix J, Fig. 4 shows the resulting mean squared errors ∆h 2 , as a function of the phylogenetic coupling strength K 0 on the initial tree from which the leaves were sampled. We compare our method ("full inference") with a "naive" averaging using H 0 = 0, where the entropy is given by the familiar expression Here, Z uc is the partition sum for the Hamiltonian Eq. (3) for uncorrelated samples (H 0 = 0) and the averages m i and m ij are obtained as before by averaging over the columns of X. Fig. 4 demonstrates that the reconstruction errors are systematically and significantly smaller using the full inference. For better comparison between methods, we select clusters always based on differential cluster entropies for the full inference. We used a cluster threshold Θ = e −K0 /M chosen after inspection of pair entropies ∆S ij . Otherwise, a method that is unaware of the phylogeny will always yield more clusters due to larger log-likelihood differences ∆S, because deviations that are actually due to phylogenetic correlations are "explained" by larger values for the fit parameters h and J. Similar results are obtained for a different inference problem (the Sherrington-Kirkpatrick spin glass, Fig. 5). Since the trends of Fig. 2 carry over to the specific correlation structure associated with phylogenies, our analytical results are useful to understand the global inference problem.
Rescaling vs. reweighting. Previous work often used a simple reweighting approach to account for phylogenetic correlations, based on a differential weighting of samples when calculating moments from empirical observations, such thatm i = α w α X αi andm ij = α w α X αi X αj . For comparing to reweighing schemes, we focused on one method suitable in our context. Here, the weights w α = β χ calculated from the inverse of the phylogenetic correlation matrix χ αβ = 1 4 (µ αβ − µ α µ β ). This gives the maximum likelihood estimate for the mean of a sample drawn from a multivariate Gaussian distribution [26]. The loss of information associated with reweighting can easily be quantified by calculating the information content I(w) = − α w α ln w α of the weights distribution, and a resulting effective number of independent sequences M eff = e I(w) . This reweighting scheme captures the heterogeneous structure of the phylogenetic correlations and accounts for the redundancy in the data (Fig. 3(G,H)).
Results for "naive" inference with reweighting are presented as stars in Figs. 4 and 5, and indicate significant but comparatively minor improvements, especially for the inferred couplingsĴ (see also Ref. [9]). This implies that the specific structure of the phylogenetic tree is much less important than the overall sequence similarity in the sample. The correspondence between Figs. 4 and 2 therefore suggests to augment the reweighting method with a heuristic rescaling scheme,m i →m i e −2K eff andm ij →m ij / cosh 2K eff for i = j. The effective coupling K eff serves to connect correlations on the phylogenetic tree to correlations of a linear chain. We use a well-known result for the spin-spin correlation function on a tree [37] to calculate an estimate tanh 2 K eff = 2 M α max β =α 2π αβ − 1 from the average sequence similarity between most similar sequence pairs (cf. Fig. 3(E,F)). As shown by the triangles in Figs. 4 and 5, this simple method of globally removing phylogenetic bias significantly decreases the inference error down to the level of the full inference, even for correlations with an underlying tree structure.

The mean-field solution.
Recent biological applications relied on a simple mean-field approach [8][9][10][11], where the couplings are inferred by inverting the matrix C ij = m ij − m i m j of connected correlations:Ĵ To test the performance of this method in the presence of phylogenetic correlations and to compare to reweighting and rescaling schemes, we simulated data with the same phylogenies as before, but a larger system of N = 50 loci. Fig. 6 shows results of the inference using the cluster expansion or the mean-field method (with a pseudocount to handle insufficient variability), respectively. We did not include the full inference, since for larger systems with clusters of size n the time complexity scales as M 2 2n and additionally suffers from roundoff errors in the message passing recursions, leading to slow convergence of the minimization routines. Without the full inference as standard, we decided to include clusters based on the naive method, with the cluster threshold Θ = .1/M held fixed. Generally, the mean-field solution is less accurate than the cluster expansion, but these alternative methods follow similar trends: rescaling is very effective in removing phylogenetic biases, while reweighting is only marginally beneficial (due to a different quantitative effect of the pseudocount it actually performs worse than the naive method for larger K 0 ).

Connection to standard phylogenetic models
Phylogenetic inference is usually performed using Markov models. For binary data, such models require 2M − 3 parameters (the length of each branch on the associated tree) plus one value setting the equilibrium frequency (the relative proportion of the two values). These values are fit to data using recursive algorithms largely equivalent to the ones used here [38]. The commonly used substitution matrices also imply reversibility of the underlying stochastic process, and the assumption that the equilibrium frequency does not change along the tree. However, a simple interpretation of their parameters (e.g., branch lengths as expected substitutions per time) warrants some caution, since substitutions are not the only cause of sequence change and their rates or the relevant time unit not necessarily constant along the tree, and because other assumptions about homology and evolutionary processes enter the preparation of the alignment in the first place. More cautiously, these models can be seen as optimal descriptions of the available data within the considered space of models. Hence we argue that our choice for describing phylogenetic correlations by means of a phylogenetic Hamiltonian is not a limiting factor, because it is merely a generalization of a Markov model to a Markov random field, allowing for different equilibrium frequencies on different leaves [39]. Alternatively, our entire approach could easily be reformulated in the language of phylogenetic models [40], leading to similar recursions [38]. In any case, apart from conceptual clarity and straightforward techniques for generating simulation data we believe that our non-standard formalism is advantageous under circumstances where the data is poorly fit by an explicit phylogenetic model. This could be the case due to non-uniform sequencing quality or alignability between samples, leading to an uneven distribution of gaps in the alignment. Gaps are often included as additional states, but standard Markov models prescribe constant gap frequencies along the tree [41] whereas we can use different priors for each species. Further, our approach could be favorable if the data represent states of larger genomic regions, such as cis-regulatory elements, whose evolution is best described on a more coarse-grained scale. We note that exploiting the correspondence between evolutionary dynamics and Ising models has a long tradition [31]. A similar phylogenetic Ising model has recently been used to model HIV sequence statistics [42].

Inference on protein alignments
Inverse Ising inference has found a powerful application in the prediction of residue contacts from large protein alignments (where it is often called direct coupling analysis [8][9][10][11]25]). These analyses use sequences from large protein families spanning considerable evolutionary distances, such that neutral positions in the alignment can generally be considered as independent. Still, there are typically subsets of sequences from more closely related species where this assumption is violated. In principle, our method can be readily adapted to non-binary data, corresponding to formulating the Hamiltonian Eq. (3) in terms of Potts variables with Λ = 21 states (for 20 amino acids and a gap). In this case, we anticipate that it might be difficult to reliably estimate all associated parameters. Also, the complexity of the cluster expansion method combined with the message passing grows like M Λ 2n for a cluster of n columns, which would quickly become prohibitive. Further, published methods for genomics-aided protein structure prediction [8][9][10][11] only require the identification of a small number of putative residue contacts from the top interacting pairs, and the pair ranking has been observed to be quite robust with regards to phylogenetic reweighting [5,9]. However, for more quantitative applications (see, e.g., Refs. [7,12]), we propose the mean-field approach combined with our rescaling method as simple yet effective strategy. This mainly involves shifting measured sample averages closer to the background distribution because deviations are partially attributed to coherent fluctuations. It ameliorates problems with the proper choice of regularizers, and only requires knowledge of this background distribution and of the average sequence identity in the sample. Both can usually be reliably estimated in current sequence data sets.

Conclusions
We presented a systematic study of inverse Ising inference for phylogenetically correlated samples, based on combining belief propagation recursions with an adaptive cluster expansion method proposed previously [21]. Here, we employed an Ising-like background model that generates the observed phylogenetic correlations. We then maximize the likelihood of interaction coefficients between different loci in adaptively chosen small clusters, given the corresponding data and the background model. Our method significantly reduces the inference error due to phylogenetic bias. Our focus here was on phylogenetic correlations between samples, but we note that such correlation may arise from slow dynamical processes in other contexts unrelated to biology. Finally, we emphasize that there might also be circumstances where biases due to phylogeny or other processes can safely be neglected, for instance if only the interaction topology (i.e., non-zero entries ofĴ regardless of their exact value) is of interest [24,33,34], but this question warrants further research. Popular approaches for mitigating the effect of phylogenetic bias are based on down-weighting highly similar samples, but we show here that this has only marginal benefits. In contrast, we propose a simple rescaling of observed averages by the expected contribution attributed to excess sequence similarity, and show that it can be highly effective. Importantly, this undemanding approach is very useful even when the inference is based on simple (and computationally inexpensive) mean-field inference, which is now frequently used in the field of protein folding.