Minimizing couplings in renormalization by preserving short-range mutual information

The connections between renormalization in statistical mechanics and information theory are intuitively evident, but a satisfactory theoretical treatment remains elusive. Recently, Koch-Janusz and Ringel proposed selecting a real-space renormalization map for classical lattice systems by minimizing the loss of long-range mutual information [Nat. Phys. 14, 578 (2018)]. The success of this technique has been related in part to the minimization of long-range couplings in the renormalized Hamiltonian [Lenggenhager et al., Phys. Rev. X 10, 011037 (2020)]. We show that to minimize these couplings the renormalization map should, somewhat counterintuitively, instead be chosen to minimize the loss of short-range mutual information between a block and its boundary. Moreover, the previous minimization is a relaxation of this approach, which indicates that the aims of preserving long-range physics and eliminating short-range couplings are related in a nontrivial way.

The connections between renormalization in statistical mechanics and information theory are intuitively evident, but a satisfactory theoretical treatment remains elusive. Recently, Koch-Janusz and Ringel proposed selecting a real-space renormalization map for classical lattice systems by minimizing the loss of long-range mutual information [Nat. Phys. 14, 578 (2018)]. The success of this technique has been related in part to the minimization of long-range couplings in the renormalized Hamiltonian [Lenggenhager et al., Phys. Rev. X 10, 011037 (2020)]. We show that to minimize these couplings the renormalization map should, somewhat counterintuitively, instead be chosen to minimize the loss of short-range mutual information between a block and its boundary. Moreover, the previous minimization is a relaxation of this approach, which indicates that the aims of preserving long-range physics and eliminating short-range couplings are related in a nontrivial way.
Despite neither being able to experimentally probe nor theoretically precisely describe the microscopic details of the physical systems that surround us, via renormalization we are still able to make predictions and verify them to remarkable degrees of accuracy. A renormalization process progressively removes degrees of freedom from a physical system, mapping it to an effective system having the same physics at large scales [1,2]. One may regard the renormalization map as removing unimportant short-range information while leaving long-range information intact, and therefore possible connections to information theory have been explored in several different approaches [3][4][5][6][7][8][9]. One difficulty in the renormalization enterprise is finding an appropriate renormalization map. In real space renormalization [10], for example, there is no unique way to remove degrees of freedom, and a several maps can plausibly be used. Some work noticeably better than others [11], but there is no clear criterion for choosing the best map.
Recently, Koch-Janusz and Ringel [12] proposed choosing real-space renormalization maps based on an information-theoretic criterion, as follows. Consider a spin model on a lattice Λ, and divide the lattice into non overlapping blocks A j . Let R be a renormalization map on a single block, specifically a stochastic transformation on the random variables describing the spins in the block, and call its output on the jth block A ′ j . In the renormalization procedure R is applied to each A j , but here we need only focus on a single block A with output A ′ = R(A). In particular, dividing the lattice into the block in question A, its neighbors within some distance B, and the remainder of the spins C, as illustrated in Figure 1a, Koch-Janusz and Ringel propose choosing where P = 1 Z e −βH is the Gibbs distribution of the spin system and I(A : C) P is the mutual information of random variables A and C under the distribution P .
Due to the data processing inequality, it follows that I(A : C) P ≥ I(A ′ : C) R(P ) , and hence R KJR retains the most mutual information between the block and the long range parts of the lattice. Koch-Janusz and Ringel argue that it therefore extracts the relevant degrees of freedom and that it results in a renormalized Hamiltonian with short-range couplings. They also propose a machine-learning algorithm to determine R KJR on a parametrized subset of all possible maps. The resulting Real Space Mutual Information (RSMI) algorithm produces good results when benchmarked on various physical models. Lenggenhager et al. [13] further showed that R KJR does not create any long-range couplings within C when I(A : C) P = I(A ′ : C) R(P ) . Their theoretical work was expanded to field theory [14] and their algorithm improved by using deep learning techniques [15].
In this Letter we argue that, contrary to the above intuition, to minimize long-range couplings one should instead choose the renormalization map to retain shortrange mutual information: As we show in detail below, in fact no map R can result in long-range couplings within C or from A to C, and R ⋆ additionally minimizes coupling within the boundary B. This approach has several other advantages. For one, the optimization is considerably simpler, as it only involves the block in question and its boundary. Moreover, it is the case that I(A ′ : B) R(P ) ≥ I(A ′ : C) R(P ) for every map R, and hence the optimization in (1) is a relaxation of the optimization in (2). We emphasize here that these two optimizations are born out of two different motivations: (1) identifies the degrees of freedom that are most relevant to the long range physics, while (2) aims to control the proliferation of couplings. It is not expected that these two motivations yield the same optimization problem, and the relaxation described above relates the two. Finally, the optimizer of (2) (as well as of (1)) is b) The random variables in the black region are conditionally independent of the those in the white region given the gray region, as the gray region shields the former from the latter in the Markov network. The regions need not be connected. a deterministic map, which makes brute-force optimization feasible for small blocks by searching the entire map space directly on the probability distribution, rather than by using sampling techniques. We illustrate how the optimization can be performed for 2 × 2 maps using tensor network representations for the 2D Ising model. Gibbs states as Markov networks.-To prove our claims we make use of the Hammersley-Clifford theorem of probability theory, which states that every Gibbs state of a local Hamiltonian is a Markov network. A Markov network is a (probability distribution on a) collection of random variables with conditional independence relations that are captured by an undirected graph. Consider a collection of random variables V = (V 1 , . . . , V n ) associated to vertices of a graph G and having a joint probability distribution P (V ). Vertices V j and V k connected by an edge in G correspond to dependent random variables, for which I(V j : V k ) = 0. Given three regions of the graph A, B, and C, corresponding to disjoint collections of the random variables, B is said to shield A from C if all paths connecting A to C pass through B. The regions themselves need not be connected, as depicted in Figure 1b.
Then (G, P ) is a Markov network if every two regions shielded by a third are conditionally independent, i.e. A and C are independent given the value of B. Put yet differently, the correlations between A and C are mediated entirely by B. Conditional independence can be succinctly expressed using the conditional mutual information (CMI) as I(A : C|B) P = 0, where The Hammersley-Clifford theorem [16,17] then states that (G, P ) is a Markov network if and only if P (V ) = e h(V ) for some local function h, meaning h = c∈C h c , where C is the set of cliques of the graph (the fullyconnected subgraphs) and each h c is a function only of the variables involved in the clique c.
The renormalization procedure begins with the Gibbs state of a local Hamiltonian P ∝ e H . Renormalizing a block A with map R results in a new probability P ′ = R(P ) = e h ′ , where we define h ′ = log P ′ . Renormalizing all blocks results in some distribution P ′′ , and the corresponding h ′′ is just the renormalized Hamiltonian, up to the inverse temperature β and normalization constant factors. By the Hammersley-Clifford theorem, h ′′ will not contain any couplings between random variables which are conditionally independent, and this property can be established by showing that the CMI vanishes. And by data processing, it is sufficient to consider just h ′ to determine where new couplings may arise.
Ruling out couplings.-The presence of the boundary B around the block A ensures that R creates no couplings within C nor from A ′ to C. Consider two parts C 1 and C 2 of C which are not already coupled. Thus they are conditionally independent given the remainder R of the random variables comprising the system. Region A is a part of R, and the rest we can call D so that R = AD. Since B bounds A, it must be the case that D shields C 1 from C 2 and therefore I(C 1 : C 2 |D) P = 0. This does not change under application of any map R, I(C 1 : C 2 |D) R(P ) = 0, and therefore C 1 and C 2 are not coupled in h ′ . To show the same thing, the authors of [13] prove instead that I(C 1 : C 2 |A ′ ) = 0 by assuming that long range mutual information is preserved, i.e. I(A : C) P = I(A ′ : C) R(P ) . That A ′ will not become coupled to anything in C follows because all the correlations are mediated by B. Using the positivity of CMI and data processing, we have Hence, the main concern is couplings between parts of B which may be induced by R. In one-dimensional systems, as depicted in Figure 2, it turns out that coupling between B L and B R is related to the change in mutual information between the block A and the boundary B = B L B R . If the mutual information is unchanged after R, then B L and B R are uncoupled in h ′ . This is a consequence of the following more general statement. Typically, no nontrivial map R will precisely preserve the mutual information for reasons we shall explain in a moment. Nevertheless, minimizing the change in mutual information, by maximizing I(A ′ : B) R(P ) as in (2), minimizes the coupling between B L and B R . This is because the smaller the CMI, the closer the distribution R(P ) is to some P ′ in which B L and B R are conditionally independent, as measured by the total variational distance between distributions (see [18,Lemma 1]). Hence smaller CMI leads to an associated h ′ with weaker couplings. Somewhat counterintuitively, then, to minimize couplings it is more important to preserve mutual information between a block and its boundary rather than between a block and distant spins.
For isotropic systems, we can translate the 1D argument to multiple dimensions by treating a D dimensional isotropic lattice as a 1D system in every direction, as proposed by Leggenhager et al. [13]. The lattice can be separated into disconnected regions by hyperplanes creating effectively a 1D system ( Figure 3) and the argument of Theorem 1 carries over, so that no couplings will appear between the spins in the boundary strips B L and B R . Couplings might still appear inside the central strip, but if the system is isotropic we can repeat the same argument with hyperplanes separating the renormalized block from the rest in a different dimension and expect that if a map maximized I(A ′ : B) in one dimension, it will do so also in the other dimension. This argument breaks down for non isotropic systems as the different directions may have different optimal maps. Before proceeding to examine the two optimizations in more detail, let us remark that a renormalization map which precisely preserves the mutual information can actually be undone by a suitable stochastic map. This accords with the idea that no information is lost along the renormalization flow in this case by assumption, but one does not typically expect renormalization to be reversible. Starting from I(A ′ : B) R(P ) = I(A : B) P and using the fact that I(A : C|B) P = I(A ′ : C|B) R(P ) = 0, it follows that the total mutual information is preserved, I(A : BC) P = I(A ′ : BC) R(P ) . Then we can appeal to Lemma' 1 of [18], which ensures that the so-called "transpose" map or Petz recovery mapR is such that R • R(P ) = P [19]. The transpose map depends on R and the marginal distribution of A under P , but we shall not go into further details here.
Optimization.-Computing I(A : B) does not require handling the whole probability distribution, but only the marginal distribution on the AB subsystem. This simpli- FIG. 3. The dark and light gray strips indicate the blocks that are used when treating the system as one dimensional in each direction, while the square indicates a block to be renormalized. If the renormalization map is optimal, the light gray strips are uncoupled. If the system is isotropic, the optimal maps for the two directions are the same.
fies the optimization relative to Koch-Janusz and Ringel's proposal, where the distribution on the entire spin system must be treated somehow. As mentioned above, (1) is a relaxation of (2) in that I In both (1) and (2) the optimal map R ⋆ is necessarily deterministic, i.e. all its transition probabilities are either zero or one. This follows because the objective function, the mutual information, is a convex function of the optimization variable, the map R, and the extreme points of stochastic maps are deterministic maps.
Proposition 2. Let C be the space of channels from A to A ′ . For a fixed probability distribution P AB the function C → R + , W → I(A ′ : B) W (P ) is convex.
Proof. Consider a collection of channels {W z } z∈Z indexed by the values of a finite random variable Z with distribution Q. The average channel W Z is just W Z (P AB ) = z∈Z Q(z)W z (P AB ) for any P AB , leading to mutual information I(A ′ : B) WZ (P ) . For simplicity, denote W Z (P ) just by P ′ . Meanwhile, the average mutual information is given by the CMI I(A ′ : B|Z) P ′ since and therefore the mapping is convex.
When maximizing a convex function over a convex set, the optimum will occur at one of the extreme points [20,Theorem 32.2], which in this case are the deterministic maps [21,Theorem 1]. This simplifies the optimization by making the search space finite. While brute force might still be out of reach for interesting systems, more sophisticated methods such as machine learning techniques can be informed by this fact.
The Ising model.-Consider renormalization maps on 2 × 2 blocks in the 2D square-lattice Ising model. To investigate which maps are optimal according to (2), we use the Corner Transfer Matrix algorithm [22] to extract the marginal distribution of a 4×4 block, and we measure the change in mutual information between the central 2 × 2 block and its boundary after each of the possible 2 16 deterministic maps mapping this block to a single spin. We then compute the change in mutual information for each map over the range of temperatures β ∈ [0.1β c , 1.9β c ] and find the optimal map at each temperature. In Figure  4 we show the change in mutual information compared with the minimum value for some common maps: 1. Decimation: the value of the renormalized spin is simply the value of one of the 4 spins in the block.
2. Majority vote: the renormalized spin is assigned a value +1 if the majority of the spins in the block are +1, and vice versa. Ties must be broken with a 2 × 2 block, we do this in 4 possible ways: using a predetermined fixed value (i.e. the ties are always resolved with +1 or −1), using one of the spins in the block (hence the map becomes decimation in case of ties), or choosing a value at random.
Some of these maps are not symmetric under spin flips, namely the majority vote with fixed value tie breaker and the biased maps. Which version is optimal depends on the symmetry breaking low temperature state that has been selected during the simulation. We call the tie breaker or the biased map "aligned" (denoted ⇈ in the figure) if the relevant fixed value for the renormalized spin is aligned with the magnetization in the symmetrybreaking state, and "antialigned" ( ⇆ ) otherwise. At high temperature (β/β c 0.3554), the optimal map is decimation, afterwards, for 0.3554 β/β c 0.6109, majority vote with tie breaks decided by decimation. From that point up to the critical temperature, both version of fixed tie breaker majority vote are optimal, the aligned version remains so up to β/β c ≈ 1.0509, after which the low temperature symmetry breaking prevails and the best map is the aligned biased map.
Interestingly, majority vote with random tie breaker is rather far from optimal (it cannot be optimal as it is not deterministic) and fares worse of all other tie breakers except the antialigned one at low temperature. It can  4. Difference of the mutual information change for each map above the optimal change, as a function of inverse temperature. Each shaded region indicates which map is optimal in the corresponding interval. Note that while both majority vote maps which break ties aligned (MV-⇈) and antialigned (MV-⇆ ) with the overall magnetization are optimal in the interval (0.6109,1), the random tiebreaker map (MV-rnd) is far from optimal. also be seen that decimation performs poorly, especially around the critical point. This is consistent with the observations of [11].
Conclusions.-In this Letter, we argued that maximizing the short-range mutual information between a block and its boundary yields a renormalized system with reduced long-range couplings. In particular, couplings are never introduced beyond the boundary region of the renormalization map, and are suppressed when more of the short-range mutual information is preserved. This gives an information-theoretic account of some aspects of renormalization. The optimization suggested by this approach leads to a simple brute-force algorithm for finding the optimal renormalization map which requires only the probability distribution of the input region of the map and its boundary. It is efficient enough for small systems, as demonstrated in the 2D Ising model. Further work is required to explore the robustness of this result when information is only approximately preserved, perhaps by using an approximate generalization of the Hammersely-Clifford theorem.
Our approach contrasts with the focus of [12] and [13], which maximizes the long-range mutual information with the dual goals of capturing the relevant degrees of freedom and reducing long-range couplings. The fact that their long-range mutual information optimization is a relaxation of our short-range optimization implies some connection between these goals: If we view extracting the relevant information as the primary justification for the long-range optimization (an intuitively very plausible statement), then it will necessarily do this by minimizing long-range couplings in the renormalized Hamiltonian to some extent. The open question is how much. It would therefore be interesting to investigate under what conditions or in which models the optimal renormalization maps of the two approaches actually coincide. To this end it would also be interesting to modify the RSMI algorithm to focus on short-range mutual information, as exact optimization is computationally difficult for more complicated models. In either scenario one may also be able to take into account the fact that the optimal renormalization map is necessarily deterministic.
Finally, it should be noted that the focus on shortrange versus long-range information here is reminiscent of the relation between the Tensor Renormalization Group (TRG) [23] and the Tensor Network Renormalization (TNR) [24] algorithms. The latter is a refinement of the former in which the additional steps are meant to remove short-range correlations, improving the algorithm near the critical point. Here the setting is block-spin renormalization, i.e. maps on the physical degrees of freedom and not the tensors in the tensor-network description, but again the focus is on the short-range couplings. It would be interesting to investigate if informationtheoretic methods can be used to give tensor network algorithms.