A variational method based on weighted graph states

In a recent article [Phys. Rev. Lett. 97 (2006), 107206], we have presented a class of states which is suitable as a variational set to find ground states in spin systems of arbitrary spatial dimension and with long-range entanglement. Here, we continue the exposition of our technique, extend from spin 1/2 to higher spins and use the boson Hubbard model as a non-trivial example to demonstrate our scheme.


Introduction
Spins or harmonic oscillators on a lattice form a class of models which have been studied intensively in statistical physics. Understanding them is the key to many problems in condensed matter systems, especially regarding magnetic phenomena but also electrical and heat conduction and many other aspects. As the importance of quantum phase transitions [Sac99,Voj03] has been more and more realized, interest in the ground states of quantum spin models grew. While the relevance of entanglement for quantum phase transitions was initially not fully appreciated, it is now a vivid area of research (e. g., [OAFF02,VLRK03]), and many researchers feel that paying explicit attention to entanglement features is vital for further progress in numerical methods for the treatment of spin models [ON02,VPC04,Lat07]. Although quantum phase transitions nominally only occur at zero temperature, their presence has great influence on the system properties at finite temperature, namely leading to the break-down of quasiparticle descriptions. Hence, studying the ground state of spin models holds promises to understand experimentally observed features of such systems, not the least of which is high-temperature superconductivity. Finally, spin models (including bosons on a lattice) are an ideal way to model optical lattices, which are currently researched with exciting successes in theory and experiment (reviewed in [LSA + 07]).
While there are some exactly solvable spin models in one spatial dimension [Tak99], for nearly all models in higher dimensions approximative techniques have to be used. A variety of quite different techniques have been developed: Most prominently, these are quantum Monte Carlo techniques, where recent progress has been acieved especially in the context of the so-called world-line Monte Carlo methods ( [ELM93,PST98], reviewed in [KH04]). For one-dimensional systems, extraordinary accuracy has become possible with the density matrix renormalisation group (DMRG) algorithm ( [Whi92,Whi93], review [Sch05]). Recently, this algorithm was extended to allow the calculation of not only ground state properties but also of thermal states [ZV04,VGC04] and time evolutions [LXW03,MMN05,Vid04,DKSV04]. Also, an extension to higher spatial dimensions has been proposed [VC04a]. Its usability in practice has been demonstrated only very recently [MVC07].
All these variations of DMRG are based on the same class of variational † states, namely matrix product states [RÖ97]. We have recently found [APD + 06] that another class of states, namely the so-called weighted graph states (WGS), first studied in different context in [DHH + 05, HCDB05], is also quite promising as ansatz for variational approximation of ground states of spin systems. Its particular advantage is the unlimited amount of entanglement that can be present. Hence, we consider our technique as especially promising for systems with long-range entanglement such as critical systems. A further key difference of our states to matrix product states is that their mathematical structure does not reflect any spatial geometry (while the product of matrices in a matrix product state reflects a chain or ring geometry, as studied especially in [RÖ97,VPC04]) and hence may be expected to be equally suitable for higher dimensions (2D or 3D) as for 1D. Hence, even though we probably cannot compete with the astounding accuracy of DMRG in 1D, we aim to provide a complementary alternative to the higher-dimension generalisations of DMRG [VC04a]. In [APD + 06], we presented this technique and demonstrated its use for simple spin-1/2 systems in one and two dimensions. In the present article we explain our method in †Strictly speaking, only the fixed-length phase of DMRG can be called a variational method. much more detail, show new results we have obtained since then (especially regarding the treatment of spins higher than spin-1/2, and concerning heuristics to perform the minimizations) and tests its usefuleness on practical examples. The article is selfcontained and does not assume the reader's familiarity with weighted graph states or the content of [APD + 06].
This article is organised as follows: We start in Sec. 2 by reviewing some general observations about variational methods. In Sec. 3 we describe our class of variational states as a generalisation of weighted graph states and discuss their parametrisation. Section 4 explains how reduced density matrices of these states are calculated in an efficient manner in order to be able to evaluate expectation values of observables, including energy. To test our method, we show results for calculations on two different models (namely the XY model and the Bose-Hubbard model) in Sec. 5. In a variational method, a crucial part is finding a state within the given class that minimises the energy as well as possible. Our techniques for doing so are the topic of Sec. 6. We add some further notes on the details of our numerical implementation and its performance (Sec. Appendix A), and finish with a conclusion and an outlook on further work (Sec. 7)

General considerations on variation
For a Hamiltonian H that is too large to diagonalise one can approximate the ground state using the Rayleigh-Ritz variational method. One uses a family of states |Ψ(x) which depend on some parameter x. It may be better to see this as a map from a parameter space Ê K to a Hilbert space H: One then solves the minimisation problem in order to obtain an upper bound E min to the ground state energy and an approximation |Ψ(x min ) for the ground state. For this to give good results, the map Ψ has to fulfil the following conditions: (i) There must be an efficient algorithm to calculate the expectation value of observables for any state Ψ(x). In principle, it is sufficient to be able to calculate Ψ(x) | H | Ψ(x) and Ψ(x) | Ψ(x) , but if one wants not only to bound the ground state energy, but also analyse the ground state approximant |Ψ(x min ) , it is desirable to be able to calculate expectation values for other observables, too.
"Efficient" means here a computation time at most polynomial in the number of parameters K. As the dimension of the Hilbert space H typically scales exponentially in the number N of constituents of the system, we want K to not do the same. Thus, the map Ψ, considered as a a family Ψ N of maps for different system sizes N , should be such that the dimension K of its domain scales only polynomially with N , and thus logarithmically with dim H.
(ii) There should be reason to expect that there are states |Ψ(x) within the range of the map Ψ that have large overlap with the true ground state or at least an energy near to the true ground state energy. As the range of Ψ is a sub-manifold of H of dimension at most K ≪ dim H, this requires it either to be folded and twisted in a quite peculiar way to reach many different regions of H, or to happen to occupy the same small part of H as the ground state. Typically, it is not possible to prove such a statement, and one hence has to do with heuristic arguments or numerical evidence.
(iii) There should be reason to expect that the minimisation programme (1) succeeds in finding a good minimum and does not get stuck in a bad local minimum. It is often not justified to hope to find the global minimum, but a local minimum of an energy only slightly higher than that of the global minimum is hardly worse.
Whether the minimisation can succeed depends on the "energy landscape", i. e. the graph of E(x). If this landscape has many local minima, a naïve multistart optimisation cannot succeed. Often, the number of local minima increases exponentially with N or K, which may render a method that is efficient for small systems useless for larger ones. Hence, one usually has to succeed in tailoring a heuristics that helps to find good minima for the specific kind of energy landscape one has to deal with.
One of the best studied variational methods is finite-length DMRG, and we shall illustrate the conditions given above by briefly discussing how DMRG (in the formulation of Ref. [VPC04]) fulfils them. For DMRG, the class of variational states are the matrix product states [ÖR95,RÖ97]. For an N -site matrix product state, an efficient algorithm exists to evaluate the expectation value of any observable that can be written as a sum of tensor products of local operators in time linear in N . This meets condition (i). The expectation that a matrix product state is a good approximant for the ground state of a generic 1D system (condition (ii)) is the very rationale that led White and Noack to their idea of keeping the lowestlying eigenstates not of the short-range Hamiltonian but of the corresponding density matrix as explained e. g. in [Whi98]. The fact that condition (iii) is fulfilled, i. e. that the "sweeping procedure" of finite-length DMRG does not get stuck in local minima is somewhat mysterious, especially in the light of the possibility of construction of Hamiltonian for which this cannot be avoided [Eis06]. Nevertheless, the construction principle, as exposed in [VPC04], shows that the matrix for each site has direct influence only on this site and its neighbours, i. e. matrix product states allow for an essentially local description of states despite the existence of significant amount of entanglement. Hence, is seems natural that -barring "pathological" cases such as those discussed in [Eis06]-the local variation of matrices during sweeps allows for a good minimisation, provided the initial N -site state was chosen well (which is the task of the so-called "warm-up", which uses infinite-length DMRG). Furthermore, Wolf et al. have recently shown a close connection between approximability and Rényi entropy for matrix product states [SWVC07].
We shall come back to some of these points when comparing our variational states with matrix product states at the end of Sec. 3.

Basic idea
Our class of quantum states derives from the so-called weighted graph states, which were introduced in [DHH + 05, RBB03] and also used in [HCDB05]. They are a generalisation of graph states (introduced in [BR01], see [HDE + 06] for a review).
For a Hilbert space of N qubits, they are defined as † where a product of phase gates W ϕ is applied onto a tensor product of |+ = (|0 + |1 )/ √ 2 states. These phase gates are two-qubit operation, diagonal in the computational basis, and of the form ‡ It may help to see the effect of W on small states. This is, e g., a three-qubit weighted graph state (where the qubits are numbered 1, 2, 3 from left to right): For every pair a, b of spins, there is a phase gate with a phase ϕ ab = ϕ ba . The key observation and starting point of the work in [DHH + 05] is that even for very large N , we can efficiently calculate any reduced density matrix for a subset A ⊂ {1, 2, . . . , N } of the qubits as long as the number of qubits in A (i. e. the number of qubits not traced over) is low. This calculation is efficient in the sense that the time requirement scales only polynomially in N (though exponentially in |A|). This is remarkable because in the generic case, the time to calculate a reduced density matrix is exponential in N , and most classes of states which allow for calculation of reduced density matrices in polynomial times are bounded in the amount of entanglement that they can contain. Especially in the case of matrix product states, this fact is the dominant reason why DMRG cannot be applied successfully for certain settings [VPC04]. Weighted graph states, on the other hand, are not bounded in the amount of their entanglement, as shall be explained in Sec. 3.3.
There is no guarantee that these states spread through those parts of the Hilbert space which are of interest for us, and hence, we add as many further degrees of freedom to the form (2) as possible without losing the ability to efficiently calculate reduced density matrices. As will be demonstrated in Sec. 4, the following additions do not hinder the efficiency of the reduced density matrix evaluation: (i) Let the phase gates act not simply on |+ ⊗N , but on any N -qubit product state. (ii) Even weighted superpositions of m product states can be treated, provided m is small. (iii) After the phase gates, arbitrary local unitaries may be applied.

Parametrisation
Deviating from the treatment in [APD + 06], we develop the formulae not just for spin-1/2 particles but, more generally, for n-level systems, i. e. our states live in a Hilbert space H = ( n ) ⊗N . †In this article, superscripts in parentheses always indicate the spins an operator acts on. Hence Wϕ is an operator defined on a 2-spin space, while W (ab) ϕ is defined on the full N -spin Hilbert space, but has support only on spins a and b.
‡In [DHH + 05, HCDB05], the notation Uϕ ab is used instead of W (ab) ϕ ab . Here, we use the W to emphasise that it is a specific, and not some general unitary. Note also the absence of a minus sign in the exponential e iϕ , which differs from the convention used in [DHH + 05].

Superposition of product states
We start with a superposition of m product states, which we write The operator nrm denotes normalisation: nrm |ψ := |ψ / |ψ . To facilitate notation, we also introduced As we normalise afterwards, we can fix the coefficient in front of |0 to 1: d j a,0 ≡ 1 for all a, j. It will also be useful later to introduce the deformation operators If one is given the ζ st and can choose the β s and γ t at will, one does not need the freedom to set all entries in W Φ . It suffices to have 4 phases: W Φ = diag (1, 1, 1, 1, e iΦ 11 , e iΦ 12 , 1, e iΦ 21 , e iΦ 22 ).
Defining Φ s0 ≡ 0 and Φ 0t ≡ 0 for all s, t ∈ Ë, we can simply write Our variational states now take the following form: The vector x is a concatenation of all the parameters that are present in the righthand side, i. e. the (real) parameters of x contain the real and imaginary parts of the complex scalars d j as and α j , the (real) entries Φ st ab of the phase matrices, and the parameters describing the N local unitaries U a ∈ SU (n), a = 1, . . . , N .

Parametrisation of the unitaries
Next, we need to choose a parametrisation of SU (n) in order to describe the unitary matrices U . For this, we use an isomorphism between the set SU (n) of unitary n × n matrices U and the set of Hermitian n × n matrices A because Hermitian matrices are easy to parametrise. We could use (a) the matrix exponentiation U = exp iA or (b) the Cayley transform (introduced 1846 by Cayley, see e. g. [Puz05]) To calculate these expressions numerically, we need, for (a), a matrix diagonalisation and, for the matrix invertion in (b), an LU factorisation [TB97]. We choose the Cayley transform, not only because it is slightly faster, but especially because we will later have to evaluate the derivatives of U with respect to its parameters, and while this is very involved for (a) [NH95], it is rather trivial for (b) [PlM]. (A disadvantage seems to be on the first glance that the Cayley transform is undefined if A has -1 as eigenvalue, because then, (i½ − A) cannot be inverted. The algorithm will not, however, converge to this case, and if it happened to hit on it, the program would abort.)

Parameter count
Let us now count the number K of real parameters needed to describe a state |Ψ(x) : • For each phase gate, we need (n − 1) 2 real numbers. In case of one phase matrix for each pair of spins, there are N (N − 1)/2 gates. • For the deformations, i. e., the specification of the initial product states, we need 2mN (n − 1) real numbers. • For the superposition coefficients, 2m reals.
• An n× n Hermitian matrix is specified by n(n− 1)/2 complex entries in one of the triangles above or below the diagonal and n real entries in the diagonal. Hence, we need for the N unitaries a total of N n(n + 1)/2 real parameters.

Entanglement properties
As already mentioned an important motivation for this work was the goal to find a class of states which exhibit strong entanglement over arbitrary distances that is somewhat "generic". After all, the limited ability to describe such entanglement is a common shortcoming of many approximation methods for many-body quantum mechanics. For the case of DMRG, this has been studied in detail in Ref. [VPC04]. There, it was shown that the matrix product states that arise during DMRG can be understood as "projections" from an auxilliary linear quantum system of the valence bond solid type [VC04b]. Hence, whenever one cuts the matrix product states "chain" into two parts, the blockwise entanglement (i. e., the entropy of the reduced density matrix of one of either part) is bounded by 2 log 2 D, where D is the dimension of the auxilliary spins, which is equal to the number of "kept states" in DMRG parlance or the matrix size in the matrix product state picture. This explains why DMRG performs not too well when applied to long 1D systems with long-range entanglement or, more precisely, to systems where the blockwise entanglement grows with the block size. A scaling of the entanglement is hardly avoidable when treating systems with more than one dimension. According to the various "area law" theorems and conjectures, for most systems the entanglement of a block versus the rest of the system scales linearly with the area of the interface between this block and the rest [AEPW02, PEDC05, Wol06, CEPD06, WVHC07]. Hence, for, say, a 2D system, the entanglement scales linearily with the surface area of the block and matrix product states are unable to render this feature without their matrix size growing quite fast. There are ways of replacing the matrices with higher-rank tensors to keep up with the area law, yielding so-called projected entangled pair states (PEPSs) [VC04a] but the formalism of these is rather tedious and grows more complicated with increasing spatial dimension. Also, PEPSs cannot go beyond the area law and are hence still unable to treat systems that do not follow the area law, i.e., show entanglement that scales superlinearly with the block surface, which typically is the case in critical and certain disordered systems [Kor04, KM05, CEP07, VWPC06, EO06, BCS06].
We hope that our variational method turns out to be a viable complementary method especially to this "PEPS" generalization of DMRG. To see how this claim may be substantiated, note that in the description of our states, the geometry of the system has not entered yet. Every spin is connected to every other spin by a phase gate, and we can thus modell any geometry, i. e., any scheme of neighboring relations. The entanglement of a block of M spins w. r. t. the rest of the system (with N − M spins) can scale with the number of spins M , i.e. with the volume and not with the surface area of the block [CHDB05]. Thus, the blockwise entanglement can reach the maximum value that is possible in the given Hilbert space. Other entanglement measures such as localizable entanglement between pairs of spins and also two-point correlation functions can reach their maximum value (independent of the distance), but can also show exponential or polynomial decay [DHH + 05]. This is already evident from the fact that 2D cluster states are within our variational class, and they reach The circles denote sites of a 6 × 6 lattice. In constructing a variational state (8) on this lattice, phase gates W Φν are performed on any pair of sites, where the phase gate acting on the purple shaded site and a site marked with the number ν uses the phase matrix Φν . Translation of these markings show the phase indices for other site pairs. Note, how due to the rotation and reflection symmetries of the square lattice, the pattern of phase indices is repeated eight times. maximum entanglement in several senses [NMDB06], e. g. the localizable entanglement between all pairs of spins is one.

Making use of symmetries
3.4.1. Symmetrising the phases The quadratic scaling of K with the number N of spins (lattice sites) in Eq. (10) can be reduced to a linear scaling in case of a system Hamiltonian with translational symmetry. This is because in this case it is reasonable to assume that we do not lose precision if we let the phase matrices depend not on the absolute positions of the spins a and b but only on the position of b relative to a. More precisely, we introduce a mapping ν : V × V → {1, . . . , R}, that gives the phase index for the spin pair a, b: the phase gate that is applied on the pair (a, b) shall be the phase matrix with number ν(a, b), and R is the total number of phase matrices. The 4th-order tensor Φ r1r2 ab now becomes a 3rd-order tensor Φ r1r2 ν(a,b) . The mapping ν has to be constructed such that two pairs of spins, (a, b) and (c, e), get the same index, ν(a, b) = ν(c, e), if and only if the pair (a, b) can be mapped onto (c, e) by a symmetry transformation that leaves the system Hamiltonian invariant. For the common case of a Hamiltonian that is a sum of identical terms which each act on one bond (i. e., connection of lattice sites), this is the symmetry group of the lattice. In the case of a square lattice with N = L × L sites on periodic boundary conditions (PBC), only † phase matrices are needed as can be seen from Fig. 1, and thus, we need only K = O(N ) parameters. Note also, that ν is naturally symmetric, ν(a, b) = ν(b, a), and that this has to be reflected by a like symmetry of Φ w. r. t. its upper indices: Φ r1r2 ν = Φ r2r1 ν , which must be imposed explicitely.

Full symmetrisation
For a symmetric Hamiltonian, it seems natural to reflect this symmetry not only in the phases Φ, but also in the local, site-dependent properties, i. e., in the local unitaries U a and the deformation parameters d j a . In case of full translation symmetry, one may want to completely drop the dependence of these on the site index a. This does indeed reduce the number K of parameters significantly, but not as dramatically as in the case of phase symmetrisation. The latter reduced the scaling of K from O(N 2 ) to O(N ), while further symmetrisation of the other parameters cannot change K = O(N ). On the other hand, the time required to calculate the energy of a given state is reduced by a factor O(N ) in the fully symmetric case, as one needs to evaluate it for only one elementary cell of the lattice.
A good reason not to impose full symmetrisation nevertheless is the observation that for many systems, the ground state does not necessarily obey the full symmetry of the Hamiltonian due to spontaneous symmetry breaking. Even though in such a case, the ground state must be degenerate, and at least one state within the ground subspace must obey the full symmetry, this state is unlikely to be the state that is easiest to approximate within the chosen class of variational states. To give an example: The ground state of the antiferromagnetic Ising chain without transverse field is α |0101 . . . + β |1010 . . . , for any α, β with |α| 2 + |β| 2 = 1. Only for α = β the state is invariant under a translation of one site. However, the state most easily approximated is α = 1, β = 0 (or vice versa), as it is a product state, while any other state contains long-range entanglement. If we imposed full translational symmetry onto the states, our algorithm would likely fail to find a good state. However, the example suggests a compromise between flexibility and low number of parameters: We make the local properties U a and d j a periodic in a way that matches the expected periodicity of the spontaneously-broken ground state, e. g., in the case of the Ising chain, we may use one common unitary and one common deformation vector for all odd sites, and another unitary and another deformation vector for all even sites. However, our numerical experiments showed that this does not work particularly well: the enforcement of such symmetries introduces very many additional local minima which trap the minimzation routine much too soon. The intuitive reason for this is that enforcing the symmetrie amount to a cut through the energy landscape of the parameter space which seems to divide meandering troughs into seperated basins.
Let us nevertheless mention two more possibilities to even further reduce the parameter scaling. (i) We can make the phase index mapping ν such that it does not depend on the geometric relation as in Fig. 1 but just on the scalar number of lattice steps that separates the spins, the number of phase indices scales linearly only with the length L, not with the number of sites N = O(L D ) (where D is the dimension of the system). Together with a full or periodic symmetrization of the local properties, we reach a scaling of the number of parameters K = O(L), which allows for a quick treatment even of 3D systems of moderate size. The accuracy achieved this way is, however, very modest.
(ii) Often, one may expect long-range entanglement to be supressed exponentially. Then one can choose a distance threshold and fix to zero all phases betweens spins with a distance above this threshold. The threshold will typically be chosen of the order of the entanglement length, and as the latter usually does not increase strongly with the system size (except at criticality) one can save considerably on the number of parameters.

Evaluating observables
In order to evaluate an observable O with support on A ⊂ V , we need to evaluate As we shall see now, ρ A can be calculated in time polynomial in the number N − |A| of spins, the number n of levels per spin and the number m of superpositions, but exponential in the number |A| of spins not traced over. Hence, the expectation value O x of observables can be calculated efficiently as long as O is a sum of terms with small support.
In particular, we need this algorithm to evaluate the energy Ψ(x)| H |Ψ(x) , as this is the quantity we wish to minimise. Thus, due to the scaling properties just mentioned, we require that the system Hamiltonian can be written as sum of terms with small support (as it is the case nearly always).

A pair of spins
To keep notation simple, we only derive the procedure to obtain the two-spin density matrix (A = {a, b}) This is a generalisation of the work done in [DHH + 05] for spin-1/2. A further generalisation to more than two spins is easy and its result will be given at the end of this section. The spins that we do not trace over are denoted a and b. We start by inserting Eq. (8) into Eq. (11) and pull as much as possible out of the partial trace: Here, the operator nrm again means normalisation, now defined as nrm ρ := ρ/ tr ρ, and the inner term ρ jk ab contains anything that cannot be pulled out of the partial trace: which is, due to Eqs. (5) and (7), Note that in the trace (13) all the phase gates W Φce with c, e / ∈ {a, b} cancel with their Hermitian conjugate, as do all the local unitaries U c , c / ∈ {a, b}. Hence, ρ ab depends only on a subset of the parameters.
In order to take the trace in Eq. (13), we have to sum over all states q , q ∈ Ë N −2 , where the underline denotes that the components of q are not indexed (s 1 , s 2 , . . . , s N −2 ) but rather using the elements of V \{a, b} as indices. We get In the last line, we can exchange sum and product in the following manner without changing the expression:

This gives
The sum over Ë has n terms, and N − 2 such sums are multiplied. Hence, in order to calculate one matrix element of ρ jk ab we have to evaluate the underbraced term (N − 2)n times. This is the origin of the promised polynomial scaling for the calculation of expectation values.
Recall that ρ jk ab is an n 2 × n 2 matrix. We will make this more explicit by writing the product as Hadamard product. The Hadamard product, denoted ⊙, is defined as the component-wise multiplication of matrices, (B ⊙ C) ij := B ij C ij . Its identity, denoted ½ ⊙ , is the matrix having 1 as all of its elements. Using this, we can rewrite the previous equation in a very compact form: where the matrix elements of ρ jk ab,c are given by the underbraced term in Eq. (14). Each factor of the Hadamard product can be understood as resulting from the interaction of the spins a and b with one spin c from V \{a, b}. These factors can be calculated seperately because the interaction between two different spins in V \{a, b} may be and is ignored due to the cancellation of all phase gates within V \{a, b}. (Cf. the remark after Eq. (13).) To make this more concrete, let us look at the simple case of n = 2. Then (Recall that d j a,0 ≡ 1, and Φ s,t = 0 for s = 0 or t = 0.) which is the formula given in [DHH + 05].

Several spins
For reference, we give the result for density matrices for not simply two spins a, b, but arbitrary numbers of spins, given in a set A: with The mapping τ : A → {1, . . . , |A|} gives here the index that spin a ∈ A gets within the density matrix ρ A (i. e., in the 2-spin case of ρ ab , τ (a) = 1 and τ (b) = 2.). It is also useful to observe that This formula comes from the observation that a matrix product of a diagonal matrix, an arbitrary matrix, and another diagonal matrix can be written as Hadamard product: Further, for the numerics, one may use a,b∈A a<b

Demonstration for two models
To approximate a ground state, we have to vary the parameters in order to minimize the energy. Before we explain our techniques to achieve this we show the results of such minimizations for two different models to demonstrate the performance of our technique.

The XY model with transverse field
The XY model with transverse field for a system of spin-1/2 particles on a lattice is given by the Hamiltonian where σ x,y,z are the Pauli matrices, B is the set of nearest neighbours, B the transverse field and γ is called the asymmetry. For γ = 0, we get as special case the XX model, and for γ = 1, we get the Ising model.

One dimension (spin rings)
For 1D, the XY model with transverse field can be diagonalised using a Jordan-Wigner and then a Bogoliubov transformation (the latter is trivial for γ = 1). Correlations have been studied in early work in [Pfe69] (Ising) and [Kat62,BM71] (XY). The latter article also gives the phase diagram of the 1D XY system (reproduced in Fig. 2a)  for 2D (according to [Hen84]).
As our technique tends to spontaneously break symmetry where the true ground state does not, it makes sense to plot the two-points correlations † for many different values of the parameters of the Hamiltonian (here: B and γ) in order to spot phase transitions. We find that it works better to plot correlations for a specific distance than to estimate correlation lengths from the data because the system is still so small that the exponential decay of correlations is masked by boundary effects . Fig. 4 shows such a plot for the 1D XY model. As is to be expected, one sees that near critical regions correlations are much stronger. (For the infinite chain, the critical regions are: XX criticality at γ = 0 for 0 < B < 1 and XY criticality ‡ for B = 1 [BM71].) The spread of the areas of high correlation around the critical regions of the infinite chain looks similar areas of high entropy identified in [LLRV05] -compare with Fig. 3 in that article (and note that there, entropy is small around γ = 0, B = 1 despite the critical nature of this point -a feature also seen in our plot of correlations.) Had we not known the critical regions, it is not merited to conclude that the system is critical where the correlations are strong, as the system size and the correlation distance is surely to small for this. We rather suggest to use a plot of this kind for a first look at a yet unstudied Hamiltonian. Regions of high correlations may suggest points in parameter space for which numerical calculations for different system sizes may give interesting results.
Another interesting feature of the 1D XY model is the Baruch-McCoy circle, which is the defined by B 2 + γ 2 = 1. On this circle, the ground state has product form [BM71]. Our approach accurately reproduces the vanishing of all correlations as one approaches a point on the Barouch-McCoy line (Fig. 5).

Two dimensions
The 2D XY model with transverse field has been studied in [Hen84]. The main result of the latter treatment is illustrated by Fig. 2b. †We either plot the XX correlations or the maximum singular value of the correlation matrix " σ (a) i,j=x,y,z ‡Strictly speaking the model is XY critical only for B = 1, δ = 1, and Ising critical for B = δ = 1.  As a guide to the eye, symbols for the same distance are conmnected by lines. It is evident that one seems to need two lines to connect the set of data points for each distance, as a single line would "jump" in zig-zag near the critical point. In other words: There seem to be two basins of attraction for the minimzer, corresponding to the B < 1 and B > 1 phase, and near the critical point, the minimizer may either fall into one basin or the other. The use of the sweeping technique, to be discussed in Sec. 6.3, allows to get around this undesirable behavior and direct the minimizer to minima with smaller basins of attraction which better resemble the true ground state in the area of influence of the quantum phase transition. Hence, this plot is not meant to show a good result, but rather illustrate the kind of failure that motivates and nessecitates the sweeping technique.
In order to demonstrate our scheme in a 2D setting, we have done calculations for a torus (i. e., a square with periodic boundary conditions) of 6 × 6 spins. We fixed the asymmetry at γ = 0.65 and varied the field strength B from 0 to 4.5 in order to cross both of the phase transitions indicated in Fig. 2b. The results, shown in Fig.  6, show prominent kinks at the expected positions of the phase transitions, and the correlations fall off in a roughly exponential manner with distance as expected. We still see additional jumps due to convergence into wrong basins, and this prompted us to seek a means to avoid this, namely the sweeping technique. The plots in the following section have been obtained this way and hence do not show such strong jumps.   The red arrows show the positions of the two phase boundaries for the infinite case (according to [Hen84], cf. Fig. 2b.) As this plot has been produced without use of the sweeping technique, some instances of convergence to the wrong minimum are evident from the jumps at B > 3. (Compare with the discussion at Fig. 3). The different curves show the correlation for spin pairs with distance (dx, dy) in x and y direction.

Bose-Hubbard model
The Bose-Hubbard model is defined for a system of harmonic oscillators, arranged in a lattice, and is described by the Hamiltonian As before, V is the set of all lattice sites, and B the set of all unordered pairs of nearest neighbour. The operatorsb † a andb a denote the ladder operators to create and annihilate a bosonic excitation of the oscillator at site a, andn a =b † aba is the number operator. The first term, called the hopping term describes the "hopping" of an excitation from a site a to a neighbouring site b, a process which occurs with the hopping strength J. The second term describes the repulsion between several bosons on the same site. To fix our energy scale, we set the repulsion U to 1 in the following, i. e., all dimensionless energies are to be understood in units of U . † The last term is relevant if the particle number is not fixed, which it is in fact not in our case. Then, assigning a value to the chemical potential µ allows to choose the mean density of the ground state.
The Bose-Hubbard Hamiltonian is of interest due to its rich phase diagram, first exposed in [FWGF89]. While its original motivation was the description of certain structured solid state systems such as arrays of Josephson junctions, interest in the †When comparing with other literature, care has to be taken that many authors use the alternative convention to set J ≡ 1. Also, J is often denoted t.  = 1, 2, 3. and the surrounding superfluid phase. The crosses in cyan mark the parameter points for which a calculation was performed. To depict the values, an surface was interpolated between these points and is used to colour the plot. Unfortunatly, we could not fully get rid of artefacts due to the interpolation (see Appendix A.5), and at those regions where the density of data points varies the contour lines become distorted. The resulting "wobbling" at regions with sparse data is hence not genuine and should vanish if one adds more data points. As cuts through the plane do not suffer from these presentation problems, we have calculated much more points for three fixed values of J and show these cuts in Fig. 8. system increased significantly with the discovery that it can be realized with cold atoms in optical lattices [JBC + 98] and with the spectacular experimental demonstration of this fact [GME + 02], where a transition from the Mott insulator phase to the superfluid phase and back was observed. (For a review, see [LSA + 07]).
In order to simulate a bosonic system with our ansatz, we restrict the number of occupations at each site. For all the following calculations, we set the dimension of each site to n = 5, i. e., the maximum occupation per site it n − 1 = 4. The creation operatorb † is defined such thatb † |n − 1 = 0 in order to truncate the Hilbert space.
A good way to distinguish the Mott insulator from the superfluid phase it to look at the mean compressibility  Fig. 7, we have many data point for J = 0.02, 0.04, 0.062, which allow us to plot vertical cuts through the plane of Fig. 7. Here, we show the values of the average occupation number per site ρ (a) and the compressibility κ (b) for the mentioned three values of the hopping strengths J. The curve for J = 0.02 looks smoothest because these points have been calculated to higher accuracy (cf. Fig. 10).  Figure 9. For values of the density ρ corresponding to very low (and integer) total particle numbers, we can obtain exact values for energy E and compressibility κ from diagonalisation of the full Hamiltonian. This shows that our method has good accuracy. (Note that the plot corresponds to a very small section of the steep left-most flank in Fig. 8b.) For the J = 0.02 curve, which has been calculated to especially high accuracy, the exact values for ρN = 4, 5 (i. e., as N = 4 2 : ρ = 0.25, 0.325) coincide with the approximated and interpolated values with an absolute deviation of only 10 −4 . Even for the values for J = 0.04, which have been obtained with much fewer sweeping steps, the accuracy is below 2 · 10 −3 .
which is strongly suppressed in the Mott insulator phase. Our results shows the form of the phase diagram in impressive clarity (Fig. 7). Although each data point only required a rather quick and rough calculation, one gets a good overview of the ground state properties in dependence of the Hamiltonian parameters J and µ. To better show the quantitative features, we have also plotted vertical cuts through the (J, µ) plane (Fig. 8). Especially for the points at J = 0.02, the zigzag sweeping technique (described later in Sec. 6.3) was used to improve accuracy by more than an order of magnitude. This can also be seen from Fig. 9: In this plot, we compare the observable κ, calculated for our approximand states, with exact values. To allow for . Non-local properties of the ground state are much harder to obtain than local ones. In order to see whether our technique is also capable of yielding good results here, we calculated (for J = 0.02 and varying values of ρ for the 4x4 sites Bose-Hubbard system) the density-density correlation γ := 1 N P a∈V nan a ′ − ρ 2 where a ′ denotes the site that is one or two lattice step(s) to the right (denoted with (1,0) and (2,0). (Correlations of this kind have, incidentally, been studied recently in [PRB06].) The connected points show the results obtained with our approximation, the isolated points have been calculated using the worm code (quantum Monte Carlo technique) [TAH98] of the ALPS project [ADG + 05] (at finite, but low temperature). The agreement is qualitatively ok but quantitatively not too precise. (Actually, the precision of two-point correlation is unfortunatly insufficient to obtain a good picture of the momentum distribution from their Fourier transformation.) this comparison, we have included values from exact Lanczos diagonalization. We are grateful to G. Pupillo, who supplied these numbers to us. He used a program, written for another project and using Arpack [LMSY96], that allows to diagonalize a small Bose-Hubbard system exactly if the number of particles is small as well. For a 4 × 4 system, up to approx. 6 particles in the 16 sites can be treated. This corresponds to the very beginning of the plots of Fig. 8, which we have magnified in Fig. 9. The accuracy of 10 −4 for the compressibility is competing well with the precision attainable with quantum Monte Carlo techniques.
While the compressibility is a local observable, the more challenging task is to study non-local observables such as density-density correlations of the form where a ′ is the site which has a fixed position relative to a, i. e. ν(a, a ′ ) = ν(a). In Fig. 10, we attempt this task for a 4 × 4 lattice. Fig. 11 shows calculations for larger systems, up to 32 × 32 sites. While the noise present in the latter plot is small on an absolut scale (note that the plot zooms in to a quite small parameter region) is is unfortunately still too large to prevent us from doing finite-size scaling. . Ground state approximation for Bose-Hubbard systems with up to 32 × 32 sites. The figure shows the compressibility κ as function of the chemical potential µ at the phase transition between the n = 1 Mott lobe and the superfluid phase above of it (for a coupling of J = 0.02U ). As one can see, the numerics cope with the large amount of sites but fails to converge with a precision sufficient to make out clear finite-size scaling trends. The reason seems to be that the number of local minima increases very strongly with system size: While the use of sweeping technique allowed to get a smooth curve for the 4 × 4 case, our attempts on the larger systems failed to get smoother than shown here. (We put considerably more effort in the data for µ < .8345, but even there, the curves are still very noisy.)

Performing the minimisation
Usually, the Hamiltonian of a spin system is given in the form of a sum of terms each of which has support on only a small number of spins -one or two in most physical cases. When the terms acting on single spins are absorbed into those acting on two spins, such a Hamiltonian can be written as where B is the set of all pairs of spins, on which a term acts jointly. These pairs are called bonds in the following, and they typically (but not necessarily) form a regular lattice. The bond Hamiltonians H ab may all be equal or not, and only in the former case, the simplifications of Sec. 3.4.1 can be used.
The minimisation problem Eq. (1) that we have to solve then takes the form Finding a minimum of a general function of many parameters is a thoroughly researched but intrinsically hard problem. Our approach is described in the following. As we do not assume the reader's familiarity with numerical optimisation, we will explain some textbook knowledge.

Local search
Given a starting point x 0 ∈ Ê K in parameter space, the problem of local search or local minimisation is the task of finding a local minimum in the vicinity of x 0 . An exhaustive treatment of this topic can be found in the standard textbook [NW99] which covers all of the algorithms mentioned in the following in detail. In our case, we have to deal with unconstrained (i. e., all values of the unbounded space Ê K are admitted) nonlinear (i. e., the energy function E(x) does not have any simple structure that would allow the use of a more powerful, specialised algorithm) local minimisation. Algorithms for this case come in two classes: So-called direct methods only require a means to evaluate the function at any given point, while gradient-based methods also require a means to obtain the gradient ∇ x E(x) at any given point.
Direct methods are convenient, but comparably slow. For very small systems (chains of up to 6 spin-1/2 sites, corresponding to less than 100 parameters), we could achieve convergence with direct methods, using the two most common ones, Nelder-Mead [NM64] and Powell [Pow64] minimisation, with Powell minimization converging faster.
For any meaningful system size, however, direct methods are much too slow. Hence, we coded routines to obtain the derivatives of E w. r. t. all kinds of parameters. † This required rather tedious calculation and coding, and the formulae and their derivation are summarised in Appendix B.
Using the gradient functions, we tried the standard minimisation methods the literature offers, namely the Fletcher-Reeves conjugate-gradient method, the Polar-Ribière conjugate-gradient method and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method. We started using the implementations provided by the GNU Scientific Library [GTJ + 03], which, however, turned out to be not robust enough. Nevertheless, it could be established that convergence speed for our problem is as usually expected, i. e. Polar-Ribiére (the oldest of the algorithms, from 1964) performs worst and BFGS (the newest, from the 1970s) performs best.
BFGS is a so-called quasi-Newton or Davidon algorithm. This means, it uses the gradients obtained at the points visited so far to build up an estimate to the Hessian of the function (assuming that the Hessian varies only slowly). The approximation to the Hessian (or more precisely, to the inverse of the Hessian) is then used to make a good guess for the next step. As the approximant is, like the Hessian, a K × K matrix, updating it at each step requires O(K 2 ) steps, which scales worse than the calculation of E(x), and hence, maintaining the BFGS data becomes more expensive than evaluating the function.
The textbook solution to this problem is to use the "limited memory" variant ‡ of BFGS, which is known as L-BFGS and stores only a list of the last, say 25, gradients, and uses this data to produce a Hessian approximant "on the fly" [BLN95]. We have used the L-BFGS-B Fortran code [ZBLN97], which is very robust, not the least due to the excellent line search routine [MT94] that it uses.
A problem is the stop condition, which decides when convergence is assumed. We have tried several approaches: watching the norm of the gradient, the size of the steps, and the difference of the function value per step; these either taken point-wise, †Note that it is not helpful to obtain the gradient by numerical differentiation, as this is hardly faster than using a direct method.
‡The term "limited memory" shows that the problem of keeping the full matrix was then, in the 1980s, not so much seen in the time it takes to update the matrix but rather simple in the fact that a large matrix might not fit into the memory of the computer. or averaged over the last, say 30, or 100, steps, or taken the maximum from the last 30-or-so steps. All this could not clearly predict convergence, as there seem to be long shallow slides, which tempt one to stop minimisation prematurely. In the end, we found that waiting until progress gets below machine precision is most viable. However, in the sweeping technique, described later, the minimzation can be stopped once it seems advantageous to first continue with a neighbor.
The described technique only allows to find local minima. How do we find a good local minimum, or even the global one? Although the literature discusses many different heuristics and algorithms, it is a far from trivial task to find a good scheme. We have developed and tested two different heuristics, which shall be explained in the following two subsections. Both heuristics are two-phase methods [Sch02], i. e., they combine a global driving scheme, that chooses points to start a local search from, with the local-search algorithm just discussed.
In both cases the minimization for a specific tuple of Hamiltonian parameters is typically performed several times. Whenever a new energy value is found which is lower than all values that have been found so far for this parameter tuple, the previous energy value and corresponding state is replaced by the new one. Hence, the longer one performs these heuristics the nearer one gets to the true ground state (or more precisely: to the lowest lying state within the variational class). We emphasize that we never discard a data point unless it has been "underbid" by a new calculation. This makes the result objective even if subjective judgement has been used in carrying out the heuristics.

Multi-start and step-wise adding of superpositions
For the calculation of the results for the XY model (presented in Sec. 5.1), we tried several different heuristics in order to "move around" local minima. We finally settled for the "multi-start" scheme described now, which turned out to work best, at least for the examples we studied: We started by choosing the parameters x 0 for a state |Ψ(x 0 ) with m = 1 (i. e., without superpositions) uniformly at random from [−5; 5] K and the used the L-BFGS algorithm (as explained in Sec. 6.1) to go downhill from there towards a minimum. We allowed this minimisation only to run for a limited number of function evaluations (typically a few hundred, or up to 1000), and then restarted with another randomly chosen initial point. Having done a number (say 15) of such "trial runs", the one that reached the lowest energy within the limited number of steps is kept, the other data is discarded. The best run is now allowed to continue for a significantly longer time, until the maximum number of "main run" steps (typically, several thousand function evaluation) is exhausted or the energy change falls below machine precision. Then, we increased the number m of superpositions by one. This makes the parameter vector x longer, i. e. 2N + 2 real numbers have to be added (for N complex deformation parameters and one complex superposition coefficient, cf. Eq. (10) for n = 2). These are again chosen at random, but without changing the parameter values that have already been found. (It also helps to choose the new value for α m with small modulus, such that the new parameters do not let the state stray to far from the already established good state.) Again, a number of trial runs is started, with different random numbers to extend the parameter vector, and the best one is allowed to continue for many more steps in the main run phase. This iterative extending of parameter values was looped until m reached a certain value. This values does not have to be very large: for the results presented in Sec. 5.1, m = 3 was sufficient.
A disadvantage of this heuristics is evident in Fig. 3: Some points are much worse than their neighbours. For example, the point B = 1.09 shows a sharp peak towards worse accuracy (orange line), while its neighbours to both sides are better. What happened is that near the phase transition, the two phases compete to govern the ground state, and once the minimiser gets trapped by the catch-basin of one of the two phases it cannot switch to the other one. In most cases, the multi-start scheme will allow us to enter the main run within the catch-basin of the correct phase. If, however, the minimum energy of the two competing phases are very close, they cannot be distinguished during the short and rough trial runs, and it depends on mere chance to which phase we converge. The obvious solution is to use the value of neighbours which seem to have converged to lower energies as starting points in order to see if this allows to get to lower energies. This is the strategy that we tried next.

Zigzag sweeping
All the results for the Bose-Hubbard model (as presented in Sec. 5.2) have been obtained without the use superpositions to the right of the phase gates, i. e., with m = 1. Accuracy was instead improved in an iterative way using the following heuristics: Start minimisations from parameter vectors chosen at random for a variety of different values of Hamiltonian parameters (i. e., chemical potential µ and on-site repulsion J, for the Bose-Hubbard Hamiltonian) within the area of interest. Once the minimisations have converged more or less, compare each point with its neighbours. If one point looks better than a neighbouring, use this point's parameter vector to start a minimisation for the neighbouring point's Hamiltonian parameters.
In order to see how to do this in an objective way, look at the example of the red curve of Fig. 8. There, J = 0.02 is kept fixed and µ varies from -0.08 to 2.7. The data points are spaced rather closely (µ varies in steps of 0.003 up to 0.15). Hence, if we plot the energy E versus the chemical potential µ and zoom in to look only at a few adjacent data points, we may expect to see simply a straight line. If the points lie close enough, any deviation from linearity is less likely for physical reasons but rather due to different quality (i. e., proximity to the global minimum) of the approximation at the points. Hence, we can interpret slight deviations towards higher [lower] energy as an indication that the point is a worse [better] approximant than its neighbours. As the slope varies too little to clearly see these differences it is helpful to take the second numerical derivative to enhance the differences. Following the sketch in Fig.  12, a simple heuristic emerges: A pronounced peak in the second derivative means that the corresponding point is a better approximant than its neighbours. Hence, use its parameter vector as initial values to redo the minimisation at the neighbouring Hamiltonian parameters. If one of the two neighbours has a much lower second derivative than the other, redo only this one. Conversely, a point with a pronounced dip in the second derivatives should be re-done, starting with the parameter vector of one of its neighbours, normally the one with the higher second derivative.
Close to a phase transition, the procedure may get stuck because the step from one neighbour to the next changes the state too much. In this case, one should insert a new data points between the point that failed to get better and the neighbouring point used for the initial value.

Outlook to other minimization techniques
The literature on unconstrained nonlinear minimization is vast, and finding a good global minimization scheme requires a lot of trial and error. Apart from the twophase heuristics described above, we have also tried genuine global minimization techniques, namely simulated annealing [KGV83] and differential evolution [SP97]. Both are genuinly global in the sense that they do not employ a local search stage. However, they thus cannot take advantage of the possibility to calculate the gradient. Hence, it is not surprising that simulated annealing converged much too slowly to be of use. (Simulated annealing is used in many different fields with much success but usually for functions with a convoluted potential surface but only few variables. We have several hundreds or even thousands of variables.) Differential evolution is a genetic algorithm and shows the -on first sight surprising-feature of converging to the mean field solution. (This seems explicable from the fact that crossing two genotypes in different basins has to end up in a "compromise", which is mean field.) One further possibility might be basin hopping, which is a family of techniques (reviewed in [WS99]) that combine simulated annealing with a local search phase in order to overcome the problem states in the previous paragraph. These ideas are quite recent and research is still ongoing. So far, however, it seems that the basin hopping requires to perform very many local searches which hence have to converge fast. This is unfortunately not so in our case. It seems conceivable that variants can be developped that only use rough and hence fast local searches, and this might be a way to proceed with our method.
Another ansatz is using a clustering stage in the global phase of a two-phase method [RT87]. This allows to make multi-start much more efficient but has two difficult requirements: (i) One needs to factor any degenerecies in the minima out of the parameter space. We have not yet studied whether this is possible. (ii) The number of local minima must be small enough that one has a decent chance to encounter all of them during the local searches. Unfortunately, especially the calculations for Fig.  11 have brought us to the observation that the number of minima seems to grow very fast with the system size.
A further technique that we have tried is imaginary time evolution, which works as follows. Given an initial state |Ψ(x 0 ) chosen at random, we can find an estimate |Ψ(x i+1 ) ≈ nrm e −∆tH |Ψ(x i+1 ) for the discretized evolution of the state under the system Hamiltonian in the imaginary time direction. As for most initial states |Ψ 0 , Ψ G = lim t→∞ nrm e −tH is the ground state, this iterative evolution should converge to a good approximation of the ground state. We have decomposed e −∆tH into a product of bond terms e −∆tH ab using standard Trotter decomposition and then tried to find the ∆x ∈ Ê K that maximizes the overlap Unfortunatly, the maximization failed to give good result even for arbitrarily small time steps ∆t and we thus abandoned this approach. We should also mention that for the some of the results of or previous paper [APD + 06], we (actually, M. Plenio, who programmed this part) have used a Rayleigh minimization technique: One restricts the energy function E(x) in the sense that one keeps all but a few parameters fixed. For certain such subsets of only a few parameters, namely for the set of parameters corresponding to a single local unitary or to the phases and deformations for one pair of qubits, one can write the restricted energy function as quotient of two quadratic forms. This is also known as a generalized Rayleigh quotient and the global minimum can be find via a generalized eigenvalue problem. Such a "global minimum" typically is, however, not even a local minimum of the full energy function. The reason that we got good result for the Ising model in [APD + 06] seems now, in retrospect, have been due to the extraordinarily benign form of the corresponding energy landscape. Hence and because the scheme cannot easily generalized to spins higher than 1/2, we did not persue this any further.

Conclusion and outlook
To conclude, we have presented a class of variational states that holds promise to approximate the ground states of spin systems and bosonic systems. The advantageous properties of this class is that it includes states with an arbitrarily high entanglement and the possibility to adapt to arbitrary geometries and number of spatial dimensions. We have shown how to calculate expectation values of observables for these states and demonstrated the approximation of the ground state for two model systems, namely the XY spin-1/2 model and the Bose-Hubbard model, in one and two dimensions. Furthermore, we have explained heuristics suitable to drive the minimization.
The method works for small systems and maps out the rough structure of phase diagrams. (The system sizes, though small, were sufficient to see the phase boundaries even though phase boundaries are defined, strictly speaking, only in the thermodynamical limit.) We can calculate observables for states in systems of considerable size but have problems in approximating the ground state in larger system to precision sufficient to see actual differences between different system sizes and hence to do finite-size scaling studies.
It seems likely that this is not because there were no states in our variational class which were close enough to approximate such ground states well. Rather, we simply cannot find them because our minimzation gets trapped in local minima. Can the avoid this? This is the crucial question for the future development of the scheme, and at this moment, we may only offer some thoughts on that: It seems unlikely that the choice of another generic global minimzation algorithm is able to steer around these local minima better that those algorithms that we have tried. For further progress, it seems hence desirable to have a better understanding of the shape and structure of the manifold Ψ(Ê K ) ⊂ H, i. e., our variational set of states as described by the mapping from the parameter space. Is, for example, this manifold "folded" more and curved stronger than the equi-energy surfaces of typical system Hamiltonians? This might explain, why there are so many local minima -and getting a better grasp on topology and metric of the mapping Ψ and its image could be most helpful in finding a better way to steer towards good minima. Figure A1. Performance of our "hwgs" implementation: For a Bose-Hubbard system on a square lattice of varying size, the calculation time for a single reduced density matrix of a pair of sites (red diamonds) and for the full gradient of the energy (derivatives w. r. t. all parameters) (blue squares) is shown. These calculations have been done for states with n = 5 levels per site and no superpositions (m = 1). The program was run on AMD Opteron machines clocked at 2.2 GHz.
spots", i. e., the proverbial 10% of the code in which the processor spends 90% of the time, in an optimizing compiled language such as C++. This approach, though it may sound unusual to a traditionally oriented computer physicist, has been used in several places with much success (see e. g. the advocacy in [BCG05]), and from our experiences, we clearly recommend its use. Hence, for our second implementation, "hwgs", we followed this paradigm consequently and wrote only a small part in C++. This part was bound to the main Python code using SWIG [B + ]. For the local minimizer we used in both implementations the L-BFGS-B Fortran code [ZBLN97], linked to Python with the help of the tool f2py [Pet].

Appendix A.3. Performance
The performance of the "hwgs" implementation can be seen in Fig. A1. The blue curves shows the time required to calculate energy and full gradient for one parameter vector at various system sizes. In order to see the time required to find a good approximand, this has to be multiplied with the number of function evaluations needed by the minimiser.
Usually, one wants to find approximands for several different values of the Hamiltonian parameters. Then, one can save much time by running these minimisation in parallel if one has access to a computer cluster.

Appendix A.4. Availability
We would welcome to see our code been used in further projects. Hence, researchers who are interested in applying our code in their own projects are encouraged to contact the authors. derivative is expressed by a simple formula: For any invertible square matrix A = A(t) that depends differentiably on a real parameter t, we have (for a proof, see e. g. [PlM]). We need n 2 real parameters to parametrise a Hermitian n × n matrix, which we arrange to form a n × n upper triangular matrixÃ with real entries in the diagonal, complex entries in the upper triangle and zeroes in the lower triangle. A =Ã +Ã † is now Hermitian and is unitary. Using Eq. (B.1), we get We use this to calculate , whereρ bc is the reduced density matrix without application of the local unitaries, i. e. ρ bc = (U b ⊗ U c ) † ρ bc (U b ⊗ U c ).

Appendix B.2. Derivatives w. r. t. the deformation parameters
For the derivatives w. r. t. the parameters Re d l c,s and Im d l c,s , we have to take care of the normalisation of ρ ab as it depends on those parameters. We abbreviate the middle line of Eq. (12) withρ ab and start with using Eq. (19) in order to see that  |r e iΦ r 1 r 2 , so that we can write -due to Eq. (21)-Eq. (12) as We have to take care that in case of a symmetrisation according to Sec. 3.4.1 a phase can occur more than once in the expression for ρ ab , and hence, we make use of the phase index mapping ν(a, b) ∈ {1, . . . , R} (Sec. 3.4.1), that associates with every pair of spins a, b a phase matrix Φ ν(a,b) . We write (with r ≡ (r 1 r 2 )) and proceed to discuss the two derivatives in this expression. The first one is evidently non-zero only if ν = ν(a, b) and then evaluates to The set notation in the Kronecker deltas accounts for the fact that Φ is symmetrised, Φ r1r2 = Φ r2r1 , (cf. again Sec. 3.4.1) and hence, the order of the components of the vectors r, q, and q ′ must be disregarded. For the second term, we pull the derivative inwards The sum over e formally runs over N −2 terms. Most of these vanish, however, namely all those for which neither ν = ν(a, e) nor ν = ν(b, e). For translation-invariant phases, the number of remaining terms is of the order of the coordination number of the lattice.