$m^\ast$ of two-dimensional electron gas: a neural canonical transformation study

The quasiparticle effective mass $m^\ast$ of interacting electrons is a fundamental quantity in the Fermi liquid theory. However, the precise value of the effective mass of uniform electron gas is still elusive after decades of research. The newly developed neural canonical transformation approach [Xie et al., J. Mach. Learn. 1, (2022)] offers a principled way to extract the effective mass of electron gas by directly calculating the thermal entropy at low temperature. The approach models a variational many-electron density matrix using two generative neural networks: an autoregressive model for momentum occupation and a normalizing flow for electron coordinates. Our calculation reveals a suppression of effective mass in the two-dimensional spin-polarized electron gas, which is more pronounced than previous reports in the low-density strong-coupling region. This prediction calls for verification in two-dimensional electron gas experiments.


Introduction
Landau's Fermi liquid theory [1] is one of the cornerstones of condensed matter physics [2]. It explains the mystery why the non-interacting picture can largely apply to real metals despite the strong Coulomb repulsion between electrons. The essence is that a Fermi liquid consists of quasiparticles that are adiabatically connected to bare electrons. Such a renormalization procedure can be encapsulated in only a handful of parameters, from which one can predict a broad range of physical properties of the system. One such parameter is the quasiparticle effective mass m * , which is the central focus of this work.
Depending on the spatial dimension and spin polarization of the uniform electron gas, previous results may differ quantitatively or even qualitatively on whether the quasiparticle effective mass is enhanced (m * /m > 1) or suppressed (m * /m < 1) compared to the bare electron mass m. Resolving these discrepancies within the same approach can be challenging. For example, there lacks a systematic way of improving various approximate analytic calculations to reach a consensus [3][4][5][6][7][8][9]. It is hoped that numerical calculations offer more reliable predictions to the effective mass. However, two recent quantum Monte Carlo (QMC) studies [16,17] report drastically different effective masses for the three-dimensional electron gas. The reason for such discrepancy is unclear and may be related to different (but equivalent) ways of defining the effective mass as well as different approximations employed in the methods. The situation is also not clear in the two-dimensional case, even if one employs the same kind of QMC method [12][13][14][15]. There, the predicted effective mass differ qualitatively depending on how to process the QMC data 1 . These discrepancies are related to ambiguities in relating excited state energies to the effective mass [19], which again entangle with approximations and finite size errors in the calculations.
Resolving the discrepancy on the effective mass of uniform electron gas is not only a theoretical question with pure academic interests, but also of direct experimental relevance. One can measure the effective mass in a semiconductor quantum well via quantum oscillations [20][21][22][23][24] or thermodynamics [25], which is a high-quality realization of the two-dimensional electron gas (2DEG) with tunable densities. Unless otherwise specified, we will focus on the spinpolarized case in this paper, which can be conveniently realized in experiments by applying an in-plane magnetic field.
At sufficiently low temperature, the entropy per particle of 2DEG s/k B = π 2 3 m * m T T F exhibits linear dependence on the temperature T , where k B is the Boltzmann constant and T F is the Fermi temperature. Therefore, one can directly estimate the effective mass from the entropy ratio of interacting (s) and non-interacting (s 0 ) electron gases [26]: By direct access of the thermodynamic observables, one can avoid subtleties in relating excitation energies of finite size system to the quasiparticle effective mass [19]. However, previous finite-temperature calculations of the uniform electron gas do not resolve the issue related to the effective mass [26] because they focus on melting of the Wigner crystal [27] or the equation of state in the warm dense matter region [28][29][30], both are outside the scope of Fermi-liquidlike behavior. This is partially due to the fact that the adopted QMC methods typically suffer less from the sign problem at low density and high temperature, where the fermionic nature of the system is less pronounced.
In this paper, we employ the recently developed neural canonical transformation approach [31] to study 2DEG at low temperature and estimate the effective mass via the entropy ratio Eq. (1). Neural canonical transformation leverages recent advances in deep generative models [32] for variational free energy calculation of interacting fermions at finite temperature. This approach is particularly suitable for the present task for two reasons: firstly, the employed variational density matrix ansatz fits nicely to the philosophy of Fermi liquid theory; secondly, the thermal entropy can be directly accessed unlike other conventional QMC methods.
Consider N electrons in a two-dimensional periodic box of length L. We set the energy unit to be Rydberg ħ h 2 /2ma 2 0 , where a 0 = ħ h 2 /me 2 is the Bohr radius. The dimensionless Wigner-Seitz parameter r s = L/( πN a 0 ) measures the average distance between electrons in the unit of Bohr radius. The Hamiltonian reads [33] where r i = (x i , y i ) is the coordinate of the i-th electron. The constant term in Eq. (2) refers to the energy due to the neutralizing background.

Method
To investigate the finite-temperature properties of Eq. (2), we minimize the variational free energy with respect to a many-electron density matrix ρ, where β = 1/k B T is the inverse temperature. In practice, T is measured relative to the Fermi temperature k B T F = 4Ry/r 2 s . The variational free energy Eq. (3) is lower bounded by the true free energy of the system, i.e., F ≥ − 1 β ln Z, where Z = Tre −β H is the partition function. The equality holds only when the variational density matrix coincides with the exact one, i.e., ρ = e −β H /Z.
The variational density matrix is expressed as a weighted sum over a family of many-body orthonormal basis states ρ = Here K ≡ {k 1 , k 2 , . . . , k N } represents a set of occupied momenta, each (under the periodic boundary conditions) taking one of the discrete values k = 2πn L (n ∈ 2 ) without duplication as required by the Pauli exclusion principle. Such setting is closely in line with an essential point of Fermi liquid theory [1]: one can label the low-energy excited states using the same quantum number as the ideal Fermi gas. In practice, one has to truncate the momenta within an energy cutoff, which is set to be sufficiently large to avoid bias in the considered temperature range. Therefore, for M possible momenta within the energy cutoff, the summation Eq. (4) involves M N terms. See Fig. 1(a) for a schematic illustration. Substituting the density matrix ansatz Eq. (4) into Eq. (3), one finds an unbiased estimator for the variational free energy: Here R ≡ {r 1 , r 2 , . . . , r N } is the set of electron coordinates and Ψ K (R) = 〈R|Ψ K 〉 the corresponding basis wavefunction. The local energy is defined as E loc |r i −r j | + const. We use Ewald summation to evaluate the Coulomb interaction term, while the gradient and Laplacian operator appearing in the kinetic term can be computed using automatic differentiation. We model the Boltzmann distribution p(K ) and wavefunction Ψ K (R) using two generative networks.
We use a variational autoregressive network [34,35] to model the normalized Boltzmann distribution p(K ) over a set of discrete momenta: where each factor in the product is a parametrized conditional probability. To facilitate the sampling of these conditional probabilities, we assign a unique index idx(k) ∈ {1, 2, . . . , M } to each of the M available momenta, e.g., according to their single-particle energies 2 . We then model p(K ) using a neural network that maps the set K = {k 1 , k 2 , . . . , k N } to N vectors To ensure the autoregressive property, i.e.,k i depends only on k j with j < i, we implement the network as a transformer with causal self-attention layers [36]. In addition, to accommodate the Pauli principle, we require Eq. (6) assign nonzero probabilities only to those momentum configurations satisfying idx(k 1 ) < idx(k 2 ) < . . . < idx(k N ). This can be achieved by carefully masking out disallowed configurations in the output logitsk i [37]. We note that Refs. [38,39] devised an alternative autoregressive model in the bit string representation with fixed number of nonzero elements. Using the autoregressive model Eq. (6) rather than enumerating all possible excitations [31] in the summation Eq. (4) allows us to incorporate a combinatorially large number of manybody states and access broader temperature range. One can estimate the thermal entropy unbiasedly via the estimator Note such a simple and tractable expression for the entropy is a direct consequence of orthonormality of the many-body basis |Ψ K 〉. Next, to parametrize a family of orthonormal many-body states |Ψ K 〉, we perform a unitary transformation on the basis of plane-wave Slater determinants. In practice, we construct the unitary transformation as a learnable bijection from the electron coordinates R to a new set of quasiparticle coordinates ζ, as illustrated in Fig. 1(b). The wavefunction reads [31] The originally non-interacting plane waves possessed by individual electrons would interfere with each other due to correlation effects introduced by the coordinate transformation. Such a picture closely mimics the renormalization process as depicted in Fermi liquid theory. Technically, Eq. (8) differs from the standard Slater-Backflow trial wavefunction by the presence of an additional Jacobian determinant. This factor turns out to play a crucial role of preserving orthonormality of the basis dR Ψ * K (R)Ψ K ′ (R) = δ K K ′ . Note also that the state Eq. (8) involves coordinate transformation in a many-body context [42], where one needs to deal with the extra issue of permutation equivariance compared to the single-particle setting [43][44][45].
We use a normalizing flow [46] to implement the bijective map between the electron and quasiparticle coordinates. This can be regarded as a generalization of the backflow transformation with invertible neural networks [47]. We compose the FermiNet [48] blocks into a residual network to carry out permutation and translation equivariant transformation of electron coordinates. We also modify the electron distance features to comply with the periodic nature of the simulation box [37].
It is instructive to examine the present approach in two limiting cases. Firstly, in the noninteracting limit the problem reduces to a classical statistical mechanics problem: one needs to distribute N particles in M possible momenta according to the probability distribution p(K ) to minimize the free energy where the second term is the non-interacting energy. In this case, one can trivially set the normalizing flow to an identity map and optimize only the autoregressive network. Figure 2(a) shows a typical training process, where the entropy Eq. (7) steadily approaches the exact value [40]. Note the calculation of exact non-interacting entropy in the canonical ensemble is not a completely trivial task [37]. Secondly, in the zero-temperature limit, p(K ) is nonzero only for one particular momentum configuration K 0 corresponding to the closed-shell non-interacting ground state. The present approach then reduces to the usual ground-state variational Monte Carlo method. As an example, Fig. 2(b) shows the optimized ground-state energy density e = 1 for a particular set of system parameters, which is lower than previous report using the Slater-Jastrow ansatz [41]. In general cases, one has to jointly optimize the autoregressive model and the normalizing flow. We show some additional benchmark results for the three-dimensional spin-polarized electron gas in the Appendix [37].
To understand how variational free energy calculation reveals the quasiparticle effective mass, note the effective mass affects low-temperature thermodynamics of the system via the density of states of low-lying excitations. In practice, we pretrain the state occupation p(K ) using non-interacting energies as in Fig. 2(a). Thus, p(K ) will initially give the same entropy of the ideal Fermi gas. We initialize the normalizing flow network to be close to an identity map. Training of the normalizing flow will modify the many-body basis Eq. (8) and thus change the quasiparticle energy spacing and density of states. On the other hand, the autoregressive model will also adjust the Boltzmann distribution accordingly, causing the entropy to depart from the initial non-interacting value. Putting it all together, the entropy ratio Eq. (1) thus provides a principled way to extract the effective mass from the quasiparticle energy spectrum.

Results
To access the quasiparticle effective mass of electron gas via the entropy ratio Eq. (1), one should consider temperatures T well below the Fermi temperature T F . Figure 3(a) shows the entropy per particle of ideal Fermi gas in the thermodynamic limit N = ∞ 3 , which exhibits linear behavior practical calculations, otherwise the finite size effect would cause the entropy to deviate from the ideal linear behavior due to a small energy scale ħ h 2 /mL 2 ∼ N −1 , as shown in Fig. 3(a) for N = 29 and 57 non-interacting electrons. We choose to set T /T F = 0.15 to balance these two considerations.
To obtain conclusive predictions of the effective mass, we adopt the twist-averaged boundary conditions [49] to alleviate the finite size effect [37]. Moreover, one can reasonably expect that by taking the entropy ratio Eq. (1), the remaining finite size errors involved in the interacting and non-interacting systems would further cancel out. Fig. 3(b) shows the interacting entropy as a function of training epochs for N = 29 and various densities r s . One clearly sees the entropies are reduced from the initial non-interacting values, indicating a suppression of the effective mass upon increasing r s . The entropy fluctuates more strongly than the free energy since it is more sensitive to the variation of model parameters in the training.
Early analytical calculations [7,50] find enhanced or non-monotonic r s -dependence of the effective mass in spin-polarized 2DEG. On the other hand, several QMC calculations [13][14][15] consistently find monotonically suppressed effective mass as r s increases, but the quantitative predictions still differ, especially for large r s as shown in Figure 4 4 . The discrepancy is due to different ways of extracting effective mass from the excitation energies. Refs. [13,14] obtain the effective mass by differentiating the fitted energy band, while Ref. [15] is based on its relation to other Landau Fermi liquid parameters by Galilean invariance. Both approaches possess a number of uncertainties, such as the fitting range of momentum space and integration error in estimating the Fermi liquid parameters. Though, the authors of Ref. [15] appeared to be more confident with the larger effective mass values reported in Refs. [13,14].   (1). We extract the interacting entropy by performing an exponentially-weighted moving average over the training epochs. The error bars take into account both the statistical uncertainties due to Monte Carlo sampling and the fluctuation of variational parameters due to noisy gradients [51]. The computed effective mass decreases monotonically with increasing interaction strength. In the small r s region, the predicted effective mass appears to converge well to the analytical result reported in Ref. [50], shown as the green dashed line in Fig. 4, which is reliable in the weak-coupling limit r s → 0. However, our predictions are lower than previous QMC results [14,15] when r s is large. Such differences cannot be attributed to the remaining finite size errors [37]. Since the present approach based on the entropy ratio Eq. (1) is less subject to ad hoc assumptions and data processing schemes, we believe it offers a cleaner and more reliable prediction on the effective mass of spin-polarized 2DEG.
On the experimental side, both enhanced [52] and later then suppressed [53,54] effective mass of the spin-polarized 2DEG are reported in different systems. The discrepancy was attributed to the valley degeneracy [55,56] involved in the sample used in Ref. [52]. The experimental data [53,54] spread widely between our predictions and those of Refs. [13][14][15] 5 . Confirmation of the present results calls for a new generation of experimental efforts, where besides data uncertainty issue one also has to account for various complications in reality for a fair comparison, such as thickness of the electron layer, disorder and temperature effects [7, 57-59].

Discussions
The variational approximation of the present approach may be improved by adopting alternative network ansatz [60][61][62] and optimization schemes [63,64]. We have documented the original data and trained models in the code repository [51] to facilitate further developments. In the present implementation, the finite size errors have been largely reduced by adopting the twist-averaged boundary conditions. To scale up the calculation to larger systems one can employ machine learning techniques such as gradient checkpointing [65] and distributed training [66]. Specifically, techniques for efficiently training flow models and invertible neural networks [67][68][69][70][71][72] can be useful. It is also profitable to integrate the present standalone implementation into an existing software framework [73]. On the other hand, a rigorous finite-size scaling theory for the entropy of uniform electron gas is also valuable for a direct extrapolation to the thermodynamic limit.
With suitable extensions of the model architecture, the technique developed in this paper can apply equally well to the spin-unpolarized case. This may shed new light on the conflicting results reported in the literature on the three-dimensional [16,17] and two-dimensional [10][11][12][13][14][15] electron gases. While we have been focusing on the quasiparticle effective mass, the outcomes of the present approach are also directly relevant to the exchange-correlation free energy, which are useful for the thermal density functional theory [26,30] and thermodynamic measurements in the 2DEG [25]. Along this line, it is also possible to extend the present study to the grand canonical ensemble and compute the compressibility and susceptibility of electron gas measurable in experiments [22,23,74,75]. Finally, having direct access to the energy and wavefunction of low-lying excitations may also allow for calculation of spectral functions at real frequencies.
Neural canonical transformation [31] not only serves as a variational free energy approach powered by deep learning techniques, but also nicely incorporates basic notions of Fermi liquid theory. For example, the probabilistic model p(K ) in Eq. (4) actually encapsulates the Landau's energy functional for quasiparticles. Moreover, the unitary transformation implemented as a normalizing flow between the electron and quasiparticle coordinates vividly illustrates the notion of adiabatic continuity when switching on interactions [2]. Because of these technical and physical considerations, we are optimistic with the outcome of applying neural canonical transformation to a broader class of interacting fermion problems.

A Benchmark for three-dimensional spin-polarized uniform electron gas
We carry out additional benchmark calculation for three-dimensional spin-polarized uniform electron gas in a periodic cubic box of length L. We use the same network architectures and training procedure as described in the main text, except that r s =  Fig. S1 display a typical training process of the kinetic energy k and potential energy v per particle, respectively. Note that r s = 10 is within the low-density parameter range where the restricted path integral Monte Carlo method [28] can favorably produce accurate results.  (7) and v = −0.14360 (7), while the red horizontal lines k = 0.0426 (1) and v = −0.14358(1) are from the restricted path integral Monte Carlo calculation [28].

B Entropy of non-interacting fermions in the canonical ensemble
To compute the entropy of N non-interacting fermions at (inverse) temperature β, we first compute the partition function Z N of the system in the canonical ensemble via the recursion formula [40] where Z 0 = 1 and z ℓ = k exp −ℓβ ħ h 2 k 2 2m is the single-particle partition function at temperature ℓβ. Note that Eq. (S1) involves adding exponentially small numbers with alternating signs, thus one needs to use high-precision arithmetics to obtain reliable results for large particle number N .
After obtaining Z N , we evaluate the entropy per particle of ideal Fermi gas using the standard formula The derivative in Eq. (S2) can be conveniently computed by automatic differentiation through high-precision arithmetics, which is natively supported in, e.g., Julia [76]. Alternatively, one can manually differentiate both sides of Eq. (S1) with respect to β to derive a similar recursion relation for the energy E N = − ∂ ∂ β ln Z N : where e ℓ = − ∂ ∂ (ℓβ) ln z ℓ is the expected single-particle energy at temperature ℓβ. The starting point of this recursion is, unsurprisingly, E 0 = 0.

C Twist-averaged boundary conditions
In this work, we aim to simulate an interacting Fermi liquid consisting of finite number of electrons in a finite box. Under the conventional periodic boundary conditions (PBC), the single-particle momenta reside on a discrete lattice k = 2πn L (n ∈ 2 ), as illustrated in Fig. 1(a) of the main text, and the identification of a sharp spherical Fermi surface characteristic of the system is ambiguous. This is a major contribution to the finite size errors of various physical quantities.
A useful technique to alleviate the finite size effect is to use the twist-averaged boundary conditions (TABC) [49]. This amounts to averaging physical observables over a twist vector θ t ∈ [−π, π] 2 , which corresponds to the extra phase picked up when the electrons wrap around the periodic boundaries of simulation box. Consequently, the single-particle momenta k = 1 L (2πn+θ t ) would be shifted away from the integer lattice points. When the twist average is performed, the effective state occupation "smears out" continuously as in the thermodynamic limit, thus yields better scaling behavior for various physical quantities.
To illustrate the impact of TABC on the simulation, we compute the entropies per particle s 0 of two-dimensional ideal Fermi gas at the temperature T /T F = 0.15 for various particle numbers N , as shown in Fig. S2. The results labeled as "PBC" are computed at the Γ point θ t = 0, whereas the "TABC" results are obtained by averaging the entropy over 10000 uniformly sampled twists. It is clear that TABC results in more regular scaling behavior for small N and converges more smoothly to the thermodynamic limit N = ∞. We also plot the same data versus inverse particle number N −1 in the inset to better visualize how they extrapolate to the thermodynamic limit.
In practice, we choose to implement the twist average over a 2×2 Monkhorst-Pack grid [77], which corresponds to a single twist vector θ t = ( π 2 , π 2 ) [78, 79] after taking into account the point group symmetry of the simulation box. Such a scheme is more convenient than randomly sampling the twist vectors, and introduces essentially no extra computational cost and code development efforts. We also plot the non-interacting entropies per particle under this scheme in Fig. S2 labeled as "2 × 2", which are in excellent agreement with the results obtained by random sampling. For the largest system size (N = 57 electrons) employed in our calculation of the quasiparticle effective mass, the non-interacting entropy deviates from the thermodynamic limit value by about 8%. One would then reasonably expect the uncertainties in the final estimate of effective mass to be within the same level, assuming a similar N -scaling behavior of the interacting entropy.

D Model architectures
This section summarizes the network architectures used for the autoregressive model and normalizing flow. They are adapted from the transformer [36] and FermiNet [48], respectively. Please refer to the original publications for more background on these models.

D.1 Autoregressive model for p(K )
In Algorithm 1, given the momenta k 1 , k 2 , . . . , k N occupied by the N electrons, CausalTransformer [36] outputs N M -dimensional arraysk 1 Table S1 summarizes the adopted value of hyperparameters of the network CausalTransformer throughout our calculations. For more implementation details, please refer to our source code [51]. Other architectures of the autoregressive model are also possible, such as the masked autoencoder [80], but our choice here based on the transformer turns out to scale more favorably to large systems regarding the number of trainable parameters involved. Table S1: The hyperparameters of the causal transformer adopted in this work. Note that we choose the non-linear activation of the fully connected neural network involved in the architecture to be tanh, which is smooth enough to avoid potential issues upon automatic differentiation.

D.2 Normalizing flow for Ψ K (R)
Recall that our goal is not to model a single wavefunction as done in ground-state calculations, but an exponentially large family of orthonormal many-electron basis states Ψ K (R). As shown in Eq. (8) of the main text, we achieve this goal by bijectively mapping the original electron coordinates R to the quasiparticle coordinates ζ. , where h 1 , h 2 are the one-and two-particle feature size, respectively, and ℓ ranges from 1 to d. Initially, f 2 contains pairwise distance features of the electrons in a periodic box [81,82], while f 1 is set to be zero to guarantee the translation equivariance property. Each fully connected layer involved in the algorithm has its own independent parameters, including those at the final stage. Throughout our calculations, the network depth is set to be d = 2 and h 1 = h 2 = 16. We note that one can repeat Alg. 2 several times to compose an iterative-backflow-like transformation [31,83] in terms of neural networks.
In principle, one may want to ensure invertibility of the coordinate transformation [68] to make for a genuine normalizing flow model. To our best knowledge, there lacks such a rigorous guarantee for Alg. 2 described above. Nevertheless, the practical training process still appears stable. This is probably because the network we adopt is not very deep.

E Details of the training procedure
We sample electron momenta K directly from the autoregressive model in an ancestral manner; see Eq. (6) in the main text. Given the momenta, we then sample the electron coordinates R using the Metropolis algorithm according to Born probability of the wavefunction Eq. (8). The batch size is set to be 8192. We compute the Jacobian ∂ ζ ∂ R of the coordinate transformation involved in the wavefunction ansatz using forward-mode automatic differentiation in Jax [84].
Building on these self-generated samples, we train the autoregressive model Eq. (6) and the normalizing flow Eq. (8) jointly to minimize the variational free energy Eq. (5). The gradient estimators with respect to the parameters φ and θ in the autoregressive model and normalizing flow, respectively, can be easily derived as follows: where the wavefunction Ψ K (R) has been assumed to be complex-valued, as suggested by Eq. (8). One can employ the control variate method [31,34,85] to further reduce the variance of these estimators. Below we present some more techniques employed to make the training as efficient as possible.

E.1 The Hutchinson's trick
The computational bottleneck of the variational free energy Eq. (5) and gradient estimators Eq. (S4) lies in the Laplacian ∇ 2 ln Ψ K (R) involved in the local energy, which amounts to computing the trace of the 2N × 2N Hessian matrix H (ln Ψ K (R)). In the standard automatic differentiation approach, one needs to iterate over the rows or columns of the Hessian, which can be inefficient for large systems.
To reduce the computational complexity, we employ the Hutchinson's stochastic trace estimator [86] ∇ over a 2N -dimensional random vector ε with zero mean and identity covariance matrix. The probability density f (ε) can, for example, be chosen as a standard Gaussian. The Hessianvector product involved in Eq. (S5) can be efficiently computed by combining forward-and reverse-mode automatic differentiation in Jax [84]. The price we pay, however, is an additional source of randomness on top of the original estimators Eqs. (5) and (S4), which may potentially require more samples to achieve a given statistical accuracy.
In practice, we choose to apply the Hutchinson's trick only to Hessian of the Jacobian determinant term in ln Ψ K (R); see Eq. (8). In this way, one can enjoy an overall speedup of roughly one order of magnitude without sacrificing accuracy due to enlarged variance of the estimators.

E.2 Stochastic reconfiguration for density matrices
The quantity of central interest for the purpose of this work is the thermal entropy, which turns out to be fairly sensitive to the training process. We thus employ the stochastic reconfiguration method [81], which is much more efficient than conventional first-order optimizers like Adam.
In the context of machine learning a classical generative model or the ground-state variational Monte Carlo of quantum systems, the conventional metric for the parameter space involved is well known as the Fisher information. To find an appropriate metric for the present quantum statistical mechanics setting, the arguably most natural candidate is the Bures distance, defined for two density matrices ρ and σ as [87]  The second term on the right-hand side is well known as the quantum fidelity [88].
Recall from Eq. (4) in the main text that ρ(φ, θ ) = K p(K ; φ) |Ψ K (θ )〉〈Ψ K (θ )|, where φ and θ are the variational parameters in the classical Boltzmann distribution and quantum many-body basis, respectively. By expanding the Bures distance in the neighborhood of a certain point (φ, θ ), one can obtain [87] for some positive-definite matrices I and J . The most important observation is that the desired metric is block-diagonal with respect to the two generative models in this work. This fact is favorable in practice, since the metric needs to be inverted to determine the parameter update direction in each training step. Note the size of I, J are equal to the number of parameters in the autoregressive model and normalizing flow, respectively, both in the order of several thousand throughout our calculations. I coincides exactly with the classical Fisher information matrix whereas the quantum component J reads The double summation over momenta in the second term of Eq. (S9) is inconvenient to estimate. In practice, we choose to approximate J i j as the following covariance matrix Notice the first term is the same as that of Eq. (S9), which is clearly a natural generalization of the usual quantum Fisher information matrix for pure states. The trainings turn out to still behave quite well.
The update rules for the parameters φ, θ read as follows: where we have added a small shift η = 10 −3 to the diagonal of (modified) Fisher information matrices for numerical stability. The norms of updates are constrained within a threshold of 10 −3 [48], which plays a similar role as the learning rate in other conventional optimizers. See the source code [51] for more details.