Empowering deep neural quantum states through efficient optimization

Computing the ground state of interacting quantum matter is a long-standing challenge, especially for complex two-dimensional systems. Recent developments have highlighted the potential of neural quantum states to solve the quantum many-body problem by encoding the many-body wavefunction into artificial neural networks. However, this method has faced the critical limitation that existing optimization algorithms are not suitable for training modern large-scale deep network architectures. Here, we introduce a minimum-step stochastic-reconfiguration optimization algorithm, which allows us to train deep neural quantum states with up to 106 parameters. We demonstrate our method for paradigmatic frustrated spin-1/2 models on square and triangular lattices, for which our trained deep networks approach machine precision and yield improved variational energies compared to existing results. Equipped with our optimization algorithm, we find numerical evidence for gapless quantum-spin-liquid phases in the considered models, an open question to date. We present a method that captures the emergent complexity in quantum many-body problems through the expressive power of large-scale artificial neural networks.

Neural quantum states (NQSs) have emerged as a novel promising numerical method to solve the quantum many-body problem.However, it has remained a central challenge to train modern large-scale deep network architectures to desired quantum state accuracy, which would be vital in utilizing the full power of NQSs and making them competitive or superior to conventional numerical approaches.Here, we propose a minimum-step stochastic reconfiguration (MinSR) method that reduces the optimization cost by orders of magnitude while keeping similar accuracy as compared to conventional stochastic reconfiguration.MinSR allows for accurate training on unprecedentedly deep NQS with up to 64 layers and more than 10 5 parameters in the spin-1/2 Heisenberg J1-J2 models on the square lattice.We find that this approach yields better variational energies as compared to existing numerical results and we further observe that the accuracy of our ground state calculations approaches different levels of machine precision on modern GPU and TPU hardware.The MinSR method opens up the potential to make NQS superior as compared to conventional computational methods with the capability to address yet inaccessible regimes for two-dimensional quantum matter in the future.
As a fundamental concept in quantum physics, the ground state wave function plays a central role in understanding the behavior of many-body quantum systems.The accurate numerical solution of ground states, however, becomes an extraordinary challenge for existing numerical methods, especially in complex and large twodimensional systems.The respective challenges depend on the individual utilized method, such as the "curse of dimensionality" in exact diagonalization (ED) [1], the notorious sign problem [2] in quantum Monte Carlo (QMC) [3], or the entanglement growth and matrix contraction complexity in tensor network (TN) methods [4].
Recently, the neural quantum state (NQS) has been introduced as a promising alternative for the calculation of ground states of quantum matter by means of artificial neural networks [5], which has seen already tremendous progress [6][7][8][9].However, this method also faces an outstanding challenge limiting critically its capabilities and its potential to date.Existing optimization algorithms are not capable of efficiently training modern large-scale deep networks to desired quantum state accuracy in practice, which would be key in order to exploit the full power of this machine learning approach.In this work, we introduce an alternative training algorithm for NQS, which we term the minimum-step stochastic reconfiguration (MinSR).We show that the optimization cost in MinSR is reduced to a leading factor of N p for a deep network with N p parameters, which is a great acceleration compared to conventional stochastic reconfiguration (SR) with O(N 3 p ) complexity or non-parallel iterative solvers.This in turn allows us to train deep neural networks of 64 layers and 10 5 parameters with limited numerical efforts, which is significantly larger than most NQS practice with no more than 10 layers and around 10 3 parameters [5,8,10].We apply our resulting algorithm to paradigmatic two-dimensional quantum spin systems such as the spin-1/2 Heisenberg J 1 -J 2 model.As a consequence of the now accessible large-scale deep neural networks, we find significantly lower variational energies outperforming conventional numerical approaches up to lattice sizes of 16 × 16 spins.Most importantly, we observe that our ground state results reach different levels of machine precision.Thus, with MinSR we are able to reach the frontier where the sole limitation of applying the NQS approach to complex two-dimensional quantum matter is not anymore the expressive power of the neural network but rather the inherent numerical precision of the computing device.This is of key importance to exploit the full power of NQS for the calculation of ground states in the future opening up the potential to address yet inaccessible regimes of quantum many-body systems also in higher spatial dimensions.

Neural quantum states
Within the NQS approach, artificial neural networks are utilized to encode the full quantum many-body wave function.In a quantum many-body system with N spin-1/2 degrees of freedom, the Hilbert space can be spanned by the S z spin configuration basis |σ = |σ 1 , . . ., σ N with σ i = ↑ or ↓.The neural network is constructed such that it maps every σ at the input to a wave function amplitude ψ σ (θ) at the output, for an illustration see Fig. 1.Here, θ denotes the parameters of the network, i.e., its weights and biases.Hence, the Hilbert space, which is exponentially large in the number of degrees of freedom, is compressed into a much smaller variational space of the network parameters by constructing many-body quantum wave functions via |Ψ(θ) = σ ψ σ (θ) |σ .In order to obtain the variational parameters θ for the best representation of the ground state of a Hamiltonian H, an NQS is optimized by means of a variational Monte Carlo (VMC) approach [11].Similar to general machine learning tasks minimizing loss functions, in VMC ground states of quantum systems are obtained by minimizing the variational energy for the target Hamiltonian H.Such minimization proce-dure for the variational energy is a generic approach in quantum physics and has been utilized extensively also for other choices of variational wave functions with VMC, such as the famous Gutzwiller-projected fermionic wave function (GWF) [11,12].However, it is important to emphasize the key difference that traditional wave functions are inspired by specific physical backgrounds and therefore biased due to several reasons such as mean field pictures.As a consequence, they are not guaranteed to asymptotically represent exact wave functions upon increasing the ansatz complexity.The fundamental advantage of NQS on the other hand is that it represents, in principle, a numerically exact algorithm.As guaranteed by the universal approximation theorem [13] and observed empirically by numerous applications of deep learning [14], an NQS is expected to converge to the desired solution to arbitrary accuracy upon increasing the size and depth of the underlying neural network [15].

Current dilemma
As compared to ordinary deep learning tasks in computer science, a major difficulty in NQS is that the rugged quantum landscape [16] with many saddle points imposes a great challenge to the conventional stochastic gradient descent (SGD) method [17].To improve the optimization dynamics, it is typically necessary to utilize a quantum generalization of natural gradient descent [18] named stochastic reconfiguration (SR) [19].This increase in precision comes, however, at a significant cost: SR faces a O(N 3 p ) complexity for a neural network with N p parameters, which is much larger than the O(N p ) complexity in SGD.Although iterative linear solvers can be employed [5] to reduce the complexity, the time cost is still huge due to the large amount of non-parallel iterations.Consequently, the current applications of NQS mainly focus on shallow networks such as the restricted Boltzmann machine (RBM) [5,20] or small-scale convolutional neural networks (CNNs) [8,10].Although these shallow networks work well in some specific models without sign problem, they also face their natural limitations against TN or GWF in more complex frustrated models, which are of strong current interest, for instance, due to the possibility of realizing celebrated quantum spin liquid phases [21,22].
Many efforts have been made to overcome the optimization difficulty in deep NQS.For example, simple optimizers including SGD and Adam [23] are used instead of SR at the cost of losing accuracy [24][25][26][27].While keeping to use SR for high accuracy, large-scale supercomputers are employed to enlarge the allowed number of parameters by a few orders of magnitudes [28,29], thus compensating the training complexity by computing power.Furthermore, a sequential local optimization approach has also been proposed in which SR only optimizes a portion of all parameters to reduce the time cost [30].In all these studies, however, the O(N 3 p ) complexity or the non-parallelizable iterative solvers of SR still remain to represent the key limitation for further increasing the network sizes.

Minimum-step stochastic reconfiguration
The central idea of SR is to approximate imaginarytime evolution on the variational state |Ψ(θ) in every training step so that the new state |Ψ = e −Hδτ |Ψ(θ) has reduced contributions from eigenstates with higher energies after an imaginary-time interval δτ , thereby pushing the state towards the ground state step by step.However, as the variational manifold usually only takes up a tiny portion of the whole Hilbert space, one has to project |Ψ onto the variational manifold to obtain |Ψ(θ ) with new parameters θ .A common choice of projection is to minimize the difference of the two states given by the Fubini-Study (FS) distance d(Ψ(θ ), Ψ ) [31][32][33].Expanded to the lowest order of the optimization step δθ k = θ k − θ k and the imaginarytime interval δτ , we find, as proved in Methods, that where N s is the number of Monte-Carlo samples to be generated, and We also adopt the following notations: where the sum of spin configuration σ is performed over whose least-squares solution minimizes the FS distance and leads to the SR equation.Conceptually, one can understand the left-hand side of this equation as the change of the variational state induced by an optimization step of the parameters, and the right-hand side as the change of the exact imaginary-time evolving state.The SR solution minimizing their difference is given by In most literature [5,19], Eq. ( 4) is directly shown as a solution minimizing the FS distance, but here we show that it can also be derived as the least-squares solution of Eq. (3) which will lead to natural and key improvements as we will discuss in detail later.
As illustrated in Fig. 1(a), the matrix S in Eq. ( 4) plays an important role as the quantum metric in VMC, i. e., the FS distance is d 2 (Ψ(θ), Ψ(θ )) = δθ † S δθ for states before and after a variation of parameters δθ [18,33].A major difficulty of SR is the computation of S −1 which helps to map a variation in the Hilbert space back to the parameter space.For an optimization step with N s samples and N p parameters, the shapes of O and S are N s × N p and N p × N p respectively, and the complexity of solving Eq. ( 4) is hence O(N 2 p N s + N 3 p ) for direct linear solvers and becomes unacceptable for deep networks.Many iterative methods including conjugate gradient (cg) and minres [34] can be employed to reduce the complexity of SR to O(N p N s N iter ) [35], where N iter is the number of iterations in the iterative solver.However, with large N s and N p the iterative approach is typically facing natural limitations by a huge required N iter and non-parallelizable iterations.In Ref. [28,29] with N p ∼ 10 5 and N s ∼ 10 6 , for instance, the direct linear solver is still adopted instead of iterative solvers.
To reduce the cost of SR, we focus on a specific optimization case of a deep network with a large number of parameters N p but a relatively small amount of batch samples N s similar to most deep learning research.We assume, and justify below, that NQS can still converge to the minimum energy even with reduced N s and lower optimization accuracy caused, and the expressive power controlled by N p is the dominant factor affecting the final variational accuracy.In this case, as shown in Fig. 1(b), the rank of the meaning that S contains much less information than its capacity due to an insufficient number of batch samples.
As a more efficient way to express the information of the quantum metric, we introduce the neural tangent kernel 36] containing the same non-zero eigenvalues as S but reducing the matrix size from Here, we propose a new method based on using T as the compressed matrix.To formulate this method, we notice that Eq. ( 3) is underdetermined with an infinite amount of least-squares solutions when N s < N p .To obtain a unique δθ solution, we employ the leastsquares minimum-norm condition which is widely used for underdetermined linear equations.To be specific, we choose, among all solutions with minimum residual error ||Oδθ − ||, the one minimizing the norm of variational step ||δθ|| = k |δθ k | 2 , which helps to reduce higherorder effects, prevent overfitting and improve stability.We name this method minimum-step SR (MinSR) due to the additional minimum-step condition.In Methods we provide two approaches to obtain the following MinSR solution which only requires the inverse of an N s ×N s matrix with an O(N p N 2 s + N 3 s ) complexity.For large N p , it provides a tremendous acceleration with a time cost proportional to N p instead of N 3 p , and the employment of non-parallel iterative solvers is also avoided.Therefore, it can be viewed as a natural reformulation of traditional SR which is particularly useful in the limit N p N s as relevant in deep learning situations.
It is important to emphasize that, although the reduced number of samples N s in MinSR leads to increased uncertainty of energy when evaluating E loc,σ , it does not have a significant impact on the optimization accuracy.The reason is that the local energies of samples roughly fall into the range E loc,σ ± std(E loc ), but the uncertainty of E loc,σ is only u( E loc,σ ) = std(E loc )/ √ N s std(E loc ) as long as N s is not too small, so u( E loc,σ ) only causes a small error when computing σ = −δτ (E loc,σ − E loc,σ )/ √ N s for MinSR optimization.Similar argument also applies to O σk .Furthermore, as std(E loc ) → 0 when the wave function is close to convergence in VMC [11], one can easily obtain a small u(E) without too many samples when evaluating the expectation value of energy, as we will see for the specific numerical benchmark examples studied below.For other observables with finite standard deviations, the large amount of required samples can be generated after training convergence which does not slow down the optimization.

Benchmark models
To showcase the performance of MinSR, we apply it to the paradigmatic spin-1/2 Heisenberg J 1 -J 2 model on the square lattice which serves as a benchmark system in various NQS studies and provides a convenient comparison to other state-of-the-art methods as will be shown in the following.The Hamiltonian of the system is given by where S i = (S x i , S y i , S z i ) with S x i , S y i , S z i spin-1/2 operators at site i, i, j and i, j indicate pairs of nearestneighbor and next-nearest neighbor sites, respectively, and J 1 is chosen equal to 1 for simplicity in this work.
We will specifically focus on two points in parameter space: J 2 /J 1 = 0 and J 2 /J 1 = 1/2.At J 2 /J 1 = 0, the Hamiltonian reduces to the non-frustrated Heisenberg model.At J 2 /J 1 = 1/2, the J 1 -J 2 model becomes strongly frustrated close to the maximally frustrated point where the system resides in a quantum spin liquid phase [7], which imposes a great challenge on existing numerical methods including NQS [10,37].

Performance comparison
We now start with a direct performance comparison among various optimization methods, including SGD, MinSR, and SR with direct pseudo-inverse (pinv) solver or iterative conjugate gradient (cg) solver.Each of these methods involves for each optimization step the inversion of S or T as shown in Eq. ( 4) and Eq. ( 5), which is usually ill-conditioned.Thus, it is important to identify a suitable regularization.In our numerical experiments, pseudo-inverse with relative tolerance r tol = 10 −12 is used for MinSR and SR (pinv).For SR (cg), the diagonal shift S reg ii = S ii + ε 1 S ii + ε 2 with ε 1 = 10 −4 and ε 2 = 0 is applied.An alternative choice is ε 2 = 0 [9], but in our tests it does not help to improve the performance of SR.In order to demonstrate the overall capability and performance of each of these algorithms, their efficiency and accuracy are tested on the 10 × 10 Heisenberg model.In each test, an NQS close to convergence is used to generate O and by drawing Monte-Carlo samples, and different methods are employed to produce approximate solutions of Eq. ( 3).The whole experiment is performed on a single NVIDIA A100 80GB GPU.
The time cost of each algorithm in solving Eq. ( 3) is shown in Fig. 2(a) for a comparison of their efficiency.The test is performed with two batch sizes N s = 10 3 and 10 4 , and multiple networks with different amounts of parameters N p .SR (cg) searches for solutions with non-parallel iterations, causing much larger time costs compared to other methods.For the test with the largest available N s and N p , the time cost of SR (cg) is more than 500 times larger than MinSR and becomes the bottleneck of optimization.On the other hand, the time cost of SR (pinv) is larger than MinSR when N p > N s and grows rapidly with N p , which matches our complexity analysis.Furthermore, SR (pinv) is unable to produce any result for N p 10 4 due to the memory cost of inverting S even on an A100 80GB GPU with the largest available memory currently available for GPUs.In summary, MinSR has a great efficiency advantage in the large N p regime compared to traditional SR.Moreover, MinSR makes it possible to obtain direct pseudo-inverse solutions within the limited memory space provided by modern GPU or TPU, which helps to avoid the usage of massive CPU resources as has been done recently [20,29,38] and further reduces the time cost.
To compare the accuracy of different methods, in Fig. 2(b) we evaluate the relative residual error r = ||Oδθ − ||/|| || similar to the one employed in Ref. [5,39].The SGD error is larger than 1, meaning that the direct energy gradient direction does not coincide with the imaginary-time evolution direction given by SR and is unlikely to further optimize the variational wave func-tion when the NQS is close to convergence.The SR (cg) method produces biased solutions due to the diagonal shift of S, while the pseudo-inverse method provides a strict cut-off to small eigenvalues and minimizes the residual error under given regularization.When N p > N s = 10 4 , especially, the matrix S in SR contains a lot of vanishing eigenvalues, in which case the SR (cg) method exhibits a much worse accuracy compared to MinSR or SR (pinv).Moreover, the MinSR and SR (pinv) methods, as explained in Methods, give the same solution up to numerical precision, but MinSR makes it possible to apply a very accurate pseudo-inverse solution to larger networks with more parameters.
In Fig. 2(c) we plot the change of residual error as deeper networks with larger time costs are employed.The same data as Fig. 2(a) and Fig. 2(b) is used.The MinSR method is able to produce accurate results by using deeper networks without significantly longer time costs, providing a reliable approach for accurate solutions in VMC which helps the optimization to converge faster.On the contrary, one has to spend much more time obtaining accurate solutions if traditional SR is adopted.In summary, MinSR drastically outperforms the traditional SR method in that it requires orders of magnitude less time costs without loss of accuracy.It brings accurate SR optimization to a next level for deep NQS, which makes it possible to train variational wave functions even toward machine precision for exact ground states of quantum matter.

Optimization results
Having demonstrated the superior performance of MinSR in terms of runtime and accuracy as compared to other optimization algorithms, it is the goal in the following to apply MinSR to paradigmatic physical problems.In particular, it will be the purpose to benchmark our MinSR results against existing results in the literature with the goal of demonstrating that our introduced approach of utilizing NQS in combination with MinSR allows us to outperform existing methods.The VMC optimization is performed on deep networks with up to 64 layers and 146320 parameters, which is larger than all previous networks trained by SR to convergence until now.
In a first step we aim to show the accuracies obtainable with our introduced method on the level of the full quantum many-body wave function.For that purpose we display in Fig. 3 full NQS wave functions obtained on the 6 × 6 Heisenberg J 1 -J 2 model.The ED results obtained by the lattice-symmetries package [43] are shown as dotted lines for reference.The full wave functions are given in the ground state sector with a Hilbert-space dimension of 15804956, which is much larger than the number of network parameters and therefore still presents a challenge to the expressive power of NQS.The NQS wave function exhibits a nearly exact match with the ED result in the non-frustrated case without any obvious outliers among more than 10 7 components, showing the outstanding performance of deep NQS in expressing non-frustrated wave functions upon utilizing MinSR, which makes the training possible for such huge networks.In the frustrated case, one can observe a slightly different behavior.The deep network is still capable of accurately capturing the dominant wave function amplitudes ψ σ , while deviations start to appear on a scale of the order of 10 −5 .Notice that in terms of probabilities |ψ σ | 2 this implies errors to appear only on the level of 10 −10 or even smaller.Further, the dominant challenge for the NQS wave function is still to learn proper sign structures, as the wave function amplitudes of small magnitude often do not have correct signs.Nevertheless, the full wave function is still very accurate because those inaccurate components are all very close to zero with marginal effects on the overall quantum state.In order to quantify the accuracy of how well the deep NQS has learned the sign structure we calculate a sign structure error defined as where ψ σ and φ σ are variational and exact wave function amplitudes, respectively.This error is expected to vanish a completely accurate sign structure and 1/2 for a random one.If NQS does not learn any frustrated behavior and only gives the so-called Marshall sign rule (MSR) [44], the error is e s = 0.020.The well-trained deep NQS in our work presents a further correction to MSR and gives e s = 1.3 × 10 −6 , showing extraordinary performance in learning the frustrated sign structure.
Having outlined the achieved excellent accuracies for the wave function amplitudes, we now aim to go one step further by demonstrating that our approach is actually capable to reach a fundamental and long-sought goal of NQS: Upon increasing the size of the neural network the NQS solution is supposed to converge to the exact one, so that the sole limitation of NQS becomes the numerical precision of the computing device.For that purpose, we also compute the wave function amplitudes when different precision, including 32-bit floating-point (FP32), TensorFloat-32 (TF32), and bfloat16 (BF16), is employed for evaluating the NQS wave function.In the  [5], shallow CNNs [10], and RBMs with an additional exact Lanczos step [38].As no tensor network (TN) data is available in the periodic boundary condition (PBC), the best result with an open boundary condition (OBC) is included as a dashed line [40], b Ground-state energies of the frustrated J1-J2 model at J2/J1 = 0.5.The results by means of MinSR obtained in this work for a CNN are compared to previous results in the literature for shallow CNN [10], RBM with an additional exact Lanczos step [38], and CNN with transfer learning [29].Further results from methods other than NQS are included as dashed lines such as TN [41], the Gutzwiller wave function (GWF) with 2 additional exact Lanczos steps (Lanczos2) [42], and the combination of pair product state (PP) and RBM [7].The exact GS energy in the frustrated case is estimated by extrapolation of variational energy and energy variance in our NQS results.As a further reference, the so-called Marshall sign rule (MSR) limit is included, which is obtained from an NQS training for a wave function where the sign structure is not learned but rather fixed by the MSR.inset of Fig. 3 we show the infidelity defined as where Ψ and Φ are the variational and exact ground states, respectively.In the non-frustrated Heisenberg model, the nearly vanishing infidelity implies an almost perfect match between the variational state and the exact ground state.The infidelity increases slightly on TF32 and dramatically on BF16, indicating that the accuracy of the obtained wave function approaches the limit of TF32.We also compute the infidelity limit, which is 4.5 × 10 −8 , when ψ σ and φ σ are respectively given by the ground state wave function with 16-bit floating-point (FP16) and FP32.As FP16 and TF32 give similar accuracy, we notice that our variational infidelity 1.2 × 10 −7 is only 2.6 times larger than the limit of TF32.Considering the error accumulation in the forward pass of the deep network, this result has well approached the TF32 machine precision.In the frustrated J 1 -J 2 model, the infidelity is roughly unchanged from FP32 to TF32, but obviously increases from TF32 to BF16, which means the deep NQS trained by MinSR approaches BF16 precision.Another piece of evidence is that the BF16 result in the Heisenberg model roughly indicates the precision limit of BF16 and the result in the J 1 -J 2 model gets below this level.Although many small wave-function components exhibit deviations as shown by the full J 1 -J 2 wave function, the infidelity under BF16 precision is still mainly caused by the reduced precision of large components.Furthermore, we also present the infidelity obtained by a CNN with 13750 parameters, which, as shown in Fig. 2, is roughly the maximum network size allowed for SR (pinv) on GPU.As shown in the inset of Fig. 3, the conventional SR method is unable to approach machine precision because its infidelity is limited to a level far above MinSR.
In conclusion, the deep network architecture with the help of MinSR plays a critical role in pushing the NQS accuracy toward machine precision.
After quantifying the accuracy of our approach for small systems upon comparing to ED, it will be the key goal in the following to show that the deep networks trainable now by MinSR lead to a numerical algorithm that outperforms existing methods for large system sizes beyond the reach of ED, in particular for the considered Heisenberg J 1 -J 2 model.First, we start by focusing on systems with a 10 × 10 lattice.In this case, the huge Hilbert space becomes a greater challenge for the expressive power and the generalization ability of NQS compared with the 6 × 6 lattice.As the benchmarks of NQS are mostly given in the 10 × 10 lattice, it also provides the ideal starting point to compare our intro-duced scheme with the performance of other methods.In each optimization attempt, 20000 training steps are performed without symmetry, followed by 10000 steps with imposed symmetry as shown by Eq. (38) in Methods, and the number of samples N s is fixed at 10000.
In the non-frustrated Heisenberg model, the deep NQS trained by MinSR provides an unprecedentedly precise result better than all existing variational methods as shown in Fig. 4(a).The adopted reference ground-state energy per site is E GS /N = −0.67155267(5), as given by a stochastic series expansion (SSE) [45] simulation performed by ourselves, instead of the commonly used reference E/N = −0.671549(4)from Ref. [46] because our best NQS variational energy E/N = −0.67155260(3)provides an even better accuracy as compared to this common reference energy.Thanks to the deep network architecture and the efficient MinSR method, the relative error of variational energy rel = (E − E GS )/|E GS | drops much faster than the 1-layer RBM as N p increases and finally reaches a level of 10 −7 .Compared with other variational results, our best variational energy is about 10 4 times more accurate than what has been obtained by means of shallow NQSs including RBM [5] and shallow CNN [10], and still around 10 3 times more accurate than the best NQS result before this work given by an additional exact Lanczos step on RBM [38].Since autoregressive networks and TN are both known to be more suitable for open boundary conditions (OBCs) instead of periodic boundary conditions (PBCs) used in this work, here for comparison we also give their relative error of variational energy in OBCs, which is rel = 3.5×10 −5 [24] and rel = 3.0 × 10 −5 [40], respectively, while in our work the most accurate result gives rel ≈ 10 −7 in PBC.Although OBC and PBC results are not directly comparable, the huge difference in accuracy still hints at a possibly much more accurate variational state.As the autoregressive network in Ref. [24] is also a deep network with around one million parameters but trained by SGD, this comparison also shows that MinSR is indispensable even in the simple non-frustrated case.
In order to go to the next level of complexity we will now focus on the frustrated J 1 -J 2 model, whose accurate ground-state solution has remained a key challenge for all available computational approaches.As we show in Fig. 4(b) for a 10 × 10 lattice, our introduced method based on MinSR allows us to reach ground-state energies below what is possible with other numerical schemes.In this context, the MSR limit shows the energy one can obtain without considering any frustration.As shown in the figure, the usage of deep NQS becomes absolutely crucial as the shallow CNN is not guaranteed to beat the MSR limit.Most importantly, our obtained variational energy is reduced upon increasing the network size and finally outperforms all existing NQS results and arrives at the same level of energy as the best existing result given by the combination of a special kind of GWF named pair product state (PP) and RBM [7].Although PP+RBM produces similar energy with a smaller number of pa- rameters N p than our deep CNN due to its additional physical input encoded in the GWF, their computational complexity is comparable because some additional computations are required to enforce the translation symmetry in PP+RBM which is unnecessary for the CNN.This result shows that the deep NQS trained by MinSR is superior even in the frustrated case which was argued to be challenging for NQS on a general level [47].Finally, we aim to provide evidence that our approach exhibits an even more advantageous performance as compared to other computational methods upon further increasing system size.In Table I the obtained variational energy on a 16 × 16 square lattice is presented and compared to existing results in the literature.In our training, the variational parameters from the 10 × 10 lattice are used as the initial parameters through transfer learning [28] to reduce time cost.One can clearly see that our approach yields the best variational energy for the frustrated J 1 -J 2 model on such a large lattice.In this system, our variational energy is obviously below the PP+RBM [7] and TN [48] energies.Compared with the best existing variational result given in Ref. [29], the energy in this work is still 2 × 10 −4 lower.In summary, the deep NQS trained by MinSR obtains state-of-the-art or even superior results in large frustrated models, which fills the gap in efficient and reliable numerical methods for large two-dimensional quantum matter and might become an indispensable approach for discovering new intriguing quantum many-body phenomena in such complex systems.

Discussion
In this Article, we have introduced an alternative method, termed MinSR, to train NQS wave functions.We have shown that MinSR reduces the time cost of optimization by orders of magnitude for deep NQS as compared to traditional SR methods while maintaining the same level of accuracy.We have demonstrated that MinSR allows us to successfully train deep NQSs with up to 64 layers and more than 10 5 parameters.It is a key outcome of this work that the achieved variational ground-state energies outperform all existing variational results on the considered paradigmatic quantum spin models.In the non-frustrated Heisenberg model on the square lattice, we show that the accuracy of the variational energy is improved by orders of magnitude toward TF32 precision.Further, for the frustrated J 1 -J 2 model the MinSR approach outperforms the best existing vari-ational results and approaches BF16 precision.Based on our findings we expect that MinSR is the starting point to put the NQS approach for solving the quantum manybody problem on a next level with the potential to address complex quantum matter in previously inaccessible regimes.
We identify many directions along which our current results might be further improved.For instance, MinSR provides an additional condition, in this work chosen as minimizing ||δθ|| in Eq. ( 3), to control the variational optimization under the constraint of the minimum residual error ||Oδθ − ||.One is free to adopt other conditions for potentially controlling better the training stability or to include higher-order effects in NQS optimization, which may lead to even better variational accuracy.A further improvement of the variational energies of the considered models might be likely achievable by choosing more specific deep neural network architectures with additional physical input.
The CNN architecture is scalable to larger system sizes through transfer learning [28], as we have also shown here, and it is also sufficiently flexible to target various types of lattices [8,47].Thus, one can expect that deep CNNs in combination with the MinSR method provide a powerful approach to target quantum matter, which is challenging to solve by other means, with the potential to discover new quantum phenomena, especially concerning complex high-dimensional systems such as the pyrochlore Heisenberg model [8,49], where TN and GWF do not provide sufficient accuracy.
In this work we have applied MinSR for the variational search of ground-state wave functions.It is important to emphasize, however, that the fundamental Eq. ( 3), solved by MinSR, can be directly translated to a time-dependent variational principle for solving the dynamics of quantum matter by means of NQS [5,39].It is therefore a natural question to which extent the MinSR method might also be applicable to such time-evolution problems in order to yield improved solutions of the fundamental equation of quantum mechanics -the Schrödinger equation.
Moreover, it is key to point out that the MinSR method is not at all restricted to NQS.As a general optimization method in VMC, it can also be applied to other more traditional wave functions like TN or GWF so that more complex ansatz architecture can be introduced in these conventional methods to include more physical insights.We can further envision the application of MinSR beyond the scope of physics for general machine learning tasks in case a suitable space for optimization similar to the Hilbert space in physics can be defined in order to construct an equation similar to Eq. (3).12) and Eq. ( 13) into Eq.( 11), we obtain

Derivation of MinSR equation
In this section, we adopt two different approaches, namely the Lagrangian multiplier method and the pseudo-inverse method, to derive the MinSR formula in Eq. ( 5).Lagrangian multiplier.The MinSR solution can be derived by minimizing the variational step k |δθ k | 2 under the constraint of minimum residual error To begin with, we assume that the minimum residual error is 0, which can always be achieved by letting N s < N p and assuming a typical situation in VMC that O σk values obtained by different samples are linearly independent.This leads to constraints k O σk δθ k − σ = 0 for each σ.The Lagrange function is then given by where α σ is the lagrangian multiplier.Written in matrix form, the Lagrangian function becomes From ∂L/∂(δθ † ) = 0, one obtains Putting Eq. ( 19) back into Oδθ = , one can solve α as Combining Eq. ( 20) with Eq. ( 19), one obtains the final solution as which is the MinSR formula in Eq. ( 5).Similar derivation also applies to the case that O, δθ and are all real.
In the main text, the residual error is non-zero which is different from our previous assumption.This is because the inverse in Eq. ( 21) is replaced by pseudo-inverse with finite truncation to stabilize the solution in numerical experiments.Pseudo-inverse.To simplify the notation, A = O, x = δθ and b = are adopted in this subsection.We will prove that for a linear equation Ax = b, is the least-squares minimum-norm solution, where the matrix inverse is pseudo-inverse.Firstly, we prove x = A −1 b is the solution we need.The singular value decomposition of A gives where U and V are unitary matrices, and Σ is a diagonal matrix with σ i = Σ ii = 0 if and only if i > r with r the rank of A. The least-squares solution is given by minimizing where x = V † x, b = U † b, N s is the dimension of b, and the second step is because applying a unitary matrix does not change the norm of a vector.Therefore, all the least-squares solutions take the form Among all these possible solutions, the one minimizes ||x|| = ||x || is With the following definition of pseudo-inverse we have x = Σ + b , so the final solution is Furthermore, we show the following equality With the singular value decomposition of A in Eq. ( 23), Eq. ( 29) can be directly proved by and In the derivation, the shapes of diagonal matrices Σ and Σ + are not fixed but assumed to match their neighbor matrices to make the matrix multiplication valid.Eq. (29) shows that the SR solution in Eq. ( 4) and MinSR solution in Eq. ( 5) are both equivalent to the pseudo-inverse solution δθ = O −1 , which justifies MinSR as a natural alternative of SR when N s < N p .To numerically solve the MinSR equation

MinSR solution
a suitable pseudo-inverse should be applied to obtain a stable solution.In practice, the Hermitian matrix T is firstly diagonalized as T = U DU † , and the pseudoinverse is given by where D + is the pseudo-inverse of the diagonal matrix D, numerically given by a cut-off below which the eigenvalues are regarded as 0, i.e.
where λ i and λ + i are the diagonal elements of D and D + , λ max is the largest value among λ i , and r pinv and a pinv are the relative and absolute pseudo-inverse cut-off.In most cases, we choose r pinv = 10 −12 and a pinv = 0. Furthermore, we modify the aforementioned direct cutoff to a soft one [50] to avoid abrupt changes when the eigenvalues cross the cut-off during optimization.

Neural quantum state
Neural network architecture.In this Article, we adopt a single real-valued residual convolutional neural network [51] in NQS.As suggested in Ref. [52], every residual block contains two convolutional blocks, each given by a layer normalization [53], a ReLU activation function and a convolutional layer sequentially.The detailed network architecture information is listed in Table II.
After the forward pass through all residual blocks, a final activation function is applied, which resembles the cosh(x) activation in RBM but can also give negative outputs so that the whole network is able to express sign structures while still being real-valued.In the non-frustrated case, |f (x)| is used as the final activation function to make all outputs positive.After the final activation function, the outputs v i are used to compute the wave function amplitude as where r is a rescaling factor updated in every training step that prevents data overflow after the product.Sign structure.On top of the raw output from the neural network ψ net σ , the Marshall sign rule (MSR) [44] is also imposed, which serves as the exact sign structure for the non-frustrated Heisenberg model and still the approximate sign structure even in the frustrated region around J 2 /J 1 ≈ 0.5.In practice, the MSR is applied on Hamiltonian with basis rotation instead of on NQS.Although MSR seems to be an additional physical input that cannot be generalized to other models, the generality of this work is not reduced because it has been shown that simple sign structures such as MSR can be exactly solved by NQS [54,55].Hilbert space reduction.Furthermore, we also take several measures to reduce the Hilbert-space dimension for better performance.For instance, a reduced Hilbert space with i S i,z = 0 is adopted due to the conserved total magnetization of Heisenberg interactions, which is achieved by generating Monte-Carlo samples under this constraint.
Since symmetry also helps to reduce the Hilbert-space dimension and plays an essential role in the practice of NQS [20,56], we apply symmetry on top of the welltrained ψ net σ to project variational states onto suitable sectors and further improve the accuracy.Assuming the system permits a symmetry group of order ν represented by operators T i with characters ω i , the symmetrized wave function is then defined as [20,57] so that With translation symmetry already enforced by the CNN architecture, the remaining symmetries applied by Eq. ( 38) are the C 4v point group symmetry and the spin inversion symmetry σ → −σ.In total, there are 16 elements in the symmetry group.

Reweighting variational Monte Carlo
The usual sampling in VMC with probability proportional to |ψ σ | 2 leads to two problems in practice.First, the Markov chain is easily trapped in a small portion of the whole possible search space, especially for frustrated models with rugged probability distributions [16].This problem can be solved by employing autoregressive architecture [24,58], fixed-node Hamiltonian construction [59], or kinetic samplers [60].The second problem is that a few configurations with large amplitudes are much more likely to be visited by the Markov chain than the other parts.In this case, a large portion of the whole Hilbert space is not visited by the Markov chain Monte Carlo which reduces global accuracy.
To solve these two problems, we utilize reweighting Monte Carlo that samples by probability proportional to |ψ| n instead of |ψ| 2 with n ranging from 0 to 2 and usually chosen as 1.This is similar to kinetic samplers which accept Monte Carlo proposals with different probabilities to overcome the rugged probability distribution.As the samples are generated with different probabilities, some additional steps should be taken to obtain the same expectation value given by the original probability.In general, for random variables x i with two unnormalized probability distributions p i and q i , one has x p = i p i x i i p i = i q i x i p i /q i i q i p i /q i = xp/q q p/q q , (40) where ... p and ... q represents the expectation value under two probability distributions.With p = |ψ| 2 and q = |ψ| n , one obtains the corrected formula for reweighted expectation value The reweighting method leads to increased variance which has a bad impact on the training procedure, thus not applied to the non-frustrated Heisenberg model.Some measures explained in supplementary information are taken to reduce the influence of increased variance.Furthermore, all measurements of expectation values in this work are performed without reweighting to ensure a fair comparison with previous literature.

FIG. 1 .
FIG. 1. Illustration of neural quantum states (NQS) and minimum-step stochastic reconfiguration (MinSR).a Within the NQS approach an artificial neural network is used to represent a quantum many-body state.A change of the network parameters for an NQS leads to a new quantum state, whose distance to the previous NQS is given by the quantum metric S ∈ C Np×Np , where Np is the number of variational parameters.b The quantum metric S = O † O can be decomposed into a smaller matrix O ∈ C Ns×Np with Ns Np the number of Monte-Carlo samples.The optimization of an NQS involves the inversion of the quantum metric S, which is equivalent to determining its nonzero eigenvalues λi with i = 1, . . ., Ns.Within MinSR a neural tangent kernel T = O O † ∈ C Ns×N S is introduced with identical eigenvalues λi and therefore the essential information of S.

4 FIG. 2 .
FIG. 2. Performance evaluation of various optimization methods in solving the linear equation Oδθ = for a single training step.The tested methods include stochastic gradient descent (SGD), stochastic reconfiguration (SR) with both the pseudoinverse (pinv) and conjugate gradient (cg) solvers as well as minimum-step SR (MinSR).The SR (pinv) result is unavailable for Np 10 4 due to memory constraints on the utilized A100 80G GPU. a Time cost in seconds for the different optimization methods as a function of the number of variational parameters Np for different numbers of Monte-Carlo samples Ns. b Relative residual error ||Oδθ − ||/|| || for Ns = 10 4 .c Relation between the residual error and time cost for Ns = 10 4 .

FIG. 3 .
FIG.3.Neural quantum state (NQS) wave function amplitudes for the Heisenberg and J1-J2 models on a 6 × 6 square lattice obtained by deep convolutional neural networks (CNNs) with 64 layers and 146320 parameters by means of minimumstep stochastic reconfiguration (MinSR).The spin configurations are sorted according to the exact diagonalization (ED) wave function amplitudes in descending order as shown by the black dotted lines.All wave function amplitudes are shown for the Heisenberg model, while for the J1-J2 model only one point is plotted among 10000 successive points.In the inset, we show the infidelity obtained by the deep CNN with different numerical precision.The infidelity of a medium-scale CNN with 13750 parameters which approaches the size limit of SR (pinv) is also presented for comparison.

FIG. 4 .
FIG.4.Variational ground-state energies E for the 10 × 10 square lattice.a Relative error of the variational ground-state (GS) energy rel = (E − EGS)/|EGS|, with EGS being the ground-state energy, for the non-frustrated Heisenberg model as a function of the number of variational parameters Np.The variational energies obtained in this work by means of a deep convolutional neural network (CNN) trained with the minimum-step stochastic configuration (MinSR) are compared to previous results in the literature including restricted Boltzmann machines (RBMs)[5], shallow CNNs[10], and RBMs with an additional exact Lanczos step[38].As no tensor network (TN) data is available in the periodic boundary condition (PBC), the best result with an open boundary condition (OBC) is included as a dashed line[40], b Ground-state energies of the frustrated J1-J2 model at J2/J1 = 0.5.The results by means of MinSR obtained in this work for a CNN are compared to previous results in the literature for shallow CNN[10], RBM with an additional exact Lanczos step[38], and CNN with transfer learning[29].Further results from methods other than NQS are included as dashed lines such as TN[41], the Gutzwiller wave function (GWF) with 2 additional exact Lanczos steps (Lanczos2)[42], and the combination of pair product state (PP) and RBM[7].The exact GS energy in the frustrated case is estimated by extrapolation of variational energy and energy variance in our NQS results.As a further reference, the so-called Marshall sign rule (MSR) limit is included, which is obtained from an NQS training for a wave function where the sign structure is not learned but rather fixed by the MSR.

TABLE I .
Variational ground-state energy per site E/N for the frustrated J1-J2 model at J2/J1 = 0.5 on the 16 × 16 square lattice.