On the Sample Complexity of Quantum Boltzmann Machine Learning

Quantum Boltzmann machines (QBMs) are machine-learning models for both classical and quantum data. We give an operational definition of QBM learning in terms of the difference in expectation values between the model and target, taking into account the polynomial size of the data set. By using the relative entropy as a loss function this problem can be solved without encountering barren plateaus. We prove that a solution can be obtained with stochastic gradient descent using at most a polynomial number of Gibbs states. We also prove that pre-training on a subset of the QBM parameters can only lower the sample complexity bounds. In particular, we give pre-training strategies based on mean-field, Gaussian Fermionic, and geometrically local Hamiltonians. We verify these models and our theoretical findings numerically on a quantum and a classical data set. Our results establish that QBMs are promising machine learning models.


INTRODUCTION
Machine learning (ML) research has developed into a mature discipline with applications that impact many different aspects of society.Neural network and deep learning architectures have been deployed for tasks such as facial recognition, recommendation systems, time series modeling, and for analyzing highly complex data in science.In addition, unsupervised learning and generative modeling techniques are widely used for text, image, and speech generation tasks, which many people encounter regularly via interaction with chatbots and virtual assistants.Thus, the development of new machine learning models and algorithms can have significant consequences for a wide range of industries, and more generally, society as a whole [1].
Recently, researchers in quantum information science have asked the question of whether quantum algorithms can offer advantages over conventional machine learning algorithms.This has led to the development of quantum algorithms for gradient descent, classification, generative modeling, reinforcement learning, as well as many other tasks [2][3][4][5][6].However, one cannot straightforwardly generalize results from the conventional ML realm to the quantum ML realm.One must carefully reconsider the data encoding, training complexity, and sampling in the quantum machine learning (QML) setting.For example, it is yet unclear how to efficiently embed large data sets into quantum states so that a genuine quantum speedup is achieved [7,8].Furthermore, as quantum states prepared on quantum devices can only be accessed via sampling, one cannot estimate properties with arbitrary precision.This gives rise to new problems, such as barren plateaus [9][10][11][12][13][14], that make the training of certain QML models challenging or even practically impossible.
In this work, we show that a particular quantum generative model, the quantum Boltzmann machine [15][16][17][18] (QBM) without hidden units, does not suffer from these issues, and can be trained with polynomial sample complexity on future fault-tolerant quantum computers.QBMs are physics-inspired ML models that generalize the classical Boltzmann machines to quantum Hamiltonian ansätze.The Hamiltonian ansatz is defined on a graph where each vertex represents a qubit and each edge represents an interaction.The task is to learn the strengths of the interactions, such that samples from the quantum model mimic samples from the target data set.Quantum generative models of this kind could find use in ML for science problems, by learning approximate descriptions of the experimental data.QBMs could also play an important role as components of larger QML models [19][20][21][22][23].This is similar to how classical Boltzmann machines can provide good weight initializations for the training of deep neural networks [24].One advantage of using a QBM over a classical Boltzmann machine is that it is more expressive since the Hamiltonian ansatz can contain more general non-commuting terms.This means that in some settings the QBM outperforms its classical counterpart, even for classical target data [18].Here we focus on QBMs without hidden units since their inclusion leads to additional challenges [12,25,26], and there exists no evidence that they are beneficial.
In order to obtain practically relevant results, we begin by providing an operational definition of QBM learning.Instead of focusing on an information-theoretic measure, we assess the QBM learning performance by the difference in expectation values between the target and the model.We require that the data set and model parameters can be efficiently stored in classical memory.This means that the number of training data points is polynomial in the number of features if the target is classical, or in the number of particles if the target is quantum.Thus, statistics computed from such data set have polynomial precision.We then employ stochastic gradient descent methods [27,28] in combination with shadow tomography [29][30][31] to prove that QBM learning can be solved with polynomially many evaluations of the QBM model.Each evaluation of the model requires the preparation of one Gibbs state and therefore in this paper we refer to the sample complexity as the required number of Gibbs state preparations.We also prove that the required number of Gibbs samples is reduced by pre-training on a subset of the parameters.This means that classically pre-training a simpler QBM model can potentially reduce the (quantum) training complexity.For instance, we show that one can analytically pre-train a mean-field QBM and a Gaussian Fermionic QBM in closed form.In addition, we show that one can pre-train geometrically local QBMs with gradient descent, which has some improved performance guarantees.To the best of our knowledge, this is the first time these exactly solvable models are used for either training or pre-training QBMs.
It is important to note that the time complexity of preparing and estimating properties of Gibbs states at finite temperature is largely unknown, but can be exponential in the worst case [32].For cases where the Gibbs state preparation is intractable, the polynomial sample complexity of QBM learning becomes irrelevant as obtaining one sample already takes exponential time.On the other hand, the polynomial sample complexity is a prerequisite for efficient learning which many other QML models do not share.In this work we are only concerned with the sample complexity of the QBM learning problem, and we do not discuss any specific Gibbs sampling implementation.We shall mention in passing that Gibbs states satisfying certain locality constraints can be efficiently simulated by classical means, e.g., using tensor networks [33,34].Furthermore, generic Gibbs states can be prepared and sampled on a quantum computer by a variety of methods (e.g., see [35][36][37][38][39][40][41][42][43]), potentially with a quadratic speedup over classical algorithms.There also exist hybrid quantum-classical variational algorithms for Gibbs state preparation [44][45][46][47][48][49], but there are currently many open issues regarding their trainability [9,11,14,50].In practice QBM learning allows for great flexibility in model design and can be combined with almost any Gibbs sampler.
Throughout the paper we compare QBM learning and a related, but crucially different, problem called Hamiltonian learning [31,[51][52][53][54][55].Learning the Hamiltonian of a physical system in thermal equilibrium can be thought of as performing quantum state tomography.This has useful applications, for example in the characterization of quantum devices, but requires prior knowledge about the target system and suffers an unfavorable scaling with respect to the temperature [53].In contrast, the aim of QBM learning is to produce a generative model of the data at hand, with the Hamiltonian playing the role of an ansatz.
Our setup and theoretical results are summarized in Fig. 1.We provide classical numerical simulation results in Figs. 2 and 3 which confirm our analytical findings.We then conclude the paper with a discussion of the generalization capability of QBMs and avenues for future work.

METHODS
We start by formally setting up the quantum Boltzmann machine (QBM) learning problem.We give the definitions of the target and model and describe how to assess the performance by the precision of the expectation values.These definitions and assumptions are key to our results, and we introduce and motivate them here carefully.To further motivate our problem definition we compare to other related problems in the literature, such as quantum Hamiltonian learning.
We consider an n-qubit density matrix η as the target of the machine learning problem.If the target is classical, n could represent the number of features, e.g., the pixels in black-and-white pictures, or more complex features that have been extracted and embedded in the space of n qubits.If the target is quantum, n could represent the number of spin-1 2 particles, but again more complex many-body systems can be embedded in the space of n qubits.In the literature, it is often assumed that algorithms have direct and simultaneous access to copies of η.We do not take that path here.Instead, we consider a setup where one only has access to classical information about the target.We have a data set D = {s µ } of data samples s µ and, crucially, we assume the data set can be efficiently stored in a classical memory: the amount of memory required to store each data sample is polynomial in n, and there are polynomially many samples.For example, s µ can be bitstrings of length n; this includes data sets like binary images and time series, categorical and count data, or binarized continuous data.As another example, the data may originate from measurements on a quantum system.In this case, s µ identifies an element of the positive operator-valued measure describing the measurement.
Next, we define the machine learning model which we use for data fitting.The fully-visible QBM is an n-qubit mixed quantum state of the form where Z = Tr e H θ is the partition function.The parameterized Hamiltonian is defined as where θ ∈ R m is the parameter vector, and {H i } is a set of m Hermitian and orthogonal operators acting on the 2 n -dimensional Hilbert space.For example, these could be n-qubit Pauli operators.As the true form of the target density matrix is unknown, the choice of operators {H i } in the Hamiltonian is an educated guess.It is possible that, once the Hamiltonian ansatz is chosen, the space of QBM models does not contain the target, i.e., ρ θ ̸ = η, ∀θ.This is called a model mismatch, and it is often unavoidable in machine learning.In particular, since we require m to be polynomial in n, ρ θ cannot encode an arbitrary In Definition 1 we provide an operational definition of the QBM learning problem where the model and target expectations must be close to within polynomial precision ϵ.A solution θ opt is guaranteed to exist by Jaynes' principle.With Theorems 1 and 2 we establish that QBM learning can be solved by minimizing the quantum relative entropy S(η∥ρ θ ) with respect to θ using stochastic gradient descent (SGD).This requires a polynomial number of iterations T , each using a polynomial number of Gibbs state preparations, i.e., the sample complexity is polynomial.With Theorem 3 we prove that pre-training strategies that optimize a subset θ pre of the QBM parameters are guaranteed to lower the initial quantum relative entropy.After training the QBM can be used to generate new synthetic data.Icons by SVG Repo, used under CC0 1.0 and adapted for our purpose.
density matrix.Note that for fixed parameters, ρ θ is also known as Gibbs state in the literature.Thus, we refer to an evaluation of the model as the preparation of a Gibbs state.
A natural measure to quantify how well the QBM ρ θ approximates the target η is the quantum relative entropy This measure generalizes the classical Kullback-Leibler divergence to density matrices.The relative entropy is exactly zero when the two densities are equal, η = ρ θ , and S > 0 otherwise.In addition, when S(η∥ρ θ ) ≤ ϵ, by Pinsker's inequality, all Pauli expectation values are within O( √ ϵ).However, achieving this is generally not possible due to the model mismatch.In theory, one can minimize the relative entropy in order to find the optimal model parameters θ opt = argmin θ S(η∥ρ θ ).The form of the partial derivatives can be obtained analytically and reads This is the difference between the target and model expectation values of the operators that we choose in the ansatz.A stationary point of the relative entropy is obtained when ⟨H i ⟩ ρ θ = ⟨H i ⟩ η for i ∈ {1, . . ., m}.Since S is strictly convex (see Supplementary Note 2) this stationary point is the unique global minimum.It is difficult in practice to quantify how well the QBM is trained by means of relative entropy.In order to accurately estimate S(η∥ρ θ ), one requires access to the entropy of the target and the partition function of the model.Moreover, due to the model mismatch, the optimal QBM may have S(η∥ρ θ opt ) > 0, which is unknown beforehand.In this work, we provide an operational definition of QBM learning based on the size of the gradient ∇S(η∥ρ θ ).
Definition 1 (QBM learning problem).Given a polynomial-space data set {s µ } obtained from an n-qubit target density matrix η, a target precision ϵ > 0, and a fully-visible QBM with Hamiltonian H θ = m i=1 θ i H i , find a parameter vector θ such that with high probability The expectation values of the target can be obtained from the data set in various ways.For example, for the generative modeling of a classical binary data set one can define a pure quantum state and obtain its expectation values (see Kappen [18] and Supplementary Note 5).For the modeling of a target quantum state (density matrix) one can estimate expectation values from the outcomes of measurements performed in different bases.Due to the polynomial size of the data set, we can only compute properties of the target (and model) to finite precision.For example, suppose that s µ are data samples from some unknown probability distribution P (s) and that we are interested in the sample mean.An unbiased estimator for the mean is ŝ = 1 M M µ=1 s µ .The variance of this estimator is σ 2 /M , where σ 2 is the variance of P (s).By Chebyshev's inequality, with high probability the estimation error is of order σ/ √ M .The polynomial size of the data set implies that the error decreases polynomially in general.In light of this, we say the QBM learning problem is solved for any precision ϵ > 0 in Eq. ( 5).The idea is that the expectation values of the QBM and target should be close enough that one cannot distinguish them without enlarging the data set.
An exact solution to the QBM learning problem always exists by Jaynes' principle [56]: given a set of target expectations {⟨H i ⟩ η } there exists a Gibbs state We refer to the model corresponding to Jaynes' solution θ opt as the optimal model.In QBM learning we try to get as close as possible to this optimal model by minimizing the difference in expectation values.As shown in Supplementary Note 3, any solution to the QBM learning problem implies a bound on the optimal relative entropy, namely A similar result can be derived from Proposition A2 in Rouzé and Stilck França [31].This means that if one can solve the QBM learning problem to precision ϵ ≤ ϵ ′ 2∥θ−θ opt ∥1 , one can also solve a stronger learning problem based on the relative entropy to precision ϵ ′ .However, this requires bounding ∥θ − θ opt ∥ 1 .
We further motivate our problem definition by highlighting important differences with a related problem called quantum Hamiltonian learning [31,[51][52][53][54][55].This problem can be understood as trying to find a parameter vector ∥θ − θ * ∥ 1 ≤ ϵ, where θ * is the true parameter vector defining a target Gibbs state η.This is, in fact, a stronger version of the QBM learning problem because finding the true parameters, θ * , is more demanding than getting ϵ-close to the expectation values of the optimal model from Jaynes' solution.In realistic machine learning settings, we do not know the exact form of η a priori and, Hamiltonian learning is in effect equivalent to quantum state tomography.This task has an exponential sample-complexity lower bound [29].Moreover, even if we do know the exact form, minimizing the distance to the true parameters, ∥θ − θ * ∥ 1 , can make the problem significantly more complicated than QBM learning, and potentially impractical to solve.Consider a single-qubit target η = e H * / Tr e H * , with Hamiltonian H * = θ * σ z .We can easily see that near ⟨σ z ⟩ η = 1 − ϵ the parameter θ * diverges.For example, if θ * = 2.64 then ⟨σ z ⟩ η = 0.99 and if θ * = 3.8 then ⟨σ z ⟩ η = 0.999, from which we see that the optimal parameter is very sensitive, O(1), to small changes, O(10 −2 ), in expectation values.In these cases, Hamiltonian learning is much harder than simply finding a QBM whose expectation value is within ϵ.A similar argument can be made for high-temperature targets near the maximally mixed state.Here the QBM problem is trivial: since all target expectations are ⟨H i ⟩ η ≈ 2 −n Tr[H i ], any model sufficiently close to the maximally-mixed state solves the QBM learning problem.In contrast, Hamiltonian learning insists on estimating the unique true parameters.
Let us now discuss a method for solving QBM learning.We approach the problem by iteratively minimizing the quantum relative entropy, Eq. (3), by stochastic gradient descent (SGD) [28].This method requires access to a stochastic gradient ĝθ t computed from a set of samples at time t.Recall that our gradient has the form given in Eq. ( 4).The expectations with respect to the target are estimated from a random subset of the data, often called a mini-batch.The mini-batch size is a hyper-parameter and determines the precision ξ of each target expectation.Similarly, the QBM model expectations are estimated from measurements, for example using classical shadows [29,30] of the Gibbs state ρ θ t approximately prepared on a quantum device [43].The number of measurements is also a hyper-parameter and determines the precision κ of each QBM expectation.We assume that the stochastic gradient is unbiased, i.e., E[ĝ θ t ] = ∇S(η∥ρ θ )| θ=θ t , and that each entry of the vector has bounded variance.At iteration t, SGD updates the parameters as where γ t is called the learning rate.

Sample complexity
Stochastic gradient descent solves the QBM learning problem with polynomial sample complexity.We state this in the following theorem, which is the main result of our work.
Theorem 1 (QBM training).Given a QBM defined by a set of n-qubit Pauli operators {H i } m i=1 , a precision κ for the QBM expectations, a precision ξ for the data expectations, and a target precision ϵ such that κ 2 + ξ 2 ≥ ϵ 2m .After iterations of stochastic gradient descent on the relative entropy S(η∥ρ θ ) with constant learning rate where E[•] denotes the expectation with respect to the random variable θ t .Each iteration t ∈ {0, . . ., T } requires preparations of the Gibbs state ρ θ t , and the success probability of the full algorithm is λ.
A detailed proof of this theorem is given in Supplementary Note 3. It consists of carefully combining three important observations and results.First, we show that the quantum relative entropy for any QBM ρ θ is Lsmooth with L = 2m max j ∥H j ∥ 2 2 , which follows from upper bounding the relative entropy and quantum belief propagation [57].A similar result can be found in Rouzé and Stilck França [31], Proposition A4.This is then combined with SGD convergence results from the machine learning literature [27,28] to obtain the number of iterations T .Finally, we use sampling bounds from quantum shadow tomography to obtain the number of preparations N .In this last step, we focus on the shadow tomography protocol in Huang et al. [30], which restricts our result to Pauli observables H i ≡ P i , thus ∥H i ∥ 2 = 1.It is possible to extend this to generic two-outcome observables [29,58] with a polylogarithmic overhead compared to Eq. (10).Furthermore, for k-local Pauli observables, we can improve the result to with classical shadows constructed from randomized measurement [59] or by using pure thermal shadows [43].
By combining Eqs. ( 8) and ( 10), we see that the final number of Gibbs state preparations N tot ≥ T × N scales polynomially with m, the number of terms in the QBM Hamiltonian.By our classical memory assumption, we can only have m ∈ O(poly(n)).This means that the number of required measurements to solve QBM learning scales polynomially with the number of qubits.Consequently, there are no barren plateaus in the optimization landscape for this problem, resolving an open question from the literature [26,50] for QBMs without hidden units.Furthermore, also the stronger problem in Eq. ( 6) can be solved without encountering an exponentially vanishing gradient as we show in Supplementary Note 3.
We remark, however, that our result does not imply a solution to the stricter Hamiltonian learning problem with polynomially many samples.In order to achieve this one needs some stronger conditions.For example, using α-strong convexity for the specific case of geometrically local models, as shown in Anshu et al. [53].In Supplementary Note 2, we investigate generalizing this by looking at generic Hamiltonians made of low-weight Pauli operators.We perform numerical simulations using differentiable programming [60] and find evidence indicating that even in this case the quantum relative entropy is αstrong convex.This means we could potentially achieve an improved sample complexity for QBM learning.We prove the following theorem in Supplementary Note 3.
Theorem 2 (α-strongly convex QBM training).Given a QBM defined by a Hamiltonian ansatz H θ such that S(η∥ρ θ ) is α-strongly convex, a precision κ for the QBM expectations, a precision ξ for the data expectations, and a target precision ϵ such that κ 2 + ξ 2 ≥ ϵ 2m .After iterations of stochastic gradient descent on the relative entropy S(η∥ρ θ ) with learning rate γ t ≤ 1 4m 2 , we have Each iteration requires the number of samples given in Eq. (10).
Finally, we observe that the sample bound in Theorem 1 depends on δ 0 , the relative entropy difference of the initial and optimal QBMs.This means that if we can lower the initial relative entropy with an educated guess, we also tighten the bound on the QBM learning sample complexity.In this respect, we prove that we can reduce δ 0 by pre-training a subset of the parameters in the Hamiltonian ansatz.Thus, pre-training reduces the number of iterations required to reach the global minimum.
Theorem 3 (QBM pre-training).Assume a target η and a QBM model ρ θ = e m i=1 θiHi /Z for which we like to minimize the relative entropy S(η∥ρ θ ).Initializing θ 0 = 0 and pre-training S(η∥ρ θ ) with respect to any subset of m ≤ m parameters guarantees that where θ pre = [χ pre , 0 m− m] and the vector χ pre of length m contains the parameters for the terms {H i } m i=1 at the end of pre-training.More precisely, starting from ρ χ = e m i=1 χiHi /Z and minimizing S(η∥ρ χ ) with respect to χ ensures Eq. ( 14) for any S(η∥ρ χ pre ) ≤ S(η∥ρ χ 0 ).We refer to Supplementary Note 4 for the proof of this theorem.It applies to any method that is able to minimize the relative entropy with respect to a subset of the parameters.Crucially, all the other parameters are fixed to zero, and the pre-training needs to start from the maximally mixed state I/2 n .Note that the maximally mixed state is not the most 'distant' state from the target state, and there exist states for which S(η∥ρ θ ) > S(η∥I/2 n ).As an example of pre-training, one could use SGD as described above, and apply updates only to the chosen subset of parameters.With a suitable learning rate, this ensures that pre-training lowers the relative entropy compared to the maximally mixed state S(η∥I/2 n ).As a consequence, one can always add additional linearly independent terms to the QBM ansatz without having to retrain the model from scratch.The performance is guaranteed to improve, specifically towards the global optimum due to the strict convexity of the relative entropy.By Eq. ( 14) and Pinsker's inequality, the trace-distance to all the other observables reduces as well.This is in contrast to other QML models which do not have a convex loss function.It is particularly useful if one can pre-train a certain subset classically before training the full model on a quantum device.For instance, in Supplementary Note 4 we present mean-field and Gaussian Fermionic pre-training strategies with closed-form expressions for the optimal subset of parameters.

Numerical experiments
In order to verify our theoretical findings we perform numerical experiments of QBM learning on data sets constructed from a quantum and a classical source.In particular, we use expectation values of the Gibbs state of a 1D quantum XXZ model in an external field [61,62], and expectation values of a classical salamander retina data set [63].How to obtain the target state expectation values for both cases is explained in Supplementary Note 5.
First we focus on reducing the initial relative entropy S(η∥ρ θ 0 ) by QBM pre-training, following Theorem 3. We consider mean-field (MF), Gaussian Fermionic (GF), and geometrically local (GL) models as the pretraining strategies.The Hamiltonian ansatz of the MF model consists of all possible one-qubit Pauli terms 2), hence it has 3n parameters.The Hamiltonian of the GF model is a quadratic form In contrast, the GL models are defined with a Hamiltonian ansatz for which, in general, the parameter vector ⃗ θ ≡ {λ, σ} cannot be found analytically.Here the sum ⟨i, j⟩ imposes some constraints on the (geometric) locality of the model, i.e., summing over all possible nearest neighbors in some d-dimensional lattice.In particular, we choose one-and two-dimensional locality constraints suitable with the assumptions given in previous work [53,54].In these specific cases the relative entropy is strongly convex, thus pre-training has the improved performance guarantees from Theorem 2.
In Fig. 2 (a) we show the relative entropy S(η∥ρ θ pre ) obtained after pre-training with these models for 8-qubit problems.For MF and GF we use the closed form solutions, and for GL we use exact gradient descent with learning rate of γ = 1/ m.As a comparison, we also show the result obtained for the QBM without pre-training, i.e., for the maximally mixed state S(η∥ρ θ=0 ).We observe that for both targets, all pre-training strategies are successful in obtaining a lower S(η∥ρ θ pre ) compared to the maximally mixed state.For the GL 1D ansatz, the target quantum state is contained within the QBM model space, which means that the relative entropy is zero after pre-training.This shows that having knowledge about the target (e.g., the fact that it is one dimensional) could help inform QBM ansatz design and significantly reduce the complexity of QBM learning.The GF model, which has completely different terms in the ansatz, manages to reduce S(η∥ρ θ pre ) by a factor of ≈ 5 for the quantum target and ≈ 4 for the classical target.By the Jordan-Wigner transformation, we can express the target XXZ model in a Fermionic basis.In this representation, the target only has a small perturbation compared to the model space of the GF model.This could explain the good performance of pre-training with the GF model.
Next, we investigate the effect of using pre-trained models as starting point for QBM learning with exact gradient descent.We extend all pre-trained models with additional linearly independent terms in the ansatz where now, compared to Eq. ( 15), any qubit can be connected to any other qubit, and we do not have a constraint on the geometric locality.This is the fullyconnected QBM Hamiltonian used in previous work [18,43].We now consider data from the quantum target η = e HXXZ /Z for 8 qubits.In Fig. 2 (b) we show the quantum relative entropy during training when starting from different pre-trained models ρ θ 0 := ρ θ pre .The initial parameter vector θ 0 is the one obtained at the end of pre-training on quantum data in panel (a).We do not include gradient noise in this simulation, i.e., κ = ξ = 0, and we use a learning rate of γ = 1/(2m).The performance of the MF pre-trained model (orange line) is better than that of the vanilla model (blue line) for all t.However, the improvement is modest.Starting from a GL 2D model (red line) has a much more significant effect, with S(η∥ρ θ t ) being an order of magnitude smaller than that of the vanilla model at all iterations t.This pre-training strategy requires very few gradient descent iterations (dark gray area).This could be stemming from the strong convexity of this particular pre-training model.In general, the benefits of pre-training needs to be assessed on a case-by-case basis as the size of the improvement depends on the particular target and the particular pre-training model used.Importantly, we only proved Theorem 1 for a learning rate of γ = min{ 1 L , ϵ 4m 2 (κ 2 +ξ 2 ) }. Therefore, choosing a larger learning rate might reduce the benefits of pre-training.
Lastly, we numerically confirm our bound on the number of SGD iterations (Eq.( 8) in Theorem 1).We consider data from the classical salamander retina target with 8 features and a fully-connected QBM model on 8 qubits.In Fig. 3 we compare training with two different noise strengths κ 2 + ξ 2 .We implemented these settings by adding Gaussian noise effectively simulating the noise strength determined by the number of data points in the data set and the number of measurements of the Gibbs state on a quantum device.Using the standard Monte Carlo estimate, at each update iteration we require a mini-batch of data samples of size 1 ξ 2 , and a number of measurements 1  κ 2 (assuming these measurements can be performed without additional hardware noise).We could even use mini batches of size 1 and a single measurement, as long as the expectation values are unbiased.For both noise strengths, we obtain the desired target precision of ϵ = 0.1 within 40000 iterations.This is well within the bound of order 10 9 on the number of iterations obtained from Theorem 1, which is the worst-case scenario.For the interested reader, in Supplementary Note 3 we provide additional numerical evidence confirming the scaling behavior of our theorems.

Generalization
Before concluding our work, in this section we briefly comment on the generalization capability of the fullyconnected QBM to which our convergence results apply.In particular, we connect with a result proved by Aaronson [64], which concerns the learnability of quantum states.
Theorem 4 (Theorem 1.1 in [64]).Let η be a 2 ndimensional mixed state, and let D be any probability measure over two-outcome measurements.Then given samples E 1 , . . ., E m drawn independently from D, with probability at least 1 − ν, the samples have the following generalization property: any hypothesis state ρ such that | ⟨E i ⟩ ρ − ⟨E i ⟩ η | ≤ ζε for all i ∈ {1, . . ., m} will also satisfy provided we took for some large enough constant C.
In other words, when a hypothesis state ρ matches the expectation values of m randomly sampled measurement operators from distribution D, then, with high probability, most expectation values of the other measurement operators from D match up to error ε.Comparing to Definition 1, we can identify the learned QBM ρ θ as a specific hypothesis state that matches the expectation values of operators {H i } m i=1 of the target η.Our convergence results show how many Gibbs state samples are required to obtain such a hypothesis state.Thus, our QBM learning result in Theorem 1, in a sense, complements Theorem 4 and provides a concrete algorithm for finding the hypothesis ρ with O(poly(n)) samples.One distinction is that in QBM learning we already assume access to a fixed set of measurement operators which defines the Hamiltonian ansatz, instead of randomly sampling operators from a distribution.
In order to say something about generalization, we now consider a scenario in which the user is interested in solving the QBM learning problem for some fixed (potentially exponentially large) number of two-outcome measurement operators {H i } K i=1 .Then by defining D as the The target η is made of 8 features from the classical salamander retina data set, the model ρ θ t is a quantum Boltzmann machine (QBM).When computing expectation values for the gradient, we allow precision ξ for the target and precision κ for the QBM.We compare the combined noise strength of 0.01 (blue line) to 0.05 (orange line).We aim for a maximum error of ϵ = 0.1 (red dashed line) and use a learning rate of γ = ϵ 4m 2 (κ 2 +ξ 2 ) .SGD converges within a number of iterations consistent with Theorem 1.
uniform distribution over the set {H i } K i=1 , and constructing the QBM ansatz by sampling m < K operators from D, we can directly apply Theorem 4 for generalization.In particular, defining the ansatz in this way ensures that our trained QBM will, with high probability, generalize to any other observable sampled from D, including ones that are not in the ansatz.

CONCLUSIONS
In this paper, we give an operational definition of quantum Boltzmann machine (QBM) learning.We show that this problem can be solved with polynomially many preparations of quantum Gibbs states.To prove our bounds we use the properties of the quantum relative entropy in combination with the performance guarantees of stochastic gradient descent (SGD).We do not make any assumption on the form of the Hamiltonian ansatz, other than that it consists of polynomially many terms.This is in contrast with earlier works that looked at the related Hamiltonian learning problem for geometrically local models [53,54].There, strong convexity is required in order to relate the optimal Hamiltonian parameters to the Gibbs state expectation values.In our machine learning setting, we do not know the form of the target Hamiltonian a priori.Therefore, we argue that learning the exact parameters is irrelevant, and one should focus directly on the expectation values.Our bounds only require L-smoothness of the relative entropy and apply to all types of QBMs without hidden units.
We also show that our theoretical sampling bounds can be tightened by lowering the initial relative entropy of the learning process.We prove that pre-training on any subset of the parameters is guaranteed to perform better than (or equal to) the maximally mixed state.This is beneficial if one can efficiently perform the pretraining, which we show is possible for mean-field, Gaussian Fermionic, and geometrically local QBMs.We verify the performance of these models and our theoretical bounds with classical numerical simulations.From this, we learn that knowledge about the target (e.g., its dimension, degrees of freedom, etc.) can significantly improve the training process.Furthermore, we find that our generic bounds are quite loose, and in practice, one could get away with a much smaller number of samples.This brings us to interesting avenues for future work.One possibility consists of tightening the sample bound by going beyond the plain SGD method that we have used here.This could be done by adding momentum, by using other advanced update schemes [28,65], or by exploiting the convexity of the relative entropy.While we believe this can improve the O(poly(m, 1 ϵ )) scaling in our bounds, we note that it does not change the main conclusion of our paper: the QBM learning problem can be solved with polynomially many preparations of Gibbs states.Another important direction is the investigation of different ansätze and their performance.Generative models are often assessed in terms of training quality [66], but generalization capabilities have been recently investigated by both classical [47,67] and quantum [68,69] machine learning researchers.For the case of QBMs, we employ an alternative definition of generalization [64] and show how it can be used to construct ansätze.The numerical investigation of this method is an interesting venue for future work.
Our pre-training result could be useful for implementing QBM learning on near-term and early fault-tolerant quantum devices.To this end, one would use a quantum computer as a Gibbs sampler.There exists a plethora of quantum algorithms (e.g., see [36,38,[40][41][42]) that prepare Gibbs states with a quadratic improvement in time complexity over the best existing classical algorithms.Moreover, the use of a quantum device gives an exponential reduction in space complexity in general.One can also sidestep the Gibbs state preparation and use algorithms that directly estimate Gibbs-state expectation values, e.g., by constructing classical shadows of pure thermal quantum states [43].This reduces the number of qubits and, potentially, the circuit depth.
Finally, our results open the door to novel methods for the incremental learning of QBMs driven by the availability of both training data and quantum hardware.For example, one could select a Hamiltonian ansatz that is very well suited for a particular quantum device.After exhausting all available classical resources during the pre-training, one enlarges the model and continues the training on the quantum device, which is guaranteed to improve the performance.As the quantum hardware matures, it allows us to execute deeper circuits and to further increase the model size.Incremental QBM training strategies could be designed to follow the quantum hardware roadmap, towards training ever larger and more expressive quantum machine learning models.

DATA AVAILABILITY
The data that support the findings of this study are available at Zenodo [70].

SUPPLEMENTARY NOTE 1: SOME USEFUL MATHEMATICAL FACTS AND RELATIONS
Here we state and derive some useful mathematical results that are used in the proofs in the other appendices.The definitions and lemmas about convexity are taken from [28].The derivative of the matrix exponential expressed as quantum belief propagation is a result of [57] and we rederive it in a pedagogical way.

Convexity
Definition 2 (Convexity).A multivariate function f : R m → R is said to be convex when If additionally the gradient ∇f (x * ) is zero only for one unique vector x * ∈ R m , then f is said to be strictly convex.
The following Lemma can be deduced from the standard definition of convexity [28].
A stronger version of convexity is used in some of our discussions.
An even stronger convexity condition is the following.
Definition 4 (α-strong convexity).Let f : R m → R, and α > 0. We say that f is α-strongly convex if The latter implies the former.
The strong convexity of a function can be tested as follows.
Lemma 3. Let f be twice continuously differentiable.Then f is α-strongly convex if Besides convexity, we also need to characterize the smoothness of a function.
Definition 5 (L-smoothness).Let f : R m → R and L > 0. We say that f is L-smooth if it is differentiable and if the gradient ∇f is L-Lipschitz: For L-smooth functions we have the following useful property [28].
Lemma 4 (Descent lemma).Let f : R m → R be a twice differentiable, L-smooth function, then Derivative of a matrix exponential Lemma 5 (Quantum belief propagation [57]).Let H = W + θV be a Hamiltonian ansatz with parameter θ.Let f (t) be the function for which , and define the operator Fourier transform as where {A, B} = AB + BA is the anti-commutator.
Proof.The derivative of the matrix exponential e H with respect to a parameter is given by Duhamel's formula [71] ∂ Taking H = W + θV , with simple manipulations we find a useful alternative expression Here we use the basis diagonalizing the Hamiltonian, H = j λ j |j⟩⟨j|, and we introduce the notation V jk = ⟨j|V |k⟩.
The above expression is valid also for the diagonal entries, k = j, since lim x→0 With the notation f (ω) = tanh(ω/2) ω/2 we can write Let us interpret f (ω) as the Fourier transform of another function: Plugging this in the previous expression we obtain We have recovered a result from [57] by different means.

SUPPLEMENTARY NOTE 2: PROPERTIES OF THE QUANTUM RELATIVE ENTROPY FOR QBMS
In this appendix, we prove some properties of the quantum relative entropy S(η∥ρ θ ) of a generic Quantum Boltzmann Machine (QBM) ρ θ with respect to some arbitrary target η.These properties are used for the proof of the theorems in the main text.We start by showing the convexity and afterward, we show the L-smoothness.
Strict convexity Lemma 6.Let H θ = m i=1 θ i H i be a Hamiltonian ansatz where H i are non-commuting operators, and ρ θ = e H θ Z the corresponding QBM.For all states η, the quantum relative entropy S(η∥ρ θ ) is a strictly convex function of θ.
Proof.In order to show (strict) convexity of S we use Lemma 1.We first show that the Hessian of the quantum relative entropy with respect to the QBM parameters, ∇ 2 S, is positive semidefinite.Afterwards, we show that S has a unique global optimizer θ * for which ∇S(η∥ρ θ * ) = 0, and apply the Lemma.
Using the derivative of the matrix exponential in Eq. ( 31) we have In the last step, we use the cyclic property of the trace.This is Eq. ( 4) in the main text.We now take the second derivative starting from the third line of Eq. ( 32), In the last step we used Tr[A{B, C}] = Tr[C{A, B}] to rearrange the terms.As Φ(V ) is a Hermitian operator for any Hermitian V we see that the Hessian has the form of a covariance matrix.It is then readily shown to be positive semidefinite and to satisfy Eq. (20).For any vector Here we define the Hermitian operator W = n v n H n .The last line is the expectation value of the square of a Hermitian operator, and as such it must be non-negative.This means that the quantum relative entropy is convex.We now show strict convexity by a contradiction argument, following Proposition 17 in [53].Assume we have found one set of parameters θ * with ∇S(η∥ρ θ * ) = 0. Then from Eq. ( 32) we have θ for all H i .Note that we can always find at least one such θ * by Jaynes' principle [56].Next, assume there exists a different set of parameters, χ ̸ = θ * , with Similarly, by swapping ρ χ and ρ θ * , we find It follows that S(ρ * θ ∥ρ χ ) = 0, implying ρ * θ = ρ χ .Now because the operators H i are orthogonal we have θ * = χ.This contradicts the assumption in the beginning (θ * ̸ = χ), and we can have only one unique θ * with ∇S(η∥ρ θ * ) = 0. Hence S is strictly convex by Definition 2.

Strong convexity
To show α-strong convexity of S one can use Lemma 3. To the best of our knowledge, there is no proof in the literature showing that the quantum relative entropy of Gibbs states is strongly convex in general.On the other hand, this property has been proven for particular classes of Hamiltonians.Anshu et al. [53] proves strong convexity for k-local Hamiltonians defined on a finite dimensional lattice.They show that in this case α ∈ Ω(1/n), a polynomial decrease with respect to the system size.Haah et al. [54] proves strong convexity for the more general class of low-intersection Hamiltonians.Low-intersection Hamiltonians have terms that act non-trivially only on a constant number of qubits, and each term intersects non-trivially with a constant number of other terms.
In this section, we use differentiable programming [60] to numerically analyze the smallest eigenvalue of the Hessian, λ min (∇ 2 S), seeking evidence for strong convexity.
We consider a 1D nearestneighbor Hamiltonian, , and a fully-connected one We randomly sample coefficients uniformly in [−µ, µ] where µ is a scale parameter and determines the max-norm of the vector of the coefficients.Supplementary Figure 1 shows mean and standard deviation over 25 instances.The smallest eigenvalue decreases with the number of qubits until convergence to a fixed value.The fully-connected Hamiltonian has m ∈ O(n 2 ) parameters and yields smaller eigenvalues than the 1D Hamiltonian which has m ∈ O(n) parameters instead.These results provide evidence of strong convexity with α decreasing polynomially with the system size.We leave more extensive numerical studies for future work.

L-Smoothness
Lemma 7. Let H θ = m i=1 θ i H i be a Hamiltonian ansatz where H i are non-commuting operators, and ρ θ = e H θ Z the corresponding QBM.For all states η, the quantum relative entropy S(η∥ρ θ ) is a L-smooth function of θ with L = 2m max j ∥H j ∥ 2 2 .Proof.We look for an upper bound on the largest eigenvalue of the Hessian in Eq. (33).We begin with the following property where we use that f > 0 and fmax = 1.In what follows we use the above result with ∥ • ∥ 2 , the operator norm induced Supplementary Figure 1.Minimum eigenvalue of the Hessian, as a function of the number of qubits.We show the median of 25 random instances for the 1D nearest-neighbor Hamiltonian (a), and fully-connected Hamiltonian (b).The scale parameter µ determines the maximum size of random parameters.We observe that in all cases the smallest eigenvalues shrink with the number of qubits, but appear to plateau to a positive value.
by the Euclidean vector norm (p = 2).Let us bound the entries of the Hessian Here we use that expectations are bounded by the largest eigenvalue, or alternatively, by the p = 2 operator norm.We also use the sub-multiplicative property of the operator norm and Eq. ( 37) in the last line.We are now able to upper-bound the largest eigenvalue of the Hessian matrix The first equality uses the fact that the Hessian is a symmetric matrix, and the first inequality is a consequence of the Gershgorin circle theorem.
We can use this result to prove the L-smoothness.Let us define a function h(t) = ∇S(η∥ρ y+t(x−y) ).Then we have

∥∇S(η∥ρ
where in the last step we used Eq. ( 39).Thus the quantum relative entropy is L-smooth with L = 2m max j ∥H j ∥ 2 2 .

SUPPLEMENTARY NOTE 3: CONVERGENCE RESULTS OF SGD FOR TRAINING QBMS
In this appendix we first review useful results from the machine learning literature, then prove Theorems 1 and 2 in the main text.We also discuss a few upper bounds for the relative entropy in the context of QBM learning.Finally, we provide additional numerical results for the scaling behavior of our theorems.

Review of SGD convergence results
We begin by stating three convergence results from the Stochastic Gradient Descent (SGD) literature.Consider a loss function f : R m → R that is L-smooth (Def.5) and bounded from below by f inf ∈ R. The stochastic gradient is unbiased, i.e., E[ĝ] = ∇f , and satisfies for some A, B, C ≥ 0 and all x ∈ R m .SGD iteratively minimizes f according to the update rule x t = x t−1 − γ t ĝx t−1 at time step t.Khaled and Richtárik [27] proved the following SGD convergence result.
Lemma 8 (restatement of Corollary 1 in [27]).Choose precision ϵ > 0 and step size γ = min{ we have that SGD converges with Here E[•] denotes the expectation with respect to x t , which is a random variable due to the stochasticity in the gradient.Let us now consider a loss function which, in addition to the previous conditions, is also α-Polyak-Lojasiewicz (Def.3).We consider the following iterative learning rate scheme for γ t .
Lemma 9 (restatement of Lemma 3 in [27]).Consider a sequence (r t ) t satisfying r t+1 ≤ (1 − aγ t )r t + cγ 2 t , where γ t ≤ 1 b for all t ≥ 0 and a, c ≥ 0 with a ≤ b.Fix T > 0 and let k 0 = ⌈ T 2 ⌉.Then choosing the stepsize as For this learning rate scheme, Khaled and Richtárik [27] proved the following SGD convergence result.
Lemma 10 (restatement of Corollary 2 in [27]).Choose precision ϵ > 0 and step size γ t following Lemma 9 with we have that SGD converges with Proofs of Theorems 1 and 2 in the main text We prove Theorem 1, which is repeated here for completeness.
Theorem 1 (QBM training).Given a QBM defined by a set of n-qubit Pauli operators {H i } m i=1 , a precision κ for the QBM expectations, a precision ξ for the data expectations, and a target precision ϵ such that κ 2 + ξ 2 ≥ ϵ 2m .After iterations of stochastic gradient descent on the relative entropy S(η∥ρ θ ) with constant learning rate where E[•] denotes the expectation with respect to the random variable θ t .Each iteration t ∈ {0, . . ., T } requires preparations of the Gibbs state ρ θ t , and the success probability of the full algorithm is λ.
Proof.Recall from Supplementary Note 2 that the quantum relative entropy is L-smooth with L = 2m max i ∥H i ∥ 2  2 and that for Pauli operators ∥H i ∥ 2 = 1.Then, we can minimize the relative entropy by SGD and apply the convergence result in Lemma 8.
For the SGD algorithm, we need an unbiased gradient estimator with bounded variance.We recall that the gradient of the relative entropy is given by ∂ θi S(η∥ρ θ ) = ⟨H i ⟩ ρ θ − ⟨H i ⟩ η .The target expectation values ⟨H i ⟩ η are estimated as ĥi,η from the data set, as described in Supplementary Note 5. Note that |⟨H i ⟩ η − ĥi,η | ≤ ξ, where ξ > 0 is limited by the size of the data set.One can improve on ξ by collecting more data, as long as the number of samples is polynomial in n.
For estimating the QBM expectation values ⟨H i ⟩ ρ θ , we can use a number of techniques.Here we focus on classical shadow tomography.From Theorem 4 in [30], there exists a procedure that returns the expectation values of m different Pauli operators {H i } to precision κ with O( log m/ λ κ 4 ) preparations of ρ θ , and succeeds with probability 1 − λ.To make use of this result we now restrict the H i in the QBM Hamiltonian to be Pauli operators.As discussed in the main text, this is not a limitation, and it is possible to use non-Pauli operators in combination with other shadow tomography protocols.We can therefore obtain estimators ĥi,ρ θ such that We then use ĝθi = ĥi,ρ θ − ĥi,η as estimators for the partial derivatives of the quantum relative entropy.The variance of the norm of the gradient estimator is bounded as Since the variance can also be written as E∥ĝ θ ∥ 2 − ∥∇S(η∥ρ θ )∥ 2 we find that our setup is compatible with Eq. ( 45) for A = 0, B = 1, C = m(κ 2 + ξ 2 ).We choose ϵ < 1 and κ 2 + ξ 2 ≥ ϵ 2m in Lemma 8.This yields a learning rate of γ = ϵ 4m 2 (κ 2 +ξ 2 ) .We conclude that after iterations of SGD we have Here δ 0 = S(η∥ρ θ 0 )−S(η∥ρ θ opt ) is the relative entropy at the initialization minus the relative entropy at the optimum.Importantly we note the QBM expectation values are computed with a success probability 1 − λ at each iteration.Consequently, the total success probability of the whole training is equal to (1 − λ) T for T update steps.Then to have a total success probability of λ we need to set λ = (1 − λ 1/T ) in the shadow tomography protocol.This result, together with the sampling bound on the number of measurements of the shadow tomography, O( log m/ λ κ 4 ), completes the proof of Theorem 1.
We now provide a proof for Theorem 2, which we restate here.
Theorem 2 (α-strongly convex QBM training).Given a QBM defined by a Hamiltonian ansatz H θ such that S(η∥ρ θ ) is α-strongly convex, a precision κ for the QBM expectations, a precision ξ for the data expectations, and a target precision ϵ such that κ 2 + ξ 2 ≥ ϵ 2m .After iterations of stochastic gradient descent on the relative entropy S(η∥ρ θ ) with learning rate γ t ≤ 1 4m 2 , we have Each iteration requires the number of samples given in Eq. (49).
Proof.In order to prove this theorem, we first show that η, ρ opt , and ρ θ are 'collinear' with respect to the relative entropy.This is helpful because it allows us to directly bound the difference in relative entropy S(η∥ρ θ ) − S(η∥ρ θ opt ) instead of bounding the individual relative entropies.

S(η∥ρ
Here, in the going from the second to the third line, we used the fact that Tr[ηH i ] = Tr[ρ θ opt H i ], which follows from setting Eq. ( 32) to zero.Rearranging the terms we get the collinearity S(η∥ρ θ ) = S(η∥ρ θ opt ) + S(ρ θ opt ∥ρ θ ).This is a non-trivial result because the relative entropy is not a distance: it is not symmetric and does not satisfy the triangle inequality in general.With this relation, we are now able to prove Theorem 2.
The relative entropy S(ρ θ opt ∥ρ θ ) satisfies all the relevant assumptions for SGD convergence: it is an L-smooth function with L = 2m max i ∥H i ∥ 2 2 , it is bounded below by 0, and the stochastic gradient has bounded variance [Eqs.( 51) and ( 49) apply].In addition, the α-strong convexity assumed by the theorem implies that S(ρ θ opt ∥ρ θ ) is α-Polyak-Lojasiewicz by Lemma 2. This means we can invoke Lemma 10.As before, we set A = 0, B = 1, C = m(κ 2 + ξ 2 ) and choose ϵ ′ < 1 in the Lemma, thus obtaining a maximum learning rate γ t ≤ 1 4m 2 .Looking at the case where we find that after T ≥ 9m 2 (κ 2 +ξ 2 ) iterations the expected relative entropy is ES(ρ θ opt ∥ρ θ T ) ≤ ϵ ′ .Note that, depending on the problem-specific parameter δ 0 , and the free parameters κ and ξ, one could be in the other case of Lemma 10.Following the same steps shown here one arrives at a different number of iterations T , which is still polynomial in n.
Since the QBM learning problem is phrased in terms of the distance in the expectation values and we only have obtained a bound on the relative entropy we now relate the two.Using Pinsker's and Jensen's inequalities it follows that where in the last line we used the variational definition of trace distance.The maximization is over unitary matrices.
Let us restrict the maximization to unitary matrices defined as These have the property that This implies, To solve the QBM learning problem to precision ϵ we choose ϵ ′ = ϵ 2 2 and, since ∥H i ∥ = 1 for Pauli operators, we conclude that Achieving a desired precision on the quantum relative entropy In this section, we study the scenario where the reader is interested in obtaining a certain precision on the quantum relative entropy, rather than on the difference in the expectation values.Again, due to a potential model mismatch, we discuss the relative entropy S(ρ θ opt ∥ρ θ ) w.r.t. the optimal QBM ρ θ opt .
We begin by training the QBM ρ θ with SGD.Using Theorem 1, we can achieve |⟨H i ⟩ η − ⟨H i ⟩ ρ θ | ≤ ϵ for all i with polynomial sample complexity.This implies a similar relation w.r.t. the optimal model: While conclusive, the above proof does not provide us with a method to find such a χ pre , i.e., it is agnostic to the specific pre-training method.As a constructive example, let us consider minimizing χ pre with noiseless gradient descent on a subset of m parameters.This means we update the subset parameters as χ t = χ t−1 − γ ∇S(η∥ρ χ t−1 ), where ∇S(η∥ρ χ t−1 ) is the gradient of the subset of parameters, and γ the learning rate.Since S is L-smooth, we can use the descent Lemma 4 to bound the difference in relative entropy of the subset S(η∥ρ χ t ) − S(η∥ρ χ t−1 ) ≤ ∇S(η∥ρ χ t−1 ) T −γ∇S(η∥ρ Setting γ ≤ 2 L we obtain S(η∥ρ χ t ) ≤ S(η∥ρ χ t−1 ).By recursively applying this inequality we obtain a χ pre with S(η∥ρ χ pre ) ≤ S(η∥ρ χ 0 ), which by our theorem above ensures Eq. ( 65).Note that the smoothness L here is the smoothness on the subset of parameters, which can be bounded by L ≤ 2 m max i ∥H i ∥ 2 2 .

Pre-training models
Here we discuss possible pre-training models and strategies to optimize them.We focus on the models discussed in the main text: 1) a mean-field model, 2) a Gaussian Fermionic model, and 3) nearest-neighbor quantum spin models.The advantage of the first two models is that they can be trained analytically.While for the nearest-neighbor models, this is not possible, they satisfy the locality assumptions in [53,54], and hence have a strongly convex relative entropy.

Mean-Field QBM
We define the mean-field QBM by the parameterized Hamiltonian Since this Hamiltonian has a simple structure, in which many terms commute, we can find the optimal parameters analytically.First, recall that the QBM expectation values are given by For the mean-field Hamiltonian, we find where we have defined Here we use the commutativity of single qubit operators in the first equality and expand the exponent for the second equality.We therefore get From which the derivative follows as In order to find the optimal QBM parameters for each qubit, i, we then solve the three coupled equations, which corresponds to setting the QBM derivative in Eq. ( 4) in the main text to zero.From the strict convexity of the relative entropy, we know this has one unique solution provided the target expectation values, ⟨σ x,y,z i ⟩ η form a consistent set, i.e., it comes from a density matrix.We can find the solution by squaring the three equations, and adding them together, giving Here we used that the argument of the tanh is always positive.Substituting this into Eq.( 73) we then find the closed-form solution of the QBM parameters In practice, the optimal parameters for an arbitrary mean-field QBM can be obtained by numerically evaluating this expression for the given target expectation values.
Gaussian Fermionic QBM The Gaussian Fermionic QBM has a parameterized, quadratic, Fermionic Hamiltonian Here, ⃗ C † = [c † 1 , . . ., c † n , c 1 , . . .c n ] is a vector containing n Fermionic mode creation and annihilation operators, which satisfy the Fermionic commutation relations {c i , c † j } = δ i,j and {c i , c j } = 0.These Fermionic operators can be expressed as strings of Pauli operators by the Jordan-Wigner transformation.Θ is the 2n × 2n dimensional matrix containing the QBM model parameters θ, which can be identified as a Fermionic single-particle Hamiltonian.Note that this matrix needs to be Hermitian, and since terms like c † 1 c † 1 are zero, it has in total n 2 free parameters.In order to find the optimal parameters, we use the fact that the single-particle correlation matrix with entries contains sufficient information to compute all possible properties of the Gaussian quantum system.This includes all possible observables (via Wick's theorem), entanglement measures, and also sampling from ρ θ [72].In particular, the Gaussian Fermionic QBM gradient reduces to the difference in the correlation matrices of the target and the model We can solve this by first determining the target expectation values Then we use the fact that the Hamiltonian of a Gaussian Fermionic system can be written in the eigenbasis of the correlation matrix as where W η and Λ η is given by the eigendecomposition Γ η = W η Λ η W † η , and σ −1 (X) the inverse sigmoid function.Thus, we (numerically) diagonalize Γ η and set the optimal Gaussian Fermionic QBM Hamiltonian equal to Since the eigen decomposition of a Hermitian matrix is unique, we find one unique solution.This is in agreement with the strict convexity of the quantum relative entropy.

Geometrically-Local QBM
The last type of restricted QBM model we discuss is the geometrically-local QBM.We focus on nearest-neighbor models on some d-dimensional lattice, for example a one-dimensional chain where each Pauli operator only acts on two neighboring qubits.In full generality, the geometrically-local QBM Hamiltonian is given by where the sum is over nearest-neighbor sites ⟨i, j⟩ of the lattice with periodic boundary conditions.In the main text, we consider for example a d = 1 lattice (a ring), and a d = 2 square lattice.
In order to use these models for pre-training, we train them with exact gradient descent on the relative entropy until a fixed precision is reached.Importantly, as these Hamiltonians only have m ∈ O(n) terms and a finite interaction range, Refs.[53,54] show that the quantum relative entropy is strongly convex.Therefore, the optimization is guaranteed to converge quickly to the global optimum, recall Theorem 2. However, this requires obtaining Gibbs state expectation values of geometrically local Hamiltonians.This can be done with a quantum computer or, potentially, with classical tensor networks [33,34].

SUPPLEMENTARY NOTE 5: CONSTRUCTION OF THE TARGET STATE EXPECTATION VALUES
In this appendix, we review how to embed classical data into a target density matrix η.We will follow the approach in [18] for quantum spin models.We also show how to extend this formalism to the Fermionic quantum models needed for the pre-training of our Gaussian Fermionic QBM.Lastly, we describe the two different targets used in our numerical simulations.

Classical data encoding
Following [18], one way to encode a classical data-set consisting of N bit strings { ⃗ s µ ∈ {0, 1} n } M µ=1 into a quantum state is by defining the pure state with Here q(⃗ s) = 1 N M µ=1 δ ⃗ s,⃗ s µ is the classical empirical probability for bitstring ⃗ s, and |⃗ s⟩ is a computational basis state indexed by ⃗ s.The q(⃗ s) can be found by counting the bitstrings in the data set { ⃗ s µ }.From |ψ⟩ one can compute expectation values such as ⟨ψ| σ z i |ψ⟩ = ⃗ s∈{0,1} n q(⃗ s)⃗ s i (82) for the Pauli spin operator σ z i .This can be efficiently computed classically for a polynomially sized data set, i.e., for polynomially many ⃗ s µ .Computing such expectation values from η is possible for all 1− and 2−local Pauli operators as shown in Ref. [18].
We now show that we can generalize this encoding to Fermionic QBMs, i.e., the terms in the Hamiltonian ansatz consist of Fermionic creation c † i and annihilation operators c i .We define |⃗ s⟩ to be equal to the Fermionic Fock basis.This is analogous to the computational basis in the spin-picture (by the Jordan-Wigner transformation), but the bit-strings { ⃗ s µ ∈ {0, 1} n } M µ=1 in the data set should now be interpreted as occupation-number vectors of Fermions.Note that the occupation-number basis is defined by the eigenstates of the Fermionic number operator i c † i c i .The creation and annihilation operators act on the Fock-basis states as follows where ⃗ δ i is the unit bit-string with a 1 at position i and zeros everywhere else.With these relations, we can derive the required expectation values for the target η to train the (Gaussian) Fermionic QBM ⟨ψ| c † i c i |ψ⟩ = ⃗ s∈{0,1} n q(⃗ s)⃗ s i , ⟨ψ| c † i c j |ψ⟩ = ⃗ s∈{0,1} n q(⃗ s)q(F i F j ⃗ s)(1 − ⃗ s i )⃗ s j , ⟨ψ| c † i c † j |ψ⟩ = ⃗ s∈{0,1} n q(⃗ s)q(F i F j ⃗ s)(1 − ⃗ s i )(1 − ⃗ s j ), i ̸ = j ⟨ψ| c i c j |ψ⟩ = ⃗ s∈{0,1} n q(⃗ s)q(F i F j ⃗ s)⃗ s i ⃗ s j , i ̸ = j where F i flips the Fermion occupation number (from occupied to unoccupied and vice versa) of index i in the vector ⃗ s.

Data used for the numerical simulations
For the numerical simulations, we use two different targets η: a target constructed from a quantum source, and a classical data set embedded into η using the encoding above.For the quantum source we use the XXZ model Hamiltonian Here J and ∆ are parameters describing the Heisenberg interactions between the quantum spins on a one-dimensional lattice, and h z is the strength of an external magnetic field.We set η = e H XXZ Z with J = −0.5, ∆ = −0.7,and h z = −0.8, and compute the expectation values ⟨H i ⟩ η classically.This is intractable in general, but our aim is to replicate the scenario in which the expectation values are measured experimentally.For example, from a state prepared on a quantum device.
For the classical source, we use the classical salamander retina data set given in Ref. [63].This data set consists of bit-string data recordings of different features of the response of cells in the salamander retina.We select the first 8 features and trim the data to the first 10 data recordings.We then construct the expectation values ⟨H i ⟩ η from the procedure outlined above.

Figure 1 .
Figure 1.Summary of the results.The quantum Boltzmann machine (QBM) learning algorithm takes as input a data set of size polynomial in the number of features/qubits, and an ansatz with parameters θ and a set of m Hermitian operators {Hi}.In Definition 1 we provide an operational definition of the QBM learning problem where the model and target expectations must be close to within polynomial precision ϵ.A solution θ opt is guaranteed to exist by Jaynes' principle.With Theorems 1 and 2 we establish that QBM learning can be solved by minimizing the quantum relative entropy S(η∥ρ θ ) with respect to θ using stochastic gradient descent (SGD).This requires a polynomial number of iterations T , each using a polynomial number of Gibbs state preparations, i.e., the sample complexity is polynomial.With Theorem 3 we prove that pre-training strategies that optimize a subset θ pre of the QBM parameters are guaranteed to lower the initial quantum relative entropy.After training the QBM can be used to generate new synthetic data.Icons by SVG Repo, used under CC0 1.0 and adapted for our purpose.

Figure 2 .
Figure 2. Pre-training and training quantum Boltzmann machines.(a) Quantum relative entropy S(η|ρ θ pre ) obtained after various pre-training strategies.We compare a mean-field (MF) model, a one-dimensional and two-dimensional geometrically local (GL) model, and a Gaussian Fermionic (GF) model to no pre-training/maximally mixed state.For the GL models, we stop the pre-training after the pre-training gradient is smaller than 0.01.We consider an 8-qubit target η as the Gibbs state e H XXZ /Z of a one-dimensional XXZ model (Quantum Data), and a target η which coherently encodes the binary salamander retina data set (Classical Data).(b) Quantum relative entropy versus number of iterations for Quantum Data.The t < 0 iterations (gray area) show the reduction in relative entropy for GL 2D pre-training (red line).The t = 0 iteration corresponds to the pre-training results in panel (a).The t > 0 iterations show the training results in the absence of gradient noise, i.e., κ = ξ = 0.
Fermionic creation and annihilation operators, where θij is the 2n × 2n Hermitian parameter matrix, which has n 2 free parameters.Here ⃗C † = [c † 1 , . . ., c † n , c 1 , . . ., c n ]with the operators satisfying {c i , c j } = 0 and {c † i , c j } = δ ij , where {A, B} = AB + BA is the anti-commutator.The advantage of the MF and GF pre-training is that there exists a closed-form solution given target expectation values ⟨H i ⟩ η (see Supplementary Note 4).

Figure 3 .
Figure 3.The maximum error in the expectation values versus the number of iterations of stochastic gradient descent (SGD).The target η is made of 8 features from the classical salamander retina data set, the model ρ θ t is a quantum Boltzmann machine (QBM).When computing expectation values for the gradient, we allow precision ξ for the target and precision κ for the QBM.We compare the combined noise strength of 0.01 (blue line) to 0.05 (orange line).We aim for a maximum error of ϵ = 0.1 (red dashed line) and use a learning rate of γ =