A Path Towards Quantum Advantage in Training Deep Generative Models with Quantum Annealers

The development of quantum-classical hybrid (QCH) algorithms is critical to achieve state-of-the-art computational models. A QCH variational autoencoder (QVAE) was introduced in Ref. [1] by some of the authors of this paper. QVAE consists of a classical auto-encoding structure realized by traditional deep neural networks to perform inference to, and generation from, a discrete latent space. The latent generative process is formalized as thermal sampling from either a quantum or classical Boltzmann machine (QBM or BM). This setup allows quantum-assisted training of deep generative models by physically simulating the generative process with quantum annealers. In this paper, we have successfully employed D-Wave quantum annealers as Boltzmann samplers to perform quantum-assisted, end-to-end training of QVAE. The hybrid structure of QVAE allows us to deploy current-generation quantum annealers in QCH generative models to achieve competitive performance on datasets such as MNIST. The results presented in this paper suggest that commercially available quantum annealers can be deployed, in conjunction with well-crafted classical deep neutral networks, to achieve competitive results in unsupervised and semisupervised tasks on large-scale datasets. We also provide evidence that our setup is able to exploit large latent-space (Q)BMs, which develop slowly mixing modes. This expressive latent space results in slow and inefficient classical sampling, and paves the way to achieve quantum advantage with quantum annealing in realistic sampling applications.

Deep learning [2][3][4][5][6], in which a labeled dataset is used to train a statistical model to solve tasks such as clustering and classification, is currently revolutionizing the field of supervised learning. Deep neural networks are now commonly used in many scientific and industrial applications. Unlike supervised learning [7], unsupervised learning is a much harder, and still largely unsolved, problem. And yet, it has the appealing potential to learn the hidden statistical correlations of large unlabeled datasets [5,8,9], which constitute the vast majority of data available today.
Training and deployment of large-scale machine learning models, especially for unsupervised learning, faces computational challenges [10] that are only partially met by the development of special purpose classical computing units such as GPUs. This has led to an interest in applying quantum computing to machine learning tasks [11][12][13][14][15] and to the development of several quantum algorithms [16][17][18][19] with the potential to accelerate training. Most quantum machine learning algorithms need fault-tolerant quantum computation [20][21][22], which requires the large-scale integration of millions of qubits and is still not available today. It is however believed that quantum machine learning (QML) will provide the first breakthrough algorithms to be implemented on commercially available quantum annealers [23,24] and gatemodel devices [25,26]. For example, small gate-model de-vices and quantum annealers have been used to perform quantum heuristic optimization [26][27][28][29][30][31] to solve clustering [32] and classification problems [33][34][35][36].
Current-generation quantum annealers physically simulate a transverse-field Ising model and operate in interaction with in a thermal environment. Perhaps the most natural use of quantum annealers is thus as samplers from the Boltzmann distribution of an Ising system [37][38][39][40]. This observation inspired the use of quantum annealers for the training of Boltzmann machines (BMs) [41][42][43][44][45][46][47], powerful and versatile probabilistic models that can be trained in either a supervised or unsupervised fashion. Training BMs requires computationallyexpensive techniques, such as persistent contrastive divergence (PCD) [10] and population annealing (PA) [48], to accurately sample from the correct thermal distribution. Quantum annealing has the potential to overcome this computational bottleneck and to allow training of larger and more powerful BMs. Additionally, quantum annealers with advanced annealing control schedules allow sampling from quantum BMs (QBMs) [49][50][51], which might be able to capture the more complex correlations realized by quantum states. Despite being able to represent complex probability distributions, early numerical studies showed that restricted BMs (RBMs) with the quasi two-dimensional connectivities typical of currentgeneration quantum annealers perform poorly [52], even on simple standard datasets such as MNIST [53].
A possible approach to circumvent the limitation above is the following quantum-classical hybrid (QCH) paradigm: 1) employ classical feed-forward neural networks (NN) to encode the data into a compressed representation that is more easily processed by a quantum device and 2) jointly train the NN and the quantum device. This approach was pioneered in Ref. [54], where the authors introduced a quantum-assisted Helmholtz machine (QAHM), a generative model with latent variables. A QAHM has an inference (encoder), a generation (decoder) network and a generative process that is represented by a QBM or classical BM, and can thus be simulated by a quantum annealer. QAHMs do not have a well-defined loss function [9], which means that training does not scale well, even to standard machinelearning datasets such as MNIST. This approach was taken a step further in Ref. [1] by using variational autoencoders (VAE), a class of generative models with latent variables that provide an efficient inference mechanism [55,56]. VAEs can be trained by minimizing the evidence lower bound (ELBO), a variational lower bound to the exact log-likelihood. The ELBO is a well-defined loss function with fully propagating gradients that can be efficiently optimized via backpropagation. This allows us to achieve competitive and scalable performance on large-scale datasets. Upon discretization of the latent space [57,58], the generative process can be then realized by either classical or quantum BMs. In the latter case, training can be also performed efficiently by optimizing a quantum lower bound to the exact log-likelihood [1].
In this paper we use quantum annealers to sample from a classical BM. In other words, we assume that the quantum bias due to the presence of a transverse field is small enough that we can use a quantum annealer to approximate thermal expectations of a classical BM. The thermal expectation of the BM energy (the "negative phase") is indeed required to compute the gradients of the BM parameters [59]. We demonstrate that the evaluation of these thermal expectations with a quantum annealer is accurate enough to allow the training of a VAE with latent-space BM.
Our hybrid approach exploits a large amount of classical computational resources (backpropagation through deep NN performed on GPUs). This is advantageous since it allows us to achieve competitive results on relatively large-scale datasets such as MNIST. However, validating the training with quantum annealers is a major challenge: VAE can train well even with unstructured priors (e.g., a product of Gaussian distributions). One thus needs to carefully validate and demonstrate any performance improvement due to the use of structured samples coming from well-trained BMs. To this end, existence of a well-defined loss function, which allows for a quantitative comparison between different models, is critical. In order to single out the performance improvement due to computations that can be off-loaded to a quantum annealer (sampling from the BM), we always focus on comparing the performance of a model trained with a nontrivially connected BM (for example with full or Chimera connectivity D) to a fully classical baseline in which we employ a BM with no connectivity; that is, a set of independent Bernoulli variables. After a certain number of gradient updates, training of such models will have used exactly the same amount of classical resources that cannot be off-loaded to the quantum annealer, with the only difference being the computational effort in sampling from the latent-space BM.
We successfully train a convolutional VAE with 288 latent units and Chimera-structured RBM (a 6-by-6 Chimera graph; see Appendix D) using only samples obtained by D-Wave 2000Q quantum annealers to estimate the negative phases required to evaluate the gradients for the RBM parameters. We then use several techniques to validate the trained models. In particular, we use an auxiliary RBM trained on the annealer samples to quantitatively estimate the test log-likelihood and compute several model parameters to show how they track the same quantities for models trained fully classically. With these approaches, we show that models trained on Chimera connectivity outperform their trivial ("Bernoulli") baseline and can achieve competitive performance on the MNIST dataset.
The next question is whether our approach offers a path towards obtaining quantum advantage with quantum annealers in machine learning applications. To achieve such a goal, we need to exploit large latent-space RBMs that develop complex multimodal probability distributions (i.e., a complex energy landscape) from which sampling is classically inefficient. We give evidence that this is indeed possible. For example, we demonstrate that the BMs placed in the latent space develop nontrivial modes that are likely to cause classical Monte Carlo algorithms to have long mixing times. Moreover, we show that training on more complex datasets likely takes advantage of larger BMs to improve performance. In addition, we discuss the role of connectivity, emphasizing its importance even in this hybrid approach, and the necessity to develop device-specific classical NN to better exploit physical connectivities such as the Chimera graph.
The structure of the paper is as follows. In Sec. I we review VAE and the implementations of discrete latent variables, a necessary step to implement BMs and QBMs in their latent space. Section II motivates the use of quantum annealers as samplers to train quantum and classical BMs. In Sec. III we report our experiments in training VAEs with D-Wave 2000Q systems. In Sec. IV we discuss a possible path toward quantum advantage in our setup. Finally, we present our conclusions in Sec. V.

I. VARIATIONAL AUTOENCODERS
In this section, we will briefly introduce VAEs and describe their extension to discrete latent variables; a necessary step to hybridize with quantum priors and to per-form quantum-assisted training.
In generative modeling, the goal is to train a probabilistic model such that the model distribution p θ (X) (where θ are the parameters of the model) is as close as possible to the data distribution, p data (X), which is unknown but assumed to exist. The ensemble X = {x d } N d=1 represents the training set; that is, N independent and identically distributed samples coming from p data (X). The preferred method to training probabilistic models is arguably maximum likelihood estimation (MLE), which means the optimal model parameters are obtained by maximizing the log-likelihood L(X, θ) of the dataset with respect to the model: where E x∼p data [. . . ] is the expectation over p data (x). Similarly to generative adversarial networks (GANs) [60], VAEs [55] are "directed" probabilistic models with latent variables (see Fig. 1): the model distribution, defined as the joint between the visible units x and latent units ζ, is explicitly parameterized as the product of the "prior" p θ (ζ) and "marginal" p θ (x|ζ) distributions, p θ (x, ζ) = p θ (x|ζ)p θ (ζ). The model prediction for the data is then obtained by marginalizing over the latent units: Generative models with latent variables can potentially learn and encode useful representations of the data in the latent space. This is an important property that can be exploited in many practical applications [61][62][63][64] to improve other tasks such as supervised and semi-supervised learning [65]. The drawback is "intractable inference" due to the appearance of integrals such as the one in Eq. 2. Essentially, VAEs remove the necessity to evaluate such integrals by introducing a variational approximation q φ (ζ|x) to the true posterior p θ (ζ|x). A so-called "reparameterization trick" is also introduced to obtain an efficient and low-variance estimate of the gradients needed for training. We will briefly review these two important elements in the next two sections.

Variational inference
Training generative models with latent variables via MLE requires the evaluation of the intractable integral of Eq. 2 to calculate the posterior distribution p θ (ζ|x). VAEs circumvent this problem by introducing a tractable variational approximation q φ (ζ|x) to the true posterior [66], with variational parameters φ (see Fig. 1). VAEs are then trained by maximizing a variational lower ζ x prior: p θ (ζ) joint: p θ (x, ζ) marginal: p θ (x|ζ) approx. posterior: q φ (ζ|x) FIG. 1: Generative models with latent variables can be represented as probabilistic graphical models that describe conditional relationships among variables. In a directed generative model, the joint probability distribution p θ (x, ζ), is decomposed as p θ (x, ζ) = p θ (x|ζ)p θ (ζ). The prior distribution over the latent variables p θ (ζ) and the marginal (decoder) distribution p θ (x|ζ) are hard-coded to explicitly define the model. The computation of the true posterior, p θ (ζ|x) is intractable. In VAEs, an approximating posterior q φ (ζ|x) (decoder) is introduced to replace the true posterior.
bound L(x, θ, φ) to the log-probabilities log p θ (x): where D KL (q φ (ζ|x)||p θ (ζ|x)) is the Kullback-Leibler divergence (KL divergence) between the true and approximating posteriors. Since KL divergences are always nonnegative, we have which immediately gives: where L(X, θ, φ) is called the evidence lower bound (ELBO). The ELBO can be written in terms of tractable quantities: The marginal p θ (x|ζ) and approximating posterior q φ (ζ|x), also called "decoder" and "encoder" respectively, are commonly parameterized using deep neural networks.

The reparameterization trick
To train VAEs, we need to calculate the derivatives of the objective function (Eq. 6) with respect to the generative (θ) and inference (φ) parameters. The naive evaluation of ∂ φ of terms of the type E ζ∼q φ [f (ζ)] is called REINFORCE [67]. With the use of the identity ∂ φ q φ = q φ ∂ φ log q φ , one has: However, the term above has high variance and requires intricate variance-reduction mechanisms to be of practical use [68]. A better approach is to write the random variable ζ as a deterministic function of the distribution parameters φ and of an additional auxiliary random variable ρ. The latter is given by a probability distribution p(ρ) that does not depend on φ. This reparameterization ζ(φ, ρ) is appropriately chosen so that one can write . Therefore, we can move the derivative inside the expectation with no difficulties: This is called the reparameterization trick [55] and its efficient implementation is responsible for the recent success and proliferation of VAE.

A. VAE with discrete latent variables
The application of the reparameterization trick as in Eq. 8 requires that f (ζ(φ, ρ)) be differentiable, so the latent variables ζ are continuous. However, discrete latent units can be indispensable to represent the right distributions, such as in attention models, language modeling, and reinforcement learning [65,69,70]. For example, a latent space composed of discrete variables can learn to disentangle content and style information of images in an unsupervised fashion [71]. Several methods have thus been developed to circumvent the non-differentiability of discrete latent units [68,[72][73][74]. In the context of VAE, the reparameterization trick has been extended to discrete variables by either relaxation of discrete variables into continuous variables [58,69,75] or by introducing smoothing functions [57]. In Ref. [1], QVAE was introduced based on the implementation of Ref. [57]. In this work, we follow the implementation of Ref. [58], which gives biased estimates but provides a much simpler and flexible implementation.
To set up a notation that we keep throughout the paper, we now assume the prior distribution is defined on a set of discrete variables z ∼ p θ (z), with z ∈ {0, 1} L . Given a discrete variable z with mean q and logit l = is the sigmoid function), a non-differentiable implementation of Eq. 8 for discrete variables can be obtained as where Θ is the Heaviside function and the random variable ρ ∈ [0, 1] is distributed according to a uniform distribution U. In the second equality, we have used the fact that the inverse sigmoid function is monotonic. A continuous smoothing (also known as the Gumbel trick [75]), is performed by replacing the Heaviside function with the sigmoid function: where τ is a temperature parameter introduced to control the smoothing. Typically, τ is annealed from large to low values during training. For large values of τ , the bias introduced by substituting z with ζ everywhere in the loss function is large, but the gradients propagating through ζ are also large, facilitating training. Conversely, for low values of τ the bias is reduced but gradients vanish and training stops. Evaluation of trained models is done in the limit τ → 0, where ζ → z. Throughout this paper, we will use BMs to provide powerful and expressive prior distributions defined on discrete variables: To train a VAE with BM prior, following the prescription of the previous section, we formally replace p θ (z) p θ (ζ). As usual, the gradients of the log-probability is given by the difference between a positive and negative phase: In the equation above, we have highlighted the fact that the smoothed latent samples ζ φ depend on the variational parameters φ. The model samplesz, however, remain discrete variables sampled from the BM, and are thus not smoothed during training [58].

B. Hybridization with quantum prior
Once we have a framework to train VAEs with discrete latent variables, we can consider quantum-classical hybrid VAEs in which the generative process z ∼ p θ (z) is realized by measuring the computational basis on a given quantum state ρ θ . Such quantum states can be realized by a quantum circuit or via a quantum annealing process controlled by a set of parameters θ we wish to adjust during training of the model.
As introduced in Ref. [1], a QVAE can be obtained by assuming the quantum state ρ θ is a thermal state of a transverse field Ising model; i.e., a QBM [49]. The prior p θ (z) distribution is then given by: where Γ, h, W ∈ {θ}, Λ z ≡ |z z| is the projector on the classical state z, and σ x,z l are Pauli operators. Unlike for a classical BM, the direct evaluation of the gradients of the term above is intractable. As discussed in Ref. [49], a possible workaround is to perform the following substitution: (14) in the ELBO L to obtain the so-called quantum ELBO (Q-ELBO)L.
As a consequence of the Golden-Thompson inequality Tr[e A e B ] ≥ Tr[e A+B ], one has: The Q-ELBOL is thus a lower bound to the ELBO with tractable gradients that can be used during training. The derivatives of the log-probabilities logp θ (z) can be estimated via sampling from the QBM [49]: wherez are the model samples distributed according to the quantum Boltzmann distribution. The use of the Q-ELBO and its gradients precludes the training of the transverse field Γ [49], which is treated as a constant (hyper-parameter) throughout the training. Training via the Q-ELBO is performed as in the BM case, by smoothing z ζ.

II. SAMPLING WITH QUANTUM ANNEALERS
Currently manufactured quantum annealers physically implement a transverse-field Ising model: where s ∈ [0, 1] is a control parameter, and A(s) and B(s) are respectively decreasing and increasing monotonic functions of the parameter s with A(0) B(0) and A(1) B(1). Quantum annealers operate immersed in a thermal environment. There is theoretical and numerical evidence [37,42,76,77] that when the anneal is performed sufficiently slowly the system above is in thermal equilibrium with the environment. This property can be exploited to turn quantum annealers into programmable Boltzmann samplers. Thermal relaxation rates are controlled by the intensity of the transverse field A(s). At the beginning of the anneal, relaxation times are small, and the system proceeds through a sequence of thermal states. As the anneal proceeds, relaxation times grow larger and eventually the state of the system freezes at the point s * where relaxation times roughly become larger than the annealing time t a .
With the above picture in mind, we can use quantum annealers to sample from the QBM defined in Eq. 13 with: Advanced control techniques for the anneal schedule (such as pauses and fast ramps present in the latest generation of D-Wave quantum annealers) allow in principle to control the freezing point s * . Note that knowledge of the effective transverse field Γ * is unnecessary: QBMs are trained via the Q-ELBO, in which the transverse field does not appear explicitly, but only implicitly in sampling from the model. In the following we assume that for the models and datasets under consideration freezing happens late in the anneal. This means we effectively sample from a QBM that is very close to a classical BM. More specifically, we use quantum annealers to quantum-assist training of VAEs with classical latent-space BMs.

III. TRAINING VAE WITH QUANTUM ANNEALERS
We have implemented a convolutional VAE whose prior is implemented by a BM. To improve the performance of the model, we use several techniques such as learning-rate and KL-term annealing, importanceweight annealing, convolution gating, and batch normalization. We give a detailed description of the model in Appendix A. In this section, we restrict ourselves to a Chimera structured restricted BM (RBM) with 288 latent units (a six-by-six patch of Chimera cells) and present our results with models trained end-toend by using samples drawn with D-Wave 2000Q quantum annealers on the common handwritten digit dataset MNIST [53]. Samples used to estimate the negative phase (second term of Eq. 16) are obtained following the prescription given in Appendix B.
The effective temperature β * ef f must be chosen appro-priately to correctly train the parameters of the inference network q φ (see appendix C for a more detailed discussion). The parameter β * ef f can be considered as a multiplicative correction for the learning rate of the prior parameters b, W ∈ θ. However, this observation is not true for the inference parameters φ, whose gradients also propagate through the first term in Eq. 16 via ζ(φ, ρ). Due to our simple forward-anneal schedule, we expect the value of β * ef f to change during training. To account for this effect, in this work we employ a real-time β * ef f estimation as explained in Appendix C. In future works, we expect to be able to train at a fixed-temperature with appropriate pause-and-ramp annealing schedules such as those available with D-Wave quantum annealers [50,51].
Training is performed jointly on the parameters of the classical networks and on the parameters of the quantum device. The gradients of the latter parameters require estimation of the negative phase (a thermal expectation of the energy) in Eq. 16. At each gradient update, such expectations are computed using samples from the quantum annealer only, and do not involve any classical Gibbs sampling such as persistent contrastive divergence, or any classical post-processing of the samples obtained by the annealer. We typically trained our models for 2000 epochs and a batch size of 1000. Figure 2 shows a set of images generated by a VAE trained end-to-end using a D-Wave 2000Q system. The set of images, obtained by generating latent samples z with the quantum annealer and subsequently decoded as x ∼ p θ (x|z), shows a good amount of global consistency and consistent statistical variety.

A. Validation of training
In this section we give more evidence that we have successfully exploited the Chimera-structured RBM prior in the latent space of our convolutional VAE. Validating the training of quantum-classical hybrid generative models can be nontrivial and must be assessed carefully, especially when training uses a large amount of classical processing. We trained the deep networks of our model using GTX 1080 Ti GPUs, and the quantum annealer is called only to estimate the negative phase. A principled validation strategy is thus to compare a model trained with quantum assistance to a fully classical baseline for which the quantum hardware is not required. In our case, a convenient baseline is a model that has the same classical networks but whose prior is trivial. As a trivial prior, we choose a set of independent Bernoulli variables, which is equivalent to an RBM with vanishing weights between latent units. For simplicity, in the following we refer to such prior as RBM with Bernoulli prior. For image processing, comparison between different generative models could be done qualitatively by visually inspecting the generated samples. In our research however, visual comparison is inconclusive [78]. In Fig. 3, for example, we compare samples generated by a trained model with Chimera ( Fig. 3(a)) and Bernoulli ( Fig. 3(b)) priors. Both models have been trained by evaluating the negative phase with a D-Wave 2000Q system. The images are also generated using samples coming from the annealers. Given our specific implementation, it is difficult to discern an improvement of visual quality when using a Chimera-structured RBM rather than a Bernoulli prior. To overcome this difficulty, we evaluate quantitatively [84] the generative performance of VAEs by computing the ELBO defined in Eq. 3 or by estimating the log-likelihood via an importance sampling technique as described in Ref. [56].
These quantities are however not accessible when training with analog devices as samplers. In fact, while we assume that samples generated by quantum annealers are distributed according to the required Boltzmann distribution, we must treat quantum annealers as black-box samplers during testing and validation. In other words, the log-probabilities for the quantum generative process log p DW θ (z) must be assumed to be unknown. Therefore, we validate results by replacing the unknown hardware log-probabilities with those of an auxiliary RBM whose weights are given by the relations in Eq. 17 [85]: This approach can be more rigorously interpreted as validating the fully classical model in which we replace the quantum annealer by the auxiliary RBM defined by the relations above. In Tab. I we report the LL of the auxiliary VAE with 288 latent unit on a Chimera connectivity trained with a D-Wave 2000Q quantum annealer. We compare it to the same model trained end-to-end with population annealing [1]. We also compare each model with its respective Bernoulli baseline. We have reported the mean and the standard error over 5 independent training runs. Training with a Chimera-structured RBM improves significantly the log-likelihood over the Bernoulli baseline. Moreover, the models trained with quantum annealers achieved the same log-likelihood as the models trained with PA. Notice that each model and its baseline em-  ployed exactly the same amount of classical computational resources. Models trained with structured RBMs achieve better performance by requiring thermal sampling, a computational task that can be offloaded to a quantum annealer.
In general we would like to use quantum annealers to sample from the trained generative model. As explained above, treating quantum annealers as black-boxes means we cannot quantitatively evaluate such a model. However, we argue that the log-likelihood of such a VAE is likely very close to that of the auxiliary VAE. After all, the training assumes the hardware samples are distributed according to a Boltzmann distribution of the auxiliary RBM, and in the previous section we have shown that this assumption is accurate enough to correctly train the auxiliary RBM. We can confirm this visually in Fig. 4. On the left panel we show a set of digits generated by sampling from a D-Wave 2000Q. On the right panel we use the same trained model but sample from a D-Wave 2000Q after setting its weights to zero. We see that while the annealer still generates plausible digits, it does not generate a number of digits with the correct statistics (in the right panel of Fig. 4, digits 9 and 4 seem to dominate the scene). This gives evidence that D-Wave 2000Q quantum annealers sampled consistently, such that the classical networks were able to correctly learn the correlations between latent units existing due to the RBM with non-vanishing weights.

IV. A PATH TOWARDS QUANTUM ADVANTAGE WITH VAE
In the previous section we have shown that it is possible to use quantum annealers as Boltzmann samplers to train RBM-structured priors placed in the latent space of deep convolutional VAE. In the experiments presented, we have settled on relatively small, Chimera-structured RBMs with 288 latent units. We found that larger RBMs did not appreciably improve performance of the overall VAE when training on the MNIST dataset. We will explain why this is the case in this section. Sampling from RBMs with a few hundred units can still be done classically with relative ease, and the natural question that arises is whether we can obtain a quantum advantage in this hybrid setup.
A necessary condition to obtaining quantum advantage with our QCH setup is to engineer models that can exploit very large RBMs with complex, multimodal distributions. Sampling from RBMs with complex landscapes is slow and inefficient for classical approaches such as Gibbs sampling, PCD, and PA. Another important requirement is to be able to reliably sample from thermal states using quantum annealers with sufficiently low control errors.
In the next sections, we give evidence of the existence of a natural path towards obtaining quantum advantage by applying quantum annealing to generative modeling within the proposed VAE framework.

A. Exploit large latent-space RBMs
Exploiting a larger number of latent units to improve the generative performance of a VAE is a popular and active research area. One known obstacle in achieving this is the loss function used for training. We rewrite the ELBO here for convenience by highlighting its two terms: The first term is sometimes called the "autoencoding" term and can be thought of as a reconstruction error: an encoded latent configuration ζ is first sampled from the encoder q φ (ζ|x) and is subsequently decoded by p θ (x|ζ). When both approximating posterior and marginal distributions are factorized distributions, this term can be also interpreted as a reconstruction error. Maximizing the ELBO results in maximizing this term, which tends to maximize the number of latent units used to prevent information loss during the encoding-decoding steps. The second term, the KL divergence between the approximating posterior and the prior, has the effect of a regularization term and it is sometimes also called KL regularization. Maximizing the ELBO results in minimizing the KL term, which pushes the approximating posterior close to the prior. This also means the approximating posterior depends less sharply on the inputs x. In the case of factorized distributions, this usually means some latent units are conditionally independent from the input ("inactive units"): The balance between the autoencoding and KL terms means using VAE can be efficient at lossy data compression by using the right number of latent units. However, the tension between KL and autoencoding terms results in an optimization challenge: during training the model is usually stuck in a local minimum with a suboptimal number of latent units. The main takeaway, for our purposes, is that the number of latent units effectively used is highly dependent on the model, the optimization technique, and the training set. To exploit a larger RBM, one thus has to work on all these elements. In Fig. 5 we show the number of active units during a training run of 2000 epochs when the models are trained with either PA or D-Wave 2000Q, with either a Chimera-structured RBM or a Bernoulli prior. Lines (bold or dashed) are the means over 5 independent runs while light-color areas delimit the smallest and largest values among the 5 runs. To identify whether a latent unit is active or not, we compute the variance σ of the value of each unit z over the test set and we set a threshold of σ > 0.01 as definition of an active unit. Figure 5 shows several key points that we will expand upon in the next sections. First, it shows the use of a KL annealing technique: the KL term is turned off at the beginning of the training and it is slowly (linearly) turned on within 200 epochs. The figure clearly shows the effect of the KL term in shutting down a large number of active units. Since the number of active units plateaus around a number much lower than 288, using a larger RBM usually does not improve performance of our implementation on MNIST. It also shows that connectivity of the RBM plays a major role in determining the number of active units, which is much higher with a Chimera-structured RBM. Notice also that in the Bernoulli case, both samplers (PA and D-Wave 2000Q) train a model that uses a very similar number of latent units. However, when sampling is nontrivial (as in a Chimera-structured RBM) the model trained with the quantum annealer uses a number of units larger than the Bernoulli case, but smaller than the model trained with PA. This is a manifestation (which we will discuss later) of biased sampling with the quantum annealer: sampling quality is good enough to train the model (indeed we obtained a log-likelihood as good as that of the model trained with PA) but exploits a smaller number of latent units.
In the next three sections we discuss three important elements that would allow to build quantum-classical hybrid VAEs that can effectively exploit large latent-space RBMs. This is a necessary condition to search for quantum advantage in these models: speed-up and scaleup sampling from large RBMs using quantum annealers rather than inefficient classical sampling techniques.

Denser connectivities
Connectivity of the RBM in the latent space plays an important role in determining both performance of the generative model and number of active latent units. While these two elements are not directly related, we typically observe a correlation between them. In this section, we investigate in more detail the effects of implementing denser connectivities by performing numerical experiments in four different cases: Bernoulli prior and Chimera, Pegasus, and fully connected RBM. Together with Bernoulli and RBM, we pick the connectivities of currently available D-Wave 2000Q quantum annealers and D-Wave's next-generation Pegasus architecture [79] (see also Appendix D).
In Fig. 6 we compare the log-likelihood and the number of active units along training runs obtained using the same classical networks but using a different connectivity for the latent-space RBM prior. At each step during the training run, we employed roughly the same amount of classical resources for back-propagation. Unsurprisingly, using a more capable (dense) RBM results in better generative performance (log-likelihood) of the model (left panel)). It also results in using a larger number of latent units (right panel). Notice how Bernoulli and Chimera priors effectively use a number of latent units well below 150, while Pegasus and fully connected use well above 150. Because of this, we could not improve generative performance of Bernoulli and Chimera models by just using larger graphs, at least when training on MNIST. On the other hand, using larger Pegasus and fully connected RBMs would likely have improved the log-likelihood of the model. In Fig. 6 we show results on the same number of latent units (288)

comparison.
Working with VAEs allows us to easily take advantage of new connectivities without having to implement a new convolutional VAE. In fact, our model has been fairly optimized to improve performance on Chimera graphs (see the next section). Despite this architecture-specific optimization, just using the denser Pegasus graphs improve performance of the model. Moreover, the flexibility of the VAE hybrid approach allows us to easily adapt the implementation to the slightly different working graphs of different processors with different active/inactive qubits. By using the same implementation, during training the model naturally learns to deactivate latent units corresponding to uncalibrated qubits. We have indeed seamlessly used the same model to train on different D-Wave 2000Q processors, as well as using different groups of qubits within the same processor. In no case was a hardcoded connectivity (which would change for each processor) necessary.
The results of Fig. 6 show that developing quantum annealers with denser connectivities (such as Pegasus) naturally leads to exploiting larger latent-space RBMs, possibly getting us closer to a regime where quantum advantage is possible.

Hardware-specific optimization of classical networks
In our implementation, we have used fairly conventional convolutional neural networks. As we will discuss in this section, we have implemented only one specific architecture-dependent element in the encoder network that turned out to be very effective at improving performance on the Chimera graph. In general, we believe more elaborate, hardware-specific implementations of both the approximating posterior q φ (z|x) and the marginal distribution p θ (x|z) will significantly improve performance of VAE trained with analog devices with quasi twodimensional connectivities. This is not just a problem of building a model with more capacity. As we discussed in the case of latent-unit use, it is also a problem of optimizing the model hyperparameters. When working with sparsely connected RBMs, it is easier for the KL term to push both approximating posterior and RBM prior to a local minimum of the loss function in which they are both trivial. Developing hardware-specific hybrid models will thus also aim at reaching local minima during training in which the RBM prior is as expressive as possible, exploiting the largest amount of correlations between latent units.
In the context of hybrid VAE models trained with quantum annealers, the considerations above could replace more standard mapping techniques used in the quantum annealing community such as minor embedding [80,81] and majority voting. The latter techniques typically require a hard-coded specification of the hardware connectivity of each processor, which makes adapting the code to different processors cumbersome. Our perspective in the contest of hybrid generative modeling is to work by adapting classical networks using a high-level specification of the connectivity and letting the model, through stochastic gradient descent, learn the details of the connectivity of each processor (such as the locations of uncalibrated qubits).
We now discuss an example of how the classical networks can be optimized for a given architecture (in our case the Chimera graph). A common technique to implement a more expressive approximating posterior q φ (z|x) (with a tighter variational bound) is to introduce conditional relationships among latent units, also called hier- archies. In the case of two hierarchies, we first define an approximating posterior for a subset of latent units z 1 and sample from it, then define a second approximating posterior for the remaining latent units z 2 that depends conditionally on both the input data and the sampled values of the first group of latent variables: The schematic of our implementation is shown in Fig. 7(a). We notice that models with a large number of hierarchies (possibly as large as the number of latent units) are possible. Such models, also referred to as autoregressive models [82], are very powerful but have inefficient inference, since sampling must be performed sequentially and cannot be parallelized on modern GPU hardware. The two hierarchical groups (z 1 , z 2 ), as well as the physical qubits on the Chimera connectivity, are not equivalent. We can thus build different models by simply choosing the mapping between the two hierarchical groups and an arbitrary bipartition of the physical qubits. Notice that training and deploying each of these models will involve exactly the same amount of classical and quantum computational resources. In our experiments we consider two possible mappings of the two hierarchical groups onto the physical qubits of the Chimera graph, which we call "Bipartite" and "Chains". The Bipartite mapping corresponds to the bipartite structure of the Chimera graph (see Fig. 7(b)). The Chains mapping corresponds to the vertical and horizontal physical layout of qubits of the Chimera architecture (see Fig. 7(c)). The identification of vertical and horizontal chains of qubits is commonly used to perform a minor embedding of a fully connected RBM on the Chimera graph. We stress again, however, that we never perform any minor embedding, and we always sample from the native Chimera graph in all cases.
Comparative results of the two mappings, obtained with classical sampling, are shown in Fig. 8. In the left panel we see that there is a sizeable difference in generative performance between the two mappings, with the Chains mapping performing remarkably better. This better performance is also reflected in the much higher number of latent units exploited by the Chains mapping (see right panel of Fig. 8). An intuitive explanation of the results above can be given as follows. Let us first write the KL term as: In the Bipartite mapping, the conditional p θ (z 2 |z 1 ) has a simple form that can be computed analytically due to the bipartite structure of the Chimera RBM. During training, it is very easy for the model to use the capacity of q 2,φ (z 2 |z 1 , x) to match the simple prior marginal p θ (z 2 |z 1 ) and be independent from x. As a consequence, a large portion of the representational capacity of the approximating posterior is wasted in representing the simple marginal p θ (z 2 |z 1 ). In the Chains mapping, in contrast, the marginal p θ (z 2 |z 1 ) is nontrivial, and the approximating posterior q 2,φ (z 2 |z 1 , x) has more difficulties in matching it and decoupling x. As a consequence, the model ends up using more efficiently all the variational parameters φ. This is another manifestation of the optimization challenge present with VAE models mentioned before, which in this case it is exploited to find better local minima of the loss function. In this section we have shown how a simple architecture-aware modification of the encoder network allows us to train better models and to exploit a given architecture more efficiently. We expect that architectureaware model-engineering will be crucial to fully exploit large physical connectivities in the latent space of VAE.

Training on larger datasets
Implementing more highly connected RBMs and developing classical encoders and decoders tailored to a given connectivity can only go so far in helping to exploit larger latent spaces. Together with other techniques such as KL-term anneal, the ideas mentioned in the previous two sections help reduce the pressure of the KL term to reach suboptimal local minima. In essence, VAE are also efficient lossy encoders. An alternative direction to increase latent space utilization is thus to train on more complex datasets. By doing so, a larger number of latent units is necessary to store enough information such that the reconstruction term (first term in Eq. 19) is large enough. We give numerical evidence of the intuition above by training the same VAE models used in the previous sections on the Fashion MNIST (FMNIST) dataset. A set of images from the FMINST dataset is shown in the left panel of Fig. 9. FMNIST is the same size as MNIST (50000, 28×28 images) and has the same number of classes. However, its images are more complex with more fine details, including grey-scale features that are important for correct image classification (whereas MNIST digits are substantially black and white). In the right panel of Fig. 9, we train the same models on MNIST (shaded) and FMNIST (solid) and compare the number of active units during training. We see that, apart from the case with fully connected RBMs, all other models use a substantially larger number of latent units.

B. Multi-modality of latent-space RBMs
Exploiting large latent-space RBMs is a necessary condition to eventually achieve quantum advantage when sampling with quantum annealers. This condition is however not sufficient. The typical computational bottleneck in training an RBM is due to the appearance of well-defined modes. These modes make classical sampling techniques inefficient and slow-mixing, resulting in highly correlated samples used both during training and generation. While making sampling harder, the development of multi-modal distributions is actually an appealing property of RBMs, since it allows such models to represent complex and powerful probability distributions. The idea behind searching for quantum advantage in training RBM with quantum annealers is, indeed, to exploit quantum resources (such as tunneling) to more efficiently mix between different modes in the landscape defined by the RBM.
When trained on visible data, an RBM naturally develops complex landscapes to match the complexity of the statistical relationship present in the training data. However, while RBMs trained on latent representations can potentially develop well-defined modes, they do not necessarily do so. In fact, one of the capabilities of generative models with latent variables is finding a set of statistically independent latent features [83]. This is typically enforced during model building by using trivial priors such as the product of independent Gaussian (for continuous latent units) or the product of independent Bernoulli (for discrete latent spaces). Even when the prior is potentially complex and trainable, as an RBM, the presence of the KL term can push the model during training into local minima in which the trained RBM develops a trivial landscape.
In this section we give evidence that RBMs trained in the latent space of a VAE model do indeed develop a nontrivial landscape with well-defined modes. As we have shown in Sec. IV A 1 (see Fig. 6), RBMs with denser connectivities naturally lead to better performing VAEs. This is an indirect indication that we are indeed exploiting the additional capacity and expressivity of more connected RBMs. In this section we give more explicit evidence of this. In Fig. 10, we generate a sequence of images via block Gibbs sampling. The top left image is generated by picking a latent configuration z out of a uniform distribution over all configurations. This latent sample is then sent through the decoder. Going from left-to-right, top-to-bottom, each subsequent image is obtained after updating all latent units with a sequence of block Gibbs updates (one for Bernoulli, two for the bipartite connectivities Chimera and RBM, four for the quadripartite Pegasus connectivity). As expected, in the Bernoulli case ( Fig. 10(a)), each update results in uncorrelated samples. The Chimera connectivity ( Fig. 10(b)) is able to develop weakly correlated samples, as shown by short sequences of similar images. Correlated samples with well-defined modes are more clearly visible with the Pegasus connectivity (( Fig. 10(c))). Finally, we confirm that increasing the connectivity up to a fully connected RBM ( Fig. 10(d)) results in long sequences of correlated samples and related to the deep valleys of the RBM energy landscape.
The results shown in Fig. 10 show that RBMs trained as priors of generative models with latent variables naturally learn multi-modal, nontrivial probability distributions. These distributions are expressive, making the whole VAE more expressive, while at the same time developing the same types of computational bottlenecks that make classical sampling algorithms inefficient. This paves the way to effectively use quantum annealers as means to more scalable quantum-assisted sampling, enabling us to sample from RBMs of sizes and complexity that are infeasible with classical methods.

C. Robustness to noise and control errors
Using quantum annealers to train large RBMs directly on complex data remains challenging. Apart from the unsatisfying performance of using RBMs with quasi twodimensional connectivities on visible data, a major difficulty is biased sampling (and thus inaccurate gradients) obtained with quantum annealers. There are two main sources of bias: control errors and imperfect or incomplete thermalization at the freezing point. While the latter can be improved with appropriate pause-and-ramp annealing schedules, the former can only be improved with technological advancements. Despite these known difficulties, we have shown in the previous sections we have successfully trained large RBM (hundreds of units) in the latent space of a VAE solely using samples coming from a D-Wave 2000Q, without using any hard-coded pre or post-processing to the raw data obtained from the annealer.
We interpret our positive results as an indication that training RBMs with our setup is relatively robust to noise and control errors. In fact, we can interpret both the encoder and decoder as powerful tools to pre-and postprocess data to be sent to the quantum annealer. Using stochastic gradient descent, we train the encoder and decoder to generate a set of latent features that are more easily modeled by the latent-space RBM. For example, real images might have strongly correlated, sharp features (such as regions with black or white pixels), which require large weights to be modeled correctly. A precise implementation of such large weights might be challenging for analog devices with finite range such as quantum annealers. Additionally, both encoders and decoders might be able to learn and correct, or at least reduce the effects of, systematically biased sampling.
To investigate the role of noise and control errors in determining sampling quality and performance of the trained models, we perform a set of comparative experiments in which we train the same model on MNIST dataset, using samples coming from three D-Wave 2000Q with different noise profiles. The Baseline and Lower-Noise D-Wave 2000Q are both publicly available on D-Wave's Leap TM cloud service. We have also included an Interim Lower-Noise processor with an intermediate noise profile that is internally available at D-Wave.
Results are shown in Fig. 11. In Fig. 11 (a) we report the log-likelihood during training. Models trained on the Lower and Interim Lower-Noise D-Wave 2000Q achieved performance comparable to training with population annealing. The evaluation of the log-likelihood for the Baseline D-Wave 2000Q diverged due to diverging weights, as can be seen in Fig. 11(b). The weights of the Baseline processor start diverging after about 100 epochs. The Interim Lower-Noise processor shows an opposite behavior, with weights getting small and plateauing after about 500 epochs. For the Lower-Noise processor, the L1 of the weights plateaus at a value that is closer to that obtained with "noiseless" population annealing. Notice that a consistent comparison in Fig. 11(b) we have reported the weights W rescaled by the effective temperature (see Eq. 17, and not the "bare" annealing values J. Figure 11(b) highlights the remarkably different response of three different quantum annealers to our model. Despite such differences, the performance of our hybrid implementation is robust (as shown in Fig. 11(a)) and does not require any hardware-specific adaptations or fine-tuning. Only while training with the Baseline D-Wave 2000Q, we needed a more aggressive clipping (that is restricting the weights and biases to have narrower range than the maximum allowed) to achieve similar performance and converged log-likelihood estimation.
Noise and control errors also manifest in a less efficient use of the latent space, as seen in Fig. 11 (c). All models trained with D-Wave 2000Q use fewer active units than population annealing. Since the estimate of the loglikelihood of the models trained with the Baseline processor did not converge, we focus on the comparison between the Interim and Lower Noise processor: the latter can exploit a larger number of latent units. We finally show in Fig. 11(d) the profile for the extracted effective temperature during training, which is remarkably different for the three D-Wave 2000Q. The results shown in Fig. 11(d) underlines the importance of developing advanced annealing techniques to stabilize temperature during training.
In this section we have demonstrated the robustness to noise of our implementation, as highlighted in Fig. 11 (a). At the same time, Fig. 11 (c) shows an important effect of noise, which is to make it harder for our hybrid model to exploit the optimal number of latent units. We thus anticipate that exploiting large RBMs (with thousands of units, eventually) following the directions indicated in the previous sections must be accompanied by continued efforts in reducing sampling bias due to noise and control errors of future-generation quantum annealing devices.

V. CONCLUSIONS
In this work, we have demonstrated the use of quantum annealers as Boltzmann samplers to estimate the negative phase of classical RBMs placed in the latent space of deep convolutional variational autoencoders. This setup allows for the construction of quantum-classical hybrid generative models that can be scaled to large, realistic datasets. We have mostly experimented with MNIST, a common testbed dataset which includes 50, 000, 28 × 28 binarized handwritten digits to achieve a log-likelihood of about −82.2 ± 0.2 nats, which compares favorably to state-of-the-art achieved with autoregressive models (−78.5 nats (natural unit of information)). In addition to demonstrating scalability and performance, we have discussed several other features of our hybrid approach. First, we are able to use quantum annealers as "native samplers", that is, samplers from their native graph: we do not use any hard-coded encoding-decoding scheme such as minor embedding or majority vote. Arguably this is one of the most effective ways to exploit the computational capabilities of quantum annealers. The encoding and decoding process is indeed efficiently performed by deep convolutional networks, which are trained to extract relevant feature via stochastic gradient descent. As we have shown, this approach is particularly flexible, since it naturally adapts to different connectivities and arbitrary working graphs.
Second, by successfully training the same model on three quantum annealers with different noise profiles, we have shown that our implementation is fairly robust to noise and control errors. Indeed, the deep convolutional networks can be seen as learned pre-and post-processing steps that regularize both the visible data and the effects of noise. A key reason of the success of our implementation is indeed the fact that the weights and biases as implemented on the quantum annealers rarely grow (during training) beyond their allowed range, even with minimal or no regularization. This result is to be contrasted to training RBMs directly on visible data, for which weights are typically much larger and regularization is critical to avoid overfitting. The latter case is much more challenging for analog devices with limited allowed range.
The quantum-classical hybrid models we have considered in this work employ a large amount of classical computing power performed on modern GPUs. The computational task that we offloaded to the quantum annealer (sampling from the latent-space RBMs) can still be performed classically at a fraction of the overall computational cost. To achieve any form of quantum advantage in this framework, we need to offload generative capacity to the prior, by exploiting large RBMs capable of representing complex probability distributions from which classical sampling becomes too expensive. We have provided evidence that this path to quantum advantage is possible by deploying annealers with denser connectivities and lower noise, engineering classical neural nets that better exploit physical connectivities and by working with more complex datasets. All these improvements seem achievable in the near future, and represent possible interesting lines of research that we leave for future work. encoder and decoder. In the encoder, down-sampling is achieved by employing strided convolutions, while in the decoder up-sampling is similarly obtained with strided deconvolutions. The last (first) layer of the encoder (decoder) network is a dense network with two (one) layers (see Fig. 12 (c)). In the case of the encoder, a hierarchical (conditional) relationship among variables is implemented as described in Fig. 7(a). The convolutional networks are implemented as a simple sequence of five gated convolutions, whose detailed implementation is given in Fig. 12 (c). Notice the use of batch normalization and dropout. The latter was only used in the decoder, to prevent over-fitting, with a drop-rate of 0.2. We trained our models using batches of size 100, and the Adam optimizer with an initial learning rate of 3e −3 , exponentially decaying to a minimum learning rate of 1e −4 after 1800 epochs. The temperature parameter τ defined in Eq. 10 for the Gumbel trick is typically annealed from large to small values. We however did not find a real advantage in doing so, and we fixed the parameter to the low value τ = 1/7 throughout the training. To improve training and avoid collapse of the approximating posterior to trivial local minima, we have linearly annealed the KL term from zero to its full value within 200 epochs.
In general we have trained our models using an importance-weighted estimate of the likelihood. As first described in Ref. [56], a K-sample weighting estimate of the log-likelihood can be written as: which is equivalent to the ELBO defined in Eq. 3 for K = 1 and converges to the exact log-likelihood for K → ∞. We also found useful, to reduce the variance of the gradients of L K , to use a multi sample evaluation of the gradients per data point x. In other words, we can use the following for training: with k = 1, . . . , K and d = 1, . . . , D. Notice that Eq. A2 requires sampling KD latent configurations per datapoint x. This can be parallelized on GPU by effectively working, in our case, with batches of size KD × 100. In our experiments we found it effective to have KD = 8 and to change the relative values of K and D while keeping their product constant. Every 200 epochs we changed their value as follows: (K, D) = (1, 8) → (2, 4) → (2, 4) → (4, 2) → (4, 2) → (8, 1) and kept it constant afterwards. While a larger K results in a tighter variational lower bound, it also makes harder training the approximating posterior, the reason being that in the limit of large K the bound L K does not depend on q φ (ζ|x). We found this K ↔ D anneal to be more efficient at both training the approximating posterior and training on a tighter bound to the log-likelihood. We used the same technique, with K = 1000, D = 1 as the estimate of the log-likelihood.

Appendix B: Sample collection with D-Wave 2000Q
To estimate the negative phase with D-Wave annealers, we used 1000 samples obtained with independent annealing runs. For each gradient evaluation, we performed 5 random spin-reversal transformations and collected 200 samples each time. We used a forward annealing schedule with a 1 µs forward anneal up to s = 0.5, where we paused for 10 µs. After the pause we performed a 10 ns quench to finish the anneal. After a bit of experimentation, we found this particular annealing schedule to slightly improve training, although a simple forward annealing without pause-and-quench also worked well. We did not perform any post-processing of the samples, which we used as-is to compute the negative phase.
An important question for future works is whether a more careful choice of the annealing schedule, possibly with longer pauses, can stabilize the effective temperature at which samples are drawn from the hardware. We discuss the importance of this aspect in the next section.