A generative modeling approach for benchmarking and training shallow quantum circuits

Hybrid quantum-classical algorithms provide ways to use noisy intermediate-scale quantum computers for practical applications. Expanding the portfolio of such techniques, we propose a quantum circuit learning algorithm that can be used to assist the characterization of quantum devices and to train shallow circuits for generative tasks. The procedure leverages quantum hardware capabilities to its fullest extent by using native gates and their qubit connectivity. We demonstrate that our approach can learn an optimal preparation of the Greenberger-Horne-Zeilinger states, also known as"cat states". We further demonstrate that our approach can efficiently prepare approximate representations of coherent thermal states, wave functions that encode Boltzmann probabilities in their amplitudes. Finally, complementing proposals to characterize the power or usefulness of near-term quantum devices, such as IBM's quantum volume, we provide a new hardware-independent metric called the qBAS score. It is based on the performance yield in a specific sampling task on one of the canonical machine learning data sets known as Bars and Stripes. We show how entanglement is a key ingredient in encoding the patterns of this data set; an ideal benchmark for testing hardware starting at four qubits and up. We provide experimental results and evaluation of this metric to probe the trade off between several architectural circuit designs and circuit depths on an ion-trap quantum computer.


I. INTRODUCTION
What is a good metric for the computational power of noisy intermediate-scale quantum [1] (NISQ) devices?Can machine learning (ML) provide ways to benchmark the power and usefulness of NISQ devices?How can we capture the performance scaling of these devices as a function of circuit depth, gate fidelity, and qubit connectivity?In this work, we design a hybrid quantumclassical framework called data-driven quantum circuit learning (DDQCL) and address these questions through simulations and experiments.
Hybrid quantum-classical algorithms, such as the variational quantum eigensolver [2,3] (VQE) and the quantum approximate optimization algorithm [4,5] (QAOA), provide ways to use NISQ computers for practical applications.For example, VQE is often used in quantum chemistry when searching the ground state of the electronic Hamiltonian of a molecule [6][7][8].QAOA is used in combinatorial optimization to find approximate solutions of a classical Ising model [9] and it has been demonstrated on a 19-qubit device for the task of clustering synthetic data [10].Other successful hybrid approaches based on genetic algorithms were proposed for * Correspondence: aperdomo@post.harvard.eduapproximating quantum adders and training quantum autoencoders [11][12][13].In all these examples, there is a clear-cut objective function describing the cost associated with each candidate solution.The task is then to optimize it exploiting both quantum and classical resources.
Although optimization tasks offer a great niche of applications, probabilistic tasks involving sampling have most of the potential to prove quantum advantage in the near-term [14][15][16].For example, learning probabilistic generative models is in many cases an intractable task, and quantum-assisted algorithms have been proposed for both gate-based [17,18] and quantum annealing devices [19][20][21][22][23]. Differently from optimization, the learning of a generative model is not always described by a clear-cut objective function.All we are given as input is a data set, and there are several cost functions that could be used as a guide, each one with their own advantages, disadvantages, and assumptions.
Here we present a hybrid quantum-classical approach for generative modeling on gate-based NISQ devices which heavily relies on sampling.We use the 2 N amplitudes of the wave function obtained from a N -qubit quantum circuit to construct and capture the correlations observed in a data set.The circuit layout is carefully designed to exploit the full power of the underlying NISQ hardware without the need for a compilation step.
Because Born's rule determines the output probabili-arXiv:1801.07686v2 [quant-ph] 24 Nov 2018 ties of the model, our work is also an early implementation of a Born machine [24].Our work differs from other implementations of Born machines [25][26][27][28] as we need not rely on the construction of tensor networks and their efficient manipulation through graphical-processing units.By employing the quantum circuit itself as a model for the data set, we also differentiate from quantum algorithms that target specific probability distributions.For example, in Ref. [29] the authors developed an hybrid algorithm to approximately sample from a Boltzmann distribution; the distribution is then used to update a classical generative model, but that requires running the full hybrid algorithm for each training iteration.Our work is also in contrast with other quantum-assisted ML work (see e.g.Refs.[30][31][32][33]) requiring fault-tolerant quantum computers, which are not expected to be readily available in the near-term [34].
On the benchmarking of NISQ devices, quantum volume [9,35] has been proposed as an architecture-neutral metric.It is based on the task of approximating a specific class of random circuits and estimating the associated effective error rate.This is very general and it is indeed useful for estimating the computational power of a quantum computer.In this paper, we propose the qBAS score, a complementary metric designed for benchmarking hybrid quantum-classical systems.The score is based on the generative modeling performance on a canonical synthetic data set which is easy to generate, visualize, and validate for sizes up to hundreds of qubits.Yet, implementing a shallow circuit that can uniformly sample such data is hard; we will show that some candidate solutions require large amount of entanglement.Hence, any miscalibration or environmental noise will affect this single performance number, enabling comparison between different devices or across different generations of the same device.Moreover, the score depends on the classical resources, hyper-parameters, and various design choices, making it a good choice for the assessment of the hybrid system as a whole.
Our design choices are based on the setup of existing ion trap architectures.We choose the ion trap because of its full connectivity among qubits [36] which allows us to study several circuit layouts on the same device.Our experiments are carried out on an ion trap quantum computer hosted at the University of Maryland [37].

The learning pipeline
In this Section, we present a hybrid quantum-classical algorithm for the unsupervised machine learning task of approximating an unknown probability distribution from data.This task is also known as generative modeling.First, we describe the data and the model, then we present a training method.
The data set D = (x (1) , ) is a collection of D independent and identically distributed random vectors.The underlying probabilities are unknown and the target is to create a model for such distribution.For simplicity, we restrict our attention to N -dimensional binary vectors x (d) ∈ {−1, +1} N , e.g.black and white images.This gives us an intuitive one-to-one mapping between observation vectors and the computational basis of an N -qubit quantum system, that is Note that standard binary encodings can be used to implement integer, categorical, and approximate continuous variables.Provided with the data set D, our goal is now to obtain a good approximation to the target probability distribution P D .A quantum circuit model with fixed depth and gate layout, parametrized by a vector θ, prepares a wave function |ψ(θ) from which probabilities are obtained according to Born's rule Following a standard approach from generative machine learning [38], we can minimize the Kullback-Leibler (KL) divergence [39] D KL [P D |P θ ] from the circuit probability distribution in the computational basis P θ to the target probability distribution P D .Minimization of this quantity is directly related to the minimization of a well known cost function: the negative loglikelihood C(θ) = − 1 D D d=1 log P θ (x (d) ).However, there is a caveat; output probabilities must be estimated from a finite number of measurements, and the exponentially large number of computational basis states could prevent us from measuring some of the relevant configurations.This means that P θ (x (d) ) = 0 for some x (d) in the data set would lead to infinite cost.To avoid singularities in the cost function, we use a simple variant where > 0 is a small number to be chosen.Note that the number of measurements needed to obtain an unbiased estimate of the relevant probabilities may not scale favorably with the number of qubits N .In the Supplementary Material we suggest alternative cost functions such as the moment matching error and the earth mover's distance.
After estimating the cost, we update the parameter vector θ to further minimize the cost.This can in principle be done by any suitable classical optimizer.We used a gradient-free algorithm called particle swarm optimization (PSO) [40,41] as previously done in the context of quantum chemistry [7].The algorithm iterates for a fixed number of steps, or until a local minimum is reached and the cost does not decrease.
We chose the layout of the model circuit to be of the following form.Let us consider a general circuit parametrized by single qubit rotations {θ (l,k) i } and twoqubit entangling rotations {θ (l) ij }.The subscripts denote qubits involved in the operation, l denotes the layer number and k ∈ {1, 2, 3} denotes the rotation identifier.The latter is needed as we decompose an arbitrary single qubit rotation into three simpler rotations (see Section IV for further details about some exceptions and potential simplifications depending on the native gates available in each specific hardware).Inspired by the gates readily available in ion trap quantum computers, we use alternating layers of arbitrary single qubit gates (odd layers) and Mølmer-Sørensen XX entangling gates [42][43][44][45] (even layers) as our model.All parameters are initialized at random.
It is important to note that in our model the number of parameters is fixed and is independent of the size D of the data set.This means we can hope to obtain a good approximation to the target distribution only if the model is flexible enough to capture its complexity.Increasing the number of layers or changing the topology of the entangling layer alter this flexibility, potentially improving the quality of the approximation.However, we anticipate that such flexible models are more challenging to optimize because of their larger number of parameters.Principled ways to choose the circuit layout and to regularize its parameters could significantly help in such a case.
As summarized in Figure 1, the learning algorithm iteratively adjusts all the parameters to minimize the value of the cost function.At any iteration the user-defined cost is approximated using both samples from the data set and measurements from the quantum hardware, hence the name data-driven quantum circuit learning (DDQCL).

The qBAS score
Bars and stripes (BAS) [46] is a synthetic data set of images that has been widely used to study generative models for unsupervised machine learning.For n × m pixels, there are N BAS(n,m) = 2 n + 2 m − 2 images belonging to BAS, and they can be efficiently produced and visualized.The probability distribution is 1/N BAS(n,m) for each pattern belonging to BAS(n, m), and zero for any other pattern.Figure 2 on the top left panel shows patterns belonging to BAS (2,2), while the top central panel shows the remaining patterns.
We use DDQCL to learn a circuit that encodes all the BAS patterns in the wave function of a quantum state.This also allows us to design the qBAS(n, m) score: a task specific figure of merit to assess the performance of shallow quantum circuits.In a single number, it captures the model capacity of the circuit layout and intrinsic hardware strengths and limitations in solving a complex sampling task that requires a fair amount of entanglement.It takes into account the model circuit depth, gate fidelities, and any other architectural design aspects, such as the quantum hardware's qubit-to-qubit connectivity and native set of single and two-qubit gates.It also takes classical resources such as the choice of cost function, optimizer, and hyper-parameters, into account.Therefore, it can be used to benchmark the performance of the components of the hybrid quantum-classical system.
General framework for data-driven quantum circuit learning (DDQCL).Data vectors are interpreted as representative samples from an unknown probability distribution and the task is to model such distribution.The 2 N amplitudes of the wave function resulting from an N -qubit quantum circuit are used to capture the correlations observed in the data.Training of the quantum circuit is achieved by successive updates of the parameters θ, corresponding to specifications of single qubit operations and entangling gates.In this work, we use arbitrary single qubit rotations for the odd layers, and Mølmer-Sørensen XX gates for the even layers.At each iteration, measurements from the quantum circuit are collected and contrasted with the data through evaluation of a cost function which tracks the learning progress.
The qBAS(n, m) score is an instantiation of the F 1 score widely used in the context of information retrieval.The F 1 score is defined as the harmonic mean of the precision p and the recall r, i.e.F 1 = 2pr/(p + r).The precision p indicates the ability to retrieve states which belong to the data set of interest [47].In our context this is the number of measurements that belong to the BAS(n, m) data set, divided by the total number of measurements N reads performed.The recall r is the capacity of the model to retrieve the whole spectrum of patterns belonging to the desired data set.In our case, if we denote the number of unique patterns that were measured as d(N reads ), then r = d(N reads )/N BAS(n,m) .To score high (F 1 ≈ 1.0), both high precision (p ≈ 1.0) and high recall (r ≈ 1.0) are required.
The F 1 score is a useful measure for the quality of information retrieval and classification algorithms, but for our purposes it has a caveat: the dependence of r on the total number of measurements.As an example, consider a model that generates only BAS patterns, i.e. its precision is 1.0, but with highly heterogeneous distribu-

BAS patterns
Non-BAS pa�erns to N = 4 qubits, and we show some of the qubit-to-qubit connectivity topologies that can be set up in entangling layer and natively implemented by the ion trap quantum computer (e.g chain, star, and all ).The bottom left panel shows the results of DDQCL simulations of shallow circuits with different topologies.We show the bootstrapped median and 90% confidence interval over the distribution of medians of the KL divergence as learning progresses for 100 iterations.The mean-field-like circuit L = 1 (green crosses) severely underperforms.A significant improvement is obtained with L = 2, where most of the parameters for XX gates have been learned to their maximum entangling value.These observations indicate that entanglement is a key resource for learning the BAS data set.Note that for L = 2 the choice of topology becomes a key factor for improving the performance.The chain topology (purple squares) performs slightly better than the star topology (red stars) even though they have the same number of parameters.The all-to-all topology (orange circles) significantly outperform all the others as it has more expressive power.The bottom central image extends the previous analysis to deeper circuits with L = 4 and approximatively twice the number of parameters.All the topologies achieve a lower median KL divergence and the confidence intervals shrink.The bottom right panel shows the bootstrapped mean qBAS(2, 2) score and 95% confidence interval for simulations (green bars) and experiments on the ion trap quantum computer hosted at University of Maryland (pink bars).
tion.If some of the BAS patterns have infinitesimally small probability, we can still push the recall to 1.0 by taking a large number of measurements N reads → ∞.This is not desirable since our purpose is to evaluate circuits on the task of uniformly sampling all the patterns from BAS(n, m).Therefore, to define a unique score it is important to fix N reads to a reasonable value such that r ≈ 1.0 under the assumption of the model distribution being equal to the target distribution P BAS(n,m) , but not so large as to make the score insensitive to deviations from the target distribution.Assuming a perfect model distribution P BAS(n,m) = 1/N BAS(n,m) , the expected number of measurements to obtain a value of r = 1.0 can be estimated using the famous coupon collector's problem.For our purposes here, we set N reads to be equal to the expected number of samples that need to be drawn to collect all the N BAS(n,m) patterns ("coupons").That is, , where H k is the k-th harmonic number.Computed values of N reads are provided in Table I for different values of n and m up to 100 qubits.As shown in the table, the number of readouts required to determine qBAS(n, m) are within experimental capabilities of current NISQ devices [48].
For statistical robustness, we recommend as a good practice to perform R repetitions of the N reads measurements leading to R independent estimates of the recall (each denoted as r i ).For estimating the precision p, all the samples collected should be used to robustly estimate this quantity.Using this value of p one can compute R independent values of the qBAS(n, m) score from each of the r i .These are subsequently bootstrapped to obtain a more robust average for the final reported value of qBAS(n, m) (see details in Section IV).
We note that a more general performance indicator than qBAS(n, m) score is indeed the KL divergence, D KL [P BAS(n,m) |P θ ].However, this would not be robust in terms of scalability; as n × m becomes large, it is expected that the KL divergence is frequently undefined [49].This is true when measurements yield distributions such that P θ (x (d) ) = 0 for any of the x (d) in BAS(n, m).In all these cases, the qBAS score can still be computed and the number of measurements N reads necessary for obtaining a robust estimate continues to remain relatively small to be practical for intermediate size n × m.

Experiments
We investigated three different examples, namely, GHZ state preparation, coherent thermal state preparation, and BAS(2, 2).We implemented each example of DDQCL using both numerical simulations and experiments.We explored the parameter space of DDQCL by varying the qubit-to-qubit connectivity topology (see top-right panel of Figure 2) and the number of layers.We evaluated the performance of each instance by using the KL divergence from the circuit probability distribution in the computational basis to the target probability distribution.While explicitly computing the KL divergence is generally intractable, due to its demanding resource requirement as the size of instances to consider increases, we were able to compute this quantity explicitly for all cases considered in this paper.

GHZ state preparation
To test the capabilities of DDQCL, we started with the preparation of GHZ states, also known as "cat states" [50].Besides their importance in quantum information, the choice is motivated by their simple description and by the availability of many studies about their preparation (see e.g.Refs.[51][52][53]).From the DDQCL perspective, we explored whether it is possible to learn any of the known recipes for GHZ state preparation starting only from classical data.More specifically, the input data consists of samples from a distribution corresponding to the two desired computational basis states; P 0 = 0.5 for the state |0 . . .0 and P 1 = 0.5 for the state |1 . . . 1 .When using a single entangling layer with the all-to-all topology, DDQCL yielded many degenerate but different preparations of GHZ-like states, differing only by a relative phase between the two main states.
In particular, we first ran particle swarm optimization on 25 random initializations for 3, 4, 5 and 6-qubit instances.Then, a human expert inspected the best set of parameters learned for each size and after removing some noise, e.g.rounding the parameters to some preci-sion, spotted a clear pattern.Instances of 3 and 5 qubits yielded a recipe, while instances of 4 and 6 qubit yielded another recipe.The recipes obtained are summarized in Figure 3 and were verified for larger number of qubits, both odd and even.Indeed, DDQCL successfully reproduced the recipes previously used on ion trap quantum computers [54], and, to the best of our knowledge, they correspond to the most compact and efficient protocols for GHZ state preparation using XX gates (see Figure 3).Another commonly used approach consists of cascading entangling gates, with alternations of single qubit rotations [53].DDQCL produced approximate recipes of this kind in some of the test cases for 3 and 4 qubits when using a single entangling layer with chain topology.
It is interesting to note that in DDQCL all the parameters are learned independently and not constrained to be the same.As shown in Figure 3, the learning process unveiled that these converge to the same value.This is not necessarily the case for the other data sets considered below.Finally, it is reassuring that DDQCL obtained circuits for cat state preparation starting from samples of a classical distribution.While this target distribution could have been modeled by a classical zero-temperature ferromagnet, there apparently was no other way for our circuit to reproduce such a distribution, if not by preparing a GHZ-like state.
FIG. 3. GHZ-like state preparation assisted by DDQCL.Left (right) panel shows the recipe obtained by DDQCL for the even (odd) cat state preparation up to a relative phase between the output integer states |0 . . .0 and |1 . . . 1 .Gate Rx(θ) stands for the conventional single qubit rotation about the x axis.GMS stands for a global Mølmer-Sørensen gate [45] acting on all the N = 2n (N = 2n + 1) qubits and it is equivalent to the application of local XX gates to all N (N − 1)/2 pairs of qubits.

Coherent thermal states
Thermal states play an important role in statistical physics, quantum information, and machine learning.Using DDQCL, we trained quantum circuits with different number of layers L ∈ {1, 2, 3} and using all-to-all topology, to approximate a target Boltzmann distribution.In particular, we considered data sets sampled from the Boltzmann distribution of 25 synthetic instances with N = 5 qubits.By decreasing the temperature T of the target distribution, we can increase the difficulty of the learning task.Figure 4 shows the bootstrapped median and 90% confidence interval over the distribution of me-dians of the KL divergence during learning.Deeper circuits such as L = 3 (purple pentagons) consistently outperformed shallower circuits such as L = 2 (red circles) and L = 1 (yellow triangles).This became more evident as we went from easy learning tasks (Figure 4 (a)) to hard learning tasks (Figure 4 (c)).Results for instances of N = 6 qubits are shown in Figure 6 in the Supplementary Material.
To assess how well DDQCL performs on the generative task, we compared DDQCL to the inverse Bethe approximation [55] (see also Eqs. (3.21) and (3.22) in Ref. [56]), a classical closed-form approach widely used in statistical physics to solve the inverse Ising problem.As shown in Figure 4, the inverse Bethe approximation (green bar) performed extremely well in the easy task (a), matched the L = 3 quantum circuit in the intermediate task (b), and underperformed on the difficult task (c).The latter observation comes from the fact that the median performance of the inverse Bethe approximation has very large confidence intervals.We emphasize that this is not a form of quantum supremacy as the two methods are fundamentally different.DDQCL prepares a quantum state without the assumption of an underlying Boltzmann distribution, while the inverse Bethe approximation infers the parameters with such assumption.Furthermore, the error in the inverse Bethe approximation is expected to go to zero with system size, and only above the reference temperature T c (see Section IV for details).It is thus not surprising that we obtained bad performance in Figure 4 (c) with the inverse Bethe approximation.We also note that there exist several other classical methods that could have performed better in inferring the parameters, e.g.Boltzmann machines [57].
For the purposes of benchmarking and measuring the power of NISQ devices with DDQCL, it is insufficient to have an easy-to-generate target data set; we also require the data set to represent a useful quantum state in quantum computing, while simultaneously proving to be sufficiently challenging for the quantum computer to generate.As a metric for the quality of an hybrid system, the qBAS score aims to measure the difficulty in generating the target data set by the quantum computer in practical setting.
Let S be the entanglement entropy averaged over all two-qubit subsets in a pure state [58].We start by noting that the four-qubit cat state has S = 1.0.Provided how cat states, ideal for studying decoherence and decay of quantum information [51,53], have been widely used in quantum computing for its rich entangled nature, together with the importance of entanglement in quantum information processing, we considered entanglement entropy as a proxy measure of a specific quantum state's usefulness in quantum computing.Since the BAS(2, 2) target data set with 1/ √ 6 amplitude on each of the rel-evant computational basis states has entanglement entropy S = 1.25163, this implies that the qBAS score effectively measures how good a quantum computer is in preparing a useful quantum state.We also note that, interestingly, one of the quantum representations of BAS(2, 2) found by DDQCL (see Figure 7 in Supplementary Material) reached a value S = 1.69989.This is close to the maximal possible entanglement entropy S = 1.79248 attainable with four qubits [58].This shows the power of our framework, in that DDQCL is capable of handling useful quantum states that are rich in entanglement.This is an important observation, since we know, based on our empirical results, that (i) single layer circuits with no entangling gates severely underperform in producing the output state probability distribution that is close to the target data set, and (ii) when inspecting the parameters learned for circuits with all-to-all topology with L = 2 layers, we found that most of the XX gates reached their maximum entangling setting.
We now discuss results for the qBAS(2, 2) score, which we computed experimentally and theoretically in order to compare the entangling topologies sketched in the top right panel of Figure 2 and for different number of layers.The process consists of two steps; first, DDQCL is used to encode BAS(2, 2) in the wave function of the quantum state.Second, the best circuits, i.e. those with lowest cost, are compared using the qBAS(2, 2) score.
The bottom left and bottom central panels in Figure 2 show the bootstrapped median of the KL divergence and 90% confidence interval over 25 random initializations of DDQCL in silico.The all-to-all topology (orange circles) always outperforms sparse topologies (red stars and purple squares).However, deeper circuits do not always provide significant improvements, as it is the case for allto-all L = 4 (dark green circles) versus all-to-all L = 2 (orange circles).A possible explanation is that, when going from two to four layers, we approximately double the number of parameters, and particle swarm optimization struggles to find enhanced local optima.Another plausible explanation is that for this small data set, the all-to-all circuits with L = 2 are already close to optimal performance (we show supporting evidence in Figure 7 in the Supplementary Material).
As the best performing circuits to compare using the qBAS score, we chose all-to-all L = 2 and star L ∈ {2, 4} circuits.While they represent very different approximate solutions to the same problem, they may be compared with the help of qBAS(2, 2) score.For each setting, we computed 25 scores from batches of size N reads = 15 samples, as described in Section II.The bottom right panel in Figure 2 shows the bootstrapped mean qBAS(2, 2) score and 95% confidence interval for simulations (green bars) and experiments on the ion trap quantum computer hosted at University of Maryland (pink bars).The score is sensitive to the depth of the circuit as shown by the performance improvement of L = 4 compared to L = 2 in the star topology.Note that the theoretical where Tc is the reference temperature (see Section IV for details).The model is a quantum circuit with five qubits and an all-to-all qubit connectivity for the entangling layer.We show the bootstrapped median and 90% confidence interval over the distribution of medians of the KL divergence of DDQCL as learning progresses for 50 iterations.(a) When T > Tc, the learning task is easy and shallow quantum circuits such as L = 1 (yellow triangles) and L = 2 (red circles) perform very well.(b) When T ≈ Tc, a gap in performance between circuits of different depth becomes evident.(c) When T < Tc, the learning task becomes hard and deeper circuits perform much better than shallow ones.We also report results for the inverse Bethe approximation, which does not actually prepare a state, but produces a classical model in closed-form.The classical model so obtained (green band) is excellent for the easy task in (a), matches the best quantum model in (b), and underperforms for the hard task in (c).
improvement for using L = 4 is larger than that observed experimentally in the ion trap.This is because the quantum computer accumulated errors while executing the deeper circuit.The score is also sensitive to the choice of topology as shown by the drop in performance of star compared to all-to-all when the same number of layers L = 2 is used.
Although we compared circuits implemented on the same ion trap hardware, the score may be used to compare different device generations or even completely different architectures (e.g.superconductor-based versus atomic-based).Similarly, one may use the score to compare classical resources of the hybrid system (e.g.different optimizers).

III. DISCUSSION
Data is an essential ingredient of any machine learning task.In this work, we presented a data-driven quantum circuit learning algorithm (DDQCL) as a framework that can assist in the characterization of NISQ devices and to implement simple generative models.The success of this approach is evidenced by the results on three different data sets.
To summarize, first, we learned a GHZ state preparation recipe for an ion trap quantum computer.Minimal intervention by a human expert allowed to generalize the recipe to any number of qubits.This is not an example of compilation, but rather an illustration of how simple classical probability distributions can guide the synthesis of interesting non-trivial quantum states.Depending on the level or type of noise in the system, the same algorithm could lead to a different circuit fulfilling the same probability distribution as that of the data.The message here is that machine learning can teach us that "there is more than one way to skin a cat (state)".
Second, we trained circuits to prepare approximations of thermal states.This illustrates the power of Born machines [24] to approximate Boltzmann machines [57] when the data require thermal-like features.
Finally, tapping into the real power of near-term quantum devices and approximate algorithms implementable on them, we designed a task-specific performance estimator based on a canonical data set.The bars and stripes data is easy to generate, visualize and verify classically, while modeling it still requires significant quantum resources in the form of entanglement.Errors in the device will affect this single performance measure, the qBAS score, which can be used to compare different device generations, or completely different architectures.The qBAS score can also be used to benchmark the typical performance of optimizers used in hybrid quantumclassical systems.Selecting the method and optimizing the hyperparameters can be a daunting task and is a key challenge towards a successful implementation as the number of qubits increases.Therefore, having this unique metric for benchmarking could help reduce the complexity of this fine-tuning stage.The score can be computed in any of the NISQ architectures available to date.
Subsequent to this work, several teams have developed new methods for quantum circuit learning.In Ref. [59] the authors show how to train Born machines using dif-ferentiable cost functions and exploiting gradient calculations proposed in Ref. [60].In Ref. [61] the authors show that the expressive power of shallow circuits outweighs that of some classes of artificial neural networks.
It is left to future work to demonstrate more realistic machine learning by allowing more flexible models and employing regularization.At a finite and fixed low circuit depth, the power of the generative model can be enhanced by including ancilla qubits, in analogy to the role of hidden units in probabilistic graphical models.Regularization can be implemented at the level of the cost function.In this paper, we used the negative loglikelihood which quickly becomes expensive to estimate as the size of the system increases.We reported preliminary results on alternative cost-functions that overcome the caveat and still produce satisfactory results.Finally, layer-wise pre-training of the quantum circuit inspired by deep learning [38] could initialize parameters to nearoptimal locations in the cost landscape.We think these are among the most promising directions to be explored in future work.
Ongoing work includes the experimental realization of the full DDQCL training on the quantum computer and the subsequent evaluation of the qBAS score from the resulting quantum circuit.To the best of our knowledge, our approach represents the first proposal with the bidirectional capability of using NISQ devices for machine learning, and machine learning for the characterization of NISQ devices.We hope the ideas presented here contribute to developing further concrete metrics to help guide the architectural hardware design development, while tapping into the computational power of NISQ devices.

Simulation of quantum circuits in silico
We simulated quantum circuits using the QuTiP2 [62] Python library and implemented the constraints dictated by the ion trap experimental setting.In the current experimental setup, we can perform arbitrary single qubit rotations and Mølmer-Sørensen XX entangling gates involving any two qubits.We used only these gates hence avoiding the need of further compilation.For the simulations in silico, we also assume perfect gate fidelities and error-free measurements.
In the ion trap setting, the implementation of single qubit rotations R z is very convenient.Therefore, we perform arbitrary single qubit operations relying on the decomposition U ), where l is the layer number, i is the qubit index, and θ ).Because we execute circuits always starting from the |0 • • • 0 state, the first set of R z rotations would have no effect and, therefore, is not needed.When an odd number of layers is used, a similar exception occurs in the last layer.There, the last set of R z rotations would only add a phase that becomes irrelevant when taking the amplitude squared required for the Born machine.In other words, we can slightly reduce the number of parameter without changing the expressive power of the circuit.Every other layer of arbitrary single qubit operations would in general require 3N parameters, where N is the number of qubits.By using an alternative decomposition, namely U = R x R z R x , we could apply commutation rules with XX gates and obtain a reduction to 2N parameters in all odd layers.We decided not to do the former step for two reasons.First, there is no effective reduction of the number of parameters for experiments up to L = 5 layers considered here.Second, in the ion trap quantum computer used here [37] it is experimentally convenient to use a larger number of R z rather than R x rotations.
For the case of the entangling gates, we use the notation Recalling that states that differ by a global phase are indistinguishable, a direct computation shows that the tunable parameters can be taken as θ . Also, there is no need to set up an order for these gates within an entangling layer as they commute with one another.
The total number of parameters per entangling layer depends on the chosen topology: all is a fully-connected graph and has N (N − 1)/2 parameters, chain is a onedimensional nearest neighbor graph with N − 1 parameters, and star is a star-shaped graph with N − 1 parameters.The top right panel of Figure 2 shows a graphical representation of these topologies for the case of N = 4 qubits.
When executing DDQCL, we always estimated the required quantities from 1000 measurements in the computational basis.

Gradient-free optimization
Once the number of layers and topology of entangling gates is fixed, the quantum circuits described above provide a template; by adjusting the parameters we can implement a small subset of the unitaries that are in principle allowed in the Hilbert space.The variational approach aims at finding the best parameters by minimizing a cost function.For all our tests, we choose to minimize a clipped version of the negative log-likelihood.
We use a global-best particle swarm optimization algorithm [41] implemented in the PySwarms [63] Python library.A 'particle' corresponds to a candidate solution circuit; the position of a particle is a point θ in parameter space, the velocity is a vector determining how to update the position in parameter space.Position and velocity of all the particles are initialized at random and updated at each iteration following the schema shown in Figure 1.There are three hyper-parameters controlling the swarm dynamics: a cognition coefficient c 1 , a social coefficient c 2 and an inertia coefficient w.After testing a grid of values, we chose to use a constant value of 0.5 for all three hyper-parameters, which we found to work well for our purpose.To avoid large jumps in parameter space, we further restrict position updates in each dimension to a maximum magnitude of π.
Finally, we set the number of particles to twice the number of parameters of the circuit.This is a conservative value compared to previous work [8], also because of the large number of parameters in the circuits explored here.

Data sets details
We worked with three synthetic data sets: zerotemperature ferromagnet, random thermal, and bars and stripes (BAS).In all our numerical experiments we use 1000 data points sampled exactly from these distributions.

GHZ state preparation
The zero-temperature ferromagnet distribution is equivalent to assigning 1/2 probability to both |0 . . .0 and |1 . . . 1 states of the computational basis.This distribution can be easily prepared as a mixed stated, but our study uses pure states prepared by the circuit.The only way to reproduce the zero-temperature ferromagnet distribution in our setting is to implement a unitary transformation that prepares a GHZ-like state.

Thermal states
A thermal data set in N dimensions is generated by exact sampling realizations of x ∈ {−1, +1} N from the distribution P (x) = Z −1 exp(( ij J ij x i x j + i h i x i )/T ) where Z is the normalization constant, J ij and h i are random coefficients sampled from a normal distribution with zero mean and √ N standard deviation, and T is the temperature.In the large system-size limit, a phase transition is expected at T c ≈ 1.Although this is not true for the small-sized systems considered here, we take this value as a reference temperature.In our study, we vary T ∈ {2T c , T c , T c /1.5} in order to generate increasingly complex instances.

Bars and Stripes
BAS [46] is a canonical machine learning data set for testing generative models.It consists of n × m pixel pictures generated by setting each row (or column) to either black (−1) or white (+1), at random.In generative modeling, a handful of patterns are input to the algorithm and the target is to train a model to capture correlations in the data.Assuming a successful training, the model can reconstruct and generate previously unseen patterns from partial or corrupted data.On the other hand, if we provide the algorithm with all the patterns we are interested in and aim to a model that generates only those, this would amount to an associative memory.Although both tasks can be done with our DDQCL pipeline, for the qBAS(n, m) score we focus on the latter task.
We now determine the number of patterns and provide an easy identification of the bitstring belonging to the BAS(n, m) class.For the total count of the number of patterns, we first count the number of single stripes, double stripes, etc. that can fit into the n rows.This number is the sum of binomial coefficients n k=0 n k = 2 n .The same expression holds for the number of patterns with bars that can be placed in the m columns, that is 2 m .Note that empty (all-white) and full (all-black) patterns are counted in both the bars and the stripes.Therefore, we obtain the total count for the BAS patterns by subtracting the two extra patterns from this double count: In the main text, we use the BAS data set to design a task-specific performance indicator for hybrid quantumclassical systems.

Bootstrapping analysis
To obtain error bars for the KL divergence, we used the following procedure.DDQCL was always executed 25 times with random initialization of the parameters.From the 25 repetitions, we sampled 10,000 data sets of size 25 with replacement and computed the median KL divergence for each.From the distribution of 10,000 medians, we computed the median and obtained error bars from the 5-th and 95-th percentiles as the lower and upper limits, respectively, accounting for a 90% confidence interval.
For the case of qBAS score, we did the following bootstrap analysis.qBAS score was always computed 25 times from batches of samples N reads .From the 25 repetitions, we sampled 10,000 data sets of size 25 with replacement and computed the mean for each.From the distribution of 10,000 means, we computed mean and obtained error bars from two standard deviations, accounting for a 95% confidence interval.

Comparison of cost functions
In realistic machine learning scenarios, we typically do not have access to the complete target probability distribution, nor to that obtained from the output state of a quantum circuit.Hence, we need to compare the two distributions at the level of histograms and using a finite number of samples and measurements.Here we compare three cost functions via simulations in silico.a. Clipped negative log-likelihood: where probabilities are estimated from samples and > 0 is a small number that avoids an infinite cost when P θ (x (d) ) = 0.This may happen if for example entanglement in the circuit prevents us from measuring configuration x (d) .Moreover, when the number of variables N is large, all except few configurations will ever be measured due to the finite number of samples.Note that by re-normalizing all the probabilities after the clipping, we could interpret this variant as a Laplace additive smoothing.All the experiments in the main text are carried out using the clipped negative log-likelihood with = 10 −8 .b. Earth mover's distance [64]: where F (x, y) is a joint probability distribution such that y F (x, y) = P D (x) and x F (x, y) = P θ (y), that is, its marginals correspond to the data and circuit distributions, respectively.Intuitively, this is the minimum cost of turning one histogram into the other where the ground metric d(x, y) specifies the cost of transporting a single unit from x to y.We chose d(x, y) to be the Hamming distance between strings x ∈ {−1, +1} N and y ∈ {−1, +1} N .Since we normalize histograms to sum up to one, the Earth Mover's Distance is equivalent to the 1-st Wasserstein distance [65].In our simulations, we use the PyEMD Python library for fast computation of the earth moving distance [66].
c. Moment matching via mean squared error: where the expectation values P D and P θ are taken with respect to data and circuit distributions, respectively.This cost function can be generalized to include moments beyond the second as well as using different positive exponents for the error.
We compared the cost functions on the task of learning thermal states of size N = 5, with L = 3 layers and all topology.Figure 5 shows the bootstrapped median KL divergence on 25 realizations and 90% confidence interval as learning progress for 100 iterations.The fact that C nll (red diamonds) outperforms other cost functions does not come as a surprise; minimization of the negative log-likelihood is indeed directly related to minimization of the KL divergence.However, we expect the performance of DDQCL based on this cost function to degrade quickly as the size of the problem increases.In realistic applications, the relevant probabilities in Eq. (3), i.e. those associated with the data, are a vanishing fraction of the 2 N probabilities.Moreover, they need to be estimated from a finite number of measurements.The earth mover's distance C emd (green pentagons) performs well, but it suffer from similar scalability issues.Fast algorithms for the computation of this distance may struggle when the number of bins in the histogram increases exponentially as in our case.However, it is reassuring to see that alternative cost functions with no relation to the KL divergence can still produce satisfactory results.Surprisingly, the moment matching C mm (purple crosses) closely tracks the other cost functions while retaining computational efficiency.In fact, even though a large number of samples may be needed to obtain low-variance estimates for the moments, only O(N 2 ) terms are computed at each iteration.We expect this cost function to be a good heuristic for DDQCL on large systems.FIG. 6. DDQCL preparation of approximate coherent thermal states for the case of six qubits, different circuit depths L ∈ {1, 2, 3}, and different temperatures T ∈ {2Tc, Tc, Tc/1.5}.In these simulations, an all-to-all topology was used for the entangling gate layer (l = 2).We report the bootstrapped median and 90% confidence interval.For the low temperature case shown in panel (c), the inverse Bethe approximation converged only in 7 out of 25 instances.Hence, no median value was extracted.We plotted a KL divergence of 2.0 as a reference, but it is clear that for all those instances DDQCL outperformed the inverse Bethe approximation.

FIG. 2 .
FIG.2.DDQCL on the BAS data set.The top left panel shows patterns that belong to BAS(2, 2) our quantum circuit is to generate.The top central panel shows undesired patterns.On the top right panel, we show a possible mapping of the 4 pixels to N = 4 qubits, and we show some of the qubit-to-qubit connectivity topologies that can be set up in entangling layer and natively implemented by the ion trap quantum computer (e.g chain, star, and all ).The bottom left panel shows the results of DDQCL simulations of shallow circuits with different topologies.We show the bootstrapped median and 90% confidence interval over the distribution of medians of the KL divergence as learning progresses for 100 iterations.The mean-field-like circuit L = 1 (green crosses) severely underperforms.A significant improvement is obtained with L = 2, where most of the parameters for XX gates have been learned to their maximum entangling value.These observations indicate that entanglement is a key resource for learning the BAS data set.Note that for L = 2 the choice of topology becomes a key factor for improving the performance.The chain topology (purple squares) performs slightly better than the star topology (red stars) even though they have the same number of parameters.The all-to-all topology (orange circles) significantly outperform all the others as it has more expressive power.The bottom central image extends the previous analysis to deeper circuits with L = 4 and approximatively twice the number of parameters.All the topologies achieve a lower median KL divergence and the confidence intervals shrink.The bottom right panel shows the bootstrapped mean qBAS(2, 2) score and 95% confidence interval for simulations (green bars) and experiments on the ion trap quantum computer hosted at University of Maryland (pink bars).

5 FIG. 4 .
FIG.4.DDQCL preparation of coherent thermal states.We generated 25 random instances of size N = 5 and varied the difficulty of the learning task by decreasing the temperature in T ∈ {2Tc, Tc, Tc/1.5}where Tc is the reference temperature (see Section IV for details).The model is a quantum circuit with five qubits and an all-to-all qubit connectivity for the entangling layer.We show the bootstrapped median and 90% confidence interval over the distribution of medians of the KL divergence of DDQCL as learning progresses for 50 iterations.(a) When T > Tc, the learning task is easy and shallow quantum circuits such as L = 1 (yellow triangles) and L = 2 (red circles) perform very well.(b) When T ≈ Tc, a gap in performance between circuits of different depth becomes evident.(c) When T < Tc, the learning task becomes hard and deeper circuits perform much better than shallow ones.We also report results for the inverse Bethe approximation, which does not actually prepare a state, but produces a classical model in closed-form.The classical model so obtained (green band) is excellent for the easy task in (a), matches the best quantum model in (b), and underperforms for the hard task in (c).

FIG. 5 .
FIG.5.Bootstrapped median KL divergence and 90% confidence interval of circuits learned by minimizing different cost functions.Both the moment matching (Cmm) and earth mover's distance (C emd ) closely track the clipped negative loglikelihood (C nll ) used in the main text.

FIG. 7 .
FIG.7.Detailed comparison at the level of output states between simulation of circuits and the respective experimental implementations in the ion trap quantum computer.These are the best circuits in terms of KL divergence that were obtained by DDQCL for three different setting: (a) all-to-all with L = 2 layers, (b) star with L = 4, and (c) star with L = 2.The theoretical state obtained in (a) is close to optimal and attains an entanglement entropy averaged over all two-qubit subsets of S = 1.69989.

TABLE I .
TableIshows the requirements for some values of n and m.Example of experimental requirements for nearterm quantum computers with up to 100 qubits.As described in the main text, N reads is the number of readouts required for every estimation of the qBAS score.