Monte Carlo sampling from the quantum state space. II

High-quality random samples of quantum states are needed for a variety of tasks in quantum information and quantum computation. Searching the high-dimensional quantum state space for a global maximum of an objective function with many local maxima or evaluating an integral over a region in the quantum state space are but two exemplary applications of many. These tasks can only be performed reliably and efficiently with Monte Carlo methods, which involve good samplings of the parameter space in accordance with the relevant target distribution. We show how the Markov-chain Monte Carlo method known as Hamiltonian Monte Carlo, or hybrid Monte Carlo, can be adapted to this context. It is applicable when an efficient parameterization of the state space is available. The resulting random walk is entirely inside the physical parameter space, and the Hamiltonian dynamics enable us to take big steps, thereby avoiding strong correlations between successive sample points while enjoying a high acceptance rate. We use examples of single and double qubit measurements for illustration.


Introduction
Our companion paper [1] states the motivation for this work and introduces the terminology and notational conventions we are using; when referring to an equation or figure in that paper, the respective number is preceded by " I-". While the sampling methods presented there-rejection sampling, importance sampling, and Markov-chain sampling-are easy to implement, they require a costly (in CPU time) check whether the candidate probabilities obey all constraints imposed by the positivity of the statistical operator. By contrast, in the Hamiltonian Monte Carlo (HMC) method [2,3] that we discuss here, the constraints are always obeyed by construction.
Like the Markov-chain Monte Carlo (MCMC) method discussed in I, HMC involves a "walk" around the parameter space. While one often needs a walk comprising many small steps for MCMC to attain a good sampling yield, HMC can be done with large steps. As a consequence, the problem of a slow exploration of the probability space, in conjunction with strong correlations between successive sample points, which is a typical feature of other random-walk strategies, is not present in HMC.
Further, HMC enables us to sample in accordance with any (reasonable) prior density and any posterior density. Some other sampling methods are very efficient for particular priors (see, e.g., [4,5]), but lack this much-needed flexibility. The downside of HMC, however, is the requirement for a suitable parameterization for its implementation.

The idea
The HMC method makes use of pseudo-Hamiltonian dynamics in a mock phase space and can be applied to most problems with continuous state spaces by introducing fictitious momentum variables. One way of conveying the basic idea of HMC is the following.
We begin with a trial density f (θ), supplement the position variables θ = (θ 1 , θ 2 , . . . , θ S ) with momentum variables ϑ = (ϑ 1 , ϑ 2 , . . . , ϑ S ), and dress up the position density f (θ) with a Gaussian momentum density to compose the initial phasespace density at time t = 0. We obtain the final phase-space density at time t = T by propagating F 0 to F T with the aid of Liouville's equation where the differential operator involves the velocity components ∂ ∂ϑs H = ϑ s and the force components u s (θ) = − ∂ ∂θs H = w(θ) −1 ∂ ∂θs w(θ) associated with the Hamiltonian At the final time T , we thus have The updated position densityf (θ) now results from integrating over the momenta, so that is the net map for f (θ). Exceptional situations aside (usually they arise from an unfortunate choice for the duration T ), f (θ) is a fix point of this map only if F T = F 0 is a function of the Hamiltonian, which is the case if where the dotted equal sign stands for "equal up to a constant numerical factor." Taking for granted without proof that the map (6) is a contraction, repeated applications of the map will thus turn f (θ) into w(θ), the desired prior or posterior density.

The implementation
While this conveys the idea of HMC, its actual implementation is, however, not in terms of a trial sample that is iteratively updated by the mapping (6), but by a Markovian random walk. At each step of the walk, the state follows a trajectory (θ(t), ϑ(t)) from the current position θ = θ(t = 0) to the proposal θ * = θ(t = T ), where the components of the initial momentum ϑ(t = 0) are chosen at random from the Gaussian distribution ∝ e − 1 2 s ϑ 2 s , as required by the initial phase-space density in (1). As long as the map (6) is ergodic, the time averaged distribution of θ will converge towards the stationary distribution, with density w(θ) as demonstrated in section 2.1.
HMC2 Generate ϑ (j) from a multivariate normal distribution with unit variance.
HMC6 Update j → j + 1. Escape the loop when j = M , the target number of sample points; otherwise, return to HMC2.
A few remarks are in order. First, the value of the Hamiltonian is constant along the exact trajectory (θ(t), ϑ(t)) of HMC3, which would give a = 1 in HMC4. In practice, however, we rely on an approximate trajectory; accepting Neal's advice [3], we calculate it with the leapfrog method described below, and the difference between the initial and final values of H(θ, ϑ) in (9) is nonzero. Second, HMC is an implementation of the Metropolis-Hastings Monte Carlo (MHMC) algorithm [6] that, regardless of the step size used, achieves a high acceptance rate with much weaker correlations between successive points. In MHMC, the proposal θ * is accepted with probability [see (I-18)], where w(θ) is the target probability density, and J(θ * |θ) is the probability of proposing point θ * given the current point θ. The comparison with (9) establishes that in HMC where ϑ is the randomly chosen initial momentum and ϑ * is the negative of the resulting final momentum; the otherwise irrelevant minus sign in ϑ * = −ϑ(t = T ) in HMC3 ensures the symmetry of the process, thereby exploiting the time-reversal invariance of the equations of motion (8). In effect, then, HMC achieves an acceptance rate close to 1 by a proposal scheme where J(θ|θ * ) J(θ * |θ) is close to w(θ) w(θ * ) . Third, rather than the "kinetic energy" 1 2 s ϑ 2 s of (4), we could use a general quadratic form 1 2 s,s ′ ϑ s µ s,s ′ ϑ s ′ where the coefficients µ s,s ′ are the entries of a symmetric positive-definite S × S matrix; HMC2 and the velocities in HMC3 are then changed accordingly. Whether or not this freedom in choosing matrix µ offers an advantage, depends on the structure of the "potential energy" − log w(θ). One would usually carry over any symmetries in the potential energy to the kinetic energy; for instance, if w(θ) is invariant under the interchange θ 1 ↔ θ 2 , the kinetic energy should treat θ 1 and θ 2 on equal footing. The kinetic energy of (4) is used for all the examples in section 4.

The leapfrog method
The leapfrog method is a split-operator method much like those based on the Trotter-Suzuki formula (see, e.g., [7]). For a small time increment τ , we advance (θ(t), ϑ(t)) to (θ(t + τ ), ϑ(t + τ )) in three steps: by letting only the kinetic energy 1 2 s ϑ 2 s govern the evolution for duration 1 2 τ in the first and third steps, whereas only the potential energy − log w(θ) is relevant in the intermediate second step of duration τ . In total, this amounts to where the right-hand sides are correct to order τ 2 and have an error ∝ τ 3 . We break up the total duration T into L time intervals τ = T /L so that L leapfrog jumps accomplish HMC3 with a discretization error ∝ Lτ 3 = T 3 /L 2 . The two adjacent kinetic-energy 1 2 τ -periods of subsequent jumps are conveniently combined into a single period of a full τ . Accordingly, we find the final (θ * , ϑ * ) pair from the initial (θ, ϑ) pair by this procedure: . We note an important property of the leapfrog method: The individual steps in LF1-3 are shearing transformations that preserve phase-space volumes, so that the approximate solution of (8) is volume-preserving just like the exact one (Liouville's theorem).
According to Neal [3], there is a trade-off between accuracy in the propagation and CPU time consumed. A good choice of L is such that the acceptance rate (10) is about 65% for high-dimensional problems. Further, if one observes slow convergence, the likely cause is nonergodicity, which can be cured by randomly choosing τ and L from fairly small intervals [8].

State parameterization
In [1], all sampling is done in the probability space. Unless the circumstances are so simple that we can state explicitly all constraints obeyed by the probabilities and do not need to execute a physicality check of the kind discussed in Appendix A of [1], it is not feasible to implement the HMC random walk for variables θ that are (a non-redundant subset of the) probabilities. Rather, we parameterize the statistical operator ρ, and the θ dependence of the probabilities then follows from the Born rule.
For a d-level quantum system, the statistical operator ρ is represented by a hermitian unit-trace d × d matrix that has S = d 2 − 1 real parameters. The matrix of the arbitrary operator A in ρ = A † A/tr A † A , however, has 2d 2 real parameters, of which d 2 + 1 are superfluous. We get rid of them by restricting the A matrix to upper-triangular (or lower-triangular, as in [9]) form with real diagonal elements; this reduces the count of real parameters to d 2 . One more parameter is removed by enforcing that holds, which requires that the moduli of the elements of A are points on a ( 1 2 d(d+1)−1)sphere. We parameterize this sphere with spherical coordinates-angle parameters θ 1 , recursively defined by The cases d = 2, d = 3, and d = 4 illustrate the matter: Other ways of assigning the Cartesian coordinates to the upper-triangle entries and the phase factors to the off-diagonal entries are, of course, possible and give equally valid parameterizations. The assignment chosen is Upon expressing the probabilities p = (p 1 , . . . , p K ) in terms of the parameters θ = (θ 1 , . . . , θ S ) with S = d 2 − 1, the step-function constraints in w cstr (p) of (I-7) are taken care of. Then, integrating out the delta-function constraints gets rid of redundant probability variables. Finally, we need the Jacobian between the remaining p k s and the θ s s to convert the prior or posterior density in p into the corresponding expression in θ, The HMC algorithm can now be executed for u s (θ) = ∂ ∂θs log w(θ).
As stated, the parameterization of (14) and (18) with (15), (16), and (19) is applicable to POMs that are informationally complete, for which the whole state space is the reconstruction space. If the POM is not informationally complete, it may still be possible to use A of (18) with fixed values for some of the θ s s. An example is the trine measurement for a qubit, for which the 2 × 2 matrix in (17) with θ 1 = π 4 does the job; see section 4.2. If no such restricted version of (18) serves the purpose, it may be possible to introduce additional parameters during the HMC sampling, thus produce a sample in a space of higher dimension, then marginalize the auxiliary parameters, and so arrive at a proper sample; see section 4.3 for an example.

Examples
The parameterization of first the statistical operator ρ and then the probabilities p in terms of the angles θ, while systematic, tends to be involved and does not lend itself to simplification when explicit expressions are needed. Therefore

Qubit: Informationally complete POMs
For the d = 2 case of a qubit, with the 2 × 2 matrix of (17) referring to the basis in which the Pauli operators σ x , σ y , and σ z have their standard form, we have x = σ x = sin(2θ 1 ) cos θ 2 cos θ 3 , y = σ y = sin(2θ 1 ) cos θ 2 sin θ 3 , for the expectation values of the Pauli operators. As it should, the sum of their squares, takes on all values between 0 and 1. For later use, we note that which exhibits the Jacobian factor that relates the x, y, z parameterization of ρ in (21) to the θ parameterization of (14) with (18). We consider two POMs, the four-outcome tetrahedron POM of minimal qubit tomography [10] and the six-outcome Pauli POM that measures in three mutually unbiased bases. For the tetrahedron POM, we have the probabilities The corresponding constraint factor w cstr (p) of (I-6) contains w basic (p) of (I-5) for K = 4 and By integrating out p 4 with the help of the delta function in w basic (p), we reduce the volume element in the probability space of (I-8) to (dp) → dp 1 dp 2 dp 3 η with p 4 = 1−p 1 −p 2 −p 3 ; the p k values selected by the first step function are nonnegative, so that we can omit the η(p k ) factors.
Since the Jacobian between p 1 , p 2 , p 3 and x, y, z does not depend on the coordinates, we have dp 1 dp 2 dp 3 . = dx dy dz , and (dp)= dx dy dz follows, where the doubly dotted equal sign says "essentially equal in the sense of ignoring constant factors and reducing the number of variables such that the remaining ones are independent." The final step uses (23) and (24) to arrive at (dp)= dθ 1 dθ 2 dθ 3 |sin(2θ 1 ) 3 sin(2θ 2 )| for the tetrahedron POM. For the Pauli POM, we have the six probabilities for which and are the factors in w cstr (p). The analog of (27) is (dp) → dp 1 dp 2 dp 3 η 1 − and (28)-(30) hold here as well. That is, whether we use the the tetrahedron POM or the Pauli POM, for both the volume element for physical probabilities is that of (30). Therefore, if we are sampling in accordance with the primitive prior w 0 (p) = 1, there is no difference between these two POMs. We just have w(θ) = |sin(2θ 1 ) 3 sin(2θ 2 )| , u 1 (θ) = 6 cot(2θ 1 ) , u 2 (θ) = 2 cot(2θ 2 ) , in (4), (8), and (13) of the HMC algorithm. The sample of 50,000 points thus produced is reported in figure 1. This sample is a collection of (x, y, z) values inside the Bloch sphere; it is converted into a p sample by (25) or (31), respectively. Figure 2 shows trajectories of 50 sample points, with consecutive points connected by blue lines, generated using the HMC algorithm, as well as the xMHMC algorithm from I. For the HMC sample, we see that even this short trajectory samples the unit circle rather efficiently, with the points far apart from their predecessors and successors. Such a trajectory is rather different from that of the xMHMC algorithm, where consecutive points are often close together. Indeed, HMC overcomes the problem of the strong autocorrelation that figure I-2 shows. The high acceptance rate of about 95% in the run that produced figures 1 and 2(a) is another advantage of HMC over xMHMC, which had an acceptance rate of about 60% in the run that produced figure 2(b). In higher dimensional cases, the optimal acceptance rates should be about 65% and 25% for the HMC and xMHMC algorithms respectively. And to repeat, since all generated points are physical by construction, there is no need for the CPU-time consuming check of physicality that is necessary in the xMHMC algorithm.
For the primitive prior density, these give (dp)= dθ 2 dθ 3 |sin(2θ 2 )| for (41) dθ 1 dθ 3 |sin(4θ 1 )| for (42) for both the trine POM and the crosshair POM. Figure 3 shows the histogram of a sample of 50,000 points in accordance with a posterior density. This example is for the trine POM with the primitive prior and the parameterization of (41), and the data are {n 1 , n 2 , n 3 } = {8, 5, 11}. Specifically, we have w(θ 2 , θ 3 ) . = |sin(2θ 2 )| and u 2 (θ 2 , θ 3 ) = 2 cot(2θ 2 ) − w(p) = dq w(p, q) , and perform HMC sampling for the nine-dimensional target density w(p, q). Upon ignoring the q values of the sample points (p (j) , q (j) ), we get the wanted sample of probabilities p (j) . In passing, we note another use of (47). It can provide a physicality check for candidate probabilities p jk that is quite different from, and much simpler than, the algorithm discussed in Appendix A in [1]. Given p jk s that obey all the basic constraints, they are physical if there is a q value for which ρ ≥ 0; otherwise, they are not physical.

Conclusion
We have adapted the Hamiltonian Monte Carlo sampling algorithm to the problem of sampling quantum state spaces, where the constraints that result from the positivity of the statistical operator must be obeyed throughout. To ensure this, we use a systematic parameterization of the statistical operator without any superfluous parameters. This can always be done for informationally complete measurements but may not be possible for others, in which case one must resort to other sampling algorithms, such as the ones we discuss in [1].