Monte Carlo sampling from the quantum state space. I

High-quality random samples of quantum states are needed for a variety of tasks in quantum information and quantum computation. Searching the high-dimensional quantum state space for a global maximum of an objective function with many local maxima or evaluating an integral over a region in the quantum state space are but two exemplary applications of many. These tasks can only be performed reliably and efficiently with Monte Carlo methods, which involve good samplings of the parameter space in accordance with the relevant target distribution. We show how the standard strategies of rejection sampling, importance sampling, and Markov-chain sampling can be adapted to this context, where the samples must obey the constraints imposed by the positivity of the statistical operator. For illustration, we generate sample points in the probability space of qubits, qutrits, and qubit pairs, both for tomographically complete and incomplete measurements. We use these samples for various purposes: establish the marginal distribution of the purity; compute the fractional volume of separable two-qubit states; and calculate the size of regions with bounded likelihood.


Introduction
Many situations in quantum information and quantum computation call for a random sample from the space of quantum states. This can be in the numerical testing of the typicality of entanglement among states from a bipartite quantum system, understanding of the efficacy of a gate implementation or a noise protection scheme by examining its performance on randomly selected states, computation of the average of some quantity of interest over a subset of quantum states, integration over a region of states, optimization of a function on the state space with a complicated landscape, etc. In every case, a quantitative conclusion can be drawn only if one first specifies what the word 'random' means, i.e., according to which distribution we are drawing states, and then have an efficient way of sampling from the state space in accordance with that distribution.
In many cases, in the absence of additional information or when desiring caution against pre-biasing the results, what one means by drawing a random sample translates into sampling from a 'uniform' distribution of states that treats every state 'fairly.' This is certainly an appropriate attitude when dealing with a discrete (sub)set of quantum states ρ = { } n n N 1 so that the uniform distribution is simply one that assigns probability N 1 to each state ρ n . For a continuous set of states, the notion of a uniform distribution, or more generally, an 'uninformative' distribution [1,2], is an ill-defined one that depends highly on the choice of parameterization of the state space and the criterion for uniformity. One hence needs to specify the desired distribution and, depending on the choice, how one samples according to that distribution may not be easy or obvious.
One often-used choice of random distribution is defined by writing the state ρ as Λ U U † , with U a Haarrandom unitary matrix, and Λ a diagonal nonnegative matrix with entries chosen according to the Lebesgue measure on real space. [3] and [4] describe how one samples from this distribution in a simple and computationally efficient way, even in high-dimensional problems, and employ such samples for the estimation of the proportion of entangled to separable states in bipartite state spaces. Another popular sampling approach is

Priors and constraints
A general measurement in quantum mechanics is a probability-operator measurement (POM). 7 The POM has outcomes Π Π Π … , , , K for the chosen POM constitute the probability space. Given p, it is customary to report a ρ for which (1) holds. If there is a choice among several ρs for the same p (which can happen when the POM is not informationally complete), we pick one representative [13], and so have a one-toone mapping ρ ↔ p. These ρs constitute the reconstruction space  0 . Because of the one-to-one mapping between states and probabilities, we will identify p with ρ, and regions in the probability space with corresponding regions in the reconstruction space. Note that, while the probability space is always convex, it may not be possible to choose a convex reconstruction space. 8 The measurement data D consist of the observed sequence of detector clicks, with n k clicks of the kth detector after measuring a total number of = ∑ = N n k K k 1 copies of the state. The probability of obtaining D, if p is the true state, is the point likelihood (2) n n n K n 1 2 3 )takes on its largest value for the maximum-likelihood estimator [14], The positivity of ρ and its normalization to unit trace ensure that p satisfies the basic constraints ⩾ p 0 k and ∑ = p 1 k k . Since the probabilities result from the POM via the Born rule, the positivity of ρ usually implies further constraints on p. For example, consider a qubit measured by a three-outcome trine POM with 6 Ironically, the statisticians had earlier learned the methods from physicists. 7 In literature, POM is often written as POVM, standing for 'positive operator-valued measure'. This reference to the mathematical discipline of measure theory arises for historical reasons. We prefer the physical term POM as a more descriptive and relevant here. 8 For example, if the probabilities determine all nondiagonal elements of the 3 × 3 matrix of a qutrit state, it is not possible to assign diagonal matrix elements such that the reconstruction space is convex. 2 . Quite generically, there are such additional constraints for the probabilities, and it may not be easy or feasible to state them explicitly for high-dimensional systems measured by many-outcome POMs. A p is called physical or permissible if it satisfies all these constraints.
We summarize all constraints in w p ( ) cstr , which is a product of step functions and delta functions, and vanishes if p is not permissible. For example, for the basic constraints, we have 9 cstr basic qu where w p ( ) qu is the product of step functions that specify the constraints imposed by quantum mechanics. Then, the volume element p (d ) of the probability space is is the (unnormalized) prior density of our choice. Specifically, we will discuss two choices for w p ( ) 0 in the examples later. The first is the primitive prior, primitive so that the density is uniform in p over the (physical) probability space. The second is known as the Jeffreys prior which is a common choice of prior when no external prior information is available [1,2]. Upon multiplying the prior density with the point likelihood for the observed data, we get the (unnormalized) posterior density D 0 While sampling from the probability space in accordance with the prior w p ( ) 0 is often relatively easy, sampling according to the posterior w D (p) is usually difficult. With the prior and posterior densities at hand, we can now define the size and credibility of a region  in the reconstruction space. The computation of these regionspecific values is an important application of random samples in the context of quantum state estimation [16]; an example is given in section 6.
The is its credibility.  S is the prior content of region , i.e., the probability that the true state is in  before any data are acquired;  C is its posterior content, i.e., the probability that the true state is in  conditioned on the data. 10 9 There are POMs for which partial sums of the p k s have fixed values, in which case there is more than one delta function in w p ( ) basic ; see, e.g., (23) in [13] or (30) and (36) in [12]. One identifies these situations easily and we need not elaborate on them. 10 The conventions used here differ somewhat from those in [13]. In particular, we include here the factor w p ( ) cstr in p (d ) and we prefer unnormalized prior and posterior densities in the current context, so that the denominators in (12) and (13)  If we wish to sample the reconstruction space in accordance with the prior w p ( ) 0 , the fraction of sample points in region  should be equal to the size  S of the region; and equal to the credibility  C when sampling according to the posterior w D (p). The situation can be reversed: we may have a procedure for generating a random sample, and then this procedure defines the underlying prior; an example is 'prior I' in section 4.
In the context of quantum state estimation, one needs to compute  S and  C for given region  and data D. Since the quantum constraints are generally highly nontrivial, leading to a probability space with complicated boundaries, and the dimension of the probability space grows rapidly (as the square of the dimension for tomographically complete measurements), the integrals in  S and  C are difficult to compute directly. The structure of the integrals naturally suggests the use of Monte Carlo methods for evaluation: generate points with a density distributed according to a target w(p) (in our case, w p ( ) 0 or w D (p)); the size and credibility are then the ratio of the number of points contained in  to the total number of points in the full reconstruction space  0 .
One also needs a systematic and efficient way of numerically checking if a given p satisfies all quantum constraints, i.e., if the given p is physical-this is required as part of the procedure for evaluating the constraint factor w p ( ) cstr . For a high-dimensional system measured by a many-outcome POM, the constraints cannot easily be expressed explicitly in terms of inequalities, and can thus only be checked numerically. If the POM is informationally complete, the ρ ↔ p mapping is linear and usually known quite explicitly and one could just check if it gives a nonnegative ρ. For other POMs, this approach is often not available because the ρ ↔ p mapping is involved and only defined for physical ps, and then other methods must be used.
In appendix A, we provide an algorithm for checking the physicality of a given p. We make use of a figure-ofmerit functional Q p p ( ;ˆ), where p is the probability (assumed to satisfy the basic constraints) to be checked for physicality and p is a random variable in the reconstruction space, i.e., p is physical. Q is chosen such that it attains its optimal value when p is as close to p as possible. Given p, one optimizes Q over p using gradient methods, and if the optimal p is equal to p, p is physical; otherwise, it is not.

Independence sampling: rejection sampling and importance sampling
In independence sampling, as the name suggests, sample points are randomly generated independently of one another. In general, it is not straightforward to sample directly from the target distribution w p if sampling in accordance with the prior or the posterior. However, as long as we can sample over the probability space (perhaps using a convenient parameterization) with a known reference distribution w p ( ) r , we can approach the target distribution by means of rejection sampling or importance sampling [8]. The factor r(p) that relates the target distribution to the reference distribution, t r can be regarded as the ratio ' ,' but this should be done with care since w p ( ) t and w p ( ) r usually share singularities of the delta-function kind. The easiest way of sampling the probability space is often to first sample uniformly in p from the space of probabilities that satisfy only the basic constraints of positivity and unit sum, as specified by the factor w p ( ) basic of (5). We refer to this space as the basic probability simplex, and the (unnormalized) sampling distribution w p ( ) r is equal to w p ( ) basic from (5), i.e., it takes a constant value over the entire simplex, and is zero for ps violating the basic constraints. In appendix B, we provide two algorithms for sampling from this w p ( ) r . The physical probability space is a subregion of this simplex, with the additional quantum constraints imposed by the rejection or importance sampling procedures.
In rejection sampling, we draw many sample points according to the chosen reference distribution w p ( ) r , and then reject (i.e., discard) or accept points in such a way that the remaining sample points are distributed according to the target distribution w p ( ) t . More specifically, we accept a sample point p j ( ) with probability One calls a the acceptance ratio. Rejection sampling requires one to discard points in accordance with the acceptance ratio, and one ends up with fewer sample points than the initial set drawn from w p ( ) r . In importance sampling, instead of discarding points, one attaches a weight to each point to compensate for the difference between the sampling and the target distributions. For sample point p j ( ) , the weight is This weight can be thought of as a multiplicity for each sample point in accordance with the target w p ( ) t , so that each point p j ( ) counts W j times in computing the ratio of number of points in  to the total number of points in  0 for the value of the integral  S or  C . Moreover, the weights should have finite variance for good practical performance and ideally be bounded [17]. But this can be hard to check.
For the reference distribution that is uniform on the simplex, (15) and (16) for posterior sampling. Both in rejection sampling and in importance sampling, unphysical points, i.e., those that satisfy the basic but not the quantum constraints, do not contribute to the integral, since they are either rejected with unit probability in rejection sampling, or carry zero weight in importance sampling. This means that, if  0 is a small subregion of the basic probability simplex, one ends up with only a small fraction of the sample points contributing finally to the integral. For example, for a three-outcome trine measurement on the single qubit of (4), only π = 27 60.5% of the points sampled from = w p w p ( ) ( ) r b a s i c are physical (see figure 1). The yield decreases as the dimensionality of the system increases: for a nine-outcome trine-antitrine (TAT) measurement on a qubit-pair (see section 6), only about 10% of the points are physical [13]; and for the sixteen-outcome POM used in section 4, only one in 50 000 candidate points is accepted.
One also runs into problems where the ratio r(p) is sharply peaked. For example, the Jeffreys prior formally becomes infinite when (at least) one of the p k s is zero. In practice, one never gets a sample point p j ( ) with a p k that is exactly zero, so that r p ( ) j ( ) is never infinite. Still, any sample point in the vicinity of the singular points will have a very large r p ( ) j ( ) value. The normalization constant R for the Jeffreys prior is also formally infinite, calling to question the applicability of the rejection sampling procedure. In practice, one can take R as a large constant by approximating the target Jeffreys prior by one with a 'cutoff' value when one or more of the p k s vanish. This still, however, makes the acceptance rate tiny for all ps away from the singular points. Correspondingly, in importance sampling, large weights are attached to the points in the vicinity of these singular points, and the main contribution to the integral then comes from just those few points.
Both the problems of small physical subregion and sharply peaked priors stem from the fact that the target distribution can be very different from the reference distribution. Whenever possible, one should start with samples from a w p ( ) r that is close to w p ( ) t . Nevertheless, independence sampling according to a uniform w p ( ) r on the basic probability simplex is straightforward to set up, and can provide an easy first estimate of the desired integral, or more generally, a rough first sample.

Example: volume of separable two-qubit states
For a first application, we sample the two-qubit state space and ask how large the volume of the set of separable states (or conversely, entangled states) is. In [3], a natural prior on the set of states is used that is induced by the Haar measure on the group of unitary matrices and the Lebesgue measure on the real space (labeled 'Prior I' in figures 2 and 3). For this prior, numerical results establish that ± 63.2% 0.2% of the mixed two-qubit states are separable.
Here, we consider the scenario where each of the two qubits is measured by the four-outcome tetrahedron POM of [18] separately. The resulting two-qubit POM (which is informationally complete) has sixteen outcomes with the single constraint of unit sum, so the probability space is fifteen-dimensional. We employ the 60.5% (i.e., the fractional area of the circle) of the points generated are physical (green dots), and the rest are not (red dots). simple rejection sampling method to generate probabilities in accordance with the primitive prior (labeled 'Prior II' in figures 2 and 3). Altogether 53 332 physical probabilities are generated, with an acceptance rate of 0.00215%. 11 This small acceptance rate is due to the tiny fraction of the physical region compared to the whole probability simplex, which is fifteen-dimensional. Then we construct the corresponding density matrices as well as their partial transposes. The well-known Peres-Horodecki criterion states that if a state ρ is separable, then its partial transpose has nonnegative eigenvalues; otherwise ρ is entangled. According to our numerical results, the probability that a randomly generated two-qubit state is separable equals ± 24.2% 0.2%, which is much smaller than the value reported in [3]. Clearly, the two priors are quite different.
To better understand how these priors differ, and how this difference affects the computation of the volume of separable states, it is worth considering the physical connection between the purity ξ ρ ρ = ( ) tr { } 2 of the states and their separability. In figure 2(a), we show the probability of finding a separable two-qubit state as a function of the purity for both priors. Although the two curves are very similar, there appears to be a systematic difference between the two priors: for given purity, states are a bit more likely to be separable for prior I than prior II. The strong similarity, however, tells us that the prior densities conditioned on the purity are close. But the marginal   11 The same random sample according to the primitive prior can also be obtained as follows [5]: generate a square random matrix A with all entries being independent complex Gaussian numbers, and compute ρ = AA AA tr { } † † , which is automatically physical. This procedure is much faster, but only applies to informationally complete POMs with the primitive prior. densities for the purity are rather different; see figure 2(b). For our prior II, we find the marginal density peaking at a higher purity value, indicating that this prior puts more weight on the states of higher purity and less weight on the low purity states. This, together with the fact that higher purity states are less likely to be separable, results in a smaller overall probability for our prior to produce a separable state.
To further see the difference between these two approaches, we also plot the probability density of the quantum states for qubit as well as qutrit state space in figure 3. Analytical forms of the marginal densities are indicated in the plots by red curves; see appendix C for details.

MCMC sampling
The problems of independence sampling-the low yield in high-dimensional spaces and the difficulty in handling sharply peaked distributions reliably-can be resolved by using the MCMC strategy (see, for instance, [19]). In MCMC, sample points that obey the basic constraints are generated sequentially, with the position of the next point depending on the position of the current point; hence the term 'Markov chain.' One makes use of a random walk such that the next sample point is likely to be in the vicinity of the current point; this gives a high chance of staying within the permissible region if the current point is physical, or within the same peak if the current point is within a sharply peaked region of The Markov chain's stationary distribution is to be the target distribution. To achieve this, the Metropolis-Hastings Monte Carlo (MHMC) procedure [9,10] is adopted when performing the random walk; see appendix D for a review. For our purposes, it is expedient to reparameterize the probability space: let , so that the space of x is the surface of the unit − K ( 1)-hypersphere centered about the origin. From point x, we then propose a new point by drawing a K-dimensional multivariate Gaussian random variable with mean x and variance σ 2 , and normalize it back to unit length. The symmetry of such a proposal distribution is guaranteed by the spherical symmetry of the Gaussian distribution.
The reparameterization in terms of x requires a corresponding transformation of the target distribution, Also, the starting point of the random walk should be a physical probability, which can be ensured by picking a state ρ (the maximally mixed state, for instance), computing the corresponding probabilities p (1) , and setting the initial = x p (1) (1) . Putting it all together, the x-parameterized MHMC scheme is as follows: xMHMC1 Pick an arbitrary state. Obtain p (1) from the Born rule, then set = x p (1) (1) and j=1.
xMHMC5 Draw a random number b uniformly from the range < < b 0 1 For target number of sample points M, escape the loop if j=M; otherwise, return to xMHMC3.
Some attention should be paid to the choice of variance σ 2 in xMHMC2. The corresponding standard deviation σ can be viewed as the typical 'step size' in the random walk. Generally, if the step size is too large, the acceptance rate tends to be low, since a single step may take one too far from the permissible or important region; if the step size is too small, the random walk takes a long time to explore the entire space. Therefore, σ has to be chosen carefully. In statistics literature, using general arguments invoking the central limit theorem in the manyparameter situation, one is told that the optimal σ should be chosen such that acceptance rate is around 23% (see, for instance, [20] and [21]). This gives a good rule of thumb for choosing the step size σ, and turns out to be quite reasonable even for small-dimensional problems; see figure 4.

Example: incomplete two-qubit tomography
For a comparison of the differences between various sampling methods, we consider the TAT scheme of [22] (see also section 6 in [13]) for quantum key distribution. This TAT scheme can be implemented as follows: begin with a source of entangled qubit pairs and distribute one qubit each to the two communicating particles; one qubit is measured by the trine POM of (4); the other is measured by the antitrine POM, which is (4) with the signs of x and y reversed. The two-qubit POM has nine outcomes subject to the single constraint of unit sum, resulting in an eight-dimensional probability space. For the simulated experiments, we measure 60 copies of qubit pairs, and the data, in one experiment, are = D {11, 4, 5, 2, 10, 5, 4, 6, 13}, which are used in the specification of the optimal error regions below.
We generated various sets of samples in accordance with the primitive prior as well as with the Jeffreys prior. The platform used for generating these samples is a standard desktop (Intel i7-3770 CPU, with quad core and 8 GB RAM). The CPU time taken to generate the sample of 100 000 (physical) probabilities with different sampling strategies is summarized in table 1. We also show the time taken by MCMC, but with a physicality check that exploits the structure of the TAT (labeled 'MCMC with parameter searching' in table 1). In the TAT version of the matrix in (47) of [12], eight out of nine real parameters for a reconstruction space can be determined by the probabilities, with the last one being in the range − [ 1,1]. If such a ninth parameter can be found to make the corresponding state positive semidefinite, then the generated p is physical; otherwise it is not. The CPU time by this procedure is almost the same as that by MCMC with CG in the physicality check. Moreover, there is barely any difference in time whether the sample is generated according to the primitive prior or the Jeffreys prior, as the time-consuming part is the checking of physicality of the generated probabilities.
In figure 5, we show the size λ s as a function of λ for different regions for data D using samples of various sizes. Here, λ is the likelihood threshold for the region, with λ ⩽ ⩽ 0 1. The region specified by a given λ is the set of points with point likelihoods satisfying figure 5(a) using the samples generated by independence sampling, there is not much difference for the curves obtained with the samples of sizes 10 000 and 100 000 for both the primitive prior and the Jeffreys prior. However, for figure 5(b) using the samples generated by MCMC, the curve obtained with 10 000 sample points deviates quite far from the curves obtained . We explore how the autocorrelations for Metropolis-Hastings random walks change as the step size σ is varied. The average correlation (over t) between any points x t ( ) and + x t j ( ) , = … j 0, 1, , 60, is plotted for the nine-outcome trine-antitrine measurement on a qubit pair. The autocorrelation decays most quickly for step size 0.08, suggesting this as the optimal choice of σ. The acceptance rate for σ = 0.08 turns out to be 25%, which is close to the rule of thumb that the target acceptance rate should be about 23%. with larger sample sizes for the primitive prior. A possible reason is that the chain may have been 'trapped' at the mode of the prior in the transformed x space for a significant portion of the time, and the sample with 10 000 points is not large enough for the random walk to reach the whole space. This does not happen for the Jeffreys prior, which is flat in the x space. In terms of the 'quality' of the sample points, we notice that a sample of size 10 000 generated in accordance with the primitive prior by independence sampling is good enough to produce figure 5; while it requires a sample of size 100 000 if we use MCMC. In this sense, the comparison of the time taken to generate the same number of physical probabilities in table 1 may not always be fair. At least for our particular purpose of calculating the sizes, both the independence sampling and the MCMC are roughly equivalent. The lower acceptance rate of independence sampling is supplemented by a faster convergence rate, while MCMC converges more slowly at the benefit of a higher acceptance rate.

Conclusion
We have shown how one can perform rejection sampling, importance sampling, and MCMC sampling in the probability space (and thus also in the reconstruction space) with due attention paid to all the constraints obeyed by physical probabilities. Rejection sampling and importance sampling are rather simple to implement but they have a low yield and are costly (in CPU time) unless one manages to check the physicality of candidate probabilities in an efficient way. While MCMC sampling tends to be less costly because the yield is higher (fewer candidate probabilities rejected), this comes at the price of correlations in the sample, which in turn requires larger samples to achieve the same accuracy that rejection sampling and importance sampling get for smaller samples. For comparison, we have generated samples of various sizes in the probability space for two-qubit states measured by an incomplete POM. Using the samples created, the sizes for different regions are then calculated.
Once the samples are at hand, one can now efficiently compute the optimal error regions for quantum state estimation introduced in [13], where integrals over high-dimensional regions in the quantum state space must be evaluated. While this application motivated these investigations, the random samples themselves can be used for the many purposes mentioned in the introduction. The algorithms explored here-independence sampling and MCMC sampling-can be applied to problems of any dimension, any choice of measurement, and any target distribution, even if one does not have an explicit parameterization of the state space, as is often the case. This flexibility is of great practical value.
If an explicit parameterization of the state space is available, one can alternatively sample by the so-called HMC algorithm [11]. HMC is very different from both independence sampling and MCMC in many aspects. We deal with HMC in our companion paper [12]. ,ˆ, we conclude that p is physical or not. This search for the maximum of Q can be carried out by simply following the direct gradient (DG) G of (A.2) and (A.3). Or, we can use the more efficient conjugate-gradient (CG) method [24] where + G j 1 is the direct gradient and = H G 1 1 . The parameter γ + j 1 is known as the Polak-Ribière criterion determined by G, see (A.10). Here is an outline of the iterative algorithm that employs the CG method for the physicality check (PC) of p, for a d-dimensional quantum system: PCCG1 Start with j = 1, two fixed constants ϵ and ξ, the d×d identity matrix .