Data-Centric Machine Learning in Quantum Information Science

We propose a series of data-centric heuristics for improving the performance of machine learning systems when applied to problems in quantum information science. In particular, we consider how systematic engineering of training sets can significantly enhance the accuracy of pre-trained neural networks used for quantum state reconstruction without altering the underlying architecture. We find that it is not always optimal to engineer training sets to exactly match the expected distribution of a target scenario, and instead, performance can be further improved by biasing the training set to be slightly more mixed than the target. This is due to the heterogeneity in the number of free variables required to describe states of different purity, and as a result, overall accuracy of the network improves when training sets of a fixed size focus on states with the least constrained free variables. For further clarity, we also include a"toy model"demonstration of how spurious correlations can inadvertently enter synthetic data sets used for training, how the performance of systems trained with these correlations can degrade dramatically, and how the inclusion of even relatively few counterexamples can effectively remedy such problems.

Efforts to improve the performance of ML systems are generally classified as either model-centric or datacentric. Model-centric techniques focus on altering the underlying architecture of an ML system. Examples include increasing the number of hidden layers in a deep neural network, tailoring the structure of a model, modifying the loss function [26] or tweaking the reward function in reinforcement learning. Alternatively, data-centric [27][28][29][30][31][32][33][34][35][36] methods-leaving the system's architecture unchanged-endeavor to improve system performance by using enhanced data sets, e.g., removing spurious correlations, increasing the accuracy of labels, increasing the variety of sampled situations covered by the data, or distilling the data sets to improve efficiency [37].
Given the relative maturity and availability of ML models and systems, and how similarly many state-of- * slohan3@uic.edu † tsearles@uic.edu ‡ brian.t.kirby4.civ@army.mil the-art models perform [38], it has been suggested that data-centric techniques represent an undervalued opportunity to boost system performance [39]. This recommendation is especially relevant to domain scientists deploying ML in their particular field of research where model-centric methods may be outside of their expertise.
In other words, applying domain-specific knowledge to improve the quality and accuracy of a data set is likely the most efficient and direct route to performance improvements for those not specializing in ML specifically. Data-centric ML techniques have found wide applicability in a variety of domains including legal [32], natural language processing [40], image classification [35], and medical prognosis [36]. In the context of QIS, a datacentric approach to improving state reconstruction accuracy was implemented where expected statistical and experimental noise were included into training sets of a convolutional neural network (CNN), resulting in overall performance improvements [41][42][43]. Beyond the inclusion of artificial errors, it was also recently shown that even very general pieces of prior information in the construction of training sets-such as the expected mean purity-can improve the performance of ML-based state reconstruction systems [44].
This paper develops data-centric heuristics specifically targeting classical ML applications in QIS. For concreteness, we demonstrate the effectiveness of the heuristics using a pre-trained CNN-based quantum state reconstruction system. However, the heuristics themselves are based on the general properties of distributions of quantum states and are use-case agnostic. After reviewing the architecture of our CNN in Sec. II, we begin in Sec. III by introducing a representative example where spurious correlations between purity and entanglement in a training set cause our network to misclassify separable pure states as entangled. While this "toy model" intuitively FIG. 1. A schematic of a data-centric ML approach in QIS. Inputs are coarse properties of the distributions expected in our test system, e.g., the limits (maximum and minimum) of the purity and concurrence. To obtain the training set, we sample from a suitably chosen distribution of random quantum states, and then apply a bandpass filter to eliminate any samples lying outside of the specified extrema. These engineered training sets are used in optimizing the trainable parameters of an ML model, as indicated by a dotted brown arrow labelled "Training." Finally, the resulting neural network is enlisted to infer unknown quantum states probed by measurements on NISQ-era hardware ("Inference" arrow). An example reconstructed three-qubit quantum state is shown at the right, where shading denotes the phase.
demonstrates the impact of training set construction on application performance, it also leads to our first heuristic: engineering data sets to include even comparatively few counterexamples is sufficient to remedy errors due to spurious correlations.
In Sec. IV, we present three more data-centric approaches that improve the reconstruction fidelity of our system. We begin by reviewing various distributions of random quantum states used to generate synthetic training sets for ML systems and describe a method for further engineering them to incorporate prior information using simple "bandpass filters." To illustrate the utility of our training set engineering approach, we perform state reconstruction on randomly sampled states from a cloudaccessed 7-qubit quantum processor and demonstrate fidelity improvements. To stress the data-centric nature of these fidelity improvements, we compare our results to those achievable using the same datasets but with a model-centric approach that increases the CNN's trainable parameters. Then, in Sec. V A, we build on prior results found in [41][42][43] and demonstrate that synthetically incorporating statistical noise comparable to that found in a test set into the training set can result in significant performance improvements. Finally, in Sec. V B, we consider situations where states covering a wide range of purity values are expected in the target distribution of an ML system. We find that, in this case, it is not always optimal to train on the exact distribution expected, and instead, it is more efficient to bias training sets towards the more mixed end of the distribution. We argue that this surprising result is due to the heterogeneity of the free variables in states of differing purity.

II. NEURAL NETWORK
We implement a custom-designed CNN that takes tomographic measurement values as inputs and reconstructs an estimate of the density matrix as the output, similar to systems described in [41,43,44]. Our system has a convolutional layer with a kernel size of (2, 2), stride lengths of 1, ReLU as an activation function, and filters of size 25. Then we add a max-pooling layer with a pool-size of (2, 2), stride lengths of 2, and a "valid" padding, followed by a flattening layer. Next, we attach a fully connected dense layer (dense 1) using the ReLU activation function. Then, we apply a dropout layer with a 50% dropout rate, followed by another fully connected dense layer (dense 2), again, using the ReLU activation function, followed by a dropout layer with the same rate. After this, we attach another fully connected dense layer (dense 3) with a linear activation. Note that the number of trainable parameters depends upon the number of neurons at the dense 1, dense 2, dense 3 layers, and the number of qubits.
The output of dense 3 is a vector τ nn that defines a corresponding density matrix through the Cholesky decomposition. In general, any density matrix can be written as ρ = ζ(τ )ζ(τ ) † , where ζ(τ ) is a lower triangular matrix. The nonzero elements of the matrix ζ(τ ) can be rearranged into a vector as given by where d is the number of qubits. The first 2 d elements represent the diagonal entries, and the remaining components populate the real and imaginary parts of the offdiagonal entries. As an example, in the two-qubit case ζ(τ ) is given by During training, the ground truth target vector τ g is provided to the network, and the trainable parameters are optimized to minimize the mean squared error (MSE) ||τ nn − τ g || 2 , where the average is taken over the full training set. Once trained, the network takes any collection of measurement values as an input and outputs a prediction, τ nn . For validation of the trained network, we utilize measurement data generated from a test set of density matrices ρ g -which may not match the training set-and compute the density matrix ρ nn corresponding to the network output τ nn . The fidelity is then used to quantify accuracy.
A schematic summarizing our approach for engineering distributions is shown in Fig. 1. The inputs to our system are the ranges of the desired purity and concurrence distributions, but generally, any quantifiable property of a distribution can be substituted. In our particular case, we estimate the approximate range of the distribution of states generated from the IBMQ as in [43], but many other techniques can be used without requiring full state reconstruction [45,46]. The input purity and concurrence ranges inform the selection of an initial distribution of random quantum states, which is passed through a simultaneous bandpass filter in both purity and concurrence, resulting in our final engineered training set. A detailed description of the engineering procedure is presented in Sec. IV.
We then use these engineered training sets to optimize the trainable parameters of an ML model as previously mentioned, which is shown by a dotted brown arrow "Training" in Fig. 1. Once trained, the resulting neural network is, finally, enlisted to infer unknown quantum states probed by measurements on noisy intermediatescale quantum (NISQ) hardware, in our particular case the IBMQ, as indicated by an "Inference" arrow.

III. SPURIOUS CORRELATIONS AND LACK OF VARIATION
As an illustrative example to introduce our datacentric approach to quantum state tomography (QST), we first examine an ML system trained on a set of density matrices containing perfect correlations between purity and entanglement. We then study the performance of this system when used as a separability-entanglement classifier on generic states that are counterexamples to the learned correlation. In this sense, how spurious correlations impact reconstruction fidelity and separabilityentanglement classification is related to the ML concept of generalizability, which considers how well a system will perform on data not included in its training set [47,48]. We will show that our network indeed learns the correlations between purity and entanglement present in the training set, which limits generalizability and results in a high error rate when classifying pure states as separable or entangled. Yet we will then demonstrate that including only a modest number of counterexamples in the training set can significantly mitigate this issue-a paradigmatic "data-centric" improvement.
To define the restricted subspace within the overall Hilbert space that enforces a strong correlation between purity and entanglement, we generate our training states from local rotations of two-qubit maximally entangled mixed states (MEMS). The MEMS define a particular class of states which, for a given linear entropy, have the maximum possible concurrence [49,50]. In general, MEMS can be expressed as (up to local rotations) where and the parameter γ ∈ [0, 1] is equal to the concurrence. For consistency of presentation with results later in this manuscript, we consider the purity of the MEMS instead of the linear entropy. The purity of the state in Eq. (3) is given by which ranges from 1 3 ≤ P (γ) ≤ 1. We generate an element of our training set ρ M EM S according to where γ is drawn from the Uniform distribution, γ ∼ Uniform(0, 1), and U i (2) is a two-dimensional Haarrandom unitary matrix applied to qubit i. When plotting the concurrence of these states as a function of their purity, they form a curve as shown in Fig. 2(c,d). Note that the MEMS span a wide range of possible values of purity and concurrence. We stress this fact about the MEMS to demonstrate how even for two qubits, the simplest of all entangled systems, it can be challenging to detect spurious correlations when only considering general properties of states independent of each other. In this case, the relationship between purity and entanglement [49,50] is well known, but such relationships may be significantly more difficult to detect in more complex systems.
Due to the correlation between purity and entanglement in the training set, our network has never been exposed to separable states of high purity. Therefore, to understand the effects of these correlations we will use our network to reconstruct randomly generated separable states with P > 1 3 and classify them as separable or entangled based on the Peres-Horodecki positive partial transpose (PPT) criterion [51,52]. We generate these states according to where ρ a and ρ b are random full-rank density matrices sampled from the Hilbert-Schmidt (HS) distribution. The use of the HS distribution here does not meaningfully impact the results, and details related to the distribution are described in Sec. IV A. We find that our network has significant error in this scenario and fails to correctly classify approximately 50% of the states from ρ s , as shown by the first data point in Fig. 2(a), suggesting that the impact of spurious correlations on the performance of ML-based QST can be dramatic. We next investigate how these errors can be mitigated through the inclusion of a modest number of counterexamples in the training set. In particular, we include randomly sampled states from ρ s in the initial training set, such that the total number of states in the training set (states from ρ M EM S and ρ s ) is where N train represents the total number of training states, N is the number of states drawn using Eq. (6) and N s the total number sampled from Eq. (7). We fix N train = 30, 000 and modify the fraction N s /N train . For the test set, we sample 5000 random states using Eq. (7). Note that the training and test sets are drawn randomly and independently. We vary N s from 0 to 1750 with a step size of 250 and train a separate neural network at dense 1 = 3050 and dense 2 = 1650 up to 400 epochs at a learning rate of 0.008 for each of them as previously described. After this, the pre-trained networks are again used to classify states from ρ s as separable or entangled. The state classification accuracy versus the percentage of separable states added to a training set is shown in Fig. 2(a). Note that we train the same network architecture 10 times for each case and take the average of all the predictions (from 10 trials) for the given state to minimize the effects of random initialization during training. The N s = 0 point on the curve represents the result described above where the entire training set is drawn from states in ρ M EM S . We see that with an increasing percentage of separable states in the training set, the network state classification accuracy increases rapidly.
Similarly, we also measure the reconstruction fidelities for states from ρ s , with the results shown as the dotted red line in Fig. 2(b). As expected, with the additional separable state examples in the training set, we have significantly enhanced the network's reconstruction fidelity for separable states. Furthermore, in order to cross-check if the network performance for entangled states has been affected by the modified training sets, we again independently sample 5000 random states according to Eq. (6). Note that the purity of each ρ M EM S generated according to Eq. (6) will always be greater than 1 3 , and that except for the vanishingly small case where γ = 0, the sampled states are entangled. The reconstruction fidelity for the sampled ρ M EM S test states are shown as the magenta lines in Fig. 2(a,b). The error bars show one standard deviation from the mean. The fidelity for this test set remains high and constant for all examined training sets. Hence, we find that the inclusion of a small number of counterexamples in the training set significantly improves classification performance on out-of-correlation states without reducing overall network performance.
To further illustrate the overall impact of spurious correlation and the effect of the mitigation strategy, we also reconstruct states from across the purity-concurrence plane with a network trained only on the MEMS class (N s /N train = 0) and a network trained on counterexamples as well (N s /N train = 0.058), the results of which are plotted in Figs. 2(c,d). Each point represents a randomly sampled density matrix colored by the reconstruction fidelity. Figure 2(c) corresponds to the same network as indicated by the leftmost boxes in Fig. 2 As expected, in Fig. 2(c) separable states of high purity (residing toward the lower right hand corner) are reconstructed with low fidelity, since these states possess the most extreme deviation from the correlation found in MEMS. In contrast, we find a significant enhancement for the network trained with MEMS and a few examples of separable states, as shown in Fig. 2(d). Interestingly, although the improvement is most pronounced for separable pure states, the modified training set increases reconstruction fidelity across the entire purity-concurrence plane.
To conclude this section, we note that with respect to the use of MEMS training sets, QIS researchers would be unlikely to expect generalizability, and it is perhaps unsurprising that neural networks trained on them would perform so poorly on other states. Nevertheless, their strong correlations acutely highlight the broad issue of spurious correlations-which in many situations may prove much more difficult to detect-as well as indicate a simple mitigation strategy based on tailored training data. In the following sections, we apply these general ideas to situations of more practical interest in QIS, exploring a variety of density matrix distributions that all offer full Hilbert space support, and compare their performance in ML-based QST. In these more nuanced cases, we again will find noticeable improvements with engineered training sets, for multiple experimentally relevant contexts.

A. Distributions of random quantum states
In this subsection, we briefly review the salient features of the most common methods for defining distributions of random quantum states. Beyond fundamental motivations [53][54][55], many efforts to perform state re-construction and classification using ML-based methods have relied on these distributions to generate training sets [1,[41][42][43][44]. These distributions will serve as baselines for evaluating ML-based system performance for which our data-centric heuristics will be compared. In other words, we aim to improve the performance of ML-based techniques in QIS beyond what is possible from training on these standard distributions.
Hilbert-Schmidt (HS) distribution: Random quantum states distributed according to the HS measure can be induced through the partial trace on Haarrandom pure states in higher dimensions [55]. Operationally, ensembles of HS-distributed random quantum states are typically generated by sampling the complex Ginibre ensemble [56], which comprises D × D complex matrices whose elements are independently drawn from the complex standard normal distribution [55]. Specifically, random quantum states distributed according to the HS measure can be obtained using where G is a random matrix from the Ginibre ensemble.
Bures distribution: Similar to the case of the HS distribution, a random quantum state ρ from the Bures ensemble can be sampled according to where G is, again, a random matrix from the Ginibre ensemble and U is a Haar-distributed random unitary from U (D) [57].
Hilbert-Schmidt-Haar (HS-Haar) distribution: Previous studies have noted that ensembles of random quantum states distributed according to the HS and Bures measures have limited applicability for many NISQ devices due to their low average purities [44]. Therefore we define here a simple technique for biasing an arbitrary input distribution toward a higher average purity. In particular, we consider a convex combination of HS-distributed quantum states (ρ HS ) and random Haardistributed pure states (ρ H ) as given by where δ is chosen uniformly at random from the interval [0, 1]. We will find later in this section that this approach performs surprisingly well in our tests despite its simplicity.
Mai-Alquier (MA) distribution: This distribution was originally studied as a prior for Bayesian QST [58][59][60][61][62] and was recently utilized in [44] to generate training sets for ML-based state reconstruction methods. The MA distribution is defined as a mixture of Haar-random pure states with coefficients drawn from the Dirichlet distribution. The probability density function of the Dirichlet distribution for vectors x = (x 1 , ..., x K )-where the elements of x belong to the open K − 1 simplex (x i ≥ 0 and where α = (α 1 , ..., α K ) with all α i ≥ 0 defines the concentration parameters and Γ(·) is the standard gamma function. The concentration parameters provide flexibility to alter the overall features of the distribution. Therefore, an ensemble of D-dimensional mixed states from a convex sum of K Haar-random pure quantum states |ψ i is written as where the vector x is a random variable distributed according to Dir(x|α): for simplicity, we specialize to the symmetric case α = {α, ..., α} only. The expectation value of the purity is Finally, we note that in [44] strong evidence was presented that the MA distribution reduces to the HS distribution for a specific set of input parameters. Zyczkowski (Z) distribution: Dirichlet-distributed vectors x are again employed to generate random density matrices as described by [53], which we refer to as the Z distribution for convenience. This approach relies on the unitary invariance of the eigenvalues of a density matrix and utilizes the Dirichlet vectors x of length D as the eigenvalues of D-dimensional states. Once the eigenvalues are generated they are placed along the diagonal of a D × D matrix and a Haar-random unitary from U (D) is applied. Note that the resulting construction is of the same form as Eq. (13) for K = D, but with all states in the convex sum orthogonal. The expectation value of the purity of Z-distributed states is given by where we have used β as the concentration parameter of the Dirichlet distribution so as not to be confused with the MA expressions (where α was used). Like the MA distribution, the Z distribution can be biased in various ways through manipulation of concentration parameters. A discussion of how the MA and Z distributions compare against experimentally measured distributions can be found in [44]. Finally, we note that the Z distribution is a widely employed method for generating random density matrices including for machine-learning applications such as the state classifier described in [1].

B. Generating random test sets on NISQ hardware
In order to illustrate several of our heuristics on data sets representative of real-world experimental scenarios, we will perform quantum state reconstruction and state classification using data obtained from NISQ hardware. In particular, we utilize data sets consisting of tomographic measurements performed on random quantum states obtained with ibmq jakarta, one of the IBM Quantum Falcon processors. We first numerically generate 500 Haar-random two-qubit pure states and initialize these on ibmq jakarta. Then, the states are automatically transpiled from the backend into the required quantum circuits for generation. The depths of the transpiled quantum circuits-i.e., the longest path from input to output-range between 12 and 16 gates. For each state, we perform full state tomography with a total of 36 measurement projections, corresponding to the four outcomes for all 9 two-qubit combinations of the Pauli operators {X, Y, Z} 1 ⊗ {X, Y, Z} 2 .
We then reconstruct the measured quantum states using maximum likelihood estimation (MLE) according to the algorithm in [63]. Unfortunately, due to random noise on the backend hardware, the estimated states are mixed despite having been programmed as pure [43], leaving uncertainty about the ground truth state that the data represent. Therefore, to retain the general properties of the distribution generated by ibmq jakarta while permitting the construction of test sets with known ground truth states, we perform additional rounds of tomographic simulations on the MLE-obtained results; these synthetic measurement results comprise the test sets below. We simulate measurement results using the methods described in [43] which further allow us to select the amount of statistical noise (shots) on demand.

C. Engineering training sets
Motivated by the restricted nature of existing techniques, we describe a general method for engineering an arbitrary input distribution to conform to certain general characteristics desired in a training set. Several of the previously introduced distributions allow for biasing based on a single input. This is enough to control, for example, the mean purity [44]. However, a single parameter may not always be enough to meaningfully constrain a distribution for a given use case. For example, even in the two-qubit case considered here, the purity and the entanglement only bound rather than determine each other [49]. The situation becomes even more complex for higher-dimensional systems where several inequivalent classes of entanglement exist [64].
The method described in Fig. 1 consists of repeatedly sampling from a suitably chosen input distribution followed by the application of a simultaneous bandpass filter for both purity and concurrence. In general, the bandpass filter approach can be applied to any measurable TABLE I. Purity and concurrence distributions of explored training sets. Pmean, P, P-distributions, C, and C-distributions represent the mean purity, range of purity, purity distributions, range of concurrence, and concurrence distributions, respectively, for training sets generated from the density matrix distributions in the left column. Of particular interest are the "P-distributions" and "C-distributions" columns which show the NISQ sample distribution in solid green with the training set distributions overlaid. The average reconstruction fidelities for test states from the NISQ distribution are shown in the right column. The error (±) indicates one standard deviation.
property or properties of the sampled states. However, we have chosen purity and concurrence for this demonstration as they are both well-understood properties of two-qubit density matrices. Hence, in many experiments of interest, the approximate maximum and minimum values of the purity and concurrence are easily inferred. In short, the bandpass filter approach can be summarized as first randomly sampling states ρ π from an arbitrary input distribution Π and then passing them through the simultaneous filter given by where ρ eng are the filtered (engineered) states, and C and P represent the concurrence and purity, respectively.
In principle, we can use any input distribution Π in the above approach. Hence, our method for engineering distributions of random states only requires upper and lower bounds on the purity and concurrence, or some other property of the states, as input. However, the MA distribution is particularly convenient as an initial distribution because it allows extra freedom in biasing the resultant distribution of states. Furthermore, we will find in Sec. V B that when a test set covers a wide range of purity values, as is often the case in an experimental situation, the optimal training set to maximize mean reconstruction fidelity is more mixed than the test set. Therefore, we recommend that input distribution Π be chosen according to the following recipe. First, set K = D (the minimum K value capable of producing full rank matrices) and tune α such that the mean of this distribution is equal to the chosen lower bound. Second, based on results in Appendix D, we then recommend that once α is fixed, the distribution be sampled with K > D. We will explore the benefit of this final recommendation more in Sec. V B and Appendix D. Finally, we summarize the full algorithm for engineering distributions in Appendix E.

D. Data-centric state reconstruction with NISQ hardware
We now compare the performance improvements obtainable with both data-centric and model-centric techniques using our ML-based QST system. Our datacentric methods consist of training our ML-QST system using sets drawn from the various distributions described in Sec. IV A and those engineered using the method in Sec. IV C. Our model-centric methods consist of increasing the trainable parameters in the network. Ultimately we will find that data-and model-centric approaches work in a complementary fashion and that our greatest data-centric improvement is found using our engineered distribution approach from Sec. IV C.
The test set used to compare the various trained networks in this section is based on the NISQ-sampled density matrices according to Sec. IV B. The tomographic data are generated assuming 1024 shots, meaning every Pauli measurement circuit was executed 1024 times each. The resultant distribution has a purity and concurrence in the range [0.68, 0.96] and [0, 0.86], respectively, which is used to inform the engineered distribution of Sec. IV C. All 500 generated states are used as a test set. Note that the test states are completely unknown and hidden from the network during training, and the only information used to inform the construction of training sets is the maximum and minimum of the purity and concurrence.
A summary of the distributions used to generate training sets for our ML-QST system is included in Table I. Of particular interest are the "P-distributions" and "Cdistributions" columns which show the NISQ sample distributions in solid green with the training set distributions overlaid. The x-axis in each of these plots is held constant, but the scaling along the y-axis is arbitrary. The HS, Bures, and HS-Haar distributions have no input parameters and hence the P mean , P , and C values list in Table I for these distributions are application-agnostic. Alternatively, the MA, Z, and engineered distributions all include degrees of freedom that allow for the incorporation of prior information. As discussed in Sec. IV C, for the engineered distribution we first select the α value which makes the mean purity of the K = D distribution equal to P min , and then use a larger K > D to actually generate samples. In this case we find when K = 4, the mean of the MA distribution matches P min for α = 0.3394, which we then sample with K = 6. These selections are made heuristically with knowledge of P min only, but arguments for their selection and considerations of their optimality are considered in Sec. V B and Appendices A, D, and C. Finally, for a fair comparison of the engineered distribution against the MA and Z distributions we also show the MA distribution with α = 0.3394 and K = 6 but without having passed through the bandpass filter and the Z distribution tuned such that its mean purity is equal to P min . Although these parameter selections are well motivated by the results in Sec. V B, we also demonstrate explicitly the impact of this selection in Appendix E, where we see that this choice is near optimal for both the MA and engineered distributions.
For each training set, we also consider the impact of an additional model-centric approach which consists of increasing the number of trainable parameters in the model. As the number of qubits is fixed, the number of neurons in the dense 3 layer is also fixed at 16. Therefore, the number of neurons in the dense 1 and dense 2 layers determine the total number of trainable parameters as shown in Table II. We use the same network architecture for the model with all combinations of sampling distributions. Additionally, all the networks are trained up to 400 epochs with a learning rate of 0.008. We manually optimize the learning rate, as discussed briefly in Appendix B.
The results of the data-centric and model-centric approaches are pictured simultaneously in Fig. 3. Each curve corresponds to a different data-centric method, meaning the neural network was trained with states drawn from a different distribution. For each curve, increasing along the x-axis corresponds to a "modelcentric" performance improvement, where the training set is fixed but the number of trainable parameters in the network is increased. Each point making up each curve corresponds to the average reconstruction fidelity for our ML-QST system when reconstructing 500 NISQ-sampled states, using a network trained on 30,000 randomly sampled states from the corresponding distribution. Note that all measurements used to train the networks described in Fig. 3 are simulated in the ideal scenario (i.e., the limit of infinite shots, which in Sec. V A we will see is a reasonable choice given the test set is sampled at 1024 shots).
From Fig. 3 we see that the average performance of our system for any training set is improved, at least at first, using model-centric techniques. However, the performance improvements from this model-centric approach appear to impact each iteration of our network in approximately the same way (with a few minor crossovers occurring). Similarly, for a given network size (x-axis position) we find the average reconstruction fidelity can be improved with data-centric methods. In other words, we find that data-centric and model-centric approaches are complementary paths to performance improvement.
Ultimately, Fig. 3 indicates that the engineered training set attains the highest average reconstruction fidelity. Importantly, only the minimum and maximum values of the NISQ purity and concurrence were used in producing the training sets-not any detailed features of the distribution's shape. The HS-Haar distribution, which is simply a convex sum of HS-and Haar-distributed states, performs close to that of the engineered distribution in the limit of the maximum model-centric improvements. This is especially surprising when considering the inset of Fig. 3 which shows how dramatically the engineered and HS-Haar distributions differ in both purity and concurrence. The impact of these differences evidently shrinks as the number of trainable parameters grows, removing the initially wide separation (at small x-values) in the performance of the neural networks trained on the engineered distribution and HS-Haar.
After the engineered and HS-Haar training sets the next highest performing sets are those which can be biased based on mean purity, the MA and Z distributions. As discussed above our selection of MA parameters were heuristically chosen based only on P min . To better understand the optimality of this selection we also include the reconstruction fidelities with various MA concentration parameters in Appendix C. It is unsurprising that all of these distributions ultimately outperform the Bures and HS training sets, both because they have no parameters with which to incorporate prior information and because they skew significantly more mixed than the average state generated by ibmq jakarta.
To better understand exactly how the engineered distribution-which again only takes into account minimum and maximum information-compares to the distri- The use of various distributions of random quantum states to train the network constitutes a data-centric approach and is shown by an arrow pointing upwards, whereas varying number of trainable parameters is an example of a model-centric approach as indicated by the arrow pointing to the right. The domain from 10 6 to 10 7 parameters is magnified in the left inset. The rectangular box in the inset indicates the network architecture used for the results described in Fig. 5. Additionally, the concurrence and purity of random quantum states from the Hilbert-Schmidt-Haar (HS-Haar),Życzkowski (Z), engineered, and IBM Q distributions are, respectively, shown by blue, orange, brown, and green histograms in the right insets.
bution actually generated by the NISQ device, in Fig. 4 we plot the concurrence of the states sampled as a function of purity from ibmq jakarta (labeled as IBM Q in green) and those from the engineered distribution (in brown). We see that the engineered set covers the NISQsampled set convincingly, albeit weighted more heavily toward mixed states. While this might initially appear to be an inefficiency, we will find in Sec. V B that this bias in the training set toward lower purity than the target distribution actually helps explain the high performance of the engineered set.

FIG. 4.
Engineered states on concurrence-purity plane. The engineered and IBM Q sets are shown by brown and green dots, respectively.

V. APPLICATIONS OF DATA-CENTRIC ENGINEERING
In this final section, we present two additional datacentric techniques for improving the reconstruction fidelity of our system. The first subsection considers situations where statistical noise is present in measurement results and demonstrates that synthetic statistical noise in training sets can significantly improve average reconstruction fidelity. The second subsection describes a surprising result applicable to scenarios where the states composing a test set vary widely in purity. In this case, even given complete access to the distribution of the test set and using that exact distribution to generate the training set, the optimal average reconstruction fidelity is not found by constructing a training set from the same distribution but rather from one slightly more mixed. The two methods in this section can be used independently or in concert with each other and the other heuristics described throughout this paper.

A. Low-shot state reconstruction
Experimental data used for state reconstruction will necessarily include statistical noise since measurements can only be repeated a finite number of times. Many practical considerations may further restrict the plausibility of repeating an experiment, such as low count rates, experimental complexity, or the dimension of the underlying quantum system, which results in inefficient scaling of required measurements. The presence of statistical noise in measurement results causes estimated expectation values of measurement operators to differ from the ideal, lowering the reconstruction fidelity. In the context of ML-based QST systems, some previous work has demonstrated that incorporating statistical noise comparable to that present in a test set into the training set can improve average reconstruction fidelity of pure states [43]. Here, we extend this fundamentally data-centric technique by applying it both to mixed states generally and to the engineered distribution of Sec. IV D specifically. In other words, we show that multiple data-centric techniques can be used in a complementary fashion.
For our demonstration we use the same states obtained in Sec. IV B as our test set, but with their measurements simulated at shots ranging from 128 to 8192. Here the term "shot" represents the number of times each Pauli measurement circuit runs. In Fig. 5, we show the reconstruction fidelity of the NISQ-sampled (from IBM Quantum) test set as a function of the number of shots used to generate the measurement results in the test set for two different training sets. We use the same network architecture, with dense 1 = 3050 and dense 2 = 1650, with each training set. The red line indicates the reconstruction fidelity of the test set using a network trained on ideal measurement data, meaning we generate measurement probabilities directly from expectation values. Alternatively, the blue line is the reconstruction fidelity when the network has been trained on measurement re- Reconstructing the NISQ-sampled distribution of Sec. IV B with simulated measurements performed with shots ranging from 128 to 8192. The red line is the reconstruction fidelity when performed using a network trained on ideal measurements which themselves have no statistical error. The blue line is the reconstruction fidelity when a separate network has been trained for each shot level such that the training set was simulated at the same shot level as the test set. The whiskers represent the inter-quartile range, while small circular dots are the outliers. sults that have been simulated at the same shot number as the NISQ-sampled test set (x-axis).
For each data point on Fig. 5, box plots represent the range of reconstruction fidelities of the test states for the respective case. The notch indicates the median; the whisker and each box encloses [Q 1 , Q 3 ]; and the whiskers range from Q 1 −1.5(Q 3 −Q 1 ) to Q 3 +1.5(Q 3 −Q 1 ), where Q 1 and Q 3 are the first and third quartiles. As can be seen by the divergence of the red and blue lines at low shot numbers, when significant statistical noise is present in a test set it is advantageous to include equivalent statistical noise in the training set. We note that the convergence of the blue and red lines is expected as 8192 shots is large enough to significantly reduce statistical noise.

B. Accounting for heterogeneity in state complexity
An intuitive assumption when generating a training set is that one should aim to match the distribution of the test set as much as possible. Surprisingly, we will find here that it is not always optimal to exactly match the test distribution when the states cover a wide range of purity. In particular, we find that a higher reconstruction fidelity is obtained for a given test distribution when we train our network on a slightly more mixed distribution. We will demonstrate this effect for two, three, and four qubit systems.
To control the relative mean purity of the test and training distributions, we use the MA distribution as described in Sec. IV A. We begin by fixing the test distribution concentration parameter and K value. We then train a network on the same distribution as well as several others with the same concentration parameters but progressively larger K values. Recall that the MA distribution is constructed by summing K Haar-random pure states, and hence increasing K causes the overall distribution to become more mixed [44].
For our test sets, we draw 5000 random quantum states from the MA distribution with the parameters (α, K) = (0.1, 4) for two qubits, (α, K) = (0.03, 8) for three qubits, and (α, K) = (0.015, 16) for four qubits, simulating ideal Pauli measurements on each. Note that for informationally complete tomography, the number of measurements and number of neurons in dense 3, respectively, scales as 6 d and 2 2d , where d is the number of qubits. The purity distributions of the samples for each case is, respectively, shown at the top in Fig. 6(ac). At each qubit number, we generate four training sets each with 30,000 random quantum states sampled from MA distributions with the same α as the corresponding test set, but with varying K: K ∈ {4, 5, 6, 7} for two qubits, K ∈ {8, 9, 10, 11} for three qubits, and K ∈ {16, 19, 22, 25} for four qubits.
We train a separate network at dense 1 = 3050, and dense 2 = 1650 for each K and reconstruct the test states. To collect statistics, we run each network 10 times In the lower plots, the red lines indicate reconstruction when the network has been trained using the same distribution (sampled separately) as the test set. Each other curve (blue, green, purple) represents the reconstruction fidelity when the network is instead trained on a more mixed version of the test distribution. The vertical and horizontal error bars represent one standard deviation from the mean. Moreover, the reconstruction fidelities averaged over all test states are shown in the insets, for each K. and take the average of all 10 predictions for each test state as the reconstruction fidelity for the given state. The reconstruction fidelities, grouped by the ground truth purity of the test states, are plotted in Fig. 6(a) for two qubits, (b) for three qubits, and (c) for four qubits. The purity range from 0.3 to 1.0 is divided into 10 bins, and the statistics are evaluated separately in each bin. The vertical and horizontal error bars represent one standard deviation from the binned mean of the reconstruction fidelity and ground truth purity, respectively.
In general, we find that increasing K > D in the training set, which decreases the purity, significantly enhances reconstruction fidelities for mixed states, while slightly reducing performance for pure states (as shown in the insets of Fig. 6). Therefore, caution should be taken when choosing the value of K used in the generation of the training set. Nevertheless, on the whole, the improvement for mixed states tends to outweigh any reduction in performance for pure states. We conjecture that this effect can be explained by the difference in the number of terms required to fully describe a state of different purity. For example, a pure state has fewer free variables than a mixed state of the same dimension, making it more difficult for the network to learn how to reconstruct a mixed state than a pure state. Hence, biasing a training set to be slightly more mixed than the target distribution improves the performance of the network on average.

VI. DISCUSSION
Data-centric techniques represent a broad set of valuable and often underutilized strategies for improving the performance of classical ML-based systems used throughout QIS. Unlike model-centric approaches, data-centric methods have the distinct advantage of requiring no alteration to the underlying ML model. Generally speaking, data-centric techniques focus on identifying inadequacies in the construction of data sets, such as false correlations, insufficient variety of examples, and improper scoping. Remedying these deficiencies can significantly improve the performance of ML-based systems, but identifying these errors can require significant domain-specific knowledge. This paper has developed various data-centric heuristics for training set generation that consider prior or domain-specific knowledge to improve system performance, demonstrating the effectiveness of these heuristics with an ML-based quantum state reconstruction system.
Many data-centric heuristics are highly specialized to a particular situation under investigation and broadly include any technique for incorporating prior knowledge, such as the expected average state a system will generate, into the structure of the generated data set. Previous work has considered how to create data sets for MLbased quantum state reconstruction that take into account statistical counting noise, systematic experimental errors, and the expected distribution of states generated by a system [43,44]. Here, we have added to this list a method for engineering training sets to match distributions of expected experimental scenarios. We compare the effectiveness of our distribution-engineering approach to other standard methods for generating data sets, including those capable of incorporating some amount of prior knowledge such as mean purity.
We describe how spurious correlations can reduce system performance, how it can be challenging to identify these correlations in quantum states given their complexity, and how the inclusion of only a few counterexamples can remedy problems related to these correlations. We show that even for systems as small as two qubits it can be tempting to believe a data set is broadly illustrative of the overall set of possible states. In particular, we generate a data set that includes nearly the full range of possible purity and concurrence values and yet contains a false correlation between the two. We show that, in this example, such a correlation causes our state reconstruction system to misclassify pure separable states as entangled, having only ever seen pure states that are entangled. We then demonstrate that surprisingly few counterexamples need to be added to the training set to remedy this issue. Hence, it is prudent to include several states of every possible classification in any given data set.
More generally, we have also described data-centric heuristics that leverage only broad features of QIS rather than specific prior knowledge about an experiment or scenario. In particular, we find that, given the heterogeneity between the number of free-variables in pure and mixed states, it is not always optimal to endeavor to generate training sets that exactly match the distribution of an experimental scenario in the first place. Instead, when an ML system is to be applied to states covering a wide range of purities, training sets should be biased to be more mixed on average than the expected experimental distribution.
The data-centric heuristics described in this paper focus on situations where training data are synthetically generated, as is often the case in applications of classical ML to QIS-specific problems. The motivation for simulated data sets can be due to convenience, as experimental data may be impractical to obtain, or because the problem itself is theoretical and measured data would only open the possibility of introducing experimental errors. However, the heuristics developed still apply to experimentally obtained training sets, but in those cases can be considered more prescriptive as they suggest the structure of data sets likely to result in the highest performing ML systems. Due to our focus on synthetic data, we have not included any data-centric methods concerned with data labeling. However, we note that significant work in the ML community has focused on the effects of data labeling and developed a set of data-centric approaches for systematically relabeling or removing mislabeled data to improve overall system performance [27,[65][66][67]. An interesting problem for future studies would be to consider the application of these label-focused approaches to experimental QIS systems. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. Additionally, we acknowledge use of the IBM Quantum for this work. The views expressed are those of the authors and do not reflect the official policy or position of IBM Quantum. This material is based upon work supported by, or in part by, the Army Research Laboratory and the Army Research Office under contract/grant numbers W911NF-19-2-0087 and W911NF-20-2-0168.
Appendix A: Varying K Due to asymmetry in state complexity, the performance of a neural network may depend on the density of mixed and pure states in a training set. First, in order to generate a more mixed training set, we vary the value of K consecutively from 4 to 8 using the MA distribution at α = 0.3394 and D = 4, and the corresponding distributions of the states with respect to the concurrence and purity are shown by unfilled blue histograms in Fig. 7(a) (left). Note, this α value was chosen as it aligns the mean of the MA distribution with the minimum purity value in the NISQ-sampled data set, which we found in Sec. IV D worked well for engineered states. The solid-line histogram represents the case of K = 6, whereas the the filled green histogram represents the test distributions obtained from the cloud-accessed hardware (IBM Q). As shown, the increase in the value K is directed upward, increasing the mixedness of the training samples. Without any filter, we find that the increase in reconstruction fidelities with K quickly saturates and then gradually decreases as shown by the blue line in Fig. 7(b). The error-bars show one standard deviation from the mean.
In order to address the issue, we apply a filter of concurrence and purity to remove unwanted mixed states from the training set. The concurrence (top) and purity (bottom) histograms for the filtered (engineered) distributions of sampled states are shown by unfilled red his- tograms in Fig. 7(a) (right). As previously mentioned, a solid-line histogram indicates K = 6. With a network pre-trained with these engineered states, we find the reconstruction fidelity gradually increases as shown by red line in Fig. 7(b), saturating by around K = 6. In Sec. IV C we suggested setting the mean of the initial distribution, which is then sent through the bandpass filter, to P min . Here we consider the performance of this parameter selection. To this end, we generate engineered training sets where the initial concentration parameter α is chosen such that an MA distribution with K = 4 will have mean purity given by the x-axis in Fig. 10. (The mean of the engineered and K = 6 training sets will thus differ from this value.) We then use these engineered data sets to reconstruct the same NISQ sampled data as in Fig. 3. For comparison, we also perform the same exercise for MA distributions that have not gone through bandpass filters. We see that the engineered distribution outperforms the base MA distribution in all cases. Further, our heuristic choice of setting the mean of the MA distribution to P min is near-optimal over the presented range.

Appendix E: Engineered distributions
Here we present the complete algorithm written in pseudocode for generating engineered distributions of quantum states as explained in Section IV C.