Efficient flexible characterization of quantum processors with nested error models

We present a simple and powerful technique for finding a good error model for a quantum processor. The technique iteratively tests a nested sequence of models against data obtained from the processor, and keeps track of the best-fit model and its wildcard error (a quantification of the unmodeled error) at each step. Each best-fit model, along with a quantification of its unmodeled error, constitute a characterization of the processor. We explain how quantum processor models can be compared with experimental data and to each other. We demonstrate the technique by using it to characterize a simulated noisy 2-qubit processor.

We present a simple and powerful technique for finding a good error model for a quantum processor. The technique iteratively tests a nested sequence of models against data obtained from the processor, and keeps track of the best-fit model and its wildcard error (a quantification of the unmodeled error) at each step. Each best-fit model, along with a quantification of its unmodeled error, constitute a characterization of the processor. We explain how quantum processor models can be compared with experimental data and to each other. We demonstrate the technique by using it to characterize a simulated noisy 2-qubit processor.

I. INTRODUCTION
A quantum processor consists of a collection of effective 2-level physical systems called qubits, and a system that regulates and controls these qubits and their environment [1,2]. In order for the processor to work properly, the control system must maintain the coherence of the qubits' collective quantum state while performing very specific manipulations of that state. In real quantum processors these quantum logic operations act imperfectly [3]. This limits the processor's computing power and utility. Understanding these imperfections is critical to improving future hardware and advancing the state of the art [4].
There exist a variety of QCVV (quantum characterization, verification, and validation) protocols that aspire to identify and quantify various deviations from ideal processor behavior. Most (arguably all) of these techniques rely on some model, implicit or explicit, for the those deviations. Some use comprehensive models that describe the fine-grained behavior of every logic operation, and are intended to predict the outcome probabilities of arbitrary quantum circuits [5][6][7][8][9][10][11]. Other QCVV techniques use simpler models for coarse-grained observable properties -e.g., binary success/failure probabilities, or specific circuits only, or averages over circuit ensembles [12][13][14][15][16][17][18][19][20]. "Good" models of either type -i.e., ones that accurately fit the data -can be used to identify noise processes and error mechanisms in the quantum hardware, to extrapolate the behavior of existing devices, and to predict the behavior of future hardware.
Model complexity presents a fundamental trade-off. Complex models with many adjustable parameters, such those used in gate set tomography (GST) [7,21], often provide greater predictive power, more robustness to unanticipated phenomena, and more insight into error sources. But these virtues come at a cost. Complex models are computationally harder to evaluate and trickier to interpret, and fitting their many parameters demands more experimental data. Simpler models, such as the 3-parameter model used by randomized benchmarking (RB) [14][15][16][17], are much easier to construct and interpret, and can be more easily scaled to larger number of qubits -but they are often less predictive, and provide less insight into underlying physical mechanisms. So choosing a QCVV protocol (and its associated model of errors) at the beginning of an experiment can be a momentous choice -a "too simple" model won't capture all the errors, while a "too complicated" model will demand excessive resources.
In this work, we introduce a new paradigm. Instead of choosing a protocol and a model in advance, we dynamically explore a range of models and experimental designs to find one that explains the processor's behavior parsimoniously. This approach, and the specific technique we deploy, are motivated by two key take-aways from our experience characterizing experimental processors: (1) There's rarely such a thing as "the right" noise model for an experimental processor; and (2) it's critical to balance the model richness needed to describe observed data against the simplicity needed to facilitate useful interpretation.
Our basic methodology is to construct a set of nested candidate models, arrange them in a sequence, then iteratively fit them against data and use statistical tests to determine whether to proceed further (adding more data), or try a bigger model. Testing a statistical model is a well-researched task. There are powerful statistical methods for quantifying when a model is consistent with a set of data, and when a particular model should be preferred over another. We apply these methodsand some novel ones that we developed recently -to the task of finding empirical models for errors in quantum processors.
Section II provides additional motivation for our proposed method, and summarizes it at a high level. Section III introduces important technical background and definitions, and describes how statistical model testing can be applied within the context of quantum characterization. Then, in Section IV, we present our method completely and discuss its properties. Finally, in Section V we demonstrate our protocol by applying it to the simple case of a simulated 2-qubit quantum processor.
We use the open source pyGSTi software package [11,22] to perform the numerical analysis.

II. CHARACTERIZATION USING MULTIPLE MODELS
Characterizing a quantum processor typically involves choosing a method (e.g., RB or GST), based on a single model, and running it. Such methods use models with varying strengths and sizes, but in the end only a single model is ever utilized and we must accept the strengths and weaknesses inherent in it.
Standard gate set tomography [7,21], for example, uses a large model where each gate is an arbitrary CPTP map. The best-fit of this model is compared with the data, with the hope arXiv:2103.02188v1 [quant-ph] 3 Mar 2021 that it will describe most if not all of the data. If it does, the large best-fit model must still be analyzed to extract meaningful simple metrics that describe the errors in an intuitive way. This approach can be inefficient because the GST model has more parameters than are usually needed. It allows for the possibility of many errors that either 1) don't occur in the device or 2) aren't intuitive or aren't related to hardware adjustments that could improve the device. A large model can also make the analysis (e.g. fitting the model) time consuming and the interpretation of the result opaque. Finally, large models require a proportionately large amount of data to unambiguously estimate all of their parameters, which demands more experimental resources. Standard 2-qubit GST requires thousands or tens of thousands of circuits. In many cases, most of these circuits are unnecessary because they probe errors that aren't present in the device.
Randomized benchmarks [14][15][16][17][18][19] suffer from complementary ailments. For example, standard RB's model contains a single gate error rate for an "average Clifford gate" [14]. This model is not intended to predict the outcomes of arbitrary circuits, and it can be difficult to generalize this error rate into meaningful statements about processor performance.
Solving these problems demands adaptive methods that integrate multiple models during the characterization process. In this article we present one such multi-model approach, and show how it offers distinct advantages over single-model approaches. Our procedure takes as input a sequence of nested models, ordered from smallest to largest, and a sequence of experiment designs -lists of circuits that define an experiment that could be performed on a quantum processor. Beginning with the smallest model and simplest experiment design, we compare our current model to the data from the current experiment design and decide whether the model sufficiently captures the data. If it does, we move to the next experiment design, to perform a more strenuous test of the model. If it does not, we move to the next larger model, in hopes that it will capture the data. This process is depicted in Fig. 1. Instead of fitting a large model to a large amount of data at the outset, and potentially finding that many of its error-rates are zero, we begin with a small model and dataset and test whether anything more is needed to describe the data. We only move to a larger model if the smaller model is deemed insufficient.
In the next two sections, we make this idea more precise and concrete. Section IV restates the procedure of Fig. 1 in greater detail and as pseudocode. In the intervening Section III, we introduce necessary background such as defining what nested models are and identifying metrics that we can use to decide whether a model "sufficiently describes" a set of data.

III. MODEL TESTING
Statistical models of logic operations appear naturally when characterizing a quantum processor. In this section we explain how statistical models relate to models of quantum processors, and how tools from statistics can be used to test and select among them. After some preliminary definitions we describe how a processor model can be compared with data from an  actual quantum processor, and how different models can be compared with each other.

A. Datasets
A quantum circuit describes a sequence of quantum gates (quantum logic operations) on a fixed number of qubits. All the quantum circuits we consider in this work begin with a state preparation on, and end with the measurement of, all the qubits in the system. An ordered list of distinct quantum circuits, which we represent using script C, along with an integer sample count, N, specifies an experiment to be performed on a quantum processor. We also call C an experiment design. The outcomes obtained from executing each circuit N times [23] form a dataset, which we denote using D = D(C, N). In this work, we only use the histogram of outcomes for each circuit -the time-ordering of the outcomes is discarded. Extensions to time dependent models [9,24] that require time series data, are left for future work.

B. Models
A model for a quantum processor is a mathematical object that can predict the outcome of any quantum circuit that is run on the processor. Since quantum mechanics is probabilistic, this prediction takes the form of a probability distribution over the possible circuit outcomes. A quantum processor model M is therefore a parameterized set of functions M θ that each map circuits to outcome probability distributions. We also refer to M θ as a model, since it is simply a quantum processor model without parameters. Θ is M's parameter space. It's dimension, k, is the model's number of parameters. When a processor model is combined with an experiment design C, a statistical model -a parameterized probability distribution [25] -results. The value of the statistical model M(C) at θ is simply the product of the circuit probability distributions c∈C M θ (c). Thus, M(C) has k parameters and, at every θ, predicts an outcome probability distribution for each circuit in C. When the circuits are clear from the context, we will omit them and simply use M to denote the statistical model M(C).
One way of specifying a model is by associating process matrices with each of a processor's available operations. By multiplying and contracting process matrices, such a model can be used to predict the probabilities of any circuit and thus for the circuits in C. A depolarizing noise model, where the same n-qubit depolarizing channel is applied after each gate, is a specific example of a 1-parameter quantum processor model.

C. Comparing a model to a dataset
A well-established way to quantify how well a model fits a set of data is the log-likelihood statistic. It is defined between a model M and dataset D as the probability of M given D. If c indexes each circuit for which D contains data, and β c the allowed outcomes of c, then the log-likelihood is given by where N c is the number of times circuit c is repeated, f c,β c is the frequency (fraction of total counts) with which outcome β c is observed after running c, and p c,β c is the corresponding probability predicted by M. The log-likelihood is welljustified on a number of counts: the inverse of its Hessian is the Fischer information [26] and by definition it quantifies the probability that the model produced the set of data. Intuitively, it quantifies how surprising it would be for the model to have generated the data. We define a parameterized model's loglikelihood as the maximum log L over its parameter space, i.e., If a model's predictions exactly match the observed frequencies, then the maximum possible likelihood is reached. The value for which this occurs is not a universal quantity, and it is a well known fact that the value of log L is only meaningful in a relative sense. When log L is given by Eq. 2, then this maximum, is clearly dependent on the observed data. Typically, a model does not predict the observed frequencies exactly, and we need to determine the quality of "goodness" of the fit based on the obtained value of log L.
Model M is said to be valid relative to D if it contains the (non-parameterized) modelM that generated D -or a map indistinguishable fromM. The maximal model for D, constructed to have one parameter for every independent observable probability in D, fits D perfectly, achieves log L max (D), and is valid, as there is no data that can falsify it. In general, determining a model's validity -a proxy for its "goodness"requires comparing it to a valid model. This makes maximal models particularly important points of reference.

D. Measuring goodness of fit
Two fundamental quantities enter into the perceived "goodness" of a model's fit to a dataset: the log L between model and data, and the number of parameters, k, of the model. When a model is given more parameters, it is able to fit any given set of data at least as well or better, and thus k and log L will trade-off with each other. Because, in this trade-off, we consider adding or subtracting parameters from a model, the concept of nested models is relevant. We say that M A is nested within M B , and write M A ⊂ M B , when M A constitutes a subset (within parameter space) of M B .
As we look at ways of measuring a model's goodness of fit, two underlying truths are helpful to keep in mind. First, a model's log L and k are independently important for assessing its fit. No single number can capture all the information contained in both. Secondly, we can only assess a model's fit relative to that of another model. This somewhat annoying fact is a consequence of the relative nature of log L discussed above. There is a sense in which a model's fit cannot be declared "good" with complete objectivity. Thankfully, maximal models provide nearly objective reference points to assess the goodness of other models' fits.
There are, more or less, two different ways to go about quantifying the goodness of a fit. The first way quantifies the amount of evidence of unmodeled effects. The more conclusive the evidence is against a model's being able to describe all the data, the worse the fit is deemed. The second way quantifies the size of the unmodeled effects. Here, a model's fit becomes worse as larger and larger corrections are needed to explain the data after starting at the model's predictions. We look at each approach in turn. . First, the log-likelihood between each model and a common dataset D is computed. Taking differences of log L values and model parameter counts produces ∆k and 2∆ log L, which in turn are used to compute N σ and γ (Eqs. 5 and 6). Between M and M re f these constitute measurements of the evidence that M is invalid. The smallest W is found for which the log L of the wildcard-augmented model M W meets a pre-specified goodness-of-fit threshold (see text), written as N σ ≈ 0, where the "size" of W is its L 1 -norm. The size of W quantifies the size of the errors M fails to model.

Quantifying the evidence of unmodeled effects
Suppose we want to know how well model M A fits a set of data D. Following our remarks above, this goodness-of-fit will need to be relative to some other model, M B . Let the models' parameter counts be k A and k B , respectively. In the language of hypothesis testing, M A is our null hypothesis, M B is our alternative hypothesis, and we want to know whether, or with what certainty, we should reject the null hypothesis. When M B is chosen such that M A ⊂ M B and to be valid, then Wilks' theorem [27] can help answer this question. Wilks' theorem states that twice the difference in the log-likelihoods, when both models are valid. The mean (∆k) and standard deviation ( √ 2∆k) of the χ 2 ∆k distribution effectively bestow an origin and unit, respectively, on the relative log L between nested models. Since we have assumed M B to be valid, the quantity measures, in units of standard deviations, how certain we are that M A should be rejected, i.e. how certain we are that M A is invalid. In other words, N σ quantifies the amount of evidence that M A is invalid. Informally, it quantifies how surprising it would be to learn that model M A (at any θ) generated D.
A second metric for comparing M A and M B can be derived from the expected trade-off between log L and k. Removing one free parameter will almost surely decrease 2 log L. Wilks' theorem tell us that removing a useless parameter (one that takes its true value in the smaller model) causes an expected decrease of 1 unit. So if we remove ∆k useless parameters, we expect 2∆ log L ≈ ∆k. Removing a useful parameter (one not constrained to its true value by the smaller model) will decrease 2 log L more. The Akaike information criterion (AIC) [28] states that under certain idealized assumptions, removing a set of k parameters will yield an estimate with greater predictive accuracy iff 2∆ log L < 2∆k. Both Wilks' theorem and the AIC suggest that we can evaluate the "usefulness" of a set of ∆k parameters with a quantity we call the evidence ratio, The evidence ratio tells us how much more data-fitting power M B has, per parameter, than M A . We can use it to apply a variety of rules for choosing between those models -i.e. model selection -simply by choosing a threshold ξ. We declare M A superior to M B whenever γ < ξ. Wilks' theorem implies that if all the "extra" parameters in M B are useless, then we'll observe γ ≈ 1. So the threshold ξ = 1 would only select M A when M B provides absolutely no additional model-fitting power (and even then, only 50% of the time, at random). Larger values of ξ favor smaller models, selecting M A even when its validity becomes less certain. The threshold ξ = 2 implements the AIC rule [28]. Often, even higher thresholds are desirable for quantum gate characterization, because model simplicity is prized. Like N σ , the evidence ratio quantifies the amount of evidence for unmodeled errors, not their size. Both N σ and γ give a relative comparison of two models' fit to data. While this is expected, given the relative nature of the likelihood function, we would like to assess a model's fit in an absolute sense. We are able to do this effectively by fixing, for a given data set, M B to be a reference model, M re f . The reference model must be suitably large so as to include (as nested models) all the models we wish to test, and it must be valid (a condition for interpreting N σ ). The maximal model introduced above has all of these properties, and for the remainder of this work, we take as the reference model the corresponding maximal model. After this, N σ and γ become functions of only M A . We write them as functions of a single model M, and omit the argument entirely when the model being compared is clear from the context.

Quantifying the size of unmodeled effects
When characterizing a quantum processor we're often interested in how far M is from a valid model, i.e., "how much error in the processor does M not capture?". It may be extremely costly (or entirely prohibitive) to construct a valid model, and a simpler model that fits most of the data may be preferable. As George Box famously said, "All models are wrong but some are useful" [29]. To be useful a model doesn't need to capture all of a quantum processor's behavior.
To know when we've captured enough of the behavior, we need to quantify the distance, in some meaningful metric, between the model and the nearest valid model. If both our model and a known-to-be-valid model are both specified using quantum process matrices, then the diamond norm distance between them would be a good metric. But usually this isn't an option. And neither N σ nor γ is the right sort of metricthey quantify the amount of evidence, not the amount of error. This is clear from the fact that just increasing the amount of data taken (e.g. increasing N) can increase both N σ and γ, even though nothing about the underlying processes or models is changing (cf. Eqs. 2, 5 and 6).
We recently introduced a metric of unmodeled effects called wildcard error [30], and we deploy it here. Here's a concise summary of how wildcard error quantifies unmodeled error. Any M θ can be augmented by combining it with a wildcard error model, which assigns a certain amount of wildcard error budget w to each quantum circuit. It does so by allocating w g to each gate g, and then computing each circuit's wildcard budget by summing w g over the gates in it. Wildcard budget relaxes the base model's predictions in a precisely metered way: if the base model predicts probabilities p, and the wildcard model assigns w, then any p whose total variation distance (TVD) to p is ≤ w is consistent with the relaxed prediction. Wildcard models are parameterized by wildcard error rate vectors W = {w g }, and augmenting base model M θ with wildcard error rates W yields a wildcard-augmented model M θ, W . To quantify unmodeled error, we find the smallest amount of wildcard error budget (i.e., a minimal W) that is sufficient to make the wildcard-augmented model consistent with the data. (If the base model is already consistent, then W = 0 suffices, indicating there is no unmodeled error). To determine whether a given W is sufficient, we use a standard loglikelihood test, but compute M θ, W 's likelihood by summing up each each circuit's likelihood maximized over the TVD-ball of outcome distributions that are consistent with M θ, W 's relaxed predictions [31]. We define a minimal wildcard error model as having the smallest L 1 -norm W 1 -this is a somewhat arbitrary choice, but makes W 1 a reasonable measure of total unmodeled error. Because each w g represents an amount of TVD per gate, it can be compared directly with standard error metrics with the same "units", e.g., diamond norm distance.
We combine wildcard error to a (parameterized) quantum processor model M by augmenting the non-parameterized M θ that achieves the maximum likelihood. The wildcardaugmented model, M θ, W , effectively divides the processor's error processes into two categories: (1) modeled effects, which are captured and predicted by the best-fit model M θ ; and (2) unmodeled effects, which are not modeled or predicted at all, but whose impact is upper-bounded by W. This division provides as much insight and predictability as possible for effects that the base model can explain, while acknowledging and quantifying the total impact of the unmodeled effects.
If the best-fit model is expressed using process matrices, then the elements of W can be compared directly to any TVDbased error metrics (e.g. diamond norm) derived from the base model. In many situations, this comparison can provide explicit justification for Box's aphorism. If the magnitude of a gate's unmodeled error (w g ) is much smaller than the magnitude of its modeled error, then the base model is "useful" because it captures the majority of the error behavior, even if N σ indicates that it is surely "wrong".
3. Discussion N σ , γ, and W provide complementary ways of quantifying how well a model fits a dataset. Their computation is broken down diagrammatically in Fig. 2. N σ and γ quantify the evidence of unmodeled effects, whereas W measures their size. Either (or both in concert) can be used to guide the algorithm we present here, and decide whether a given model is "good enough" for a given use. In many scenarios, we expect this choice will depend on whether the experiment's goal is (a) to identify and understand the processor's behavior for scientific reasons, or (b) to determine whether the processor will be able to satisfy engineering requirements. Statistical weight of evidence (N σ , γ) can identify effects that exist whether or not they are important. Wildcard error can quantify whether effects are likely to be important for information-processing tasks.
We adopt the more pragmatic approach, using wildcard error to set our "good enough" threshold in Section V.

IV. TESTING A SEQUENCE OF NESTED MODELS
We now have all the tools and concepts required to give a concrete, precise description of the multi-model characterization technique introduced in Section II. Let {M i } m i=1 be a sequence of nested models where M 1 ⊂ M 2 · · · ⊂ M m . We consider here just a 1D chain of nested models, but this could be straightforwardly generalized into exploration of a tree or lattice of models by considering multiple "adjacent" models at each step [32]. Let k i be the number of parameters of M i . Similarly, let {C j } l j=1 be a series of experiment designs (circuit lists). These can be chosen independently from the models, as models don't technically require any specific or minimal amount of data to be tested. However, models with more parameters require proportionately more data to estimate all of the parameters accurately, and sometimes a particular experiment design facilitates fitting a model's parameters [21]. The experiment designs can also be chosen independently of each other, though we envision the number of circuits in each design and the circuits' size increasing with j. We denote by D j the data from repeating each circuit in C j .
The method is iterative. At each stage it keeps track of a current model and experiment design, which we index us-ing i and j respectively. At the beginning of each stage, we find log L(M i , D j ) by maximizing the log-likelihood over the parameter space. We assume that log L(M re f , D j ) is also available, and compute the 2∆ log L and ∆k values comparing M i to M re f . From these, N σ and γ (Eqs. 5 and 6) are derived, and W is computed as described above. We then decide, based on application-specific criteria involving N σ , γ, and/or W, whether M i "sufficiently describes" D j . This sufficiency condition is an intentionally subjective and flexible criterion, as different applications require different levels of characterization precision. If the model is sufficient to describe the data, then we move on to the next dataset; if not, we move to the next model. Algorithm 1 Iterative model testing for quantum processor characterization.
The entire approach is outlined in Algorithm 1, giving a more concrete summary to the graphical depiction in Fig. 1. The algorithm consists of an outer loop over experiment designs and an inner loop over models. The inner loop computes the fit and model selection metrics discussed in Section III. We use s to hold values of the 2 log L statistic, and explicitly indicate how the metrics N σ and γ only depend on the difference between the compared models' 2 log L and k values. Finding a minimal wildcard model independently requires the observed frequency and M i 's predicted probability of each of circuit outcome in D and so the W min function takes M i and D as arguments. The inner loop stops based on the ModelIsSufficient function, which contains customized logic that can be, in general, based on any of the metrics. When this functions returns True the model is declared to "sufficiently describe" the data and the inner loop exits, causing advancement to the next experiment design. Note that considering a larger dataset does not reset i, the model index -we continue using the final model of the last outer iteration. The outer loop iterates through successively larger experiment designs and ultimately we either find a model that sufficiently describes D l or we exhaust all our models before this happens.
Algorithm 1 is a flexible procedure for quantum processor characterization that can be applied to almost any situation where one or more models of a quantum processor are available. Its flexibility originates from the freedom to choose the models, experiment designs, and sufficiency criterion it utilizes. Let us briefly discuss several factors that help inform these choices. First, it should be noted that the models must be simulated in order to compute fit metrics. This sets a practical limit on the size of both the models and experiment designs that can be considered. Another factor that may limit the complexity of the experiment designs is the intended application of the processor. For instance, if a processor is only expected to ever run one type of circuit, then it may be acceptable to only test circuits of this type.
There are also many reasonable choices of a sufficiency condition (the ModelIsSufficient function in Algorithm 1). In the worked example of Section V we define sufficiency to mean that unmodeled errors are small in size, and base it entirely on the wildcard error rate vector, W. Sufficiency could also be implemented as a specific certainty that the model is valid, in which case ModelIsSufficient would impose a threshold on N σ .

V. APPLICATION TO A 2-QUBIT PROCESSOR
In this section we demonstrate the characterization technique described in the previous section by applying it to simulated data. We consider a 2-qubit processor with x-and y-axis π/2 rotation gates, G (i) x and G (i) y , on each qubit i = 0, 1, and two CNOT gates between the qubits, G (0→1) cnot and G (1→0) cnot .

A. The noisy processor
We add errors to each gate by following the ideal "target" gate with an exponentiated error generator composed as a linear combination of "Hamiltonian" and "stochastic" elementary error generators H P and S P , respectively [33]. These elementary generators are indexed by a Pauli P and act on density matrices ρ by H P generates coherent (generalized over-rotation) errors about the P-axis, and S P generates incoherent (generalized dephasing) errors that diminish qubit coherence in the Pauli directions that do not commute with P. If G 0 is the Pauli transfer matrix of an ideal k-qubit gate, then is the Pauli transfer matrix for the corresponding noisy gate of the processor. Here P ranges over all k-qubit Pauli matrices and the h P and s P coefficients determine the strength of each type of error. These error types preserve completely-positive  Table I. The errors chosen for our artificial 2-qubit quantum processor. The text explains precisely how these coefficients determine the errors on the gates and SPAM operations of the processors. The values are chosen to reflect a processor with predominant ZZ-type overrotation errors on its entangling gates, and smaller amounts of overrotation and dephasing on its single-qubit gates. A ZZ term is also present, which occurs after every gate (but not SPAM) operation.
trace-preserving (CPTP) maps, and so we are guaranteed that G is a CPTP map. The errors present in our artificial 2-qubit processor are given in Table I. The state preparation and measurement (SPAM) errors are similarly constructed by following or preceding the ideal operation by an exponentiated error generator. The coefficients of these generators are specified in the ρ and M rows of Table I. The row marked "all gates" indicates an additional error that is applied after every gate operation. The precise magnitudes for these errors are chosen arbitrarily, but their structure is intended to reflect the physically plausible situation where a processor's single qubit gates have over-rotation and dephasing errors about their axis, and there exists a dominant ZZ coupling Hamiltonian that causes errors in the CNOT gates as well as via an always-on background effect. We generate data from our artificial processor by simulating the outcome counts of each circuit as needed. We use the noisy gate and SPAM models described by Table I to compute a circuit's outcome probabilities and sample the resulting multinomial probability distribution N = 10000 times.

B. An initial benchmark
Let us suppose that we are handed the 2-qubit processor just described, and asked to characterize it. Knowing nothing about the processor, an intuitive first step would be to run a holistic benchmark on the processor. Randomized benchmarking is a good choice, so let us run a set of RB circuits. Since we have just 2 qubits we can use standard Clifford RB here; for more qubits, we would use a scalable variant of RB such as direct RB [16] or a different benchmark [12,13,20]. We choose 30 random Clifford circuits at Clifford-counts of 2, 12, 22, and 32, for a total of 120 circuits. The RB data we obtained is plotted as a function of the Clifford depth in the inset of Fig. 3. The computed RB number is ≈ 0.029.
The RB number is often interpreted as an error rate, and used to define a depolarizing noise model. RB does not guar- antee that this procedure will result in a useful model, and so we will not do this, but will instead ask whether any depolarizing noise model is able to fit the RB data. Separate from the RB analysis we construct a depolarizing noise model and fit it to the RB data. The best-fit model's predicted success probabilities are show in Fig. 3 by the blue line and points. The data are plotted as a function of native gate depth to spread the points horizontally and to show how the y-direction scatter at a Clifford depth is partially explained by the varied gate depth. The 2σ standard error bars plotted on the data (black) points reveal that the model does not explain the data to the expected statistical precision.

C. Applying the multi-model approach
This motivates the consideration of richer models, and so we now apply the characterization method of Section IV. We would like to know, in the end, how our processor behaves on general circuits. Since we've already taken RB data, it is natural to set C 1 as our set of 120 RB circuits. RB circuits are designed to be insensitive to coherent noise and so we also include a set of periodic circuits designed to be sensitive to coherent errors. We choose a set of GST-like circuits, each composed of a repeated germ sub-circuit sandwiched between two fiducial sub-circuits (cf. 21). We find a set of 436 circuits that are sensitive to all (Markovian) coherent errors and have up to 16 germ repetitions. These are added to the 120 RB circuits to form C 2 . We note in passing that these 556 circuits are far fewer than the more than 9, 000 circuits required to perform standard GST. GST requires a much larger number because it uses an over-complete set of circuits and its standard set of circuits amplifies every Markovian gate error.
We next decide on our sequence of models, {M i } m i=1 . For this example, we consider the following nested models: 1. Depolarizing model M 1 : each gate has the same depolarization rate. The state preparation and measurement are also depolarized, each at an independent rate. This model, then, has a total of 3 parameters.

2.
Gate-dependent depolarizing model M 2 : each gate has an independent depolarization rate, and the state preparation and measurement have independent rates for each qubit, giving the model 10 parameters.
3. Pauli-stochastic model M 3 : each gate now has independent stochastic error rates along each Pauli direction. Each single qubit gate is given by 3 error rates and each two-qubit gate by 15 error rates. State preparation and measurement (SPAM) operations are allowed only local errors, and so have 3 degrees of freedom per qubit (6 total). This brings the total number of parameters to 4 × 3 + 2 × 15 + 6 + 6 = 54.

4.
Hamiltonian + Pauli-stochastic model M 4 : the same as M 3 except each operation also is allowed Hamiltonian (i.e. over-rotation) errors along each Pauli axis, doubling the number of parameters to 108.
5. Full CPTP model M 5 : each gate is allowed to be an arbitrary CPTP map, and SPAM operations are allowed to be followed or preceded (respectively) by such a map. The total number of parameters for this model is 2160. This is the model that standard GST uses from the outset.
The freedom to chose any set of nested models gives the method great flexibility. In our case, we presumed to know nothing about the types of noise that might appear and chose a series of models that capture generic types of noise and that are not specifically tailored to our processor. If, for example, we had reason to expect that ZZ-type over-rotation errors would be dominant, we could have included a model with only these types of entangling errors.
The last necessary ingredient is a model acceptance criterion, i.e., the ModelIsSufficient function in Algorithm 1. We only demand that a model describe most of the behavior of the processor, and not that the model be valid from a statistical standpoint. Specifically, we define our criterion as a simple = 10 −3 threshold on the maximum element of W. That is, This criterion implies that a model is satisfactory when the worst gate's unmodeled TVD allocation is less than 0.1%. The choice of here is somewhat arbitrarily, and different applications may be more or less willing to tolerate unmodeled error.  Table II. The primary outputs from running Algorithm 1 on a simulated 2-qubit processor. Each row is an iteration of the algorithm, where we compare the model (third column) to the data generated by a set of circuits (second column). N σ , γ, and wildcard error rate W values are computed, all with respect to a maximal model M re f as described in the text. The decision of whether a model sufficiently describes the data, given by Eq. 12, only depends on the maximum element of W, so that is all that is included here ( W in its entirety is given in Fig. 4). The diagram to the right shows the numbered iterations in the format of Fig. 1 Execution of Algorithm 1, given our inputs, produces the results of Table II. The table shows N σ and W at each iteration of the characterization process. The algorithm begins by rejecting the depolarizing and then the gate-dependent depolarizing models based on the RB data (D 1 ). Model M 3 , which allows independent Pauli stochastic errors, is tested next, and is able to describe the RB data well enough to be accepted, and causes the algorithm to advance to D 2 . The periodic data of D 2 causes M 3 to be rejected, leading to a test of the 108-parameter M 4 , which allows coherent and Pauli-oriented stochastic errors. M 4 is unsurprisingly capable of modeling the data very well, as the model we used to generate the data only included these types of errors. M 4 is accepted. Since C 2 is our final experiment design the algorithm exits without considering M 5 .
To visualize how well each model is able to reproduce the datasets to which it was fit to, Fig. 4 plots the difference between the observed frequency (a fraction of the total counts) and predicted probability for every circuit outcome in the dataset. Differences are plotted as a function of the predicted probability, and shaded regions demarcate a 2σ standard error bar (i.e., where |p − f | < 2 p(1−p) N when f and p are the frequency and probability respectively). For a statistically valid model, we expect ≈ 95% of the points to lie within the shaded region. When they do not, a wildcard model must account for the remaining discrepancy and | W| > 0. Vertical lines emanating from the points indicate how far that point is allowed to move toward the x-axis given the minimal wildcard error rates W, that were needed.
As a point of reference, the distances between blue and black points in Fig. 3 are the successful-outcome subset of the all the points in the first frame of Fig. 4. The error bars on Fig. 3's points correspond to the height of shaded region in Fig. 4. We see from Fig. 4 how more complex models are able to better predict the data, and how smaller wildcard error models are needed to augment such models.
This simple example illustrates several noteworthy points.
1. It is clear that N σ provides different information from  . Differences between observed frequencies and predicted probabilities at each algorithm step. Plots are arranged in a grid pattern similar to that of Fig. 1 and show, as a function of the predicted probabilities, p, the difference between this probability and the observed frequency f . In each plot, there is one point per circuit outcome. Lines drawn from each point toward the x-axis show how the given wildcard budget W is able to adjust p toward f ( f − p toward 0). Shaded regions indicate 2σ standard error bars.
W. The first two iterations have N σ > 500, expressing certainty that these models do not describe all the data. But the corresponding wildcard models have maximum elements less than 3 × 10 −3 , indicating that if just this small amount of additional ("wildcard") TVD is allowed per gate, the data can be explained by the model. Indeed, if we had set = 10 −2 , then we would have accepted the depolarizing model (M 1 ) and immediately moved to D 2 for the second iteration. In the third iteration, we treat an ostensibly invalid model (N σ = 13) as sufficiently describing D 1 based on the wildcard error W being small. In the final iteration we find that M 4 is a valid model and so necessarily doesn't require any wildcard error.
2. Different circuit lists are sensitive to different types of errors. These results show, unsurprisingly, that RB circuits are insensitive to, and effectively mask, coherent errors. Even though the underlying processor possesses predominantly coherent errors (cf. Table I) the purely stochastic model M 3 is able to fit the RB data very well (and would be considered a valid model if only 1,000 samples were taken for each circuit). Indeed, if we only considered C 1 it would be extremely difficult or impossible to determine whether the errors were coherent or incoherent using only RB data.
3. Finally, this example illustrates how repeated model testing makes efficient use of experimental data. We note that through the third iteration only the initial set of RB data was utilized. By this point in the algorithm we have constructed a 54-parameter model of the stochastic errors that clearly is able to explain more of the data than a simple depolarizing model (cf. Fig. 4). This stands in stark contrast to the single average-errorper-Clifford number obtained by a standard analysis of the same data. It is also the case that all the models up to this point (M 1 to M 3 ) can be simulated efficiently, and so are easily scalable to 10s or even 100s of qubits.

VI. CONCLUSIONS
Iterative model testing is a simple, powerful technique for learning about the behavior of quantum processors. We have shown how the techniques from statistics, along with the novel concept of wildcard error, can be applied to a series of nested quantum processor models to determine a good model. We have outlined a general procedure for performing such testing, and have demonstrated its utility by characterizing an simulated 2-qubit processor. By being flexible with regard to the models that are tested and the data that is utilized, our method can be applied over a wide range of scenarios and adapted to define what constitutes a "good" model based on a processor's intended application. The presented algorithm considers increasingly rich datasets, and tests sequentially more complex models. This allows it to avoid expending resources until they are absolutely needed, improving upon existing techniques such as gate set tomography. Throughout the characterization process, we quantify how much of the processor's error is not being captured by the current model. In the end, either an acceptable model is found or we have attempted the most complex model available (or feasible) to us. In either case, the amount of unmodeled error gives us a concrete sense of how the model can be used, and ensures that it never fails to provide useful information. We find, overall, that this modeltesting approach yields more detailed characterization infor-mation using the same experimental resources (data) when compared with existing techniques. This work was supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, and the Laboratory Directed Research and Development program at Sandia National Laboratories. Sandia National Laboratories is a multi-program laboratory man-aged and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA-0003525. All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of the U.S. Department of Energy, or the U.S. Government.