Randomized Benchmarking in the Analogue Setting

Current development in programmable analogue quantum simulators (AQS), whose physical implementation can be realised in the near-term compared to those of large-scale digital quantum computers, highlights the need for robust testing techniques in analogue platforms. Methods to properly certify or benchmark AQS should be efficiently scalable, and also provide a way to deal with errors from state preparation and measurement (SPAM). Up to now, attempts to address this combination of requirements have generally relied on model-specific properties. We put forward a new approach, applying a well-known digital noise characterisation technique called randomized benchmarking (RB) to the analogue setting. RB is a scalable experimental technique that provides a measure of the average error-rate of a gate-set on a quantum hardware, incorporating SPAM errors. We present the original form of digital RB, the necessary alterations to translate it to the analogue setting and introduce the analogue randomized benchmarking protocol (ARB). In ARB we measure the average error-rate per time evolution of a family of Hamiltonians and we illustrate this protocol with two case-studies of simple analogue models; classically simulating the system by incorporating physically motivated noise. We find that for both case-studies the data fit with the theoretical model and we gain values for the average error rate for differing unitary sets. We compare our protocol with digital RB, where both advantages (more realistic assumptions on noise) and disadvantages (difficulty in reversing the time-evolution) are discussed.


I. INTRODUCTION
In the quest for real-world applications in the present and near-future, the focus of quantum computing has fallen on noisy intermediate-scale quantum systems. While there is a lot of interest in digital quantum computing/simulation (DQC), significant progress has been made in analogue quantum simulation (AQS), which offers existing medium-scale systems, and where relevant physical problems are mimicked by a highlytunable quantum system. AQS are engineered to evolve continuously in time according to a specific Hamiltonian or family of Hamiltonians which can be flexibly controlled and microscopically understood from first principles. Instead of the application of quantum logic gates, as in the digital setting, a calculation is performed through the unitary or dissipative evolution under the system Hamiltonian. As a result, AQS currently lacks many of the existing testing or error-correcting methods of DQC. Being able to trust that AQS are simulating the designated quantum system, or running the same class of Hamiltonians in a reproducable way becomes essential when the AQS cannot be classically simulated.
Most AQS experiments aim to estimate the ground state of a Hamiltonian or to learn about the dynamical properties of the system and the spread of information, i.e. on a given quantum hardware or when evolving under a given Hamiltonian. Typically, AQS is tested against accurate numerical methods in classically solvable regimes (such as one-dimensional dynamics) or cases where the final outcome is known, i.e. classi-cal optimization. More recently, a number of ideas have been developed for tackling classically intractable regimes. These include a self-verifying technique for simulations of the Lattice-Schwinger model [1], a method based on non-demolition measurements of the Hamiltonian [2] and the broader concept of crossplatform verification, i.e. comparing results for the same problem from different quantum hardware or the measurement of compatible correlation functions that can verify specific Hamiltonians even when the quantum dynamics cannot be classically simulated [3,4]. It is also possible to estimate the final state of the AQS through quantum process/state tomography (QPT/QST) [5,6], direct fidelity estimation (DFE) [7], or in recent years, matrix product state (MPS) tomography [8] and neural-network approaches [9].
QST is a technique used to reconstruct an unknown quantum state when given multiple copies of it to measure, and to estimate the final state of a quantum simulation, whilst QPT is a similar technique used to reconstruct an unknown quantum process (quantum channel) and to test the process of a quantum simulation. Both techniques require resources that grow exponentially with the size of the system. DFE is a technique to determine whether a system arrives at some target state or runs a target gate. This is more efficient than the previous techniques since knowing the target state/operation means that there are fewer resources (measurement settings) required, however it does not directly account for the errors from state preparation and measurement (SPAM) and is still not scalable in the sense that the actual number of measurements still scales exponentially. MPS tomography provides an estimate of the final state of the system, and also requires less measurement settings than standard tomography due to the fact that it exploits tensor network techniques (and specifically matrix product states) to approximate the final state; again, it does not include SPAM errors in its analysis and in the suitable non-classical regime, scales exponentially in terms of resources. To summarise, (i) none of the techniques mentioned previously succeed in giving an account of the errors from state preparation and measurement, meaning that the accuracy to which they characterise the noise of the system is always bounded by an unknown contribution from the SPAM errors and ii) these methods are not efficiently scalable, and can only provide characterisation for systems of up to ∼ 20 qubits failing to capture correlations further than a few sites [8] or relying on state symmetries [10].
Despite the considerable advances in recent years [1] most of the current methods rely on model-specific properties to improve on previous proposals. The aim of our research is not to verify the ground state estimate of complex Hamiltonians, but instead find a general way to capture the performance of an analogue quantum simulator in running a family of Hamiltonians. Randomized benchmarking (RB) is a digital method to find a measure for the holistic performance of a quantum hardware that takes into account SPAM errors and is theoretically efficiently scalable, i.e. in the length of the gate sequences applied (poly(L)) and in the size of the system (poly(N )). This scalability is dependent on the efficiency with which the gateset tested can be composed/compiled with the native operations of the chosen hardware.
The motivation for this work grew from the need to overcome the limitations of previous methodologies and develop scalable non-model-dependent methods, building on ideas from the digital setting and applying them to AQS, i.e. with the intention to reduce the gap in testing techniques that exists between the digital and analogue fields of quantum computing. RB in the analogue setting would offer a way to determine whether your chosen quantum architecture will reliably perform a set of quantum evolutions or unitaries, giving an average error rate for this set of unitaries; particularly useful when considering programmable analogue quantum simulation. RB has the potential to efficiently provide a measure of performance of AQS in regimes that are currently classically intractable without the SPAM error noise floor on the accuracy of this partial noise characterisation. Moreover, it will become apparent that RB is more natural and physically motivated for analogue systems than for their digital counterparts.
With these motivations in mind, we propose extending randomized benchmarking to the analogue setting, which we call analogue randomized benchmarking (ARB). In Section II we present the original form of randomized benchmarking and the technical details of it, in order to keep this work self-contained and to better explain the alterations made for our contribution. Following in Section III, the modifications necessary to extend randomized benchmarking to analogue quantum simulators; including the current form of the analogue randomized benchmarking (ARB) protocol (Sec. III-C). In Section IV we present two case studies of simulating ARB on concrete analogue models. Note that we do not perform experiments on physical hardware, but instead emulate these devices by classically modelling physically motivated noise scenarios. We conclude in Section V with considering the barriers for physical implementation of this protocol as well as some ideas to overcome them, and introducing the possible future directions that have transpired through this process.

II. RANDOMIZED BENCHMARKING
Randomized benchmarking [11][12][13] is a technique for evaluating the performance of a quantum hardware. This is done by simplifying the error channel of a quantum process in a way that quantifies the average error per gate of a given gate-set, when it is run as part of a long random computation on this hardware. We first give the intuition behind this method, and formally present the technical details from Sec. II-A onwards, with simulation results in Sec. IV. The central idea is based on two observations. The first observation is that any error channel, irrespective of the correlations it initially exhibits, when twirled (essentially averaged over random unitaries; see details below) "behaves" as a much simpler error channel that is easier to characterise and quantify. The second key observation is that one can mimic the averaging over the infinite set of unitaries by sampling from a small (finite) set. Explicitly, unitary t-designs are finite sets of unitaries that have precisely this property: randomly sampling from the unitary t-design is equivalent to randomly sampling from the set of unitaries, provided that the averaged quantity computed involves polynomials of order, at most, t. For our purposes (twirling) the relevant polynomial is of order two, therefore a 2-design is sufficient. Crucially (and conveniently), the Clifford group constitutes a well known and simple to construct unitary 2-design. While this 2-design is sufficient for digital quantum simulations, for all cases, analogue or digital, where the "native" gates of a quantum hardware do not include the Clifford group, it is important to explore the possibility of using other 2-designs as an optimal (or as the only) way to test that quantum hardware. Principally, RB uses the observations stated above to estimate the average strength of errors, under certain assumptions about the noise, thereby providing an estimate of the average error rate of a given gateset on a given hardware.

A. Twirling and Depolarizing Channels
The foundation of randomized benchmarking lies in a concept known as twirling, which transforms a quantum channel Λ(ρ) into another channel Λ t (ρ) by conjugating over unitaries U (ρ) ∈ U (D), where D is the dimension of the Hilbert space, in the following way: If the unitaries U (ρ) are distributed according to the Haar measure dµ [14] which is a measure of uniformity, then the twirled channel becomes a depolarising channel.
A depolarising channel, intuitively, is a channel that with some probability leaves the state intact, while with the remaining probability depolarises that state, returning the maximally mixed state. It is of the following simple form: where I is the Identity operator (thus I D is the maximally mixed state), ρ is the quantum state that the channel acts upon, E d represents the depolarised channel, p E is the probability that the state ρ stays intact whilst (1 − p E ) is the probability of error occurring on that state, i.e., the probability that the state is in the maximally mixed state. We use p E to characterise the depolarising channel [15], i.e. if p E → 1 then the state remains unchanged.
As stated above, twirling a general error channel Λ, leads to a depolarising channel Λ t , where we will denote the corresponding probability p Λ . This gives a simpler form that enables us to characterise the strength of the error channel with a single parameter, p Λ .

B. Unitary t-design
A unitary t-design is defined as a set of unitaries {U k } where {k = 1, ..., K} such that: for every polynomial P t,t (U ) of order t, where dµ(U ) is the Haar distribution. In other words, for any polynomial of unitaries of degree t or less, calculating the average over the set {U k } is the same as calculating the average over all the unitaries (Haar integral). Anapproximate unitary t-design is a set that has the same property, where it holds only up to some error . As an aside, any unitary t-design is also a (t−1)-design.
Randomized benchmarking aims to test and quantify how well a set {U k } performs on average on a given quantum hardware, by utilising the twirling property of the Haar distribution. The twirling property is then used to simplify the effective error channel and make it quantifiable with a single parameter (depolarisation). In other words, if one has a set that is a unitary tdesign, then by performing the method of randomized benchmarking one obtains an average error rate of the said (gate)-set on that quantum device.
Explicitly constructing an exact unitary t-design is generally challenging and, in most practical cases, also inefficient [16]. To practically use the twirling property, one needs to reduce the infinite average to a finite one that can be experimentally performed, therefore the set that you are testing must be at least a unitary 2-design (for more intuition, see App. A). A benefit of the Clifford group of operators {C i } is that they naturally form a 2-design [17]. Benchmarking the full unitary group is not scalable, but it can be generated by adding one additional non-Clifford single-qubit gate (such as T = diag(1, e iπ/4 )) to {C i }. We would not expect the error-rate for a single-qubit rotation on a specific quantum hardware to differ significantly from that of the single rotations in the Clifford group, hence RB can provide some confidence in your hardware for the full unitary group by benchmarking {C i }. Moreover, scalable benchmarking methods for {C i } are important for error correction codes in universal quantum computing, since most fault tolerant schemes are based on stabilizer codes [18]. Another important advantage of using the Clifford group for RB is that the Clifford operator that inverts a sequence of Cliffords {C i } can be found efficiently [19], and this efficient inversion is key to performing standard RB.

C. Standard form of Randomized Benchmarking
The basic steps of the standard RB protocol consist of: 1) preparing a (simple to prepare) initial state ρ ψ = |ψ ψ|, then 2) running random sequences of gates of various lengths, where the final gate is one that inverts the preceding string of gates, and finally 3) measuring the probability that the output state remained unchanged, i.e. that it is still the same ρ ψ that we started with; something that would happen in an ideal noiseless case. Under certain simplifying assumptions about the noise-channel (imperfections of the device) this survival probability can be directly related with the average fidelity of the gates from the tested gate-set. This, therefore, quantifies how well the gate-set performs on average, i.e. the average error-rate of the gate-set. The fact that we look at the survival probability over sequences of different lengths enables us to extract the errors from state-preparation and measurement (SPAM) since they do not scale with the length of the sequence.

Notations:
We define the set {U k } as a set of unitaries that form an exact unitary 2-design, where {k = 1, ..., N }. In the standard form of RB, these unitaries are taken to be the Clifford group. We consider a quantum system prepared in an initial pure state ρ i = |ψ ψ|. We denote ρ ψ as the initial state taking into account preparation errors. E ψ is the POVM element taking into account measurement errors, and in the ideal case Running a unitary gate U on a physical device corresponds to a quantum channel denoted as Λ U . The action of this quantum channel can be decomposed in two parts: where Λ U,e = Λ U • U † , capturing the errors that differentiate U from Λ U . We use this convention to describe the imperfect channel as one that firstly applies the correct gate (U ) followed by Λ U,e , the error superoperator. In our protocol, we define Λ Uk i as the imperfect implementation of a chosen unitary U ki . To introduce a term commonly used in the literature, the probability that an initial state survives a quantum process is known as the survival probability. The survival probability of a channel Λ UC , where U C is any unitary circuit, given a fixed initial state ρ ψ is: where E ψ is the projection on the state |ψ . Specifically in RB, we apply a sequence of imperfect unitaries followed by their (imperfect) inverse so that we have: It is clear to see that this probability is equal to unity if both the noise of the forward Λ U,e and the backward channels Λ U † ,e are the identity (noiseless), since in that case we just evolve the state ρ ψ by U † • U = I. The average fidelity of the quantum channel (over all pure states) is defined as: and the average fidelity of a gate-set is given by The relevant quantity that we are interested in extracting is the average error-rate of a gate-set (on a specific hardware) which is simply one minus the average fidelity of the gate-set: It is necessary to choose suitable values for the following parameters for statistically relevant results: the number n l of sequence lengths S l to sample from {U k } (where S l should be chosen such that they uniformly cover as much of the unitary space {U k } as possible); the number n seq of sequences S η per length l, to sample; the number of repetitions, R, per sequence, η. The optimal values for each of these parameters is discussed with respect to our protocol in Sec. IV and App. B.

Assumptions:
In order for the above protocol to really quantify the average error-rate of a gate-set, as given in step 5, we need to make a number of simplifying assump-Protocol 1 Digital Randomized Benchmarking 1: Sample uniformly from {U k } a number of sequence lengths S l and run a sequence S η at length l ∈ S l where: S η = Λ Uk η +1 Λ Uk 1 , ....Λ Uk η , and Λ Uk η +1 is a single operator deterministically chosen to invert the preceding sequence of unitaries (i.e. Λ Uk η +1 = [Λ Uk l , ..., Λ Uk η ] † ). This sequence should return the system to its initial state ρ ψ . 2: Repeat this sequence R times and record T r[E ψ S η (ρ ψ )] to see if initial state ρ ψ survived the sequence S η and call this the survival probability P η for sequence η. 3: Repeat this for varying sequences of the same length l and find the average probability that the initial state survived for this sequence length, T r[E ψ S l (ρ ψ )] where S l represents the average over all sequences of this length. Call this the average survival probability for length l: P l . 4: Repeat the above steps for sequences of different lengths, and plot average survival probability against sequence length, i.e. P l vs l. Fit results to a pre-determined decay curve: P l = A + B f l where l is the sequence length and f is the fidelity decay parameter, with A and B absorbing SPAM errors. 5: The average error rate can be characterised by r where r = (d−1)(1−f )/d, and d is the dimension of the Hilbert space for a system of qubit size n (2 n ).
tions on the type of noise (error-channels) the device has. Relaxing some of these assumptions is possible with small modifications, e.g. in the derivation of the decay curve used for fitting in step 4, but the standard assumptions are: the errors should be (i) gateindependent Λ U,e = Λ e and (ii) time-independent, i.e. independent of the time it takes to run the gate and of when it is applied in any part of any sequence, (iii) the error-channel should be trace-preserving and memoryless (iv) the SPAM errors should be lengthindependent and (v) the error in the inversion step can be viewed as a single step; therefore, it does not scale with the length of the sequence which allows it to be absorbed into the SPAM errors. Finally, given that the gate-set tested is not universal, it is also crucial that (vi) the inversion step (in the ideal case) can also be implemented using gates from the tested gate-set. Realistically, the physical errors on a quantum process are likely to be gate-dependent and timedependent, and extensions to the initial RB protocol exist that factor these types of errors into their analysis [20][21][22][23][24]. The assumption that the noise is gate-independent is particularly unrealistic for certain digital quantum hardware. For instance, it is not uncommon to have two different Clifford gates produced from the exact same interaction term run for different times, whereby the amount of time needed for one gate is much longer than the time needed for the other. One would expect that the errors on a gate would depend on the length of time that a Hamiltonian is evolved for, therefore the gate-independence of the errors in such a gate-set on a hardware does not match the underlying physics. Interestingly, this type of problem is not present in our analogue generalisation of the protocol.

Why it works:
Experimentally, we obtain the average survival probability P l for each length l (step 3), summing over all the sequences of the same length l. By the 2-design property of our unitary set {U k } we have: where |ψ 0 is the initial state of the system. Note that we have expressed the imperfect inversion operator as a single gate Λ U † tot . Decomposing the errors and assuming that they are gate and time-independent Λ U,e = Λ e , leads to: where Λ e,t is the depolarised (twirled) channel corresponding to Λ e and the probability that characterises this channel is p e (see Eq. 12 and Eq. 14). One can then integrate one-by-one the U k 's, where each of the integrals result in one error term being twirled and hence depolarised. Noting that the twirled (depolarised) channels commute with all other channels in general and specifically with the unitaries appearing in the above expression, we obtain: Here, Λ s represents the error channel corresponding to the SPAM errors. Since the imperfect inverse Λ U † tot is one single operator (or at the very least it will be composed of far less gates than the forward sequence) the error associated with it can be absorbed into the SPAM errors. These errors can also be treated as a depolarising channel, because the state is measured in the basis {|ψ ψ| , I − |ψ ψ|} and the corresponding "off-diagonal" terms do not affect the probabilities that we measure (and need for the subsequent estimations). This SPAM error depolarising channel (Λ s ) is characterised by the parameter p s , leading to: which is in the exact form P l = A + Bf l mentioned in step 4, where f = p e , A = 1/d, and B = ( d−1 d )p s . By plotting P l for different values of l we recover the value of p e (f ). Having obtained the depolarising probability of the error-channel, we can now look at the average fidelity of the gate-set: and due to the left-invariance of the Haar measure, we have that F (Λ e , I) = F (Λ e,t , I), i.e. the fidelity of any superoperator (Λ e ) with the Identity (I) is equal to the fidelity of its exact Haar twirl (Λ e,t ) with the Identity (I) [25]. Therefore, with the simplifying assumptions made, it is clear to see how the average fidelity is Recalling that r := 1 − F ave (see Eq. 8) we get the expression of step 5 for the average error-rate of the gate-set:

III. THE ANALOGUE SETTING
We first give an overview of extending RB to the analogue setting, with technical details from Sec. III-A onwards, and our ARB protocol (see Protocol 2) in Sec. III-C. The quantum logic gates tested by digital RB (most commonly, the Clifford group) are replaced with a set of time evolution operators/unitaries: {U k = e −iHkdt }. The unitaries tested are generated by disordering a Hamiltonian, H s , native to the quantum device of interest resulting in a new set of Hamiltonians {H k } such that the time-evolved operators {U k } approximate a 2-design. We use the word approximate because finding an exact unitary 2-design is highly non-trivial, especially in the analogue setting; we write the protocol for an -approximate 2-design and bound our results (see Sec. III-B). In digital (Clifford) RB, the inversion operator can be found efficiently since Clifford computations can be classically simulated [19]. The unitaries generated in the analogue setting do not have this property and we therefore systematically invert one by one each of the unitaries applied (in the forwards evolution) in our protocol to return back to the initial state (in the ideal case). This systematic inversion means that the errors during the inversion operation also scale with the length of the sequence, something that could technically result in the errors in the process not being depolarised, we discuss this in Sec. III-D.
One of the caveats of digital RB is that it requires compilation of physical gates, that are produced naturally in the device, into gates of the Clifford group (even for modified versions that test non-Clifford gates [21]). This limits the size of the system that one can feasibly test with this technique; with a more recent advancement on standard RB [24] making 5 qubits the largest system size tested as of the time of writing. Moreover and perhaps, even crucially, since each Clifford gate is compiled from a different combination of native gates that run for different times and that apply different types of Hamiltonians, the assumption that each of these gates will have the same probability of error (essential for the standard RB derivation) is not physically justifiable. In contrast, with ARB the set {U k } is generated with the native abilities of the hardware in mind, centred on the Hamiltonian of the quantum system and with each unitary run for the same time interval. Therefore, the analogue setting presents certain advantages as it bypasses the compiling aspect of the protocol and would allow for efficient testing of larger system sizes; in conjunction, the assumption of gate-independence of the noise has a greater motivation. By utilising the fact that any unitary 2-design (and even approximate 2-design [23]) is sufficient, ARB could potentially improve upon the applications of the original technique, avoiding the compilation, scalability and complexity limitations of testing only the Clifford group.

A. Unitary Set for ARB
In the analogue setting, the gateset to be tested can be substituted by a group of unitaries that govern the time evolution and have the following form: {U k = e −iHkdt } where H k is the Hamiltonian indexed by k representing its position in the set {U k } and dt is the time-step for the evolution. The physical constraints of the device play a role in the choice of dt; it should not be less than that which does not physically allow for the Hamiltonian to be changed per time-step, and we use this to motivate our choice of dt (= 0.005). The value of dt also affects the convergence rate of the unitary set to an -approximate 2-design, as can be seen in Fig. 4 in Sec. IV, where the lower threshold for a physically justified time-step is dt = 0.001. We define H k to be: where H s represents the original form of the Hamiltonian (the Hamiltonian native to the quantum system) and ζ (g,l) k is an added disorder term which we define 8 to be one of the following: where the indexes (g, l) denote global (g) and local (l) disorder terms, and ∆ (i) k is a disorder potential that varies for every {k = 1, ..., K} and according to which site it is acting on. Here, the disorder potential is conditioned on one-site in the interaction, i.e. for each two-site coupling the disorder potential will be changed according to the first site. With u = x, y, z indicating which product of Pauli operators is to be applied on nearest-neighbours. By varying the original Hamiltonian H s with these disorder terms we generate a family of Hamiltonians (spanning the set H k ). From this family {H k } we time-evolve the Hamiltonians to get our unitary set {U k } for a fixed time-step (dt).
Generating an exact unitary 2-design in the analogue setting is inefficient. In order for ARB to produce meaningful results, the set {U k } will need to form at least an -approximate unitary 2-design and we therefore define and analyse -approximate unitary 2designs and discuss their role in the context of ARB in the following section.

B. -approximate Unitary 2-design and Bounds
In ARB the main operation that we want to achieve is twirling the error channel into a depolarising channel by averaging this channel over the unitary set that we have. The steps involved in RB perform this necessary twirl because the unitaries that are randomly sampled are uniformly distributed (a property of the unitary 2design). If we instead have an -approximate unitary 2-design, there is no guarantee that the errors will be depolarised since the unitaries sampled will not come from an exact uniform distribution, although there is evidence that approximating a t-design is adequate for the RB method in some contexts [23]. From Eq. 27 we set as an unknown error bound on our results in Sec. IV, and here we define an -approximate 2-design and present the reasoning behind this bound.
Definition III.1. An -approximate unitary 2-design defined in terms of the diamond-norm [26] is a measure on a finite subset U (D), where D is the dimension of the Hilbert space, that satisfies the following property [27]: where E α is the twirled channel of Λ over a set of unitaries {U α } spread according to a probability distribution α. E µ is the Haar-twirl of that channel, with µ the Haar measure.
A unitary 2-design is the case when E α = E µ , i.e. this twirl should behave the same as the Haar-twirl, and = 0. There are a few ways to determine how close a particular set of unitaries is to an exact unitary 2design, that involve comparison with the way that Haar random unitaries behave, such as the frame potential [28] and comparing the second moment operators [29] (see App. D) . Following the reasoning of [30], we define the trace-norm of a quantum channel.
Definition III.2. The trace-norm of a quantum channel E in terms of the input state density matrix ρ, that minimises the error probability on distinguishing between two quantum channels E 1 and E 2 , is defined as: where · 1 is the trace-norm, i.e.
Definition III.3. The diamond-norm distance written in terms of the trace-norm of a quantum channel E is as follows: Using these definitions, we obtain the following: is the twirled channel of Λ over a set of unitaries {U α } spread according to a probability distribution α and E µ is the Haar-twirl of that channel, then for an -approximate 2-design, it holds that: with Definition III.4. The average survival probability for each sequence length found from RB is: with a pure input state ρ, and using an exact unitary 2design, where the average error rate of the unitaries surrounding, and including, Eq. 14). Similarly, the average survival probability for each sequence length, measured for an unknown αdistribution of unitaries via RB, is: where the unitaries are assumed to be an -approximate 2-design and P τ l = T r[E ψ E τ (Λ) l (ρ ψ )], τ ∈ {α, µ} represents the survival probabilities of input state ρ ψ when the twirled quantum channels E τ are applied to it l times.
Lemma III.2. If the unitaries {U α } form anapproximate 2-design, it holds that: We refer to Sec. II-C for completeness, and particularly Eq. 5, 10 and 11 where l sums (integrals) over the unitaries appear in the survival probability's expression.
Proof Sketch. To go from the measured survival probabilities to the exact case, one needs to replace the l sums over the -approximate 2-design with the corresponding integrals over the Haar measure. For each one of these replacements, the maximum difference between the two probabilities increases by , resulting in the l · at the end of Lem. III.2. The full proof is given in App. D.
Assumption III.1. We assume that statistical error in estimating the value of P α l with the RB method, is much smaller * than the error induced by running RB with an -approximate 2-design rather than an exact 2-design: determining f l from the RB method of an -approximate 2-design, is given by: where f is dependent on l, f l = f ± δf l is the value for the fidelity decay parameter found for RB with an -approximate 2-design and f is that obtained with an exact design.
Note that f l and δf l can be computed separately for each different length l. In practice, B will also depend on the SPAM errors which are (generally) unknown, and therefore it is essential to consider several different lengths to obtain a value for f l .
Lemma III.4. The error in determining the average error-rate r l of a gate-set that is an -approximate 2design, from the RB method, is given by: as compared to that when using RB with an exact 2design. Where p s is the parameter that characterises the depolarising channel of the SPAM errors (i.e. if no SPAM errors are present, p s = 1, while if SPAM errors completely depolarise the channel p s = 0).
It is clear that the larger the value of l that we use to estimate r l , the larger the error due to the -approximate 2-design. In our simulated results, we assume no SPAM errors (p s = 1), and we can therefore take the more optimistic view and consider the errors for l = 1. Under these assumptions, it is easily seen that: Lemma III.5. If l, p s = 1 the value of r determined from the RB method when testing an -approximate 2design, is bounded as follows: where we use r to denote the value for the average error-rate measured from using an -approximate 2design, and r is the value gained when using an exact 2-design, with the RB method.
The proofs for the above Lemmas can be found in Appendix D. From this point on, when we refer to the bounds on r , we will use the above expression. In any case, the above (exact) expressions can be used to test further that our experiments agree with the error bars in f and r which would (a) give further evidence that the set of unitaries we considered form an -approximate 2-design and (b) could help us determine the actual value of .
The biggest hurdle to overcome in this setting is generating the -approximate unitary 2-design from the time-evolution operators {e −iHkdt }. In order for a set of Hamiltonians of the form of Eq. 15 to efficiently converge to a unitary 2-design, i.e. form an -approximate unitary 2-design, we have to break the symmetries in the original Hamiltonian H s , between the variables and how they commute with an added disorder term. This is why we give a few options for the disorder term ζ (g,l) k and test our protocol with different disorders. The disorder term ζ (g,l) k should be sufficient enough to break the symmetries in H s and therefore H k , however since we do not formally prove that our set of generated unitaries are an -approximate 2-design we use the above Lemma's to confirm our assumptions and results.
In their study of Rényi entropies, the authors of [31] (using a similar set-up to us but looking at sectors of a given Hamiltonian (H| A )) found that by applying a local disorder potential and a single-site Pauli-operator to all sites at each time-step, their random unitaries converged to a unitary 2-design. Their local disorder potentials were drawn from a normal distribution with standard deviation δ = J. The disorder potentials we apply are also drawn from a normal distribution, where δ = J, but we apply these potentials differently according to which model we test and choose two-site Pauli-interaction terms in order to break the symmetry of H s . In Sec. V we discuss the prospect of optimising the generation of a set of unitaries {U k = e −iHkdt } for a given . ] on your system such that if each unitary was perfectly implemented your system would be returned to initial state ρ ψ . Here we systematically invert each preceding unitary. 2: Repeat the sequence R times; record the survival probability P η = T r[E S η (ρ)] for this sequence. 3: Repeat the above steps for sequences of different time-lengths and plot the average survival probability against time-length P T vs T . 4: Fit the results to a predetermined decay curve of the following or similar form: Where again T is the sequence time-length and f represents the fidelity decay parameter of the process, with A and B fit parameters that absorb SPAM errors. And again, the average error rate is characterised by r where r = (d-1)(1-f )/d and d is the dimension of the Hilbert space for a system of qubit size n (d = 2 n ).
Ω over unitaries {U k } depolarises that channel. The time-step dt is kept the same for each unitary operator to mitigate time-dependent errors. We again prepare the quantum system in initial pure state ρ ψ where, if ideally prepared, ρ ψ = |ψ ψ| and assume that the error channel is a trace-preserving and memoryless CPTP map, with errors gate and time-independent. The parameters for the standard RB protocol (see Protocol 1 in Sec. II-C) still apply, but we redefine the form of S l and introduce another measure for the sequence length : S T where S T ≡ S l and T represents the total time to run each sequence of length l, i.e. T = dt · l. Furthermore, we highlight that the average error-rate gained from this protocol is the average error per time-step dt and therefore we define that our f in this case is the fidelity decay parameter as a function of the time-step and not per-second/unit-time, i.e. f := f dt , and when we write f T in step 4 of Protocol 2, we mean f T := f l dt where our length is now in terms of the time it takes to run a sequence (as previously described). Now, our sequence lengths are called S T and we refer to length as time-length.

D. Systematically Inverting Unitaries
The original form of RB (Sec. II-C) involves a single deterministically chosen inversion operator Λ Uk+1 that inverts the preceding unitary sequence. The errors in the process before this inversion step get depolarised through the twirling of the error channel, but the single inversion operator will also have errors attached to it. Due to this inversion operator being much shorter in length than the sequence preceding it, the error associated with this operator is a constant additive error. While this (SPAM & inversion) error is not (in general) depolarising, as far as computing the survival probabilities, there exists some depolarising error channel that would have the exact same effect † , and therefore we can model it as such.
As mentioned, the inversion operator of any string of Clifford operations can be found efficiently. Unfortunately, there is no equivalent result in the analogue setting, and therefore the initial development of the protocol involves systematically inverting each preceding unitary. This systematic inversion of the preceding sequence means that the errors, now a combination of forward evolution and inversion errors do not, in general, depolarise in the RB process; therefore, for the purposes of analysis in Sec. IV, we model the systematically inverted unitaries as perfect. It is worth mentioning that, while the errors do not (directly) get depolarised, with a relatively simple extension one can still obtain some characterisation of the average errors of the gate-set if similar assumptions, e.g. gateindependence of errors, hold. In particular, one can get the average error-rate for sequences of pairs of gates, which is still a quantity of practical importance. However, the details of the above construction go beyond the scope of this paper and will be discussed in a future publication.
A necessary step to physically implementing analogue randomized benchmarking will be to mitigate this time-reversal technique. Firstly, because with systematic inversion the errors will not necessarily be depolarised, and secondly and, most importantly, because performing this type of time-reversal is not trivial on the analogue quantum simulator. While certain terms such as field or on-site terms can be inverted efficiently in the current experiments, the inversion of off-site couplings or tunneling terms still remain challenging [32,33]. To highlight this non-triviality, we note that only recently was such inversion realised even in a simplified model [34]. In the next section we focus on describing how ARB would operate on an analogue device, where our analysis involves classically simulating the process, including modelling and implementing the error channels.

IV. CASE STUDIES
To analyse ARB, we consider a spin system native to many quantum simulators, which is particularly relevant to the case of trapped ion experiments [35][36][37]. Specifically, we modelled an Ising Hamiltonian with a transverse field: where J ij ∝ J |i−j| α is the interaction term and 0 < α < 3 dictates the strength of the interaction, σ +,−,z are the corresponding Pauli operators, N is the length of the spin chain, i.e. the number of qubits in the system, and B is the transverse magnetic field strength. We focus on the case of α ∼ 3 where we assume nearest-neighbour interactions only [38] and on α ∼ 0 which corresponds to an all-to-all coupling between sites. Current experimental development [8,[39][40][41] has led to regimes where the theoretical prediction for the above XY Hamiltonian becomes classically intractable, it is therefore essential to provide a characterisation of such devices in the absence of simulation capabilities.
In the following we consider the case of a system consisting of N = 6 spins studying how the coupling strength, the disorder terms and the field strength affect the protocol. We simulate the evolution of the system from an initial product state |ψ = |↑↓↑↓ . . . . We generate our set of unitaries {U k = e −iHkdt } by adding a disorder term chosen from Eq. 16 with k = {1, . . . , 1000} and run the ARB protocol with perfect inversion operators. In this initial exploration we modelled gate and time-independent noise, adhering to the specifications of the standard RB protocol. We modelled this noise in the form of uniformly distributed fluctuations to both J and B terms that form the static Hamiltonian H s (see Eq. 28). These terms mimic fluctuations in the trapping or stray fields in the system and can be realised experimentally. Futhermore, we run the protocol with no errors in state preparation or measurement. Assuming that we have an -approximate 2-design, it follows that the decay curve that we expect our data to fit is of the form: where d = 2 N = 2 6 . In the following results, we fit the ARB decay curve using a non-linear least squares regression fitting tool [42]. Assuming no SPAM errors and perfect inverses simplifies the comparison of our results to those of an exact unitary 2-design (see Lem. III.4 and Eq. 27). We use this to estimate the average error rate by fitting P T from the fidelity decay parameter f as r = (d − 1)(1 − f )/d.

A. XY Hamiltonian with transverse field (Nearest-Neighbours)
In this section, we present the ARB fits for the model described in Eq. 28 for the case of α ∼ 3 that we consider to be well-described by: We generate a set of Hamiltonians {H k } for both local (ζ l k ) and global (ζ g k ) disorder terms: where we chose from Eq. 16 that u = x, since this type of coupling should break the symmetry in our . In both cases the resulting fidelity follows the decay curve that we use to estimate the average error rate r, which is displayed in each figure. We obtained rg = 0.003350 (0.003321, 0.003380) for global disorder and r l = 0.003264 (0.003227, 0.003301) for the local case. For all the displayed results we chose J = 1 as our frequency reference and scale all timescales with respect to it so that our time axis reads T J.
which to sample for each set {H (g,l) k } of the following form: {U In Fig. 1, we present the ARB protocol results for both the constant global disorder H g k (Left) and the local site-dependent disorder H l k (Right) as indicated in Eq. 31. On the time evolution forward the system is subject to noise proportional to H s , which we chose to be normally distributed with mean ∆J = ∆B = 0 and standard deviations σ J = 0.2 and σ B = 0.5 ‡ . In these results we conducted n seq = 100 sequence iterations for every sequence time-length S T and repeated each individual sequence R = 10 times to find the average of a given sequence. We discuss our choices for these parameters in App. B. ‡ Note that these values are taken as an example and we do not require the experimental device to exhibit the exact same values.
In both cases, the data fits the ARB decay curve, which in itself is an indication that the errors modelled were depolarised by the process of ARB. This in turn, indicates that the set of unitaries sampled {U (g,l) k } were a good approximation of a 2-design. We observe that for well-sampled sets {U k } the disorder term added, whether global or local, does not largely impact the fit result. This could indicate that both types of disorder are sufficient to generate an -approximate 2-design. Due to our noise model and perfect inverses, it is possible that the averaging during the ARB process, rather than twirling (caused by an approximate 2design), is what causes the errors to behave like a depolarising channel; this is why we write our results in terms of Eq. 27 (from Lem. III.5). If the errors behave like a depolarising channel due only to averaging with this kind of simple noise model then we can not trust the ARB protocol in a regime where we have more complex noise (though still adhering to noise assumptions). For the average error-rate we obtained (with 95% confidence bounds): r l = 0.003264 (0.003321, 0.003380) (33) r g = 0.003350 (0.003227, 0.003301) , (34) for locally (r l ) and globally (r g ) disordered unitaries, respectively. Where, from Lem. III.5 we have the average error-rates bounded as follows: where α ∈ {l, g}. Assuming an -approximate 2-design with → 0, our values of r are low. In this case, r can be thought of as an average infidelity (due to no SPAM errors and no errors in the inversion operators [43]) of each unitary (run as part of a long sequence), and therefore a value of r 1 indicates a high average fidelity for each unitary per time-step. It would be unreasonable in this instance to expect a large value of r since it is found that r < 1 even in digital systems where RB has been applied with unrealistic noise assumptions. Since the noise that we modelled was physically motivated, i.e. feasible and also uniform, we would not expect to have a large average errorrate. We use r to characterise how this type of noisy hardware (modelled as the fluctuations to J and B) would behave under this set of unitary operators, and with our nearest-neighbour model we obtained low values of r indicating that this set of operations would perform well overall on such a hardware.

B. All-to-all spin model Hamiltonian with transverse field
We present the ARB fits for a spin system governed by Eq. 28 for the case of α ∼ 0, i.e. an all-to-all coupled spin system with the same system parameters as in Sec. IV-A: Again, we generated a set of Hamiltonians H k for both local and global disorder terms: and thereby a set of 1000 unitaries {U (g,l) k } for each {H (g,l) k }. In Fig. 2, we present the results of fitting our data for the all-to-all H s to the ARB curve as in Eq. 29 with the constant global disorder term (Left) and the local site-dependent disorder term (Right). We observe better agreement to the ARB curve than in the case of nearest-neighbours (Fig. 1), particularly at shorter times when the error bars are smaller. The better fit suggests that an all-to-all H k generates a set {U k } that is a closer approximation of a 2-design than its nearest-neighbour alternates (Sec. IV-A). Due to the fast scrambling conjectures in [44] we would expect that the more random disorder we create in our {H k } the more likely we will get a convergence to a unitary 2-design with a large set of unitaries {U k } and Fig. 2 supports this. The average error rates are as follows: In both cases the resulting fidelity follows the decay curve that we use to estimate the average error rate r, which is displayed for each case, in the figure. We obtained rg = 0.001840 (0.001826, 0.001855) for global disorder and r l = 0.001826 (0.001811, 0.001842) for the local case.
the better fit to the decay curve (see Eq. 29), and → 0 the average error-rate for our sets {U (g,l) k } is low, as expected with this type of noise model. We observe that the values of r l found for both the nearest-neighbour model (Fig. 1) and the all-to-all model (Fig. 2) when compared to the corresponding model values of r g were actually very close. Although we would not expect them to be the same since we are effectively testing two different sets of unitaries for each model, the fact that the same noise model (simulated hardware) is being tested for each set could indicate that we would see a similar average error rate for these two sets of unitaries since they are built around the same starting Hamiltonian (see Eq. 28).

C. Impact of Field-term and Time-step
Obtaining the average error-rates (r) from the ARB protocol provides a characterisation of the hardware, and how it copes with a specific set of operations (gateset). We therefore analyse how some of the physical system parameters, of our specific tested system, may impact the values of r obtained.
In Fig. 3, we present the fidelity decay curves for the case of the nearest-neighbour H XY and added global disorder (see Eq. 30 and Eq. 31) as a function of the transverse magnetic field, B. We observe that the ARB result does not depend on the off-set value (B) of the field but rather only depends on the magnitude of the noise specified by σ B . This reveals that, according to our simulations, a quantity that would govern the ground state properties of the device does not affect our protocol. Therefore, the characterisation of the device depends only on the form of the noise and not on the choice of static parameters.
We now consider how the numerical time-step (dt) can impact the ARB curve. In Fig. 4, we present the survival probability P T for different values of the timestep (dt) used to create the unitaries for the same system as in Sec. IV-A; again the nearest-neighbour H XY with global disorder. The values of dt were chosen in the regime where the numerical simulations have no impact on the differences between them, if the system were noiseless, to avoid any numerical error contribution to the analysis. These results highlight  We observe that the survival probability decreases as dt increases, since the noise can deviate the system from the initial state for a longer time period, leaving it harder to correct with the noiseless backward evolution. the fact that when a given noise term is applied for a longer period of time in the system it can cause a stronger deviation from the initial state, more difficult to correct with the perfect backwards evolution. In this analysis dt is, therefore, related to the ratio of change of the noise in the actual quantum device, which can affect the result since we model perfect inverses, i.e. noiseless backward evolution. For the purposes of the protocol, that the error on each unitary (gate) should not be dependent on the time it takes to run that unitary, we chose one value of dt(= 0.005) fixed for all timeevolutions simulated.

D. Main Observations
• The average error rate per time-step decays as expected in both cases of disorder (global and local) in the nearest-neighbour and all-to-all models of the XY Hamiltonian, with r bounded in terms of , which we assume to be small. • Our simulations reveal that the value of the magnetic field B (that governs the ground state properties of a Hamiltonian) does not affect the results of our protocol, but rather only the form of the noise (fluctuations) added to B create an effect. This indicates that our protocol is in fact providing a measure of the noise in our (simulated) system and it is robust to changes in the system parameters. • The fit to the all-to-all coupling (see Fig. 2) of our ARB fidelity curve (see Eq. 29) was better than that of the nearest-neighbour curve (see Fig. 1) as would be expected with a more randomized set {U k }. • The disorder and noise parameters used in the simulations above are compatible with current analogue quantum simulators meaning that it is not unrealistic that our protocol could be (after some modifications concerning time-reversal) experimentally implemented.

V. CONCLUSIONS AND FUTURE WORK
With the aim of developing a scalable generic method for testing analogue quantum simulators, we extended randomized benchmarking to the analogue setting. By replacing the quantum logic gates in the protocol with unitary time-evolution operators (native to the quantum system), fixing the time-step to be the same for each unitary, and introducing systematically inverted unitaries rather than one single inversion operator we presented the analogue randomized benchmarking protocol. In the context of continuous time evolution, the challenges we met were in creating a set of unitaries {U k } that generated an -approximate 2-design and efficient time-reversal of the unitaries on an analogue system. We numerically simulated our protocol on two models of the XY Hamiltonian (nearest-neighbour and all-to-all), which is native to trapped ions. We added global disorder to generate one set of unitaries and local disorder to generate another, for each model. We modelled uniformly distributed fluctuations (noise) in the coupling J and B field terms of the static Hamiltonian. The data for both cases fits the derived (for this noise) randomized benchmarking fidelity decay curve, with the average error rates for the unitary sets {U k } as expected for this type of gate and time-independent noise model. We observed a better fit with the data from the all-to-all H XY model. This fit could be explained by the fact that all-to-all coupling would induce more randomness in our Hamiltonians, therefore possibly converging more quickly to an -approximate 2-design. We note that the average error-rate r for the all-to-all model is (in both cases) approximately half the size of r for the nearestneighbour model. This implies that the all-to-all unitary sets perform better on this type of noisy hardware, on average.
Analogue randomized benchmarking creates opportunities for improving confidence in analogue quantum simulators by providing alternatives to the current benchmarking techniques. Assuming one has anapproximate 2-design and the sequences run can be efficiently inverted, we could compare the average error-rate r across two quantum devices with the same starting Hamiltonian (H s ) that the set is built around; since ARB is primarily a test of a specific quantum hardware, r could provide information about what kind of noise were present in each device depending on the results of the protocol on both. Another area that ARB could be useful in is random circuit sampling, where the ARB parameter r could potentially be used to prove that random sampling from a random circuit is hard; with future works looking at this direction. At the root, ARB provides a measure for the performance of a set of unitaries on a specific hardware, and in the analogue setting this could be useful in testing programmable analogue quantum simulators. Particularly, the value of r would give a characterisation of how ones device will run a family of Hamiltonians, providing an extra security in the results you would gain from a programmable AQS experiment.
Extending RB to the analogue setting highlighted many interesting research questions, particularly in regards to approximate unitary t-designs with unitary time-evolution operators. In our work, we assume an -approximate 2-design is formed from our disordered set of unitaries {U k } (formed from disordered Hamiltonians {H k }) because the disorder added was such that it should be sufficient to break the symmetry of the system Hamiltonian. However, we have not formally proven that our unitary sets {U k } are -approximate 2-designs and we therefore introduced a bound on the results garnered from the ARB protocol. This at least allows us to assess our results for the average error rate within a relative context, and with the standard error on our result we bound the unknown parameter . Perhaps the RB parameter r could provide an indication of the value of for a set of unitaries that categorically are an -approximate 2-design. An extension to this work is to formally define generating an -approximate unitary 2-design from a set of unitaries formed around a Hamiltonian native to an AQS. A recent result from [45] linked the frame potential (see Eq. 43) and the Haar moment operators (see Eq. 45) in order to more accurately characterise an -approximate 2-design. This idea could provide a way to optimise the generation of approximate designs in the analogue setting. Moreover, exploring the types of disorder that one can add to the starting Hamiltonian, i.e. more localised disorder, could reveal the optimal type of disorder that generates anapproximate 2-design with a given Hamiltonian.
Another area to investigate is the limitation of the time-reversal (mentioned in Sec. III-D) where we have acknowledged that systematic inversion could still provide a measure of the average error (though per twogates rather than one) and that the main obstacle, in our point of view, to implementing our protocol is the fact that time-reversal in analogue devices is currently not feasible, although it can be implemented for a restricted set of operators, e.g. field terms. For small scale systems, one can compute the ideal output of running sequences on that system and estimate the fidelity of the output state with the ideal state, i.e. using DFE techniques or efficient tomography. This would mitigate the need for the inversion step in our protocol, and the benefit with this type of hybrid technique would be removing the SPAM errors from the characterisation; though, unfortunately, losing the scalability advantage of ARB. On trapped-ion simulators in particular, digital and analogue computations may be performed, and therefore it would be prudent to look at the difference in errors found with both techniques: a possibility for ARB would be to implement the inversion in a Trotterised (digital) way and combine the analogue and digital techniques in order to better characterise the types of errors on this kind of device.
The advantages that DRB (Sec. II-C) and ARB (Sec. III) have in common are that they evaluate, in a scalable way, the performance of a device whilst also removing the fixed imperfection of the SPAM errors.
Comparatively, the use of native gates of the systems means that it is likely that ARB will have smaller errors, e.g. in compilation of gates/more noise that does not adhere to RB assumptions, than digital RB. This could be especially prevalent when dealing with the same physical system used for both analogue and digital quantum simulations. The assumption of gateindependence, and even of nearly gate-independence, of the noise model is far better motivated (and closer to reality) in the analogue case which means it is more likely that when experimentally implemented, ARB would give a better fit to the fidelity curve than in the digital case. Moreover, ARB could provide a way to test the performance of digital quantum simulators where researchers could focus on the average errorrate per length of computation time, rather than pergate. This type of characterisation is not only more physically motivated, but could also bring this analysis closer to the adiabatic model of quantum computation, where complexity is considered in regards to the time taken for the adiabatic evolution.

VI. ACKNOWLEDGEMENTS
We thank Ulysse Chabaud, Andreas Elben, Martin Kliesch, Rawad Mehzer and Hendrik Waldner for helpful discussions and clarifications. Work at the University of Strathclyde was supported by the EPSRC Programme Grant DesOEQ (EP/P009565/1) and the European Union's Horizon 2020 research and innovation program under grant agreement No. 817482 PASQuanS. Work at the University of Edinburgh was supported by the following grants from the EPSRC: Verification of Quantum Technology (EP/N003829/1) and UK Quantum Technology Hub: NQIT (EP/M013243/1).

APPENDIX A UNITARY 2-DESIGNS
Consider a superoperator Λ acting on a space M D of D-dimensional quantum states, when t = 2, Λ has a D 2 × D 2 dimensional matrix representation. We now define a set U (D) of unitary matrices on this space M D . If the set is a unitary 2-design then the space M D is reducible to two irreducible invariant subspaces. Now, we define Λ acting on a quantum operator X as: Λ(X) = AXB. Schur's Lemma [46] implies the following (reducible representation) for U (D)-invariant trace-preserving operators: where p = T r(Λ)−1 D 2 −1 . Considering the fact that a unitary 2-design means that sampling uniformly from the set {U 1 , ..., U K } is operationally equivalent to sampling from the Haar measure, we can say that [27]: for all A, X, B ∈ L(C D ). This essentially means that if we have a set {U k } that is a unitary 2-design or above, then conjugating a quantum channel over this set and averaging will result in a depolarisation of that channel.

APPENDIX B PARAMETERS CONVERGENCE
Changes to some of the method parameters, such as the repeated runs of each random sequence (R) and the number of sequences tested for each sequence length (n seq ) would improve the accuracy of our results: iterations over as many sequences as possible of the same length are desired in RB to sample as uniformly from the unitary space as possible and repeating each sequence sufficiently gives a more accurate measure We observe how the average results overlap and remain within the errorbar interval for all R values. We display the error for R = 10 only to help visualizing the curves; (Bottom) Logarithmic differences between R = 5, 10 and R = 10, 20 curves, we compare this with the error bar of R = 10 (dotted line). We observe that differences with R are smaller than the statistical uncertainly of the curves.
of the average survival probability for that sequence.
In this section we discuss the choice of numerical parameters in the results presented in the main text.
Here we describe how the RB curves depend on some of the averaging parameters such as R and n seq . In Fig. 5 and Fig. 6, we justify the choice of the numerical parameters n seq = 100 and R = 10 in the main text by comparing the ARB decay curves for varying values of the mentioned variables. We observe that in both cases the statistical uncertainty derived from the sequence averaging is larger than the discrepancy as we vary these parameters, therefore we are confident that the presented results do not depend on the chosen values for n seq and R.

APPENDIX C COMPARATIVE TECHNIQUES FOR 2-DESIGNS
In addition to our discussion in Sec. III-B, here we highlight some of the comparative techniques used to determine whether one has an exact unitary 2-design beginning with the introduction of the Spherical tdesign. Consider a real function f , and imagine we are interested in the average value of this function on an n-dimensional real sphere S n ; this is hard to compute, so one can think of averaging over a finite set of unit vectors D = {|φ 1 , . . . , |φ K } instead. Briefly, a spherical t-design is a finite subset D of S n such that the average of every t-th order polynomial p over S n is equal to the average of p over D. For spherical t-designs, the frame potential [28] is a well-known metric for determining whether one has an exact spherical design or not, and is defined as follows: Definition C.1 (Spherical t-design). A set of vectors {|φ 1 , . . . , |φ K } is a spherical 2-design in C d if and only if: The definition of spherical t-designs was modified for the unitary setting, changing the real sphere S n to a set of unitaries U (D) and comparing with the Haar distribution, with the term unitary t-design coined by Dankert et al [47].
Adapting the frame potential from the spherical setting to the unitary setting, D. Gross et al [48] developed the frame potential technique for determining whether ones set of unitaries is an exact unitary 2-design or not, see Definition C.2. Another technique for determining whether the (suspected) unitary 2-design that one has in an exact unitary 2-design, is to compare the moment operators up to order t (in this case 2) of the particular unitary set and the Haar measure, we refer to [29] for Definition C.3.
Definition C.2 (Frame Potential). Let the set M = {U k } with {k = 1, ..., K} be a set of unitaries. The frame potential of M is defined as: M is an exact unitary 2-design ⇐⇒ P (M) = 2.
Definition C.3 (Second order moment operators). A degree (t, t)-monomial in C ∈ U ((C d ) ⊗n ) is degree t in the entries of C and degree t in the entries of C * . Collecting all these monomials into a single matrix of dimension d 2nt by defining C ⊗t,t := C ⊗t ⊗ C * ⊗t , we state that α is an exact unitary t-design if expectations of all t, t moments of α match those of the Haar measure: Therefore, µ is an exact unitary t-design if and only if: Where µ is the Haar distribution.
Although the above metrics are useful in determining whether one has an exact unitary 2-design, they do not give a measure of how close ones set is to being an exact design.

APPENDIX D PROOFS FOR SECTION III-B
Proof for Lemma III.1. From Definitions III.1 and III.3 we obtain: which implies: Using Definition III.2 we state: And it holds by definition that: Therefore: as required.
We set ρ 0 as the average state of the RB protocol before the final measurement (i.e. averaging the above expression over different sequences according to the distribution dα): ρ 0 = dα(U 1 ) · · · dα(U l )ρ(S l ) .
Here, in each of the above quantum states we replace (one-by-one) the average over the distribution α with that of the Haar measure µ. Any two consecutive states ρ j , ρ j+1 differ by a single integration, and by the -approximate 2-design property (see Eq. 20) we, therefore, have that: where the implication follows from the triangle inequality.
The definition of survival probability stated in Eq. 5 results in the the equivalence of the above equation. Intuitively, the difference in the probabilities, ψ| ρ l |ψ and ψ| ρ 0 |ψ , of obtaining the states (after measurement) is no larger than l · , the bound specified by the trace-norm between the the two initial states, as in Eq. 54.
Note that in the following proofs we denote the error in obtaining a quantity C as δC, where a subscript is added if the error depends on some measured quantity that is not obvious.