Witnessing Entanglement in Experiments with Correlated Noise

The purpose of an entanglement witness experiment is to certify the creation of an entangled state from a finite number of trials. The statistical confidence of such an experiment is typically expressed as the number of observed standard deviations of witness violations. This method implicitly assumes that the noise is well-behaved so that the central limit theorem applies. In this work, we propose two methods to analyze witness experiments where the states can be subject to arbitrarily correlated noise. Our first method is a rejection experiment, in which we certify the creation of entanglement by rejecting the hypothesis that the experiment can only produce separable states. We quantify the statistical confidence by a p-value, which can be interpreted as the likelihood that the observed data is consistent with the hypothesis that only separable states can be produced. Hence a small p-value implies large confidence in the witnessed entanglement. The method applies to general witness experiments and can also be used to witness genuine multipartite entanglement. Our second method is an estimation experiment, in which we estimate and construct confidence intervals for the average witness value. This confidence interval is statistically rigorous in the presence of correlated noise. The method applies to general estimation problems, including fidelity estimation. To account for systematic measurement and random setting generation errors, our model takes into account device imperfections and we show how this affects both methods of statistical analysis. Finally, we illustrate the use of our methods with detailed examples based on a simulation of NV centers.


I. INTRODUCTION
Entanglement is a fundamental property of quantum mechanical systems [1] and an important resource for many quantum information processing tasks.In quantum computing, coherently creating entanglement between several qubits is necessary for computational speedups [1][2][3].In quantum networks, remote entanglement is an essential resource for quantum cryptography [4][5][6] and distributed computing applications [7][8][9].Entanglement also plays a crucial role in quantum sensing and metrology [10][11][12], enabling more precise measurement of physical quantities.With the rapid experimental advances in the manipulation and control of quantum systems, much progress had been made towards the generation of entangled states in various physical platforms [13][14][15][16][17][18][19].Yet, the creation of high-quality manybody entanglement is still a significant challenge.It is therefore important to have good tools to certify the creation of entanglement.The main tools used for this purpose are entanglement witnesses.
An entanglement witness is an observable W on a quantum system that can certify entanglement of a state ρ * under investigation [20].By definition, the witness W satisfies for all separable states ρ ∈ S. As a consequence, it can be used to certify entanglement: If a state ρ * has negative witness expectation value, Tr[W ρ * ] < 0, then it is necessarily entangled.If the expectation value is non-negative, 1. Geometric interpretation of a witness W as a hyperplane that separates the state ρ * from the convex set S. If S is the set of separable states, then W certifies that ρ * is entangled.
the test is inconclusive (the state can either be separable or not).The witness method applies more generally than just for separating entangled from separable states.If S is an arbitrary convex set of states and ρ * ∈ S, then there always exists a witness W such that Eq. ( 1) holds, while Tr[W ρ * ] < 0. For example, a witness can be used to certify that states are genuinely multipartite entangled.Geometrically, the witness W can be interpreted as a hyperplane that separates the convex set S from the state ρ * / ∈ S.This is illustrated in Fig. 1.In general, finding an appropriate witness W for a state ρ * is a difficult problem that has been studied extensively in literature [21][22][23].For the remainder of this article, we will assume that W is chosen and fixed.
Often, a witness W is a non-local observable of the system for which entanglement is to be certified.Such measurements are typically hard to perform, particularly in a network setting.Therefore, in experiments, W is usually decomposed into a sum of locally measurable observables which are then measured individually on the constituent subsystems.The witness expectation value Tr[W ρ * ] is then the sum of the expectation values of the locally measurable observables.Each of these expectation values can then only be estimated to some finite precision, since in any experiment only a finite number n of data points can be collected.As a consequence, the witness estimate ŵn obtained from n measurement outcomes can differ from the true value Tr[W ρ * ].Therefore, it is an important question how to quantify the confidence in the experimentally determined estimate ŵn .

A. Prior work and motivation
In many experiments, the confidence in the estimate ŵn of the true witness expectation value Tr[W ρ * ] is expressed by the standard error σ (the empirical standard deviation) [13][14][15][16][17][18][19]24].These experiments typically claim the certification of entanglement if the estimate ŵn is a number of σ's below zero.This approach is simple and pragmatic, but may suffer from statistical and practical challenges (see [25] for similar objections to using this method for quantifying Bell violations).We give a concrete example in Section IV D (see Fig. 6) where this approach could potentially be problematic.
Certification of entanglement by the number of sigma's of witness violation is most easily justified under the assumption that in each round i the state ρ i is independent and identically distributed (iid assumption).This is equivalent to assuming each round a fixed state ρ * is produced.Under this assumption, the estimate ŵn is considered a realization of a Gaussian random variable Ŵn with mean E[ Ŵn ] = Tr[W ρ * ] and standard deviation σ ∼ 1 √ n (for sufficiently large n).This is justified by the central limit theorem.The empirically obtained ŵn and σ are appropriate estimates of the mean Tr[W ρ * ] and standard deviation σ.Thus, the reported ŵn ± σ is a complete and accurate characterization of the distribution (and hence leads to meaningful confidence intervals etc.)However, if the states ρ 1 , . . ., ρ n produced in an n round experiment are not iid, several difficulties may arise.This starts with what is even to be estimated.Most natural is to estimate the average witness value which can also be interpreted as the witness expectation value of the average state ρ * = 1 n n i=1 ρ i .There are versions of the central limit theorem that relax the iid assumption.They can be used to argue that the estimate ŵn is still an observation of a Gaussian random variable Ŵn with mean E[ Ŵn ] = W n for sufficiently large n (Gaussian assumption).But in practical experiments it is not always clear when these theorems can be applied, so that the convergence of Ŵn to a Gaussian with mean W n is not guaranteed for any n.Moreover, under non-iid states ρ i it is unclear whether the observed standard error σ is an appropriate estimator of the true standard deviation σ.Finally, even if the central limit theorem applies, the convergence of Ŵn to a normal distribution can be extremely slow in n (especially when the witness violation is large [25]), so that too small n will still cause Ŵn not to be Gaussian.Hence, in practice it can be difficult to justify the Gaussian assumption.
When the iid assumption (or more generally the Gaussian assumption) fails, several different problems can arise.First, it can lead to overestimation of the confidence in the witness violation based on the observed data.This happens when Pr[ Ŵn < 0] is smaller under the true distribution of Ŵn than under the estimated distribution, based on the observed ŵn and standard error σ (i.e., based on the Gaussian assumption).Second, the reported numbers ŵn ± σ lack rigorous interpretation.The empirical standard deviation σ may no longer be an appropriate estimate of the true standard deviation σ.Moreover, the mean and standard deviation do not necessarily describe the distribution of W n completely (if Ŵn is not Gaussian, the true distribution may depend on more than 2 free parameters).Because of these two effects, the number of σ's of witness violation in relation to the estimate ŵn will depend on the way Ŵn fails to be Gaussian or on how σ fails to estimate σ.This also makes the results between different experiments and physical platforms become incomparable, because the actual distribution of W n may be influenced by experimental parameters, such as the distribution of ρ i , measurement settings, hardware imperfections or the choice of witness.Hence the number of σ's of violation may also be influenced by these experimental details.
Finally, we note that measurement noise (systematic errors) can also lead to overestimation of the confidence in the witness violation.This is because any measurement noise leads to the imperfect implementation W of the witness W .In case that W n < W n , this again leads to an overconfidence in the witness violation.In fact, it can even happen that W n < 0 while W n ≥ 0, leading to falsely concluding entanglement [26].This overconfidence persists independent of the number of samples n taken, since the error is systematic.

B. Our contribution
We propose a new method of carrying out and analyzing witness experiments that addresses all of the aforementioned issues.This method applies without any assumption on the states produced by the experiment (i.e., they may be arbitrarily correlated).Moreover, it has a simple and clear interpretation, and yields figures of merit that are easily comparable between different experiments.Finally, our method takes into account imperfections of the measurement device and random setting generation, avoiding systematic overestimation of confidence.
In our method, we view the source of the states as a black box that produces a quantum state ρ i on demand.
The source can produce multiple states sequentially.All of these states are modeled by random variables that can be arbitrarily distributed and which may depend arbitrarily on the history of the experiment.That is, we allow the source to have memory.We now model the experiment as a sequential process of i = 1, . . ., n rounds.
In each round, a state ρ i and a random measurement setting (determined by the decomposition of W into locally measurable observables) are requested.The appropriate measurement is performed on the state and the outcomes are recorded for data processing.In this model of the experiment, we additionally allow for arbitrary bounded-strength noise in the measurement devices and random measurement setting generator.That is, we assume that the noise in these devices is smaller than a quantified maximum amount.Witnessing entanglement without any assumptions on the devices, an area known as self-testing [27,28], is based on Bell-type inequalities, which are typically tighter than witness inequalities (and therefore harder to violate experimentally).Thus, our model is very general and has minimal assumption to be as widely applicable as possible for analyzing witness experiments.
The main contribution of this work is two different types of data analysis for witness experiments.Both methods are valid under these extremely general assumptions (in particular the state assumptions).In the first method, we quantify the confidence that the source has the capability to produce an entangled state (i.e., a state outside S).This means that we rigorously determine the confidence that Tr[ρ i W ] < 0 for at least one i.We do this by applying the framework of hypothesis testing, in which a null hypothesis is to be rejected based on the observed evidence in experiment.In witness experiments, the null hypothesis is that the source only produces separable states (i.e.∀i : ρ i ∈ S).To reject this null hypothesis means that at least one entangled state must have been produced by the source (i.e.∃i : ρ i / ∈ S).The figure of merit to quantify the confidence in rejecting the null hypothesis is the p-value.Intuitively, the p-value is the probability of obtaining data at least as extreme as the observed data in an experiment if the experiment were governed by the null hypothesis, i.e., if the source was only able to produce separable states.A small p-value is then considered strong evidence against the null hypothesis: the observed data is very unlikely to be explained by a model that includes the null hypothesis.If p is smaller than some significance level α, the null hypothesis is rejected and we conclude that entanglement must have been produced at least once with confidence 1 − α.We shall refer to this entire procedure as a witness rejection experiment and the data analysis as the rejection analysis.This method is different from the standard methods, in the sense that here we can make a statement about at least one state, whereas typically one makes a statement about the average produced states (e.g. when estimating the average witness value W n as defined in Eq. ( 2)).We emphasize that our method and analysis applies in complete generality to arbitrary witness experiments (with an arbitrary convex set of states S and witness W ). For concreteness we will focus in this work on entanglement witnessing, but see Section V A for other examples.
In the second method we aim to estimate the average witness value W n and we provide a confidence interval around this estimate.The main contribution of our confidence interval method is that it is valid without any assumptions on the produced states and therefore always applies.We will refer to this method as the witness estimation method and the data analysis as the estimation analysis, since the objective here is to estimate W n .This method is generally applicable to estimate any Hermitian observable, not just witness operators (i.e., it is not necessary that there is a set S such that Eq. ( 1) holds).Thus our estimation method even applies to fidelity estimation and other estimation experiments.
The contributions of our work are presented in the following way.First, we formulate the round-by-round witness experiment as an entanglement witness game, expand on this description and present a formal model that governs the experiment in Section II.In the model description we incorporate imperfections in the measurement device and random setting generation in a quantitative way.Based on this model, we give a step-by-step description how to set the parameters and carry out the witness experiment in Section III A. Then, we show how to calculate a witness correction from the quantification of the imperfect experimental devices (Section III B, Theorem 1).It is used in both the rejection and estimation experiments to account for systematic device errors and prevent possible overconfidence in the experiment outcomes (rejection and estimation).Next, we provide the main result to perform the rejection analysis.This is an easy-to-compute bound on the p-value (Section III C, Theorem 2).The bound is simply evaluated from the measurement outcomes.By comparing this bound to a predetermined significance level α, we can determine whether the experiment rejects the null hypothesis with statistical significance.This allows us to rigorously conclude that the source has the capability to produce entangled states with confidence 1 − α.Finally, we provide the main result to perform the estimation analysis.This is a direct method to compute and estimate and confidence interval for the average witness value W n (Section III D, Theorem 3).The estimate and this confidence interval are also directly and easily computable from the observed measurement data.We illustrate these contributions with several detailed numerical examples in Section IV.Two of our examples are based on the simulation of Nitrogen Vacancy centers.The focus of these examples is to detect genuine multipartite entanglement between three qubits (i.e., not a convex combination of biseparable states, states separable over some bipartition of the three subsystems).Our third example (Fig. 6) shows an explicit case where the Gaussian assumption fails to be applicable and where our methods are still applicable.
The technical ingredients of this work are summarized as follows.Both results are obtained by viewing the witness experiment as a game [29], similar to Bell tests being viewed as nonlocal games.This allows us to construct (super)martingale sequences and use a concentration inequality to upper bound the tail probabilities (we use Bentkus' inequality [30,31], which is slightly tighter than the more commonly used Hoeffding-Azuma inequality [32]).This method is inspired by the analysis of Bell inequalities of Ref. [32].By choosing the appropriate (super)martingale sequences, we obtain the p-value bound for the rejection analysis and the confidence interval for the estimation analysis.

C. Relation to other work
In this work, we model the measurement noise as general as possible via the POVM formalism and determine a witness correction from this model using analytical methods to guarantee we never overestimate the confidence.Our measurement model can be viewed as a generalization of the model studied in Ref. [26], where imperfect qubit measurements are modeled by Bloch vectors that are misaligned by at most some fixed angle.In Ref. [26] a witness correction factor is computed under this more restricted noise model.However, they compute the witness correction via numerical optimization (see Section V C 2 for why we opt for an analytical bound and how the witness correction factor can alternatively be calculated using numerical optimization for our noise model).
The witness rejection experiment and analysis is new for entanglement witness experiments, but is inspired by the use of this technique for testing local realism through nonlocal games [32].We emphasize that this rejection method aims to rigorously certify that a machine has the capability of producing entanglement.This is different than typical witness experiments in literature where the objective is to estimate the average witness value [13][14][15][16][17][18][19]24].The estimation method we study here also aims to estimate the average witness value.The main difference is that most works implicitly assume that the states are iid (or at least that the estimator is Gaussian by some type of central limit theorem argument), whereas our work applies in the most general case with arbitrary noise on the state.This makes our method more generally applicable.
Closely related to the confidence interval we construct here, Ref. [33] provides a method to construct a Bayesian credible interval for an estimate of the average fidelity of experimentally prepared states to a fixed entangled state (note that is equivalent to a particular choice of witness).The model is similar in the sense that the states can be arbitrarily correlated, but the estimation objective is different: the goal of Ref. [33] is to estimate the average fidelity of the unmeasured states from the measurement of a subset of all available states.Similarly, Ref. [34] derives an efficient method to verify the production of graph states by measuring all but one copy of the state.In contrast, we measure all available states and only aim to make a statement about all the created states (after the fact).The work in Ref. [35] is related to this by giving general lower bounds on the size of a credible regions for quantum parameter estimation.
An alternative method to estimate a property of a quantum system, is by using quantum state tomography to collect measurement data, estimate a figure of merit (fidelity or witness value) and determine a confidence interval [36].However, this typically requires more measurement data than partial state characterization since the complete state is reconstructed.
Finally, we mention that there is also a way of witnessing entanglement without the need to trust the measurement devices at all (measurement-device-independent entanglement witnessing, MDI-EW) [37,38].This, however, requires each party to hold auxiliary local quantum states in each round and perform a joint measurement between the auxiliary quantum state and the quantum state under investigation.This method has been implemented in an experiment under the iid assumption [38].

II. FORMULATION AND MODEL OF WITNESS EXPERIMENTS
In this section, we will discuss the formulation and modeling of witness experiments.We will start with a brief review of entanglement witness games as known in the literature in Section II A. Next, we will generalize the game formulation to handle two additional things: (1) multiple terms in the decomposition of the witness operator may be inferred from a single measurement; and (2) measurements are allowed to be implemented by arbitrary POVMs.We explain how to do this and introduce notation in Section II B. Finally, in Section II C we give a complete description of the experimental model that underpins our experiment.This includes the characterization of noisy measurement and random setting generation devices.

A. Entanglement witness games
In this section, we will recap entanglement witness games from the literature.We will start from the assumption that a choice of witness W has been made.The quantum system under investigation is decomposed into m subsystems on which local measurements can be performed (e.g., m = 2 for bipartite entanglement witnessing).The witness operator then admits a decomposition into locally measurable observables of the form where each M (j) x is a locally measurable observable on subsystem j and where x runs over the terms in the decomposition.Note that such a decomposition is always possible.A decomposition is minimal if the number of terms over which x runs is minimal.In practice, the locally measurable observables M (j) x will often be Pauli observables.The decomposition in Eq. ( 3) is chosen such that each locally measurable observable can be easily measured in the experiment.Measurement of M (j) x yields one of the possible outcomes labeled by a j (in the case of Pauli observables, the outcomes are simply ±1).We shall denote the vector of all outcomes of the m subsystems as a = (a 1 , . . ., a m ). ( With this decomposition, an entanglement witness experiment can be formulated as a game [29].This is similar to how Bell experiments are often formulated as nonlocal games.See Fig. 2 for an illustration of an entanglement witness game.The game consists of n rounds.There are m players, one for each subsystem.At the start of each round, each player receives a subsystem of a quantum state ρ i , as well as a random measurement setting X i (we will use the conventional notation of writing random variables as capital letters and their realizations as lowercase letters).This random setting X i dictates which measurements the players should perform on their local subsystems [according to the decomposition Eq. ( 10)].Hence, upon receiving X i = x in round i, each player j perform the local measurement labeled by x and j.They then report their respective outcomes a to a referee, who assigns a score to the round according to where p x is the desired probability of realizing measurement setting X i = x.A priori, any choice of p x defines a valid game.However, the choice of p x has a significant influence on finite statistics in an experiment.We suggest a reasonable choice in Eq. ( 17) and discuss the issue further in Section V B 3. The negative sign in Eq. ( 5) is added conform the common convention that games aim to maximize score.The score can be interpreted as the contribution of round i to the witness value.Note that the score itself is a random variable S i := s(X i , A i ), since it is a function of the random measurement setting X i and the random measurement outcomes A i .The score is constructed in such a way that the expected value of the score (in the ideal scenario with perfect measurements and randomness) satisfies for all possible states ρ i .Thus, the witness expectation value is affinely related to the expected score of each round.
ρ i

Random setting
Black box source of states Score s i = s(x i , a i ) Round i of the entanglement witness game.By pressing the red button, a source produces a quantum state ρi and sends its subsystems to the players.We model the source as a black box, meaning it can produce a state by an arbitrary process.The state in round i can arbitrarily depend on everything that happened earlier in the experiment, i.e. the source is allowed to show memory effects.Then, by pressing the gray button, the random setting generator produces a measurement setting xi (almost independently from ρi).The players each perform a measurement according to setting xi and send their outcomes ai to a referee, who computes the score of that round.Afterwards, round i + 1 starts.

B. Generalization of the game formulation
In this section, we will expand on the game formulation as discussed in the previous section.In particular, we will make two generalizations.First, we will explain and introduce notation to easily infer the expectation value of multiple terms in the witness decomposition Eq. (3) from a single measurement.Doing this typically requires fewer measurements for fixed confidence so it is advantageous to do so when possible.Second, to be as general as possible in our measurement model, we shall allow every local measurement on a individual subsystem to be described by a POVM.Let us make these things more precise.
Sometimes it is not needed to measure all terms in Eq. ( 3) separately [13,22].For example, with m = 3 single-qubit subsystems, Pauli-Z measurements on each subsystem would allow to one infer the expectation values of all operators (omitting the tensor symbol) ZZZ, ZZI, ZIZ, IZZ, ZII, IZI, IIZ.
This holds in general.Measurement of m non-identity operators on all of the subsystems, would allow one to infer 2 m − 1 expectation values.We shall refer to the non-identity operators that are measured (ZZZ in this example) as the measurement setting [13,22] and refer to one or more of the possible 2 m − 1 operators whose expectation value can be computed (operators from Eq. ( 7) in this example) as observables.Through-out the rest of this work, we will denote measurement settings as M (1) x , where every M (j) x = I is not the identity, and index them with a subscript x.We will denote observables as and index them with the subscript ξ.Note that ξ may run over different (more) terms that x.Using this new notation, the witness decomposition [Eq.(3)] is thus written as To keep track of which observables (labeled by ξ) are related to which measurement setting (labeled by x), we define can be measured by the measurement setting M (1) x .Furthermore, we define b(ξ) ∈ {0, 1} m as the bitstring of length m such that In this way, each term in Eq. ( 8) is related to a measurement setting from which it can be obtained.For example, the observables IZZ, ZIZ, ZZI can all be measured by the setting ZZZ, and the corresponding bitstrings b are 011, 101, 110, respectively.Note that if all observables require a different setting, then x , thus reducing to the simple case discussed in Section II A. Using this notation, we can write Eq. ( 8) alternatively as To allow for the most general model of measurements, we will allow each M (j) x in a measurement setting to be measured by a POVM {Π (j),x a } a∈Ω (j) x with outcomes labeled by a (which take values in the finite set Ω (j) x ).That is, we will write For a standard measurement of the observable M (j) x , this decomposition is simply given by the spectral decomposition, so that the POVM elements are the eigenprojections and the outcomes are simply the eigenvalues of M (j) x .However, this is not the only option: the decomposition is not unique.In particular, the POVM need not be a projective measurement.This allows for the modeling of known non-unitary measurement noise.Suppose for example that we wish to implement the measurement setting M (j) x = Z, the Pauli Z-operator.Its standard implementation would be by a projective measurement in the |0 , |1 -basis.This corresponds to the decomposition Z = |0 0| − |1 1| in Eq. (11).However, suppose Setting M Our results take into account the additional statistical uncertainty introduced by non-projective measurements in estimating the expectation value from finite single-shot measurements on the level of single-shot outcomes.Requiring that Eq. ( 11) holds, ensures the expectation value of this non-projective measurement equals the expectation value of observable to be implemented.
With the generalizations just discussed, the score function in Eq. ( 5) needs to be generalized to The score now sums over all observables ξ obtained from the same setting x.The fact that the outcomes are raised to the power b j (ξ) reflects the fact that O (j) ξ = I if b j (ξ) = 0 (in which case all outcomes are 1).Note that the weights w ξ are labeled by ξ as they appear in Eq. ( 8), whereas the probabilities p x are labeled by the measurement setting x.This generalized score function still satisfies Eq. ( 6) and is related to the witness decomposition Eq. ( 10) via (see Appendix A for details) We give an overview of the introduced notation in Table I.

C. Model of the experiment
Above we explained that an entanglement witness experiment can naturally be interpreted as a game carried out by m players in n rounds (cf.Fig. 2).We now summarize the key properties of our model in more detailsee Table II.A mathematically rigorous formulation is given in Appendix B. Assumption (I) states that the experiment is performed sequentially.Importantly, we do not assume that the states ρ i are independently and identically distributed (iid).In fact, we allow that the i-th round depends arbitrarily on the previous rounds.Thus, the state ρ i as well as the measurement setting X i and the noisy POVMs elements of round i can be arbitrarily correlated and depend arbitrarily on the state, settings, POVMs, and outcomes of the previous rounds, as long as Assumptions (II) and (III) are satisfied.This takes into account any possible systematic error in the experiment and in particular closes the detection loophole for entanglement witness experiments (the possibility of violating the witness due to classical correlations in POVM measurements) [39].
Assumptions (II) and (III) model the devices used to perform the measurements in the experiment.We assume that the random setting generator is characterized up to some precision τ and that the measurement devices are characterized up to some δ j , as defined in Eqs. ( 15) and ( 16) respectively.In principle, τ and the δ j may all depend on the round number i, and the δ j may also depend on the measurement setting x.However, in practice, these dependencies will be small and we may safely take a maximum.Moreover, it would be extremely impractical to characterize the devices for each round specifically.The parameters τ and δ j are later used to calculate the witness correction.With this we ensure that the confidence in the witness violation is never overestimated, even in the presence of systematic device errors.
Finally, in the rejection experiment, we also need to formalize the Null Hypothesis which we wish to reject.In the case of entanglement witnessing, the Null Hypothesis is that all states produced ρ i are separable in every round i.We formulate this more generally, by letting S a convex subset of states (e.g. the separable states) such that W is a witness operator for S.This means that Tr[W ρ] ≥ 0 for all ρ ∈ S. Then we can finally state the Null Hypothesis for S mathematically as the following assumption: (H 0 ) Null Hypothesis.In every round i, the source produces a state ρ i ∈ S.
This assumption is to be rejected with statistical confidence by the experiment, as we will describe in the next section.Model Assumptions (I) Sequentiality.Rounds of the witness game are played sequentially.At the start of each round i, each player j receives one part of a joint state ρ i generated by the black box source, as well as a random measurement setting x i .Each player j performs a POVM measurement that depends on the setting x i , and reports the outcome a j to a referee, who computes the score s i of that round using Eq. ( 13).
The next round i + 1 only starts after the referee received all players' measurement outcomes for round i.The experiment is allowed to depend arbitrarily on the past.
(II) Trusted randomness.The random setting generator produces in each round i a random setting X i , whose distribution pi,x (conditioned on the history of the experiment and the state produced) is close to the desired distribution p x : We assume τ < p x for all x, so that each setting has nonzero probability of being realized.
(III) Trusted measurements.In each round i, each player j performs a noisy POVM measurement { Π(j) i,a } a∈Ω (j) that is close to the ideal POVM from Eq. ( 11): The noisy measurements have the same outcomes a ∈ Ω (j) Xi as the ideal measurements.Measurement outcomes follow Born's rule.

III. RESULTS
In this section we present the main results of our work.In Section III A we start by giving a step-by-step description of our method and give an outline on how to apply the rejection analysis and estimation analysis.Then, in Section III B, we compute the witness correction as a function of the model parameters that quantify the maximum device noise (τ and the δ j 's).Next, in Section III C we provide an easy-to-compute upper bound to the p-value, which is used to perform the witness rejection experiment.Finally, in Section III D we state how to compute a confidence interval for the average witness value Eq. ( 2).These results apply in the presence of arbitrary, possibly correlated noise on the states.

A. The design and analysis of an experiment
In this section, we detail all steps necessary to apply our framework -from the design of the experiment to the analysis of the data.In Table III we summarize our method.We now explain each step in detail.
In this work, we assume that the specific observable W has been chosen.Of course, this choice is part of defining the entire experiment.For the rejection analysis, W should be a witness for some set S [as defined by property Eq. ( 1)].With entanglement witnessing in mind, S is the convex set of separable states and W is some entanglement witness.See Section V B 1 for a discussion on how to choose a suitable W for witness experiments.
Step 1. Define the experiment.With a choice of W fixed, we now choose a decomposition of the witness W as in Eq. ( 10) (such a decomposition is not unique).A good decomposition minimizes the number of terms, while keeping each term simple to measure.
Then we choose an ideal model for the implemented measurements by describing each measurement as a POVM {Π (j),x a } a∈Ω (j) x that satisfies Eq. (11).These POVMs should model the real implementation of the local measurements as accurately as possible (we will quantify the deviation of the real measurement devices in the second step).Note that these POVMs can simply be projective measurements.
Next, we choose the desired probability distribution p x of the random measurement settings in each round.In principle, this can be chosen arbitrarily and the method will still work, but it has significant influence on the finite statistics of the experiment.We propose to choose p x as Here w ξ are the weights appearing in the decomposition Eq. ( 10).This equation can be interpreted as choosing p x proportional to the sum of absolute values of the weights |w ξ | of all observables ξ that correspond to setting x.Hence, heavy-weight terms are measured more frequently to increase the precision of estimating that term.See Section V B 3 for a more detailed discussion on choice of p x .The choices made so far define the score function Eq. ( 13) that assigns a score to each round i of the game.Finally, we fix the number of rounds n to play in the entanglement witness game, as well as a significance level α TABLE III. Outline of our method for designing, performing and analyzing witness experiments.We assume that the experiments are guided by the Model Assumptions of Table II and that an appropriate operator W is fixed.
Outline of experiment design and analysis 1. Define the experiment.Choose a a. decomposition of W of the form Eq. ( 10); b. measurement model of the form Eq. ( 11); c. probability p x of measurement settings [e.g., using Eq. ( 17)]; d. number of rounds n; e. significance level α.
For a rejection experiment, W should be a witness for some set S. Choose the Null Hypothesis to be Hypothesis (H 0 ).
3. Carry out the experiment.In each round i, record the obtained score s i using Eq. ( 13).
4a. Perform the rejection analysis.From the recorded scores, compute the total normalized score t n using Eq.(18).Evaluate the upper bound p bound (t n , n, γ) using Theorem 2, Eq.(30).If p bound ≤ α, reject (H 0 ) and conclude that the source is capable of producing states ρ / ∈ S with confidence at least 1 − α.Otherwise, the test was inconclusive.4b.Perform the estimation analysis.From the recorded scores, compute the witness estimate ŵn using Eq. ( 21).From the significance level α, compute the confidence interval radius ε using Theorem 3, Eq. ( 34).Then, one-sided confidence interval for the unknown quantity W n as defined in Eq. ( 2).
(typical values are α = 0.05, 0.01, 0.001).In the rejection experiment, the significance level determines how small the observed p bound on the p-value must be in order for us to reject Hypothesis (H 0 ).In the estimation experiment, the significance level α determines the confidence of the constructed confidence intervals around the estimate.For the entanglement rejection experiment, it is important that all these parameters, especially α and n are set before the experiment is carried out (see Section V D).
Step 2. Characterize devices.In this step, we need to characterize the measurement devices that aim to implement the POVM elements of Eq. ( 11), and the random setting generator that aims to implement p x .This characterization is done by determining suitable τ and δ j 's such that Eqs. ( 15) and ( 16) hold (ensuring that Assumptions (II) and (III) plausibly hold).In practice, this process requires calibration and characterization of the real experimental devices.From the numbers τ and δ j 's obtained in this characterization, we can compute the socalled witness correction γ = γ 1 + γ 2 using Theorem 1, Eqs. ( 25) and ( 26), in Section III B. When appropriate, one can use a first-order approximation for γ 2 given in Eq. (29).The witness correction γ is defined such that it bounds W − Wi ∞ , where Wi is the effectively implemented operator in round i and W is the ideal target operator.It is a function of the parameters τ and δ j , which quantify the device imperfections.The witness correction γ is used to protect against the largest possible systematic error in the experiment under the Model Assumptions of Table II.
Step 3. Carry out the experiment.Play n rounds of the witness game.Each round i, receive a state ρ i from the source and measurement setting X i = x from the random setting generator.Then each subsystem j performs the POVM measurement { Π(j) i,a } a∈Ω (j) x corresponding to setting x and obtains one of the possible outcomes labeled by a j .Collect all the obtained outcomes A i = a, compute and the score s i = s(x, a) using the score function in Eq. ( 13) and record s i .After the data collection has completed, one can do the analysis.We differentiate between the rejection analysis and estimation analysis.Both can be done using the same recorded data.
Step 4a.The rejection analysis.After the data collection has completed, we can determine if the experiment successfully rejected the Null Hypothesis (H 0 ) with confidence 1 − α.To do so, we compute the total normalized score t n , defined by where ∆s := s max − s min and Step 4b.The estimation analysis.From the collected data, we can also estimate the average witness value W n as defined in Eq. ( 2) and construct a rigorous confidence interval for around the estimate in the presence of arbitrary noise.The average witness value is estimated by the estimator This estimator can, in the absence of noise on the measurements and random number generation, be seen as an unbiased estimator of W n by Eq. ( 6).In the presence of unknown noise, the bias of the estimate can only be bounded by the witness correction γ (see the discussion in Section III B).Using γ, we can compute the radius ε of the confidence interval using Eq.(34).By Theorem 3, the interval I( ŵn ) = [ ŵn − ε, ŵn + ε] is a (1 − 2α) two-sided confidence interval and J ( ŵn ) = (−∞, ŵn +ε] is a (1−α) one-sided confidence interval for W n .If W is a witness for the set S in the sense of Eq. ( 1) and if ŵn + ε < 0, then one can conclude that W n < 0, meaning that on average states outside S must have been produced, i.e.
with confidence at least 1 − α.We emphasize that the intervals I, J are corrected for systematic (measurement and random setting generation) errors within the Model Assumptions via the witness correction γ of Theorem 1 (since ε depends on γ) and that it is statistically rigorous for arbitrary state noise.

B. Computing the witness correction
In this section we present Theorem 1 to compute the witness correction γ as a function of the randomness and measurement imperfection parameters τ, δ j determined in Step 2 of Table III.The imperfect implementation of the measurements and random number generator will lead to an effectively implemented operator Here Πx i,a is the expected implemented joint POVM in round i, conditioned on the history of the experiment, the state produced, and the event that X i = x (which happens with probability pi,x ).See Appendix C for a precise definition.Note that this effectively implemented operator Wi is closely related by the ideal witness operator W by comparing to Eq. ( 14).Indeed, the ideal random setting distribution p x is replaced with the implemented distribution pi,x (which differ little by Assumption (II)) and the ideal POVM elements Π are replaced with the conditional expected implemented joint POVM elements Πx i,a (which differ little by Assumption (III)).The witness correction γ we derive in Theorem 1 precisely captures how much Wi can deviate from W within the Model Assumptions of Table II.
Theorem 1.Let W be a Hermitian operator [not necessarily a witness in the sense of Eq. ( 1)] with decomposition and ideal implementation given by Eqs.(10) and (11).Suppose the experiment is modeled by the Model Assumptions in Table II.Define the effectively implemented operator Wi by Eq. (23).Then, in every round i, where the witness correction γ := γ 1 + γ 2 is the sum of the random number generation correction γ 1 and the measurement correction γ 2 defined by respectively, in terms of (j) |a| and λ (j) The proof is given in Appendix C. Let us first explain why we call the quantity γ the witness correction.An important consequence of Equation ( 24) is that for all ρ i .This means the witness inequality Tr[W ρ i ] ≥ 0 implies that Tr[ Wi ρ i ] ≥ −γ.Thus, if W is a witness, then the operator Wi + γI is also a witness.Hence, γ is the witness correction in the sense that any effectively implemented witness Wi corrected by γ is still a witness.We emphasize that this result does not say anything about the effects of finite statistics, but is solely about the required correction of expectation values due to imperfect devices.That is, the factor γ protects against potential systematic errors in an experiment.The witness correction γ has two terms, γ 1 and γ 2 .The term γ 1 quantifies the correction due to imperfect random number generation.The constant γ 2 quantifies the correction due to measurement errors.Thus, γ can be interpreted as the total correction required if the witness W is implemented with noisy measurements and with an imperfect number generator.Note that the choice of p x influences the correction γ 1 , as the score function Eq. ( 13) depends on p x .
The measurement correction γ 2 has a simple first-order approximation under the assumption that λ (j) ξ = 1, making it easier to compute.This assumption means that all measurement operators have eigenvalues in the interval [−1, 1] and is satisfied for example by all Pauli operators.Then a first-order approximation for γ 2 is where is a constant such that ≤ for all ξ, j.Hence, this is a good approximation if 1.This is typically the case when δ j 1, which means that the measurement devices have been well-characterized.In Section V C 2, we discuss a possible alternative method for deriving γ.

C. Bound on the p-value for witness rejection experiments
In this section, we give the main result to perform the rejection analysis in Theorem 2. The theorem provides an easy-to-compute upper bound on the p-value under the Null Hypothesis (H 0 ).Recall that the p-value is the probability of observing a total normalized score T n under the Null Hypothesis (H 0 ) that is at least as large as the observed total normalized score t n in the experiment, p = Pr[T n ≥ t n |H 0 ].If the p-value is smaller than a previously chosen significance level α, then we may consider the Null Hypothesis (H 0 ) to be statistically unlikely to explain the observed t n , and we may reject the model at significance level α.To determine if p ≤ α, we put an upper bound p bound on p in Theorem 2, which can be compared to the significance level α.
Theorem 2. Let W be a witness operator [satisfying Eq. (1)] for the set S with decomposition and ideal implementation given by Eqs.(10) and (11).Suppose that the experiment is governed by the Model Assumptions of Table II and consider the Null Hypothesis (H 0 ) with respect to S. Let t n denote the observed total normalized score after n rounds in the experiment.Then, the p-value as defined in Eq. ( 20) is upper-bounded by where is the log-linear interpolation of the survival function of a binomial distribution with parameters n and β, and where Finally, x is the largest integer less than or equal to x.
We give a detailed proof of Theorem 2 in Appendix D. We construct a supermartingale sequence from the total normalized scores up to each round i.We then apply Bentkus' inequality [30,31] (a concentration inequality for bounded difference supermartingale sequences, similar to, but tighter than, the Hoeffding-Azuma inequality) to obtain an upper bound for the p-value.Our proof is inspired by the approach of [32] to certify Bell violations.

D. Confidence intervals for average witness estimation experiments
In this section, we give the main result for the witness estimation analysis in Theorem 3. The theorem provides confidences interval for the average witness expectation value W n as defined in Eq. ( 2).The point estimate for W n is given in Eq. ( 21), and is a function of the scores recorded in the experiment.We construct a (1 − 2α) two-sided confidence interval I and a (1 − α) onesided confidence interval J .Theorem 3. Let W be a Hermitian operator [not necessarily a witness in the sense of Eq. (1)] with decomposition and ideal implementation given by Eqs.(10) and (11).Suppose that the experiment is governed by the Model Assumptions in Table II.Let Ŵn denote the average witness estimate as defined in Eq. (21).
Here γ is defined in Eqs.(25) and (26), and F • n,β is defined in Eq. (31) (with β = 1  2 and e ≈ 2.72).Then, the intervals are a (1−2α) two-sided and a (1−α) one-sided confidence interval respectively for the average witness value W n as defined in Eq. (2).That is The proof of this theorem is given in Appendix E. The confidence interval is also based on the construction of a martingale sequence and the application of Bentkus' inequality.The techniques are very similar to the proof of Theorem 2. We chose to use Bentkus' inequality because it is tighter than the more standard Hoeffding-Azuma inequality [32].The radius of the interval ε is however slightly more difficult because it involves (numerically) solving Eq. (34).See Section V C 3 for a brief discussion on this.

IV. EXAMPLES AND ILLUSTRATION
In this section, we will illustrate our results with two examples based on simulations of a proposed entanglement witness experiment in Nitrogen Vacancy (NV) centers.Moreover, we will give a concrete example in which the iid and Gaussian assumptions fail.Finally, we shall illustrate how the function eF • n,β of Eq. ( 31) scales in its arguments and parameters.This function determines the p-value bound and the confidence interval size in our results.Before we present these examples, we briefly describe the physical system that we aim to simulate in Section IV A. This section serves as a motivation for our simulation, but the examples can also be understood without knowledge of the physical system we simulate.Then we present the two examples.In the first example (Section IV B), we describe how to apply our method in detail, outlining all the steps in Section III A in a concrete example.For this example, we simulate a single experiment with identically distributed states ρ.In the second example (Section IV C), we illustrate our method for noniid states.To do so, we will perform a large Monte Carlo simulation of many independent experiments.In each experiment, we use a sequence of three-qubit states ρ i that are neither independent nor identically distributed.Then, in Section IV D, we give an artificial example of non-iid states in which the Gaussian assumption fails considerably.This example shows that a Gaussian assumption, on which the central limit theorem relies in prior work, need not always be justified (cf. the discussion in Section I A).Our method applies regardless of the validity of a Gaussian assumption.Finally, in the Section IV E we illustrate how the function eF • n,β (x) defined in Eq. (31) [which directly determines the p-value bound Eq. ( 30), and the confidence interval Eq. ( 34

A. Simulation details of Nitrogen Vacancy systems
Both examples in Sections IV B and IV C are based on a scheme for generating tripartite GHZ states in three physically separated Nitrogen Vacancy (NV) centers in diamond (see Ref. [40] for a review of this system).In these NV centers, the electronic spin associated with the defect can be used as qubit.This qubit is optically accessible and can be entangled with the presence or absence of a photon, which can be used as a flying qubit.Surrounding the NV center there are several Carbon-13 atoms (1.1% natural abundance).Their nuclear spins can be used as additional qubits, which can be controlled via the hyperfine interaction between the nuclear and electronic spins.
Two NV centers are entangled in the following way [19]: First, each NV center produces a spin-photon entangled pair, where the qubit state is encoded in the absence/presence of a photon.The joint state of the spin-photon pair is then given by where z is a tunable parameter.By coupling the photons into single mode fibers and interfering them using a beam splitter, the two electronic spins can become entangled.
In essence, this amounts to detecting the presence of a photon but erasing the information about which arm the photon came from.This single click entanglement (SCE) scheme is illustrated in Fig. 3a.The joint state between the two electronic spins is now (ideally) where Here, θ is a relative phase that needs to be characterized and controlled experimentally to create useful entanglement.
To generate a tripartite GHZ state, two EPR pairs are combined into a single GHZ state in the following way: First create one EPR pair between two NV centers using the electronic qubits.Then, one node swaps the state of the electronic spin with a nuclear spin qubit, so that the electronic spin becomes free again for entanglement production.At this point a second EPR pair is produced between the now-free electronic spin of this node and a third node.The GHZ state is then created by coupling the nuclear spin and electronic spin in the middle node and measuring the electronic qubit.This results into a state that is equivalent to a GHZ state under local Pauli operations (the Pauli operations can depend on the observed measurement outcome).This procedure is sketched in Fig. 3b.
When simulating this procedure, we account for several noise processes.First, we include noise in the generation of the EPR pairs.Our model for EPR generation follows the noise model for SCE generation developed in [19].This model incorporates several independent noise parameters: the single photon detection efficiency p det (the probability of detecting a photon in the heralding station, conditioned on it being emitted from the NV center), the distinguishability V of the emitted photons, the double excitation probability p 2ph (when more than one photon is emitted by an NV center in single entanglement attempt), the probability of dark counts p dc , as well as an uncertainty in the relative phase θ that is modeled by applying a Pauli-Z to one of the qubits with probability p θ .We assume that the detection efficiency is the same for all three setups, and furthermore assume that in each SCE scheme symmetric values for the free parameter z [see Eq. ( 39)] are used.However, this parameter may be different for the first and second EPR pair.We refer the reader to Ref. [19] for full details on the SCE generation model.
On top of the detailed SCE noise, our model assumes dephasing noise on the first EPR pair while it is kept in memory, waiting for the successful generation of the second EPR pair.The off-diagonal terms in the density matrix of the first EPR pair are multiplied with dephasing parameter q = 1 − exp(−N max /ν), where N max is a free parameter determining the maximal number of attempts before we discard everything and start over (because the first EPR pair has decohered too much) and ν is a parameter quantifying the strength of the dephasing noise.Finally, we assume that all single-and two-qubit gates are performed with unit fidelity.In Section IV B we instantiate this model with representative numerical values for all model parameters [see Table IV] to produce the simulated experimental data.

B. Step-by-step application of our method
In our first example, we illustrate our method on data produced by a simulation of iid states on three qubits (m = 3).The aim is to witness a genuine tripartite entanglement by producing a GHZ state: Therefore, we let S be the set of biseparable states, i.e., the set of all states ρ that are a mixture of separable states on any bipartition of the three subsystems.To witness a state not in S (and reasonable close to a GHZ state), we use is the projection witness, given by This factor 1 2 is known to be optimal for the GHZ projection witness [20].Note that c = 1 2 in Eq. ( 8).In fact, it is easily observed that c = 1 2 − 1 d = 3 8 here (since d = 2 m = 8 in this example).We will now describe all steps of Section III A to illustrate how to fully define the experiment, obtain the (simulated) data, and calculate the resulting p-value and confidence interval of the experiment.
Step 1a.The first step is choosing a decomposition of our choice of W [Eq. ( 42)] of the form Eq. ( 10) This witness has a four-setting and five-setting decomposition.We will use the five-setting decomposition into local Pauli observables: where {I, X, Y, Z} are the Pauli operators (including the identity operator) and the tensor symbol is omitted for clarity.Thus, Eq. ( 43) is a decomposition of the form Eq. ( 8) with c = 3 8 .We shall label the seven non-identity, traceless observables by ξ = 1, . . ., 7 in the order of their appearance in Eq. (43).
There are only five measurement settings needed for the decomposition in Eq. (43).
These are {ZZZ, XXX, XY Y, Y XY, Y Y X}, since the first three observables (ξ = 1, 2, 3) can all be computed from the first measurement setting ZZZ (x = 1), as discussed in Section II B. Therefore we have x = 1, . . ., 5, indexing the settings.The formal mapping between observables and measurement settings is given by Step 1b.Next, we specify a model for measuring each of the Pauli observables that occur in the chosen measurement settings.In our simulation, we shall model each measurement as systematically imperfect, meaning that X, Y and Z is not implemented by the usual projective measurements.Instead, we model all Pauli measurement by POVM elements, parameterized by two parameters u, v ∈ [0, 1], which characterize the efficiency of detecting the +1 and −1 eigenstate of the Pauli operator, respectively.In experiments, these numbers are referred to as the readout fidelity [19,41].Concretely, for the Pauli-Z measurement on each subsystem, we model (dropping the subsystem index j for notational compactness) the measurement by the POVM elements The X and Y POVM elements are defined by where the H and K gate are the gates that rotate the Z to the X and Y basis respectively.The two outcomes of all Pauli measurements are These values are chosen in such a way that for all Pauli's P = X, Y, Z, so that the measurement operators correspond to the desired observables [according to Eq. ( 11)].In our model, we will set u = 0.95 and v = 0.99.This results in a + ≈ 1.1064 and a − ≈ −1.0213.Step 1c.Next, we choose p x according to Eq. ( 17).That is, we choose This now fully defines the score function [per Eq. ( 13)]: Note how the observables IZZ, ZIZ and ZZI are combined into one measurement setting ZZZ which directly contributes to the score.
Step 1d-e.Finally, we fix the total number of rounds to be n = 600 and set α = 0.05.
Step 2. We characterize our (simulated) devices to have a RNG bias τ = 10 −6 and measurement imperfection as compared to the model described in the previous step of δ j = δ = 2 • 10 −3 for all parties j = 1, 2, 3.The value of δ is determined from the uncertainty in the measurement characterization of the NV system.The value of τ is chosen sufficiently large for any practical implementation of randomness.
With these values of τ and δ, and the score function Eq. (51), the witness correction γ can be computed from Theorem 1.The random number generation correction γ 1 is computed using Eq. ( 25) to be The measurement correction γ 2 is computed using Eqs.( 26) and (27).First, we compute (j) ξ from Eq. ( 27).We find that (53) Furthermore, from Eq. ( 43), it is clear that λ  1, we use the approximation Eq. ( 29) instead of the full Eq.( 26).We compute that Hence we find that Step 3. In this step, the experiment is simulated.We play n rounds of the entanglement witness game.In our simulation, we take the same state ρ i = ρ in each round, corresponding to an iid situation.This state is computed using the model of tripartite GHZ generation in remote NV centers as discussed in Section IV A. The valued of the parameters we used in the simulation are given in Table IV.The resulting state ρ is given in Table V.In addition to the state ρ i = ρ, we also generate a random setting x i ∈ {1, 2, 3, 4, 5} with probability given by Eq. (50).The randomness is generated by a standard pseudo-random number generator.For efficiency reasons, we use the ideal POVM elements to simulate the measurement outcomes.We nevertheless use a nonzero value of δ to illustrate the effect of measurement noise on the witness correction.The set of measurement outcomes a i is obtained according to Born's rule.From this, we compute and record the score of this round s i = s(x i , a i ) using the score function Eq. (51).
Step 4a.In this step, we calculate the upper bound p bound to the the p-value using Theorem 2. This is straightforwardly done using γ, n, c, s min , s max and the total normalized score t n by evaluating Eqs.(30) to (33) of Theorem 2. The quantities s min and s max as defined in Eq. (19), can be computed from the score function Eq. (51).In our example we find s max = −s min ≈ 1.185.The total normalized score t n is computed from the recorded scores s i for i = 1, . . ., n, according to Eq. ( 18).To compute p bound from concrete numbers, suppose the a single simulation of the experiment yields a total normalized score of t n = 440.97.To compute the p-value bound, we first calculate β using Eq.(33).We find (using γ ≤ 0.01) that β ≈ 0.662.Hence, we can evaluate our p-value bound, Eqs. ( 30) and (31), by Since p ≤ p bound ≤ α, we have rejected the Null Hypothesis (H 0 ) at significance level α = 5%.As we trust that the Model Assumptions (I) to (III) hold, we can reject Hypothesis (H 0 ) and conclude that our (simulated) source of quantum states is capable of producing genuinely multipartite entangled states ρ * / ∈ S.This is indeed the case, since ρ defined by Table V is not biseparable (i.e., not in S, since Tr[W ρ] < 0).
To illustrate how the p-value evolves as more rounds are played (more data is taken), we plot in Fig. 4 the running value of p bound computed with the total normalized score up to round i, as a function of i.Up until round i ≈ 520, the bound remains constant at its maximal value e ≈ 2.72, due to the prefactor of e in Eq. (30).In this regime there is insufficient data to draw any conclusions.Then our p-value bound starts decreasing roughly exponentially in i.The jitter can be explained by the randomness due to measurement settings and outcomes.
Step 4b.Finally, we illustrate how to estimate W n as defined in Eq. ( 2) and compute the confidence intervals around this estimate using Theorem 3. The witness estimate is directly computed from the observed scores and c using Eq.(21).Suppose that a run of the experiment yielded scores such that ŵn = −0.182[this is consistent with t n = 440.97by comparing Eqs. ( 18) and ( 21)].Now the radius of the confidence interval can be computed from α = 0.05, n = 600, γ = 0.01 and ∆s = 2.370 by solving Eq. ( 34) numerically.We find that ε ≈ 0.216 in this example.Hence, we find the 90% two-sided confidence interval and the 95% one-sided confidence interval I 0.9 = [−0.398,0.034], J 0.95 = (−∞, 0.034], (57) respectively.This is just insufficient to conclude that on average ρ was not in S (i.e., genuinely multipartite entangled).However, we can compare these intervals to the true value W n = −0.172(see Table V and recall that ρ i = ρ in this example).Clearly W n ∈ I 0.9 and W n ∈ J 0.95 .Moreover, the point estimate ŵn is not far off the true value.

C. Illustration of method for correlated noise
In the previous example we gave a step-by-step illustration of how our method is applied to data gathered in a single (simulated) experiment with identical states each round.Since our method is applicable in general, we now illustrate its use for states ρ i that are neither independent nor identically distributed.As before, we will consider states ρ i that are created from two EPR pairs, each of which has a relative phase θ (see Eq. ( 40) in Section IV A), but we now assume that the relative phase θ drifts between subsequent rounds following a random walk.That is, the relative phase in round i depends on its value in round i − 1, which creates correlation between the rounds.This modeled phase drift is motivated by the observation in experiments that the relative phase θ changes over time due to fluctuation in the optical path length [19].The simulation uses the same parameters as in Section IV B, Table IV, except that now p θ = 0. Instead, in the first round, we model (both EPR pairs) with known initial phase θ 0 = 0 • .Then, each round, the phase drifts with step size ∆θ = ±0.98 • (for each EPR pair the drift is an independent random walk).This creates dependence and correlation between the rounds.Furthermore, we use the same witness [Eq.( 42)] and employ the same measurement model with witness correction γ ≤ 0.01 [Eq.( 55)] as in Section IV B.
Instead of performing a single simulation with this noise model, we perform a Monte Carlo simulation with N = 2 • 10 4 repetitions of the experiment, each with n = 600 rounds.For each repetition, we calculate the witness estimate from the observed score using Eq. ( 21).We emphasize that in this simulation, each repetition also a different W n is realized, because the collection of ρ 1 , . . ., ρ n can be different every time.We also compute a bound on the p-value for each repetition of the experiment by using Theorem 2. We thus obtain N = 2 • 10 4 witness estimates and bounds on the pvalue from our Monte-Carlo simulation.
In Fig. 5 we plot histograms of the witness estimates ŵn and p-value bounds p bound , both of which are directly computed from the observed scores s 1 , . . ., s n (via Eq. ( 21) and Theorem 2, respectively).We emphasize that all s i are realizations of a random variable S i , determined by the random measurement settings X i , the random measurement outcomes A i (following Born's rule), and the correlated random states ρ i produced by the source.The produced states were not necessarily biseparable (i.e., not necessarily in S).
In Fig. 5a we plot a histogram of the witness estimates ŵn and compare to a Gaussian reference curve.We clearly see that this noise process can lead to a non-Gaussian witness distribution.The skew in the plot is mainly due to the fact that different true average witness values W n = 1 n Tr[W ρ i ] are realized in each run of the experiment.The figure illustrates that under non-iid noise, the standard Gaussian prediction ŵn ± σ ŵn does not accurately predict future repetitions of the experiment.We emphasize that the plot in Fig. 5a can not be interpreted as the probability distribution of ŵn under a particular, fixed sequence of states ρ 1 , . . ., ρ n (because each Monte Carlo iterate used a different sequence of states), and hence no inference could be made from this plot about the uncertainty of ŵn .
In Fig. 5b we plot a histogram of the p-value bounds.Note that the p-value is plotted logarithmically on the xaxis, making the bins of unequal size.We show the significance level α = 0.05 as a red line, which in this example turns out to be the 95 th percentile.Thus, in 95% of the simulated experiments, the Null Hypothesis (H 0 ) was rejected with statistical confidence.Note that it is merely a coincidence that α = 0.05 marks the 95% percentile of the distribution of p bound over the N simulated experiments.We emphasize that the p-value itself [as defined in Eq. ( 20)] can not be inferred from the simulation results.This is because the p-value is a statement about the distribution of the total normalized score T n assum-ing that the Null Hypothesis (H 0 ) holds.But here the states produced are not necessarily in S, so Hypothesis (H 0 ) is violated.See Section V D for more detailed discussion.

D. Example where iid assumption fails
In the previous section we saw an example where the states within each experiment were generated by a noniid noise process, and we discussed the corresponding distribution of the witness estimate and p-value bound across different runs of the non-iid experiment.While the witness estimate ŵn was non-Gaussian, this was largely explained by skew in true average witness value W n .In this section we give an explicit example where the iid assumption (more generally the Gaussian assumption) is inappropriate even when the average witness value W n is fixed.This example is not based on simulation of NV centers, but is based on a noise model that is designed to clearly exhibit non-iid behavior.
In this example, we generate the states according to a noise process in which the source of states produces a perfect GHZ state in a fraction F of the n rounds and produces an orthogonal state in the remaining (1 − F )n rounds.Importantly, the fraction of good states F is held constant in this model.This is the only source of correlations between the states in a single run of the experiment.This model is an extreme case of the more realistic scenario in which the source intermittently produces good and bad quality states.These bad quality states can for example be produced when a heralding signal (a signal witness experiments (each with n = 600 rounds).The histogram shows the distribution of the witness estimate ŵn computed from the observed scores using Eq. ( 21) in blue.This approximates the true distribution of ŵn.The 95% interval is shaded in blue.The true average witness value is W n = −0.172 in each run of the experiment, which coincides with the mean (and median) ŵn.The red curve is shows the inferred distribution from the data (the scores) observed in the median run when the data is assumed to be iid.That is, the red curve is a Gaussian with mean ŵn and standard deviation σ ŵn as computed from the observed scores under the iid assumption in this particular run.The central 95% interval of this Gaussian is shaded with red lines.Clearly, the distribution inferred from the iid assumption does not match the true distribution of ŵn.
that indicates entanglement was created) is falsely triggered.Note that the fraction F of good states equals the true average fidelity to the GHZ state, and with our choice of witness operator W [according to Eq. ( 42)], the true witness value is In this example, we fix F = 0.672, which yields a true average witness value is W n = −0.172.In Fig. 6 we plot a histogram of the witness estimate ŵn , computed using Eq. ( 21), under the described noise model.The histogram represents N = 2 • 10 4 independent simulated experiments (each with n = 600 rounds).We also plot the Gaussian distribution that would be obtained under the iid assumption on the median run.For this run, the mean and standard deviation ŵn ± σ ŵn are computed from the observed scores s 1 , . . ., s n in that run.In particular, for the red curve we observed ŵn ± σ ŵn = −0.171± 0.026.
The 95% central interval is shown as the shaded region under the red curve.This region is [ ŵn − 2σ, ŵn + 2σ] = [−0.223,−0.120].However, the 95% central interval found from the Monte Carlo simulation is only [−0.205, −0.138].Hence, the 95% central interval estimated under iid assumption (red shaded region) is roughly 1.5 times larger than the 95% central interval found from the Monte Carlo simulation of ŵn (the medium blue region).This example clearly shows that in this scenario the iid assumption fails.

E. Scaling of the p-value bound
Finally, we analyze how our bound p bound = eF • n,β (t n ) [see Theorem 2, Eq. ( 30)] scales with its three free parameters: the total normalized score t n , the number β (which depends on the witness correction γ, see Eq. ( 33)) and the number of rounds n.Note that this also directly illustrates how the size of the confidence interval scales with the significance level α according to ].We discuss the scaling here from the perspective of the rejection analysis.Then it is qualitatively clear that a larger observed score t n and smaller γ (and hence smaller β) should lead to statistically more significant results (meaning a smaller p bound ).Similarly, a larger number of rounds n should lead to a smaller uncertainty and hence make it less likely to observe extreme data under the Null Hypothesis (H 0 ).We show that this is indeed the case in Fig.In the left plot of Fig. 7, we see that it appears that p bound ∝ e −t 2 n for fixed n in the regime t n ≥ nβ.Even if this is not exact, p bound at least appears to decrease (super)exponentially in t n .The p-value bound is trivial in the regime t n ≤ nβ.This is understood to mean that nβ is a total normalized score that is likely to be achieved under the null hypothesis.In the middle plot, we see that p bound seems to decrease exponentially with n as well and that the asymptotic decay rate depends on the mean normalized score tn n .This is also expected behavior, which for example also holds in the iid case.In the right plot, we focus on the effects of increasing β.The p-value bound increases (sub)exponentially with β.It also seems that an increment of β (which is due to increased γ from noisy devices) can be directly compensated by observing an increased t n , since the three curves seem to be identical but displaced in the x-axis.n,β (tn), which determines the bound on the p-value Theorem 2 and the radius of the confidence interval Theorem 3, with the total normalized score tn [Eq.( 18)], the number of rounds n and β [which depends on γ, see Eq. ( 33)].The black dots correspond to the simulation discussed in Section IV B, where n = 600, β = 0.662 (using γ = 0.01), and tn = 440.97,yielding p bound = 2.1 • 10 −4 .

A. Applicability of our work
The methods we have discussed throughout this work were frequently motivated by and illustrated with entanglement witness experiments.Here we elaborate how generally applicable all of our results are.First, the estimation experiment can in principle be done for any Hermitian observable W .In Theorems 1 and 3 we do not require that the operator is a witness; we only need to write W in the form of Eqs.(10) and (11).Note that this is in principle always possible (although for general W , the number of terms labeled by ξ might grow exponentially with the total system dimension).This allows one to perform the experiment and do the estimation analysis of Step 4b. in Table III.This yields a point estimate and confidence interval for estimating the average value W n as defined in Eq. ( 2), which is valid without any assumption about the sequence of states ρ 1 , . . .ρ n produced sequentially in the experiment.
Second, the rejection analysis of the experiment is valid for any witness W that satisfies Eq. ( 1) for some convex subset of states S. The Null Hypothesis we aim to reject is always the hypothesis that the source can only produce states ρ ∈ S, conform Hypothesis (H 0 ).Therefore, the experiment can witness the creation of any state ρ * ∈ S, given suitable choices of W and S. Examples include witnessing different types of entanglement, defined by nonmembership of a particular separability class S. A frequently encountered separability class is the set S k of kseparable states.These states are a convex combination of pure states that are separable over some k-partition of the subsystems [42].Our examples in Section IV focused on k = 2, the class of biseparable states.In certain applications it might also be important to certify entanglement across a particular partition of subsystems, which leads to a different definition of S. In this case, the set S describes separability between any specific partitioning of the system.Corresponding witness operators for graph states have been identified in [43].Finally, S can also be chosen to include specific classes of entangled states, so that non-membership signals that the entanglement is not of that particular class.As an example of this, S can be chosen as the convex hull of all biseparable states and W -class states [42].All of this can be tested with our method by accordingly defining S in Hypothesis (H 0 ).Of course, this also requires one to find a suitable witness operator W that separates S from some particular state ρ * that one aims to witness [according to Eq. ( 1)].

The witness operator W
The choice of witness operator is another integral part of the design of the experiment.It affects the number of trials n needed in an experiment to attain a certain pvalue given some target state ρ * .It affects the (expected) p-value bound or interval size ε one would observe in an experiment given finite n.There are two different mechanisms behind this.First, the choice of W influences the decomposition Eq. ( 8), and in particular the number of terms that appear in this decomposition.A smaller number of terms directly implies more statistically significant results.This is intuitively understood since the budget of n data points can now be used to measure less observables in the decomposition, meaning each can be estimated more accurately.Second, for witness rejection experiments where some (average) ρ * is assumed to be produced, one would ideally like to minimize the p bound over all witness operators W for fixed ρ * and n.This task is extremely difficult and impractical.Instead, one often employs the heuristic of minimizing the witness violation.That is, one seeks to find the minimizer W of the following optimization problem min This witness will then give large contributions to the witness violation, hence making it statistically less likely to observe such a violation with finite number of states from S. The downside of this is that it only optimizes for the magnitude of the violation in case of infinite statistics.In particular, it does not take into account the number of terms in the decomposition.Many different methods for entanglement witness design have been discussed extensively in literature [20,22,23].Most methods rely on some form of numerical optimization to find a witness with particular attractive properties such as efficient decomposition, maximal violation with a given state or large noise tolerance.For pure states, the most widely used witness operator for genuine multipartite entanglement of states close to |ψ is the projection witness W = λ 2 I − |ψ ψ|, where λ is the maximum of the Schmidt coefficients of |ψ when all bipartitions are considered [20,22].This is also the choice that we made in Section IV for |ψ = |GHZ [see Eqs. ( 42) and ( 43)].

The number of rounds n
An important design parameter of a witness experiment is the number n of rounds to be played.This is often delicate, as too small n renders the entire experiment yielding no conclusion (see Fig. 4) but too large n can be overly costly in experiment resources.To get a good estimate of n that produces small enough p bound without being unnecessarily expensive, the experimenter must have a good understanding of the quantum states being produced and the measurements being performed (e.g., from a theoretical model of the experiment).With this, the experiment can be simulated, so that one can get an estimate of the distribution of the total normalized score T n .Then pick n such that you are very likely (e.g., in 90% of the experiments) to realize a t n that evaluates with known or estimated γ to a p bound less than α (for the rejection experiment) or that evaluates given α to a confidence interval that is sufficiently small (e.g.such that the entire interval is smaller than zero).

The target distribution of measurement settings px
In principle, the choice of probability distribution over the measurement settings p x is left free in this work.Each choice leads to a valid witness experiment from which the conclusion can be drawn rigorously.However, bad choices of p x will lead to suboptimal bounds on the p-value or suboptimal confidence intervals for equal experimental cost.An interesting question is what choice of p x yields the smallest p-value bound (resp.confidence interval) given fixed n and γ.This question is hard to address, since p x influences both the distribution of T n and the value of ∆s (which enters in Theorem 2 via β and Theorem 3 via Eq.( 34)).Based on the underlying concentration inequality (Bentkus' inequality, see Lemma 1 in Appendix D for details), we conjecture that p x should optimally be chosen to minimize ∆s.This conjecture holds in the simulation of our experiments.Intuitively, this requirement ensures that S i − S i−1 is often large compared to ∆s.However, finding a distribution p x that minimizes ∆s is not a simple In small instances, it can be solved by brute force search.If all possible outcome sets are Ω (j) x = {1, −1}, then the p x that minimizes ∆s is given by our recommended choice in Eq. (17).
C. Possible modifications of our method

Model modifications
It is not hard to incorporate different ways of characterization the hardware devices [modifying Assumptions (II) and (III)].For example, the bias in random measurement setting generation can be quantified by different q -norms than the ∞ -norm considered here.Similarly, there are many other ways of characterizing noisy measurement devices.For example, one may model and characterize noisy measurements by a small misalignment angle, defined as the angle between the Bloch vectors of the ideal and the noisy measurements [26].Any modification of the way hardware devices are characterized, requires that Theorem 1 is modified accordingly to calculate an appropriate correction γ.We emphasize that this is the only part that needs modification.Theorems 2 and 3 can directly be applied with the new bound γ.

Alternative method for computing γ
In the proof of Theorem 1, we applied an analytical method to find the witness correction γ.This bound is not necessarily tight in all cases.If the correction γ is too large for a practical experiment, one can see if tigher bounds can be found numerically.The optimization problem for gamma under Hypothesis (H 0 ) and Assumptions (II) and (III) of the model takes the form where the minimization over px is constrained by Assumption (II) and the minimization over W is constraint by Assumption (III).This optimization is not always easy to carry out.For example, optimizing over S is hard in general (e.g., when S is the set of separable states), al-though in low dimensions one can often resort to using PPT relaxations [44,45].Furthermore, the set of feasible W is not convex, and the objective function Tr[ W ρ] is bilinear in the variables.For non-convex problems, general nonlinear optimization methods usually only find local optima which may result in witness corrections γ that are smaller than justified.

Using Hoeffding-Azuma instead of Bentkus
In this work, we have chosen to use Bentkus' inequality to bound the relevant tail probabilities in Theorems 2 and 3, because this is often slightly tighter than the more common Hoeffding-Azuma inequality.We have stated Bentkus' inequality in Lemma 1 in Appendix D. Whenever this lemma applies, the Hoeffding-Azuma inequality also applies.So if R 0 , . . ., R n is a supermartingale with n .By replacing this in the right places in the proofs of Appendices D and E, we can modify the statements of Theorems 2 and 3.In particular, Eq. ( 30) can be modified to with β still defined as in Eq. ( 33), and Eq. ( 34) can be modified to Especially this latter modification can be useful, since this makes it easier to compute the confidence interval in Theorem 3.This is because the above equation can be explicitly solved for ε to find One may choose to use these equation instead of the theorems as stated, but generally this could come at a slight cost in the tightness of the bounds.

D. General remarks regarding hypothesis testing and the p-value
Our rejection analysis of entanglement witness experiments fits into the general framework for hypothesis testing in statistics.As always, it is important that all relevant parameters are fixed prior to the experiment being carried out.In our case, this concerns the quantities defined in steps 1. and 2. of Section III A. Most notably, this includes the number of trials n and the significance level α, but also the witness operator, its decomposition, and all other parameters of the Model Assumptions.This also implies that one may not simply combine data from multiple experiments (e.g., extending experiments by further trials until a desirable outcome is achieved).Instead, p-values of different experiments can be combined using known statistical methods [32,46], or one can carry out a larger, independent experiment instead.
In the framework of hypothesis testing, the p-value is defined as the probability of observing a test statistic T (in our case, the total normalized score T n ) under the null hypothesis (in our case Hypothesis (H 0 )) that is at least as large as the observed value t: If the p-value is smaller than a previously chosen significance level α, then the null hypothesis is considered to be statistically unlikely to explain the observed t, and we reject the null hypothesis at significance level α.This means that the p-value is a statement about the observed data in relation to a hypothesis about the distribution of T .In particular, it does not give any information about the distribution of T if the null hypothesis does not hold.Neither should the p-value be interpreted as 'the probability that the null hypothesis was valid' or as 'when the experiment is repeated many times, the null hypothesis is rejected in a 1 − p fraction of experiments'.The p-value is only quantifies the likelihood of a particular assumption about the distribution of T , namely that it is governed by the null hypothesis.Therefore, it is this a hypothesis that we may plausibly reject upon observing a small p-value.Finally, we emphasize that the p-value itself can be seen as a random variable, as it is a function of the realization of a randomly observed test statistic t (whether or not governed by the null hypothesis).Each experiment is simply a realization of a particular p-value.

VI. CONCLUSIONS
In this work, we proposed two new methods to analyze witness experiments.Both methods are applicable to any source of quantum states that produces a sequence of states on demand.These states can be arbitrary distributed and correlated -unlike previous works, which often implicitly assume that the noise is well-behaved so that the central limit theorem applies.With our rejection analysis, the experimenter can rigorously certify that a device has the capability of producing entangled quantum states.The statistical confidence is quantified by a bound on the p-value, the probability that an experiment yields data as extreme as the observed data under the assumption that the source produces only separable states.Hence a small p-value is directly interpreted as strong evidence for the production of entangled states.The rejection method applies more generally to any witness experiment that is defined by a set of states and a corresponding witness operator.In particular, it can be used to witness genuine multipartite entanglement.With our estimation analysis, the experimenter can estimate the average witness value and construct a statistically rigorous confidence interval around this estimate.This method allows the experimenter to conclude whether entanglement was produced on average.This confidence is valid regardless of the type of noise present.The estimation method applies to any general estimation problem.This means that it is not necessary that W is a witness in the sense of Eq. ( 1).Fidelity estimation is therefore also covered by this method.
Both methods we derived are simple to use.We provide a step-by-step recipe to choose all relevant parameters, collect the necessary experimental data and compute both a bound on the p-value and a average witness estimate with confidence intervals from this collected data.Our method requires no modification compared to prior methods of performing witness experiments, except possibly the requirement of measuring different settings in random order instead of a fixed, predetermined order.This requirement is however inevitable if one wishes to deal with arbitrarily correlated noise.Both of our methods yields a figure of merit -a bound on the p-value or a point estimate with a confidence interval -that has a concrete interpretation and is comparable between experiments.We illustrated our methods with simulations of an experiment in Nitrogen Vacancy centers.
In this work we chose to treat the witness operator as a fixed object that has been chosen with a state to be witnessed in mind.However, other methods exist in which the witness is modified adaptively based on the measurement data collected so far [24].The objective of adaptive witnessing is to tune the witness operator (perhaps within a parameterized family) to have maximal detection efficiency, based on the partial knowledge of the noise obtained from prior measurement data.In particular, one can attempt to identify and counteract coherent components of the noise.Indeed, coherent and local noise preserve entanglement, and adaptive protocols have been shown to enhance entanglement detection [47].An interesting open question is whether the statistical methods we develop in this work can be combined with adaptive witness methods.This would open ways to increase the detection efficiency of the witness method, while maintaining the robustness against correlated noise.
To allow for the most general model of measurements, we allow each M (j) x in a measurement setting to be measured by a POVM {Π (j),x a } a∈Ω (j) x with outcomes labeled by elements a in some finite set Ω (j) x .That is, we write The tensor product POVM for setting x has outcomes a = (a 1 , ..., a m ) in Ω x := Ω (1) x .We denote it by Sometimes we also need Ω (j) := x Ω (j) x and Ω := Ω (1) × • • • × Ω (m) .We formally define Π (j),x a := 0 for a ∈ Ω (j) \ Ω (j) x and Π x a := 0 for a ∈ Ω \ Ω x .This is useful in case we want to sum over a ∈ Ω (j) or a ∈ Ω without knowing x.Using Eq. (A4), we can write each local observable O where the last equality follows from the facts that b j (ξ) ∈ {0, 1} and POVM elements sum to the identity I, and we use the convention that a 0 = 1, Q 0 = I for any number a and operator Q. Hence by plugging Eq. (A6) into Eq.(A2), the witness operator can be further decomposed as The last equality follows from rearrangement and Eq.(A5).Table I in the main text summarizes all relevant objects.As discussed in Sections II A and II B, the witness experiment now proceeds in each round by selecting a measurement setting x at random, measuring the corresponding POVMs on each subsystem, and, upon obtaining outcomes a ∈ Ω x , assigning a score according to Eq. ( 13): It is useful to define s(x, a) := 0 if a ∈ Ω x .Now it becomes apparent why we had written W in the form Eq. (A7): it allows us to plug in the definition of the score function Eq. (A8).As a consequence, we find that showing that Eq. ( 14) holds.Finally, we define the algebraic minimum and maximum score, and their difference, as s min := min By Assumption (I)a, this implies that X 1 , A 1 , ρ 1 , ..., X i−1 , A i−1 , ρ i−1 are all contained in the history H i .
(II) Trusted randomness.The measurement setting for the i-th round of the experiment is modeled by a random variable X i .We assume that the distribution of X i given the history H i and the state ρ i (we denoted this informally as pi,x in the main text) is close to a known distribution p x , i.e., there is τ > 0 such that } a∈Ω (j) x , where we recall that x labels the measurement setting, j = 1, . . ., m the subsystem, and Ω (j) x is the set of possible outcomes (we formally set Π (j),x a := 0 for a ∈ Ω (j) x ).We model this by assuming that for each of these POVMs { Π(j) i,a } a there exists a constant δ j > 0 such that for each j, almost surely, We further assume that which means the noisy measurements have the same sets of possible outcome as the ideal measurements.Finally, we assume that the measurement outcomes A i follow Born's rule, so we demand that In terms of the above data, the score of the i-th round is given by the following random variable: We will show that, almost surely, Wi − Wi ∞ ≤ γ 1 and W − Wi ∞ ≤ γ 2 , which together imply the theorem by the triangle inequality.For the former, we find that by comparing the definitions of Wi and Wi , Eqs. (C2) and (C8) respectively, that using Assumption (II) in the first inequality and using in the second inequality that the operators { Πx i,a } form a POVM for each i and x.
To show that W − Wi ∞ ≤ γ 2 , we start by comparing their respective definitions Eqs.(A9) and (C8), which gives where we inserted the definition of the score function Eq. (A8) in the second step.Thus, We will show Eq.(C19) in the remainder of the proof.We expand Πi,a − Π x a using a telescoping sum of the form Proof.We first compute where in Eq. (C40) we pull can pull ρ i out of the conditional expectation value since it is conditioned on ρ i and in Eq. (C41) we use the definition of Wi (given in Eqs.(C2) and (C3)).From this, it follows that where the last inequality follows from the Theorem, Eq. (C4).
where the last inequality holds by Corollary 1.2, which asserts that E[S i |H i ] ≤ c+γ almost surely.Thus, R 0 , R 1 , . . ., R n is a supermarginale as claimed.
Next we bound the differences.Since s min ≤ S i ≤ s max , we find using Eq.(D11) that

FIG. 3 .
FIG. 3. Schematic illustration of tripartite entanglement generation using diamond Nitrogen Vacancy (NV) center systems.(a) The single click entanglement (SCE) scheme generates a single EPR pair between two NV centers.(b) Two EPR pairs are generated using the SCE scheme and combined into a GHZ state by interfering and measurement.The classically conditioned operations Ua, U b and Uc are Pauli operations.

FIG. 4 .
FIG.4.Running p-value bound for our simulated GHZ witness experiment with parameters as defined in the main text.Up until round i = 520, the upper bound is the maximal value of e ≈ 2.72, due to the prefactor e in Eq.(30).The final p-value bound after n = 600 rounds is 2.1 • 10 −4 .

FIG. 5 .
FIG.5.Histograms for a Monte-Carlo simulation of N = 2 • 10 4 independent witness experiments (each with n = 600 rounds).Here, the noise model is not iid but rather follows a stochastic process, as described in Section IV C. (a) Histogram of witness estimates.The blue histogram shows the distribution of the witness estimate ŵn (cf.Eq. (21)).The distribution is non-Gaussian, as is obvious from comparison with the red Gaussian reference curve.Thus, the standard Gaussian prediction ŵn±σ ŵn does not accurately predict future repetitions of the experiment.The skew in the distribution is primarily due to skew in the true average witness value W n across different runs of the non-iid experiment (due to the changing sequence of n states).(b) Histogram of p-value bounds.The blue histogram shows the distribution of p-value upper bounds (cf.Theorem 2).The red line indicates the significance level α = 0.05, the number below which we reject the Null Hypothesis (H0).

FIG. 6 .
FIG. 6. Histogram of a Monte-Carlo simulation of N = 2•10 4witness experiments (each with n = 600 rounds).The histogram shows the distribution of the witness estimate ŵn computed from the observed scores using Eq.(21) in blue.This approximates the true distribution of ŵn.The 95% interval is shaded in blue.The true average witness value is W n = −0.172 in each run of the experiment, which coincides with the mean (and median) ŵn.The red curve is shows the inferred distribution from the data (the scores) observed in the median run when the data is assumed to be iid.That is, the red curve is a Gaussian with mean ŵn and standard deviation σ ŵn as computed from the observed scores under the iid assumption in this particular run.The central 95% interval of this Gaussian is shaded with red lines.Clearly, the distribution inferred from the iid assumption does not match the true distribution of ŵn.

7 . 3 8
To compute p bound numerically, we used c = and s max = −s min = 1.185, just as in the example of Section IV B as the fixed values.The black dots in the figure correspond to n = 600, β = 0.662 and t n = 440.97as used for illustration in Section IV B. The corresponding bound is p bound = 2.1 • 10 −4 .

TABLE I .
Summary of notation to allow multiple observables per measurement setting.The objects in the top half relate only to the choice of witness and its decomposition into observables (labeled by ξ).The objects in the bottom half relate to the implementation of the witness using measurement settings (labeled by x), which are implemented by POVMs.

TABLE II .
List of Model Assumptions on the experimental devices and the nature of the experiment.These assumptions should plausibly hold in the real experiment.The validity of our results depends on these assumption holding.We give a mathematically rigorous definition of the model in Appendix B.
total normalized score T n under the Null Hypothesis (H 0 ) that is at least as large as the observed total normalized score t n .To determine if p ≤ α, we compute an upper bound p ≤ p bound (t n , n, γ) to the pvalue in Theorem 2, Eq.(30), and compare p bound to α.If p bound ≤ α then we can reject the Null Hypothesis (H 0 ) with confidence at least 1−α.We can therefore conclude that at least one state ρ ∈ S must have been produced and therefore the source has the capability of producing such states.In the context where S is the set of separable states, this is interpreted as concluding that the source is capable of producing entangled states.This logical reasoning is only valid if all the Model Assumptions (I) to (III) in TableIIhold.If these fail then one may incorrectly reject H 0 .
(20)the algebraic minimum and maximum value of the score function, respectively.Note that t n ∈ [0, n].This total normalized score is the our test statistic for the hypothesis test.We can reject the Null Hypothesis if the p-value is at most α.The p-value is defined as the probability p := Pr T n ≥ t n H 0(20)of obtaining a

TABLE IV .
Values of the parameters used in the simulated creation of a tripartite GHZ state in NV centers as discussed in Section IV A. The resulting state is described in TableV.

TABLE V .
Nonzero components of the state ρ used in each round of the simulation of the entanglement witness experiment described in Section IV B. The expected witness value of ρ is Tr[W ρ] = −0.172.

TABLE VI .
Summary of the random variables present in the model and analysis the history of the preceding round, i.e, the σ-algebra of H i−1 is contained in the σ-algebra of H i : σ(H i−1 ) ⊆ σ(H i ) ∀i = 2, ..., n.(B1)(b) the state, measurement setting and outcome of round i − 1, i.e., σ(