Bayesian Quantum Noise Spectroscopy

As commonly understood, the noise spectroscopy problem---characterizing the statistical properties of a noise process affecting a quantum system by measuring its response---is ill-posed. Ad-hoc solutions assume implicit structure which is often never determined. Thus it is unclear when the method will succeed or whether one should trust the solution obtained. Here we propose to treat the problem from the point of view of statistical estimation theory. We develop a Bayesian solution to the problem which allows one to easily incorporate assumptions which render the problem solvable. We compare several numerical techniques for noise spectroscopy and find the Bayesian approach to be superior in many respects.


Introduction
Quantum technologies will very likely ultimately rely on active error correction. However, at every stagecrucially in current experiments-open-loop control techniques to suppress errors need to be employed [1]. They can be thought of as a layer-0 level of protection designed to make the errors in any operation as small as possible, before the machinery of quantum error correction and fault-tolerant quantum computing takes over.
These techniques can be roughly classified in terms of their robustness to uncertainty in the knowledge of the noise they attempt to suppress. Exemplifying this, dynamical decoupling (DD) [2] sits on one end of the spectrum as a highly robust technique-DD sequences only require the noise to be 'slow' (in some appropriate metric we will discuss in more detail later), but beyond this the details of the noise are not important. On the other hand, optimal control techniques can be used to design pulse sequences capable of efficiently suppressing a wide range of noises, both 'fast' and 'slow', but only if detailed knowledge of the noise is available [3,4]. The ideal strategy is thus a function of the knowledge available about the noise. Unfortunately this detailed information is often absent, since in an open quantum system scenario the bath or environment generating such noise cannot be directly measured or controlled. Phenomenological models and intimate control over the fabrication process of a given quantum system can alleviate this, but can at most give partial information about the noise sources affecting a qubit.
In order to bridge this gap in knowledge, quantum noise spectroscopy protocols of varying generality have been developed and implemented in recent years [5][6][7][8][9][10][11][12][13]. Their objective is to characterize the actual noise affecting a quantum system of interest, regardless of its source, in terms of its correlations, or more specifically the set of power poly-spectra [14]. The key point is that the information these protocols output, together with optimal control techniques, should enable one to design control routines tailored to suppress the actual noise affecting the quantum system of interest [15]. Very recently, for example, a 10-minute record-breaking coherence time was achieved in trapped-ions [16] using this principle. Operationally, spectroscopy protocols measure the response of a quantum system, in terms of expectation values of observables, in a known initial state, to the noise affecting it in tandem with user-determined control routines. The main difficulty is that noise correlations influence the dynamics of the quantum system in a highly non-linear way. Thus, inferring these correlations in detail from the response of the quantum system is generally an ill-posed problem, unless a priori information on the noise is assumed. Even when standard assumptions, such as Gaussian noise or a dephasing coupling are satisfied, the problem remains non-linear and inverting it carries along a set of non-trivial complications that in turn constrain the type of noise that can be characterized. For example, in Refs. [8][9][10][11][12] a controlinduced frequency comb approach is used in order to overcome the non-linear character of the problem but it comes at the cost of being only effective when the noise correlations are smooth functions in frequency space.
We propose that many of these problems can be ameliorated, or at least properly quantified, using a statistically principled approach. Within the statistical phrasing of the problem we provide a Bayesian solution [17], complete with a numerical implementation. We show the problem can be solved analytically in the limit of large of amounts of experimental data. At the other extreme-the small-data limit-a numerically stable Monte Carlo algorithm [18] approximates the full Bayesian solution. Our two approaches provide a robust solution to the software side of the noise spectroscopy problem. These two regimes are schematically depicted in Figure 1. Though the physical model we consider is simplified to illustrate the novel aspects of this work, we note that our approach is very versatile. We discuss this later.
We summarize the performance of our approach for the large-and small-data limits in Section 6, with complete details including all source code in the supplementary material. In the large-data limit, our approach can give almost an order of magnitude improvement in performance over state-of-the art estimation strategies. By contrast, in the small-data regime, we can even achieve two orders of magnitude improvement with just 2,500 bits of data. Thus, our approaches yield immediate and dramatic benefits in terms of experimental costs.
An expert reader may have realized that our work seems related to recent work of Zwick, Alvarez and Kur-

Number of parameters Amount of data
Here be dragons... eating your free lunch Linearization Sec. 5 Sec. 4 Figure 1: Estimation demands data. At some point (the exact location of which depends are far too many factors to quantify), the distribution of data becomes well-approximated by a Gaussian, allowing an effective linearization of the problem. This greatly simplifies the calculations required to solve the estimation problem. When this is not the case, the problem demands more resources and more clever numerical algorithms to approximate the solution. In any case, the more parameters one has in their model, the more data is required to learn anything.
izki (ZAK) [19]. It is useful to highlight the main differences between that work and ours. There the authors discuss the problem of designing experiments to maximize the expected information of select spectral properties ("bath parameters"). The main difference is that ZAK treat the single parameter case, and are thus able to obtain clever analytic solution to the experiment design problem. In contrast, our approach is inherently multidimensional and thus more general, but this means we only provide numerical algorithms. These works can thus be seen as complementary 1 .
Our paper is organized as follows. In Section 2 we motivate the problem from a physical perspective. In Section 3 we extract the core mathematics, simplifying as much as possible the physical equations in order to phrase the problem as one of statistical estimation theory. We provide two solutions (as discussed in Figure 1) at varying degrees of complexity and economy. In Section 4 we provide the Gaussian process model which is valid in the large data limit. In Section 5 the full Bayesian solution is outlined and applied to the ubiquitous 1/f noise model. A summary of the findings from our numerical experiments is given in Section 6. We wrap up in Section 7 with a brief discussion. Full implementation details and code to reproduce the results are listed in the ancillary files for this paper. In particular, this paper can be seen as a living document as follows: when the source is downloaded and run, new random variables will be drawn and the data will change. Therefore, the figures will also change, while-hopefully!-the conclusions drawn from them will not.

Physical motivation
In order to present our results, it is useful to introduce a concrete physical model that is simple enough that the Open Quantum Systems and Control language does not distract from the main features of our statistical approach, but that is sufficiently non-trivial to be a relevant model from the physical point of view.
Our toy physical model is the dynamics of a two-level system (TLS) in the presence of dephasing noise generated by a zero-mean, Gaussian, stationary process. The TLS and its environment are assumed to be initialized in a factorizable state of the form ρ SB = ρ S ⊗ ρ B , with ρ B being the initial state of the bath. Furthermore, their dynamics is ruled by a Hamiltonian that, in the interaction frame with respect to the natural dynamics of the bath, takes the form where B(t) represents the bath noise, and H ctrl (t) is a control Hamiltonian acting solely on the TLS. We will describe the bath as if it is quantum but a classical bath is of course just a special case where all the bath operator that appear in our equations commute at different times, i.e. [B(t 1 ), B(t 2 )] = 0. For simplicity, we shall assume H ctrl (t) to enact instantaneous σ x pulses at times {t i }, i.e., H ctrl (t) = π 2 δ(t − t i )σ x , such that, in the so-called toggling frame with respect to the control, the Hamiltonian takes the even simpler form to highlight that when the bath B(t) has a component β(t) that is a classical stochastic process, one can only access the average expectation value of σ α over many realizations of β(t). That is one can measure σ α (T ) c with · c denoting average over realizations of β(t). To ease the notation we will denote simply by ⟪·⟫ when both averages are taken. Under these conditions, the expectation value of an operator σ α is given by [12] ⟪σ α (T )⟫ = Tr S Tr The assumption that B(t) is a Gaussian, zero-mean, stationary process, implies that only the second cumulant of the process is non-vanishing, i.e., C (2) In turn this leads to the exact solution, in both time and frequency domains, being given by [12,14] ⟪σ α (T )⟫ = e Here (2) (B(t)B(0))e −iωt is the power spectrum of the noise process. Additionally, F (1) (ω, T ) = T 0 dt y(t) e iωt is the so-called order-one fundamental filter function which depends purely on the control [20,21]. These equations capture the dephasing behaviour of a qubit in the presence of a classical-as in the various semiclassical approximations often made in NV centers [22] or NMR [10]-noise, or quantum noise-as in the case of a bosonic bath in a thermal state [14,23].
In this language, suppressing the decoherence is akin to minimizing the value of the exponent on the right hand side of Eq. (3). This can be achieved via the use of control: one can choose a routine whose filter has a small overlap with the power spectrum. Dynamical decoupling (DD) sequences, for example, achieve this by generating filters which vanish at ω = 0 and are flat around it. They are thus very effective against decoherence generated by "slow" noise, where S(ω) is mostly supported around ω = 0. For noise with considerable high frequency contributions, the DD sequence can be ineffective or can, in analogy to the anti-Zeno effect [24], even generate the opposite effect, i.e., decoherence enhancement. On the other hand, if information about the power spectrum is available then optimal control techniques can be used to find a control routine that minimizes the overlap and thus the decoherence [20,21]. This is clearly the ideal situation, but it begs the question if the necessary information can be obtained in an Open Quantum system scenario.
Complete knowledge of the Hamiltonian describing the decoherence process and of the initial state of the environment would grant us perfect knowledge of the power spectrum and, in such situations, optimal control methods could be used to minimize the decoherence of the TLS. However, even when Gaussianity is imposed a priori [14], such knowledge is rarely available. For example, the temperature of the thermal state of a bosonic environment or the dispersion relation for the bosonic modes is usually unknown. Fortunately, as seen from the equations above, the decoherence process only depends on S(ω), i.e., not on the actual form of B(t) or even ρ B but on the bath correlations they induce. This quantity, while not directly measurable can be inferred from the measurable (average) response of the TLS to different control Hamiltonians. This is the working principle behind recently proposed noise spectroscopy protocols. Schematically, one such protocol would work as follows. Imagine preparing a +1 eigenstate of σ x at time t = 0, in such way that the expectation value of the observable σ x at the final time In principle, it should be possible to choose a sufficiently large set of different control sequences in such way that the integral in the exponent can be deconvolved, and information about S(ω) can be inferred. Different approaches to this problem, under different simplifying assumptions, have been proposed and even experimentally implemented [10,11,13,16,25]. More exotic protocols have been proposed to characterize more general noise processes, such as non-Gaussian noise [11] or noise affecting multiple qubits [12,26]. While we will not consider them here in detail, we note that the statistically motivated methods can in principle be also used there via an appropriate generalization.
In the remainder, we will abstract as much physical detail as possible for brevity and generality. This allows us to easily apply techniques from statistical decision and estimation theory.

Bayesian spectral estimation
In the physical description above, we make reference to observations as being the average values of observables. By contrast, in real experiments, observations are made by acquiring single bits of data at a time through projective measurements of single quantum systems. These two views of experimental observations agree only in the limit that very large numbers of projective measurements are made on identical copies of the system. Reasoning about noise spectroscopy in the presence of experimental constraints is thus, at its core, a statistical problem not suited to the "data-fitting" paradigm we are more used to. To make this precise we first extract the core mathematical elements of the problem. Mathematically, we are interested in the exponent ap-pearing in (3), where Ω is a high frequency cut-off which is often imposed experimentally and required for the numerical integration we use. So as to not introduce too many complications, we assume that Ω is known. Recall that we do not have direct access to χ as it is only exposed experimentally through the statistical model in (3). Moreover, expectation values of observables also cannot be measured directly and will always come with fluctuations due to finite sample sizes. Thus, we prefer to work from the bottom up, considering the precise distribution of each bit of data. To this end, let r be a the binary random variable with distribution such that the expectation value in (3) obeys This is the most fundamental statistical model and we should process data at this level whenever possible. But wait, what does it mean to process data? This is where the Reverend Bayes comes in. The notation Pr(A|B) is read "the probability of A being true given B is known to be true". So, Pr(r = 1|S; F j ) is the probability of observing r = 1 given the filter F j is used and the spectrum is S. Ah, but that seems a bit awkward, doesn't it? Isn't the spectrum the thing we don't know? To rectify this, we invert the probability using Bayes' rule: Some terminology [27]: Pr(r|S; F j ) is called the likelihood function and in physics it is always given by the physical model; Pr(r|F j ) is called the evidence and it is usually ignored as it can be determined by normalization; Pr(S|F j ) is called the prior and encodes the information we have about the spectrum before the data is take; and finally, Pr(S|r; F j ) is called the posterior, which is the information we have about the spectrum after the experiment-exactly what we want to know! In general, performing this inversion is both analytically and computationally intractable. There are two general approaches to solving this problem. Either we make analytical approximations or we employ clever numerical integration techniques. Here we demonstrate both. But, the problem and solutions are also not decoupled from how much can be assumed known about the spectrum-the dimension of model-and the amount of data available, such that the domain of applicability of each solution is restricted in subtle ways. This is shown pictorially in Figure 1.

Big data: analytical approximations with weak assumptions
In the large data 2 limit, we can appeal to the central limit theorem. In the Gaussian limit of the likelihood function we effectively linearize the model. For brevity, we will denote χ(S; F j ) =: χ j . Suppose the the number of binary samples taken per filter function used is N . Denote each binary sample r ji and Thenŷ j is a binomial random variable with mean and variance given by Another way to specify data when this approximation is valid is to treatχ j ∼ N (χ j , σ 2 j ), for each filter F j , where Though this approximation will be valid when strong assumptions are made to reduce the model dimension on S, it's real utility is in allowing a tractable solution for weak assumptions on S. In Section 4, we will specify precisely how we model S when this approximation holds.

Small data: sequential Monte Carlo
For smaller data sets, the normality assumptions made in the previous section are difficult to justify, and may fail altogether. Thus, we must take an alternative approach to calculating the posterior distribution given by (6). To do so, we note that the main advantage of the Gaussian approach of the previous section was 2 The definition of "large" is intentionally left ambiguous as it depends on far too many things to give a precise number to. that it will allow us to represent the prior and the posterior distributions as being different members of the same family of distributions. A promising alternative approach, then, is to consider more general families of distributions, perhaps at cost of greater computational effort. In particular, we will use the sequential Monte Carlo (SMC) approximation, which represents the distributions Pr(S|F j ) and Pr(S|r; F j ) appearing in (6) by weighted sums of δ-distributions. This approximation is very general, and will allow us to be much more general in our treatment of Pr(r|S; F j ). In particular, using SMC will allow us to easily express models for spectral density functions that can be described using a small number of parameters, such as 1/ω α for an unknown power α.
On the other hand, using sequential Monte Carlo forgoes the benefits of the analytic approximations described in the previous section, such that there is a natural tradeoff between the two approaches with the amount of data being taken, and with the form of the models under consideration. We detail the sequential Monte Carlo-based approach and compare it to the Gaussian process approach in Section 5.

Filter functions and "naive" estimator
For our numerical experiments we consider filters that arise from compressing so-called CPMG sequences [28] of increasing number of pulses in a fixed total time T , as was done in Ref. [29] for example. For a CPMG sequence of p pulses, y(t) is a function which switches between −1 and 1 at every pulse time t i = (2i + 1)T /(2n). More involved sequence choices can be made, if one is interested in exploring higher frequency regimes, for example [12,30], but this simple choice is enough for our purposes. A useful feature of this choice is that it provides an intuitive way of producing filters whose main support is in a given frequency range. More specifically, the larger p is the higher in frequency the main peak of F (ω) is. This is important becuase if no filter had support in a given frequency range it would be impossible to accurately estimate the power spectra in that regime. We plot the filter functions used in this paper in Figure 2.
For the large data limit, we can use a simple data fitting estimator for a point of reference. In forming our naive spectral density estimate, we will require the normalizations f j := Ω 0 dωF j (ω) and the maxima ω j := arg max ω F j (ω). For the filters we consider (see Figure 2), the normalizations and maxima are plotted in Figure 3. Next, we approximate each filter function as This leads to Suppose we identify the experimentally observed random variablesχ j with the theoretical values χ j . Then, inverting (12), we define the naive estimatê This will be compared to some more sophisticated, but more computationally expensive estimators. An example output of the naive estimator is shown in Figure 4.

Big data: Gaussian process regression
Suppose we are in the large data limit. That is, N is big enough that all distributions are roughly Gaussian. To deal with the notion of a prior, or measure, on functions we treat the unknown spectrum as a random function S(ω). Denote the distribution of S as Pr(S). To specify this concretely we discretize the support of the distribution to the set W = {ω 1 , . . . , ω M }. This means that, whenever a numerical calculation is performed, we really only consider random variables S k := S(ω k ), which can be represented collectively as the vector S. The simplest non-trivial distribution is Gaussian (or normal): S ∼ N (µ, k), where k jk := k(ω j , ω k ) and µ j := µ(ω j ) are the covariance and mean. In the function space picture we write this S(ω) ∼ GP(µ(ω), k(ω, ω )), where GP stands for Gaussian process [31], µ is the mean function and k is the covariance function, or kernel. In standard notation, In principle we can choose any functions µ and k as our mean and kernel functions. However, there are natural choices and ones that have been found to perform well in a broad range of problems. The most common kernel is the so-called squared exponential 3 , where δ is a hyper-parameter which controls the correlation in S for nearby ω and κ controls the overall prior uncertainty. In a purely Bayesian context, we should have a priori values for µ, κ, and δ. In other words, we believe the "true" spectrum is drawn according to a 0.0 0.5

S(ω)
True Naive (Noiseless) Naive (Noisy) Figure 4: A randomly chosen true spectrum, compared to naive estimates from noiseless and noisy data. The data for each estimate is simulated using the 25 control sequences discussed in the main text, with the noisy data being simulated for N = 1000 repetitions per control sequence, and with the noiseless data being simulated for the limit of infinite repetitions per sequence. The discrepancy between the noiseless data curve and the true curve is due to the approximation of the filters as delta functions in the theory.

S(ω)
Prior mean True 95% credible Figure 5: A visualization of a Gaussian Process. Here the mean function µ is taken to be a Gaussian function and we use the squared exponential kernel in (16) with parameters κ = 0.02 and δ = 100. In red, the mean and 95% credible band is plotted. The other curves are samples from this GP. One of them, in solid black, we take to be the true spectrum.
GP with these parameters. If this is the case, then no more needs to be done. If not, we would need to perform model selection [32]. The topic of model selection is beyond the scope of this work and so we will chose specific values for µ, κ, and δ. To get some intuition for how this relates to qubit noise spectra, we have plotted a visualization of the GP prior we will use in Figure 5.
If we begin with a GP prior and the distribution of data is also Gaussian, then the posterior is Gaussian and we can derive an analytic expression for its mean function and kernel. As discussed above, in the large data limit we can do just this. Moreover, since each experiment is uncorrelated, we can process the data at once treatingχ ∼ N (χ, Σ), where Σ is a diagonal covariance matrix with entries given by σ 2 j . Since the prior is Gaussian and the likelihood function is Gaussian, the posterior is also Gaussian. Determining its mean and covariance is a simple exercise in multivariate completing the square. Denote G as the matrix with entries G kj = F j (ω k )(ω k − ω k−1 )/4π, such that the trapezoidal rule applied to χ j is written Then, Bayesian updating amounts to updating the covariance and mean as follows [31]: Let's take a look at an example simulation and use this GP estimator to find the spectrum.  Figure 5, we plot the observed data superposed over the theoretical (infinite precision) χ's. ure 6 we plot the simulated data for N = 1000 repetitions per experiment.
Using the data plotted in Figure 6, we apply Equations (19) and (20) to get the posterior Gaussian Process. We plot the mean function and 95% credible band in Figure 7. We see that the naive and GP estimator agree reasonably well when plenty of data is available. We also plot the same procedures for much less data (N = 100), where it is evident that naive estimator fails completely. Whereas, the GP estimator correctly hedges its bets by not suggesting any extreme features deviating the prior GP-it "knows" it doesn't have enough data to do so. A more extensive analysis of this difference is presented in Section 6.
The GP estimator is not a silver bullet, however. First, as already noted, it is only valid when the data are drawn according to a Gaussian distribution-for example, when the Central Limit Theorem applies. Second, unless augmented with more sophisticated machine learning algorithms, the GP estimator can only reliably estimate features with sizes the order of δ. To illustrate this, consider now the one-on-f model (or, more generally the 1/f α model, with f and ω being interchangeable here). This corresponds to a spectrum, where c gives an effective low frequency cut-off. In this case, the spectrum has a high amount of structure. Indeed, the entire functional form is dictated by only a few parameters. We don't expect, then, that with the freedom allowed by the GP, it will be able to find this structure without a large amount of data. In Figure 8, we see that the GP is blind to the structure in the 1/f α model and estimates additional non-existent features. While the naive estimator also suffers from this problem, in the next Section we will show how to incorporate information about this structure by the method of hy- perparameters. This provides an alternative approach for cases in which a global property of the spectrum is of more experimental interest than the spectrum itself.
After we describe the final estimator, we will provide an extensive numerical comparison of all estimators in Section 6.

Small data: Hyperparameterized nonlinear regression
In the 1/f α example discussed in the previous Section, the prior uncertainty was concentrated on a small number of parameters, such that the distribution of the spectra at each point is very highly correlated. That is, if we perfectly knew the parameters A, α, and c as they appear in (21), we would be able to precisely predict the value of S(ω) at arbitrary ω.
Though we can describe these correlations using the methods of the previous Section, such that the covariance kernel is effectively a low-rank linear operator, it can also be very useful to describe our learning problem more directly. For example, in (21), we can interpret (A, α, c) as a vector of parameters in its own right, as this vector describes the distribution over the alternative parameterization implied by our discretization of S(ω).
In making this interpretation, we will use the method of hyperparameters to incorporate our knowledge of an appropriate functional form for spectra into our estimation model directly. This is the standard approach in Hamiltonian parameter estimation, for example, where prior knowledge of the physics allows for a drastic reduction in model dimension [33]. Bayesian inference can be then applied directly to the such hyperparameters instead of at the level of the bare physical model.
Generally, if the likelihood function Pr(r|η) depends on a model vector η that itself is distributed as Pr(η|θ) for some other vector θ, then we can consider the marginalized distribution as a likelihood function in its own right.
Returning to the problem of spectral density estimation, we note that we can readily define the model vector η by the inner product S, F k = χ(S; F k ) = 1 2π S(ω)F k (ω)dω, as predicting each inner product χ k is sufficient to reproduce the entire likelihood function. From this perspective, if we can reproduce each χ k from a lower-dimensional model θ (that is, a model with less parameters than the number of filter functions used to gather data), then θ represents a more efficient hyperparameterization than taking η directly. For example, consider hyperparameterizing (4) following a 1/f α model, for a given ultraviolet cutoff Ω and with θ = (A, α, c). This model has been studied experimentally, in particular as a diagnostic for superconducting and spin qubits [6,7], such that improvements even in this simple example immediately yield experimental benefits.
By expressing the estimation problem in terms of the hyperparameters θ, we introduce a subtle distinction in how we report our final estimatesŜ once we have obtained a datum r. We can report the spectrum evaluated at the estimated hyperparametersθ = E[θ|r] for our recorded data. This estimate achives the best possible mean-squared error (MSE) for reporting the hyperparameters themselves, but does not necessarily provide the best estimate of S. As an alternative, we can instead report the Bayesian mean estimate of the spectrum directly,Ŝ (ω) = E θ [S(ω; θ)|d]. (24) That is, by taking the spectra and then the mean, we obtain the Bayesian estimate of the spectrum, using our knowledge of the hyperparameters. Though these two methods coincide for spectrum models that are linear functions of their hyperparameters, for models such as 1/f α , the difference can be quite significant, as demonstrated in Figure 9. Critically, both methods for estimating S from a posterior over hyperparameters can be generated easily from the same data without additional analysis. Thus, we are free to report the optimal estimate for questions of experimental interest, rather than assuming a priori that only one question will be asked of our data.
In any case, we treat the spectrum more generally as being drawn from a parameterized distribution of spectra, such that Pr(S(ω)|θ) = δ(S(ω) − S θ (ω)) and θ is a real-valued vector for a functional form S θ (ω) such as the 1/f α model discussed above. Therefore, once we have specified θ, we have specified the unknown spec-trum. Following Bayes' rule (6) as usual gives us a posterior distribution over the parameters θ, conditioned on a data record r, Although the denominator looks like an innocuous normalization constant, producing accurate estimates of θ requires its calculation. Since we do not assume that S θ is a linear function of θ, and since we are concerned with efficiently utilizing small amounts of data, the analytic solution in terms of Gaussian process regression used for the process-model case cannot be directly applied here. In lieu of that, our preferred method is sequential Monte Carlo [18], also known as particle filtering. This algorithm computes posterior distributions of the form given as (25) by evaluating the likelihood at each of many different particles, each of which represents a particular hypothesis about the true model vector θ and an associated weight. Expectation values over the posterior can then be replaced by finite sums over the particles in the sequential Monte Carlo approximation.
We will use the implementation provided by the QInfer package for Python [34]. In the following Section, we detail and present results obtained from a QInfer model for hyperparameterized spectral density estimation, and compare these results to those obtained from the Gaussian process regression method of Section 4. Our QInfer model will depend on the specification of a set of test frequencies W = {ω 1 , . . . , ω M } and a spectral model function S (ω; θ), where M specifies the resolution of test frequencies. The inner products S(ω; θ), F k (ω) will then be approximated by numerically evaluating the integral (4) in terms of the trapezoidal rule (18) applied to the integrand {S(ω i ; θ)F k (ω i ) : ω i ∈ W }. This design allows for our model to be very general with respect to the particular choice of S(ω; θ).
In this paper, we will work with one such description by writing the 1/f α model as a hierarchal model in which r is a random variable that is defined by its distribution conditioned on our new hyperparameters θ. In particular, the conditional distribution of r is given by 4 4 An exponentially distributed random variable is denoted x ∼ Exponential(x l , λ) and has a probability density function f (x) = λ exp(−λ(x − x l )) for x > x l . For more details, please see the complete source code provided in the supplemental material.

Numerical experiments
We have already demonstrated some comparisons between the different approaches in the previous sections. The purpose of this section is to consolidate and expand on those illustrative comparisons. Namely, we will compare the performance of the naive, GP and hyperparameter estimator over many randomly chosen spectra. The comparison is facilitated by the mean squared error metric. LetŜ be an estimate of the true spectrum S. Then, the error-or loss-is defined as To compare different estimators, we select a true spectrum randomly from the prior, simulate experiments, calculate the estimators, record the loss of each, and repeat. Then we plot a histogram of the achieved accuracy.
We start with the true spectrum randomly selected by sampling a Gaussian Process, as exemplified in Figure 5. In this case, the hyperparameterized estimator is not useful 5 , so we only compare the GP estimator with the naive estimator. As a point of reference, we can also treat the prior mean function as an estimator and calculate its loss. This is equivalent to the earlier comparison in Figure 7, but we average the results of many trials. The result of 400 trials is shown in Figure 10. As expected, the posterior loss is lower than the prior loss, indicating that the algorithm is learning. The naive loss does a respectable job as well, but is convincingly beaten by the GP estimator-especially given the fact that GP estimator comes with all the added benefits of the Bayesian methodology discussed above.
In the case of the 1/f α model we compare all estimators. The results are presented in Figure 11 using another 400 trials. Again, we see that each estimator demonstrate genuine learning by reducing the loss over The prior for the Gaussian process estimator is taken to be that shown in Figure 5, and the true spectra are sampled from the prior.
the prior. The GP still outperforms the naive estimator, but here the hyperparameterized estimator shines, demonstrating that knowledge of the additional structure is immensely beneficial for learning. We also plot the bias of each estimator in Figure 12. The hyperparameterized estimator is extremely robust, reducing both the variance and bias over its competitors. By contrast, the GP estimator and naive estimator are reliably biased, and in opposite directions. We do not yet have a strong theoretical explanation for this behaviour.
When performing parametric estimation, as is implied by our hyperparameterized estimator, it is more common to define loss functions on the parameters themselves. The standard loss function is the squared error: As this loss function is not defined for the naive or GP estimator, we only report the results for the hyperparameterized estimator. This appears in Figure 13. Since we are only considering a single parametric estimator, in Figure 13 we plot the ratio of the posterior loss to the prior loss of the hyperparameterized estimator. This demonstrates that the parameter of interest, α, can be learned to two orders of magnitude better accuracy than the prior with no more than 2,500 single-shot measurements.

Discussion
In this work, we formulated the noise spectroscopy problem in the language of statistical estimation theory. This allows us to provide a robust and principled solution to the problem using Bayesian analysis. Considering Figure 1 again, we have demonstrated two separate numerical solutions suited to two different regimes in the the continuum of possibilities: the big data limit and the low-data/low-dimension limit.
In the large-data limit, we can effectively linearize the model and use a non-parametric approach to capture a broad class of spectra. The resultant spectra are not by eye different from a naive data fitting procedure in many cases. However, for a minimal amount of added computation, our approach gives a statistically rigorous accounting of the error bars in the reported spectra.
In the low-data limit we use the exact statistical model. In this case, estimation-or, learning-is aided by use of prior knowledge on the model which reduces its dimension. The sequential Monte Carlo method is then applied to the resultant parameter estimation problem, which allows accurate inference even for highly nonlinear models in the low data setting.
In sum, we have treated what we consider the core inference problem in noise spectroscopy in order to provide the cleanest demonstration of our algorithms. Within our approach, however, there is no limit to the model complexity that can be treated-requiring only more computational resources. We comment briefly on some generalization that straightforwardly build on these methods. For instance, if one expects the power spectrum to have delta-like peaks-as would be the case if the probe was coupled to a finite number of harmonic oscillators, for example-it is possible to use the position and height of the peaks as hyperparameters in our routines. In an analogous way, our methods can be used  (26). For the Gaussian process estimator, we take the prior to have a mean function given by the µ(ω) = Eα,A,c[A/(ω α + c)], where the expectation is over the prior for the hyperparameterized model. The covariance of the Gaussian process mean is taken to be the same kernel as that in Figure 5. Finally, the true spectra are drawn from the hyperparameterized prior.  for model selection-that is, to discriminate between various proposed models for our environment. For instance, making contact with our previous example, one can determine how many oscillators are coupled to our probe. More importantly, the statistical methods developed here are not constrained to the spectroscopy scenario we considered: Gaussian, zero mean noise. More general spectroscopy protocols are based on inverting multi-dimensional integrals of the form dω 1 · · · dω m F (m) (ω 1 , · · · , ω m , T )S (m) (ω 1 , · · · , ω m ), where S (m) (ω 1 , · · · , ω m ) is the mth order polyspectra and F (m) (ω 1 , · · · , ω m , T ) an mth generalized filter function [12,30], essentially an m-th dimensional time ordered Fourier transform of the product y(t 1 ) · · · y(t m ). Estimating S (m) (ω 1 , · · · , ω m ) given our ability to manipulate F (m) (ω 1 , · · · , ω m , T ) is then a generalization of our current methods to higher dimensional integrals.
Finally, we note that our inferential algorithm can easily be embedded into control software for online (real-time) estimation and, more interestingly, closedloop adaptive control. These are exciting possibilities not currently offered without significant modification by traditional approaches.