Efficient Estimators for Sequential and Resolution-Limited Inverse Problems

A common problem in the sciences is that a signal of interest is observed only indirectly, through smooth functionals of the signal whose values are then obscured by noise. In such inverse problems, the functionals dampen or entirely eliminate some of the signal's interesting features. This makes it difficult or even impossible to fully reconstruct the signal, even without noise. In this paper, we develop methods for handling sequences of related inverse problems, with the problems varying either systematically or randomly over time. Such sequences often arise with automated data collection systems, like the data pipelines of large astronomical instruments such as the Large Synoptic Survey Telescope (LSST). The LSST will observe each patch of the sky many times over its lifetime under varying conditions. A possible additional complication in these problems is that the observational resolution is limited by the instrument, so that even with many repeated observations, only an approximation of the underlying signal can be reconstructed. We propose an efficient estimator for reconstructing a signal of interest given a sequence of related, resolution-limited inverse problems. We demonstrate our method's effectiveness in some representative examples and provide theoretical support for its adoption.


Introduction
In many applications, data about a signal of interest can only be indirectly gathered. For instance, astronomical images from ground-based telescopes are observed through the blurring caused by atmospheric turbulence; Positron Emission Tomography (PET) scanners measure photon intensities averaged over lines; and seismologists record the surface effects of earthquakes whose waves have been filtered by the Earth. In these examples and other such inverse problems, the basic measurements are smooth functionals of the signal that dampen or entirely eliminate some of the signal's interesting features. This makes it difficult, or sometimes impossible, to fully reconstruct the signal from noisy data.
Over the years, a number of methods have been developed for the recovery of a signal under inverse problems. We cannot hope to provide a comprehensive list, but see O'Sullivan (1986); Wahba (1990); Donoho (1995); Tenorio (2001); Candés and Donoho (2002);  and the references contained therein for an introduction and Cavalier (2008) for a modern review of the state of the field. Also, many disciplines have developed specific techniques for addressing particular issues, such as Astronomy (Starck, Pantin and Murtagh, 2002; van Dyk et al., 2006) and Tomography (Ólafsson and Quinto, 2005).
However, the above cited work provides techniques and theory for situations in which an estimate of a signal is formed after only one observation. In many fields, recent technological advances have made it possible to automate datacollection, enabling repeated observations of the signal over time. While repeated observations can improve accuracy, it often raises new challenges as the inverse problems faced at different times can vary significantly. For example, the Large Synoptic Survey Telescope (LSST), a multi-year, Earth-based survey of the entire sky, will image space to an unprecedented depth and will catalog billions of astronomical objects. The LSST will take long sequences of images at each patch of sky, about 3 degrees on a side. In each sequence, the images will be separated in time by approximately 3-4 days. Each image in each sequence is taken with different blurring and distortion conditions. Thus, the viewing process represents sequences of related but distinct inverse problems. One scientific goal is to use these images to reconstruct the signal, which in this case is comprised of the underlying celestial structures, as accurately as possible.
Notationally, we consider the following problem. We want to recover information about an unknown signal θ ∈ R p from measurements of the form Y i = K i θ + εW i , for i = 1, 2, . . . . (1) Here, each Y i is a measured signal, such as an audio recording or a (vectorized) image represented as a p × 1 vector. Each forward operator K i describes the measurement process and the W i 's are independent, mean zero Gaussian pvectors with variance-covariance matrix I p , the order p identity. The K i represent both the damping of the signal present in an inverse problem and the necessary discretization due to the resolution-limited nature of most observational devices, most commonly through pixelization. The K i 's are a priori unknown and hence must be measured and estimated. As any information about the K i 's comes from the observational device itself, any estimate of the K i 's are resolution-limited as well. Therefore, we represent the measurement process K i as a p × p matrix. This captures the idea that, in many problems, the resolution is fixed by the instrument and does not change as more data is collected (that is, as n → ∞).
An early formal consideration of the sequential inverse problem is found in the literature on developing loss-less analogue-to-digital conversion techniques. The recovery of the orginal, analogue signal is an inverse problem as there is not a unique analogue signal corresponding to each digital signal. This result is formalized in the quantity referred to as the Nyquist rate, or frequency (Mallet, 2009, Chapter 3). If the signal is instead sampled multiple times at different, carefully chosen sampling rates, Berenstein and Patrick (1990) and Casey and Walnut (1994) find conditions under which the original signal can be reconstructed in a loss-less way. Note that, as opposed to our paper, these approaches deal with only the case where ǫ = 0 and the K i , which correspond to the sampling rate, can be chosen by the experimenter.
Subsequently, the sequential inverse problem is considered in a series of articles, beginning with Piana and Bertero (1996), in which two methods are introduced. The first corresponds to Tikonov-Phillips (TP) regularization (known in statistics as ridge regression) adapted to the sequential problem. The second is an iterative method based on Landwieber iterations (LI). See Bertero and Boccacci (1998) for an overview of the Landwieber iterations method in inverse problems. Though the above methods have been successfully implemented in the past, most notably in the software package AIRY (Correia et al., 2002), it has two shortcomings: the methods correspond to restrictive choices among all possible estimators and they offer no automated method for choosing the introduced tuning parameters.
Remark 1.1. The Tikonov-Phillips and Landwieber Iteration methods can readily be derived by the formalism developed in this paper (see equations (19) and (17)). Therefore, we get for free a principled method for setting the tuning parameters, as well as a suite of new estimators.
The goal of this paper is to develop and investigate a statistically efficient estimator of θ from the sequence of resolution-limited inverse problems introduced in equation (1). We require that any estimator must satisfy the following: (i) it leaves no user-defined tuning parameters and (ii) the estimatorθ n based on an n-sequence can be efficiently updated to produce the estimatorθ n+1 after observing Y n+1 . Both requirements are particularly important in applications like the LSST, where it is inconvenient (or impossible) to access the entire past data stream with each new observation and hence the data must be processed in near real time.
In Section 3.3 we discuss two reasonable approaches to this problem based on collapsing the sequence of operators (K i ) into one summary operator, in one case averaging the operators and in the other concatenating them. We show that both approaches do not satisfy conditions (i) and (ii).
Satellite Imaging: To fix ideas, we introduce a typical instance where the observations form a a sequence of resolution limited inverse problems. During satellite imaging operations, a location on Earth is imaged many times over the life span of the satellite. The quality of the recorded observations can be low and variable due to changing atmospheric and/or weather conditions. See the left column of Figure 1 for a representative panel of four such images taken of the White House and surrounding buildings. Note that the amount of blurring in each image i, corresponding to the forward operators K i , can be very different.
However, the pixelization induced by the observational device is fixed over the sequence of images.
Our proposed estimatorθ n takes these images and sequentially creates a new estimate of the unknown signal θ after each observation Y i (right column of Figure 1). Each row of Figure 1 is a new observation andθ n after being updated with that observation. Notice that the recovery is quite good, even after only a few images as input. We emphasize that there are no choices to be made by the data analyst: all tuning parameters are chosen in an automatic, data-dependent way.
This paper is organized as follows. In Section 2 we give a careful overview of our method and provide justification for the assumptions made, with greater exposition occurring in Appendix A. In Theorem 2 and Theorem 3, we give supporting theory for our estimator that both shows uniform consistency over the parameter space and an asymptotic oracle inequality. These results show both that our estimator will get the correct answer eventually, no matter the signal in our parameter space, and that our estimator makes essentially as efficient use of the data as if we knew the signal θ. In Section 4, we provide an example of our framework in action on simulated data.
Notation. For A ∈ C p×p and a ∈ C, define A * to be the Hermitian adjoint of A. Correspondingly, define |a| 2 = a * a and |A| 2 = A * A to be the squared complex modulus of a scalar and matrix, respectively. Likewise, for any vector x ∈ C p , ||x|| 2 = x * x. If AA * = I p = A * A, then we say that A is unitary. We utilize bold faced font for vectors: b n ∈ C p with its j th entry notated b nj and the subscript n indicates dependence on the sample size. Similarly, A nj is the j th element of the main diagonal of the matrix A n . We abuse notation slightly by using λ as both a vector in C p and as a function from C p to C p given by component-wise multiplication.

Methodology and main results
We begin this section by making the following assumptions, building on the notation introduced in equation (1): are known smoothing matrices. (A3) There exists a unitary matrix Ψ ∈ C p×p and diagonal matrices D i such that K i = ΨD i Ψ * for all i = 1, . . . , n, . . .. (A4) There exists an N < ∞ such that for all j there exists an Assumptions (A1) and (A2) are very standard in the statistical inverse problem literature. We discuss a strategy for estimating ε in Section 3.2. Assumption

Observations Estimates
Fig 1: Example of images of the White House from a satellite and associated recovery of the unknown signal θ using our proposed estimator. In the left column (Observations) the different amounts of blurring are due to varying atmospheric conditions and correspond to the forward operators K i in equation (1). In the right column (Estimates) we report the output of our estimator using the data in the left column. Each row corresponds to making another observation Y i and updating our estimator with this new data. We emphasize that there are no choices to be made by the data analyst; all tuning parameters are chosen in an automatic, data-dependent way.
(A4) is also commonly made and it ensures that, at some point, the entire signal θ is identified and loosely corresponds to the intersection of the null spaces of the (K i ) n i=1 eventually only containing the zero vector. Assumption (A5) merely prevents a pathological case where the K i are becoming more ill-conditioned without bound as n → ∞. Assumption (A3) is crucial to our method and while the reason for it will become clear, the following theorem provides a general family of matrices that satisfy it: all correspond to the convolution operation, then there exists a unitary matrix Ψ and a sequence of diagonal matrices (D i ) n i=1 , all of which could have complex entries, such that (A3) holds. If θ is a one (two)dimensional signal, then the K i are (block) circulant and the entries of the matrix Ψ are the discrete one (two)-dimensional Fourier basis and the entries of D i are the corresponding discrete one (two)-dimensional Fourier coefficients.
Hence, we see that (A3) is more general than the convolutional assumption made in Piana and Bertero (1996) and many other works concerning statistical inverse problems. See Appendix A for a proof of Theorem 1 and an investigation into more general families of matrices that satisfy assumption (A3).

Overview and main results
An overview of our procedure is as follows. The parameter θ and each observation Y i us rotated by Ψ * . The rotated Y i 's are combined together to form a sufficient statistic B n . The estimators we consider are of the form θ = Ψλ(B n ) := Ψ(λ j B nj ) p j=1 . Define this set of estimators to be We choose from the estimators in E using a combination of minimizing an empirical estimator of the risk and some additional regularization parameters. Define our estimator to beθ n = Ψλ(B n ), wherê ( The form of this estimator is derived in the text containing and preceding equation (20). We set Ω 2 n := (p − 2) 1 + maxj ∆nj minj ∆nj . Note this choice of Ω 2 n is motivated by Brown, Nie and Xie (2011) in which it is shown that ensemble minimaxality in the heteroscedastic case holds for the soft thresholded James-Stein type estimator.
We define our loss function to be the l 2 norm with associated risk and set Θ := {θ : ||θ|| 2 2 ≤ T 2 } for any 0 < T 2 < ∞. Then If D ij ≡ D j for some D j ∈ C, then γ n ≍ 1/n; that is the parametric rate. However, the forward operators (K i ) in effect ensure that each observation doesn't decrease the risk equally. The quantity ∆ nj relates to how much information is present in the first n observations about the j th component of Ψ * θ.
Additionally, we can compare our estimator to the E-oracle θ * Theorem 3. Suppose assumptions (A1) -(A5) and let where the term O(1) does not depend on θ.
An interesting extension of this model is to the random operator setting. That is, what is the impact of having K i being drawn from some distribution? We answer this question in an interesting case.

Random Eigenvalues
Suppose that the (K i ) are random operators such that K i = ΨD i Ψ * for all i = 1, 2, . . . and diag(D i ) where D is any p-variate complex distribution that doesn't have too much mass near zero. Specifically, This is a stochastic extension of assumption (A4) as it allows the random eigenvalues to be arbitrarily close to zero in magnitude but with the probability of them being small going to zero at an appropriate rate. Lastly, let (W i ) and (D i ) be mutually independent.
Theorem 4. Suppose assumption (B4) holds with some ρ > 1. Then where E (Di),(Yij ) corresponds to integration with respect to the joint distribution of (D i ) and (Y ij ).

Rotations, estimators, and tuning parameter selection
Returning to equation (1), for i = 1, 2, . . . we define X i := Ψ * Y i , β := Ψ * θ, and Z i := Ψ * W i . Then it follows that Note that in this case Z i It is also convenient to look at equation (8) component-wise, for j = 1, . . . , p. Note that for these multiplications to be defined, we have to think about R p being embedded in C p by having imaginary part equal to zero. We follow this convention without comment in what follows.
Remark 2.1. Note that the (Z i ) are degenerate complex Gaussian vectors in the following sense: if we think of a p dimensional complex Gaussian as a 2p dimensional real Gaussian with some covariance matrix, then the Gaussian actually has values in a p dimensional subspace of R 2p . Thus the random variables don't have a density with respect to Lebesgue measure on the full space C p , among other complications.
Remark 2.2. Commonly, the sequence space formulations found in equation (8) and equation (9) are accomplished by a real, orthogonal matrix instead of a complex, unitary one. Allowing for the sequence (K i ) n i=1 to share the same eigenvectors necessitates permitting Ψ to be complex. This makes equation (9) more complicated than the conventional normal means problem in at least two ways. First, as stated above, the random variables are complex. Second, and more importantly, the model is heteroscedastic. This leads to a much more involved theory than in the homoscedastic case, such as in Brown (1975), and is still the topic of contemporary research (Brown, Nie and Xie, 2011).
Lastly, define This quantity is particularly important, as evidenced by the following theorem Theorem 5. Under the model introduced in equation (1) and (A1) -(A4), the random vector B n := (B nj ) p j=1 is sufficient for β in equation (8). This claim can be seen by noting that the map Φ * is measure preserving.
As Ψ is also unitary, we can define an equivalent risk to the one defined in equation (4) Any risk computations made under the data, which is (X i ) n i=1 in our notation, are equivalent to those made under a sufficient statistic (Bahadur, 1954, Theorem 7.1). By Theorem 5, B n is sufficient for β and hence for all measurable functions of the data that are not functions of B n , there exists an estimator with equal risk that is a function of B n . In fact, the expectations in equation (11) are equivalent under (X i ) n i=1 and B n . Therefore, for each n, we can treat B n as the data and formulate estimators based upon it.
To develop an automatic procedure for signal estimation in sequential inverse problems we begin by regularizing an unbiased estimator of β through the use of a smoothing parameter vector. We choose this smoothing parameter by minimizing an estimate of the risk. This type of procedure, known generally as unbiased risk estimation, has been revisited regularly in many fields for solving various problems related to denoising (Stein, 1981;Donoho and Johnstone, 1995). However, as inverse problems generally result in unstable estimators of both the parameter β and the risk R, we compensate by including additional regularization.
The specifics of our approach are related to the procedure found in Beran (2000). However, the goal in Beran (2000), unlike our paper, is the estimation of the regression function in an assumed linear model instead of the coefficients themselves. That is, referring to the notation in equation (1), the estimation of K i θ instead of the estimation of θ. This is an important distinction as both estimating θ is intrinsically harder than estimating K i θ and θ is the object of actual interest. The practical implications of these differences is that only minimizing an unbiased estimate of risk, as is the procedure in Beran (2000), provides insufficient regularization. As well, the theoretical justification that appears in Beran (2000), is essentially entirely asymptotic in p. This is a regime we do not consider relevant for the problem at hand.
The first part of the proposition provides an unbiased estimate of the risk while the second part implies that we gain no improvement in risk by allowing λ to have values outside of L.
UsingR n from (12), define for any G ⊆ L which produces an estimator of β viâ Lastly, we recover an estimate of θ by formingθ G := Ψβ G . As any choice of G results in an estimatorβ G via the above machinery, there are in principle many possible choices. We focus on G = L, which by inspection of equation (12), results inλ where as usual (·) + = max(·, 0) is the soft thresholding function. Other choices can and should be explored in further research into estimation in sequential inverse problems such as M := {λ ∈ L : λ 1 ≥ λ 2 ≥ . . . ≥ λ p }, which induces a monotonicity constraint on the estimated coefficients, or block methods of piecewise constant weights .
Additionally, the aforementioned Tikonov-Phillips regularization and Landwieber iterations methods correspond to specific subsets of L. The Tikonov-Phillips estimator takes the form which can be rewritten as an element of E by defining with associated estimatorβ γ = λ γ (B n ). The Landwieber iterations estimator is by nature iterative. However, it has an equivalent formation in the form of the following linear smoother where γ corresponds to the number of iterations and τ is a relaxation parameter. The associated estimator isβ (γ,τ ) = λ (γ,τ ) (B n ). Hence, this procedure generalizes the results in Piana and Bertero (1996) by providing a principled tuning parameter selection method.
A problem arises if we choose smoothing parameters in this fashion in inverse problems: insufficient regularization. This is due toR n being an unstable estimate of R for the same reason as B n is an unstable estimator of β.
Instead of regularizing the risk estimator, we modify the weights directly to provide additional regularization. However, we record our belief that regularizinĝ R n by limiting how ill-conditioned the risk estimator can become and then minimizing this biased estimator of the risk should provide a suite of interesting estimators via the above machinery. Definê where the parameter Ω 2 n is specified before Theorem 2. Lastly, define our estimator of θ to beθ n := Ψλ(B n ).
3. Computational concerns, variance estimation, and alternate methods

Computations
The specifics of the computation of an estimatortheta G depend on the subset G. However,R n is a convex objective function. Hence, if G is a convex subset of R p , then the solution can be found both efficiently and uniquely. Of the estimators mentioned above, all except M have a closed form solution and therefore trivial computation. The minimization ofR n over M can be accomplished by a well known algorithm called Pooled Adjacent Violators (PAV) (Robertson, Wright and Dykstra, 1988) that transforms the least squares solutionψ into the monotone solution by taking weighted averages of adjacent elements ofψ that violate the monotonicity constraint. Additionally, in the case of convolution, the vector ∆ n and the random variables (X i ) can be computed via the Fast Fourier Transform, which implies O(p log p) computations and is of course the archetypal instance of an efficient algorithm. However, for more general matrices K i , the eigenvectors must be computed via a conventional eigenvector solver, which necessarily has computational complexity O(p 3 ). This could become prohibitive for large scale problems. There do exist modern approximation methods for eigenvalues and eigenvectors that could be used instead, such as in Halko, Martinsson and Tropp (2009). However, we do not explore this idea further in this paper.
An additional feature is that for the computation at step n, it is not necessary to keep the entire history (Y i ) n i=1 and (K i ) n i=1 . Both B n and ∆ n can be computed from aggregate information. Hence, we can produce an estimate of θ given only access to a few summary statistics which get updated after each new observation.

Estimating the variance parameter
Estimating the variance parameter can be accomplished in a consistent way by setting aside a subsequence N of N and computing the estimator Here, we have computed the estimator after the first n ′ entries in N . Alternatively, we can take advantage of the observational process to acquire a good estimate of ǫ. As we make observations, occasionally some will be of exceptionally poor quality. This observation will be less helpful for recovery in general and provide almost no information about the higher order components of the vector β.
Suppose now that N is the set of all indices i such that Y i is a low quality observation; that is there exists a p ′ such that for j = p ′ , . . . , p, the |D ij | 2 are small. In general, p ′ could depend on i, but we do not consider this complication here. Form the following statistiĉ Then Eε 2 i = ε 2 + 1 p−p ′ p q=p ′ |D iq | 2 |β j | 2 and we report 1/n ′ i∈Nε 2 i as our estimator of ǫ 2 . This is in general a biased estimator of the variance. Nevertheless, it is still useful. First, it is conservative owing to its positive bias. Perhaps more importantly, this estimator provides an interesting situation where the lowest quality parts of the lowest quality observations provide the best performance.

Averaging is not enough
In equation (1), conventional statistical practice would suggest averaging the observations (Y i ) directly. However, we show here that this leads to suboptimal results. Specifically, averaging gives thes following model where, under assumption (A3), . This can also be equivalently expressed as Here, X n = Ψ * Y n . We define the corresponding set of linear estimators to be E := {θ = Ψλ(B n ) : λ ∈ C p }.
Note that we can write equation (24) without any assumptions about the eigenvectors of the forward operators. However, under assumption (A3), the following theorem supports forming estimators based on equation (10) instead of equation (24).
Theorem 7. Suppose for a fixed θ, where the expectations in R 1 and R 2 are under B n and B n , respectively. Then where '< * ' means 'strictly less than except when D i ≡ D for all i and some D.' That is, the oracle linear risk based on equation (10) is strictly less than the oracle linear risk based on equation (24).
Remark 3.1. Note that the classic Tikonov-Phillips estimator based on the Y n is of the formθ and hence the Tikonov-Phillips estimator is in E, among many others.
However, estimators based on this approach, such as spline type estimators, rely on accessing the entire history of observations (Y i ) and forward operators (K i ). This is computationally infeasible as this means both keeping and repeatedly accessing the entire sequence of observations. Hence, this approach doesn't satisfy our requirement of an estimate at time n being efficiently updatable to a new estimate after recording Y n+1 .

Description
In this section we present visual results of using our estimatorθ n to reconstruct various signals, given access only to smoothed and noisy, but repeated, observations of that signal. In both cases, we compare our estimator,θ n toθ ridge from equation (26), with the smoothing parameter τ chosen by minimizing generalized The left column corresponds to θ smooth and the right column corresponds to θ peaked . The top row is a plot of the signal itself, along with the signal after the minimum (dashed line) and maximum (dashed and dotted line) amount of smoothing. The bottom row is an example of the recorded data after corruption by smoothing and noise. Notice that in θ peaked , the smaller peak is completely obscured.
cross validation (GCV). For a quantitative comparison, we use the normalized relative risk (RR) given by We estimate RR by averaging 100 runs of our simulations. For each of the signals introduced below, we set p = 256 and fix the noise parameter ǫ to be such that the signal-to-noise := ||θ|| 1 /(pǫ) = 1. For these examples, we admit K i that are an equally weighted mixture of three Gaussians, normalized to have l 1 mass equal to 1, with means µ 1 = −0.75, µ 2 = 0.00, and µ 3 = 0.50, along with standard deviations σ iq = 0.5 + E qi , where ∼ exponential(1) and q = 1, 2, 3. Note that this implies that the K i are not symmetric. Also, note that Gaussian-like smoothing represents one of the worst cases as it exponentially down-weights the β j for large j.
We consider two signals for estimation, which we refer to as θ smooth and θ peaked (Figure 2). The first signal, θ smooth , is the sum of two Gaussians functions that are filtered by a Gaussian-tapered filter. This filter is additionally enforced to be zero above the p/2 frequency. Hence, θ smooth is very smooth and compactly supported in the frequency domain. This example is instructive as a smooth function should be well represented by the eigenvectors Ψ of the smoothing operators K i . Also, a compact representation in frequency domain will reveal the effectiveness of the soft-thresholding in zeroing out the appropriate B nj , ie: those that correspond to the β j that are zero. See the left column of Figure 2 for a plot of θ smooth (top) along with a typical example of a noisy, smoothed version that comprises the recorded data (bottom). Additionally, we consider the opposite situation by defining a signal θ peaked that is the sum of three sharp, non-smooth, peaks. This signal is difficult to represent with the eigenvectors of smoothing matrices but is common in signal processing as it corresponds to both spectra from biochemical analysis and nuclear magnetic resonance imaging (nMRI). Note that the smallest peak is completely obscured by the smoothing and noise. See the right column of Fig

Results
In estimating either signal, θ smooth or θ peaked , the estimatorθ n converges rapidly to the truth. See Table 1 for the RR ofθ n andθ ridge used on both signals. In each case, for n = 50, the RR are approximately the same, withθ ridge having a slight edge. Every sample size thereafter shows substantial advantage ofθ n overθ ridge , culminating with a factor of two improvement in RR after n = 300 observations.
For estimating θ smooth , both estimators have substantial oscillations for low sample sizes. However, due toθ n having a soft-thresholding effect, some of the entries in our estimator of β are zeroed out. In contrast,θ ridge only shrinks the coefficients and hence still has substantial fluctuations after n = 300 observations. See Figure 3 for graphical results.
For the signal θ peaked ,θ n estimates the true height of the peaks accurately and quickly. In particular, the secondary small peak is definitively identified with the correct shape and height for n = 50 observations, while forθ ridge , the secondary peak is much less clear. There are still some remaining oscillations at n = 300, resulting from unavoidable consequence of using the eigenvector basis. This is a well-known phenomenon in Fourier analysis known as the 'Gibbs effect.' Even with this obstacle,θ n converges quickly to θ peaked . See Figure 4 for graphical results.

Discussion
In this paper, we provide a general method for recovering an unknown signal given a sequence of noisy observations that are only indirectly of that signal of The sample sizes range from top to bottom, n = 50, 100, 200, 300. Our estimator, θ n , quickly converges to θ smooth . However,θ ridge , which doesn't zero out any coefficients, still has substantial fluctuations after n = 300 observations. See Table 1 for RR results for this simulation. The sample sizes range from top to bottom, n = 50, 100, 200, 300. Our estimator, θ n , estimates the true height of the peaks accurately and quickly. In particular, the secondary small peak is definitively identified with the correct shape and height. There are still some remaining oscillations at n = 300, resulting from an unavoidable Gibbs effect from using the eigenvectors as a basis. See Table 1 for RR results for this simulation.  Table 1 The RR for the two considered simulations. These are estimated by averaging 100 runs of our simulations.
interest. Our estimator,θ n , has many favorable properties. It has computational efficiency in the sense that it can be updated with a new observation without need to reference the entire sequence of observations. Instead, it relies on only a few summary statistics that need to be maintained and updated. Though its computation is predicated on finding the eigenvectors and eigenvalues of potentially large matrices, the implementation is straightforward and generalizable to higher dimensional signals such as images. Additionally, there exist accurate methods for the approximate computation of the eigenvectors of matrices that could in principle be used to speed up the computation of Ψ. Also,θ n is statistically efficient as well. The uniform consistency and oracle inequality results show that it is making about as good a use of the data as possible. Likewise,θ n has worked very well in our experiments so far, as evidenced by the results in Figures 1, 3, and 4. Our estimator can recover the unknown signal θ efficiently, requiring very few observations. Also, referring to the second row of Figure 1, the collection of a very poor observation merely doesn't improve the estimate instead of decreasing its quality. This is in opposition to many currently implemented techniques such as straight averaging, where low quality observations decrease the quality of the recovery.

Appendix A
This section gives warrant for assumption (A3) in Section 2. Although a slightly weaker version of assumption (A3) is all that is actually required (that only the right eigenvectors need be the same instead of both left and right eigenvectors) we leave it in its current form for simplicity of exposition and conditions. Two real matrices A, B share the same eigenvectors if they are simultaneously unitarily diagonalizable; that is, there exists two diagonal matrices Σ 1 , Σ 2 and an unitary matrix Ψ such that A = ΨΣ 1 Ψ * and B = ΨΣ 2 Ψ * . Note A and B must of course be unitarily diagonalizable, which implies by the spectral theorem that A and B are normal; that is A ⊤ A = AA ⊤ and B ⊤ B = BB ⊤ . The following theorem characterizes simultaneous diagonalizability.
Lemma 8. Let K be a commuting family of normal matrices. Then K is also simultaneously unitarily diagonalizable.
Proof of Lemma 8. By the Schur unitary triangularization theorem (Horn and Johnson, 1985, Theorem 2.3.1) if K is a commuting family of matrices, then there is a unitary Ψ such that ΨKΨ * is upper triangular for every K ∈ K. Hence, as normality is preserved under unitary congruence and a triangular normal matrix must be diagonal, the result follows.
Though all Toeplitz matrices commute asymptotically as the number of rows and columns increases, not all Toeplitz matrices commute for a fixed size. Many subsets of the family of Toeplitz matrices satisfy Lemma 8, however. In particular, all circulant matrices commute (Gray, 2001, Chapter 3.1). This shows Theorem 1.

Appendix B
We utilize the following notation in several of the below proofs. We use to indicate 'less than or equal to up to a constant independent of n.' Also, it is convenient to think of a complex number a = a 1 + a 2 i as an element (a 1 , a 2 ) ∈ R 2 . In this case, we use |||a||| 2 = a 2 1 +a 2 2 as a norm on R 2 , as the complex modulus is not technically defined on elements of R 2 . Additionally, Z ∼ N (0, I 2 ) is the two dimensional standard normal. Lastly, we define s nj := Ω 2 n ε 2 /∆ nj . We begin with a lemma that will be used in the proofs of Theorem 2 and Theorem 3: Lemma 9. Let µ ∈ R 2 be a vector, Σ = diag(σ 2 1 , σ 2 2 ) be a diagonal matrix with positive entries, and c 2 be a real, positive constant. Then if |||µ||| > c and σ max = max{σ 1 , σ 2 }.
Here, we don't give a formal proof but provide intuition. The probability in equation (29) corresponds to the amount of the mass of an elliptical normal, aligned with the cononical axis, that resides in a ball of radius c at the origin. Hence, if |||µ||| > c (that is, the mean is outside the ball) a more spread out the normal results in more mass inside the ball.
Proof of Theorem 2. For simplicity, writeβ n :=λ(B n ). Then where the subscript n on R has been included to emphasize the dependence on n.
We begin by defining the following set where ω ranges over the measure space on which the random variable B nj is defined. The utility of defining A j iŝ Additionally, write B nj = β j + Z nj as a mean term plus stochastic term, where Z nj is the j th entry in the complex normal ε∆ −1 n i (D * i Ψ * W i ). Then the following bound on the j th term in the loss holds: To show that the expected value of the first term goes to zero in expectation, observe: As Ω 2 n < C < ∞ for n large enough for some C by assumption (A5), First, we compute the eigenvalue matrix Λ nj of the covariance matrix of Z nj as a vector in R 2 . By the properties of complex normals 2 Hence, the entries in Λ nj are λ 2 nj,1 = ε 2 /∆ nj +ℑC jj and λ 2 nj,2 = ε 2 /∆ nj −ℑC jj , which are both strictly positive. Also, define U to be the associated eigenvector matrix.
Though it is clear that P(A c j )|β j | 2 goes to zero pointwise, the worst β j is arbitarily close to zero. Hence, to show uniform convergence, we define a parameter τ 2 nj . For each j, define B j := {β j : |||β j ||| 2 ≤ T 2 } and split this set into Also, as |||·||| is invariant under orthogonal operations, we can rotate everything by the eigenvectors U . Denote rotation by U by a tilde; that is,β j := U β j . Then, Then continuing on with the second term in the max, and using Lemma 9 with |||β j ||| > s nj , which happens if τ 2 nj > s 2 nj , The last inquality follows by Figure 5 and by noting that (1 − Φ(t))(t + 1) 2 is continuous, unimodal, and bounded by 1. Therefore sup β∈B P(A c j )|β j | 2 = max{τ 2 n , λ 2 max } Hence, it is sufficient to choose τ 2 nj = 2s 2 nj and to note that

This implies sup
As we are summing over j in the risk, we conclude that lim sup Proof of Theorem 3. We use the same notations and conventions as in the proof of Theorem 2. Note that if we define σ 2 nj = ε 2 /∆ nj , then the linear oracle risk is we bound the j th term in the loss: By the previous proof, we see that the expected value of the first and third term go to zero uniformly over β ∈ B at rate O(1/∆ nj ); the same rate as the oracle. For the second term, notice that 1 Aj s 2 nj + Ω 2 n |β j | 2 |β j + Z nj | 2 ≤ 1 Aj 1 + Ω 2 n |β j | 2 |β j + Z nj | 2 1 Aj |β j | 2 |β j + Z nj | 2 =: G nj for n large enough, by assumption (A5). Then our goal is to show that lim sup n→∞ sup β∈B EG nj < ∞.
First, due to G nj being rotationally symmetric (once we use ||| · ||| instead of |·|), we renormalize to transform Z nj into a vectorZ with independent standard normal components We define Λ nj and U in the previous proof andÃ j := {||β j + Λ 1/2 njZ || 2 > s 2 nj . We break bounding EG nj into cases.
Therefore, if we can exchange the limit and integral, then by the previous two proofs we can conclude that the limit is zero. We appeal to the following. We say a set of random variables {X t : t ∈ T } is uniformly integrable if lim x→∞ sup t∈T E|X t |1 |Xt|>x = 0.
It holds that if X t → X with probability one and {X t : t ∈ T } is uniformly integrable, then EX t → EX. Hence, we wish to show that f n is uniformly integrable. It holds that if each term in the sum over j is uniformly integrable, then f n is uniformly integrable as well.
Note that For large x, x > T 2 and for large n, Ω n ≍ 1. Therefore, we only need deal with the term ǫ 2 /∆ nj .
Using assumption (B4), continuing the above with relevant terms, and noticing that sup n f n occurs at n = 1, it follows that for x large enough This allows for the exchange of integration end hence shows the desired result.