Function values are enough for $L_2$-approximation: Part II

In the first part we have shown that, for $L_2$-approximation of functions from a separable Hilbert space in the worst-case setting, linear algorithms based on function values are almost as powerful as arbitrary linear algorithms if the approximation numbers are square-summable. That is, they achieve the same polynomial rate of convergence. In this sequel, we prove a similar result for separable Banach spaces and other classes of functions.

Let F be a set of complex-valued functions on a set D such that, for all x ∈ D, point evaluation is continuous with respect to some metric d F on F .We consider numerical approximation of functions from such classes, using only function values, and measure the error in the space L 2 = L 2 (D, A, µ) of square-integrable functions with respect to an arbitrary measure µ such that F is embedded into L 2 .We are interested in the n-th minimal worst-case error (1) e n (F, L 2 ) := inf , which is the worst-case error of an optimal linear algorithm that uses at most n function values.These numbers are sometimes called sampling numbers.We want to compare e n with the n-th approximation number (2) a n (F, L 2 ) := inf , which is the worst-case error of an optimal linear algorithm that uses at most n linear functionals as information.
The numbers a n and e n are well studied for many particular classes of functions.
For an exposition of known results we refer to the books [15,16,17], especially [17,Chapter 26 & 29], as well as [4,19,20] and references therein.However, it is still one of the major open problems in the field to relate these numbers for general classes F .Clearly, the approximation numbers are smaller than the sampling numbers.On the other hand, an example of Hinrichs/Novak/Vybiral from [7] shows that it is not possible to give any general upper bound for the sampling numbers in terms of the approximation numbers if the latter are not square-summable.We therefore ask for such a relation in the case that the approximation numbers are square-summable.Let us formulate one particularly interesting open question.
Open Problem.Is it true for any class F and any measure µ as above that there exists a constant c ∈ N (possibly depending on F or µ) such that Clearly, a similar question for L p -approximation with p = 2 is also of interest.Note that (3) is of no use if the approximation numbers are not square-summable.But if they are square-summable, then it would imply that the sampling numbers and the approximation numbers always have the same polynomial rate of convergence.The relation ( 3) is true for all examples of sufficiently studied function classes F that are known to the authors.On the other hand, general results are only known in the case that F is the unit ball of a reproducing kernel Hilbert space.Even in this case, the problem is open, but the gap is quite small already.Namely, it was shown in [9] that (3) is true with cn replaced by cn log(n), verifying that a polynomial order of the approximation numbers greater than one half carries over to the sampling numbers.This was then improved by Nagel/Schäfer/T.Ullrich in [13], who showed (3) with an additional factor of √ log n on the right hand side.The aim of this paper is to prove a similar result for general function classes F .
Before we come to our result, let us mention that, recently, another beautiful upper bound for the sampling numbers of general classes F has been shown by Temlyakov in [21].Namely, for a finite measure µ, the sampling numbers may be bounded above by the Kolmogorov n-widths of F in the uniform norm.While this relation has many interesting consequences, it does not lead to a general comparison of the numbers e n and a n , see Example 1.
Our main result reads as follows.
Theorem 1.Let (D, A, µ) be a measure space and let F be a separable metric space of complex-valued functions on D that is continuously embedded into L 2 (D, A, µ) such that function evaluation is continuous on F .Assume that with some α > 1/2, β ∈ R and c > 0. Then there is a constant C > 0, depending only on α, β and c, such that By homogeneity, we can clearly choose C = c • c ′ , where c ′ > 0 only depends on α and β.We will actually show that the upper bound is achieved by a specific weighted least squares estimator.The algorithm is semi-constructive.It requires the knowledge of (almost) optimal subspaces related to (2) and uses a random selection of points that work with high probability.
Let us shortly discuss our condition on the set F .The natural condition appearing in our proof is that F is a countable subset of L 2 , see Theorem 2. We then need some kind of continuity in order to extend our result to uncountable sets F .Here, we employ that F is a separable metric space with continuous function evaluation and continuous embedding in L 2 .These assumptions are satisfied, for example, if • F is the unit ball of a separable normed space on which function evaluation at each point is a continuous functional, or • the measure µ is finite and F is a compact subset of L ∞ (D, A, µ).
In particular, we cover the settings considered in [13] and [21].Clearly, the theorem might also be applied if F is the unit ball of a non-separable normed space, since we might have separability with respect to a weaker norm.For instance, the unit ball F of the non-separable Sobolev space W s ∞ (0, 1) with smoothness s ∈ N clearly satisfies the conditions of Theorem 1 when equipped with the L ∞ -metric.
We finish the introduction with a short comparison of our result and the recent result of V. Temlyakov [21].First of all, both results have obvious advantages over one another.Namely, [21] only works for classes of bounded functions and finite measures while our result requires the square-summability of the approximation numbers instead.These conditions are independent of one another.The following example shows that it is not possible to say in general that one result yields better estimates than the other.It was kindly provided to us by Erich Novak Example 1.Let µ be the Lebesgue measure on [0, 1] and let (ℓ i ) i∈N and (h i ) i∈N be decreasing zero-sequences with ∞ i=1 ℓ i = 1.Let b i be the hat function with height one, supported on the interval Note that F is a compact subset of C([0, 1]) which follows from the theorem of Arzelà and Ascoli.For this example, one can compute that and that the Kolmogorov widths in the uniform norm are given by By choosing ℓ i = i −α / k∈N k −α with some α > 1 and h i = i −β with some β > 0, we obtain that the sampling numbers are of order n −(β+(α−1)/2) while the Kolmogorov widths are of order n −β .If α is close to 1 and β < 1/2, then [21] yields an almost optimal bound while our result yields nothing.If α > 2 and β is close to zero, then our results yields an almost optimal bound, while [21] yields almost nothing.

The result behind Theorem 1
Our main result is based on the following apparently more general theorem.
Theorem 2. Let (D, A, µ) be a measure space and let F 0 be a countable set of functions in L 2 (D, A, µ).Assume that for some α > 1/2, β ∈ R and c > 0. Then there is a constant C > 0, depending only on α, β and c, such that Before we prove this theorem, let us show how it implies Theorem 1.
Proof of Theorem 1.Since F is a separable metric space, it contains a countable dense subset F 0 .Now, let x 1 , . . ., x n ∈ D and ϕ 1 , . . ., ϕ n ∈ L 2 be arbitrary.We obtain for every f ∈ F and g ∈ F 0 that To bound the first and the last term, first note that U δ (f ) ∩ F 0 = ∅ for every δ > 0, where The continuity of the embedding into L 2 and of function evaluation now implies that for any ε > 0 there is some δ > 0 such that f −g L 2 < ε and |f (y i )−g(y i )| < ε for all i = 1, . . ., n and all g ∈ U δ (f ).Therefore, for every ε > 0 and every f ∈ F , we have We obtain that sup for all x 1 , . . ., x n ∈ D and ϕ 1 , . . ., ϕ n ∈ L 2 .Therefore, an error bound of an algorithm on F 0 carries over to F and so, Theorem 2 implies Theorem 1.
We will now prove Theorem 2 by proving an error bound for a specific algorithm on F 0 .Recall that we have just proven that the same algorithm works for the class F from Theorem 1 if we choose F 0 as a countable dense subset.

Algorithm and Proof of Theorem 2
In this whole section, we work under the assumption of Theorem 2. We start with the following simple observation.

Lemma 3.
There is an orthonormal system {b k : k ∈ N} in L 2 such that the orthogonal projection P n onto the span Proof.Clearly it is enough to find an increasing sequence of subspaces of L 2 , such that the projection P n onto U n satisfies (5).By the definition of a m , m ∈ N, there is a subspace W m ⊂ L 2 of dimension m and a linear operator We let U n be the space that is spanned by the union of the spaces W 2 k over all k ∈ N 0 with 2 k ≤ n/2.Note that U n contains a subspace W m with m ≥ n/4.Therefore, P n f is at least as close to f as T m f for some m ≥ n/4, which implies (5).
In what follows {b k : k ∈ N} will always be the orthonormal system from Lemma 3. Note that we will consider b k as a function, where we take an arbitrary representer from the equivalence class in L 2 .We now establish almost sure convergence of the (abstract) Fourier series on F 0 .

Lemma 4.
There is a measurable subset D 0 of D with µ(D \ D 0 ) = 0 such that for all x ∈ D 0 and f ∈ F 0 we have Proof.Let us fix some parameter δ with 1/2 < δ < α.We first observe for all f ∈ F 0 and thus that this sum is bounded by a constant C (only depending on δ, α, β, c).Now Hölder's inequality yields for all x ∈ D and f ∈ F that Since the integral of the right hand side is finite, it is finite for µ-almost every x ∈ D. In particular, the Fourier series converges (absolutely) almost everywhere.
On the other hand, we get from ( 5) that the series also converges in L 2 and that the L 2 -limit is f .Together, this implies that the Fourier series converges almost everywhere to f .Since F 0 is countable, the almost everywhere convergence holds simultaneously for all f ∈ F 0 .
The proof of our bound on the sampling numbers is based on an error bound for a specific algorithm.This is a weighted least squares estimator of the form for some x 1 , . . ., x m ∈ D, m ≥ n, and positive weights ̺(x i ) > 0. We give an explicit formula for the weight function ̺ : D → R + later, but the weights and the algorithm anyway are not completely constructive as we obtain V n and x 1 , . . ., x m through an existence argument.Note that the algorithm A m,n may be written as is the weighted information mapping and G + ∈ R n×m is the Moore-Penrose inverse of the matrix ( 7) This description of A m,n is actually more precise since it also specifies A m,n (f ) in the case that the argmin in ( 6) is not unique (which is equivalent to G not having full rank).For the state of the art on (weighted) least squares methods for the approximation of individual functions, or in a randomized setting, we refer to [2,3] and references therein.Here, we consider such methods in the worst-case setting, i.e., we measure the error via .
Clearly, we have e m (F, L 2 ) ≤ e(A m,n , F, L 2 ) for every choice of x 1 , . . ., x m and ̺.
The proof of our upper bound uses the following simple lemma, see [9], which we prove for the reader's convenience.Note that our systematic study of the "power of random information" was initiated in [6], see also [5], and therefore some of the basic ideas behind, like a version of the following lemma, already appeared there.

Lemma 5.
Assume that G has full rank, then where s min (G : is smallest singular value of the matrix G. Proof.Since G has full rank, we obtain that A m,n satisfies A m,n (f ) = f for all f ∈ V n .Using Lemma 3, we obtain for any f ∈ F 0 that It only remains to note that the norm of G + is the inverse of the smallest singular value of the matrix G.
The main result of this paper therefore follows once we prove for an instance of A m,n with n ≤ m ≤ Cn, where ε n := n −α log β (n + 1) is the upper bound on the approximation numbers from (4).Note that the first bound implies that G has full rank.We divide the proof of this into two parts: (1) We show these bounds with high probability for m ≍ n log(n + 1) i.i.d.random points and then, (2) based on the famous solution to the Kadison-Singer problem, we extract m ≍ n points that fulfill the same bounds.
2.1.Random points.We first observe the result of this paper holds with high probability if we allow a logarithmic oversampling.For this, we now introduce the sampling density ̺ : D → R, which will also specify the weights in the algorithm.For 1/2 < δ < α let where Remark 6.The form of the density ̺ is very much inspired by the density invented in [9], which was already applied in [8,13,22].The density in these papers was needed to prove the result for Hilbert spaces in greatest generality, i.e., for all sequences (a n ) ∈ ℓ 2 .The density used here is different.It is not clear to us whether one can use the density from [9] for arbitrary classes F and prove a result like Theorem 1 for all (a n ) ∈ ℓ 2 .Presently, we do not know what happens, e.g., in the case that a n ≍ n −1/2 log −3/2 (n + 1), see also Remark 9.
As the result of this part of the proof might be of independent interest, we formulate it as a theorem.
Theorem 7. Let (D, A, µ) be a measure space and let F 0 be a countable subset of L 2 (D, A, µ).Assume that for some α > 1/2, β ∈ R and c > 0.Then, there are constants C 1 , C 2 > 0, depending only on α, β and c, such that A m,n from (6) with m = ⌈C 1 n log(n + 1)⌉ and i.i.d.random variables x 1 , . . ., x m with µ-density ̺, satisfies The proof of this result follows a similar reasoning as the original proof in [9], with the improvements from [22] that show the result from [9] with high probability.The crucial difference is that we show an upper bound on the "norm" of the information mapping N on the set related to full approximation spaces, instead of the smaller set related to Hilbert spaces and considered in [9,22].We do this with the help of a dyadic decomposition of the index set {k ∈ N | k > n}, together with suitable bounds on the norms of the corresponding random matrices.For this, we use again the matrix concentration result from [18], see also [12].
Proposition 8 ([18, Lemma 1]).Let X be a random vector in C k with X 2 ≤ R with probability 1, and let X 1 , X 2 , . . .be independent copies of X.Additionally, let , where E denotes the spectral norm of E.Then, where s t = t 2 for t ≤ 2, and s t = 4(t − 1) for t > 2.
Note that [18, Lemma 1] wrongly states s t = min{t 2 , 4t − 4}, which can easily be corrected by looking into the proof.Let us now prove the norm bounds that we need.Namely, we prove the existence of constants C 1 and C 2 such that the following holds for m = ⌈C 1 n log(n + 1)⌉ and i.i.d. points x 1 , . . ., x m with density ̺.
Fact 2: P sup Together with Lemma 5 these bounds clearly imply Theorem 7.
Proof of Fact 1.
Proof of Fact 2. Note that we almost surely have for all i = 1, . . ., m that ̺(x i ) is positive and finite and x i is contained in the set D 0 from Lemma 4 such that we have f In this certain event, each entry of N(f − P n f ) ∈ R m can be written as If we now define I ℓ := {n2 ℓ + 1, . . ., n2 ℓ+1 } and the random matrices with β + = max{β, 0} and some c ′ > 0, depending only on α, β and c, and thus It remains to bound the norms of Γ ℓ with high probability, simultaneously for all ℓ.
Let us start with an individual ℓ, and consider where we use the definition of ̺ and that k (1, . . . , 1) and so E = 1.Therefore, Proposition 8 with m = ⌈C 1 n log(n + 1)⌉ as above and for some constants C 3 , C 4 > 0 (depending only on C 1 and δ).Note that these probabilities are summable, and so a union bound shows We therefore obtain, with probability at least 1 Since α > δ > 1  2 , the latter series is finite.
2.2.Kadison-Singer to reduce the number of points.We now employ the powerful solution to the Kadison-Singer problem due to Marcus, Spielman and Srivastava [11], see also [1], to show that we can reduce the number of points in our algorithm to m ≍ n, without losing the error bound.In detail, we need an equivalent version of the KS problem due to [23], which was already brought into a form that is very useful for us in [13].Note that the authors of [13] seem to be the first to use this approach in a setting similar to ours.By this, they improved upon [9] and proved the result of Theorem 1 for F being the unit ball of a separable reproducing kernel Hilbert space.For applications of KS to the discretization of the L 2 -norm, which was used to prove the result in [21], see [10].Here, we use a special case of [13,Theorem 2.3], see also [ We now let u i = 1 √ m ̺(x i ) −1/2 b j (x i ) j≤n ∈ C n with random x 1 , . . ., x m as in the previous section.We clearly have from ( 8), ( 9) and (10) we see that u 1 , . . ., u m satisfy with high probability the conditions of the proposition and sup for some constant c 4 > 0. The proposition yields J ⊂ {1, . . ., m} with #J ≤ c 1 n such that the matrix for every J ⊂ {1, . . ., m}, where N J (f ) := (̺(x i ) −1/2 f (x i )) i∈J .Using Lemma 5, we see that the algorithm for some constant C > 0 depending only on α, β and δ.This proves Theorem 2, and thereby Theorem 1.