Nyström type subsampling analyzed as a regularized projection

In the statistical learning theory the Nyström type subsampling methods are considered as tools for dealing with big data. In this paper we consider Nyström subsampling as a special form of the projected Lavrentiev regularization, and study it using the approaches developed in the regularization theory. As a result, we prove that the same capacity independent learning rates that are guaranteed for standard algorithms running with quadratic computational complexity can be obtained with subquadratic complexity by the Nyström subsampling approach, provided that the subsampling size is chosen properly. We propose a priori rule for choosing the subsampling size and a posteriori strategy for dealing with uncertainty in the choice of it. The theoretical results are illustrated by numerical experiments.


Introduction
Regularization based kernel methods, such as kernel ridge regression (KRR), provide an effective framework for the supervised learning [14,15]. However, a standard implementation of these methods is infeasible when dealing with the so-called 'Big Data'.
The Big Data concept can be considered from different points of view. Here, by 'Big Data', we mean data sets exceeding the computational capacity of conventional learning systems. For example, in KRR, one receives a training data set z of N samples of the form = = {( )} x y z , by an unknown target function *   f X : , and the goal is to approximate this function by the minimizer a f z of the regularized empirical risk functional: Here, K  denotes the reproducing kernel Hilbert space (RKHS) generated by a kernel , X is equipped with the same metric as  d , = | | N z , and α is a regularization parameter.
By the representer theorem for RKHS [7], the minimizer of(1) is equal to j . Now, it is clear that KRR will suffer from at least quadratic computational complexity ( ) N O 2 in the number of observations N, as this is the complexity of computing the kernel matrix K. In the Big Data setting, where N is large, this is not acceptable. Therefore, learning schemes have been designed to avoid the computation of the exact minimizers a f z . One family of such schemes, which we broadly refer to as the Nyström type subsampling, consists of methods replacing the kernel matrix K with a smaller matrix obtained by column subsampling [18,19]. This can also be interpreted as a restriction of the minimization of a ( ) T f z to the space Therefore, the main question about the Nyström type subsampling is the following: how big should N ν be to incur no loss of the performance compared to the full kernel matrix K; or, in other words, is it possible to implement the Nyström approach with a complexity that is subquadratic in the number of observations N without losing the performance?
Note that the answer on this question can be given in terms of the decay rate of the singular values of the kernel matrix K. For example, if K is a low-rank matrix, then the number of Nystöm samples n N should be chosen related or even equal to the number of nonvanishing singular values of K, which is assumed to be much smaller that N. Therefore, the idea of low-rank approximation has been widely used to obtain an approximation for kernel matrices K.
Moreover, low-rank approximation is a building block in other kernel approximation strategies employed in the context of Nyström type subsampling, such as 'ensemble Nyström method' [9], 'pseudo landmark points' [6] and 'memory efficient kernel approximation' [16]. These strategies are proven to be efficient for the kernels maintaining special structures of matrices K, such as shift-invariant kernels, for example. At the same time, the above-mentioned strategies are entirely based on the smoothness of the eigenspectrum of the approximated kernel matrices, but do not take into account the smoothness of the target function f * , in spite of the fact that the latter one has an essential impact on the performance of KRR.
The studies of the Nyström approach accounting for the smoothness of the target functions have been recently given in [1,13]. However, in [1], the error analysis is derived in a fixed design regression setting, such that , are assumed to be uniformly sampled, for example. The study [13] extends the results of [1] to a general statistical learning setting. At the same time, the analysis of [13] is fairly technical and lengthy. In particular, it is based on the assumptions describing the capacity of the hypothesis space K  with respect to the unknown distribution r X from which is assumed to be sampled. Moreover, only the Hölder type of smoothness of the target functions is covered by the analysis of [13].
In the present study, we are going to analyze the so-called plain Nyström approach as a particular case of the regularized projection scheme. Therefore, we will use some arguments developed in the regularization theory for analyzing regularized projection approximations [11,12]. Instead of the assumption on the capacity of the solution space, these arguments rely on the assumption on the approximation power of the projection method induced by the projector such as n P z in (2). For the purpose of our study, the arguments developed in [11,12] should be accompanied by the ones that take into account that in the context of learning, the regularized projection schemes, such as(2), operate only with noisy versions of the operators describing the learning tasks.
An analysis incorporating the above mentioned arguments is presented in the next section. Unlike [13], it gives capacity independent learning rates for the Nyström type subsampling. Moreover, it indicates a rather general a priori choice of the subsampling size n | | z that allows a subquadratic complexity without loss of the performance. Such a priori choice of n | | z requires a knowledge of the regularity of the unknown target function with respect to K and r X and covers much more than just the Hölder type of smoothness. In section 3, we consider a situation when such a priori knowledge is not accurate, and may lead to uncertain parameter n | | z . In section 4, we discuss some simulations illustrating our theoretical results.
factorized as r x is the conditional distribution on  Ì Y given Î x X, and r X is the so-called marginal distribution, from which the set of inputs is supposed to be sampled. A common assumption to simplify analysis is that = - for some > D 0. A weaker condition can be found in [3].
Given a training set Ì Z z , the goal is to find an estimate = f f z with a small expected risk Once we choose K  as the so-called hypothesis space, the best possible risk value is clearly As in [13], we assume that there exists To formulate our further assumptions we need some operators, which are traditionally used in the context of regression learning. At first we consider the space and the covariance operator

⟨· ·⟩
, is the inner product in K  . The operator C can be proved to be a positive trace class operator. Therefore, the operator = C C 1 2 is well-defined and relates the norms of · . We will measure the approximation power of the projection method induced by the projector n P z in terms of the quantity that has been also studied in [13] (see lemma 6 [13]). At the same time, such kind of measure is usual in studying regularized projection methods [11,12], and in spirit of that studies we assume that there is b > 0 such that the following holds with probability and b 1 is a positive number depending only on β. Note, that a probabilistic character of the assumption (7) is natural, because in the plain Nyström approach the points forming n z are sampled uniformly at random without replacement from the training set z.

Inverse Problems 33 (2017) 074001 G Kriukova et al
As we have already mentioned, in [13], the Nyström subsampling approach was studied under assumptions on the capacity of K  . These assumptions are formulated in [13] with the use of the quantity , 0 1, then from lemma 6 [13] it follows that our assumption (7) is satisfied with any b Î g ( ) 0, 1 2 . Our last assumption describes the regularity of † f in terms of source condition concept that is fairly standard in the regularization theory [10]. In the context of the learning theory this concept has been introduced in [2]. Within this concept, we assume that † f admits the representation where the function j is operator monotone on , and such that j = ( ) 0 0 and j 2 is a concave function. As it has been shown in [11] an important implication of operator monotonicity is that there is a number j d depending only on j such that for any self-adjoint operators C C Moreover, as a consequence of the concavity of j 2 we have (see proposition 2 [11]) Note that our assumption(8) generalizes assumption4 of [13], where only the case of operator monotone functions , has been studied. In the sequel we extensively use the following bounds (see, e.g., [2]) that hold under the above assumptions with probability at least d -1 and quantify the perturbation effect of random sampling: Note that for j = ( ) t t s the above theorem gives us the learning rate that matches the result obtained in seminal paper by Smale and Zhou [17]. Moreover, for j = ( ) t t s the rate(13) can be thought of as the limit case of the capacity dependent learning obtained in [3] under the assumptions that the eigenvalues l i of the covariance operator C have polynomial decay  l m i i with m > 1. Now we are going to prove that the same learning rate (13) can be achieved in Nyström type subsampling(2) if the approximation power of n P z is high enough.
Theorem 2. Assume the conditions of theorem 1, and let (7) be satisfied. If the size = n | | m z of a subsampling n z is chosen such that then with probability at least d -1 we have 2 1 , and b 1 is the same as in (7).
Before proving this statement, we first comment on the computational complexity of Nyström approximation(2) with a subsampling size n | | z chosen according to theorem 2. In view of the assumption (7) it is clear that the condition of the theorem can be satisfied with Let the assumption(8) be satisfied with . On the other hand, the computational complexity of (2) is of order n (| || | ) z z O 2 (see, e.g. . Thus, under the conditions of theorem 2 Nyström subsampling has the same learning rate as the one guaranteed by theorem 1 for KRR based on the whole sample z. Moreover, theorem 2 allows an estimation of a regularity range, such as(15), for which the above mentioned learning rate can be achieved with subquadratic complexity. Note, that the con-dition(15) is automatically satisfied with  b 1, for example.
Proof of Theorem 2. It is known (see, e.g. [11]) that the following inequality holds true for functions j mentioned in the assumption(8)  a j j aa - where j h q , depends only on j and q. Note also that, by very definition, Q > Q -- Moreover, without loss of generality we can assume that | | z is so large that  (11) and (12). This is not a real restriction, because the left-hand side of (18) tends to zero as  ¥ | | z . A direct implication of (18) is that with probability at least d -1 Consider the decomposition  To prove (14) we also need to bound s s ,   Keeping in mind that y = ( ) t t is an operator monotone function, from (9), (17) and ( To estimate s r   2,2 we need to bound K s   2,2,1 . For this end, we use the following known estimate (see proposition 3 [11]) Moreover, (9), (10), (17) and (19) give us O . Finally, we need to estimate s r   3 . Observe that    with a subquadratic complexity. Nevertheless, the question appears of how to select a good approximant among the calculated ones, or how to use all of them. This question is similar to the one in the regularization theory, where some strategy for aggregating all calculated regularized approximants has been discussed recently [4]. In [8] the strategy [4] has been adjusted in the context of learning and presented in several versions.
According to the simplest version, the intention is to approximate the vector solving the following minimization problem . Therefore,(23) is equivalent to the matrix problem = ( ) † Gc g , 2 4 where G and † g are respectively a Gram matrix and a vector of inner products In view of(11) the first term of the last equality(26) can be estimated as follows: As in [21], all KRR estimators appearing in this experiment are constructed in K  with K ¢ = + ¢ ( ) z i . At the end we briefly discuss the performance of the aggregated Nyström approximants in the context of binary classification, that is, when y i takes only two values, usually designated by zero and one. If a function, say f, is determined by means of a (regularized) leastsquares regression from data {( )} To illustrate the performance of the Nyström approach in classification we use the same dataset breastcancer as in [13]. This set consists of 569 instances with 30 features each, that describe characteristics of the digitized image of a breast mass. Outputs Î { } y 0, 1 describe diagnosis (benign or malignant).
As in [13], the dataset breastcancer has been split in test sets consisting of 169 items, and training sets = | | z z , 400. Moreover, the values of the regularization parameter α and the width of Gaussian kernel σ have been chosen as a = -10 7 and s = 0.9. Subsampling sets = n i z , 1, 2, 3 To summarize our experiments, we have the following conclusion. Theorem 2 shows that a wide range of target functions can be learned with the best known capacity independent learning rates by means of Nyström subsampling algorithm of subquadratic complexity, provided that subsampling size is properly chosen. At the same time, a proper choice of the subsampling size requires knowledge that may not be available. Then several subsampling sizes can be tried, and in section 3 we have proposed a way of how to utilize/aggregate all of the corresponding Nyström approximants. The experiments reported above demonstrate that the aggregation approach described in section 3 automatically uses the best of the available options and can be recommended as a reliable strategy to be implemented together with the Nyström subsampling when dealing with uncertainty in the subsampling size.