Compressive Statistical Learning with Random Feature Moments

We describe a general framework -- compressive statistical learning -- for resource-efficient large-scale learning: the training collection is compressed in one pass into a low-dimensional sketch (a vector of random empirical generalized moments) that captures the information relevant to the considered learning task. A near-minimizer of the risk is computed from the sketch through the solution of a nonlinear least squares problem. We investigate sufficient sketch sizes to control the generalization error of this procedure. The framework is illustrated on compressive PCA, compressive clustering, and compressive Gaussian mixture Modeling with fixed known variance. The latter two are further developed in a companion paper.


Introduction
Large-scale machine learning faces a number of fundamental computational challenges, triggered both by the high dimensionality of modern data and the increasing availability of very large training collections.Besides the need to cope with high-dimensional features extracted from images, volumetric data, etc., a key challenge is to develop techniques able to fully leverage the information content and learning opportunities opened by large training collections of millions to billions or more items, with controlled computational resources.
Such training volumes can severely challenge traditional statistical learning paradigms based on batch empirical risk minimization.Statistical learning offers a standardized setting where learning problems are expressed as the optimization of an expected loss, or risk, R(π, h) := E X∼π ℓ(X, h) over a parameterized family of hypotheses H (where π is the probability distribution of the training collection).This risk is empirically estimated on a training collection, and parameters that empirically minimize it are seeked, possibly with some regularization.Empirical minimization typically requires access to the whole training collection, either in batch mode or iteratively with one or more passes of stochastic gradient.This can become prohibitively costly when the collection is large and each iteration has non-negligible cost.An alternative is to sub-sample the collection, but this may come at the price of neglecting some important items from the collection.Besides online learning [e.g.Mairal et al., 2010], sampling techniques such as coresets [Feldman and Langberg, 2011] or Nyström's method [e.g.Rudi et al., 2015] have emerged to circumvent computational bottlenecks and preserve the ability to exploit latent information from large collections.
Can we design an alternative learning framework, with the ability to compress the training collection before even starting to learn?We advocate a possible route, compressive statistical learning, which is inspired by the notion of sketching and is endowed with favorable computational features especially in the context of the streaming and distributed data model [Cormode et al., 2011] (see Section 1.3).Rooted both in the generalized method of moments [Hall, 2005] and in compressive sensing [Foucart and Rauhut, 2012], it leverages techniques from kernel methods such as kernel mean embeddings [Gretton et al., 2007, Sriperumbudur et al., 2010] and random Fourier features [Rahimi and Recht, 2007] to obtain innovative statistical guarantees.
As a trivial example, assume x, h belong to R d , and consider the squared loss ℓ(x, h) = x − h 2 , whose risk minimizer is E[X].In this specific example, keeping only the d empirical averages of the coordinates of X is obviously sufficient.The vision developed in this paper is that, for certain learning problems, all the necessary information can be captured in a sketch: a vector of empirical (generalized) moments of the collection that captures the information relevant to the considered learning task.Computing the sketch is then feasible in one pass, and a near-minimizer of the risk can be computed from the sketch with controlled generalization error.
This paper is dedicated to show how this phenomenon can be generalized: roughly speaking, can the sketch size be taken to be proportional to the number of "intrinsic parameters" of the learning task?Another fundamental requirement for the sketching operation is to be online.When recording the training collection, it should be possible to update the sketch at almost no additional cost.The original training collection can then be discarded and learning can be performed from the sketch only, potentially leading to privacy-preservation.As shown in the companion paper [Gribonval et al., 2020], a sketching procedure based on random generalized moments meets these requirement for clustering and Gaussian mixture estimation.

Inspiration from compressive sensing
Another classical example of learning task is (centered) Principal Component Analysis (PCA).In this setting, x ∈ R d , h is an arbitrary linear subspace of dimension k, and the loss is ℓ(x, h) = x − P h x 2 2 with P h the orthogonal projector onto h.The matrix of second moments Σ π := E X∼π XX T is known to summarize all the information needed to select the best subspace for a training collection.It thus constitutes a natural sketch (of finite dimension d 2 ) of the training set.
A much smaller sketch can in fact be computed.Results from compressive sensing and low-rank matrix completion [Foucart and Rauhut, 2012] allow to compress the matrix of second moments to a sketch of dimension of the order of kd (much smaller that d 2 when k ≪ d) from which the best rank-k approximation to Σ π can be accurately estimated (this rank-k approximation allows to calculate the PCA with appropriate learning guarantees, as we will see in Section 4).This compression operation is made using random linear projections on Σ π , which can be seen as random second order moments of the training collection.
We propose to generalize such a sketching procedure to arbitrary random generalized moments.Given a learning task and training collection, we study the following questions: • How can we perform learning from a sketch of the training collection?
• What statistical learning guarantees can we obtain with such a procedure?

Contributions
In this paper, we present a general compressive learning framework.
• We describe a generic sketching mechanism with random generalized moments and provide a theoretical learning procedure from the sketched data.
• We derive general learning guarantees for sketching with random generalized moments.
In the companion paper [Gribonval et al., 2020], we exploit this framework to establish statistical learning guarantees for compressive clustering and compressive Gaussian mixture estimation.We conclude this paper by briefly discussing the potential impact of the proposed framework and its extensions in terms of privacy-aware learning and of the insight it may bring on the informationtheoretic properties of certain convolutional neural networks.

Related work
Sketching and streaming methods.Sketches are closely linked with the development of streaming methods [Cormode et al., 2011], in which data items are seen once by the user then discarded.A sketch is a small summary of the data seen at a given time, that can be queried for a particular piece of information about the data.As required by the streaming context, when the database is modified, e.g. by inserting or deleting an element, the subsequent update of the sketch must be very fast.In practice, sketches are often applied in various contexts where the data are stored in multiple places.In this heavily distributed framework, a popular class of sketches is that of linear sketches, i.e. structures such that the sketch of the union of two databases is the sum of their sketches -then the sketch of a database distributed over several parts is simply the sum of all their sketches.The sketch presented in this work is indeed a linear sketch (when considered without the normalization constant 1/n) and as such, updates operations are excessively simple and fast.Sketches have been used for a large variety of operations [Cormode et al., 2011] such as the popular detection of heavy-hitters [Cormode andMuthukrishnan, 2005, Cormode andHadjieleftheriou, 2009].Closer to our framework, sketches have been used to approximately maintain histograms [Thaper et al., 2002] or quantiles [Gilbert et al., 2002], however these methods are subject to the well-known curse of dimensionality and are unfeasible even in moderate dimension.
Learning in a streaming context.Various learning algorithms have also been directly adapted to a streaming context.Examples include the Expectation-Maximization algorithm [Andrieu andDoucet, 2003, Cappé andMoulines, 2009], the k-means algorithm [Guha andMishra, 2016, Ailon et al., 2009], or Principal Component Analysis [Ghashami et al., 2016].In each case, the result of the algorithm is updated as new data becomes available.However these algorithms do not fully benefit from the many advantages of sketches.Sketches are simpler to merge in a distributed context, update operations are more immediate, and the learning step can be delocalized and performed on a dedicated machine.
Coresets.Another popular class of structures that summarize a database for learning is called coresets.Coresets were initially developed for k-means [Har-Peled and Mazumdar, 2004] or, more generally, subspace approximation [Feldman et al., 2010, Feldman andLangberg, 2011] and also applied to learning Gaussian Mixture Models [Feldman et al., 2011, Lucic et al., 2017].In a sense, the philosophy behind coresets is situated halfway between sketches and streaming learning algorithms.Like the sketching approaches, coresets methods construct a compressed representation of the database (or "coreset"), but are somehow closer to already approximately performing the learning task.For instance, the coreset described in [Frahling and Sohler, 2005] already incorporates steps of Lloyd's k-means algorithm in its construction.Similar to the k-means++ algorithm [Arthur and Vassilvitskii, 2007], many coresets have been developed as (weighted) adaptive subsampling of the data [Feldman et al., 2011, Lucic et al., 2017].
Linear sketches vs Coresets.It is in general difficult to compare sketching and coresets methods (including the sketching method presented in this paper) in terms of pure performance or theoretical guarantees, since they are very different approaches that can be more or less adapted to certain contexts.We can however outline some differences.Unlike sketches, coresets are not specifically build for the streaming context, and they may require several passes over the data.Nevertheless they can still be adapted to streams of data [as described e.g. in Har-Peled and Mazumdar, 2004, Feldman and Langberg, 2011, Lucic et al., 2017] by using a merge-and-reduce hierarchical strategy: for each batch of data that arrives sequentially, the user builds a coreset, then groups these coresets and builds a coreset of coresets, and so on.This update method is clearly less direct than updating a linear sketch, and more importantly the user must balance between keeping many coresets and letting the size of the overall summary grow with the number of points in the database, or keeping only highest-level coresets at the cost of losing precision in the theoretical guarantees each time the height of the hierarchical structure increases.As a comparison, the sketch presented in the companion paper [Gribonval et al., 2020]for k-means does not have these limitations: like with any linear sketch, updates are totally independent of previous events, and for a fixed sketch size the ability to perform the learning task strictly increases with the number of points.
Generalized Method of Moments and Compressive Sensing.The methodology that we employ to develop the proposed sketching framework is similar to a Generalized Method of Moments (GeMM) [Landau, 1987, Hall, 2005]: the parameters θ of a model are learned by matching a collection of theoretical generalized moments from the distribution π θ with empirical ones from the data.GeMM is often seen as an alternative to Maximum Likelihood estimation, to obtain different identifiability guarantees [Belkin and Sinha, 2015, Hsu and Kakade, 2013, Anderson et al., 2014] or when the likelihood is not available.Traditionally, a finite number of moments is considered, but modern developments give guarantees when an infinite (integral) number of generalized moments are available [Carrasco andFlorens, 2000, 2014], in particular generalized moments associated to the (empirical) characteristic function [Carrasco andFlorens, 2002, Feuerverger andMureika, 1977].Our point of view is slightly different: we consider the collection of moments as a compressed representation of the data and as a means to achieve a learning task.Compared to the guarantees usually obtained in GeMM such as consistency and efficiency of the estimator θ, the results that we obtain are more akin to Compressive Sensing and Statistical Learning.For instance, when learning Gaussian Mixture Models, we prove in the companion paper [Gribonval et al., 2020]that learning is robust to modeling error (the true distribution of the data is not exactly a GMM but close to one), which is generally overlooked in GeMM.In the proof technique, this is done by replacing the so-called "global identifiability condition", (i.e.injectivity of the moment operator), which is a classical condition in GeMM but is already difficult to prove and sometimes simply assumed by practitioners [see Newey andMcFadden, 1994, p. 2127] by the strictly stronger Lower Restricted Isometry Property (LRIP) from the Compressive Sensing literature [Donoho, 2006, Candès et al., 2006, Baraniuk, 2007, Foucart and Rauhut, 2012].This is achieved by considering random feature moments (related to random features [Rahimi and Recht, 2007, 2009, Bach, 2017] and kernel mean embeddings [Sriperumbudur et al., 2010]), so in a sense the resulting Compressive Statistical Learning framework could be considered as a Method of Random Feature Moments.While the LRIP is reminiscent of certain kernel approximation guarantees with random features [see e.g.Sriperumbudur andSzabó, 2015, Bach, 2017], it is in fact of a different nature, and none seems to be a direct consequence of the other.

Outline
Section 2 describes our general framework for compressive statistical learning.We define here statistical learning guarantees, introduce the required notions and state our general Theorem on statistical learning guarantees for compressive learning.An important concept is the notion of Lower Restricted Isometry Property (LRIP) using the notion of a model set (a set of "simple" probability distributions) which is futher discussed in Section 3. To illustrate the proposed framework, we detail in Section 4 a procedure for Compressive PCA, where we do not intend to match the latest developments in the domain of PCA such as stochastic and incremental PCA [Arora et al., 2012, Balsubramani et al., 2013], kernel PCA using Nyström sampling [Sterge et al., 2020] or random features [Ullah et al., 2018, Sriperumbudur andSterge, 2020]; but rather to give a first illustration.Generic techniques to establish the LRIP property for sketches of controlled size are described in Section 5.In the companion paper [Gribonval et al., 2020], we specify a sketching procedure and state the associated learning guarantees for compressive clustering and compressive Gaussian mixture estimation.We discuss in Section 6 possible extensions of the proposed framework as well as the insight it may bring on the information flow across one layer of a convolutive neural network with average pooling.Finally, all proofs are stated in the Appendix.

A general compression framework for statistical learning
This section is dedicated to the introduction of our compressive learning framework.

Statistical learning
Statistical learning offers a standardized setting where many learning problems (supervised or unsupervised) can be expressed as the optimization of an expected risk over a parameterized family of functions.Formally, we consider a training collection X = {x i } n i=1 ∈ Z n drawn i.i.d.from a probability distribution π on the measurable space (Z, Z).In our examples, Z = R d is endowed with the Borel σ-algebra Z.One wishes to select a hypothesis h from a hypothesis class H to perform the task at hand.How well the task can be accomplished with the hypothesis h is typically measured through a loss function ℓ : (x, h) → ℓ(x, h) ∈ R and the expected risk associated to h: where (here and in the sequel) we will always assume that we restrict our attention to probability distributions π such that x → ℓ(x, h) is measurable and π-integrable for all h ∈ H.In the idealized learning problem, one selects a function h ⋆ π that minimizes the expected risk (we will assume existence of this minimum for a simpler presentation, although most statements to come can be transformed if needed using a sequence of approximate minimizers) We will use the shorthand h ⋆ for h ⋆ π whenever there is no ambiguity from the context.In practice one has no access to the true risk R(π, h) since the expectation with respect to the underlying probability distribution, E X∼π [•], is unavailable.Instead, methods such as empirical risk minimization (ERM) produce an estimated hypothesis ĥ from the training dataset X by minimizing the risk R(π n , •) (or a regularized version) associated to the empirical probability distribution πn := 1 n n i=1 δ xi of the training samples.One expects to produce, with high probability at least 1 − ζ on the draw of the training set, the bound on the excess risk R(π, ĥ where η n typically decays as 1/ √ n or better.We will use the following running examples. Examples: • PCA: as stated in the introduction, the loss function is ℓ(x, h) = x − P h x 2 2 where P h is the orthogonal projection onto the subspace hypothesis h of prescribed dimension k.
• k-means clustering: each hypothesis corresponds to a set of k candidate cluster centers, h = {c 1 , . . ., c k }, and the loss is defined by the k-means cost ℓ(x, h) = min 1≤l≤k x − c l 2 2 .The hypothesis class H may be further reduced by defining constraints on the considered centers (e.g., in some domain, or as detailed in the companion paper [Gribonval et al., 2020]with some separation between centers).
• Gaussian Mixture Modeling: each hypothesis h corresponds to the collection of weights, means and variances of a mixture of k Gaussians, whose probability density function is denoted π h (x).The loss function is based on the maximum likelihood ℓ(x, h) = − log π h (x).

Compressive learning
Our aim, and one of the major achievements of this paper, is to control the excess risk (2) using an estimate ĥ obtained from the sole knowledge of a sketch of the training collection.As we will see, the resulting philosophy for large-scale learning is, instead of addressing an ERM optimization problem of size proportional to the number of training samples, to first compute a sketch vector of size driven by the complexity of the task, then to address a nonlinear least-squares optimization problem associated to the Generalized Method of Moments (GeMM) on this sketch.Taking its roots in compressive sensing [Donoho, 2006, Candès et al., 2006, Foucart and Rauhut, 2012] and the generalized method of moments [Landau, 1987, Hall, 2005], but also on kernel mean embeddings [Smola et al., 2007, Sriperumbudur et al., 2010], random features [Rahimi and Recht, 2007, 2009, Bach, 2017], and streaming algorithms [Gilbert et al., 2002, Cormode and Muthukrishnan, 2005, Cormode et al., 2011], compressive learning relies on the choice of a measurable (nonlinear) feature function Φ : Z → R m or C m and has two main steps: 1. Compute generalized empirical moments using the feature function on the training collection to summarize it into a single sketch vector (3) 2. Produce a hypothesis from the sketch using an appropriate learning procedure: ĥ = Learn(y).
Overall, the goal is to design the feature function Φ(•) and the learning procedure Learn(•) given a learning task (i.e., a loss function) such that the resulting hypothesis ĥ has controlled excess risk (2) (if Φ is drawn at random according to some specification, we want (2) to hold with high probablity also with respect to the draw of Φ).To anticipate on what will be developed in Section 3.1 (notably Eq. ( 27)), let us mention that learning from a sketch will take the form of a minimization problem ĥ ∈ arg min h∈H R(y, h) where in a sense R(y, •) will play the role of a proxy for the empirical risk R(π n , •).
• Estimation of the mean: Assume x, h belong to R d , and consider the squared loss ℓ(x, h) = x − h 2 , whose risk minimizer is E[X].In this specific example, it is obviously sufficient to keep only the d empirical averages of the coordinates of X, i.e., to use Φ(x) := x.
• PCA: As the principal components are calculated from the eigenvalue decomposition of the matrix of second moments of the samples, we can simply use Φ(x) := xx T .
A less trivial example is Compressive PCA.Instead of estimating the full matrix Σ π = E X∼π XX T , of size d × d, it is known that computing m random gaussian linear measurements of this matrix makes it possible to manipulate a vector y of dimension m of the order of kd from which one can accurately estimate the best rank-k approximation to Σ π , that gives the k first principal components.Nuclear norm minimization is typically used to produce this low rank approximation given the vector y.We will describe this procedure in details in Section 4 as a first illustration of our framework.
In the companion paper [Gribonval et al., 2020], for the more challenging examples of Compressive k-means and Compressive Gaussian Mixture Modeling, we provide a feature function Φ and a method "Learn" (based on a specific proxy (4) corresponding to a non-convex least-squares minimization) that leads to a control of the excess risk.
As described below, these results are achieved by establishing links with the formalism of linear inverse problems and low complexity recovery (i.e., sparse/structured vector recovery, low-rank matrix recovery) and extending theoretical tools to the setting of compressive statistical learning.

Compressive learning as a linear inverse problem
The most immediate link with linear inverse problems is the following.The sketch vector y can be seen as a linear function of the empirical probability distribution πn :=1 n n i=1 δ xi : where A is a linear operator from the set of distributions π such that Φ is integrable with respect to π, to R m (or C m ), defined by A(π) := E X∼π Φ(X).
(5) This is linear in the sense that 1 A(θπ Since for large n we should have A(π n ) ≈ A(π), the sketch y can be viewed as a noisy linear observation of the underlying probability distribution π.This viewpoint allows to formally leverage the general methodology of linear inverse problems to produce a hypothesis from the sketch y.
Conceptually, we construct the learning-from-sketch procedure ĥ = Learn(y) in two steps: • Define a so-called decoder ∆ that finds a probability distribution π given a sketch y: • Find a best hypothesis from this estimate: As a first coarse analysis of this scheme, notice that if the decoder step is such that a uniform approximation between the risk of π and of π holds: then we will be able to control the excess risk (2) -our goal.Indeed, using ( 6) and the triangle inequality, it is easy to show that (7) directly implies (2).(We will see later that ( 7) can be too coarse and will introduce a more refined analysis based on excess risks in Section 2.5.)In a way, this is very similar to ERM except that instead of using the empirical risk R(π n , •), we use an estimate of the risk R( π, •) where π is deduced directly from the sketch y.
Remark 2.1.At first sight, the above conceptual view may wrongly suggest that compressive learning replaces statistical learning with the much more difficult problem of non-parametric density estimation.
Fortunately, as we will see, this is not the case, thanks to the fact that our objective is never to accurately estimate π in the standard sense of density estimation as, e.g., in [Bertin et al., 2011], but only to accurately estimate the risk R(π, •).On practical examples, a natural decoder will be based on best moment matching over a parametric family of probability distributions, which will be expressed more directly as the minimization of a proxy for the risk (4), cf Section 3.1.

Statistical learning guarantees: a first control of the excess risk
In this section, for simplicity we first focus on how to establish uniform control of the risks of the form (7) using general results from linear inverse problems (we shall introduce in the next section a sharper but also slightly more notation-heavy analysis).To leverage the links between compressive learning and general inverse problems, we further notice that sup h∈H |R(π, h)−R(π ′ , h)| can be viewed as a metric on probability distributions.Given a class G of measurable functions f : Z → R or C, we use the following notation throughout this work: which defines a semi-norm on the space of finite signed measures (see Appendix A.2) on (Z, Z) such that all f ∈ G are integrable.In order to be explicit about the integrability assumptions in the results to come, we will call this space the set of G-integrable finite signed measures (resp.probability distributions, when appropriate).With this notation, we have where We will usually abbreviate the latter notation by dropping the dependence on H, considered fixed.The desired guarantee (7 In the usual context of linear inverse problems, producing an accurate estimate from noisy underdetermined linear observations requires some "regularity" assumption.Such an assumption often takes the form of a "low-dimensional" model set that the quantity to estimate is close to. Example 2.2.In the case of sparse vector recovery (respectively low-rank matrix recovery), one wishes to estimate x ∈ R n (resp.X ∈ R n×n ) from y ≈ Ax (resp.y ≈ Avec(X)).Guarantees are achieved when x is close to the set of k-sparse vectors (resp.when X is close to the set of rank-r matrices).
Similarly here, estimating π from y ≈ A(π) may require considering some model set S, whose choice and definition will be discussed in Section 3.
Remark 2.3.While in classical compressive sensing the model set plays the role of prior knowledge on the data distribution that completes the observations, in the examples considered here we will often obtain distribution free excess risk guarantees using models derived from the loss function.
Given a model set2 S that plays the role of regularizer, and a sketching operator A, an "ideal" decoder ∆ should be robust to two different sources of error: the distribution of the data π generally does not belong to S but is "close" to it, introducing some modelling error, and the empirical sketch is used instead of the true generalized moments, which adds some noise.Generalizing early formulations for sparsity-regularized inverse problems, a decoder robust to both noise and modelling error is usually reffered to as instance optimal [Cohen et al., 2009, Bourrier et al., 2014].Mathematically, this can be expressed as: for any distribution π, any draw of the training samples from π (embodied by the empirical distribution πn ), with where hides multiplicative constants, and d(•, S) is some measure of distance to the model set S. In the rest of the paper, we refer to this first term as "bias".A significant part of later sections will be devoted to the control of the bias and the choice of a good model set.Proving that a decoder satisfies (10) ultimately serves to establish bounds such as (7) to control the excess risk.
It turns out that general results from abstract linear inverse problems [Bourrier et al., 2014] can be adapted to already characterize the existence of a decoder satisfying property (10).By [Bourrier et al., 2014, Section IV-A], if a decoder with the above property exists then a so-called lower Restricted Isometry Property (LRIP) must hold: there is a finite constant Conversely, the LRIP ( 11) implies [Bourrier et al., 2014, Theorem 7] that the following decoder (also known as ideal decoder) which corresponds to best moment matching, is instance optimal, i.e., (10) holds for any π and πn , with the particular distance As a consequence, the LRIP (11) implies a control of the excess risk achieved with the hypothesis ĥ selected with (6), where π = ∆[y], as where we used explicit constants from [Bourrier et al., 2014, Theorem 7].Note that in the above argument, it was never used that the data is distributed i.i.d.from π. Estimate ( 14) therefore holds under this form for any fixed data sample (in fact, for any empirical distribution πn , being understood that it determines ĥ) and any distribution π.Of course, the data distributional assumption is useful to control the second term in the bound.

Improved excess risk analysis
The analysis of the previous section has the merits of simplicity, generality, and using existing results from linear inverse problems.However it has some limitations, in particular when the bias term d • (π, S) is not close to zero.To emphasize this point, we consider a simple example and compare the excess risk control (for the same sketched learning procedure) obtained through the general bound ( 14), to a direct computation specific to this example.
Consider the problem of estimating the median of a distribution on R: we assume Z = H = R, and consider the absolute value loss ℓ(x, h) = |x − h|, whose risk minimizer under the distribution π is the median h ⋆ = Med(π).As a sketching operator we take simply Φ(x) = x, resulting in the sketch given by the empirical mean Finally, as a model we consider the family of 1-point Dirac measures, S = {δ x , x ∈ R} (this is the model consisting of all distributions with vanishing optimal risk, see Section 3.2 for a more general discussion).Obviously, it then holds with these choices that the ideal decoder given by ( 12) is π = ∆[y] = δ y , and further ĥ = y.On the one hand, the excess risk for this sketching/decoding scheme is bounded as follows, by a simple direct calculation, putting Mean(π On the other hand, it is easy to check that the LRIP (11) holds, with equality, for 1-point Dirac measures with constant C A = 1.If we consider the general bound ( 14), while we recover (up to factor 4) the second term above , the first term in ( 14) is a bias term which is driven by The inequality in the third line above is obtained by noticing that |x − Mean(π)| ≤ |x − h|+|Mean(π) − h| for each x, h ∈ R. Hence, using the general bound ( 14) instead of the specific direct calculation, we get an additional, unwanted term corresponding to the optimal risk R(π, h ⋆ ) which is nonzero as soon as π ∈ S, and can become arbitrary large (even if B(π) = 0, e.g. if π is symmetric around its mean).One reason for this lack of sharpness is that the analysis in the previous section concentrated first on (uniform) control of the risk difference π − π ′ L , to deduce only as a second step a control on the excess risk.It is known from the statistical learning literature that it is generally sharper to directly analyze the excess risk; and correspondingly consider the excess loss class (see, for instance, Bartlett et al., 2005, Section 5 and Koltchinskii, 2006, Section 7): However, the risk minimizer h ⋆ depends on the distribution π; for this reason we will consider a family of excess losses and risks with respect to some reference hypothesis h 0 .
Definition 2.4.The excess risk relative to a reference hypothesis h 0 is defined as: and the associated excess risk divergence with respect to h 0 is: Observe that ∆R h0 (π, h 0 ) = 0 for each π, hence the latter quantity is nonnegative (although no absolute value is involved in its definition), but not symmetric in general.Yet, it satisfies an (oriented) triangle inequality: for any π, π ′ , π ′′ It is therefore a hemimetric (see Definition B.1 in the Appendix).
The excess risk divergence will play a role similar to that of π − π ′ L , and satisfies in particular With this setting we have the following result, which can be seen as a refinement of ( 14).
Theorem 2.5.Consider a loss class L(H), a feature function Φ, and a model set S such that every probability distribution τ ∈ S is both L-integrable and {Φ}-integrable.Assume that the sketching operator A associated to Φ satisfies the following LRIP inequality: for some finite constants C A > 0 and η ≥ 0.
Consider any training collection X = {x i } n i=1 ∈ Z n , and denote πn : for some constants ε, ν ≥ 0, (22) ĥ satisfying R( π, ĥ) Then, for any probability distribution π that is both L-integrable and {Φ}-integrable: where Similarly to (14), the above estimate holds regardless of any distributional assumptions on the training collection X.Nevertheless, estimate ( 24) is primarily of interest when X is drawn i.i.d.according to π and with h 0 = h ⋆ π , in which case the left-hand side is the excess risk with respect to the optimum risk, which is what one generally aims at controlling.However, in some situations it may be also helpful to consider excess risk with respect to other reference hypotheses h 0 ; this can include situations where h ⋆ π itself is not well-defined if the infimum of the risk is not attained.As compared to (14), we observe that this result is more general, as it allows for a (ν, ε)-approximate decoder ( 22), an ε ′ -approximate ERM (23), an η-approximate LRIP condition (20); more importantly, the main bound (24) involves the sharper excess risk divergence rather than the loss norm .L .It may also be useful to consider π = πn , to predict the quality compared to the empirical risk minimizer.
Moreover, inequality (19) implies that the lower LRIP condition (11) considered in the previous section implies the relaxed LRIP condition (20) (with η = 0, and up to a factor 2 in the constant), so establishing ( 11) is sufficient in order to obtain the improved inequality (24).
The proof of Theorem 2.5 follows the structure outlined in the previous section, but requires to formally extend the result of [Bourrier et al., 2014, Section IV-A] (leading from the LRIP (11) to instance optimality (10)) to the case of a hemimetric, η-approximate LRIP and ε-approximate decoder.These technical aspects are relegated to Appendix B.

Discussion:
• Computing the sketch ( 21) is highly parallelizable and distributable.Multiple sketches can be easily aggregated and updated as new data become available.
• As discussed in Remark 2.1, while ( 22) may appear as a general nonparametric density estimation problem, in all the examples considered in this paper and the companion one [Gribonval et al., 2020], it is indeed a nonlinear parametric least-squares fitting problem and the existence of the minimizer follows in practice from compactness arguments.
-For Compressive PCA (Section 4) it is a low-rank matrix reconstruction problem.Provably good algorithms to estimate its solution have been widely studied.
-For Compressive k-means and Compressive Gaussian Mixture Modeling (cf the companion paper [Gribonval et al., 2020]), the resulting optimization problem has been empirically addressed through the CL-OMPR algorithm [Keriven et al., 2015[Keriven et al., , 2018]].Algorithmic success guarantees are an interesting challenge.This is however beyond the scope of this paper.We note that the classic (non-compressed) k-means problem by minimization of the empirical risk is known to be NP-hard [Garey et al., 1982, Aloise et al., 2009] and that guarantees for approaches such as K-means++ [Arthur and Vassilvitskii, 2007] are only in expectation and with a logarithmic sub-optimality factor.
• In Section 3 we discuss choices of the model set S that are driven by the learning task only and make the minimization problem (23) trivial to solve.With these choices the combined solution of ( 22)-( 23) is explicitly turned into the minimization of a proxy for the risk, as in (4).
• The second term in the bound (24) of the excess risk, η n , is the empirical estimation error It is easy to show that it decays as 1/ √ n when the data is drawn i.i.d.according to π, this will be done explicitly for the considered examples.
For large collection size n drawn i.i.d.according to π, the term A(π) − A(π n ) 2 becomes small and (24) shows that compressive learning will benefit from accurate excess risk guarantees provided the model S and the feature function Φ (or equivalently the sketching operator A) are chosen so that: 1. the LRIP (20) holds; ideally for a "small" value of m, as we also seek to design compact sketches and, eventually, tractable algorithms to learn from them.
2. the distance d h ⋆ π (π, S) is "small"; this vague notion will be exploited in Section 3 below to guide our choice of model set S, and will be made more concrete on examples.
We illustrate the improvement obtained for the bias term with respect to the coarser analysis on the toy example of median estimation considered previously.In that setting we have h ⋆ π = Med(π), and for τ = δ x ∈ S: The inequality in the third line is obtained by using x = Mean(π).Note that the presence of the last term is unavoidable (since |x − Med(π 0 )| + |x − Mean(π)| ≥ |Med(π 0 ) − Mean(π)|), and that it is still larger than B(π) which we recall is the only bias term appearing in the direct calculation (15).(A situation where it is much larger is the following: assume ) In this sense, even using the improved excess risk analysis, the general bound (24) can lack some tightness.It is nevertheless much sharper than the bound ( 16), in particular d h ⋆ π (π, S) = 0 as soon as Mean(π) = Med(π), while d h ⋆ π (π, S) > 0 in general in this case, see ( 16).
3 Task-driven model sets An important ingredient of the proposed framework is the model set S and we now discuss its choice.If prior knowledge on the data distribution is available, it is of course possible to choose S to incorporate such knowledge into compressive statistical learning.However, it is more common in statistical learning to seek "distribution-free" statistical guarantees.Therefore, it may be more desirable to derive a model set entirely or mostly from the learning task itself, to the extent possible.
To begin with, we show that for any model set S, the abstract two-step learning mechanism (cf steps ( 22)-( 23) in Theorem 2.5) can be written as the minimization of a more explicit proxy (4) for the empirical risk.This rewriting exploits a partition of S into certain submodels S h driven by the learning task, i.e. by the loss family {ℓ(•, h)} h∈H .For certain learning tasks, we further show that the submodels S h and the corresponding proxy (4) have a simple expression provided we choose a "natural", task-driven, model set S. Finally, for so-called "compression type" learning tasks, choosing such a task-driven model set allows to control the bias term in the excess risk (24) by a function of the optimal risk R(π, h ⋆ ).

Learning from a sketch without explicit density estimation
Consider a family S of L-integrable distributions and assume for each π ∈ S the risk admits a minimizer, i.e., there is h ∈ H such that R(π, h) = inf h ′ ∈H R(π, h ′ ).When this holds the model set S can be decomposed as S = ∪ h∈H S h where for each hypothesis h ∈ H we define and the hypothesis ĥ selected using steps ( 22)-( 23) in Theorem 2.5 (with ε ′ = 0) is equivalently obtained as a near-minimizer of the following proxy for the risk With this expression in hand, it is possible to directly cast the estimation of ĥ as (4).To turn this into a concrete proxy for the risk it is helpful to consider a model set S such that S h has a simple characterization.
3.2 Choosing a model set: with or without prior knowledge ?
Learning tasks such as maximum likelihood estimation directly involve a natural model set for which S h as in ( 26) is easily characterized.Consider the loss ℓ(x, h) = − log π h (x) with {π h , h ∈ H} a parameterized family of distributions with π h ′ = π h for h ′ = h.In this setting, it is natural to consider the following model set: which is nothing more than a statistical model in the usual sense.Moreover, up to a constant additive term, it then holds R(π, h) = KL(π π h ), where KL(• •) is the Kullback-Leibler divergence, so that for any h ∈ H: see [Cover and Thomas, 1991, Chapter 9].As a conclusion, the proxy (27 For many other learning tasks, the choice of the model set S results from a tradeoff between several needs.On the one hand, results from compressed sensing suggest that given a model set S that has proper "low-dimensional" properties, it is possible to choose a small sketch size m and design the sketching operator A such that the LRIP (20) holds, and the ideal decoder ∆ in ( 12) -or its relaxed version in ( 22) -is guaranteed to stably recover probability distributions in S from their compressed version obtained with A. This calls for the choice of a "small" model set.On the other hand, and perhaps more importantly, the model set should not be "too small" in order to ensure that the obtained control of the excess risk is nontrivial.
Ideally, in the common case of compression-type tasks, as defined below, the bias term in the excess risk (24) should be small when the true optimum risk is small, and even vanish when the true optimum risk vanishes, i.e. when inf h∈H R(π, h) = 0. Definition 3.1.We call the learning task a compression-type task if the loss can be written as ℓ(x, h) = d p (x, P h x), where d is a metric on Z, p > 0, and P h : Z → Z is a "projection function", i.e., Typical examples of compression-type tasks are PCA, k-means, and k-medians.For PCA, P h is the orthogonal projector onto subspace h.For k-means and k-medians, P h maps x ∈ Z = R d to the closest center c i from h = (c 1 , . . ., c k ), with ties broken arbitrarily.In other words, given an arbitrary Voronoi partition corresponding to k disjoint sets W j such that ∪ j W j = R d and d(x, c j ) = min l d(x, c l ) for each x ∈ W j , P h x = c j if and only if x ∈ W j .Manifold learning tasks where P h is a projection onto a manifold parameterized by h (with ties broken arbitrarily) would also fit under this framework.
For a compression-type task, a natural model set is the family of L-integrable probability distributions We consider a few examples: • Compressive PCA: the model set S CT (H) consists of all distributions which admit a matrix of second moments of rank at most k.Given any π ∈ S CT (H), a minimum risk hypothesis according to (6) is any subspace ĥ spanned by eigenvectors associated to the k largest eigenvalues of Σ π .More details will be given shortly in Section 4.
• Compressive k-means or k-medians: the model set S CT (H) consists of mixtures of k Diracs.Given h = {c 1 , . . ., c k } and any π = k ℓ=1 α ℓ δ c ℓ ∈ S CT h , a minimum risk hypothesis according to ( 6) is ĥ = h.Since A(δ c ) = Φ(c), the proxy (27) reads For compressive PCA we exhibit in Section 4 a feature function Φ so that A satisfies the LRIP (20) with respect to the model set S = S CT (H).The same is done in the companion paper [Gribonval et al., 2020] for compressive k-means, compressive k-medians and compressive Gaussian mixture modeling.

Controlling the bias term for compression-type tasks
The bias term -defined in (25) -is a measure of distance to the model set S. For compression-type tasks, the particular model set S = S CT (H) in ( 31) was designed so that this bias term vanishes when π ∈ S, and we can further bound the bias term d h ⋆ π (π, S CT (H)) with an increasing function of the true minimum risk, R(π, h * ).This leads to recovery guarantees providing distribution-free excess risk guarantees.Whether this holds for other learning tasks, or even generically, is a challenging question left to further work.
The following lemmas allow to obtain an upper bound of the bias term in function of the minimum risk in a number of relevant cases.For a probability distribution π on Z and P h as in Definition 3.1, we denote P h π the push-forward of π through P h , i.e., the probability distribution of a random variable Y = P h X where X ∼ π.Given a loss class L(H) we recall that h ⋆ π = arg min h∈H R(π, h).Lemma 3.2.Consider a compression-type task on the input space Z.Then

• S CT
h is the set of probability distributions on X ∈ Z such that X ∈ E h := P h Z almost surely.With the model set S CT (H) and the loss class L(H), the bias term (25) satisfies (for any h 0 ∈ H) (33) • If d p is a metric (in particular if p ≤ 1) then D h (π P h π) = 0 for any h ∈ H and L-integrable distribution π.
Remark 3.3.When d p is not a metric there are u, v, w ∈ Z such that d p (u, v) > d p (u, w) + d p (w, v).The loss ℓ(x, h) := d p (x, h), with H := {v, w}, defines a compression-type task with P h x = h for all x ∈ Z. Set π = δ u .Since d(u, w) < d(u, v) we have h ⋆ π = w.We also have for h Hence, one cannot generically obtain D h (π P h π) = 0, not even with the restriction h = h ⋆ π .For p = 2, d the Euclidean distance on R d , and h 0 = h ⋆ π (which we recall is generally the primary interest case since our main bound (24) then gives a control of the excess risk with respect to the optimum), we still have D h ⋆ π (π P h ⋆ π π) = 0 for certain tasks.In light of the above remark, this is a nontrivial property which is established for PCA in Lemma E.1, and for k-means in the companion paper [Gribonval et al., 2020].Beyond these specific situations, it is possible (under somewhat generic additional assumptions) to bound the two terms appearing in (33) by (a power of) the risk itself, as established next.
Lemma 3.4.Consider a compression-type task where (Z, d) is a separable metric space.Then • Assume that Z has d-diameter bounded by B. Then for any p > 1, h ∈ H and L-integrable distribution π: Then, for h ∈ H ′ and p ≥ 1: For h ∈ H and p ≤ 1, if the space Z has d-diameter bounded by B: The proofs of the above lemmas are in Appendix D. For Lemma 3.4, optimal transport is exploited through connections between the considered norms and the norm π , where Lip(L, d) denotes the class of functions f : (Z, d) → R that are L-Lipschitz.The two lemmas can be combined to express an "explicit" bound on d h ⋆ π , this is postponed to concrete examples.

Illustration with Compressive PCA
As a first simple illustration, this general compressive statistical framework can be applied to the example of PCA, where most of the tools already exist.Our aim is essentially illustrative, and focuses on controlling the excess risk, rather than to compare the results with state-of-the art PCA techniques.
Definition of the learning task.The risk associated to the PCA learning problem is defined3 as R k−PCA (π, h) = E X∼π X − P h X 2 2 with P h the orthogonal projector onto subspace h.It is minimized by any subspace h ⋆ π associated with k largest eigenvalues of the matrix Σ π = E X∼π XX T .It is well established [Foucart and Rauhut, 2012] that matrices that are approximately low-rank can be estimated from partial linear observations under a certain Restricted Isometry Property (RIP).This leads to the following natural way to perform Compressive PCA.
Choice of feature function.Choose (at random) a linear operator M : R d×d → R m satisfying (with high probability) the RIP on low-rank matrices: for any M ∈ R d×d of rank at most 2r, with • F the Frobenius norm and δ < 1.This is feasible with m of the order of rd, by taking the Frobenius inner product of M with m independent random Gaussian matrices [see e.g.Foucart and Rauhut, 2012].Given these facts one can define the feature function as Φ : Sketch computation.Given sample points x 1 , . . ., x n in R d , compute the sketch y as in (3), i.e., compute empirical estimates of random second moments of the distribution π of X.
Learning from a sketch.Given a sketch vector y, estimate a solution of the optimization problem over positive semi-definite (p.s.d.) symmetric matrices (Σ 0) This step estimates the rank-r p.s.d.matrix whose sketch best matches that of the empirical matrix of second moments, in the least squares sense.Compute the eigen-decomposition Σ = UDU T and output ĥ := span(U(:, 1 : k)). (39) In Appendix E we control the excess risk of PCA by relating the excess risk divergence D h0 (π π ′ )with h 0 ∈ H an arbitrary hypothesis -to the Frobenius norm Σ π − Σ π ′ F , and upper bounding the "bias" term (25) appearing in the generic bound of Theorem 2.5, to obtain the following result: Theorem 4.1.Consider any probability distribution π with finite second moments and any draw of x i , 1 ≤ i ≤ n (represented by the empirical distribution πn ).Applying the above approach yields, for any s, 1 ≤ s ≤ r: where λ i (Σ π ) are the eigenvalues of Σ π ranked in decreasing order (with multiplicity), In particular: Discussion: • Bias term.The first term in the right hand side of ( 41) is a bias term that vanishes when the true risk is low.Since it is proportional to the true risk, it leads to the (non-sharp) oracle inequality We show in [Gribonval et al., 2020], (using Lemma 3.4 and (36)) that this type of property also holds for Compressive kmedians; for Compressive k-means we prove similar properties where the bias term is bounded by the square root of the true risk (using (34),( 35)).It is notable that the bias multiplier C δ (k, r) is of order √ k if we use the natural model set (r = k), but drops to a constant independent of k as soon as we choose e.g. the larger model set S r with r = 2k.Thus, there appears to be a clear advantage, in the sense of the obtained bound, in choosing a reconstruction model that is larger than the natural model set S CT (H), while not significantly changing the magnitude of the number of required data sketches.At this point it is unclear to us if the inflation of the bias factor for the natural model is unavoidable or is just a technical artefact.
• Sample complexity.Regarding the second term, if we further assume that the support of π is contained in a Euclidean ball of radius R, then by the RIP (37) we have a.s.M(xx T ) 2 ≤ √ 1 + δ • R 2 hence, by the vectorial Hoeffding's inequality [see e.g.Pinelis, 1992], we obtain with high probability w.r.t.data sampling that • Root-n consistency in a high-dimensional scenario.As noticed in the previous point, the statistical estimation error term in the sketched learning bound is of order k/n.Consider a high-dimensional situation where d is large and growing with n, and assume a polynomial spectral decay λ j (Σ π ) ≤ j −α with α > 1.Then by choosing s = r/2, the bias term in ( 40) is of order √ kr −(2α−1) .As a consequence, it is sufficient to take r of order min(d, n 1 2α−1 ) so as to ensure that the bias term is at most of the same order as the statistical error term.This gives an advantage compared to the standard approach of storing all d 2 second order moments, as soon as d > n 1 2α−1 .For comparison, standard statistical analysis of PCA based on the uniform bound on the deviations of the empirical risk from its expectation (see e.g.Shawe-Taylor et al., 2005) leads to a control of order4 k/n.Hence, using the sketched approach we can reduce storage/memory imprint significantly while keeping statistical guarantees of the same order as in the standard setting.
• Relation to efficient kernel-PCA methods.Methods in recent literature have been proposed to make kernel PCA more efficient, relying on Nyström subsampling Sterge et al. [2020] or on approximation of the kernel using random features [Ullah et al., 2018, Sriperumbudur andSterge, 2020].Even when restricting attention to a linear kernel, a direct comparison to our approach proves delicate.The Nyström subsampling method is very much taylored to the dual point of view, which is canonical in kernel methods (i.e.approximation of the kernel Gram matrix): this implicitly posits to store all training points in order to represent the output of the method as a kernel expansion (the computational gain concerns the storage and manipulation of the (n, n) Gram matrix).On the other hand, the random feature approach could be construed as closer in spirit to ours, however rather than storing generalized moments as we do, it is in essence a (random) dimension reduction of the individual input points from kernel space to a finite-dimensional feature space where regular PCA is performed.Also the theoretical works of Ullah et al. [2018], Sriperumbudur and Sterge [2020] concern reconstruction in L 2 (π) space and not in the original kernel space norm (lifting back the PCA projection found in approximate feature space into original kernel space proves a delicate question).In contrast, we consider sketching the data into generalized empirical moments but propose a reconstruction method directly in the relevant space with theoretical guarantees.It is to be noted however, that the results of Ullah et al. [2018] posit a black-box PCA method applied in the finite (but high-)dimensional approximate feature space, and one could apply a sketching approach at this stage.This suggests the interesting proposal that efficient kernel PCA methods could be combined, rather than be in competition with, the sketching approach, though we did not push the idea further.
Practical algorithms for learning and comparison to prior PCA-specific results.One can consider several relaxations of the nonconvex optimization problem (38) in order to perform compressive PCA.Beside convex relaxations using the minimization of the nuclear norm [Foucart and Rauhut, 2012, Section 4.6], Kabanava et al. [2016] showed (in a complex-valued setting) that the rank constraint in (38) can be relaxed when M is made of random rank-one projections, i.e. when Φ(x) = where a j ∈ C d are independent standard complex Gaussian vectors.In this setting, let Σ := arg min and the corresponding hypothesis ĥ obtained through (39).Combining [Kabanava et al., 2016, Theorem 4 with p = 2] with Equation (72) in Section E and Equation ( 63) in Section B, we have the following result: if m ≥ Ckd where C is a universal constant, then with high probability on the draw of the a j , for any x 1 , . . ., x n , we have the control where D 1 , D 2 are positive universal constants that do not depend on k.
Hence, provided we use a model set of dimension 2r as discussed above, the error control (41) obtained via our general approach matches what can be obtained using directly the PCA-specific study of Kabanava et al. [2016].Two practical advantages of the latter are (a) that ( 42) is a convex program, and (b) that the sketches are made using rank-one matrices, which are cheaper to store.Still, the guarantees obtained by our general approach is able to match prior results for setting-specific methods.It will further permit the study of the less trivial setting of compressive clustering and compressive Gaussian mixture estimation as shown in the companion paper [Gribonval et al., 2020].

Establishing the LRIP for random sketching operators
In this section, we investigate how to establish the LRIP (20) (with η = 0) when the sketching operator A is associated to random features.The approach uses connections with the notion of kernel mean embedding of probability distributions.

Random features and kernel mean embeddings
Definition 5.1 (Random feature map).Consider F := {φ ω } ω∈Ω a parameterized family of (real-or complex-valued) measurable functions, Λ a probability distribution Λ over the parameter set Ω (often Ω = R d ), and a sketch size m.A random feature map is defined by drawing m i.i.d parameters (ω j ) m j=1 according to Λ and setting Any draw of the feature function Φ defines a positive semi-definite kernel between samples κ Φ (x, x ′ ) := Φ(x), Φ(x ′ ) R m (or Φ(x), Φ(x ′ ) C m ).Compressive learning is deeply connected to kernel mean embeddings of probability distributions, as the related sketching operator A defines a so-called kernel mean embedding between probability distributions which are F -integrable.
Definition 5.2 (Kernel mean embedding, Mean Map Discrepancy [Gretton et al., 2007, Sriperumbudur et al., 2010]).Any positive semi-definite kernel κ(•, •) in the sample space is associated to a Mean Map Embedding (a kernel between distributions).By abuse of notation, we keep the notation κ for both the expression of the kernel in the sample space and of the corresponding kernel for probability distributions with appropriate integrability κ(π, π The associated Maximum Mean Discrepancy (MMD) metric is The average kernel κ associated to (F , Λ), will play a key role in establishing the LRIP.Given Similarly, given π, π ′ , the squared MMD π − π ′ 2 κ with this kernel is the expectation of A characterization of the MMD that we will leverage throughout this section is that for any π, π ′ , We observe that A satisfies the LRIP (20) (with η = 0) for a given model set S if, and only if, the metric π − π ′ ∆L is dominated by π − π ′ κΦ for π, π ′ ∈ S. Our overall strategy to check that a random feature function Φ defined by F and Λ satisfies the LRIP (20) (with η = 0) with controlled sketch dimension m will be to: 1. prove that the average kernel κ defined by ( 46) satisfies the Kernel LRIP 2. in the spirit of compressive sensing theory, use concentration of measure and covering arguments to show that for any 0 < δ < 1, for large enough m, with high probability on the draw of ω j , so that the kernel LRIP (47) actually holds with κ Φ instead of κ and constant Remark 5.3.The expression (48) expresses the control of the relative error of approximation of the MMD, restricted to certain distributions.This contrasts with state of the art results on random features that either control uniformly the approximation of the kernel Sriperumbudur and Szabó, 2015] or the approximation of functions in the RKHS induced by κ by those in the RKHS induced by κ Φ [Bach, 2017].These types of controls are indeed of a different nature compared to ours, and none seems to be a direct consequence of the others.

Ingredients to verify the Lower Restricted Isometry Property
In sight of the inequalities (47),(48) we need to prove, the analysis will focus on the so-called normalized secant set of the model S with respect to the average kernel κ, defined as follows [see, e.g.Dirksen, 2016, Puy et al., 2017]: Definition 5.4 (Normalized secant set).The normalized secant set of a model set S with respect to a kernel κ is the following subset of the set of finite, signed measures (see Appendix A.2) with appropriate integrability Using the secant set, the LRIP ( 48) is equivalent to The radius of S κ with respect to certain function norms will play an important role.Since this notion will come up repeatedly in the analysis, we introduce the following notation, which will be heavily used in the sequel.Given a norm • on measures, the radius of a subset E of finite signed measures is denoted With these definitions we can observe that sup where we recall that the metric • G is defined in (8).In particular, the constant from ( 47) can be equivalently rewritten as C κ := S κ ∆L .
Concerning 48, the strategy to establish it will rely on the following two quantities: first, a concentration function c κ (t) characterizing the pointwise (i.e. for fixed τ, τ ′ ∈ S) concentration of A(τ ) − A(τ ′ ) 2 2 around its expectation; secondly, certain covering numbers of S κ needed to step from pointwise to uniform concentration.
Classical arguments from compressive sensing [Baraniuk et al., 2008, Eftekhari and Wakin, 2015, Puy et al., 2017, Dirksen, 2016, Foucart and Rauhut, 2012] prove that certain random linear operators satisfy the RIP by relying on pointwise concentration inequalities.Similarly, a first step to establish that the inequalities (48) hold with high probability consists in assuming first a pointwise version of the same, i.e., for any choice of m in (43): for any µ ∈ S κ : for some concentration function t → c κ (t) that should ideally be as small as possible.The following result shows that the radius S κ F can be used to control such a concentration function.
Lemma 5.5.Consider a family of functions F := {φ ω } ω∈Ω , m parameters (ω j ) m j=1 drawn i.i.d.according to some distribution Λ on Ω, and A the (random) operator induced (see (5)) by the feature function The proof is in Appendix C. Observe that the above estimate only depends on the choice of the feature family F , and holds for any feature sampling distribution Λ.More refined estimates for mixture models, exploiting moments of Λ rather than a uniform bound, are provided in the companion paper [Gribonval et al., 2020] and used to obtain concrete estimates for Compressive Clustering and Compressive GMM.
Finally, we will extrapolate pointwise concentration (52) to all pairs τ, τ ′ ∈ S using covering numbers of the normalized secant set with respect to an appropriate metric.As the normalized secant set is a subset of the infinite-dimensional space of finite-signed measures, it is not obvious when its covering numbers are finite.Controlling them can be nontrivial, yet this is feasible on a case by case basis as will be illustrated in the companion paper [Gribonval et al., 2020].
Covering numbers and pointwise concentration can be then combined to give rise to the following result (whose proof is in Appendix C) where the logarithm of the covering numbers somehow captures an intrinsic dimension of the considered learning task: Theorem 5.7.Consider F := {φ ω } ω∈Ω a family of functions, Λ a probability distribution on Ω, Φ the associated random feature function and κ the corresponding average kernel.Consider the pseudometric on F -integrable probability distributions6 Consider a model set S and S κ = S κ (S) its normalized secant set.Assume that S κ has finite covering numbers with respect to the pseudometric then, with probability at least 1 − ζ on the draw of (ω j ) m j=1 i.i.d.

Summary and applications
To briefly summarize the results in this section, in order to establish the LRIP property with respect to a given model S in the context of a sketching operator Φ associated to a family of random features F and feature sampling distribution Λ we proceed as follows.After identifying the associated average kernel ( 44), the key quantities to estimate relative to the normalized secant S κ (S) are its radius S κ ∆L (which serves as a measure of compatibility between the kernel, the learning task, and the model set S), the pointwise concentration function c κ (.) from ( 52), and the covering numbers of S κ with respect to the distance d F from (54).Even though the above ingredients and results may look quite abstract at this stage, we can turn them into concrete estimates on several examples.The resulting guarantees are summarized in Table 1 for the examples developed in detail in the companion paper [Gribonval et al., 2020](compressive k-Means, k-medians and GMM).
The random sketching results developed in the present section can also be used to revisit the illustrative PCA example from Section 4. Namely, while we have directly lifted from existing literature the RIP property (37) for random Gaussian sketching matrices applied to low-rank covariance matrices, the arguments used there to establish this property follow in essence the canvas of this section (pointwise concentration of the random operator to its average, then unifom concentration via appropriate covering number arguments).In this context, the squared MMD with respect to the averaged kernel is precisely the Frobenius norm between covariance matrices.The additional ingredient needed to complete the analysis is to relate the PCA excess risk to the Frobenius norm of differences of low rank matrices (see (72) in the technical Appendix E, which can be reinterpreted as a bound on S κ ∆L in the PCA setting).

Conclusion and perspectives
The principle of compressive statistical learning is to learn from large-scale collections by first summarizing the collection into a sketch vector made of empirical (random) moments, before solving a Task PCA k-med./means(p = 1/p = 2) Gaussian Mixture Model.
Hypothesis h subspace k cluster centers mixture π h of k Gaussians Feature function quadratic polyn.weighted Fourier features Fourier features Φ(x)  1) for other values of s.
Table 1: Summary of the application of the framework on our three main examples (detailed in Section 4 and the companion paper [Gribonval et al., 2020]) in Z = R d .S k−1 denotes the (k − 1)dimensional simplex (i.e. the sphere with respect to the ℓ 1 -norm in the non-negative orthant of R k ), and x Σ = x T Σ −1 x the Mahalanobis norm associated to the positive definite covariance matrix Σ.The order of the sketch size is indicated up to universal numerical multiplicative factor and logarithmic dependencies on the parameters δ and ζ from Theorem 5.7.The displayed average kernels are up to a multiplicative constant.
nonlinear least squares problem.The main contribution of this paper is to set up a general mathematical framework for compressive statistical learning and to demonstrate on an example (compressive PCA) that the excess risk of this procedure can be controlled, as well as the sketch size.The companion paper [Gribonval et al., 2020] completes the illustration of the framework by considering two more examples: compressive clustering and compressive Gaussian mixture estimation -with fixed known covariance.
Sharpened estimates?Our demonstration of the validity of the compressive statistical learning framework for certain tasks is, in a sense, qualitative, and we expect that many bounds and constants are sub-optimal.A number of non-sharp oracle inequalities have been established in the course of our endeavor.A particular question is to obtain more explicit and/or tighter control of the bias term d h ⋆ π (π, S CT (H)), and to understand whether Lemma 3.2, which relates this bias term to the optimal risk, can be tightened and/or extended to other loss functions.In the same vein, as fast convergence rates for the excess risk can be established for certain classical statistical learning tasks under appropriate conditions (see e.g.[Levrard, 2013] for the case of k-means), it is natural to wonder whether the same holds for compressive statistical learning.
Links with neural networks.From an algorithmic perspective, the sketching techniques we have explicitly characterized in this paper have a particular structure which is reminiscent of a one-layer (random) neural network with subsequent averaging over multiple data points.Indeed, when the sketching function Φ corresponds to random Fourier features, its computation for a given vector x involves first multiplication by the matrix W ∈ R m×d whose rows are the selected frequencies ω j ∈ R d , then pointwise application of the e • nonlinearity.Here we consider random Fourier moments, hence a subsequent averaging operation is performed.As we have seen, this draws a link with the MMD, as is done e.g. in the so-called MMD-GANS [Li et al., 2015, Binkowski et al., 2018], where the so-called discriminator is a neural net trained to compute MMDs over batches of samples.
This suggests that our analysis could help analyze the tradeoffs between reduction of the information flow (dimension reduction) across multiple layers of such networks and the preservation of statistical information [Shwartz-Ziv and Tishby, 2017].For example, this could explain why the pooled output of a layer is rich enough to cluster the input patches.Given the focus on drastic dimension reduction, this seems very complementary to the work on the invertibility of deep networks and pooling representations with random Gaussian weights [Bruna Estrach et al., 2014, Giryes et al., 2016, Gilbert et al., 2017].Finally, we mention the recent popularity of networks with random weights in statistical physics [Gabrié et al., 2018] and in analyzing the initialization point of optimization algorithms with a kernel characterization [Jacot et al., 2018, Bietti andMairal, 2019], for which information-preservation (nondegeneracy during training) is also an essential feature.
Privacy-aware learning via sketching?The reader may have noticed that, while we have defined sketching in (3) as the empirical average of (random) features Φ(x i ) over the training collection (or in fact the training stream), the essential feature of the sketching procedure is to provide a good empirical estimator of the sketch vector A(π) = E X∼π Φ(X) of the underlying probability distribution.A consequence is that one can envision other sketching mechanisms, in particular ones more compatible with privacy-preservation constraints [Duchi et al., 2014].For example, one could average Φ(x i +ξ i ), or Φ(x i )+ξ i , or D i Φ(x i ), etc., where ξ i is a heavy-tailed random vector drawn independently from x i , and D i is a diagonal "masking" matrix with random Bernoulli {0, 1} entries.An interesting perspective is to characterize such schemes in terms of tradeoffs between differential privacy and ability to learn from the resulting sketch.Preliminary results in this direction have been recently achieved Schellekens et al. [2019], Chatalic et al. [2021].
Recipes to design sketches for other learning tasks through kernel design?Given the apparent genericity of the proposed compressive statistical learning framework, a particular challenge is to extend it beyond the learning tasks considered in this paper and its companion [Gribonval et al., 2020].Kernel versions of these tasks (kernel PCA, kernel k-means, or spectral clustering) appear as the most likely immediate extensions.They are expected to lead to sketching architectures reminiscent of two-layer convolutional neural networks with additive pooling.Compressive supervised classification and compressive regression are also natural candidate tasks in a learning setting, but seem more challenging.Given a learning task, the main bottleneck is to find an adequate sketching function Φ(•).As illustrated on Figure 1, this primarily relies on the quest for a task-compatible kernel, i.e., one satisfying the Kernel LRIP (47).Subsequent technical steps would rely on the identification of an integral representation of this kernel using random features with the right concentration properties, and establishing that the associated secant set has finite covering dimension with respect to the feature-based metric (54).On a case by case basis, one may have to identify the analog of the separation conditions apparently needed for compressive k-means, see [Gribonval et al., 2020].
Vice-versa, one could wonder which family of learning tasks is compatible with a given kernel.In other words, how "universal" is a kernel, and how much can be learned from a single sketched representation of a database ?We expect that tasks such as compressive ranking, which involve pairs, triples, etc. of training samples, may require further extensions of the compressive statistical learning framework, to design sketches based on U -statistics rather than plain moments.These would lead to sketches linear in the product probability π ⊗ π instead of π.The investigation of such extended scenarios is expected to benefit from analogies with the lifting techniques used in phaseless reconstruction, see e.g.[Candès et al., 2013].
Proof.The proof follows very closely [Bourrier et al., 2014] and is adaptated to the fact that d X (• •) is a hemimetric.Consider x * ∈ X , y ∈ Y and x = ∆(y).Consider any z ∈ Σ and write As this holds for any z ∈ Σ, taking the infimum yields the result.
Lemma C.1 (Bernstein's inequality).Let X i ∈ R, i = 1, . . ., N be i.i.d.bounded random variables such that EX i = 0, |X i | ≤ M and V ar(X i ) ≤ σ 2 for all i's.Then for all t > 0 we have Proof of Lemma 5.5.First, observe that for any F -integrable probability distributions π, π Applying Lemma C.1 with the independent random variables Z(ω) we obtain for each t > 0: we have, with probability at least 1 − 2N exp(−m/c κ (δ/2)): By definition of the concentration function, for any t > 0 and m ≥ 1 This establishes a pointwise concentration result when µ is on the normalized secant set S κ .We now use a standard argument to extend this to a uniform result on S. Let µ i , 1 ≤ i ≤ N be the centers of a δ/2-covering (with respect to the metric d F ) of S. Using (67) with t = δ/2, the probability that there is an index i such that A(µ i ) Hence, with probability at least 1 − ζ, we have: for any µ ∈ S, with i an index chosen so that d F (µ, µ i ) ≤ δ/2: This implies (48).Since S κ ∆L < ∞, the LRIP (20) holds wrt D Proof of Lemma 3.2 and Lemma 3.4 For the second claim of Lemma 3.2, by (30), since d p is a metric we have for any x ∈ Z d p (x, P h x) ≤ d p (x, P h P h0 x) ≤ d p (x, P h0 x) + d p (P h0 x, P h P h0 x).
As this holds for any h, and as equality is reached for h = h 0 , we get D h0 (π P h0 π) = 0.In particular when p ∈ (0, 1] we have  Using (19) this implies that for any π, π ′ we have in general in the above considered setting: It is well-known that the 1-Wasserstein distance between two distributions can be equivalently defined in terms of optimal transport (so-called "earth mover's distance") but also as as soon as (Z, d) is a separable metric space, see, e.g., [Dudley, 2002, Theorem 11.8.2].By the transport characterization of the Wasserstein distance, considering the transport plan that sends x to P h x, where h ∈ H ′ , we conclude π − P h π Wasserstein1(d) ≤ E X∼π d(X, P h (X)) ≤ [E X∼π d p (X, P h (X))] by Jensen's inequality (since p ≥ 1 here), yielding the claim (34).
For the final claim, we have where f u (x) := Φ(x), u .Moreover, for u 2 ≤ 1 and any x, x ′ , since Φ is assumed L-Lipschitz:  69).When p ≤ 1 and the space Z has d-diameter bounded by B, as d(X, P h X) = d 1−p (X, P h X)d p (X, P h X) ≤ B 1−p d p (X, P h X), we obtain (36) as follows π − P h π Wasserstein1(d) ≤ E X∼π d(X, P h (X)) ≤ B 1−p E X∼π d(X, P h (X)) p = B 1−p R(π, h).

E Proof of Theorem 4.1 on Compressive PCA
For Compressive PCA, we recall that k is the number of PCA components we want to estimate.The hypothesis class H is the set of linear subspaces of dimension k of the input space R d , which is in one-to-one correspondance with the space P k of orthoprojectors P of rank k.In the remainder of this section we therefore use directly P k as the hypothesis class, for notational convenience.We recall that for r ≥ k, we consider the model S r consisting of probability distributions having their second moment matrix of rank at most r.
Observe that for any P ∈ P k : (1 ≤ ℓ ≤ d) denotes an orthoprojector onto the ℓ first eigenvectors of Σ π , and λ i (M) denote the eigenvalues, with multiplicity and ordered in nonincreasing sequence, of a matrix M.
We follow the improved risk analysis of Section 2.5.In the above setting, the excess risk divergence (18) with respect to an arbitrary reference hypothesis P 0 ∈ P k is given by Lower RIP.We start with the following bound on the excess risk divergence, holding without restriction for any distributions π, π ′ with existing second moments and any P 0 ∈ P k : Proof of (72).By the so-called Ky Fan Theorem Fan [1949], for a symmetric matrix M ∈ R d×d As a result, we obtain Since the right-hand side of (72) does not depend of P 0 , we get from (19) in particular that for any π, π ′ with finite second moments Hence, since M is a linear operator having a RIP (37) on matrices of rank lower than 2r, M induces (in the way described in Section 4) a sketching operator A : π → A(π) := M(Σ π ) that has the lower RIP described in ( 20 Formally, ∆[y] can then be taken as any distribution having second moment matrix Σ.This last step can naturally be shunted since whatever the choice of such a representative distribution, the associated estimated hypothesis ĥ is directly given by (39).
By Theorem 2.5, this decoder is instance optimal yielding (24) with ν = η = ε = ε ′ = 0, i.e., the excess risk of ĥ of the procedure of Section 4 when the true data distribution is π is controlled by with the bias term defined in (25).
Control of the bias term.The next lemma improves over Lemma 3.2 in the special case of PCA.π is a nonnegative matrix and r ≥ k, (71) yields Definition 5.6 (Covering number).The covering number N(d(•, •), S, δ) of a set S with respect to a (pseudo)metric 5 d(•, •) is the minimum number of closed balls of radius δ with respect to d(•, •) with centers in S needed to cover S.