Exponential inequalities for dependent V-statistics via random Fourier features

We establish exponential inequalities for a class of V-statistics under strong mixing conditions. Our theory is developed via a novel kernel expansion based on random Fourier features and the use of a probabilistic method. This type of expansion is new and useful for handling many notorious classes of kernels.


Introduction
Consider the following V-statistic of order m generated by the symmetric kernel f , V n (f ) := n i1,...,im=1 f (X i1 , . . . , X im ), (1.1) where {X i } n i=1 is a stationary sequence with marginal measure P on the d-dimensional real space R d . The purpose of this paper is to establish exponential-type tail bounds for (1.1) when {X i } n i=1 are weakly dependent.
In (1.1), if the summation is taken over m-tuples (i 1 , . . . , i m ) of distinct indices, the resulting is a U-statistic. In many applications, the techniques of analyzing U-and V-statistics are the same. Non-asymptotic tail bounds and limiting theorems of V-and U-statistics in the i.i.d. case have also been extensively studied [17,2,15,1].
Exceptions include [5] and [16], who proved Hoeffding-type inequalities for U-and V-statistics under φ-mixing conditions. There, the results either rely on assumptions difficult to verify, or are limited to nondegenerate ones.
In this paper, we show that for a strongly mixing stationary sequence, exponential inequalities hold for a large class of V-and U-statistics. The main theorem is presented in Section 2. We then illustrate the usefulness of our theory with examples and some further extensions in Section 3. Detailed proof of the main theorem is given in Section 4, with the rest of proofs given in Section 5.
Notation used in the rest of the paper is as follows. L 1 (R d ) denotes the class of integrable functions in R d , and for each p ≥ 1, f Lp : . For a real vector u ∈ R d , u denotes its Euclidean norm. For two real numbers a, b, a ∨ b := max{a, b}.
Exponential inequalities for dependent V-statistics for 1 ≤ p ≤ m − 1. For each 1 ≤ p ≤ m and 1 ≤ k ≤ n, we denote the V-statistic generated by f p and data f p (X i1 , . . . , X ip ).
For a real function g ∈ L 1 (R d ), its Fourier transform is defined as where dx := dx 1 . . . dx d .
is part of a stationary sequence {X i } i∈Z that is geometrically α-mixing with coefficient α(i) ≤ γ 1 exp(−γ 2 i) for all i ≥ 1, (2.2) where γ 1 , γ 2 are two positive absolute constants. Suppose f ∈ L 1 (R md ) is continuous, and its Fourier transform f satisfies for some q ≥ 1. Then, there exists a positive constant C = C(m, γ 1 , γ 2 ) such that for each 1 ≤ p ≤ m, and any x > 0, We remark that a maximal-type tail estimate for V n (f ) in (1.1) can be obtained in a straightforward manner by assembling the tail estimate in (2.4) for each r ≤ p ≤ m, where r is the degenerate level of f . Indeed, for any 1 ≤ k ≤ n and 1 ≤ p ≤ m, we have by the symmetry of f 1≤i1,...,im≤k 1≤j1<...<jp≤m This entails that We now provide a proof sketch of Theorem 2.1 with a focus on the technical novelties.
One key step in our proof is to find a uniform approximation f = f (; t, M ) of the original kernel f under any prescribed accuracy t such that (i) f − f ≤ t uniformly over a large enough compact set [−M, M ] md , and (ii) f admits the following tensor expansion Here, K is a positive integer that depends on both the approximation error t and the range of approximation M , {f j1,...,jm } K j1,...,jm=1 is a real sequence, and {e j (·)} K j=1 is a set of uniformly bounded real bases. Once such an f is found, a truncation argument will yield the proximity between {f p } m p=1 and { f p (; t, M )} m p=1 , the latter being the degenerate components of f (; t, M ) in its Hoeffding decomposition. Then, using each f p (; t, M ) as a proxy, standard moment estimates with the aid of exponential inequalities for partial sum processes (cf. Corollary 24 in [21]) will render a tail bound for each max 1≤k≤n |V k (f p )|. The problem then boils down to finding such an f with the tensor structure (2.6). One main difficulty in this step is to construct expansion bases {e j (·)} K j=1 that are uniformly bounded. Many classical approaches in multivariate function approximation are unable to provide a satisfactory answer to this problem. For example, uniform polynomial approximation by the Stone-Weierstrass theorem will have very poor performance, since high orders of the polynomials lead to a large upper bound of the bases. The use of Lipschitz-continuous scale and wavelet functions, as exploited in [19], is also inappropriate for the same reason.
Our solution is based on a probabilistic method, and especially, by realizing that the tensor decomposition (2.6) is intrinsically connected to the idea of randomized feature mapping [22] in the kernel learning literature. More specifically, when f ∈ L 1 (R md ) is continuous and f ∈ L 1 (R md ), the Fourier inversion formula implies that where the right-hand side can be seen as the expectation of a Fourier basis with random frequency, which follows the sign measure of f . Due to the boundedness of the Fourier bases, Hoeffding's inequality guarantees an exponentially fast rate for a sample mean statistic of Fourier bases to approximate f at each fixed point x ∈ R md . The elements exp{2πi(u j,1 x 1 + . . . + u j,m x m )} in s K (x 1 , . . . , x m ) naturally decompose to bounded basis functions of inputs x j . An entropy-type argument is then used so that we could prove the existence of a satisfactory set of bases such that the approximation holds uniformly over any compact The detailed proof will be given in Section 4.

Examples and extensions
Motivated by their wide applications in statistics and machine learning, we will put special focus on shift-invariant symmetric kernels in the case m = 2 with f (x, y) = f 0 (x − y) for some f 0 : R d → R. We start with a corollary of Theorem 2.1 for such kernels.   Moreover, the same bound holds with the above A p and M p if f 0 only satisfies (3.1) for some 0 < q < 1, but is both PD and Lipschitz continuous.
We now list several commonly-used kernels covered by Theorem 2.1 and the previous two corollaries.
, which has fractional moments and thus satisfies (3.1) for any 0 < q < 1. Since f 0 is both PD and Lipschitz, it satisfies the conditions in Corollary 3.2.
We then discuss extensions to Theorem 2.1. The smoothness assumption (2.3) in Theorem 2.1 could be further relaxed by employing the standard smoothing technique through mollifiers. More precisely, we resort to an intermediate kernel f h between f and f . It is constructed by convolving f with the Gaussian mollifier with scale parameter h. The parameter h controls the trade-off between approximation error and smoothness: small h leads to finer approximation of f by f h , but makes f h less smooth and thus renders a larger constant in the tail bound. Theorem 2.1 is then applied on this intermediate kernel f h to obtain the tail bound.

Proof of Theorem 2.1
We will use the following extra notation. For any real-valued function f on R d , ∇ x f is the gradient of f . For a set A, |A| indicates its cardinality. For a subset M in R d , we will use diam(M) to denote its diameter, i.e., diam(M) := sup x,y∈M x − y . For a function f , we write f (; θ) to emphasize its dependence on some parameter θ. For a measurable set A, we will use 1{A} to denote the indicator variable on the set A. For any positive integer N , we will use [N ] to denote the set {1, . . . , N }.
As described in the proof sketch after Theorem 2.1, we split the main part of the proof into the following lemmas. The first lemma finds a symmetric kernel f with tensor decomposition (2.6) that approximates f uniformly over some prescribed set [−M, M ] md and accuracy t.
for some constants F, B that do not depend on M and t. In particular, one can take F = 2 m f L1 and B = 1.
Proof. This proof adapts from that of Claim 1 in [22]. Throughout the proof, x 1 , . . . , x m and u 1 , . . . , u m are real vectors in R d , dx = dx 1 . . . dx d , and x, u will be real vectors in Clearly, Condition (2.3) with some q ≥ 1 implies that f ∈ L 1 R md . Since f is continuous, by the Fourier inversion formula (see, for example, Chapter 6 of [24]), we have Note that without the continuity of f , the above equation only holds almost surely with respect to the Lebesgue measure. Let f = g + i h for real-valued functions g, h, then since f is real-valued, we have f = I − II, where We now approximate I and II separately. I can be further written as I = I + − I − , where )du, and note that A + g and A − g are both nonnegative and satisfy A + g + A − g = g L1 < ∞ and A + g − A − g = f (0), where we use the fact that g ∈ L 1 R md since f ∈ L 1 R md . Then, we have where (u 1 , . . . , u m ) follows the distribution g1{ g > 0}/A + g , and (v 1 , . . . , v m ) follows the distribution − g1{ g < 0}/A − g . Assume without loss of generality that A + g > 0 and A − g > 0. We now focus on I + . For any compact subset M ⊂ R md , there exist T Euclidean balls with radius r that cover M, where T ≤ {c diam(M)/r} md with c = 3 md/π.
Denote {d 1 , . . . , d T } as the centers of these balls in R md . Now choose an i.i.d. sample {(u i1 , . . . , u im ) } D1 i=1 from the distribution g1{ g > 0}/A + g with the sample size D 1 to be specified later. Then, for each center d = (d 1 , . . . , d m ) and any t > 0, it holds by Hoeffding's inequality that where in the first line we use the finiteness of R md f (u) u du (guaranteed by Condition (2.3)) and dominated convergence theorem to exchange the derivative with expectation. Moreover, where we have used the finiteness of E( u 1 q ) since R md f (u) u q du < ∞ and the convexity of the function x q when q ≥ 1. Therefore, it holds that and thus by Markov's inequality, By the triangle inequality, the event sup x∈M s D1 (x) − k + g (x) ≤ t/4 has greater probability than the following event Letting the right-hand side of the above inequality be of the form κ 1 r −md + κ 2 r q , and r = (κ 1 /κ 2 ) 1/(q+md) , we have we conclude that there exists {u i } D1 i=1 ∈ R md such that uniformly over M, it holds that when D 1 is chosen such that

Similarly, it can be shown that there exists {v
and D 2 is chosen such that for some sufficiently large constant C 2 = C 2 (t, f, M). Repeating this procedure for the approximation of II, can find s D3 and s D4 which are sample means of sine functions such that uniformly over all x ∈ M, when the sample sizes D 3 and D 4 are respectively chosen such that  for some sufficiently large constants C 3 , C 4 that depend on t, f, M. Putting together the pieces, we obtain that is smaller than t when D 1 -D 4 are chosen as above. Since and, in particular, f ∈ L 1 (R md ) under the product measure P m . We will prove  The third lemma derives a maximal-type tail bound for each max 1≤k≤n |V k (f p )| when f admits the tensor decomposition (2.6). Then, there exists a positive constant C = C(m, γ 1 , γ 2 ) such that for any 1 ≤ p ≤ m, and any x ≥ 0, Proof. Throughout the proof, let C i 's be positive constants that only depend on m, γ 1 , γ 2 , and we will use the shorthand f j a:b for f ja,...,j b for positive integers a < b. We drop the dependence of A p,n and M p,n on n for notational simplicity.
Fix a 1 ≤ p ≤ m and we now derive the tail bound for max 1≤k≤n |V k (f p )|. For the set of bases {e j (·)} K j=1 in the expansion of f , define e j := e j − E{e j (X 1 )} for j ∈ [K]. Since f is symmetric, for any (x 1 , . . . , x m ) ∈ R md , f (x 1 , . . . , x m ) = f (π(x 1 ), . . . , π(x m )) for any permutation π of {x 1 , . . . , x m }. By the definition of {f p } m p=1 in (2.1), one can readily check that Define, for each j ∈ [K] and k ∈ [n], is also geometrically α-mixing. We now control each even order moment of max 1≤k≤n |V k (f p )|. Let Integrating the tail estimate in Corollary 24 of [21] and using Theorem 2.3 in [6] yield that, for any positive integer N , by choosing C 4 in c to be sufficiently large, Then, employing a similar argument as in [5] (cf. Equation (12) where in the second inequality we use the generalized Hölder inequality. By Stirling's approximation formula √ 2πn n+1/2 e −n ≤ n! ≤ en n+1/2 e −n , it holds that (4.4) where in the first inequality we use only the even moments with an absolute constant 3. For the first summand in (4.4), we have where in the first line we use the relation N !/(2N )! ≤ 2 −N /N !. For the second summand, Moreover, taking δ = 1 in Theorem 3 of [13], we obtain Putting together the pieces completes the proof.
We now use Lemmas 4.1-4.3 to complete the proof of Theorem 2.1.
Proof of Theorem 2.1. Fix a 1 ≤ p ≤ m. Fix some t > 0 and M > 0, and define the event Again by Lemma 4.
where A p and M p are defined in (2.5). Now, note that the first summand on the right hand side does not depend on M or t. Accordingly, by first choosing a large enough M that depends only on x, n, F , since the measure P considered in this paper is always tight, we obtain that the second term is smaller than an arbitrary small proportion of the first term. Lastly, choosing t = x and adjusting the constant finishes the proof.

Proofs of other results
We will only prove Corollaries 3.1-3.3. The proof of Corollary 3.4 is similar to that of Corollary 3.1.
cos 2πu i x /D (here we use the notation s D instead of s D1 since in the PD case we only need to approximate the term I + as argued in the first part of the corollary). Note that the original (4.2) in the proof of Lemma 4.1 no longer holds as mere fractional moment does not guarantee the exchange of derivative and expectation in its first step. For the first term in the above inequality, we have Therefore, it holds that Markov inequality now gives P sup Proceeding with the proof of Lemma 4.1, we obtain P sup Writing the right-hand side of the above inequality in the form κ 1 r −md + κ 2 r q and letting r = (κ 1 /κ 2 ) 1/(q+md) , we obtain P sup For any t > 0, we can choose large enough D = D(t) such that the right-hand side of the above inequality is arbitrarily small. The proof is complete. Putting together the pieces, it holds that for any t > 0, there exists some h = h(M f , m, d, t) such that f h − f ∞ ≤ t/2. Since both f and K h belong to L 1 (R md ), their Fourier transforms exist. It can be readily checked that K h (u) = exp −2π 2 h 2 u 2 , and thus f h (u) = f (u) · K h (u) = f (u) exp −2π 2 h 2 u 2 .
Using the relation f * g Lq ≤ f Lq g L1 for any q ≥ 1 and f ∈ L q (R md ), g ∈ L 1 (R md ) and the fact that K h ∈ L 1 (R md ), it holds that f h ∈ L 1 (R md ). Moreover, it can readily checked that This completes the proof.