APPROXIMATING HIGH-DIMENSIONAL INFINITE-ORDER U-STATISTICS : STATISTICAL AND COMPUTATIONAL GUARANTEES

We study the problem of distributional approximations to high-dimensional nondegenerate U -statistics with random kernels of diverging orders. Infinite-order U -statistics (IOUS) are a useful tool for constructing simultaneous prediction intervals that quantify the uncertainty of ensemble methods such as subbagging and random forests. A major obstacle in using the IOUS is their computational intractability when the sample size and/or order are large. In this article, we derive non-asymptotic Gaussian approximation error bounds for an incomplete version of the IOUS with a random kernel. We also study data-driven inferential methods for the incomplete IOUS via bootstraps and develop their statistical and computational guarantees.


Introduction
Let X 1 , . . . , X n be independent and identically distributed (i.i.d.) random variables taking value in a measurable space (S, S) with common distribution P , and let h : S r → R d be a symmetric and measurable function with respect to the product space S r equipped with the product σ-field S r = S ⊗ · · · ⊗ S (r times). Assume E[|h j (X 1 , . . . , X r )|] < ∞ for 1 j d, and consider the statistical inference on the mean vector θ = (θ 1 , . . . , θ d ) T = E[h(X 1 , . . . , X r )]. A natural estimator for θ is the U -statistic with kernel h: U n := 1 |I n,r | ι∈In,r h(X i1 , . . . , X ir ) := 1 |I n,r | ι∈In,r h(X ι ), where I n,r := {ι = (i 1 , . . . , i r ) : 1 i 1 < . . . < i r n} is the set of all ordered r-tuples of 1, . . . , n and | · | denotes the set cardinality. The positive integer r is called the order or degree of the kernel h or the U -statistic U n . We refer to [21] as an excellent monograph on U -statistics.
In the present paper, we are interested in the situation where the order r may be nonneglible relative to the sample size n, i.e., r = r n → ∞ as n → ∞. U -statistics with divergent orders are called infinite-order U -statistic (IOUS) [14]. IOUS has attracted renewed interests in the recent statistics and machine learning literature in relation to uncertainty quantification for Breiman's bagging [3] and random forests [4]. In such applications, the tree-based prediction rules can be thought of as U -statistics with deterministic and random kernels, respectively, and their order corresponds to the sub-sample size of the training data [23]. Statistically, the subsample size r used to build each tree needs to increase with the total sample size n to produce reliable predictions. As a leading example, we consider construction of simultaneous prediction intervals for a version of random forests discussed in [23]. Example 1.1 (Simultaneous prediction intervals for random forests). Consider a training dataset of size n, {(Y 1 , Z 1 ), . . . , (Y n , Z n )} = {X 1 , . . . , X n } = X n 1 , where Y i ∈ Y is a vector of features and Z i ∈ R is a response. Let h be a deterministic prediction rule that takes as input a sub-sample {X i1 , . . . , X ir } and outputs predictions on d testing points (y * 1 , . . . , y * d ) in the feature space Y. Then U n in (1) are the overall predictions by averaging over all possible sub-samples of size r.
For random forests [4,23], the tree-based prediction rule may be constructed with additional randomness: in building a tree or multiple trees based on a subsample, the split at each node may only occur on a randomly selected subset of features. Thus, let {W ι : ι ∈ I n,r } be a collection of i.i.d. random variables taking value in a measurable space (S ′ , S ′ ) that are independent of the data X n 1 , and that determine the potential splits for each sub-sample. Here, each W ι captures the random mechanism in building a prediction function based on X ι = (X i1 , . . . , X ir ), but are assumed to be independent for different subsamples. Further, let H : S r × S ′ → R d be an S r ⊗ S ′ -measurable function, that represents the random forest algorithm, such that E[H(x 1 , . . . , x r , W )] = h(x 1 , . . . , x r ). Then predictions of random forests are given by a d-dimensional U -statistic with random kernel H: U n := |I n,r | −1 ι∈In,r H(X i1 , . . . , X ir , W ι ) = |I n,r | −1 ι∈In,r where the random kernel H varies with r.
Compared to U -statistics with fixed orders (i.e., r being fixed), the analysis of IOUS brings nontrivial computational and statistical challenges due to increasing orders. First, even for a moderately large value of r, exact computation of all possible n r trees is intractable. For diverging r, it is not possible to compute U n in polynomial-time of n. Second, the variance of the Hájek projection (i.e., the first-order term in the Hoeffding decomposition [19]) of U n − θ tends to zero as r → ∞. To wit, define a function g : S → [0, ∞) by g(x 1 ) = E[h(x 1 , X 2 , . . . , X r )], and σ 2 g,j := E[(g j (X 1 ) − θ j ) 2 ] for 1 j d, σ 2 g := min Then the Hájek projection of U n − θ is given by n −1 r n i=1 (g(X i ) − θ). By the orthogonality of the projections, we have Thus the variances of the kernel h and its associated Hájek projection g have different magnitudes. In particular, if the variance of h j (X 1 , . . . , X r ) is bounded by a constant C > 0 (which is often assumed for random forests, cf. [23]), then σ 2 g,j C/r, which vanishes as r diverges. Thus standard Gaussian approximation results in literature are no longer applicable in our setting since they require that the componentwise variances are bounded below from zero to avoid degeneracy, i.e., there is an absolute constant σ 2 > 0 such that σ 2 g σ 2 (cf. [6,10,11]).
In this work, our focus is to derive computationally tractable and statistically valid sub-sampling procedures for making inference on θ with a class of highdimensional random kernels (i.e., large d) of diverging orders (i.e., increasing r). To break the computational bottleneck, we consider the incomplete version of U n by sampling (possibly much) fewer terms than |I n,r |. In particular, we consider the Bernoulli sampling scheme introduced in [8]. Given a positive integer N , which represents our computational budget, define the sparsity design parameter p n := N/|I n,r |, and let {Z ι : ι ∈ I n,r } be a collection of i.i.d. Bernoulli random variables with success probability p n , that are independent of the data X n 1 and {W ι : ι ∈ I n,r }. Consider the following incomplete U -statistic (on the data X n 1 ) with random kernel and weights: Obviously, U ′ n,N is an unbiased estimator of θ and it only involves computing N terms, which on average is much smaller than |I n,r | if p n ≪ 1.
When the kernel h is both deterministic and of fixed order, finite sample bounds for the Gaussian and bootstrap approximations of U ′ n,N − θ (after a suitable normalization) are established in [8]. Roughly speaking, error bound analysis in [8] has two major steps: i) establish the Gaussian approximation to the Hájek projection, and ii) bound the maximum norm of all higher-order degenerate terms. As discussed above, the first-order Hájek projection in the Hoeffding decomposition is asymptotically vanishing for the IOUS, and we must control the moments of an increasing number of degenerate terms, which makes the analysis of the incomplete IOUS with random kernels substantially more subtle.
In Section 2, we derive non-asymptotic Gaussian approximation error bounds for approximating the distribution of the incomplete IOUS U ′ n,N with random kernels subject to sub-exponential moment conditions. Specifically, our rates of convergence for the Gaussian approximation of U ′ n,N have the explicit dependence on all parameters (n, N, d, r, σ 2 g , D n ), where D n is an upper bound for the ψ 1 norms of the random kernels (for precise statements, see conditions (C3), (C4), and (C3') ahead). In particular, asymptotic validity of the Gaussian approximation can be achieved if σ −2 g r 2 D 2 n log 7 (dn) = o(n ∧ N ). The order of σ −2 g will be application specific. As we shall verify in Section 4, under certain regularity conditions, It is worth noting that (4) is sharp in the sense that for the linear kernel h(x 1 , · · · , x r ) = (x 1 + · · · + x r )/r, we have σ −2 g ≍ r 2 if c Var(X 1j ) C. If further D n = O(1), log(d) = O(log(n)) and n = O(N ) (i.e., the computational complexity is at least linear in sample size), then the order of U ′ n,N is allowed to increase at the rate of r = o(n 1/4−ǫ ) for any ǫ ∈ (0, 1/4). On the other hand, the dimension may grow exponentially fast in sample size (i.e., d = O(e n c ) for some constant c ∈ (0, 1/7)) to maintain the asymptotic validity of Gaussian approximations while r is still allowed to increase at a polynomial rate in n.
The proof of our Gaussian approximation results for IOUS builds upon a number of recently developed technical tools such as Gaussian approximation results for sum of independent random vectors and U -statistics of fixed orders [6,7,10,11], anti-concentration inequality for Gaussian maxima [9], and iterative conditioning argument for high-dimensional incomplete U -statistics (with the fixed kernel and order) [8]. However, there are three technical innovations in our proof to accommodate the issues of diverging orders and randomness of the kernel. First, we use the iterative renormalization for each dimension of g and also H by its variance. This simple trick turns out to be the crux to avoid the lower bound assumption for Gaussian approximation in the literature [8,10]. Second, we derive an order-explicit maximal inequality for the expected supremum of the remainder of the Hájek projection of the IOUS (cf. Section 5). This maximal inequality is new in literature and our main tools include a symmetrization inequality of [27] and Bonami inequality [13,Theorem 3.2.2] for the Rademacher chaos, both with the explicit dependence on r. Third, we develop new tail probability inequalities for U -statistics with random kernels by leveraging the independence between {W ι , ι ∈ I n,r } and the data X n 1 . In Section 3, we derive computationally tractable and fully data-driven inferential methods of θ based on the incomplete IOUS when the sample size n, the dimension d, and the order r, are all large. We consider a multiplier bootstrap procedure consisting of two partial bootstraps that are conditionally independent given X n 1 and {W ι , Z ι : ι ∈ I n,r }: one estimates the covariance matrix of the randomized kernel, and the other estimates the Hájek projection. The latter is usually computationally demanding, and we develop a divide and conquer algorithm to maintain the overall computational cost of our multiplier bootstrap procedure at most O(n 2 d + B(N + n)d), where B denotes the number of bootstrap iterations. Thus the computational cost of the bootstrap to approximate the sampling distribution for incomplete IOUS can be made independent of the order r, even though r diverges.
In Section 4, we discuss the key non-degeneracy condition (4) for deriving the validity of Gaussian and bootstrap approximations. We provide a general embedding scheme where a Cramér-Rao type lower bound can be established for the minimum σ 2 g of the projection variances. Specifically, the lower bound for r 2 σ 2 g only involves the sensitivity of E[h(X 1 , . . . , X r )] under perturbation and the Fisher information of the embedded family, which in some cases remain constants as r diverges. In non-parametric regressions, there is a natural embedding of the response variable into a location family such that the sensitivity and Fisher information can be explicitly computed.

Connections to the literature
For univariate U -statistics (d = 1), the asymptotic distributions are derived in the seminal paper [19] for the non-degenerate case. [14] introduced the notion "infinite-order U statistics" (IOUS) with diverging orders and established the central limit theorem for U n when d = 1. For univariate IOUS, asymptotic normality of IOUS can be found in [2,Chapter 4.6], and the Berry-Esseen type bounds for IOUS were established by [16,30,31]. Further, [23] applied IOUS to construct a prediction interval for one test point. However, i). [23] does not address the issue that the variance of the Hájek projection is vanishing: the two conditions in Theorem 1 therein, Eh kn (Z 1 , . . . , Z kn ) C < ∞ and lim ζ 1,kn = 0, are not compatible based on our previous discussions ; ii). in practice, the size d of a test set may be comparable to or even much larger than the size n of a training set, and the current work is motivated by such consideration. Limit theorems of the related infinite-order V -statistics and the infinite-order U -processes were studied in [18,28]. The high-dimensional Gaussian approximation results and bootstrap methods were established in [10,11] for sum of independent random vectors, and in [6,8] for U -statistics. We refer readers to these references for extensive literature review.
Incomplete U -statistics were first introduced in [1], which can be viewed as a special case of weighted U -statistics. There is a large literature on limit theorems for weighted U -statistics; see [22,24,25,26]. The asymptotic distributions of incomplete U -statistics (for fixed d) were derived in [5] and [20]; see also Section 4.3 in [21] for a review on incomplete U -statistics. Recently, incomplete Ustatistics have gained renewed interests in the statistics and machine learning literatures [12,23]. To the best of our knowledge, the current paper is the first work that establishes distributional approximation theorems for incomplete IOUS with random kernels and increasing orders in high dimensions.
The remaining of the paper is organized as follows. We develop Gaussian approximation results for above U -statistics in Section 2, and bootstrap methods for the variance of the approximating Gaussian distribution in Section 3. We apply the theoretical results to several examples in Section 4. We highlight a maximal inequality in Section 5, and present all other proofs in Appendix A.

Gaussian approximations for IOUS
In this section, we shall derive non-asymptotic Gaussian approximation error bounds for: (i) the IOUS with random kernel U n in (2), which includes the IOUS with deterministic kernel U n in (1) as a special case, and (ii) the incomplete IOUS U ′ n,N in (3) under the Bernoulli sampling scheme.
Let Y A and Y B be two independent d-dimensional zero mean Gaussian random vectors with variance Γ g and Γ H respectively. We may take Y A and Y B to be independent of any other random variables. Further, for any two zero mean d-dimensional random vectors U and Y , Finally, in view of the discussions in the Introduction (Section 1) and to simplify presentation, we assume σ 2 g 1. Otherwise, the conclusions in this paper hold with σ g replaced by min{σ g , 1}.

IOUS with random kernel
We start with U n . Define for 1 j d, q > 0, and (x 1 , . . . , x r ) ∈ S r , We make following assumptions: there exist D n 1 and an absolute constant q > 0 such that B n,j (X r 1 ) ψq D n for all j = 1, . . . , d.
Proof. See Section A.3. We highlight that a key step to establish Theorem 2.1 is to control the expected supremum of the remainder of the Hájek projection of the complete IOUS with deterministic kernel (See Theorem 5.1). Then the Gaussian approximation result for IOUS follows from Gaussian approximation results for sum of independent random vectors [10] and anti-concentration inequality [9], by a similar argument in [8] with proper normalization.
Clearly, in the special case of non-random kernel, i.e., H(x 1 , . . . , x r , W ) = h(x 1 , . . . , x r ), (C4) trivially holds. Thus we have the following immediate result for the IOUS with deterministic kernel U n in (1).
where q * := (6/q + 1) ∨ 7, and means up to a multiplicative constant that only depends on q.
Remark 2.3 (Comparisons with existing results for d = 1). For the univariate IOUS with non-random kernels, asymptotic normality and its rate of convergence are well understood in literature; see [2] for a survey of results in this direction. In [31], a Berry-Esseen bound is derived for symmetric statistics, which include IOUS (with non-random kernels) as a special case. In particular, applying Corollary 4.1 in [31] to IOUS, the rate of convergence to normality is of order O(r 2 n −1/2 σ 2 H /σ 2 g ) for a bounded kernel, which implies that asymptotic normality requires (at least) r = o(n 1/6 ). A related Berry-Esseen bound is given in [16]. In both papers, the rates of convergence are suboptimal. For elementary symmetric polynomials (which are U -statistics corresponding to the product kernel h(x 1 , . . . , x r ) = x 1 · · · x r ), it is shown in [30] that the sharp rate of convergence to normality is of order . This result implies that asymptotic normality for the IOUS with the product kernel is achieved when r = O(log −2 (n)n 1/2 ). If σ −2 g = O(r 2 ), which holds under regularity conditions in Lemma 4.1, our Corollary 2.2 with q = 1 implies that the rate of convergence for high-dimensional IOUS is O((r 4 log 7 (dn)n −1 ) 1/6 ) (with suitably bounded moments). In particular, Gaussian approximation is asymptotically valid if log d = O(log n) and r = o(n 1/4−ǫ ) for any ǫ ∈ (0, 1/4). Even though our result is valid for a smaller range of r and the rate is slower than the optimal rate in the case d = 1, Corollary 2.2 does allow the dimension to grow subexponentially fast in sample size, which is a useful feature for high-dimensional statistical inference. In addition, to the best of our knowledge, the validity of bootstrap procedures proposed in Section 3 to approximate the sampling distribution of IOUS (on hyperrectangles in R d ) are new in literature.

Incomplete IOUS with random kernel
Now we consider U ′ n,N , where we recall that N is some given computational budget. We will assume the following conditions: for q > 0, Clearly, (C4) and (C3') implies (C3) up to a multiplicative constant. Further, (C3') and (C5) hold if |H j (X r 1 , W )| D n a.s. for 1 j d.
Remark 2.5. If q 1, then q 1 = 2 and q * = 7. Since ξ ψ1 ξ ψq for any random variable ξ and q 1, we may assume without loss of generality that q 1 in the proof. When r is fixed, q = 1, the kernel is deterministic, and there exists some absolute constant σ 2 > 0 such that σ 2 g σ 2 , then the above Theorem recovers Theorem 3.1 from [8].
Further, by first conditioning on X n 1 , we have where for two square matrices, A B means A − B is positive semi-definite. Thus the random kernel H(·) increases the variance of the approximating Gaussian distribution compared to the associated deterministic kernel h(·).

Bootstrap approximations
In Section 2.2, we have seen that the incomplete U -statistic with random kernel is approximated by a Gaussian distribution N (0, r 2 Γ g + α n Γ H ). However, the covariance term is typically unknown in practice. In this section, we will estimate Γ g and Γ H by bootstrap methods.

Bootstrap for Γ H
Let D n := {X 1 , . . . , X n } ∪ {W ι , Z ι : ι ∈ I n,r } be the data involved in the definition of U ′ n,N , and take a collection of independent N (0, 1) random variables {ξ ′ ι : ι ∈ I n,r } that is independent of the data D n . Define the following bootstrap distribution: The next theorem establishes the validity of U # n,B .

Bootstrap for the approximating Gaussian distribution
, and its form is specified later. We use the following quantity to measure the quality of G i1 as an estimator of g(X i1 ) Define G := 1 n1 i1∈S1 G i1 and consider the following bootstrap distribution for N (0, Γ g ): where {ξ i1 : i 1 ∈ S 1 } is a collection of independent N (0, 1) random variables that is independent of D n and {ξ ′ ι : ι ∈ I n,r }. Lemma 3.2. Assume the conditions (C1-ND), (C2) and (C3') hold. If for q 2 := (4/q + 1) ∨ 5, some constants C 1 , and ζ 1 , ζ 2 ∈ (0, 1). Then there exists a constant C depending only on q, C 1 and ζ 1 such that with probability at least Proof. See Subsection A.5.2.

Simultaneous confidence intervals
We first combine the Gaussian approximation result with the bootstrap result.
In simultaneous confidence interval construction, it is sometimes desirable to normalize the variance of each dimension, so that if we use maximum-type statistics, the critical value is not dominated by terms with large variance. Define for 1 j d, which are the diagonal elements in the conditional covariance matrices of U # n,A (10) and U # n,B (7) respectively. Further, define a d × d diagonal matrix Λ with Λ j,j = r 2 σ 2 g,j + α n σ 2 H,j , for each 1 j d. Corollary 3.5. Assume the conditions in Corollary 3.4. Then there exists a constant C depending only on q, C 1 and ζ such that with probability at least Remark 3.6. From Corollary 3.5, we can immediately construct confidence intervals for θ in a data-dependent way. Specifically, let q 1−α be a (1 − α) th quantile of the conditional distribution of Λ −1/2 U # n,n1 ∞ given D n . Then one way to construct simultaneous confidence intervals with confidence level (1 − α) is as follows:

Applications
In many applications, g(x) = E[h(x, X 2 , . . . , X r )] does not admit an explicit form, and thus it is usually hard to compute σ g in conditions (C1-ND) and (12) directly. When the kernel h has special structures, we can establish a lower bound on σ g with explicit dependence on r, which can be applied to Example 1.1. We shall give additional examples in Section 4.3 and 4.4 to illustrate the usefulness of U -statistics as a tool to estimate and make inference of certain statistical functionals of X 1 , . . . , X r . In Section 4.3 for the expected maximum and log-mean functionals, we also establish a lower bound on σ g with explicit dependence on r. In Section 4.4 for the kernel density estimation problem, r is assumed to be fixed, but we allow the diameter of the design points to diverge. For simplicity of the presentation, in this section, we assume that all involved derivatives and integrals exist and are finite, and that the order of integrals and the order of integral and differentiation can be exchanged. These assumptions can be justified under standard smoothness and moment conditions. For illustration, we use q = 1 in (C4) and (C3').

Lower bound for σ g
Suppose that the distribution P of X 1 has a density function f 0 with respect to some σ-finite (reference) measure µ, i.e., We first embed f 0 into a family of densities {f β : β ∈ B ⊂ R ℓ }, where B is an open neighborhood of 0 ∈ R ℓ . Such embeddings always exist and below are some examples for S = R ℓ .
1. Location and scale family. If µ is the Lebesgue measure on R ℓ , we may consider the following location or scaling families: for x ∈ R ℓ , , then we may consider the exponential family: 3. Additive noise model. Let Υ be a R ℓ -dimensional random vector independent of X 1 , whose distribution is absolutely continuous w.r.t. µ, then X 1 + βΥ has a density f β given by the convolution of those of X 1 and βΥ.
For β ∈ B, define the following perturbed expectation where E β denotes the expectation when X 1 , . . . , X r have density f β . Further, define where ∇ denotes the gradient (or derivative when β is a scalar) with respect to β and Var β denotes the covariance matrix when X 1 , . . . , X r have the density f β . Thus Ψ(β) is the score function and J (β) is the Fisher-information for a single observation.
In particular, if there exists an absolute positive constant c such that Proof. See Subsection A.6.

Simultaneous prediction intervals for random forests
Consider the Example 1.1 and assume that ( That is, the feature Y 1 has the density q(y) w.r.t. some σ-finite measure ν on Y, and thus is allowed to have both continuous and discrete components. The response Z 1 given Y 1 = y has a conditional density p(z; y) w.r.t. the Lebesgue measure. For many regression algorithms such as tree based methods, if we fix the features and increase the responses of training samples by β ∈ R, the prediction at any test point will increase by β, i.e., for 1 j d, which implies that h ((y 1 , z 1 + β), . . . , (y r , z r + β)) = h ((y 1 , z 1 ), . . . , (y r , z r )) + β. Now we consider the embedding into the "location" family {q(y)p(z − β; y) : β ∈ R}. Observe that Thus if we assume that there exists c such that then (13) reduces to σ 2 g cr −2 . If further we assume that H j (X r 1 , W ) C a.s. for some constant C and each 1 j d (this holds for example when the response is bounded a.s.), then the conditions (C2), (C3), (C4) and (C5) hold with D n = ln −2 (2)C. With these assumptions, the condition (12) in Corollary 3.5 simplifies as Thus if r = O(n 1/4−ǫ ) for some ǫ > 0, log(d) = O(log(n)), and n = O(n 1 ∧ N ), then Corollary 3.5 can be used to construct asymptotically valid simultaneous prediction intervals with the error of approximation decaying polynomially fast in n.
Remark 4.2 (Fisher information in nonparametric regressions). Let us take a closer look at the condition (14). Consider the nonparametric regression model where κ : Y → R is a deterministic measurable function, and ǫ 1 , . . . ǫ n are i.i.d. with some density f with respect to the Lebesgue measure. Then p(z; y) = f (z − κ(y)) and thus where for the last equality, we first perform integration w.r.t. dz and apply a change-of-variable. Thus J (0) only depends the density of the noise.

Expected maximum and log-mean functionals
Next we compute the lower bounds on σ 2 g for two additional statistical functionals.
Example 4.3. Let S = R d and consider the following two kernels: for 1 j d, In the former case, we are interested in estimating the expectation for the coordinate-wise maxima of r independent random vectors, {E[max 1 i r X ij ] : 1 j d}. In the latter, we assume X 1j > 0 for 1 j d and are interested in In both cases, the coordinates of X 1 can have arbitrary dependence, and we allow r → ∞.
Consider the first kernel in Example 4.3, where S = R d , and h j (x 1 , . . . , x r ) = max 1 i r x ij for 1 j d. Assume X 1j has a density f j w.r.t. the Lebesgue measure on R for 1 j d, and we consider the following embedding {f j (·−β) : β ∈ R}. As in the previous example, for β ∈ R Thus, by Lemma 4.1, if we assume for some absolute positive constant c Further, if we assume that there exists a positive constant C such that X 1j ψ1 C, 1 j d, then by maximal inequality (e.g., see [29, Lemma 2.2.2]), max 1 i r X ij ψ1 log(r). Then if we select D n = C ′ σ −1 g log 2 (r) , the conditions (C2), (C3) and (C5) hold. Further, (C4) trivially holds for non-random kernels. With above assumptions and selection of D n , the condition (12) in Corollary 3.5 simplifies as ( has a density f j w.r.t. the Lebesgue measure on R for 1 j d, and consider the following As before, it is easy to see that for 1 j d, Thus if there exists a constant c such that then the conditions (C2), (C3), (C4) and (C5) hold with D n = ln −1 (2) log(C). With these assumptions, the condition (12) in Corollary 3.5 simplifies as (

Kernel density estimation
Example 4.4 (Kernel density estimation). Let τ : S r → R ℓ be a measurable function that is symmetric in its r arguments, and {t j : 1 j d} ⊂ R ℓ be d design points. [15,17] used U n as a kernel density estimator (KDE) for the density of τ (X 1 , . . . , X r ) at the given design points with where b n > 0 is a bandwidth parameter, and κ(·) is the density estimation kernel with κ(z)dz = 1, which should not be confused with the U -statistic kernel h.
For this example, we will assume r fixed and the bandwidth b n → 0, but allow the diameter of the design points, max 1 j d t j , to grow, where · denotes the usual Euclidean norm.
Assume that given . Then by definition, for 1 j d, As in [15], if for any fixed t. If there exists some R > 0 such that max 1 j d t j R for any d ∈ N and inf t∈R ℓ :|t| R V(t) > 0, under mild continuity assumptions (e.g. the equicontinuty of V n (t)), there exists an absolute constant c > 0 such that σ 2 g c for large n. Then we can apply the result in [8], which does not allow σ 2 g to vanish. In this work, we allow σ 2 g to vanish, and thus allow the diameter of the design points to grow as n becomes large. Specifically, if we assume κ(·) is bounded by some constant C, we can select D n = ln −1 (2)Cb −1 n in conditions (C2), (C3), (C4) and (C5). Then the condition (12) in Corollary 3.5 simplifies as Thus if log(d) = O(log(n)) and n = O(n 1 ∧ N ), to apply Corollary 3.5, we require that σ −2 g = O(b 2 n n 1−ǫ ) for any ǫ > 0. Remark 4.5. [15] considers the case d = 1, and shows the √ n-convergence rate of the KDE. The same discussion applies here. [17] constructs confidence bands (without computational considerations and bootstrap results) for the density of τ (X r 1 ), under the additional assumptions required to establish the convergence of empirical processes.

Maximal inequality
In this section, we derive an upper bound on the expected supremum of the remainder of the Hájek projection of the complete IOUS with deterministic kernel. This maximal inequality (with the explicit dependence on r) serves as a key step to establish the Gaussian approximation result for the incomplete IOUS with random kernel.
The proof of Theorem 5.1 is quite involved: we need to develop a number of technical tools such as the symmetrization inequality and Bonami inequality (i.e., exponential moment bound) for the Rademacher chaos, all with the explicit dependence on r.
For a given probability space (X, A, Q), a measurable function f on X and x ∈ X, we use the notation Qf = f dQ whenever the latter integral is welldefined, and denote δ x the Dirac measure on X, i.e., δ x (A) = 1{x ∈ A} for any A ∈ A . For a measurable symmetric function f on S r and k = 0, 1, . . . , r, let P r−k f denote the function on S k defined by whenever it is well defined. To prove Theorem 5.1, without loss of generality, we may assume θ = P r h = 0, since we can always consider h(·) − θ instead. For 0 k r, define Clearly π k is degenerate of order k with respect to the distribution P in the sense of (16) below. For any ι = (i 1 , . . . , i k ) ∈ I n,k , and J = (j 1 , . . . , j ℓ ) ∈ I k,ℓ where 0 ℓ k, define ι J := (i j1 , . . . , i j ℓ ) ∈ I n,ℓ .

Symmetrization inequality
For each integer k, consider a symmetric kernel f : S k → R d . We say that f is degenerate of order k with respect to the distribution P if E X1 [f j (X 1 , X 2 , . . . , X k )] = 0 a.s., for any 1 j d.
The following result is essentially due to [27, Section 3, Symmetrization inequality] in the U -process setting. We provide a self-contained (and perhaps more transparent) proof for completeness.
improvement of the constant to the exponential growth in k turns out to be crucial to obtain the maximal inequality for the IOUS in Theorem 5.
Further, for each ι = {i 1 , . . . , i k } ∈ I n,k , define Due to degeneracy, we have where the first and third equalities follow from definitions and Fubini Theorem, and the second follows from the degeneracy. To wit, on the event that {ǫ i ℓ = −1} for some 1 ℓ k, The rest of the argument is standard: by Jensen's inequality, Since (X 1 , . . . , X n , ǫ 1 , . . . , ǫ n ) and (Z 1 , . . . , Z n , ǫ 1 , . . . , ǫ n ) have the same distribution, taking expectation on both sides completes the proof.

Maximal inequality
We start with a lemma, whose proof is elementary and thus omitted. Recall the definition of ψ β in Subsection 1.2.

Exponential moment of Rademacher chaos
The goal is to establish an exponential moment bound (i.e., Bonami inequality) of Rademacher chaos of order k. Based on the well-known hyper-contractivity of Rademacher chaos variables in literature (cf. [13, Corollary 3.2.6]), our Lemma 5.6 below provides an exponential moment bound with an explicit dependence on the order.
Lemma 5.6 (Exponential moment of Rademacher chaos). Fix k 2, β = 2/k and let {x ι : ι ∈ I n,k } be a collection of real numbers. Consider the following homogeneous chaos of order k: where ǫ 1 , . . . , ǫ n are i.i.d. Rademacher random variables. Then Proof. Denote κ = E[Z 2 ], c = √ 7 and thus ∆ n = c k κ. Observe that β 1 and βk = 2. From [13, Theorem 3.2.2], we have for any q > 0 Here, the first inequality clearly holds for q 2, and we use [ Using the fact that ℓ ℓ e ℓ ℓ!, we have Since c 2 = 7 > e, we have which completes the proof.

Proof of Theorem 5.1
Now we are in position to prove Theorem 5.1. Recall that we assume θ = 0. First, for each 2 k r and 1 j d, define where π k h is defined in (15), and ǫ 1 , . . . , ǫ n are i.i.d. Rademacher random variables. Define By Jensen's inequality and the fact that ( n i=1 z n ) 2 n n i=1 z 2 n , we have for any 1 j d, Then by Lemma 5.6, Further, by Lemma 5.5 with β = 2/k, we have Then by Lemma 5.2 and Jensen's inequality, we have Now we bound E[F 2 k (X 1 , . . . , X k )]. By the definition of π k h j , condition (C3), Lemma 5.4 and Jensen's inequality, we have . . . , X k , X k+1 , . . . , X r )|/D n )] + 1 2.
Since ψ q (0) = 0, by Jensen's inequality, we have π k h j (X 1 , . . . , X k )| ψq 2D n . Then by the standard maximal inequality (e.g., see [29, Lemma 2.2.2]), there exists a constant C, depending only on q, such that for 1 k r, Thus we obtain that Observe that if r 2 n, we have for any 1 i r Further, for any x, y 2, log k/2 (x + y) 2 k/2 (log k/2 (x) + log k/2 (y)). Now, take c = 1/500, and in particular r 2 n. Then For the first term, by geometric series formula, For the second term, since for any ℓ 1, ℓ ℓ e ℓ ℓ!, we have which completes the proof of Theorem 5.1.

A.1. Tail probabilities
In this section, we collect and prove some results regarding tail probabilities for sum of independent random vectors, U -statistics, and U -statistics with random kernels. For each type of statistics, we present two versions, one for non-negative random variables and the other for general cases. These inequalities are used in bounding the effects due to sampling (Subsection A.4.3), and also in controlling the · ∞ distance between the bootstrap covariance matrices and their targets (Section A.5).

A.1.1. Tail probabilities for sum of independent random vectors
In this subsection, m, n, d 2 are all integers.
Then there exists some constant C that only depends on β such that Proof. See Subsection A.7.1. Then there exists some constant C that only depends on β such that P max Define U n := |I n,r | −1 ι∈In,r f (X ι ). Then there exists a constant C that only depends on β such that P max Clearly, we can replace v n by u n .
Clearly, we can replace σ by u n .

A.1.3. Tail probabilities for U -statistics with random kernel
Let X 1 , . . . , X n be i.i.d. random variables taking value in (S, S) and W, {W ι , ι ∈ I n,r } be i.i.d. random variables taking value in (S ′ , S ′ ), that are independent of X n 1 . In this subsection, we consider a measurable function F : S r × S ′ → R d that is symmetric in the first r variables, and fix some β ∈ (0, 1]. Further, define We first consider the non-negative random kernels. Lemma A.6. Consider Z := max 1 j d |I n,r | −1 ι∈In,r F j (X ι , W ι ). Assume that for all j = 1, . . . , d, F j (·) 0, and that there exists u n 1 such that b j (X r 1 ) ψ β u n , f j (X r 1 ) ψ β u n , for all j = 1, . . . , d.
Then there exists some constant C that only depends on β such that with probability at least 1 − 8/n, Proof. See subsection A.7.6.
Next, we consider centered random kernels.

A.2. Additional lemmas
The following Lemma concerns Gaussian approximation for sum of independent vectors. It replaces the · ψ1 condition in Proposition 2.1 of [10] by · ψq .
Then there exists some constant C that only depends on q and σ 2 such that The following lemmas are elementary, but used repeatedly.
Lemma A.9. Let β > 0. There exits a constant C, only depending on β, such that for any positive integers r, n such that 2 r √ n, n 2 r β C I n,r .
Proof. Fix β. If r → ∞, n 2 r β / I n,r → 0. Thus there exits M such that if r M , n 2 r β I n,r . For r < M , the inequality holds with C = M β .
Lemma A.10. Let β, k > 0. For any random variable X, Proof. Observe that which implies that X k ψ β X k ψ kβ . The reverse direction is similar. For β < 1, · ψ β is not a norm , but the usual triangle inequality and maximal inequality hold up to a multiplicative constant.
(i) For any random variables X and Y , (ii) Let ξ 1 , . . . , ξ n be a sequence of random variables such that ξ i ψ β D for 1 i n, and n 2. Then there exists a constant C depending only on β such that max Proof. See Subsection A.8.

A.3. Proofs in Section 2.1
We first prove Corollary 2.2 and then prove Theorem 2.1.
Proof of Corollary 2.2. Let c be the constant in Theorem 5.1. Without loss of generality, we assume since ρ(·, ·) 1 and we can always consider h(·) − θ instead. Recall that q * = (6/q + 1) ∨ 7. Fix any rectangle R = [a, b] ∈ R, where a, b ∈ R d and a b. Definẽ Then by Theorem 5.1, For any t > 0, by Markov inequality and definition, Due to assumptions (C2), (C3) and Cauchy-Schwarz inequality, Then due to Lemma A.8, we have Further, by anti-concentration inequality [10, Lemma A.1], Finally, taking t = σ −2 g n −1 r 2 log 1+2/q (d)D 2 n 1/4 and due to convention (17), we have Likewise, we can show the lower inequality which completes the proof.
Proof of Theorem 2.1. As before, without loss of generality, we assume θ = 0, and for some sufficiently small c 1 ∈ (0, 1). Define for each ι = (i 1 , . . . , i r ) ∈ I n,r , Then by definition, Step 1 . We first show that Note that conditional on X n 1 , R n is an average of independent random vectors. Thus by [9,Lemma 8], By definition (6) Define Under the assumption (C4) and again maximal inequality ([29, Lemma 2.2.2] and Lemma A.11), we have Then, we have Then due to Lemma A.9 and (18), we have E max 1 j d |R n,j | D n log 1/2+1/q (dn) n .
Step 2 . We finish the proof by a similar argument as in the proof of Corollary 2.2.
Fix any rectangle R = [a, b] ∈ R, where a, b ∈ R d and a b. Definẽ where we recall that Λ g is defined in (5). Recall that U n = U n + R n . For any t > 0, by Markov inequality, the result from Step 1, and Corollary 2.2, Observe that E[Ỹ 2 A,j ] = 1 for 1 j d. By anti-concentration inequality [10, Lemma A.1], Finally, taking t = σ 2 g n −1 r 2 log 2/q (dn)D 2 n 1/4 and due to convention (18), we have . By a similar argument, we can show . which completes the proof.

A.4.1. Bounding N /N
The following lemma follows from an application of Bernstein's inequality and is proved in the Step 5 of the proof of [8, Theorem 3.1]. It is included here for easy reference.
We will bound these two terms separately.

A.4.3. Bounding the effect of sampling
The following quantity will appear in the proof of Theorem 2.4: The next lemma establishes conditional Gaussian approximation for √ N ζ n .
Lemma A.14. Suppose the assumptions in Theorem 2.4 hold. There exists a constant C, depending on q, such that with probability at least 1 − C/n, where we recall that Y B ∼ N (0, Γ H ), and we abbreviate P |X,W for P |X n 1 ,{Wι:ι∈In,r} . Proof. Consider conditionally independent (conditioned on X, W ) R d -valued random vectors { Y ι : ι ∈ I n,r } such that By triangle inequality, it then suffices to show that each of the following events happens with probability at least 1 − C/n, on which we now focus. Without loss of generality, since σ g 1, we assume for some sufficiently small constant c 1 ∈ (0, 1) that is to be determined. Recall that q 1 = 2 ∨ (2/q) and q * = (6/q + 1) ∨ 7.
Step 1 . The goal is to show that the first event in (25), ρ R |X,W ( √ N ζ n , Y ) C̟ n , holds with probability at least 1 − C/n.
Step 1.1. Define By Theorem 2.1 in [10], there exist absolute constants K 1 and K 2 such that for any real numbers L n and M n , we have In Step 0, we have shown P min 1 j d ΓH ,jj 1/2 1 − 13/n. In Step 1.2-1.4, we select proper L n and M n such that the first two events happen with probability at least 1 − C/n. In Step 1.5, we plug in these values.
Step 1.3: bounding M n,X (φ n ). Since Z ι is a Bernoulli random variable, it is clear that M n,X (φ n ) = 0 on the event where we use the value (30) for L n . By assumption (C3') and Lemma A.11, Due to (26), Thus if we take c 1 in (26) to be sufficiently small such that then P( M n,X (φ n ) = 0) 1 − 2/n and φ n 1.
Step 1.4: select M n . From Step 1.3, we have shown that Then by the same argument as in Step 1.4 of the proof of [8, Theorem 3.1] and due to (26) and φ n 1, on the event E ′ n , for any ι ∈ I n,r , n log 2/q+5/6 (dn) Cn 3r/2 exp −|I n,r | 11/84 /C .
Thus there exists an absolute constant C 2 such that if we set then P( M n,Y (φ n ) M n ) 1 − 2/n.
Step 1.5: plug in L n and M n . Recall the definition L n and M n in (30) and (31). With these selections, we have shown that P(E n ) 1 − C/n, where we recall that E n : n σ H D n r 1/q n 3r/2 exp −|I n,r | 11/84 /C 2 C̟ n , which completes the proof of Step 1.
Step 2. The goal is to show that the second event in (25), on the event that { ΓH − ΓH ∞ ∆}. From (27) in Step 0, Without loss of generality, we assume that Observe that where we recall that U n and ζ n is defined in Section 2.1 and in (24) respectively.
Step 1: the goal is to show that For any rectangle R ∈ R, observe that By Lemma A.14, since n −1 ̟ n , we have where we recall that Y B is independent of all other random variables. Further, by Theorem 2.1, Observe that E[(σ −1 g,j rY A,j ) 2 ] = r 2 1 for any 1 j d, Γ H ∞ D 2 n due to (C3'), and α n p n = n/|I n,r | n −1 . Then by the Gaussian comparison inequality [8,Lemma C.5] and due to (32) Similarly, we can show P( √ nΦ n ∈ R) P rY A + √ α n Y B ∈ R − C̟ n . Thus the proof of Step 1 is complete.
Step 3: final step. Recall that Then by the result in Step 1, we have where the last inequality is due to (32). Similarly, we can show P √ N U ′ n,N ∈ R P α −1/2 n Y ∈ R − C̟ n , and thus which completes the proof.

A.5. Proofs in Section 3
In this subsection, without loss of generality, we assume q 1 (see Remark 2.5).
A.5.1. Proof of Theorem 3.1 Proof. Without loss of generality, we can assume θ = E[H(X r 1 , W )] = 0, since otherwise we can center H first. Recall the definition of Λ H in (5),H(·) = Λ −1/2 H H(·), and ΓH , ΓH in (20). Observe that for any integer k, there exists some constant C that depends only on k and ζ such that Step 0. DefineŨ ′ n,N := Λ −1/2 H U ′ n,N and Since ΓH ,jj = 1 for 1 j d, by Gaussian comparison inequality [8, C.5], Thus it suffices to show that with probability at least 1 − C/n, ∆ B log 2 (d) Without loss of generality, we can assume C 1 n −ζ 1/16, since we can always take C to be large enough. Then by Lemma A.12, P(|N/ N | C) 1 − 2n −1 , and thus it suffices to show that P ∆ B,i log 2 (d) Cn −ζ/2 1 − C/n, for all i = 1, . . . , 4, on which we now focus.
Step 2. The goal is to show (34 Thus combining the results from 2.1 and 2.2, we have where the second inequality is due to (12) and that ν 7/6 and K = ⌊(n − 1)/(r − 1)⌋.
A.5.4. Proof of Corollary 3.5 Proof. We have shown in Step 0 of the proof (Subsection A.5.1) for Theorem 3.1 that Further, if we take ν = 7/ζ in Theorem 3.3, then in the proof for Theorem 3.2 and Theorem 3.3, we have shown that P max The rest of the proof is the same as the proof for [8, Corollary A.1], and thus omitted.
A.6. Proof of Lemma 4.1 Proof. Clearly, the inequality is for each dimension, and thus without loss of generality, we assume d = 1 and omit the dependence on j.
Combining two parts finishes the proof.  Combining two parts finishes the proof.
Further, by [ Putting two parts together completes the proof.
Then the proof is complete by combining above results.
Then the proof is complete by combining above results.

A.8. Proofs of additional lemmas
The following lemma is similar to [10, Lemma C.1], and is needed in proving Lemma A.8.
Lemma A.15. Let q ∈ (0, 3], and ξ be a non-negative random variable such that ξ ψq D. Then there exists a constant C, depending only on q, such that E ξ 3 ; ξ > t C(t 3 + D 3 )e −(t/D) q , for t > 0.