On inference validity of weighted U-statistics under data heterogeneity

Motivated by challenges on studying a new correlation measurement being popularized in evaluating online ranking algorithms' performance, this manuscript explores the validity of uncertainty assessment for weighted U-statistics. Without any commonly adopted assumption, we verify Efron's bootstrap and a new resampling procedure's inference validity. Specifically, in its full generality, our theory allows both kernels and weights asymmetric and data points not identically distributed, which are all new issues that historically have not been addressed. For achieving strict generalization, for example, we have to carefully control the order of the"degenerate"term in U-statistics which are no longer degenerate under the empirical measure for non-i.i.d. data. Our result applies to the motivating task, giving the region at which solid statistical inference can be made.


Introduction
This manuscript studies the following general weighted U-statistic of degree m: U n = (n − m)! n! 1≤i1,i2,...,im≤n: ij =i k if j =k a n (i 1 , . . . , i m )h n (X i1 , . . . , X im ). (1.1) Here we assume X 1 , . . . , X n are independent but not necessarily identically distributed random variables, taking values in a measurable space (X , B X ) [16]. The weight function a n (·) and kernel function h n (·) are both possibly asymmetric, and they are both allowed to be sample size dependent.
Our study on weighted U-statistics is motivated from the following new correlation measurement popularized in the information retrieval area [40]. It is formulated as a weighted U-statistic of asymmetric kernels and weights: Here 1(·) represents the indicator function and X 1 , . . . , X n are specified to be real-valued. For this specific example, X 1 , . . . , X n correspond to the scores the ranking machine gives for each online page, aligned by the rankings of human labels. The data points X 1 , . . . , X n are usually modeled by a location-scale model, and are usually non-i.i.d.. The statistic in (1.2), named average-precision (AP) correlation, aims to evaluate the performance of any given online ranking algorithm by calculating a reweighted rank correlation measurement between the algorithm's rankings, while "giving more weights to the errors at high rankings". For the AP correlation, it is desirable to derive confidence intervals for solid inference. Obviously, τ AP is an extension to the Kendall's tau statistic: Compared to τ Ken , the analysis of τ AP is much more involved, but naturally falls into the application regime of our theory.
The analysis of unweighted U-statistics (i.e., a n (·) ≡ 1) has a long history. There has been a vast literature on evaluating their asymptotic behaviors since the seminal paper of [12]. Specifically, regarding the simple independent and identically distributed (i.i.d.) setting, inference results have been summarized in [20], [36], and [16]. For extensions, [20] proved the asymptotic normality under a Lyapunov-type non-i.i.d. condition. [41] and [6] derived central limit theorem and (block) bootstrap inference validity for stationary weakly dependent time series. [4] proved the m-out-of-n bootstrap inference validity.
Weighted U-statistic is comparably less touched in the literature. Here, under the i.i.d. setting, [38] and [27] conducted asymptotic analysis for weighted Ustatistics of degree two. [24] and [34] made extensions to weighted U-statistics of degree m ≥ 2, with focus on the degenerate cases. [13] relaxed the independence assumption, proving the asymptotic normality for a wide range of stationary stochastic processes. Recently, [42] generalized the results in [13], proving central and noncentral limit theorems for a class of nonstationary time series.
Despite the above substantial advances, i.i.d. or stationary assumption is commonly posed, especially for proving Efron's bootstrap inference validity. A notable exception is [42], who established central limit theorem for nonstationary time series. However, bootstrap inference is not discussed, and the regularity conditions therein are too strong to include statistics like τ AP . In addition, the kernels and weights are required to be symmetric.
Motivated from our study on the AP correlation, this manuscript aims to fill the above gaps. In particular, we build unified theory for analyzing nondegenerate weighted U-statistics, namely, establishing sufficient conditions for their asymptotic normality and bootstrap inference validity. Both Efron's bootstrap and a new resampling procedure stemmed from [31] and [2] are considered. For this, we waive the i.i.d. assumptions, allowing researchers to analyze statistics like τ AP in practical settings. In addition, our analysis allows both the kernels and weights to be asymmetric.

Other related work
Our results are very related to bootstrap inference under data heterogeneity. In [22], Regina Liu pioneered the study on Efron's bootstrap inference validity for non-i.i.d. models. Her results showed that bootstrap is robust to these specific non-i.i.d. settings with common locations (means). However, bootstrap is very sensitive to mean differences. The inference validity is captured by a function of {μ i := EX i } n i=1 , which she called "heterogeneity factors" [22,23]. For example, for the sample mean, at the worse case, the distance between the largest and smallest means needs to shrink to zero as n → ∞ for bootstrap consistency. [25] summarized the existing results, providing necessary and sufficient conditions of bootstrap validity for the sample-mean-type statistics under non-i.i.d. settings.
Politis and Romano's subsampling [32] and many other resampling schemes [2] are appealing alternatives to Efron's bootstrap. They are designed to correct the bootstrap inference inconsistency problem in many different settings, where the data could be, for example, dependent or heavy-tailed. In this manuscript, we examine a new resampling procedure's inference validity for weighted Ustatistics.

Notation
Let R be the set of real numbers, and Z be the set of integers. For a positive integer n, we write [n] = {a ∈ Z : 1 ≤ a ≤ n}. For any set A, let card(A) represent the cardinality of A. Let d → denote "convergence in distribution", and P → denote "convergence in probability". Let "a.s." be the abbreviation of "almost surely". Let Φ(t) be the cumulative distribution function of the standard Gaussian. For two positive integers m < n, define where n! represents the factorial of n. Let C be a generic absolute positive constant, whose actual value may vary at different locations. For any two real sequences {a n } and {b n }, we write a n b n , or equivalently b n a n , if there exists an absolute constant C such that |a n | ≤ C|b n | for all sufficiently large n. We write a n b n if both a n b n and a n b n hold. We write a n b n , or equivalently b n a n , if a n b n holds, but a n b n does not. We write a n = O(b n ) if a n b n , and a n = o(b n ) if a n = O(b n ) and b n = O(a n ). We write a n = O P (b n ) if a n /b n is stochastically bounded, that is, for any > 0, there exists a finite M > 0 and a finite N > 0 such that P (|a n /b n | > M) < for all n > N. We write a n = o P (b n ) if for any > 0, lim n→∞ P (|a n /b n | ≥ ) = 0.

Structure of the manuscript
The rest of the manuscript is organized as follows. In Section 2 we provide the unified theory for asymmetric weighted U-statistic, deriving central limit theorem, bootstrap, and a new resampling procedure's inference validity under data non-i.i.d. settings. In Section 3, we apply the developed theory to explore the inference validity of Kendall's tau in (1.3) and AP correlation in (1.2). All proofs are relegated to Appendix.

Main results
Throughout the manuscript, we focus on the following triangular array setting: Assume we have n independent random variables {X n,i }, n ≥ 1, 1 ≤ i ≤ n. Each X n,i follows the distribution P n,i . The elements in {P n,i , i ∈ [n]} are not necessarily equal to each other. When n increases, P n,i could possibly change. For notational simplicity, in the sequel we drop n in the subscripts of X n,i and P n,i when no confusion could be made.
We are focused on the following weighted U-statistic of degree m, with weight function a(·) : Z m → R and kernel h(·) : X m → R: U n = U n (X 1 , . . . , X n ) = (n − m)! n! I m n a n (i 1 , . . . , i m )h n (X i1 , . . . , X im ). (2.1) Here the summation is over all possible m elements in [n] without overlap: Such U n is usually referred to as a weighted U-statistic in the literature [36]. We do not assume symmetry of a n (·) or h n (·) in their arguments. For notation simplicity, in the sequel we omit the subscript n in a n (·) and h n (·). Let's define θ(i 1 , . . . , i m ) := E{h(X i1 , . . . , X im )} = h(y 1 , . . . , y m )dP i1 (y 1 ) . . . dP im (y m ) (2.2) to be the population mean of h(X i1 , . . . , X im ). For any l ∈ [m], define π l (·; ·) to be a function that takes two arguments (a scalar and a vector of length m − 1), and returns a vector of length m by inserting the first argument into the l-th position of the second argument. Formally, we define π l (y; y 1 , y 2 , . . . , y m−1 ) := (y 1 , . . . , y l−1 , y, y l , . . . , y m−1 ).
We further define Define the first order expansion of h(·) for each X i , regarding the specific sequence X i1 , . . . , X im−1 , to be: Define the first order expansion of h(·) for X i to be where the summation is over For l ∈ [m], we write (i 1 , . . . , i m )\i l := (i 1 , . . . , i l−1 , i l+1 , . . . , i m ), and define Before presenting the main theorem, we have to introduce more notation on the weight function a(·). For K, q ∈ Z with K ≥ 2 and 0 ≤ q ≤ m, let (I m n ) ⊗K ≥q be the collection of all K-dimensional index vectors from I m n that share at least q common indices: and (I m n ) ⊗K =q be the collection of all K-dimensional index vectors from I m n that share exactly q indices in common: With fixed K, q, m, it is easy to observe card{(I m n ) ⊗K ≥q } card{(I m n ) ⊗K =q } as n → ∞, and In particular, we have card{(I m n ) ⊗2 ≥2 } n 2m−2 , card{(I m n ) ⊗2 ≥1 } n 2m−1 , and card{(I m n ) ⊗3 ≥1 } n 3m−2 . Define the average weight, A K,q (n), as The following theorem gives sufficient conditions on the weights and distributions of {X i } for guaranteeing U n to be asymptotically normal.  (2.4). Assume the following conditions hold: Then we have and The first step of the proof, which establishes a von-Mises-expansion type result, is simple yet inspiring. Of note, under i.i.d. settings, an analogous theorem has been (inexplicitly) stated in [38]. Lemma 2.2 (Hoeffding's decomposition). With h 1,i (·) and h 2;i1,...,im (·) defined in (2.4) and (2.5), we have and for any i, k ∈ [n] and (i 1 , . . . , i m ) ∈ I m n , E{h 1,i (X i )} = 0, (2.14)

F. Han and T. Qian
For putting Theorem 2.1 appropriately in the literature, let's first give a brief review on the most relevant existing results. The first proof of asymptotic normality for (unweighted) nondegenerate U-statistics was given in Hoeffding [12]. Grams and Serfling [10] studied general unweighted U-statistics of degree m ≥ 2 and bounded their central moments. The techniques therein also play a central role in our analysis. [38] and O'Neil and Redner [27] analyzed the asymptotic behavior of weighted U-statistics of degree 2. They assumed weight function a(·) symmetric. The above results all assume data i.i.d.-ness. For unweighted U-statistics, [20] outlined an extension to non-i.i.d. data.
Theorem 2.1 is stronger than the results in the literature, allowing a(·) and h(·) asymmetric, and the X i 's non-i.i.d.. By examining the proof, one can also easily check that, when the corresponding symmetry, boundedness, or i.i.d. assumptions are made, our results can reduce to the ones in Hoeffding [12], [38], O'Neil and Redner [27], and [20]. (2.12). Condition (2.9) evolves from the Lyapunov condition with δ = 1, which is readily weakened to the condition of a smaller 0 < δ < 1 or the Lindeberg-Feller condition. Condition (2.7) is made and could be weakened based on the same argument. For presentation clearness, we choose the current conditions.
Inferring the distribution of U n or approximating Var(U n ) is usually challenging in practice. Resampling procedures are hence recommended. The rest of this section gives asymptotic results for Efron's bootstrap [7] and a new resampling procedure for approximating Var(U n ).
Due to the heterogeneity in P i , it is well known that bootstrap could possibly no longer be consistent [22]. However, it is still possible to recover bootstrap consistency by restricting the heterogeneity degree. But before that, let's first provide a theoretically interesting theorem. It states that, under very mild conditions, bootstrapped mean from the set {h 1,i (X i ) : 1 ≤ i ≤ n} approximates the distribution of n −1 n i=1 h 1,i (X i ) consistently. This is consistent to the discovery in [22] by noting that are. Theorem 2.4 (Sufficient condition for bootstrapping main term to work). Denote (2.9) hold. In addition, assume for every > 0, we have where P * denotes the conditional probability given X 1 , . . . , X n . If further (2.8) holds, then Remark 2.5. Equations (2.17) and (2.18) are rather mild constraints. As we will show in Corollary 3.1, usually they can be directly deduced from the asymptotic normality of U n . However, unless we know much about X i , the form of h 1,i (·) is unknown.
We now focus on bootstrapping the original U-statistic for estimating Var(U n ). The following theorem shows that Efron's bootstrap still gives consistent variance estimate for U n under some additional conditions on data heterogeneity. Although the bootstrap inference validity for U-statistics under i.i.d. assumptions has been established (check, for example, [16]), the corresponding one for non-i.i.d. settings, even for the simplest unweighted U-statistics, is still absent in the literature. Our manuscript fills this gap. Theorem 2.6 (Sufficient condition for consistent bootstrap variance estimation). Given X 1 , . . . , X n , let X * 1 , . . . , X * n denote the bootstrapped sample, which are i.i.d. draws from the empirical distribution of X 1 , . . . , X n . Define the bootstrapped U-statistic Assume all conditions in Theorem 2.1 are satisfied. Also assume the following conditions hold: (i) Bounded second moment of von-Mises type kernel: and Here we define r := (r 1 , . . . , r m ), and similarly for s, k.
Then we have where the operator Var * (·) denotes the conditional variance given X 1 , . . . , X n .
The detailed proof of Theorem 2.6 is very involved and highly combinatorial. Of note, in the theorem, (2.21) comes from [1], ensuring that the bootstrapped U-statistic won't explode. Equations (2.22) and (2.23) ensure that the conditional variance of n −1 n i=1 h 1,i (X * i ) approximates Var(U n ). Equation (2.24) ensures that U * n (a, h 2 ) is negligible compared to n −1 n i=1 h 1,i (X * i ). Remark 2.7. Although U n (a, h 2 ) in the decomposition (2.12) is degenerate and hence negligible under the conditions of Theorem 2.1, its bootstrapped version U * n (a, h 2 ) is not necessarily degenerate, because the empirical measure does not equal the true measure. This makes U * n (a, h 2 ) not necessarily negligible compared to the bootstrapped version of the main term, n −1 n i=1 h 1,i (X * i ). Therefore, bootstrap may fail without careful control on both the main term and the remainder U * n (a, h 2 ). We developed delicate analysis to bound U * n (a, h 2 ) and showed that it is negligible under the constraint (2.24). Remark 2.8. Condition (2.24) puts homogeneity conditions mainly on the means. This is consistent to Theorem 2.4 and the discoveries in [22], who showed that bootstrap is most sensitive to mean differences. To illustrate, assume a(·) ≡ 1 and the kernel h(·) to be a bounded function. Assume the assumptions in Theorem 2.1 hold, so that we have asymptotic normality of U n . Equation (2.9) requires σ 2 n n −4/3 . Therefore, for (2.24) to hold, it is necessary that M 1 (n) 2 n −1/3 and M 2 (n) n −1/3 . The space to improve our requirements, if existing, is relatively small. This is by noting that, even for the simplest sample-mean-type statistics, for most cases, [22] required the mean differences shrink to zero as n → ∞ for bootstrap consistency.
and for each i ∈ [n − b + 1], X * i1,b,i , . . . , X * im,b,i are independently drawn from the empirical distribution of {X i , . . . , X i+b−1 } with replacement. The tuning parameter h n regulates the scale.
The following theorem verifies the new resampling procedure's inference consistency for V * n , showing that the procedure tends to give conservative variance estimate under non-i.i.d. settings. It also shows that the inference is more tractable compared to Efron's bootstrap when we have more prior information on the heterogeneity degree, reflected in the consistency rate of U n and the choice of h n . We also refer the readers to Remark 3.4 and discussions therein for the order of σ n in a specific example. Theorem 2.10. Assume that all conditions in Theorem 2.6 hold for each "moving block"

Application
This section explores two specific statistics, the Kendall's tau (denoted as τ Ken ) [14] and average-precision (AP) correlation (denoted as τ AP ) [40]: Without loss of generality, we focus on the transformed versions of these two statistics: We assume {P i , i ∈ [n]} to be absolutely continuous with regard to the Lebesgue measure. Obviously, both U Ken n and U AP n enjoy the distribution-free property [15] when the data are i.i.d.. Of note, these two statistics could also be treated as (weighted) U-statistics of symmetric kernels and weights with non-i.i.d. data (X 1 , 1), . . . , (X n , n). However, we found the following analysis based on X 1 , . . . , X n much neater, and as will be seen in the proof, non-i.i.d.-ness is the major obstacle in analysis.

Asymptotic theory
Note that the statistics U Ken n and U AP n have the same kernel h(x, y) = 1(y > x). Using the definition in (2.2), we have θ(i, j) = E{h(X i , X j )} = P (X j > X i ). The forms of h 1,i (·) and h 2;i,j (·) for U Ken n and U AP n are then summarized in the following two lemmas.
Lemma 3.2 (Hoeffding's Decomposition of U AP n ). We have  such that for any sufficiently large n and for each i ∈ [n], one of the following two conditions holds: In addition, if If we have then U AP n is asymptotically normal, The proof of Theorem 3.3 exploits Theorem 2.1. A key step in the proof is to bound V (n) := n −2 i Var{h 1,i (X i )} from below. The magnitude of Var{h 1,i (X i )} varies greatly with different i, making it a challenging task to bound the entire summation. To tackle this, we break V (n) into summations over multiple subsets of [n]. Within each of these summations, the magnitude of Var{h 1,i (X i )} is stable. Then we develop bounds on the summations for i with large Var{h 1,i (X i )}.
The sequences {δ n } and {p n } in Conditions (i) and (ii) of Theorem 3.3 characterize the heterogeneity degree among the P i 's. If all P i 's are identical, it is easy to check that there exist absolute constants δ n and p n not depending on n such that Condition (i) or (ii) holds. Equations (3.3) and (3.4) allow δ n and p n to decay to zero as n → ∞. The legitimate decaying rate of δ 3 n p n depends on the average weight of each of the two statistics. The conditions for asymptotic normality of U AP n (3.4) are slightly stronger than that for U Ken n (3.3), because for U AP n the weight is much more skewed. Motivated by the studies in [40], in the sequel we consider the following specific location-scale model. In particular, given two sets of real values μ i with μ 1 ≥ μ 2 ≥ . . . ≥ μ n and σ 2 1 , . . . , σ 2 n > 0, let's consider absolute continuous (with respect to Lebesgue measure) probability distribution P i with mean μ i and variance σ 2 i for i ∈ [n]. Assume X 1 , . . . , X n are independent draws from P 1 , . . . , P n . The following theorem characterizes the explicit sufficient conditions on {(μ i , σ i ), i ∈ [n]} for Kendall's tau and AP correlation to be asymptotically normal. For n, i, j such that Then the following results hold.
(i) Assume there exist absolute constants c 1 , c 2 > 0, b 1 > b 2 > 0, and t 0 > 0, such that for any n, i, j with 1 ≤ i = j ≤ n and for any t ≥ t 0 , Then the sufficient condition for asymptotic normality of U Ken 8) and the sufficient condition for asymptotic normality of U AP (3.9) (ii) Assume there exist absolute constants c 1 , c 2 > 0, b 1 > b 2 > 0, and t 0 > 0, such that for any n, i, j with 1 ≤ i = j ≤ n and for any t ≥ t 0 , Then the sufficient condition for asymptotic normality of U Ken (3.12) and the sufficient condition for asymptotic normality of U AP where and ξ(p) := 1(p ≤ 1) + 2 p−1 1(p > 1).
We compare Condition (3.8) in (i) and Condition (3.12) in (ii) for U Ken n . Assume σ i = 1 for all i ∈ [n]. In this case, we have ρ n = 1, and R n = max 1≤i =j≤n |μ i − μ j | becomes the spread of the means. Equation (3.8) becomes . Rearranging terms, we obtain a sufficient condition for (3.16) to hold: For heavy-tailed distributions in (i), (3.15) implies that the spread of means should not grow faster than a polynomial of n. For light-tailed distributions in (ii), (3.17) implies that the spread of means should not grow faster than the logarithm of n (up to some constant scaling factor). Of note, under both tail conditions, R n is allowed to increase to infinity at proper rates.

Example 3.1.
A special distribution satisfying the conditions in Theorem 3.5(ii) is the Gaussian. Again, consider U Ken n and assume σ i = 1 for all i ∈ [n]. Note in this case F c j (·) is the survival function for Gaussian with variance 1, whereas F c ji (·) is for Gaussian with variance 2. Let λ = 2, b 1 = 1/2 + , b 2 = 1/4 − for arbitrarily small > 0, and c 1 , c 2 , t 0 be properly chosen constants (whose value does not affect the rate in (3.17)). Equations (3.6) and (3.7) are satisfied due to Lemma C.11. It then follows from (3.17) that is sufficient for U Ken n to be asymptotically normal.
Remark 3.7. We comment on a modified version of Theorem 3.5(i), with a condition alternative to (3.6) (A similar modification applies to Theorem 3.5(ii)).
In detail, define F j (t) = P {(X j −μ j )/σ j ≤ t} to be the standardized cumulative distribution function that is complement to the survival function F c j (t). The conclusion in Theorem 3.5(i) still holds if we replace the condition (3.6) by For comparison, (3.6) regulates the upper-tail behavior of X j , whereas (3.18) regulates the lower-tail of X j . Technically speaking, the proof of Theorem 3.5(i) examines Condition (ii) in Theorem 3.5, whereas the alternative version examines Condition (i) in Theorem 3.5. Note that (3.7) is required in both versions, and regulates both the upper-and lower-tail behaviors of X j − X i .
The following three corollaries give asymptotic results for bootstrapping U Ken n and U AP n . The first of them states that bootstrapping the main term is very insensitive to data non-i.i.d.-ness. This is as expected by the results in [22].   AP 1,i . As has been shown in Section 2, bootstrapping the whole U-statistic requires much stronger assumptions for guaranteeing its consistency. The following two corollaries provide sufficient conditions for bootstrap inference validity of the two considered U-statistics.
In addition, assume there exist η 2 > 0 and an absolute constant C > 0 such that for all i ∈ [n] and all 1 ≤ j, k ≤ n such that j = i and k = i, Assume η 2 = θ 2 . Then we have In addition, assume there exist η 2 > 0 and an absolute constant C > 0 such that for all 1 ≤ i ≤ n and all 1 ≤ j, k ≤ n such that j = i and k = i, In the proof of Corollaries 3.2 and 3.3, for verifying (2.22), we exploit the weak law of large numbers for independent but not identically distributed variables. For verifying (2.23), we break the left-hand side into the sum of an unweighted U-statistic and a negligible term, and apply the law of large numbers for unweighted U-statistics.
Remark 3.8. The condition η 2 = θ 2 in Corollaries 3.2 and 3.3 is mild. Under the i.i.d. case, it essentially requires that the X i 's are not degenerate random variables. To see this, let θ := P (X 1 > X 2 ) and η 2 : Jensen's inequality implies that η 2 ≥ θ 2 , with equality only if X i is a degenerate random variable.

Numerical experiments
In this section, we evaluate the developed theory and examine the finite sample behavior of Kendall's tau and AP correlation via synthetic data analysis. Both central limit theorem and bootstrap inference validity are examined under different data heterogeneity degree. The numerical results show that central limit theorem holds under relatively weaker data homogeneity requirement, whereas bootstrap variance estimation is much more sensitive to data heterogeneity. These observations agree with the theory developed in this manuscript.
In the first simulation study, we examine the validity of central limit theorem for Kendall's tau and AP correlation. We consider generating the data from Gaussian distribution and t-distribution. For Gaussian distribution, each time, we generate the data sequence The means {θ i , i ∈ [n]} are assigned equally spaced between R n and 0, with R n = max |θ i −θ j | representing the heterogeneity degree, taking values 0, 10, 30, and 50. For t-distribution, we generate X 1 , . . . , X n with X i follows noncentral t-distribution with noncentrality parameter θ i and 5 degrees of freedom. The noncentrality parameters {θ i , i ∈ [n]} are assigned equally spaced between R n and 0, and R n takes values 0, 8, 25, and 42. We choose these R n , so that the difference between the means of X 1 and X n are similar under Gaussian distribution and under t-distribution. We consider the sample size n being 50, 100, 200, and 500.
Under each setting, we repeat the simulation for 50,000 times. We use two goodness-of-fit tests to examine the normality of the considered statistics: Cramer-von Mises test (CvM) and Lilliefors test (L). Both tests are implemented in the R package "Rnortest", and we refer the readers to Thode [39] for detailed descriptions on these tests. We also calculate the coverage probability for confidence intervals of nominal level 80% and 95% based on Gaussian approximation. Table 1 presents the p-values of two tests for normality and the coverage probabilities, when the data are generated from Gaussian distribution. For both statistics, with large sample size (n = 500) normality is plausible for R n up to 50, as both tests fail to reject at significance level 0.05. Test rejection occurs as sample size decreases. In terms of confidence interval, for sample size as small as n = 50, the coverage probabilities are all close to the nominal level even for large R n . Note that with R n = 50 the 95% confidence interval for U AP n becomes slightly conservative, especially with smaller sample size. Table 2 presents the p-values and the coverage probabilities when the data follow noncentral t-distribution. The trend is similar to that of Table 1, while by comparison, we observed that the statistics are more robust to location shifts for the heavy-tailed t-distribution as compared to Gaussian distribution, supporting our theoretical discoveries.
In the second simulation study, we examine the bootstrap variance estimation consistency and present the results in Tables 3-6. We consider the following three approaches: (i) bootstrapping the original U-statistic as in Theorem 2.6, termed as "Efron" in the tables; (ii) bootstrapping the main term of the Ustatistic as in Theorem 2.4, termed as "Efron-main term" in the tables; (iii) the new resampling strategy as in Theorem 2.10, termed as "moving-block" in the tables. Among them, the "Efron-main term" bootstrap is not of practical use because it requires knowledge of h 1i (X i ), which depends on the probability distribution of X i . We include it for theoretical purpose in order to validate Theorem 2.4. Similar to the first simulation study, we generate the data from Gaussian distribution and t-distribution. For Gaussian distribution, we simulate X i ∼ N (θ i , 1). For t-distribution, we simulate X i following noncentral t-distribution with noncentrality parameter θ i and 5 degrees of freedom. For both distributions, parameters {θ i , i ∈ [n]} are assigned equally spaced between R n and 0, and the degree of heterogeneity R n is set to be 0, 1, 2, and 3. We consider the sample size n being 50, 100, 200, and 500. We set the number of bootstrap replicates within each simulation to be 2,000 in bootstrap approaches (i) and (ii), 200 for each block in bootstrap approach (iii), and the block-size in (iii) to be n/5. Under each setting, we repeat the simulation for 50,000 times. In the "bias" column of each table, we present the relative bias of the bootstrap variance estimators, where the relative bias is defined as {Var(U n ) − Var(U n )}/Var(U n ). Relative bias being positive/negative means that the bootstrap method tends to underestimate/overestimate the variance. We also compute the coverage probability for confidence intervals of nominal level 80% and 95% based on Gaussian approximation and the estimated variance. Table 3 shows the performance of three bootstrap variance estimators for U Ken n when X i follows Gaussian distribution. When there is no heterogeneity in the data (R n = 0), all three bootstrap methods consistently estimates the variance, with close to zero bias and close to nominal level confidence interval coverage. As the distribution of X i becomes more heterogeneous (larger R n ), bootstrapping the main-term still estimates the variance consistently, whereas Efron's bootstrap and the moving-block bootstrap tend to overestimates the variance, resulting in negative relative bias and larger than nominal confidence interval coverage. This is as expected due to Corollary 3.1 and Corollary 3.2. Table 4 gives the bootstrap performance for U AP n when X i follows Gaussian distribution, and the trend is similar to U Ken n . A comparison between Table 3 and Table 4 shows that the finite sample performance of all three bootstrap methods is better for U AP n than for U Ken n . This is consistent with our theoretical findings in Corollary 3.2 and Corollary 3.3. Tables 5 and 6 present the results for both statistics when X i follows t-distribution. The trends there are similar to the Gaussian case, and by comparison, the statistics are more robust to location shifts for t-distribution, supporting our theorems.
Comparing the first and the second simulation studies, we see that the central limit theorem for our considered statistics holds under much weaker homogeneity conditions than the resampling procedures. This is as expected due to the theory developed in Section 3.1. We also see that central limit theorem holds approximately with sample size as small as n = 50, whereas bootstrap variance estimation requires much larger sample size to have decent performance.

Discussion
One of the main focus of this manuscript is the consistency of bootstrap variance estimator for U-statistics under data heterogeneity. The proof is based on brutal combinatorial calculation. This cannot be readily extended to analyzing bootstrap distributional consistency. We believe using techniques developed by [25] and [11], it is promising to devise the corresponding bootstrap distributional consistency theory. However, there are still some challenges and open problems to be resolved before rigorous distributional consistency theory can be established. Details will be worked out in a future work.
We have considered U-statistics with data that are independent but not identically distributed. In the literature, there have been many developments of bootstrap methods for stationary time series since the seminal work of block bootstrap methods by [18]. See, for example, [30], [19], and [31]. Among the few developments for nonstationary time series, [8] showed that block bootstrap is robust for linear regression estimation, and [9] established the consistency for block bootstrap variance estimator of sample means. To the best of knowledge, there is no work on bootstrapping U-statistics in the nonstationary time-series setting. It would be interesting to extend our current techniques in this manuscript to allow for dependent data. We believe our technique and the techniques used in the bootstrapping time-series literature (e.g., [3], [28], and [37]) can be potentially combined for analyzing bootstrapping U-statistics for nonstationary time series data. However, the analysis will become even more challenging technically, and will be left for future research.

Appendix A: Proofs of main results
In this section, we prove theoretical results presented in the manuscript. The results are proved in the order they appear in the manuscript. For succinctness, the supporting lemmas that appear in the proof are proven in Section B. In those proof, sometimes we also have to refer to certain auxiliary results. Those are numbered by C.

A.1. Proof of Theorem 2.1
Proof. By Lemma 2.2, we have For proving Theorem 2.1, by Slutsky's theorem it suffices to establish the following results: and First we show (A.2) using Lyapunov's Central Limit Theorem (Lemma C.4). The following lemma gives bound on where C is some absolute constant.
By Lemma A.1 and the fact that E{h 1,i (X i )} = 0, we deduce Equation (A.6) and Lemma C.4 with δ = 1 yield (A.2). Next we show (A.3). To simplify notation, let i denote the index vector (i 1 , . . . , i m ) and X i denote (X i1 , . . . , X im ). Consider two index vectors i, j from I m n . If i ∩ j = ∅, by independence of the X i 's we have Cov{h 2;i (X i ), h 2;j (X j )} = 0. If i ∩ j = i p = j q for some p, q ∈ [n] (i.e., the two vectors only share one common index), Lemma C.2 and (2.15) imply that (A.7) By Lemma C.1(i) and Cauchy-Schwarz inequality, the right-hand side of (A. Equations (A.8) and (A.9) imply that This completes the proof.

A.3. Proof of Theorem 2.4
Proof. In Lemma C.6, let Y n,i = σ −1 n h 1,i (X i ), g n be the identity function, t n = 0 and σ 2 n = 1. By the definition ofT n we haveT n = n −1 n i=1 σ −1 n h 1,i (X i ).

A.4. Proof of Theorem 2.6
Proof. By the definition of σ 2 n , we have Var(σ −1 n U n ) = 1. For proving Theorem 2.6 it suffices to show that Multiplying σ −1 n and then taking Var * on both sides of (A.16) yields where Cov * (·) denotes the covariance operator on the empirical measure. By (A.18) and Slutsky's theorem, for proving (A.15) it suffices to show the following: First we prove (A. 19). Since conditional on X 1 , . . . , X n the X * i 's are i.i.d. draws from the empirical distribution of X 1 , . . . , X n , we have

Equations (A.22) and (A.23) yield
By the weak law of large numbers for i.i.d. random variables, we have Next we prove (2.23). By algebra we have Equation (2.22) implies that the first term on the right-hand side of (A.27) converges to 0 in probability. The second term on the right-hand side of (A.27) equals (n − 1)/n times a U-statistic with symmetric kernel g(x, y) = n −2 σ −2 n × n i=1 h 1,i (x)h 1,i (y). By the triangle inequality, Jensen's inequality, and the i.i.d.ness of the X i 's, we deduce (A.28) The i.i.d.-ness of the X i 's and the fact that E{h 1,i (X i )} = 0 yield Equations (A.30) and (A.31) imply that M 2 (n) = 0. Therefore, (2.24) follows from M 1 (n) = M 2 (n) = 0 and the assumption that n −2 σ −2 n A 2,1 (n) → 0.

A.6. Proof of Theorem 2.10
Proof. By the definition of V * n , we have where the second equality follows from the assumption Var (1)) and Theorem 2.6. This combines with the assumption (1)) gives the desired result.

A.7. Proof of Lemma 3.1
Proof. For U Ken n we have a(i, j) = 1(j < i) and h(X i , X j ) = 1(X j > X i ). Using definitions in (2.2) and (2.3), we have f This completes the proof.

A.8. Proof of Lemma 3.2
Proof. For U AP n , we have a(i, j) = n(i − 1) −1 1(j < i) and h(X i , X j ) = 1(X j > X i ). The form of f (1) i (x) and f (2) i (x) is the same as in the proof of Lemma 3.1.

By Lemma 2.2 we obtain
This completes the proof.

A.9. Proof of Theorem 3.3
Proof. We divide the proof into two parts. In Part I, we prove the theorem for U AP n . In Part II, we prove the theorem for U Ken n .
Part I (for U AP n ). By Theorem 2.1, for proving asymptotic normality of U AP n , it suffices to show that (2.7), (2.8), and (2.9) hold under the assumption of Theorem 3.1. Equation (2.7) holds trivially with M (n) = 1 due to boundedness of the kernel function h(·). In the following, we establish (2.8) and (2.9) by calculating the orders of A 2,2 (n), A 3,1 (n), and V (n).
First we derive upper bound on A 2,2 (n) and A 3,1 (n). We will repeatedly use Lemma C.8 to bound the partial sum of harmonic series. By the definition of A 2,2 (n) in (2.6), we have we have a(i, j)a(j, i) = 0 and a(i, i) = 0. It then follows from (A.32) that By the definition of A 3,1 (n) in (2.6), we have The term |a(i, j 1 )a(i, j 2 )a(j 3 , i)| is nonzero only if j 1 , j 2 < i < j 3 , so the corresponding summation in (A.34) equals The term |a(i, j 1 )a(j 2 , i)a(j 3 , i)| is nonzero only if j 1 < i < j 2 , j 3 , so the corresponding summation in (A.34) equals (A.37) The term |a(j 1 , i)a(j 2 , i)a(j 3 , i)| is nonzero only if j 1 , j 2 , j 3 > i, so the corresponding summation in (A.34) equals

A.10. Proof of Theorem 3.5
Proof. Define and Using the definitions of F c j and F c ji in (3.5), we have Lemma A.6. Define .

Consider a fixed i ∈ [n]. If x satisfies
then for all j ∈ [n]\{i} we have Since ρ n ≥ 1, by the definition of K 1 we have Combining (A.63), (A.64) and (3.6) yields Thus by dropping constants we obtain In the following we show that (A.65) and (3.8) imply If lim sup n→∞ R n = ∞, the fact that ρ n ≥ 1 and b 1 > b 2 > 0 yields where K 3 , K 4 are defind in (3.14). Then for all j ∈ [n]\{i} we have Since ρ n ≥ 1, by the definition of K 3 , we have Combining (A.74), (A.75), and (3.10) yields Thus by dropping constants we obtain This completes the proof of Part II for U Ken n . For U AP n the proof is almost the same, except that (3.12) is replaced by (3.13), and the right-hand side of (A.77) is replaced by n −1/3 (log n) 2 .

Appendix B: Proofs of the supporting lemmas
In this section, we prove the supporting lemmas that appear in Section A.

B.2. Proof of Lemma A.2
Proof. We prove Lemma A.2 by showing that for each (i * 1 , . . . , i * m ) ∈ I m n , the coefficients of a(i * 1 , . . . , i * m ) on both sides of (A.14) are equal. In the following we fix (i * 1 , . . . , i * m ) ∈ I m n . For the left-hand side of (A.14), we enumerate the combinations in So the coefficient of a(i * 1 , . . . , i * m ) on the left-hand side of (A.14) is This equals the coefficient of a(i * 1 , . . . , i * m ) on the right-hand side of (A.14). This completes the proof.

B.3. Proof of Lemma A.3
Lemmas B.1 and B.2 that appear in this proof are proven immediately after this proof.
Proof. Define i := (i 1 , . . . , i m ) and X i := (X i1 , . . . , X im ). By (A.17) we have and (B.9) It follows from (B.7) and (B.8) that The following proof consists of two steps. In the first step, we establish In the second step, we show that Lemma A.3 then follows from (B.11), (B.12), and Slutsky's theorem.
Step I.

F. Han and T. Qian
For any (i, j) ∈ (I m n ) ⊗2 ≥2 , by the law of iterated expectation, Cauchy-Schwarz inequality, and triangular inequality we have Similarly, by Jensen's inequality and triangular inequality we have Using the law of iterated expectation, we deduce It then follows from Markov's inequality that Step II. Consider a fixed (i, j) ∈ (I m n ) ⊗2 =1 . Without loss of generality assume i ∩ j = {i p } = {j q } for some 1 ≤ p, q ≤ m. By the i.i.d.-ness of X * i 's given X 1 , . . . , X n , we have and The following lemma gives bound on the right-hand side of (B.24).
Using an argument similar to (B.24), we have r,s∈{1,...,n} 2m The following lemma gives bound on the right-hand side of (B.27).
Therefore, by the definition of A 2,1 (n) in (2.6), we have with r p = s q and i ∩ r = ∅ = j ∩ s. By the law of iterated expectation and the independence of X i 's we have Using the definition of h 2;i (·) in (2.5) we have By the independence of the X i 's we have Using (B.32) and (B.33) we obtain We introduce some notation: Using the new notation, (B.34) becomes (B.35) Similarly, we have (B.36) By algebra and the law of iterated expectation, we derive from (B.35) and (B.36) that By the definitions of M 1 (n) and M 2 (n) in (2.25) and (2.26), we have |T 1 | ≤ 2M 2 (n), |T 2 | ≤ CM 1 (n), |T 3 | ≤ CM 1 (n) 2 , |T 4 | ≤ CM 1 (n), and |T 5 | ≤ CM 1 (n). Therefore, it follows from (B.37) that This yields (B.25). The proof is thus finished.

=0
such that i ∩ r = ∅ = j ∩ s. By independence of the X i 's we have By the definition of h 2;i (·) in (2.5), we have It then follows from the definition of M 1 (n) in (2.25) that This implies (B.28). The proof is thus finished.

B.7. Proof of Lemma A.7
Proof. For any j ∈ [n]\{i}, we have ρ −1 ij ≤ ρ n and −r ij ≤ R n . This combined with (A.72) implies that z i ≥ ρ −1 ij t 0 − r ij , or equivalently This implies that δ n ∈ (0, 1) and that In the following we show f ij (x) ≤ −δ n for all j ∈ [n]\{i} under these two mutually exclusive and collectively exhaustive cases.
Proof of Lemma C. 10. By algebra we have For T 1 we have (C.17) For T 2 we have For T 4 we have Note that