Quantifying the cost of simultaneous non-parametric approximation of several samples

We consider the standard non-parametric regression model with Gaussian errors but where the data consist of different samples. The question to be answered is whether the samples can be adequately represented by the same regression function. To do this we define for each sample a universal, honest and non-asymptotic confidence region for the regression function. Any subset of the samples can be represented by the same function if and only if the intersection of the corresponding confidence regions is non-empty. If the empirical supports of the samples are disjoint then the intersection of the confidence regions is always non--empty and a negative answer can only be obtained by placing shape or quantitative smoothness conditions on the joint approximation. Alternatively a simplest joint approximation function can be calculated which gives a measure of the cost of the joint approximation, for example, the number of extra peaks required.


Introduction
We consider the following problem in non-parametric regression. Given k samples y ini = {(t ij , y ij ) : j = 1, . . . , n i }, i = 1, . . . , k, with supports the question to be answered is whether they can be simultaneously represented by a common function f. The standard approach is to assume that the data sets were generated according to the model and then to consider the null and alternative hypotheses We assume that the noise processes Z i (t), i = 1, . . . , k are independent and standard Gaussian white noise. Individual samples generated under (3) will be denoted by Y ini = {(t ij , Y ij ) : j = 1, . . . , n i }, i = 1, . . . , k.
Here and in the following we use minuscule letters to denote general data sets and majuscule letters for data generated under (3). We shall mostly restrict attention to the case k = 2; the extension to more samples poses no problems. Within this setup it is possible to construct tests which are asymptotically consistent if lim n i = ∞, i = 1, 2, and which can detect alternatives converging to the null hypothesis at certain rates. This may be formalized by putting f 1 (t) − f 2 (t) = f 1,n1 (t) − f 2,n2 (t) = ∆ n (t), n = min(n 1 , n 2 ) where ∆ n is a difference function and measures the rate of convergence to the null hypothesis. The best result seems to be that of Neumeyer and Dette (2003) who construct a test which can detect alternatives which converge to the null hypothesis at the optimal rate ∆ n = O(n −1/2 ). If the supports are equal, S ini = {t 1 , . . . , t ni }, i = 1, 2, then it is not difficult to construct such a test as the differences Y 1n1 (t j ) − Y 2n2 (t j ) do not depend on f (see for example Delgado (1992) and Fan and Lin (1998)). The result of Neumeyer and Dette (2003) continues to hold even if the supports are disjoint, S 1n1 ∩ S 2n2 = ∅. In this case, however, there are difficulties which can be most clearly seen in the case of exact data y ij = f i (t ij ), t ij ∈ S ini , i = 1, 2.
If we denote the supremum norm on [0, 1] by · ∞ then the null and alternative hypotheses of (4) may be rewritten as If the values of f 1 and f 2 are known only on disjoint sets S 1n1 and S 2n2 respectively, then it is not possible to decide between H 0 and H 1 . This continues to hold even if f 1 and f 2 are subject to qualitative smoothness conditions such as infinite differentiability: all one does is to interpolate the data points using such a function. The addition of noise and the use of asymptotics does not solve the problem as indicated by Figure 1. The top panel shows two data sets of sizes n 1 = n 2 = 500 generated according to Y 1 (t) = exp(1.5t) + 0.25Z 1 (t) and Y 2 (t) = exp(1.5t)+3+0.25Z 2 (t) with disjoint supports taken to be i.i.d. uniform random variables on [0, 1]. The centre panel shows a joint piecewise constant approximating function with 514 local extreme values. It can be made infinitely differentiable by convolving it with a Gaussian kernel with a small bandwidth. The bottom panel shows a sample of size n = 1000 generated using the function of the centre panel. It looks very much like the two original data sets. In order to distinguish between H 0 and H 1 it is necessary to place either quantitative conditions on f 1 and f 2 such as f (1) 2 ∞ ≤ 1, or shape restrictions such as f 1 and f 2 being monotone. In spite of this all conditions imposed in the literature are of a qualitative form: Hall and Hart (1990), a bounded first derivative; Härdle and Marron (1990), Hölder continuity; King et al. (1990), 'at least uniform continuity; Kulasekera (1995), Kulasekera and Wang (1997), a continuous second derivative; Munk and Dette (1998), Hölder continuity of order β > 1/2; Dette and Neumeyer (2001), a continuous rth derivative: Lavergne (2001), a second derivative which is uniformly Lipschitz of order β, 0 ≤ β < 1; Neumeyer and Dette (2003), continuous derivatives of order d ≥ 2. The problem is one of uniform convergence which is required to make the results applicable for finite n and which does not follow from qualitative conditions alone. What can be said is that if the functions differ, then any joint approximation will become more complicated as the sample sizes increase. It is this increase in complexity which we call the cost of the simultaneous approximation. This is shown in Figure 1 where the individual approximations are monotone (top panel) but the simplest joint approximation has 514 local extreme values (centre panel). In the remainder of the paper we show how the quantification can be carried out. Our approach can be split into two parts: (1) Firstly, for each sample y ini we define a so called approximation region A ini which specifies those functions f i for which the model (3) is an adequate approximation for the sample. The intersection of the approximation regions A 1,n1 ∩ A 2,n2 contains all those functions which simultaneously approximate both samples. It is also the approximation region for the simultaneous approximation. A similar idea in the context of the one-way table in the analysis of variance is expounded in Davies (2004).
(2) Secondly, using some measure of complexity we regularize within each approximation region by choosing the simplest function which is consistent with the data. This is in the spirit of Donoho (1988) who pointed out that in non-parametric regression and density problems it is possible only to give lower bounds on certain quantities of interest such the number of modal values.
In Section 2 we define the approximation or confidence regions A ini and in Section 3 we apply the ideas and concepts to the problem of comparing regression functions.

Single samples
The following is based on Davies et al. (2008c). We consider a single sample of data Y n = (t i , Y (t i )) n 1 generated under the model where we take the t i to be ordered. Based on this model we consider two different approximation or confidence regions A n and A * n defined as follows. For any function g and any interval I ⊂ [0, 1] we put where |I| denotes the number of points t i ∈ I. The confidence region A n is defined by where I n is a collection of intervals of [0, 1]. We restrict attention to the cases where I n is either the set of all intervals or a set of intervals of the form for some λ > 1. Our default choice is the (wavelet) dyadic scheme I n (2). For any given α and collection of intervals I n we define τ n (α) by P max The value of τ n (α) may be determined by simulations. These show that for I n = I n (2) we have τ n (0.95) ≤ 3 for all n ≥ 500. If I n contains all singletons {t i }, as will always be the case, it follows from Dümbgen and Spokoiny (2001) and Kabluchko (2007) that lim n→∞ τ n (α) = 2 for any α. One immediate consequence of (11) is so that A n is a universal, exact and non-asymptotic confidence region for f of size α. The confidence region (9) treats all intervals equally. The second confidence region A * n downweights the importance of small intervals and is defined as follows. Dümbgen and Spokoiny (2001) extended Lèvy's uniform modulus of continuity of the Brownian motion and showed that If we embed the partial sums j i∈I Z(t i )/ |I|, I ∈ I n , in a standard Brownian motion it follows that sup I∈In ( tj ∈I Z(t j )) 2 /|I| − 2 log(n/|I|) log(log(e e n/|I|))) = Γ < ∞ a.s..
is a universal, exact and non-asymptotic α-confidence region for f. The values of γ n may be determined by simulation. For α = 0.95 and with I n = I n (2) a good approximation for γ n (α) for n ≥ 100 is given by The confidence regions A n (Y n , I n , σ, τ n ) and A * n (Y n , I n , σ, γ n ) both require the true value of σ. We indicate how this may be obtained from the data in such a manner that the confidence region now becomes honest (Li (1989)) rather than exact. The following argument makes the somewhat casual remarks on the problem made in Davies et al. (2008c) more precise. Using the normal approximation for the binomial (n, 1/2) distribution it follows that for an i.i.d. sample (W 1 , . . . , W n ) with common continuous distribution P with median med(P ) where W (i) denotes the ith order statistic of the sample and z β the β-quantile of the standard normal distribution. On putting β = 0.995 we obtain We now apply this to the n/2 random variables It follows from Anderson (1955) that whatever the value of the function f On using the corresponding result for the (n − 1)/2 random variables it follows that if we defineσ n to be the n/2 + 1.814 √ n order statistic of the random variables for all n ≥ 100 say whatever the value of f . It follows that A n (Y n , I n ,σ n , τ n ) and A * n (Y n , I n ,σ n , γ n ) are now universal and non-asymptotic honest confidence regions whatever the value of f but with α replaced by α − 0.01, with the corresponding inequality for A * n . In spite of this the default value for σ n we shall use in this paper iŝ It is simpler, the difference is in general small, it was used in Davies and Kovac (2001), Davies et al. (2006), Davies et al. (2008a) and it also corresponds to using the first order Haar wavelets to estimate σ (Donoho et al. (1995)).
In Davies (1995) implicit use is made of an confidence region based on the lengths of runs of the signs of the residuals. Explicit universal, honest and nonasymptotic confidence regions based on the signs of the residuals are to be found in Dümbgen (1998Dümbgen ( , 2003Dümbgen ( , 2007 and Dümbgen and Johns (2004).

A one-way table for regression functions
This section extends the approach given in Davies (2004) for the one-way table to the case of regression functions. We consider k samples Y ini = (t ij , Y i (t ij )) ni j=1 generated under (3). As a first step we replace the α in (11) and (15) by α k = α 1/k where k is the number of samples. This adjusts the size of each confidence region to take into account the number of samples. The confidence region for the ith sample is given by We denote by P f with f = (f 1 , . . . , f k ) the probability model where all the samples Y ini , i = 1, . . . , k, are independently distributed and Y ini was generated under (3) with All questions concerning the relationships between the functions f i can now be answered by using the confidence regions A ini . For example, the question as to whether the f i are all equal translates into the question as to whether is empty or not. If the supports S ini of the samples are not disjoint then it is possible that the linear inequalities which define the confidence regions are inconsistent. In this case A n k = ∅ and there is no joint approximating function. If the supports S ini of the samples are pairwise disjoint then A n k is nonempty and so there always is a joint approximation function. Without further restrictions on the joint approximating function nothing more can be said. If however the joint approximating function is required to satisfy, for example, a shape constraint such as monotonicity, then it may be the case that there is no joint approximating function. Figure 1 shows just such a case where there are monotonic approximations for each sample individually but no monotonic joint approximation. To answer questions of this nature we must regularize within A n k and this is the topic of the next section.

Disjoint supports
We consider firstly the case when the supports S ini , i = 1, . . . , k, are pairwise disjoint. In this case the joint approximation region A n k is non-empty and will in general include many functions which would not be regarded as being acceptable. Indeed, it may be that A n k does not contain any acceptable function. The definition of 'acceptable' will usually be formulated in terms of shape or quantitative smoothness constraints.
Alternatively, rather than impose prior restrictions, one can determine a simplest function in the joint approximating region. One possibility is to minimize the number of local extreme points of a function g subject to g ∈ A n k . Figure  1 shows an example of this approach where the joint approximating function has 514 internal local extreme points compared with monotone approximating functions for both data sets separately. The additional local extreme points can be regarded as the cost of the joint approximation. The same idea can be used if simplicity is defined in terms of smoothness, for example by minimizing the total variation T V (g (2) ) of the second derivative subject to g lying in the approximation region. The upper panel of Figure 2 shows the data and curves of Figure 1 but with the values of the second sample reduced by an amount 2.3. There is now a joint monotone approximation which is shown in the lower panel of Figure 1 so there is no cost in terms of the number of local extreme values. If we minimize the total variation of the second derivative subject to the function being an adequate monotone approximation then there is a cost. The upper panel of Figure 3 shows the approximations of the two individual samples which minimize the total variation of the second derivatives subject to the functions being monotone. The lower panel of Figure 3 shows the joint approximation for the combined sample. The second derivatives are shown in Figure 4. The values of the total variation of the second derivative are are he values are 9.317 and 6.305 for the individual samples and 59.496 for the joint sample.

Intersecting supports
As mentioned in Section 1 the Neumeyer and Dette (2003) procedure can detect differences of the order of n −1/2 . We now consider the size of detectable differences for our procedure in the case of equal supports. For simplicity we consider only the case k = 2 and assume that the supports S 1n1 and S 2n2 are given by t 1i = t 2i = i/n. We take I n to be the set of all intervals but indicate below the adjustments required if I n = I n (λ) as in (10). We state the results using σ 1 and σ 2 rather than the estimates (18) and write τ n = τ n (α 1/k ). If a joint approximating functionf n exists then for any interval I of [0, 1] we have For the noise we have with probability α 1 |I| ti∈I (Z 1 (t i ) − Z 2 (t i )) ≤ (σ 1 + σ 2 ) τ n log(n) and hence with probability α 1 |I| ti∈I (f 1 (t i ) − f 2 (t i )) ≤ 2(σ 1 + σ 2 ) τ n log(n) .
Suppose now that f 1 and f 2 differ by an amount η n on an interval I n ⊂ [0, 1], that is f 1 (t) − f 2 (t) > η n , t ∈ I n and that the length of I n is δ n . As I n contains about nδ n support points we see that 1 √ nδ n nδ n η n ≤ 2(σ 1 + σ 2 ) τ n log(n) which implies that no joint approximation will exist if δ n η n > 2(σ 1 + σ 2 ) τ n log(n)/n.
It follows that with probability of at least α, deviations satisfying the latter inequality will be detected. If I n = I n (λ) as in (10) it follows that there exists an interval I n ⊂ I in I n (λ) and of length |I n | ≥ |I n |/λ = δ n /λ for which f 1 (t) − f 2 (t) > η n , t ∈ I n . This requires replacing (22) by We consider a situation similar to that of Figure 1 as is shown in Figure 5. The sample sizes are n = 500 with common supports t j = j/n and we take α to be 0.95 so that α k = 0.95 1/2 = 0.9747. For this choice of α and with I n = I n (2) simulations give τ n = 2.973. We set f 1 (t) = exp(1.5t) and put f 2 (t) = f 1 (t) except for t ∈ [0.402, 0.44] where f 2 (t) = f 1 (t) + η n . For this interval δ n = 20/500 and so we expect to be able to detect deviations η n of the order η n = 2 √ 2(0.25 + 0.25) 2.973 log 500 / √ 20 = 1.359 (24) with probability of at least 0.95. For the data shown in Figure 5 the difference is detected with η n = 0.575 but not with η n = 0.574. If we put δ n = 1 in (22) so that the two functions deviate over the whole interval then η n > 2 √ 2(σ 1 + σ 2 ) τ n log(n)/n .
which implies that deviations of order log(n)/n can be detected. The same analysis can be carried through using the approximation region A * n . Corresponding to (22) we obtain where h(δ) = δ 2 * log(1/δ) + γ n (α) log log(e e /δ) , 0 < δ ≤ 1, is monotonically increasing and γ n (α) is bounded in n. In particular, if δ n = 1, then deviations of order 1/ √ n can be detected.

Adapting the taut string algorithm
The taut string algorithm of Davies and Kovac (2001) has proved to be very effective in determining the number of local extremes of a function contaminated by noise (see Davies et al. (2008b)). We show how it may be adapted to the case of k samples. Let n ≤ k i=1 n i denote the number of different t ij values which we order as 0 ≤ t 1 < . . . < t n ≤ 1. For each sample we calculate the values of σ ini given by (18). We put y m = tij =tm y ij /σ 2 ini tij =tm 1/σ 2 ini , m = 1, . . . , n.
As a first step we check whether a joint approximating function exists. We do this by puttingf n (t m ) = y m and then determining whetherf n ∈ A n k . If this is not the case we conclude that no joint approximation exists. If a joint approximation exists we putΣ m = tij =tm 1/σ 2 ini and calculate the partial sums y • m = m 1Σ j y j and A m = m 1Σ j for m = 1, . . . , n with y • 0 = A 0 = 0. The initial lower and upper bounds L i and U i are set to be where D is chosen so large that the straight line joining (0, 0) and (y • n , A n ) lies in the tube. For a given tube, the taut string through the tube and constrained to pass through (0, 0) and (y • n , A n ) is calculated. The value of the estimatef n (t m ) at the point t m is taken to be the left hand derivative of the taut string except for the first point where the right-hand derivative is taken. For each data set individually it is now checked whetherf n ∈ A ini , i = 1, . . . , k. If this is the case the procedure ends. Otherwise those intervals for which the inequalities defining the A ini do not hold are noted and the tube is squeezed at all points t j−1 and t j for which t j lies in such an interval. This is continued until a functionf n ∈ A n k is found.

Analysis and simulations
As the approach developed in this paper is not comparable with others when the supports are disjoint, we restrict attention to the case of equal supports. For simplicity we take k = 2. For such data Delgado (1992) proposed the test statistic where σ n is some quantifier of the noise. Under the null hypothesis f 1 = f 2 = f the distribution of T n does not depend on f . In this special case the test statistic of Neumeyer and Dette also reduces to (29). If the data were generated under (3) then under H 0 the distribution of T n converges to that of max 0≤t≤1 |B(t)| where B is a standard Brownian motion. The 0.95-quantile is approximately 2.24 which leads to rejection of H 0 if T n ≥ 2.24.
Suppose now that the data are generated as in (3) with f 1 (t) = f 2 (t) apart from t in an interval I of length δ n where f 1 (t) − f 2 (t) ≥ η n . It follows from (30) that H 0 will be rejected with high probability if where σ 2 = σ 2 1 + σ 2 2 . If δ n = 1 deviations of the order of σ/ √ n can be picked up which contrasts with the O(σ log(n)/n) of (25). If however δ n = 1/ √ n it follows from (31) that the test statistic T n will pick up deviations of the order of σ. It follows from (22) and (26) that the methods based on A n and A * n will both pick up deviations of the order of σ log(n)/ √ n .
Another test which is applicable in this situation is due to Fan and Lin (1998). If we denote the Fourier transform of the data sets byỸ 1 (i) andỸ 2 (i), i = 1, . . . , n, and order them as described in Fan and Lin (1998), their test statistic reduces to whereσ n is some estimate of the standard deviation of theỸ 2 (i) −Ỹ 1 (i). For data generated under the model (3) the critical value of T * n can be obtained by simulations. It is not as simple to determine the size of the deviations which can be detected by the test (32) as the test statistic is a function of the Fourier transforms and the differences in the functions must be translated into differences in the Fourier transforms. The first member of the sum in (32) is the difference of the means and this is given the largest weight. We do not pursue this further but give the results of a small simulation study.
We put n = 500 and consider two samples of the form were generated where the Z j (i/n) are i.i.d N (0, 1) random variables with g given by one of where U is uniformly distributed on [0, 3/4] and independent of the Z i , i = 1, 2. The four procedures, Delgado-Neumeyer-Dette, Fan-Lin and those based on A n and A * n were all calibrated to give tests of size 0.05 for testing g ≡ 0. The critical values for Delgado-Neumeyer-Dette and Fan-Lin tests are 2.22 and 6.97 respectively. The value of τ n for the test based on A n is 1.46 and the corresponding value of γ n for that based on A * n is 0.66. Figure 6 shows the power of the tests for different values of η. The upper panels are the results for g given by (35) (1) and (35) (2) and the lower panels for g given by (35) (3) and (35) (4). The colour scheme is as follows: Delgado-Neumeyer-Dette blue, the Fan-Lin black, A n green and A * n red. The results confirm the analysis given above. The Delgado-Neumeyer-Dette and Fan-Lin tests are better with g given by (1) but if the mean difference is zero (2), or the interval is small (3) or both (4) then they are outperformed by the procedure based on A * n and, in case 4, also by that based on A n .

An application
We give an example with some real data from the area of thin-film physics. They give the number of photons of refracted X-rays as a function of the angle of refraction and were kindly supplied by Professor Dieter Mergel of the University of Duisburg-Essen. Two such data sets are shown in the top panel of Figure 7; the differences y 1 (t i ) − y 2 (t i ) are shown in the bottom panel. Each data set is composed of 4806 measurements and the design points are the same. The samples differ in the manner in which the thin film was prepared. One of the questions to be answered is whether the results of the two methods are substantially different.
The noise levels for the data sets are the same, namely 8.317, which is explainable by the fact that the data are integer valued. The differences between the two data sets are concentrated on intervals each containing about 40 observations. The estimate (31) suggests that the differences will have to be of the order of 92 to be detected with a degree of certainty by the Delgado-Neumeyer-Dette test. The actual differences are of about this order and in fact the test fails to reject the null hypothesis at the 0.1 level. The realized value of the test statistic is 1.734 as against the critical value of 1.90 given in (30). The Fan-Lin test (32) rejects the null hypothesis at the 0.01 level. The realized value of the test statistic is 111.66 as against the critical value of 12.44 for a test of size α = 0.01. Finally the tests based on A n and A * n both reject the null hypothesis at the 0.01 level. The realized value of τ n is 43.15 as against the critical value of 1.50. The realized value of γ n is 53.27 as against the critical value of 0.733. .440] where f 2 (t) = f 1 (t) + 0.575. The lower panel shows the two data sets Y 1 (t j ) = f 1 (t j ) + 0.25Z 1 (t j ) and Y 2 (t j ) = f 2 (t j ) + 0.25Z 2 (t j ) for j = 1, . . . , 500 and with t j = j/500.  (1) and (2). The bottom row shows the power functions of the four tests with g given by (1) and (2). The Delgado-Neumeyer-Dette is shown in blue, the Fan-Lin test in black, the test based on An in green and that based on A * n in red.