Global adaptive smoothing regression

: We propose an adaptive smoothing method for nonparamet- ric regression. The central idea of the proposed method is to “calibrate” the estimated function through an adaptive bandwidth function , which is a kind of intermediate solution between the global bandwidth (constant on the support) and the local bandwidth (variable with x ). This also allows to correct the bias of the local polynomial estimator, with some beneﬁts for the inference based on such estimators. Our method, which uses the Neu- ral Network technique in a preliminary (pilot) stage, is based on a rolling, plug-in, bandwidth selection procedure. It automatically reaches a trade- oﬀ between the eﬃciency of global smoothing and the adaptability of local smoothing. The consistency and the optimal convergence rate of the re- sulting bandwidth estimators are shown theoretically. A simulation study shows the performance of our method for ﬁnite sample size.


Aims and motivations
Consider the real bivariate process {Y, X}. A general regression setup is which includes several special cases through appropriate definition of the function φ. Given a realization of the process {Y i , X i ; i = 1, . . . , n}, the unknown function m φ (·) and its derivatives m using the local polynomial estimator (LPE). Denote withm φ (x; h) such estimator, where h denotes the smoothing parameter (bandwidth). The good theoretical properties of the LPE and its conceptual simplicity determine the success of such estimators. But a serious drawback of the LPE is its strong dependence on the bandwidth parameter, which has a remarkable affect on the bias and the variance of the estimator. By studentizing the estimator, we can highlight its bias term Now, if the true asymptotic optimal local bandwidth h opt L (x) is used, we can show that where p is the order of the local polynomial estimator. So, even when the bandwidth is the best one, there is a non-vanishing bias into the normal limit of the estimator, given that p and v are fixed (see also the example in the left panel of fig. 1). This has implications for the inference. For example, if we do not consider the bias term explicitly, the confidence intervals are not centred around the true value of the function. As a consequence, the coverage, the size and the power of the confidence intervals and tests are all affected. So, for optimal smoothing regression, it is necessary both optimal bandwidth selection and bias correction. Anyway, the bias and the bandwidth must be estimated, and the quality of such estimations has strong implications on the final estimate of the regression function.
The aim of this paper is to propose an adaptive smoothing method based on both data-driven bandwidth estimation and bias correction. Our method works through a rolling plug-in bandwidth selection procedure which also provides all the estimations required for bias correction, as will be explained in section 7.3. Figure 1 shows an example. It reports the boxplots of the local polynomial estimations for the conditional mean function of model 6, used in the simulation study of section 7.1 (for samples of size n = 200). The solid lines represent the true regression function. Plot (a) is obtained using our estimated optimal bandwidth, but without bias correction. Plot (b) includes bias correction.
The central idea of our proposed method is to "calibrate" the local polynomial estimated function through an adaptive bandwidth function, which is a kind of intermediate solution between the global bandwidth (constant on the support) and the local bandwidth (variable with x). The motivation of our proposal is based on the consideration made in Wang & Gasser (1996) that the rate of convergence of a global bandwidth estimator is generally much faster than the rate of convergence of any local bandwidth estimator. So the advantage (adaptability) of the local bandwidth function may not tranfer in the estimated local bandwidth function, at least for small samples. With our rolling procedure, we estimate global bandwidths which are adaptive on the support, still maintaining the advantage of global smoothing (efficiency), as will be shown in the simulation study. Moreover, we also obtain better results compared with the local bandwidth estimators, such as the one proposed in Prewitt & Lohr (2006).
The most important theoretical result of this paper concerns the rates of convergence of the proposed bandwidth and bias estimators, which also influence the final rate of the local polynomial estimatorm φ (x; h). We show in Theorem 1 that our global bandwidth estimator reaches the optimal rate of convergence, assuming conditions that are less stringent than those used by other global bandwidth estimators (as in Fan & Huang (1999)). Furthermore, the result for our local bandwidth estimator is even more interesting. In fact, we show in Theorem 2 that the proposed rolling procedure is a useful "trick" which lets to estimate the true local bandwidth asymptotically, with a relative rate of convergence which confirms the optimal relative rate of convergence for any local bandwidth estimator shown in Wang & Gasser (1996).
The method proposed in this paper for bandwidth selection differs from the usual plug-in procedures for local bandwidth selection that have been proposed in the context of regression estimation. In particular, it differs from the approach used in the paper of Prewitt & Lohr (2006), which directly estimates the local bandwidth function h opt L (x). This and similar approaches usually produce highly variable bandwidth functions. On the other hand, it differs from the approaches which partition the X space and apply a global methodology to each subset, such as Fan & Gijbels (1995). Both of these methods require a smoothing of the estimated bandwidth function, to avoid roughness or change in the bandwidths between partitions. As a consequence, they introduce additional smoothing se-lection problems in the final stage of the procedure (say post-smoothing). Our approach avoids this problem by introducing a rolling scheme with an automatic rule for the selection of the rolling interval, here denoted with a, which determines the right "degree of locality". Our method borrows the idea of Hall & Schucany (1989) of considering an interval around the point of estimation x, although they use a cross-validation procedure for density estimation, while we propose a plug-in method for regression estimation, with better rates of convergence.
The smoothing method proposed in this paper uses the neural networks in a preliminary (pilot) stage, in order to estimate the optimal bandwidth. A parameter d, denoting the number of nodes of the neural network function, is therefore introduced and must be set externally to our smoothing procedure. The presence of such a pilot parameter is not new in the context of kernel regression. In fact, it has the same role as the well known pilot bandwidths used in the classic plug-in bandwidth selection procedures. Anyway, even if the role is the same, there are important differences in their behaviour. First of all, setting the parameter d is much simpler than setting the pilot bandwidths, as it will be explained in section 3. Moreover, we will show in the simulation study that variations in (misspecification of) the parameter d have little effect on the final results of the smoothing regression, contrary to what happens with the pilot bandwidths.
This paper is organised as follows. Section 2 introduces the adaptive smoothing method. Sections 3 and 5 describe in details the estimation procedure based on the neural networks technique. In section 4 we propose a methodology to select the optimal value of the parameter a. In sections 6 and 8 we derive the rate of convergence for our bandwidth estimators, in particular section 6 concerns the general bandwidth estimator while section 8 concerns the local bandwidth estimator, obtained asymptotically when a = a n → 0 for n → ∞. Then we show the performance of the proposed method: section 7.1 shows the results of a simulation study; section 7.2 gives an example of application of our procedure to a real dataset, while section 7.3 explains how to correct the bias of the final local polynomial estimator. All the technical proofs are contained in the appendix.

Our proposal: From global to local smoothing
The local polynomial estimator is given by a weighted least squares regression where v = 0, 1, . . . denotes the degree of the derivative (v = 0 for the function itself) and the weights W K,v,p (·; h) are derived from a kernel function, K(·), and from a local approximation of the function m φ (x) through a polynomial of order p. The kernel function is a symmetric and bounded density function defined on [−1, 1]. The bandwidth h, which regulates the smoothness of the estimated function, must be such that h → 0, nh 2v+1 → ∞ when n → ∞.
The degree of the polynomial can be p = 0, 1, 2, . . . , respectively for Nadaraya-Watson estimator, local linear estimator, local quadratic estimator, and so on. For theoretical reasons, the parameter p is generally fixed to p = v + 1 or in such a way that p − v is odd. To slim the equations, we omit the symbols p, v and K from the notation of the estimatorm φ (x; h). See Fan & Gijbels (1996) for a detailed description of the LPE and its properties. The asymptotic mean squared error of the estimatorm φ (x; h) can be decomposed, as is usual, into the sum of the squared asymptotic bias and asymptotic variance where, denoting with f X (·) the density of the random variable X and with σ 2 φ the conditional variance V ar{φ(Y )|X = x} (for simplicity the model is supposed to be homoscedastic), we have The coefficients B (v) p,K are known, and depend on the parameters K, v and p. Table 1 reports the values for the most used kernel function and for different values of p and v (see Fan & Gijbels (1996) for the formulas). Note that only the cases where p − v is odd are reported. Taking account of the biasvariance trade-off , the asymptotic plug-in optimal local bandwidth is derived by minimizing the AM SE at the point x, giving Note that the (6) is a local bandwidth function, which performs a smoothing of the regression function which varies with the support of the estimation. We use the notation h opt G to denote a constant bandwidth that minimises some global measure of the estimation error, as the integrated asymptotic mean square error (AM ISE). We expect local bandwidths to perform better than global bandwidths. In any case, the estimator of h opt L (x) is expected to be less efficient than the estimator of h opt G (note that h opt L (x) is a function while h opt G is a value). So, the question is how to evaluate whether there is an effective gain from using the estimated local bandwidth instead of the estimated global bandwidth.
In this paper we propose an hybrid method that aims to combine the advantages of local (adaptability) and global (efficiency) smoothing. We call the method Global Adaptive Smoothing (GAS). Our idea is based on the use of a rolling window procedure. Given a point x ∈ χ, where χ is a compact set of R (see the appendix for more details), the optimal bandwidth for that point is derived by estimating a global bandwidth on the interval centred on x, I x = [x − a/2, x + a/2], for a given a > 0. So, the interval I x must be of positive length and it must contain at least one observed point. By moving the interval I x , we derive an estimation of the optimal local bandwidth for other points x on the support of the estimation. Define the AM ISE on I x as The optimal bandwidth on I x can be derived by minimizing the (7). Denote such bandwidth with h Ix . It is equal to We need to estimate the functionals in (9) and to plug them into the (8) in order to have the estimated bandwidthĥ Ix . Then the final regression estimator of m φ (x) ism φ (x;ĥ Ix ). In section 3, we propose to estimate the functionals σ 2 φ and m (p+1) φ using the neural networks technique, generalizing a procedure in Giordano & Parrella (2008). Actually, other nonparametric estimators could be used for the estimation of these functionals, such as splines or local polynomials (as traditionally done in the other plug-in methods). Each one of these alternatives implies, of course, the necessity of setting some pilot parameters (such as the number of knots or the pilot bandwidths). However, the classic plug-in methods for bandwidth selection are known to be crucially dependent on the correct identification of such pilot parameters. Instead, here we choose the neural network because it is a global approximator and it allows our smoothing estimator to be stable against the misspecification of the pilot parameter (which in our case is given by the number of nodes d). This is shown in the simulation study.
Note that we do not need to estimate the density f X (·), since it is implicitly estimated by integration (as shown in section 3).
A tuning parameter, a, which determines the width of the window around x, is introduced to tune between the global and local bandwidths, since For a fixed point x and a finite sample size n, the bandwidth h Ix reaches the optimal bandwidths h opt L (x) and h opt G in limit, while for 0 < a < ∞ it assumes some intermediate value which is suboptimal from a theoretical point of view, but which can be the best choice when estimating h Ix . Note that the rolling window procedure automatically performs a "smoothing" of the estimated h Ix , given that a > 0. Therefore, it is not necessary to post-smooth the estimated bandwidth function in a second stage, as with other plug-in local bandwidth selectors (see, for example, the papers of Fan & Gijbels (1995); Prewitt & Lohr (2006); Gluhovsky & Gluhovsky (2007)). In section 4 we suggest a possible strategy for the automatic selection of the parameter a.

The neural network GAS algorithm
In this section we present a procedure for estimating the unknown functionals in (9). For simplicity, we consider the problem of estimating the conditional mean function m(·). So, we start from the following model where the ε i are i.i.d with mean zero and variance σ 2 ε . We consider one point of estimation x ∈ χ, around which we define the interval I x . The bandwidth function is estimated by implementing the rolling procedure described in section 2. We consider a nonparametric estimator q(x;η), whereη is given bŷ and q d (u; η) is a Feedforward Neural Network (FNN), with one input layer and one hidden layer. It is defined as where η = (c 0 , c 1 , . . . , c d , a 1 , . . . , a d , b 1 . . . b d ) is the vector of the parameters of the FNN to be estimated, d is the number of nodes in the hidden layer and Γ(·) is the activation function. We consider the class of feedforwad Neural Networks with a sigmoidal activation function, i.e. a measurable function Γ(·) on R such that Γ(x) → 1 when x → ∞ and Γ(x) → 0 when x → −∞. The parameter d acts as a tuning parameter for the neural networks function, and must be identified by means of some optimality criteria. So, it has a role similar to the pilot bandwidths of the classic plug-in bandwidth selection procedures. Anyway, the effects of this parameter on the bandwidth estimator are rather different, as will be evidenced in the simulation study. Moreover, the selection of such tuning parameter is simpler than the selection of a pilot bandwidth, given that the first is a positive integer number (generally some units), whereas the second is a positive real number. Another important difference with pilot bandwidths is that the parameter d for the estimation of the derivative function is the same used for the estimation of the function itself (while the pilot bandwidths used for the estimation of the derivative function must be of different order than the bandwidths used for the estimation of the function).
We define the residuals asε i = Y i − q(X i ;η) andε i =ε i −ε, whereε = 1 n n i=1ε i . Thus we propose the following estimator for σ 2 Next, we need to estimate the derivative function m (p+1) (x). We use the previous NN estimate, taking the derivative of order p + 1 of the estimated NN functionm Note that h Ix can be written as where, given that dω Ix (u) = du/µ X (I x ), . (17) In the (17), µ X (I x ) is the measure of I x with respect to the distribution function of X. We prefer to express the functionals in (17) with respect to the conditional distribution function of X because this provides some advantages from a computational point of view. We can then propose the following two estimators for the functionals in (17) with respect to a set I x of dimension a: .
(18) The final bandwidth estimator, for a given value of the parameter a, is obtained by plugging the (18) into the (16)

Setting the parameter a
Here we suggest a strategy for the selection of the parameter a. First, suppose that the functionals in the (4) are known. Given that we expect asymptotically an increase in the value of the local AM SE when using a bandwidth different from h opt L (x). Using the (4), (5) and (6), the relative increment is given by So, for fixed v and p, it is directly connected with the ratio h Ix /h opt L (x). The minimum is reached, as expected, at h Ix /h opt L (x) = 1, which leads to the following condition There are two opposite cases for condition (22). The first one is when the function In such a case, we have h Ix = h opt G , ∀x ∈ χ and ∀a > 0. The second is when the ratio B 2 (x)/V(x) is not constant. In this case, given the (10) and the (20), the only solution for the minimization of the (21) is a = 0, i.e. the local bandwidth. Anyway, we may argue that in many cases the ratio B 2 (x)/V(x) is not constant but rather "smooth", such that it can be seen as substantially constant over subintervals. In this case, there can exist a solution for the (22) for some a > 0. Moreover, note that the (22) is actually a pointwise condition, suggesting a pointwise optimal value, a * x . But, if we are interested in the estimation of the function m φ (x) on the whole support χ, we may prefer to identify a constant value of a which can be considered globally optimal, though locally suboptimal. In such a way, we maintain the procedure computationally simple. This is the reason why we introduce the following global condition which is not equivalent to (22) but which can be used to estimate a global parameter a. This is only one possible criterion to choose the parameter a. Our idea is to balance, in mean over the support χ, B 2 (x)/V(x) (related to the local smoothing) and B Ix /V Ix (related to the global adaptive smoothing). In this way, we find a global parameter a which is able to capture the trade-off between local and global smoothing. Under model (11), using the (17) the condition (23) can be written as follows where we compare the functionals R a (x) = x+a/2 as consistent estimators for R and R a (x) in equation (24), for some given values of a and x. Let n x be the number of estimation points in χ.
The estimated a n is obtained by solving the equation It can be easily shown that the (24) admits always a solution a > 0 and that the estimatorâ in the (25) is consistent in probability.
Overall, both the tuning parameters d and a are essential but they have different roles in our smoothing procedure. The parameter a represents the trade-off between local and global smoothing while the parameter d works in the so called stage of pre-smoothing, so it has a role similar to the well known pilot bandwidths typically used in the classic plug-in procedures.

Computational considerations on the NN-GAS algorithm
The estimation of the parameter a does not increase appreciably the computational burden of the bandwidth selecting procedure. To explain why it is so, we present schematically the steps of the GAS algorithm.
• Step 1 : using the (12), we estimate the neural networks function q(X i ;η), for i = 1, . . . , n. To this end, a BIC procedure can be used to derive the optimal value of d. • Step 2 : by the (14) and (15), using q(X i ;η), we derive the NN estimator of the variance,σ 2 ε , and the NN estimator of the derivative,m (p+1) (X i ), i = 1, . . . , n.
Note that only step 1 is computationally intensive, since it implies a nonlinear minimization in order to estimate the neural networks function. The other steps are very fast, given that they are based on simple calculations of such estimated values, or simple averages of the estimated values falling in each interval I x .
Note that the NN estimations in steps 1 and 2 are global estimations, i.e. they do not depend on the points of estimation x. This means that they must not be repeated for different values of x. On the other side, in steps 3-4 the NN global estimations are transformed, through the (18), into "local" estimations. Finally, only step 4 must be repeated for each desired point of estimation x ∈ χ. Anyway, this takes a little more computing time, as will be shown now.
The algorithm for the estimation of a in step 3 is based on successive splittings of the interval in order to derive the solution of the (25).
To give an idea of the computing time, consider, for example, a sample of size n = 500 and n x = 50 points of estimation x j ∈ χ. Let us consider three cases: in the first, we point to estimate the pure global bandwidth (in such case, step 3 is skipped and step 4 must be adapted for the global bandwidth estimator); in the second case, we point to estimate the variable bandwidth h Ix for a fixed parameter a (in this case, we only skip step 3); in the last case, we perform the full algorithm. We run the algorithm 100 times on a dual core with Intel Pentium 2.00 GHz. In the first case (pure global bandwidth), the average computing time is 8.03 seconds. In the second case, the time is 8.16 seconds. So, the step 4 for the 50 points takes only 0.14 seconds. Finally, in the last case of local bandwidth, the average time to run the whole algorithm is 10.50 seconds. These computing times are satisfactory for a local bandwidth selection procedure. For example, the average time for one iteration of the Prewitt-Lohr method is 12.67 seconds. Moreover, as pointed out in subsection 7.1, the BIC criterion is consistent for selecting the parameter d, but it is not very sensible in case of oversmoothing (i.e., the bandwidth estimates are substantially stable for d ≥ 3). So, if we fix d = 4 we avoid the BIC selection step and we have only 4.61 seconds for our procedure.

Theoretical results for the rate of convergence
In this section we investigate the rate of convergence for our GAS procedure, and compare it with the rates of convergence of the most popular global bandwidth selectors proposed in the literature.
When analysing the rate of convergence of a global bandwidth selector, two aspects need to be considered: the first involves the closeness of the AM ISE approximation to the M ISE (being the AMISE the leading term of the MISE expansion); the second concerns the quality of the functional estimation. In particular, concerning the first aspect, using the same arguments as in section 2 of Ruppert et al. (1995), we can write so the bandwith h opt G = arg min h AM ISE{m φ (x; h)} represents the first order approximation of the true optimal bandwidth minimizing the M ISE, denoted with h MISE .
Suppose for simplicity that v = 0 and p = 1, so that p− v is odd. The authors Fan & Huang (1999) showing an upper bound rate of convergence for any plug-in bandwidth estimator based on the estimation of h opt G . They show also that it is useless to consider further terms in the asymptotic expansion of the M ISE to improve the rate of convergence. The only way to change this rate is to modify the kernel estimator in order to decrease the order of its bias and/or variance, based, for example, on the proposal in He & Huang (2009). Although our procedure can be generalised to those setups, for simplicity we do not consider this point further.
The second aspect, concerning the quality of the functional estimation, is of more interest to us, because it highlights the real differences among the bandwidth estimators. Ruppert et al. (1995) show that the best rate of convergence achieved by their bandwidth selectors is O p (n −2/7 ) when using a local cubic polynomial fit to estimate the derivative function m (2) φ . Fan & Huang (1999) obtain a rate of the order O p (n −2/5 ) for the pre-asymptotic substitution method proposed by Fan & Gijbels (1995), which reaches the upper bound in the (27), but they consider a local polynomial fit of order p = 5 to estimate the derivative function m (2) φ , which has several implications. First of all, an increase in the variability of the estimator; second, the need for more data to avoid invertibility problems; finally, the need of more stringent assumptions on the model, in terms of the existence of higher order derivatives. Moreover, although the results in Ruppert et al. (1995) and Fan & Huang (1999) are a big improvement on the cross-validation procedures -which are O p (n −1/10 ), as shown in Härdle et al. (1988) -they are handicapped by the need to assume an "optimal" order for the pilot bandwidths used in the polinomial fit of the derivative m (2) φ , which overlooks that these pilot bandwidths also need to be estimated. When the pilot bandwidths are estimated, the final rate of convergence of the bandwidth selector becomes worse.
Our global smoothing method achieves the best rate of convergence without requiring any recursive bandwidth estimations. Let and |χ| is the length of the closed and bounded interval χ. Moreover, suppose that m (2) (·) ≡ 0 for each I x . If we consider I x we can write M ISE Ix as in (26). Note that AM ISE Ix is defined in (7). In this case, the bandwidth which minimizes AM ISE Ix is h Ix (see (8)). Letĥ opt G be the estimator of the (28). We can state the following two results.
Theorem 1. Under assumptions (a1)-(a6) in the appendix, the bandwidth estimator,ĥ Ix , has the following uniform rate of convergence to the true bandwidth Corollary 1. Under assumptions (a1)-(a6) in the appendix, the global bandwidth estimator,ĥ opt G , has the following rate of convergence to the true band- Remark 1. In Theorem 1 and Corollary 1, we consider a sequence for d given in Assumption (a6). However, we can use a selection criterion as BIC about d.
The penalty term in the BIC criterion is n d /n(log n), where n d indicates the number of parameters to estimate in a Feedforward Neural Networks with d hidden layer neurons. Using assumptions (a1)-(a6) in the appendix, we have In order to prove this result, it is sufficient to use together Theorem 4 in Barron (1994), Lemma 1 and the same arguments as in the proof of Lemma 2.
7. The NN-GAS procedure at work

Results from a simulation study
In this section we report the results of a Monte Carlo experiment aimed at assessing the numerical performance of the NN-GAS procedure. Since our procedure can be used to perform both global and local smoothing, we compare it with the most widely used procedures for global and local bandwidth selection. In relation to global smoothing, we compare our procedure with Ruppert et al. (1995) plug-in bandwidth selection algorithm (using their package implemented in the R environment) and with the Cross-Validation method. Here, we denote such bandwidth selection algorithms as RW and CV, respectively. Note that the CV method uses a prediction error measure, contrary to plug-in methods which use MISE based measures. In such a way, we give an idea of the performance basing on different criteria. In relation to local smoothing, our benchmark is the method proposed by Prewitt & Lohr (2006), who suggest a local bandwidth selector based on the eigenvalue approach. We denote this bandwidth selection algorithm by PL. We reply here the same simulation study of Prewitt & Lohr (2006). In such way, we compare indirectly our NN-GAS procedure also with their competitors Fan & Gijbels (1995); Ruppert et al. (1995); Choi et al. (2000). We consider six different models, and we want to estimate the conditional mean function m(x). Table 2 reports the details. The first four models were considered by Fan & Gijbels (1995) and Prewitt & Lohr (2006). Models 5 and 6 were used by Ruppert et al. (1995) and Prewitt & Lohr (2006). We consider different values of n and σ ε and we generate 500 replications for each setting Table 2 Models used in the simulation study configuration. For models 5 and 6, we fix the variance of the error σ 2 ε as in Ruppert et al. (1995). The number of points in which the regression estimation is performed is fixed to n x = 50 (chosen uniformly on the support χ).
We use the Epanechnikov kernel function for the local polynomial estimations and the logistic activation function for the neural network estimations. The number of nodes in the hidden layer, d, is selected following a BIC optimization procedure. Finally, all the estimations are derived without implementing bias correction, in order to take the comparison with the competitors fair. Table 3 assesses the performance of our method using global smoothing, which means that we consider the pure global bandwidth estimator, that is constant on the support of estimation. We want to stress here that the procedure computes once for all the estimated global bandwidthĥ opt G , and then it is used to smooth the function m(x) for each point of estimation x. For the RW method, we use the package implemented by Ruppert et al. (1995) in the R environment, with all the pilot estimations set automatically. We report the median, mean and standard deviations of the integrated squared error (denoted with M edISE, M ISE, and SDISE, respectively), calculated over the 500 replications of models 1-6 (note that here the integrated squared error is taken as the sum of the LP estimates obtained over the n x = 50 estimation points). On the right of the table, we also report some summary statistics giving evidence of the influence of the parameter d (= number of nodes in the hidden layer of the NN function) on the bandwidth estimations. In particular, we report the mean value and the standard deviation ofd, denoted by md and sd, respectively. The results show that the variance ofd is low, while its mean value is not greater than 5. For completeness, we also report the mean and the standard deviation of the estimated global bandwidth (mĥ and sĥ, respectively).
In order to give more evidence of the sensitivity of the estimations from the pilot parameter d, figure 2 shows the boxplots of the NN-GAS estimations of the global bandwidth h opt G , obtained for model 1 and for fixed values of d (ranging from 2 to 5). The solid lines represent the true global bandwidth. Plot in (a) considers n = 200, while plot in (b) refers to n = 500. In both cases, the first boxplot on the left summarizes the results obtained when estimating d through the BIC algorithm. Remember from table 3 that the mean value of d, for such model, is approximately 3, as desired. Note also from the boxplots of figure 2 that undersmoothing the NN function (= setting d < 3) may cause some problems in the bandwidth estimations, while oversmoothing (= setting d > 3) is not so crucial. The situation is similar for the other models. Table 4 compares the NN-GAS method with other smoothing methods: 1) the oracle smoothing estimator based on the true asymptotic global bandwidth h opt G ; 2) the smoothing estimator based on the RW estimated global bandwidth; 3) the Table 4 Relative increments in the median, mean and standard deviation of ISEs, with respect to the results of the NN-GAS method reported in table 3, observed when estimating the function m(x) through global smoothing with different bandwidths: the true optimal global bandwidth h opt G , the RW's estimated global bandwidth and the CV's estimated global bandwidth. Note that these values are calculated using the formula in (30), and they must be multiplied by 100 to give the percentage variations. The asterisked cases were omitted because the RW method produces few results, due to invertibility problems and similarly for the other two indicators M ISE and SDISE. So, a positive value of the (30) shows a better performance of the NN-GAS method compared with the competitor, and the value of MedISE ∆ multiplied by 100 gives the percentage increment in the M edISE. For example, looking at the first line of the table, we note that the oracle smoothing estimator has a M edISE which is the 3% lower than the MedISE of the NN-GAS method presented in table 3. This represents the loss due to the estimation of the asymptotic optimal global bandwidth. The positive values which appear in the column of the oracle smoothing estimator are more interesting to comment. They show that, in some cases, the NN-GAS procedure performs even better than the oracle smoothing procedure: this is due to the use of the asymptotic expression of the optimal bandwidth, which for small samples can be distant from the true optimal bandwidth (in fact, note that all positive and negative values converge to zero for increasing n). It Table 5 Relative increments in the mean and standard deviation for the Prediction Error, with respect to the results of the NN-GAS method, observed when estimating the function m(x) through global smoothing with the CV's estimated global bandwidth. The values must be multiplied by 100 to give the percentage variations is interesting to note that, in general, the relative variations in the column of the oracle smoothing estimator are not very high, contrary to what happens in the other two columns relative to RW and CV methods. The NN-GAS method clearly outperforms the RW and the CV methods, especially in terms of variability. The asterisks hide some values that, though definitely favoring the NN-GAS method, are not reliable because they are based on only a few results for the RW. The reason for this is that the pilot estimations in the RW method are based on the estimation of higher order derivatives through LPE, using p = 3, and this implies some problems of invertibility for small samples (as said in the introduction, this is one of the drawbacks of the classic plug-in methods which is solved with our method, since the NN-GAS procedure uses the same pilot estimation of the function m(x) in order to derive the estimate of the derivative m (p+1) (x)). Finally, we evaluate the performance of our estimator using a different measure, the prediction error. We compare our method with the CV method, reporting the relative increment in the same style as in Table 4 (of course, we have corrected such an indicator by the variance of the error, given that the prediction error converges to such value asymptotically). Therefore, positive values in Table 5 show an advantage for our method compared with the CV. Given the computational burden of the new measure, we considered only some of the models (choosing among the most interesting ones). The performance of our bandwidth estimator, based on the new measure, can be considered satisfactory. Table 6 shows the results for local smoothing, which means that we suppose that the bandwidth is variable on the support of the regression function. Thus, for each model we fix the value of a as the value that produces the true minimum M ISE. Table 6 is organised as follows. The results for the neural network method are reported in absolute values, as in table 3. On the right of the table, we present some statistics on the estimated values of the tuning parameters d and a, in particular mean and standard deviation. For the parameter d, one can note that the results are nearly similar to those in table 3. This confirms that the NN-GAS local bandwidth is estimated on the bases of the same NN estimations used for the global smoothing procedure. Finally, the last column in table 6 tracks the estimated values for the tuning parameter a, which are divided by Table 6 Median, mean and standard deviation of ISEs, observed when estimating the function m(x) for the 500 replications of models 1-6, using local smoothing. On the right, we report the mean (m) and the standard deviation (s) of the estimated tuning parameters: the number of nodes d and the length of the interval a (the last is divided by the length of the interval χ, to give a relative measure of a) Model NN-GAS methoddâr =â/range ( (25). In this case, we use the first approach described in section 4, for two reasons. First, from a theoretical point of view, we have a faster rate of convergence, shown in Remark 1. Second, from a computational point of view, we should not worry about any empty intervals, I x . One can note that, from model 1 to model 4, we have some benefits to consider the our approach, since the estimated relative measureâ r is rather small with respect to one. Instead, for models 5 and 6, the estimatedâ suggest a substantial global smoothing. In table 7, we compare the previous results with some alternatives. First, we want to evaluate what happens when we estimate the parameter a, as reported in the last column of table 6. When we estimate the parameter a, we compare the corresponding ISE's results with those reported in table 6. To show the relative increments, we report on the left of table 7 the ∆MedISE, ∆MISE and ∆SDISE statistics. Note that, in general, the relative increment is definitely lower than the increment obtained when comparing the NN-GAS with the PL method (right side of table 7). Indeed, in some cases there is even a decrease (negative values), showing that there is still room for improvement to the NN-GAS results. Looking at the right of table 7, our procedure works better than PL procedure. Note that we implement the PL procedure considering the best possible configuration of settings (in particular, the results are sensitive to the choice of the bandwidth grids in the two minimization steps, so we try different Table 7 Relative increments in the median, mean and standard deviation of ISEs, with respect to the results of the NN-GAS method reported in table 6, observed when estimating the function m(x) through local smoothing with different bandwidths: the NN-GAS local bandwidth with a estimated using the (25); the true optimal local bandwidth h opt L (x) and the PL's estimated local bandwidth. Note that the values reported here are calculated using the formula in (30) ). Sometimes, the PL shows lower variability for short datasets. This is not surprising, since the variability of the PL is limited by the use of the grid of bandwidths, while the NN-GAS can generate unbounded values of bandwidths. As the size of the sample increases, the estimations become more stable, and the NN-GAS also improves over the PL in terms of varability. Finally, the central columns in table 7 report the ∆MedISE, ∆MISE and ∆SDISE statistics for the oracle local smoothing estimator, which is based on the use of the true asymptotic local bandwidth, h opt L (x). The results seem quite satisfactory.
In order to compare with the results in Prewitt & Lohr (2006), we also report in table 8 the results when estimating the first derivative of the conditional mean function, denoted here by m ′ (x). The results are comparable with those in tables 6 and 7, although, as expected, they reflect the greater difficulty to estimate the derivative function (which is a latent function).

Application to a real dataset
In this section we analyse a real dataset downloaded from the data archive of the NASA, at the following address: https://mynasadata.larc.nasa.gov/ Table 8 Median, mean and standard deviation of ISEs, observed when estimating the derivative function m ′ (x) for the 500 replications of models 1-6, when using local smoothing. In the central column, the results for the neural network method are reported in absolute value. On the right, the results for the PL method are reported as relative increments with respect to the NN-GAS method  The x value are the longitudes from 154.9W to 19.9E, for a total of 700 observations, while the y values give the monthly precipitation, measured in mm/hr, observed at the zero latitude in January of the year 1999. For our analysis we scaled the x values (longitudes) in the range (0-1). Figure 3 reports the results of the analysis. Plot in (a) gives the observed values. Plot in (b) shows the estimated bandwidth functions for different values of the parameter a. We consider n x = 20. It is evident how the bandwidth locally adapts to the peculiar structure of the dataset as long as the parameter a changes. The bold line denotes the optimal one, that is the bandwidth function derived by estimating the optimal value of the parameter a using the criterion suggested in equation 25. The estimate of a is 0.26 against an interval of length 1. Looking at Figure 3 we can argue that there is an advantage to use a local bandwidth selection procedure because the data structure seems to be very different behaviour on the x values. Figure 4 reports the local bandwidth estimates by our method (NN-GAS) (solid line), the local bandwidth estimates by PL method (dashed line) and the global bandwidth derived using the Neural Network approach.
If we compare the mean squared residuals using the obseved values, we have that the global approach and the local one (NN-GAS) give nearly the same value. Therefore, contrary to what one could expect, by observing the scatter-plot of the data, the optimal bandwidth function is almost equivalent to the global bandwidth function, so a substantial global smoothing should be preferred to a local smoothing for the analysed dataset. All that because the estimated global bandwidth is very low as we can note in Figure 4. Instead, the same measure for PL method gives a greater value. Since we do not know the true unknown function, it is more interesting to analyze the two local bandwidth curves from NN-Gas and PL methods. In this way, it is important to outline that the estimated local bandwidth by NN-GAS is a very smooth curve with respect to the PL estimates ( Figure 4). Besides, the behaviour of NN-GAS bandwidth curve respects the data points. When the data are nearly constant then we have a high value for the estimated bandwidth, otherwise the estimated bandwidth is lower. Finally, this aspect is important because a higher bandwidth implies that nh gives a greater value so we can reduce the variability of nonparametric estimator which is important for inference as in the case of confidence bands and so on. Finally, by these considerations we can suggest to use a local bandwidth approach. Moreover, our method gives an automatic trade-off between the adaptability (local bandwidth) and efficiency (global bandwidth).

Bias correction of the local polynomial estimator
Having estimated V ωI x and h Ix , we have all the information to correct the bias ofm φ (x; h Ix ). In fact, given the (2), we can write where the operator sign Ix determines the mean of the sign operator on I x (it can be estimated, for example, through the mean of the signs of the estimated function (15) in the interval I x ). Finally, the bias corrected local polynomial estimator ism where Bias[m φ ] is obtained from (31) substituting the unknown functionals with their estimators. This correction has been implemented in plot (b) of figure 1.

Further extensions: Rate of convergence for the GAS local bandwidth estimator
Note that our method can be used to estimate the optimal local bandwidth when considering a sequence of values a n such that a n → 0 for n → ∞. For the sake of completeness, in this section we derive the rate of convergence of our local bandwidth estimator. We show that it almost reaches the upper bound for a local bandwidth estimator established in Wang & Gasser (1996). First of all, we consider the results in Fan et al. (1996). They show that (Theorem 2), for x 0 ∈ χ, v = 0 and p = 1, it is where the h MSE (x 0 ) is the bandwidth that minimises the true MSE at the point x 0 and h opt L (x 0 ) is defined in (6). Hence, the ideal bandwidth h MSE (x 0 ) behaves in first order like the asymptotical optimal bandwidth h opt L (x 0 ), and the (32) gives the upper bound for the rate of convergence of a plug-in bandwidth estimator based on the approximation of h opt L (x 0 ) and uniformly with respect h. For the non uniform case, the bound becomes Letĥ Ix 0 , defined in (19), be the estimator of h Ix 0 in the point x 0 , with p = 1 and v = 0. Moreover, suppose that m (2) (·) ≡ 0 in I x0 . Now, given the bound in (33), the following result holds.
Theorem 2. Under assumptions (a1)-(a4) in the appendix, with p = 1, if a n → 0 and na 5 n → ∞, when n → ∞, and if the second derivative of f X (x) and the fourth derivative of m(x) are bounded ∀x ∈ I n x0 and m (2) (x 0 ) = 0, then the bandwidth estimatorĥ I n x 0 has the following rate of convergence to the true ideal local bandwidth h MSE (x 0 ) where I n x0 = [x 0 − a n /2; x 0 + a n /2].
Remark 2. Theorem 2 states thatĥ p −→ 0 when a n → 0, na 5 n → ∞ as n → ∞. The rate of convergence depends on a n . But when we consider such a sequence we estimate, asymptotically, the local bandwidth in a point x 0 . If we consider the sequence a n = O(n −1/9 ), we have the best rate of convergence for our local bandwidth estimator which is O p ((log n) 1/2 n −2/9 ), by Theorem 2. This result almost reaches the upper bound stated in Wang & Gasser (1996), that is O p (n −2/9 ).
Remark 3. The proposed procedure can be generalised to the cases with multivariate predictors. First, using Barron (1993) and Chen & Shen (1998) we have already the theory to deal with the multivariate neural network estimators. Second, using the properties of the product kernel, we can refer to the well known results in the literature (see, for example, Ruppert & Wand (1994)) in order to derive the multivariate functionals for the optimal bandwidth.
Appendix: The consistency of the NN-GAS procedure Let χ ⊂ R be a compact set such that χ ⊆ S X , where S X is the support of the random variable X.

Assumptions
The density function of X, f X (·), is positive and bounded ∀x ∈ χ. (a3) X i ∼ i.i.d., ε i ∼ i.i.d. and the X i 's are independent of the ε i 's, ∀i. (a4) The sigmoidal activation function, Γ(·), has a continuous p + 1 derivative.
). Now, the compact set, χ, can be the support of the random variable X if it is compact. Instead, if the support of X is not compact we can set χ such that its measure is positive, γ, for example γ = 0.95. Now, we state the consistency result for the neural estimators. Consider {Y * i , X * i } the process where X * i are the random variables X i ∈ χ, while Y * i and ε * i are the corresponding random variables Y i and ε i , using model (11), i = 1, 2, . . . , n * , with n * = nµ X (χ). Thus n * = O(n). We have the neural network estimator, q(x;η * ), as in (12) η Lemma 1. Under the assumptions (a1)-(a6), the neural network estimator of m(·) is consistent in the sense that: Proof. We have to consider two cases. First, the support of X is compact. So we can fix χ = S X . Second, if the support of X is not compact then we build the random variable X * which has a compact support. Note that, in this case, n * = O(n). Moreover, using assumption (a3) we have that the function m (p+1) (·) is square integrable on χ. Thus we can apply the results in Barron (1993). Moreover, using the Case 1.1 in Chen & Shen (1998) the result follows. The next two Lemmas show some preliminary results which are used in Propositions 1 and 2.
Finally, the result follows.
Let I x = [x − a, x + a], with a > 0 and ∀x ∈ χ. By assumption (a2) it follows that µ X (I x ) > 0. Moreover, the number of observed values in I x from (11) tends to infinity when n → ∞ with probability one. We can write the estimator B ωI x from (18), that is .
Proof. The estimator in (34) can be written as .
The quantity B 2 p,K is known. Putn * = n * . Note that n * = O(n). Then Given the neural network estimator based on {Y * i , X * i }, i = 1, 2, . . . , n * and using Lemma 3 we have that

F. Giordano and M. L. Parrella
Now, we have Using the Taylor's expansion it follows that Finally, using the same arguments as in the end of the proof of Lemma 2, the proof is complete.
Now we have to consider the estimator V ωI x in (18).
Proposition 2. Using the same assumptions as in Proposition 1, then V ωI x , defined in (18), with I x ⊆ χ is consistent in the sense that: Proof. It is immediate applying Lemma 2 and following the same arguments as in Proposition 1. Now, we consider the optimal bandwidth and its plug-in estimator for the unknown function m(·) in model (11) or its derivatives, using the Local Polynomial Estimator.
Let I n x = [x − a n /2; x + a n /2] where a n → 0 when n → ∞. Letĥ Ix be the estimator of h Ix , defined in the (19).
Proposition 3. Using the assumptions (a1)-(a4), then i) if a > 0, the assumptions (a5) and (a6) hold, and assuming that m (p+1) (·) ≡ 0 in I x , thenĥ Ix is consistent in the sense that ; ii) if na 5 n → ∞ when n → ∞, assuming that m (p+3) (x) is bounded on χ, and assuming that m (p+1) (·) ≡ 0 in I n x for some n ≥ 1 and m (p+1) (x) = 0, thenĥ I n x is consistent in the sense that: Proof. First, consider a value a > 0, for the part i). Then we have, by assumptions, that B ωI x > 0 in probability, when n → ∞. Therefore, by Proposition 1 and Proposition 2 and using the same approach as in Ruppert et al. (1995) (page 1265), we have thatĥ Ix −hI x hI x = O p (( log n n ) 1/2 ). When we have a sequence {a n }, for the part ii), it is necessary to assure that the number of observations in I n x goes to infinity in some sense in order to apply Proposition 1 and 2.
By assumption (a2), and using the mean value theorem, we have that nµ X (I n x ) = O(na n ). It can be shown that µ X n (I n x ) − µ X (I n x ) = O p ((n/a n ) −1/2 ). Therefore, we have that nµ X n (I n x ) = nµ X (I n x )[1 + O p ((n/a n ) −1/2 )] so that nµ X n (I n x ) = O p (na n ). Since m (p+3) (x) is bounded on χ, using Hornik et al. (1994) we have that x+an/2 is the derivative of order p + 1 of the neural network function with the true weights and d n is the number of neurons in the hidden layer.
Following the proof of Case 1.1 in Chen & Shen (1998) and given that the number of observations in [x − a n /2, x + a n /2] is O(a n n), we have to find d n such that the following relation is satisfied. It can be shown that d n = ( a 5 n n log(a 5 n n) ) 1/2 satisfies the (35) since a 5 n n → ∞ when n → ∞. Using the same arguments as in the proof of Lemma 2, we have thatĥ Proof of Theorem 1. Since we have the compact set χ, then there exists a finite a * such that I a x = χ, ∀a ≥ a * and ∀x ∈ χ. (Note that I a x ≡ I x . We write I a x instead of I x only to give more evidence for the parameter a). So we can build the set I * = [b 1 , b 2 ] with b 1 > 0 and b 2 = a * > b 1 . By part i) of Proposition 3, with p = 1 and v = 0, we can write sup By Theorem 2 of Fan et al. (1996) it follows that III = O p (n −2/5 log n) because we have a continuous mapping between a ∈ I * and the bandwidth h I a x ∈ [n −b3 , n −b4 ], for suitable b 3 and b 4 . So, II = 1 + O p (n −2/5 log n).
Let X n (a) = I a x (q(u; η) − m(u)) 2 f X (u)du andX n (a) = I a x (q(u;η * ) − m(u)) 2 f X (u)du as in Lemma 1, where q(·; ·) is the neural network function with the true parameter vector and estimator, respectively.
To show the rate of convergence for I, it is sufficient to derive the rate of convergence for sup a∈I * X n (a) and the result follows using Lemmas 2, 3 and Propositions 1, 2 and 3, part (i).
There exists a finite sequence in I * , say {a i }, 1 ≤ i ≤ M , with M finite integer, such that sup a∈I * X n (a) ≤ max 1≤i≤MX n (a i ) + max 1≤i≤M sup |ai−a|≤M −1 X n (a i ) −X n (a) = I ′ + II ′ .
By Lemma 1 we have that I ′ ≤ M O p (d −1 n ) = O p (d −1 n ) where d n is defined in assumption (a6). Using Hornik et al. (1994) it follows that |X n (a 1 ) − X n (a 2 )| ≤ C|a1−a2| dn , with 0 < C < ∞. Therefore, using again Lemma 1, we have that II ′ ≤ M O p (d −1 n ). Since M is a finite constant then we can conclude that sup a∈I * X n (a) = O p (d −1 n ). The proof is complete. Proof of Corollary 1. By (10), we can find an arbitrary large value of a such that h Ix ≡ h opt G . Therefore, by part i) of Proposition 3, with p = 1 and v = 0, and using the (27), we havê So the rate of convergence is dominated by the O p (n −2/5 ) term.
Proof of Theorem 2. We have that Now we can write the first term at the right hand side of (36) as: By part ii) of Proposition 3, with p = 1 and v = 0, it follows thatĥ whereã is a value between 0 and a. Using the conditions of the Theorem, it follows that B(a; x 0 ) = (m ′′ (x 0 )) 2 f X (x 0 )a + c 1 a 3 , with c 1 < ∞. After some algebra, we can write h I n x 0 as: When we consider a n instead of a, then also c 1 depends on n. But, in this case c 1 < ∞, ∀n and for n → ∞. So, for simplicity we remain the constant c 1 .
Therefore, the last term in (37)  Using the (33), it follows that