Series estimation under cross-sectional dependence

An asymptotic theory is developed for series estimation of nonparametric and semiparametric regression models for cross-sectional data under conditions on disturbances that allow for forms of cross-sectional dependence and heterogeneity, including conditional and unconditional heteroscedasticity, along with conditions on regressors that allow dependence and do not require existence of a density. The conditions aim to accommodate various settings plausible in economic applications, and can apply also to panel, spatial and time series data. A mean square rate of convergence of nonparametric regression estimates is establishedfollowedbyasymptoticnormalityofaquitegeneralstatistic.Data-drivenstudentizationsthat rely on single or double indices to order the data are justified. In a partially linear model setting, Monte Carlo investigation of finite sample properties and two empirical applications are carried out. © 2015 The Authors. Published by Elsevier B.V. This access article the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Introduction
Economic agents are typically interdependent, due for example to externalities, spill-overs or the presence of common shocks. Such dependence is often overlooked or ignored in cross-sectional or panel data analysis. In order to account for possible crosssectional dependence, one needs first to establish a framework under which its structure can be suitably formalized, and which permits an asymptotic statistical theory that is useful in statistical inference, in particular a central limit theorem for estimates of functions or parameters of interest. Several approaches to modelling cross-sectional dependence prominent in recent literature can accomplish this. One class of models postulates unobserved common factors that affect some or all individual units, see e.g Andrews (2005), Pesaran (2006) and Bai (2009), and entail persistent cross-sectional dependence. Two other classes involve a concept of ''economic location'' or ''economic distance''. In economic data, cross-sectional units correspond to economic agents such as individuals or firms, envisaged as positioned in some socio-economic (even geographical) space, whereby their relative locations underpin the strength of dependence, see e.g. Conley (1999) and Pinkse et al. (2002). The spatial autoregressive (SAR) model of Ord (1968, 1981), see e.g. Arbia (2006), Lee (2002Lee ( , 2004 and Prucha (1998, 1999), employs spatial weight matrices whose elements consist of inverse pairwise economic distances between agents, whence the dependent variable or disturbance for a given unit is affected by a weighted average of the other sampled units' variables. The weights are presumed known and reflect the proximity between agents, leaving a small number of parameters to be estimated. Alternatively, mixing conditions extending ones familiar from the time series literature, have been employed. Conley (1999) and Jenish and Prucha (2012), for example, develop spatial mixing and functions-of-mixing conditions in terms of economic distance between agents, under a suitable stationarity assumption, while an alternative type of condition was proposed by Pinkse et al. (2007). Another approach, of Robinson (2011), employs a possibly non-stationary, linear process for disturbances, with dependence in regressors expressed in terms of the departure of joint densities from the product of marginals; a degree of heterogeneity across units is permitted, as well as strong dependence analogous to long memory in time series, which is ruled out by mixing conditions, as well as weak dependence, and the model can also accommodate economic distances, as well as lattice or irregularly-spaced data.
On the other hand, nonparametric and semiparametric estimation has become well established in econometric analysis, enabling assumptions of a known parametric functional form, that are frequently not warranted by economic theory, to be dropped or relaxed. There are many theoretical results on nonparametric kernel estimation under temporal dependence, see e.g. Robinson (1983). Jenish (2012), Robinson (2011) and Robinson and Thawornkaiwong (2012) have considered kernel estimation in nonparametric regression and partially linear regression, under forms of cross-sectional dependence. The asymptotic behaviour of series estimation under independence has been studied in Andrews (1991) and Newey (1997). For weakly dependent time series data, Chen and Shen (1998) and Chen et al. (2012) offer a rather complete treatment of asymptotic theory and robust inference of general sieve M-estimation, while Chen and Christensen (2015) shows that spline and wavelet series regression estimation obtains the optimal uniform convergence rate of Stone (1982).
This paper presents an asymptotic theory for series estimation of nonparametric and semiparametric regression models that covers fairly general cross-sectional heterogeneity and dependence, mainly of a weak form. The conditions of the paper may be relevant to cross-sectional, spatial, time series and panel data, and follow the framework of Robinson (2011), with modifications necessitated by the nature of series estimates relative to kernel ones. Our asymptotic results can easily be modified to cover linear and nonlinear parametric regression. Our other main contribution is establishing a theoretical background for a studentization method that offers an alternative to the existing variance estimation literature. In the spatial context, an extension of heteroscedasticity and autocorrelation consistent (HAC) estimation of the covariance matrix in the limiting normal distribution familiar from the time series literature, see e.g. Hannan (1957), is possible if additional information is available, such as the locations or geographical or economic distances between units. Conley (1999) considered HAC estimation under a stationary random field with measurement error in distances, as did Kelejian and Prucha (2007) for SAR-type models, and Robinson and Thawornkaiwong (2012) in a semiparametric regression set-up. However, the small sample performance of HAC estimation can be poor and an alternative studentization that can produce more accurately sized tests was suggested by Kiefer et al. (2000) in a time series setting. The present paper provides theoretical justification for employing a version of such a studentization in spatial or spatio-temporal data, though it critically relies on an assumption that the practitioner can suitably order the data across one or two dimensions, as may be relevant when geographical locations are known, or there are one or two characteristics believed to be strongly associated with dependence between individuals.
The paper is structured as follows. In Section 2, the model setting is outlined. In Section 3, series estimation is introduced and a mean square rate of convergence for the nonparametric component is established. Section 4 contains asymptotic normality results, covering slower-than-√ n, as well as √ n, rate of convergence. In the latter setting, Section 5 presents data-driven studentizations in one and two dimensions. Using the semiparametric partially linear regression model, Section 6 presents a Monte Carlo study of finite sample performance and two empirical examples. Section 7 concludes. Two appendices contain proofs.

Model setting
The paper commences from the nonparametric regression model relating observable random variables (X i , Y i ) ∈ X×R, for some set X ⊂ R q , where m : X → R and U i satisfies where σ : X → R and e i ∈ R are unobservable random variables with zero mean and finite variance, independent of {X i } ∞ i=1 and σ : X → R. We regard m and σ as nonparametric functions. The factor σ (X i ) allows for conditional heteroscedasticity in Y i , and the factor e i for dependence and unconditional heteroscedasticity. We observe (X i , Y i ) for i = 1, . . . , n, while the e i (and thence the U i and Y i ) can form triangular arrays, so e i = e in , etc., but this feature will be suppressed in our notation; triangular arrays enable coverage of a wide range of models for spatial dependence, including SAR models with normalized weight matrices, and stationary models for panel data or multi-dimensional lattice or irregularly-spaced data where the single index i in (1) and (2) requires a re-labelling of multiple indices which is liable to change as n increases, as discussed by Robinson (2011), who considered kernel estimation of m. In some of our work e i can have semi-strong dependence analogous to that found in long memory time series models. In Sections 3 and 4 we qualify (1) and (2) by detailed regularity conditions, including also restrictions on the dependence of X i .
Under the preceding conditions m(x) = E(Y i |X i = x) for x ∈ X. We will estimate m by a series nonparametric regression estimatem, constructed as a linear combination of pre-specified approximating functions. More generally, we are interested in estimating a d × 1 vector functional a(m) of m, as in Andrews (1991) and Newey (1997), where a(m) can be estimated by a(m).
Simple nonparametric examples of a(m) include the value of m, a(m) = (m(x 1 ), . . . , m(x d )) ′ , and the value of the partial deriva- Of semiparametric examples of a(·), the partially linear regression model, will be discussed in detail in Section 5. Other a(·), including nonlinear functionals, can be found in Newey (1997). Andrews (1991) established asymptotic normality for series estimates of a vector-valued linear a(m), with X i and U i independent and non-identically distributed, and indicated that his proof can be extended to cover strong mixing time series regressors. Newey (1997) established uniform and integrated mean square rates of consistency form(x) and asymptotic normality of a(m) when X i and U i are independent and identically distributed (iid) and a(m) is a possibly nonlinear scalar functional, also describing conditions under which a(m) converges to a(m) at parametric rate. Chen and Shen (1998) and Chen et al. (2012) considered these issues for sieve extreme estimates with weakly dependent time series, with rules of inference that are robust to weak dependence. Chen et al. (2012) also indicated that for certain cases of slower-than-√ n rate of convergence such as when a(m) = m, the asymptotic variance of the estimate a(m) coincides with that obtained under independence, as found for kernel estimation by Robinson (1983), for example.
In this connection, our first assumption restricts smoothness of the unknown function m. We first introduce a weighted sup norm for functions in order to allow for unbounded support X = R q of X i : |g| ∞,w := sup x∈X |g(x)|(1 + ∥x∥ 2 ) −w/2 , for some w ≥ 0, see e.g. Chen et al. (2005) and Chen (2007). The weighted sup norm becomes the more familiar sup norm |g| ∞ = sup x∈X |g(x)| when w = 0, and this suffices in place of |g| ∞,w in the following assumption when the X i are uniformly bounded.
Assumption A1. Let w ≥ 0 be the largest value such that max i≥1 E ∥X i ∥ 2w < ∞. There exists a sequence of vectors β = β κ and a number α > 0 satisfying, Conditions similar to Assumption A1, with | · | ∞ in place of | · | ∞,w , were used in earlier series estimation literature, see e.g. Andrews (1991) and Newey (1997). Further insights into those conditions for various p s (·), including polynomials, trigonometric polynomials, splines and orthogonal wavelets, can be found in Chen (2007, p. 5573). Chen et al. (2005) noted that the weight function (1 + ∥x∥ 2 ) −w/2 used here could be regarded as an alternative to the trimming used in kernel estimation when support is unbounded, X = R q . Conditions imposing an upper bound on κ may necessitate stronger smoothness of m.
For any non-negative definite matrix B denote by λ (B) and λ (B) the smallest and largest eigenvalues respectively, and for any real matrix B denote the spectral norm ∥B∥ =λ 1/2 (B ′ B). As in Andrews (1991) and Newey (1997), for example, introduce the sequence If m is believed bounded, one might choose bounded p s (·), whence ξ ≤ C √ κ, where C throughout denotes a generic arbitrarily large positive constant. It may be possible to obtain the rate of ξ exactly: Newey (1997) showed that under suitable conditions, ξ ∼ κ when the p s (·) are orthogonal polynomials, and ξ ∼ κ 1/2 when they are B-splines, a ∼ b meaning that a/b tends to a positive finite limit. Define Denote by c an arbitrarily small generic positive constant. The following assumption presupposes elimination of any redundant p s (·): Assumption A3. The model (1), (2) holds, where the sequences are mutually independent, max i≥1 Eσ 4 (X i ) < ∞, and the e i have zero means and finite variances.
To describe dependence in {X i } introduce The quantity ζ ij is bounded above by the maximal correlation coefficient of Rozanov (1963), for example. For Gaussian X i , it follows from Lemma 10.2 of Rozanov (1963) Further, for a stationary Gaussian process X i , ζ ij is upper-bounded by the α-mixing coefficient with index |i − j|. More generally, the X i need not have absolutely continuous distribution functions. The quantity △ is an overall measure of dependence; the property △ ∼ n −1 is analogous to weak dependence of time series, while a rate between that and the upper bound △ ≤ 1 is analogous to strong dependence. Assumption A4. As n → ∞, κ −1 + κξ 2 (n −1 + △) → 0. The △ component is vacuous under independence of X i and otherwise indicates that the stronger the dependence the smaller κ needs be. In light of A1, this necessitates greater smoothness of m. The κξ 2 /n → 0 component reduces under suitable conditions to κ = o(n 1/2 ) for B-splines and κ = o(n 1/3 ) for orthonormal polynomials.
To describe dependence in e i define The property can be said to cover weak dependence in e i when ς = 1, and ''semistrong dependence'' when ς ∈ (0, 1), the latter case corresponding to long memory in stationary time series. Define Both versions of χ are identical when ξ ∼ κ 1/2 . The following theorem obtains an integrated mean square rate of convergence.
Theorem 1. Under Assumptions A1-A4, as n → ∞, in view of (11), leading to the optimal rate κ ∼ n ς /(1/2+ρ+2α) . The rate in (13) coincides with Newey's (1997) for iid X i and e i , and bounded σ , when χ ∼ κ/n, and weak dependence in e i (ς = 1 in (11)) leaves the rate unchanged. On the other hand, more generally under only (11), we have χ ∼ ξ κ 1/2 n −ς , entailing stronger restrictions on κ and/or ξ to achieve consistency. Newey (1997) also covered a general concept of uniform convergence (where his uniform rate was improved by de Jong (2002), for compact X). A uniform rate of convergence in our setting may be shown to be O p under a modification of A1 stated in terms of sup norm, using the proof of Theorem 1. More recently, Chen and Christensen (2015) verified that spline and wavelet series regression estimators for weakly dependent regressors attain the optimal uniform convergence rate of Stone (1982).

Asymptotic normality
Our interest now lies in inference on the d × 1 deterministic functional a(m), for which a central limit theorem is the first step. For this purpose, we denote θ 0 = a(m),θ =θ n = a(m), and introduce the following assumptions from Newey (1997), modified here with the weighted sup norm: Assumption B1. Either: (i) a(g) is a linear operator in g, or: (ii) For any ϵ > 0, there exists a linear operator D(g) and constant B2 implies continuity of D(·), the Frechet-differential of a(·) at m, while B1(ii) imposes a stronger smoothness condition on a(·) at m and is not restrictive, see e.g. Newey (1997, p. 153). When a(·) is a linear operator, its Frechet-derivative is itself, Define the κ × d matrix and thence the conditional and unconditional variance matrices Assumption B3. For a positive non-increasing scalar sequence τ = τ n , τ λ (V ) ≥ c for all sufficiently large n.
B3 requires A to have rank d for all κ ≥ d. The case τ → 0 as n → ∞ is relevant when a is a nonparametric functional, converging at rate √ nτ , while the case τ ≡ 1 is relevant when a is a semiparametric functional, converging at rate √ n, the latter situation being discussed in detail in the following section. B3 does not satisfactorily allow both nonparametric and semiparametric components in a, which will have different rates; theory suggests they should use different κ, but our statistic is insufficiently general to permit this. We strengthen the rate condition A4 to: Assumption B4. As n → ∞, To prove asymptotic normality we introduce a more detailed specification of e i : Assumption B5.
(2) holds with where the ε j , j ≥ 1, are independent random variables with zero mean and unit variance, the ε 2 j are uniformly integrable, and the b ij are real constants such that for some η > 0 and all sufficiently large n, The linear process (15) was employed by Robinson (2011) and the dependence on both i and j of b ij (like γ ij ), rather than just their difference i − j, distinguishes (2) from representations for stationary time series, as does our allowance for them to be triangular arrays, thereby covering models described after (1) and (2) in Section 2. In SAR models, where the b ij tend to reflect economic distances between agents, we have, for all i, b ij = 0 for j > n, and  n j=1   b ij   ≤ C is commonly assumed, see e.g. Kelejian and Prucha (1998) and Lee (2004). In models featured in the spatial statistics literature it is frequently natural to allow b ij to be non-zero for all j, analogously to autoregressive time series models. Condition (16) implies weak dependence in e i : noting that it is easily seen that  n k=1 |γ ik | ≤ C and thus ω = O  n −1  in (10). Finally, some falling off of fourth cumulants is required in the assumption below.
Under independence of the X i , or more generally finite dependence (similar to m-dependence in time series), this condition entails κ 2 ξ 2 /n → 0 as n → ∞.
Denote by B 1/2 the unique positive definite square root of a positive definite matrix B.
Theorem 2. Under Assumptions A1-A4 and B1-B6, as n → ∞, Replacing V by a data dependent quantity without affecting the limit distribution in (20) typically requires additional information, as discussed in Section 1. In the weakly dependent time series context, Chen et al. (2012) found that under conditions on a(·) that preclude √ n-convergence of a(m), V reduces asymptotically to the same matrix as under the assumption of independence (cf Robinson (1983) for kernel estimates), which in our setting is lim This prompts interest in developing, for spatial settings, inference under √ n-convergence, when such a simplification does not result. As indicated in Section 1, one example of a(m) is the vector (m(x 1 ), . . . , m(x d )) ′ whence, using Theorem 2 and after a relatively simple studentization a test for a specific functional form of m(x) analogous to that of Robinson (1983) can be developed, though such a test lacks satisfactory consistency properties and much more work would be required to justify a consistent test. Of course in case of a nonparametric a(m) the rate of convergence in Theorem 2 is slower than √ n, whereas in the sequel we focus on the, semiparametric, √ n case.

Studentization
We present a studentization for a √ n-convergent semiparametric estimateθ . Attainment of the parametric √ n rate by semiparametric estimates has attracted wide interest in the econometric literature, following Robinson (1988) and Powell et al. (1989), who used kernel estimation. Newey (1997) developed √ n-convergence of a series semiparametric estimate, while Chen and Shen (1998) establish extensions for weakly dependent time series. In our spatial setting statistical inference requires a studentization that is robust to dependence and heterogeneity across i. Section 1 discussed difficulties with HAC estimation in dealing with cross-sectional dependence. Given a suitable ordering of data, however, a(m) can be studentized without assuming any particular dependence structure, following the time series approach of Kiefer et al. (2000). We consider two different versions of such studentization, depending on whether data are to be ordered according to one or two indices. Below, we first present some common assumptions that are required for both versions of the studentization, then each version is considered in detail in the following subsections.
Define, for a fixed r ∈ [0, 1]: Throughout this section, we impose the following unprimitive assumption.
Sufficient conditions for C1 involve technical restrictions on the functional derivative D(·) of a(·) that can be found in Assumption 7 of Newey (1997), where they are verified to hold for the estimands in partially linear and single index models and also when the quantity of interest is the approximate consumer surplus. In a non-iid setting, we also require weak dependence in e i for C1 to hold, which is implied by B5. Existence of Ω is a condition on the collective strength of dependence in U i and X i , comparable to Assumption A4 of Robinson and Thawornkaiwong (2012). C1 also requires a degree of homogeneity across units, although somewhat less stringent than identity of distribution.
Assumption C2(i) implies weak dependence of X i , C2(ii) restricts the dependence across i of p κ (X i )U i , C2(iii) strengthens the smoothness condition of B3(iii) and C2(iv) further restricts the growth of κ and ξ , but is satisfied by, for example, the rate mentioned after Assumption B6.
The following assumption from Newey (1997) requires D i (·; g) to exhibit continuity over g.

Studentization for one dimensional ordering
In this subsection, we consider the case when one has information to suitably order the data with a single index. We first introduce the quantities used in the studentization and discuss in detail the requisite assumption on the ordering.
We estimate A bŷ and withM = m where W d (·) denotes a d-dimensional vector of independent Brownian motions, so Ψ d is the integral of the outer product of the d-dimensional multivariate Brownian bridge, and we note that Assumption C5(ii) further rules out the presence of any ''dominant'' unit whose error covariances with new units added to the sample are persistently significant. In our other assumptions, the ordering of the n observations is arbitrary, but Assumption C5(i), required to justify our studentization, is unfortunately highly sensitive to ordering, requiring some falling-off of dependence as |i−k| . This is often reasonable with time series, but for spatial data there is generally no natural ordering and the bulk of the n! possible orderings will not satisfy C5(i), so considerable care would be required in ordering the data.
In some economic applications data might be ordered with respect to some relevant (explanatory) variable. For example, with firm data, firms using similar inputs or producing similar outputs might be expected to exhibit high correlation in disturbances. It may be that some economic distances are available, as in SAR models, in which case they can facilitate ordering by indicating neighbouring units although there would still remain considerable arbitrariness in the ordering. Such considerations are pursued in the Monte Carlo study in the following section. Other approaches to the modelling of cross sectional data that would justify alternative studentizations can also be challenged. Mixing assumptions, for example, are, like ours, nonparametric, but they generally presuppose that observations are recorded on a Euclidean space and distances between units are known to the practitioner for inference robust to dependence, using HAC variance estimation, to be carried out. Parametric models, such as SAR and factor models, with the former requiring the practitioner to specify one or more n × n spatial weight matrices, run obvious risks of misspecification. If the practitioner has no information that suggests a plausible ordering of data, e.g. in the case one has a random permutation of observations, one cannot use our studentization to carry out inference that is robust to cross-sectional dependence, just as one cannot with any other alternative method available.
Theorem 3. Under the assumptions of Theorem 2, and Assump- Theorem 3 may be employed in approximate hypothesis tests and interval estimation, with critical values given in Table 2 of Kiefer et al. (2000).

Studentization for two dimensional ordering
We can partially relax the ordering required in the previous subsection. When one has geographical locations of individuals or spatio-temporal data, it is natural to consider two dimensional ordering of the data. For a given number n of observations, suppose we can choose s and t such that n = st. In the case of geographical locations, this setting does not necessarily entail a grid or regular spacing. Suppose each location has two (real valued) coordinates, north and east. One may divide the south-north axis into s intervals, possibly of different length so that each has t observations. In each such interval one then orders the t observations along the west-east axis, and thence finally gives each observation two subscripts, j = 1, . . . , s and k = 1, . . . , t. The choice of s and t introduces arbitrariness. Moreover, having gone through such an indexing procedure, so e.g. X jk corresponds to say, so while the procedure is appropriate for (possibly irregular) lattice data, for example, it entails an element of arbitrariness in relation to general geographical data sets on two dimensions. There is also an underlying difference in our attitudes to the one-and two-dimensional orderings of this and the previous section. In the former case, our attitude is that the ordering may be possible (e.g. it will be in the time series case) but typically more or less problematic, whereas in the latter we take for granted that we know locations, at least approximately. It should also be added that if in fact we know actual locations in either case, our procedures do not in general use that information, only orderings, so will not be ideal. Though we adopt the above simplified method of ordering in establishing our results, they will also hold if the s intervals contain differing numbers t j of observations, where  s j=1 t j = n. It would seem also to be possible to extend our approach and results to three or more dimensions. Some of the previously defined quantities need to be rewritten to incorporate two dimensional ordering of data. Note that quantities that do not depend on the ordering of data such aŝ θ do not need to be restated. The model can be rewritten as Assumption C5 ′ (ii) is identical to C5(ii), just rewritten with the two subscripts. Assumption C5 ′ (i) requires a suitability of ordering in both directions.

Numerical results for the partially linear model
The present section contains Monte Carlo investigation of finite-sample performance of our estimates and two empirical applications, in all cases for the partially linear regression model, which is discussed in the following sub-section.

Partially linear regression model
We partition (1) to where h(·) is a nonparametric function and δ 0 is an unknown . This model is particularly suitable when Z i contains categorical variables, and is often used when the overall number of regressors is large, when a fully nonparametric specification suffers the curse of dimensionality. Kernel estimation of this model has received much attention see e.g. Robinson (1988) and Fan and Li (1999), where δ 0 was shown to be estimable at √ n rate despite first stage nonparametric estimation having a slower-than-√ n rate. Series estimation of (22) was considered in Chamberlain (1986) where the choice of series functions takes into account the partially linear form, the first d functions being Z i , and the remaining κ − d are functions of W i only. The series estimate of δ 0 is then the first d elements ofβ, andĥ(x) =m(w, z) − z ′δ . There is more than one a(·) that yields a(m) = δ 0 . Andrews (1991) considered a(m) = ∂m(w, z)/∂z = δ 0 for any w, z, earlier mentioned by Robinson (1988), in a kernel context. Here we use the functional of Newey (1997), which leads to √ n-consistency. Denote Z * = Z − E(Z|W), where Z and W are random variables independent of the data used to constructδ. Suppose E(Z * Z * ′ ) is non-singular, an identification condition, and consider

Monte Carlo study of finite sample performance
Our simulations take both W i and Z i in (22) to be onedimensional, and throughout set δ 0 = 0.3 and h(w) = log(1+w 2 ).
Our main focus is to investigate performance of studentizations of Section 5 when the ordering requirements of C5 and C5 ′ may be problematic due to noise in our information about the ordering. For example, even if one knows which characteristic(s) of individual units underpins the structure of spatial dependence, it may be observed with error. For both settings of Sections 5.1 and 5.2, we introduce varying degrees of perturbations to the correct Table 1 Asymptotic critical values of ordering of data and report their effects on Monte Carlo coverage probabilities and power of test.
In the first set of simulations which is relevant to the studentization of Section 5.1, we generate random locations for individual units along a line, which determines the underlying dependence structure. We then compare the performance of our studentization under the correct ordering of data with that under a perturbed ordering, where locations are observed subject to error, but used to order the data. To be specific, the locations of the observations, denoted s = (s 1 , . . . , s n ) ′ , were generated by a random draw from the uniform distribution over [0, n]. Keeping them fixed across replications, U i and Z i were generated independently as scalar normal variables with mean zero and covariances Cov(U i , U j ) = Cov(Z i , Z j ) = ρ |s i −s j | , using various ρ ∈ (0, 1). To construct W i , we generate another scalar normal random variable V i in the same way as U i and Z i and let For the studentization, we add noise to the locations, to generate four sets of ''perturbed'' locations: defining we take We base the studentization on 5 different orderings of the data, according to the five sets of locations s, s ′ , s ′′ , s ′′′ , s ′′′′ . We consider two sample sizes, n = 100, 400 and 4 levels of dependence, ρ = 0, 0.2, 0.4, 0.6. For each of the 8 combinations, three bandwidth values, κ = 4, 6, 9 were tried, and for the series functions of W i , the first κ − 1 orthonormal Legendre polynomials were used. The results are based on 1000 replications.
We first analyse performance of the estimates of both the nonparametric function m and semiparametric quantity a(m). We report in Table 2   so for n = 400. For the MISE, κ = 4 was best when n = 100, and κ = 6 was best for n = 400. The Monte Carlo MSE ofδ was relatively insensitive to κ across all 8 settings, which is important as optimal choice of κ for semiparametric estimation is often more ambiguous than for nonparametric estimation. Our next objective is to investigate performance of the studentization of Section 5.1. Theorem 3 implies in this setting,     Table 4 reports empirical coverage probabilities for the 99%, 95% and 90% confidence intervals under the five different orderings of data, based on locations s, s ′ , s ′′ , s ′′′ , and s ′′′′ . When ρ = 0, studentizations with all orderings produce a rather precise coverage probabilities for both samples sizes. For ρ = 0.2, 0.4, 0.6 and correct ordering based on s, the coverage probabilities suffer slightly in the smallest sample n = 100, while being fairly good for n = 400, at least for ρ = 0.2 and 0.4. The more we perturb the ordering, a gradual deterioration is reported. Nevertheless, even with the greatest perturbations, caused by substantial noises Var  ϵ ′′′ i  = 100 and Var  ϵ ′′′′ i  = 400, the results are encouraging. Table 5 reports empirical power of testing H 0 : δ 0 = δ against H 1 : δ 0 ̸ = δ, for δ = 0.3, 0.4, 0.5, 0.7; of course columns corresponding to δ = 0.3 report empirical size. Not surprisingly, for ρ = 0, powers across different orderings are similar, while for ρ = 0.2, 0.4 and 0.6, power tends to improve with increasing perturbations.
The second set of simulations considers the studentization of Section 5.2. To generate the data, we follow the random location setting of Robinson and Thawornkaiwong (2012), where the locations of the observations, denoted s 1 , . . . , s n , were generated by a random draw from the uniform distribution over [0, 2n 1/2 ] × [0, 2n 1/2 ]. Keeping these locations fixed across replications, U i and Z i were generated independently as scalar normal random variables with mean zero and covariances Cov( We generated W i and Y i , and used the same κ, series functions and number of replications, as before, but took ρ = 0, 0.2, 0.4, 0.52 for n = 100 and ρ = 0, 0.2, 0.35, 0.5 for n = 400. The random location setting implies the degree of dependence is determined not only by ρ, but also by the distances between locations. The fact that we are considering locations on a plane rather than along a line implies that ρ produces differing strengths of dependence compared to the familiar time series AR(1) model, making it difficult to get a sense of the degree of dependence in the data generated. One measure of dependence that might be used in comparisons is   Table 5 Empirical power of 5% test on δ 0 , κ = 6, design 1.   Table 7 Monte Carlo average 95% CI length for δ 0 , design 2. We used the two dimensional location coordinate s i = (s 1i , s 2i ) to order the data with two indices. The first coordinate s 1i was used to divide the sample into n 1/2 groups numbered by the first index and the second coordinate s 2i was then used to order within each group, generating the second index. For perturbations to the ordering, we added two dimensional error terms to s i , leading to are independent across i. It is worth noting that due to the difference in the settings, above errors represent more significant perturbations than in the previous setting, since the variance of s 1i and s 2i in the current set-up is smaller than the variance of s i used in the previous one.
We report in Table 6 the Monte Carlo MSE, bias and variance of  m(0.5, 0.5), MISE of  m, and MSE ofδ 0 . Again, patterns of bias and variance of  m with changing κ are in line with predictions, and κ = 6, 9 tended to generate the lowest MSE for settings with n = 100, 400, respectively. Table 7 reports Monte Carlo average length of 95% confidence intervals. As before, it decreases with n, increases with ρ and shows little variation across κ. Table 8 reports the empirical coverage probabilities for the 99%, 95% and 90% confidence intervals based on critical values of Table 1. As expected, adverse impacts of perturbed ordering increase with ρ and the magnitude of perturbations, although the results are again encouraging even for the greatest perturbations and ρ = 0.52, 0.5. Table 9 reports empirical power of testing H 0 : δ 0 = δ against H 1 : δ 0 ̸ = δ, for δ = 0.3, 0.4, 0.5, 0.7 with 5% significance level.
Again, results are similar to those of the previous setting, with power improving with increasing perturbations for ρ ̸ = 0, and powers across different orderings are similar when ρ = 0.
The parametrically involved regressors are wage (log wage rate), pcap (log price of capital), PUC (a dummy for public utility commissions that deliver additional services, and therefore may benefit from economies of scale), life (log of the remaining life of distribution assets), lf (log of the load factor, measuring capacity utilization relative to peak usage), and kmwire (log of kilometres of distribution wire per customer). The nonparametrically involved regressor is cust (log of the number of customers). The disturbance is now denoted u. Yatchew (2003) was interested in estimating the conditional expectation of tc given cust, holding the other regressors fixed, as the shape of this curve reveals whether there are increasing/decreasing returns to scale in electricity distribution. We are interested in estimating δ and testing their significance, H 0 : δ l = 0, versus H 1 : δ l ̸ = 0 for l = 1, . . . , d, when allowing for dependence in the disturbance u. The data consists of 81 municipal distributors in Ontario, Canada in 1993. The first set of columns of Table 10 repeats the kernel estimates of δ and their standard errors, assuming uncorrelatedness of error terms, from Yatchew (2003) based on methods and theory of Robinson (1988). The second set of columns reports the  δ i , using the first three Legendre polynomials in series estimation. Test statistics labelled * are 5% significant, while those labelled △ are 10% significant. To apply the studentization of Section 5.1, two different orderings were tried. First, the data were ordered in the ascending wage rate faced by the firm, with the rationale that firms may be subject to input shocks, and those with similar wage rate may use similar inputs, leading to dependence in disturbances. Test statistics based on this studentization are denoted t * n,w . Second, the data were ordered according to the number of employees of the firm, which is a measure of size, noting that firms of similar size may be subject to similar shocks, or alternatively, may be dependent due to competition. Test statistics based on this studentization are denoted t * n,e . Inference based on the assumption of uncorrelated disturbances found PUC , life, lf and kmwire to be 5% significant using kernel estimation, while PUC , life and kmwire are 5% significant with series estimation. When allowing for dependence in disturbances, and with both orderings, life and kmwire were still found to be 5% significant, while lf , pcap and wage·pcap were 10% significant and PUC , which was 5% significant under uncorrelatedness, was 10% significant based on ordering according to number of employees. The second example involves hedonic pricing of housing attributes, the data concerning 92 detached homes in Ottawa sold during 1987. The dependent variable is the sale price of a given house (price), while the parametrically involved regressors consist of attributes of the house, including lot size (lotarea), square footage of housing (usespc), number of bedrooms (nrbed), average neighbourhood income (acginc), distance to highway (dhwy), presence of garage (grge), fireplace (frplc), and luxury bathroom (lux). The nonparametric function h(·) has two arguments, being location coordinates s = south and w = west: price = h(s, w) + δ 1 frplc + δ 2 grge + δ 3 lux + δ 4 acginc + δ 5 dhwy + δ 6 lotarea + δ 7 nrbed + δ 8 usespc + u. In series estimation of h(s, w), we used approximating functions (1, s, w, sw). The first set of columns of Table 11 recalls the kernel estimates reported in Yatchew (2003), and the second set reports the corresponding series estimation results. The estimates of coefficients, their standard errors and the t-statistics are broadly similar, revealing significance of many of the regressors at the 5% level.
In applying the studentization of Section 5.2, we divided the north axis to 9 intervals, the first two containing 11 units each and the rest 10 units, anticipating possible spatial dependence in disturbances of neighbouring houses. In Table 11, SE refers to standard error computed under the assumption of independence, and the test statistic t ′ n defined above has critical values 40.3, 26.45 at sizes 5% and 10%, respectively. Again, test statistics labelled * are significant at 5% level and those with △ at 10% level. Our tests, which attempt to account for dependence, find that the presence of fireplace, garage, luxury bathroom and square footage are 5% significant, while average neighbourhood income and lot size are 10% significant. The contrasting conclusions on the significance of δ estimates between the test under independent errors (denoted t-stat) and the test t ′ n allowing for dependence may be due to cross-sectional dependence in the data, as seems natural, given that prices of houses of the same type, sold in the same year and city, would have been subject to an overlapping set of demand and supply side factors, driven by the same macroeconomic fundamentals.

Conclusion
This paper has established a theoretical background for series estimation of a vector-valued functional of the nonparametric regression function under cross-sectional dependence and heterogeneity. Theoretical results include a mean square rate of convergence and asymptotic normality. Robust data-driven studentizations that require a suitable ordering of observations across either one or two dimensions offer an alternative to existing methods of inference. The framework of cross-sectional dependence and heterogeneity of this paper and its technical arguments may be used to establish asymptotic theory in some other settings.
which can be improved to when σ is bounded. Thus (23) follows from (24). To prove (24), for any ε > 0, the Euclidean norm of a matrix A. We show that and thus, applying A4, by A4, which completes the proof of (27), and thus of the theorem.
Proof of (28). Let z (ℓ) i denote the ℓth element of z i . The (ℓ, p)th from Isserlis' formula, has variance 1 n 2 . For each ℓ, p = 1, . . . , d, the left side of (42) is bounded by by A4 and B4, since By symmetry the second term in (43) is also of the order (44).
By the Cramer-Wold device, it suffices to show that for any d × 1 vectors c ′ 1 , . . . , c ′ k , Q n → d Q , where l Ω 1/2 W d (r l , u l ).
By (41), which holds for all c ′ l t ιh (n; r l , u l ), l = 1, . . . , k, we have  S ι=1  T h=1 t * ι,h (n; r, u) 4 = o p (1) for S and T similar to N mentioned in the proof of (31), and Q n → d Q follows by the same argument as in the proof of asymptotic normality (31).