Spatial bootstrapped microeconometrics: Forecasting for out-of-sample geo-locations in big data

Spatial econometric models estimated on the big geo-located point data have at least two problems: limited computational capabilities and inefficient forecasting for the new out-of-sample geo-points. This is because of spatial weights matrix W defined for in-sample observations only and the computational complexity. Machine learning models suffer the same when using kriging for predictions; thus this problem still remains unsolved. The paper presents a novel methodology for estimating spatial models on big data and predicting in new locations. The approach uses bootstrap and tes-sellation to calibrate both model and space. The best bootstrapped model is selected with the PAM (Partitioning Around Medoids) algorithm by classifying the regression coefficients jointly in a nonindependent manner. Voronoi polygons for the geo-points used in the best model allow for a representative space division. New out-of-sample points are assigned to tessellation tiles and linked to the spatial weights matrix as a replacement for an original point what makes feasible usage of calibrated spatial models as a forecasting tool for new locations. There is no trade-off between forecast quality and computational efficiency in this approach. An empirical

example illustrates a model for business locations and firms' profitability.

K E Y W O R D S
bootstrapping, predictions out-of-sample, spatial big data, spatial point-pattern, spatial weights matrix, tessellation, Voronoi polygons INTRODUCTION The key point in the geo-located data is the similarity in the neighborhood expressed as spatial autocorrelation. Data for soil, climate, or environment often reveal visible spatial patterns, that can be mapped and modeled with geostatistical models like kriging or Stochastic Partial Differential Equation with Integrated Nested Laplace Approximation (Krainski et al., 2018;Lindgren et al., 2022) or machine learning (Meyer & Pebesma, 2021). This approach in fact is about "filling in the gaps between sampling sites," when controlling for local and temporal conditions. Data in economics like business locations (sectoral belongingness, profitability, high-tech involvement, employment etc.) are much more complex. Units are more internally heterogeneous and they do not show clear spatial patterns (target variables are not spatially continuous), even if the spatial autocorrelation and similarity of multidimensional neighborhood exists. This problem is about "predicting the value for a specific unit located in space," when controlling for the unit itself and its neighborhood. Therefore those relations are modeled with spatial econometrics and captured by spatial weights matrix W, which is to express the neighborhood structure (more in Appendix A).
Classical spatial econometrics uses W matrix with contiguity criterion in modeling regional areal data and k nearest neighbors W matrix for point data. This W matrix, being a core of spatial econometrics, simultaneously is its main problem-due to being intractable in big data and defined only for the observations used for model fitting. This generates two severe problems for spatial econometrics on point data: (1) usage in the case of big data and (2) predictions for new locations. As for now, alternatives for W in big data are not satisfactory-parallel computing increases the speed but does not solve the critical issue of memory; spatial machine learning is still discussed (Kopczewska, 2022;Meyer & Pebesma, 2021) as W is not easy to substitute; models on spatially aggregated point data and polygon-based W erase the information from the local neighborhood and cut the local variability; kriging, which may deal with out-of-sample smoothing of the target variable, needs inverting distance-dependent large and dense covariance matrices, which limits its applications in big data (Perdikaris et al., 2015); Matrix Exponential Spatial Specification can work only with a single neighbor in classical spatial models (Arbia, 2014;LeSage & Pace, 2007). Neglecting spatial autocorrelation in geospatial data is also dangerous as it mostly results in bias, inconsistency, overfitting of the nonspatial model and its false predictions (e.g., Ibrahim & Bennett, 2014). Recent advances include for big spatial data, primarily storage solutions, and for spatial econometric out-of-sample predictions, mostly the discussion on challenges (Jiang, 2018). Geostatistical models are far better developed-in a number of available solutions, out-of-sample predictions, and big data solutions (Heaton et al., 2019). This depicts the need for equivalent developments in spatial econometrics.
Spatial big data starts at the current edge of the computational capabilities. In the case of geostatistical models, it is around 150,000 observations-Heaton et al. (2019) evidence that some methods require even 48 h at 30 cores to deal with this mass of data. In the case of W and spatial econometrics, it was lying a few years ago at around 70,000 observations (Arbia et al., 2019), while now it can be estimated similarly to around 150,000 observations. For bigger datasets, one needs new methods and solutions. The real challenge is to estimate the spatial econometric models using W and information from the neighborhood in case of big data-datasets of hundreds of thousands or millions of geo-points, for example, for real estate valuation or business location. W has the problem of size (n 2 ) and due to the low density of connections is sparse, what requires a special approach when inverting it. Currently, only super-computers can do this task, mostly due to larger memory. And even if the current computational progress enables estimation on larger datasets than before, one can easily imagine increasingly bigger tasks that already may appear and will be too complicated again.
The feasibility of geo-data micro-econometric spatial models on the standard machines can be achieved with a bootstrapping technique. Till now, bootstrap was considered mainly as a tool supporting the small-sample data (e.g., Hall, 2013;Hesterberg, 2015), and the bootstrapping replications were expected to discover the hidden population statistical properties. In general, bootstrap is used mainly for smoothing (e.g., Davison et al., 2003), testing (e.g., Manly, 2006), confidence intervals (e.g., DiCiccio & Efron, 1996), internal validation of models (Tran & Tran, 2016). In spatial analyses, bootstrap is used in the uncertainty bands in functional kriging (Franco-Villoria & Ignaccolo, 2017), testing the spatial nonstationarity and heterogeneity in the Geographically Weighted Regression coefficients (Harris et al., 2017;Hong et al., 2022), correlation functions (Loh, 2008), sampling under known joint distribution (García-Soidán et al., 2014), testing Moran's I (Jin & Lee, 2015), discovering uncertainty of parameters (Castillo-Páez et al., 2020;Dalposso et al., 2019;Uboldi et al., 2014). However, bootstrapping can be used oppositely, shrinking big data size while preserving its statistical properties. Predominantly, data are reduced by sampling in univariate statistical analyses, and for the econometric purposes in the multivariate analyses by Lasso (Tibshirani, 1996), bootstrap of the candidate parameters matrices (Ye & Weiss, 2003), bolasso (Bach, 2008) or Principal Component Analysis (e.g., Rosipal et al., 2001) when cutting redundant variables. However, bootstrapping can be an efficient tool in limiting the number of observations, alternative to the sampling. Within this developing stream, one can find a proposal by Barbian and Assunção (2017) of spatial subsemble for partial estimations in spatially structured subsets and aggregating the results in the spatial analysis.
Bootstrapped regression, based on the sampling of smaller subsamples and replicating the model estimations, is present for 30 years (since Freedman, 1981and Wu, 1986, and now as Hesterberg, 2015Harris et al., 2017). It enables operating on a much narrower scale while obtaining consistent, efficient and nonbiased estimates (e.g., Davison et al., 2003;Efron & Tibshirani, 1997). However, there are still two challenging issues because of the spatiality of data. Firstly, when sampling the observations, one samples also the location. Thus, there is no unique spatial weights matrix W as each estimated model is based on its individual W because of the different spatial composition of points. Secondly, the forecasting possibilities for a new geo-point are limited as the new point is not represented in W.
This paper proposes the solution to all three issues discussed above: (i) computational problems in big data, (ii) lack of unique W when sampling, and (iii) difficult forecasting for out-of-sample data. The subsequent sections present the method and its justification. The logic is as follows: the paper starts with true estimates of coefficients and its errors on population (full dataset), which becomes the reference for other solutions and enables the quality comparisons (Section 2). Straightforward reduction of the dataset (sampling) as an alternative would reduce computation time and lower the model quality but does not solve the problem of out-of-sample prediction and unique W (Section 3). Bootstrapped regression improves computational aspects and keeps the characteristics of full sample estimates but requires deciding how many replications and observations to use (Section 4). Having bootstrapped distributions of beta coefficients in multivariate regression, one cannot simply choose an average of each beta coefficient individually due to interrelated values-a solution is to choose the most central set of coefficients in this multi-dimensional setting. This is equivalent to finding the most representative model by searching for the medoid model using Partitioning Around Medoids (PAM) clustering (Section 5). Locations of observations used in the most central model are treated as the best approximation of point-pattern and used to derive a representative spatial weight matrix W. Flexible prediction for any new point within analyzed territory requires linking in pairs the existing in-sample points with new out-of-sample points. This is possible by partitioning the area into disjoint subregions using tessellation and pairing points that belong to the same tessellation tile (Section 6). Forecasting which inputs to the model one new point and all-but-one old points gives well-fitted predictions (Section 7). The proposed method is discussed (Section 8) and summarized (Section 9). All stages are presented in Figure 1 The paper's novelty is in: (a) using bootstrap to shrink the dataset size, which is different from typical applications of bootstrap to small samples, (b) selecting the best bootstrap multivariate model with PAM algorithm, (c) substituting train data with test data in W for predictions based on pairing design, (d) pairing in-sample and out-of-sample locations by overlaying new points on the tessellated surface with old points and assigning to underlying tiles. All those innovations build F I G U R E 1 Design of study. complex approach to estimate spatial models on big data and run predictions for out-of-sample data. This methodology is novel in spatial econometrics, and it gives very stable and efficient results at the low level of computational effort.

REFERENCE FULL-SAMPLE SPATIAL ECONOMETRIC MODEL
Let us assume the dataset with approximately n = 37,000 geo-located point observations, which are static real business locations within a region (Figure 2a). The geo-referenced point-locations (x,y) are supplemented with a real business characteristic (z) as an industry branch (agriculture agri, production prod, construction constr, service serv), employment size (empl), Euclidean distance to Lublin core city (dist). The analyzed dataset is the representative 10% sample of real REGON register for the NTS2 Lubelskie region in Poland in 2014. The profitability (roa) (understood as Return on Assets) was generated assuming normal distributions within four sectors with given parameters: N(2%, 0.045% 2 ) in agriculture, N(3,5%, 0.045% 2 ) in production, N(5%, 0.045% 2 ) in construction and N(8%, 0.045% 2 ) in service; additionally roa includes the premium of 1.5% in the core city ( Figure 2b).
The research goal is to estimate and calibrate the model explaining the profitability roa of a given firm (j) with its location and characteristics, including all possible spatial information, and to forecast the profitability (z*) of a new entry firm in a given new location (x*,y*) for given industry, employment, and distance.
The paper considers four specifications of the same model. The first (1) is a standard linear model specification (estimated with Ordinary Least Squares, OLS): where j = 1,2,…,n are observations in the dataset, roa j and dist j are continuous variables on the profitability and distance to the core city, empl j is a numeric variable specifying the middle of the employment size class, prod j , constr j , and serv j are the dummy variables differentiating the business sector from the agriculture (base level), while j is error term. The second, third, and fourth specifications are spatial models: spatial error model (SEM) (2), spatial autoregressive model (SAR) (3) and spatial Durbin model (SDM) (4): roa j = 0 + 1 empl j + 2 prod j + 3 constr j + 4 serv j + 5 dist j + j and j = W j + u j , (2) roa j = 0 + Wroa j + 1 empl j + 2 prod j + 3 constr j + 4 serv j + 5 dist j + u j , roa j = 0 + Wroa j + 1 empl j + 2 prod j + 3 constr j + 4 serv j + 5 dist j where j is a spatially auto-correlated error term, decomposed to its spatial lag W j component and the iid random term u j , Wroa j is autoregressive component and θWX j are spatial lags of dependent variables X, and W is 37,000 × 37,000 spatial weights matrix using k = 5 nearest neighbors (knn = 5) row-standardized to one (more on selection of knn for W in Kubara & Kopczewska, 2023). Spatial lag is the average of values in the neighborhood; in the case of dummy variables, it is the percentage of occurrence of a given feature. Estimation of models was run in R software (more in Appendix C).
The estimation results show that in OLS model (Equation 1) all variables are significant (*** for p-value <0.001; Table 1), goodness-of-fit is acceptable (R 2 = 0.98, Akaike information criterion [AIC] = 24.620). At the same time, the spatial autocorrelation of the error term exists (Moran's I standard deviate = 316.91 using the knn = 5 in W, more in Appendix B). Spatial models: SEM (Equation 2), SAR (Equation 3), and SDM (Equation 4) with knn = 5 W matrix were much better fitted (with AIC SEM = −27.917, AIC SAR = 18.757 and AIC SDM = −28.447, respectively) than OLS (AIC OLS = 24.620) and obtained significant spatial coefficients (rho , lambda , and theta θ). There exists an upward bias in OLS coefficients of approximately 5% compared to spatial models, while SEs of OLS are almost doubled than in SEM, SAR, and SDM. Coefficients for prod, constr, and serv are the premium of profitability over the agriculture sector, taken as a base. The average ROA for a basic sector-agriculture ( ROA agri = 2%) added up with sectoral coefficients ( SEM prod ≈ 1.5, SEM constr ≈ 3.0, SEM serv ≈ 6.0) sum up to the assumed average ROA ( ROA prod = 3.5%, ROA constr = 5%, ROA serv = 8%). Emp variable revealed instability which has no importance for methods presented in a paper. In case of SAR and SDM for structural interpretation of a given factor's impact on a dependent variable, one should use direct and indirect impacts, estimated for the best-selected model.
There are a few remarks on the sample size. Firstly, big data is not unequivocal in terms of the size of the dataset, and it can be defined as "data which exceed(s) the capacity or capability of current or conventional methods and systems" (Ward & Barker, 2013). Gandomi and Haider (2015) write about big data volume as "multiple terabytes and petabytes." However, in spatial estimation applied on standard computers, many routines require long waiting at approximately n = 70,000 spatial units, and stop computations for more than 0.5 million obs. Secondly, when testing the statistical solutions to proxy the full sample, one needs an operationalizable dataset to compare the subsample and full sample results. In the light of recent works by Arbia et al. (2019), approximately 37,000 observations in a safe and substantial spatial dataset. Thirdly, dealing with big data TA B L E 1 Estimation results of ordinary least squares (OLS), spatial error model (SEM), spatial autoregressive model (SAR) and spatial Durbin model (SDM) for a full sample (37,000 obs).  requires progress in the computational power of computers and smart statistical solutions. With the flood of mass spatial data (as mobile data, business locations, housing transactions, selling points etc.), one can easily imagine the more extensive dataset, which is bigger than the biggest operationalizable even for super-computers.

SIMPLE SAMPLING AND COMPUTATION TIME
Estimating the model on a full sample can be costly with regard to computational time, required computer memory and necessary effort for dataset collection. Sampling solutions understood as single estimation of a model on a subset can give acceptable approximations of full-sample results at a much lower cost. Subsample regression coefficients are expected to hold the full-sample estimates' values, while the main difference appears in their SEs. The big sample property of estimators saying that when expanding the size of the sample, the accuracy of the estimation rises, and especially when doubling the sample, its variance decreases √ 2 times, is well-proven in the literature (e.g., Lenth, 2001). This rule can be expressed as equivalent to 10 ⋅ n −0.5 for n observations. The efficiency of the variance estimator for the expanding window estimation (or the inverted jack-knife), Eff exp.window , with step s, for the observations from 1 to h⋅s taken from the total sample of n observations is: where h is the number of steps, s is a length of the step, where h⋅s = n, var 1∶(h⋅s) and var 1∶n are the variances of the estimator in a subsample and the full sample, respectively. This allows for predicting the SE in a full sample, using a subsample estimate of SE. It shows how much one can cut the full sample to obtain reasonable subsample results. The expanding window procedure, with the step of the length of s = 100 observations and h = 1,2,3,…, 370 steps was applied on the randomly resorted data in a full dataset. In this analysis model specification was estimated 370 times on increasing dataset (100, 200, 300, …, 1000, …, 10,000, …, 37,000 obs.). Figure 3 illustrates beta coefficient and its SE for prod variable in OLS, SEM, SAR and SDM (beta coefficients instead of impacts are used-impacts do not exist for all models and report simulated SE only). Beta coefficients for subsamples hold their central tendency toward the full-sample estimate in all models, even if an over-bias of OLS compared with spatial models is persistent. Nonbiased beta yield nonbiased impacts. SE of coefficients were drawn twice: as empirical SE-values of beta SE in specified models, and as theoretical SE-recalculated beta SE using " √ 2 rule" based on beta SE for 100 observations. Empirical and theoretical values follow the same pattern what confirms the existence of the " √ 2 rule." The efficiency of SE estimates Eff exp.window at 10,000, 20,000, and 30,000 observations equals ca. 1.98, 1.36, and 1.12 respectively in all models (what means that SE in a sample of 10,000 is ca. two times bigger than in full sample). Thus, increasing the sample size may approximate the subsample SE to full-sample SE, and only the subsample over 30,000 observations gives a relatively slight increase in this efficiency.
Comparison of the time cost of the calculations ( Figure 4) assumes time inspection for doubled data (n = 1000, 2000, 4000, 8000, 16,000, 32,000, 64,000, 128,000, 256,000, 512,000). OLS estimation takes from 0.009 s for 1000 obs., 0.05 s for 32,000 obs. to 0.74 s for 512,000 with a growth rate of approximately 1.6 times between doubled samples. Spatial estimations last much longer, in general from 2 s for 1000 obs., to 20 s for 32,000 obs. and 2700 s for 512,000 obs. For big data  (512,000 obs.), the majority of this time takes the process of W construction (ca. 2510 s, 93% of time), while estimation itself goes quicker (190 s, 7% of time). Estimation time is the quickest for SAR, while SEM and SDM take 1.5-2 times longer. The major problem is the time growth rate when doubling the data size-for OLS and spatial regression models, it grows approximately 1.6, while for W it grows approximately 2.86 times. That seemingly small difference matters a lot in big data-doubling data size nine times (from 1000 to 512,000) means that regression time increases approximately 70-80 times, while W computation time increases approximately 12,000-13,000 times! (see Appendix D). Some time savings are possible due to selection of matrix decomposition method and developing software implementations. The above analysis shows that spatial estimation on thousands of geo-located observations is still very time consuming, if feasible at all, even when using optimized routines. Computer scientists offer technical solutions: super-computers, cloud service, parallel computations etc., to solve the technical problem of estimation. However, the massive inflow of individual spatial data causes bigger datasets than currently served to become available; thus, technical problems may remain unsolved. As in this paper, a statistical approach is to overcome, in general, the obstacles of dealing with big spatial data. This is to underline that simple sampling (in fact cutting the dataset) may not give satisfactory results because of representativeness issue and have a limited possibility of using the calibrated spatial model in forecasting for new geo-point, as it would result in a new spatial weights matrix W, which could destroy the fine-tuned calibration. There is a need for a smart solution to deal with computational problems in big data, lack of unique W when sampling and difficulty of forecasting for out-of-sample data to benefit from the information available. Thus the paper develops a new methodological solution based on resampling, which is step-by-step tested throughout the paper.

CHALLENGES OF BOOTSTRAPPED SPATIAL REGRESSION
The bootstrap's original purpose was to support the estimation on a small sample when the limited availability of data was the main obstacle in getting reasonable results. In big data, the availability of mass data redesigns bootstrap motivations, which may become the way of limiting, not multiplying the data. Before running the resampling procedure, which is to run many regressions on different subsamples, one should understand how this design works to answer fundamental questions: how many iterations to use, what should be the size of subsample, does the location of selected observations matter, are the estimates of good quality. This section clarifies those issues.
Following Fox (2015), the coefficient's standard bootstrap error is the SD of the bootstrap coefficient replicates. Thus one can define the efficiency Eff boot of the bootstrapped estimation as: where s = 1,2,…, S is the size of subsample in a bootstrap (number of observations drawn from a full sample), i = 1,2,…R is a number of the iterations (number of replications of drawings), var ( is a variance of estimates in an i-iterated set and var full sample is the known variance of estimator derived for a full sample or population. There are at least three issues on the spatial sampling: (i) the effective size of the spatial sample; 9ii) the sampling design; (iii) the number of replications and the subsample's size in the bootstrap. Even though the literature seems to be rich, there is no clear answer to those issues. They are discussed below.
The sample's effective size was studied mostly from a nonbootstrap perspective (e.g., Griffith & Zhang, 1999). Griffith (2008) builds a rule of thumb that approximately n = 400 observations may be sufficient in a spatial regression with a single covariate. Chernick and LaBudde (2014), following Hall (1985) indicate, that bootstrap can work for n = 20 observations if there are i = 2000 replications.
The sampling design meets many recommendations. The literature usually applies parametric bootstrapping, residuals bootstrapping or observations bootstrapping (e.g., Tran & Tran, 2016) as they differ in assumptions, advantages, and disadvantages (Davison & Hinkley, 1997;Moulton & Zeger, 1991). For spatial data, the spatial dimension is of great importance as it is to deal properly with spatial autocorrelation. Franco-Villoria and Ignaccolo (2017) follow Lahiri (2003), that "a bootstrap procedure needs to mimic the data generating mechanism to reproduce the spatial dependence structure in the bootstrap samples," thus they recommend nonuniform sampling schemes. This is contrary to Davison et al. (2003), who claim that "The ordinary nonparametric bootstrap uses uniform resampling from a data sample to mimic the mechanism that originally produced that sample." Alternatively, Griffith (2005) proves that for the single-draw sampling designs in the case of spatial autocorrelation, the hexagonal tessellation stratified sampling design performs the best. Since Hall (1985) suggested the blocked bootstrap for spatial dependent data, there is a discussion on the size of blocks (e.g. Hall et al., 1995;Nordman et al., 2007), shape of the blocks (e.g. Lahiri, 2003;Roberts et al., 2017), if they should overlap (Carlstein, 1986;Kunsch, 1989) or comparisons are presented (Radovanov & Marcikić, 2014). For bias or variance estimation Hall et al. (1995) recommend n 1∕3 blocks. Interestingly, Chernick and LaBudde (2014, p. 148), following Lahiri (2003), indicate that "bootstrap estimates under irregularly space grids are consistent." One popular method of selecting irregular nonoverlapping partitions is k-means clustering of spatial coordinates (e.g., Russ & Brenning, 2010;Schratz et al., 2019), which divides spatial points into spatially homogenous k clusters and runs spatially stratified sampling from those clusters. Irregular windows in bootstrap are also used by Kraamwinkel et al. (2018). Therefore, this paper applies k-means clusters for a stratified sampling of the subsample (Figure 5), keeping a number of clusters between 50 and 100. The decision on sampling design is strongly linked to the type of estimated model. Methods which do not account for spatial autocorrelation-like machine learning-address this issue through spatial cross-validation (CV). It is to divide train and test data spatially to train the model on one territory and test on another area. Then, the test model cannot use the information gained by the train model-what happens in nonspatial CV when information is transferred via spatial autocorrelation channel from neighboring observations. There are many good solutions for spatial CV as blockCV, sperrorest, or mlr3spatiotempcv R packages. However, in the case of spatial econometrics, the sample must include well-preserved spatial autocorrelation as it will be addressed with W and represented by tessellation, while spatial CV is not applied.
Bootstrap replications and its subsample size are usually recommended as arbitrary. Harris et al. (2017) note that usually increasing the number of replications is easier than raising the sample size. They indicate that usually there are i = 999 replications, while in their study, because of the computational burden, they use i = 99, as the sampling design is simple. Escanciano and Jacho-Chávez (2012), in bootstrapped regression, assume 300 replications, Hall et al. (1995) take 200 replications, Efron and Tibshirani (1997) only 50 replications based on the internal variance calculations, while Fox (2015) and Tran and Tran (2016) apply 2000 replications. Hesterberg (2015, p. 380) insists on "1000 bootstrap samples for rough approximations, or 10 4 or more for better accuracy." In this paper, we test the bootstrapped OLS regression efficiency, depending on subsample size s and number of replications i, in a simulation on 37,000 observations. OLS models for a continuous dependent variable with m = 5 explanatory variables were estimated on the expanding by 1000 number of observations, starting with 1000 (s = 1000, 2000, 3000, …, 37,000). For each F I G U R E 5 K-means nonoverlapping clustering of points for stratified sampling. s, i = 2000 replications were performed, sampled with replacement from 37,000 full sample. The distributions of the coefficients and the standard errors were derived by sampling within 2000 models a subsample i, which is equivalent to a number of replications (i = 100,200, 300, …, 2000; see Figure 6).
In bootstrapping the regression coefficients, m are well-converging toward the population parameter m * , with the efficiency ≈ 1 for most of the scenarios (Figure 5d). The variance of the estimators is almost independent of a number of iterations (i on the x-axis), and strongly dependents on a size of a subsample (the consequent lines and s on the y-axis) (Figure 5b). The regression coefficients' bootstrap variance is lower than the expected (theoretical) one from a single estimation as in Section 2 ( Figure 5a) and reaches the efficiency approximately 1 at circa 22,000 obs. For a big sample (ca. 30,000) it is lower by 90% than the theoretical one. The estimated coefficients' accuracy depends on sample size and symmetrically converges toward the full-sample parameter (Figure 6d).
An efficient resampling design allows for avoiding the bias in the bootstrapped OLS coefficients. The skewness of j distributions was Skew = 0 ± 0.15, which confirms no asymmetry in j distributions, and consequently no bias. Also, a symmetry of the upper and lower part of the boxplot (Figure 6c), as well as the symmetric and centered distributions of betas for different sample sizes (Figure 6d), confirm no bias in the estimation process. The above general analysis for OLS confirms that the bootstrap is an efficient and convenient way to obtain the full-sample-like estimates at a low estimation cost. The bootstrap models on a moderate-size subsample are achievable with standard software and hardware and provide stable and high-quality results. This analysis confirms that bootstrapped spatial models behave as nonspatial ones (OLS). The comparison of the spatial and nonspatial models was run on a subsamples of s = 1000, 2000, 4000, and 8000 observations bootstrapped i = 500 times (Figure 7). Beta coefficient (Figure 7a) and spatial rho from SDM (Figure 6c) are more precisely estimated with an increase of sample size, and the SE decreases by √ 2 when doubling the sample size ( Figure 7b). AIC, because of its dependence on sample size, cannot be used for setting proper sample size, but its variance diminishes with the increase of sample size. This confirms that the behavior of bootstrapped coefficients and variances are similar in spatial and a-spatial models. One can confirm the above OLS conclusions for spatial models that bootstrapping appears as an attractive  way of approximating the estimation on a full sample. At the same time, data characteristics are decisive for spatial and a-spatial specifications. The above shows that one can consciously choose bootstrap parameters-the final model will use 500 iterations (resamplings) of 8000 observations drawn randomly from the full sample.

SELECTION OF THE BEST MODEL AND BEST DATA REPRESENTATION
The bootstrapped regression estimation procedure with the parameters B(i,s,) generates the i sets of the regression coefficients. Presented example runs i = 500 bootstrapped models on s = 8000 observations. A model with a constant term, estimated with m = 5 explanatory variables in i = 500 replications, gives a matrix of size [i × (m + 1 + 1)] = [500 × 7], while after inclusion of spatial parameters in SDM [i × (m + m + 2)] = [500 × 12]. All i scenarios (iterations, resamplings) are estimated on the s-item subsamples of the randomly selected (but stored) observations. Selection of the best representation requires a multi-dimensional methodology as the distributions of individual m coefficients are not independent of each other. Also, information on the location of observations that gave the best model is necessary, as they will help to build W. This section discusses how to choose the best model and check its quality.
This paper proposes a novel approach to choose the best model from many candidate bootstrap models-it uses PAM algorithm used in unsupervised learning. In general, the clustering methods as PAM are designed to split the dataset into the most homogenous clusters with regard to many variables analyzed. The partitioning procedure assumes finding the best combination(s) among the available ones to maximize the homogeneity within the group and to maximize the heterogeneity between the groups. Analyzed points in PAM are multi-dimensional, and the applied distance metrics (e.g., Euclidean or Manhattan) defines the joint relations between them. For a set of bootstrap regression coefficients, a point is a set of coefficients from a single model, while the calculated distances between the points compare the pairs of models. To avoid the overflow of a single variable, the input data, and consequently the coefficients, should be standardized. The quality of partitioning (for c > 1) is measured with silhouette or dissimilarity measures, computed for a given distance metric. The result of partitioning is twofold: firstly, one gets the id of the iteration set, which is the medoid of coefficients-the most typical set of coefficients; and secondly, all iterations are assigned to a given cluster. More details on PAM and silhouette in Appendix E.
There are few possible number of clusters c. With c = 1 medoids, all possible observations are within the same cluster, while the best points representation is single. With c = 2 or more medoids, a double, triple or bigger set of "the best coefficients" is obtained. This can be interpreted as modeling in groups, and then the differentiating criteria are to be found. However, it needs a solid analysis of the significant differences among the groups. This method can also sort out the outlier scenarios, especially those located on the edge of clusters. The efficient estimation will support the c = 1 partitioning, which simplifies the procedure of finding the best (most representative) sampling combination f * y . In fact, for c = 1, the medoid model is the one with the least sum of (Euclidean) distances to other models.
When assuming a single cluster only, there is a need to check the data's clustering tendency into more partitions. The Hopkins statistics tests if the data is clusterable (H1) when h ∼ 0 (h < 0.5). It compares the total distances between the closest neighbors real pairs of points (w d j ) and closest neighbors real point and uniformly randomly distributed points u d j : where ∑ n j=1 u d j is the average distance to nearest neighbor between real point and uniformly generated random point (with the same variance as the real data) and ∑ n j=1 w d j is the average distance to the nearest neighbor between the real data. As the statistic is based on the randomly generated data, thus the values of the statistics in iterations may differ. In the analyzed example partitioning into one single cluster is the optimal division, which is confirmed with Hopkins H statistics: H OLS = 0.21, H SEM = 0.18, H SAR = 0.19, and H SDM = 0.17.

Most middle (medoid) SDM model Multidimensional scaling 2D representation
Average Euclidean distance to other points − first dimension Average Euclidean distance to other points − second dimension F I G U R E 8 Two-dimensional (2D) visualization of multidimensional scaling of Euclidean distance between beta coefficients in bootstrapped spatial Durbin model. Each point of scatterplot represents one (out of 500) model, medoid point (red) is the most representative model. 2D representation proxies distances between full set of regression coefficients in bootstrapped models.
For visualization purposes, to see the medoid model's selection, one needs two-dimensional output, even when models are compared in i-dimensional space, where i is the number of iterations, and the result of comparing i models is [i × i] matrix). Multi-dimensional scaling allows for dimension reduction from [i × i] to [i × 2]. It keeps the scale of original values -here, the multi-dimensional Euclidean distance between pairs of the models (in fact, model coefficients) (see Figure 8). The medoid scenario, located centrally in clusters on Figure 8, yield the centered model coefficients.
One should note that the model with the lowest possible AIC or Bayesian information criterion (BIC) is not a representative model but an attractive outlier model. As the AIC and BIC are sensitive to outliers and increase with the bootstrap sample size, they cannot be used for deciding which sample size to use.
The quality of the bootstrapped model is measured in a typical way. Following Escanciano and Jacho-Chávez (2012), the performance of the bootstrapping estimators can be assessed with RAMSE (Root Average Mean Squared Error), denoted as: where i is the number of replications i = 1,2,…R, is the overall prediction of the model,   example, RAMSE of the medoid model was significantly lower in spatial SDM and SEM models (RAMSE SEM = RAMSE SDM = 0.12) than in a-spatial models (RAMSE OLS = 0.31) and in spatial SAR model (RAMSE SAR = 0.27) (see Table 3). This indicates the substantially higher quality and better fit of the spatial models. Tables 2 and 3 and Figure 9 show the details of estimation and comparison of medoid models with average and full sample coefficients. Comparing the estimated beta coefficients (Figure 9a) clearly shows the upward bias of 5%-10% in OLS and relatively slight differences in betas between the full and medoid SEM, SDM, and SAR models. Bootstrapped (average and medoid) coefficients replicate well the full-sample estimates. The coefficients and the standard errors of betas, differ mainly for the constant term (variable 1), but not for the rest of the variables (Figure 8a). SEs of estimates are as expected: the lowest in case of medoid SDM and SEM, then all full models, and later for medoid OLS and SAR (Figure 9b). All SEs behave predictably and enable approximating the full-sample error with the "double size, √ 2 decrease in error" rule, while the most representative middle models' coefficients are as good as from the full models. The medoid models' efficiency is higher than the averaged models (columns 8-9 in Table 2). Moran tests for the residuals' spatial autocorrelation proved that residuals from spatial models are random (Moran's I = −0.085, p-value = 1).
The comparison of the medoid PAM-selected model with the full-sample and the average results evidences that the PAM algorithm selects the most typical medoid representation, very similar to the average and full-sample results. The z-test for equality of coefficients from a full-sample and PAM selected medoid bootstrap regressions does not reject H 0 about the equality, which means no bias (columns 1-3 in Table 2). The value-added of the PAM-selected medoid model over the averaged coefficients is that (a) with PAM, one can replicate the subsample observations that support this typical pattern and enable the further analysis (tessellation), and (b) that coefficients are considered jointly, not independently, as in case of the average.
Those comparisons suggest that the bootstrapped spatial models, especially SDM, replicate well the expected full-sample estimation using the limited subsample only. This indicates that

F I G U R E 9
The comparison of the beta estimates in the ordinary least squares (OLS), spatial error model (SEM), spatial autoregressive model and spatial Durbin models: (a) beta coefficients, (b) SEs of beta coefficients. Values on x-axis represent consecutive variables, with 1 being the constant term.
bootstrapping together with PAM procedures are efficient tools to support the big data spatial econometrics. The bootstrap estimates are as they were obtained from a full sample; their SE can be rescaled by constant factor resulting from " √ 2 rule," while the computational effort is incomparably lower and the spatial models are feasible. The obtained medoid coefficients can be used to calibrate the econometric model and spatial weights matrix.

TESSELLATION AS A METHOD OF SPACE CALIBRATION
The medoid combination of the regression coefficients selected with the PAM algorithm above needs to be extended for a spatial dimension. The selected observations, which were used in a medoid bootstrap model, are treated as the best representation of the full sample, both regarding the values (z) as well as in terms of a location (x,y). The issue here is about the representation of spatial point distribution. By analogy to the econometric model's calibration by finding the best point estimates of regression coefficients, the calibration of space is needed to obtain one and universal W necessary for a spatial estimation.
In the approach where the point data are aggregated within the polygons along the administrative borders, the continuous polygonal representation of spatial pattern exists but follows the aggregation and the MAUP problems. It also lowers the accuracy of spatial information. The sample representation of continuous spatial data was developed well in geostatistics as a point pattern.
There are a few available methods as kriging or thin-spline etc. (e.g., Chun & Griffith, 2013). However, they are primarily single-dimensional and far from mimicking the spatial weights matrix W.
This paper proposes using the Voronoi polygons (also called Dirichlet tessellation or Thiessen Polygons) as a method of a discrete polygonal representation of continuous spatial data based on a sample point data. In general, the tessellation constructs the polygons around the points by delimiting the points in half of the distance between the points and making those points centroids. The irregular shape tiles cover the whole area of a region in a continuous and nonoverlapping manner. Each tile contains one point, and each point is assigned to one tile. For real geo-locations, a list of the tessellation tiles may be shorter than a number of observations in case of the overlapping locations. If locations of two points are the same, they are assigned to a single and unique tile. Spatial weights matrix W in defining the neighborhood can accept the same neighborhood information doubled in rows for the overlapping points. Some solutions can be jittering of locations by shifting location by a small value of epsilon.
The geo-locations of the observations that were used in the estimation of the best medoid model serve as the best representation of the spatial point pattern. The tessellation transforms the point pattern into the continuous polygons set ( Figure 10). This natural tessellation replicates well the point pattern of underlying location data. This approach to space delimitation and point data aggregation does not suffer from the MAUP. Many studies from the last 40 years (e.g., Sibson, 1980) confirm that tessellation is an attractive method for data analysis because of its flexibility. Ahuja (1982) indicates the great potential of the Voronoi polygons, which "possess intuitively appealing characteristics, as would be expected from the neighborhood of a point," while Hall et al. (2001) confirm the tessellation can be used efficiently "in determining polygonal neighbourhood relationships between point locations". There is exhaustive literature on the properties of the Dirichlet tessellation (e.g. Du et al., 1999;Hinde & Miles, 1980;Kopczewska, 2021), which generally sees mostly the favorable features of this method. Recent developments started to link the tessellation to the bootstrap. Secchi et al. (2013) use the bootstrap and the tessellation to reduce the original dataset and find the neighborhood's local representatives. Their approach is statistically and computationally efficient, even if they use it in the context of clustering the functional data, not spatial regression. More on tessellation in bootstrapping in Appendix F.

FORECASTING FOR OUT-OF-SAMPLE GEO-LOCATED POINTS
Spatial forecasting seems to have more challenges than covered by available solutions. For in-sample predictions (the same training and testing dataset) as well as for previsions (the same spatial units, different values of variables), one can use the well-known trend-signal-noise predictor (Cressie, 1993), which is suitable for regional data only. For out-of-sample data (new locations, new values of variables) Goulard et al. (2017) overviews the methods and propose the algorithms for SAR models. Jiang (2018) similarly revises existing spatial predictions methods for regional data and indicates challenges for other data types. Popular kriging used for predictions in point patterns is not suitable for big data, and its forecast, even in regression models, is based on smoothing of the surface of the target variable. Zhu et al. (2018) propose using the Third Law of Geography, which assumes that the similarity of characteristics of two locations is reflected in a similarity of target variable at analyzed points. The similarity is used as a weight in the prediction of the target variable in a given point. When using the spatial econometrics approach, the problem in forecasting with the spatial microeconometric model for the new location of the out-of-sample point (x*,y*), is that it is not a part of the estimated W in a final model, while it must be linked to this W. Simple recalculation of W in case of a new point added, would decalibrate the model and potentially disturb the result, as the medoid combination would lose its properties of the best representation. Kriging, which is often used, serves as a good approximating method, while it also has profound limitations with regard to the sample size (van Stein et al., 2015). Thus, the alternative solution proposed here is to use the tessellation (Voronoi polygons) to calibrate the space, assign new points to tiles via the "overlying" procedure and, consequently, link them to the already calibrated W (see Figure 10b).
An idea lies in the controlled imputation. In the analyzed example, the final W is 8000 × 8000 and tessellation tiles t were enumerated as observations. Let us assume that a new geo-point (x new , y new ) = (x*,y*) was assigned to tile no.17 (t 17 out of t = 8000). As the new observation also contains the values of the explanatory variables (z*), it can be imputed to the dataset in place of observation no. 17, which automatically defines both a neighborhood and a place in W. With this replacement, the calibrated model can be easily used to forecast the value of the dependent variable of a new point. The most important feature of the spatial model-the characteristics of the neighborhood-stay unchanged. Thus the information necessary for computing the spatial lags is available.
There are few methods of internal validation of models: simple validation by splitting the population into learning and testing datasets; CV (as the K-fold validation), where the dataset is divided into K subsets and in K trials kth subset is treated as a testing sample, while the other remaining as learning samples; and bootstrap resampling, where the resampling parameters of the model are used in building their confidence intervals. Tran and Tran (2016), following Kuhn and Johnson (2013), as well as Molinaro et al. (2005) and Steyerberg et al. (2001Steyerberg et al. ( , 2003, indicate that the bootstrap resampling is more efficient than simple and CV. This paper applies a complex approach-validated bootstrap resampling: the regression coefficients selected with PAM from a series of the bootstrapped models are, in fact, the central values in confidence intervals of the parameters. The observations which were used in the final medoid model are treated as a learning sample, while the others constitute "out-of-sample" training sample, from which the test locations are drawn. The forecast quality can be measured with RAMSE, introduced in (6) as a measure of model quality. For out-of-sample j observations from training data (x*,y*,z*), modified RAMSE compares the forecastedŷ with the observed y* as follows: where f * y is the calibrated model-best model selected with PAM. In this approach, RAMSE includes only out-of-sample observations and their forecast, while the observations used in calibration were omitted in calculating RAMSE.
For the analyzed dataset, the quality check of four models calibrated on s = 8000 obs. was performed on out-of-sample j = 100 obs. The new observations were introduced to the calibrated model step-wise, one observation per check, to keep as much original data in the model as possible. For each j, the respective tessellation tail of point location was indicated, and original data for this tile were removed. Thus, the theoretical valuesŷ j were calculated with n − 1 original observations from in-sample dataset (x,y,z) and single out-of-sample observation (x*,y*,z*).
The goodness-of-fit of the models itself on training data is significantly better in SDM and SEM models (RAMSE SEM = RAMSE SDM = 0.12) and a few times lower than in OLS (RAMSE OLS = 0.31) and in SAR (RAMSE SAR = 0.27). In the forecasts for out-of-sample data, RAMSE SAR = 0.508 and RAMSE SDM = 0.506 outperform RAMSE OLS = 0.527 and RAMSE SEM = 0.542. For the dependent variable roa ranged [1,10], RAMSE SAR of ca.0.5 obtained in the bootstrap models is an attractive result-on average 5% error of forecast.
This suggests that bootstrapped SDM model performed the best in terms of RAMSE for model and forecasts, with all variables significant and lowest standard errors of estimates. Bootstrapped SDM is also far better than bootstrapped OLS and slightly better than other spatial bootstrapped models. Full sample SDM gives similar RAMSE as bootstrapped SDM, both for a model (RAMSE SDM full = 0.137) as well as for j = 100 forecasts (RAMSE SDM full = 0.555).
This indicates two critical facts. Firstly, spatial estimation matters for the quality of fit and spatial information reflected in the spatial weights matrix. The value-added of using spatial models lies in the additional information on spillovers that one can get, controlling for spatial autocorrelation and using the specific neighborhood information. Secondly, the proposed methodology of calibrated tessellation and using a medoid model from bootstrapped sampled alternatives is efficient and reliable for forecasting out-of-sample data at a relatively low technical cost. Tessellation can be efficiently calculated for thousands of points ( Figure 10b).

DISCUSSION OF RESULTS
This novel multistep estimation procedure, designed for dealing with forecasting for out-of-sample with spatial big-data models, is a mixture of traditional econometrics, bootstrapping and machine learning. All its elements are well-positioned in the literature. However, this combination was never proposed. Treating 37,000 observations as big data has an only illustrative purpose: to compare procedure and full-sample results and reliably asses the proposed solution. All tests presented above confirm the power of the bootstrap approach in improving estimates. Presented OLS, SAR, SEM and SDM models were selected arbitrarily as a case study, and this procedure works for any model. The paper outlines the idea of the approach and shows its high quality. When applying the solution to other datasets, the researcher has to choose bootstrapping parameters: the number of observations taken as a subsample, the number of bootstrap iterations, and the number of k-means clusters to sample data from irregular shapes. These decisions are the function of available computing capacity and time. The analysis above showed that doubling the dataset doubles the computation time (Figure 4b). Using standard PC with Windows and R, simulation of a single spatial model on 5000 observations, with five variables and knn = 5 W matrix iterated 500 times takes approximately 500 (times)*2.5 (s) = 20 min. The recommended solution in this paper is to choose at least approximately 5000 observations and 500 replications. The bigger the subset, the more precise results, but obtained in a longer time. Clustering procedures with k-means or PAM, or CLARA may seem efficient for big data. However, they are sensitive to a high number of clusters. The recommendation is to apply not more than 50-100 clusters.
Thus, the user of this multistep procedure is to: (i) prepare the data, (ii) decide about the bootstrapping parameters (min. s = 5000 observations and i = 500 iterations), (iii) divide sample into training (e.g., 90%) and test (e.g., 10%) data, (iv) cluster with k-means (e.g., k = 100) the geo-coordinates to run sampling of locations from irregular shapes, (v) estimate in a loop i times the desired model (for a given variable specification and type of model), (vi) select with PAM the best medoid model out of a set of i models, (vii) tessellate the space with observations from best PAM-selected model, (viii) overlay the new out-of-sample points on the tessellated surface to assign points to tiles, (ix) run the controlled imputation in test dataset by replacing single observation data with new point data, (x) forecast value with n − 1 old records and 1 new record. The estimates obtained in this way are nonbiased, efficient and guarantee low RAMSE.
This study is within the framework of spatial autocorrelation solutions, which in taxonomy by Jiang (2018) is next to spatial heterogeneity, limited ground truth, and multiple spatial scales and resolutions. As Jiang (2018) shown, there is a customary attitude to use spatial econometrics to areal data and geostatistics (as kriging) or GWR to point data. The trend of using spatial econometrics to point data is increasing (Abbruzzo et al., 2021;Arbia et al., 2021;Piacentino et al., 2021;Santi et al., 2021), but due to discussed in paper limitations, not exploited. The presented solution may push those studies forward. An important aspect is the alternative solutions to the proposed one. For problems of spatial heterogeneity, limited ground truth, and multiple spatial scales and resolutions innovations slowly appear-as modifications of kriging and GWR, decomposition-based ensemble, multitask learning, semi-supervised learning, active learning etc. However, as shown in Jiang (2018) there are more challenges than ready-to-use solutions, especially for spatiotemporal models, for anisotriopic spatial dependency, and for big data. Spatial predictions need deepened interest and bringing new concepts which enable bypassing the obstacles identified until now.

CONCLUSIONS
Paper offers few methodological novelties and interesting solutions to econometric problems. Firstly, it develops complex approach in spatial microeconometric modeling, solving computational efficiency problems in case of big spatial data, lack of unique spatial weights matrix W when sampling, and difficulty of forecasting for out-of-sample data. Secondly, it introduces a new method for forecasting in the spatial models for the out-of-sample geo-located points by linking in pairs train and test location to substitute in W. Pairing uses tessellated surface to represent the generalized point pattern with Voronoi polygons and assigns new points by overlaying them on tessellation tiles. Third, it uses the bootstrap technique to shrink the dataset's size and find the most representative combination of sub-sample observations. Fourth, it introduces the double calibration concept in spatial models, calibrating both the regression coefficients in SAR, SEM, and SDM, and the spatial weights matrix W. Fifth, it applies the PAM algorithm in a joint analysis of the bootstrap regression coefficients to select the jointly most middle scenario and underlying observations-treated as the best representation of coefficients and spatial sampling.
The proposed solution provides a stable, efficient and feasible approach to estimate and calibrate typical (SAR, SEM, and SDM) spatial econometric models on geo-referenced big data and to forecast for the new out-of-sample locations. It significantly increases the computational efficiency of modeling. Bootstrap estimations yield accurate estimates with highly efficient errors and precise forecasts. This method can find its applications in all microeconometric problems, especially in models for individual business locations and in real estate valuation models for datasets of hundreds of thousands or more of observations. In the case of large datasets or big data, the estimation on the whole dataset, even when feasible, is usually technically very demanding. Thus the bootstrap technique facilitates and simplifies the calculations without losing the accuracy.

FUNDING INFORMATION
This is a part of a project on "Spatial econometric models with fixed and changing structure of neighbourhood. Applications to real estate valuation and business location" financed by National Science Center Poland (Krakow, Poland; OPUS 12 call, grant number UMO-2016/23/B/HS4/02363).