Using geographically weighted regression to predict the spatial distribution of frozen ground temperature: a case in the Qinghai–Tibet Plateau

This paper combines the use of principal component analysis (PCA) and the geographically weighted regression (GWR) model to predict the spatial distribution of frozen ground temperature. PCA is used to reduce the multicollinearity among covariates, while the GWR model is used to address the spatially non-stationary relationship between frozen ground temperature and its predictors, such as air temperature, latitude, longitude, and vegetation cover. Our approach is applied in a typical permafrost area in the Qinghai–Tibet Plateau, Western China. The result demonstrates the applicability of our approach in the spatial distribution of frozen ground temperature and shows that the approach can be used for analysis and prediction. This study offers insight into temperature monitoring of frozen ground, which helps prevent regional geological disasters.


Introduction
With pronounced global warming, frozen ground, which is a major part of the cryosphere and includes permafrost and seasonally frozen ground, is facing degradation (James et al 2013, Vasiliev et al 2020. For example, the Qinghai-Tibet Plateau (alternatively known as the Tibetan Plateau) has recently undergone a severe permafrost degradation problem. From 1975 to 2017, the area of the permafrost in the plateau has considerably decreased by 440 000 km 2 (Zou et al 2017). Moreover, the change in the active layer thickness in the plateau has accelerated. Evidently, the degradation of frozen ground may cause numerous geological disasters, including but not limited to collapse, landslides, mudslides, and mercury release , Mu et al 2020. As a large number of studies have indicated, frozen ground is highly sensitive to changes in ground temperature and thus becomes an indicator of climate change Yao 2009, Cheng et al 2013). It is widely acknowledged that understanding the temperature distributions of frozen ground is of paramount importance in many facets . For example, it provides a valuable basis for the regional prevention of geological disasters.
The most common method used to explore temperature distributions of frozen ground is numerical modeling (or computational simulation). Numerous researchers have made the effort along this line. Yu et al (2014) coupled the finite element method with a heat transfer and seepage model to analyze the effect of underground oil heating pipes on the temperature of frozen ground. Mu et al (2016) conducted a heat transfer-based finite element analysis to simulate temperature distributions of thermosyphons and foundation soils around the frozen ground along the Qinghai-Tibet transmission line. Garayshin et al (2019) established a 2D nonlinear heat equation with the phase change to explore temperature across the ice wedge. Liu et al (2020) developed a twodimensional non-steady-state heat conduction finite element model to predict the distribution of permafrost temperature by taking into account factors such as wind speed, wind direction, evaporation, and solar radiation. To facilitate the preparation of input parameters and the visualization of simulated results, Marchenko et al (2008) developed the GIPL (Geophysical Institute Permafrost Laboratory) 2.0 model by combining the GIPL 1.0 model (which numerically solves 1D nonlinear heat equation with phase change) and ArcGIS. The GIPL 2.0 model is determined to well simulate the dynamic evolution of frozen ground temperature in Alaska, the United States. Jafarov et al (2012) adopted the GIPL2-MPI numerical transient model with various factors (e.g. altitude, slope, aspect, and precipitation) to simulate the temperature distribution of the Alaska permafrost region. Qin et al (2017) and Yin et al (2018) used the GIPL2 model to predict the active layer variation of the permafrost in the Qinghai-Tibet Plateau and the annual average temperature distribution in the permafrost region of the Qinghai-Tibet Plateau Engineering Corridor, respectively.
However, existing studies mainly focused on predicting frozen ground temperature at the vertical level under the analytical framework of numerical modeling. By contrast, they seldom investigated the temperature at the horizontal level. More importantly, they have rarely considered the spatially non-stationary relationship 3 between frozen ground temperature and its influencing factors. Put it in another way, a spatially invariant (or fixed) relationship between frozen ground temperature and its correlates was often assumed. However, as Fotheringham et al (2002) indicated, it is often not the case in real life. In light of this, based on in-situ borehole observation data, this study applies geostatistical models (more specifically, geographically weighted regression (GWR) models) to analyze the variation in frozen ground temperature and to predict the spatial-temporal distribution of the temperature in June-September, 2001, in a typical permafrost area of the Qinghai-Tibet Plateau. In addition, in stark contrast to global regression models, the GWR model is a local regression technique that captures the spatial non-stationarity of associations between the dependent and independent variables (Fang et al 2015). On the other hand, in the application of GWR (and other regression models), it is easily trapped in multicollinearity among independent variables, thus giving rise to uncertainty in prediction results (Nguyen and Ng 2020). This is highly applicable to our temperature modeling study. In such a case, this study uses principal component analysis (PCA) to eliminate multicollinearity among contributory factors to frozen ground temperature.
The contributions of this study include (a) combining PCA and the GWR model to explore the spatially varying relationship between frozen ground temperature and its correlates; (b) predicting the spatial-temporal distribution of the temperature of frozen ground in the study area; and (c) revealing the potential of using multi-source data to predict the spatial-temporal distribution of frozen ground temperature and providing a new angle to this issue.
The remainder of the paper is organized as follows. Section 2 succinctly presents the data. Section 3 introduces the methodologies of PCA and the GWR model. Section 4 shows the results. Section 5 concludes the paper, discusses research limitations, and points out avenues for future research.

Study area
Our study area is a typical permafrost area of the Qinghai-Tibet Plateau (figure 1). North to the Kunlun Mountains and south to the Chiqu Valley, this study area takes Wudaoliang as the center and includes the Chumar River High Plain, the Hoh Xil Mountain area, and the Beilu River Basin. It is the most densely monitored area of frozen ground during the construction and operation of the Qinghai-Tibet Railway (figure 1). The average altitude of this area is 4900 m. This area has a plateau continental climate. It is characterized by an arid climate on the glacial margin. In the study area, the evaporation is much larger than the rainfall, and the vegetation is sparse. The average annual air temperature of the areas is from −2 • C to −6.9 • C. The temperature is highest in July (with an average of 6.5 • C-8.1 • C) and lowest in January (with an average of −14.5 • C to −17.4 • C).

Frozen ground temperature data
The borehole temperature data were obtained by the First Survey and Design Institute of China Railway Group. The data utilized in this study were collected from a maximum of 231 boreholes from May to December, 2001, along the Qinghai-Tibet Railway. The depth for these boreholes ranges from 0.2 m to 15 m. Abnormal drilling data were eliminated before subsequent statistical analysis.
Through the statistical analysis of the temperature data measured at the boreholes (figure 2), we find that the average temperature of the study area is not much different from the median, indicating that the distribution of the ground temperature may be relatively concentrated with low dispersion. Moreover, the ground temperature in July and August was highest, indicating that permafrost's active layer thickness may change significantly.
This study only focuses on frozen ground temperature distribution at many depths (ranging from 0.2 m to 15 m) from June to September because of limited observations in other months.

Selection of independent variables
The temperature of frozen ground is expected to be affected by elevation, longitude, latitude, and various local factors, including but not limited to rock lithology, topography, vegetation, snow, and water content (Osterkamp 2007. Obviously, these factors have temperature effects that vary across space and time (or season). For example, vegetation and snow cover have varying temperature effects in cold and warm seasons. In addition, related studies have identified that meteorological factors have significant impacts on frozen ground temperature (Zhang and Stamnes 1998, Throop 2012, Liu et al 2013. To this end, a total of 13 independent variables are selected for statistical modeling, including elevation, longitude, latitude, air temperature, air temperature amplitude, precipitation, wind speed, net radiation budget (difference between short-wave radiation and long-wave radiation), rock lithology, slope, aspect, fractional vegetation cover (FVC), and snow.
Different lithological rocks may have various physical, surface heat transfer, water-holding, and permeability properties, resulting in different effects on frozen ground temperature (Smith and Riseborough 1996). Various categories of rocks should be considered, such as sandstone, conglomerate rock, tuff, siltstone, marlstone, and grit.
Terrain and geomorphology influences ground temperature mainly via slope and aspect (Kenner and Magnusson 2017). Areas with different slopes and aspects receive different levels of solar radiation, so they vary in evaporation and vegetation and thus frozen ground temperature (Cheng 2004).
FVC has cooling effects in summer but warming effects in winter on frozen ground temperature. Therefore, this study selects FVC as one of the independent variables (   Snow cover impacts the temperature of frozen ground in many ways. 'Snow has high albedo and emissivity that cool the snow surface, high absorptivity that tends to warm the snow surface, low thermal conductivity so that a snow layer acts as an insulator, and high latent heat due to snowmelt that is a heat sink' (Zhang 2005, p 1). The overall effect of snow cover on frozen ground temperature is dictated by various factors such as timing, duration, accumulation, density, structure, thickness of seasonal snow cover, and its interactions with other variables (e.g. micrometeorological conditions, local microrelief, and vegetation) (Zhang 2005, Park et al 2014, Zhang et al 2018. Hence, there should be numerous snow cover variables (e.g. snow depth and snow cover in the same period and the snow cover in the early winter and spring) for the prediction of frozen ground temperature.
Meteorological conditions, such as ground precipitation rate, wind speed, downward short-wave radiation, downward long-wave radiation, and air temperature amplitude, significantly influence frozen ground temperature (Zhang and Stamnes 1998, Throop et al 2012, Liu et al 2013, Qin et al 2013. It is known from climatology that the underlying surface is warmed by short-wave radiation and the ambient air is warmed by the underlying surface. Notably, this study use net radiation budget, namely the difference between short-wave radiation and long-wave radiation as a predictor (Chou et al 2012, Xie et al 2015.

Measurement of independent variables 2.4.1. Latitude and longitude
It is measured by the authors.

Elevation, slope, and aspect
The study uses ArcGIS 10.2 to calculate the slope and aspect by the digital elevation model (DEM) (accuracy of 30 m) downloaded from the Geospatial Data Cloud website (www.gscloud.cn/).

Rock lithology
Its data are cut from the 1:1 million geological maps in 2002 and vectorized to obtain 13 basic rock units. By categorizing the borehole data distributed in different rock strata, the lithology character variable is quantified according to the inflection points of the borehole temperature in different months.

Vegetation cover
FVC is calculated by using the pixel dichotomy model with the input of the normalized difference vegetation index (NDVI). The NDVI data are obtained from the Terra satellite MOD13 Q1 global index (resolution = 250 m). Pearson's correlation analyses are conducted for the dependent and independent variables. Table 1 offers the results for August (N = 128). Since there was no snow cover in the borehole points in August, the number of independent variables becomes 12. It indicates that the selected 12 independent variables are more or less related to frozen ground temperature at different depths, justifying the selection of predictors. Figures A1 and A2 show the distribution of time-constant and time-varying variables, respectively. Figure 3 shows the workflow of this study. Due to the uneven distribution of borehole monitoring stations in the study area, a 3 km × 3 km fishnet was established for the subsequent analysis in ArcGIS. To     eliminate multicollinearity among some independent variables, the study converts them into independent principal components by PCA and extracts the components with characteristic roots above one as the key variables for GWR modeling. The leave-oneout cross-validation (LOOCV) method is employed to cross-validate the modeling results.

PCA
PCA, an eigenvector-based exploratory data analysis technique, uses an orthogonal transformation to convert a large number of potentially correlated variables into several components that are linearly uncorrelated and contain most, though not all, of the information involved in the original variables.  The transformation reduces the collinearity in the data. Before performing the PCA, the original data are usually normalized. The relationship between the original variables and components is expressed as follows (UI-Ssaufie et al 2013): where PC i is the ith component, X j is the jth variable, and l ij is the coefficient (or weight) of X j .

GWR
The traditional ordinary linear squares (OLS) regression model assumes that the relationship between a dependent variable and its covariates is fixed across pace and provides a global modeling approach. By contrast, the GWR model embeds spatial location information in the data and extends the traditional OLS regression framework (Brundon et al 1996, Fotheringham et al 1998Fotheringham et al , 2002. It assumes that the association between the dependent variable and covariates can be spatially variant and offers a set of equations to estimate coefficients at any given location. It has been extensively employed in existing literature (Xu and Huang 2015, 2020a, 2020b, 2020c, Xu and Yang 2019, Zhao et al 2020. The introduction of the GWR model is heavily borrowed from Yang et al (2020aYang et al ( , 2020b.
The GWR model is expressed as follows: where y i is the dependent variable in the ith regression point, (u i , v i ) is the coordinate of the ith point, x ik is the kth explanatory variable of the ith regres- , the regression coefficientβ(u i , v i ) can be estimated by the weighted least squares method:   where The estimated valueŷ i of the observation value y i is: For any given observation, GWR adopts a kernel function to estimate the weights of geographically close observations and sets a rule that closer observations have a higher effect on the estimation of the coefficients than observations that are more geographically distant. The element of the weight matrix W(u i , v i ) is measured by mean of a weight kernel function. The choice of the weight kernel function has significant influences on the estimation of the parameters of the GWR model. Typically, four types of spatial kernel functions are used, including the  fixed Gaussian, fixed bi-square, adaptive Gaussian, and adaptive bi-square kernel functions: Fixed Gaussian kernel: Fixed bi-square kernel: Adaptive bi-square kernel: where w ij is the weight of point j in the estimation of the local regression equation of point i, d ij is the Euclidean (straight-line, crow-fly) distance between points i and j, θ is a fixed bandwidth, and θ i(k) is an adaptive bandwidth defined by the kth nearest neighbor distance for estimating the local equation of point i. GWR is sensitive to the bandwidth. If the bandwidth is too large, parameter estimates will be similar to global estimates. If the bandwidth is too small, parameter estimates will be highly localized, and their variance may be large (Windle et al 2010).

Selection of covariates
Diagnostics are needed to assess the presence of collinearity (or non-independence of covariates) in the data. Common collinearity diagnostics include the  absolute value of correlation coefficient, condition number, variance-decomposition proportions, and variance inflation factor (VIF) (Dormann et al 2013). In this study, VIFs are used to evaluate the collinearity in the data for all the months, including June-September. Table 2 provides the VIF values of all the 12 independent variables in August and reveals the non-independence of many variables. (VIFs for other months are available from the authors upon reasonable request.) To retain as many independent variables as possible, this study does not conduct PCA for all the 12 variables. Instead, it keeps five linearly uncorrelated variables (i.e. Elevation, Lithology, Slope, Aspect, and FVC) and transforms seven variables that are subject to collinearity. Two out of seven principal components (PC1 and PC2) have an eigenvalue greater than 1, and they collectively capture 90.06% of the variation in the data. Therefore, the two principal components are included in the GWR model. Table 3 shows the VIF values of the five linearly uncorrelated variables and the two PCs. Observe that VIF values of all the seven variables are far less than 10. This observation illustrates the absence of collinearity between the variables (Dormann et al 2013) and suggests that the seven variables can be used to predict frozen ground temperature in the GWR modeling framework.

Prediction of the spatial distribution of frozen ground temperature
Temperature data from hundreds of boreholes along the Qinghai-Tibet Railway and its covariates are used to predict the region-wide frozen ground temperature at many depths. Figure 4 provides predicted temperature fields in all the 4 months.
We initially interpret the temperature fields at the horizontal level, and frozen ground temperature at a specific depth (0.5 m) is used for illustration. Figure 5 shows the spatial distribution of predicted temperature at 0.5 m below the frozen ground from June and September. Such distributions seem highly reasonable. The ground temperature is relatively low in high-altitude areas, including the Kunlun Mountains, the Hoh Xil Mountain, and the Fenghuo Mountain. Comparatively, the temperature is high in high plains and basins. In other words, the temperature is predicted to decrease with advancing altitude. In addition, the ground temperature is higher in areas adjacent to lakes and rivers than in other areas. Moreover, the change in ground temperature is bigger in July than in August. Furthermore, 0.5 m frozen ground temperature ranges from 0 • C to 10 • C in July-September, but it can reach −4 • C in June. Besides, from June to September, frozen ground temperature first increases and then decreases, and it peaks in July. Figure 6 shows the spatial distribution of frozen ground temperature at different depths for each month. Observe that the ground temperature decreases with increasing depth. The Chumar, Beilu, and Tongtian River Basins demonstrate a similar characteristic: the temperature is higher in the downstream area than in the upstream. In the depth range of 0-5 m, the ground temperature varies greatly. By contrast, when the depth is above 5 m, the ground temperature changed a little.
By comparing the spatial distributions in different months, we find that except for the 0.5 m frozen ground, the temperature of the frozen ground at all the other depths increases from June to September, and it reaches the peak in September. This finding is due to the time lag caused by the downward temperature diffusion. Figure 7 shows the variation in ground temperature based on the observed borehole data. It indicates that the ground temperature fluctuates between −4 • C and 16 • C. Notably, the ground temperature varies significantly within a depth range of 0-5 m, and its maximum difference can reach 16 • C. Within 5-15 m, the ground temperature remains relatively stable. Such results provide insightful evidence to support the spatial distribution of ground temperature regarding the study area. Table 4 shows the cross-validation results. The GWR model has a high degree of goodness of fit, and its R 2 reaches 85.81%. In addition, the mean absolute prediction error (MAE) and root mean square error (RMSE) values are small, implying that the GWR model is robust.

Results of predictions for nearby locations
To further validate the GWR model and gain confidence in the proposed methodology, we decide to compare the ground temperature data from some randomly selected borehole points to the prediction results of nearby points ( figure 8(a)).
On the one hand, points with similar topography and geomorphology are compared. The borehole points in the Kunlun Mountains and the Hoh Xil Mountain area are selected to compare with those located within 10 km (figures 8(b) and (c)). On the other hand, points with similar rock lithology are compared. The study randomly selects the borehole points and the corresponding prediction points in the Fenghuo Mountain area (figure 8(d)) to make the comparison. Figure 9 shows the comparison of observed temperature at the borehole points and predicted temperature at nearby points. Ground temperatures at the borehole points and their nearby points are relatively consistent. This further illustrates that the GWR model can be used to simulate the ground temperature in the study area effectively.

Results of predictions for terrain-specific locations
To gain further confidence in model performance, we decide to investigate the relationship between terrain types and frozen ground temperature, thus categorizing the study area into three groups: mountainous area, high plain, and basin. Ten geographic locations (points) are randomly selected in each area for temperature prediction. Figure 10 displays the distribution of the prediction points. Figure 11 offers frozen ground temperature curves at the prediction points. It demonstrates that temperature is lower in high-altitude mountainous areas than in low-altitude plains and basins (see figure 11(a)); the Kunlun Mountains area has the lowest frozen ground temperature, which is stable between −3 • C and −4 • C; frozen ground temperature in the Hoh Xil Mountain and the Fenghuo Mountain areas is −1 • C to −3 • C; and the temperature in other areas is relatively high (see figure 11(b)). Figure 12 shows frozen ground temperature curves at all the six prediction areas separately (see Figure 12 for geographic distribution). The temperature changes by depth in the Chumar River High Plain ( figure 12(b)), the Hoh Xil Mountain (figure 12(c)), the Beilu River Basin (figure 12(d)), and the Chiqu Valley (figure 12(f)) are relatively small, which may be due to the homogeneous soil lithology in these areas. However, the ground temperature varies by depth significantly in the Kunlun Mountains ( figure 12(a)) and the Fenghuo Mountain ( figure 12(e)).
To recap, the above findings are reasonable and agree with reality, thereby supporting the plausibility of the proposed methodology (PCA + GWR).

Conclusions
In this study, we combine PCA and the GWR model to predict the spatial distribution of frozen ground temperature. The GWR model is used to address spatially non-stationary relationships between the dependent variable (frozen ground temperature) and covariates (e.g. latitude, elevation, and air temperature), while PCA is used to relieve the collinearity (or correlation between covariates) problem.
The model provides reasonable results of frozen ground temperature. Moreover, through LOOCV and the comparison between the measured temperature data from several borehole points with temperature predictions of nearby points, this study shows that the GWR model performs reasonably well in predicting the spatial distribution of frozen ground temperature. This study provides the following findings: (a) frozen ground temperature in the study area is predicted to have a trend of decreasing with increasing altitude and increasing underground depth; (b) frozen ground temperature of mountainous areas is lower than that of high plains and basins; (c) in the same altitude, lakes and rivers' temperature is relatively high; (d) the temperature fluctuates greatly when the underground depth of the study area is within 5 m, and the maximum difference is 18 • C; and (e) when the depth is over 5 m, the ground temperature remains nearly unchanged.
The frozen ground temperature analysis provides insights into the determination of 0 • C isotherms and the monitoring of the change in ground thickness. In a departure from the existing literature that heavily relies on numerical modeling (or computational simulation) to analyze frozen ground temperature at the vertical level, this study mainly focuses on frozen ground temperature at the horizontal level and thus tries to use a statistical model that addresses spatial non-stationarity (the GWR model) to explore the spatial distribution of the temperature. Admittedly, this study is not so ambitious to provide a definite conclusion on the issue. Instead, it intends to offer a new angle or perspective and arouse attention to the studied problem from, inter alia, researchers and decision makers. The methodology itself is, of course, open to debate. Moreover, this study has other limitations that highly deserve further investigation. First, mainly restricted by data availability, this study only chooses June-September, 2001, for the frozen ground temperature prediction. The generalizability of our approach can be challenged. We, therefore, hope that more studies can be conducted to test the applicability of our approach and reach a firmer conclusion. Second, this study selects 12 variables for frozen ground modeling. Therefore, more temperature-influencing variables can be adopted in future research to optimize the GWR modeling and improve the accuracy of prediction. Third, our borehole points are highly concentrated on a line (i.e. the Qinghai-Tibet Railway), which may modestly contaminate temperature predictions. Last, this study uses the traditional GWR framework. Some enhanced versions of GWR models, such as semi-parametric GWR (differentiating local and global variables) and multiscale GWR (allowing the covariate-specific bandwidth) (Fotheringham et al 2017, Yu et al 2020, have been proposed in recent years. They may enhance the prediction accuracy and therefore can be tested in future research.

Data availability statement
The data generated and/or analyzed during the current study are not publicly available for legal/ethical reasons.