Comparison of Machine-Learning Algorithms for Near-Surface Air-Temperature Estimation from FY-4A AGRI Data

Six machine-learning approaches, including multivariate linear regression (MLR), gradient boosting decision tree, k-nearest neighbors, random forest, extreme gradient boosting (XGB), and deep neural network (DNN), were compared for near-surface air-temperature (Tair) estimation from the new generation of Chinese geostationary meteorological satellite Fengyun-4A (FY-4A) observations. 'e brightness temperatures in split-window channels from the Advanced Geostationary Radiation Imager (AGRI) of FY-4A and numerical weather prediction data from the global forecast system were used as the predictor variables for Tair estimation. 'e performance of each model and the temporal and spatial distribution of the estimated Tair errors were analyzed. 'e results showed that the XGB model had better overall performance, with R of 0.902, bias of −0.087°C, and root-mean-square error of 1.946°C.'e spatial variation characteristics of the Tair error of the XGBmethod were less obvious than those of the other methods. 'e XGB model can provide more stable and high-precision Tair for a large-scale Tair estimation over China and can serve as a reference for Tair estimation based on machine-learning models.


Introduction
Air temperature (T air ) is one of the basic meteorological observation parameters [1][2][3] and is of great concern in scientific disciplines like hydrology, meteorology, and environmental science. Furthermore, it influences most land-surface processes, such as photosynthesis and land-surface evapotranspiration [4]. Obtaining high-resolution T air data can reduce human health risks and promote urban heat island research, so high-resolution T air information is quite crucial [5,6]. e summer T air value in China is generally above 20°C, except in the high-altitude regions (e.g., Qinghai-Tibet Plateau). Summer heat waves have a major impact on agricultural food production, as well as the use of water and electricity [7]. is study focuses on the issue of summer T air estimation in China using Advanced Geostationary Radiation Imager (AGRI) data.
Large-scale T air data are mainly obtained by interpolation from the data collected by surface meteorological stations.
However, the distribution of meteorological stations is usually uneven due to geographical factors, and some sparsely populated areas even have no meteorological observation [8].
erefore, the accuracy of the interpolated T air data is limited, and researchers are unable to obtain high-spatial-resolution T air information [9].
Meteorological satellites such as low-Earth-orbit (LEO) satellites and geostationary-Earth-orbit (GEO) satellites can provide continuous surface (i.e., land-surface temperature (LST)) and atmospheric observations with a wide spatial coverage at global and regional scales [10][11][12]. In the last several decades, LEO and GEO observations have been gradually applied to T air estimation with the development of meteorological satellite technology. LEO satellites can only acquire data once or twice a day for one place. In addition, cloud contamination will reduce the effective data for T air estimation [13][14][15]. Unlike LEO satellites, GEO meteorological satellites can continuously provide data every 15 or 30 min on one-third of the Earth's surface [16][17][18][19][20]. erefore, GEO satellites comprise an effective method of obtaining highspatial-and high-temporal-resolution T air data in a fixed area and have the potential to facilitate the study on the daily change of T air [20,21].
At present, the methods for T air estimation from satellite brightness temperatures (BTs) and land-surface temperature (LST) product data can be divided into simple linear, multivariate linear, and nonlinear approaches [21,22]. Previous studies [7,23,24] have shown that machinelearning algorithms can obtain higher-accuracy T air values than those in other methods. For example, a machinelearning model (e.g., a neural network model (NN)) has higher accuracy, and the root-mean-square error (RMSE) is reduced by 1.29°C compared with linear models [7]. e AGRI aboard Fengyun-4A (FY-4A) has 14 spectral bands [18,20,25,26]-six visible/near-infrared (VIS/NIR), six infrared (IR), and two water vapor bands-with a temporal resolution of 15 min for the full disk and a spatial resolution of 4 km at IR bands. It provides an unprecedented opportunity for obtaining high-precision T air data over China and surrounding areas.
Machine-learning methods are used to estimate T air based on moderate-resolution imaging spectroradiometer (MODIS) data in several studies [27][28][29]. However, there is currently a lack of relevant studies on T air estimation based on FY-4A. e use of FY-4A data to estimate high-resolution T air is of great significance to the study of human health and high-temporal-and high-spatial-resolution T air in East Asia. In addition, there is a need for timely and high-resolution T air data for the sustainable planning and management of climate-resilient cities [3].
is study aims to develop the machine-learning approaches for T air estimation using FY-4A data and compares the performances of different machine-learning models [i.e., multivariate linear regression (MLR), gradient boosting decision tree (GBTD), k-nearest neighbors (KNN), random forest (RF), extreme gradient boosting (XGB), and deep neural network (DNN)] in T air estimation, which, to the best of our knowledge, has never been done before. By comparing different machine-learning algorithms, a machine-learning algorithm with good applicability for estimating T air is selected. e algorithm is widely applicable to meteorological satellites without surface-temperature products. e remainder of this paper is organized as follows. In Section 2, the study area and data used for model development are introduced, and the construction of the abovelisted six machine-learning models for T air estimation is described. Variable importance analysis, validation results, and discussion are described in Section 3. Conclusions are presented in Section 4.

Study Area.
e study area is located in China, and Figure 1 shows the spatial distribution of 1,812 meteorological stations used in this study. ere is a higher altitude in the West over China than in the East, and even the Qinghai-Tibet Plateau has an average elevation of over 4,000 m [30]. ere are more stations in the East areas than in the West ones due to the uneven distribution of population and economic development in China ( Figure 1).

Data.
e data used in this study mainly include FY-4A/ AGRI brightness temperature (BT) and L2 cloud mask data, global forecast system (GFS) 3 h forecast data, meteorological data of 1,812 stations in China, and other auxiliary data (longitude, latitude, and Julian day).

Satellite
Data. FY-4A, the new generation of Chinese geostationary meteorological satellites, was launched on December 11, 2016. It was fixed at a position of 99.5°E above the equator. As thermal infrared split-window channels, the 12 and 13 bands of AGRI (BT 12 and BT 13 , respectively) are mainly used for studies of cloud, aerosol, and T air estimation. eir central wavelengths are 10.8 and 12.0 μm [31]. BT 12 , BT 13 , and L2 cloud mask products during Summer 2018 (i.e., June, July, and August) were used. e ARGI data were selected at 3 h intervals (i.e., 00, 03, 06, 09, 12, 15, 18, and 21 UTC) per day. e data were downloaded from the China National Satellite Meteorological Center (http:// satellite.nsmc.org.cn/PortalSite/Data/Satellite.aspx).

Meteorological Data.
is study selected meteorological data at 3 h intervals from 1,812 observation stations in China during summer 2018. e meteorological variables used in this study include T air and the digital elevation model (DEM). T air in summer 2018 ranges from −5°C to 40°C, and the DEM of the station was between 0 and 5000 m. ese data were obtained from the China Meteorological Data Service Center (CMDC) (http://data.cma.cn/).

Numerical Weather Prediction Data and Auxiliary
Data. Previous studies showed that the relationship between BTs (or LST) and T air is easily affected by surface characteristics and atmospheric conditions [7,31]. erefore, the accuracy of T air estimation was effectively improved by adding several auxiliary parameters [32]. In this study, GFS 3 h precipitable water vapor (GFS PWV) and relative humidity (GFS RH) forecast fields data were used. e forecast length of the GFS data (GFS PWV and GFS RH) used was 3 h per day, and there were eight periods of data per day (i.e., 00, 03, 06, 09, 12, 15, 18, and 21 UTC). e GFS data were interpolated according to the location and time information of the AGRI pixels. GFS data were obtained through the U.S. National Oceanic and Atmospheric Administration (NOAA) National Centers for Environmental Prediction (http://www.nco.ncep.noaa.gov/pmb/products/ gfs). Table 1 presents the temporal and spatial resolution information of the data used in this study.

Preparation of Training Dataset.
e BT 12 , BT 13 , GFS PWV, GFS RH, and auxiliary data were used as the input variables, T air was used as the response variable of the machine-learning models (Table 1), and all data points (across space and time) were included in one model (i.e., the XGB model) [33]. e construction of the representative training data was crucial to develop successful retrieval models using machine learning.
us, data from June to August-except the 1st, 10th, 20th, and 30th of each month-were collected as the original dataset, and the original dataset was randomly divided into a training dataset (80%, 97,1773 samples) and a test dataset (20%, 24,2944 samples) with the same number of pieces of data for each bin (i.e., 1.0°C in temperature) as shown in Figure 2. For the validation, the data that were not used for training were selected from June to August 1st, 10th, 20th, and 30th.

Machine-Learning Algorithm.
Machine-learning methods have been widely used in classification and regression in the field of remote sensing [34][35][36][37][38][39][40][41]. In this study, six machine-learning approaches, that is, MLR, GBTD, KNN, RF, XGB, and DNN, were used for constructing T air estimation models. e flowchart of T air estimation based on machine-learning approaches is shown in Figure 3. L2 cloud mask products were used to detect cloud. If the data were cloudless, FY-4A data matched both the GFS data and meteorological station data (same space and time), and then T air was estimated through the machine-learning models.
As a simple machine-learning algorithm, MLR has usually been the basic tool for the estimation of meteorological parameters [42,43]. Similarly, as a local nonlinear algorithm, the prediction process of KNN is generally divided into two steps. First, when the KNN algorithm predicts a point, it searches for the k-nearest neighbors closest to the point in the training dataset. Second, the mean of the target variable of the k-nearest neighbors is computed [44,45]. In this study, the hyperparameters of MLR and KNN were set to default values. Unlike MLR and KNN, RF is an ensemble to a decision-tree-based approach for improving the prediction accuracy, such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest [34,43,[46][47][48][49][50].
e Scikit-learn library was used for hyperparameter tuning named GridSearchCV from Python to filter the hyperparameters including number of    e principle of GBTD is to sequentially apply a classification algorithm to the weighted version of the training data [51,52], descending along the gradient direction of the model loss function previously established, and then perform a weighted majority vote on the resulting classifier sequence. As an improved algorithm of GBTD, XGB uses all data in each iteration, which is similar to RF [53,54]. erefore, XGB reduces the complexity of the model and makes the learned model simpler [35,[54][55][56][57][58]. In this study, four hyperparameters in GBTD and XGB models (i.e., n_estimators, max_depth, learning_rate (lr), and minimum loss reduction) required to make a further partition on a leaf node of the tree (gamma) were empirically tuned based on RMSE.
An artificial neural network (ANN) is a biologically inspired machine-learning method [59]. Here, DNN, a subset of ANN with multiple hidden layers, uses a fully connected structure, which has the ability to learn time and space relationships [60,61]. It adjusts the connection strength through back-propagation and minimizes the prediction error by iterating between neurons [62][63][64]. Each hidden layer was tested in the DNN model at one to five hidden layers and 5-200 neurons in five intervals. In addition, some widely used optimizers (i.e., stochastic gradient descent, RMSProp, and Adam) were tested by comparing the calculated results. In this study, the hyperparameters of the DNN were set as follows: batch_size, 128; dropout_rate, 0.1; stop_steps, 20 (if the validation-set loss function was not improved within 20, training will be terminated); and learning rate, 0.001. e optimizer chose Adam, the number of hidden layers was three, and the number of hidden neurons was 256.

Error Analyses.
Four statistical factors-determination coefficient (R 2 ), RMSE, MSE, and mean bias (bias)-were used to evaluate the accuracy of T air estimation model as follows: bias where T ea is the estimated T air , T oa is the observed T air at the meteorological stations, and N is the sample size.

Results and Discussion
In this section, the results of variable importance were presented, and the performance of the six machine-learning models was verified. e spatial distribution characteristics of the T air errors of each model were also analyzed.

Variable Importance Results.
Correlation analysis was performed to analyze the linear relationship between T air and BT 12 , BT 13 , GFS PWV, GFS RH, DEM, longitude (LONG), latitude (LAT), and Julian day (JD). Table 2 shows the correlation coefficient matrix of these variables. As described in Figure 4(a), GFS PWV, DEM, BT 12 , and BT 13 had a better correlation with T air than other variables, and the R values of the four variables were 0.635, −0.596, 0.459, and 0.413, respectively.
is indicated that these variables played more important roles in the linear T air estimation models. However, the Pearson correlation coefficient only described the linear correlation between two variables; it could not identify the nonlinear relationship between two variables. erefore, the variable importance of the RF algorithm was also analyzed (Figure 4(b)). e RF algorithm modeled the nonlinear relationship well. e GFS PWV was identified as the most important variable for T air estimation in the RF model, while the GFS RH and BT 12 also played important roles than other predictors. erefore, PWV and RH were used as inputs to effectively improve the accuracy of T air estimation, which was consistent with the previous study [65].

Model Performance Results.
For evaluating the overall performance of each model, a 10-fold cross-validation method was used. K-fold cross-validation was used for model configuration selection. When a particular value of K was selected (where K was 10), the datasets were randomly and equally distributed among K groups. One group was folded for test, and the K − 1 group was folded for training. In a total of k validations, the model performance was calculated using different test folds for each validation [35]. Finally, the average validation results were used to evaluate the overall performance of each model. Figure 5 illustrates the six models with different statistical parameters, including RMSE, Bias, MSE, and R 2 .
e MLR model had the lowest performance of the six models. e variation range of RMSE, Bias, MSE, and R 2 in the MLR model was quite wide; even the range of RMSE was 1.602°C-4.487°C, while the DNN model used in this study had better overall performance and higher efficiency than the other five models. e DNN model showed the highest accuracy, with an average RMSE of 1.736°C. e Advances in Meteorology range of RMSE in the DNN model was 0.852°C-2.584°C, showing good concentration and stability, as presented in Figure 5(a). In addition, the overall performances of the XGB and GBTD models of the remaining models were equivalent, which were better than those of the MLR, KNN, and RF models.

Validation Results.
Model performance was used as an indicator to internally validate each model. e model accuracy must be evaluated with a dataset that was not used for training or testing. To validate the developed MLR, RF, KNN, GBTD, XGB, and DNN models, the observed data not used for both training and testing were utilized (validation dataset in Section 2.3.1). Figure 6 illustrates the quantitative validation results of the estimated T air during the validation time (the 1st, 10th, 20th, and 30th of June-August 2018). Compared with the results in the test dataset, the overall accuracy of the six models on the validation dataset decreased. For example, for the DNN model, the RMSE of T air using the test dataset was 1.736°C, while that of the validation results was 2.006°C. is difference may be caused by overfitting due to the fact that the best model was not selected based on the final validation results [35]. e biases of the MLR, RF, DNN, GBTD, and XGB models were within ±0.2°C, indicating no obvious overestimation or underestimation. In contrast, the KNN model showed a larger negative bias of −0.492°C. e reason that the KNN model had a larger negative bias may be that it had poor robustness. Robustness mainly depended on the dataset, and poor robustness made the model difficult to directly apply to other cases, so the KNN model had a low bias on the test dataset and a high bias on the validation dataset.
e XGB model had excellent modeling performance with R 2 of 0.902. e R 2 values of the GBTD and DNN models were 0.898 and 0.890, respectively, and the R 2 value of the remaining three models was less than 0.89. Moreover, compared with the other models, the XGB and GBTD models can repeatedly learn to generate a weighted average of the weak learners. erefore, the XGB and GBTD models showed a relatively better performance in the validation dataset in most sites. In general, the XGB model showed a higher overall performance than the other five models on the validation dataset. e T air estimation models based on satellite and numerical forecast data are susceptible to factors such as altitude and surface roughness. To further evaluate the   Advances in Meteorology applicability of these models, the spatial distribution of each meteorological observation was evaluated (Figures 7-9). It can be seen that the T air estimation errors of all models showed obvious spatial distribution characteristics (Figure 7). Generally, the RMSE is relatively low in the eastern regions (e.g., Guangdong Province) and high in the northwestern regions for each model (e.g., Xinjiang Province). For example, the RMSE in Guangdong Province of the XGB model was approximately 1.2°C-1.8°C, while that in Xinjiang Province was about 2.0°C-3.2°C. Because the northwestern regions have relatively wide T air changes during day and night, high altitude, and few meteorological observations, the accuracy difference between northwestern and eastern China is obvious. Moreover, the RMSE of the KNN, DNN, GBTD, and XGB models was relatively low in the eastern and southern regions. However, the MLR, RF, KNN, and DNN models had a higher RMSE in northwestern China. In contrast, the GBTD and XGB models had a relatively smaller RMSE in northwestern China because the GBTD and XGB models can generate repeated weighted averages to adjust the applicability of different regions through repeated learning of numerous data.
Furthermore, Gong's study (2015) [66] illustrated that the RMSE of GFS T air in most eastern regions reaches 1.5°C-3.0°C and was above 3.5°C in the northwestern regions. By contrast, the results showed that the RMSE of T air estimated by the DNN, XGB, and GBTD models was obviously lower than that of GFS data. In the present study, the RMSE of the XGB model was 1.0°C-2.0°C in most eastern regions, and it was below 3.5°C in the northwestern regions. In addition, RMSE < 2.0°C accounted for 48.2% and RMSE < 2.5°C accounted for 87.6% in the XGB model. e six models showed the same distribution trend as shown in Figure 8, with R 2 being higher in the eastern regions, but R 2 gradually became lower as it got closer to the southwestern regions. Compared with the central regions (e.g., Henan Province), the viewing zenith angle (VZA) of ARGI over the western China is larger. e larger the VZA is, the more the radiation reaching the sensor will be highly affected by the atmosphere, which may cause differences in R 2 of the estimated T air value between the southwestern and central regions.
For the MLR model, the bias for all of China was large. For the RF and KNN models, relatively high negative bias existed in southwestern China (e.g., Yunnan-Guizhou Plateau), as shown in Figure 9. is may be the relatively simple structure of the three models mentioned above, which cannot well simulate the complex T air changes in China, resulting in underfitting. Besides, T air estimated by the DNN model was overestimated in northwestern China, which was the reason that the RMSE in the DNN model was also high in these regions. In contrast, the GBTD and XGB models had relatively low bias in northwestern China, where the absolute bias ranges from 2.0°C to 3.0°C. In conclusion, the bias is lower in the coastal areas and higher in northwestern areas, which is mainly related to the characteristics of Summer T air change. Figure 10 shows the time series of RMSE for the six models during the validation period.
e RMSE of the MLR model was significantly higher than other models, with the RMSE ranging from 2.5°C to 4.3°C. In contrast, the RMSE of the GBTD and XGB models showed a relatively lower RMSE (i.e., 1.8°C-2.2°C) than that in the RF, KNN, and DNN models.
Based on the above analysis, it is expected that the XGB model can provide a more reliable and accurate T air estimation than other models. For purposes of evaluating the contribution of predictive factors in the XGB model to T air estimation, BTs data (BT 12 and BT 13 ) and GFS data (GFS PWV and RH) were successively introduced (Table 3). As shown in Table 3, DEM, longitude, latitude, and Julian day were used as input variables, and the RMSE of the XGB model was 3.003°C. e accuracy of T air estimation was obviously improved when BT 12 and BT 13 were included in the model. Moreover, when GFS PWV and RH were added to the input variables, the RMSE of the XGB model was decreased to 2.164°C, indicating important influences of GFS PWV and RH on the T air estimation. ese results are understandable due to the fact that PWV and RH are the main parameters needed for atmospheric correction and LST retrieval. e RMSE of XGB model was improved by 0.228°C compared with just GFS data which were introduced when both AGRI BTs and GFS data were introduced to the input variables. is indicates that both GFS data and satellite observation data have an important role in improving the T air estimation model. e RMSE of T air estimation model was less than 2.0°C when both satellite BTs and GFS data were introduced, which was considered to be the precision level of "accurate" [67].
e relationship of XGB model errors with altitude, observed T air , and VZA was analyzed. Figure 11 demonstrates the scatter plot of the estimated T air error with DEM, T air , and VZA. It can be seen that the T air error mainly ranges from −3°C to 3°C. e results showed positive deviation at high-altitude areas, which produced a larger RMSE than low-altitude areas. e model showed a positive deviation when T air was low while exhibiting a  negative bias for the high-air-temperature condition. erefore, the model showed a larger RMSE in the lowerand higher-air-temperature conditions due to underestimation and overestimation. is is similar to the results of previous studies [38]. Furthermore, the uneven distribution of stations makes the applicability of the model in high-altitude areas poor. It is worth mentioning that the effect of VZA on model performance is negligible as shown in Figure 11(c).

Conclusions
In this study, six machine-learning approaches (MLR, RF, KNN, DNN, GBTD, and XGB) for T air estimation from FY-4A AGRI data in China were compared and analyzed in terms of the spatial and temporal characteristics of their performance.
e validation results highlighted the high potential of T air estimation approaches using machine learning and showed that the accuracy of the XGB model was better than that of the MLR, RF, KNN, GBTD, and DNN models at most sites for T air estimation over China. e validation was performed using spatially and temporally independent data, and hence the model performance was considered to be quite reliable.
is study improves on previous studies in the following key areas. First, T air estimation models were constructed based on FY-4A AGRI data and other auxiliary data. e results showed that high-temporal-and high-spatial-resolution T air values (RMSE <2.0°C) can be obtained based on FY-4A data. According to the study of Vazquez [67], the level of precision generally accepted as "accurate" for remote-sensing-based T air estimation is between 1°C and 2°C. Second, the accuracy and performance of the six machine-learning models (MLR, RF, KNN, XGB, GBTD, and DNN) were compared and analyzed. e results showed that the XGB model can provide more stable and high-precision T air estimation, which provides a reference for T air estimation based on machine-learning models. Finally, the accuracy of T air estimation based on satellite data can be effectively improved by adding a numerical model of T air . e experimental results showed that only satellite data were used for large-scale T air estimation in China, and the RMSE of the XGB model was 2.376°C, but the RMSE using satellite data combined with numerically modeled T air data reached 1.946°C. However, aside from the novelties of this study, the limitation of the dataset used is the restriction to clear-sky conditions. Similarly, machine-learning algorithms cannot infer beyond the range of observed T air value. If the T air value increases beyond the range that cannot be observed within the current training period, the model must be retrained. Moreover, future research may explore whether adding other predictors, such as distance-to-coast and vegetation information (normalized difference vegetation index, etc.), can improve the accuracy of the T air estimation models.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.