Assessment of Four Near‐Surface Soil Freeze/Thaw Detection Algorithms Based on Calibrated Passive Microwave Remote Sensing Data Over China

The near‐surface soil freeze/thaw (F/T) cycle affects the surface energy balance, hydrological processes, and soil greenhouse gas release. Passive microwave remote sensing data are the most widely used method for determining the near‐surface soil F/T status. While many algorithms have been developed for this purpose, their performance has never been compared using same and large in situ data set. Here, we evaluate and inter‐compare the classification results of the four most widely used algorithms using a large ground truth data set covering China. Based on the ground observations, our evaluation shows a wide range of near‐surface soil F/T detection performance, with Cohen's kappa coefficient ranging from 0.42 to 0.72 and an overall accuracy between 73.8% and 86.2%. We suggest performing parameter calibration for the decision tree algorithm and the discriminant function algorithm before applying them to areas outside of the training sites. All these four algorithms exhibited remarkable uncertainty in the detection of the onset and offset of near‐surface soil freezing with root mean squared errors of more than 27 days. These results suggest that careful cautions should be taken when outputs of these algorithms are used for investigations of long‐term changes.


Introduction
The phase transition between water and ice occurs in about 60% of near-surface soils globally (Kim et al., 2011). The near-surface soil freeze/thaw (F/T) cycle is characterized as a toggle switch for land surface processes because of its strong influence on surface energy balance , hydrological processes (Guo et al., 2011), vegetation dynamics (Michelsen et al., 2002), carbon dioxide emissions (Schimel & Clein, 1996), and ecosystem function (Freppaz et al., 2007). The seasonally frozen ground is spatially and temporally variable (Kim et al., 2011(Kim et al., , 2012. During the past few decades, many studies have shown that rising near-surface air temperature (T a ) has contributed to earlier onset of near-surface soil thaw and has decreased the duration of the frozen period (Kim et al., 2012;McDonald et al., 2004;Peng et al., 2016;Wang et al., 2015).
There are a number of methods used to assess whether near-surface soil is frozen or thawed. These can be classified as (i) traditional methods, which include interpolation of geological maps, in situ measurements (e.g., Peng et al., 2016), and extrapolation based on terrain features known to be closely associated with permafrost or seasonally frozen ground, such as rock glaciers (e.g., Boeckli et al., 2012); (ii) geophysics, most notably ground penetrating radar (e.g., Ma et al., 2015); (iii) numerical modeling (e.g., Freppaz et al., 2007;Guo et al., 2018); and (iv) remote sensing (e.g., Bartsch et al., 2007;Kim et al., 2017;Leverington & Duguay, 1996). While the traditional and geophysical methods can describe the near-surface soil F/T cycle well at the site scale, they are inadequate for larger-scale studies and trend detection, especially in mountains and in remote areas where observations are sparse. Numerical modeling is challenging due to the limited availability of reliable forcing data, such as snow distribution and soil conditions. Over the past several decades, satellite remote sensing techniques, validated by ground truth data, have been used successfully to investigate the near-surface soil F/T cycle from local to global scales and have shown numerous advantages over other methods. Passive microwave remote sensing has emerged as one of the most widely used systems for near-surface soil F/T cycle investigations due, in part, to its temporal continuity (from 1978 to present), global coverage, and short revisit period (typically twice daily). Passive microwave remote sensing detects the electromagnetic radiation spontaneously emitted by matter based on its temperature, composition, and physical structure (Zhang et al., 2004). The discrimination between frozen and thawed soil 10.1029/2019EA000807 is made possible by the strong contrast in electromagnetic properties between liquid and solid (ice) phase of water which is typically present in near-surface soil. Algorithms that use passive microwave remote sensing data for near-surface soil F/T detection can be broadly grouped into several categories: (i) dual-index algorithms and their variants, which exploit both the spectral gradient and the relationship between brightness temperature (T b ) and air temperature (T a ) or ground surface temperature (T g ) (Jin et al., 2015;Kim et al., 2011Kim et al., , 2017; (ii) temporal change detection algorithms, which use the differences in T b between different frequencies (Chen et al., 2019;Smith et al., 2004); (iii) decision tree algorithms, which use the characteristics of the volume scattering effect and the lower T b of frozen ground (Jin et al., 2009); and (iv) model-based discriminant algorithms, which train discriminant models based on observations or use combined models that consider different surface conditions, such as soil moisture and snow cover Zhao et al., 2011Zhao et al., , 2017. These methods are mainly based on the signatures associated with frozen soil, including lower physical (or thermal) temperature, higher surface emissivity, larger optical thickness, and lower T b (Jin et al., 2015;Zhang et al., 2004).
Generally, these algorithms have been evaluated to some extent. However, some of the evaluations are limited to small data sets that included only a few sites or short time series. This may hinder the application of some of these algorithms to novel study areas outside of the test sites. Despite increasing efforts to investigate near-surface soil F/T changes, the detection algorithms based on passive microwave remote sensing have not been inter-compared using unique and large validation data sets. This is an essential step in describing the state-of-the-art for near-surface soil F/T detection as well as for identifying any possible biases. It is also critical for identifying priorities when updating these methods or the satellite and sensor designs in the future. Additionally, even though sensor architectures may be consistent as one instrument replaces its predecessor, the calibration of individual instruments is not identical (Sapiano et al., 2013). Therefore, inter-calibration among historic data sets is important for deriving the near-surface soil F/T cycle as the differences between the instruments could introduce uncertainty or inconsistency in the raw data and derived results. However, these differences are often either ignored (e.g., Jin et al., 2015) or calibrated using simple techniques such as linear regression based on periods of overlapping data (e.g., Kim et al., 2012Kim et al., , 2017. The objectives of this study are to (1) evaluate and inter-compare four widely used near-surface soil F/T detection algorithms and (2) discuss their bias and potential uncertainties.

Calibrated Enhanced Resolution Brightness Temperature
In this study, we used the Calibrated Enhanced Resolution Brightness Temperature (CETB) data product obtained from the National Snow & Ice Data Center (NSIDC). The CETB was recently developed by the National Aeronautics and Space Administration (NASA) using the most mature available Level-2 records from three passive microwave sensors on multiple platforms between 1978 and 2017 ( Figure 1) (Brodzik & Long, 2018). More specifically, the CETB comprises historic data from the Special Sensor Microwave Imager/Sounder (SSMIS), Special Sensor Microwave/Imager (SSM/I), and Scanning Multichannel Microwave Radiometer (SMMR) instruments which have been reprocessed to a higher standard. Compared to the original data sets, the CETB includes improved cross-sensor calibration and quality checking, enhanced resolution (up to 3.125 km), better quality control, improved projection grids, and local time-of-day processing. The newest version of the CETB data (v1.3) used here is physically consistent based on multiple independent inter-calibration techniques (Brodzik & Long, 2018;Sapiano et al., 2013). The CETB can therefore provide suitable T b long-term data records for studying the near-surface soil F/T cycles.
Although spatial resolutions of up to 3.125 km are available for the CETB, only data with 25 km resolution were used here in order to make this contribution more comparable with previous studies. The CETB provides continuous global T b measurements twice per day, corresponding to the ascending and descending tracks separated by the overpass time of the radiometers.
The effects of atmospheric conditions (e.g., aerosols and water vapor) and clouds on microwave remote sensing data quality were assumed to be minimal, and hence, no correction for atmospheric effects was applied. To obtain a continuous spatiotemporal record of T b , any missing values caused by orbital gaps between satellite overpasses were filled using a two-step procedure. First, a temporal linear interpolation of adjacent T b was conducted with a window size of 11 days to fill small gaps (Jin et al., 2015). This step significantly improved the data completeness (Table 1). The large gaps were then filled by exploiting the linear relationship between T b and T a using downscaled meteorological data from ERA-Interim (for details see section 2.4). This second step was conducted annually ( Figure A1). Figure A3 gives an overview of the interpolation approach used to fill the T b gaps. To further improve the reliability and continuity of CETB, whenever overlaps in data occurred between multiple platforms, we chose the platform with the greater completeness and longer observation period. Furthermore, newer satellite platforms were favored over older ones. The resulting time series used data from the Nimbus-7 of SMMR from October 1978 to July 1987, the F08 of SSM/I from July 1987 to December 1991, the F11 of SSM/I from December 1991 to February Figure 1a). However, because the T b (85V) was only available for the SSM/I sensor, the temporal coverage for this frequency only starts in July 1987 and between February 2000 and 2016; T b records from F15 were used instead.

Land Use Map and Topography
To derive land cover masks, we used the Multi-source Integrated Chinese Land Cover (MICLCover) map because it has better performance (overall accuracy >70%) compared to other popularly used maps (Ran et al., 2012). The original International Geosphere-Biosphere Programme (IGBP) land cover classification used in MICLCover was aggregated into five major types: forest (including evergreen needle leaf forest, evergreen broad leaf forest, deciduous needle leaf forest, deciduous broad leaf forest, and mixed forest), shrub (including closed shrub lands and open shrub lands), grass (including woody savannas, savannas, grasslands, permanent wetlands, and croplands), bare land (including urban and built-up, cropland/natural vegetation mosaic, snow and ice, and barren or sparsely vegetated), and water by following Xiao et al. (2018). This was done to account for the biased distribution of CMA (China Meteorological Administration) stations that resulted from using the original categories. The MICLCover map was resampled from its original 1 km spatial resolution to match the NSIDC EASE-Grid format using majority resampling. The Global Digital Elevation Model version 2 (GDEM2) with a spatial resolution of 1 arc second (∼30 m) was first aggregated to 3 arc seconds to avoid the noise in the original data set and then resampled using bi-linear interpolation Note. The interpolated completeness is derived using adjacent T b ( Figure A3), and only the selected data are present here. Please note that different temporal coverage was used for T B (85V) (underlined) because it is not available for the sensor of SMMR and SSMIS.
to match the 25 km NSIDC EASE-Grid format (version 2, Brodzik et al., 2012). The 25 km DEM is then used for T a downscaling. Finally, before any soil F/T classification took place, desert regions were masked from the original CEBT data based on "The map of desert distribution in 1:2,000,000 in China" (Chao, 1984).

Ground Observations
We obtained T g records at 0 cm depth from the China Meteorological Administration (CMA) to (i) derive the T b threshold for the dual-index algorithm and (ii) evaluate the classification results for all near-surface soil F/T detection algorithms. T g was monitored by trained professional technicians four times per day (02:00, 08:00, 14:00, and 20:00 Beijing Time) (Wang et al., 2015). T g was measured with a white mercury ball thermometer with a diameter of about 3 mm and was partially buried in order to reduce the influences radiation and atmosphere (Wang et al., 2015). The satellite data from the a.m. and p.m. passes were found to be more closely related to daily minimum and maximum T g , respectively, than to the daily mean (Xie et al., 2013). The daily maximum was not used here because no frozen ground was found for any of the sites in the summer. Therefore, only data from the a.m. satellite pass was used for evaluation. It is important to note that when the ground was covered by snow, the snow surface temperature was measured instead of the soil temperature (Wang et al., 2015). We make the assumption that the near-surface soil is frozen when it is covered by snow, as snow cover can only be maintained when the near-surface soil temperature is below 0 • C (Zhang et al., 2004). The assumption is reasonable as the classification accuracy of snow-covered period is found to be better than that of snow-free period based on the measurements; however, these results are not presented here.
Regions with little F/T activity were excluded from the study. To do this, the freezing probability was calculated for each site by dividing the number of days with T g < 0 by the total number of days. Sites were removed if their overall freezing probability during 1979-2016 was less than 0.1. Sites in grid cells containing more than 50% water or in desert were also removed. This resulted in a total of 526 stations with ∼ 15.0 × 10 6 observations ( Figure 2). These stations extend over a large area of China and cover all land use types with a distribution of about 14.1% in forest, 4.7% in shrub areas, 31.0% in grassland, and 51.7% in bare land (see section 2.2 for the description of the land use map). In addition, the CMA data set represents a wide range of elevations from about 0 to 5,600 m ( Figure 3a). The distribution of aspects for the stations was approximately uniform (26.8% on the east slope, 24.1% on the south slope, 24.5% on the west slope, and 24.5% on the north slope); however, the majority (63.5%) of stations were located in relatively flat areas with slopes of less than 10 • (Figure 3b). The ground observations used here are therefore considered to be representative enough to conduct the evaluation and inter-comparisons among different near-surface soil F/T detection algorithms over most areas of China but may have increased uncertainty in steep terrain. Note that no sites are located within the same grid.

Reanalysis and Downscaling
As discussed below, T a measurements were required to derive T b thresholds for the modified seasonal threshold algorithm (MSTA). For this, ERA-Interim, produced by the European Centre for Medium-Range Weather Forecasts (ECMWF), was used. ERA-Interim uses a fully coupled atmosphere-ocean-land model, making it suitable to describe large-scale climate, and it has a similar temporal coverage as the CEBT (January 1979 to current) (Dee et al., 2011). In order to match with the CETB data, the original ERA-Interim grids Distribution of elevation (radius), slope (color), and aspect (0 • represents North) for aggregated field measurements. Slope and aspect were derived from a DEM with a spatial resolution of 3 arc seconds. Note. SDIA = seasonal dual-index algorithm; MSTA = modified seasonal threshold algorithm; DTA = decision tree algorithm; DFA = discriminant function algorithm; T a = near-surface air temperature; T b = brightness temperature; T g = ground surface temperature; SG = spectral gradient.
with a spatial resolution of ∼79 km were downscaled to 25 km based on the elevation correction method (Gao et al., 2012). This downscaling method has been shown to perform well for fine-scale T a , Cao, Zhang, et al., 2019Cao, Quan, et al., 2019).

Near-Surface Soil F/T Classification Algorithms
This study is focused on assessing only the most widely used near-surface soil F/T algorithms rather than presenting a review and synthesis of all such algorithms. Table 2 summarizes the four that were selected for this study: the seasonal dual-index algorithm (SDIA), MSTA, decision tree algorithm (DTA), and the discriminant function algorithm (DFA). Each of these algorithms is discussed below. The temporal change detection algorithm was not used here because it is not able to detect repeated F/T cycles in autumn and spring.

Seasonal Dual-Index Algorithm
The vertically polarized T b at 37 GHz, written as T b (37V), is strongly correlated with T a and T g at depths less than 5 cm (Jin et al., 2015;Zhang et al., 2003). However, the presence of snow and/or thick vegetation cover reduces the strength of this correlation. Additionally, when soil freezes, the volume-scattering effect becomes obscured in the lower frequency channel, which contributes to a negative spectral gradient (SG, Zhao et al., 2011). The combination of T b (37V) and the 19 to 37 GHz SG can therefore be used to assess whether near-surface soils are frozen or thawed (Zhang et al., 2004). The near-surface soil freeze status thaw status of SDIA (FT SDIA ) can be expressed as where a value of 1 means the near-surface soil is classified as frozen. The variables T cutoff [K] and SG cutoff [K GHz −1 ] are empirically defined threshold or cutoff values for T b and the SG, respectively. The spectral gradient between 19 and 37 GHz is calculated as The value of T b is equal to the product of the physical temperature and the emissivity of the measured surface. Emissivity is affected by both the sensor (e.g., frequency, polarization, and incidence angle of the satellite observation) and the surface conditions (e.g., roughness, vegetation cover, and soil moisture) (Smith et al., 2004). For this reason, the T b cutoff is derived annually or seasonally for each land cover type rather than being treated as a constant (detailed T cutoff derivation and evaluation is described in section 3.2). We used weighted nonlinear regression to handle the strong sensitivity of the algorithm to T cutoff (Kim et al., 2017). Values of T g nearer to 0 • C were assigned a greater weight because of the significance of the freeze-thaw boundary. Weights were derived from a cosine function with a T g range from −30 to 70 • C, which covered 99.6% of all T g observations ( Figure A1). Figure 4 shows the derived T cutoff used for SDIA. The SG cutoff was assigned a constant value of 0 K GHz −1 .

Modified Seasonal Threshold Algorithm
MSTA is a variant of the seasonal threshold algorithm which treats T b thresholds as both spatially and temporally variable. The algorithm may reduce the potential influence of spatial and temporal variations in climate (e.g., precipitation) and land surface conditions (e.g., soil moisture and vegetation cover) on F/T classification and promote greater consistency in large area performance over the long-term record (Kim et al., 2017). As above, the T b threshold was derived empirically using the weighted, nonlinear relationship between T a and T b (37V) cell-by-cell and year-by-year ( Figure 5). To take into consideration the strong variability of T a caused by large-to medium-scale climate conditions and local-scale factors (especially elevation), we used the downscaled T a (see section 2.4) rather than simply interpolating the 2 m surface air temperature from ERA-Interim. The improved representation of T a is thought to better capture the relationship between T b and T a and contribute to higher classification accuracy. Additionally, the SG was introduced here to further improve classification. In other words, MSTA was effectively a variant of the dual-index algorithm with grid-cell-wise T b thresholds and determined as equation (1), where a value of 1 corresponds to frozen status as above.

Decision Tree Algorithm
Jin et al. (2009) developed a DTA to distinguish between frozen near-surface soil, thawed near-surface soil, precipitation, and desert. There are three critical indices used in the decision tree: the scattering index, T b (37V), and the polarization difference at 19 GHz. When the ground is frozen, the effect of volume scattering darkening is stronger at higher frequencies. Hence, the scattering index is approximated as the difference between the observed T b (85V) and the simulated value assuming ideal conditions and no scattering. This can be expressed as where T b (85V) sim is the simulated T b (85V), while T b (85V) ps is obtained from the passive microwave sensor. T b (85V) sim is estimated based on an empirical relationship using T b (19V) and T b (22V) (Grody, 1991): T b (37V) was used as an indicator for soil physical temperature as it had the best agreement with T g compared to other channels. The 19 GHz polarization difference is used to determine surface roughness and to improve identification of desert. For simplicity, the desert areas were masked using the available desert inventory, and the precipitation class was reclassified as thawed ground. The simplified DTA (FT DTA ) is shown in Figure 6 and can be written as where a value of 1 corresponds to frozen status, as above.

Discriminant Function Algorithm
DFA is a more recently developed method (Zhao et al., 2011). The criteria and discriminant functions were obtained from both radiometer observations and model simulations. In order to simulate near-surface soil T b for different surface covers, a combined microwave emission model was first developed by considering various factors, especially the vegetation and snow cover. Then a F/T discriminant algorithm was estimated using simulated T b (37V) and T b (19H) by assuming that (i) the T g is strongly related to T b (37V) and (ii) the emissivity, which increases considerably when water becomes ice, can be estimated using the ratio of T b (19H) to T b (37V). The DFA uses Fisher linear discriminant analysis to classify near-surface soil F/T status. Parameters are validated based on the simulated T b and the measured ground F/T. This can be written as where D F and D T are discriminant scores for frozen and thawed near-surface soil, respectively. Here, Qe refers to quasi-emissivity, which is an approximation to the true emissivity given by The near-surface soil is classified as frozen if D F is greater than D T . Otherwise, it is classified as thawed: To make the evaluation possible, the near-surface soil F/T status based on DFA was conducted using the T b from SMMR, SSM/I, and SSMIS, even though it was developed using the Advanced Microwave Scanning Radiometer for Earth Observing System (AMSR-E). The T b from the platforms of SMMR, SSM/I, and SSMIS show good agreement with those obtained from AMSR-E, and values were calibrated to AMSR-E based on their linear relationship in order to remove any bias ( Figure A2).

Evaluation
To evaluate the performance of the algorithms, the classification results were compared to observations from the CMA. Observed soil F/T status was based on the measured T g . The near-surface soil was considered frozen when T g ⩽0 • C. Because the T cutoff value for SDIA must be derived using observational data, we conducted a 10-fold cross-validation to assess its performance. The CMA stations were first divided into four subsets based on land use types. Each time, ∼90% of the stations for each land use type were randomly selected from the subsets for deriving T b thresholds, and the remaining 10% were used for evaluation. Following Zhang et al. (2003), evaluations of the maps with categorical types were conducted using classification accuracies (acc).
The subscripts on the right-hand side of the equations identify the classified value. For instance, F F is the number of frozen observations which were correctly classified as frozen, and F T is the number of frozen observations which were misclassified as thawed. To avoid the impact of uneven distribution of observation numbers for near-surface soil F/T status, the Cohen's kappa coefficient ( ), which measures inter-rater agreement for categorical items, was introduced here for classification evaluation (Landis & Koch, 1977). The value of ( ) is defined as where p e and p o are the probability of random agreement and disagreement, respectively, and can be calculated as In addition to the evaluation metrics discussed above, a detailed evaluation was also conducted using the root mean squared error (RMSE) and mean bias (BIAS) of four indices: the number of freezing days (NFD), the onset and the offset of freezing, and the total duration of near-surface soil freezing. The onset (offset) of the near-surface soil freeze was defined as the earliest (latest) date after 1 July when the near-surface soil remained frozen for five successive days.

Broad Evaluation
Our results showed a wide range of performance for the four algorithms with overall between 0.42 and 0.72 and an O acc between 73.8% and 86.2% (Table 3). DFA achieved the best performance overall with the  highest and O acc of 0.72 and 86.2%. While the SDIA, MSTA, and DFA show similar performance ( ranges from 0.68 to 0.73) for the SSM/I and SSMIS sensors, significant improvement is achieved for SDIA and MSTA compared to the SMMR sensor (Table 4 and Figure 7). The DTA showed the worst performance and had significantly lower and O acc values than the other algorithms.
All four algorithms were better at detecting the thawed (from May to September) and frozen seasons (from November to February) than the transition season (April and October). As a consequence, the algorithms show a clear seasonal variation of overall performance with a mean O acc of 93.1 ± 4.7% in the thawed season and 74.4 ± 11.6% in the frozen season ( Figure 8).

Detailed Evaluation
Not surprisingly, the sites with lower probabilities of being frozen normally have better performance due to the better detection of the thawed season compared to the transition season. Consequently, the performance generally decreases for more northerly sites. Most (92.6%) of the sites calculated using DFA have a ⩾ 0.4, and it decreases to between 38.8% and 82.1% for the other algorithms. The DTA only shows suitability for a few sites, most of which are northerly (latitude > 30 • N) and have a relatively cold climate.
Because all algorithms underestimate the freeze status during the transition season, the NFD was consistently underestimated, with a BIAS of between −8.1 and −95.3 days.
SDIA, MSTA, and DFA showed remarkable error in the duration of frozen status, and the overall RMSE was between 51.8 and 57.8 days ( Table 5). The pronounced uncertainty on the frozen duration is attributed to both the error in predicting onset, which has a BIAS of −0.57-48.5, and the error in predicting offset, which has a BIAS between −42.7 and 0. The detection of NFD, onset, offset, and hence frozen duration for SDIA  and MSTA show significant differences between sensors with the SMMR having the worst performance. On the other hand, the DFA algorithm shows similar yet large bias for all the sensors ( Figure 10).

Inter-comparison of Algorithms
With the exception of DTA, all the algorithms were suitable for detecting near-surface soil F/T status at a large scale and had O acc around 85% and between 0.67 and 0.72.
In general, DFA was the best-performing algorithm and had the highest overall agreement with observations. One of the reasons for this is the better performance of the SMMR sensor compared to the SDIA and MSTA sensors (Table 4 and Figure 7). There is, however, a temporary decline in accuracy from 2006 to 2009 associated with a peak in T b . The T b (37V) and T b (19H) during this period were between 2.4-3.6 K and 1.9-3.1 K higher than 2010 values, respectively (Figure 11a). Based on the sensitivity test, DFA performance depends on T b (Figure 11b). This is because the parameters of DFA in equations (8) and (9) are derived from statistically based discriminant functions. In other words, better performance could only be expected in areas where the mean T b s values are similar to the training data sets used by Zhao et al. (2011). Parameter calibration is therefore recommended before applying DFA to other regions.
SDIA and MSTA show similar performance with a remarkable improvement from the SMMR sensor to SSM/I and SSMIS. There are several explanations for this; the first is the difference in T b accuracy between sensors. Njoku (1980) reported that the T b error of the SMMR sensor was around 1-2.5 K, whereas the error for SSM/I and SSMIS was only 0.5 K (Brodzik & Long, 2018). Second, the narrow sensor swath contributes to the decreased performance of SMMR (Table 1) because interpolated grid cells are expected to perform more poorly than measured ones even though a rigorous procedure was used to fill the T b gap ( Figure A3). For example, the for SMMR derived from MSTA using the original (no interpolation) T b was 0.61, but it decreased to 0.55 for the no-gap T b . Since the data completeness for SMMR is 20-42% lower than the other platforms, worse performance is expected. Although SDIA and MSTA achieve similar performance in the four indices based on detailed evaluation, MSTA can be more easily applied in areas with limited observations because its model parameters are derived from globally available grid data rather than from observations.
Our results showed that DTA is less suitable for near-surface soil F/T status detection than the three other algorithms. In particular, the DTA significantly underestimated the freeze status with about 60% of observations incorrectly classified as thawed. This is very likely caused by an underestimated threshold for T b (37V), which was fit using a small data set comprising only eight stations (Jin et al., 2009). In contrast, the T cutoff for SDIA was about 259.7 ± 1.8 K for bare land, 261.4 ± 1.9 K for grass, 261.8 ± 1.6 K for shrub, and 261.6 ± 1.8 K for forests. These values are about 10 K higher than that used in DTA (252 K). This is also demonstrated by a sensitivity test (Figure 12). Figure 12 shows that the performance of DTA is highly sensitive to the values of selected T b (37V) cutoff. Based on the sensitivity test, DTA achieves its best performance in the T b (37V) range of 263-265 K. For this reason, more validation data are recommended before applying DTA to areas outside of the test sites in order to estimate more reliable parameters.

Uncertainties of Near-Surface Soil F/T Detection Algorithms
In order to take full advantage of our observations, we assumed that the near-surface soil was frozen when snow cover was present. However, when the T g is only slightly below 0 • C, the presence of liquid water under the snow will yield dielectric properties that resemble thawed conditions. This would decrease the accuracy of detection, especially during the snow-melt season.
The passive microwave data were found to be less sensitive to near-surface soil F/T status in the F/T transition season with worse performance with respect to freezing onset and offset detection. This is because near-surface soil freezing may occur sporadically for some length of time during the F/T transition season, and detection mismatching would result in greater uncertainty. The mismatch between observation depth and passive microwave detection depth will also contribute to poor performance in assessing freezing onset, offset, and NDF. The reference data set used as ground truth was derived from measurements at a depth of 0 cm whereas the detection depth of passive microwave sensors is expected to be on the order of several centimeters. The greater detection depth will lead to a delayed or lowered sensitivity of F/T detection in the transition seasons due to the time lag caused by heat diffusion. Although soil temperatures from multiple depths, such as 0 cm (Jin et al., 2015), 4 cm (Jin et al., 2009;Zhao et al., 2011), and 5 cm (Zhang et al., 2003(Zhang et al., , 2004, were used, the detection depth for passive microwave sensors is site-specific due to the strong spatial variation of fine-scale soil properties, such as moisture content, as well as surface characteristics, such as roughness and vegetation (Zheng et al., 2018). Recently, Zheng et al. (2019) reported that vertical T b has better agreement with the soil moisture measured at 2.5 cm than that measured at 5 cm based on in situ measurements. A truly systematic evaluation of these effects would require a comprehensively designed set of observations. We hope that, over time, more investigations of passive microwave detection depth in various soil conditions will help to establish correlations of trusted values with soil characteristics.
Although the calibrated T b data sets are used here to reduce the potential uncertainties caused by the changes of sensors, the mismatch of overpass time remains unsolved in CETB. While the overpass time of the SSM/I and SSMIS sensors (5-10 a.m.) is highly aligned with the minimum T g measured at 08:00, the overpass of the SMMR sensor is about midnight. This is thought to reduce the classification accuracy of SDIA for the SMMR sensor by weakening the relationship between T g and T b .

Reliability of Change Detection
Passive microwave remote sensing data sets have been widely used for deriving long-term near-surface soil F/T changes. For example, Jin et al. (2015) reported that the frozen period has shortened by 20 days during the period 1978-2008 based on estimates of onset delay and offset advance. Similarly, a delay of ∼14 days to the first day of soil freezing and an advance of ∼10 days to the last day was found over the period  10.1029/2019EA000807 on the Qinghai-Tibet Plateau . However, our results suggest that the RMSE of onset and offset dates could be up to 41.6 and 35.8 days, respectively. This is the case even for the algorithm-sensor pair with the best performance. The uncertainty based on the calibrated T b data set is about 1.8-6.2 times the magnitude of the estimated change over the last several decades.
We also evaluate the global FT data records, daily freeze/thaw Earth System Data Record (FT ESDR, https:// nsidc.org/data/nsidc-0477), proposed by Kim et al. (2017) in order to avoid the possible bias induced by different T b gap filling and aggregation (or resampling) methods. Similarly, FT ESDR is found to have a RMSE of about 32 and 26 days for the onset and offset, respectively, over China.
For this reason, we argue that remarkable uncertainty still exists in the detection of long-term near-surface soil F/T status, and the results of previous studies may not be reliable. Recently, a noisy-edge detection method has been successfully applied to investigate the near-surface soil F/T changes over the Alaska, and the results show a RMSE of 1 ± 7 (Chen et al., 2019). This provides a promising method for long-term near-surface soil F/T changes in the future as soon as it is calibrated more widely.

Conclusions
We conducted a detailed evaluation of near-surface soil F/T status detection based on passive microwave remote sensing data. Specifically, we evaluated and inter-compared the four most widely used algorithms (SDIA, MSTA, DTA, and DFA) using a large number of ground truth observations across China. Our results support the following conclusions: • In general, the SDIA, MSTA, and DFA algorithms are suitable for near-surface soil F/T status investigations and have overall accuracies between 84.1% and 86.2% and of 0.67-0.72. However, the current parameters for DFA may limit its performance in areas with different T b from the training data set. DTA is not transferable to the areas outside the test stations used for its development, and more validation is recommended before applying it to a large area. • All these four algorithms show a reduced ability to capture the near-surface soil freeze status in the F/T transition seasons and exhibit significant uncertainties in onset, offset, and NFD. • Investigations of long-term changes to near-surface soil F/T using coarse-grid passive microwave remote sensing data may be subject to remarkable uncertainty.
Appendix A Figure A1. The weighting scheme used in the linear regression model for T cutoff estimation. T g is used for SDIA, while T g is used for MSTA.  Figure A3. Examples of the two-step T b gap filling scheme. The blue numbers in the bottom right of the panels describe data completeness.