Quasi-global machine learning-based soil moisture estimates at high spatio-temporal scales using CYGNSS and SMAP observations

Global soil moisture mapping at high spatial and temporal resolution is important for various meteorological, hydrological, and agricultural applications. Recent research shows that the land surface reflection in the forward direction of Global Navigation Satellite System (GNSS) signals at L-band can convey high-resolution land surface information, including surface soil moisture. However, these signals are often affected by complex land surface characteristics and the bistatic nature of the GNSS-Reflectometry (GNSS-R) technique, resulting in a nonlinear relationship between the signals and surface soil moisture. In this work, a machine learning (ML) approach is used to map quasi-global soil moisture using bistatic reflectance observations acquired from the recently launched Cyclone GNSS (CYGNSS) mission. Specifically, several land surface parameters are obtained from remote sensing products and integrated with Soil Moisture Active Passive (SMAP) enhanced soil moisture re-trievals to facilitate daily quasi-global CYGNSS soil moisture mapping at 9 km. Based on cross-validation against SMAP data, the ML algorithm is shown to be suitable for retrieving soil moisture from CYGNSS. Median values of unbiased root-mean-square-difference for the quasi-global coverage or regions with vegetation water content less than 5 kg/m 2 are 0.0395 cm 3 /cm 3 and 0.0320 cm 3 /cm 3 , respectively. Likewise, via independent evaluation against more than 100 in-situ sites, the algorithm is shown to have an unbiased root-mean-square-error of 0.0543 cm 3 /cm 3 . CYGNSS-based retrievals contain similar spatial variability as SMAP across different seasons. Moreover, through a robust triple collocation technique, the accuracy of CYGNSS soil moisture is relatively high over moderately vegetated regions with correlations ranging from 0.4 to 0.8. Based on these validation results, we argue that derived CYGNSS soil moisture estimates can supplement current global soil moisture databases and provide more frequent retrievals at 9 km.


Introduction
Soil moisture (SM) is a key component in the global water cycle and plays a vital role in partitioning precipitation into surface runoff and infiltration and available energy between sensible and latent heat flux (McColl et al., 2017).The characterization of SM conditions has been crucial for understanding land-atmosphere interactions (Koster et al., 2004;Seneviratne et al., 2010;Lei et al., 2018), enhancing hydrometeorological forecasts (Scipal et al., 2008;Liu et al., 2012;Lin and Pu, 2019), and improving agricultural water resource monitoring (Martínez-Fernández et al., 2016;Mladenova et al., 2019).Following several decades of development, microwave remote sensing has been proven to be an effective way for obtaining SM at extensive spatial coverage with relatively high temporal frequency (Njoku and Entekhabi, 1996).In particular, satellite remote sensing has been deeply explored for deriving global SM estimations.
Within the microwave frequency range of the electromagnetic spectrum, the L-band frequency signals are relatively more sensitive to the surface SM (~5 cm depth) conditions, with less impact from vegetation canopy as compared to the higher C-and X-band frequencies (Schmugge, 1998;Jackson et al., 2010;Wigneron et al., 2017).Currently, the European Space Agency Soil Moisture and Ocean Salinity (SMOS) (Kerr et al., 2001) and National Aeronautics and Space Administration (NASA) Soil Moisture Active Passive (SMAP) (Entekhabi et al., 2010) are the two dedicated satellite missions for mapping global SM with L-band radiometers onboard.SMOS, launched in 2009, provides surface SM estimates at a spatial resolution of about 40 km and a revisit time of 2-3 days.Meanwhile, SMAP launched in 2015 was initially designed to map SM at 3 km by integrating both radar and radiometer microwave signals.However, due to the failure of the SMAP radar instrument in July 2015, SMAP has fallen back to producing radiometer-based SM retrievals at comparable spatial and temporal scales with SMOS.Until now, both SMOS and SMAP have generated more than five consecutive years of global SM data at about 25-km to 36km spatial resolution for various researches and applicationsincluding, but not limited to, climate change (Dorigo et al., 2017;Forgotson et al., 2020), drought monitoring (Mishra et al., 2017;Ford and Quiring, 2019), and food security (Sadri et al., 2020;Krishnamurthy R et al., 2020).
Given the increasing needs and requirements for SM applications at regional and local scales, global SM data at finer spatial resolution (1 km to 10 km) and temporal resolution (sub-daily to daily) have become more critical in recent years.Correspondingly, the NASA SMAP science team released an enhanced SMAP radiometer-based SM product (Chan et al., 2018) and SMAP/Sentinel-1 integrated active (C-band) and passive (L-band) SM product (Das et al., 2019).C-band active observations, such as those produced by Sentinel-1, can support SM retrieval at a 1-km spatial resolution (Paloscia et al., 2013;Bauer-Marschallinger et al., 2018).However, due to its limited swath width, the revisit time of Sentinel-1 varies from 3 to 12 days.This tradeoff between spatial and temporal resolution has spurred interest in deploying a constellation of small satellites for the ocean and subsequent land surface applications.Particularly, the Cyclone Global Navigation Satellite System (CYGNSS) mission, encompassing a constellation of small satellites carrying GNSS-Reflectometry (GNSS-R) receivers, can facilitate global SM mapping with sub-daily to daily sampling frequencies (Ruf et al., 2018;Chew and Small, 2020).In contrast with the abovementioned active and passive microwave platforms, the GNSS-R technique used by CYGNSS collects GNSS signals at L-band frequency that represents a viable approach for obtaining global SM with relatively low cost and high resolution.Specifically, GNSS-R is a bistatic technique in which a GNSS receiver measures the forward-scattered signals that are originally emitted by multiple constellations of GNSS navigation satellites and reflected off the Earth's surface.The feasibility of GNSS-R for SM estimation has been explored through several field campaigns (Alonso-Arroyo et al., 2016;Egido et al., 2014;Rodriguez-Alvarez et al., 2009) and a demonstration satellite (Camps et al., 2016;Chew et al., 2016), i.e., TechDemoSat-1.
Currently, several efforts are being conducted to retrieve surface SM from CYGNSS GNSS-R data (Chew and Small, 2018;Kim and Lakshmi, 2018;Al-Khaldi et al., 2019;Clarizia et al., 2019;Calabia et al., 2020;Yan et al., 2020;Yueh et al., 2020).In these approaches, empirical relationships have been constructed between CYGNSS reflectivity or signal-noise-ratio (SNR) and the SM estimates obtained from SMAP or in-situ networks.Considering that the GNSS-R signals are significantly affected by the topography, vegetation, and surface roughness, a machine learning (ML)-based approach can be optimal for characterizing the (potentially) complex, nonlinear relationship between CYGNSS observables and the SM conditions through the inclusion of land surface characteristics (Eroglu et al., 2019).Previous work has regressed CYGNSS reflectivity observations against SM measurements using Artificial Neural Network, Random Forest, and XGBoost ML approaches (Eroglu et al., 2019;Senyurek et al., 2020a;Senyurek et al., 2020b;Yang et al., 2020;Jia et al., 2021).However, the optimal approach for constructing an ML model for CYGNSS retrieval that has both high performance and generalization capability is still uncertain.
Herein, a daily quasi-global CYGNSS SM product at 9 km is generated via the Random Forest ML technique.Particularly, distinct from previous research efforts that have primarily focused on the use of a single ML algorithm applied to all available data for generating a SM product, this work evaluates the usage of multiple, separately trained models with different data stratification strategies to determine SM and its associated accuracy.Different from our previous work using highly localized information from in-situ sites to represent a spatial grid, SMAP SM data is used as the base reference here.Multiple remote sensing land surface products characterizing the land surface conditions are incorporated for describing the nonlinear relationship between CYGNSS signals and SM.The optimally constructed ML-based model is then generalized for future SM retrieval.Next, the performance of the retrieval model is evaluated through both year-based cross-validation and an independent validation against in-situ measurements from sparse networks.Finally, a robust and independent evaluation for the quasi-global CYGNSS product is conducted using the Triple Collocation (TC) technique (Gruber et al., 2016;Chen et al., 2018).
The remainder of this paper is structured as follows.Section 2 describes CYGNSS and SMAP data, and ancillary data used for the ML retrieval process.The ML retrieval process and the evaluation techniques are detailed in Section 2. Results of the CYGNSS SM retrievals and validation analysis are shown in Section 3.Both the optimization of the ML method and the spatial/temporal resolution issues are discussed in Section 4 with conclusions summarized in Section 5.

Dataset and methodology
Datasets leveraged for constructing the ML-based retrieval model are described below, including the SMAP SM data and ancillary data for characterizing land surface conditions.In addition, in-situ SM measurements and global SM data from other sources are included for the evaluation of CYGNSS SM retrievals.The general data preprocessing, ML procedures, model construction approaches, and evaluation metrics are explained afterward.

Cyclone Global Navigation Satellite System
CYGNSS is a NASA mission designed with a constellation of small satellites for Earth observation (Ruf et al., 2018).It consists of eight micro-satellite observatories, each of them carrying a four-channel GNSS-R bistatic radar receiver to collect Global Positioning System (GPS) signals reflected off the Earth's surface.Although these microsatellites orbit above the tropics, limiting their spatial coverage between 38 • N and 38 • S, the overall revisit time of CYGNSS is within several hours (Ruf et al., 2016).Compared with the sun-synchronous satellites, such as SMAP and SMOS with 1 to 3 days revisit time, CYGNSS can provide more frequent sampling.The spatial resolution of CYGNSS signals varies depending on the surface roughness at and near the reflection point (Comite et al., 2019).The minimum spatial footprint can be 7 km × 0.5 km for coherent scattering from smooth surfaces.On the other hand, the reflected signal is incoherent over rough suface and its spatial resolution is on the order of several to dozens of kilometers.As of mid-2019, the incoherent integration time for signals was reduced from 1 s to 0.5 s, making the minimum footprint as small as 3.5 km × 0.5 km for coherent reflection.The CYGNSS Level-1 version 2.1 data (available from https://podaac.jpl.nasa.gov/CYGNSS)from March 18, 2017 to December 31, 2019 are used here.The reflected GPS signals are recorded via a delay Doppler mapping instrument (DDMI) onboard CYGNSS satellites.Raw Delay Doppler Map (DDM) measurements are processed via the receivers by cross-correlating the reflected signal with a replica of the transmitted signal over a range of time delays and Doppler frequencies.In Level-1 data, DDMs are processed to account for non-surface related terms by inverting the CYGNSS forward-scattering model and obtaining the surface's bistatic radar cross section (BRCS) and the effective scattering area for each specular point.The CYGNSS BRCS is a 17 delay × 11 Doppler array that generally reflects a horseshoe shape with peak values changing in response to the scattered field coherency/incoherency, surface roughness, vegetation, and SM F. Lei et al. conditions.In addition to BRCS, CYGNSS Level-1 metadata also include geometric and instrumental information, including incidence angle and the distances between the transmitter, receiver, and the specular point.Using these Level-1 observables, the surface reflectivity can be calculated using the peak value of the calibrated DDM with the assumption of coherent scattering (Rodriguez-Alvarez et al., 2019;Chew and Small, 2020;Senyurek et al., 2020a).Following Rodriguez-Alvarez et al. (2019), the BRCS (variable name is brcs in CYGNSS Level-1 data) denoted as P brcs is used to obtain surface reflectivity (Γ) as shown below where r st is the distance between the specular reflection point and the GNSS transmitter (tx_to_sp_range in Level-1) and r sr is the distance between the receiver and the specular reflection point (rx_to_sp_range in Level-1).Moreover, trailing edge slope (TES), indicating the coherent or incoherent condition, is calculated from the reflectivity delay waveform by integrating Eq. ( 1) within the Doppler domain (Rodriguez-Alvarez et al., 2019).Particularly, TES is calculated from the reflectivity delay waveform values at the delay bins m (peak delay bin) and m + 3. Therefore, the reflectivity, TES, and specular point incidence angle (sp_inc_angle in Level-1) from Level-1 data were used in our ML framework.
The CYGNSS data were gridded to EASE-Grid 2.0 3 km and averaged over one day.Following the typical convention for SMAP's penetration depth assumption of SM retrievals, CYGNSS reflectivity is assumed to represent SM in the top 5 cm depth of the soil profile.Before utilizing the CYGNSS observables for SM estimation, a quality check is critical to remove data records with large uncertainties resulting from either instrumental processing or complex underlying land surface conditions.Quality flags were utilized following Rodriguez-Alvarez et al. (2019) and Chew and Small (2018), including "S-band transmitter powered up", "large spacecraft attitude error", "black-body DDM", "DDM test pattern", and "low confidence in GPS EIRP estimate".Moreover, observations with: 1) incidence angle greater than 65 degree, 2) negative receiver antenna gain, or 3) DDM peak value falling outside a range of delay bins between 5 and 11 were filtered.Typically, incidence angles below 30 degrees have a limited effect on reflectivity (Stilla et al., 2020).However, the number of available observations will be significantly reduced if a threshold of 30 degree is used for incidence angle.Here, observations with incidence angles below 65 degree were included following Chew and Small (2020) and Senyurek et al. (2020aSenyurek et al. ( , 2020b)).Note that CYGNSS measurements over areas with surface elevation greater than 600 m were masked by CYGNSS on-board data compression algorithm before December 2017.Considering the significant impact of topography, open water bodies, and vegetation cover, data records over regions with surface elevations greater than 1500 m (600 m before December 2017) and open water fractions larger than 2% of the 3-km grid were removed (Eroglu et al., 2019;Senyurek et al., 2020aSenyurek et al., , 2020b)).Ancillary surface elevation and open water body data used for masking are described in Section 2.3.

Soil Moisture Active Passive
To acquire global SM conditions at high accuracy, the NASA SMAP mission has been designed to collect L-band signals that are sensitive to surface SM (~ 5 cm of depth).Following the malfunction of its radar instrument, SMAP has been continuously providing global L-band brightness temperature at a spatial resolution of about 40 km with a revisit frequency of 2-3 days.Using the single channel algorithm, SMAP SM retrievals from the passive radiometer are posted on the EASE-Grid 2.0 36-km grid for descending (6 a.m.local solar time) and ascending (6 p.m. local solar time) overpasses in the SMAP Level 2 Passive Soil Moisture Product (L2_SM_P) (Chan et al., 2016).To increase the resolution of SMAP products, an optimal interpolation technique is applied to the antenna temperature from overlapped radiometer footprints on orbit for producing the SMAP Level 2 Enhanced Passive Soil Moisture Product (L2_SM_P_E) with a 9-km grid spacing from the enhanced brightness temperature data (Chan et al., 2018).The overall accuracy of this L2_SM_P_E product is comparable to the standard L2_SM_P based on their validation over core sites and sparse networks (Chan et al., 2018;Zhang et al., 2019).
The SMAP L2_SM_P_E version 4 data between March 1, 2017 and December 31, 2019 were accessed from the National Snow and Ice Data Center (NSIDC:https://nsidc.org/data/smap/smap-data.html).As the thermal equilibrium conditions at the soil-vegetation interface are dependent on the local acquisition time and thus can impact the errors in surface SM retrieval process (Lei et al., 2015), SMAP descending SM retrievals are considered to have higher accuracy.However, in this work, both ascending and descending overpasses were exploited to increase the frequency of SMAP data samples.Nevertheless, quality checks were conducted (Zhang et al., 2019) beforehand to exclude grid points: (1) under frozen conditions (with land surface temperature < 273.15 K); and ( 2) with open water fractions >10%.

Ancillary data for retrieval process
The reflected GNSS-R signals, from which SM is derived in this work, respond also to the surface roughness.Consequently, the accurate representation of the land surface reflectivity requires a prior topography information.Surface reflectivity increases rapidly in the case of surface open water bodies, which can hamper the retrieval of soil water conditions in mixed land/water landscapes.In addition, vegetation density, as well as roughness at both small-scale and topography scale, can attenuate the reflected signals and impact GNSS-R reflectivity.Therefore, land surface conditions, especially the topography, vegetation, and open water bodies, should be known to facilitate the retrieval of SM from land-reflected signals.
Specifically, surface elevation information was derived from the 1km global Digital Elevation Model GTOPO30 product (doi: https://do i.org/10.5066/F7DF6PQS).Elevation data were spatially aggregated from 1 km to 3 km using the drop-in-bucket averaging approach (Fang and Lakshmi, 2014).Global soil texture data from the Global Gridded Soil Information (SoilGrids; Hengl et al., 2017) were used to characterize the spatial variations of soil type.In SoilGrids, the vertical soil profile is discretized into 7 layers, and the soil of each layer is classified into 12 standard soil texture types depending on the corresponding sand, clay, and silt proportions.To be consistent with the penetration depth of L-band microwave signals, the soil clay and silt proportions from the top layer of 5 cm depth were used and spatially averaged from their native spatial resolution of 250 m to 3 km.Note that since the presence of surface inland open water bodies significantly influences the scattering of GNSS signals, it is necessary to remove observations over regions with relatively high open water percentages.The Global Surface Water Dataset from the Joint Research Centre (GSW-JRC; Pekel et al., 2016) is chosen as it provides satisfactory accuracy in characterizing detailed surface water spatial distribution at 30 m. Particularly, the seasonality map of GSW-JRC indicates the number of months (0− 12) where permanent or seasonal surface water is present.The water percent for each 3-km grid was calculated as the percentage of 30-m grids within the 3km box with the presence of either permanent or seasonal water.This water percent was used to filter CYGNSS data samples as described in Section 2.1.About 11.80% of data samples were excluded after applying the threshold of 2% for removing 3-km grid cells with inland water bodies and an additional 11.78% of data were filtered after applying the elevation mask.
Several Moderate Resolution Imaging Spectrometer (MODIS) land surface products were used to characterize the underlying vegetation conditions.The 500-m, 16-day composite MODIS normalized difference vegetation index (NDVI) from both Terra (MOD13A1) and Aqua (MYD13A1) platforms were combined to increase the temporal F. Lei et al. frequency.These data are accessible from the USGS Land Processes Distributed Active Archive Center (https://lpdaac.usgs.gov/).Similarly, NDVI data products were spatially averaged from their original 500-m gridding to 3-km for each EASE grid box.The 2018 MODIS yearly Land Cover Type (MCD12Q1) product was used.It contains 17 land cover classes provided by the International Geosphere Biosphere Programme (IGBP) land cover scheme.The dominant land cover can be identified based on the percentages of each land cover type at 500 m in a 3-km grid cell.Vegetation water content (VWC) was calculated using the same baseline approach in the SMAP SM retrieval process (O'Neill et al., 2015) that NDVI was used to estimate the foliage water content, and NDVI extremes along with the stem factor were used for estimating the stem water content.Long-term annual maximum and minimum NDVI values were obtained from the NDVI data record from 2015 to 2019.The stem factor was determined for each MODIS IGBP land cover type based on the look-up table provided in O' Neill et al. (2015).

Soil moisture data for independent validation
As ML algorithms tend to reflect the presence of bias in their training data, the use of independent datasets is essential for a robust evaluation of an ML-based data product.For an independent evaluation of CYGNSS SM data, the ground-based measurements made available through the International Soil Moisture Network (ISMN, Dorigo et al., 2011) were used.The ISMN collects data from individual ground-based SM networks.The majority of networks are located within the U.S. and SM measurements are sampled at varying depths in different individual networks.With the preprocessing and quality check conducted via ISMN, the top layer SM measurements were used to be generally consistent with the penetration depth of L-band microwave signals.
Hourly SM data with quality flags identified as 'Good' (Dorigo et al., 2013) were retained and averaged to daily values.
In addition to the validation at sparse networks, a global evaluation of SM retrieval accuracy across different land surface conditions can provide complementary insights by utilizing microwave remote sensing data from other sources and land surface modeling products.Therefore, the C-band Advanced SCATterometer (ASCAT) SM retrievals were included using the Vienna University of Technology (TU-Wien) change detection algorithm (Naeimi et al., 2009).This product has a spatial resolution of 25 km with a grid spacing of 12.5 km and was resampled onto the EASE-Grid 2.0 36-km resolution using the nearest-neighboring approach.Retrievals from both a.m. and p.m. overpasses were averaged into daily data from January 2017 to June 2019.ASCAT retrievals identified with frozen soil conditions were filtered.In addition, the Noah model-based SM estimations from the Global Land Data Assimilation System (GLDAS; Rodell et al., 2004) were used to complement the global evaluation via TC analysis (detailed in Section 2.6).Particularly, the 3hourly GLDAS-2 Noah v3.3 product with a spatial resolution of 0.25 • was temporally averaged to daily values and spatially resampled onto the 36-km grid using the nearest-neighboring method.SM estimates were extracted from the top layer (0-0.1 m) based on the Noah soil discretization scheme.Constrained by the relatively coarse spatial resolution of ASCAT and Noah SM data, the TC analysis was conducted at 36 km using data records obtained between March 2017 and June 2019.

Machine learning framework
Using the abovementioned ancillary data, several features were derived and utilized in combination with CYGNSS observables for the input layer of the ML technique.In total, there were eight input features: (1) reflectivity, (2) TES, (3) incidence angle, (4) DEM elevation, (5) NDVI, (6) VWC, and (7-8) soil silt and clay ratios.The overall framework of the training and testing procedures is depicted in Fig. 1.
To construct the ML model for the quasi-global CYGNSS data, different clustering strategies were deployed to aggregate 3-km grid cells.These clustering strategies are summarized in Table 1.A straightforward way was to construct one single model for the globe, where data from all 3-km grid cells were combined for training and testing.Here, we refer to it as ONE-cluster.Furthermore, the full dataset can be stratified by clustering the CYGNSS global coverage into regular and contiguous geographical boxes.The length of the box was set to 9 km, 72 km, and 288 km in this work.Taking the length of 72 km as an example, in each 72 km by 72 km grid box, all 3-km grid cells falling within this box were gathered and one single ML model was built from these data samples for this box accordingly (referred to as 72KM-cluster hereinafter).Through this approach, the CYGNSS coverage was Fig. 1.A schematic framework of the machine learning (ML)-based soil moisture (SM) retrieval model and validation process.
F. Lei et al. delineated into 15,618 grid boxes with a 72 km by 72 km domain for each box.The number of models for the regularly spaced box clustering strategy is determined by the number of boxes with data over land.Similarly, 9KM-cluster and 288KM-cluster were used and the statistics for each clustering approach are shown in Table 1.
Another clustering technique to aggregate the spatial grids was based on the land cover types and climate zones.Specifically, the 17 IGBP land cover types were reclassified into nine classes, i.e., Forests, Shrublands, Savannas, Grasslands, Wetlands, Croplands, Urban, Barren, and Snow/ Ice/Water.Note that grids classified as Snow/Ice/Water were not considered for SM retrieval, the number of land cover types was reduced to eight.The Köppen-Geiger climate zone (Kottek et al., 2006) information was introduced as an additional criterion to subdivide the global extent.Four major climate zone categories were used, i.e., Equatorial, Arid, Warm Temperate, and Continental, excluding Polar where CYGNSS data were not available or with low quality.Therefore, 32 (8 by 4) combinations were used for classifying the global grids into different land cover/climate zone (LCCZ) classes (referred to as LCCZ-cluster hereinafter).As shown in Table 1, the number of available data samples varies between classes.
The selection of the ML algorithm can have a significant impact on the performance of SM retrievals.Our previous work (Senyurek et al., 2020a) has shown that a Random Forest algorithm is more suitable for CYGNSS SM estimation in comparison with other ML algorithms, such as an Artificial Neural Network or Support Vector Machine.In particular, Random Forest is an ensemble-based decision tree algorithm that provides more flexibility through randomization and an ensemble approach (Abbaszadeh et al., 2019).
Here, the hyperparameters of the Random Forest were determined separately for different clustering approaches.Particularly, one unique hyperparameter set was searched and applied to all available boxes/ classes for each clustering approach.The Random Forest hyperparameters, i.e., number of trees and maximum split size, were optimized using a combinatorial search strategy.The number of trees varied from 10 to 1250 with an interval of 5 and the maximum split size changed from 2 to 1024 with an interval of 2. To speed up the hyperparameter searching process, only 10% of all data were used.For example, there were 15,618 72-km boxes in the 72KM-cluster approach and a total of 1561 boxes (randomly selected) were used for the search process.Each selected 72-km box was trained with one hyperparameter set in the search range, and the overall testing root-mean-square-error was calculated correspondingly.This process was repeated for all hyperparameter sets and the optimal hyperparameter set was determined given the minimum overall testing error.In the 72KM-cluster approach, the optimal numbers of decision trees and the maximum split size were found to be 25 and 16 and were then applied for all 72-km boxes.Table 1 shows the optimal hyperparameter set for each clustering approach.Although the hyperparameters can be further optimized separately for each box of all clustering strategies, this approach was not used given the significantly increased computational cost and reduced feasibility for future applications.For the training and test processes, each model was trained using the data from March 2017 to December 2018.Data from 2019 were used for testing the trained model.Using this year-based cross-validation method, the trained model was evaluated with data from a different time range.Therefore, a model with satisfactory performance can be easily transferred to predict future data without refitting a new modeldemonstrating its generalizability and feasibility for forecasting applications.

Data processing and evaluation metrics
CYGNSS observables were first gridded onto 3 km and all other ancillary data were spatially averaged from their native spatial resolutions to 3 km as described in Section 2.3.Due to the 9-km resolution of SMAP data, SM data for all 3-km grids within one 9-km grid cell were equal.Concurrent CYGNSS observables, ancillary data (i.e., NDVI and VWC), and SMAP SM data were linked in the training period from 2017 to 2018 for ML modeling.The trained model was then applied for all CYGNSS observations from 2017 to 2019 and the performance of the ML model was evaluated at 3 km via the year-based cross-validation approach (see Section 3.1).Predicted CYGNSS SM estimates at 3 km were then spatially aggregated to 9 km and daily values.The CYGNSS daily 9-km product was compared, alongside SMAP, to daily in-situ SM measurements from ISMN as described in Section 2.4.Results from these comparisons are shown in Section 3.2.
Data from ISMN sites within the same 9-km grid were averaged to represent the corresponding 9-km grid cell.Note that ISMN data indicates highly localized SM information.However, since spatial representativeness errors are the same for both CYGNSS and SMAP, it is expected that the evaluation against (localized) ISMN observations stills represents a relatively robust approach for assessing CYGNSS against SMAP (Dong et al., 2020).The spatial variability and intra-month deviation of CYGNSS SM are compared to SMAP at 9 km in Section 3.3.For the analysis of CYGNSS global accuracy, 9-km CYGNSS SM data were spatially averaged to 36 km and evaluated with ASCAT and GLDAS Noah via the TC analysis with results shown in Section 3.4.A schematic structure of the three-step evaluation of CYGNSS SM at three different spatial resolutions is shown in Appendix A.
As described in Section 2.5, the performance of the ML-based retrieval model was first evaluated through a year-long cross-validation exercise.The independent evaluations for CYGNSS SM estimates were conducted in two ways in addition to this cross-validation.First, both CYGNSS and SMAP SM data were compared against the in-situ measurements at all valid ISMN sites.The ISMN sites were selected with at least 100 concurrent data samples for both CYGNSS and SMAP.A minimum correlation coefficient between ISMN and SMAP SM was set to 0.15 [− ] for eliminating sites with marginal spatial representativeness for a 9-km grid cell.The metrics used include the Pearson's correlation coefficient (R), root-mean-square-error (RMSE, or root-mean-squaredifference RMSD referred to in the cross-validation), unbiased RMSE/ RMSD, and mean absolute error (MAE).
Moreover, the error patterns of CYGNSS SM were explored through TC analysis (Stoffelen, 1998).TC is a technique that has been utilized for analyzing random errors in large-scale remote sensing or model products.For geophysical variables with three estimates that have independent errors, the random errors between each estimate and the theoretical truth can be derived assuming each product is linearly related with the truth and errors are both orthogonal (i.e., independent of true conditions) and mutually independent.Specifically, the random error variance of one product can be derived as where X, Y, and K denote different products of the variable, ε X is the random error of X, and the operator Cov [•] and Var[•] denote the covariance and temporal variance, respectively.More details of the TC analysis can be found in (Gruber et al., 2016).Here, an extended TCbased correlation coefficient (R TC ) was used (McColl et al., 2014) to characterize the correlation between product X and the theoretical truth T as Note that TC analysis will be biased by the presence of crosscorrelated errors.Therefore, it should only be applied to products acquired from independent sources.Here, the active microwave-based ASCAT and numerical model-based GLDAS Noah SM products (described in Section 2.4) were used in conjunction with CYGNSS to analyze the accuracy of CYGNSS SM retrievals at the global scale.

Results
In this section, the performance of CYGNSS SM predictions using the ML model is evaluated from several perspectives as described in Section 2.6.

Model performance via year-based cross-validation
As described in Section 2.5, global grids are clustered through multiple approaches, i.e., 9KM-, 72KM-, 288KM-, LCCZ-and ONE-cluster settings.The ML models are trained and tested independently for these different clustering strategies.For both the training period of March 2017 to December 2018 and the testing period of 2019, the ubRMSD and R metrics between CYGNSS ML predictions and SMAP data are calculated for each 3-km grid cell.Global median values are shown in Fig. 2 with vertical bars indicating the range between 25% and 75% percentiles.From Fig. 2a, it is clear that the overall training accuracy of CYGNSS predictions is generally improved with an increased number of models with smaller ubRMSD and higher R values when moving from ONE-to 72KM-cluster approaches, except for the 9KM-cluster approach.Meanwhile, the performance of ML models tends to be more stable with a slightly narrower distribution of ubRMSD across different grid cells for 72KM-cluster as compared to other settings.However, both ubRMSD and R medians are highest for the 9KM-cluster approach.Note that the sample size is largely reduced for each stratum in the 9KM-cluster approach as shown in Table 1.The cross-validation results in Fig. 2b generally show a similar pattern with the training results, i.e., ubRMSD is reduced and R is improved with enhanced stratification strategies.Particularly, 72KM-cluster shows the lowest ubRMSD and highest R values.Also, note that R value decreases in the testing process for the 9KM-cluster approach, suggesting that a reasonable sample size is important for ML models.Accordingly, based on both training and test results, the 72KM-cluster approach appears to be the optimal strategy in our application of the ML model for CYGNSS SM retrieval.
To further demonstrate spatial variations in model performance, two typical clustering approaches are examined in detail with the statistics shown in Table 2, i.e., the 72KM-cluster and LCCZ-cluster.Fig. 3 shows the spatial maps of ubRMSD and R between CYGNSS ML predictions and SMAP data for each 3-km grid cell during the training period.The global median values of ubRMSD and R derived via the 72KM-cluster approach are 0.0380 cm 3 /cm 3 and 0.5238 [− ], respectively.Generally, high correlations and low ubRMSDs are found over regions with moderate vegetation density and semi-arid/arid areas, such as the western U.S., Sahel, south Africa, and central Australia.For highly vegetated regions, including the eastern U.S., Amazon basin, Congo basin, and southeastern Asia, the correlations are mostly smaller than 0.4 [− ] and ubRMSD can be even greater than 0.08 cm 3 /cm 3 .Note that, these regions with high uncertainties are where SMAP retrievals tend to be less reliable due to the increased VWC impeding the SM retrieval process over densely vegetated areas.When only considering grids with VWC less than 5 kg/m 2 -a threshold used for determining the mission accuracy requirement for SMAP SM retrievals (Entekhabi et al., 2010), the median of ubRMSD decreases to 0.0312 cm 3 /cm 3 and R increases to 0.5815 [− ] as shown in Table 2.
Comparing the performances of the LCCZ-cluster (Fig. 3c, d) and 72KM-cluster (Fig. 3a, b) approaches, smaller ubRMSD and higher R values are found when the 72KM-cluster approach is utilized.Obvious differences in ubRMSD are primarily seen for the Amazon and Congo basins where the LCCZ-cluster approach leads to higher ubRMSD and smaller R. It is expected that the accuracy of an ML model can be enhanced through the increased sampling of similar land surface characteristics and reduced signal noise.For all available grids, ubRMSD increases from 0.0380 cm 3 /cm 3 to 0.0420 cm 3 /cm 3 and R decreases F. Lei et al. from 0.5238 [− ] to 0.3579 [− ] when switching from the 72KM-cluster to LCCZ-cluster approach.Similar trends are also found for the RMSD and MAE metrics.Therefore, the 72KM-cluster approach appears to be more suitable for aggregating the global 3-km grid cells in this work.
As described in Table 1, the sample size varies across different combinations for the LCCZ-cluster and each 72-km box of the 72KM-cluster approach.Nevertheless, under most circumstances, ubRMSD and R values obtained from the training time range (Fig. 3) are comparable to those of the testing period (Fig. 4).This suggests that the performance ML is relatively stable from the training period of 2017-2018 to the testing period of 2019.Slight degradations are found for regions with dense vegetation (such as the Amazon basin, Congo

Table 2
Statistics of the ML-based model in training and testing cases between CYGNSS predictions and SMAP reference data for the 72KM-and LCCZ-cluster approaches.Medians are calculated for metrics including root-mean-square-difference (RMSD), unbiased RMSD (ubRMSD), correlation coefficient (R), mean absolute error (MAE), and normalized unbiased RMSD (NubRMSD) with respect to SMAP SM standard deviation.F. Lei et al. basin, and southeastern Asia) or arid conditions (such as Sahara and central Australia) by comparing Fig. 4 against Fig. 3.The ubRMSD slightly increases from 0.038 cm 3 /cm 3 to 0.0395 cm 3 /cm 3 for training and testing periods as for the 72KM-cluster approach.Despite the slight degeneration seen in the year-based test, the performance of the 72KMcluster approach is still superior to the LCCZ-cluster approach with consistently smaller RMSD, ubRMSD, MAE, and higher R values as shown in Table 2.Note that the SM temporal variability differs significantly from low vegetated and dry areas to highly vegetated and wet areas.Normalized ubRMSD (NubRMSD) with respect to the standard deviation of the reference data (i.e., SMAP SM) can indicate the degree of temporal variability that is explained by the retrievals (i.e., CYGNSS estimates).A small NubRMSD value generally suggests a better model predictability.Fig. 5 shows NubRMSD values derived from both the 72KM-cluster and LCCZ-cluster for the training and testing periods.In consistent with Figs. 3 and 4, the 72KM-cluster approach is generally superior to the LCCZ-cluster approach with higher predictability, i.e., smaller NubRMSD.The global median of NubRMSD of all grids is 0.0987 [− ] and 0.1066 [− ] for the 72KM-cluster and LCCZ-cluster approaches, respectively.As shown in Fig. 5, the model has better predictability in regions with low (such as central Australia and Sahara) to moderate (such as Eastern U.S.) vegetation density.Overall, these results further support the utilization of the 72KM-cluster approach.As a result, all CYGNSS SM results presented in the following sections are based on 72KM clustering.

Validation at in-situ sites
The CYGNSS SM estimates are then compared against the in-situ measurements acquired at ISMN sites.Fig. 6 shows three representative sites from the Soil Climate Analysis Network (SCAN), U.S. Climate Reference Network (USCRN), and COsmic-ray Soil Moisture Observing System (COSMOS) networks where the CYGNSS SM retrievals have smaller or comparable ubRMSE values with respect to SMAP data.As shown in Fig. 6a and b, ubRMSE between CYGNSS SM and in-situ data is 0.0376 cm 3 /cm 3 where CYGNSS estimates generally follow closely with the temporal dynamics of the ground-based measurements.Comparatively, SMAP overestimates/underestimates SM for wet/dry conditions with greater temporal variations at this site.For the Edinburg site (Fig. 6c and d) in the USCRN network, both CYGNSS and SMAP have a consistent temporal range and dry-down patterns when compared to insitu data.On the other hand, CYGNSS tends to have smaller temporal variations with regard to both in-situ and SMAP data at the SMAP-OK site (Fig. 6e and f).
Overall, CYGNSS SM estimates show a good agreement with groundbased wetting and drying trends in the SM dynamics.The performances of CYGNSS and SMAP are comparable with the median ubRMSEs of 0.0543 cm 3 /cm 3 and 0.0522 cm 3 /cm 3 respectively for all examined ISMN sites, suggesting that comparable performance is achieved by CYGNSS and SMAP.However, correlations are generally smaller between CYGNSS and in-situ data as compared to SMAP across most networks as shown in Table 3.These small correlations of CYGNSS data is mainly due to a low bias in temporal variance.In Fig. 7, all CYGNSS and SMAP SM estimates are compared with the ISMN in-situ measurements.Consistent levels of overall ubRMSE and R values are found for both CYGNSS and SMAP retrievals.Both CYGNSS and SMAP slightly overestimate SM for data less than 0.10 cm 3 /cm 3 .In addition, CYGNSS underestimates SM for wet conditions as observed in Fig. 6e.

Spatial and temporal variations
The global spatial variability of CYGNSS is compared with SMAP in Fig. 8 for July 2019.Both CYGNSS and SMAP have quite similar spatial patterns with regard to relative levels of aridity and wetness.Particularly for some arid and semi-arid regions, such as northern Africa, southern Africa, Arabian Peninsula, and central Australia, the differences of July mean SM data between CYGNSS (Fig. 8a) and SMAP (Fig. 8c) are marginal.For some regions with dense vegetation, CYGNSS SM is generally lower than SMAP, including Amazon, Congo, India, and southeastern Asia, as shown in Fig. 8b and d.Nevertheless, the overall spatial variations between CYGNSS and SMAP are similar, indicating that the CYGNSS SM has a reasonable mean spatial pattern.
Across all four seasons, differences between CYGNSS and SMAP mean SM estimates are relatively small for regions with light vegetation canopies as shown in Fig. 9, such as central Australia and the western U. S.However, CYGNSS generally underestimates SM levels over the eastern U.S. and the Amazon all year-round.In India, wetter CYGNSS conditions are found for the winter (Fig. 9a) and spring (Fig. 9b) while dryer conditions are seen in summer (Fig. 9c) and fall (Fig. 9d).Similarly, either underestimations or overestimations are observed for southeastern Asia, the Congo Basin, and the Cerrado region in South America in different seasons.As the ML model is trained for optimizing the long-term mean, it is expected that seasonal biases with regard to the long-term mean can vary from season to season.Nevertheless, the differences between CYGNSS and SMAP are mostly small and within the F. Lei et al. range of ±0.05 cm 3 /cm 3 .
In addition to examining the spatial pattern of the monthly mean, it is critical to diagnose the intra-month range of the SM data.Here, Fig. 10 demonstrates the standard deviations of CYGNSS and SMAP SM estimates for July 2019.Despite consistent mean maps in Fig. 8, the standard deviations between CYGNSS and SMAP are found to have relatively large discrepancies.Overall, CYGNSS SM retrievals have a much smaller temporal standard deviation than comparable SMAP estimates (Fig. 10b  and d).This tendency is also observed in the comparison against in-situ measurements shown above in Fig. 6e.

Evaluation at quasi-global scale
Moreover, using the TC approach described in Section 2.6, the  correlations between CYGNSS SM data and the theoretical truth can be calculated at the global scale to further quantify the accuracy of CYGNSS SM data.Fig. 11 demonstrates the spatial distribution of TC-based correlation R TC .It is worth emphasizing that although the TC approach can serve as an independent technique for characterizing the accuracy of SM products, it requires at least three independent data sources with sufficient collocated samples for robust sampling.Therefore, the availability of R TC estimates are limited for grids with more than 100 collocated data samples.In addition, the TC analysis is conducted for SM anomalies by removing the seasonal climatology in all three SM time series, i.e., CYGNSS, ASCAT, and NOAH.Note that since R TC indicates the accuracy of SM anomaly variability superimposed on a SM climatology, it does not capture the impact of climatological errors.Based on Fig. 11, relatively high R TC are seen for arid regions with light vegetation, including

Optimizing random forest
In the application of regression-based ML for fitting input features with respect to target variables, a low bias in temporal variability is generally expected between fitted and target quantities as a consequence of the considered cost function to be minimized (Mehta et al., 2019).Such a tendency is evident as shown in Fig. 10 that the (fitted) CYGNSS  F. Lei et al.SM retrievals have smaller temporal variations as compared to SMAP.To improve the performance of ML for predicting SM as presented in Section 3.1, it is important to consider adding regularization when optimizing the cost function.Typically, the use of a regularization term in the cost function can lead to the selection of fewer weak learners in regression ensembles and possibly ameliorate over-fitting issues.However, a small temporal standard deviation does not guarantee the existence of over-fitting.To examine the potential of regularization for improving the performance of ML model, we apply the L 1 regularization penalty which is also called Least Absolute Shrinkage and Selection Operator (LASSO) to the cost function of the ensemble regression model as where λ is the LASSO unitless parameter with value greater than zero, α t learner weight, and f t a weak learner in the ensemble trained on N sets of input features β n (as described in Section 2.5), reference SM θ n , and weights w n .The numbers of observations N and ensemble trees T are shown in Table 1.The cost function without regularization is the first term of Eq. ( 4).By adding this regularization term, both the training error and variance in the time domain can be examined.Previous results presented in Section 3.1 (i.e., the 72KM-cluster results) are based on ML models without a regularization term (indicated as 'N/A' in Fig. 13).By adding the regularization term with λ varying from 1e-2 to 1e-5, the global median values of RMSD and MAE are shown in Fig. 13 for both the training and testing periods.These results demonstrate that a suboptimal λ can lead to increased training and testing errors, e.g., λ equals 1e-2 or 1e-3.In contrast, λ smaller than 1e-4 leads to smaller errors than that without a regularization term in our applications here.However, even a very small λ value (< 1e-5) does not guarantee further reductions in errors (results not shown here).Overall, adding this regularization term only results in a slight decrease in the training and testing errors.More importantly, it is critical to examine the changes introduced in the temporal standard deviation of CYGNSS SM estimates by adding the regularization term.Fig. 14 shows the latitudinal and longitudinal averages of the intra-month mean (Fig. 14a, c) and standard deviation (Fig. 14b, d) of CYGNSS and SMAP SM data in July 2019.The black line is the SMAP data and the blue line is the CYGNSS spatial trend without applying a regularization term.By including a regularization term, both intra-month mean and standard deviation have been increased to be  closer to the SMAP baseline when λ is equal to, or smaller than, 1e-4.Note that a λ of 1e-2 leads to largely increased temporal variation at the cost of reduced accuracy in the mean.However, even when λ equals 1e-5, improvements in both mean and standard deviation are relatively marginal compared to those obtained without regularization.Consistent results are found for different regression parameter λ values based on the TC analysis (provided in the Supplementary Materials).Therefore, we opt for generating the final CYGNSS product without applying regularization in this work.

Spatial and temporal frequency
Taking into account its spatial and temporal resolution, as well as the number of available satellites (one satellite of SMAP versus the CYGNSS constellation of eight micro-satellites), it is interesting to examine the temporal data availability of CYGNSS SM at different spatial scales.Fig. 15 shows the ratio of available CYGNSS data against SMAP at either 9-km or 36-km resolution.Note that the availability of CYGNSS has been remarkably reduced by many pre-processing filters including the DEM elevation limit (maximum elevation is 1500 m).Nevertheless, the ratios between CYGNSS and SMAP daily sample availability (i.e., days with available CYGNSS retrievals divided by days with available SMAP retrievals) between ±38 • latitudes are mostly greater than 0.6 [− ] at 9 km.More importantly, the ratios increase significantly at 36 km with the majority larger than 1.2 [− ].By integrating CYGNSS and SMAP, the resulting enhanced frequency and spatial resolution can greatly benefit the global SM databases for various hydrologic, ecologic, and meteorological applications.However, it is critical to consider relative retrieval errors in both CYGNSS and SMAP SM products when fusing them.Note that, in this work, the retrieval process of CYGNSS involves  the integration of various remotely sensed land surface products with SMAP SM data.Future investigation about the impact of these data sources on retrieval accuracy will allow for the further improvement of the CYGNSS SM retrieval process.

Error sources and product accuracy
CYGNSS receives forward scattered signals that are transmitted by the GNSS and reflected off the Earth's surface.Given the intrinsic bistatic feature of GNSS-R technique and the changing pattern of the transmitting and receiving satellites, CYGNSS collected observables are pseudo-randomly positioned with irregular paced spatial and temporal resolution.In contrast to typical remote sensing techniques that have repeatable swaths and the same local acquisition time, the regular spatial mapping of CYGNSS data is challenged by the need to assign an appropriate spatial grid size.The spatial resolution of the CYGNSS signal over land is particularly affected by surface roughness at and near the reflection point, as well as the surface dielectric constant.The spatial resolution of CYGNSS observations can range from the first Fresnel zone (0.5 km) with coherent reflection to the glistening zone (i.e., dozens of kilometers) with incoherent reflection.Furthermore, CYGNSS observables that fall within an area defined by a regular spatial grid are acquired at different times of the day.Therefore, traditional mapping with a regular spatial grid and an integrated time step fails to account for the complexity of CYGNSS signal spatial and temporal resolution.Nonetheless, CYGNSS observables need to be transformed in consistent with other remote sensing and modeling dataa process that can introduce errors into the reflectivity used in this study and all other related works (Chew and Small, 2018;Kim and Lakshmi, 2018;Al-Khaldi et al., 2019;Clarizia et al., 2019;Calabia et al., 2020;Yan et al., 2020;Yueh et al., 2020;Jia et al., 2021).As described in Sections 2.1 and 2.6, CYGNSS reflectivity is mapped onto 3 km and the retrieved SM estimates are then averaged to 9 km and daily values.Further investigation of the optimal spatial grids for observations with different coherent and incoherent reflections is needed and information from sub-daily variations can be utilized for diurnal analysis.In addition, given reliable high-resolution ancillary data, the impact of land surface heterogeneity on the accuracy of CYGNSS signals and SM retrievals needs to be examined further.
A reference data set is required for relating the CYGNSS observables with SM conditions, as is the case with all current available CYGNSS SM retrieval approaches.The SMAP enhanced SM product is used here, and the nonlinear relationship between CYGNSS reflectivity and SMAP SM is determined via a ML approach.Because SMAP is also a SM product with its own instrumental and retrieval errors, SMAP estimation errors are expected to be propagated onto CYGNSS estimates.Furthermore, the original dynamics of the reference data theoretically limit the fittingbased estimates.When the training data sets are weakly related, the range of estimates can be significantly reduced, as shown in Fig. 6e.In addition, due to the inter-annual variability of SM and land surface conditions, a model trained with a limited time range may have lower predictability for data obtained from years with obvious differences versus the training years.For example, as demonstrated in the Supplementary Material, a model trained with data from 2017 to 2018 shows a slight degradation in model performance when applied to predict SM in 2020 versus 2019.On the other hand, the retrieval approach presented in this work would benefit from more accurate high-resolution ancillary data, particularly enhanced data for surface water masking.
As shown in Table B1, previous CYGNSS SM inversion studies have used either SMAP or in-situ observations as the fitting reference.However, the direct comparison of our product to these earlier CYGNSS SM estimates is hampered by differences in: (1) CYGNSS derived observables, (2) spatial and temporal gridding/averaging assumptions, (3) ancillary data requirements, (4) reference and validation data sets, and (5) spatial and temporal coverages.Nevertheless, a thorough comparison of the performance of various retrieval algorithms needs to be conducted consistently and systematically.When compared to SMAP data, our previous work (Senyurek et al., 2020b) using ISMN data as the reference data for quasi-global SM mapping achieved an ubRMSD of 0.049 cm 3 /cm 3 and R of 0.63 for 3-day averages.It is important to note even a model trained with (ground-based) ISMN data can have its error sources, including scale representativeness error and limited land surface conditions.Here, in this study, the trained model is directly F. Lei et al. evaluated via independent year-long, site-based, and TC-based validations.The daily 9-km CYGNSS SM has an ubRMSD of 0.0395 cm 3 /cm 3 and R of 0.4006 when compared to SMAP data in 2019 and ubRMSE of 0.0543 cm 3 /cm 3 and R of 0.4623 when compared to ISMN data from 2017 to 2019.

Conclusions
Here, daily 9-km quasi-global CYGNSS SM estimates are derived by incorporating various land surface characteristics and CYGNSS observables as inputs and SMAP data as target values within a Random Forest algorithm.To facilitate the global retrieval process, different clustering approaches are implemented to aggregate 3-km grid cells, i.e., a regularly spaced delineation approach that combines all 3-km grid cells within a box based on the geographic proximity rule with lengths varying from 9 km, 72 km to 288 km, a cluster approach for grids with similar land cover/climate zone conditions, and one single model for all global data.Results show that these approaches all facilitate a robust trained model for predicting SM at the test period (independent from the training time range).Generally, both training and testing errors are reduced and correlations are improved with the increase in the number of ML models.Among the examined clustering options, the 72KM-cluster approach generates the best SM estimates with the smallest RMSD and highest R. Sequentially, this 72KM-cluster is applied for deriving the CYGNSS quasi-global SM estimates.Through independent validation against in-situ measurements, the CYGNSS SM data demonstrates a moderate performance with the ubRMSE of 0.0543 cm 3 /cm 3 over more than 100 sparse sites.The comparison of spatial variability between CYGNSS and SMAP suggests that the averaged CYGNSS SM has similar spatial patterns with respect to SMAP for all four seasons.Some moderate over-and under-estimations are found for regions with dense vegetation canopy.However, these differences are not consistent across different seasons.Moreover, the performance of CYGNSS SM is further examined through the TC product evaluation technique.Based on TC results, relatively high R TC are found for arid regions with light vegetation and relatively low R TC are mostly over heavily vegetated areas.Particularly, median values of R TC are over 0.6 [− ] for shrublands and grasslands.Therefore, the CYGNSS SM can be a reliable supplement to the current global SM databases in facilitating hydrologic and meteorological applications.Note that once the ML-based retrieval model is trained and validated, it can be applied directly to future CYGNSS observables without the requisite of other L-band passive microwave SM products, such as SMAP or SMOS.To foster the application and usage of this CYGNSS SM product, the team at Mississippi State University has launched a website for distributing this 3-year CYGNSS SM product.The website URL is https://www.gri.msstate.edu/research/ssm.Users can obtain information and download the data from this website.The product will be regularly updated with newly available data.Future work will be mainly focused on eliminating the impact of surface inland water bodies on the SM retrieval accuracy, enhancing the spatial/temporal resolution by optimally fusing CYGNSS and SMAP data, and further reducing existing dependency on ancillary data for the ML framework.

Appendix B Table B1
The retrieval approach and accuracy of currently published CYGNSS soil moisture products (adapted from Senyurek et al., 2020b).

Fig. 3 .
Fig. 3. Training period evaluation from March 2017 to December 2018 for ubRMSD [cm 3 /cm 3 ] (subplots a and c) and R [− ] (subplots b and d) between CYGNSS and SMAP SM data.Results from the 72KM-cluster and LCCZ-cluster approaches are shown on the top and bottom rows, respectively.

Fig. 4 .
Fig. 4. Testing period evaluation from January 2019 to December 2019 for ubRMSD [cm 3 /cm 3 ] (subplots a and c) and R [− ] (subplots b and d) between CYGNSS and SMAP SM data.Results from the 72KM-cluster and LCCZ-cluster approaches are shown on the top and bottom rows, respectively.

Fig. 5 .
Fig. 5. Normalized unbiased RMSD (NubRMSD) with respect to SMAP SM standard deviation from the 72KM-cluster (top row) and LCCZ-cluster (bottom row) approaches for the training (left panel) and testing (right panel) periods.

Fig. 6 .
Fig. 6.SM time series from the International Soil Moisture Network (ISMN), SMAP, and CYGNSS at three sites from (a) Soil Climate Analysis Network (SCAN), (c) U. S. Climate Reference Network (USCRN), and (e) COsmic-ray Soil Moisture Observing System (COSMOS) networks.Zoom-in subplots are shown on the right panel.Precipitation data for the SMAP-OK site is from the nearest ISMN site with rainfall data.

Fig. 7 .
Fig. 7. Scatter plot between ISMN SM and (a) CYGNSS or (b) SMAP for all sparse networks.Data points are colored according to the number of samples at the scatter plot position.R and ubRMSE are computed with all data samples.

Fig. 10 .
Fig. 10.Spatial maps of the intra-month standard deviation of (a) CYGNSS and (c) SMAP SM retrievals for July 2019.Latitudinal and longitudinal averages are plotted in (b) and (d), respectively.

Fig. 12 .
Fig. 12. Boxplot of R TC aggregated for different land cover types with medians shown as red dots and 15% to 85% percentiles as dotted bars.The number of grids for each type is shown in gray bar.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 13 .
Fig. 13.Changes in the (a) RMSD and (b) MAE as a function of the regularization parameter λ.

Fig. 14 .
Fig. 14.Latitudinal and longitudinal averages of intra-month mean and (temporal) standard deviation (std.) of CYGNSS and SMAP data in July 2019.

Fig. 15 .
Fig. 15.The ratio of CYGNSS data samples with regard to SMAP data at (a) 9 km and (b) 36 km.

Fangni Lei :
Fig. A1.Structure of the CYGNSS SM evaluation at different spatial resolutions.

Table 1
Statistics for the different clustering strategies used to delineate global coverage.9KM-, 72KM-, 288KM-clusters refer to regularly spaced box clustering approaches with different lengths.LCCZ-cluster indicates clustering based on the same land cover/climate zone condition.ONE-cluster refers to a single model for all global grid cells.The number of samples includes both training and testing samples.

Table 3
Validation statistics between CYGNSS/SMAP retrievals and ground-based SM measurements for different networks and all examined ISMN sites.Metrics are calculated separately for each site with median values shown for each network or across all sites.
F.Lei et al.