Comparison of Data‐Driven Techniques to Reconstruct (1992–2002) and Predict (2017–2018) GRACE‐Like Gridded Total Water Storage Changes Using Climate Inputs

The Gravity Recovery and Climate Experiment (GRACE) mission ended its operation in October 2017, and the GRACE Follow‐On mission was launched only in May 2018, leading to approximately 1 year of data gap. Given that GRACE‐type observations are exclusively providing direct estimates of total water storage change (TWSC), it would be very important to bridge the gap between these two missions. Furthermore, for many climate‐related applications, it is also desirable to reconstruct TWSC prior to the GRACE period. In this study, we aim at comparing different data‐driven methods and identifying the more robust alternatives for predicting GRACE‐like gridded TWSC during the gap and reconstructing them to 1992 using climate inputs. To this end, we first develop a methodological framework to compare different methods such as the multiple linear regression (MLR), artificial neural network (ANN), and autoregressive exogenous (ARX) approaches. Second, metrics are developed to measure the robustness of the predictions. Finally, gridded TWSC within 26 regions are predicted and reconstructed using the identified methods. Test computations suggest that the correlation of predicted TWSC maps with observed ones is more than 0.3 higher than TWSC simulated by hydrological models, at the grid scale of 1° resolution. Furthermore, the reconstructed TWSC correctly reproduce the El Nino‐Southern Oscillation (ENSO) signals. In general, while MLR does not perform best in the training process, it is more robust and could thus be a viable approach both for filling the GRACE gap and for reconstructing long‐period TWSC fields globally when combined with statistical decomposition techniques.

However, after more than 15 years in orbit, the GRACE mission ended its operation in October 2017, and its successor-the GRACE Follow-On (GRACE-FO) mission-was only launched in May 2018, leading to approximately 1 year of missing data. Although several alternative sensors and data processing techniques have been proposed to derive surface mass change maps prior to the GRACE period and during the gap and our tests indicate that the extrapolated (up to 6 years in our case) gridded total water storage change maps, in all study regions, have much higher correlation with the observed GRACE data as compared to simulated water storage changes from hydrological models.
Following this introduction, in section 2, the unified framework and the details of the applied data-driven methods are described. In section 3, we introduce the GRACE, Swarm, climate, and hydrological data, and in section 4, experiment results and discussions are provided for 26 study regions. We provide conclusions in section 5.

Unified Methodological Framework
For an unbiased comparison, we deem it necessary to place methods in a unified framework first, to assess their prediction skills with identical input and validation data (see Figure 1), for the same region and using the same metrics.
In general terms, all methods utilize decomposition methods (usually ICA or PCA, Forootan et al., 2014) to partition the GRACE TWSC maps over a specific region and the suspected climate drivers (we employ precipitation, land surface temperature, climate indices, and SST) into spatial patterns and temporal modes. Then, time series analysis methods such as the STL procedure (Cleveland et al., 1990) or a simple least squares (LS) fitting method are used to further separate the individual modes of GRACE and climate data into typically a linear trend, seasonal, interannual, and the residual part. Third, the seasonal and Figure 1. Illustration of the data flow in our unified framework for comparing different data-driven methods. TWSC means total water storage change; P for precipitation; T for land surface temperature; and SST for sea surface temperature. ICA/PCA are independent and principal component analysis techniques; LS/STL are least squares and seasonal-trend decomposition based on loess procedure; and ANN/ARX/MLR are artificial neural network, autoregressive exogenous, and multiple linear regression models.
de-seasoned (i.e., interannual and residual) components of the GRACE temporal modes are then reconstructed or predicted from empirical relationships between the temporal modes of GRACE and climate data as identified for a training period from either ANN, ARX, or MLR method. Eventually, the GRACE-derived linear trend is commonly added to the reconstructed seasonal and de-seasoned components to extrapolate the full GRACE temporal modes. Here we would like to mention that linear and other long-term (e.g., accelerated) trends in GRACE data are often caused by ice and glacier melt, dam management, and human water abstractions, and these factors could vary over time, so it may lead to misinterpretation when one simply extrapolates GRACE trends. Furthermore, the long-term trend (estimated over 10 years of GRACE data) could be affected by interannual and decadal variability, which might also bias the trends estimation. A focus on the reconstruction of total water storage trends would require a specific regional treatment of all factors mentioned above, which is out of the scope of this study. Finally, GRACE-like gridded TWSC maps are reconstructed and predicted by combining the GRACE-derived spatial patterns with the reconstructed temporal modes. It is, however, unclear how the choice of the particular spatial and temporal decomposition methods affects the skills of the forecasted maps. Moreover, no systematic study exists which compares the sensitivity of the predictions or reconstructions to the type of empirical relationship which is trained from GRACE and climate data residuals. Therefore, in what follows we derive different combinations of methods and assess them within our unified framework. Here, we will focus on isolating the sensitivity of the prediction with respect to one group of techniques while keeping the other two groups consistent, that is, we will firstly employ the method combinations PCA-LS-ANN, PCA-LS-ARX, and PCA-LS-MLR to compare the performances of three predictive models (i.e., ANN, ARX, and MLR) and identify the most robust method for the TWSC prediction, and then this strategy will be applied to compare the (a) two spatiotemporal decomposition techniques ICA and PCA or (b) two time series decomposition techniques LS and STL. Figure 1 visualizes the flow of computation via these various combinations; here we will select several (e.g., m) climate predictors for each decomposed component (i.e., seasonal, interannual, or residual) of detrended GRACE total water storage change temporal modes. For instance, if we identify six dominant (as to reconstruct a given percentage of the signal energy, e.g., 95%) GRACE temporal modes for a specific region, then we will reconstruct 18 (six modes multiplied by three components) GRACE components using m × 18 relevant predictors.

Signal Separation Methods 2.2.1. Spatiotemporal Decomposition
The GRACE maps have a resolution of approximately 300 km, but it would be computationally expensive and not justifiable to try to predict or reconstruct unobserved TWSC at each (e.g., 100 km) grid point globally. Thus, for dimensionality reduction, we assume that one will use a statistical decomposition method to identify the main patterns and modes of GRACE-observed TWSC and continue with predicting only their temporal evolution. In the end, gridded GRACE signals will be reconstructed assuming that the dominant spatial patterns do not change over time. This is a caveat of all methods; however it is a common assumption in the reconstruction of unobserved climate data which, for example, has been used for GRACE in Becker et al. (2011) andForootan et al. (2014) or for sea level reconstruction from tide gauges in Church et al. (2004).
The PCA method seeks to separate the original data (i.e., GRACE and climate signals) into orthogonal spatial patterns (EOFs) and their associated temporal modes (EOF modes) as follows (Wold, 1987): The data matrix X n × t , with n rows for each spatial grid cell and t columns for each epoch, represents the mean-centered original data. Columns of E n × n contain the separated spatial patterns and rows of T n × t the related temporal evolution. The first r dominant modes will contribute to the majority of the original signal (Wold, 1987). In this case, we choose r as to retain 95% of the total signal variance. The original matrix can be approximately restored by

10.1029/2019WR026551
Water Resources Research where b X n × t represents the restored signal, columns of b E n × r are the r dominant EOFs, and rows of b T r × t are the associated EOF modes. Forootan and Kusche (2012) suggested that to replace the PCA method by independent component analysis (ICA) in GRACE signal decomposition, motivated by the assumption that physically independent real-word processes will more likely exhibit statistical independence than orthogonality (Forootan et al., 2014). In the ICA method one additionally rotates the dominant EOFs and the temporal modes with a rotation matrix R to maximize their statistical independence: where the rotated b E n × r b R r × r are then denoted as independent components (ICs) in the context and the b R T r × r b T r × t represent the temporal modes (IC modes) identified by the ICA method. Several methods have been proposed to determine the rotation matrix R, based on different procedures to define statistical independences (e.g., minimizing third-order or fourth-order statistical cumulants). Here we employ the method introduced by Cardoso (1999) and implemented in Forootan and Kusche (2012).

Time Series Decomposition
In order to retain and extrapolate the observed trends, and to apply statistical prediction techniques on the anomalous signals only, one commonly partitions the temporal evolution of observed EOF/IC modes into (a) a linear trend, (b) interannual, (c) seasonal, and (d) residual signals. These components are then considered individually. When decomposing time series into seasonal, interannual, and so forth components, observation errors are typically not taken into account, and they are not considered in this study as well. Drawing on Figure 1, we consider the least squares (LS) and seasonal-trend decomposition based on loess (STL) methods to estimate these deterministic signals in the observed modes. In the LS method, linear trend, interannual, and seasonal components are estimated, for example, by linear regression, segmented cubic polynomial function, and annual sine-waves, and the residual or anomalous signal is obtained by removing these. In the segmented cubic polynomial, the first step is to partition the total time series into several shorter time series (or subseries), and then the second step is to fit each subseries using a cubic polynomial function: where y(t), t = 1, 2, 3, … , n represent the segmented subseries, t is the time, and n represents the length of the subseries. In this case, we set n = 19 months corresponding to the smoothing parameter that was used in the STL procedure for decomposing the interannual component as described in Cleveland et al. (1990).
STL is a filtering procedure, which was introduced by Cleveland et al. (1990) and applied by, for example, Baur (2012), Frappart et al. (2013), Hassan and Jin (2014), and Humphrey et al. (2016) to GRACE data; it allows decomposing a time series into three components trend (i.e., linear trend and interannual in this study), seasonal, and residual: where Y is the original time series; T c , S c , and R c represent the trend, seasonal, and residual components separated from the original time series, respectively; and c is the cycle-index in the inner loop of the STL procedure (Cleveland et al., 1990). The trend decomposed by STL comprised of linear trend and interannual components, so we then separate the linear trend using linear regression, and the interannual is obtained by removing the linear trend from the STL-derived trend. Thus, equation 5 can be expressed as where L c and I c are the linear trend and interannual components of the original time series. In this study, we implemented STL as in Cleveland et al. (1990).
STL, which consists of applying a sequence of smoothing operations, is computationally more intensive and generally retains more detailed features of the acquired time series when fitting seasonal components as 10.1029/2019WR026551 Water Resources Research compared to the LS method. Such difference between STL and LS may also lead to different prediction results when applying them in the prediction or reconstruction of GRACE total water storage changes as described in section 2.1. In this study, we employ and compare both LS and STL methods and assess their performances for the TWSC predictions.

Three Predictive Models
Predictive models seek to learn a relationship between a group of predictors (here, precipitation, temperature, sea surface temperature fields, and climate indices) and the target variable (gridded GRACE TWSC) (Bishop, 2006). Previous studies have successfully employed artificial neural network (ANN), autoregressive exogenous (ARX), and multiple linear regression (MLR) models to predict and reconstruct GRACE TWSC time series. In this study, we place these three models in a unified framework and compare them with the focus on prediction of GRACE total water storage changes.

Artificial Neural Network (ANN) Model
The multi-layer perceptron (MLP) ANN model has been proposed in the past for predicting GRACE time series (Sun & Alexander, 2013). We implement this ANN model therefore for learning relations between decomposed components (i.e., seasonal, interannual, and residual) of detrended temporal modes in GRACE data and the supposed climate predictors. In the most simple MLP model there are three layers, that is, the input, hidden, and output layers (Bishop, 2006;Long et al., 2014). In this study, the output layer of the ANN model is separately chosen to represent each decomposed component of detrended GRACE temporal modes, and the inputs (predictors) comprise selected m sensitive components of climate temporal modes. The selected climate components (predictors) are determined based on their correlations as related to each target GRACE component-that is, we select the predictors by retaining the most correlated climate components. The number of input "channels" is set to m = 3 because we find no obvious improvement when using a larger number of predictors. The hidden layer consists of u artificial neurons, and each neuron represents a sum of weighted predictors. We set the number of artificial neuron to u = 7 in the hidden layer based on the criterion as described in Forman et al. (2014). It is difficult to reasonably initialize the weights of the artificial neurons in the ANN training process, such that we randomly choose start weights hundreds of times and repeat the training process, in order to finally generate the final prediction as a mean over several ANNs with different start weights. In this case, we write the ANN codes based on the Matlab_R2014b neural network function to develop all MLP networks for this study.

Autoregressive Exogenous (ARX) Model
The ARX model, which formulates another type of relationships between a group of inputs and the output, is governed by a system of linear equations (Ljung, 1987): where y(t), t = 1, 2, … , n represent the target variables, t is the time epoch, and n is the length of the time series; x q (t), q = 1, 2, … , m represent m channels of inputs (m = 3 in this case); n a and n b are the orders of the autoregressive exogenous model with respect to the output and input, respectively. ε(t) allows for a random Gaussian-noise input. Here, a i and b q,l are the coefficients that need to be estimated in the training step using both inputs and output data; thus they play a role similar to the weights as in the artificial neural network approach. In this case, we set both n a and n b to 3 as discussed in Forootan et al. (2014). After obtaining b a i , b b q;l , b n a , and b n b one can predict the target variable beyond the training period based on these coefficients and parameters: whereb y t n ð Þrepresents the predictand of the ARX model at the epoch t n . More details about the application of the ARX model to predict GRACE temporal modes can be found in Forootan et al. (2014). In this case, we write the ARX codes by referring to the equations as described in Forootan et al. (2014), and we use identical inputs and output data as employed in the ANN process (see section 2.3.1) to train the ARX model and to extrapolate the GRACE TWSC time series-that is, we choose three sensitive components of climate temporal modes to predict and reconstruct each decomposed component of detrended temporal modes of GRACE TWSC using the autoregressive exogenous model (equation 7 for training and equation 8 for testing/predicting).

Multiple Linear Regression (MLR) Model
The MLR model prescribes linear relationships between multiple input and one output variables (Sousa et al., 2007). In this case, we use the multiple linear regression function from Matlab_R2014b. The inputs will be three selected components of climate temporal modes while the output will be the decomposed component of detrended GRACE temporal modes. Again, for a fair comparison, we will employ identical input and output data as used in the ANN and ARX models to train the MLR representation. Here, we choose the least squares method for the estimation of the MLR coefficients, and then we predict the target variables based on the estimated coefficients.

Comparison
The artificial neural network model can fully derive nonlinear relationship between the input and output data, but it is difficult to optimally determine the number of artificial neurons, and it may lead to overfitting the problem. Furthermore, it is computationally intensive to repeat the ANN training process for improving the predictand. The ARX and MLR models both employ linear relations, so they cannot be expected to predict nonlinear relationships too well. Within the ARX model, each predicted value depends on the nearest former predictand-that is,b y t n ð Þin equation 8 is predicted by usingb y t n − i ð Þ, i = 1, 2, … , n a -so the predicting error of the autoregressive exogenous model is easily accumulated over time. Therefore, the multiple linear regression method will be likely a better choice on the condition that there are no nonlinear relationships between the input and output data.

Error Perturbations
In order to study the error propagation characteristics in the three predictive models, we generate a series of Gaussian-like uncertainties to perturb each target variable (i.e., GRACE seasonal, interannual, or residual component) based on the Monte Carlo approach (Challa & Hetherington, 1988). The perturbed target variables could be expressed by the equation as follows: where P i (t) (t = 1, 2, 3, … , n) is the ith perturbed GRACE TWSC; i is the count of iteration times (or random sample number); t is the time epoch and n is the length of the GRACE time series; G(t) represents unperturbed GRACE TWSC; and ξ i (t) is the Gaussian-like uncertainties generated by the Monte Carlo approach.
Here, we first predict the GRACE-like gridded total water storage changes using the unperturbed GRACE TWSC as target variable, and then we predict another group of GRACE-like gridded TWSC using the perturbed GRACE TWSC as target variables, and the error propagation bars are estimated by the mismatch between the unperturbed and perturbed GRACE-like gridded TWSC: where B(t) is the error propagations in the predictive model, t is the time epoch, PT i (t) is the GRACE-like gridded TWSC predicted using the ith perturbed GRACE TWSC as target variable, and UT(t) is the GRACE-like gridded TWSC predicted by the unperturbed GRACE TWSC; m represents the sum of iteration times of error perturbations.

Total Water Storage Change Data
We use the RL06 GRACE monthly mascons, developed with a 1°resolution using Tikhonov regularization in a geodesic grid domain (Save, 2019;Save et al., 2016) by the Center for Space Research (CSR), between April 2002 and June 2017 as the target variables. The storage anomalies, which capture all the signals observed by GRACE within the measurement noise level, are given in equivalent water thickness units (cm), and the correlated error has been intrinsically removed; thus, these products do not need to be additionally destriped or smoothed. In this study, we unify the spatial resolution of all input data to 1°× 1°(corresponding to the CSR mascons) to eliminate inconsistent resolutions between the input and output data.
The GRACE Follow-On (GRACE-FO) mission has been operated since May 2018, and we use the GRACE-FO TWSC as a criterion to evaluate the accuracy of the predicted TWSC. The GRACE-FO temporal gravity field models, from June 2018 to December 2018, derived by the CSR are employed to estimate the GRACE-FO TWSC over all study regions based on the methods and strategies as described in Li et al. (2018).
During the gap of the two GRACE missions, the Swarm satellites may serve as an alternative to derive Earth's gravity field models albeit at much lower resolution. As a verification to our predicted TWSC, the RL06 Swarm time variable gravity field models from December 2013 to December 2018 calculated by Lück et al. (2018), with a max degree of 40, are also employed to estimate the Swarm total water storage changes. Here, we only use the Swarm gravity field models complete to degree and order ten, to suppress excessive noise at higher orders.

Climate Data 3.2.1. Global Precipitation
The Climate Prediction Center (CPC) global daily unified gauge-based analysis of precipitation , with a spatial resolution of 0.5°, is employed in this study. The monthly precipitation is obtained by averaging the CPC global daily precipitation corresponding to the GRACE time interval, and both daily and monthly precipitation from October 1991 to March 2019 are used to reconstruct and predict the GRACE-like TWSC out of the GRACE period. For improving the correlation between precipitation and GRACE temporal mode de-seasoned terms (i.e., interannual and residual), we reconstruct the de-seasoned components of monthly precipitation temporal mode using the daily precipitation temporal mode based on a time delay parameter as introduced by Humphrey et al. (2016). The spatial resolutions of both daily and monthly precipitation are made consistent to 1°.

Global Land Surface Temperature
Global Historical Climatology Network version 2 and the Climate Anomaly Monitoring System (GHCN CAMS), developed at CPC, National Centers for Environmental Prediction (NCEP), is a monthly global land surface temperature data set (0.5°× 0.5°) from 1948 to near present (Fan & Dool, 2004). We use this data set between October 1991 and March 2019 as one of the input climate data to reconstruct and predict GRACE-like TWSC time series. As discussed before, the spatial resolution of this data set is sampled onto 1°cells.

Sea Surface Temperature (SST)
Sea surface temperature drives ocean evaporation, which in turn affects atmospheric moisture transport and, eventually, rainfall over land areas and water storage. For example, the temporal evolution of GRACE TWSC in west Africa is highly correlated with the SST anomalies in the Pacific, Atlantic, and Indian oceans (Forootan et al., 2014). To take advantage of this kind of climate data, we use the SST in several oceans and seas that are located near the study regions ( Figure 2) as one kind of input data. In our case, the monthly Optimum Interpolation (OI) sea surface temperature, with a resolution of 1°, provided by National Oceanic and Atmospheric Administration (NOAA) is employed (Reynolds et al., 2002).

Climate Indices
Climate indices are by definition constructed to capture large-scale variability in fields such as SST or surface pressure, which are related to land precipitation and temperature through atmospheric teleconnections. Therefore, several publications (e.g., Anyah et al., 2018) have shown that they play a key role in representing interannual GRACE water storage variations. In this study, 17 climate indices-that is, Multivariate ENSO Index (

Hydrological Models
Hydrological models simulate soil moisture, near-surface air temperature, accumulated snow, water/energy flux, and other hydrological components on land. Several studies have shown that model outputs relate well with GRACE TWSC although model schemes do not include, for example, all water reservoirs, because the temporal evolution of the different water reservoirs is often highly connected (Humphrey et al., 2017). We use model outputs, including the NASA Global Land Data Assimilation System (GLDAS) NOAH 10M series model (Rodell et al., 2004) and CPC soil moisture (Fan & Dool, 2004) from January 1992 to December 2018, to evaluate the reliability of the reconstructed and predicted total water storage change. Moreover, we also employ the newest WaterGAP Global Hydrology Model (WGHM) version 2.2d results, over January 1992 to December 2016, in this study. Compared to the 2.2a version ) the water balance is closed, the calibration routine is changed, and the human water use values and the groundwater recharge algorithm are improved.

Results
We choose 26 major river basins of the world, as delineated online (https://www.grida.no/resources/5782), as the study regions ( Figure 2). For representing the amplitude of GRACE TWSC, we calculate the root-mean-square (RMS) of each gridded CSR mascon over the study regions ( Figure A1). To make full use of the SST data in the TWSC reconstruction and prediction, we divide the global sea surface temperature data into 14 patches as shown in Figure 2 and use each of them as one of input data.

Signals Separated from the GRACE Data 4.1.1. Dominant Temporal Modes of GRACE Total Water Storage Change
As discussed in section 2.2.1, the dominant modes identified by the PCA or ICA methods contribute to the majority of the GRACE signal. We predict each selected GRACE EOF/IC temporal mode individually based on the methods as described in section 2 and find that forecasting modes with less energy (i.e., less variance explained of the GRACE signal) tend to have higher standard errors as estimated by the CSR mascons; thus, while choosing more dominant EOF/IC modes in the TWSC prediction will reduce the signal leakage (i.e., signals from discarded modes), this will also increase the prediction uncertainties. Consequently, it is difficult to choose the optimal number of modes to be selected. An optional approach could be that one first predicts the TWSC based on a different number (e.g., from 3 to 10) of retained modes and then uses a GRACE solution to test the uncertainties of TWSC predicted by these different numbers of modes to identify the best number for each study region. It is important to highlight that although the GRACE solution cannot be viewed as an unambiguous reference, it could be used to estimate the accuracy of the prediction because the reference (i.e., GRACE solution in this study) should be accompanied by a "conservative error estimate." It is clear that further tuning the methods would improve the reconstruction/prediction skills for a specific basin. But for applying our approach globally, that is, to a large number of basins, tuning for each basin Figure 2. The study regions (i.e., 26 river basins) and divided oceans and seas. The river basins are numbered from 1 to 26, and the divided oceans or seas are numbered from S1 to S14 as shown in this figure.
would not be in the interest of repeatability, and it is not clear how robust such over-tuned approaches would be. Thus, in what follows we rather define unified criteria for all study regions based on extensive testing on only a few representative basins. Here, we set a unified criterion-that is, the number of modes that jointly explain at least 95% of the total variance-to identify the number of dominant modes automatically by the algorithm. We note that we set the number of selected climate modes equal to the number of selected GRACE modes just to minimize the inconsistency between the input and output data, but this is not strictly required for the algorithm.  Figure A3. These results indicate that there is no large difference between the LS and STL methods for decomposing the linear trend and interannual components, and the two methods perform almost the same in separating the seasonal component from the temporal mode which has a strong periodicity (e.g., temporal modes of GRACE EOF1 and EOF2 in Figure A3). When decomposing a high-frequency time series, the LS and STL methods exhibit some differences in separating seasonal signals-that is, seasonal signals separated by STL method show more detailed features and larger oscillations (e.g., the temporal mode of GRACE EOF6 in Figure A3).

Prediction and Reconstruction of Total Water Storage Change for 26 Study Regions 4.2.1. Selection of Climate Inputs for the Total Water Storage Change Prediction
In this study, we predict and reconstruct the interannual, seasonal, and residual components of significant GRACE TWSC temporal modes for each study region based on the predictive models that was introduced in section 2.3. These predictive models seek to derive the relationship between the input and output data. Typically, there are a few months of lag time between the climate variations and GRACE TWSC; thus, our algorithm is designed to automatically move each climate driver (i.e., the interannual, seasonal, or residual of climate temporal modes) time series for a few months (i.e., 0 to 3 months) to search for the strongest correlation, and for this we also interpolate the GRACE TWSC time series to fill the missing data. In addition, we reconstruct the de-seasoned components of precipitation temporal modes based on a time decay parameter that was developed by Humphrey et al. (2016) to further improve the correlation between the precipitation and GRACE components.
For selecting the "best" input data, the correlation coefficients between each target variable (i.e., the interannual, seasonal, or residual of GRACE EOF and IC temporal modes) and related climate drivers are automatically computed and sorted by our algorithm, for example, for a specific basin we calculate the correlation coefficients between the interannual component of GRACE EOF1 mode and interannuals from precipitation, land temperature, SST (in 14 different oceans and seas) EOF modes, and 17 climate indices and sort them by size, and then this process is successively applied to GRACE EOF1 mode seasonal, GRACE EOF1 mode residual, GRACE EOF2 mode interannual, and so on. Here, the sensitive input data are sorted only by correlation coefficients, and for the selection this method may reject a predictor that is not very highly correlated but brings new information compared to other highly correlated predictors. So, before the selection of inputs, we use the stepwise regression method (Summers, 1985) to remove the highly correlated predictors which do not bring sufficient new information. In addition, we would like to make a cautionary note that correlation between climate input and the GRACE data does not necessarily represent causation, and in this case our (like any other similar) techniques may derive a "right answer due to the wrong reasons" but of course may fail in extrapolating well.
As discussed in section 2.3, we choose three sensitive climate drivers as predictors to extrapolate each target variable. Furthermore, we identify one most sensitive climate driver for each target variable as listed in Table 1. We find that the temporal evolution of GRACE total water storage change seasonal component is highly related to the seasonal component of SST (see the third column in Table 1), and the time series of GRACE TWSC interannual and residual components are strongly correlated with the interannual and residual of both sea surface temperature and precipitation variations (see the fourth and fifth columns in Table 1).

Metrics for Methods Comparison and Estimation of Prediction Uncertainties 4.2.2.1. Criteria for Identifying the Most Robust Method
There are more than 15 years of GRACE data altogether (April 2002 to June 2017). In the data processing, we set the training section to April 2002 to June 2011 and set the testing period to July 2011 to June 2017, that is, we use the GRACE data from April 2002 to June 2011 to train the predictive models and to test the uncertainty of the next six years (i.e., July 2011 to June 2017) of prediction. Then, we use the CSR mascons to calculate the standard error of both training and testing TWSC. We note that one can determine "absolute" errors, including the systematic and random errors, only in the training phase. In the computations, we set the testing section within the GRACE period just for assessing the uncertainty of predicted TWSC. As discussed in section 3.1, the CSR mascon contains the GRACE measurement noises. Several studies (e.g., Landerer & Swenson, 2012) have assessed the measurement errors that are contained in the GRACE-derived total water storage changes. These measurement errors may falsify the validation between the predicted/reconstructed TWSC and the CSR mascons, but it is difficult to exactly determine the exact influences of GRACE measurement errors on the TWSC prediction. Through we do not assess this influence, one can estimate to what extent the GRACE measurement error may affect the predicted TWSC by computing the root-sum-square of the GRACE measurement errors and the prediction errors.
As discussed in section 3, three groups of data-driven techniques-that is, (a) spatiotemporal decomposition techniques ICA and PCA; (b) time series decomposition techniques LS and STL; and (c) machine learning methods ANN, ARX, and MLR-are employed in this study, Here, we firstly fix the spatiotemporal decomposition and time series decomposition techniques to PCA and LS and compare the robustness of three machine learning techniques ARX, ANN, and MLR in 26 river basins.
In this case, we used three criteria-that is, (a) standard error of TWSC, (b) correlation coefficients of TWSC, and (c) correlation coefficients of de-seasoned TWSC as used in Reichle et al. (2004)-to identify the most (or more) robust method. Table 2 lists the standard errors of training and testing TWSC by using the three machine learning methods as evaluated by the CSR mascons at both grid and basin scales. The correlation coefficients of training/testing TWSC and TWSC anomaly (i.e., de-seasoned TWSC) as compared to the CSR mascons are listed in Tables 3 and 4, respectively.
We here summarize the optimal methods-that is, methods with minimal standard error or maximal correlation coefficients at the grid scale-for each river basin (see Tables 2-4). Within the training section, the ANN model simulates the TWSC best in 18 river basins as estimated by the criterion of standard error (see fourth column in Table 2) and simulates the TWSC best in 19 regions and 20 regions assessed by the other two criteria as shown in Tables 3 and 4. ARX performs best in 12 basins, 16 basins, and 10 basins within the training phase as assessed by the criteria standard error of TWSC, correlation coefficients of TWSC, and correlation coefficients of de-seasoned TWSC, respectively, and MLR simulates the TWSC worse in all river basins than ARX or ANN (see column 2 to column 7 in Tables 2-4). These results indicate that MLR performs worst and ANN performs best within the training period.
The MLR method, in the testing period, shows the best skill in 19 regions, 18 regions, and 19 regions as evaluated based on the three criteria, respectively. Obviously, the ANN and ARX models perform worse than MLR in the testing period as listed through column 8 to column 13 in Tables 2-4. Here, we also calculate

10.1029/2019WR026551
Water Resources Research the average standard error or correlation coefficient over 26 river basins for each predictive method and in both training and testing period as listed in the last rows of Tables 2-4. All these results indicate that though MLR performs worst within the training phase, it is the most robust method for the prediction.
For ANN and ARX, we use a group of unified empirical parameters for predicting the TWSC grids. With these empirical parameters (e.g., number of input predictors), ANN and ARX perform better than MLR in some regions while they perform worse in the other regions, indicating that these parameters are not optimal in all study regions and might lead to some overfitting problems. One alternative is to try all different combinations of these parameters for ANN and ARX and find the best combination for each study region; this may suppress the overfitting problem but will also dramatically expand the testing works, and it is computationally expensive to apply them to all river basins globally. Therefore, we suggest one to try ANN and ARX if he or she aims at predicting only a few basins, and the MLR method is a more robust alternative if one wants to predict the TWSC over a large number of river basins. After our tests with three predictive methods, we then choose the spatiotemporal decomposition and machine learning techniques to PCA and MLR to compare the performances of STL and LS in all study regions. We use the CSR mascons and the three criteria to evaluate the precision of testing/training TWSC from STL and LS methods (see Tables , , B1-B3), and we find that the LS method performs better in most of the cases compared to STL in those study regions. Finally, we use the same metric to compare the PCA and ICA methods. As listed in Tables , , B4-B6, the standard errors of testing TWSC from PCA are smaller than those from ICA, and the correlation coefficients of both TWSC and de-seasoned TWSC from PCA are higher in most river basins, indicating that PCA is more robust than ICA for such application.

. Error Propagation in Three Predictive Models
In this case, for studying the error propagation characteristics in three predictive models, we first repeat the error perturbations and TWSC prediction and calculate the error bars for the testing TWSC at grid cell scale as described in section 2.3.5. Then we divide the study period into different sections, that is, training period, the first year past the training period, the second year past the training period, and so on. Figure A4 shows the propagated errors in each divided time section for the three predictive models at the grid scale. As expected, we find that the error bars, from both ANN and ARX methods, in the testing sections (first year to sixth year) are larger than those in the training section (the second and third columns in Figure A4), and the error bars from the MLR method are almost constant in both training and testing sections (the first column in Figure A4), indicating that the multiple linear regression method is more stable than the ANN and ARX methods in coping with random input uncertainties. This enhances our confidence to choose the MLR method for the prediction and reconstruction. In addition, the results also show that the distribution of propagated uncertainties in predictive methods depend on the amplitude of gridded GRACE TWSC, such as the GRACE total water storage changes with larger amplitudes (e.g., TWSC in the middle of the Amazon basin) that also show larger uncertainties ( Figure A4).

Prediction Uncertainties of the Identified Methods
Based on the testing results in Tables 2-4, Tables , , , , , B1-B6, and Figure A4, the combination of PCA, LS, and MLR methods is identified to be the most robust combination for extrapolating the GRACE TWSC map. Figure 3 shows the standard errors of both training and testing GRACE-like gridded total water storage changes as evaluated by CSR mascons at the grid scale, and we find that the regions with larger amplitude of TWSC also show larger standard errors. Figure 4 shows the correlation coefficients of training TWSC,

10.1029/2019WR026551
Water Resources Research testing TWSC, and TWSC simulated by the hydrological models as related to the GRACE mascons over 26 river basins. The averaged correlation coefficients (i.e., averaging the correlation coefficients of all grids in 26 river basins) of training TWSC, testing TWSC, GLDAS TWSC, CPC TWSC at the grid scale are 0.91, 0.86, 0.51, and 0.45 as compared to the CSR mascons. We also remove the seasonal cycles of the related TWSC at the grid scale as described in Reichle et al. (2004) and obtain the correlation coefficients of de-seasoned/anomaly TWSC signals (see Figures 4e-4h); the correlation coefficients of anomaly signals are 0.77, 0.68, 0.44, and 0.46, respectively. Results indicate that both training and testing GRACE-like gridded total water storage changes have much stronger correlations with the GRACE mascons than the model-simulated TWSC. Furthermore, both training and testing TWSC time series at the basin scale fit the CSR mascons well in almost all study regions as shown in Figure 5.

Identification of the Optimal Region Size
In this study, we apply our methods to all river basins respectively. For understanding at which basin size the predicting methods work best, we firstly divide the Europe-Asia continent into different sizes of subcontinent (see Figure 6), and then the identified methods combination is applied for the TWSC prediction in each subcontinent. Again, standard errors are derived by comparing to the CSR mascon solution as shown in Figure 6. The averaged standard errors in Figures 6a-6d are 2.86, 2.88, 2.74, and 2.84 cm respectively, indicating that dividing the Europe-Asia continent into four parts (~10-15 million km 2 per subcontinent) will be an optimal option for the prediction of TWSC based on our identified methods. Thus, we suggest an optimal region size 10-15 million km 2 for readers who want to predict the TWSC using our method.

Extrapolating the GRACE Total Water Storage Change Outside the GRACE Period
As discussed in section 4.2.2, the PCA, LS, and MLR methods are identified to predict and reconstruct the GRACE-like gridded total water storage change over 26 river basins, that is, in each river basin (a) we use the PCA method to identify significant modes of the GRACE and climate signal; (b) we use the LS method to separate interannual, seasonal, and residual components of GRACE and climate temporal modes; and (c) we use the separated components from GRACE and climate data between April 2002 and June 2017 to train the multiple linear regression model and then predict the GRACE-like gridded TWSC over July 2017 to December 2018 and reconstruct TWSC from January 1992 to March 2002. El Nino-Southern Oscillation (ENSO) represents natural variability in the climate system, which may cause some climate extremes especially in the tropical regions (Juan et al., 2016). Isolating the de-seasoned signal in total water storage change enables one to detect climate extreme events such as hydrological drought (Thomas et al., 2014). In an

10.1029/2019WR026551
Water Resources Research in Figure 7. Within our analysis period, there were two significant El Niño events, that is, the years 1997/1998 and 2015/2016, which can be derived from ENSO indicators. We find that, after removing the seasonal cycle, the reconstructed total water storage change shows strong abnormal signals in four tropical regions during the first significant El Niño periods (i.e., 1997/1998), which is consistent with the de-seasoned GRACE TWSC during another significant El Niño period (i.e., 2015/2016) (see the dark area of Figure 7). As shown in Figure 7, the GRACE-FO TWSC fits well with the predicted TWSC in four tropical river basins. The hydrological model does not explicitly simulate groundwater redistribution, and its skills in representing anthropogenic water withdrawals are limited; this may explain most differences between TWSC from the hydrological model and those from GRACE, GRACE-FO, or Swarm mission seen in Figure 7. The Swarm TWSC fits well with the GRACE total water storage change in the Amazon basin but shows larger deviations and uncertainties in the other regions; this shows some potential of the Swarm mission to detect large amplitude water storage changes. We also find that the Swarm TWSC has larger uncertainty in 2013 and 2014 than later; this is mainly caused by the more active ionosphere during these 2 years as discussed in Schreiter et al. (2019). In addition, the predicted and reconstructed TWSC in the other river basins could be found in Figures A5 and A6. The results clearly suggest that the predicted TWSC over June 2018 to December 2018 fits well with the GRACE-FO TWSC in almost all study regions. (2019), we do not know other studies that aim at the reconstruction of GRACE-like gridded total water storage change for the global land surface. Thus, we restrict the comparison of our method to the method as used in Humphrey and Gudmundsson (2019).

Comparison to Previous Studies Except Humphrey and Gudmundsson
In their TWSC reconstruction, Humphrey and Gudmundsson (2019) focused on a grid cell representation using precipitation and temperature as inputs. However, it appears difficult with their approach to make use of additional input data sets originating from outside the study regions (e.g., the SST data or climate indices), and they did not reconstruct the seasonal signal of the GRACE TWSC as described in Humphrey and Gudmundsson (2019).
In this study, the decay filter, MLR, and STL techniques employed in Humphrey et al. (2017) are included in our unified framework, but we had to combine these in a somewhat different way. Different from Humphrey's original method, we focused our attention on the reconstruction of dominant GRACE modes, and we feel that it is beneficial to involve additional data which have been shown before to be highly related to the evolution of dominant GRACE temporal modes as inputs (e.g., the SST and climate indices). Therefore, our implementation is able to assimilate more information to support the TWSC reconstruction. Furthermore, besides aiming at the de-seasoned anomalous TWSC signals, we also reconstruct the seasonal signals based on their high correlations with the seasonal signals of SST (see Table 1). Thus, our implementation seems able to reconstruct a more complete picture of the GRACE TWSC record as compared to the method originally developed by Humphrey and Gudmundsson (2019).
In addition, we would like to mention that it is the first time that three groups of data-driven techniquesthat is, spatiotemporal decomposition, time series decomposition, and machine learning-are formulated in a unified way for the reconstruction or prediction of leading modes identified in GRACE-derived total water storage grids.

Conclusions
In this study, a unified methodology framework is developed to compare different data-driven techniques for predicting and reconstructing gridded GRACE-like total water storage variations outside the GRACE period. We find that both ARX and ANN methods simulate the target variable better than the MLR method, but they are not robust enough for the prediction on account of some overfitting problems. The PCA-LS-MLR methods combination is identified as the most robust alternative through our framework for predicting and reconstructing the gridded TWSC over all river basins. One encouraging result is that our testing TWSC (6 years lead time with regard to the GRACE period) from the identified methods show much stronger correlation with the CSR mascons compared to the model-simulated TWSC at almost each grid of the study regions; thus, our results could be an alternative for the mass balance constraint for hydrological models beyond the GRACE period (e.g., Eicker et al., 2014). We also find that the temporal evolution of the seasonal component of the GRACE total water storage change in the study regions closely related to the seasonal variation of SST and the de-seasoned (i.e., interannual and residual) components of GRACE TWSC has strong correlation with the de-seasoned changes of both sea surface temperature and precipitation. These results may improve our understanding of rough relationships between the TWSC and the related climate drivers.
We study the error propagation in the adopted predictive models, and the results indicate that the MLR method is more stable and robust than the ANN and ARX methods in coping with error perturbations. Finally, we predict 1.5 years (i.e., July 2017.7 to December 2018) of gridded TWSC past the GRACE period and reconstruct more than 10 years (i.e., January 1992 to March 2002) of gridded TWSC. At the basin scale, the de-seasoned signal of reconstructed TWSC exhibits a strong abnormal signal in the tropical basin during significant El Niño periods. The total water storage change derived from the GRACE Follow-On mission fits well with the predicted TWSC in almost all study regions, and the Swarm TWSC shows potential to detect extreme climate events, but it contains large uncertainties.
The approach identified from our framework presents a viable alternative for bridging the data gap of the GRACE missions and can also be used for extrapolating the global GRACE gridded TWSC time series outside the GRACE period for a longer time.
Appendix: This section provides some figures which surpport the discussion of this article         Appendix: This section provides some tables which surpport the discussion of this article

Appendix: Robustness Tests of the Identified Methods
In the test computations, we had to choose some options, such as setting the training section to April 2002 to June 2011, setting the input number of climate drivers to three, and so on, and identified an optimal methods combination, that is, PCA-LS-MLR. But the prediction skill may vary with different options. Therefore, in this section we will tell how robust this optimal combination is with respect to different meta-parameters, predictors, and reference data periods. Since it is impossible to test all possible applications, we will consider a few typical choices here and show for a few basins how robust our results are. We choose five river basins located within different continents and climate zones-that is, the Amazon, Congo, Yangtze, Mississippi, and Danube basins-for the robustness tests.
We chose three sensitive climate drivers as the input for the prediction previously. For understanding how robust the number of selected sensitive climate drivers affects the prediction results, we now set this number from 1 to 15 in the prediction and use the CSR mascons to assess the standard error of testing total water storage change, respectively. The standard error based on different numbers of inputs are shown in Figure C1. We find that the standard error of testing TWSC based on less than 8 (i.e., from 1 to 8) of inputs do not show large differences, but the standard error could be increasing when we use a larger number of inputs, for example, more than 12 inputs in Congo basin. Thus, we do not suggest the users to set the number of inputs larger than eight for the prediction when using our method. In the previous experiments we had chosen the number of selected modes by using a unified criterion, such as to retain 95% of the total energy; in what follows we will vary the number of modes between 3 and 10 to find how robust this criterion is. Figure C2 shows the standard errors of testing TWSC as compared to the CSR mascons, and the results indicate that the uncertainty of predictions will be stable when choosing more than five modes for the prediction. As discussed in Humphrey et al. (2016), there is a few months of time delay between the climate changes and (affect) water storage changes. Our algorithm has been designed to maximize lag correlation within a window of 3 months (0 … 3), in what follows we will extend this window successively to 6 months; for example, if the window is set to 4 months, then we will move the climate driver from 0 to 4 months to search for its highest correlation as related to the GRACE data. We predict the TWSC in five river basins based on different values of window (i.e., from 0-0 to 0-6), and then we estimate the standard errors of prediction using the CSR mascons as shown in Figure C3. The results indicate that the prediction uncertainties are reduced when we turn the time window from 0 to 3, and there is no obvious improvement when the window is increased from 3 to 6 months. For testing the robustness of the reference data period, we firstly fix the length of training section to 8 years, and we fix the length of the testing section to 6 years; that is, we firstly use the GRACE data from April 2002 to March 2010 (totaling 8 years) to train the predictive models and to predict the next 6 years (i.e., April 2010 to March 2016) of TWSC. Then we move the training and testing sections over time (e.g., move the training section from April 2002 to March 2010 to May 2002 to April 2010) and predict the TWSC using different periods of reference data, respectively. Finally, we use the CSR mascons to evaluate the standard errors of the testing total water storage change in all moving time sections, and we Figure C1. Standard errors of testing TWSC at the grid scale by turning the number of input climate drivers from 1 to 15 in five river basins as evaluated from the CSR mascons.
show them in Figure C4. The results indicate that the standard error of predictions from different time sections do not show large differences in five study regions, which demonstrate the robustness of our method for this variation. Figure C2. Standard errors of testing TWSC at the grid scales by turning the number of selected modes from 3 to 10 in five river basins as evaluated from the CSR mascons. Figure C3. Standard errors of testing TWSC at the grid scales by turning the time window from (0 … 0) to (0 … 6) in five river basins as evaluated by the CSR mascons. Figure C4. Standard errors of testing TWSC at the grid scales by moving the training and testing periods from 1 to 15 times in five river basins as evaluated by the CSR mascons.