Groundwater Withdrawal Prediction Using Integrated Multitemporal Remote Sensing Data Sets and Machine Learning

Effective monitoring of groundwater withdrawals is necessary to help mitigate the negative impacts of aquifer depletion. In this study, we develop a holistic approach that combines water balance components with a machine learning model to estimate groundwater withdrawals. We use both multitemporal satellite and modeled data from sensors that measure different components of the water balance and land use at varying spatial and temporal resolutions. These remote sensing products include evapotranspiration, precipitation, and land cover. Due to the inherent complexity of integrating these data sets and subsequently relating them to groundwater withdrawals using physical models, we apply random forests—a state of the art machine learning algorithm—to overcome such limitations. Here, we predict groundwater withdrawals per unit area over a highly monitored portion of the High Plains aquifer in the central United States at 5 km resolution for the Years 2002–2019. Our modeled withdrawals had high accuracy on both training and testing data sets (R2 ≈ 0.99 and R2 ≈ 0.93, respectively) during leave‐one‐out (year) cross validation with low mean absolute error (MAE) ≈ 4.31 mm and root‐mean‐square error (RMSE) ≈ 13.50 mm for the year 2014. Moreover, we found that even for the extreme drought year of 2012, we have a satisfactory test score (R2 ≈ 0.84) with MAE ≈ 9.72 mm and RMSE ≈ 24.17 mm. Therefore, the proposed machine learning approach should be applicable to similar regions for proactive water management practices.


Introduction
Groundwater constitutes nearly 30% of total global freshwater reserves (Schneider et al., 2011) and is a key component of the water-food-energy nexus (Smajgl et al., 2016). Globally, about half of the drinking water is supplied by groundwater (GW) (Gleeson et al., 2019). Moreover, GW is extensively used in agricultural activities, and the demand for it is increasing due to the rise in global population, dietary shifts, and climate change (Margat & van der Gun, 2013). Monitoring of GW withdrawals (pumping) is needed for quantifying Due to the availability of voluminous amounts of satellite data, it is possible to monitor large regions using remote sensing techniques (Frappart & Bourrel, 2018;Leidner & Buchanan, 2018). In the hydrologic remote sensing domain, there is a multitude of spaceborne missions that provide satellite products for assessing different quantities related to the global water cycle. The most prominent of these products quantify total water storage (TWS)-GRACE (Gravity Recovery and Climate Experiment) and GRACE-FO (GRACE -Follow On), terrestrial ET-MODIS (Moderate Resolution Imaging Spectroradiometer), soil moisture-SMAP (Soil Moisture Active Passive), precipitation-GPM (Global Precipitation Measurement), TRMM (Tropical Rainfall Measuring Mission), and land use-USDA-NASS (United States Department of Agriculture-National Agricultural Statistics Service) (Boryan et al., 2011;Chan et al., 2016;MardanDoost et al., 2019;Nie et al., 2018;Yi et al., 2018).
Each of these data sets is related in some way to hydrologic fluxes that impact the GW system. A great deal of research has been done on using these data sets to estimate GW fluxes. A number of studies have used GRACE to estimate changes in TWS then subtract the components of soil moisture, surface water, and snow water to estimate fluxes in GW storage (Famiglietti et al., 2011;Rodell et al., 2007Rodell et al., , 2009. While useful for basin-or continental-scale studies, the resolution of GRACE is too coarse (~400 km) for local estimates of GW flux. Surface water availability and land use (which is a proxy for water demand) have also been tied to GW withdrawals in previous studies (Faunt, 2009). Many land use data sets are now derived from remote sensing, and a growing number of studies are using remote sensing estimates of land use to estimate irrigated area (Deines et al., 2017;Ozdogan & Gutman, 2008), although this has not been tied directly to GW extraction. Satellite methods have been used to estimate GW storage change at high resolution (~100 m) utilizing land subsidence estimates from interferometric synthetic aperture radar (InSAR, e.g., Hoffmann et al., 2001;Smith et al., 2017), but this approach is typically limited to regions with confined or semiconfined aquifers (where pressure changes are highest) and highly compressible sediments (Smith & Majumdar, 2020). In spite of the mature research in many of these individual fields, very rarely are the various remotely sensed data sets combined to estimate GW fluxes, which limits their ability to estimate these fluxes to very specific cases. In this study, we seek a generalizable approach that utilizes satellite data to estimate GW withdrawals at the local scale (5 km resolution) and accomplish this by integrating diverse satellite data sets that are related to different components of the water balance.
Integrating satellite data for estimating the different water balance components can be immensely challenging due to the varying spatial and temporal resolutions (Tamayo-Mas et al., 2018). Moreover, we lack methods to estimate several key parameters, such as recharge, GW inflow/outflow, and surface water withdrawals, to "close the loop" of the water balance and directly estimate GW withdrawals. Furthermore, existing water balance estimates have been shown, in many cases, to exhibit spatial bias, limiting their accuracy to some extent (Hashemi et al., 2017). These roadblocks have limited our ability to estimate GW withdrawals except in the special cases noted previously. In addition, traditional approaches to calibrate physical models that incorporate these diverse data sets are challenging because the models quickly become 10.1029/2020WR028059
In this research, rather than using traditional physical models, we leverage the correlations between various water balance measurements and GW withdrawals in a machine learning framework that learns the relationship between the various data sets and uses them in a predictive fashion. Here, we apply random forests (Belgiu & Drăguţ, 2016;Breiman, 2001), a state of the art machine learning algorithm, to develop local scale (5 km) estimates of GW withdrawals over the entire state of Kansas. Regarding the satellite and modeled data sets, we include the water balance products from MODIS, PRISM (Parameter-elevation Regressions on Elevation Slopes Model), and land use from USDA-NASS acquired over this region between 2002 and 2019. Here, we use PRISM as it is specific to the United States, and the precipitation estimates are interpolated from a dense array of rain gauges thereby providing higher accuracies than TRMM and GPM (Cannon et al., 2017;Daly et al., 2008;Hashemi et al., 2017). We also use crop coefficients (CCs) (Allen et al., 1998) as a model predictor as an additional proxy for ET. Finally, this novel machine learning framework is validated against the annual in situ GW pumping data available over this area and the results are also compared to GRACE and GRACE-FO TWS (Landerer & Swenson, 2012;Swenson, 2012).
The remainder of this paper is divided into five main sections. We first discuss the characteristics of the study area (section 2) followed by the details of our methods (section 3). We then provide an in-depth analysis of the results (section 4) and end with a discussion of the broader implications and applicability of this work to other regions (sections 5 and 6).

Study Area
The High Plains Aquifer (HPA) in the central United States is one of the world's most extensive and productive aquifers helping to support agricultural demand in the United States and across the globe. It spans over eight states, and its water is predominantly used for irrigation of crops, including wheat, corn, cotton, and soybean. However, the HPA is heavily stressed (mostly in the central and southern parts) primarily due to GW overpumping driven by the large-scale demand for freshwater and increasing variations in seasonal ET and precipitation likely linked to climate change (Smidt et al., 2016). More specifically, the situation in the state of Kansas, particularly in the western area, has reached a critical point as 90% of its irrigation water is supplied by the HPA . Recent trends show that the water table is declining rapidly and some regions are at risk of losing their ability to pump GW. This would significantly impact the agricultural productivity of the state in the near future thereby also hindering its economic vitality Deines et al., 2020).
Currently, the only feasible way to mitigate these negative impacts of extreme GW withdrawals is to reduce GW extraction (Butler et al., 2020). This is because surface water is scarce in western Kansas and the water table decline is driven by GW pumping. Over 95% of the nondomestic pumping wells in the Kansas High Plains aquifer are metered, and pumping volumes must be reported annually . The pumping data are available through the Water Information Management and Analysis System (WIMAS) database that can be accessed at the Kansas Geological Survey (Wilson, 2019). Since "data" are the crux of any machine learning model (Hastie et al., 2013), we chose the entire state of Kansas as our study area, which is highlighted in Figure 1. Considering the severe aquifer stress in the region and the need for proactive management solutions, we believe that this is an ideal test area for our proposed machine learning framework.

Methodology
The overall workflow is shown in Figure 2; the main steps are data acquisition (Step 1) using the Google Earth Engine (GEE) platform (Gorelick et al., 2017), preprocessing (Step 2), and the implementation of the machine learning model (Step 3) using Python 3. After that, we perform model evaluation and analyze the GW withdrawal predictions using the available in situ GW pumping geodatabase (GDB). Here, the input time series data consist of ET, precipitation (P), and density of the following land use types: agriculture (AGRI), urban (URBAN), and surface water (SW). We provide a detailed description of our workflow involving the different processing blocks and the time series data specifications in sections 3.1 and 3.2.

Data Acquisition and Preprocessing
GEE is a cloud-based platform for large-scale geospatial data processing and analysis that is accessible to researchers and operational users (Gorelick et al., 2017). We used GEE for seamless and efficient downloading of our data sets which include MODIS (ET) and PRISM (P). The land use data (used to produce AGRI, URBAN, and SW) were downloaded directly from USDA-NASS. The specific product names along with the corresponding description are given in Table 1. Moreover, for the MOD16 and PRISM products, we restricted the time span between 1 April and 30 September (inclusive) of each year (the typical growing season; Butler et al., 2019). Our assumption was that the land cover does not change significantly between years over our study period, so we selected the USDA-NASS cropland data layer (CDL) for the year 2015 only. While many farmers rotate crops, we assume that the rotations did not significantly change the percentage of each 5 km × 5 km pixel that was cultivated over our study period. This data set does not distinguish between dryland and irrigated crops. To compensate for this, we included precipitation and actual ET as input data sets. By combining these data sets, our algorithm is able to identify pixels with low precipitation and high ET and thus high GW demand.
The Kansas in situ GW pumping data were obtained from the WIMAS database (Wilson, 2019). We discarded pixels with average extraction rates of more than 1,000 mm per year. These rates only occurred in two pixels, which had high-capacity wells that were withdrawing large volumes of water directly beneath the Missouri River and are thus more representative of surface water withdrawals than GW withdrawals. The in situ GW pumping data have not been released for 2019, and hence, this year is kept solely for forecasting.
We took the temporal sum of both the ET and P bands between April and September of each year thereby obtaining the total ET and P for that period for each year. In addition, the original USDA-NASS CDL data set (covering the conterminous United States) for the year 2015, which classifies land use based on the specific crop type, or water, urban, and so forth, was reclassified according to supporting information Table S1. Thereafter, this reclassified data set was split into separate binary rasters (AGRI, SW, and URBAN) with pixel values of 0 and 1 corresponding to the absence and presence of the respective class (OTHER class is ignored). Moreover, the original CDL data set was used to generate the CC raster by applying midseasonal stage CC values (Allen et al., 1998). The Kansas GW pumping database (GDB) was automatically converted into yearly shapefiles and subsequently rasters (5 km spatial resolution and UTM 14N projection) using The ET data are available as cumulative 8-day composite images at 500 m spatial resolution (Running & Mu, 2015). The algorithm developed by Mu et al. (2011), originally based on the Penman-Monteith equation (Monteith, 1965), is used for developing these ET products. PRISM monthly spatial climate data set AN81m 2002-2019 (Apr-Sep) Temporal sum over the "ppt" band The PRISM group provides daily and monthly gridded climate data sets for the conterminous United States wherein digital elevation models and weather station data are spatially regressed using weights derived from several physiographic factors including location and topography (Daly et al., 2008).
Here, we use the monthly data sets available at 4 km spatial resolution. USDA-NASS cropland data layers 2015 (annual) Reclassification of the "cropland" band and Gaussian filtering CDL is an annual crop-specific land cover data layer (30 m spatial resolution) created for the continental United States developed using MODIS and ground-truth data (USDA-NASS, 2015).

Crop coefficient 2015 (annual) Crop coefficient values applied over CDL data
The crop coefficient includes crop properties and soil evaporation effects and is used to estimate crop evapotranspiration. We use the midseasonal stage crop coefficient values (Allen et al., 1998) to develop a static crop coefficient raster using the USDA-NASS CDL data. Kansas in situ groundwater pumping data

2002-2018 (annual)
Groundwater pumping data to rasters The Kansas groundwater pumping database (GDB) is publicly available (Wilson, 2019). The GDB contains point data across all counties in Kansas.

Water Resources Research
GDAL/OGR Python APIs, GeoPandas, and Rasterio (GDAL/OGR contributors, 2019; GeoPandas developers, 2019; Gillies, 2013; Harris et al., 2020). All of these rasters (ET, P, AGRI, URBAN, SW, and CC) were reprojected to UTM 14N, resampled to 5 km spatial resolution, and clipped using the Kansas shapefile shown in Figure 1. Each 5 km × 5 km pixel contains the total annual GW withdrawal within that pixel for the years 2002-2018. The value was computed as the total volume of GW pumping in that region divided by the area, 25 km 2 .
Furthermore, we applied a Gaussian filter available from the Scipy library  over each of the AGRI, URBAN, and SW rasters to create smoothed rasters representing the density of each land use type within a given radius, the size of which is a function of the standard deviation (σ) of the Gaussian kernel . Here, we set σ = 3 pixels for AGRI and URBAN and σ = 5 pixels for SW. The filtered data were then normalized over the interval [0,1] where the values represent the density of these classes; that is, pixel values close to 0 have very few of the respective class within the specified window, and vice versa. Finally, the preprocessed data sets were organized into a Pandas DataFrame object (McKinney, 2010). It is noteworthy that feature ordering (order in which the predictors are provided to the model) affects model reproducibility in classification and regression trees (CART) (Breiman, 2001). This is because the splitting operation involves random feature selection to find the best split (Breiman, 1984;Pedregosa et al., 2011). In order to have consistent feature ordering, the CSV file was sorted based on the attribute names (Olson, 2015). The feature or attribute names in alphabetical order are AGRI, CC, ET, P, SW, and URBAN where the ET and P are in mm and AGRI, URBAN, and SW are measures of land type density and are unitless like CC. The GW pumping values (mm) are marked as the label set (GW pumping is the target variable for prediction). It should be noted that our proposed framework provides the flexibility to alter any of these parameters (σ, spatial resolution, and reclassification values) as per the requirement of the specific application. Moreover, in our initial setup, we chose these specific values for model workability purposes. However, observing the effects of modifying these preprocessing parameters could provide further insights into the spatial and temporal mechanisms driving GW demand and usage.

Our Approach-Random Forests
Random forests is a widely used machine learning algorithm that has been extensively applied in the remote sensing domain, primarily for classification purposes with nonlinear or multimodal data sets (Belgiu & Drăguţ, 2016). Although in this work we use it for multivariate regression, the general idea remains the same where the algorithm employs a set of CART for prediction. Essentially, the random forests algorithm behaves as an ensemble or meta estimator where it uses averaging across all the CART to improve the predictive accuracy and control overfitting (Breiman, 2001). To test the performance of our model, we split the original data into training and testing data because the random forests follow a supervised machine learning approach. Here, two main hyperparameters are involved-the number of estimators or trees (n_estimators) and the number of features to consider during the best split operation (max_ features). If max_features = n_features (number of actual features), then the subsample size is equal to the original input sample size where the samples are drawn with replacement. We have iterated our model using different values for both of these hyperparameters. Also, the random_state parameter controls the randomness of the sample bootstrapping and therefore remained fixed throughout the workflow for model reproducibility (Pedregosa et al., 2011).
In our final model setup, we chose n_estimators = 500 which is a commonly used value for remote sensing studies (Belgiu & Drăguţ, 2016;Smith et al., 2018), random_state = 0, max_features = 6, and the other hyperparameters (maximum leaf size and minimum number of splits) as scikit-learn defaults (Pedregosa et al., 2011;scikit-learn developers, 2019). The parameters n_estimators and max_ features were optimized to minimize the root-mean-square error (RMSE) on our testing (validation) data sets (described in section 4). We split the training and testing data considering a leave-one-out (year) cross-validation method where the user is able to choose a particular year as pure test and remaining years as training data respectively.
Regarding model evaluation, we considered the feature importances, partial dependence plots, and different error metrics for the predictions such as the coefficient of determination (R 2 ), RMSE, and mean absolute error (MAE). The feature or variable importances are values (sum up to 1) signifying the impact of each variable (the higher the value, the more important the feature). Although these can be calculated in several ways, in our approach, the feature importance, also called Gini importance or mean decrease in importance (MDI), is defined as the total decrease in node impurity (weighted by the probability of reaching a particular node) which is averaged over all the trees (Breiman, 1984). Currently, this is the most computationally efficient method and is widely used in the scientific community (Breiman, 1984;Pedregosa et al., 2011). Partial dependence plots in conjunction with the feature importances can be useful to interpret the effect of specific variables on the predicted outcome. Partial dependence plots determine the effect on the prediction outcomes when one variable is varied while accounting for all possible values from the rest (Hastie et al., 2013). The equation for partial dependence plots is as follows: where e f x ð Þ is the partial dependence function, n is the number of rows in the data set, f (x, x ic ) is the random forest model, x is the predictor variable of interest, and x ic denotes the values of all other variables.
This allows us to quantify how each input variable impacts GW pumping. For example, we expect regions or years with low precipitation and high ET to have high GW pumping values. If this is verified in our partial dependence plots, we can have confidence that our model is capturing the dynamics of the hydrologic system. In our proposed workflow, we use the scikit-learn library (Pedregosa et al., 2011) to obtain the different partial dependence plots which are discussed in section 4. Lastly, we also perform residual diagnostics to observe the error distribution and perform normality checks (Hastie et al., 2013).

Results and Analysis
In order to better assess our proposed model, we perform two main test cases-one with 2014 (average to slightly dry year) as a pure validation data set and the other with 2012 (extreme drought). Additionally, we also perform model evaluation based on multiple validation data sets. Here, we describe the results and the corresponding analysis for each of these scenarios.

Test Case I-Year 2014
In this case, we select the preprocessed data sets for the Year 2014 as pure validation data for performing leave-one-out cross validation. The number of rows, n, is 142,600 for the training data set and 8,912 for the testing or validation data set. Accordingly, we obtain an overall training score, R 2 ≈ 0.99. Moreover, the error metrics shown in Table 2 suggest that we are achieving good prediction results for the validation data set (or testing data set), R 2 ≈ 0.93, RMSE ≈ 13.57 mm, MAE ≈ 4.31 mm, and normalized MAE ≈ 0.22 (the MAE is normalized by the average annual pumping).
The actual and predicted GW pumping rasters for 2014 are given in Figures 3a and 3b, respectively. Next, we check the goodness of fit based on different plots (R Core Team, 2019) involving predicted values and residual diagnostics. The 1:1 relationship between the actual and predicted GW pumping values is given in Figure 3c which shows that the predictions closely follow the actual values for 2014 (R 2 ≈ 0.93 and RMSE ≈ 13.50 mm).
The model residual diagnostics are shown in Figures 4a-4d. Figure 4a shows some regions in the western portion of the study area that have pockets of high residual error. Figures 4b and 4c show the standardized residuals, obtained by dividing the residuals by their standard deviation. These should lie in the [−2, 2] interval for normally distributed residuals (Hastie et al., 2013). We can see from these plots that there is a slight bias, with the residuals centered around 0 (average residual of 0.40 mm) and no significant trends in the residuals. While about 96% of the data lie within the [−2, 2] interval and Figure 4b seems to follow a nearly normal distribution from the first inspection, the data do not follow a perfect normal distribution. Figure 4d shows a Q-Q plot, which can be used to determine normality. When the curve follows the horizontal line

Water Resources Research
(marked in red), the distribution is normal. As observed, the curve follows a straight line between quantiles −2 and 2, where most of the data lie, but the outliers have some skew.
Some deviation from normality is expected as we are possibly overfitting the random forests model which also agrees with the lower test score (R 2 ≈ 0.93) than training score (R 2 ≈ 0.99). However, given that there is no significant bias or trend in the residuals, we consider the model predictions to be robust.

Test Case II-Year 2012
Similar to the 2014 data set, we selected 2012 as pure validation data set and the rest as training. Here, the model training score, R 2 ≈ 0.99 is also similar. The error metrics for this test case are shown in Table 3.
The actual versus predicted GW pumping plots in Figures 5a and 5b  The actual versus predicted GW pumping values for 2012 are shown in Figure 5c. As expected, there is a significantly higher scatter in Test Case II relative to Test Case I. The lower R 2 ≈ 0.84 and higher RMSE ≈ 24.17 mm for Test Case II also suggest that the predicted GW pumping values deviate more from the actual ones. The model residual diagnostics are depicted in Figures 6a-6d. As shown in Figure 6a, there are several areas in the western portion of the study area with high residuals. Figures 6b and 6c show some bias (with an average residual value of −2.10 mm), and Figure 6d shows that the residuals are not normally distributed though about 96% of the standardized residuals lie in the [−2, 2] interval. These observations are in concordance with the lower testing score of the random forest model, R 2 ≈ 0.84 when compared to the high training score, R 2 ≈ 0.99 signifying bias in the final estimates. Still, considering the extreme drought scenario, the Year 2012 is an outlier with significantly higher GW pumping than observed in the training data sets. Given these limitations, we consider the model performance for this year to be reasonable.

Test Case III-Using Multiple Validation Data Sets
In order to further assess our model, we held the Years 2011 to 2018 out of our model for testing (n = 71,296) and used the data sets from 2002-2010 for training (n = 80,216). Here, we only check the error metrics as given in  Figure S1 which shows that the residuals for the mean error raster are almost normally distributed with a mean and standard deviation of −0.07 and 17.17 mm, respectively. The R 2 in the training data set is significantly higher than the R 2 of the testing data set. This can be a sign of model overfit, when models fit noise rather than causal mechanisms to improve the match of training data, at the expense of true model predictive accuracy (Lever et al., 2016). However, in this case the disparity between training and testing scores is caused primarily by the difference in climate during the training and testing periods. The testing period was intentionally selected to include two major drought years (2011 and 2012), which both had ET and P values outside the range of the training data (supporting information Figures S2a and S2b). In spite of the discrepancy between training and testing scores, the R 2 value of 0.80 indicates a reasonably good fit even in years where the testing data falls outside of the range the training data experienced, indicating a model that is robust in predicting pumping during extreme events.

Discussion
We chose the random forest model developed during Test Case I as the final one for the GW pumping prediction, as it used the most training data and was able to learn from both wet and dry years and is most likely to be effective for future predictions. After the model evaluation phase, we observed that the feature importances in all three test cases were similar, with the most important features being the land use classes, that is, AGRI, SW, and URBAN. This finding is in agreement with GW withdrawals predominantly occurring in agricultural areas with limited surface water availability . More specifically, the feature importances (rounded to two decimal places) of AGRI, SW, URBAN, ET, P, and CC are 0.33, 0.27, 0.19, 0.09, 0.08, and 0.03 in descending order, respectively. This metric gives higher importance to spatially variable but temporally static predictors (i.e., land use) than spatiotemporal variable predictors (i.e., P and ET). The main reason for this is that GW withdrawals vary more spatially than they do temporally, so spatial predictors explain most of the variance in GW pumping values. However, this metric does not provide a complete picture of variable importance, as P and ET are critical

Water Resources Research
variables for quantifying any temporal variations in GW, including the effect of drought or wet years, which is one of the key goals of this study. To improve our understanding of the importance of spatial and spatiotemporal variables, we compared the temporal variations in GW withdrawals from 2002-2018 considering first spatial predictors with no temporal variance (AGRI, SW, URBAN, and CC) and then predictors with spatiotemporal variance (ET and P) separately. Supporting information Figures S3a-S3e indicate that using only ET and P (feature importances of 0.49 and 0.51, respectively) results in an overall underprediction of the mean annual GW withdrawals in each groundwater management district (GMD). However, there is a substantial overprediction for the year 2012 when the entire state of Kansas (supporting information Figure S3f) is considered. This could be attributed to the high GW withdrawals in 2012 (drought year) which lead to significant model overfitting in the absence of land use predictors. Furthermore, when only the spatial predictors are used for model training, we do not observe any temporal variations which are expected. In this case, the feature importances of AGRI, SW, URBAN, and CC are 0.40, 0.31, 0.25, and 0.04 in descending order, respectively. Therefore, the best predictions are obtained when both the spatial and spatiotemporal predictors are synergistically introduced during the model training.
To further explore the importance of each variable, we employ partial dependence plots which indicate a plausible agreement between the model behavior and the expected heuristic relationships (Figures 7a-7e). Groundwater pumping increases with higher AGRI and ET values and decreases with increased P and SW, as would be expected. Also, the GW pumping varies significantly with the URBAN predictor. This could be

Water Resources Research
attributed to the variability in the GW demand of urban areas in Kansas. CC (which has the lowest feature importance) shows a prominent increase in partial dependence (GW) between 1.15 and 1.20 which is the midseasonal CC range of wheat, corn, soy, sorghum, and cotton-the most common agricultural crops in Kansas. It is noteworthy that the land use predictors and the CC are not heavily correlated with each other. However, SW has moderate correlation with ET and P, and ET and P are strongly correlated. Although the partial dependence function assumes that the features are uncorrelated (Hastie et al., 2013), Altmann et al. (2020) show that random forest partial dependence is somewhat robust toward correlated features. The correlation matrix diagram shown in supporting information Figure S2e can be used for obtaining further details regarding feature correlation.
While our model was reasonably accurate, it did display bias in some regions (Figures 4a, 6a, and supporting information Figures S1a-S1i), particularly in west central Kansas, where we are overpredicting pumping, most notably during the major drought year of 2012 ( Figure 6a). This could be related to the transmissivity of the aquifer-the aquifer thickness in many of the areas with overpredicted pumping has reached a point that it can no longer support large-scale pumping for irrigated agriculture . Thus, these areas have high GW demand due to the drought but do not have sufficient resources to meet the demand. We would expect those areas to have lower ET, but the dependence of pumping on ET resulting from the random forests may not be great enough to identify those areas.
The model we developed provides reasonable estimates of GW pumping in Kansas. Since the model was trained on climate (i.e., precipitation and ET) data and land use patterns specific to that region, the relationships it learned between those predictor variables and GW pumping estimates will likely hold in other regions with similar climates and land use patterns. However, it is less transferrable to regions with significantly different climates or land use patterns.

Temporal Trends in Groundwater Pumping and TWS
As a final step of our model evaluation, we used Test Case III (2011-2018 as pure test data) to compare ( Figure 8) the GW pumping values (both actual and predicted) for the entire study area and three specific GMDs (GMD 3, 4, and 5) to the GRACE and GRACE-FO TWS measurements (Landerer & Swenson, 2012;Swenson, 2012). GMDs 3, 4, and 5 cover southwest, northwest, and south central regions of Kansas (Butler et al., 2016) and are under significant stress due to heavy GW pumping. GMDs 3 and 4 are in a semiarid climate, while GMD5 is in a subhumid climate. Due to their varying climate and high water demand,

Water Resources Research
these GMDs have conditions similar to a range of other aquifers that are under heavy water stress, so the performance of our model over these regions is indicative of model performance over regions with similar conditions globally.
From Figure 8a, we observe that the GW pumping predictions for GMD 3 are lower than the actual values. However, the predictions for GMD 4 and 5 match quite well with the actual pumping measurements (Figures 8b and 8c). GMD 3 is likely underpredicted because, in general, the actual values of pumping in GMD 3 are the highest of the study area. Random forests tend to produce conservative estimates that lean toward the mean and thus underpredict and overpredict the highest and lowest estimates, respectively. However, the predictions are still reasonably close, and the other GMD in a semiarid climate (GMD 4) matches closely with observed withdrawals. Thus, we consider the model to perform well in both semiarid and subhumid climates.
Since the TWS measurements in Figure 8e are too coarse to compare at the GMD scale, we use the GW pumping estimates from the entire state of Kansas (Figure 8d) for the comparison. The pattern for the mean annual GW pumping values for Kansas is in general agreement with that for the TWS values. Note that GW pumping most closely matches the change in TWS during each growing season (spring to fall) for any given year. Essentially, this highlights the effectiveness and robustness of our model even when we leave out several years of data from training.
Many studies have found data from the GRACE to be a strong predictor of changes in GW storage at regional scales, that is, hundreds of km (Landerer & Swenson, 2012;Rodell, 2004;Rodell et al., 2007Rodell et al., , 2009Tiwari et al., 2009). For this reason, we initially included GRACE TWS as a predictor in our model. However, we found that our model performed better without GRACE data, so it was ultimately removed. We believe that the value of GRACE in GW storage changes is diminished in our study both due to the coarse GRACE resolution (400 km) (Miro & Famiglietti, 2018) relative to the resolution of our model (5 km), and also due to its correlation with precipitation, which is another strong indicator of drought in Kansas. We consider it likely that GRACE would be a more useful predictor for extending assessments to broader regions, where GRACE is a more consistent indicator of drought than precipitation, which has different normal levels for different regions.

Method Transferability to Similar Regions
In order to verify the transferability of our approach to other regions having similar spatiotemporal characteristics, we built a new test case where we explicitly held out a large region as test data and retrained the model using the remaining areas. Essentially, this is a different train-test split approach (compared to Test Cases I-III) where we consider all the annual data (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018) for model training except those belonging to the test region. More specifically, we selected the test area shown in supporting information Figure S4a because the water availability varies from east to west (less water going west) and north to south (less water going south). This region also had land use and climate patterns that were within the bounds of the other training data covering all the years (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018). The mean annual GW withdrawal predictions corresponding to the test area (supporting information Figures S4b-S4e) indicate that our method was able to predict GW withdrawals reasonably well, having R 2 ≈ 0.23, MAE ≈ 16.01 mm, RMSE ≈ 31.51 mm, and normalized MAE ≈ 0.84. Although the R 2 is low relative to Test Cases I-III, the broader spatial trends of GW pumping, as well as the temporal trends, are captured. The lower fit for this data set is expected, since the entire region being tested has not been "seen" by our model. Thus, while our model is most accurate in predicting GW pumping within the study area and could be used for predicting future GW pumping or filling in temporal gaps, it may also be used to make first-order estimates of GW pumping in other regions with similar climate and land use patterns.

Uncertainty Associated With MOD16, PRISM, and CDL
We selected MOD16 because it is the highest resolution (500 m) globally available ET product. Moreover, various studies have shown that the ET estimates from MOD16 perform reasonably well when validated against eddy covariance flux tower data (Cleugh et al., 2007;Lakshmi et al., 2018;Running et al., 2019;Tang et al., 2011;Velpuri et al., 2013). More specifically, the validation test against individual AmeriFlux towers shows about 10-30% uncertainty associated with the MOD16 ET products (Mu et al., 2011;Velpuri et al., 2013). Furthermore, all remote sensing-based ET products will have certain assumptions leading to 10.1029/2020WR028059 biases (McShane et al., 2017). Our model also has the advantage of learning how to use each input data set given its relationship to the training data, and thus can implicitly account for biases in each product. Thus, we consider MOD16 products to be useful in this framework for regional applications such as ours (Velpuri et al., 2013).
PRISM is a widely used gridded weather database available over the conterminous U.S. region. Evaluations of PRISM precipitation data carried out in the U.S. corn belt (Kansas belongs to this region along with several other states) between 1 April and 30 September show that the precipitation estimates lie within ±19% (RMSE) of the weather-station measured values (Mourtzinis et al., 2017). Nevertheless, with such an extensive database and strict quality control measures (Daly et al., 2008), use of the PRISM data is justified.
The USDA-NASS CDL metadata over Kansas for the Year 2015 (USDA-NASS, 2016) shows a statewide overall accuracy of 89.20% and Kappa coefficient of~0.87 wherein 5,82,034 pixels have been accurately classified. Considering this satisfactory classification accuracy at high spatial resolution (30 m), our choice of using the USDA-NASS CDL data is justified.
Errors in the above data sets could have an impact on our model results. Random variation, rather than systematic bias, is more likely to affect the model results. If there is a consistent pattern (i.e., bias) in any predictor, the random forest algorithm will learn how to use the data to best predict GW withdrawals.

Conclusions
In this research, we developed a new machine learning framework for predicting GW withdrawals at a local scale (5 km spatial resolution). This holistic approach integrates various openly available satellite products that are related in different ways to GW withdrawals. These include ET (MODIS), precipitation (PRISM), and land use data (USDA-NASS). We deployed a machine learning framework using random forests to build a self-learnable method that incorporates these data sets along with in situ GW extraction data (Wilson, 2019). Our workflow is fully automated and has been implemented using open-sourced or freely available software tools and libraries.
A thorough model assessment showed that our proposed approach works satisfactorily with predictions having high R 2 values and low RMSE and MAE. In this work, we considered three test scenarios to validate our method, two of which included the extreme drought period of 2012. Moreover, the random forest feature importances agree with the expected dependencies of GW withdrawals.
As climate change, dietary shifts, and population growth increase the global stress on GW resources, it is critical to develop sustainable GW management practices, yet the volume of GW withdrawals, a key quantity for such efforts, is poorly constrained in many regions. Although one of us has repeatedly called for greater monitoring of GW withdrawals (Butler et al., 2016(Butler et al., , 2020, that has proven challenging in many areas. This method enables water managers to predict GW withdrawals from anthropogenic and climate driverssuch knowledge could lead to proactive implementation of sustainable solutions related to GW withdrawal practices.