Spatio‐Temporal Machine Learning for Regional to Continental Scale Terrestrial Hydrology

Integrated hydrologic models can simulate coupled surface and subsurface processes but are computationally expensive to run at high resolutions over large domains. Here we develop a novel deep learning model to emulate subsurface flows simulated by the integrated ParFlow‐CLM model across the contiguous US. We compare convolutional neural networks like ResNet and UNet run autoregressively against our novel architecture called the Forced SpatioTemporal RNN (FSTR). The FSTR model incorporates separate encoding of initial conditions, static parameters, and meteorological forcings, which are fused in a recurrent loop to produce spatiotemporal predictions of groundwater. We evaluate the model architectures on their ability to reproduce 4D pressure heads, water table depths, and surface soil moisture over the contiguous US at 1 km resolution and daily time steps over the course of a full water year. The FSTR model shows superior performance to the baseline models, producing stable simulations that capture both seasonal and event‐scale dynamics across a wide array of hydroclimatic regimes. The emulators provide over 1,000× speedup compared to the original physical model, which will enable new capabilities like uncertainty quantification and data assimilation for integrated hydrologic modeling that were not previously possible. Our results demonstrate the promise of using specialized deep learning architectures like FSTR for emulating complex process‐based models without sacrificing fidelity.


Introduction
Computational models have been hugely successful in predicting and understanding Earth and environmental systems.Since their early use in the mid twentieth century these models have increased in complexity to account for higher spatiotemporal resolution and a wider range of physical processes.To achieve the highest possible degree of physical realism scientists and engineers often must run these models on high-performance computing clusters and supercomputers, which require large institutional investment to build and maintain.In addition, the computational complexity of using these models on such platforms requires considerable expertise to use effectively.As a result, these large-scale and highly complex models are typically only used by a small subset of researchers.
Emulation (also referred to as reduced order models or surrogate models) has long been a popular method to reduce the computational complexity of running simulations in a number of domains (Astrid et al., 2008;Razavi et al., 2012;C. Wang et al., 2014).Reducing the computational complexity of process based models makes it possible to build more complex workflows such as building model chains, performing parameter calibration or sensitivity experiments (Cheng et al., 2023), and running more scenarios to better understand uncertainties (Kasim et al., 2021).One avenue that is becoming increasingly popular is the use of deep learning (DL) based methods for building emulators of process based models (Doury et al., 2023;Leonarduzzi et al., 2022;Reichstein et al., 2019;Tran et al., 2021).
Deep learning has recently and quickly become a standard piece of the computational modeling toolkit and offers almost universal applicability to modeling tasks (Jordan & Mitchell, 2015).It has also proven very powerful in allowing researchers to take models from disparate applications and apply them to new problems (Khan et al., 2022).This is true in the Earth system sciences and related fields, where deep learning models have been used to simulate atmospheric phenomena (Brenowitz et al., 2020), predict flow in rivers (Kratzert et al., 2018), and monitor land use (Xu et al., 2017) among many other applications.Specifically in hydrology the majority of applications are based around streamflow modeling or other forms of "point-scale" applications where the models are trained on individual sites (Bennett & Nijssen, 2021;De la Fuente et al., 2023;Gauch et al., 2021).However, we know that in subsurface hydrology lateral flow occurs, and can have large impacts on both human and natural systems (Condon & Maxwell, 2019;Fan, 2015).
Hydrologic models that treat both the coupled surface and groundwater systems as well as account for lateral flow in the subsurface are commonly referred to as integrated hydrologic models.These models are among the most complex and comprehensive representations of the terrestrial hydrologic cycle that have been developed.However, they are often difficult to run because they are data-hungry and computationally heavy.This is particularly true in the subsurface, where observations are sparse and parameters are hard to measure (Blöschl et al., 2019).Even when data is available, it is often difficult to calibrate these models because of the computational complexity and large-dimensional search space over parameter configurations (O'Neill et al., 2021).It should also be noted that the lack of spatiotemporally complete (i.e., gridded data sets) observations/reanalysis data for groundwater is a large reason that purely data-driven approaches have not emerged as they have in the weather forecasting domain (Chen et al., 2023;Keisler, 2022;Lam et al., 2022).Because of these challenges the use of emulator or surrogate models is an appealing approach to improving the usability of integrated hydrologic models.
In this study we demonstrate the use of modern deep learning based emulators of a continental scale integrated hydrologic model, without sacrificing spatiotemporal or process fidelity.Specifically, we emulate subsurface flow of the ParFlow-CLM model, developed over a large portion of the contiguous US at a high spatiotemporal resolution (Maxwell & Condon, 2016;Maxwell et al., 2015;O'Neill et al., 2021).Previous work has shown that deep learning is an effective approach to this problem, showing good agreement in predictions as well as computational efficiency on synthetic benchmarks (Maxwell et al., 2021) and on smaller domains (Leonarduzzi et al., 2022;Tran et al., 2021).
In this work, we compare the ability of three different deep learning model architectures to emulate 3d subsurface pressure fields simulated by ParFlow.We compare ResNet, UNet, and Forced-SpatioTemporal-RNN (FSTR) architectures; the first two are off-the-shelf convolutional neural networks that we apply in an autoregressive fashion to build up spatiotemporal predictions.That is, we feed the previous output of the model as an input to the next step of the prediction process, in addition to other features.
The FSTR model is a novel adaptation of the PredRNN model with action-conditioning, which is a videoprediction model that has proven capable for atmospheric & hydrologic modeling as well as robotics (Tran et al., 2021;Y. Wang et al., 2017).We make use of the robotics terminology of "action-conditioning" to explicitly account for the meteorological forcings acting on the hydrology in a way that is separate from the subsurface parameters/geology and the initial state of the domain that we are simulating.
We compare the performance of the three model architectures for emulating the evolution of both pressure heads at multiple depths as well as the derived soil moisture states and water table depths.We find that all three model architectures show good overall capabilities at matching spatial and temporal patterns.Our FSTR architecture consistently shows the best performance results, and is capable of simulating a year of the entire domain in less than an hour on a single 40 GB Nvidia A100 GPU, showing a >1,000 times speedup as compared to the original simulations run on >3,000 CPU cores.Based on these findings we believe that our FSTR architecture could form the basis for a new set of modeling capabilities to fine tune model parameters and enable real-time ensemblebased forecasting.

Modeling Domain and Data
Our modeling domain is based on the ParFlow CONUS1.0 model (O'Neill et al., 2021).The domain covers the majority of the contiguous United States, plus some small portions of Canada and Mexico (Figure 1).We are primarily interested in representing the subsurface hydrology in this domain at a 1 km gridded spatial resolution, with a daily timestep.The gridded domain amounts to a regular grid of 3,342 by 1,888 km.The depth layers of our simulations increase in the downward direction, starting with shallow surface layers and a large groundwater layer.The depths of each layer are 0.1, 0.3, 0.6, 1, and 100 m from the surface to the bottom for a total of 5 layers.
The simulations that we use for training/testing have previously been validated against many observational data sets across multiple variables (e.g., streamflow, evapotranspiration, snow) with favorable results given the default parameter sets (O'Neill et al., 2021).These simulations cover water years (spanning from October 1 to September 30) 2003-2006 at an hourly timescale.We aggregate all the data to a daily timescale by taking the daily means, totals, minimums, and maximums where appropriate.
In this study we are primarily concerned with modeling subsurface and surface water flow, which is represented in ParFlow with Richard's equation (Richards, 1931) parameterized by the van Genuchten closure relations between pressure heads and saturation content (van Genuchten, 1980).Specifically, our emulation target is the full fourdimensional (time + space) pressure head field.From this pressure head field we can use the closure equations and additional calculations to calculate soil moisture and water table depth.The governing equations that describe the dynamics of the subsurface flows are given as: where S s is the specific storage [L 1 ], S is the relative saturation [ ], ψ is the pressure head [L], K s is the saturated hydraulic conductivity tensor [LT 1 ], k is the relative permeability [ ], ϕ is the porosity [ ], z is the elevation [L], and q is a source-sink term [L 3 T 1 ].
The ParFlow simulations that we use to train/evaluate our emulators is forced by a modified version of the NLDAS-2 product.The modifications include a blending of higher-resolution data sets such as Stage IV Radar and GOES Surface and Insolation Products as well as remapping from the native 12 km grid to our 1 km grid using bilinear interpolation.See O'Neill et al. (2021) for more information about the development of the forcing data.The spatially-distributed parameter fields were developed from a variety of data sets including the Soil Survey Geographic Database (SSURGO) and subsurface conductivity maps from Gleeson et al. (2011).More details on the development of the parameter fields can be found in Maxwell et al. (2015).We show the longterm average water table depth to highlight the spatial variability of the domain.

Model Architectures
We explore three deep learning model architectures to emulate the ParFlow simulations described previously.
Here we will describe the overall structure of each of these models, but will leave the detailed input/output setup of them for the following section.Throughout the rest of the paper we will interchangeably use the terms model and emulator.All references to the original ParFlow simulations will be directly specified as ParFlow.We employ three model architectures in this study: a Residual Network (ResNet; He et al., 2015), a UNet (Ronneberger et al., 2015), and a newly developed architecture that we refer to as the ForcedSpatioTemporalRNN (FSTR) model.The first two of these are relatively "standard" models in the deep learning literature at this point and have been used for a large array of tasks (Alzubaidi et al., 2021;Azad et al., 2022).The architectures of the ResNet and UNet are shown in schematic form in Figure 2, while the FSTR architecture is shown in Figure 3.
The development of the ResNet architecture is considered a milestone in the development of machine learning to the task of image classification and paved the way for the modern deep learning revolution (He et al., 2015).We consider the ResNet a strong baseline architecture to compare against as has been adopted by other studies with similar approaches (Haber & Ruthotto, 2018;Kochkov et al., 2021;Ott et al., 2020).This model consists of stacks of "residual blocks," usually convolutional layers followed by a nonlinear activation function.Following these stacks is a residual connection, which adds the input back to the output of the residual block.Architectures with residual connections have been found to train more effectively, especially in very deep networks (Zhang et al., 2023).We stack two residual blocks, each consisting of 1 depthwise-separable convolutional layer with a layer norm and activation function, and a final convolutional layer.We use two residual blocks with each having two convolutional layers with hidden dimension of 256 channels for this study.
The second architecture that we consider is the UNet, which is named as such because of its use of successive downsampling and upsampling layers (along with skip connections) which allow the model to capture spatial relationships at varying resolutions.This type of architecture is considered state of the art in image segmentation which has applications in both remote sensing (Yuan et al., 2021) and medical imaging (Ronneberger et al., 2015).Like the ResNet, the UNet is mainly composed of stacks of convolutional layers.However, the UNet makes use of downsampling and upsampling to get a "multi-resolution" view of the data.In our model, we use stacks of these downsampling and upsampling layers which are composed of convolutional layers that are either preceded by MaxPool layers or followed by bilinear interpolation layers for down and upsampling, respectively.Each downsampling and upsampling layer consists of two convolutions followed by an activation.At parallel levels in the downsampling/upsampling process skip connections are used to transfer information at multiple resolutions to the upsampling, which helps to preserve spatial structure in the data.Here we set the base dimension for the convolutional layer to be eight at the input and output and double/halve the hidden dimension for each down/up sampling layer respectively.
We also developed a novel architecture based on the PredRNN model, which is itself based on the Convolutional LSTM model (ConvLSTM; Shi et al., 2015).It attempts to take best practices from image modeling via CNNs and sequence modeling from recurrent neural networks, particularly with the design of the Long Short Term Memory network (LSTM; Hochreiter & Schmidhuber, 1997).The key insights that the developers of PredRNN and associated models had over the ConvLSTM architectures was that an additional memory/hidden state would better reflect the spatiotemporal state of the system, and could be shared amongst model layers.This, along with several other procedural training techniques led the PredRNN model to be one of the most performant video prediction models available (Y. Wang et al., 2017).In the original PredRNN paper, the authors also introduced an "Action-Conditioned" variant, which allows the input sequence of a robotic arm to be used as a part of the prediction algorithm (Y.Wang et al., 2017).We make use of this modification because the input to the robotic system "acts" on the video stream in a similar way that meteorological forcings "act" on the evolution of hydrologic states.
The main modification that we make to the PredRNN structure is the use of encoders to initialize the memory and cell states for the model.These states are updated throughout the recurrent loop, and by default, are initialized as zeros in the PredRNN structure.However, our insight is that the initial conditions and subsurface parameters can be considered as an approximate byproduct of the true memory of the natural environment, and thus can be used to initialize these hidden states.We use the initial conditions (i.e., the 3-dimensional pressure heads for the domain being simulated) to initialize the memory state and the parameter values (e.g., porosity and permeability for each gridcell) to initialize the cell states.Both encoders consist of convolutional layers that project the inputs into a higher dimensional space that matches the hidden states of the Action-Conditioned ST-LSTM which forms the backbone of the FSTR model.The initial conditions are used as input to the memory encoder as well as used as the starting input pressure field to the model.We use two layers of the AC-ST-LSTM layer in our model, each having a hidden dimension of 16.The core of the FSTR model is the Action-Conditioned SpatioTemporal LSTM layer (AC-ST-LSTM), which was introduced in Y. Wang et al. (2017).This layer modifies the ConvLSTM model in ways that allow it to take in external inputs on the system in the form of an action that is fused to the hidden state by an elementwise multiplication.The equations for the AC-ST-LSTM layer are given by where X t is the input, W▪▪ are the weight matrices, H l t is the hidden state, C l t is the cell state, M l t is the spatiotemporal memory state, A is the action tensor, i,f,g are the gating functions (along with their primed counterparts), V l t is the action fusion which allows for external inputs to modify the system, and o t is the output for timestep t, and l is the depth layer of the network.Finally, the update step for the AC-ST-LSTM layer is Our model structure considers the meteorological forcings to be "action" tensors rather than model inputs directly.This delineation of information partitioning between the initial conditions, parameter values, and meteorological forcings in FSTR is very similar to how ParFlow and many other hydrologic models operate.
All models take in daily minimum temperature, daily maximum temperature, total precipitation, snowmelt, and bulk evapotranspiration as the boundary condition forcing.Static parameters that the models take as input are porosity, permeability, van Genuchten n, van Genuchten alpha, topographic index, elevation, and the distance to the nearest stream based on the digital elevation map.The static parameters were chosen to include the major parameters required to solve Richard's equation as well as represent major topographic features at both point and aggregated scales.The models output 4-dimensional spatiotemporal predictions of the subsurface pressure heads for all grid cells in the input domain.All models are run in an auto-regressive fashion, meaning they start with the initial conditions as a starting point and then evolve their output in response to the forcings and parameter value inputs, at which point they use their own prediction as the initial state for the next time step.All of the models have roughly the same number of parameters: the ResNet has 307 thousand, the UNet has 272 thousand, and FSTR has 194 thousand.We examined both larger and smaller model sizes and found that it was not the dominant control on the model performance.That said, we believe with a larger training data set were compiled a larger model may still provide better performance.
At inference time (i.e., after the models have been trained) we use the models to produce the full 4D pressure head field, which is the quantity that is treated most fundamentally by the form of Richard's equation that is used in the ParFlow simulations.However, when using ParFlow model outputs it is often more useful to calculate other quantities such as surface soil moisture and water table depth from the subsurface pressure head field.We make a similar conversion to the model outputs by processing them into water table depth and surface soil moisture.Water table depth is calculated by finding the depth of the unsaturated zone, and then adding the pressure of the deepest layer in the unsaturated zone.The soil moisture is calculated by the van Genuchten closure relationship where S sat is the saturation water content [ ], S res is the residual water content [ ], ψ is the pressure [L], and α [L 1 ] and n [ ] are the van Genuchten parameters referenced above.These quantities are calculated in the same way that they are in standalone ParFlow simulations, but rather than using the functionality to compute this translation from ParFlow we re-implemented the routines in a PyTorch compatible layer, which in theory could provide these translations at training time.We do not take this approach here, however, due to several technical challenges which will be discussed later but could form the basis for training a fully differentiable model directly to data in the future.

Experimental Setup
In this study we explore the ability of the three neural network architectures to reproduce the 3d ParFlow pressure heads across the CONUS domain.To quantify this, we train each model on water years 2003-2005.The testing data set that we evaluate against is from the water year 2006, and all results shown here are from this set.As mentioned previously, we aggregate variables to daily timesteps to reduce the overall complexity of the data.While we are interested in building emulators that maintain high spatiotemporal resolution we are primarily focused on seasonal to annual modeling, reflecting timescales that groundwater interactions tend to take place at.
We use the Adam optimizer with weight decay (Loshchilov & Hutter, 2019) and a one-cycle learning rate scheduler with a maximum learning rate of 1e 3. We augment the standard L2 (also known as means squared error) loss with an additional term that penalizes gradients in the spatial directions of the predictions, which are calculated via a simple finite difference (Serifi et al., 2021).This modified loss function, where y and ŷ are the target and predicted values (in our case, the 4d pressure field) is given as: We train all models in three phases.For each phase we sample from the training data, using a different "patch" sizes and rollout time horizons.A patch is simply an N by N spatial subsample from the overall domain, and a rollout horizon is the number of timesteps that are taken.For the first phase we train for three epochs using a patch size of 256 and a rollout horizon of 3 days.In the second phase we train for another epoch using 96 pixel chips and a rollout horizon of 35 days.The final phase uses a patch size of 64 and rollout horizon of 120 days.This helps stabilize the autoregressive loop of the model, as well as learn seasonal trends.We found that training across all three phases improved the model prediction in terms of both stability of the predictions and the overall accuracy for all model types.We include a representative loss curve for the three models in the supplementary materials, which shows the effect of this multi-stage training process.All models were trained on a single Nvidia A100 40 GB GPU.Training times varied by model, but generally took roughly 24 hr each, with the ResNet being cheapest and FSTR being the most expensive, though all were quite similar.It is difficult to benchmark the computational performance of the model training because IO and data preprocessing steps often introduced overhead that was affected by use of shared resources on the training server.Nevertheless, training a single instance of an emulator was much less costly than running the ParFlow simulations.We will briefly touch on this in the discussion (Section 4) as well.

Results
To analyze the performance of each of the emulator architectures we calculated the overall root mean square error (RMSE) and Pearson correlation (hereafter, just correlation) for each grid cell in the domain with respect to the original simulations.The resulting spatial maps for these measures are shown in Figure 4.The UNet performs worst in terms of the RMSE, followed by the ResNet, with FSTR showing the best predictions.Overall, all three models have higher error in the western portion of the domain, particularly in the southwestern region showing

Journal of Advances in Modeling Earth Systems
10.1029/2023MS004095 BENNETT ET AL.
negative correlations.This is the most arid portion of the domain, and also has very large spatial gradients in the water table depth.We found that these areas of very high spatial variability in water table depth were difficult for the models to perfectly track over the full year rollout.
Similarly, in Figure 5 we examine the models' ability to emulate the surface soil moisture levels and find all three models perform similarly.The areas of highest error for surface soil moisture tend to be the same for all the models, unlike the results for the water table depth.Of note are three small regions where RMSE is higher: one in north-eastern Arizona (southwestern portion of the domain), and two small clusters in the central portion of the domain.These are quite different regions from a hydrologic perspective, with the section in the southwest being an arid desert and the central regions being grasslands.The fact that they are quite different suggests deficiencies in the input data or model training procedure are the underlying cause, perhaps due to unique configurations in the static parameter fields, though we were not able to diagnose the exact reasons.We also include plots of the distributions of RMSE and correlation for both water table depth and soil moisture in the supplementary materials (Figures S2 and S3 in Supporting Information S1), which make it easy to compare general performance characteristics of each of the model architectures.To better understand the temporal nature of error growth of the emulators, we also show the forecast error in water table depth at increasing prediction times in Figure 6.We find that all three models show increasing error growth, with the UNet showing the highest median and interquartile range at the end of the test period.FSTR maintains the lowest median and interquartile ranges, though all three models have roughly equivalent 90th percentile error ranges.All models show the majority of errors are below 1 m RMSE, which is promising for use in subseasonalto-seasonal modeling contexts.The continual error growth, particularly the 90th percentile, does highlight that using these models for longer term (annual to decadal) applications may require further research.
So far, we have looked at the overall error characteristics in the forecasts across the full domain, but it is worth zooming in on some particular regions to see local performance spatially and temporally.First, in Figure 7 we show a 256 by 256 km sub-domain in the midwestern US.Comparing the time series for a particular location in this region we once again see that the FSTR and ResNet models are able to much more accurately capture the dynamics of the system, as compared to the UNet.Most notably, the FSTR and ResNet models are able to capture the seasonal dynamics in a much more robust way than the UNet, which shows good overall correlation, but drifts in magnitude from the ParFlow results.Similarly, the spatial plots show that FSTR is much more able to represent the spatial heterogeneity across seasons, particularly in areas with shallow water tables.This connects back to the timeseries picture, where we see the UNet consistently predicts a deeper water table and the ResNet consistently predicts a shallower water table depth.None of the models can perfectly capture the structure of the river network, although some general features that correlate well with the river network from ParFlow are seen in all results.
Figure 8 shows another example in the southwestern portion of the domain.This region is much more arid and has deeper water table depths on average than the region shown in Figure 7.Here we see that all models can capture the long-term dynamics of the region and maintain spatial coherence over the simulation period as well.However, we do see that the FSTR model is better able to capture the fine-scale structure, particularly in the sub-basins that form shallow water table depths and rivers.The UNet cannot reproduce very shallow water tables and lacks the river network structure while the ResNet produces a river network that is too large.We also include two more subregions in the supporting information to show details in the central and southeastern portions of the domain, which have different hydrologic conditions but showcase similar conclusions for differences between the model architectures (Figures S4 and S5 in Supporting Information S1, respectively).
In Figure 7 we saw that one of the main deficiencies in the UNet output is drift in the longer-term dynamics, which we found in other regions as well (Figure S5 in Supporting Information S1).To better understand what drives the accumulation of errors over time we looked at how the models respond to precipitation events.We accomplished this by selecting grid cells at varying levels of precipitation and then comparing the response of the surface level pressure heads before and after the storm events (Figure 9).Overall, we found that the model architectures show similar sensitivities to precipitation events, even in the extreme events, though none matched perfectly.Overall, the ResNet was more sensitive to extreme precipitation than the original ParFlow simulations, while the UNet and FSTR showed consistent under-response.The over-sensitivity of the ResNet may be the reason for the overdevelopment of river networks in Figures 7 and 8.The ability of the emulators to respond accurately across a wide range of events is a promising result, although there is clearly room for further improvements with more training data and/or model architecture tweaks.

Discussion and Future Work
In this study we chose to focus entirely on the subsurface and treat the snow and evapotranspiration as inputs to the model.We chose this because ParFlow is substantially more computationally expensive to run than the land surface model CLM, which provides the simulation of the snow, evaporation, and vegetation processes.We plan to incorporate sub-modules that simulate these processes and can be connected to the subsurface component in another study.These processes could also be added to the FSTR model, but this would require the ability to estimate their gridded initial conditions of snowpack and ET to be used in this framework.There are also many additional experiments that could be performed using the framework that we have developed here to further improve the capabilities of the emulators.
One opportunity for further exploration is to modify the training routines that we use to produce more suitable emulators for specific temporal or spatial scales.One of the main drivers in performance of our emulators was the ability to flexibly train them across a variety of spatio-temporal scales.The ResNet was especially sensitive to the choice of training phases as discussed in Section 2.3.Generally, we found that the initial time horizon for the first training phase was the largest factor for controlling model stability, with small horizons being favored over long ones.However, this is at odds with representing the slower timescale processes like water table depth.Training at a single timescale was always met with subpar performance; exclusively using small temporal rollouts led to bad long term predictions and exclusively using large temporal rollouts led to bad predictions of the temporal dynamics.We believe that finding optimal training strategies for models for different spatio-temporal scales will be an important question to answer.
Additionally, our aim of enabling subseasonal to seasonal groundwater prediction was the reason for our initial choice of using a daily timestep, but we showed they can be rolled out to annual scales with little degradation in performance with our FSTR model.It may even be possible to combine training methodologies into a multi-stage model (where sub-models are used for different timescale predictions), similar to the approach taken by the FuXi weather forecasting architecture (Chen et al., 2023).Another possible concern is whether our trained emulators depend highly on the forcing data used.For example, if we were to use our trained emulator in a true forecast context it is unclear how we should integrate bias-correction/post-processing techniques to ensure that the emulator responses are robust.We are currently working on follow-up studies that make use of perturbed parameter and forcing variables to explore this.
Another aspect that we have not yet explored is in the translation from the pressure heads to water table depth and soil moisture during training time.The equations we used are implemented in the same way as they are in ParFlow, but are written as PyTorch layers which, in theory, could be appended to our model structures and trained on directly.It is possible that this could yield better predictions for these quantities but may sacrifice the fidelity of the more fundamental pressure heads.In future work we may explore this along with multi-objective training to capture a wider range of conditions.There are some technical challenges that remain before such experiments can be performed, though we believe this will be a necessary step in order to successfully emulate the overland flow component of ParFlow.
One other potential limitation of our current emulation method is that the emulators are dependent on the soil and vegetation classifications that were used in the original simulations.Work is ongoing to develop strategies for building ParFlow ensembles with varied parameters which would provide the basis for model training to be varied in more robust ways.We hope that this will provide an even stronger emulator which can be used to optimize the subsurface parameter values via simulation-based inference, in turn giving better ParFlow simulation results as compared to observations.Going beyond improving the emulation of ParFlow simulations, this work enables new applications for largescale simulations.For example, having a fast and stable emulator at the seasonal timescale allows for ensemble predictions, uncertainty analysis, and coupling to land surface and atmospheric models.If future work The change in the surface pressure head is defined as the 1-day difference in surface pressure, following a precipitation event with probability associated with the quantile on the x-axis.Boxplots show the horizontal center line as the median, the colored boxes the interquartile range, and the whiskers span to 1.5 times above/below the interquartile range.
shows that such emulation approaches can be stable over the annual-to-decadal periods this will also enable more robust studies on the effects of climate change on groundwater.Additionally, our FSTR architecture should be suitable for modeling in other domains where external forcings act on the state of the system, particularly other areas of hydrologic and land surface modeling.Additionally, we believe that the action-conditioning portion of the architecture provides a natural coupling point to other model types.

Conclusions
In this study we developed a deep-learning model architecture that is a viable approach to full-system emulation of a complex spatiotemporal groundwater model at high resolution.Our results show that off-the-shelf neural network architectures like the ResNet and UNet perform reasonably well in emulating ParFlow, but our FSTR architecture is more accurate at reconstructing groundwater dynamics over long rollout times.Our emulators exhibit much lower computational cost compared to the original Parflow simulations.This opens up many new opportunities to use emulated results for ensemble forecasting and model calibration through simulation-based inference.
The FSTR model architecture developed in this study not only shows good overall performance at simulating pressure heads, water table depth, and soil moisture but is also fast and scalable to run.As a side-benefit the model architecture is easy to conceptually understand because the model inputs directly correspond to the input data types used in most distributed hydrologic models.Initial conditions are used to set the model up for simulation, static physical parameters are used to modulate the time evolution of the system, and meteorologic forcings act on the internal state of the system.Currently each of these subsystems are processed via convolutional layers before being fed into the core AC-ST-LSTM model, but fundamentally could be any neural network component.
The use of deep learning for geophysical modeling is still a rapidly developing field where most applications for land and subsurface modeling are either point-scale timeseries models or spatially distributed models which are not dynamic.In this work we have demonstrated a modeling approach that is not only fully spatiotemporal, but also can be used to represent multiple processes simultaneously.We consider this work to be a step in moving toward more advanced machine learning representations of the subsurface, which is currently lacking compared to capabilities for weather and atmospheric predictions.

Figure 1 .
Figure 1.The extent of our modeling domain, covering most of the Contiguous United States (CONUS).We show the longterm average water table depth to highlight the spatial variability of the domain.

Figure 2 .
Figure 2. Diagrams of the two baseline model architectures used in our experimental setup.Left shows a ResNet architecture, which stacks convolutional blocks with residual connections that propagate the input signal into deeper layers.The N and M hyperparameters in the dashed boxes show units which can be repeated.Here we use N = 2 and M = 2.The right panel shows the UNet architecture, which also consists of convolutional blocks, but differs from the ResNet by using them in successive downsampling and upsampling configurations, which consider multiple resolutions of the input data.

Figure 3 .
Figure 3.An architecture diagram of the Forced Spatio Temporal recurrent neural network (FSTR).Tensor variables are represented in yellow colors, while neural-network layers with trainable weights are drawn in blue.The red "initialization" phase is only run a single time per training example, while the update arrow is run in a recurrent loop.Quantities in the orange outlined boxes represent the hidden states and model inputs.

Figure 4 .
Figure 4. Spatial metrics of performance foreach model architecture's emulated water table depth over the entire testing period of water year 2006.The left column (a-c) shows the root mean square error (RMSE), while the right (d-f) shows the Pearson correlation.

Figure 5 .
Figure 5. Spatial metrics of performance for each model architecture's emulated surface soil moisture over the entire testing period of water year 2006.The left column (a-c) shows the root mean square error (RMSE), while the right (d-f) shows the Pearson correlation.

Figure 6 .
Figure 6.Growth of the error distribution of water table depth with increasing forecast length.The central lines show the median RMSE, while the shaded lines show the interquartile and 10%-90% range of errors.

Figure 7 .
Figure 7.A zoom-in of a region of the domain in the midwest highlighted in red on the upper right.The spatial plots correspond to time slices designated by dashed vertical lines in the timeseries.The timeseries is shown for the location marked in the lower left with a black "x."

Figure 9 .
Figure 9.Comparison of the response of the surface layer pressure head to different precipitation inputs.The change in the surface pressure head is defined as the 1-day difference in surface pressure, following a precipitation event with probability associated with the quantile on the x-axis.Boxplots show the horizontal center line as the median, the colored boxes the interquartile range, and the whiskers span to 1.5 times above/below the interquartile range.