A Novel Deep Learning Approach for Data Assimilation of Complex Hydrological Systems

In hydrological research, data assimilation (DA) is widely used to fuse the information contained in process‐based models and observational data to reduce simulation uncertainty. However, many popular DA methods are limited by low computational efficiency or their reliance on the Gaussian assumption. To address these limitations, we propose a novel DA method called DA(DL), which leverages the capabilities of DL to model non‐linear relationships and recognize complex patterns. DA(DL) first generates a large volume of training data from the prior ensemble, and then trains a DL model to update the system knowledge (e.g., model parameters in this study) from multiple predictors. For highly non‐linear models, an iterative form of DA(DL) can be implemented. Additionally, strategies of data augmentation and local updating are proposed to enhance DA(DL) for problems involving small ensemble size and the equifinality issue, respectively. In two hydrological DA cases involving Gaussian and non‐Gaussian distributions, DA(DL) shows promising performance compared to two ensemble smoother methods, that is, ES(K) and ES(DL), which respectively apply the Kalman‐ and DL‐based updates. Potential improvements to DA(DL) can be made by designing better DL model architectures, imposing physical constraints to the training of the DL model, and further updating other important variables like model states, forcings and error terms.


Introduction
Effective prediction of hydrological systems generally requires two sources of information.The first source is the mathematical model (e.g., analytical, numerical or data-driven) that embodies our understanding of the hydrological process (Clark et al., 2015;Peel & Blöschl, 2011;Spieler et al., 2020).The second source is the observational data obtained at different spatial/temporal scales using various techniques (Etter et al., 2020;Slater & Binley, 2021;F. Zheng et al., 2018).Due to insufficient knowledge of the system dynamics and related parameters, the model may not adequately represent the hydrological reality, although it can provide spatially and temporally continuous predictions (Liu & Gupta, 2007;Vrugt et al., 2008).On the other hand, the observational data are more consistent with the system behavior (despite some discrepancies due to measurement errors), but are usually sparse in time and/or space and may lack the ability to explain and extrapolate.In general, either source of information is incomplete and should not be used alone to support hypothesis testing and decision-making.
To synergize scientific knowledge (i.e., model) with observations (i.e., data), two different strategies can be implemented.The first strategy uses the data to constrain the model in order to obtain more reliable spatial/ temporal simulations and predictions of the hydrological system.This can be achieved by applying data assimilation (DA) to extract information from the observations and update the model state (Evensen, 2009), structural errors (Evensen, 2019;Smith et al., 2008), parameters (Chen & Zhang, 2006), and initial/boundary conditions (Dechant & Moradkhani, 2011), among others.The second strategy, on the other hand, utilizes scientific knowledge to improve data-driven prediction approaches (e.g., machine learning) to make them more explainable and extrapolative.This strategy has gained popularity in recent years across various research fields (Karpatne et al., 2022).Some successful applications include process-guided deep learning (DL) for lake temperature prediction (Read et al., 2019) and physics-informed neural networks for both forward and inverse problems (Q.He et al., 2020;Karniadakis et al., 2021).Our focus here is on the first strategy of hydrological DA.
The second strategy is beyond the scope of the present work, interested readers can refer to a recent book on knowledge-guided machine learning edited by Karpatne et al. (2022).
In hydrological predictions, simulation uncertainties are inevitable and typically treated as random variables.DA can be viewed from the Bayesian perspective to quantify these uncertainties.The background knowledge is first represented by a prior distribution and then updated to posterior probability by assimilating the information contained in the observational data (Carrassi et al., 2018;Law et al., 2015;Reich & Cotter, 2015).Compared to the prior distribution, the posterior usually better reflects the underlying data-generating process.The quality of a DA approach is affected by two factors: (a) how informative are the data?and (b) how effective is the DA algorithm at extracting the information contained in the data?Under a limited budget, it is essential to make rational decisions about when, where, and what kind of data to collect.This can be achieved through Bayesian experimental design (Tarakanov & Elsheikh, 2020;Thibaut et al., 2022;J. Zhang et al., 2015) or data-worth analysis (Dausman et al., 2010;Wang et al., 2018;Xue et al., 2014).To handle non-linear and non-Gaussian observation/ system models, Markov chain Monte Carlo (MCMC) or particle filter (PF) methods can be used as the suitable DA methods to approximate the posterior, even when its exact form is unknown (Moradkhani et al., 2005;Shi et al., 2023;Vrugt, 2016).However, MCMC and PF can become computationally expensive when dealing with complex problems due to the curse of dimensionality, despite recent advances in improving their simulation efficiency (Pan et al., 2022;Pulido & van Leeuwen, 2019;Reuschen et al., 2021;J. Zhang, Vrugt, et al., 2020).Nevertheless, high-dimensional DA problems can be efficiently implemented if the models are approximately Gaussian.The ensemble Kalman filter (EnKF) is one such DA algorithm that has been extensively used in hydrological science (Evensen, 2009).EnKF is a Monte Carlo approach to the Kalman filter for sequential DA.It represents the system state distribution using an ensemble of state vectors and computes the mean and covariance matrix from this ensemble.When the ensemble size of EnKF is small, its robustness can be improved by artificially increasing the spread of the forecast ensemble through inflation (Bauser et al., 2018) or by considering the spatial decay of correlations through localization (Anderson, 2012).For efficiency and simplicity, all available data can be assimilated in a single global update using the ensemble smoother (ES; van Leeuwen & Evensen, 1996), a popular variant of EnKF.To deal with strongly non-linear models, iteration can be introduced to EnKF and ES, resulting in the iterative EnKF (Gu & Oliver, 2007;Lorentzen & Naevdal, 2011) and ES (Chen & Oliver, 2012;Emerick & Reynolds, 2013).In hydrological DA, 4D-Var is commonly employed as a competitor to EnKF.These two methods may each demonstrate specific advantages in different situations (Lorenc, 2003;Skachko et al., 2014).Despite their popularity in various research fields, the performance of these DA methods may still deteriorate when dealing with non-Gaussian problems.Currently, developing DA methods that are both general (i.e., suitable for non-linear, high-dimensional and non-Gaussian problems) and efficient (i.e., without requiring a massive amount of model runs) is an important need in hydrological science.
Essentially, EnKF and its variants work by updating the state from the innovation vector (i.e., the difference between perturbed observation and model prediction).The Kalman gain matrix, based on the first two statistical moments, defines a linear mapping from the innovation vector to the update vector (i.e., the difference between the updated state and the prior state).Thus, these Kalman-based DA methods are restricted by the Gaussian assumption.To relieve this constraint and formulate a more general DA method, we proposed to use DL to construct a non-linear mapping to replace the Kalman gain matrix (J.Zhang, Zheng, et al., 2020).A high volume of training data are generated from the prior ensemble to train the DL model and possible non-Gaussian patterns in the data can be automatically recognized.This new method, called ES (DL) , has shown promising performance in subsurface characterization problems involving high-dimensional and non-Gaussian variables.Latter, Man et al. (2022) applied ES (DL) to characterize vapor intrusion sites.Xiao et al. (2023) used the DL-based DA method to update future state of geological CO 2 plumes from historical data.Godoy et al. (2022) adopted random forest instead of DL to construct a non-linear mapping to improve EnKF in the estimation of heterogeneous conductivity field.Wang and Yan (2022) introduced multi-fidelity simulation to further improve the efficiency of ES (DL) for fast DA of subsurface flow problems.
Despite the improved performance of ES (DL) over its Kalman-based counterpart, there is still room for further enhancement.Our primary motivation here lies in crafting a DA method capable of adeptly addressing a range of challenges, including high-dimensionality, non-Gaussianity, and equifinality, without requiring an extensive number of model evaluations.In this paper, we introduce a novel DL-based DA method called DA (DL) .In contrast to ES (DL) , which exclusively uses the innovation vector as the predictor, DA (DL) simultaneously incorporates prior parameters, model outputs, and the innovation vector as inputs to the DL model.This strategic integration empowers DA (DL) to harness a more extensive array of information, effectively transcending the conventional updating framework that predominantly centers on the innovation vector for updates, as commonly seen in methods like EnKF, ES, and ES (DL) .When running the system model takes a long time, only a small ensemble size is usually affordable.To ensure the robustness of DA (DL) in this situation, we will introduce a simple but effective data argumentation method.In addition, to address the issue of equifinality commonly encountered in hydrological research, we will introduce the local updating approach developed in our previous work (J.Zhang et al., 2018) to DA (DL) .
The rest of this paper is organized as follows.Section 2 presents the theory and implementation details of the new DA (DL) method.Then, two hydrological cases are used to demonstrate the performance of DA (DL) .Finally, conclusions and discussions are provided in the last section.

Methods
In this work, the hydrological system of concern is described as where ỹ signifies a vector of observational data obtained at different times and locations that summarize the responses of the hydrological system H to external forcings U, m* denotes the unknown parameters, and ϵ represents the measurement errors.When a simulator of the hydrological process is available, the data can be modeled as where ψ0 signifies the initial states, and E represents additive errors originated from observational and modeling processes.In hydrological DA, the observational data can either be assimilated sequentially in a filtering problem or used in a batch update with a smoother.Moreover, DA can target not only model parameters but also model state, external forcings, and model errors.In this work, we focus on the parameter estimation problem.Adopting a Bayesian formalism, posterior distribution of the model parameters can be derived as where p(m) and p(m| ỹ) are the prior and posterior distribution of model parameters, respectively, p(ỹ|m) demotes the likelihood function, and p(ỹ) = ∫ p(ỹ|m) p(m)dm is the evidence.For complex hydrological system that involves non-linear processes, analytical form of p(m| ỹ) is not available, and Monte Carlo method can be used to provide an approximate estimate.
EnKF and its variants (e.g., ES) use an ensemble of parameters or states to represent uncertainties.From p(m), N e random samples can be drawn to form the prior parameter ensemble, M 0 = {m 0 1 ,…,m 0 N e } .Through running F( ⋅ ), the ensemble of prior state can be obtained, that is, Y 0 = {y 0 1 ,…,y 0 N e } .The Kalman formula can be used to update each sample in M 0 in the following way: where i = 1, …, N e , M 1 = {m 1 1 ,…,m 1 N e } is the updated parameter ensemble, C 0 MY is the cross-covariance between M 0 and Y 0 , C 0 YY is the auto-covariance of Y 0 , R is the covariance of measurement errors, and ϵ i is a random realization of measurement errors.Equation 4 describes an update from the innovation vector, Δy i = ỹ + ϵ i y 0 i , to the update vector, Δm i = m 1 i m 0 i : Water Resources Research 10.1029/2023WR035389 ZHANG ET AL.
where M K ( ⋅ ) is a mapping defined by the Kalman gain matrix, It is clear that this mapping is linear and depends on the Gaussian assumption.As shown in our previous work (J.Zhang, Zheng, et al., 2020), this Kalman-based DA method, called ES (K) for convenience, cannot obtain reliable results in hydrological DA that involves non-Gaussian distributions.
To address this issue, we formulated a non-linear mapping with DL.From M 0 and Y 0 , we can randomly select two samples without repetition and calculate the differences to obtain The number of combinations is N e (N e 1)/ 2. By feeding these training data to a properly designed DL model, we can obtain a non-linear mapping and recognize complex patterns like non-Gaussian distribution contained in the data.This DL-based method, known as ES (DL) , performs similarly as ES (K) under the Gaussian condition.However, in non-Gaussian cases, ES (DL) can produce more reliable results (J.Zhang, Zheng, et al., 2020).
Nevertheless, in ES (DL) , the starting points of Δm and Δy, that is, m 0 and y 0 , are not utilized in both training and inference of the DL model, leaving room for potential improvement.For a DL model, if more relevant predictors can be treated as the inputs, the target should be better identified.Based on this idea, we propose in this work a more powerful DA method than ES (DL) , which is called DA (DL) .This new method constructs a non-linear mapping with three predictors, that is, and each sample in the updated ensemble can be obtained as, This allows DA (DL) to transcend the limitations of ES and become more versatile and adaptable.In addition to the choice of predictors, the structure of the DL model also significantly impacts the assimilation result.As the design space of a DL model is infinite, it is impossible to identify the optimal architecture.After comparing popular DL models such as DenseNet (Huang et al., 2017), ResNet (K.He et al., 2016) and U-Net (Ronneberger et al., 2015) in multiple problems that involve both Gaussian and non-Gaussian distributions, it is found that models with a specific encoder-decoder architecture can generally produce satisfying results for both ES (DL) and DA (DL) .The encoderdecoder architecture consists of two sub-networks: an encoder that compresses the input into a smaller spatial representation with more channels, and a decoder that expands the spatial dimensions while reducing the number of channels.
In highly non-linear DA problems, one single update using ES (K) , ES (DL) , or DA (DL) may not be sufficient.In this situation, it is suggested to assimilate the observational data multiple times (Emerick & Reynolds, 2013).To guarantee that the updating results are reasonable, random realizations of measurement errors generated in iteration t should be inflated by a factor of β t , where t = 1, …, N iter , N iter is the number of iterations, and Then the updating schemes of ES (K) , ES (DL) and DA (DL) become respectively.Finally, we use to approximate the posterior distribution of model parameters.
In many situations, evaluating the system model F( ⋅ ) can be computationally intensive.As a result, a small ensemble size N e is often used.Even though the number of training data fed to the DL model, that is, N e (N e 1)/2, is much larger than N e , it may still not be enough for training a data-hungry DL model.To address this issue, a simple yet effective data argumentation method is proposed.In the tth iteration, t = 1, …, N iter , when using Finally, a training data set with M*N e (N e 1)/2 samples can be obtained.This way of expanding the training data set can reduce over-fitting and make the DL model more generalized.Performance of this data augmentation method will be demonstrated in Section 3.1.
In hydrological simulations, one major challenge for many DA methods is the equifinality issue, where multiple parameter sets with significantly different values can produce equally good performance.From the Bayesian perspective, it means that the posterior distribution of model parameters is multi-modal or flat (say a uniform distribution).To improve the performance of DA (DL) in multi-modal DA problems, the local updating approach proposed in our previous work (J.Zhang et al., 2018) can be introduced.The idea behind the local updating approach is straightforward: although globally the distribution is non-Gaussian (e.g., multi-modal), locally it is still approximately Gaussian.Based on this idea, the local ensemble of m t 1 i (i = 1, …, N e , t = 1, …,N iter ) can be obtained based on the following measure: where Using DA (DL) , we can obtain the updated local ensemble, M t i,local , from which a random sample, m t i , can be drawn as the updated sample of m t 1 i .For more details about the local updating approach and its application to ES (K) , one can refer to (J.Zhang et al., 2018).When implementing DA (DL) with the local updating approach, the DL model should be trained N e times in each iteration.If each time the DL model is trained from scratch, it will require a lot of time and computational resources.To speed up this process, a transfer learning approach can be applied.From the entire prior ensemble, we first train a basic DL model.Then this trained model is updated with new data obtained from each local ensemble.This fine-tuning process is usually very easy and fast.
To demonstrate the effectiveness of the local updating approach, we conduct a simple test on a problem with a multi-modal posterior.The system model considered in this problem is described by y = m 2 1 + m 2 2 , with both m 1 and m 2 following a uniform prior distribution of U ( 2,2).To infer the posterior distribution of m = {m 1 , m 2 }, we utilize a single observation of ỹ = 1, with error that follows a Gaussian distribution, ϵ ∼ N 0,0.01 2 ) .It is noteworthy that an infinite set of parameter combinations can accurately fit the observation due to the problem's nature, resulting in a posterior distribution that takes on a circular shape.Here, four DA methods with N e = 300 and N iter = 2 are implemented, that is, ES (K) , ES (DL) , DA (DL) , and DA (DL) with local update (α = 0.1).Both ES (DL) and DA (DL) employ a feedforward neural network (FNN) with one hidden layer that has 10 nodes.In DA (DL) , we use {Δy, y t 1 } as inputs to the FNN model instead of {Δy, m t 1 , y t 1 } due to differences in the dimensions of m and y.DL models are capable of merging predictors with different dimensions, which will be demonstrated in the succeeding section.As shown in Figure 1, introducing the local updating approach to DA (DL) enables this method to successfully identify the circle-like posterior.The average root-mean-square error (RMSE) between model simulations and ỹ is also evaluated, with values of 2.335, 2.146, 1.861 and 0.0730 for ES (K) , ES (DL) , DA (DL) , and DA (DL) with local update, respectively.

The Non-Gaussian Condition
In this section, we compare the performance of DA (DL) with ES (K) and ES (DL) in a non-Gaussian setting.For this purpose, we simulate transient groundwater flow in a confined, channelized aquifer.The domain size is 800 (L) × 800 (L), with impervious upper and lower boundaries, as well as two constant-head boundaries at the left (202 L) and right (198 L) sides.The initial hydraulic head is set to 198 (L) everywhere except for the left boundary, where it is prescribed.The system includes an injection well with a rate of 150 (L 3 T 1 ) and a pumping well with a rate of 150 (L 3 T 1 ), which can enhance water flow within the domain.The channelized field comprises two materials Water Resources Research 10.1029/2023WR035389 with distinct hydraulic conductivities: K 1 = 0.5 LT 1 ) and K 2 = 2.3 LT 1 ) .With the above settings, we can obtain transient hydraulic heads h(x, t) at different locations and times by solving with MODFLOW (Harbaugh et al., 2000), where S s (L 1 ) represents specific storage, x (L) denotes location, t (T) is time, q(x,t) = K(x)∇h(x,t) signifies the water flux, ∇ is the nabla operator, and g(x, t) (T 1 ) denotes the source or sink term.For this model, we uniformly divide the domain into 41 × 41 grids, set the simulation time to be 18 (T), and use S s = 0.0001 (L 1 ).
In this case, the spatial distribution of K is unknown and should be inferred from indirect observations.As depicted in Figure 2b, the reference field of K is non-Gaussian, making it challenging to be accurately estimated.
To accomplish this, hydraulic head measurements are collected from 7 × 7 wells at t = {0.6,1.2, …, 6.0} (T) with errors that fit ϵ ∼ N 0,0.01 2 ) .For the three DA methods, a same set of prior parameter ensemble with N e = 499 samples are generated with the direct sampling method proposed by Mariethoz et al. (2010).The training image featured in Figure 2a serves as the basis for generating these samples.As the level of non-linearity in this problem is relatively low, we only set up one iteration for each of the three DA methods.simultaneously increase the channel number from 10 to 64.The non-linear activation function of rectified linear unit (ReLU) is used after ConvT and Conv.As feature maps move through the encoder path, spatial dimensions are progressively reduced while channel numbers increase, utilizing layer types such as Conv, ReLU, Max Pooling and Dropout (with 30% probability).This continues until the feature size reaches 2 × 2 × 512.The decoder path, on the other hand, expands the spatial dimensions and reduces the number of channels until the feature reaches a size of 16 × 16 × 64.For the purpose of producing finer-grained predictions, skip connections are employed in the U-Net structure to facilitate direct forwarding of feature maps from the encoder to the decoder pathway.To incorporate the information contained in m 0 and y 0 , the extra parts with brown arrows transform each of these inputs into two output features with sizes of 8 × 8 × 64 and 2 × 2 × 256, which are concatenated with features in the encoder and decoder paths that have the same spatial dimensions, further improving the model's performance.Here, ES (DL) only uses the U-Net part, that is, the input Δy "flows" through the blue arrows to the target Δm.
For both ES (DL) and DA (DL) , we train the DL models using the Adam optimizer with a constant learning rate of 0.001.During the training process, we implement mini-batches containing 512 samples over the course of 50 epochs.Furthermore, we incorporate a gradient threshold method that clips any gradient values that exceed a threshold of 10, preventing potential instability.To mitigate the risk of over-fitting, we include an L 2 regularization factor of 0.0002 for the weights to the loss function.This value is double the default setting (0.0001) typically found in the DL Toolbox within MATLAB.The choice of loss function is another critical consideration in DL, and it can play a substantial role in model performance, particularly when dealing with non-Gaussian scenarios.While we utilized the default mean squared error (MSE) loss for our regression task, it's important to acknowledge that alternative loss functions tailored to specific data characteristics or assumptions may yield different outcomes.Researchers may explore the appropriateness of alternative loss functions in cases where Gaussian assumptions do not hold.
As shown in Figures 4a-4c, all the three DA methods can capture the non-Gaussian feature in the spatial distribution of K to varying degrees.However, ES (K) fails to reproduce the connectivity feature of the reference K field, as seen in Figure 2b.As the subsurface media consist of only two distinct materials with values of K 1 = 0.5 and K 2 = 2.3, the histogram of an estimated K field should display bi-modality.Nonetheless, ES (K) is unable to identify this bi-modality, as illustrated in Figure 4d.These results confirm that the Kalman-based update method ES (K) is inadequate in solving non-Gaussian DA problems.
By introducing a non-linear updating scheme with DL, ES (DL) can better estimate the spatial distribution of K (Figure 4b) over its Kalman counterpart, and the bi-modality feature can be identified (Figure 4e).Additionally, the standard deviation (std) field as shown in Figure 4h also reveals the connectivity feature of K, with larger std values at the interface of the two materials.When we use three predictors {Δy, m 0 , y 0 } in DA (DL) to infer the update vector Δm, significant improvement in overall performance can be achieved compared to ES (DL) : the estimated mean field shows a clearer connectivity feature (Figure 4c), the bi-modality distribution is better identified (Figure 4f), and the std field has smaller values (Figure 4i).To conduct more comprehensive comparisons, we perform eight repetitive runs for each of the three DA methods.As shown in Figure 5, DA (DL) generally provides a more accurate estimation of both the K field and a better data-match.The average RMSE (R 2 ) values between model simulations and measurements calculated from the updated ensembles are 1.437 (0.991), 1.023 (0.996), and 0.893 (0.997) for ES (K) , ES (DL) , and DA (DL) , respectively.The average RMSE (R 2 ) values between the estimated and reference K fields for the three methods are 0.690 (0.367), 0.660 (0.414), and 0.513 (0.625), respectively.These results suggest that DA (DL) is superior to both ES (K) and ES (DL) in solving this non-Gaussian DA problem.
When a simulator requires significant computational resources, only a limited number of model simulations can be afforded.In this situation, we propose using the data augmentation method as described in Section 2 in DA (DL) .Specifically, we set N e = 50 and M = 20 and evaluate the performance of ES

The Gaussian Condition
In the previous section, we demonstrated that DA (DL) outperforms two ES methods that rely on the Kalman-and DL-based updates in estimating high-dimensional and non-Gaussian distributed parameters of a groundwater model.To further examine the effectiveness of DA (DL) , we now focus on its performance under the Gaussian condition, where ES (K) is typically expected to excel.If DA (DL) can produce similar (or even better) outcomes to ES (K) , it would confirm the versatility and practicality of DA (DL) as a reliable DA method.Specifically, we explore the joint estimation of contaminant source parameters and heterogeneous conductivity (K) field in the subsurface media, for which the posterior distribution of these parameters is roughly multi-Gaussian.
In this case, we simulate steady-state groundwater flow and contaminant transport in a 2-D confined aquifer.The size of the domain is 20 (L) × 10 (L), with impervious upper and lower boundaries and constant-head boundaries at the left (12 L) and right (11 L) sides.The domain is discretized into 81 × 41 grids in the numerical model.The K field is heterogeneous and its logarithmic transformation (Y = logK) is Gaussian distributed.To characterize the spatial correlation of Y at any two locations {x 1 , y 1 } and {x 2 , y 2 }, we adopt the following function: where σ 2 Y represents variance of Y, λ x and λ y denote correlation lengths in the x and y direction, respectively.The steady-state hydraulic heads (h) and water velocity (v i ) within the domain can be determined by utilizing MODFLOW, which involves solving the following equations (Harbaugh et al., 2000): and where θ ( ) represents the porosity of the subsurface media, and the subscript i denotes the coordinate axis (1 for the x direction and 2 for the y direction).
The subsurface flow field contains a contaminant source with unknown location and release strengths that change over time.To determine the concentrations (C) at different times in the domain, we use MT3DMS (C.Zheng & Wang, 1999) to solve the following mass-balance equation: Water Resources Research 10.1029/2023WR035389 ZHANG ET AL.
where t (T) represents time, q a (T 1 ) is volumetric flow rate per unit volume of the aquifer, C s (ML 1 ) represents concentration of the source, and D ij denotes the hydrodynamic dispersion coefficient with the following components: where α L and α T (L) represents the longitudinal and transverse dispersity, respectively.In this problem, we aim to identify the unknown conductivity field and contaminant source using measurements of hydraulic head and solute concentrations.
To reduce the dimensionality of the problem, we utilize the Karhunen-Loève (KL) expansion (D.Zhang & Lu, 2004) to approximate the random field of Y as follows: Here, Ỹ(x) represents the reconstructed value of Y at location x = {x, y}, where μ Y is the mean.The eigenvalues and eigenfunctions of the covariance defined by Equation 13 are represented by τ n and s n (x), respectively.The uncertainty in the field is represented by independent Gaussian random coefficients ξ n ∼ N (0,1), where n = 1, …, N KL .To retain 95% variance of the original field, 100 terms (N KL = 100) should be included: 95.The contaminant source is described by eight parameters: location {x s , y s } and six massloading rates S k (MT 1 ) during time interval [k,k + 1] (T), where k = 1, …, 6.The prior distributions of the eight parameters are all uniform, with ranges listed in Table 1.In summary, there are 108 unknown parameters to be estimated, that is, {ξ 1 ,…,ξ 100 ,x s ,y s ,S 1 ,…,S 6 }.Additional parameters are determined through geological surveys or experiments, including ), and θ = 0.25 ( ), respectively.To infer the 108 unknown parameters, measurements of steady-state hydraulic head and contaminant concentrations at t = {4, 5, …, 12} (T) are collected from 15 wells in the domain (represented by the circles in Figure 6a).Both types of measurement errors adhere to the Gaussian distribution, with ϵ h ∼ N 0,0.005 2 ) and ϵ c ∼ N 0,0.005 2 ).It is important to acknowledge that in practical applications, determining the standard deviation for the measurement error (σ) can be a challenge, as it is often unknown.The selection of an appropriate σ requires careful consideration, taking into account various factors such as measurement methods and instruments.Moreover, it's worth highlighting that σ can also be treated as an unknown variable and estimated concurrently with the model parameters, as exemplified in (J.Zhang, Vrugt, et al., 2020).This approach allows for a more comprehensive and data-driven determination of σ within the context of the problem at hand.In this case, we utilize a DL model resembling the one used in the non-Gaussian problem for DA (DL) , as shown in Figure 7.For the ES (DL) method, we again use the U-Net part indicated by the blue arrows.We incorporate layernormalization (LN) layers that normalize data across all channels for each sample independently, enabling efficient training and improved performance.For both ES (DL) and DA (DL) , the DL model is trained over a total of 600 epochs using the Adam optimizer with an initial learning rate of 0.006, which decreases to its 80% value every 15 epochs.The MSE loss is adopted for the regression task.We set the mini-batch size to 3072, and the L 2 regularization factor for the weights to the loss function is 0.0001.
Although this case encompasses significantly fewer parameters (N m = 108) in comparison to the non-Gaussian case (N m = 1,681), it remains a formidable challenge for MCMC techniques.To provide context, our previous research necessitated approximately 1 million model evaluations to attain reliable results, even when leveraging an advanced MCMC algorithm, specifically, DREAM (ZS) (J.Zhang, Vrugt, et al., 2020).Consequently, we have refrained from presenting MCMC results in this context.Here, we once again compare ES (K) , ES (DL) , and DA (DL) , with N iter = 5 and N e = 500.For each DA method, we conduct five repetition runs, and average the outcomes to obtain the estimated mean Y fields as shown in Figure 6.All three methods are able to identify the major high-and low-value regions of the reference field (Figure 6a).However, since the measurements are taken from only 15 wells, which are primarily located in the central region of the flow domain, these DA methods tend to underestimate the high-value regions and overestimate the low-value regions (e.g., the high-value region in the lower bottom corner is not captured).Increasing the number and distribution of measurement wells can improve the precision of the Y field estimation.The mean RMSE (R 2 ) values, calculated between the reference Y field and the updated ensembles (consisting of 2,500 samples from the five repetitions), are 0.4830 (0.4120) for ES (K) , 0.5074 (0.3550) for ES (DL) , and 0.5072 (0.3811) for DA (DL) , with standard deviations of 0.0866 (0.0968), 0.1045 (0.1230) and 0.1167 (0.1527), respectively.In this case, ES (K) obtains slightly more accurate estimation of the Y field.Nonetheless, with better designed DL models and training options, ES (DL) and DA (DL) have the potential to deliver enhanced performances.
Figure 8 presents a comparison of the posterior density curves for the eight contaminant source parameters, derived from 2,500 updated samples of the five repetition runs, using the three DA methods.Although in a single run, each of the three DA methods may produce slightly biased estimates of the contaminant source parameters (results not shown), merging the updated ensembles of the five repetition runs yield consistent results across the three methods.The RMSE (R 2 ) values between the measurement data and the updated ensembles for ES (K) , ES (DL) , and DA (DL) are 0.0574 (0.9997), 0.0574 (0.9998), and 0.0435 (0.9998), respectively.The corresponding standard deviations are 0.0335 (0.000489), 0.0719 (0.000487), and 0.0440 (0.00160), respectively.Although ES (K) and ES (DL) yield almost identical mean RMSE values, the results from ES (DL) exhibit greater variability.Overall, the DA (DL) method produces the best data-match.
In the aforementioned tests, all three DA methods were employed with five iteration steps.Water Resources Research 10.1029/2023WR035389 DA (DL) for highly non-linear problems remains an ongoing challenge that demands further dedicated effort and exploration.

Conclusions and Discussion
In this study, a novel DA method, that is, DA (DL) , is proposed to improve the simulation accuracy of complex hydrological systems involving non-linearity, high-dimensionality, and non-Gaussianity.Traditional DA methods, such as MCMC, may suffer from low computational efficiency, while others like EnKF and its variants are limited by the Gaussian assumption.DA (DL) takes advantage of DL to recognize complex patterns (including non-Gaussianity) and approximate non-linear relationships automatically from data.By employing ensemble representation of model parameters, states or other related variables, DA (DL) quantifies and reduces the uncertainties inherent in the simulation process from prior knowledge and measurement data.DA (DL) builds nonlinear mappings from multiple predictors to the target variable, which is the difference between updated and prior vector of concerned variables (model parameters in the present study).To train the DL model, a large volume of training data are generated from the prior ensemble.When the system model is CPU-demanding, it is preferable to use a small ensemble size, which may not be sufficient for the data-hungry DL model.In this condition, we propose a data argumentation method to enhance the performance of DA (DL) .In addition, we address the equifinality issue, which arises when different parameters or forcings can lead to the same outcomes in a complex dynamical system, by introducing a local updating approach proposed in our previous work (J.Zhang et al., 2018) to DA (DL) .To evaluate the performance of DA (DL) , we conduct numerical experiments involving Gaussian and non-Gaussian distributions and compare DA (DL) with two ES methods, one using the Kalman formula (Emerick & Reynolds, 2013) and the other using a DL-based update (J.Zhang, Zheng, et al., 2020).Our results demonstrate that DA (DL) outperforms its counterparts, especially in the non-Gaussian condition.The reference values are indicated by the black vertical lines.We conduct five repetition runs for each method to obtain the above results.
Despite the promising results achieved by DA (DL) in this study, there are still several issues that are not well addressed.Firstly, it is difficult or even impossible to design the optimal DL model structure and related training options for a specific problem.During the development of DA (DL) , dozens of DL model structures have been tested, and only a few of them can yield satisfactory outcomes.Although the DL models used in this study enable DA (DL) to produce good results, especially in the non-Gaussian case, the current settings are suboptimal.For example, in the Gaussian case, DA (DL) still requires the same number of iterations as ES (K) , despite the fact that the mapping defined by DA (DL) is non-linear.It is important to further improve and standardize the implementation of DA (DL) .Secondly, as the updating made by DA (DL) and ES (K) relies on statistical relationship derived from data, there is no assurance of physical consistency of the DA outcomes.To address this limitation in the context of DA (DL) , incorporating physical constraints into the training of DL model via knowledge-guided machine learning (Karpatne et al., 2022) would be a promising approach.This can help decrease the need of huge training data and enhance the reliability and stability of the inference results.Thirdly, this study only addresses the updating of model parameters form measurement data using DA (DL) .However, there is potential to expand the scope of DA (DL) in future research by incorporating other crucial variables such as model states, forcings and error terms.Doing so would provide a more comprehensive understanding of simulation uncertainties, ultimately leading to more informed model enhancements and predictions.

Figure 3
Figure 3 depicts the use of a DL model by DA (DL) with three predictors, namely Δy, m 0 , and y 0 , and one target, namely Δm.Dimensions of the input variables Δy and y 0 are both 7 × 7 × 10, indicating the presence of 7 × 7 measurement wells and 10 sampling times.Dimensions of the input variable m 0 and the target variable Δm are both 41 × 41 × 1, representing 41 × 41 model grids.The DL model consists of the U-Net part (blue arrows) that inputs Δy and the extra parts (brown arrows) that input m 0 and y 0 .The U-Net part is composed of two pathways, an encoder path on the left and a decoder path on the right.For the input variable Δy, we first utilize transposed 2-D convolution (ConvT) and 2-D convolution (Conv) to extend the spatial dimensions from 7 × 7 to 16 × 16 andFigure 2. (a) The training image used to generate random realizations of K field using the direct sampling method; (b) The reference K field, injection well (the down triangle), pumping well (the up triangle), and measurement locations (the circles).

Figure 3 .
Figure 3. Architecture of the deep learning (DL) model used by DA (DL) in the non-Gaussian case.Here, the part with blue arrows is the U-Net model used by ES (DL) .Output size of each layer is indicated by height×width×channels.Conv and ConvT mean 2-D convolution layer and transposed 2-D convolution layer, respectively.
superior to ES (DL) (RMSE: 3.022) and DA (DL) without data augmentation (RMSE: 2.961), which are based on training data sets each with only 1,225 samples.Introducing the data augmentation method to DA (DL) can increase the number of training data points to 24,500, producing the optimal performance among the four DA methods.In practical applications, it is crucial to strike a balance between the choice of M and the computational resources needed for the simulation of system model and the training of the DL model.

Figure 5 .
Figure 5. Root-mean-square errors (RMSEs) (a) between model simulations and measurements and (b) between estimated and reference K fields calculated from the updated ensembles obtained by ES (K) , ES (DL) , and DA (DL) , respectively.The RMSE values are sorted in ascending order.There are eight repetition runs for each method.

Figure 6 .
Figure 6.(a) The reference Y field and measurement well locations (white circles); (b-d) The estimated mean fields of Y (averaged over five repetition runs) obtained by ES (K) , ES (DL) and DA (DL) , respectively.

Figure 7 .
Figure 7. Architecture of the deep learning (DL) model used by DA (DL) in the Gaussian case.Here, the part with blue arrows represents the U-Net model used by ES (DL) .Size of each layer is indicated by height×width×channels.Conv, ConvT and LN mean 2-D convolution layer, transposed 2-D convolution layer, and layernormalization layer, respectively.

Figure 8 .
Figure 8. Posterior density curves of the eight contaminant source parameters estimated by ES (K) (blue lines), ES (DL) (red lines) and DA (DL) (green lines), respectively.The reference values are indicated by the black vertical lines.We conduct five repetition runs for each method to obtain the above results.

Table 1
Prior Distributions and Reference Values of the Eight Contaminant Source Parameters However, in theory, ES(DL)or DA (DL) should ideally not necessitate a multi-step implementation.The underlying expectation is that a well-designed DL model should inherently possess the capability to approximate intricate non-linear relationships in a single step.Nevertheless, when we set the iteration number to one, we observe average RMSE values between measurements and updated ensembles of 2.354, 2.418, and 2.391 for ES (K) , ES (DL) , and DA (DL) , respectively.These RMSE values are notably larger than those obtained with five iterations.Consequently, for highly nonlinear problems, it becomes apparent that iterations are still required for the two DL-based DA methods.This phenomenon could be attributed to several contributing factors.First, selecting the appropriate DL model architecture is pivotal.A network that is too shallow or lacks complexity may struggle to capture non-linear relationships, while an excessively complex network can lead to overfitting.Second, deep neural networks are susceptible to gradient vanishing or exploding during training, particularly in very deep networks.These issues can further impede the learning of non-linear relationships.Techniques such as batch normalization and residual connections can alleviate these problems.Third, selecting appropriate hyperparameters, such as learning rate, batch size, number of layers, and activation functions, is essential.Fourth, DL optimization often entails finding a global minimum in a highly non-convex loss function.In some cases, the optimization process may become trapped in local minima, preventing the model from learning the intended non-linear mappings.In practical application, achieving optimal performance frequently entails a thorough process of experimentation aimed at identifying the ideal model and training configuration tailored to a specific problem.Unfortunately, in the current study, we must acknowledge that we have not yet reached this objective.The pursuit of enabling single-step