Towards new soil water flow equations using physics-constrained machine learning

The Richardson–Richards equation (RRE) is a widely used partial differential equation (PDE) for modeling moisture dynamics in unsaturated soil. However, field soil moisture observations do not always satisfy RRE. In this paper, we introduce a new physically constrained machine learning (PCML) approach to derive governing soil water flow PDE directly from moisture observations. This paper is viewed as a feasibility study and reports results of our first attempt in developing the PCML approach. Here, we rely on noisy synthetic soil moisture data obtained from the linear RRE subject to real flux boundary conditions. The linear RRE was used as a reference PDE to check the PCML-derived PDEs, where the PCML performance was


INTRODUCTION
Moisture dynamics in unsaturated soil are commonly modeled through a partial differential equation (PDE), referred to as the Richardson-Richards equation (RRE) (Richards, 1931;Richardson, 1922;Šimůnek et al., 2008;van Dam et al., 1997). Application of RRE is, however, limited by the need for costly and time-consuming determination of soil hydraulic properties, which generally exhibit strong spatiotemporal variability in nature. This limitation is often addressed using inverse solutions of RRE based on easier-to-acquire data such as from soil moisture measurements (Hopmans et al., 2002). In addition, RRE is derived based on several simplifying assumptions that are often violated in a natural ecosystem. Examples are local hydraulic nonequilibrium (Schlüter et al., 2012) or preferential flow through soil macropores (Nimmo, 2012).
These limitations in conjunction with the ever-expanding soil sensor technology (Ren et al., 2003;Sayde et al., 2010;Sheng et al., 2017) as well as remote sensing techniques (Entekahbi et al., 2010;Kerr et al., 2010;Sadeghi et al., 2017) has motivated modern data-driven models for predicting soil moisture dynamics (Carranza et al., 2021;Coopersmith et al., 2016;Karandish & Šimůnek, 2016). Although these models benefit from more readily available inputs compared to the classical RRE-based models, their performance outside of training conditions is often questionable (Carranza et al., 2021).
To address such limitations of machine-learning techniques, a third class of models referred to as "physicsconstrained machine learning" (PCML) has evolved (Karpatne et al., 2017;Raissi et al., 2019). A rare example in the vadose zone literature is the recent work of Bandai and Ghezzehei (2020), who used PCML to estimate soil hydraulic properties and water flux solely from moisture observations. They showed that, unlike the conventional inverse modeling of RRE, PCML does not require the boundary conditions, and it is not limited to a specific functional form for the soil hydraulic properties.
In this paper, we introduce a new application of the PCML approach in soil hydrological modeling. We indicate that PCML is able not only to estimate soil hydraulic properties, but also to retrieve the governing soil water flow PDE(s) directly from moisture observations. Earlier attempts on datadriven discovery of hidden PDEs in other disciplines of science and engineering include Bongard and Lipson (2007), Schmidt andLipson (2009), Brunton et al. (2016), Rudy et al. (2017), Raissi et al. (2017), , and Xu et al. (2020). Similar research has been done on saturated soil water flow and transport problems (Chang & Zhang, 2019a, 2019b. To the best of our knowledge, this is the first attempt for finding the hidden unsaturated soil water flow PDE(s) from observations. Hence, we view this paper as a feasibility study, where we rely on noisy synthetic soil moisture data coming from a known PDE (i.e., the linear RRE) as a reference to check the PCML-derived PDEs.

Theory
Let us assume we have soil moisture observations from a set of sensors installed at multiple depths and these observations satisfy a flow equation-for example, the linear RRE of the form: is soil water diffusivity, and [L T −1 ] is an unsaturated soil hydraulic conductivity parameter. In practice, the observations are our "known" information and the flow equation(s) is the "unknown," and we seek to retrieve a flow equation from the observations in order to 'model' beyond the spatiotemporal domain of our sensors. To do so, a "general PDE" (Equation 2) that is expected to contain

Core Ideas
• A physically constrained machine learning approach is introduced in soil hydrology. • It enables retrieving soil water flow equation(s) from moisture observations. • It enables adding new data-driven terms to the Richardson-Richards equation.
the "true PDE" (Equation 1) must be assumed. For example, we consider the following forth-order linear PDE: where μ ( = 0, 1, … , 5) represents parameters in the system. It is important to note that Equation 2 is only an example of a general PDE, whereas any other forms of the general PDE can be assumed as long as it contains the true PDE. Since the true PDE is not known, the formation of the general PDE requires primary knowledge of the main physical phenomena affecting observations. In this equation, some of the terms are already physically justified, and the other terms are included merely from a mathematical standpoint. For example, besides the gravity-induced flow and moisture diffusion, which are associated with the first-and second-order spatial derivatives, respectively, it is also proved that the fourthorder differential operator produces a wetting front instability that is a pre-condition for reproducing gravitational fingering (Cueto-Felgueroso & Juanes, 2009;Miller et al., 2013). Hence, the formation of the general PDE is an important step in a feasible approach.
Given soil moisture observations θ( , ) at soil depths and time steps, all the explanatory variables in the assumed general equation (i.e., derivatives of soil moisture) can be directly calculated from the observations. The respective forward and backward derivatives in time with fixed step size Δ at each of the th soil depths and th time steps may be basically calculated using the first-order numerical approximation as follows: where O denotes the truncation error. The respective th forward and backward derivatives of the space with fixed step size Δ at the th soil depth and th time step may also be calculated using the first-order numerical approximation as follows: where = 1, 2, 3, 4. After calculating the derivatives from the data, finding the PDE from moisture observations is reduced to a least-squares problem to find the best set of parameters μ ( = 0, 1, … , 5), fitting the general equation to the moisture observations. This problem can be solved using a variety of methods ranging from simple regression analysis to advanced machine learning techniques. Regardless of the underlaying optimization method, because a data-driven model is forced to satisfy the general equation, this approach is referred to as "physics-constrained machine learning." In this paper, we employ a "sparse regression" technique to solve the problem following Rudy et al. (2017). Using this technique, a conventional least-squares analysis is first carried out to solve where α is a regularization parameter and and Θμ denote the matrix of modeled and observed time-derivatives of soil moisture, respectively, defined as After this analysis, nonsignificant terms are detected and forced to be zero. Then, the least-squares analysis is redone for the nonzero terms. This method ensures that terms in the derived PDE are kept only if their effect on the error ‖ Θμ − ‖ outweighs their addition toμ.

Validation
We tested the proposed PCML algorithm based on synthetic soil moisture data. To account for realistic spatiotemporal variability of the moisture observations, we acquired ∼14 yr of measured data of 5-cm soil moisture, precipitation ( ) and evapotranspiration (ET) data from the Tonzi Ranch site in California (Baldocchi, 2016). Warrick's (1975) analytical approach was used to solve the linear RRE (Equation 1). A 200-cm homogeneous soil profile was assumed. The land surface water flux from the measured data ( = − ET) was used as the surface boundary condition for solving Equation 1, as the assumed true PDE. The initial soil moisture and soil moisture at infinity (a parameter appearing in Warrick's solution) were set to a constant soil moisture equal to the average of the 5-cm soil moisture observations over the entire study period (Sadeghi et al., 2019). This assumption is based on the fact that soil moisture temporal fluctuations dampen with soil depth and are practically zero at large depths (i.e., constant soil moisture at infinity). To estimate the soil hydraulic parameters, D and k, we used "simulated annealing," a global optimization algorithm contained in the MATLAB Optimization ToolboxTM Release 2018a (The MathWorks), to minimize the RMSE between the measured and calculated soil moisture at 5 cm.
The measured and simulated data using the optimized set of parameters D and k are presented in Figure 1, which shows a reasonable performance of the linear RRE in capturing the seasonal soil moisture dynamics. Certainly, there are discrepancies between the actual and modeled soil moisture due to the overly simplified nature of the linear RRE, as discussed in Sadeghi et al. (2019). An important feature of the modeled soil moisture data is to have even larger temporal variations than the actual observations due to the noisy nature of the surface flux measurements transferred to the modeled soil moisture data via the linear RRE. The large variations of the simulated data provide a good opportunity to test the proposed PCML algorithm. It is important to note that we used the fitted soil moisture data not the true measured data, because the true PDE is known for the former and not for the latter. This true PDE provides a reference to check the accuracy of the PCML-retrieved PDEs with the synthetic soil moisture data.
Without any prior information about the true PDE, the obtained synthetic soil moisture data were fed into the proposed PCML algorithm where Equations 3 and 4 were used to calculate the derivatives. The forward derivatives were used F I G U R E 1 Top: measured precipitation (P) and evapotranspiration (ET) used as the upper boundary condition in the linear Richardson-Richards equation (RRE, Equation 1). Bottom: The best fit of the linear RRE equation-based simulated soil moisture compared with measured soil moisture at a 5-cm depth. [L 2 T −1 ] is soil water diffusivity, and [L T −1 ] is an unsaturated soil hydraulic conductivity parameter for all space and time steps except for the last step, in which the backward derivatives were used. Note that these first-order numerical approximations are the simplest ways to calculate the derivatives and have been used purposely to test the proposed approach at a basic level, where effects of numerical truncation and round-off errors could be studied. We avoided higher-order numerical approximations as they need more spatial points (i.e., sensors) for calculating the derivatives and could further limit the applicability of this method when a limited number of sensors is available.
Effects of the sample size on the algorithm performance were investigated through various scenarios of sensor installation depth. These scenarios were defined considering two possible sensor types in practice: (a) the classic time domain reflectometry (TDR) sensor measuring a single soil moisture value per sensor, and (b) a recently developed TDR array (Sheng et al., 2017) each measuring soil moisture at eight depths simultaneously with a 1-cm increment. Accordingly, seven scenarios were considered: (a) three classic TDRs installed at 5-, 10-, and 20-cm depths, (b) four classic TDRs installed at 5-, 10-, 20-, and 50-cm depths, (c) five classic TDRs installed at 5-, 10-, 20-, 50-, and 100-cm depths, (d) one TDR array covering 3-10 cm with 1-cm increments, (e) one TDR array covering 43-50 cm with 1-cm increments, (f) two TDR arrays covering 3-10 and 43-50 cm, and (g) three TDR arrays covering 3-10, 43-50, and 93-100 cm. Note that from a mathematical standpoint at least five points are required for calculating the fourth-order spatial derivative in Equation 4. Hence, Scenarios 1 and 2 do not provide adequate data for the PCML algorithm. We applied a spline interpolation for all scenarios using 1-cm increments to first of all avoid this issue, and second to maintain a constant depth increment to ease the computations. For each scenario, the PCML algorithm retrieved a new PDE by finding the parameters of Equa-tion 2, which was compared with the true PDE defined by Equation 1.

RESULTS AND DISCUSSION
Results of the PCML algorithm for the seven considered scenarios are shown in Figure 2. Only the optimized values of μ 2 = − and μ 3 = are shown, because at most times the sparse regression correctly yielded nonzero values for these two parameters and zero for the other parameters in Equation 2, pointing to the overall success of the PCML algorithm in determining the true PDE. Sample size in both space and time has played a key role in the success of the PCML algorithm performance.
For all scenarios tested, the parameters stabilize when temporal sample size exceeded 3-6 yr. This indicates that a relatively long dataset is required to find a meaningful PDE from soil moisture observations using this approach. This large temporal sample size is a result of the large variations of the soil moisture data, which adds significant errors when approximating the derivatives numerically via Equations 3 and 4. To better understand the noise effects on the algorithm performance, we replaced the measured surface boundary conditions (Figure 1, top) with a sinusoidal flux function and produced a smooth set of synthetic moisture data. Results of this analysis (not shown) indicated that the PCML algorithm was able to find the true PDE from only a few weeks of the smooth data. Therefore, the temporal sample size is an important factor in this approach and needs to be selected carefully.
The variation among different scenarios also highlights the importance of the spatial sample size. To better compare the scenarios, we averaged the parameter values for temporal sample sizes larger than 2,000 d and listed them in Table 1. Then, we used these values to solve the PCML-retrieved PDEs analytically using Warrick's (1975) solution. The solutions for the seven scenarios are compared with those of the true PDE in Figure 3, where the most accurate PDE was retrieved out of Scenarios 1, 2, and 7. The success of these scenarios is primarily attributed to their larger and wider sample size, derived from the spline interpolation for Scenarios 1 and 2. Interestingly, the spline interpolation provided value-added information in these scenarios, promising the application of this approach for sites equipped with only three moisture sensors. The algorithm, however, showed a lower performance in Scenario 3, despite its larger sample size than Scenarios 1 and 2. We primarily attribute this result to the large gap between the fourth and fifth sensors at 50 and 100 cm. The interpolated data between these two sensors could deviate from the true solution and cause a lower performance of the algorithm.
The least accurate PDE was retrieved out of Scenario 4. This is the only scenario out of the seven scenarios where the sparse regression incorrectly yielded zero for the parameter . That is the main reason for the soil moisture simulations out of this PDE to significantly deviate from those of the true PDE (see Figure 3). This poor performance in Scenario 4 is attributed to the small extent of the data near the surface (3-10 cm), where data vary largely in time, but negligibly in space. Interestingly, the same extent of data in scenario 5 led to a reasonable PDE. In this scenario, the sampling depth was lowered 40 cm down in the soil profile, where less temporal variability exists than the soil surface (compare solutions at depth of 0 and 40 cm in Figure 3). The comparison of Scenarios 4 and 5 suggests that the data variability in time (i.e., dependent variable in Equation 2, ∂θ∕∂ ) and space (i.e., explanatory variables in Equation 2, ∂ θ∕∂ ) should F I G U R E 3 Soil moisture simulations from the physically constrained machine learning (PCML)-based retrieved partial differential equations (PDEs, Scenarios 1-7, listed in Table 1), compared with simulations from the true PDE, ∂θ∕∂ = 152.7∂ 2 θ∕∂ 2 − 0.148∂θ∕∂ , where θ [L 3 L −3 ] is volumetric soil moisture content, [T] is time, and [L] is soil depth (positive downward) be proportional for PCML to work well. In other words, the large temporal variability occurring near the surface should be accompanied with a relatively large spatial extent of data for a well-posed problem.
Comparing the results for the classic TDR (Scenarios 1-3) and the TDR array (Scenarios 4-7) shows no superiority of a single sensor compared with a sensor array in terms of the PCML performance, although the TDR array yields data at a higher resolution. This is because the low-resolution data provided by the classic TDR could be efficiently extended using the spline interpolation. This, however, could be the case for the linear RRE and needs more testing using nonlinear RRE or actual observations, where the true moisture dynamics captured by the TDR array, especially near the surface, may not be efficiently reproduced by any interpolation method.

CONCLUSIONS AND FUTURE PERSPECTIVE
The feasibility of deriving a governing PDE describing unsaturated soil water dynamics directly from soil moisture observations was demonstrated using a physics-constrained machine-learning (PCML) approach. To that end, the sample size in time and space is highly critical. For smooth theoretical data, the temporal sample size can be as short as a few weeks for the PCML to succeed. However, several years of noisy field moisture data may be required to get meaningful PDEs from this initial approach. In addition, both the number and vertical spatial extent of the moisture sensors are important here. Where only a few sensors are available, the data sample size can be efficiently enhanced using a proper interpolation technique. Hence, the extent of the sensing domain seems to play a more important role than their numbers.
This study, based on synthetic data of the linear RRE, has been viewed as a feasibility phase of this research. This study is limited to an overly simplified scenario of a linear, bare, and homogenous soil. Real-world complexities such as spatial heterogeneity, nonlinear hydraulic properties, vapor and non-isothermal flows, and root water uptake may significantly affect the proposed approach. Hence, more comprehensive tests of the PCML approach using noisy data from the nonlinear RRE or real soil moisture sensors is necessary. In addition, we derived the spatial and temporal derivative terms via first-order numerical approximations that often introduce large truncation and round-off errors, especially when having noisy data. In this first step of our research, we purposely kept the derivative calculations at a basic level and plan to explore the best differentiation methods in the next steps. In particular, studying advanced differentiation techniques such as the "complex-step derivative approximation" (Squire & Trapp, 1998) and "automatic differentiation," a widely used approach in machine learning (Baydin et al., 2018), is part of our ongoing research.
If forcing the general PDE to the nonlinear RRE, this approach is reduced to an inverse solution technique to estimate the soil hydraulic properties directly from soil moisture data (Bandai & Ghezzehei, 2020). However, the general equation can be extended well beyond the RRE to explore a variety of non-RRE phenomena such as macropore and finger flows. Another potential application of the new approach is to reduce the nonlinearity of the governing soil water flow models via establishing an expanded, but physically consistent, linear or quasilinear PDE that can accurately describe the soil moisture spatiotemporal variability under different boundary conditions. Such models that may be solved using analytical or semianalytical methods would potentially reduce the associated numerical errors, which are of concern in existing land surface models (Decker & Zeng, 2009).

A C K N O W L E D G M E N T S
Support was provided by the Utah Agricultural Experiment Station, Utah State University, Logan, approved as UAES Journal Paper no. 9447. This article has been supported by the Polish National Agency for Academic Exchange under Grant no. PPI/APM/2018/1/00048/U/001.

C O N F L I C T O F I N T E R E S T
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.