Predicting the infiltration characteristics for semi-arid regions using regression trees

The study of the infiltration process is considered essential and necessary for all hydrology studies. Therefore, accurate predictions of infiltration characteristics are required to understand the behavior of the subsurface flow of water through the soil surface. The aim of the current study is to simulate and improve the prediction accuracy of the infiltration rate and cumulative infiltration of soil using regression tree methods. Experimental data recorded with a double ring infiltrometer for 17 different sites are used in this study. Three regression tree methods: random tree, random forest (RF) and M5 tree, are employed to model the infiltration characteristics using basic soil characteristics. The performance of the modelling approaches is compared in predicting the infiltration rate as well as cumulative infiltration, and the obtained results suggest that the performance of the RF model is better than the other applied models with coefficient of determination (R)1⁄4 0.97 and 0.97, root mean square error (RMSE)1⁄4 8.10 and 6.96 and mean absolute error (MAE)1⁄4 5.74 and 4.44 for infiltration rate and cumulative infiltration respectively. The RF model is used to represent the infiltration characteristics of the study area. Moreover, parametric sensitivity is adopted to study the significance of each input parameter in estimating the infiltration process. The results suggest that time (t) is the most influencing parameter in predicting the infiltration process using this data set.


INTRODUCTION
Infiltration process influences the hydrological cycle, stream flow, groundwater recharge, irrigation, soil erosion and drainage system (Singh et al. ). Infiltration can be defined as irrigation or precipitation water moving into the soil (Sihag et al. a, b). The actual rate at which water moves into the soil is determined as infiltration rate (Haghighi et al. ). Infiltration process plays a significant role in the selection of crop type and agricultural practices. Prediction of infiltration is also necessary for the efficient design of irrigation systems (Hillel ). The sustainability of the groundwater system also depends on this process (Chen & Young ).
The quantity and quality of recharged water is the most important parameter for water resources management especially in regions where water scarcity occurs. It divides water into the two most important hydrological segments, groundwater flow and surface flow. It is the main factor in watershed modelling for the prediction of surface runoff.
Accurate prediction of infiltration rate is compulsory for reliable prediction of surface runoff (Diamond & Shanley ). At catchment level, infiltration characteristics are one of the main factors in determining the flooding condition (Bhave & Sreeja ). The significance of the infiltration process has required soil and water researchers to generate several models (e.g. Green Sihag et al. a). However, only a few developed models have been effectively used with field data. Parameter calculation is the main criterion for selection of any model over other models. These infiltration models can be divided into three groups: physical models, semiempirical models and empirical models.
The main problem for modelling the infiltration process is the variation of soil type and texture (Machiwal et al. ). Variation in the infiltration process in any watershed causes difficulty in management of water for agricultural practices. The process of infiltration is influenced by several aspects; including soil type, texture, soil hydraulic properties, soil moisture profile, and precipitation and climate properties. The spatial and temporal prediction of the infiltration process in the real environment cannot be at present deduced by direct or indirect methods. Infiltration models are based on the law of mass conservation and Darcy's law (physical models), a few models rely on field and laboratory measurements (empirical models) and some models are based on simple hypotheses about the infiltration rate and cumulative infiltration relation quantities (semi-empirical models). These infiltration models are developed for vertically homogeneous soils with constant initial moisture content in the soil, constant density and over horizontal planes. A very few researchers have made attempts to use soft computing in the prediction of soil infiltration (Sy ; Sihag et al. a, b). the best of our information, no study has been carried out for describing the spatial variability of the soil water infiltration process using soft computing techniques in semi-arid regions of India. Thus, the main objective of this research is to develop infiltration models using regression-tree-based computing approaches and to determine the spatial variability at the same scale in some locations of Haryana (India).
Both the cumulative infiltration as well as infiltration rate are predicted using random tree, random forest and M5 tree-based models. In the last few decades conventional infiltration models have been widely used for the calculation or estimation of infiltration rate and cumulative infiltration such as Kostiakov, Philip, Novel, Horton, Holton etc. These models are point-and location-specific. The aim of this investigation is to develop a general model for a complete study area using tree-based models.  Table 1.

DATA SET
The data set used for the modelling in this study is a collection of infiltration observations yielded from the field experiments from 17 sites. In this study, infiltration rate (I) and cumulative infiltration (Cu) are considered as the

Random tree (RT)
Random trees are decision/regression-tree-based models used for classification/regression problems. Random model trees are essentially the combination of two existing algorithms: single model trees are combined with random forest ideas. Every leaf in model trees holds a linear model which is optimized for the local subspace described by that leaf (Pfahringer ). The collection of tree predictors in random trees is called a forest. With random features at each node, a random tree is a tree drawn randomly from a set of possible trees. The distribution of trees is uniform and each tree in the set of trees has an equal chance of being sampled. Random trees can be generated efficiently and the ensemble of random trees generally leads to accurate models (Zhao & Zhang ).

Random forest (RF)
The random forest proposed by Breiman ( To generate a tree, instead of computing the best split among attributes for each node, only a random subset of all attributes is chosen at every node, and the best split for that subset is computed.
The procedure for the random forest regression model is as follows: 1. Draw bootstrap samples from the data set.
2. Generate a regression tree with the following modifications: at each node, choose the best split from the randomly sampled k-tree predictors.
3. The models are tuned for the user-defined parameters: number of trees (k) grown and number of variables used at each node (m) to construct a tree. The first step is to build a tree, which is similar to inducing a regular decision tree. The difference is that the splitting criterion is based on maximizing the expected reduction in error instead of maximizing the information gain at each interior node. The standard deviation reduction (SDR) can be calculated by: where K represents number of instances that reach the node; K i represents the number of instances that have the i th outcome of the potential set; and sd represents the standard deviation of the target values of the instances in the data set K.
The second step is to construct a multivariate linear regression function using the instances at each non-terminal node and then prune the built tree back from the leaves.
Instead of using all attributes, this multivariate linear regression function only uses the attributes that are referenced by tests or linear functions somewhere in the subtree at this node.
The last step is to improve the prediction accuracy by applying some smoothing techniques. The smoothed predicted values of M5 are backed up from the leaf node to the root node (Li & Li ).

Performance assessment parameters
The performance of various modelling techniques has been carried out using various statistical performance assessment parameters, which are coefficient of determination (R 2 ),

EVALUATION OF MODELS
The performance measures adopted for model assessment  Table 3.

Modelling of infiltration rate
The models are formed and checked for accuracy after setting the tuning parameters to near-optimal values (Table 3) Table 4. As indicated from Figure 2, the scattering of training data predicted by the M5 tree with both pruned as well as unpruned model is higher relative to random tree and random forest. The higher values of infiltration rate

Modelling of cumulative infiltration
In this section, the cumulative infiltration of soil is simulated using random tree (RT), random forest (RF) and M5 tree (M5) models with the basic soil properties. The scattering plots of the regression-tree based models for the training and testing data sets are depicted in Figures 5 and 6, respectively. Table 5 provides the information about the statistical measures obtained from the observed training and testing cumulative infiltration data with their corresponding predicted values by the models. The scattering of the training data predicted by the M5 pruned and unpruned models is higher, particularly for the larger values, as compared with the RT and RF models ( Figure 5). The predicted data of the RT and RF model fits closely to the perfect agreement line     (Table 5). So, statistical performance measures acquired from the testing data set suggest better performance by the RF model as compared with the RT and M5 tree models in approximating the cumulative infiltration of soil. Figure 7 shows the residual errors observed with the current data set using different regression models. The residual error values peak in the training stage, mostly predicted by the M5 tree model, while the RT model has smaller residual values. The residuals are minimum with the RF model relative to the RT model in the testing stage, which assures its higher predictability.

SIMULATION OF INFILTRATION CHARACTERISTICS OF SOIL BY RANDOM FOREST MODEL
Based on the modelling results discussed above, it is clear that the RF model has higher generalization performance in simulating the infiltration rate as well as cumulative

DATA AVAILABILITY STATEMENT
Data cannot be made publicly available; readers should contact the corresponding author for details.