Modelling Soil Water Retention Using Support Vector Machines with Genetic Algorithm Optimisation

This work presents point pedotransfer function (PTF) models of the soil water retention curve. The developed models allowed for estimation of the soil water content for the specified soil water potentials: –0.98, –3.10, –9.81, –31.02, –491.66, and –1554.78 kPa, based on the following soil characteristics: soil granulometric composition, total porosity, and bulk density. Support Vector Machines (SVM) methodology was used for model development. A new methodology for elaboration of retention function models is proposed. Alternative to previous attempts known from literature, the ν-SVM method was used for model development and the results were compared with the formerly used the C-SVM method. For the purpose of models' parameters search, genetic algorithms were used as an optimisation framework. A new form of the aim function used for models parameters search is proposed which allowed for development of models with better prediction capabilities. This new aim function avoids overestimation of models which is typically encountered when root mean squared error is used as an aim function. Elaborated models showed good agreement with measured soil water retention data. Achieved coefficients of determination values were in the range 0.67–0.92. Studies demonstrated usability of ν-SVM methodology together with genetic algorithm optimisation for retention modelling which gave better performing models than other tested approaches.


Introduction
Soil hydrologic parameters have great impact on soil water transport processes. The soil water retention curve and soil water hydraulic conductivity are required for an appropriate description of soil water phenomena, such as drainage, infiltration, or soil pollutant movement. The retention curve describes the relationship between soil water content and soil water potential and is especially important for hydrological modelling and agronomical practice as it determines soil water availability for plants.
Measurements give strict evaluation of hydraulic properties of soils. Unfortunately, measurement of the soil water retention curve is time consuming and requires specialised equipment. The classical pressure plate extractor technique [1] may be used to determine the soil water retention curve, or an alternative technique based on the dynamic simultaneous time-domain reflectometry soil water content and pressure head measurements [2].
Fortunately in many applications the hydraulic parameters can be estimated rather than measured. Pedotransfer functions (PTF) are commonly used in such circumstances [3] and allow for estimation of the retention curve or hydraulic conductivity based on easily measured soil characteristics. The most widely used soil characteristics for PTF 2 The Scientific World Journal development are granulometric composition, bulk density, and organic matter content.
The soil water retention curve basically may be modelled in two ways: indirectly (parametric PTFs) and directly (also known as point PTFs). The first method evaluates parameters of some models of the soil water retention curve, for instance, the Mualem-van Genuchten function parameters: , , , and [4]; the latter evaluates water content for a specified number of water potential values. Recently a new method of retention modelling has been developed, which directly estimates soil water content but is not limited in estimations to fixed set of soil water potentials [5]. This so-called pseudocontinuous approach allows for estimation of the soil water content for any potential value in the range from 0 kPa to model dependent minimum value.
There have been numerous attempts to develop PTFs for retention curve, utilising a wide set of mathematical methods. Regression modelling is a widely used tool for PTF model development. Some methods rely simply on granulometric distribution as input parameters [6], and others also use soil bulk density [7][8][9]. The mean weight diameter of soil particles, a granulometric composition dependent parameter, is used by some authors [7] too. Soil content of organic carbon is another widely used parameter [9][10][11] for development of PTFs. Other models [12] additionally use soil water content for specified potentials, for example, −33 kPa and −1500 kPa.
Artificial neural networks (ANNs) are another technique often used for developing PTFs. Feed forward or radial basis neural networks allow for estimation of some set of output parameters based on knowledge of input parameters. ANNs were extensively used as a tool for PTF developments [13][14][15][16][17]. Unfortunately, there are some problems specific to artificial neural network modelling such as a tendency to stacking in local minima of mean square error hyperplane during ANN training process [18] or difficulties with appropriate choice of ANN architecture which causes overfitting of ANN to training data. The partial solution to this problem is based on bootstrap averaging, which averages the predictions of many ANNs trained on randomly modified input data [19].
Recently another mathematical tool, the Support Vector Machine (SVM), has been used for PTF modelling. This technique resolves typical problems for ANN-based PTFs development. There have been attempts of SVM usage for PTF development in a direct manner [20] and in parametric form [21] where Mualem-van Genuchten parameters were estimated. Both of the SVM models were compared with ANN-based counterparts and showed better performance in water retention modelling.
In this work we focused on the development of the point PTFs for estimation of the soil water retention curve using SVM and on some of its methodological aspects. We investigated whether the newer ]-SVM method, not used before in PTF studies, was applicable and possibly better for this purpose than typically used -SVM method. For the purpose of automated models' parameters search genetic algorithms were used as an optimisation framework. A new form of the aim function used for models' parameters search is proposed, which allows for better selection of models' parameters. As a result more accurate PTF models may be developed.

Soil Datasets.
Soil dataset used in this study was an extract from the Soil Profiles Bank of the Polish Mineral Soils database [22] and contained 639 soil samples, taken from 290 different soil profiles. Soil samples were collected from three horizons for most soil profiles. Undisturbed samples were collected into the metal cylinders of volume 100 cm 3 and diameter 5 cm, and then basic soil parameters were analysed. The following soil parameters were extracted from the soil database for the purposes of this work: soil water content for various six soil water potential values: −0.98 kPa, −3.10 kPa, −9.81 kPa, −31.02 kPa, −491.66 kPa, and −1554.78 kPa; particle size distribution; total porosity; and bulk density. Particle size distribution was determined for the following fractions: clay < 0.002 mm, silt 0.002-0.05 mm, and sand 0.05-1 mm. Basic statistics of soil dataset used in this study are presented in Table 1.

SVM Methodology.
SVM is one of the classes of softcomputing techniques [23]. Originally SVM was developed for solving classification problems; then its usage has been extended to regression-type applications for function estimation [24]. SVMs used for regression modelling estimate one output variable based on a set of input variables. As being supervised learning method SVM uses training dataset for model development.  The presented values are averages of the other ten -fold submodels. In the case of the number of support vectors, RMSE and 2 for the training dataset and values in brackets are standard deviations. Columns described by "aim 1" present data for models developed using RMSE as the aim function. Label "aim 2" is linked with models developed using proposed new form of the aim function.  If one is optimising model's parameters the steps (d) and (e) may be repeated until best parameters will be found.
The training dataset = {[ ( ), ( )] ∈ R × R, = 1, . . . , } consists of pairs where many input values are mapped to one output response value. Training data which are used for model development are specific to applications and the training dataset has to be representative for modelled problem. Quality of these data impacts on model generalisation capability and its ability for making accurate predictions.
There are two different types of SVM algorithms used in regression modelling: -SVM and ]-SVM. The SVM methodology originally introduced by Vapnik is currently known as -SVM or SVM Type 1 models. These models have two adjustable parameters which influence their behaviour: -the so-called penalty parameter-and -insensitivity zone.
determines the mutual relationship between the The Scientific World Journal training error and the model complexity. An increase in causes penalising of larger errors, which leads to decreasing of approximation error. Insensitivity zone describes the tolerance for training error in the SVM model: a decrease in leads to strict fitting to training data, which may cause overfitting and decrease the model generalisation properties. SVM models with lower values use a larger number of support vectors. Support vectors are selected data records taken from the training dataset. Which data records taken from the training dataset are support vectors is decided by SVM algorithm during model training phase. Support vectors are vital part of the developed model as they are used further by the SVM algorithm for making model estimations.
One of the most important properties of SVM regression models is the ability to generalise, which allows for appropriate predictions from previously unseen input data. The technical criterion which stands behind this requisite is a limit set on the number of support vectors to about half of all vectors in the training dataset [25]. 6 The Scientific World Journal If a SVM model with arbitrarily chosen support vectors percentage is required, it is convenient to use another SVM model formulation known as ]-SVM or SVM Type 2 models [26]. In this type of SVM models instead of another model parameter ] ∈ [0, 1] is used, which is utilised for internal trading off between the model accuracy for training data and the model complexity (number of support vectors), which influences the model generalisation properties. Two parameters in ]-SVM regression models are used: and ]. Formally, the ] parameter expresses the desired number of SVs and is a lower bound on the fraction of support vectors [26]. In -SVM models, the number of support vectors participating in model formulation is indirectly related to the model parameters and while in ]-SVM models it is connected with the parameter ] itself.
Kernel function is an important part of SVM model. It is used internally in the SVM algorithm to map the input parameters to highly dimensional feature space used in the algorithm internal computations. Thanks to nonlinear kernel functions SVM algorithm allows for estimations where dependence between estimated output variable and input variables is highly nonlinear. There are some commonly used

Genetic Algorithm Parameters
Optimisation. SVM models depend on parameters which must be adjusted. Different types of models tested use different sets of parameters: SVM cost ( ), intensive zone width ( ), and parameter ( ) of the radial basis kernel function. Table 2 summarizes types of SVM models, kernel functions used, models parameters, and names of the models used for the reference in this paper.
Different models depend on different sets of parameters. The values of these parameters may be selected arbitrarily or determined using some kind of universal, developer independent procedure. In fact the determination of SVM model parameters is an optimisation problem and typical methods for such tasks may be used. Leave-one-out method [27] has been used in previous work, Lamorski et al. [20], but it is extremely computationally demanding. A simple search for an optimal solution on a grid of possible parameter values is another method commonly used [21]. Unfortunately grid search method does not test whole space of possible parameters values as it is limited to arbitrarily chosen fixed combinations of parameters. As a result nonoptimal models' parameters may be chosen. Some optimisation algorithms could be used instead.
Genetic algorithms (GA) with elitism were used in the present study for searching the optimal values of model parameters. GA is a technique used among others for optimisation purposes [28] and is especially well suited for applications where the aim function is not differentiable and may have local minima. None of the classical optimisation methods such as the Nelder-Mead downhill simplex method or gradient-based methods may be used successfully in such circumstances.
Genetic algorithm operation is controlled by two main parameters: population size and a number of GA iterations. The other vital GA parameters are elitism percentage and mutation chance. The elitism percentage used in our study was 0.2, and mutation chance was 0.02.
Population size determines the number of points in the space of input parameters, where the model performance is evaluated. A population size of 100 was used for searching of the parameters in that study.

Model Formulation.
SVM was used for model development. The soil database was randomly split into two subsets: training dataset (414 samples) and testing dataset (225 samples). SVM models were built using the training dataset and tested against the test dataset. Test datasets were not used for model development at any stage, except for final model testing and validation.
Correlation analysis was performed on the soil dataset and used for model elaboration. The analysis of this data led to selection of the input parameters for developed models: sand fraction, clay fraction, total porosity, and bulk density.
Due to the fact that SVM regression models allow for estimation of only one output parameter for a given set of input parameters, six different SVM models were developed, one for each value of the soil water potential in which water content was evaluated.
The -fold approach [19] was used for model elaboration, which allowed for a cross validation during the training phase of model development. The training dataset was randomly divided into ten distinct equinumerous subsets, for the purpose of the -fold method.
The developed models' returned value, which was an average output from 10 different submodels, resulted from the -fold method. This submodel was trained on the joinedtogether nine subsets (373 soil samples). The remaining tenth subset was used for cross validation purposes and was rotated for each of the ten submodels elaborated. The -fold method allowed for estimation of variance of evaluated properties.
One of the major decisions made during SVM model development is to choose an appropriate kernel function. Previous attempts using SVM for retention curve modelling have applied the radial basis kernel function [20,21]. In the present study we wished to investigate the influence of the kernel function on PTF model performance. We tested two kernel functions: linear kernel and radial basis kernel. The advantage of the linear kernel function over the radial basis The Scientific World Journal 7 kernel is the reduced number of model parameters. The radial basis kernel function introduces an additional parameter, , to the model, while the linear kernel function does not depend on any additional parameters.
One of the expected PTF model features is the ability for generalisation-when a model predicts correctly the results for previously unseen input data. The technical criterion in SVM model formulation that stands behind this model feature is the number of support vectors used for model formulation [25]. In commonly used -SVM models there is no direct influence on the number of support vectors, which depend implicitly on other model parameters. The objective of the present study was to compare two classes of PTFs: ]-SVM based (with a fixed value of ]) and -SVM. The parameter ], in ]-SVM based models, explicitly determines expected percentage of SVs used in the model formulation. The model performance is checked with a theoretically optimal 50% of SVs, so the value of ] is fixed at 0.5.

Model Performance Criteria
. Some kind of model performance criteria is needed in SVM model development for validation purposes. Typically root mean square error (RMSE) and the coefficient of determination ( 2 ) are used: where is the number of data analysed, is a value approximated by the model, is the "true" measured value, and̂is the mean of the measured values. In ideal conditions, when the values approximated by the model equal the measured ones, then 2 = 1 and RMSE = 0.

Overfitting and the Radial Basis Kernel Function. RMSE
(1) and 2 (2) are widely used model performance criteria for PFT development. Model parameters are adjusted to minimise the RMSE for the training dataset. Usually a model which minimises RMSE for the training dataset also has a small RMSE for the test dataset-if so, the model has good generalisation properties.
One of the main machine learning paradigms states that only the training dataset is used for model development.
On the other hand, the testing dataset is used only at the very last step, to check how developed model performs for unseen previously data. A model minimising RMSE between estimations and measured values from testing dataset is considered to be the best model.
At the initial stage of the model development RMSE was used as the aim function and the GAs were used for seeking the optimal model's parameters. For both models utilising the linear kernel function, the C-linear and nu-linear, usage of RMSE as the aim function was a successful strategy. In that case the GA was able to find optimal model parameters.
However, when the radial basis kernel function was used, GA optimisation led to overfitting regardless of the type of the SVM method used. Very low RMSE values (< 0.001) for the training dataset were achieved; however, for the test dataset the RMSE was high. The generalisation capability of these developed models was very poor, and the number of support vectors was close to the number of all training vectors-which is an indicator of overfitting of the model. This phenomenon was caused by a too high value of the radial basis kernel function's parameter determined by GA. Thus the GA algorithm was choosing the highest available value of , that is, an upper limit of the range of possible parameter values.
The SVM models seem to be especially sensitive to the value of parameter of the radial basis kernel function. The increasing value of leads to an increased number of support vectors in the model, which degrades its generalisation capabilities. When RMSE was used as the aim function, GAs chose optimal values of parameters which minimised RMSE for the training dataset. But for these parameters RMSE was relatively high for the testing dataset, so parameters were not acceptable from the point of view of model generalisation capabilities. In fact using RMSE as the aim function together with genetic algorithms optimisation led to development of models which were not optimal when radial basis kernel function was used.
An example of overfitting phenomenon for the ]-SVM model, which evaluates the value of water content for the soil water potential of −0.98 kPa, is given in Figure 1. Figure 1(a) shows the dependence between 2 and , calculated for the training dataset ( 2 train) and the testing dataset ( 2 test). The other ]-SVM model parameters are as follows: = 100 and ] = 0.5. For values of > 0.1, the value of 2 for the training dataset increased until an unrealistic value 2 = 1 (overfitting) was reached, while 2 for testing dataset decreased greatly (i.e., lack of generalisation capabilities). Figure 1(a) gives also dependence between the number of support vectors (number of SV) in the model and the value of parameter . It can be observed that for high values of the number of SV in the model reaches the maximum 373 (whole -fold training dataset)-which is also the evidence of overtraining. Similarly (Figure 1(b)) RMSE for training dataset is decreasing for growing values of and RMSE for testing dataset is increasing. Figure 2 shows similar dependence between parameter and models statistical characteristics RMSE and 2 for -SVM based model estimating water content for soil water potential −0.98 kPa.
To give better insight into models structure, sensitivity analysis was performed. For this purpose, the Morris [29] global sensitivity analysis method, modified according to Campolongo et al. [30], was used in its classical formulation: the one-step-at-a-time (OAT) approach. The outcome from sensitivity analysis was index * , which represents relative to other parameters the impact of tested model parameter on model outcome averaged over the whole parameters space.

8
The Scientific World Journal The model outcome which was used for the purpose of sensitivity analysis was a sum of model estimates made for all the records from training dataset. Figure 3 presents normalized values of sensitivity indices * determined for all models estimating water content for the soil water potential −0.98 kPa as an example. Results for model nu-linear are missing as this model depends on only one parameter . What can be seen here is relatively low dependence of the model on the SVM cost parameter . Model -SVM utilizing linear kernel function ( -linear), which depends on and parameters, is almost not sensitive to parameter . The same for the ]-SVM linear kernel model, where the parameter has low influence on model's outcome. Radial basis kernel function ]-SVM model (nuradial) depends on two parameters: and , and the latter has much higher impact on models estimations.

Alternative Form of the Aim Function.
Models for the other soil water potentials, both -SVM and ]-SVM based, demonstrated the same behaviour (overfitting) when RMSE was used as the aim function and the radial basis kernel function was used. If some kind of automated modelparameter search method such as GA is to be used, then the use of RMSE as the aim function is discouraged because of the dominant influence of on its value.
An aim function is needed that will explicitly take into account both factors: the model performance criterion (e.g., RMSE) and the model generalisation capabilities. An alternative form of the aim function, instead of RMSE, is proposed in (3) aim (RMSE, nSV) Equation (3) is dependent on two arguments: RMSE and the number of support vectors in the elaborated model (nSV). This formula mimics the normal bivariate distribution, with the parameters nSV exp , 2 rmse , and 2 nSV which are constants. The proposed new aim function has a minimum at the point where RMSE = 0 and nSV = nSV exp , both of these criteria have to be met in the minimum of this function. The parameter nSV exp is the number of support vectors expected in the developed model. The value of the parameter nSV exp should be equal to half of the total number of the input data in the training dataset [25,26] to achieve an optimal nonovertrained model. In the present study, nSV exp = 187 as we have 373 records in each -fold training dataset and we assumed theoretical ideal proportion (1/2) between the number of support vectors and the number of data in the training dataset.
Constants rmse and nSV have the impact on the shape of the aim function and are connected with its slopes. The values of these parameters were selected based on the empirical basis, numerical tests, and analysis of their influence on the shape of the aim function ( Figure 4). The parameter rmse = rmse RMSE max was related to the RMSE max , the maximum value of RMSE which could occur during SVM model parameter search procedure. In this study value of RMSE max was estimated altogether with sensitivity analysis of the models, but achieved values of RMSE max were pretty standard for point PTF developments and could be chosen arbitrarily.
Similarly, the parameter nSV = nSV nSV max was related to the nSV max , the total number of the records in the training dataset. In cases of the parameter rmse the appropriate value of rmse = 1 while in case of the parameter nSV the good value of nSV = 0.1. Figure 4 presents the shape of the aim function in relation to the values of the rmse and nSV parameters.
Instead of simple RMSE formula new aim function (3) was used in conjunction with GA optimisation techniques for searching for models' parameters. For ]-SVM based models (nu-radial) an optimal solution was found, and both criteria were reached: RMSE was low and an optimal number of support vectors were selected. For -SVM based models (radial) the results were much better than found previously when RMSE was used as an aim function but were still inadequate.

PTF Model Performance.
Finally eight types of PTF models were developed, one for each value of the soil water potential for which water content was estimated. These models utilised -SVM and ]-SVM based modelling and both kernel functions: radial and linear. Two types of aim function were used together with GA for models development. Comparison of results for these eight models presents Table 3. Table 3 summarises performance indices of elaborated models: number of support vectors, RMSE, and 2 ; all calculated for the training and testing dataset. Values of statistical indices RMSE and 2 were calculated from comparisons of evaluated model values against measurements of soil water content for specified values of soil water potential.
The ability of the model to predict correctly previously unseen data is very important, so model's results for testing dataset are most interesting.
In case of ]-SVM models based on the radial basis kernel function, the use of the aim function in a new form, proposed in this paper, allowed for selection of much better model's parameters. As a result developed models are not overtrained and perform much better than models which were developed using RMSE as the aim function.
We may state that the best combination of SVM algorithm used kernel function and method of models parameter search was radial basis kernel function based ]-SVM based model trained using the aim function in the form of (3).
Newly proposed aim function increased also substantially results achieved in case of -SVM radial basis kernel function based models, but in this case the results are not good enough to use these models anyway. Other models perform better, even linear kernel based.
We can see (Table 3) that when linear kernel is used, there is really no difference between models developed using different methodology. Regardless the SVM algorithm type used (]-SVM or -SVM) or aim function used for GA parameter search the results achieved by the models are the same. The small differences in RMSE (0.0495-0.0517) occurring for the soil water potential −491.66 kPa in case of -linear model are not important from practical point of view of PTF usage.
This information, together with the observation that the models are almost not sensitive to the value of the parameter (which is the only parameter in case of ]-SVM model), gives the conclusion that PTF models based on ]-SVM algorithm together with linear kernel function may be constructed without any optimisation of the value of the parameter. The general rules for selection of the value of parameter [31] may be used instead.
When results for the testing dataset were considered, the conclusion was that the nu-radial model based on the ]-SVM method and the radial basis kernel function was the model of first choice for further PTF evaluations. However, the nulinear model was also very appealing, due to having accuracy not much lower than that of the nu-radial counterpart while having only one model parameter, which simplifies substantially model development. One note has to be made regarding this statement. It is known from the literature of SVM kernel algorithms that linear kernel function may be considered as the alternative to radial basis kernel only for big training datasets. In case of smaller training datasets radial basis kernel function should be used. Results obtained here suggest that in our case the number of soil data used form model development was almost numerous enough (373 records) to use linear kernel function for model development.

Conclusions
SVM methodology was successfully applied to water retention modelling. Elaborated point PTF models used sand fraction, clay fraction, total porosity, and bulk density as the input parameters.
Thenewly proposed ]-SVM based retention models, with a fixed value of ] = 0.5, showed better performance than -SVM based models.
Proposed, new form of the aim function (3) used for searching of the model's parameters allows for development of better models in case of radial basis kernel based models.
Results of this study showed that the ]-SVM method is suitable for the development of PTF models for retention curve approximation. The advantage of using this method is a limited number of model parameters in comparison with -SVM methodology.
The investigated linear kernel function may be used successfully instead of the radial basis function, for point PTF developments. This kind of kernel function allows for reduction of the model parameters by one compared with the radial basis kernel function and also for simplified model development.