Abstract

Information about the spatial distribution of soil properties is necessary for natural resources modeling; however, the cost of soil surveys limits the development of high-resolution soil maps. The objective of this study was to provide an approach for predicting soil attributes. Topographic attributes and the normalized difference vegetation index (NDVI) were used to provide information about the spatial distribution of soil properties using clustering and statistical techniques for the 56 km2 Gumara-Maksegnit watershed in Ethiopia. Multiple linear regression models implemented within classified subwatersheds explained 6–85% of the variations in soil depth, texture, organic matter, bulk density, pH, total nitrogen, available phosphorous, and stone content. The prediction model was favorably comparable with the interpolation using the inverse distance weighted algorithm. The use of satellite images improved the prediction. The soil depth prediction accuracy dropped gradually from 98% when 180 field observations were used to 65% using only 25 field observations. Soil attributes were predicted with acceptable accuracy even with a low density of observations (1-2 observations/2 km2). This is because the model utilizes topographic and satellite data to support the statistical prediction of soil properties between two observations. Hence, the use of DEM and remote sensing with minimum field data provides an alternative source of spatially continuous soil attributes.

1. Introduction

Quantitative information and spatial distribution of soil properties are among the main prerequisites for achieving sustainable land management. The accuracy of soil information determines, to a large extent, the reliability of land resources management decisions [13]. Conventional soil surveys are usually used to derive information about soils and their distribution [4]; however, limited areas are covered by detailed soil information mainly due to the high costs of surveys [5]. Furthermore, the spatial distribution of soil characteristics as represented by a conventional soil map does not reflect the distribution in the field because of the polygon-based mapping employed [6]. In polygon-based mapping, soils are assumed to be homogeneous within the polygon, and abrupt changes take place at the boundaries between polygons [7, 8]. Soils usually show a diffuse spatial distribution that is hard to address in polygon-based soil maps [9]. Many researchers have suggested continuous raster maps as a better alternative to mapping soils and their properties [6, 7, 9].

In soil science, the implementation of geomatics—GIS, GPS, remote sensing, and digital elevation models (DEMs)—is suggesting new alternatives [3, 10]. Global coverage of high-resolution imagery and terrain data is increasing, and this enhances the popularity of digital soil mapping to generate accurate soil maps that capture many properties with reasonable effort [11]. The benefit of using GIS is the production of digital soil attributes and digital soil maps to optimize the need for costly field work and laboratory analysis [10, 12].

Various statistical models are used to investigate the relationships between the spatial distribution of soil attributes and that of landscape attributes. Predictive mapping techniques, such as geostatistics (i.e., kriging and cokriging), fuzzy logic, linear and multiple regression, regression trees, and neural networks, have been used to develop soil maps [10, 13]. Ongoing research in digital soil mapping indicates that quantitative prediction models are promising tools to produce soil maps with acceptable accuracy [14, 15]. Research has provided optimistic results, and some researchers obtained better results than traditional soil surveys [1618]. The use of satellite data to complement topographic information improves the mapping of natural resources [10, 13, 19]. These investigations are based on the relationship between landscape pattern and the differentiation of soil types and attributes [20]. The spatial distribution of specific soil properties has been predicted using soil-landscape modeling. This included sand, silt, and organic matter contents; A-horizon thickness; solum depth; extractable phosphorus; and pH [21, 22]. Using different terrain attributes, the empirical models developed in these studies explained 41–68% of the variation in soil properties [21, 22].

GIS is used to quantify the relationships between soils and topographic and remote sensing variables and to aid the digital soil mapping efforts [2, 3, 10, 23, 24]. Topographic parameters are derived from DEMs and used to predict the distribution of soil characteristics [5, 22, 25, 26]. The increasing availability of high-resolution remote sensing data provides a new opportunity for predicting soil characteristics with acceptable accuracy [11]. Many researchers have reported on the contribution of remote sensing data in providing acceptable prediction of soil characteristics [2729]. Nevertheless, there are many issues that warrant further research: the conjunctive use of satellite data and topographic data to improve the prediction of soil characteristics; and the best time of year to acquire satellite data. The use of various numbers (densities) of observations, and how that affects accuracy, will help researchers decide on a reasonable number of observations to maintain acceptable accuracy with an appropriate amount of field work. The use of soil-landscape models to predict soil attributes is theoretically better than interpolation techniques, because the former uses topography and satellite data as a background to guide prediction, while interpolation relies solely on the spatial relationships between adjacent observations without considering the variations in factors that drive soil differentiation. However, the merits of using soil-landscape modeling compared with interpolation techniques need verification and quantification, especially when a limited number of observations are available. The objective of this study is to investigate the utility of a soil-landscape model using DEM and remote sensing to predict the spatial distribution of soil characteristics with an optimum number of field observations.

2. Materials and Methods

The study area is located in the Lake Tana basin in the Gumara-Maksegnit watershed of the Amhara region in Ethiopia, within 12°24′–12°31′N and 37°34′–37°37′E. It is centered 45 km southwest of Gondar town and covers an area of 56 km2. Altitude within the watershed ranges from 1923 to 2851 m above sea level, with topography ranging from gentle to sharp steep slopes. The mean annual rainfall is 1052 mm. The mean minimum and maximum temperatures are 13.3 and 28.5°C, respectively. The study area is characterized by an ustic moisture regime [30]. The natural resource base has been depleted to a great extent due to soil erosion and improper land use, which resulted in reduced tree cover, biodiversity, agriculture, range, and pasture productivity.

2.1. Soil Survey and Field Observations

The area was divided into square grids (500 m × 500 m). The total area of the watershed and the heterogeneity of topography and vegetation cover, and consequently the soils, were used to guide the selection of this grid size. Larger grid size (1 km) would result in a limited number of observations and would limit the representation of soil variability. Soil samples were taken within each grid in the watershed (Figure 1). This was a compromise between grid and free sampling techniques. While the surveyor was free to take the sample within the defined grid to represent the area, the distribution of sampling sites still followed an unbiased grid. In cases where many grids shared common features, the surveyor reduced the number of samplings to represent these grids with one sampling site. Each site was excavated to a depth of 100 cm, or an impeding layer, using an auger. FAO terminology [30] was used to describe the sampling sites. The soil attributes and site characteristics recorded were GPS coordinates (easting, northing, and elevation), soil depth using an auger, slope steepness (percent) estimated using clinometers, surface stone cover (percent), and stone content of the top soil layer (0–25 cm). A soil sample for each soil observation was taken for laboratory analysis. The following soil attributes were analyzed: clay, silt, sand, organic matter, total nitrogen and available phosphorus contents, pH, and bulk density.

The hydrometer method outlined by the simplified procedure of Day was used to determine soil particle size distribution [31]. Hydrogen peroxide (H2O2) was used to destroy the organic matter, and sodium hexametaphosphate (NaPO3) was used as a dispersing agent. The bulk density (Bd) of the soil was estimated from undisturbed (core) soil samples collected using a core sampler, weighed at field moisture content, and then determined following the procedures of Blake [32].

Soil pH was measured using a digital pH meter in the supernatant suspension of 1 : 2.5 soil : liquid ratio; where the liquids were water and 1 M KCl solution, and pH was calculated by subtracting soil pH (KCl) from soil pH (H2O). The organic carbon content of the soil was analyzed following the wet digestion method described by Walkley and Black, which involves digestion of organic carbon in the soil samples with potassium dichromate (K2Cr2O7) in sulfuric acid solution [33]. The Kjeldahl procedure was followed for the determination of total nitrogen as described by Sahlemeden and Taye by oxidizing the organic matter with concentrated sulfuric acid and converting the nitrogen in the organic compounds into ammonium sulfate during the oxidation [34].

Available phosphorus was determined by Olsen et al. method [35]. The soil samples were shaken with 0.5 M sodium bicarbonate at a nearly constant pH 8.5 in 1 : 2 soil : solvent ratio for 30 min, and the extract was obtained by filtering the suspension as indicated by Olsen et al.. The available phosphorus extracted was measured by spectrophotometer following the procedure outlined by Murphy and Riley [36].

2.2. DEM and Terrain Analysis

Shuttle Radar Topography Mission (SRTM) 90 m resolution DEM was used for the study. The free availability and accuracy of this product, checked against field observations, were the main reasons for its selection. Previous research recommended some terrain attributes as best predictors of soil characteristics [37, 38]. The following terrain parameters were derived using standard commands in ArcGIS: aspect; profile, plan, and mean curvatures; flow accumulation area and slope. The average upslope contributing area for each pixel (upslope flow accumulation) was calculated by multiplying the average flow accumulation grid by the area of the pixel (flow accumulation × 8100 m2). The compound topographic index (CTI) for each pixel was calculated using the formula [21] CTI = ln (As/tanD), where As is the average upslope contributing area and D is the average slope degree. ARCSWAT was used to derive the stream network, and the watershed was also automatically divided into a number of subwatersheds (Figure 1).

The watershed subdivisions were considered as a base unit in the prediction model analyses. This is because the correlation between terrain attributes and soil characteristics for the whole watershed was very low and therefore unsuitable to predict the latter from the former. The subdivision of the whole watershed into smaller subwatersheds grouped the soil observations into homogeneous units and so enabled the establishment of better statistical relationships [37]. Each small subwatershed was divided, by the passing streamline, into two facets (subdivisions) (Figure 1). The generated facets were grouped into classes (Figure 2), based on their characteristics of area and slope. If individual subwatersheds were considered directly, the number of observations within each subwatershed was not enough to establish a rigorous statistical relationship. The grouping of subwatersheds into classes, based on characteristics such as area and slope, increased the number of observations within each class and enabled good prediction. The selection of an appropriate number of classes to cluster the subwatershed facets is important in generating good results [37]. The classification (grouping) of subwatershed facets was repeated to produce a classification that led to an approximately equal number of observations in each class (at least 15 observations). This allowed a reasonable regression model to be established for each class. The clusters were used in the prediction of soil attributes. Seven classes were found to be optimum for predicting soil attributes in the study area.

2.3. Generating the Normalized Difference Vegetation Index (NDVI) Map

ASTER satellite images taken on January 30 and March 19, 2007 and SPOT satellite images taken on October 8 and November 23, 2007 were analyzed in order to select the best image(s) for the study. The images were corrected radiometrically and geometrically using ENVI software. Radiometric correction was done separately for ASTER and SPOT imagery on an individual band basis. Geometric correction for the images within the study area was done by taking the October SPOT image as a base image. The corrected images were resampled to the same spatial resolution (15 m × 15 m). The NDVI values for each pixel were calculated using the following formula [39]: where NIR is reflection in near infrared and is reflection in red wavelength. NDVI values of the four images were compared based on their vegetation biomass distribution. Two images with maximum and minimum vegetation cover were selected: an ASTER image taken on March 19 when most of the watershed area was bare and a SPOT image taken on October 8 when the most was vegetated. This enabled assessment of the best time for satellite images to be taken for improving soil prediction. The other two images provided less extra information and were not considered in further analysis. Theoretically, the image in March reflected variations in soil appearance or color because the soil was bare at that time in Ethiopia (direct effect), while the image in October reflected the indirect effect of soil characteristics on vegetation cover and greenness because the soil was covered with vegetation at that time.

Terrain attributes and satellite data for each soil observation were extracted. Statistical analyses were implemented, within each class, between the derived terrain attributes, satellite data, and collected soil attributes of field observations using SPSS. Multiple linear regression models are usually employed to predict dependent (soil attributes) from independent variables (satellite data and terrain attributes). From a total of 220 observations, 180 were randomly selected for analysis, and the remaining 40 observations were used to assess the accuracy of the prediction model. A regression model was established for each class to predict soil attributes from terrain attributes and satellite data. Map algebra of ArcGIS was used to obtain predicted soil attribute grids using the regression models and the raster grids for each class (slope percent, contributing area, CTI, aspect classes, curvature classes, and NDVI values of ASTER and SPOT images). An example of these equations follows:

Predictions were first executed for individual classes and were then merged together to generate predicted values for the whole watershed. The prediction accuracy was verified in two ways: first, by comparing the predicted and observed values using 40 randomly selected field observations (Figure 1); and second, by comparing the predicted soil attributes with those derived using the inverse distance weighted (IDW) technique.

The role of satellite data in improving the prediction accuracy of soil attributes using multiple linear regression models was assessed by repeating the previous analysis without using the satellite images as an independent variable. This was done for one of the soil attributes (clay content). Regression coefficients () of the models without including remote sensing data were compared with those that included satellite data. The quality of the predictions with and without satellite data was also tested by examining the relationship between observed and predicted values when some differences between them were allowed. This comparison was implemented by allowing a difference between predicted and observed clay content of 3, 5, or 7%—because it was not expected that predicted values would exactly equal the observed values (where the difference allowed was zero). In fact, within a distance of 1 m in the field, there are some differences in some soil attributes within these ranges.

2.4. Selection of the Optimum Number of Observations for Soil Prediction

The effect of the number of field observations used to build the regression models on the accuracy of the prediction model (sensitivity analysis) was investigated using soil depth as an example. This was to determine the minimum number of observations to maintain acceptable prediction accuracy of soil attributes. Various numbers of observations were selected, starting from 180 and gradually decreasing: 150, 120, 90, 60, 40, 30, 25, and finally 20 observations. The prediction model was implemented until a threshold was reached where the prediction accuracy declined sharply to a low level. The classes used in the earlier modeling were amalgamated to get an optimum number of observations per class. The accuracies of predicting soil depth as a result of using different numbers of soil observations (density) were estimated using different validation indices [27, 40], such as the root mean square error (RMSE) and the mean absolute estimation error (MAEE). As the difference between predicted values and observed values increases, so do both RMSE and MAEE. Consider the following: where n is the number of verification points, is predicted soil depth, and is observed soil depth. The quality of the prediction using different densities of observations was also tested by examining the relationship between observed and predicted values by allowing some differences between them. This enabled easy observation of the gradual change in percentage of observations predicted correctly when different observation densities were used.

3. Results and Discussion

Descriptive statistics for soil attributes and site characteristics were recorded for each site (Table 1). The correlation between soil attributes and terrain attributes and satellite data, when considering the whole set of soil observations, was very low (0.0–0.50); thus, building a regression model to predict soil attributes would not produce acceptable results. However, the range of between soil attributes and terrain attributes and satellite data when the whole watershed was divided into smaller subwatersheds was 0.06–0.85 (Table 2). depended on the type of soil attribute and the subwatershed facet class for which the relationship was being established. In general, the values were acceptable compared with previous studies [4042] and were used to generate predictions of the various soil attributes within the seven classes. However, the final judgment of the prediction accuracy is made through the comparison between predicted and measured values of soil characteristics. Significant correlations between terrain attributes and satellite images with soil attributes varied for the same soil attribute for different classes (Table 3). For example, in class 2, clay content was significantly correlated with slope percent, upslope contributing area, and the ASTER image taken in March, whereas, in classes 3 and 7, clay content was highly correlated with slope percent, aspect, curvature, the ASTER image taken in March, and the SPOT image taken in October. These relationships indicate that classifying the watershed into classes improved the prediction of soil attributes. The results were supported by findings of other researchers who reported low (0.0–0.19) between various soil attributes and terrain attributes for the whole watershed, which was also improved significantly for the classified subwatersheds and consequently improved the prediction accuracy of soil attributes [37].

The regression coefficients (Table 3) indicated that digital terrain attributes had a stronger influence on soil properties. This was supported by previous studies [43, 44]. In nature, soil properties are highly spatially variable [45, 46], and, for accurate estimation of soil properties, this continuous variability should be considered. In addition, in nature, land is not flat (two dimensional) as it is represented on the map. Hence, prediction will be more reliable if a three-dimensional model is used in which terrain attributes provide a quantification of landform shape and connectivity that define geomorphometry and water flow patterns [37, 4749].

The RMSE between predicted soil characteristics and those observed in the field (Table 4) indicated good accuracy of prediction using the regression models for most soil characteristics when compared with previous studies [38, 40, 50]. Bayer et al. found that two regression techniques had similar capabilities to provide significant prediction models for soil organic carbon and iron oxides [51]. In the present study, comparing this RMSE with that derived from spatial interpolation using the IDW method indicated favorably comparable accuracy of the prediction model; RMSE of the prediction model was lower than that of the IDW in all cases. Similar results were also observed when the number of observations used was reduced to 60 (Table 4). In addition, comparing the RMSE differences (between the prediction model and the IDW) using 180 with that of 60 observations indicated higher increments in error term differences as the number of observations used decreased (Table 4). As a result, the prediction model was highly preferred due to higher accuracy than IDW, especially when a limited number of observations (60) were used. This is because the soil-landscape model provides estimates based on the characteristics of the terrain between two observations and therefore improves accuracy. This is not the case when interpolation is done with the assumption that closest observations are also closer in their soil characteristics, which is not always true. The soil-landscape model uses the background topography and satellite data to characterize each observation and then predict their soil attributes with consideration of the characteristics of these observations. This is consistent with soil genesis theories that are not considered in interpolation techniques. Previous research indicated that the use of different regression techniques enables the prediction of different topsoil parameters in a rapid and nondestructive manner, while avoiding the spatial accuracy problems associated with spatial interpolation [52].

The results indicated that soil attributes were predicted with acceptable accuracy using SRTM 90 m DEM and also provide a visual representation of the spatial distribution of soil attributes (Figure 3). This indicated that soil attributes can be predicted with low resolution DEMs. These conclusions were similar to those of Thompson et al. [53] and Wechsler [54], who concluded that high-resolution DEMs are not always necessary for soil-landscape modeling. Chabrillat et al. (2002) reported that although higher spatial resolution provided a purer image in more heterogeneous sites, it did not identify new features that a lower spatial resolution data set would miss in homogeneous terrain [55].

When only terrain attributes without satellite images were used to build multiple regression models (Table 5), the declined for all classes compared with those obtained using both terrain and satellite data (Table 3). For example, there were large decreases in in class 5 from 0.38 (Table 3) to 0.18 (Table 5) and in class 2 from 0.8 (Table 3) to 0.68 (Table 5). This indicated that satellite images could improve the prediction of soil attributes when used together with terrain attributes in building regression models. Gerighausen et al. found that the use of multiannual image data enhanced the prediction of soil parameters [56]. The percent of agreement between the predicted and observed clay content using satellite images and terrain attributes was compared with those derived using terrain attributes only. The prediction accuracy of clay content (percent of correctly predicted observations) increased as a result of using satellite images with terrain attributes compared to using terrain attributes only. For example, 50% of the observations showed agreement in predicting clay contents within a difference of ±7% between observed and predicted values when only terrain attributes were used to build the model (Table 6). The above agreement was improved to 62.5% when satellite images were used with the terrain attributes. The results showed that satellite images improved the prediction of soil attributes when used with terrain attributes to build multiple regression models. This is because satellite images provide more information about the factors controlling soil variability in addition to the topographic information provided by DEM-derived terrain attributes. Chabrillat et al. (2002) indicated that imaging spectrometry aids in the detection and mapping of expansive clays [55]. The images taken at two different dates provided direct and indirect information. The direct information was through the variation in the soil surface reflectance, which is related to soil variability (when the soil surface is bare during March). The indirect information was through the reflectance of the different vegetation cover, which reflects the variation in soil type through the effect on vegetation cover characteristics during October. Many researchers have suggested that remote sensing helps in providing accurate prediction of soil properties [2729]. Bartholomeus et al. (2012) showed that although variation of soil properties within vegetation classes is large, the vegetation composition is useful to estimate soil properties [57].

Since the spectral reflectance of both satellite images was significantly correlated with clay content (Table 3), we can conclude that two images taken during dry and rainy seasons could be sufficient, with terrain attributes, to predict soil attributes using multiple linear regression models. During the dry period, there was a correlation between soil properties (clay content) and spectral reflectance in the visible range [58]. During the rainy season, most of the area was covered with vegetation that reflects variation in soil properties such as clay content. As a result, soil properties could be indirectly predicted by determining the spectral reflectance of the vegetation cover. However, the satellite image taken when the field was bare showed higher correlation with all soil attributes compared with the satellite image taken when the field was covered by vegetation. Hence, the best time to acquire satellite images to improve the prediction of soil attributes would be during the dry period when most of the area is bare. Mulder et al. [29] also reported similar findings and successfully used airborne, space borne, and in situ measurements using various instruments.

Selecting Optimum Number of Field Observations to Predict Soil Attributes. The seven classes that were used in the earlier modeling (and using 180 observations) were amalgamated first into five classes, and several trials were done to predict soil depth using 150, 120, and 90 observations to build regression models. The analysis was repeated based on two classes to build regression models using only 60, 40, 30, 25, and 20 observations. This was because, when a limited number of observations were used (≤60 observations), there were insufficient observations if five or seven classes were used (the number of observations for each class was too low to build good regression models). The regression models (and ) to predict soil depth at different densities of field observations using terrain attributes and satellite images were used to get predicted soil-depth grids (Figure 4). The proportion of predicted soil-depth of zero (no soil) increased dramatically as the number of observations used reduced from 25 to 20 (Figure 4).

The optimum number of observations to produce an acceptable prediction of soil attributes was assessed using percent of observations where the soil depth was correctly predicted within an acceptable range of agreement (using RMSE and MAEE). The accuracy of predicting soil depth decreased from 97.5% when 180 field observations were used to 95% for 150 observations, reached 65.0% using 25 observations, and then dropped drastically to 35% for 20 observations (Table 7). There were gradual increases in RMSE and MAEE from the predictions using 180 observations (26.4 and 21.7 cm, resp.) to the predictions using 25 observations (57.5 and 46.4 cm, resp.); then, both increased sharply for 20 observations (126 and 88.2 cm, resp.; Table 7). Since there was a high increment in RMSE and MAEE values at 20 compared with 25 observations, it can be considered that 25 observations (or one observation/2 km2) were optimum to build multiple regression models in soil-landscape modeling for this particular watershed. The accuracy assessment methods used for selecting the optimum number of observations to predict soil depth indicated that 25 observations were the minimum required for multiple linear regression models to produce acceptable soil prediction. The results also indicated that the prediction accuracy was the same when 60 or 90 observations were used (87.5%). However, a close look at Figure 4 revealed that the spatial distribution of the predicted soil depth classes drastically changed when the number of observations used was <60. Therefore, it can be suggested that, for this study area, the optimum number of observations to generate an acceptable prediction is somewhere between 60 and 25 observations (1-2 observations/2 km2). The users can select between these observation densities to produce an optimum results. The approach indicated that soil-landscape modeling could be used with a low number of observations to produce accurate predictions of soil attributes with an acceptable representation of the spatial distribution of soil attributes over the landscape.

4. Conclusions

The use of satellite image data together with terrain attributes improved the prediction of soil attributes. Two satellite images taken at different dates, one in the dry season and the other in the rainy season, were sufficient to capture the variation in soil reflectance and improve the prediction accuracy. If one image is available, the best acquisition time to facilitate soil-landscape modeling is during the dry season when most of the soil surface is bare. Generally, the results indicated that soil attributes were predicted with acceptable accuracy using multiple linear regression models from freely available DEM (90 m resolution) and satellite images with a minimum of field work. In addition, the prediction model was highly preferred due to comparable accuracy with the spatial interpolation using IDW, especially when a limited number of observations were used, which is usually the case in data-scarce areas. For this study, the optimum number of observations to predict soil attributes using multiple regression models and using terrain attributes and remote sensing data was between 60 and 25 observations (approximately 1-2 observations/2 km2). The produced predictions will be very useful to provide information for detailed modeling activities, especially for a country like Ethiopia in which information on the detailed spatial distribution of soil attributes is very scarce. Further investigations should be made for applying the model over larger areas with diverse soils and landscape features to validate the results and to out-scale their application.

Conflict of Interests

The authors declare no conflict of interests with any trademark or software mentioned in this paper.

Acknowledgments

The authors would like to acknowledge the financial support by the Austrian Development Agency and the CGIAR Research Program on Water, Land and Ecosystems (CRP-5).