Assessment of reclaimed soils by unsupervised clustering of proximal sensor data

The application of soil proximal sensors on reclaimed sites presents a novel method for assessing the quality of reclaimed landscapes. This method improves assessment reliability, information management, and environmental assurance. One proximal sensing system that could be used to provide high spatial resolution measurements of soil parameters is an on-the-go optical sensor that collects data at two wavelengths: 660 and 940 nm. Proximal soil sensing data were collected at 27 sites, where organic matter, cation exchange capacity (CEC), and soil water content were collected from 221 soil samples from 0 to 15 cm. The proximal soil sensor data were then automatically clustered using a combination of self-organizing maps and random uniform forests. Overall, the proximal sensor data combined with this data analysis approach created maps with either three or four soil zones. On average, soil zones had statistically significant differences in organic matter, CEC, and water content. This system could be used to map out zones with significant soil variation as part of reclamation monitoring and then used to guide laboratory analytical sampling. Future work should focus on development of on-the-go reflectance spectroscopy systems to provide quantitative soil data with high spatial resolution.

A potential solution to the need for more timely and spatially comprehensive information is the use of soil proximal sensors. The potential of reflectance spectroscopy to measure key soil parameters such as organic carbon and total nitrogen has been extensively investigated within the broader context of soil analysis (i.e. Ben-Dor and Banin 1995; Chang et al. 2001;Chang and Laird 2002;Rossel and Behrens 2010;Sorenson et al. 2018) and more specifically for reclamation monitoring (Sorenson et al. 2017). Additionally, simpler lower cost two-band reflectance sensors have been explored for potential use in precision agriculture for F o r R e v i e w O n l y 4 mapping changes in soil organic matter contents at high spatial resolutions (Kweon et al. 2013;Piikki et al. 2016).
Determining soil organic carbon is part of wellsite reclamation assessment procedures in Alberta (Alberta Environment 2010). Previous work in Alberta has identified that wellsite construction can be associated with a decrease in soil organic carbon and overall changes in soil properties (Hammermeister et al. 2003). Soil bulk density has been found to be higher on a number of reclaimed wellsites, which in turn affects infiltration and soil water content (Hammermeister et al. 2003). Soil reclamation strategies on oil and gas wellsites often focus on increasing soil porosity and water holding capacity (McConkey et al. 2012). Cation exchange capacity (CEC) is closely related to organic matter content in soil, and both parameters are closely tied to soil fertility (Parfitt et al. 1995). Therefore, all three of these parameters can be affected by construction activities, and the ability to detect changes in these parameters is important for monitoring if a site is moving toward reclamation success.
Soil proximal sensing generates large volumes of data, which necessitates the development of techniques to process this information into a format that facilitates decision making. While building site-specific calibration models with two-band optical reflectance sensors has met some degree of success, universal calibration equations have not been successfully developed so far (Kweon et al. 2013). Rather than quantitative analysis, two-band reflectance data could potentially be used for qualitative measurements using unsupervised machine learning tools. One such technique for unsupervised data analysis and dimensionality F o r R e v i e w O n l y 5 reduction is self-organizing maps (Wehrens and Buydens 2007). Self-organizing maps are similar to multi-dimensional scaling, but rather than trying to preserve relative distance between points, self-organizing maps focus on preserving the topology of the data structure, and concentrate more on mapping similarity rather than dissimilarity (Wehrens and Buydens 2007). Self-organizing maps have been used successfully to solve other challenges associated with complex soil data analysis, such as the assessment of soil biological quality (Mele and Crowley 2008).
The main objective of this research was to determine if proximal sensing data collected with a two-band reflectance sensor with bands in the red and near-infrared could be used to map relative differences in key soil attributes. While the spectral features for organic matter and water are strongest in the short-wave infrared region, some features are present in near infrared region and increases in both properties decrease overall reflectance (Rossel and Behrens 2010).
Specifically, this study focused on combining two unsupervised machine learning methods, self-organizing maps and random uniform forests, to classify soil proximal sensing data. The overall goal of this research is to validate automated spatial zoning of soil data to support reclamation assessments.  (Table 1) at locations ranging from southeast of Calgary, Alberta to south of Lethbridge, Alberta ( Figure 1). All sites contained disturbed and undisturbed soils as they consisted of reclaimed oil and gas wellsites, with the center of each assessed area consisting of the wellsite. An area surrounding the wellsite, which included undisturbed agriculture soils, was also included in the assessment area. At each site, 5 to 10 samples were collected for laboratory analyses. In total, 221 samples were collected from the 27 sites. Each site has samples collected from disturbed and undisturbed soils.

Proximal sensing data
Proximal soil sensing data were acquired using the Veris Technologies OpticMapper®, which consists of a two-band reflectance sensor along with a GPS unit to collect location and elevation data. Reflectance data were collected by the sensor in two bands, within the red portion of the electromagnetic spectrum from 650 nm to 670 nm and within the near infrared region from 930 nm to 950 nm. A fluted coulter on the optic mapper cuts through crop residues and opens a slot in the soil where measurements take place. All measurements take place on the surface of the exposed soil in the slot. Data were collected in northeast to southwest passes spaced 10 m across the site. Calibration checks for the instrument were performed as per manufacturer specifications. Light and dark reference panels were used to check sensor performance prior to data collection and sensor response was in accordance to manufacturer specifications for each site prior to data collection.

Laboratory Analyses
Each of the 221 soil samples collected was analyzed for organic matter content and for CEC; 179 of the samples were additionally analyzed for gravimetric water content. Gravimetric water content was analyzed rather than volumetric water content, as previous reflectance spectroscopy research has demonstrated that gravimetric water content can be more accurately measured than volumetric water content with reflectance-based measurement methods (Ji et al. 2016). Samples were collected the same day the proximal sensor measurements were taken. All soil samples were collected from a depth of 0 to 15 cm to correspond to the range of the optical measurements that were collected, based on the depth of the furrow created by the fluted coulter. Organic matter content was determined by the loss on ignition method described in Nelson and Sommers (1996). CEC was quantified using the methods described in Hendershot et al. (2008), and gravimetric water content was determined using the method described in Topp et al. (2008).

Data Processing
All data processing was performed using R (R Core Team 2018). The raw reflectance data and elevation point data were converted to rasters using the inverse distance weighting function in the gstat package of R (Pebesma 2004). The reflectance and elevation data were then smoothed using a 3x3 focal median window to reduce noise using the raster package in R (Hijmans 2016). Following smoothing, the ratio of near infrared to red reflectance was

Model Development
For each site, a raster stack was created with a raster for each of the following parameters: red reflectance, near infrared reflectance, ratio of red to near infrared reflectance, elevation, slope, and topographic position index. The raster stack was first processed using a self-organizing map with the kohonen package in R (Wehrens and Buydens 2007). A 10 by 10 hexagonal grid topology was specified for the self organizing map. Following initial clustering of the data into the self organizing map grid, final clustering into a smaller number of clusters was performed using an unsupervised random uniform forest in R (Ciss 2015). The advantage of this analysis is that it can be used for unsupervised clustering of data. Additionally, random uniform forests will identify the optimal number of clusters for the data without user specification of cluster numbers.
Random uniform forests uses a multiple step approach to perform the unsupervised clustering. The random uniform forest first grows a forest of decision trees using random subsampling and random cut points according to a continuous uniform distribution, followed by multidimensional scaling and clustering with either k-means or hierarchical clustering (Ciss 2015). In this study the hierarchical clustering step was used, and the optimal number of clusters was automatically selected based on where the maximum lagged difference in cluster heights occurred. The unsupervised clustering using random forests was performed on the self- organizing map grid cell codes, leading to a final cluster number for each grid cell and each associated raster cell in each grid cell. An example of the clustering results from a site are illustrated in Figure 2.

Model Evaluation
Model performance was evaluated on the unsupervised classification results using a linear mixed effects model with the nlme package in R (Pinheiro et al. 2016). For each site, the cluster that each laboratory analytical sample was collected from was determined. The clusters were then relabelled from lowest to highest to correspond to the highest to lowest organic matter concentrations, CEC or gravimetric water content as the specific number sequence is randomized during the classification process. This step was necessary to allow for the comparison of relative effect differences among clusters across all sites. Organic matter, CEC and gravimetric water content data were then scaled by subtracting the mean and dividing by standard deviation. This transformation permitted comparison between sites regardless of the differences in magnitude across sites. Mean centered data were then compiled and analyzed using a linear mixed effects model with site as random factor. The data were normally distributed and showed homogeneity of variance. Linear mixed effects models were run for organic matter, CEC, and gravimetric water content to test if significant differences were present among clusters.

Results and Discussion
The clustering exercise yielded three clusters for most sites (Table 2), with 26 of 27 sites including three clusters and only one site clustering into four clusters. The average cluster size was 2.40 ha compared to an average site size of 14.40 ha (Table 1). Soil organic matter values ranged from 2.28 to 8.27 percent with an average value of 4.54 percent. Overall, there was a significant difference among clusters in terms of their organic matter content (Overall F-value = 22.33, p value of <0.01). Additionally, each cluster had significant differences in organic matter content relative to all other clusters (Table 3). The greatest differences in organic matter content was between clusters one and two, with cluster three only having slightly more organic matter compared to cluster two (Figure 3a). On average, organic matter contents in cluster one were 0.52 standard deviations below the mean, while cluster two included organic matter contents 0.15 standard deviations above the mean and cluster three included organic matter contents 0.46 standard deviations also above the mean. For the one site with four clusters (cluster four) organic matter contents were 1.13 standard deviations above the mean.
The CEC ranged from 1.24 to 71.76 meq/100g with an average value of 27.90 meq/100g across all sites (Table 1). The average gravimetric water content was 18 percent with values ranging from 8 to 29 percent. There was a significant difference among clusters in terms of CEC (F-value = 7.29 p-value<0.01) and gravimetric water content (F-value = 28.60, p-value<0.01).
There were significant differences amongst all clusters for both parameters, with the exception of cluster four for CEC, which was not significantly different from cluster three (Table 3). On Previous work investigating the same sensors as the ones used in this study had been successful in building site-specific calibration models (Kweon et al. 2013;Piikki et al. 2016).
However, calibration models that generalized to new areas could not be built in either of these two studies. In this present study, site-specific calibration models could not be successfully built for any of the analyzed soil parameters. Specifically, Piikki et al. (2016) used the optic mapper sensor with portable x-ray fluorescence and electromagnetic induction to build site-specific calibrations for total carbon. However, predictions for new sites were poor for total carbon, and soil texture could not be modelled with any accuracy. Kweon et al. (2013) concluded that field specific calibrations are possible and sufficiently accurate based on leave-one-out cross validation results. However, they could not develop a universal calibration model. While a quantitative model could not be developed in our study either, our results indicated that an unsupervised classification approach can be used to successfully classify a reclaimed area into zones with different organic matter, CEC and moisture contents based on soil reflectance data from two distinct bands along with elevation data. A possible explanation for why a quantitative model could not be built for this study is that 30-40 samples were collected from  Piikki et al. (2016), and with more data collected from each site a site-specific calibration model may have been possible with this study. Additionally, compared to the Kweon et al. (2013) study, the average standard deviation in organic matter was higher in this study, likely due to the presence of disturbed soils. More calibration samples are needed to build an accurate calibration model for these sites.
It is important to note that while results from the current study demonstrate that proximal sensor data can be used to automatically zone soil with different parameters, the methodological approach used does not provide any information on the direction or magnitude of difference among classes. Alternative sensor arrangements have been used by other researchers to quantitively measure soil organic carbon and other soil parameters under field conditions. Viscarra Rossel et al. (2017) successfully used reflectance spectroscopy and gamma ray attenuation to measure soil organic stocks from soil cores collected in the field. Reflectance spectroscopy has been combined with x-ray fluorescence to successfully take multi-parameter measurements of soil (Duda et al. 2017). Additionally, field reflectance spectroscopy has been used under similar site conditions to successfully measure soil organic carbon and nitrogen, but not soil pH (Sorenson et al. 2017).
Construction activities have been documented to lower soil organic matter content and increase bulk density (Hammermeister et al. 2003). These changes in turn influence both the porosity of the soil along with the CEC, which is closely linked to organic matter content (Parfitt et al. 1995). Overall, these results show that two band reflectance measurements, along with F o r R e v i e w O n l y 13 elevation data, have a role to play for monitoring soil reclamation when combined with the appropriate data analysis techniques. While quantitative models could not be successfully built using the sensors in this study, automated clustering of the data allowed the identification of zones with different organic matter, CEC and water contents (Table 3). In practice, follow-up investigations would be required for quantitative determination of soil organic matter, CEC and water content for each soil zone. This analysis could consist of either collecting samples for conventional laboratory analysis, or by collecting further point data using a higher spectral resolution reflectance spectroscopy system (Sorenson et al. 2017).
Ultimately, while there is value in qualitatively delineating reclaimed soils into zones to identify variance in key soil characteristics, quantitative measurements would be an improvement. Rather than using two reflectance bands, high resolution visible light near infrared reflectance spectroscopy could be utilized to provide quantitative measurements. Soil organic carbon, CEC and water content have all been shown to be successfully measurable using reflectance spectroscopy (Soriano-Disla et al. 2014). Point spectroscopy has been used on some of these sites to measure soil carbon quantitatively in-field as well (Sorenson et al. 2017).

Conclusion
The results of this study indicate that broad-band reflectance data combined with unsupervised classification approaches can be used to qualitatively and automatically map a field into soil zones with differences in organic matter, CEC and water contents. However, more site-specific data than were collected in this study or alterative types sensors are needed to    Tables   Table 1. Soil and Site Characteristics. The first number in each column indicates the mean for a given parameter. The second numbers are the range of values observed across all sites.