Soil organic matter determination based on artificial olfactory system and PLSR-BPNN

Soil organic matter (SOM) is a key indicator of soil fertility. For accurate measurement of SOM, a novel method based on an artificial olfactory system (AOS) was proposed. The response curves of soil volatile organic compounds (VOCs) were measured using a metal-oxide semiconductor sensor array, and four features (including maximum value, mean differential coefficient, response area, and the transient value at the 20th second) were obtained from the curves and used to build olfactory feature space. Then, prediction models were established using the pattern recognition algorithm. To further enhance the accuracy of AOS measurement, we used Monte Carlo cross-validation (MCCV) to identify and eliminate the abnormal samples of the soil olfactory feature space. Then, the dimension reduction method of the genetic algorithm (GA)back-propagation (BP) was used to find the appropriate feature vectors, and two types of hybrid models were presented. One was the support vector machine (SVM) and group method of data handling (GMDH) combined model—SVM-GMDH. The other was a combination of partial least squares regression (PLSR) and back-propagation neural network (BPNN)—PLSR-BPNN. The forecasting performances of three single models (BPNN, PLSR, support vector regression: SVR) and two combined models (PLSR-BPNN, SVM-GMDH) were comparatively evaluated. The evaluation indices included coefficient of determination (R 2), root mean square error (RMSE), ratio of performance to deviation and relative prediction error (RPE). It was found that the predictive capabilities of all five tested models were improved after elimination of abnormal samples and feature reduction. Moreover, PLSR-BPNN performed the best in predicting SOM concentrations, with R 2 = 0.952, RMSE = 1.771, PRD = 4.291, and slight variation of RPE within 0–0.185, and thus can offer a reference for predicting SOM via AOS.


Introduction
Soil organic matter (SOM) is a critical property of soils [1] and contributes to soil physical property improvement, plant growth, and crop production [2]. SOM is a key evaluator of soil fertility, and loss of SOM reflects a decline in soil quality [1,3]. Accurate determination of SOM variation is critical for guidance in crop fertilization and soil quality improvement.
SOM concentrations are usually measured by chemical detection and analysis of soil samples collected in the field, but this method is limited by high time/labor consumption, high costs and destructiveness [4]. Thus, new fast, economical and nondestructive methods for precise prediction of SOM concentrations are increasingly in demand [5]. In recent years, owing to the universal application of proximal soil remote sensing, visible and near-infrared diffuse reflectance spectroscopy (Vis-NIR DRS) has attracted increasing interest among soil scientists and has been considered feasible for soil analysis [6][7][8]. For instance, Conforti et al predicted the spatial variation of SOM using laboratory-based Vis-NIR spectroscopy [9]. Nawar et al used different Vis-NIR DRS spectra to detect clay and SOM concentrations [10]. Despite its high accuracy, Vis-NIR spectroscopy is limited by its susceptibility to changes in soil humidity [11], soil particle size [12] and iron oxide [13].
The generation and consumption of soil gases are mainly related to microbic activities in soils [14,15]. SOM is the major substrate of nutrients and energy needed by the vital activities of soil microbes [16]. The substrate for nutrient and energy supply can generate abundant volatile organic compounds (VOCs) and gases during microbial degradation [15]. This means that the VOCs and gases in soils inevitably correlate with SOM [17]. Such correlation facilitates the fast and low-cost detection of SOM. Gas detection can be achieved at very low costs, especially with methods based on solidstate chemical sensors [18]. However, soil gases are compositionally complex [19] and difficult to identify with a single gas sensor. An artificial olfactory system (AOS, also called electronic nose or e-nose) consisting of non-selective sensor arrays and a pattern recognition model is considered as an efficient means of detection of complex gases [18]. Though the AOS does not present any concrete information or property about volatile gaseous compounds [20], its combination with appropriate pattern recognition algorithms, like artificial neural networks (ANNs) or statistical methods, can identify the gas pattern of specific samples and separate them from other samples [21,22]. So far, AOS has been extensively applied in foods [23][24][25], medicine [26,27], diseases [28], the environment [29], beverages [30] and other fields [31][32][33]. AOS has also reportedly been used for soil characterization. For instance, Andrzej et al used an e-nose to assess soil humidity and research on ten moisture levels in ten types of soils, and indicated that it was a very promising tool [34]. Pobkrut et al integrated an e-nose and robots into detection of soil surface VOCs and thereby measured soil fertility [35]. However, there is little research on the use of an AOS in SOM measurement.
Although an AOS has the advantages of low cost, fast detection speed and being lossless [36], it also has some inevitable defects, which mainly reflect the construction of sensor arrays and the selection of pattern recognition algorithm. For different applications, sensor arrays are constructed differently, and selectivity, sensitivity and operating temperature should be considered comprehensively. For gases with known components to be measured, the most popular method is to select specific sensors to construct hybrid sensor arrays. However, the formation mechanism of soil gas is different, and its composition is very complex. It is difficult for a specific sensor to detect an uncertain gas mixture efficiently. Using the same type of non-specific sensor to construct a difference detection array by temperature control seems to be an effective measure to collect uncertain combined gas signals.
The pattern recognition algorithm is a key component of an AOS [37]. Commonly used pattern recognition algorithms include the back-propagation neural network (BPNN), support vector regression (SVR) and partial least squares regression (PLSR). The BPNN is the most widely used form of neural network, and has strong nonlinear mapping ability. SVR is proposed based on support vector machine (SVM) theory, which can effectively simplify the complexity of high-dimensional space. PLSR is particularly useful for predicting a group of dependent variables from a large number of independent variables, especially when there is a linear correlation between variables. These algorithms are widely used in soil property modeling. For example, based on Vis-NIR spectra, Ji et al compared the performances of PLSR and SVR in predicting soil information in situ in rice fields [38]. Qi et al preprocessed the hyperspectral Vis-NIR data of 153 soil samples using different methods and assessed the ability of PLSR, SVR and BPNN in predicting soil available nutrients, including nitrogen (N), phosphorus (P) and potassium (K) [39]. Santana et al used the NIR spectral technique and PLSR to minimize the effect of moisture on SOM detection [40]. Generally, all single pattern recognition algorithms are faced with inherent limitations [41]. For instance, BPNNs rely on abundant training data; the parameter selection of SVR is very difficult; and PLSR is not effective for nonlinear data. Guo et al also confirmed that no single prediction algorithm can be considered as superior [42]. To solve these problems, researchers have paid much attention to hybrid models in recent years. Compared with a single model, an appropriate combined model can yield more accurate results [43].
The hybrid model refers to the appropriate combination of different pattern recognition algorithms that can comprehensively utilize the useful information of single algorithms and thereby improves the prediction precision to the largest extent. Wen et al combined a gray model (GM) (1,1) and BPNN in grain yield prediction, in which the fluctuation of data series was weakened by the gray theory and the nonlinear handling ability of the BPNN was fully utilized [44]. Yin et al combined secondary hybrid decomposition, crisscross optimization and an extreme learning machine and thereby predicted wind power [45]. Wang et al combined a BPNN and genetic algorithm (GA) for wind speed forecasting [46]. The above studies suggest these combinations can yield better prediction results. The complex components of soil gases may correlate linearly or nonlinearly, or both, with SOM. PLSR is especially sensors with the same type.
Sensor arra well suited when the characteristic matrix (or matrix of predictors) has more variables than observations, and when multicollinearity exists among variables in a characteristic matrix [47]. BPNNs, because of their strong nonlinear approximation ability, are widely used in nonlinear modeling [48]. Thus, a hybrid model combining PLSR and a BPNN can effectively utilize the advantages of both and improve the predictive precision. In this study, PLSR and a BPNN were combined for the first time in SOM detection. This combination is called PLSR-BPNN.
In our previous study, we mainly built a set of AOS for SOM, used a metal-oxide semiconductor (MOS) gas sensor array to detect the response curves of soil gases, extracted four features (V max : maximum value, MDCV: mean differential coefficient value, RAV: response area value, and V t : transient value at the 20th second), and thereby constructed a soil olfactory feature space (SOFS) [49]. Single prediction models were also assessed, including a BPNN, SVR and PLSR. In this work, we optimized the SOFS prior to the prediction modeling to further improve the SOM forecasting performance. Generally, the optimization processes included the elimination of abnormal samples and the dimension reduction of features. The abnormal samples mainly originated from misoperation, errors of the AOS, or temperature, humidity and other external factors. Abnormal samples largely reduced the precision of the prediction models. Thus, abnormal samples should be identified and eliminated. Monte Carlo cross-validation (MCCV) was confirmed as a very useful method to remove abnormal samples [50]. The dimension reduction of features is a key influencing factor on model performances [51], since the original feature space contains much redundant information unrelated to modeling. The use of an unoptimized feature space in modeling will enlarge the amount of calculation and decrease the precision of prediction. The frequently used dimension reduction methods are based either on statistics (e.g. principal component analysis (PCA)) or optimization (e.g. firefly algorithm, GA) [52]. GA-BP is a combination of a GA and BPNN, which is used as an optimization algorithm for dimension reduction to overcome the shortcoming of BPNN convergence to the local optimum. In this paper, on the basis of the artificial olfactory detection method of SOM, the optimization study of SOFS was carried out, and two new mixed prediction models were proposed to improve the accuracy of soil olfactory detection. The aims of this paper are: (a) to discuss the modeling effect of SOFS optimized by MCCV and GA-BP; and (b) to evaluate the performance of three single models (BPNN, SVR, PLSR) and two hybrid models (SVM-GMDH, PLSR-BPNN) for SOM prediction.

Structure and working principle of AOS
A laboratory-based AOS was used to detect volatile soil gases. The AOS mainly consisted of a sensor array installed in a closed reaction chamber, a signal processing circuit, and a laptop PC (figure 1). The sensor array was a monoclass sensor array, which was composed of multiple sensors of the same type.
The signal processing circuit included multiple temperature modulation circuits and multiple basic measurement circuits. Each sensor in the sensor array corresponded to one temperature modulation circuit and one basic measurement circuit. The temperature modulation circuit was used to set the sensor's working temperature. The basic measuring circuit, as shown in figure 2, was used to collect the gas response signal.
VOCs during SOM degradation included gaseous hydrocarbons (CH 4 , C 2 H 4 , C 2 H 6 , C 3 H 8 ), H 2 S, ammonia, aldehyde, etc [19]. Therefore, the sensor array was selected on the basis of their sensitivity to the VOCs in soil. In this study, ten gas sensors with the same type of IDT SGAS707, purchased from Integrated Device Technology Inc. (San Jose, CA, USA), were used to construct an array for the detection of VOCs in soil gas. The basic metrological parameters of the SGAS707 are shown in table 1. R Air /R Gas in table 1 represents the ratio of air response to VOC gas response.
These sensors were arranged in a 2 × 5 array at a line spacing of 20 mm and column spacing of 10 mm. Each sensor contained a resistance element (referred to as the heater) capable  of modulating the working temperature, as shown in figure 2.
The working temperature is controlled by the heater voltage (V H ). The selectivity of sensors can be enhanced by temperature modulation [53]. There are usually two temperature modulation modes: isothermal modulation and dynamic thermal modulation [54]. In our study, isothermal modulation was adopted. To set the working temperature of sensors, we needed to obtain the relationship between V H and the sensor working temperature. Since the sensing material of the sensor is located in the metal protective casing, it is difficult to measure without damaging the sensor. Therefore, we used the temperature of the metal casing (W T ) instead of the temperature of the sensing material to carry out temperature modulation work. In this work, the relationship between W T and V H was calibrated by a PT1000 platinum resistance thermometer (precision class B) attached to the metal casing of the MOS gas sensor, as shown in figure 3. A polynomial was used to fit the calibration results of W T and V H , and the calculation formula (1) with a fitting degree (r 2 ) greater than 0.99 was obtained: In the design, the values of V H for each sensor were set with a step of 0.25 V in a range from 1.25 V to 3.5 V (this range is limited to temperature modulation circuits [49]). The corresponding working temperatures of each sensor were 34. Prior to operation, a vacuum pump extracted inertia helium to rinse the test chamber. After the output of the sensor array stabilized, the rinsing was stopped and the pass valve was closed, so the test chamber was closed. After that, measurement was started. During the measurement, soil gases sealed in a 200 ml aluminum foil gas sampling bag were extracted by using a 20 ml injector, and then transferred via the injection hole to the test chamber. In the meantime, the sensor output signals processed by the signal processing circuit were collected at 10 Hz sampling frequency and then stored on hard disks. The sampling continued for 5 min.
It is necessary to verify the correctness (or selectivity and sensitivity) of the selected sensors before making large-scale measurements. Therefore, three typical soil gas samples with the maximum, moderate and minimum organic matter contents were selected from all soil gas samples to be tested for verification. The results show that these sensors have different responses to different soil gas samples and show great differences. This verifies that the selected sensors can realize SOM detection based on an artificial olfactory method. After that, we measured all the gas samples. More information about the verification test and data collection can be found in our previous study [49].

Dataset
The dataset (including a training set and a validation set) was cited from our previous study [49]. A total of 102 soil samples were collected in Jilin Province. The SOM concentration of each sample was measured by the potassium dichromate method and regarded as the observed value. The olfaction response curves of all samples were determined using an AOS device. The SOM concentrations are listed in table 2. The AOS-detected SOM concentrations were presented in a 102 × 40 olfaction feature space, which consisted of 102 samples and 40 features. The 40 features consisted of V max , MDCV, RAV and V t on the ten curves, and a boxplot of the feature data is illustrated in figure 4. The Si (i = 1, 2, 3, … , 10) in figure 4 stood for the ten sensors. As can be seen from the figure, there were some obvious outliers in the attribute parameter, indicating the existence of abnormal samples in the data set. Therefore, it was necessary to remove the abnormal samples before establishing the prediction model.

Abnormal sample elimination
Abnormal samples can be produced by many factors such as unstable instrument status or imperfect operation. MCCV is proposed based on the basic assumption that the effect of an  outlier on the model will be different depending on whether it is selected in the calibration set or in the prediction set [50]. Because the outliers are unstable, they are not applicable for the models built based on the rest of the samples [54]. Thus, the outliers can be considered to be abnormal samples. In this study, the MCCV was used to identify the outliers in the SOFS. Firstly, from the training set, 80% of the samples were randomly selected for the establishment of PLSR models, and the remaining 20% were used for prediction. Secondly, the above process was repeated multiple times, so several PLSR models were built. Thirdly, the models were sorted in ascending order according to the predicted residual sum of squares (PRESS). Finally, the abnormal samples were identified according to the accumulative probability ( f ac ). PRESS and f ac were defined as follows: whereŷ and y i are the predicted value and observed value of the ith sample respectively, and k is the number of prediction samples.
where m is the ordinal number of a sample, and n is the index of ranked PLSR models. N is the number of training set samples, and f mn indicates whether sample m in the randomly selected samples existed in the training set of model n: if yes, f mn = 1; otherwise, f mn = 0. According to the definition, f ac represented the probability of each sample appearing in good and bad models, since the models were sorted according to PRESS values. As n increased, the f ac of normal samples was still kept around 80%, but the f ac of abnormal samples deviated from those of normal samples.

Feature dimension reduction
GA is an evolutionary algorithm that simulates natural selection. Firstly, a population was randomly generated and after crossing, mutation and 'survival of the fittest' selection, the suitable individuals remained in the next generation until certain termination conditions were met [55]. To remove redundant information from the SOFS, we used GA-BP for dimension reduction. The concrete procedures are shown in figure 5.
where n is the number of samples in the validation set. (c) It was judged whether the relevant parameter (e.g. the value of f (x) or the number of iterations) met the output condition. If so, the optimal feature vector was exported and the operation was stopped. Otherwise, the selection, crossing and mutation of GA were conducted, and then steps 2 and 3 were repeated until the output condition was met.

Single prediction models
As for prediction of soil properties, commonly used regression prediction algorithms include the BPNN, SVR, PLSR and ELM, etc [38,39,56]. Moreover, a BPNN, SVR and PLSR were also used in our previous research [49]. For comparison, these three algorithms were again adopted in this study as prediction models. The BPNN is a multilayer forward neural network and is the most widely used neural network. Its topological structure consists of an input layer, either one or several hidden layers, and an output layer. The Kolmogorov theory proves that a three-layer network containing one hidden layer can approximate any nonlinear function. Thus, the number of hidden layers was set as 1 in this study. However, the BPNN is largely susceptible to the number of neurons in the hidden layer. In our work, the number of neurons in the hidden layer was optimized by the following empirical formulas: where h, n and p are the numbers of hidden-layer neurons, input nodes and output nodes respectively, and α is a positive integer number from 1 to 10. In the BPNN model, the activation function of neurons in the hidden layer was an s-typed transfer function (tansig), and that in the output layer was a linear transfer function purelin. The number of iteratiosn for training was set as 1000, the learning rate was 0.01, and the target error was 0.001. SVR, with highly similar structures to ANNs, can learn from experimental data [57]. It is one of the most important predictive statistical models. The LIBSVM toolbox offers two types of regression methods, including ε-SVR and ν-SVR [58]. Here ε-SVR was used with the radial basis function (RBF) as the kernel function. The penalty factor c (c > 0) and the kernel parameter g are two major influencing factors on the performance of SVR. Here the SVR model was optimized. The parameter combination (c, g) was determined through the mesh search method and five-fold cross-validation, which were also used in our previous studies.
PLSR is very effective in predicting a group of dependent variables from a number of independent variables [59]. This is a multivariable statistical data analytical method that integrates PCA and multivariable linear analysis. When the variables are highly linearly correlated, PLSR can return a very effective prediction. The number of principal component factors (PCFs) is the major cause of over-fitting or underfitting of PLSR [39]. Here, leave-one-out cross-validation was used to determine the number of PCFs in the PLSR model.

Hybrid prediction models
2.6.1 SVM-GMDH. SVM is an efficient way to solve nonlinear classification problems [60]. In order to make full use of this advantage of SVM, we put forward a prediction model of pre-classification and later regression. Firstly, the training set was classified into several sub-training sets according to the SOM classification standard, and then an SVM clustering model was established and used to cluster the validation set. Finally, several regression models were built based on the sub-training sets to predict the values of the clustered validation set. Considering that the number of sub-training sets is small, it can be a challenge to establish a reliable prediction model. For this problem, the group method of data handling (GMDH) provides us with a powerful tool [61]. GMDH is a learning machine on the basis of heuristic self-organization as proposed by Ivakhnenko in 1976 [62], and has been widely used in energy conservation [63], marketing [64], fault recognition [65], and engineering geology [66]. In this study, SVM and GMDH were combined as SVM-GMDH for the first time for SOM detection.
Let the training set be T = {(X j , y j ) | j = 1, 2, 3, … , m} and the testing set be V = {(X i , y i ) | i = 1, 2, 3, … , n}, where X j and X i are the feature vectors of the training set and the validation set respectively, and y j and y i are the observed SOM concentrations by the training set and the testing set respectively. m and n are the sample numbers of the training set and the testing set, respectively. The algorithm procedures of SVM-GMDH are shown below: (a) According to the SOM classification standard, T was divided into C k classes (k = 1, 2, … , m), and there existed: where k ̸ = p (p = 1,2, … , m). (b) The training set T was used to construct an SVM clustering model. (c) C k was used as the training set to build GMDH models, and a total of k GMDH models were obtained. (d) The samples (X i , y i ) in V were classified into the C k class by the trained SVM clustering model, and the predicted value of y i could be obtained by predicting X i with the kth GMDH model.

PLSR-BPNN.
Due to the complex causality between soil gases and SOM, the SOM concentration may be internally related either linearly or nonlinearly to the soil olfaction feature space. Therefore, a hybrid prediction model of PLSR-BPNN was proposed in this study. The PLSR-BPNN is an organic combination of PLSR and BPNN, which effectively utilizes the linear modeling capability of PLSR and the nonlinear mapping capability of BPNN to improve the predictive performance of AOS. The combination of PLSR and BPNN is illustrated in figure 6. The modeling and forecasting process of PLSR-BPNN is described below:  (c) The prediction result ofŷ i was output by combining y p and y b according to the arithmetic mean, where i = 1, 2, … , n.
The combination was expressed as follows: where k 1 and k 2 are the weighting coefficients of PLRS and BPNN, respectively.
To determine k 1 and k 2 , the concept of model validity [67] was introduced here, which is expressed by S. S can be defined as: where A i (i = 1, 2, … , n) represents the prediction precision sequence of the samples, y i is the observed value of the sample in the validation set, and σ = 1 2 is the predicted value of y i . In the previous equations, E and σ represent the mean and standard deviation of the A i sequence respectively. The above calculation implies that a larger S means a higher prediction precision. We denoted the validity of the PLRS model and the BPNN model as S p and S b respectively. In this study, S p and S b were normalized as the values of k 1 and k 2 respectively, namely:

Evaluation indices
The frequently used performance evaluation indices of soil property prediction models include the coefficient of determination (R 2 ), ratio of performance to deviation (RPD), root mean square error (RMSE) and relative prediction error (RPE). Compared with previous studies, here we also used R 2 , RMSE and RPD to assess the prediction models. Moreover, the RPE was adopted as an evaluation index of parameter optimization such as feature optimization and abnormal sample elimination. The equations of R 2 , RMSE and RPD can be found in [49].
R 2 closer to 1 means higher model fitness. PRD can compensate for the limitations of R 2 in predicting nonlinear models, and a larger PRD indicates the forecasting performance is higher. A smaller RMSE implies the prediction precision is higher, and a smaller RPE suggests the predicted value deviates less from the observed value. The RPE is calculated as follows: where n is the sample number of the training set or validation set; y i is the observed value of the ith sample; andŷ i is the predicted value of y i .

Descriptive statistics for all samples
The observed values of SOM from 102 soil samples were descriptively analyzed on the statistical software SPSS13.0. The normal distribution was tested via the Kolmogorov-Smirnov (K-S) method. The SOM content varied within 12.19-48.79 g kg −1 , with a mean of 23.131 g kg −1 . The coefficient of variation (CV) was 0.319, indicating SOM content in this study showed a spatial variation trend. The K-S test value was 0.295 (P > 0.05), so the null hypothesis of normality cannot be rejected (figure 7).

Results of abnormal sample elimination
The SOFS consisted of a training set (71 samples) and a validation set (31 samples). To eliminate the effects of abnormal samples, we used MCCV to identify the abnormal samples in the training set. In the calculation of MCCV, 57 (71% × 80%) samples in the training set were randomly selected to build 1000 PLSR models, and the remaining 14 (71% × 20%) samples were used for prediction. Figure 8 shows the variation curves of f ac along with the sequence number of ranked models, and the small inset figure shows the f ac of the first 119 models.
Clearly, as the sequence number of ranked models increased, the f ac values of most samples in the training set  approached 80%, but the f ac curves of samples 1, 18, 36, 38 and 70 were significantly different from the other curves to some extent because the f ac values of these five samples were still maintained at 100% within a larger model sequence number range. Therefore, the five samples with numbers 1, 18, 36, 38 and 70 were identified as 'abnormal samples', and needed to be removed.

Results of feature optimization
After removing the abnormal samples from the training set, a new training set was obtained. The GA-BP method was used to reduce the dimension of the new training set for optimization. The output condition of the GA-BP was set as 100 iterations. Figure 9 illustrates the fitness function evolution curve.
Obviously, after the number of iterations exceeded 10, the optimal fitness value was unchanged, indicating the optimal effect had been achieved. In this case, a group of optimal feature vectors were screened out: 1, 2, 6, 8, 10, 13, 14, 16, 17, 18, 21, 24, 25, 26, 30, 36, 37, and 40, which meant that the original feature was reduced from 40 dimensions to 18 dimensions. It can be seen that, after the elimination of abnormal samples and feature dimensionality reduction, the training set was optimized from a matrix of 71 × 40 dimensions to a new matrix of 66 × 18 dimensions, and the validation set was transformed into a new matrix of 31 × 18 dimensions.

Prediction results by single models
To test the effect of the optimized OFS on modeling performance, BPNN, SVR and PLSR were used to establish three different single prediction models on the new training set (66 samples × 18 features), and the new validation set (31 samples × 18 features) was used to validate these models. In the BPNN modeling, the number of optimized neurons in the hidden layer was eight. In the SVR modeling, the optimal parameters were c = 1 048 576 and g = 1.5492 × 10 −6 . In the PLSR modeling, the optimal number of FPCs was four. The prediction results of the models are shown in figure 10.
The R 2 of the three single models were 0.941, 0.918 and 0.913, respectively; the RMSEs were 2.036, 2.224 and 2.114, respectively; and the PRDs were 3.733, 3.418 and 3.377, respectively (figure 10). The above results suggest that the three models are all of high predicting ability. In terms of R 2 , RMSE or PRD, the SVR model and the PLSR model are not significantly different, but the R 2 and PRD of the BPNN model are both higher than those of the other two models, and the RMSE is lower than those of the other two models. Thus, among the three models, the BPNN model outperforms the other two models and has a higher prediction accuracy.
However, since the organic matter content of all soil samples was greater than 10 g kg −1 , there were no level 5 or 6 samples in table 2. In addition, when the abnormal samples 1 (20.51 g kg −1 ), 18 (29.87 g kg −1 ), 36 (41.10 g kg −1 ), 38 (20.17 g kg −1 ) and 70 (38.86 g kg −1 ) were removed, only one soil sample remained in level 1 of the training set and validation set respectively, which was not conducive to the establishment of a classification model and regression model. Therefore, these two remaining samples (namely, 43.85 g kg −1 and 48.79 g kg −1 ) of level 1 were grouped into level 2. In table 2, C1, C2 and C3 were labeled as three categories respectively: C1 (level 2), C2 (level 3) and C3 (level 4).
Based on the results in table 3, an SVM-GMDH prediction model was built and the validation set was classified and predicted. The prediction results and performance evaluation indices (R 2 , RMSE, PRD, and RPE) are shown in figure 11. The results were R 2 = 0.79, RMSE = 3.52 and PRD = 2.16,   indicating the predicting ability of the SVM-GMDH model was not high ( figure 11(a)). The reasons were mainly attributed to the low classification accuracy (CA) of SVM, because the CA of SVM-GMDH in the validation set was 83.9% ( figure 11(a)). When the CA was set at 100% by manual, the predicting indices of SVM-GMDH were R 2 = 0.845,  Figure 11. Test results of the SVM-GMDH hybrid model.  figure 11(b)), which were all higher than those of SVM-based classification. Furthermore, the RPE was smaller when CA = 100% ( figure 11(c)). Admittedly, the low CA decreased the forecasting performance of SVM-GMDH.

PLSR-BPNN.
To evaluate the prediction performance of the hybrid model PLSR-BPNN, the optimized SOFS was used in modeling and prediction. In the PLSR-BPNN modeling, the parameters of its two parallel combination branching algorithms PLSR and BPNN were set the same as in the single PLSR model and single BPNN model. Figure 12 shows the predicted results by the PLSR-BPNN. Clearly, the PLSR-BPNN showed a favorable predicting trend and accuracy, and the R 2 , RMSE and PRD were 0.952, 1.771 and 4.291 respectively (figure 12). As per equation (11), the weighting coefficients of the PLSR and BPNN were calculated to be k 1 = 0.52 and k 2 = 0.48, respectively.

Discussion on optimization effect of SOFS
In this paper, we refer to the calibration model based on the unoptimized SOFS the preliminary prediction model. In our previous paper [41], we reported the preliminary prediction results of three single models of a BPNN, SVR and PLSR as listed in table 4.   As can be seen from table 3, the R 2 of the BPNN, SVR and PLSR are all greater than 0.8, and all have RPD greater than 2.0. Vohland et al reported a detailed evaluation method for predicting goodness [69]: R 2 and RPD values greater than 0.90 and 3.0, respectively, are considered to be 'excellent', whereas values from 0.82-0.9 (R 2 ) to 2.5-3.00 (RPD) are defined as 'good'. Values between 0.66 and 0.81 (R 2 ) and 2.0 and 2.5 (RPD) indicate 'approximate quantitative prediction'. When the RPD value is from 1.5 to 2.0 and the R 2 value is from 0.50 to 0.65, the model can only be used to distinguish high and low values. 'Unsuccessful' predictions have RPD and R 2 values lower than 1.5 or 0.50, respectively. Therefore, the preliminary models of the BPNN and SVR have 'good' predictive performance, while the PLSR preliminary model has only approximate quantitative ability.
The preliminary prediction results of hybrid models are shown in figure 13. The results of SVM-GMDH were: R 2 = 0.69, RMSE = 19.89 and PRD = 0.382 ( figure 13(a)). It is an unsuccessful prediction (RPD < 0.50). On the one hand, as mentioned above, SVM-GMDH is affected by the accuracy of SVM classification. On the other hand, it may be affected by abnormal samples and redundancy features. The preliminary prediction model of PLSR-BPNN has a 'good' prediction, with R 2 = 0.848, RMSE = 3.043 and RPD = 2.497 ( figure 13(b)). The optimized SOFS, including the new training set (66 samples × 18 features) and the new validation set (31 samples × 18 features), was used in modeling and prediction.We refer to the calibration model based on the optimized SOFS as the optimized model. The prediction results of the five optimized models (BPNN, PLSR, SVR, SVM-GMDH and PLSR-BPNN) are shown in figures 10(a)-(c), 11(a) and 12, respectively. In comparison with the preliminary prediction model, the results of the optimized mode rise by 6.93%, 2.57%, 13.00%, 14.53% and 12.26%, respectively (R 2 ); decrease by 24.00%, 12.13%, 37.70%, 82.29% and 41.80% respectively (RMSE); and increase by 31.58%, 13.82%, 50.76%, 464.92% and 95.29% (RPD). These results suggest the predictive capabilities of all five models were largely improved after the optimization of SOFS, indicating that MCCV can effectively identify abnormal samples and GA-BP can effectively eliminate redundant features.

Comparison of optimized models
To find the optimal prediction model for SOM detection, we visualized the evaluation indices of the five optimized models ( figure 14). Except for the SVM-GMDH optimized model (R 2 = 0.79, PRD = 2.16), which can only be used for approximate quantitative prediction, the other four optimized models (R 2 > 0.90, RPD > 3.0) have 'excellent' prediction.
According to figure 14(a), it can be seen that the prediction performance relationship of the optimized models is PLSR-BPNN > BPNN > SVR > PLSR > SVM-GMDH. Moreover, the RPE curves reflect the fluctuation of prediction errors of the five models ( figure 14(b)). Clearly, the RPE of PLSR-BPNN fluctuated within the smallest range (0-0.185) and thus showed the strongest predicting accuracy. Therefore, we believe that the PLSR-BPNN optimized model performs the best in olfactory detection of SOM. This is because there was a linear and non-linear relationship between olfactory feature variables, and the PLSR-BPNN could effectively solve the linear and nonlinear mapping problems by weighting coefficients.

Conclusions
The main intellectual merits of this work include the novel approach based on the AOS and PLSR-BPNN and its effectiveness as a method of solving the linear and nonlinear mapping problems by weighting coefficients in detecting SOM. The test results demonstrate that: (a) MCCV and GA-BP were good measures to optimize SOFS, and (b) the PLSR-BPNN model yields higher predictive performance and lower relative predicting errors compared to BPNN, SVR, PLSR and SVM-GMDH models. According to numerical and evaluation index tradeoffs, MCCV + GA-BP + PLSR-BPNN (R 2 = 0.952, RMSE = 1.771, and RPD = 4.291) is selected as the most suitable method for predicting SOM in this study.
Compared with the near-infrared spectroscopy in reference [70] (test result: R 2 = 0.91) and reference [71] (test result: R 2 = 0.69), the method of the current study is superior. However, our samples need to be stored sealed for a period of time, which is time-consuming and not conducive to real-time detection. Therefore, our next work will focus on the influencing factors of artificial olfactory detection of SOM and the rapid processing of soil samples. In addition, the number of soil samples should also be increased to make the determination method more stable and robust.