Soil Classification System from Cone Penetration Test Data Applying Distance-Based Machine Learning Algorithms

. Most work from the literature dedicated to soil classification systems from cone penetration test (CPT) data are based on simple two-dimensional charts. One alternative approach is using machine learning (ML) to produce new soil classification systems or to reproduce existing ones. The available studies within this research field can be considered limited, once most of them do not include more than two inputs within their analysis and are applicable only to specific regions. In this context, the aim of this work is to use distance-based ML techniques to replicate two chart-based methods from the literature. Up to five input feature combinations are tested, with the objective of discussing geotechnical aspects of soil classification systems. Results are compared using the statistical test of Friedman with the post-hoc statistics of Nemenyi and the signed-rank statistical test of Wilcoxon. The used dataset can be considered diversified because it contains 111 CPT soundings from several countries. Results show that the used ML techniques maintain reasonable accuracy when inputs are substituted and when incomplete data is used, which can lead to cost reduction in real engineering projects. It is important to notice that these observations would not be possible by using the replicated soil classification systems alone.


Introduction
Most available systems for soil classification from CPT data use two-dimensional charts divided into regions which represent different soil types.Initially, these charts were based on soil type (grain size and plasticity) and used raw CPT data, like cone resistance and lateral friction (Begemann, 1965).Nonetheless, later studies produced better classification methods by focusing on soil behavior and by proposing normalizations for CPT data (Douglas, 1981;Robertson et al., 1986).Some popular classification methods make use of two charts instead of one, combining three normalized variables in pairs.In this context, some work propose normalizations to include the influence of depth and overburden (Robertson, 1990).Nevertheless, these methods are not accurate for offshore soils (Jefferies & Davies, 1991) due to the dilative behavior of highly overconsolidated clays commonly found in deep water soils (Robertson, 1991).This limitation is supported by experimental data (Jefferies & Davies, 1991;Ramsey, 2002).Thus, these methods fail to distinguish stiff or dense granular soils from overconsolidated clay (Schneider et al., 2008).Different normalizations and charts were proposed to address this problem (Schneider et al., 2008;Schneider et al., 2012).Nonetheless, Robertson (2016) affirms that soil classification systems that use charts may not be reliable for structured soils, meaning aged or cemented, like some offshore soils.He also recommends to consider soils structured if a modified normalized small-strain rigidity index K G * is above 330, although some geotechnical judgment is required.
Another possible approach for soil classification systems from CPT data is based on the use of statistics and ML techniques.Most authors interested in solving general geotechnical problems use artificial neural networks (ANN) to predict values of interest such as soil parameters (Goh, 1995;Goh, 1996;Schaap et al., 1998;Juang & Chen, 1999;Kumar et al., 2000;Juang et al., 2002;Juang et al., 2003;Hanna et al., 2007).Nevertheless, one can find work using support vector machines (SVM) (Goh & Goh, 2007), decision trees (DT) (Livingston et al., 2008) and random forests (RF) (Kohestani et al., 2015).For soil classification systems there are two main approaches, one is replicating existing soil classification systems and the other is trying to propose new ones.Most work in this research field are dedicated to the latter approach, using data clustering (Hegazy & Mayne, 2002;Facciorusso & Uzielli, 2004;Liao & Mayne, 2007;Das & Basudhar, 2009;Rogiers et al., 2017).Usually, among the few work that investigate replicating existing soil classification systems such as Robertson charts (Arel, 2012), the only ML technique tested is ANN (Kurup & Griffin, 2006;Reale et al., 2018).Nonetheless, there is a study that compares different ML techniques when replicating existing systems for soil classification (Bhattacharya & Solomatine, 2006), although the used dataset is restricted to few CPT soundings which are all taken from the same location.Other work related to classifying soil with ML are Bilski & Rabarijoely (2009), Rao et al. (2016) and Chandan & Thakur (2018).
In this work, two chart-based soil classification systems proposed by Robertson (1991) and Robertson (2016) are replicated using distance-based ML techniques.These techniques were elected among other options for their simplicity and because there is a lack in the literature for this type of approach.The objective is to investigate and discuss geotechnical aspects of soil classification systems that can not be disclosed by using the original Robertson methods.First, the stratigraphic profiles of 111 CPT soundings taken within several countries are obtained using a student version of CPeT-IT v2.0.2.5 software (Ioannides & Robertson, 2016), which employs Robertson charts in a soil classification system.Next, the so-called k-nearest neighbor (KNN) and distance-weighted nearest-neighbor (DWNN) ML techniques are used to replicate Robertson (1991) and Robertson (2016) charts.For each ML technique and each classification method, 33 input feature combinations are tested and all results are compared using the Friedman statistical test (Friedman, 1937) with the Nemenyi post-hoc statistics (Nemenyi, 1963) and the Wilcoxon statistical test (Wilcoxon, 1945).The proposed discussions produced several original contributions, like showing that: 1. Distance-based ML techniques are capable of reproducing Robertson soil classification systems with good accuracy; 2. Reasonable accuracy can be obtained without normalizations proposed in the literature for the CPT data; 3. Including soil age as an input feature contributes for distinguishing between soil classes.

Soil Classification Systems
In this work, two soil classification systems available within a student version of CPeT-IT software are replicated using distance-based ML techniques.The objective of this section is to present the theory that sustains each of these methods.

Influenced by soil type (IST)
The first replicated method is based on the work of Robertson (1991).Although it was idealized to be oriented towards a behavioral classification, the labels assigned to classes are inspired by conventional soil type classes, showing even some compatibility with real soil types (Kurup & Griffin, 2006).For this reason, this method is here considered influenced by soil type, being referred to as IST throughout this text.It adopts nine possible soils types, within which two are said to be heavily overconsolidated or cemented.The IST classes are in Table 1.
The initial inputs used by CPeT-IT to classify soil with the IST method are raw CPT data, named cone resistance q c (MPa), lateral friction f s (kPa), pore pressure mea-sured behind the cone tip u 2 (kPa) and depth z (m).These values are used to obtain the input features originally considered by Robertson (1990), named normalized cone resistance Q t1 (Eq.1), normalized friction ratio F r (Eq.2) and normalized excess pore pressure B q (Eq.3).The cone resistance normalization was later updated to Q tn (Schneider et al., 2008), resulting in the charts presented in Figs.1a and  1b.Beside the nine classes predicted within these charts, an additional class 0 is used for misclassified soils.
To obtain the normalized values, first the raw cone resistance q c is replaced by the total cone resistance q t , to compute the pore pressure assisting cone penetration.Next step is estimating the soil unit weight g (kN/m 3 ) (Lunne et al., 2002;Mayne et al., 2010;Mayne, 2014), which is used to obtain the total overburden pressure s v0 (kPa) and the effective overburden pressure s v0 ' (kPa).If the water table is not known, it can be estimated by fitting a straight line in the chart z ´u2 (Fig. 2) when a drained penetration is observed.The water table depth is then used to compute the equilibrium pore pressure u 0 , which is used to determine the excess pore pressure u 2u 0 .
Given these estimations, the following normalizations are obtained: Nevertheless, work from the literature state that the exponent n of s v0 ' (n = 1 in Eq. 1) should vary from 0.5 for sands to 1 for clays (Zhang et al., 2002).To calculate n, one can consider its correlation with the classification index I c (Robertson, 2009): The normalized cone resistance Q tn is then given by: where p a = 0.1 MPa is a reference pressure.
The CPeT-IT software uses only the Q tn ´Fr chart to generate the soil classification system outputs.Soil is considered misclassified and is labeled with class 0 if the values obtained for Q tn and F r are not within the ranges presented in this chart.

Focused on soil behavior only (FSB)
The system proposed by Robertson (2016) establishes a full behavioral-oriented soil classification, which is why it is here considered more focused on soil behavior and named FSB throughout this text.FSB method includes seven classes (Table 2).
One can observe that the three main soil types are sand-like, clay-like and transitional.Each of these soil types is divided into contractive or dilative.A seventh class Soils and Rocks, São Paulo, 42(2): 167-178, May-August, 2019  is reserved for contractive clays that have high sensitivity to disturbance, which can be related to the friction ratio using the expression S t = 7.1/F r (Robertson, 2009).If sensitivity is greater than 3, which corresponds to F r < 2%, then the clay is considered sensitive.The upper limit for the normalized cone resistance for sensitive clays is defined as 10 because they are soft.
Likewise for the IST system, q c , f s , u 2 and z are the initial inputs used by CPeT-IT to classify soil with the FSB system.Nonetheless, in this case the soil classification system is based on the charts shown in Figs. 3 and 4, which use the normalized cone resistance Q tn , the normalized friction ratio F r and the normalized excess pore pressure U 2 (Schneider et al., 2008) as inputs.The FSB method also includes a class 0 for misclassified soil, which is identified if Q tn , F r or U 2 are not within the ranges presented in the charts and if the class given by both charts is not the same.
The excess pore pressure normalization U 2 is obtained as: The curves that separate soil classes are inspired by Schneider et al. (2008) and Schneider et al. (2012).The Q tn ´Fr chart has closely circular curves in the IST method, while in Robertson ( 2016) the curves have hyperbolic shapes as suggested by Schneider et al. (2012).The Q tn ´U2 chart was taken from Schneider et al. (2008) with minor changes, containing the classes originally proposed there.

Distance-Based Techniques
In this work, distance-based ML techniques are used to replicate the soil classification systems described in Section 2. These ML techniques have the advantage of using an approach similar to the chart-based methods to be replicated, representing soil examples as points in a space composed by the input features.It also uses the hypothesis that, if two soil examples produce close points, they are similar.One way of measuring the distance between points is with the Euclidean metric.Considering a pair (x i , x j ) of objects in a d-dimensional feature space, the distance between them is given by: The distance-based ML algorithms used in this work predict the class of an unknown example using a dataset of examples whose classes are known.The simplest strategy is detecting which known example produces a point that is the nearest neighbour of the point that represents the unknown example.It is then assigned to the unknown example the same class of its nearest neighbor (Cover & Hart, 1967).It is also possible to use an arbitrary number k of nearest neighbors and decide the class of the unknown example by voting, which corresponds to the k-nearest neighbors (KNN) technique.Tests can be performed to calibrate which k leads to best predictive performance.In this work, only odd values of k are tested starting from one, increasing k until decreasing predictive performance is observed.
It is also possible to weight the votes, so that closer neighbors are more valued than farther ones.In this case, the technique is named distance-weighted nearest neighbor (DWNN) (Dudani, 1976).One specific way for defining these weights is by using Gaussian weighting, which is defined by the following expression (Hechenbichler & Schliep, 2004): where dist is the distance value.In this work, the KNN and the DWNN with Gaussian weighting are used and compared to replicate the soil classification systems presented in Section 2.

Datasets description
The ML programs used in this work require a dataset of known examples to predict new examples.This dataset can be formatted as a table, where each line represents a different soil example.Input features are represented as columns and the last column contains the output feature.Table 3 presents a sample with 10 soil examples (lines), within a 0.45 m soil layer.In this sample, the inputs are raw CPT data and the output is the corresponding IST soil class, obtained using the CPeT-IT software.This program is also used to produce other input-output combinations for the ML techniques, as described with more detail in Section 4.3.
Thirty eight of all CPT soundings used to compose the datasets were sent directly by Professor P.K. Robertson.
They are the same ones used in Robertson (2016) to produce the FSB method, which is described in Section 2.2.Once detailed information about these soundings can be found in the original reference, only a brief description about them is presented in Table 4.
The first column of Table 4 gives a general description of the soil types within the CPT soundings.The second identifies where soundings were taken and the third gives the geological age when the soil was deposited.The last column presents a discrete ordered variable named "class of geology" (CG), considering the most recent age 1 and the other numbered sequentially to the oldest.The information from these 38 soundings plus the variable CG compose the here named geological dataset.The objective of including CG, as an input feature in some of the studies presented in Section 5, is investigating if information about geological age can help differentiate one soil class from the other.
Another 73 CPT soundings were obtained from the website of Professor P.W. Mayne, whose information is summarized in Table 5.Further detail about the soundings can be found on the website.All these soundings were taken within the United Sates of America and more specific information about location is presented in the table.Information about geological age was not available for these soundings, so they are not included in studies that make use of the variable CG.These soundings grouped with the ones sent by Robertson compose the here named complete dataset, totalizing 111 CPT soundings.All CPT data used in this dataset were taken in intervals of 2 to 5 cm, the pore pressure was measured behind the cone tip (u 2 ) and the raw cone tip resistance q c was corrected to q t using CPeT-IT.

Data preprocessing
In this work, all CPT data were used to classify soil using the CPeT-IT software, which was later replicated using the methods described in Section 3. The accuracy of the final results depends on the quality of the used datasets, which can be improved with data preprocessing.
The first problem is that distance-based ML techniques are sensitive to data scale.When the distance between points is calculated, the importance of input features that vary within large ranges tends to be emphasized, while the ones with low variation tend to be ignored.The solution adopted here is normalizing all input features to the interval [0, 1].
Another issue is that data taken within CPT soundings can contain noise, which is here defined as any variable becoming severely different from what it was supposed to be.Noise can have several causes, like sensor errors, formatting problems and human mistakes.The main noise types are missing data and outliers, which are here defined as distorted or corrupted values.CPeT  techniques presented in Section 3 are here used to replicate CPeT-IT, these errors tend to be also replicated.Although it is difficult to completely eliminate noise from the datasets, it is desirable to reduce them as much as possible in order to avoid classification errors.In this work, dataset cleaning was first performed manually, removing the noisy examples that could be easily identified.This procedure was then complemented by an automatic cleaning procedure that makes use of the box-plots of the input features, as illustrated in Fig. 5.
In the box-plot, the base of the rectangle represents the first quartile Q 1 and the top of the rectangle represents the third quartile Q 3 .The whiskers above and below the rectangle represent the interval [Q 1 -1.5 ´IQ, Q 3 + 1.5 ÍQ], where IQ = Q 3 -Q 1 .Values outside this range (white circles) are identified as potential outliers.Preliminary tests have shown that removing all potential outliers affects accuracy, which indicates that relevant information is being eliminated.To solve this problem, the Edit Nearest Neighbor technique (Wilson, 1972) is used in this work as a second criterion to decide if each potential outlier will be, in fact, removed.This technique compares the potential outlier with its nearest neighbor and removes it only if the classes given by CPeT-IT do not match.This procedure is illustrated in Fig. 6 for two input features, where the white dot represents the potential outlier and the black dots represent other known examples from the dataset.The numbers close to each dot represent the class assigned by CPeT-IT.One can observe that, in this example, the classes of the potential outlier and its nearest Total neighbor are the same.This means that this potential outlier will be maintained.
The next issue to be evaluated is if the number of examples within each soil class is balanced, considering both IST and FSB methods.Severe unbalance can compromise the accuracy of distance-based ML techniques because they tend to focus majority classes and ignore minority classes.The distribution of examples among classes can be checked using histograms, as presented in Figs.7a and 7b for the complete dataset and Figs.7c and 7d for the geological dataset.
One can observe that the classes are, in fact, imbalanced, which is expected for real CPT soundings.In this work, data imbalance is prevented by eliminating examples of majority classes and creating new artificial examples for minority classes.Preliminary results have shown that ran-dom elimination does not affect predictive performance, which can be explained by the fact that CPT data contains redundancies due to several data items being taken within each soil layer.
To create new artificial examples for minority classes, the SMOTE (Chawla et al., 2002) technique was used.For better distribution within the input feature space, it is here proposed to estimate each d-dimensional new artificial object from d + 1 original examples.This corresponds to the vertex number of a d-dimensional simplex.The maximum between 1000 and two times the number of elements of the minority class was stipulated as the final number of elements of each class for the balanced dataset.Once class 0 of the IST method could not be well represented within the geological dataset even with the use of SMOTE, examples of this class were completely removed from the geological dataset.

General strategy
Two ML algorithms are tested and compared, the classical KNN and the Gaussian DWNN, with respect to their capacity for replicating the IST and FSB soil classification systems.This comparison is made using several input feature combinations, including three basic sets: • First set: depth z (m), corrected cone resistance q t (MPa), lateral friction f s (kPa) and pore pressure behind the cone tip u 2 (kPa); • Second set: depth z (m), normalized cone resistance Q t1 , normalized lateral friction F r (%) and normalized pore pressure B q ; • Third set: depth z (m), normalized cone resistance Q tn , normalized lateral friction F r (%) and normalized pore pressure U 2 .The first set contains only non-normalized parameters, the second contains inputs of the IST method combined with depth and the third contains inputs of the FSB method combined with depth.For the main analysis, all combinations of two, three and four input features within each set were tested, although not all of them are presented in Section 5. Additional selected input feature combinations are tested in three complementary studies.
In order to generate statistically relevant comparisons, a 10-fold cross-validation procedure (Stone, 1974) was applied to evaluate classification accuracy.The procedure starts by randomly separating the dataset into ten partitions or folds with approximately the same size, maintaining the same proportion between classes observed in the complete dataset.At each cross-validation round, one partition is left for testing, one partition (chosen at random) is chosen as a validation set and the remaining partitions compose the training set.The validation set is used to calibrate the best number of neighbors k to be used in the distance-based algorithms.
For each cross-validation round, the average of the accuracies per class are taken.This avoids disregarding mi-  nority classes in the performance evaluation.After all folds are used for testing, a mean and a standard deviation accuracy performance are computed.For comparing the results of the experiments, the Friedman statistical test (Friedman, 1937) with the Nemenyi post-hoc statistics (Nemenyi, 1963) and the Wilcoxon statistical test (Wilcoxon, 1945) are used, based on the 10 accuracies recorded (per test fold).

Results and Discussion
A total of 132 classification results were generated to produce the comparisons presented in this main analysis: 2 replicated classification methods (IST and FSB described in Section 2) ´33 input feature combinations ´2 distance-based classification algorithms.The units used for the input features are z (m), q t (MPa), f s (kPa), u 2 (kPa), F r (%) and the other ones are dimensionless.Each predicted soil class is compared to the one originally given by CPeT-IT to compute accuracy.Tables present the mean and stan-dard deviation accuracy obtained within the 10-fold crossvalidation procedure described in Section 4.3.
Combinations that presented best performance with KNN for replicating IST outputs are presented in Table 6.Once the first combination uses the original IST inputs and outputs, it was expected that it would lead to the highest mean accuracy among all.Nonetheless, results of the Friedman statistical test with the Nemenyi post-hoc statistics show a statistical equivalence between the first two combinations in Table 6.Thus, the last two combinations shown in Table 6 can be considered of lower performance.This  shows that including more features among the original ones does not contribute to improve performance in this case.The same comparison is proposed for the combinations that lead to the best performance with the Gaussian DWNN technique for replicating IST outputs, which are presented in Table 7.One can observe that results are very close to those presented in Table 6, reinforcing the same conclusions.
Considering now the classical KNN technique for replicating FSB outputs, the best feature combinations are presented in Table 8.In this case, the Friedman statistical test with the Nemenyi post-hoc statistics show that last two feature combinations are equivalent and statistically better than the first two.One can observe that, as expected, the best combination for this case include all three original FSB inputs, named Q tn , F r and U 2 .However, associating depth to these features contributed to improve performance, even with the biasing due to the way in which the outputs were generated.
In the end, the feature combinations that produced best performance for the Gaussian DWNN technique for replicating FSB outputs are presented in Table 9.One can observe that the results are very close to the ones from Table 8, reinforcing that using original FSB inputs leads to good accuracy and that including depth among these features contributes to improve performance.
Concerning more general observations, both tested ML techniques presented good performance for replicating both soil classification systems.With respect to the nonnormalized inputs, good performance can be observed when they are associated with depth.For IST and both ML techniques, for example, accuracy is around 70% when only q t and f s are used as input features, but rises close to 90% when z is included.These observations suggest that proposing a soil classification system that uses only raw CPT data would be feasible if depth is included.Nevertheless, one should notice that confirming this hypothesis would require further studies.
Another general observation concerns evaluating which classification technique is better, comparing the classical KNN and the Gaussian DWNN.The Wilcoxon test was employed for this task adopting a p-value of at most 5%.Comparing all combinations, results show that the Gaussian DWNN presents better predictive performance than the classical KNN.

Conclusions and Recommendations
In this work, distance-based ML techniques are used to replicate systems for soil classification from CPT data.It is important to notice that the proposed discussions and obtained conclusions would not be possible by using the original soil classification systems alone, because these original methods do not allow changing input features.It was the flexibility of the ML techniques that made possible to evaluate if raw inputs without normalizations have enough information for reproducing the original methods accurately, for example.
The main advantages of the proposed approach are the ease of applying it to different datasets and little adaptation required for it to be associated with other ML techniques.The use of distance-based techniques can also be considered advantageous for its simplicity, once accurate results were obtained.Thus, the presented method can be considered rigorous compared to other work from the literature that make use of ML applications in geoscience, which do not present a data analysis as detailed as in Section 4.
A total of 132 tests were performed to draw the discussions and conclusions presented and in all of them the mean accuracy is above 85%, which can be considered reasonable within geotechnical applications.Notice the good results obtained using raw parameters, which suggests that would make sense to dismiss some types of data normalization that are proposed in the literature for soil classification systems.Reducing data normalization is advantageous because any data transformation proposed to the original dataset tend to diminish its original information, specially if the original number of input features is reduced.Results presented here are not sufficient to affirm that using raw parameters would lead to greater performance, nonetheless  they can justify future studies about this issue.Other conclusions to be pointed out are: • Highest accuracies were obtained when using the original IST inputs and outputs; • Including depth as an input increased accuracy, in most cases; • Gaussian DWNN is better than the classical KNN, considering the Wilcoxon test with a p-value of at most 5%.Future studies that can be conducted include applying and comparing different ML techniques to this same problem, discussing other geotechnical issues about soil classification systems that can not be exposed using distancebased techniques.Another possible investigation is applying clustering techniques to the problem, taking advantage of the ease of increasing dimensionality to test several normalized and non-normalized feature combinations.Thus, CPT data can be associated with data from other in situ experiments like the standard penetration test or the flat dilatometer test, exploring the problem with even higher dimensionality.
I c : classification index IQ: interquartile range k: number of nearest neighbors n: exponent of s v0 ' p a : reference pressure Q 1 : first quartile Q 3 : third quartile q c : cone resistance q t : total cone resistance Q t1 : normalized cone resistance Q tn : updated normalized cone resistance R f : friction ratio SD: standard deviation S t : sensitivity u 0 : equilibrium pore pressure u 2 : pore pressure measured behind the cone tip U 2 : updated normalized excess pore pressure w: Gaussian weighting x i , x j : points representing objects z: depth g: soil unit weigh s v0 : total overburden pressure s v0 ': effective overburden pressure Figure 1 -a) Q tn ´Fr chart from Robertson (1991) updated by Robertson (2009).b) Q tn ´Bq chart from Robertson (1991) updated by Robertson (2009).

Figure 5 -
Figure 5 -Box-plot example using generic numbers.The rectangle represents ordinate values within the 1 st and 3 rd quartiles and the circles represent outliers.

Figure 6 -
Figure 6 -Edit nearest neighbor technique, with two possible classes (1 and 2) and black dots representing known examples.The unknown example (white) is labeled with the class of its nearest neighbor.

Figure 7 -
Figure 7 -Histograms.(a) For IST classes and the complete dataset.(b) For FSB classes and the complete dataset.(c) For IST classes and the geological dataset.(d) For FSB classes and the geological dataset.

Table 1
-IT is unable to classify most noisy examples, assigning class 0 in both IST and FSB methods or no class whatsoever.Once the ML Soils and Rocks, São Paulo, 42(2): 167-178, May-August, 2019.171 Soil Classification System from Cone Penetration Test Data Applying Distance-Based Machine Learning Algorithms

Table 3 -
Sample of soil examples.

Table 5 -
Number of CPTs and test location from P.W. Mayne database (acquired in years 2000 -2003).

Table 6 -
Best KNN predictive results for replicating IST.

Table 9 -
Best DWNN predictive results for replicating FSB.

Table 8 -
Best KNN predictive results for replicating FSB.

Table 7 -
Best DWNN predictive results for replicating IST.