Application of kernel k-means and kernel x-means clustering to obtain soil classes from cone penetration test data Soils and Rocks

Most methods available in the literature for soil classification from cone penetration test (CPT) data define soil classes using laboratory tests. One disadvantage of this approach is that field soil conditions are difficult to replicate in a lab. The alternative adopted in this work is trying to define soil classes only by the similarity of the CPT measurements, using clustering. This study is the first, to the best knowledge of the authors, to cluster soil classes in a four-dimensional input feature space using measurements directly taken from the CPT experiment. Nine soil classes are produced from a general dataset containing 179 CPT soundings and, in a complementary study, four more specialized classes are obtained from 5 CPT soundings. Artificial neural networks (ANN) are used to produce simple models capable of reproducing both class groups, which are compared with clas-sical soil classifications from the literature and with standard penetration test (SPT) samples. Results show that both general and specialized class groups can be reproduced by ANN although accuracy is better for the latter, reaching a 97.04 % accuracy with a standard deviation of 1.24 %. Furthermore, it is shown that accuracies above 80 % are obtained even if incomplete data is used. This shows that the here proposed soil classes can become an interesting alternative in engineering practice.


Introduction
The more commonly used soil classification standard is the Unified Soil Classification System, which is based on granulometry and plasticity. Nevertheless, it has disadvantages like the difficulty of extracting undisturbed samples and the time delay required to get the results. On the other hand, the cone penetration test (CPT) allows an accurate measurement of soil parameters, which can be instantaneously used to classify soil layers along a vertical axis. One important issue concerning this classification is its connection to soil behavior in detriment of soil granulometry. In this context, although pioneer work proposing soil classification from CPT data focused only soil granulometry (Begemann, 1965), following studies stated that soil behavior should guide class definitions for being related to the soil load-bearing capacity (Douglas & Olsen, 1981). In later investigations, pore pressure information was included to define soil classes and propose normalizations for the cone resistance and lateral friction to account for the overburden pressure and better separate classes, which produced the well known Robertson charts (Robertson, 1990). A new friction ratio-based chart was later proposed, changing the circular curves of Robertson (1990) by hyperbolic ones (Schneider et al., 2012). Robertson (2016) modified these charts, defining a fully behavioral classification, including also the dilative and contractive behaviors for each of the three soil types.
Most work that use machine learning techniques for classifying soil from CPT data apply clustering to propose new soil classes (Hegazy & Mayne, 2002;Facciorusso & Uzielli, 2004;Bhattacharya & Solomtine, 2006;Liao & Mayne, 2007;Das & Basudhar, 2009;Rogiers et al., 2017;Wang et al., 2019). One limitation of these work is the reduced number of input features included, most times only two. Another limitation is that most work explore only hierarchical clustering techniques (Hegazy & Mayne, 2002;Facciorusso & Uzielli, 2004;Bhattacharya & Solomtine, 2006;Liao & Mayne, 2007). Nevertheless, a recent study stated that including depth as an input can improve cluster-ing results and that the x-means algorithm can lead to good results (Rogiers et al., 2017). In spite of these conclusions, to the best knowledge of the authors, no work from the literature investigated clustering techniques including all measured CPT parameters. Furthermore, the traditional x-means algorithm implemented with the original k-means can only be used for linearly separable classes.
The kernel k-means algorithm is an iterative clustering technique based on the minimization of the variance inside clusters. It allows objects changing from one cluster to another to reduce the overall variance. The kernel x-means algorithm works running kernel k-means several times, splitting the clusters into new ones in each round. In this context, the objective of this work is to use kernels k-means and kernels x-means to produce soil classification methods using four input features: depth, cone resistance, lateral friction and pore pressure. First, kernel k-means is applied to a dataset composed by 179 CPT soundings, of which 5 have paired SPT soundings, generating 9 soil classes. These classes are compared to SPT samples and to Robertson classification methods (Robertson, 1991(Robertson, , 2016 obtained with a student version of the CPeT-IT v2.0.2.5 software. An alternative specialized approach is also presented using the kernel x-means algorithm, which was found to be effective in previous work (Rogiers et al., 2017). It is shown that both proposed soil classification methods can be replicated by an ANN model, even if the pore pressure is not included as an input. This enables reproducing the obtained methods in simple spreadsheets.

Classification methods for comparison
The two soil classification methods here used for comparison were developed by Robertson. Only a brief view of their theory is presented here, once they are also used and described in previous work from the authors .

Influenced by soil granulometry (ISG)
This method was proposed by Robertson (1991) and its classes descriptions allude to granulometry: 1. Sensitive, fine grained 2. Organic soils -peats 3. Clays -clay to silty clay 4. Silt mixtures -clayey silt to silty clay 5. Sand mixtures -silty sand to sandy silt 6. Sands -clean sand to silty sand 7. Gravelly sand to sand 8. Very stiff sand to clayey sand 9. Very stiff, fine grained The normalized parameters used for classification are: where q t is the total cone resistance, which is a correction of the raw cone resistance q c . f s is the lateral friction, u 2 is the pore pressure measured behind the cone tip, u 0 is the hydrostatic pore pressure, s v0 is the total overburden stress and s v0 ' is the effective overburden stress. n is given by where p a = 0.1 MPa is a reference pressure and I c is defined as (Robertson, 2009):

Kernel k-means
The kernel k-means algorithm is a modification of the k-means algorithm, which groups the instances by partition, with a fixed number k of clusters. It is an iterative clustering technique based on the optimization of a clustering criterion, the mean squared error. For each iteration, differently from the hierarchical clustering, the objects can change from one cluster to another to reduce the error. The error is a measure of the variance inside the clusters, which has to be minimized. The mean squared error E is then given by the sum of the variances inside clusters for the k clusters as follows: where d(x i , x (j) ) is the distance between the object x i and the cluster centroid x (j) .
The algorithm does the following steps: 1. The first k centroids are randomly chosen 2. Each object is included in the group whose centroid is closer 3. A new centroid is then defined for each group in order to minimize the mean squared error 4. Steps (2) and (3) are repeated until conversion is observed, within a predefined error margin. The most used similarity measure is the Euclidean distance, which requires data normalization in order to avoid distortions due to data scale. The main advantage of k-means is its linear complexity, but its main disadvantages include the possibility of converging to local optimum and   being applicable only to linearly separable classes. Other weaknesses that can compromise analysis are its sensitivity to initialization, the possibility of generating imbalanced clusters and the need of previously fixing k. One simple alternative to search for the best k and avoid local minimums is running the algorithm several times, varying k and the initialization. This procedure is adopted in this work. One way to deal with classes that are not linearly separable is using a function to map the data from the original feature space into a higher dimensionality feature space wherein the objects are linearly separable. Nevertheless, non-linear transformation and high dimensionality are required to guarantee linear separability. Most work that make use of this approach do not define the function directly, but only a kernel function, which is sufficient to obtain the Euclidean distance. The Gaussian kernel adopted in this work is exp (-s||x i -x j || 2 ), where x i and x j are points within input feature space and s is the only calibration parameter required, which can be estimated from the data as the median of ||x i -x j || 2 .

Artificial neural networks
Artificial neural networks (ANN) are based on the brain functioning, with a structure constituted by processing units called neurons, which are connected by weighted signals called synapses. The first artificial neuron model, called Perceptron, was proposed by McCulloch & Pitts (1943). Its practical applicability was formalized with the work of Rosenblatt (1957).
In a Perceptron neuron, an object x receives n signals (inputs), which are weighted by a vector w. After these weighted inputs are gathered, an excitatory threshold or bias q is discounted, producing a net signal u. This net signal is then subjected to an activation function g to produce an output signal y = g(u) = g(w.x -q). This process is illustrated in Fig. 5. In this work, the sigmoid function is used for activation, which is presented below.
where l is a parameter to be calibrated. Data normalization is required, rescaling each input feature to the range [0,1]. One limitation of this model is that it can only be used for linearly separable classes. Non-linear cases require using multi-layer neural networks, which can be trained with the back-propagation algorithm (Rumelhart et al., 1986). Figure 6 represents the structure of this model, wherein each neuron is a Perceptron. According to the universal approximation theorem (Hornik et al., 1989), an ANN with one hidden layer is sufficient to replicate any continuous function. Thus, two hidden layers are enough to replicate even discontinuous functions.
Once there are infinite possibilities for an ANN model, restrictions must be defined to limit the number of calibration tests. The sigmoid function was fixed based on previous experience of the authors, the number of neurons for each layer was limited to double the number of classes and the number of layers was limited to 2. These decisions about architecture were based on the universal approximation theorem. Readers interested in further discussions about this issue are referred to .

Used datasets
Two datasets are used in this work, one named Full dataset and the other named Specific dataset. The objective is to demonstrate that more homogeneous datasets lead to ANN models with better accuracy. The Full dataset is composed by measurements taken within 179 CPT soundings, which are briefly described below:  of the Full dataset among the ISG and FSB classes, respectively. Even though there is an imbalance, the minority classes for the ISG and FSB methods have 381 and 5136 objects, respectively. Preliminary tests have shown that this is enough to represent these minority classes among the majority ones. The Specific dataset is a subset of the Full dataset and is composed by the measurements taken within the 5 CPT soundings provided by Professor Heraldo Giacheti. The paired SPT soundings provided 2847 soil samples, which are here divided into three classes, sands (60,2 % of samples), silts (16,1 % of samples) and clays (23,7 % of samples).
One of the objectives of this work is comparing these three SPT classes to the ones of the ISG method, of the FSB method and also to the ones here obtained by clustering.

Clustering analysis
Two separated studies are performed, one using the Full dataset and the other using the Specific dataset. Both of them are divided into two phases: clustering analysis and ANN modeling. First, the objects are grouped by the kernel k-means algorithm. For this step, the four measured CPT parameters are used to compose the original feature space: depth z (m), raw cone resistance q c (MPa), lateral friction f s (kPa) and pore pressure measured behind the cone tip u 2 (kPa). Using these inputs instead of normalizations such as Q t , B q and F r avoids reducing information within the dataset. Thus, a previous work from the authors suggests that dismissing this type of normalizations makes sense for soil classification . For both approaches the Gaussian kernel, which is calibrated by the median of the distance between points, is used to map the objects into a higher dimension feature space (see Section 3.1).
For the Full dataset, the procedure adopted to define the number of classes was manually varying this number and adopting the one with the lowest total variance inside clusters. This procedure lead to 9 classes, as described in Section 5.1. For the Specific dataset, the kernel x-means algorithm was employed. One basic version of this algorithm consists in running the kernel k-means several times from k = 2 and splitting the clusters into two new clusters in each round while a parameter called Bayesian Information Criterion is improved (Pelleg & Moore, 2000). Once this parameter gets any worse, the algorithm stops. The result for this case was 4 classes, as presented in Section 5.2.
After obtaining the clusters, they are compared to ISG classes, to FSB classes and to the three SPT classes defined in Section 4.1.

ANN modeling
In this work, ANN models are created to replicate soil classification systems obtained by clustering. The 10-fold cross-validation procedure (Stone, 1974) is employed to evaluate the predictive performance of the ANN models, as illustrated in Fig. 8. This procedure was adopted to avoid overfitting and to calculate a standard deviation of the accuracies obtained within the 10 iterations, which is an important information to be presented together with the mean accuracy.
The procedure starts dividing the dataset in 10 folds of the same size. At each step, one of the 10 folds is randomly selected and separated from the other 9. These 9 folds are then used for training, while the one kept apart is used for testing, obtaining an accuracy. Selection is made without reposition, allowing all folds to be tested after 10 steps. The mean and standard deviation of the obtained accuracies represent the predictive performance of the ANN model.
Notice that all soil samples received a class within the clustering procedure described in Section 4.2, making possible to check all predictions given by the ANN algorithm. Recall R i is defined as the number of right predictions for one class i divided by its number of examples n i :  where I ij = 1 if the model made a right prediction and I ij = 0 otherwise. In this work, the mean recall is used as performance measure and, from this point of the text, referred simply as accuracy A for a sake of clarity. For c classes, it is obtained as Preprocessing procedures are used within the 10-fold cross validation procedure to improve the predictive performance of the ANN algorithms. Once these procedures are described in previous work from the authors , they are here omitted for conciseness.

Clustering analysis with the full dataset
To produce the results presented within this section, the kernel k-means algorithm was applied. k was varied from 7 to 10, using the Full dataset and all CPT original measurements: z (m), q c (MPa), f s (kPa) and u 2 (kPa). The model with k = 9 was the one with the lowest total internal cluster variance, therefore it is the only one here presented. The 9 clusters, each one representing a soil class, have centers which coordinates are presented in Table 1.
In Tables 2 and 3 the clustering results are compared to ISG and FSB classes, respectively. Lines represent clustering classes and columns represent chart-based methods.
Each value is a percentage of soil samples that were assigned to a clustering class (line) and also to a specific ISG or FSB class (column). ISG class 0 is omitted from Table 2 due to its low representative among the used examples.
Observing Tables 2 and 3, the following interpretations were produced for the 9 cluster classes: • Classes 1 and 2: They present similar distributions among ISG classes, with a predominance of clay behavior (ISG classes 3 and 4). This predominance is also observed within FSB classes (1, 2 and 3), although the cluster classes appear to become different.
• Class 3: ISG classes 5 and 6, which represent sand behavior, compose 65 % of this cluster class. Similar per- 612 Carvalho & Ribeiro, Soils and Rocks 43(4): 607-618 (2020) Application of kernel k-means and kernel x-means clustering to obtain soil classes from cone penetration test data  centage is obtained if FSB classes 6 and 7 are added, which also represent sand behavior. • Classes 4 and 5: These classes clearly represent sand behavior, with high percentages assigned to ISG class 6 and FSB class 7. Their similarity suggests merging them together. • Classes 6 and 7: Once the behavior of these classes is well distributed among ISG and FSB classes, they are here considered transitional. In other words, behavior that cannot be clearly distinguished between sand and clay. • Class 8: This class is strongly identified with clay behavior, with 86 % of ISG classes 3 and 4 and 65 % of FSB classes 1, 2 and 3. • Class 9: Its behavior is also distributed among ISG and FSB classes, being here considered transitional. Table 4 was produced to compare ISG classes (columns) with the sample observations obtained via SPT sampling (lines). Numbers represent percentages, similarly to the previous tables, and some ISG classes are omitted for being underrepresented with samples. As defined in Section 3, SPT classes represent sand, silt and clay. Moving from ISG classes 3 to 6, one can observe an increase of sand and decrease of clay, which is coherent with their names given in Section 2.1. An analogous analysis is proposed with Table 5, comparing FSB classes (columns) to SPT (lines). The correspondence to the FSB class names given in Section 2.2 is not clear, except for FSB classes 3 and 7. This suggests that FSB is less sensitive to soil granulometry than ISG.
The clustering results were also compared to the SPT sample observations, resulting Table 6. Cluster classes 3 and 8 contain relevant parts of sand and clay, being here identified as transitional. Class 4 is the only one with predominance of clay and the other can be identified with sand. These observations do not match the ones provided by the comparisons to the ISG and FSB methods, showing that   soil granulometry alone is not enough to explain its mechanical behavior.
To better illustrate the cluster classes obtained, a case study is presented in Fig. 9. A 29.3 m sounding from the USA was used, being classified using the ISG classes (Fig.  9a), the FSB classes (Fig. 9b) and the cluster classes presented in this section (Fig. 9c). The name of the classes is the same adopted in Tables 2 and 3 and colors are used independently for each classification method.
The last step of this analysis is applying ANN to produce a model capable of reproducing the obtained classification method. This procedure resulted a model with an accuracy of 89.35 % with a standard deviation of 0.40 %, corresponding to an architecture with only one hidden layer with 18 neurons.
Another ANN model was trained using only z, q c and f s as input features. The objective is verifying if CPT equipment without a pore pressure filter can provide enough in-formation to approximate the method. The resultant model presented an accuracy of 84.47 % with a standard deviation of 0.30 %, corresponding to an architecture with two hidden layers, the first with 16 neurons and the second with 18 neurons.
The weight matrices and bias vectors produced for the ANN models of this section are here omitted for conciseness. Readers interested in this information are advised to contact the authors.

Specialized approach
Using CPT data from only 5 soundings, all from the same site, tends to improve classification accuracy. Nonetheless, the obtained model becomes limited to the soil types measured within these 5 soundings. For that reason, these clusters are here considered more specialized than those obtained in the previous section. This strategy is here investigated using the kernel x-means algorithm instead of varying manually the number of classes, which enables maintaining the minimum total internal cluster variance as a performance measure. This allows comparing different results given by this algorithm in cases wherein a high variation of the number of classes k is observed.
Only 5 CPT soundings are used to obtain the specialized classification method by clustering, all taken from the same location and paired with SPT soundings. With the kernel x-means algorithm, 4 classes were found to be the best for the considered dataset, with their centers presented in Table 7.
Crossing results with the ISG and FSB classification methods and to SPT soundings, Tables 8, 9 and 10 are obtained, respectively. As in the previous section, values represent percentages of soil assigned to a cluster class (line) and also to a reference method class (column).   For this case, some agreement can be observed for the soil type of the cluster classes when compared to ISG, FSB and SPT. Cluster class 1 shows a subtle predominance of sand over clay when compared to the ISG, which is also observed for FSB and SPT. The predominance of sand is clearer for cluster class 2, specially comparing to FSB. Cluster class 3 seems to confuse the ISG and FSB methods, although it can be identified as sand considering SPT alone. Finally, cluster class 4 can be also identified as sand, although such correlation is weaker than the one observed for cluster class 2.
Comparing these results with the ones of the previous section, one can conclude that specializing classification improves agreement with SPT sampling. This can be con-sidered an advantage, for uniting the model capability of predicting soil behavior to a correspondence with SPT visual-tactile observations.
A case study is also presented for the specialized cluster classes, which can be observed in Fig. 10. This sounding is 12.3 m long and is one of the 5 used to produce the specialized cluster classes used in this section. Class names are the same used in Tables 8 and 9.
In the end, an ANN model was produced in order to reproduce the obtained specialized classification method. The obtained model presented very good predictive performance, with an accuracy of 97.04 % and a standard deviation of 1.24 %. This result can be considered significantly better than the one obtained for the general approach, suggesting that limited extrapolations with the specialized approach are feasible. The weight matrices and their respective bias vectors for this last ANN model are: One can notice that these matrices correspond to an architecture with only one hidden layer containing five neurons. In this case it was also evaluated if suppressing pore pressure information prejudices predictive performance. The resultant ANN model, that makes use of only z, q c and f s , presented an accuracy of 90.37 % with a standard .
These matrices correspond to an ANN architecture with two hidden layers, the first with 6 neurons and the second with 5 neurons.

Conclusions and recommendations
This work explores the kernel k-means and kernel x-means clustering algorithms to group CPT data into different soil classes. Using a kernel function to modify the k-means algorithm enables evaluating classes that are not linearly separable. Next, ANN are used to create mathematical models which can be easily reproduced. Two different approaches are studied, one is general and the other more specialized. The general approach uses 179 soundings from different sources to develop an ANN model that can be better extrapolated to any new CPT data. On the other hand, the specialized approach requires running the kernel xmeans to generate specialized classes for each site investigation, as well as producing a new ANN model. The specialized model is expected to be more accurate for sites with soils similar to those for which it was trained, but it is also expected to be more limited for extrapolation. This approach is applied to 5 soundings for which the CPT soundings were paired with SPT soundings. Results confirm that the specialized model produces more well-defined classes and a more accurate ANN model. The mean accuracy (MA) and standard deviation (SD) obtained for all ANN models are summarized in Table 11.
These values can be considered reasonable when compared to other studies from the literature that used ANN to predict soil classes from CPT data, as Bhattacharya & Solomatine (2006) that achieved 83 % and Kurup & Griffin (2006) that achieved 86 %. Thus, Elkateb et al. (2003) cite a case study that shows that pure engineering judgment can lead to 70 % of poor to bad soil predictions.
One advantage of the here proposed methodology is that the ANN models can be reproduced with spreadsheets by simply combining the calibrated weights with the used activation functions. What makes it different from other methods from the literature is the possibility of approximating the soil classes without pore pressure information, becoming an important alternative for geotechnical engineers in cases that high accuracies are not required. Thus, to the best knowledge of the authors, this is the first study that produces ANN models from tropical soil CPT data, being recommended for projects within tropical countries. Nonetheless, this model can be considered limited to the types of soil for which the ANN models were trained, which is critical particularly for the specialized approach.