Spatial Autocorrelation Incorporated Machine Learning Model for Geotechnical Subsurface Modeling

: Machine learning models for spatial prediction have been applied in various types of research. However, spatial relation has not been fully considered in modeling, since the Cartesian coordinates of the observed points are directly employed as the location information for machine learning features. This study presents a machine learning modeling process which incorporates spatial autocorrelation for geotechnical subsurface modeling. A new set of features called the Euclidean distance ﬁeld (EDF) was generated based on the distance between the query points and the observed boreholes in order to incorporate spatial autocorrelation into the machine learning model. Principal component analysis (PCA) was performed to reduce the increasing dimensionality of the dataset caused by the EDF features. Optimized machine learning models based on several popular algorithms (Support Vector Machine, Gaussian Process Regression, Artiﬁcial Neural Network, and k -Nearest Neighbor) were employed for predicting several geotechnical information as the targets. The results showed that the optimized machine learning models constructed with the EDF modeling approach generate a slightly lower Root Mean Square Error (RMSE) score compared to the model with the direct XY coordinate approach by 0.041, 0.046, 1.302, and 1.561 for ground surface elevation, groundwater level, SPT-N value, and percent ﬁner than 0.075 mm sieve, respectively. Both modeling approaches performed well for USCS-based soil classiﬁcation with the EDF model having slightly improved classiﬁcation accuracy by 0.72%. Furthermore, the model can perform balance multiclass classiﬁcation as indicated by the >95% precision, recall, f1-score, and balanced accuracy score. These results indicate that spatial autocorrelation has a noticeable effect. Hence, it needs to be considered to improve the overall performance of spatial machine learning modeling. Comparison of geotechnical subsurface predictions generated based on different machine learning algorithms showed that the selection of the best-performing model based only on the lowest prediction error is not appropriate for spatial prediction modeling. Therefore, thorough analysis of the predicted data by visualization is necessary in the selection process for spatial prediction modeling.


Introduction
In a geotechnical project, an accurate prediction of geotechnical information is important. It could minimize the adjustment of the initial construction plan during the construction process. This results in better efficiency during construction as the designed schedule, methodology, and cost proceed as planned. In situ tests such as the standard penetration test (SPT) and laboratory soil tests are commonly performed to investigate soil properties at a project site [1][2][3][4]. When a site investigation is performed in a limited analyzed by comparing the performance score of all constructed machine learning models. At the end of this study, machine learning-based prediction of geotechnical properties and soil classification are analyzed by visualizing the geotechnical subsurface model using the MATLAB R2021a program.
Besides the geotechnical engineering, the spatial predictive modeling approach proposed in this study is also applicable in other fields. In geological engineering, this approach is applicable for geologic mapping where the information from geological bore log is commonly interpolated to generate the geologic map. In other fields such as environmental engineering, this approach is also applicable for three-dimensional predictive mapping of subsurface soil organic carbon and soil contamination.

Geotechnical Subsurface Modeling Methodology
The methodology for subsurface modeling of geotechnical properties is summarized and visualized in the flow chart as presented in Figure 1. The first step is to prepare the soil investigation data as the machine learning dataset. The dataset is pre-processed using various methods, including data transformation. The new features named Euclidean distance field (EDF) are generated in the data pre-processing step, and will be used as the input data in the machine learning modeling process. The dataset is then divided into training-validation data and test data. Dimensionality reductions are performed for the EDF features by employing the principal component analysis (PCA) technique. Appl. Sci. 2023, 13, 4497 3 of 24 analysis (PCA) and hyperparameter optimization are applied to all constructed machine learning models. The effect of different machine learning modeling approaches that employ EDF proposed by Behrens et al. [20], EDF proposed in this study, and raw XY coordinate as input features are analyzed by comparing the performance score of all constructed machine learning models. At the end of this study, machine learning-based prediction of geotechnical properties and soil classification are analyzed by visualizing the geotechnical subsurface model using the MATLAB R2021a program. Besides the geotechnical engineering, the spatial predictive modeling approach proposed in this study is also applicable in other fields. In geological engineering, this approach is applicable for geologic mapping where the information from geological bore log is commonly interpolated to generate the geologic map. In other fields such as environmental engineering, this approach is also applicable for three-dimensional predictive mapping of subsurface soil organic carbon and soil contamination.

Geotechnical Subsurface Modeling Methodology
The methodology for subsurface modeling of geotechnical properties is summarized and visualized in the flow chart as presented in Figure 1. The first step is to prepare the soil investigation data as the machine learning dataset. The dataset is pre-processed using various methods, including data transformation. The new features named Euclidean distance field (EDF) are generated in the data pre-processing step, and will be used as the input data in the machine learning modeling process. The dataset is then divided into training-validation data and test data. Dimensionality reductions are performed for the EDF features by employing the principal component analysis (PCA) technique.  Afterwards, the process advances to the machine learning training and validation. In this step, an optimized machine learning model is generated based on several popular machine learning algorithms (MLA). The five-fold cross-validation technique was chosen Appl. Sci. 2023, 13, 4497 4 of 24 to validate the models. This step is processed iteratively until it reaches the predetermined stopping criteria. The machine learning models that best predict each of the targets are determined based on their ability to predict the target from the test data. Finally, the chosen machine learning models are used to predict subsurface geotechnical properties in the study area. The results are then visualized in three-dimensional space using MATLAB R2021a.

Study Sites and Available Soil Investigation Data
The soil investigation data used as a dataset in this study were acquired from the Saemangeum onshore solar farm area in Gunsan, Republic of Korea. The study area is approximately 3.5 km 2 , and the data were acquired from 33 boreholes inside Zones A, B and C and from one borehole outside the area, as can be seen in Figure 2. The maximum distance between the test boreholes and its adjacent borehole is approximately 418 m (C-BH-1-C-BH-4). The dataset consists of test boreholes coordinate on the x, y, and z axes, the ground surface elevation, the groundwater level (GWL), the SPT-N value, and the soil test data such as percent finer than 0.075 mm sieve (No. 200) and the classification of soil samples according to USCS. The depth of the GWL at each borehole was identified to vary with values ranging from 0.9 m to 1.5 m. The drilling depth of these boreholes varied from 23.3 m to 32.5 m, wherein most of the boreholes were drilled up to depths of around 30.0 m. In the study area, three types of soil were identified according to the USCS soil classification system, which are sand with silty fines (SM), silt with a low liquid limit (ML), and clay with a low liquid limit (CL). Appl. Sci. 2023, 13, 4497 4 of 24 Afterwards, the process advances to the machine learning training and validation. In this step, an optimized machine learning model is generated based on several popular machine learning algorithms (MLA). The five-fold cross-validation technique was chosen to validate the models. This step is processed iteratively until it reaches the predetermined stopping criteria. The machine learning models that best predict each of the targets are determined based on their ability to predict the target from the test data. Finally, the chosen machine learning models are used to predict subsurface geotechnical properties in the study area. The results are then visualized in three-dimensional space using MATLAB R2021a.

Study Sites and Available Soil Investigation Data
The soil investigation data used as a dataset in this study were acquired from the Saemangeum onshore solar farm area in Gunsan, Republic of Korea. The study area is approximately 3.5 km 2 , and the data were acquired from 33 boreholes inside Zones A, B and C and from one borehole outside the area, as can be seen in Figure 2. The maximum distance between the test boreholes and its adjacent borehole is approximately 418 m (C-BH-1-C-BH-4). The dataset consists of test boreholes coordinate on the x, y, and z axes, the ground surface elevation, the groundwater level (GWL), the SPT-N value, and the soil test data such as percent finer than 0.075 mm sieve (No. 200) and the classification of soil samples according to USCS. The depth of the GWL at each borehole was identified to vary with values ranging from 0.9 m to 1.5 m. The drilling depth of these boreholes varied from 23.3 m to 32.5 m, wherein most of the boreholes were drilled up to depths of around 30.0 m. In the study area, three types of soil were identified according to the USCS soil classification system, which are sand with silty fines (SM), silt with a low liquid limit (ML), and clay with a low liquid limit (CL). Based on the soil investigation data, soil type within 0 m to 5 m of depth are mostly loose silty-sand soils. However, rock layers are also found within this range of depth at several locations. Geotechnical information within this depth range is important in choosing the appropriate foundation design that can securely support the load of the solar panel structure. Therefore, the constructed machine learning models in this study were expected to predict and identify geotechnical information and soil layers within 0 m to 5 m depth. Soil investigation data at slightly wider range of depths, within 0 m to 10 m depths, were used to train machine learning models that could cover prediction at the expected range of depth.
From the previously mentioned dataset, all variables for spatial modeling can be summarized as (1) X coordinate, (2) Y coordinate, (3) Z coordinate, (4) ground surface elevation (GSE), (5) groundwater level (GWL), (6) measured SPT-N value, (7) percent finer than 0.075 mm sieve, and (8) USCS based soil classification. The statistical information Based on the soil investigation data, soil type within 0 m to 5 m of depth are mostly loose silty-sand soils. However, rock layers are also found within this range of depth at several locations. Geotechnical information within this depth range is important in choosing the appropriate foundation design that can securely support the load of the solar panel structure. Therefore, the constructed machine learning models in this study were expected to predict and identify geotechnical information and soil layers within 0 m to 5 m depth. Soil investigation data at slightly wider range of depths, within 0 m to 10 m depths, were used to train machine learning models that could cover prediction at the expected range of depth.
From the previously mentioned dataset, all variables for spatial modeling can be summarized as (1) X coordinate, (2) Y coordinate, (3) Z coordinate, (4) ground surface elevation (GSE), (5) groundwater level (GWL), (6) measured SPT-N value, (7) percent finer than 0.075 mm sieve, and (8) USCS based soil classification. The statistical information such as means, median, and standard deviation of each variable are presented in Table 1. In addition, histograms to show the frequency distributions of each attribute are presented in Figure 3. A multitude of information can be acquired from the frequency distribution. For the X-Coordinate feature, most of the samples are available at the X axis with coordinates Appl. Sci. 2023, 13, 4497 5 of 24 between 4200 and 4700, while the distribution of the Y-Coordinate feature is more balanced. For the Z coordinate feature, most of the data are available within 2 m to −7 m. Ground surface elevations of the study area are recorded mainly at elevations > 1.5 m and the groundwater levels are mostly recorded at elevation between 0.9 m and 1.5 m. SPT-N values for most of the samples are identified between 0 N and 16 N. The collected dataset has a balanced number of sand samples (SM, % finer than 0.075 mm sieve < 50) and siltclay samples (ML and CL, % finer than 0.075 mm sieve ≥ 50). For the Soil Classification feature, the dataset has more soil samples of SM and less soil samples of CL. The frequency distribution as presented in Figure 3 is also able to visualize the distribution skewness. It can be clearly seen that the Z coordinate and SPT-N data are negatively and positively skewed, respectively. The distribution of each attribute shows that the data are not in a proper form for training a machine learning model and need to undergo data pre-processing to make the data more computationally accountable and explicable [21]. Appl. Sci. 2023, 13, 4497 5 of 24 such as means, median, and standard deviation of each variable are presented in Table 1.
In addition, histograms to show the frequency distributions of each attribute are presented in Figure 3. A multitude of information can be acquired from the frequency distribution.
For the X-Coordinate feature, most of the samples are available at the X axis with coordinates between 4200 and 4700, while the distribution of the Y-Coordinate feature is more balanced. For the Z coordinate feature, most of the data are available within 2 m to −7 m. Ground surface elevations of the study area are recorded mainly at elevations > 1.5 m and the groundwater levels are mostly recorded at elevation between 0.9 m and 1.5 m. SPT-N values for most of the samples are identified between 0 N and 16 N. The collected dataset has a balanced number of sand samples (SM, % finer than 0.075 mm sieve < 50) and siltclay samples (ML and CL, % finer than 0.075 mm sieve ≥ 50). For the Soil Classification feature, the dataset has more soil samples of SM and less soil samples of CL. The frequency distribution as presented in Figure 3 is also able to visualize the distribution skewness. It can be clearly seen that the Z coordinate and SPT-N data are negatively and positively skewed, respectively. The distribution of each attribute shows that the data are not in a proper form for training a machine learning model and need to undergo data pre-processing to make the data more computationally accountable and explicable [21].    To model the spatial autocorrelation, the raw X-Y coordinate features need to be transformed into a distance vector. Two forms of Euclidean distance feature set were used for machine learning modeling. The first form is the EDF model proposed by Behrens et al. [20], termed "EDFB" feature set in this study, which integrate X and Y coordinates, Euclidean distances to the corners of a rectangle around the sample set, and the distances to the center location of the sample set [20]. The second form is the one proposed in this study which calculates the Euclidean distance between the query point and all sample points (all borehole coordinates). This distance vector termed as Euclidean distance field are illustrated in Figure 4, and explained by Table 2. Figure 4 describes the location information of two known boreholes and three unobserved points. The Euclidean distance between query points and known boreholes is calculated using the following equation: where p and q are two points in the Euclidean n-space, q i and p i are Euclidean vectors starting from the origin of the space (initial point), and n is the n-space. To model the spatial autocorrelation, the raw X-Y coordinate features need to be transformed into a distance vector. Two forms of Euclidean distance feature set were used for machine learning modeling. The first form is the EDF model proposed by Behrens et al. [20], termed "EDFB" feature set in this study, which integrate X and Y coordinates, Euclidean distances to the corners of a rectangle around the sample set, and the distances to the center location of the sample set [20]. The second form is the one proposed in this study which calculates the Euclidean distance between the query point and all sample points (all borehole coordinates). This distance vector termed as Euclidean distance field are illustrated in Figure 4, and explained by Table 2. Figure 4 describes the location information of two known boreholes and three unobserved points. The Euclidean distance between query points and known boreholes is calculated using the following equation: where p and q are two points in the Euclidean n-space, qi and pi are Euclidean vectors starting from the origin of the space (initial point), and n is the n-space.  The new input features generated by this transformation can be seen in Table 2. When the raw X-Y coordinate is transformed into the Euclidean distance vector, the number of features grows proportional to the number of sample boreholes. For example, as illustrated in Figure 4, there are two known boreholes and three unobserved points (location of targets for the prediction). Therefore, two new features will be generated as a result of the transformation, containing the Euclidean distance data between known boreholes (BH1 and BH2) and the unobserved location (L1, L2, and L3) as presented in Table 2. In the same way, as this study analyzes data from 34 known borehole points, 34 new features will be generated. The new generated EDF features will replace the X and Y coordinates as location information features for machine learning modeling.  The new input features generated by this transformation can be seen in Table 2. When the raw X-Y coordinate is transformed into the Euclidean distance vector, the number of features grows proportional to the number of sample boreholes. For example, as illustrated in Figure 4, there are two known boreholes and three unobserved points (location of targets for the prediction). Therefore, two new features will be generated as a result of the transformation, containing the Euclidean distance data between known boreholes (BH1 and BH2) and the unobserved location (L1, L2, and L3) as presented in Table 2. In the same way, as this study analyzes data from 34 known borehole points, 34 new features will be generated. The new generated EDF features will replace the X and Y coordinates as location information features for machine learning modeling.

Data Transformation
In this study, data transformation techniques were used to improve features significant to the machine learning model development. Min-max normalization is applied in this study for transformation of location information features, GSE, GWL, and percent finer than 0.075 mm sieve. The following is the data normalization formula, where s norm is the normalized value of variable v, s is the initial value of variable v, s min is the minimum value of variable v and s max is the maximum value of variable v [5].
Log transformation technique is applied for the transformation of SPT-N values. Log transformation is performed by replacing each data point s with the transformed value s trans , where 1 is mean to prevent log(0) condition for every 0 value in the transformed dataset.
For the Soil Classification attribute, the data need to be transformed by assigning each soil type category with an integer value. In this study, ordinal encoding is used to transform the data from categorical data to numerical data. The assigned integers for each category are 1 for SM, 2 for ML, and 3 for CL. The Support Vector Machine (SVM) is a supervised learning algorithm for classification and regression. The SVM algorithm for classification works by finding a hyperplane from the algorithm that maximizes the distance between classes by using samples (support vectors) closest to the hyperplane as the separation margin [22]. The greater the margin, the better the algorithm ability to predict value for previously unseen data (lower generalization error). The SVM algorithm for regression works similarly to the classification SVM, where the algorithm is trained to detect the hyperplane with the maximum number of points fitted to it. Figure 5a shows an illustration of SVM algorithm for regression and classification.

Data Transformation
In this study, data transformation techniques were used to improve features significant to the machine learning model development. Min-max normalization is applied in this study for transformation of location information features, GSE, GWL, and percent finer than 0.075 mm sieve. The following is the data normalization formula, where snorm is the normalized value of variable v, s is the initial value of variable v, smin is the minimum value of variable v and smax is the maximum value of variable v [5].
Log transformation technique is applied for the transformation of SPT-N values. Log transformation is performed by replacing each data point s with the transformed value strans, where 1 is mean to prevent log(0) condition for every 0 value in the transformed dataset.
For the Soil Classification attribute, the data need to be transformed by assigning each soil type category with an integer value. In this study, ordinal encoding is used to transform the data from categorical data to numerical data. The assigned integers for each category are 1 for SM, 2 for ML, and 3 for CL.

Machine Learning Modeling Process
The Support Vector Machine (SVM) is a supervised learning algorithm for classification and regression. The SVM algorithm for classification works by finding a hyperplane from the algorithm that maximizes the distance between classes by using samples (support vectors) closest to the hyperplane as the separation margin [22]. The greater the margin, the better the algorithm ability to predict value for previously unseen data (lower generalization error). The SVM algorithm for regression works similarly to the classification SVM, where the algorithm is trained to detect the hyperplane with the maximum number of points fitted to it. Figure 5a shows an illustration of SVM algorithm for regression and classification.

Artificial Neural Network (ANN)
The artificial neural network (ANN) consists of three different types of layers, which are the input layer, the hidden layer(s), and the output layer [14], as presented in Figure  5b. The input layer is constructed of features that contain observed data that will be input into the training process. The hidden layer(s) is where the input data, together with weights and bias, undergo a mathematical process to generate the output value. The output layer is the last layer where the final output value is calculated and presented.
The applied neural network is iteratively trained with backpropagation algorithm. Then, the output of the model is compared with the targets from the training-validation set to calculate the error and update the weights and bias in the hidden layer(s). In recent research, ANNs have been used to classify soils, identify soil parameters, and predict the SPT-N value [5,7,23].

Gaussian Process Regression (GPR)
The Gaussian processes regression (GPR) algorithm, as visualized in Figure 5c, is a form of Bayesian nonlinear regression and nonparametric algorithm [24]. One of the main components of the Gaussian process regression is the kernel or covariance function that defines the proximity or similarity of the target values (t) between two close inputs i [25]. Therefore, covariance functions k need to be calculated to prepare for the GPR. Using the presumptions of the Gaussian distribution, and assuming that the observed data points (i,t) originate from noiseless distributions, the unseen target values (t*) can be predicted. The aim is to predict t* for new input data i* based on the Gaussian process prior and the observed data points (i,t). The joint prior distribution of the observed target values (t) and t* is presented in Equation (4) [26]. * ~ 0, * * * * .
In Equation (4), is a Gaussian normal distribution with mean vector 0 and covariance matrix * * * * , and T is the matrix transpose. The covariance matrix for the GPR (K, K* and K**) is explained in Rasmussen and Williams [25]. By assessing the mean and covariance matrix from the conditional probability in Equation (5) given the data i, i*, and t, the t* can be sampled from the joint posterior distribution. The t* realizations from the posteriors are generated multiple times, and the best estimate for t* acquired from the mean in this distribution ( * ) is shown in Equation (6). * | , * , ~ ( * , * * − * * ), The artificial neural network (ANN) consists of three different types of layers, which are the input layer, the hidden layer(s), and the output layer [14], as presented in Figure 5b. The input layer is constructed of features that contain observed data that will be input into the training process. The hidden layer(s) is where the input data, together with weights and bias, undergo a mathematical process to generate the output value. The output layer is the last layer where the final output value is calculated and presented.
The applied neural network is iteratively trained with backpropagation algorithm. Then, the output of the model is compared with the targets from the training-validation set to calculate the error and update the weights and bias in the hidden layer(s). In recent research, ANNs have been used to classify soils, identify soil parameters, and predict the SPT-N value [5,7,23].

Gaussian Process Regression (GPR)
The Gaussian processes regression (GPR) algorithm, as visualized in Figure 5c, is a form of Bayesian nonlinear regression and nonparametric algorithm [24]. One of the main components of the Gaussian process regression is the kernel or covariance function that defines the proximity or similarity of the target values (t) between two close inputs i [25]. Therefore, covariance functions k need to be calculated to prepare for the GPR. Using the presumptions of the Gaussian distribution, and assuming that the observed data points (i,t) originate from noiseless distributions, the unseen target values (t*) can be predicted. The aim is to predict t* for new input data i* based on the Gaussian process prior and the observed data points (i,t). The joint prior distribution of the observed target values (t) and t* is presented in Equation (4) [26].
In Equation (4), N is a Gaussian normal distribution with mean vector 0 and covariance matrix K K * T K * K * * , and T is the matrix transpose. The covariance matrix for the GPR (K, K* and K**) is explained in Rasmussen and Williams [25]. By assessing the mean and covariance matrix from the conditional probability in Equation (5) given the data i, i*, and t, the t* can be sampled from the joint posterior distribution. The t* realizations from the posteriors are generated multiple times, and the best estimate for t* acquired from the mean in this distribution (t * ) is shown in Equation (6).
Details on the GPR and its kernel or covariance function can be found in Rasmussen and Williams [25]. The application of GPR in civil engineering can be found in Mahmoodzadeh et al. [27].
The k-nearest neighbor (kNN) algorithm is used for the classification model. This is a non-parametric algorithm that can identify classes of the k-nearest data points, which is closest in distance to the new data point [28,29]. The algorithm performs classification based on the most common or majority class assigned to its k-closest data points [30]. The kNN algorithm (Figure 5d) has four steps [31]. The distance between the new data and entire data is calculated in the first step, commonly by using the Euclidean distance calculation method, as shown in Equation (7). Variables a and b are two points in Euclidean n-space, d am and d bm are Euclidean vectors. The distances are then ranked in the second step. In the third step, the smallest m values are taken and the class for the new data is determined in the final step.

Dataset Partitioning (Training, Validation, and Testing)
After data pre-processing, the machine learning dataset can be divided based on three building processes of the machine learning model [14,32]. The first is the training process, where a mathematical model that represents the training set is created from the iterative learning process of the training set by the MLA. The second is the validation process. The k-fold cross validation as one of the most common techniques for machine learning validation is selected for this study. k = 5, which is usually used for validation [33], is chosen for each machine learning model in this study. The third is the testing process, where the previously trained and validated model is then tested on previously unseen data. The test dataset is usually drawn from the entire dataset by 20-30%. In this study, the test data were assembled by randomly drawing 30% of the sample dataset.

Target, Algorithms, and Features for Machine Learning Models
All the features chosen for each machine learning model can be seen in Table 3. Besides the GSE, Z-Coordinate, GWL, SPT-N Value, and % Finer than 0.075 mm Sieve, a set of features named Location Information was included to represent the spatial relation in this modeling process. There are three forms of Location Information features to identify query point location: (1) EDF features (Section 2.3.1), (2) Raw X-Y Cartesian coordinate, and (3) EDFB features (proposed by Behrens et al. [20]).
Machine learning models with EDF as the features were separated into three models based on the applied PCA on that model: (1) without PCA, (2) with 80% threshold for the PCA variance ratio, and (3) with 90% threshold for the PCA variance ratio. As a result, seven machine learning models were generated for each target in this study. PCA reduces data dimensionality while maintaining as much variance as possible [34,35]. This is achieved by transforming datasets into a new group of variables called the principal components, which are uncorrelated, and then ordering them so that the first few of the components maintain the majority of variance included in the initial variables. As explained by Ahn et al. [35], when PCA is performed on the dataset, the weight vector for the kth principal component (PC) is obtained as follows: where w is the unit weight vector andD k is the dataset subtracting k − 1th PC from the initial dataset D.D k , which is the same as the initial dataset, is provided bŷ The number of the dataset's PCs in this study was acquired by predefining the threshold into 80% and 90% of explained variance as the suggested value from the literatures [34][35][36][37] instead of directly defining the numbers of PCs applied to the dataset.

Machine Learning Model Optimization by Hyperparameter Tuning
The present study evaluates machine learning models with four different algorithms for predicting geotechnical properties. Optimal model performance can be achieved by assigning the proper parameters for training the model by performing hyperparameter tuning. The optimum hyperparameters were obtained when the trained model reached the best validation performance after performing the iterative training-validation process. Each MLA has its own hyperparameters to be tuned. Thus, optimization needs to be performed for each model. Among available optimization techniques such as Bayesian optimization and grid search and random search [38][39][40], the random search optimization algorithm is chosen for this study. The next set of hyperparameters is obtained randomly while continuously improving the quality of the features [33]. This algorithm was chosen due to computational efficiency compared to other methods such as the grid search. The chosen number of iterations for this random search is 60, as suggested by Bergstra and Bengio [41] as an adequate number to achieve the near optimum value.

Model Evaluation
Two machine learning performance indicators were used to evaluate the performance of regression models in this study, the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Lower RMSE and MAE scores indicate better model performance. For classification model, prediction accuracy in percentage (%), precision, recall, and F1 score, and balanced accuracy are used as an indicator. The formulae for calculating each performance indicator are shown in Equations (10)- (15). These are the common indicators to quantify the machine learning model performance.
where p i is the actual value of an observed data point,p is the predicted value, and n is the number of observed data points. For the accuracy calculation, TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

Performance Evaluation of Constructed Machine Learning Models
Before comparing the model performance, retransformation of the predicted data was performed. The value predicted from machine learning was still in the transformed form, as the result of data transformation explained in Section 2.3.2. Therefore, the predicted value needs to be transformed back into the real form. The formulae to retransform the prediction value are explained in Equations (16) and (17).
where s RT is the retransformed predicted value, s pred is the direct output prediction value from machine learning, s min is the minimum value of variable v and s max is the maximum value of variable v. Equation (16) is utilized to retransform predicted ground surface elevation, groundwater level, and percent finer than 0.075 mm sieve, while Equation (17) is utilized to retransform the predicted SPT-N value. Next, the performance of machine learning with the EDF and XY coordinate modeling approach is discussed. The performance of the models developed in this study are quantified by RMSE and MAE score. Figure 6 presents the RMSE and MAE score for all constructed regression models, respectively. First, the model performance for ground surface elevation prediction is analyzed. Looking for the lowest RMSE score, the GSE_GPR_EDF_90 (best EDF model) has slightly lower error compared to the GSE_GPR_XY_90 (best XY model) and GSE_GPR_EDFB (EDFB model), with 0.215, 0.221, and 0.289 RMSE score, respectively. Appl In a similar way, we can identify the best-performing regression models for other targets. The GWL_GPR_EDF_90 and FINER_GPR_EDF_NOPCA show the best EDF model performance for predicting groundwater level and percent finer than 0.075 mm sieve, respectively. There is disagreement for the model with the lowest error for the SPT-N prediction, where the SPTN_ANN_EDF_80 model has the lowest RMSE score and SPTN_ANN_EDF_90 has the lowest MAE score. This result indicates that the prediction by SPTN_ANN_EDF_80 model has less large prediction error compared to the error SPTN_ANN_EDF_90 model. This study prefers to tolerate a lower prediction error instead of a lower overall error with several large error values. Therefore, the model with the lower RMSE value is considered to be better. Therefore, the SPTN_ANN_EDF_90 model is considered to be the best-performing model. Overall, the machine learning model with EDF approach shows lower RMSE score by 0.041, 0.046, 1.302, and 1.561 for ground surface elevation, groundwater level, SPT-N value and percent finer than 0.075 mm sieve, respectively, compared to the direct XY coordinate approach.
The performance of classification models can be evaluated from the prediction accuracy of the models as presented in Figure 7. The EDF, XY, and EDFB models show high prediction performance identified by the high level of prediction accuracy. All models predict USCS-based soil classification with lowest accuracy level at 96.39% (SC_SVM_EDF_80), which is satisfactory. This result shows that all modeling approaches can be applied interchangeably for USCS-based soil classification in the study area. In a similar way, we can identify the best-performing regression models for other targets. The GWL_GPR_EDF_90 and FINER_GPR_EDF_NOPCA show the best EDF model performance for predicting groundwater level and percent finer than 0.075 mm sieve, respectively. There is disagreement for the model with the lowest error for the SPT-N prediction, where the SPTN_ANN_EDF_80 model has the lowest RMSE score and SPTN_ANN_EDF_90 has the lowest MAE score. This result indicates that the prediction by SPTN_ANN_EDF_80 model has less large prediction error compared to the error SPTN_ANN_EDF_90 model. This study prefers to tolerate a lower prediction error instead of a lower overall error with several large error values. Therefore, the model with the lower RMSE value is considered to be better. Therefore, the SPTN_ANN_EDF_90 model is considered to be the best-performing model. Overall, the machine learning model with EDF approach shows lower RMSE score by 0.041, 0.046, 1.302, and 1.561 for ground surface elevation, groundwater level, SPT-N value and percent finer than 0.075 mm sieve, respectively, compared to the direct XY coordinate approach.
The performance of classification models can be evaluated from the prediction accuracy of the models as presented in Figure 7. The EDF, XY, and EDFB models show high prediction performance identified by the high level of prediction accuracy. All models predict USCSbased soil classification with lowest accuracy level at 96.39% (SC_SVM_EDF_80), which is satisfactory. This result shows that all modeling approaches can be applied interchangeably for USCS-based soil classification in the study area.
Detailed look at the performance of each best-performing machine learning model from the EDF, XY, and EDFB approach is achieved by analyzing Figure 8. This figure shows the comparison of predicted values with the observed value from the test data. The GSE_GPR_EDF_90 model for predicting ground surface elevation shows an acceptable prediction with the highest error not exceeding 0.3 m. The highest error noticed from GSE_GPR_XY_NOPCA and GSE_GPR_EDFB model prediction is slightly above 0.4 m and 0.5 m, respectively, which shows a better prediction result using the GSE_GPR_EDF_90 model. The next models to be analyzed are GWL_GPR_EDF_90, GWL_ANN_XY_NOPCA, and GWL_GPR_EDFB, which were constructed to predict the groundwater level. The highest prediction error by all models does not exceed 0.6 m. However, the GWL_ANN_XY_NOPCA and GWL_GPR_EDFB models show a larger prediction error and are considered worse in performance compared to the GWL_GPR_EDF_90 model. For the SPT-N value predictor model, both SPTN_ANN_EDF_80 and SPTN_ANN_XY_NOPCA models performed quite well for predicting SPT-N value under 10 blows. The performance gradually decreases for both models as the SPT-N value increases. The highest error identified from the test data is 31 N-value and 41 N-value of blows for SPTN_ANN_EDF_80 and SPTN_ANN_XY_NOPCA models, respectively. For SPTN_SVM_EDFB model, the prediction performance is not quite satisfying as the model produces higher prediction error, even for the low SPT-N value < 10 blows. The highest prediction error identified from the test data is 36 N-value for this model. From this result and RMSE-MAE score comparison, the SPTN_ANN_EDF_80 is considered a better model for predicting SPT-N with high reliability for predicting lower SPT-N value until ±10 N of blows and low reliability in predicting higher SPT-N value. The FINER_GPR_EDF_NOPCA model shows an overall good performance for predicting the percent finer than 0.075 mm sieve. The identified highest prediction error does not exceed 13%. This performance is better compared to the FINER_ANN_XY_NOPCA and FINER_GPR_EDFB model, with highest error identified as 20.3% and 46.7%, respectively. Therefore, the FINER_GPR_EDF_NOPCA model is considered to be better for predicting percent finer than 0.075 mm sieve. Detailed look at the performance of each best-performing machine learning model from the EDF, XY, and EDFB approach is achieved by analyzing Figure 8. This figure shows the comparison of predicted values with the observed value from the test data. The GSE_GPR_EDF_90 model for predicting ground surface elevation shows an acceptable prediction with the highest error not exceeding 0.3 m. The highest error noticed from GSE_GPR_XY_NOPCA and GSE_GPR_EDFB model prediction is slightly above 0.4 m and 0.5 m, respectively, which shows a better prediction result using the GSE_GPR_EDF_90 model. The next models to be analyzed are GWL_GPR_EDF_90, GWL_ANN_XY_NOPCA, and GWL_GPR_EDFB, which were constructed to predict the groundwater level. The highest prediction error by all models does not exceed 0.6 m. However, the GWL_ANN_XY_NOPCA and GWL_GPR_EDFB models show a larger prediction error and are considered worse in performance compared to the GWL_GPR_EDF_90 model. For the SPT-N value predictor model, both SPTN_ANN_EDF_80 and SPTN_ANN_XY_NOPCA models performed quite well for predicting SPT-N value under 10 blows. The performance gradually decreases for both models as the SPT-N value increases. The highest error identified from the test data is 31 N-value and 41 N-value of blows for SPTN_ANN_EDF_80 and SPTN_ANN_XY_NOPCA models, respectively. For SPTN_SVM_EDFB model, the prediction performance is not quite satisfying as the model produces higher prediction error, even for the low SPT-N value < 10 blows. The highest prediction error identified from the test data is 36 N-value A different approach was applied to analyze the performance of classification machine learning model in this study for predicting the soil classification based on USCS. First, the performance analysis was performed by examining the confusion matrix presented in Figure 9. It clearly shows that all models with EDF, XY, and EDFB modeling approaches provide satisfactory results in predicting soil classification. The SC_ANN_EDF_80 model only misclassified one data point, wherein it classified the ML soil class as CL. The SC_ANN_XY_NOPCA model misclassified two data points, wherein it classified the SM soil class as ML, and ML soil class as CL, and SC_ANN_EDFB model misclassified two data points, wherein it classified SM and CL soil class as ML.
(a-1) (a-2) (a-3) A different approach was applied to analyze the performance of classification machine learning model in this study for predicting the soil classification based on USCS. First, the performance analysis was performed by examining the confusion matrix presented in Figure 9. It clearly shows that all models with EDF, XY, and EDFB modeling approaches provide satisfactory results in predicting soil classification. The SC_ANN_EDF_80 model only misclassified one data point, wherein it classified the ML soil class as CL. The SC_ANN_XY_NOPCA model misclassified two data points, wherein it classified the SM soil class as ML, and ML soil class as CL, and SC_ANN_EDFB model misclassified two data points, wherein it classified SM and CL soil class as ML.

Predicted Class Predicted Class Predicted Class
( a) (b) (c) Precision, recall, and F1 score, as presented in Figure 10, emphasize the result obtained from the confusion matrix. All three performance indicators show a high score > 95% for soil classification SM, ML, and CL. These high scores indicate that all machine learning models can perform balance classification, which means that they predict all types of classes equally well without tendency to perform good classification on some classes and poor on the others. With slightly better performance, SC_ANN_EDF_80 model was considered better to predict soil classification based on USCS in the study area. The entire best-performing model for the EDF, XY, and EDFB approaches and their number of PCs after the dimensionality reduction process using PCA are listed in Table 4. Overall, the superiority of "EDF model" in predicting geotechnical properties and soil classification can be noticed from the RMSE, MAE score, and the prediction accuracy of the chosen models. For all the designated target attributes, the EDF model shows lower prediction error and higher performance compared to the XY and EDFB approach. The results indicate that the spatial autocorrelation represented by EDF affected the performance of machine learning models for spatial data prediction. Precision, recall, and F1 score, as presented in Figure 10, emphasize the result obtained from the confusion matrix. All three performance indicators show a high score > 95% for soil classification SM, ML, and CL. These high scores indicate that all machine learning models can perform balance classification, which means that they predict all types of classes equally well without tendency to perform good classification on some classes and poor on the others. With slightly better performance, SC_ANN_EDF_80 model was considered better to predict soil classification based on USCS in the study area.
it classified the SM soil class as ML, and ML soil class as CL, and SC_ANN_EDFB model misclassified two data points, wherein it classified SM and CL soil class as ML.

Predicted Class Predicted Class Predicted Class
( a) (b) (c) Precision, recall, and F1 score, as presented in Figure 10, emphasize the result obtained from the confusion matrix. All three performance indicators show a high score > 95% for soil classification SM, ML, and CL. These high scores indicate that all machine learning models can perform balance classification, which means that they predict all types of classes equally well without tendency to perform good classification on some classes and poor on the others. With slightly better performance, SC_ANN_EDF_80 model was considered better to predict soil classification based on USCS in the study area. The entire best-performing model for the EDF, XY, and EDFB approaches and their number of PCs after the dimensionality reduction process using PCA are listed in Table 4. Overall, the superiority of "EDF model" in predicting geotechnical properties and soil classification can be noticed from the RMSE, MAE score, and the prediction accuracy of the chosen models. For all the designated target attributes, the EDF model shows lower prediction error and higher performance compared to the XY and EDFB approach. The results indicate that the spatial autocorrelation represented by EDF affected the performance of machine learning models for spatial data prediction. The entire best-performing model for the EDF, XY, and EDFB approaches and their number of PCs after the dimensionality reduction process using PCA are listed in Table 4. Overall, the superiority of "EDF model" in predicting geotechnical properties and soil classification can be noticed from the RMSE, MAE score, and the prediction accuracy of the chosen models. For all the designated target attributes, the EDF model shows lower prediction error and higher performance compared to the XY and EDFB approach. The results indicate that the spatial autocorrelation represented by EDF affected the performance of machine learning models for spatial data prediction. Table 4. Best-performing machine learning models and algorithm with EDF, XY, and EDFB approach for predicting each target based on RMSE error score and prediction accuracy.

Model ID Number of PCs or Features Algorithm
Ground Surface Elevation

Effect of Dimensionality Reduction on Machine Learning Training Speed
This section discusses the effect of dimensionality reduction on machine learning training speed. The dimensionality reduction by using PCA was performed prior to the machine learning training process. As explained earlier, the threshold was set for the PCA, and thereby the final PCs could explain 80% and 90% variance. The training completion time in seconds for several machine learning models is presented in Figure 11. This figure compares the training completion time of the best-performing machine learning models for each target with other models that use the same MLA, but has different number of features as the result of the PCA application. The rising trend on training completion time can clearly be seen on every target as the number of features increases. Although the proportion between the number of features and the rising of training completion time could not be clearly defined, this result shows that the dimensionality reduction could reduce the time needed for machine learning to complete the training.

Effect of Dimensionality Reduction on Machine Learning Training Speed
This section discusses the effect of dimensionality reduction on machine learning training speed. The dimensionality reduction by using PCA was performed prior to the machine learning training process. As explained earlier, the threshold was set for the PCA, and thereby the final PCs could explain 80% and 90% variance. The training completion time in seconds for several machine learning models is presented in Figure 11. This figure compares the training completion time of the best-performing machine learning models for each target with other models that use the same MLA, but has different number of features as the result of the PCA application. The rising trend on training completion time can clearly be seen on every target as the number of features increases. Although the proportion between the number of features and the rising of training completion time could not be clearly defined, this result shows that the dimensionality reduction could reduce the time needed for machine learning to complete the training.

Effect of Hyperparameter Optimization on Machine Learning Model Performance
The machine learning training-validation process with hyperparameter optimization was carried out after the dimensionality reduction process. Figure 12 shows an example of actual value against predicted value plot for predicted percent finer than 0.075 mm sieve value, which visualize the effect of hyperparameter tuning and number of PCs optimization of the machine learning models. It can be interpreted from the figure that the distribution of data points is closer to the center diagonal line in Figure 12b compared to Figure  12a. Figure 12b is the prediction result from the optimized model and Figure 12a is that from an unoptimized model. In addition, better performance indicated by RMSE and MAE is also observed as the score decreased from 3.0359 to 2.1242 for RMSE, and from 1.0939 to 0.56282 for MAE. This result proves that appropriate hyperparameter optimization is able to improve the machine learning model performance significantly.

Effect of Hyperparameter Optimization on Machine Learning Model Performance
The machine learning training-validation process with hyperparameter optimization was carried out after the dimensionality reduction process. Figure 12 shows an example of actual value against predicted value plot for predicted percent finer than 0.075 mm sieve value, which visualize the effect of hyperparameter tuning and number of PCs optimization of the machine learning models. It can be interpreted from the figure that the distribution of data points is closer to the center diagonal line in Figure 12b compared to Figure 12a. Figure 12b is the prediction result from the optimized model and Figure 12a is that from an unoptimized model. In addition, better performance indicated by RMSE and MAE is also observed as the score decreased from 3.0359 to 2.1242 for RMSE, and from 1.0939 to 0.56282 for MAE. This result proves that appropriate hyperparameter optimization is able to improve the machine learning model performance significantly.

Effect of Hyperparameter Optimization on Machine Learning Model Performance
The machine learning training-validation process with hyperparameter optimization was carried out after the dimensionality reduction process. Figure 12 shows an example of actual value against predicted value plot for predicted percent finer than 0.075 mm sieve value, which visualize the effect of hyperparameter tuning and number of PCs optimization of the machine learning models. It can be interpreted from the figure that the distribution of data points is closer to the center diagonal line in Figure 12b compared to Figure  12a. Figure 12b is the prediction result from the optimized model and Figure 12a is that from an unoptimized model. In addition, better performance indicated by RMSE and MAE is also observed as the score decreased from 3.0359 to 2.1242 for RMSE, and from 1.0939 to 0.56282 for MAE. This result proves that appropriate hyperparameter optimization is able to improve the machine learning model performance significantly.
(a) (b) Figure 12. Visualization of EDF model optimization effect by tuning the GPR algorithm hyperparameters for target class percent finer than 0.075 mm sieve (FINER model); (a) predicted vs. actual plot for unoptimized GPR algorithm; (b) predicted vs. actual plot for optimized GPR.

Performance of Various Algorithms on the Geotechnical Subsurface Model
This section analyzes the visualization of the prediction results from the machine learning models in two-dimensional space (X-Y plane). Analysis is needed to understand the behavior of each MLA on generating the subsurface geotechnical model. The analysis was performed by evaluating the predicted value of each target in the unobserved spaces within the boreholes. The predicted values in these spaces were examined to decide whether the values generated by the machine learning models are sensible or not. The high difference between the predicted value in the unobserved spaces and the value at adjacent observed points could indicate the low performance of the machine learning model.
For the ANN based regression model (GSE, GWL, SPTN, and FINER model), the obvious distinction can be noticed in Figure 13c-1-c-4. The geotechnical subsurface model generated from the ANN algorithm tends to have more peak or depression filling the spaces between the boreholes in the prediction value contour map. The predicted values within these locations have a noticeable difference from the value at the adjacent observed data, especially for the SPT-N prediction. As can be seen in Figure 13c The GPR-and ANN-based model shows little difference in SPT-N value formation along the depth as can be seen in Figure 13b-3,c-3. The GPR model as presented in Figure 13b-3 shows a similar behavior with the ANN model, where some peaks are noticed in the unobserved spaces close to the observed data points. Several peaks exist such as the peak value close to C-BH-5 and the peak within B-BH-6, B-BH-9 and B-BH-12. The predicted SPT-N value at both peak differs hugely from the SPT-N value at the closest observed points (>20 N), which makes the prediction is not sensible. The GPR model for predicting percent finer than 0.075 mm sieve shows the most sensible result, as it predicted values at unobserved area closer with the values at observed points. For example, the predicted value at elevation −4 m around A-BH-1 is 32%, while the value at the adjacent observed point is 32.3%. For the soil classification models, all algorithms generate quite similar results as shown in Figure 13a-5-c-5. The generated models could follow the formation of a percent finer than 0.075 mm sieve, wherein a low value of less than 50% is classified as SM, and the higher value of greater than 50% is classified as ML and CL. The SC_kNN_EDF_NOPCA model can be employed directly for the prediction of soil classification.
The analysis provided in this section demonstrates that the trained models with low RMSE or MAE score on the test data do not guarantee that the model can generate sensible spatial prediction models. It is important to visualize the spatial predictions to visually examine the sensibility of the predicted value. The machine learning model performance for spatial prediction needs to be analyzed comprehensively instead of directly choosing the model with good performance during machine learning testing process. The final models chosen in this study for each target are (1) GSE_GPR_EDF_90; (2) GWL_GPR_EDF_90; (3) SPTN_SVM_EDF_80; (4) FINER_GPR_EDF_NOPCA; and (5) SC_ANN_EDF_NOPCA. Some sectional view of geotechnical subsurface models generated in this study are presented in Figure 14. The GPR-and ANN-based model shows little difference in SPT-N value formation along the depth as can be seen in Figure 13b-3,c-3. The GPR model as presented in Figure  13b-3 shows a similar behavior with the ANN model, where some peaks are noticed in the unobserved spaces close to the observed data points. Several peaks exist such as the peak value close to C-BH-5 and the peak within B-BH-6, B-BH-9 and B-BH-12. The predicted SPT-N value at both peak differs hugely from the SPT-N value at the closest observed points (>20 N), which makes the prediction is not sensible. The GPR model for predicting percent finer than 0.075 mm sieve shows the most sensible result, as it predicted values at unobserved area closer with the values at observed points. For example, the predicted value at elevation −4 m around A-BH-1 is 32%, while the value at the adjacent observed point is 32.3%. For the soil classification models, all algorithms generate quite similar results as shown in Figure 13a-5-c-5. The generated models could follow the formation of a percent finer than 0.075 mm sieve, wherein a low value of less than 50% is classified as SM, and the higher value of greater than 50% is classified as ML and CL. The SC_kNN_EDF_NOPCA model can be employed directly for the prediction of soil classification.
The analysis provided in this section demonstrates that the trained models with low RMSE or MAE score on the test data do not guarantee that the model can generate sensible spatial prediction models. It is important to visualize the spatial predictions to visually examine the sensibility of the predicted value. The machine learning model performance for spatial prediction needs to be analyzed comprehensively instead of directly choosing

Conclusions
This study discusses the subsurface modeling of geotechnical properties in a study area. Modeling was performed using machine learning methods constructed from popular machine learning algorithms (SVM, GPR, ANN, and kNN). The targets in this study consisted of ground surface elevation, groundwater level, geotechnical properties such as SPT-N value and percentages finer than 0.075 mm sieve, as well as USCS-based soil classification. Subsurface modeling was performed by incorporating spatial autocorrelation into the machine learning model, since closer points tend to have similar properties. This spatial autocorrelation is represented by generating Euclidean distance field (EDF) as the new features for machine learning model, replacing the direct XY coordinate as the location information of a query point. The results of the study are summarized as follows: • The Euclidean distance field (EDF) can replace raw XY coordinates as spatial input features, as it could represent the existing spatial autocorrelation.

Conclusions
This study discusses the subsurface modeling of geotechnical properties in a study area. Modeling was performed using machine learning methods constructed from popular machine learning algorithms (SVM, GPR, ANN, and kNN). The targets in this study consisted of ground surface elevation, groundwater level, geotechnical properties such as SPT-N value and percentages finer than 0.075 mm sieve, as well as USCS-based soil classification. Subsurface modeling was performed by incorporating spatial autocorrelation into the machine learning model, since closer points tend to have similar properties. This spatial autocorrelation is represented by generating Euclidean distance field (EDF) as the new features for machine learning model, replacing the direct XY coordinate as the location information of a query point. The results of the study are summarized as follows:

•
The Euclidean distance field (EDF) can replace raw XY coordinates as spatial input features, as it could represent the existing spatial autocorrelation. • Applying PCA to the high dimensional EDF features will improve the machine learning model training speed as it reduces the number of features for the learning process, as well as the computational cost.

•
Hyperparameter optimization clearly improves the performance by finding the optimum hyperparameters for each algorithm. The optimization of FINER_GPR_EDF_NOPCA model shows the improvement in RMSE score from 3.0359 to 2.1242, and from 1.0939 to 0.56282 for MAE score.

•
The best-performing machine learning models with EDF approach shows lower RMSE score by 0.041, 0.046, 1.302, and 1.561 for ground surface elevation, groundwater level, SPT-N value and percentages finer than 0.075 mm sieve, respectively, compared to the direct XY coordinate approach.

•
The analysis of MLA based on geotechnical subsurface contours shows that the models with the lowest prediction error on test data do not guarantee that the model is appropriate for the spatial prediction model. The analysis indicates that the ANN model is less appropriate for geotechnical subsurface modeling in this study.

•
Machine learning models that could cover wider and deeper subsurface area of prediction for more advanced purposes can be constructed in the future research by conducting a more adequate soil investigation for the machine learning dataset.