Machine learning and interactive GUI for concrete compressive strength prediction

Concrete compressive strength (CS) is a crucial performance parameter in concrete structure design. Reliable strength prediction reduces costs and time in design and prevents material waste from extensive mixture trials. Machine learning techniques solve structural engineering challenges such as CS prediction. This study used Machine Learning (ML) models to enhance the prediction of CS, analyzing 1030 experimental CS data ranging from 2.33 to 82.60 MPa from previous research databases. The ML models included both non-ensemble and ensemble types. The non-ensemble models were regression-based, evolutionary, neural network, and fuzzy-inference-system. Meanwhile, the ensemble models consisted of adaptive boosting, random forest, and gradient boosting. There were eight input parameters: cement, blast-furnace-slag, aggregates (coarse and fine), fly ash, water, superplasticizer, and curing days, with the CS as the output. Comprehensive performance evaluations include visual and quantitative methods and k-fold cross-validation to assess the study’s reliability and accuracy. A sensitivity analysis using Shapley-Additive-exPlanations (SHAP) was conducted to understand better how each input variable affects CS. The findings showed that the Categorical-Gradient-Boosting (CatBoost) model was the most accurate prediction during the testing stage. It had the highest determination-coefficient (R2) of 0.966 and the lowest Root-Mean-Square-Error (RMSE) of 3.06 MPa. The SHAP analysis showed that the age of the concrete was the most critical factor in the predictive accuracy. Finally, a Graphical User Interface (GUI) was offered for designers to predict concrete CS quickly and economically instead of costly computational or experimental tests.

1. To introduce diverse ML models, including non-ensemble (regression-based, genetic programming, neural network, fuzzy-inference-system) and ensemble models (adaptive boosting, random forest, gradient boosting, categorical boosting), and assess their effectiveness in predicting concrete CS. 2. Using Bayesian Optimization (BO) for hyperparameter tuning significantly enhances predictive accuracy.3. To conduct a detailed performance analysis across low, moderate, and high CS ranges.4. To perform a sensitivity analysis using Shapley-Additive-exPlanations (SHAP) to determine the most influential parameters affecting CS prediction, offering insights for improving predictive accuracy.5. To close the gap between complex computational predictions and practical real-world applications by creating a user-friendly GUI, highlighting the practical value of the study and its comprehensive approach to integrating theoretical model development with professional usability.

Research objectives
This research builds upon prior work by employing machine learning models to predict the CS of concrete across a broad spectrum of data, varying from 2.33 to 82.60 MPa.The main objective was to evaluate the efficacy of different ML models for predicting the CS of concrete.Figure 1 shows the flowchart of the methodological approach used in this study to predict the compressive strength of concrete.Initially, data from 1030 datasets are collected, including various components like cement and aggregates, and their properties are analyzed through histograms and heatmaps.Then, two types of predictive models are applied: non-ensemble models and ensemble models.The models' performance is evaluated by comparing actual and predicted values, using metrics like R2 and RMSE, and through k-fold cross-validation.Sensitivity analysis is conducted, and the results are benchmarked against previous studies to identify the best predictive model.This approach aims to facilitate the researcher's ability to gauge the effect of different variables on the prediction of CS in a more time-efficient and cost-effective manner compared to extensive experimental studies.

Database collection
To develop a model for predicting outcomes and to analyze the data statistically, researchers can use data from laboratory experiments or gather information from previously published studies.For this research, a substantial dataset consisting of 1030 data points related to the CS of concrete was assembled by reviewing past scholarly articles: Song et al. 19 , Song et al. 37 , and Yeh 57 .The data analysis of the study focused on eight principal attributes, which were used as the input variables: cement (C), blast furnace slag (Slag), fly ash (FA), water (W), superplasticizer (SP), coarse aggregate (Cagg), fine aggregate (Fagg), and the number of days of curing (Age).These were all considered to predict the final compressive strength, the outcome variable.Table 2 provides a concise overview of the statistical description of the collected data, presenting a comprehensive summary of its characteristics.Each row refers to a distinct variable, while the columns contain specific statistical measures for these variables.Furthermore, the frequency distribution of the dataset is visually represented in Fig. 2 through histogram plots.These plots are invaluable for understanding the distribution patterns of each variable, such as normality, skewness, and the presence of outliers, which align with the statistics presented in Table 2.The x-axis represents each variable, while the y-axis indicates the frequency of occurrences.This visualization enables a thorough assessment of these variables.The general observations include: • Most variables (i.e., X2, X3, X5, and X8) show a robust positive skewness, indicating a higher concentration of lower values and fewer higher values.• The X4, X6, X7, and Y variables display more balanced distributions with central tendencies.
• Outliers are more prominent in features with positive skewness, where higher values occur less frequently.www.nature.com/scientificreports/

Correlation analysis
Examining the correlation between variables is crucial for comprehending the connections between dependent features and the target strength factor, as this analysis seeks to determine the most effective prediction model.This method's most widely used measure is the Pearson correlation coefficient (r), which helps to understand these relationships 58,59 .It can be calculated as the ratio of the covariance (cov) of two variables (x, y) to the product of their standard deviations, as represented in Eq. (1).where x and y are the mean of two variables x and y; n is the number of a dataset.Figure 3 presents a heatmap that demonstrates the influence of each variable on all other variables.Notably, the strongest positive correlations between X1, X5, X8, and Y are observed, with r-values of 0.50, 0.37, and 0.33, respectively.This indicates that the CS of concrete is significantly influenced by adding cement, followed by superplasticizer, and finally by the number of days of curing, as evidenced by their higher values.Conversely, the strongest negative correlation between X4 and X5 (− 0.66) suggests an inverse relationship between water and superplasticizers.Also, there is a substantial negative correlation between water and CS of concrete, with an r-value of − 0.29.The remaining variables show a weak correlation between concrete's CS and each other, indicating that variables do not have linear solid relationships with each other.The absence of uncorrelated features implies that all eight input parameters are relevant and can be effectively employed in predicting the CS of concrete.
Figure 4 presents a scatterplot matrix that provides a comprehensive visual analysis of the eight input variables and their relationship with the output variable (Y).The matrix includes histograms on the diagonal, illustrating the distribution of each variable individually, as previously discussed.The off-diagonal cells in the matrix contain scatter plots that show the pairwise relationships between variables.Each scatter plot provides a visual representation of the correlation between two variables.The input parameters X1 and X5 exhibit a clear positive linear relationship with the output variable Y in Positive Linear Relationships.These findings suggest a positive correlation between the increase in X1 and X5 and the increase in Y, indicating a direct relationship.In addition, the variables X3 and X5 exhibited a positive linear relationship, indicating that as the value of X3 increases, the value of X5 also tends to increase.
A strong negative linear relationship is evident between the input parameters X3 and X4, indicating that as X3 increases, X4 consistently decreases.Furthermore, the correlation between X4 and X5 is highly negative, suggesting that higher values of X4 correspond to lower values of X5.In contrast, the X2 input exhibits weak or unclear linear associations with the majority of other variables, suggesting a low level of correlation.Similar to X2, the input X6 does not exhibit distinct linear patterns with the majority of its variables.Regarding the input parameter X7, it is worth noting that while there is a minor correlation with X5, overall, X7 does not exhibit significant linear associations with most other variables.The input X8 exhibits clear clusters when plotted against other input parameters, indicating the existence of sub-groups within the data.Finally, a subtle non-linear pattern can be observed regarding the correlation between inputs X6 and X7.To gain a better understanding of the underlying relationship, additional investigation may be necessary.www.nature.com/scientificreports/Data normalization Some machine learning models may not function optimally when there is a variation in the scale of input data.
As indicated in Table 2, the cement range lies between 102 and 540 kg/m 3 , while the range for superplasticizers is between 0.0 and 32.2%, highlighting the disparate magnitudes of different input features.To address this, data normalization or rescaling is employed, which adjusts all input variables to a uniform scale.This process utilizes the max-min mapping function, as outlined in Eq. (2).
In this equation, X n represents the normalized data, X min and X max denote the minimum and maximum values of each input variable, and X refers to the original dataset undergoing rescaling.The primary benefit of data rescaling lies in its ability to expedite computations and enhance the accuracy and stability of the machine learning-based prediction model.

Non-ensemble models
In this research, six different non-ensemble models were utilized, namely Multiple Linear Regression (MLR), Multiple Nonlinear Regression (MNLR), Support Vector Regression (SVR), Gene Expression Programming www.nature.com/scientificreports/(GEP), Artificial Neural Networks (ANN), and Adaptive Neuro-Fuzzy Inference System (ANFIS).These models were developed using the Python programming environment within the ANACONDA software, MATLAB, and SPSS programs.Concise explanations of each model are provided in the subsequent sub-sections.

MLR model
The MLR model is an extension of simple linear regression to predict a single output variable using multiple input variables [60][61][62] .This method assumes a linear relationship between the inputs and the output.It's particularly valuable in situations where various factors influence the response variable, allowing for the assessment of the relative contribution of each predictor.The MLR model is straightforward, interpretable, and widely used in various fields for its ability to provide insights into relationships between variables.Multivariate Linear Regression is effective for predicting and understanding the underlying data structure.

MNLR model
Nonlinear models are straightforward, easy to understand, and effective for making predictions 63 .These models are versatile in terms of the range of average outcomes they can express.However, they might not be as adaptable as linear models when describing different data types.Nonetheless, if the nonlinear model is well-suited for a particular situation, it could be more efficient, use fewer parameters, and be simpler to understand.This clarity is often due to how parameters relate to significant, meaningful processes.The process of using the MNLR model involves several steps: firstly, identifying the variable we want to predict; secondly, creating a nonlinear equation that represents how this variable is affected by other variables; thirdly, inputting initial guesses for the parameters of this equation, with the Levenberg-Marquardt method being the chosen technique for estimation; and finally, initiating the MNLR analysis to generate and review the results in the output log.

SVR model
The SVR model is an extension of Support Vector Machines (SVMs) used for regression problems 64,65 .SVR effectively finds the best-fit hyperplane in a high-dimensional space that can predict continuous values, maintaining a balance between the complexity of the model and the amount of error tolerated.It's especially useful for datasets with many features and is known for its robustness against overfitting.This study uses the linear kernel to model relationships between input variables and the target variable linearly.The linear kernel in SVR essentially represents a straight line in the feature space.It assumes that the relationship between the input features and the target variable is linear, meaning that a change in the input features results in a proportional change in the predicted value.

GEP model
The GEP model was developed to create computer programs and is similar to Genetic Algorithms (GAs) and Genetic Programming (GP) 61,66 .Figure 5 shows a flowchart of the GEP model.The GEP model follows a structured flow that begins with creating an initial chromosome population, representing potential solutions.These chromosomes are then expressed as computer programs.Following this, each program is executed, and its performance is evaluated based on a predefined fitness function.If the termination condition is met, the process ends.Otherwise, it iterates to produce a new generation.This involves selecting the best-performing programs to continue to the next cycle and using genetic operators to create a new generation of chromosomes.These genetic operators include mutation, inversion, one-point recombination, two-point recombination, gene recombination, and insertion sequence (IS) transposition rate.The cycle repeats, continually evaluating the fitness of programs and generating new ones until the best possible solution is found or another termination condition is met, at which point the model concludes.

ANN model
ANNs are a cornerstone of machine learning, inspired by the structure and function of the human brain.They are particularly adept at identifying complex, non-linear relationships within large datasets.ANNs consist of interconnected nodes or neurons, which collectively learn to perform tasks like regression and classification by considering examples.Their flexibility and adaptability make them suitable for various applications, from image recognition to natural language processing [67][68][69] .A typical neural multilayer perceptron in an ANN consists of three layers: an input layer, one or more hidden layers, and an output layer, as illustrated in a three-layered architecture.In predicting new data sets, a model employs numerous neurons organized into a network to process information.These neurons are interconnected through weights and biases, crucial determinants of a machine learning model's precision.Networks can be categorized into basic ANNs with a single hidden layer or deep neural networks with multiple layers.Utilizing additional hidden layers augments the ANN's ability to identify the connections between inputs and outputs, thereby enhancing model accuracy.Figure 6 shows that the variable Y, represented by the CS of concrete, was set as the output from the ANN model, while the eight variables were assigned as the inputs to the ANN model.

ANFIS model
The ANFIS model, initially introduced by Jang 70 and subsequently elaborated upon by Jang et al. 71 , constitutes a universal approximation methodology.In this capacity, it can approximate any real continuous function defined on a compact set with arbitrary precision.The ANFIS structure closely resembles an ANN, featuring five layers, each comprised of nodes, including rules.Notably, the Sugeno fuzzy model, as proposed by Takagi and Sugeno 72 , is frequently employed in ANFIS.A prototypical rule set for a first-order Sugeno fuzzy model, embodying two fuzzy If-Then rules, can be succinctly expressed as follows: www.nature.com/scientificreports/Here, (A 1 , A 2 , B 1 , B 2 ) represent fuzzy sets, and (p ij , q ij , r ij ) denote parameters associated with the conse- quent part of each rule.This formulation describes the basic configuration of a Sugeno fuzzy model as it applies to an ANFIS.Figure 7 depicts the equivalent ANFIS framework.In this ANFIS setup, nodes within the same level perform similar functions.The structure is composed of five layers: the first is the input layer, followed by the rule layer, then the normalization layer, the consequent layer, and finally, the output layer.For an in-depth explanation of the ANFIS framework, one can refer to Chang and Chang 73 .

Ensemble models
In this research, four different ensemble models were utilized, namely Adaptive Boosting (AdaBoost), Random Forest (RF), eXtreme Gradient Boosting (XGBoost), and Categorical Gradient Boosting (CatBoost).The development of these models was carried out using the Python programming environment within the ANACONDA program.Concise explanations of each model are provided in the subsequent sub-sections.

AdaBoost model
Boosting is a well-known algorithm in machine learning, first suggested by Schapire 75 .Subsequently, Freund 76 developed AdaBoost.This method focuses on combining several basic classifiers created during training into a single strong classifier.Additionally, it enhances the training process to improve the formation of these basic classifiers.The AdaBoost model is an ensemble technique that combines multiple weak learners to form a strong learner.In regression, it sequentially fits a model to adjust the weights of instances based on the errors of the previous model, focusing more on difficult-to-predict instances.The AdaBoost model is often used to improve the accuracy of decision trees and is known for its simplicity and effectiveness in reducing bias and variance.

RF model
The RF model is an ensemble learning method that operates by constructing many decision trees during training and outputting the average prediction of the individual trees.Breiman 77 first developed the RF model, which combines the ideas of randomly selecting features and grouping data samples together.The RF model is widely used for both classification and regression tasks.It is particularly well-known for its ability to handle large datasets with higher dimensionality and provides estimates of feature importance, which can be very insightful.

XGBoost model
The XGBoost model implements gradient-boosted decision trees designed for speed and performance.It is a highly flexible and versatile algorithm known for its efficiency in handling sparse data and its ability to perform well on a wide range of regression and classification problems.The XGBoost model has been used successfully in numerous machine learning competitions due to its scalability and ability to produce highly competitive predictive models 62 .
The choice to use XGBoost for this research is based on its useful characteristics.It uses regularization to avoid overfitting and uses second-order gradients for quicker convergence.It can handle missing data when finding splits and uses stochastic gradient descent to increase variety and reduce overfitting.Reduction is also employed to minimize overfitting.Furthermore, the XGBoost is designed with system-level enhancements such as parallel processing and cache optimization, which make it both fast and capable of handling large datasets.

CatBoost model
Categorical Gradient Boosting (CatBoost) represents a recent advancement in gradient-boosting algorithms designed to handle categorical features while minimizing information loss 78 .CatBoost distinguishes itself through two key characteristics: the utilization of ordered boosting to mitigate target leakage and its effectiveness, particularly on small datasets.Within the CatBoost model, the computation of the sample average for where x σ i,k = x σ j,k denotes a condition being met, indicated by a value of 1. P represents a predetermined value, and a is the coefficient to determine the significance of P.

Hyperparameters tuning
To optimize the hyperparameter settings in the current study, BO is employed.This technique contrasts with traditional methods like Grid-Search by initially modeling the prior distribution of the objective function and iteratively refining the search within the hyperparameter space for the optimal configuration.Initially, each model's range and prior distribution of hyperparameters are specified.BO then identifies the configuration that maximizes performance within this predefined hyperparameter space.This approach enhances the efficiency of hyperparameter tuning, minimizing redundant experiments and accelerating the identification of the most effective hyperparameter combinations [79][80][81][82] .Figure 8 illustrates a framework example of BO-XGBoost.

Comprehensive performance evaluation of models
The dataset was methodically partitioned into two distinct sets: training and testing.The training set is used to fit the model, and the testing set is used to evaluate the model's predictive performance.The split ratio was carefully chosen to balance the need for sufficient training data against the necessity of a robust evaluation.Hence, the collected dataset was divided into 80% for training and 20% for testing.This ensures that the model is trained on a representative sample of the data while still subject to rigorous testing on data it has not previously encountered.However, two commonly used approaches involve quantitative and visual methods to evaluate and compare the adopted ML models 84 .

Visual methods
Visual methods include scatter plots, violin boxplots, and Taylor diagrams.These methods offer a quick and informative way to compare models, providing insights into accurate predictions for various statistical measures like maximum, minimum, median, and quartiles.They may not capture information about the performance and ranking of models.Scatter plots are used to visualize the relationship between two variables.Violin plots provide a full distribution of the data.This is crucial when comparing models because it shows not only the central tendency (i.e., mean or median) but also the spread and density of model performance metrics.Taylor diagrams are a specialized graphical representation that quantifies the similarity between actual and predicted values.These diagrams plot the correlation, the standard deviation, and the root mean square error of predictions on a single chart.This provides a comprehensive view of a model's accuracy, variability, and overall performance compared to the actual observations.where x i is the actual CS values; x i is the mean of the actual CS dataset; y i is the predicted CS value.www.nature.com/scientificreports/k-fold cross-validation K-fold cross-validation (Fig. 9) is a widely used method to check the performance of ML models.It involves dividing the dataset into several parts, typically ten, known as "folds." In this tenfold system, the dataset splits into ten subsets.For each test, nine groups are used to train the model, and one group is kept for testing.This approach is suitable for understanding variability within the data and doesn't take too much time to compute 62 .Each of the ten subsets becomes the test set, with the others being used for training.A reliable measure of model accuracy is obtained by averaging results from all ten tests.This way of testing helps ensure effective training of the adopted ML models and reduces the chance of missing out on essential data in the dataset.

SHAP feature importance analysis
To analyze the sensitivity and interpret ML models on both a wide-scale and a more detailed level, researchers use the SHAP approach, which draws on principles from cooperative game theory 47 .The SHAP method was employed to gauge the comparative impact of input variables on the prediction process.As an advanced method within the realm of explainable artificial intelligence, SHAP helps clarify the complex interactions between the input variables and the model predictions, as shown in Fig. 10.It offers critical insights by identifying which features are most influential on predictions and how they modify the predicted results 87,88 .Equation ( 13) shows the Shapley value φ i for feature i is determined by calculating the average marginal contribution of that feature across all possible permutations of features.In this equation, N represents the set of all features, S represents a subset of features that excludes feature i, |S| denotes the cardinality of set (S), v(S) represents the model's prediction when only features in set (S) are considered, and v(S ∪ {i}) represents the model's prediction when feature i is added to set S 89 .

Hyperparametric configuration of ML models
The optimal form of the equations developed by the MLR and MNLR models was the result of extensive testing.Considering the developed MLR model, the derived linear equation (Eq.14) from the MLR model can be written as follows: Regarding the developed MNLR model, the derived nonlinear equation (Eq.15) from the MNLR model can be written as follows: On the other hand, a trial-and-error approach was used to determine the hyper-parameters, architectures, and functions of the remaining ML models during the training stage.The model with the highest average prediction accuracy across the entire training set was chosen based on these results.The grid search hyper-parameter tuning for the developed ML models is shown in Table 3. ( 13)

Visual methods
The predictive accuracy of the adopted models for concrete's CS was evaluated using visual methods: scatter plots, violin boxplots, and Taylor diagrams during the testing stage.Figure 11 shows ten scatter plots indicating the predicted versus actual values for training and testing stages across the adopted models.The distribution of most values along the black line, representing a perfect prediction with 0% error, clearly demonstrates that the model exhibits a high degree of precision in its predictions.Furthermore, the figures were elaborated with green and red boundary lines that indicate a ± 10% margin of error.This range includes the majority of projections, demonstrating the model's resilience and reliability.When evaluating the performance of individual models such as MLR, MNLR, and SVR, it is observed that the data points are widely dispersed and deviate significantly from the equality line.Many points fall beyond the ± 10% deviation lines in the training and testing stages.Given that the R 2 -values for these models were below 0.62, it can be concluded that these models exhibited poor predictive performance for the CS-value of concrete.The GEP model shows a distribution of data points that are scattered, with some clustering aligning along the line of equality.Approximately half the points are positioned above or below the ± 10% margin.The R 2 values of the GEP model were 0.81 and 0.82 for the training and testing stages, respectively.This suggests that the model exhibited a moderate level of performance with a noteworthy degree of variability.Similar to the GEP model, the ANFIS model shows an acceptable level of performance, slightly surpassing the GEP model.The R 2 values of the ANFIS model are 0.87 and 0.83 in the training and testing stages, respectively.Lastly, the ANN model outperformed all the non-ensemble models, achieving R 2 values of 0.97 and 0.90 in the training and testing stages, respectively.This is evident from the fact that most data points are within a range of ± 10%.
Considering the performance of ensemble models, the AdaBoost model had the lowest predictive performance with R 2 of 0.83 and 0.80 in the training and testing stages, respectively.The GEP model is a close equivalent of the AdaBoost model's performance.In contrast, the CatBoost model had the highest accuracy, with R 2 values of 0.987 and 0.966 in the training and testing stages, respectively.Following the CatBoost model, the XGBoost and RF present high predictive performance, with tight clustering data points and high R 2 values (i.e., R 2 > 0.98 in training and R 2 > 0.93 in testing).In summary, ensemble models demonstrate superiority over non-ensemble models regarding the visual distribution of predicted versus actual values.Furthermore, the CatBoost is the best-performing model among all those that have been implemented.
Figure 12 shows the violin boxplots of the actual and predicted values during the testing stage (TS).This figure highlights the dataset's minimum, maximum, median, ¼th, and ¾th percentile quartiles and overall distribution (actual and predicted values).It is evident that the shape model that most closely matches the actual dataset is the CatBoost model.Additionally, Fig. 13 shows a comparative analysis of the adopted models done using the Taylor diagram in the testing stage.Taylor's diagram visually represents the degree of correspondence between a pattern and observed data.This diagram shows the correlation coefficient along one axis, the standard deviation as the distance from the center, and the centered Root-Mean-Square-Difference (RMSD) from the reference point (i.e., actual).Closer proximity of points to the reference indicates superior model performance.The models positioned along the arc with higher correlation coefficients (approaching 1.0) and smaller distances from the reference (lower RMSD) exhibit better performance.The CatBoost and XGBoost models were found to be the closest to the actual point.Meanwhile, the MNLR, MLR, and SVR models were found to be the furthest from the actual point.

Quantitative methods
Tables 4 and 5 show the seven performance indices calculated for the studied ML models for the training and testing stages, respectively.The tables show that the ensemble models were significantly better at making predictions Table 3. Optimized primary hyperparameters for the adopted ML models.www.nature.com/scientificreports/Considering evaluating non-ensemble models, the ANN consistently outperforms other non-ensemble models in both the training and testing stages.It obtains the highest R 2 and WI values and has the lowest error indices, including RMSE, MAE, and MAPE.Meanwhile, the GEP and ANFIS exhibit a satisfactory level of performance, characterized by relatively high R 2 and WI values, as well as moderate error indices.Conversely, based on the lowest R 2 and WI indices and higher errors, the MLR, MNLR, and SVR models were the worst predictive models in both stages.
Among the ensemble models, the CatBoost model demonstrated superior performance during the training stage, achieving perfect scores in R2 and WI.Furthermore, it maintained excellent performance during the testing stage.Also, the XGBoost and RF models exhibit great predictive performance, as indicated by their high R 2 and WI values and low error metrics.Although the AdaBoost model demonstrates satisfactory performance, it ranks slightly behind other ensemble models, especially during the testing stage.In general, ensemble models tend to exhibit superior performance compared to non-ensemble models.The CatBoost model is the best-performing model overall, followed closely by the XGBoost and RF models.Among individual models, the ANN model exhibit superior performance, making them a strong candidate in the absence of the ensemble models.

k-fold cross-validation
Utilizing k-fold cross-validation reduces the risk of models fitting too specifically to one part of the dataset, thereby providing a more accurate evaluation of their effectiveness.This method is typically used to refine models, aiming for a more precise measurement of performance and lowering the risk of overfitting to a singular traintest split.The outcomes from this method support the trustworthiness and accuracy of the evaluated models.Figure 14 compares the adopted models using average performance indices (R2, WI, RMSE, SI, MAE, MAPE, and MBE) scores across 10-folds with the Bayesian optimization (BO) process.The findings from Fig. 14 indicated that, among the non-ensemble methods, the ANN had the highest scores in terms of R2 and WI, followed by the ANFIS and GEP models.Among the ensemble methods, the CatBoost, XGBoost, and RF models demonstrated superior scores in these indices, indicating better overall performance.When evaluating the metrics of RMSE, SI, MAE, and MAPE, it was observed that both ANN and ensemble models demonstrated improved performance by achieving lower values.In addition, although certain models displayed either positive or negative biases, ensemble models, except the AdaBoost model, generally exhibited MBE values that were closer to zero.This suggests a decrease in bias in the predictions developed by the ensemble models.
Upon examining the performance through k-fold cross-validation, it was found that ensemble models outperformed other models in terms of various indices.The CatBoost model consistently demonstrated high R2 and WI values close to 1.0, as well as low values for RMSE, SI, MAE, MAPE, and MBE, which were close to 0.0.Based on the average performance across the tenfold, the CatBoost model stands out as the best-performing model, surpassing the alternatives in terms of accuracy, precision, and reliability.

Performance of models across CS ranges
Table 6 provides a comparative analysis of the adopted ML models across three data ranges for the CS: (1) low CS (2.33-25.02),moderate CS (25.08-55.02),and high CS (55.06-82.60).It also focuses on how each model's accuracy and error indices vary with the complexity of the dataset.
In the low CS range, the CatBoost, XGBoost, and RF models showed remarkable performance, reaching the highest R 2 values and the lowest RMSE and MAPE values.These results indicate that these models possess strong predictive accuracy and reliability.The CatBoost model superior to all other ML models, with an R 2 value of 0.854, an RMSE value of 2.37 MPa, and a MAPE value of 9.80%.The CatBoost, XGBoost, and RF techniques provide the highest accuracy and lowest error metrics, thereby demonstrating their superior predictive capabilities within this range.Although the ANN achieved satisfactory performance with an R 2 value of 0.731, an RMSE of 3.26 MPa, and a MAPE of 11.10%, its error indices were slightly higher compared to the ensemble models, expect for the AdaBoost model.However, non-ensemble models such as MLR, MNLR, and SVR represent low R 2 values and high RMSE and MAPE, suggesting limited predictive ability within this range.www.nature.com/scientificreports/For moderate and high CS ranges, the same patterns were observed, with the ensemble models, outperforming the others.Furthermore, the CatBoost model still demonstrated exceptional performance, highlighting its reliable predictive accuracy across different CS ranges.The XGBoost and RF models demonstrate high performance, especially in the moderate and high CS ranges than the low CS range.Although ANN show good performance, particularly in the low and moderate CS ranges, it is surpassed by ensemble models, with the exception of the AdaBoost model.These findings suggest that for applications requiring precise ML predictions in varying CS ranges, the CatBoost and XGBoost models should be considered due to their consistent and superior performance.

Rank analysis
A rank analysis is performed to assess the overall performance of the adopted models based on their performance indices calculated in Tables 4 and 5.However, a model overall rank value of 1 represents the best performance  and a model value of 10 represents the worst.The overall ranking of each model is determined by summing up the individual ratings.The model with the highest total rank is considered the least effective, whereas the model with the lowest total rank is considered the most effective.Table 7 illustrates the results of the rank analysis in both training (TR) and testing (TS) stages.However, the CatBoost model stands out above all ten models with an overall rank of 20.XGBoost holds the 2nd overall rank with a score of 33.The RF is ranked 3rd overall with a score of 41, followed closely by ANN, ANFIS, and GEP, all sharing the 4th, 5th, and 6th positions with a score of 52, 72, and 82, respectively.However, the AdaBoost, MLR, SVR, and MNLR models are ranked 7th, 8th, 9th, and 10th, respectively, with notably high scores of 101, 114, 121, and 134, respectively.

SHAP feature importance
Sensitivity analysis serves to examine data types more closely and to understand the significance of each input parameter in relation to the output 62,90 .The SHAP feature importance results were generated using the best predictive model (CatBoost).Figure 15 shows several dependency charts that highlight the conclusions drawn from the model.The dependence plot's dots indicate a single dataset prediction.The y-axis designates the appropriate SHAP value, and the x-axis shows the feature value obtained from the features matrix.The SHAP value denotes the degree to which the model's output for a given prediction is affected by knowing the value of a particular feature.The color mapping in the plot represents a second feature that might interact with the plotted feature.Figure 15a shows the SHAP value for X1 versus X1, colored by X5.As X1 increases, SHAP values also increase, demonstrating a positive correlation.Higher values of X1 have a more substantial positive impact on the model's prediction.For the color bar of X5, where data points in red (higher X5 values) correspond to higher SHAP values for X1, suggesting that high X5 values enhance the positive effect of X1 on the prediction.Figure 15b illustrates the SHAP value for X2 versus X2, colored by X1.As X2 increases, SHAP values increase, showing a positive relationship.Higher X2 values have a greater positive impact on the model's output.Higher X1 values (red) are associated with higher SHAP values for X2, indicating that X1 enhances the positive impact of X2.In Fig. 15c, the SHAP value for X3 versus X3 is shown, colored by X5.The relationship between X3 and its SHAP values is unclear, showing scattered positive and negative impacts.Higher X3 values do not consistently correlate with higher SHAP values.For the color bar of X5, no strong correlation between X5 values and SHAP values for X3. Figure 15d shows the SHAP value for X4 versus X4, colored by X2.As X4 increases, SHAP values decrease, indicating a negative correlation.Higher X4 values have a stronger negative impact on the model's prediction.For the color bar of X2, higher X2 values (red) correspond to higher SHAP values for X4, suggesting that X2 mitigates the negative effect of X4. Figure 15e presents the SHAP value for X5 versus X5, colored by X1.Increasing X5 leads to mixed SHAP values with both positive and negative impacts.Higher X5 values do not consistently correlate with higher SHAP values.For the color bar X1, higher X1 values (red) show varied impacts on SHAP values for X5. Figure 15f shows the SHAP value for X6 versus X6, colored by X2.As X6 increases, SHAP values generally decrease, indicating a negative correlation.Higher X6 values have a negative impact on the model's prediction.For the color bar of X2, higher X2 values (red) correlate with higher SHAP values for X6, indicating that X2 mitigates the negative effect of X6. Figure 15g illustrates the SHAP value for X7 versus X7, colored by X8.As X7 increases, SHAP values generally decrease, indicating a negative correlation.Higher X7 values have a negative impact on the model's prediction.For the color bar of X8, higher X8 values (red) are associated with higher SHAP values for X7, suggesting that X8 mitigates the negative effect of X7.Finally, Fig. 15g shows the SHAP value for X8 versus X8, colored by X1.As X8 increases, SHAP values increase, showing a positive relationship.Higher X8 values have a greater positive impact on the model's prediction.For the color bar of X1, higher X1 values (red) are associated with higher SHAP values for X8, indicating that X1 increases the positive effect of X8.In summary, the input parameters of X1, X2, and X8 show positive correlations with their SHAP values, indicating that increases in these features generally enhance the model's predictions.In contrast, the inputs of X4, X6, and X7 exhibit negative correlations with their SHAP values, indicating that increases in these features generally reduce the model's predictions.Meanwhile, the inputs of X3 and X5 show mixed or less clear impacts on SHAP values, indicating that their influence on the model's predictions is inconsistent.
Figure 16a presents a summary plot illustrating the distribution of SHAP values for all features based on the best predictive model (CatBoost).On the x-axis, the SHAP value signifies the impact of each feature on the model's output, while the y-axis lists the inputs.Each dot on the plot represents a single instance, with color coding to indicate feature values-blue for low values and red for high values.For instance, inputs X8 and X1 Figure 16b depicts the mean absolute SHAP value for each input, highlighting the average impact of each input on the model's predictions.The higher the mean SHAP value, the more critical the input is to the model.According to this plot, feature X8 (i.e., number of days of curing) has the highest mean SHAP value, marking it as the most important input.This is followed in importance by inputs X1, X4, X2, X5, X7, X6, and finally, X3.Thus, the ranking of inputs based on their importance from highest to lowest is as follows: X8, X1, X4, X2, X5, X7, X6, and X3.

Comparison with previous studies
The developed ML models were compared to recent similar studies that also aimed to predict the CS of concrete.Table 8 shows the comparative analysis with previous studies for predicting the CS of concrete.Liu 91 , with the XGBoost model, shows an exceptional R 2 of 0.999, indicating very high accuracy.However, the dataset size is relatively small (60), which may not generalize to larger datasets.Hongwei et al. 19 found that the bagging regressor was the best predictive model with high R 2 (0.950) but with higher RMSE and MAE values, likely due to the small dataset size (98).Satish et al. 92 , with R 2 = 0.950, RMSE = 3.06 MPa, and MAE = 2.13 MPa, performs well on a moderately large dataset (633).
For larger datasets, the present study developed the CatBoost with R 2 of 0.966, RMSE = 3.06 MPa, and MAE: 2.27 MPa, demonstrating strong performance on a large dataset (1030).Feng et al. 93 developed an AdaBoost model that performs well with an R 2 of 0.940, RMSE of 1.93 MPa, and MAE of 1.43 MPa on a similar dataset size (1030).Also, Elhishi et al. 88 , with R 2 = 0.910 on a large dataset (1030), show lower accuracy and higher error metrics compared to the present study.The studies with similar dataset sizes, like Feng et al. 93 and Elhishi et al. 88 , show good performance, but the CatBoost in the present study outperforms them regarding R 2 .

Interactive graphical user interface (GUI)
Finally, to address the practical needs of designers in efficiently applying ML models to their needs, this section introduces a significant advancement.Despite the complex requirements of database assembly, model training, and testing hindering the seamless adoption of ML in everyday design tasks, a novel solution has been crafted.A Python web application has been developed, integrating a model with optimized hyperparameters via an intuitive GUI.This GUI is specifically designed to predict concrete CS and streamline the design process, as illustrated in Fig. 17.The CS of concrete value is directly depicted by clicking the "Calculate" button.The CatBoost model's  www.nature.com/scientificreports/guidance features and tutorials to assist users with varying levels of expertise.Finally, validating the model with real-world data and scenarios will be essential to ensure its robustness and reliability in practical applications.

2 n i=1 y i − y 2 Figure 3 .
Figure 3. Pearson correlation of input and output variables.

Figure 4 .
Figure 4. Scatter pair plots matrix with interaction variables.

Figure 6 .
Figure 6.Inputs and output variables used for ANN model development.

Figure 13 .
Figure 13.Taylor diagram for the performance of the adopted ML models in the testing stage.

Figure 14 .
Figure 14.Comparison of average values of performance indices based on BO + 10 folds cross-validation process.

Figure 16 .
Figure 16.Summary plot of SHAP values for the input variables based on the best predictive model (CatBoost).

Table 1 .
ML models adopted in the past literature.

Table 2 .
Descriptive statistics for the collected database.
Vol.:(0123456789) Scientific Reports | (2024) 14:16694 | https://doi.org/10.1038/s41598-024-66957-3 85antitative methods including seven performance indices: Determination Coefficient (R 2 ), Willmott Index (WI), Root Mean Square Error (RMSE), Scatter Index (SI), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and Mean Bias Error (MBE).The ideal values for these indices are as follows: R 2 and WI should ideally be 1, indicating perfect prediction accuracy, while RMSE, SI, MAE, MAPE, and MBE should ideally be 0, indicating no error in the predictions.In summary, a predictive model is ideal if its performance indicators are close to or strictly at these values.The equations for calculating these indices are presented in Eqs.(6-12)as follows85:

Table 4 .
Estimated performance indices for the studied ML models in the training stage.*The bold values indicated the best predictive models.

Table 5 .
Estimated performance indices for the studied ML models in the testing stage.*The bold values indicated the best predictive models.

Table 6 .
Comparison of the adopted ML models for different CS ranges.*The bold values indicated the best predictive models across the ranges of CS.

Table 7 .
Rank analysis of the adopted models.

Table 8 .
Comparative analysis for prediction of the CS of concrete.