The objectives of this study were to conduct a comprehensive analysis of the interactions between predictive variables in concrete composition and their impact on compressive strength to gain in-depth insights into the relationships between these variables using the Python programming language. Additionally, the advantages of ML techniques were compared to traditional approaches in terms of analyzing concrete properties, the interactions between predictor variables in predicting compressive strength, and specific models such as SVR, gradient boosting (GB), and artificial neural networks (Braga et al. 2020; Chaabene et al. 2020).
Jupyter Notebook
The Jupyter Notebook (version 6.4.12), an open-source web application, was selected as the development environment for implementing the ML models. It offers an interactive computing environment that seamlessly integrates code, documentation, and visualizations into a unified workflow (Wang and Zeller 2020). The Python programming language (version 3.9.13) and ML techniques were used to investigate the complex interactions between different variables (e.g., water, cement, fly ash, blast furnace slag, superplasticizer, coarse aggregate, fine aggregate, and curing time) and their influence on the compressive strength of concrete. Recent studies have shown that Jupyter Notebook provides flexibility for data exploration, pre-processing, modeling, and evaluating model performance (Paixão et al. 2022, 2023), in addition to streamlining the documentation of the process, enabling clear and cohesive communication of the analysis stages, decision-making, and results.
Database
An initial data set was supplemented with information extracted from other articles to enhance the diversity and robustness of the learning process (Achong and Gunter 2021; Da Silva and Silva 2022; Feng et al. 2020; Kamath et al. 2022; Paixão et al. 2022). The final data set consists of 1234 records on the compressive strength of concrete, which were used for the ML and algorithm training. This data set provided comprehensive and representative information on the properties of concrete and their compressive strength values. An existing database was used to ensure that the research was based on a representative and diverse sample (Brownlee 2020), thereby making the results more robust and generalizable. Another advantage of this approach is that it allows for external validation of the model and results. This external validation serves as an independent check that strengthens the reliability of the study.
Furthermore, utilizing a pre-existing database was crucial for time and resource efficiency, as collecting data can be time-consuming and costly. By using this data set to train an ML algorithm, it was possible to explore the patterns and relationships between the eight input variables (water, cement, fly ash, blast furnace slag, superplasticizer, coarse aggregate, fine aggregate, and curing time) and the output variable (compressive strength). Each concrete variable has interdependent relationships, as demonstrated in Table 1.
Table 1
Characteristics of the variables that make up concrete
Variable | Characteristics |
Cement | The quantity of cement in the concrete mix (measured in kg/m³). |
Blast furnace slag | The amount of blast furnace slag in the concrete mix (measured in kg/m³). Blast furnace slag is a by-product of iron and steel production, and it can be used as a partial substitute for cement in concrete. |
Steering wheel gray | The amount of fly ash in the concrete mix (measured in kg/m³). Fly ash is a by-product of coal combustion and can also be used as a partial substitute for cement in concrete. |
Water | The amount of water in the concrete mix (measured in kg/m³). Water is essential for cement hydration, which is the process of hardening. |
Superplasticizer | The amount of superplasticizer in the concrete mix (measured in kg/m³). Superplasticizers are chemical additives that reduce the water needed in concrete without compromising its workability. |
Coarse aggregate | The quantity of coarse aggregate in the concrete mix (measured in kg/m³). Coarse aggregate is typically crushed stone or gravel, giving volume and strength to the concrete. |
Fine aggregate | The amount of fine aggregate in the concrete mix (measured in kg/m³). Fine aggregate is usually sand and fills the gaps between the coarse aggregate particles. |
Test age | The age of the concrete at the time of testing (measured in days). Concrete strength increases with age, making it important to know the age of the concrete when interpreting test results. |
Concrete compressive strength | The compressive strength of concrete, measured in MPa. Compressive strength refers to the ability of concrete to resist compression under a load. |
Source: Adapted from Helene and Andrade (2007) and Helene and Silva Filho (2011). |
Insert Table 1 here
The algorithm undergoes a training process to learn from the training data, adjusting its parameters and creating a model capable of making predictions or decisions based on the acquired knowledge. To assess the algorithm’s performance, the data was divided into training and test sets in an 80:20 ratio. Therefore, an initial exploratory analysis was conducted to evaluate the data’s quality using descriptive statistics, which is crucial to understand the nature of the data and identify any potential obstacles that may impact the effectiveness of model training.
Attribute Selection
To identify the best attributes from data set X based on the target variable Y, we employed the “SelectKBest” feature selection technique from the sklearn.feature_selection package (Chandrashekar and Sahin 2014). The selection criterion used was the F regression (f_regression), which measures the linear correlation between each attribute and the target variable. After the analysis, studies were conducted with two different scenarios: one with k = 8, including the predictor variables water, cement, fly ash, blast furnace slag, superplasticizer, coarse aggregate, fine aggregate, and curing time; and another with k = 6, in which fly ash and fine aggregates were removed. This comparison allowed us to assess how the analysis performs when the number of predictive variables analyzed varies.
This attribute selection strategy simplifies the model by removing variables that contribute little valuable information to enable more efficient and interpretable modeling. The selection process, carried out by SelectKBest, prioritizes the most relevant variables, enhancing the model’s ability to identify patterns and make more accurate predictions on unseen data. By focusing on the most significant attributes in explaining the variability of the target variable, it is possible to improve the model’s effectiveness.
Linear Regression
Linear regression is employed to model the relationship between a set of independent variables and a dependent variable (Chein 2019), and this study utilized a flexible approach to capture complex relationships. The linear regression model was implemented using the “LinearRegression” class from the “sklearn.linear_model” package. The model was trained on the data set X, consisting of the independent variables and the dependent variable y.
The quality of the model’s fit was assessed using two metrics: the coefficient of determination (R²) and the root mean square error (RMSE). The former measures how well the model’s predictions align with the actual values, while the latter quantifies the closeness between the model’s predictions and the actual values. For evaluation purposes, we printed out the results of model training. The R² was calculated using the “score” method from the “LinearRegression” class, and the RMSE was obtained using the “mean_squared_error” function from the “sklearn.metrics” package, followed by the “sqrt” function from the “numpy” package to calculate the square root.
Support Vector Machine for Regression
The parameters for a support vector machine model and a cross-validation object were defined. Subsequently, a “GridSearchCV” object was created to conduct a grid search on the parameters of the SVR using cross-validation (Smola and Schölkop 2004). The “GridSearchCV” object was then applied to the training data, specifically Xtrain and ytrain, in order to perform a grid search on the SVR model parameters and adjust the best model to the training data (Cervantes et al. 2020). The best estimator, identified using the “grid.best_estimator_” method, represents the model that exhibited the highest performance in the cross-validation, as determined by the specified scoring metric during the configuration of the “GridSearchCV” object.
Gradient Boosting
Another method used in this study was GB, an ML technique designed for regression and classification problems. This particular approach generates a prediction model as an ensemble of weak prediction models, typically represented by decision trees (Zhu et al. 2023). The parameters employed and their respective functions are as follows:
-
• n_estimators: This parameter represents the number of boosting stages. Since GB is resistant to overfitting, a larger number generally leads to better performance;
-
• max_depth: This parameter determines the maximum depth of the individual regression estimators. The maximum depth restricts the number of nodes in the tree, and adjusting this parameter aims to optimize performance, with the ideal value depending on the interaction of the input variables;
-
• min_samples_split: This parameter specifies the minimum number of samples required to split an internal node;
-
• learning_rate: This parameter refers to the learning rate, which decreases the contribution of each tree based on the learning_rate value. There exists a trade-off between learning_rate and n_estimators;
-
• loss: This parameter denotes the loss function to be optimized, where ‘squared_error’ refers to the RMSE.
The model was trained using the GradientBoostingRegressor class from the sklearn.ensemble package. Subsequently, the trained model can provide results, enabling predictions on the X test data set and the RMSE of these predictions to be calculated (Cherkassky and Ma 2004).
Artificial Neural Networks
To apply ANN, we first identified the predictor variables (X) and the output variable (y) and then divided the data into training and test sets. The architecture of the neural network was adjusted for optimization by considering the following parameters:
-
• solver = ‘adam’: This refers to the weight optimization solver. ‘adam’ is a gradient-based stochastic optimization algorithm particularly well-suited for large data sets.
-
• hidden_layer_sizes = (32, 64, 32): This represents the number of neurons in the hidden layers. The model will have three hidden layers, with the first layer containing 32 neurons, the second layer containing 64 neurons, and the third layer containing 32 neurons.
-
• n_iter_no_change = 200: This indicates the maximum number of epochs to iterate without observing an improvement in the training process.
-
• random_state = 1: This is the seed of the random number generator, which is used to initialize the weights randomly.
-
• max_iter = 5000: This represents the maximum number of interactions for the solver.
-
• learning_rate_init = 0.0001: This signifies the initial learning rate for the ‘adam’ solver.
-
• verbose = True: This setting enables the printing of the training progress.
Once the training is complete, the ANN model is ready to make predictions.