Data-Driven Insights into Controlling the Reactivity of Supplementary Cementitious Materials in Hydrated Cement

Supplementary cementitious materials (SCMs) play an essential role in sustainable construction due to their potential to reduce carbon emissions, promote circular economy principles, and enhance the properties of concrete. However, the inherent diversity of SCMs makes it challenging to predict their degree of reaction (DOR). This study applies machine learning techniques to predict DOR while exploring key parameters affecting it. Five machine learning models are utilized: linear regression, Gaussian process regression (GPR), decision tree regression, support vector machine and extreme gradient boosting, with GPR providing the most accurate and adaptable prediction. The study delves into the impact of various parameters on DOR, revealing their significance. Silica content emerges as the most critical, followed by particle size distribution, specific gravity, and water-to-cement (W/C) ratio. Optimizing DOR requires extending curing time, reducing particle size distribution, and considering optimal silica content and W/C ratio. This research emphasizes the importance of understanding the relationships between parameters and the DOR of SCMs, providing insights to enhance the efficiency of SCMs in cementitious systems through machine learning and data-driven analysis.

as water-to-cement (W/C) ratio, chemical composition, physical characteristics, and curing conditions can significantly impact the reactivity of SCMs, leading to variations in the properties and performance of cementitious matrices (Pacewska & Wilińska, 2020;Skibsted & Snellings, 2019).Therefore, comprehending the main properties of SCMs in hydrated cement is imperative in material selection and performance optimization within cementitious matrices.
SCMs are added to concrete to increase their performance and sustainability.Determining the DOR of SCMs is essential in foreseeing the characteristics and performance of the final products.Several key properties of SCMs influence their extent of dissolution in concrete.The properties of SCMs, such as particle size, specific surface area, amount of SCMs being used to replace Portland cement, chemical composition, W/C ratio, curing temperature, and curing time, can significantly impact their performance in cementitious matrices.For instance, the particle size and specific surface area of SCMs can affect the rate and extent of their reaction with cementitious materials.Smaller particle sizes and larger specific surface areas increase the contact area between the SCM and the cementitious materials, leading to a more significant DOR (Hallet et al., 2020;Mirzahosseini & Riding, 2015;Ndahirwa et al., 2022;Sanjuán et al., 2015).The chemical composition of SCMs also affects their compatibility with other materials in the matrix (Sabir et al., 2001;Sanjuán et al., 2015;Tironi et al., 2013).Suraneni et. al. (2019) reported that SCMs with high silica, alumina, and calcium exhibit distinct characteristics regarding their utilization of calcium hydroxide and the amount of heat released.Additionally, the W/C ratio and curing conditions of the system can influence the DOR (Phung et al., 2021).Generally, a higher W/C ratio increases DOR (Escalante et al., 2001;Snellings et al., 2022).However, surpassing a certain level of W/C ratio can result in a more dilute cementitious system with increased particle distance, reducing the DOR (Navarrete et al., 2020).Finally, proper curing conditions, such as temperature and humidity, can improve the DOR by providing the necessary environment for chemical reactions to occur (de Azevedo Basto et al., 2022;Lothenbach et al., 2011;Snellings et al., 2022).The formation and development of hydration products within the cement matrix are influenced by its water content.As hydration progresses, the consumption of water decreases the internal relative humidity of the cement matrix, causing capillary pressure and shrinkage.Consequently, when the relative humidity is low, incomplete hydration may occur, leading to a lower DOR (Skibsted & Snellings, 2019).
Improving the efficiency of cementitious systems hinges on a profound understanding of their properties.
Consequently, gaining insights through data-driven analysis becomes crucial, particularly in comprehending the fundamental properties of SCMs that influence the DOR.Leveraging large datasets from diverse sources offers the opportunity to uncover correlations between key SCM properties and their performance within cementitious matrices.Harnessing the power of machine learning (ML) methods further allows for thorough examination and comprehension of extensive data sets, thereby providing deeper insights into the underlying correlations between crucial SCM features and their performance in cementitious matrices.Previous research has already demonstrated the efficient utilization of ML models for parametric investigations, enabling accurate estimations of primary material properties that impact carbonation and compressive strength (Abuodeh et al., 2020;Chen et al., 2022).By adopting this approach, the predictability of SCM influence in hydrated Portland cement can be significantly enhanced by focusing on the major SCM properties affecting the DOR, ultimately resulting in an optimized model.
While several investigations on the DOR of SCMs have been conducted by employing microstructural analysis techniques such as X-ray diffraction (XRD) (Durdziński et al., 2017), scanning electron microscopy (SEM) (Pfingsten et al., 2018), and nuclear magnetic resonance (NMR) (Walkley & Provis, 2019), as well as different testing methods (i.e., selective dissolution (Kocaba et al., 2012) and modified R3 test (Ramanathan et al., 2022)), the existing research still has significant limitations.One of the primary concerns is the substantial variability in DOR observed due to factors such as the source and production process of the SCMs, necessitating independent investigation for reliable conclusions (Ndahirwa et al., 2022).Moreover, the lack of consensus on an appropriate testing method and a standard for evaluating the DOR of SCMs makes it challenging to compare results across studies (Durdziński et al., 2017;Li et al., 2018).Additionally, previous studies predominantly focused on specific characteristics of SCMs, such as their pozzolanic activity (Donatello et al., 2010;Snellings & Scrivener, 2016) or ability to improve concrete durability (Anurag et al., 2021;Ndahirwa et al., 2022), without a comprehensive assessment of their overall reactivity.Thus, further study is essential to highlight the DOR of SCMs in different contexts and develop robust methodologies for evaluating their performance.
This study aimed to identify the essential parameters that affect the DOR of SCMs.Accordingly, five ML methods were employed for predicting the DOR: linear regression, Gaussian process regression (GPR), decision tree (DT) regression, support vector machine (SVM) and extreme gradient boosting (XGBoost).The performance of each model was evaluated using various statistical methodologies.Subsequently, the most accurate, adaptable ML model was selected for a parametric investigation encompassing 22 parameters.The influence of input parameters was studied using the Shapley value.Moreover, the fundamental parameters were set and analyzed with existing theories to determine their potential effect on the DOR.These findings offer valuable insights for optimizing the DOR of SCMs in various applications.

Data Collection and Description
The experimental data collected to study the DOR in hydrated Portland cement considered various types of binders, encompassing a range of SCMs such as slag, fly ash, metakaolin, limestone, calcium sulfoaluminate cement, silica fumes, magnesia-based cement, glass powder, calcined clay, calcium aluminate cement, and rice husk ash.Several factors affect the DOR, such as the W/C ratio, the oxide composition and proportions of Portland cement and SCMs, curing time and temperature, and physical properties such as particle size distribution, surface area, and specific gravity.
The dataset comprised 247 examples, with 22 input features and DOR as the output.Table 1 summarizes the statistical analysis of these inputs, detailing the units, minimum, maximum, average values, and standard deviations, which offers an essential insight into the range and variability of the features.Moreover, the accompanying histograms aim to illustrate the distribution density of each input, providing a preliminary, straightforward overview of the characteristics of the data.The full dataset is provided as a supplementary material for reference (Additional file 1).
However, the data collected for median particle size diameter (Dv50) were insufficient.To address this limitation, four techniques were employed to compensate for the missing Dv50 values, as described in Table 2.These techniques allowed for a thorough analysis of the impact of missing Dv50 values in machine-learning models.

Machine Learning Algorithms
Predicting the DOR of SCMs is crucial for optimizing their use in various applications.A range of advanced ML algorithms were utilized to achieve this, including GPR, linear regression, DT, SVM and XGBoost.Before the model development, the collected dataset was randomly divided into training and test groups at a ratio of 80:20 to ensure the robustness and validity of the models.Comprehensive descriptions of each of the ML models employed in this study are provided in the following subsections.

Linear Regression
Linear regression is a statistical analysis tool that is used to describe the relationship between one or more independent variables and a dependent variable.The best-fit line or hyperplane representing the relationship between these variables is determined through linear regression.Equation (1) shows the general formula for linear regression models.
where Y is the dependent variable, x n values are inde- pendent variables, β n is the regression coefficient, and ε denotes an error (Chou et al., 2014).

Gaussian Process Regression
A Gaussian process refers to a collection of random variables whose joint distribution follows a Gaussian or normal distribution, such that any finite subset of the variables has a joint distribution that is also Gaussian.This is a stochastic process with vector-defined mean and covariance functions expressed as a matrix, as indicated in Eq. ( 2).
where f(x) represents the output variable, µ(x) represents the mean function, k x, x ′ represents the covariance function, and GP represents the Gaussian Process (Rasmussen, 2003;Shi & Choi, 2011).

Decision Tree Regression
DT regression is a supervised learning approach that learns basic decision rules based on data characteristics to predict the value of a continuous target variable (Charbuty & Abdulazeez, 2021).This method constructs a tree-structured model with a root node, branches, internal nodes, and leaf nodes.The root node contains the entire dataset and has no incoming branches.The internal nodes reflect the characteristics of the data set, while the branches represent the decision criteria.The leaf nodes reflect the various outcomes of the target variable (Song & Lu, 2015).The method recursively partitions the data into subsets based on their characteristics until the subsets are more homogenous with regard to the target variable.The model then predicts the target variable by averaging the values of the training data in each leaf node of the tree (Pal & Mather, 2001).

Support Vector Machine
SVM is a machine-learning approach that can learn from data and produce predictions based on that data.

It can handle regression and classification problems
(1) by determining the optimum function to match the data while minimizing errors.The function is often a linear combination of the input variables, but it may alternatively be a nonlinear transformation based on a kernel function.The kernel function enables the SVM to translate the input into a higher-dimensional space where a linear separator may be found.A linear separator is a hyperplane that splits data into two or more classes with the largest margin attainable.The margin is the distance between the hyperplane and the nearest data points, known as support vectors.The shape and location of the hyperplane are determined by the support vectors (Gholami & Fakhari, 2017;Noble, 2006).

eXtreme Gradient Boosting
XGBoost is a scalable end-to-end tree-boosting system, which is effective for both regression and classification tasks (Chen & Guestrin, 2016).Rooted in the Gradient Boosting (Friedman, 2001) framework, an ensemble learning method, XGBoost harnesses the collective wisdom of multiple weak learners, often represented as simple decision trees, to enhance predictive accuracy.Its iterative approach involves training weak models sequentially, with each subsequent model dedicated to correcting the errors of its forerunners.Beyond its core functionality, XGBoost offers a range of essential features, including integrated regularization for guarding against overfitting, robust handling of missing data, streamlined parallel processing for efficient computation, customizable objective functions to adapt to specific use cases, and the incorporation of tree pruning techniques for fine-tuning model complexity control.

Evaluation Method
Three independent statistical measures were used to assess the efficiency of the ML models: the root mean square error (RMSE), the mean absolute error (MAE), and the coefficient of determination (R 2 ).These indicators were used to assess and compare the accuracy and reliability of the performance of the models.Employing these three separate statistical measures provides an advantage in obtaining a fair estimation of accuracy.For instance, RMSE is sensitive to outliners and penalizes large errors, thus facilitating their removal from the dataset (Chai & Draxler, 2014).On the other hand, MAE is more suitable for datasets containing outliers (Willmott & Matsuura, 2005), and R 2 offers more information without being subject to the interpretability limitations of RMSE and MAE (Chicco et al., 2021;Zhang, 2017).The RMSE, MAE, and R 2 equations are expressed below in Eqs. 3, 4, and 5, respectively.
For Eqs. ( 3), ( 4) and ( 5), X i is the predicted value, Y i is the actual value and Y m is the mean value.
K-fold cross-validation was applied to overcome the problem of overfitting.It involves splitting the  Predictions were made without considering the effect of Dv50 to explore the potential impact of missing data on the models dataset into K subsets or "folds" of roughly similar size.The model is then trained and assessed K times, with each fold acting as the validation set once and the remaining folds serving as training folds.This technique contributes to a more robust estimation of the performance of the model by minimizing reliance on a single train-test split (Hastie et al., 2009).Considering the limited size of the dataset, K has been set at 5 to find a balance between evaluating model performance and utilizing the available data.

Feature Selection
ML predictions can often suffer from the limitation of not being able to recognize the effects of input parameters on the outcome.However, understanding these relationships is crucial as they provide valuable insights into the roles of the input parameters and serve as a foundation for future predictions.In this study, the ML model with the highest prediction performance was utilized, and the order of significance of input features on the desired outcome, which was the DOR, was determined using the Shapley value.The Shapley value is an idea developed from cooperative game theory that allocates a fair distribution of total costs across players of the game (Merrick & Taly, 2020).In the context of ML, the Shapley value can be utilized to quantify the contribution of each feature to a prediction for a given instance.The Shapley value of a parameter is the weighted mean of the marginal contributions of the feature, averaged across all possible feature subsets.The marginal contribution is the difference between the prediction with and without the feature (Cohen et al., 2005).

Results and Discussion
The accuracy of ML models for DOR predictions was assessed and summarized in Table 3.It is worth noting that XGBoost produced the most accurate results.However, compared to actual findings, this model produced very inconsistent outputs, which may be attributed to the structure of the dataset.The dataset had a limited number of observations compared to the independent variables.For XGBoost, this imbalance might lead to subpar model performance, possibly stemming from reduced generalization, computational intensity, and overfitting (Barnwal et al., 2022;Ma et al., 2021).As a result, the subsequent analysis utilized the next most accurate model: GPR.GPR demonstrated comparable accuracy to XGBoost while generating interpretable results.Specifically, GPR exhibited outstanding performance with an RMSE of 12.46, an MAE of 8.88, and an R 2 value of 0.79.In contrast, linear regression yielded less favorable results, showcasing an RMSE of 20.24, an MAE of 15.12, and an R 2 value of 0.42.
The substantial improvement observed in the prediction accuracy of GPR can likely be attributed to the integration of complete data.Conversely, alternative models showcase superior performance when not considering Dv50 values (DT regression and SVM) or eliminating rows with null Dv50 values (Linear regression).
The experimental and modeled DOR using GPR is compared in Fig. 1.Following the identification of the optimal ML model for capturing the DOR, the significance of the features was further assessed using the Shapley value.Fig. 2 extensively elucidates the relative importance of each feature, providing profound insights into their criticality.Notably, the top five features, listed in order of significance, encompass curing time, SiO 2 content of the SCM, Dv50, specific gravity, and the W/C ratio.Subsequent subsections delve into the detailed descriptions of these features, providing a thorough understanding of their significance and implications.

Curing Conditions and Their Effect on SCMs Reactivity
Efficient control of the curing process is of paramount importance as it directly impacts the desired material properties and quality.Among the various factors that influence the curing process, curing time and temperature are widely recognized as crucial parameters.Gaining a comprehensive understanding of the relative  significance of these factors can provide valuable insights for optimizing the curing process and enhancing its efficiency.According to the Shapley values, curing time accounts for most of the observed variation in the DOR.
It is important to note that the DOR of SCMs tends to increase as the curing time progresses (Haha et al., 2010;Kocaba et al., 2012).However, it is imperative to differentiate the effect of curing time on DOR from its effect on the rate of DOR.This distinction is essential because the rate of DOR is highly dependent on the type of SCM, exhibiting both increasing and decreasing trends (Skibsted & Snellings, 2019).The curing period of SCMs can be broadly categorized into three main stages to examine the fundamental relationships regarding DOR.Initially, during the early stage, the DOR of SCMs is relatively low as most of the available water is consumed by Portland cement (Skibsted & Snellings, 2019).At the same time, the filler aspects of the SCMs play a significant role (Lothenbach et al., 2011;Schöler et al., 2017).In the intermediate curing stage, the DOR of SCMs gradually increases due to improved water availability and pozzolanic reaction, which contributes significantly to the strength and durability of the concrete (Ahmed, 2019).Finally, the concrete reaches its maximum DOR in the long-term curing stage.However, since curing time is an inherent property of concrete that cannot be altered during the initial formulation of SCM in hydrated Portland cement, greater emphasis should be placed on optimizing other adjustable parameters.In contrast, the influence of curing temperature is relatively less pronounced, indicating that variations in temperature within the considered range do not significantly affect the DOR.While temperature variations can accelerate or retard early-age hydration and affect the stability of specific phases (de Azevedo Basto et al., 2022;Snellings et al., 2022), their impact is minor when compared to other parameters affecting DOR.

Chemical Composition of Cementitious Matrices and its Effect on SCMs Reactivity
The main oxide compositions of SCMs affecting the DOR in cementitious systems are silica, alumina, and calcium oxide, particularly when exploring viable replacements that can enhance or maintain performance.The presence of silica and alumina generally contributes to the formation of additional hydrates of the form C-A-S-H (Simonsen et al., 2020).Silica holds high importance (i.e., ranked 2nd), primarily due to silica-based SCMs possessing a high specific surface area and fine particle size.The increased surface area offers more reaction sites, leading to a higher DOR.Silica-based SCMs have higher calcium hydroxide consumption than alumina-or calcium-based SCMs (Suraneni et al., 2019).Additionally, the C-S-H formed from excess silica exhibits a propensity for aluminum uptake, which occurs at the bridging sites within the silicate chains (Lothenbach et al., 2011).Fig. 3a illustrates the relationship between DOR and SiO 2 content of SCM.The DOR generally exhibits an upward trend as the SiO 2 content of SCM increases until it reaches an optimal replacement level.Beyond this point, as depicted in Fig. 3a, additional SiO 2 content becomes redundant, yielding no further changes.

Physical Properties of Cementitious Matrices and Their Effect on SCMs Reactivity
The physical properties investigated in this research included Dv50, surface area, and specific gravity.The Shapley values indicate that Dv50 and specific gravity possess higher significance.Dv50 measures the particle size distribution of a material, representing the size at which 50% of particle volumes are smaller than the given diameter (Arvaniti & De Belie, 2014).A smaller Dv50 (finer particle size distribution) can improve the DOR of SCMs by providing a larger surface area for interaction between the SCMs and the surrounding cementitious phases, such as calcium hydroxide.Additionally, finer particles facilitate more efficient diffusion and reactant adsorption, leading to improved DOR (Lothenbach et al., 2011;Skibsted & Snellings, 2019).These aspects are also supported by Fig. 3b, which illustrates the decreasing DOR trend with increasing particle size.Similarly, Liu et. al. (2018) demonstrated the increased hydration rate for lower Dv50 values.
In contrast, Fig. 3c shows a direct relationship between DOR and specific gravity.However, establishing a straightforward correlation between the two is challenging since multiple factors influence it.This complexity arises from the diverse physical and chemical transformations within the hydrated cement matrix.Therefore, while the modeled observations give vital insights into the collected dataset, it is crucial to remember that specific gravity can wield positive and negative effects.

Water-to-Cement Ratio and its Effect on SCMs Reactivity
The W/C ratio can influence the DOR of SCMs through various mechanisms.However, an optimal quantity of these materials must be added to leverage its benefits fully.Attaining an ideal W/C ratio is crucial for effective hydration.A lower ratio may lead to inadequate water supply for cementitious materials, diminishing their reactivity (Snoeck et al., 2014).Conversely, excessive water content can occupy space that should be filled by hydration products, resulting in adverse effects.Workability is another crucial consideration.
While surplus moisture can enhance the placement and finishing processes (Reddy & Rao, 2014), it can also contribute to the obstruction of spaces meant for reaction products (Navarrete et al., 2020).Additionally, the W/C ratio impacts the curing process.Lower ratios enhance moisture retention during the initial stages of hydration while negatively influencing the hydration process (Patil & Dubey, 2023).Moreover, higher W/C ratios are known to increase the porosity of the hydrated cement matrix, yet their optimal utilization remains crucial, as both excessive and inadequate additions can yield adverse effects (Wong et al., 2020).These factors mentioned above collectively wield the potential to significantly influence the DOR of SCMs.Hence, a meticulous selection of the W/C ratio becomes imperative.Fig. 3d shows the effect of the W/C ratio on the DOR of SCMs.For the given W/C ratio ranges, the DOR of SCMs increases as the W/C ratio increases in agreement with previously published papers (Escalante et al., 2001;Snellings et al., 2022).

Conclusions
This study focused on investigating the factors that influence the DOR of SCMs in cementitious matrices to optimize their performance.Five different ML models were used: linear regression, GPR, DT regression, SVM, and XGBoost.The model with the best accuracy

Fig. 1
Fig. 1 DOR of modeled versus experimental findings (%).The symbols and lines indicate the experimental results and linear fit of the SCMs modeled results, respectively

Fig. 3
Fig. 3 Influence of key factors on the DOR: a SiO 2 content of the SCM, b Dv50, c specific gravity, and d W/C ratio, evaluated based on average oxide composition and physical properties.Conditions: W/C ratio = 0.5, curing time = 180 days and temperature = 25 °C

Table 1
Statistical parameters of the dataset

Table 2
Approaches for representing Dv50

Table 3
Performance of ML predictionsa The representation techniques for Dv50 are detailed in Table2