The Prediction of Self-Healing Capacity of Bacteria-Based Concrete Using Machine Learning Approaches

Advances in machine learning (ML) methods are important in industrial engineering and attract great attention in recent years. However, a comprehensive comparative study of the most advanced ML algorithms is lacking. Six integrated ML approaches for the crack repairing capacity of the bacteria-based self-healing concrete are proposed and compared. Six ML algorithms, including the Support Vector Regression (SVR), Decision Tree Regression (DTR), Gradient Boosting Regression (GBR), Artificial Neural Network (ANN), Bayesian Ridge Regression (BRR) and Kernel Ridge Regression (KRR), are adopted for the relationship modeling to predict crack closure percentage (CCP). Particle Swarm Optimization (PSO) is used for the hyper-parameters tuning. The importance of parameters is analyzed. It is demonstrated that integrated ML approaches have great potential to predict the CCP, and PSO is efficient in the hyperparameter tuning. This research provides useful information for the design of the bacteria-based self-healing concrete and can contribute to the design in the rest of industrial engineering.


Introduction
Concrete is the most widely used construction material, which embodies energy of about 0.95 MJ/kg [Chilana, Bhatt, Najafi et al. (2016)]. Due to the high rate of consumption, high energy is required. One disadvantage of concrete is its sensitivity to cracks because of its limited tensile strength. Hence, an efficient method which can heal the cracks of cement and concrete will save the total energy and reduce environmental impact greatly. Over recent decades, the notion of designing concrete with a self-healing behavior to heal the cracks has attracted great attention [Zhu, Chen, Yan et al. (2014); Zhu, Zhou, Yan et al. (2015a); Zhu, Zhou, Yan et al. (2015b); Zhu, Zhou, Yan et al. (2016); Zhou, Zhu, Yan et al. (2016); Zhou, Zhu, Ju et al. (2017); Zhuang and Zhou (2018) ;Quayum, Zhuang and Rabczuk (2015); Zhang and Zhuang (2018)]. Many self-healing strategies are developed algorithms still lacks since the performance differences may be substantial in their application to the optimization of the bacteria-based self-healing concrete.
To address above problems, this paper proposes and compares six integrated ML approaches for the bacteria-based self-healing concrete optimization. ML algorithms are adopted for the relationship modeling and Particle Swarm Optimization (PSO) is utilized for the hyper-parameters tuning. Six ML algorithms, including the Support Vector Regression (SVR), Decision Tree Regression (DTR), Gradient Boosting Regression (GBR), Artificial Neural Network (ANN), Bayesian Ridge Regression (BRR) and Kernel Ridge Regression (KRR), are used. This research is a benchmark study in the application of ML approaches in biomaterials optimization. The objective of this work is to optimize the selfhealing materials for the required healing properties (i.e. the required CCP). ML and PSO algorithms are detailed in Section 2. Section 3 describes the dataset for the ML models. In Section 4, the methodology of the proposed integrated approaches is illustrated. Section 5 provides the results and the discussion while Section 6 summarizes findings.

Machine learning and PSO algorithms
A brief introduction to ML and PSO is presented in this section. To predict the selfhealing capacity and provide useful information for the design of the bacteria-based selfhealing concrete, the six ML methods are selected to predict CCP since they have been widely used in industrial engineering and have a good predictive performance in nonlinear prediction. Meanwhile, PSO is adopted for the hyper-parameters tuning. Firstly, the PSO is used to obtain the optimum hyper-parameters in the six ML models. Then, a part of the dataset (i.e., the training set) is adopted to train the six ML models. Finally, the rest of the dataset (i.e., the testing set) is applied to test the results of the six ML models.

ML algorithms 2.1.1 SVR
The basic idea of SVR is to map the data x into a high dimensional feature space via a nonlinear mapping and to perform a linear regression in this feature space by where λ is a regularization constant and the cost function is defined by vol.59, no.1, pp.57-77, 2019 Eq.
(3) can be illustrated in Fig. 1. If the point lies in the blue area (i.e. the distance between the point and the line is less than ε ), the cost is zero. If the point lies outside of the blue area, the cost is More details of SVR can be found in previous research [Cristianini and Shaw-Taylor (2000); Vapnik (1999)].

DTR
DTR is a non-parametric procedure for predicting continuous dependent variable where the data is partitioned into nodes on the basis of conditional binary responses [Breiman, Friedman, Olshen et al. (1984)]. Models use a binary tree to recursively partition the predictor space into subsets in which the distribution of y is successively more homogenous [Chipman, George and McCulloch (1998)]. A decision tree P with t terminal nodes is used for communicating the decision. A parameter 1 2 3 ( , , ,..., ) t φ ϕ ϕ ϕ ϕ = associates the parameter value (i 1, 2,3,..., t) i ϕ = with the ith terminal node. The partitioning procedure searches through all values of predictor variables to find the variable x that provides best partition into child nodes, which minimizes the weighted variance. A detailed discussion of DTR model can be found in previous research [Breiman, Friedman, Olshen et al. (1984) ;Chipman, George and McCulloch (1998)].

GBR
Gradient boosting regression is a machine learning technique for regression problems, typically on the basis of decision trees. It builds the model in a stage-wise fashion and allows optimization of an arbitrary differentiable loss function. The goal in GBR is to find a loss function F * (x) and minimize the expected value of it over the joint distribution of all (y, x) values. Boosting evaluates F*(x) by an additive expansion. The gradient boosting algorithm improves F*(x) by adding an estimator h to provide a better model. A generalization of this idea to loss functions other than squared error is that residuals for a given model are the negative gradients of the squared error loss function. Hence, gradient boosting is a gradient descent algorithm by adding a different loss. More details can be found in previous research [Friedman (2002)].

BRR
BRR is composed of Bayesian Regression and the Ridge Regression (linear least squares with l2-norm regularization). The l2 regularization used in BRR is equivalent to finding a maximum a posteriori estimation under a Gaussian prior over the parameters w with precision 1 λ − . Instead of setting lambda manually, it is possible to treat it as a random variable to be estimated from the data. The output y is assumed to be Gaussian distributed to obtain a probabilistic model. The prior for the parameter w is given by a spherical Gaussian. The priors over are chosen to be gamma distributions. More details can be found in previous research [MacKay (1992)].

ANN
ANN is a mathematical technique using an analogy to biological neurons to generate a general solution to a problem [Rumelhart, Hinton and Williams (1986)]. All neural functions are stored in the neurons and the connections between them. After learning historical data, ANN can be used effectively to predict new data. The training of ANNs is considered as the establishment of new connections between neurons. ANN architecture may have one or more hidden layers between the input and output layers. Each layer constitutes neurons, which are connected with other neurons by the weights passing signals to others. When the amount of signals received by one neuron overtakes its threshold, the activation function is awoken and the outcome is treated as the input of next neuron. It can approximate an arbitrary nonlinear function with satisfactory accuracy [Zhang, Wu, Zhong et al. (2008)]. They learn from examples by building an input-output mapping without explicit derivation of the model equation. They have been widely used in pattern classification, function approximation, optimization, prediction and automatic control and in many different domains, such as load forecasting and strength forecast [Hadavandi, Shavandi and Ghanbari (2010); Khosravi, Nahavandi and Creighton (2013); Khotanzad, Elragal and Lu (2000); Bashir and El-Hawary (2009) ;Qi, Fourie, Chen et al. (2018)].

KRR
Kernel ridge regression (KRR) combines Ridge Regression (linear least squares with l 2norm regularization) with the kernel trick. For non-linear kernels, this corresponds to a non-linear function in the original space. The form of the model learned by KRR is identical to SVR. However, different loss functions are used. KRR uses squared error loss while SVR uses -insensitive loss. Both adopt l2 regularization. In contrast to SVR, fitting KRR can be done in closed-form and is typically faster for medium-sized datasets. The learned functions are very similar. For larger training sets SVR scales better. SVR is faster than KRR for all sizes of the training set. More details about KRR can be found in previous research [Murphy (2012)].

PSO algorithm
There are many hyper-parameters in ML models, which need to be chosen properly. Traditional tuning methods, such as trial and error, are time-consuming. PSO can be utilized for the hyper-parameters tuning of the ML algorithm [Kennedy and Eberhart (1995)]. Compared with other metaheuristic algorithms [Montazeri-Gh, Poursamad and Ghalichi (2006); Ma, Xu, Wang et al. (2015); Long and Nhan (2012) ;Castaings, Lhomme, Trigui et al. (2016)], PSO requires fewer parameters to be tuned and less computational efforts for multi-objective optimization [Yang (2014); Geng, Mills and Sun (2014)]. Inspired by the social behavior of bird flocking, PSO is an evolutionary optimization algorithm in which a population of individuals changes with time. The major concepts of PSO come from the observations of the feeding behaviors of bird swarms, where flocks are formed through grouping of simple individuals and individuals interact with each other. Scientific simulation of the unpredictable collective behaviors generated from local information is performed. It is high-efficient and does not use gradient. The objective function value is then calculated for each particle. The PSO algorithm is described as below: where w is the constant inertia weight, c1 and c2 are positive constants, rand() and Rand() are two random functions in the range [0,1]; xi=(xi1, xi2, … , xiD) represents the ith particle; pi=(pi1, pi2, … , piD) represents the best previous position of the ith particle; the symbol g represents the index of the best particle among all the particles in the population; vi=(vi1, vi2, … , viD) represents the velocity for particle i. The swarm topology defines how particles are connected to one another to exchange information with the global best. The particles in the swarm make up a cloud that covers the entire search space in the initial iteration and gradually contracts as the iterations advance. It has been successfully applied to many problems such as artificial neural network training, function optimization, fuzzy control, and pattern classification, etc. Because of its ease of implementation and fast convergence to acceptable solutions, PSO has received broad attention in recent years [Poli (2008)]. More details can be found in previous research [Kennedy and Eberhart (1995)].

Crack closure of the bacteria-based self-healing concrete
The performance of six integrated ML approaches is verified and compared according to a dataset collected from the literature. Such dataset contains a considerable number of experimental cases where ML approaches can learn the relationship between CCP and its influencing variables. The whole dataset consists of 1223 cases collected from previous research for the training and the testing of prediction models [Stuckrath, Serpell, Valenzuela et al. (2014); Luo, Qian and Li (2015); ; Zhang, Liu, Feng et al. (2017); Khaliq and Ehsan (2016); Wiktor and Jonkers (2011)]. A wide range of variable values is covered. The successful application of proposed ML approaches for CCP is based on similar healing mechanisms and only the non-ureolytic bacteria-based self-healing concrete is considered. For the purpose of crack prediction, three attributes are used: the number of bacteria, the healing time and the initial crack width. These attributes are considered to be the influencing variables that govern the healing of cracks. The carrier and nutrient medium are treated as the affiliated parameter of the number of bacteria here since they are used to keep the number of bacteria. Each of these influencing variables is introduced as follows: (1) The number of bacteria is defined as the number of bacteria in 1 g concrete. The data is in the range [0, 214993302] with a mean of 64085635.
(2) Healing time is the time between crack initiates in experiments (i.e. the start of the healing introduction) and the time of the measure of crack width after healing. More details about the healing time can be found in previous research Zhang, Liu, Feng et al. (2017)]. The data is in the range [0, 100] with a mean of 38.
(3) The initial crack width is the crack width when the crack initiates. The data is in the range [0.06, 1] with a mean of 0.37. The CCP is defined as where Winital and Whealed are the width of cracks before and after healing, respectively.

Dataset
The dataset used for the implementation of the prediction is collected from 1223 specimens with different influencing variables (in Supplementary). The output of the PSO is chosen to be the CCP. The whole dataset needs to be split into the training set and the testing set. The training set is used for the model training and the hyper-parameters tuning, while the testing set is adopted for the performance evaluation of the models. The training set is chosen randomly in the beginning. The rest of the database is the testing set. The same training set is used to train the six ML models in this paper. The testing set in the six ML models is also the same. In this paper, 70% of the whole dataset is included in the training set, and the remaining 30% is included in the testing set.

K-fold cross validation
k-fold cross validation is widely used during the process of hyper-parameters tuning [Stone (1974)]. By using this method, the original training set is divided into k folds. Then, k-1 folds are used to train ML models, while the remaining one fold is adopted to validate models. The training and validating process are repeated for k times with different fold as the validating fold. The performance of ML models is obtained by averaging performances from k iterations. Compared with other methods [Badawy, Msekh, Hamdia et al. (2017)], the advantage of k-fold cross validation is that all observations are used for both training and validation, and each observation is used for validation exactly once. In this study, k is set as ten [van der Gaag, Hoffman, Remijsen et al. (2006)]. The mean squared error (MSE) and the correlation coefficient R are utilized for the hyper-parameters tuning and model evaluation. The MSE measures the squared distances between the predicted and the experimental values. R evaluates the degree two variables' change is associated. The MSE and the R can be calculated using Eq. (7) and Eq. (8)

Hyper-parameters tuning
Before the construction of ML models, several important hyper-parameters need to be determined. Hyper-parameters tuning is necessary since the predictive performance of ML models varies widely with different hyper-parameters. A good predictive performance can be achieved with suitable hyper-parameters. Here, hyper-parameters to be tuned in the ML algorithm are summarized in Tab

Comparison of integrated ML approaches
Six integrated ML approaches have been used for the prediction of CCP and their predictive performance on the testing set is compared and discussed in this section. MSE value and R value are utilized for the performance evaluation. Fig. 2 describes MSE value and R value of six ML models on the testing set with the hyper-parameters obtained by PSO in Section 5.1. As can be seen in Fig. 2(a), the best prediction model regarding the MSE is GBR. The MSE values of the training set and the testing set are 0.028 and 0.057 respectively, when GBR is adopted. DTR also shows a good performance, with 0.051 and 0.061 as the MSE of the training set and the testing set, respectively. The other 4 models have a similar MSE about 0.7. In Fig. 2(b), GBR also shows the greatest R in all 6 models. The R values of the training set and the testing set are 0.93 and 0.74, respectively. With 0.74 and 0.7 as the R values of the training set and the testing set, DTR also proves a good predictive ability. The R values of other 4 models are similar and around 0.6. The R value of the testing set is a little greater than that of the training set in SVR, NN, BRR and KRR. Compared with DTR and GBR, the performance of BRR is relatively poor according to MSE and R values. Fig. 2 proves that DTR and GBR can achieve better predictive performance in predicting CCP of the bacteria-based self-healing concrete. The R value between the experimental and predicted CCP values is not high enough [Smith (1986)]. The main reason is due to the great variability of CCP data in experiments. With the lowest MSE and the highest R value, the optimum GBR model obtains better results than the other ML models. The optimum GBR and DTR models are recommended in predicting CCP of the bacteria-based selfhealing concrete. Detailed results about the performance of the six ML models on the training set are shown in Fig. 3. In Fig. 3, the data of the training set are close to the ideal fit line when GBR is adopted, which means GBR is more likely to predict the correct CCP. In contrast, the data in other 5 models are very scattered on the training set. It is caused by avoiding the overfitting when K-fold cross validation is adopted. The performance of the six ML models with the optimum hyper-parameters is presented on the testing set in Fig. 5 by the histogram. The predicted/experimental CCP values are analyzed. It visualizes the histogram plots of the predicted/experimental CCP values ratio by the optimum ML models on the testing set. It displays that the mean of density curves are around one, which illustrates that the optimum ML models have a good predictive performance on testing sets. The GBR has the peak at 1.0, denoting that the optimum GBR model is accurate at predicting the CCP. The histogram for SVR, KRR, ANN and BRR models is slightly right-skewed of the peak ratio, indicating that the optimum ML model tends to predict a slightly greater CCP values than experimental values on the testing set. From this point, the performance of the optimum GBR model and DTR model on the testing set can be considered better as the peak ratio is approximate 1.0. It is worth mentioning that some predicted/experimental ratios are greater than 3 on the testing set, which is not shown in Fig. 5. The main reason of the discreteness in histograms is that the self-healing effect varies greatly in different cases in experiments. The randomness of the self-healing effect widely exists in experiments, which makes it difficult to determine the relationship between the influencing parameters and the healing result [Zhu, Zhou, Yan et al. (2015b)]. Hence, the experimental CCP data is scattered, which decreases the accuracy of ML models.

Relative importance of influencing variables
The sensitivity analysis of the influencing variables is implemented in the optimum GBR models to investigate the effect of variables for the prediction of CCP. The GBR is selected according to its outstanding performance on the testing set. The importance is calculated by the Gini importance, which is a measure of variable importance based on the Gini impurity index [Breiman, Friedman, Olshen et al. (1984)]. Normalization is performed on the feature importance scores and the result is shown in Fig. 6. Apparently, the initial crack width is the most sensitive variable. It accounts for more than a half of importance score among all variables influencing the CCP, which agrees with previous research [Qian, Chen, Ren et al. (2015)]. Importance scores of the other two influencing variables on the crack healing are almost equal. The number of bacteria accounts for 22.7%, which is a low value compared with the initial crack width. There are many reasons which contribute to its low value. Normal concrete samples, even without any bacteria, show some crack healing as well. The continued hydration process of cement particles, the swelling of cement matrix and the precipitation of calcium carbonate crystals can also heal the concrete to some extent [Wiktor and Jonkers (2011);Homma, Mihashi and Nishiwaki (2009); Khaliq and Ehsan (2016)]. Meanwhile, many bacteria are dead in concrete due to the pressure during the mixing stage and the formation of the dense microstructure after a period of time [Qian, Chen, Ren et al. (2015)]. Hence, the importance of the number of bacteria decreases. The healing time makes up 22.2%, which means the healing time is also a necessary variable in designing the bacteria-based self-healing concrete. Different importance scores may be evaluated with different dataset [Guyon, Gunn, Nikravesh et al. (2006)]. More representative results can be obtained as more valid experimental results about CCP are available in the future.

Contribution and limitations
The primary strength of this study is to propose and compare six integrated ML approaches on predicting CCP of bacteria-based self-healing concretes. This study contributes to the concrete design and other fields of industrial engineering. On one hand, the integrated ML approaches based on ML algorithms and PSO are very promising for prediction problems. On the other hand, the model recommendation has been made for CCP and the methodology in this paper can be adopted for a wider application in the rest of industrial engineering. The ignorance of other influencing variables for crack healing, such as the carrier of bacteria and the nutrient medium of bacteria, is a limitation of the present study since there is not enough experimental data. Meanwhile, the experimental results are scattered, which decreases the predictive accuracy of models. The performance of proposed ML approaches will improve when more experimental data is available in the future.

Summary and conclusion
In this study, a model for predicting CCP of the bacteria-based self-healing concrete is proposed based on ML and PSO. The ML algorithms are used for the non-linear relationship modeling between CCP and its influencing variables, and PSO is applied for the hyper-parameters tuning. The dataset is obtained through extensive experiments with different combinations of influencing variables. Inputs are selected to be the number of bacteria, the healing time and the initial crack width. The output is selected as CCP. 10fold CV is used as the validation method. The performance of the optimum ML model is verified using the MSE and the R value.
The results show that the PSO-ML method has great potential for the prediction of CCP. PSO is efficient in the hyper-parameters tuning of ML models with minimum MSE values being achieved for all 6 ML methods. The optimum GBR model has a quite good performance on the training set and the testing set. The low MES and high R value between predicted CCP values and experiments on the training and testing sets denote that a good prediction is achieved by the optimum GBR model compared with other 5 models. The relative importance of influencing variables is studied, and the initial crack width is found to be the most important variable. The influence of the healing time and the number of bacteria is less important compared with the initial crack width. The finding of this paper can be used for a more suitable bacteria-based self-healing concrete and in a wider application in the rest of industrial engineering.