Prediction of Cement Compressive Strength by Combining Dynamic Models of Neural Networks

This study aimed at developing models predicting cement strength based on shallow neural networks (ANN) using exclusively industrial data. The models used physical, chemical, and early strength results to forecast those for 28and 7-day. Neural networks were trained dynamically for a movable period and then used for a future period of at least one day. The study includes nine types of activation functions. The algorithms use the root mean square errors of testing sets (RMSEFuture) and their robustness as optimization criteria. The RMSEFuture of the best models with optimum ANNs was in the range of 1.36 MPa to 1.63 MPa, which is near or within the area of long-term repeatability of a very competent laboratory. Continuous application of the models in actual conditions of a cement plant in the long-term showed a performance at least equivalent to that calculated during the design step.


Introduction
Machine learning techniques and especially artificial neural networks (ANN) are widely used in predicting cement and concrete properties for quality control purposes. The product's compressive strength is the mainly utilized property, cumulatively expressing the quality. The most significant among the strength measured at different ages is considered to be the 28-day strength, at least for normative reasons. Compressive strength depends on a variety of the product's physical and chemical properties, which are due either to the raw materials used or acquired during the production processes. Successful implementation of ANNs and other machine learning techniques in forecasting concrete strength is referred to in numerous research papers, for multiple technologies of production and categories of this product [1][2][3][4][5][6][7][8][9] . In the cement industry, due to the high daily production rates and the particular impact of a quality accident, the 28-day strength forecast based on earlier analysis results has gained great interest from the past. Extensive reviews of published studies based on multilinear or polynomial models can be found 10,11 .
In recent years, significant research has been elaborated on machine learning methodologies in predicting cement strength based on laboratory or industrial quality data sets. Many researchers have applied several techniques of this type, such as ge-netic algorithms (GA), ANNs, fuzzy logic (FL), support vector machines (SVM), support vector regression (SVR), adaptive-neuro-fuzzy inference algorithms (ANFIS), and other more specialized methods. Akkurt et al. 12 structure contained a GA -ANN correlating cement compressive strength with six months-worth of industrial quality data. Their results indicated that the increase in tricalcium silicate (C 3 S), sulfates (SO 3 ), and specific surface led to increased strength. In further publication, Akkurt et al. 13 implemented an FL model to the same data set and compared the results with them of the ANN 12 . Thamma et al. 14 employed gene expression programming to predict the 28-day cement mortar strength, compared the results and those of FL and GA-ANN models, initially developed by Accurt et al. 12,13 , and concluded that gene expression programming performs better than FL. Verma et al. 15 employed three different kernel-based models -SVR, relevance vector machine, and Gaussian process regression -in predicting cement strength. Afterwards, they compared the models with ANN and FL models provided by Akkourt et al. 12,13 using the same data set. Similarly, Motamedi et al. 16 used SVR and ANFIS algorithms in forecasting the compressive strength of cockle shell-cement-sand mixtures. Their findings showed that ANFIS improved the generalization capability compared to SVR. Chen et al. 17 provided an approach based on SVM for predicting the compressive strength of cement mortars exposed to sulfate attack. They compared this model with an ANN containing one hidden node, and concluded that SVM performance was higher than that of ANN. Neural networks were also used in two other cases 18,19 . Escadari-Nadaff et al. 18 developed a structure in predicting cement mortars strength, using three different cement strength classes to their mixtures. Pani et al. 19 used robust multivariate techniques and soft sensor models based on ANNs and FL to detect the outliers in the quality of clinker industrially produced. Most of these models, estimating the parameters from a determined training data set, and predicting the future strength, are characterized as static. Tsamatsoulis 20,21 introduced a dynamic approach to predicting the strength of cement using movable time horizons based on long-term process results and several types of ANNs. The models incorporate the uncertainty due to the time variability of non-involved factors during the modeling -especially the clinker reactivity-and are dynamic. In this manner, the author provides a robust solution to the challenging issue of predicting strength during the daily quality control of cement production in a cement plant. Apart from chemical and physical measurements, these studies utilize results of early strength in predicting the typical 28-day strength. This technique uses data of a predetermined period for training, namely, for computing optimal parameters. Subsequently, the data belonging to a time interval that follows the training period are used for validating -or in other words, testing the model. The training and validation time intervals constitute the past and future periods, respectively. Tsamatsoulis 21 proved that the optimal future period is the 1-day interval, whereas the training time interval needs optimization to achieve the best prediction. This previous study employed three activation functions types: Sigmoid, hyperbolic tangent, and radial basis functions. It also used linear functions for comparison reasons. The aim of this study was threefold. Firstly, to deepen in and explore the ANN architectures implemented to forecast output, using the capability of a model to predict the future cement strength as a criterion, which can be character-ized as the generalization ability 22 of modeling. Subirats et al. 23 clearly stated that one of the strategies for avoiding overfitting is the search for compact architectures. In this research, shallow ANNs were built with a varying number of nodes within a single hidden layer. A simple and robust method of pruning connections between nodes of input variables and the input layer was also examined and exploited, involving the correlations among input and output variables. Additionally, we used weight decay 24-26 using L2-regularization 27 for all the architectures concerned. The second purpose of this study was to explore the behavior of various activation functions (AF) in the application under examination, as recent work 28 demonstrated that AFs are of high relevance, and that research in this field was constantly evolving 29 . The third objective, though not least, was to build robust models predicting cement 28-and 7-day strength, able to be applied in the daily quality control of a cement plant. Specialized algorithms appropriately add up the models' outputs, considering the time delays, aiming at augmenting generalization ability. Throughout the present text, the errors corresponding to past periods are characterized as training errors, while the ones computed from future period data, for validation purposes, are named test errors.

Materials and testing methods
The application field of modeling comprised four cement types, produced according to EN 197-1:2011 (Cement, Part 1: Composition, specifications and conformity criteria for common cements): CEM I 52.5 R, CEM I 42.5 R, CEM II A-LL 42.5 R, CEM II B-LL 42.5 R. Cements of CEM I type contain clinker and gypsum as main constituents. Moreover, limestone is one of the main components in the denoted CEM II types. Table 1 shows the raw materials' typical chemical characteristics. The oxides analysis was performed with XRF. Table 1 also presents the mineral composition of the clinker, according to Bogue equations (1) -(4). These formu- Modeling utilized the results of composite daily average samples of cement produced in mills of the Devnya plant. Each of these daily samples corresponded to a cement type produced by a specific mill. Training and verification of the models included more than 1600 datasets of daily cement results produced in about 27 months. The cement's physical, chemical, and mechanical features, and the analysis methods are shown in Table 2, resulting in a population of 11 input and output variables. Equation (5) of section "Mathematical models" normalizes the values of these variables for the entire population. Fig. 1 demonstrates the frequency distributions of normalized values. Table 3 indicates the mean, minimum, and maximum values of these variables for each cement. These data clarify that the dispersion of values is quite large, providing an initial generalization capability to the models under consideration.

Mathematical models
The physical and chemical characteristics, namely, the residue at 45 μm sieve, loss on ignition, and SO 3 , SiO 2 , Al 2 O 3 , Fe 2 O 3 , CaO oxides, constituted the set of shared input variables in all the models elaborated. The chemical properties characterize both cement composition and clinker reactivity. The initial design included three basic models in predicting strength: (i) a model predicting 28- Therefore, the maximum number of input variables for models (i) and (iii) was nine, while for model (ii) it was ten. Usually, the plant laboratory prepares the mortar for the strength measurements the day after producing a cement batch. Thus, in terms of the derived predictions, the following time delays appeared for each model. The delay for models (i) and (iii) was three days, and the prediction delay for (ii) was eight days. Apart from the three basic models, the design included two additional ones, named C1_28_2 and C2_28_2. The C1_28_2 model adds to the strength predicted from Str_28_2, a correction term derived from Str_28_7. The C2_28_2 model exploits the result of Str_7_2 to add a similar quantity to the original forecasting of Str_28_2. As described by Tsamatsoulis 21 , there is a stronger or weaker correlation among several of the model's physical, chemical, and mechanical inputs. For example, an increase in the contained limestone causes an increase in LOI and a decrease in SiO 2 , Al 2 O 3 , Fe 2 O 3 , CaO. The content of these oxides, as well as their ratio, is also related to clinker reactivity. Early strength not only depends on R45 and chemical analysis, but also on "hidden" variables such as milling performance and clinker reactivity. Therefore, the input variables contain an independent part permitting to describe the process better. The models use the early strengths to include such hidden variables as clinker activity and grinding conditions in a cumulative manner. Despite that the  (1) is an activity indication, its introduction to the models as an independent variable has weaknesses if the goal is to use the models in daily quality control: (a) For a plant laboratory it is difficult to measure the average clinker consumed in each cement batch and each cement mill in a daily program. If the mills have a shared clinker silo, then the discrimination between each mill is very hard. Conversely, if there is one silo per mill, then a very intensive sampling and analysis program is required, hard to accomplish. Usually, the cement plants concentrate their efforts on measuring and stabilizing the clinker in the kiln outlet. (b) The C 3 S calculated by the Bogue formula accumulates the uncertainties of each measurement on a spot basis.
The residue in some sieve or the specific surface is a partial indicator of grinding effectiveness and cement particle size. The measurement of particle size of each cement component after grinding is hard or impossible to accomplish in a daily program. The early strengths increase as the clinker activity increases or the clinker fraction in the cement becomes finer and reversely. Therefore, this is the rationale to introduce those strengths in the models.
The input and output variables for model Str_28_2 are named as follows: X 1 =SiO 2 , X 2 =Al 2 O 3 , X 3 =Fe 2 O 3 , X 4 =CaO, X 5 =SO 3 , X 6 =LOI, X 7 =R45, X 8 =Str_1, X 9 =Str_2, Y=Str_28. For Str_28_7, X 10 =Str_7 was also added in the inputs, whereas model Str_7_2 for the same inputs as Str_28_2 derives output Y=Str_7. These variables were normalized as follows: (1) For a given training and test set of data, the minimum and maximum values of the input variables X I and the output Y, X I,MIN , X I,MAX , Y MIN , Y MAX , respectively, were computed.
(2) Equations (5) and (6) provide the normalized variables, depending on the activation function utilized in the ANN.

( )
In the case of models (i) or (ii) application, Str_28 Calc = Str Calc . Otherwise, if 7-day strength is to be predicted, Str_7 Calc =Str Calc .

Shallow neural networks
Several architectures of feed-forward ANNs' with three layers, one of which is hidden, have been developed. The number of nodes in the hidden layer varied, aiming at succeeding the optimum generalization performance. Therefore, the examined ANNs were shallow and variable in width. All the software had been developed in C#. Before feeding the data of a training dataset to the input layer, a preprocessing of them preceded, as follows: (i) For the given training set of size N Tr , the correlation factors Correl I , between each input variable X I (for I=1 to N I ) and the 28-day strength, were computed. (ii) A positive threshold of Correl Min was assumed. (iii) XN I is fed to the input layer only if the absolute value |Correl I | ≥ Correl Min .
The described comparator acts as a filter of the input variables. If X I passes the filter, then the filter's output is XF IP =XN I , else I=I+1. Thus, IP lies in the range from 1 to N IP , where the total number of variables passed is N IP ≤ N I . Afterwards, for each dataset belonging to the training set, the activation function of each node J of the hidden layer accepts the linear combination of XF I s computed from formula (9).
where XF 0 =1, to take into account the bias and J=1 to N N .

Activation functions
Nine activation functions (AF) were examinednd implemented in the software developed. The set of AFs comprised equations traditionally applied in ANNs and others developed over the last years. A short description of them follows: (1) Linear function (Linear): The identity function, having its input as output, can be characterized as a special AF, where the ANN can have only one node in the hidden layer.
(2) Sigmoid function (Sigmoid): The sigmoid function is frequently referred to as the logistic function, and its results belong to the interval (0, 1).
(3) Hyperbolic tangent function (Tanh) The Tanh function lies within the range of -1 to 1 and is zero centered.
The ReLU function, proposed and used by Nair et al. 30 , is the most widely used activation function for deep learning applications 29 .
(7) Sigmoid-weighted linear unit function (SiLU) The SiLU and dSiLU functions were introduced and applied by Elfwing et al. 31 The authors had been motivated by the high performance of similar equations in both classification and reinforcement learning.
Softsign function, introduced by Turian et al. 32 , has been used in regression problems 33 . However, its performance is to be evaluated in this study by applying it in cement's strength prediction.
For each hidden layer's node J, the input of AF is Z J , computed by equation (9). As for normalized inputs XF I , the algorithm applied formulae (5) for the AFs with equations (10) to (11) and (14) to (17).
For the three AFs, whose output lay in the interval (-1, 1), i.e., those described from relations (12), (13), and (18), formulae (6) were used. The outputs of all nodes enter the output layer, which calculates the ANN normalized output as follows: The back-calculated Str Calc s are computed via equation (7) if the AFs are provided from (10) to (11) and (14) to (17). Otherwise, equation (8) is applied.

Dynamic models
First of all, a short description is needed as to how the dynamic models process the experimental datasets to extract training and test period: (1) A certain number of days is assumed as a training period, T Tr . (2) For a given date t, all the production samples belonging to the interval [t, t +T Tr -1] constitute the training data set. (3) The algorithm trains models Str_28_2, Str_28_7, Str_7_2 by computing the corresponding ANNs' weights via non-linear regression methodologies. (4) The future test set of each mentioned model is found considering the date t +T Tr -1 and the delay between production date and early strength measurement. The test period is the minimum possible time interval of one day if the production dates allow it. According to the quality plan of the plant quality system, physical, chemical tests, and mortar preparation are performed on daily average samples of each mill the day after the production date. A date variable, T C , is considered to be the current date. During this date, the following strength results appear if production samples exist:

Results and discussion
The principal objective of the optimization attempted was the best generalization ability of the models, i.e., minimization of RMSE Future . In succeeding this, the model with the optimized parameters would be utilized in the daily quality control of the studied cement types. The synaptic weights of each ANN were determined by minimizing the objective function given by equation (20), which does not imply that the residual error of the training set is minimal. As the purpose was to minimize the test error, the optimization of the λ parameter became highly significant. Another major factor requiring the optimum selection was the activation function type. This is the reason behind a large number of AFs being examined. Several other parameters also required optimization: (i) Length of the training period.
(ii) Number of nodes in the hidden layer for each chosen AF. (iii) Number of days, N pr , and the multiplier k C , described in steps (xiv) -(xvi) of the Appendix.
The length of the training period selected was within the range of 20 and 300 days. For these training periods, the number of consecutive training sets, N Tot , was between 400 and 640. Table 4 shows the median value and the 10 % and 90 % percen-tiles of the data sets' number within each training set as a function of T Tr . Concerning testing sets, the median value of the population within each one was two, and the 10 % and 90 % percentiles, 1 and 6, respectively. The small amount of data in each testing set was the reason for choosing RMSE as the single metric. The calculation of RMSE Future is always feasible. Choosing the widely used correlation coefficient may lead to erroneous conclusions for small populations or may be inapplicable. If the set  has only one element, then the calculation of the regression coefficient is not possible because there is no standard deviation of the experimental data. Not only all the AFs but also the impact of nodes' number on the resulting errors were thoroughly. For presentation of the results, the following abbreviations were used: AFabbr_XN, where "AFabbr" is the abbreviation of each AF referred to in section "Activation Functions", and "X" is the number of nodes.

Correlations between input and output variables
The correlation among all the input variables and Str_28 was initially examined through the computation of correlation coefficients. Fig. 3 shows the percentile distribution of these statistics for the whole population of training data sets for a training period of 50 days. Fig. 3 also depicts a threshold Correl Min =0.8. The correlation coefficients between Str_28 and SiO 2 , LOI, Str_1, Str_2, Str_7 were continuously higher than 0.8 in absolute value. The other input variables had correlation coefficients less than 0.8, up to a certain percentile. These results shall be examined from a physical standpoint. Higher LOI corresponds to higher limestone content in the cement composition, which has a negative impact on the 28-day strength. Thus, the correlation between LOI and Str_28 was strongly negative.
Lower CaCO 3 in the cement leads to a higher SiO 2 , Al 2 O 3 , Fe 2 O 3 , explaining the positive correlation between Str_28 and each of the three oxides.

F i g . 3 -Percentiles of correlation coefficients among input variables and Str_28
On the other hand, CaO content in the cement was not only due to the clinker and gypsum, but also to the limestone. The result was a worsening of the correlation coefficient of CaO compared to those of the other three oxides. Table 3 indicates that types of cement with higher average LOI value, therefore lower clinker and strength, were designed with lower average SO 3 values, proving the positive correlation between SO 3 and Str_28. As expected, a high correlation existed between any early strength and the typical one of 28 days. The R45 shows a negative correlation with the Str_28, and a percentile of 0.3 presents a correlation less than 0.8 in absolute value. These low values were attributed to the impact of the grinding conditions of each cement mill, the wide range of compositions of cement types, and the variance of clinker reactivity.
A systematic search of the shape of the functions between Str_28 and input variables follows. Concerning the early strength, the intervals of the percentiles from 0.1 to 1 with a step of 0.1 were computed. In each interval, the average values of early strength and Str_28 were determined. The same procedure was also applied for R45. The results are plotted in Fig. 4. A significant non-linearity occurs in the function between Str_1 and Str_28 in Fig. 4a, as the shape of the curve is sigmoid. An increase in linearity exists as the early strength increases from 1-day to 7-day strength (Figs. 4b -4c). Specifically, the function between 7-day and 28-day strength was strongly linear, meaning that the impact of the chemical and physical factors on both ages was similar for the cement types under consideration. The function between residue at 45 μm and Str_28 in Fig. 4d verifies the results of Fig.   3b. The increase in strength for the last percentile of high R45 is explained from the data presented in Table 3: CEM I 42.5 R was the coarsest cement, but due to its high clinker content, it achieved a high typical 28-day strength. Fig. 5 demonstrates the correlation between the chemical characteristics of each cement type and the 28-day strength. For each of these characteristics, the percentiles from 0.2 to 1 with a step of 0.2 were computed. The function between LOI and Str_28 decreased for the entire population, but this trend was not verified for each cement type separately. For example, this function for CEM I 42.5 R increased non-linearly. The cause was attributed to other parameters, such as clinker activity, partly expressed by C 3 S, and cement fineness partially explained by R45. The correlation between SO 3 and Str_28 for each cement type generally passed from a weak optimum value or was flat, meaning that the selection of SO 3 target per cement type was in the optimal range. The functions between Str_28 and the four oxides, SiO 2 , Al 2 O 3 , Fe 2 O 3 , CaO, increased if taking all points as a whole. As concerns CEM I 42.5 R and CEM I 52.5 R, the types of the highest clinker content, the correlations for SiO 2 , Al 2 O 3 , and Fe 2 O 3 clearly decreased. The reason being that an increase of these oxides resulted in a decrease in clinker C 3 S. For these cement types, an increasing function between Str_28 and CaO was not observed as could be expected. This was attributed to the low limestone content permitted to be used as a minor component, which is also rich in calcium oxide.
Therefore, Figs. 4 and 5 contribute in detecting noticeable non-linearities, which hardly could be  Each curve RMSE Future =f(T Tr ) is convex, presenting a minimum for a definite training period. Initially, as T tr increases, the testing error decreases. The cause of this behavior is the existence of values of variables not belonging to the values' range of these variables during the T Tr period. Therefore, the models are obliged to extrapolate the computation leading to a worsening of prediction. Following this T Tr value, an increase in the training period causes an increase in RMSE Future , although the variables during the test period take values more likely to be within the values' range of the training period. As seen lat-er, the function RMSE Past = f(T Tr ) is increasing for larger values of T Tr , which means that after a certain T Tr level, an increase in training error leads to a worsening of generalization ability. One may notice that the minimum is not heavily acute. For an allowable range [RMSE Future , (1+ε)· RMSE Future ], where ε is a small positive number, the corresponding T Tr s are spread to a narrower or wider range. However, the correct training period selection is crucial: For the Str_28_2 model and Tanh_2N architecture, a choice T Tr =260 days instead of the optimum of 50 days results in a 21 % higher error. Table 5 presents the ratio between the maximum and minimum RMSE Future for each ANN, and a wide range of training periods. The selection for the maximal training period is 200 days. The minimum T Tr is 30 days when one or two nodes exist in the hidden layer, and 40 days in the case of three nodes. The results verify that the optimal selection of the training period is critical for all the models and ANNs.
Figs. 7a and 7b demonstrate the impact of T Tr on RMSE Past for the model Str_28_2, implemented with the Tanh and ReLU as AFs, and using struc-    Fig. 7c does not verify this trend, especially in the area of optimum test errors: These errors of the ANN with two nodes are smaller than are those of the architectures with one and three nodes. Fig. 8 depicts training and testing errors as a function of T Tr for the model Str_28_7, SoftSign, and SiReLU functions, using one to three nodes in the hidden layer. One can observe that, for the same AF and T Tr , increasing the node number, both types of error decrease. With the two-node case, the two ANNs behave differently, especially in the region of training periods from 50 to 100 days, which provide the optimum testing errors: While the training error of the SoftSign_2N is less than that of the SiRe-LU_2N, as to the corresponding testing errors the inequality is reversed. This proves the effect of AF type on the resulting errors. Fig. 9 demonstrates the correlation between the λ-multiplier and the testing error for several combinations of ANNs' architectures and AFs, for the Str_28_2 model. For each combination, the training    period corresponds to that providing the minimum RMSE Future for the optimum λ. A wider or narrower but distinct range of minimum errors occurs in each curve. The shapes of the curves differ noticeably, and the minimal errors are located in a large region of λ for the eight cases presented. One could see that, in the majority of the cases, the λ of optimum error is between 10 and 100. However, the optimum value of this multiplier is 3 for Sigmoid_1N, whereas it is 50 for Tanh_2N. Therefore, no general rule exists for the multiplier's selection, and a case-by-case search is needed. The minimal error of each curve is significantly less than the corresponding error for λ = 0, concluding that L2-regularization is an effective technique for improving generalization ability. An optimal selection of the N pr and k c parameters appearing in equations (29) -(30) results in a significant diminishment of the test errors of C1_28_2, C2_28_2 models compared with those of Str_28_2. Fig. 10 shows the RMSE Future as a function of the mentioned two parameters for both models, using two different AFs. The test errors of Str_28_2 also appear in this Figure. It is worth noting that the optimal value of λ is not necessarily the same for the Str_28_2 model and the two models under consideration.

Optimal neural networks
The combination of activation functions with the number of nodes results in 21 different ANNs.
All the cases studied present an optimum Correl Min threshold between 0.5 and 0.7 for the minimal RMSE Future . Figs. 11a, 11b, 11c show the minimal mean test errors of each investigated ANN corresponding to optimal Correl Min , T Tr , and λ values for the models Str_28_2, Str_28_7, and Str_7_2. To enlarge the height of each column, in each Figure, the maximum value of the Y-axis is 10 % greater than the minimum error. The black columns represent testing errors up to 1 % higher than the minimum RMSE Future , whereas the errors shown with grey columns are at least 10 % higher than the minimum one. Therefore, the 1 % margin is considered as the optimality criterion. The optimal point of the ANNs  Tanh_1N, Tanh_2N, Tanh_3N, ReLU_1N, SiLU _2N, dSiLU_1N is continuously found in the optimum region for all three models. Additionally, in two out of the three models, ANNs SiLU_1N and SofSign_3N behave optimally. For any model, the ANN with linear activation functions failed to meet the optimality criterion. Some tests performed with four-node ANNs gave RMSE Future worse than those of three-nodes ANNs, which is a clear indication of overfitting, concluding that a systematic search of such ANNs is not needed.
The robustness of generalization ability for a particular ANN can be expressed by the range of training periods having a testing error in a suboptimal region. In each model, as such region, the interval [RMSE Future,Min , β · RMSE Future,Min ] is assumed, where RMSE Future,Min is the minimum mean testing error of all the ANNs. The value of 1.025 is selected for β. Then the following procedure is applied:  Fig. 12 shows a graphical example.
(iv) The ratio F Opt = S Opt /S Ref expresses the fraction of the reference surface covered with optimal or suboptimal mean test errors, and it is the measure of the ANNs' robustness. (v) Among all the F Opt , the maximum value F Opt,Max is found. Then the ratios R Opt =F Opt /F Opt,Max are compared. Fig. 13 demonstrates the ratios R Opt , for all the three models Str_28_2, Str_28_7, and Str_7_2, and all ANNs investigated. Implementing this optimization criterion, the range of optimal ANNs narrows considerably. Neural networks having R Opt around 0.90 or higher are only two or three per model. Fig. 14 shows the corresponding R Opt ratios  per ANN for the C1_28_2 and C2_28_2 models. The conclusion of comparing the results of this Figure is that only the dSiLU_1N meets both optimality criteria. Besides, these two models provide an improved minimum test error compared to that of the model Str_28_2. The minimal RMSE Future is 1.54 MPa for the former models, whereas it is 1.63 for the latter, resulting in a 5.5 % improvement. Table 6 presents the ANNs that meet optimization criteria for each of the five models, the best test errors, R Opt s, and model parameters. The following conclusions can be drawn from these results: (i) The Str_28_7 model is the most accurate among all, reaching an RMSE Future of 1.36 MPa by implementing ReLU_1N and dSiLU_1N, but its usage presents a delay time of eight days, while the Str_28_2 model shall take into account a delay of only three days. The meaning of the above result is that the incorporation of 7-day strength better takes into account parameters non-involved in the models, such as clinker reactivity, but with a loss in terms of early forecasting. (ii) The optimal training periods of the Str_28_7 model are approximately twice that of the Str_28_2. The comparison of the models Str_7_2, Str_28_2 results in the same trend.  (iii) Corrective models C1_28_2, C2_28_2 provide a significant improvement in the test error of Str_28_2, with a simultaneous increase in T Tr and λ parameters, in terms of optimum neural network dSiReLU_1N.
The critical step in finding the optimal network is selecting the type of activation function and the number of nodes. Among all architectures applied, dSiLU_1N, ReLU_1N, and Tanh_2N provide the best results. Fig. 15 illustrates a more thorough search of the correlation of test errors between the corrective models and the Str_28_2, for a wide range of training periods, for two different ANNs. The test error for the first two models is consistently smaller than that of the latter and is significantly better for longer training periods. The above proves that these models are more robust than Str_28_2 in terms of generalization ability.
Except for the RMSE Future that constitutes a cumulative criterion of optimization, it is worth further deepening and looking at the distribution of the residual test errors, provided by equation (26), for the optimal ANN of each one of the five models. Table 7 shows cumulative distributions of these errors for percentiles from 50 % to 95 %, from which the following can be concluded: (i) The median residual error for all models is between 0. (ii) The errors of the models C1_28_2, C2_28_2 are significantly smaller than those of Str_28_2, for percentiles 90 % and higher. The above proves that the contribution of corrective models is to reduce the worst errors of Str_28_2. (iii) The Str_28_7 model constantly provides the best errors for all percentiles. However, for the 95 % percentile, perhaps its error cannot be considered significantly different from that of corrective models. (iv) The errors of the Str_7_2 are between those of Str_28_7 and the corrective models for all the percentiles.
The minimal values and the robustness of RMSE Future indicate the C1_28_2, C2_28_2, Str_28_7, and Str_7_2 models could adequately be used for controlling the 28-day or 7-day strength of cement produced in the mills. A manual or automatic control technique can use, as a control variable, the clinker percentage and process variable the prediction from the corresponding model. The need to construct a dynamic function of the effect of the clinker percentage on strength is evident. The process dynamics presented in Fig. 2 would be the core of the control loop in such a case.

Comparison of results and implementation in daily quality control
The main characteristic of this study is the exclusive usage of industrial quality data. A comparison of the results of this research with literature results in the same field is necessary to gain a more comprehensive evaluation, much more if the published studies include industrial data. Table 8 pres-ents the results of the 28-day strength prediction reporting the method implemented and the RMSE of test sets. All RMSE belong to the interval [1.3, 1.9], where the low limit is related to the long-term repeatability of the test method, as explained in the section "Optimal neural networks". The models developed in this study present RMSEs of testing sets lower than the mean value of the mentioned interval, proving their generalization ability.
Finally, the application of the models in daily quality control was the actual metric of their performance: The two basic models, Str_28_2 and Str_28_7, were applied continuously for more than 15 months in the Devnya plant, predicting around 1000 daily 28-day strength measurements. Table 7 shows the models' settings and the statistical results. Comparison of the actual results of the two models with those shown in Fig. 11 and Table 7 led to the following conclusions: (1) All the results were in line with the results found during the ANNs' design, and especially the RMSE of Str_28_7 was lower than predicted.
(2) The operation of the algorithms in actual conditions verified that the design procedure built robust models of sufficient generalization ability.
(3) The optimization of the ANNs parameters in combination with the movable training period guarantee long-term performance.

Conclusions
A series of five models predicting cement strength was developed based on shallow ANNs and exclusively industrial quality data. Three out of the five were independent of each other. The models used physical and chemical data as well as: (i) 1-and 2-day strength to predict those of 28and 7-day -models Str_28_2, Str_7_2, respectively; (ii) additionally, the 7-day strength to predict that of the 28-day -model Str_28_7.
The two last models -C1_28_2, C2_28_2combined the results of the previous ones aiming at improving the predictions of the model Str_28_2, taking into account the time delays. Input variables were filtered before datasets insertion into the ANN's input layer: Only variables whose correlation coefficient with the 28-day strength was higher than the threshold, Correl Min , entered the input layer. ANNs contained from one to three nodes within the hidden layer, and were trained dynamically for a period of T Tr days. Prediction of future strength followed for a test period of at least one day. Upon completion of this period, the process was repeated by moving forward the training period. For a given ANN, achieving the best test error needs optimization of the training period. Nine types of activation functions were studied: The traditional ones such as the linear, sigmoid, hyperbolic tangent, and some developed in recent years like the rectified linear unit, sigmoid-weighted linear unit, and its derivative. The algorithm applied L2-regularization to the objective function used to find the ANN's synaptic weights, and optimized the corresponding weight coefficient λ to obtain the minimal test errors. As optimality criteria, the algorithm used the minimization of RMSE Future and its robustness. The type of activation function is a decisive factor in selecting the optimum ANN for each model. The ANN implementing the derivative of the sigmoid-weighted linear unit and having one node in the hidden layer -dSiLU_1N -showed the smallest and most robust RMSE Future , for the models Str_28_2, C1_28_2, C2_28_2. The test errors for the three models were 1.63 MPa, 1.54 MPa, and 1.54 MPa, respectively. Therefore, the two corrective models were equivalent to each other, and improved the error of Str_28_2 by 5.5 %. The optimum ANN of model Str_28_7 contained one node in the hidden layer utilizing the rectified linear unit or the derivative of the sigmoid-weighted linear unit -ReLU_1N and dSiLU_1N, respectively -with an RMSE Future of 1.36 MPa. This error was excellent and 11.7 % lower than those of corrective models, but was achieved at the expense of delay time compared to previous ones. The median test residual errors for the optimal ANNs were between 0.82 MPa and 1.24 MPa, belonging to the range of the longterm repeatability of a very competent laboratory.
In conclusion, the main contribution of the present research is the combination of four factors impacting the generalization ability decisively: (a) The usage of early strengths as independent variables to uncover hidden variables like clinker activity and milling conditions. The fact that the models were dynamic and applied to industrial data spread over 27 months was a guarantee that they could be implemented effectively to the daily quality control of the same cement plant. Continuous application of the models in the Devnya plant in actual conditions for more than 15 months showed a performance at least equivalent to that calculated during the design step.
Furthermore, the modeling can be a part of a control algorithm, which will use, as a control variable, the clinker percentage and process variable the prediction from the corresponding model. The algorithms developed can be applied to the quality control of any cement plant after training with longterm plant data, as long as they characterize the main properties of cement. The accuracy and reliability of the raw data characterize the quality of the strength prediction. A further improvement of these techniques could include as inputs the chemical characteristics of the clinker, such as C 3 S, C 3 A, equivalent alkalis, free lime, in case the above are measured routinely by some automatic sampling and measurement system.

Str
Str Equation (20) incorporates the weight decay term by applying the L2-regularization. Parameter λ needs optimization to receive the best generalization ability of the ANN under study. For a given λ value, the optimal parameters of formula (20) are obtained using the Levenberg -Marquardt technique. It is worth mentioning that this method has been efficiently used in ANNs' training 34,35 where equations (22), (23) are used if AFs (10)(11)(12)(13) or (15)(16)(17)(18) are implemented, and equations (24), (25) are applied in the case of AF (14) application.
(ix) The b I parameters are used to predict the 28-day strength, Str_28 Calc,J , for all the data belonging to the test set. Then, the error of index k is computed as follows: (x) If the test set contains the last dataset of the entire population, the searching for new training set stops. Otherwise, the procedure is repeated starting from step (iii), setting T Fin,Tr = T Next , finding T In,Tr by the formula T Fin,Tr -T In,Tr ≤ T Tr , and increasing index k by one. The above means that, when the algorithm adds datasets for the new T Fin,Tr , it subtracts the datasets between the old T In,Tr , and before the new T In,Tr , making the training period movable.
(xi) By executing steps (i) to (x), processing of N Tot consecutive training and test sets occurs. Afterwards, the root mean square training and testing errors, RMSE Past and RMSE Future , respectively, are computed as follows: (xii) The impact of multiplier λ, characterizing the weight decay, on the model generalization is assessed by executing the algorithm for a range of λ from 0 to λ Max . The λ's value providing the minimum RMSE Future corresponds to the best generalization for the given training period.
The same algorithm implements the two other models, Str_28_7 and Str_7_2, with the only difference that T D =21 d for the former and T D =5 d for the latter. There is no peculiarity in ANN's architecture for each model, except for the fact that there is a comparator before the input layer: For each training set, the correlation factor between each input variable and Str_28 compares with a threshold value, and only if higher, the variable enters the input layer.
The implementation of the C1_28_2 model follows the next steps, in addition to those already described: (xiii) After the execution of step (xii) for the trinity Str_28_2, Str_28_7, Str_7_1, the first date of the first testing set, T C , is considered, as determined from the Str_28_2 model.
(xiv) The latest date with results of the Str_28_7 and Str_28_2 is at least five days before this date, and there is the 7-day strength measurement on this date. Starting from this date, T pr , the algorithm considers N pr consecutive past days.
(xv) For all the datasets contained in the time interval [T pr -N pr +1, T pr ], the algorithm calculates the differences between the Str_28_7 and Str_28_2 predictions.
(xvi) The mean value of the differences, D Aver , is multiplied with a coefficient k C , and the product is added to all the results of the Str_28_2 predictions, belonging to the test set of first date T C . So, for output of Str_28_2 equal to Str_28_2 Calc , the result is the following: (xvii) The residual error is computed using equation (26) and the output of C1_28_2 instead of that of the Str_28_2.
(xviii) The test date, T C , moves to the first one of the next testing set, and the process is repeated starting from step (xiv) until the T C becomes the last date of the entire population.

ACKNOWLEDgEMENTS
The author is thankful to reviewers for their constructive comments, which have helped to improve the present paper.
R e f e r e n c e s