Towards understanding and prediction of atmospheric corrosion of an Fe/Cu corrosion sensor via machine learning

The atmospheric corrosion of carbon steel was monitored by a Fe/Cu type galvanic corrosion sensor for 34 days. Using a random forest (RF)-based machine learning approach, the impacts of relative humidity, temperature and rainfall were identi ﬁ ed to be higher than those of airborne particles, sulfur dioxide, nitrogen dioxide, carbon monoxide and ozone on the initial atmospheric corrosion. The RF model demonstrated higher accuracy than arti ﬁ cial neural network (ANN) and support vector regression (SVR) models in predicting instantaneous at- mospheric corrosion. The model accuracy can be further improved after taking into consideration of the signi ﬁ cant e ﬀ ect of rust formation on the sensor.


Introduction
Atmospheric corrosion is known to widely impact infrastructure, transportation, energy and other industries and generates high maintenance cost globally [1,2]. Atmospheric corrosion is influenced by the instantaneous (e.g. relative humidity/RH) and cumulative (e.g. deposition of chloride) effects from the constituents of local environments on the metal surface [3]. An accurate and real time evaluation of atmospheric corrosion provides an important guide to materials selection and engineering design for corrosion mitigation. Monitoring sensors have been used in vehicles and bridges to track the dynamic process of atmospheric corrosion and to understand how such a process is influenced by the complex environmental parameters [4,5]. For example, non-electrochemical sensors can monitor the damage status of the substrate materials based on the measurements of the change in weight, electrical resistance [6,7] or acoustic properties [8][9][10]. An electrochemical sensor, on the other hand, functions by forming a closed electrochemical cell on the sensor surface and monitor the transient corrosion activities by measuring galvanic corrosion current [4], electrochemical impedance [11,12], linear polarization resistance [13] or electrochemical noise [14,15].
The principle of galvanic-cell type atmospheric corrosion monitoring (ACM) sensor is based on galvanic corrosion between two electrodes with different electrochemical activities. The galvanic corrosion current of the couple is then measured by a galvanometer to reflect the instantaneous corrosion rate responding to changes of the environments [16]. Galvanic-cell type ACM sensors are sensitive to environmental variation and are useful in understanding the corrosion in a changing environment [16,17]. For example, Mizuno et al. installed Fe/Ag-type corrosion sensors on various parts of a vehicle for three months and showed that the atmospheric corrosion varied at the locations with difference in the RH and rainwater exposure [17]. In our earlier work, we analyzed the atmospheric corrosion of a zinc/copper ACM sensor in Beijing in a winter time and described that the air quality index had a high impact on atmospheric corrosion [4]. More recently, Fe/Cu-type ACM sensors were used in six outdoor atmospheric environments for one month [18]. Statistical analyses of the current values at different time showed that rainfall accounted for only 16.3-29.7 % of the total test duration but contributed to 64.6-89.0 % of the atmospheric corrosion of the carbon steels. And the low temperature at night was more conducive to the condensation of moisture, leading to a more severe atmospheric corrosion. Compared with the monthly or yearly data obtained from the standard corrosion coupons, the minutely or hourly data obtained from https://doi.org/10.1016/j.corsci.2020.108697 Received 26 March 2020; Received in revised form 19 April 2020; Accepted 21 April 2020 ACM studies are often large in quantity and contain rich information about corrosion kinetics as influenced by multiple environmental factors such as RH, temperature and pollutants. Because of such a complexity, it is difficult to establish causal relationships between individual environmental factor and the corrosion process based on physics-based corrosion laws or predict the corrosion life of materials [19][20][21][22]. In this case, machine learning, which automates the searching for knowledge by learning from example data and experience without relying on predetermined equations, may offer new opportunities to better understand and predict atmospheric corrosion. For example, artificial neural network (ANN) is the most commonly used data mining method [23,24]. It has been applied to model multivariant corrosion processes to predict crack growth rates [19,20] and corrosion initiation time of metals [25]. Support vector regression (SVR) analysis has also been used for corrosion prediction. While ANN is more accurate with a large sample size, SVR may present good prediction results even with a small sample size [26]. Using SVR modelling, Fang et al. predicted the corrosion rate of zinc and steel by temperature, TOW, exposure time, sulfur dioxide and chloride concentration [27]. Wen et al. applied SVR in the prediction of the corrosion rate of 3C steel in seawater influenced by temperature, dissolved oxygen, salinity, pH value and oxidationreduction potential [28]. Other methods that were less commonly used but are also suitable for mining time-series corrosion data include Markov chains [29,30], grey analyses [31,32] and Monte-Carlo simulations [33,34].
Due to the deeper layers of model structure than general machine learning models, random forest (RF) models possess a good processing ability for data with high variability [35,36]. Thus, RF models are expected to be more suitable for the dynamic atmospheric corrosion processes. For example, the impacts of multiple environmental factors and alloying elements to the corrosion rates of 17 low alloy steels under six different environments over 16 years were quantitatively ranked by a RF approach. The results showed that the environmental factors played more important roles than the chemical compositions of the steels especially for the initial period of atmospheric corrosion [37]. The pH of rainwater showed the highest influence to the corrosion rate among the environmental factors for the timespan over 16 years, whereas the importance of rainfall was more prominent in the first two years of atmospheric corrosion.
This study aims at describing the atmospheric corrosion on an Fe/ Cu type galvanic corrosion sensor exposed in Qingdao, China for 34 days. RF-based machine learning approach was employed to analyze the contribution of different environmental factors (i.e. temperature, RH, rainfall, sulfur dioxide, nitrogen dioxide, carbon monoxide, ozone and particulate pollutants) to the ACM sensor output. After the most influential parameters were identified, a rational RF model was built for the prediction of the dynamic variation of atmospheric corrosion on carbon steels. Fig. 1a shows the assembly of an ACM sensor, which consisted of seven pairs of steel/copper galvanic couple. The carbon steel (0.47 wt% C, 0.18 wt% Si, 0.59 wt% Mn, 0.01 wt% S, 0.01 wt% P, 0.01 wt% Ni, 0.02 wt% Cr, 0.01 wt% Cu) served as the anode, whereas copper (> 99.5 % pure) was used as the cathode. The electrode pairs were separated by glass fiber-reinforced epoxy (FR4) boards for electrical insulation. The exposed area of each electrode was 21 × 1 mm 2 . The electrode was individually connected with a wire, assembled together and impregnated in epoxy to obtain the ACM sensor. The anodes and cathodes were connected to the different ends of a galvanometer. The sensor surface was abraded to #1200 using SiC paper prior to the exposure tests.

Preparation of the ACM sensors
When a thin electrolyte layer is formed or a droplet is deposited across the insulating board (0.1 mm thick) between the steel and copper electrodes of the ACM sensor, the two electrodes were electrically connected and generate a galvanic corrosion current. Due to the separation of anodes and cathodes by the insulating boards, the galvanic current passes through a galvanometer. And the ACM current (I ACM ) would be detected and recorded. The relationship between the I ACM and the corrosion rate of the steel anode was reported in our previous study [18]. In other words, by measuring the I ACM , it is possible to quantitatively evaluate the atmospheric corrosion of the carbon steel.

Field exposure test
The ACM was conducted at the Qingdao atmospheric corrosion test site (120°25′E, 36°03′N) in the National Environmental Corrosion Platform of China (Fig. 1b). The test site is a coastal region and is located 25 m away the Bohai sea. The ACM sensors were installed at a distance of over 1 m above ground and were facing with an angle of 45°t o the south. The temperature and humidity sensors were exposed under the same conditions near the ACM sensor. The exposure test was carried out for 34 days from August 2nd to September 5th, 2018. The galvanometer measured the current with an acquisition frequency of two seconds per reading, and recorded the data once per minute. The measuring error was less than 0.5 %, and the resolution of the galvanometer was 0.1 nA, according to the specification of the manufacturer. The current value ranged from 0.1 nA to 50 mA, beyond which the current could not be determined.
During the entire test time, to a minimum of 97 % of the detected currents were higher than 0.1 nA, indicating that the ACM sensor captured most of the corrosion signals. Each set of data accumulated in this study consisted of I ACM , temperature (T) and RH data from the ACM sensor, and other environmental data such as rainfall status (rainy or not), the concentrations of sulfur dioxide (SO 2 ), nitrogen dioxide (NO 2 ), carbon monoxide (CO), ozone (O 3 ) and airborne particles smaller than or equal to 2.5 microns (PM 2.5 ) and those smaller than or equal to 10 microns (PM 10 ). The data for SO 2 , NO 2 , CO, O 3 , PM 2.5 and PM 10 were obtained from China Meteorological Administration. Chloride deposition at the test site was measured monthly using the dry plate method according to ISO 9225 [38].

Random forests
The RF model used in this study was implemented with the Scikitlearning machine learning library. Fig. 2a illustrates the prediction process of the RF method. The RF model consists of 100 (an optimal value shown in Fig. S1a; See Supplementary material 1) classification and regression tree (CART) models, and each CART model possessed ability for the training and prediction. T, RH, rainfall status, SO 2 , NO 2 , O 3 , CO, PM 2.5 and PM 10 are the input variables. For each CART in the RF model, the imported sets of data were produced by bootstrap sampling [36]. Bootstrap sampling guarantees the imported sets of data are different between each other, and leads to the diversity of the outputs for different CARTs. Then the corrosion current (I ACM-i, i = 1, 2, …, 100) was predicted by the i − th CART model. The final prediction result for I ACM of RF model was the average of the prediction values of all I ACM-i . Previous studies have shown that a single CART model may be unstable and overfitting when the sample size is small, but this negative effect is alleviated by averaging the outputs of different CARTs assembled in the RF model [37,39]. Therefore, the RF approach constructs and combines several weak models to form a strong model. In each CART model, the inputs were regarded as vectors and distributed in a space. Fig. 2b shows the training and prediction processes of each CART model. The training process of a CART was to split the input-space to individual subspaces based on the theory of regression trees [40]. The output of each subspace was assigned with a constant value based on the average of all training vectors. After the training was finished, the prediction process of the CART was to import a data without output and locate the subspace of the imported data. Then the prediction value was acquired based on the trained subspace. The training processes of different CART models adopted the same principle. Compared with other ensemble methods, the training process of RF model is based on the random selection of training samples and inputs at each node, i.e. each subspace. Therefore, the base model of RF has more diversity and leads to more precise ensemble results. The important parameters of the RF modelling are the minimum number of samples per subspace and number of CART. When the sample size of a node is less than a certain value, i.e., the integral number of training samples divided by 500 (an optimal value determined in Fig. S1b), the next node segmentation would be stopped, i.e. stop splitting up new subspaces, and the present node would be the leaf node. Meanwhile, at each node of the CART, a subset of the feature vectors is randomly selected.
The data samples unselected for training in each CART model in the bootstrap sampling step, termed as the out-of-bag (OOB) samples, can be used to calculate the importance of the inputs [37]. For each CART, the importance is mainly calculated by adding a disturbance to each input of the OOB data and then evaluating the variation amplitude of the predicted results. By comparing the variation amplitudes affected by different inputs, the importance of different inputs in one CART can be obtained. Finally, by averaging the importance results of all CARTs in the RF model, the quantification of the importance of different inputs is finished.

ANN and SVR modelling
For comparison, back-propagation ANN and SVR were also implemented with the Scikit-learning machine learning library in this study. The input and output variables remain the same as those for the RF model. For the ANN model, the rectified linear unit (ReLu) function was used as the activation function. The model has one hidden layer consisting of 100 units (default value). The random gradient descent method was employed for the parameter tuning. The error tolerance was set as 1 × 10 −4 , and the maximum iteration number was 100 after parameter optimization as shown in Fig. S1c, d. For the SVR model, the radial basis function (RBF) was set as the kernel function and the penalty coefficient as 300 (Fig. S1e, f). The width of RBF was 1, which was selected to avoid poor fitting and reduce prediction errors after several practical simulations. Other remaining parameters were the default values of the Scikit-learning machine learning library. More details about the working mechanisms of ANN and SVR models can be found in Supplementary Material 2.

Evaluation criteria
The fitting errors of the training and prediction samples from the different models were evaluated based on R 2 calculated by Eq. (1) and also RMSE calculated by Eq.
where N is the total number of the dataset; ŷ n and y n represent the prediction value and the true value of the n-th sample, respectively; = ∑ = y y N /   3. Results and discussion

Effect of environmental factors in an outdoor environment
As shown in Fig. 3, the 10-dimensional atmospheric corrosion data, acquired in the one-month exposure test conducted in Qingdao, China, included I ACM , T, RH, rainfall status, SO 2 , NO 2 , O 3 , CO, PM 2.5 and PM 10 . The I ACM , RH, temperature and rainfall data were reduced to a frequency of once per hour to match with other environmental data. As shown in Fig. 3a and b, the value of I ACM varied as result of the dynamic nature of the atmospheric corrosion. An increasing trend was observed for I ACM during the one-month test period. This phenomenon may due to the growth of the rust layer that can absorb moisture on the sensor surface and reduce the threshold RH for corrosion to be detected by the ACM sensor [41,42]. The conductivity of the rust is far lower than that of the electrolyte on the sensor [43][44][45]. Thus, we infer that the change in the corrosion product conductivity during rust formation would not induce a major impact on the sensor output current. Moreover, the variation of I ACM showed a similar trend with those of the outdoor temperature and RH values. For the pollutants and airborne particles The low current values in the beginning of the exposure test was attributed to the difficulty in forming continuous thin electrolyte layer on fresh sensor surface. As corrosion was developing on the sensor, we observed a sharp increase in the current value, which could be explained by the hygroscopic effect of the rust layer. Afterwards, the continuous buildup of the rust caused the current to slightly decrease over time.
To compare the importance of each environmental parameter to the initial atmospheric corrosion over the one month of exposure, a RF model was established using the environmental parameters as the inputs and I ACM as the output and the results were shown in Fig. 4. The top three important environmental parameters on the output of I ACM were temperature, rainfall status and RH. In general, the variations of the temperature and the RH showed opposite trends and a lower temperature favors the condensation of thin electrolyte layer on the ACM electrodes. The I ACM value was also remarkably affected by rainfalls which provided more dynamic and longer-lasting electrolyte film [46,47]. According to Fig. 4, PM 2.5 is the most important factor among all pollutants, which may be attributed to the high mass fraction of water-soluble ions (62 %) in PM 2.5 in Qingdao [48,49]. In comparison, the mass fraction of these ions in PM 10 is 35 % [49]. And smaller airborne particles also tend to be acidic and contain more corrosive sulphates to influence the corrosion process [50][51][52]. However, the airborne particles and pollutants are far less important than temperature, RH and rainfall status, which may be explained by their low concentrations in the Qingdao test site and the short exposure time. The monthly average concentrations of PM 2.5 , PM 10 , NO 2 and SO 2 were only 16.9 μg m −3 , 40.2 μg m −3 , 12.9 μg m −3 and 9.5 μg m −3 , respectively. It should be noted that the deposition of chlorides was not included in the model given that its value is typically recorded at least monthly by the dry plate method, according to ISO 9225 [38]. However, the impact of chlorides is cumulative and becomes more dominant for long-term corrosion to steels under marine atmospheres [53][54][55][56][57]. Chloride enrichment, which could induce the formation of non-protective akageneite [56][57][58][59], was not found in the corrosion product layer on the ACM sensor after one month of exposure (Figs. S3-S4). Besides the short exposure time, the absence of chloride enrichment may also be related to the relatively low chloride deposition rate (71.03 mg m −2 d −1 ) and the cleaning effect from the frequent rainfalls in the test period [60]. Therefore, it is reasonable to consider that chlorides had not played a major role in the present study.

Prediction of atmospheric corrosion based on the RF model
Traditional methods to predict atmospheric corrosion of carbon steels generally follow ISO9223-2012 [61] and requires the accumulation of environmental data on a yearly basis. These yearly average values cannot reflect the dynamic variation of the environmental conditions and their complex instantaneous effects on the corrosion kinetics, which could generate inaccurate prediction especially for shortterm corrosion [62]. From the previous section, temperature, RH and rainfall status have shown much stronger influences than the rest of the environmental parameters on the atmospheric corrosion of carbon steels. Therefore, these three parameters were selected as the inputs to  establish a model for the prediction of the I ACM output. To validate the performance of the models, the corrosion data samples were divided into training data and testing data. The entire dataset consisting of 48,647 data samples were used for the models, including 43,782 samples (90 %) randomly selected as the training samples, and the other 4865 samples (10 %) as the testing samples for the evaluation of the predicting performance [63]. Fig. 5 presents the fitting results both for training and prediction. The abscissa of each figure is the actual I ACM value obtained from the ACM sensor, while the ordinate represents the current value predicted based on the ANN, SVR or RF model. The red diagonal line represents the true-prediction line on which the predicted values equal the corresponding true values. The points located closer to the red diagonal line represent smaller errors of the prediction. Compared to the fitting by ANN and SVR models, the one by the RF model generated the lowest root mean square error (RMSE) (8,014.3 nA) and highest determination coefficients (R 2 ) (0.526) values, demonstrating the highest prediction accuracy. The accuracy of the SVR prediction is substantially higher than that of the ANN prediction, which is consistent with previous studies [26,28]. The low accuracy of the ANN model may be attributed to its limited capacity in processing the large noise in the I ACM [64]. Fig. 5d-f shows the predicting results for the three algorithms. The predicting results of RF are the best of the three algorithms and the value of R 2 is maintained at a good level. Meanwhile, there is no tendency to overfit at any range for RF.

High-accuracy model for predicting atmospheric corrosion
As demonstrated above, the RF model produced more accurate prediction of the atmospheric corrosion than the SVR and ANN models. However, the errors of the prediction as exemplified by the RMSE and R 2 values were still quite high, requiring further improvements on the models. Fig. 1b shows the sensor surface after 34 days of exposure in Qingdao. Clearly, the surface of the carbon steel electrodes of the sensor was almost fully covered with corrosion products. As discussed in the previous sections, it is reasonable to suspect that the growth of the rust layer of the steel electrode may affect the I ACM output in response to the environmental impacts. The growing rust layer increased the roughness on the surface of ACM sensor and enhanced hygroscopicity, which changed the surface RH as opposed to ambient RH of the environment [65]. Meanwhile, a thicker layer of the rust would benefit for the storage of the electrolyte and promote corrosion. According to the positive correlation between ACM current and corrosion rate [17], the corrosion extent of the carbon steel anode can be reflected by the electrical quantity output of the ACM sensor (Q ACM ), which is obtained by integrating I ACM over the test time according to Eq. (3), in which 1 min is the data acquisition interval. Fig. 6 shows the day-to-day variation of importance indices of the top three environmental factors (i.e. temperature, RH and rainfall status) and that of the Q ACM to the atmospheric corrosion as determined by the RF model. The importance of the temperature remained relatively stable during the one-month test period. Notably, the importance of the rainfall was much higher than that of the RH during the first half of the exposure test. The importance of Q ACM which reflected the rust formation on the sensor was generally low in the beginning but exhibited a sharp increase afterwards. In the second half of the exposure test, the formation of rust layer had become the most important factors for the output of I ACM . At the beginning of the exposure, the surface of the steel electrode of the ACM sensor was smooth, which was difficult to absorb moisture from the air. At this stage, the formation of the thin electrolyte layer that is necessary to connect the anodes and cathodes and generate I ACM outputs mainly depended on the rainfall. As the atmospheric corrosion progressed, the rust layer on the steel electrodes became thicker and the surface roughness increased. Thus, the ability of the sensor surface to collect moisture was enhanced, leading to the decrease in the critical RH to generate I ACM outputs [66]. As a result, the importance of RH to the atmospheric corrosion increased after the first half of the exposure test.
The previous section confirmed that the formation of rust layer on the ACM sensor had a crucial effect on the I ACM output. Thus, the rust formation on the ACM sensor should be considered when developing predictive models for the atmospheric corrosion. Fig. 7 summarizes the fitting results based on ANN, SVR and RF models that took in the consideration of rust formation on the sensor (i.e. Q ACM ) as an input parameter in addition to temperature, RH and rainfall status. Compared with Fig. 5, Fig. 7 shows obvious improvements on the prediction accuracy for all models, with the data points being more narrowly distributed along the diagonal line. Particularly, the value of R 2 for the RF model substantially increased from 0.526 to 0.940 and the RMSE decreased from 7,390.5 nA to 2,311.8 nA after adding Q ACM as an input parameter.
To further verify the importance of the evolution of the rust formation on the prediction of I ACM , the comparisons between the fitting results of the prediction by ANN, SVR and RF models were made in three different time periods (i.e. 0-11 days, 0-22 days, 0-34 days) (Fig. 8). When Q ACM is not included in the models (Fig. 8a-c), the datasets are increasingly scattered with the extended exposure time. The corresponding R 2 values had generally decreased (Fig. 8g). These results indicated that as the time increased the models become more inaccurate without the consideration of the rust formation. In contrast, when rust formation was considered in the models (Fig. 8d-f), the distribution of the data points extended along the diagonal line but did not become widened with the increasing exposure, suggesting that the prediction accuracy is stabilized. As shown in Fig. 8g, the R 2 values for the ANN and SVR prediction increased sharply when longer time was included in the models. The R 2 value for the RF prediction remained stabilized at an excellent level of 0.94. The growth of the rust layer on the ACM sensor had a significant impact on the I ACM , so a high-accuracy model should take Q ACM into account for the prediction of atmospheric corrosion of steels. Finally, as shown in Fig. 9a, two RF models were built and trained differently using the data from August 2nd to 30th to predict the data from September 1st to 5th. One model (RF1 in Fig. 9a) selected temperature, RH and rainfall status as the input parameters and the other (RF2) added another parameter of Q ACM . Different from the predicting process in RF1, the Q ACM of 23:59 on August 30th was selected as an input in the RF2 model for the first step, after which the predicting sequence in RF2 strictly followed the chronological order from 00:00 on September 1st to 00:00 on 5th, updating Q ACM based on Eq. (3) at every minute. Fig. 9b shows the variation of the predicted I ACM by the models with and without the Q ACM correction as compared with the actual data collected during September 1st to 5th. Although the variations of the predicted I ACM showed very similar trends to that of the actual values, a much larger fluctuation was observed in the curve predicted by the uncorrected RF model. To quantitatively compare the accuracy by the two models, the I ACM values were divided into four levels (I ACM = {I 1 , I 2 , I 3 , I 4 } where I 1 , I 2 , I 3 and I 4 are 0-800 nA, 800-1500 nA, 1500-3000 nA and > 3000 nA, respectively) and then the number of the predicted values that were at the same level with the actual level were counted [4]. As shown in Table 1, the rate of accurate prediction was improved from 87.8% to 94.7% after correcting the model with the consideration of the Q ACM . In conclusion, the incorporation of Q ACM as an input greatly improved the precision of ANN, SVR and RF models. The RF model with the correction considering the rust formation on the ACM sensor can be used as a high-accuracy model for predicting atmospheric corrosion. Furthermore, this corrected model implies the possibility that, once sufficient training data is available, atmospheric corrosion could be predicted based on simple environmental sensors without expensive corrosion sensors.

Conclusions
A Fe/Cu type galvanic ACM sensor consisting of carbon steel anodes and pure copper cathodes were exposed in an outdoor atmospheric environment in Qingdao for 34 days. An RF-based machine learning method was used to assist in the analysis and prediction of I ACM values and the results were compared with those by ANN and SVR models. Among the environmental factors accumulated, temperature, RH and  rainfall status showed the highest importance to the I ACM . The pollutants such as SO 2 , NO 2 , O 3 , CO, PM 2.5 and PM 10 were not the main factors for the I ACM because of their low concentrations in Qingdao. The predicting ability of the RF model was much stronger than that of the ANN and SVR models in training and predicting I ACM . The growth of the rust layer on the ACM sensor surface reduced the accuracy of the I ACM prediction. Taking into consideration of the rust formation on the surface of the ACM sensor, an improved RF model with the addition of Q ACM as an input was established and showed a significantly higher accuracy for the prediction of atmospheric corrosion of carbon steels. Potentially, the predictive model developed in this work could be useful in the future development of smart corrosion sensors with the ability of offsetting environmental damage to the sensor and maintaining high accuracy for extended usage.

Declaration of Competing Interest
The authors declare no conflict of interests.
Supplementary material related to this article can be found, in the