Application of machine learning for crack detection on concrete structures using CNN architecture

ABSTRACT Cracks in concrete structures are caused due to contraction and expansion irregularities, from potential damages caused in the buildings. These irregularities and damages are assessed by the engineers manually or through identification and prediction models with machine learning techniques to evaluate the impact and significance of the structural health in buildings. This research aims at applying machine learning based on VGG16-Net model for the detection of cracks in concrete structures. The proposed model is of CNN (convolutional neural network) + VGG neural network based architecture. The study uses the gradient boosting algorithm for image segmentation. The datasets are obtained from “Kaggle” resource and the library used is ‘Hugging Face Transformers’. To evaluate the developed models’ performance metrics such as “accuracy, precision, recall and f1-score” are used. The ‘accuracy’ score obtained is compared against the ‘ViT’ (Google transformer) accuracy rate, for comparison. The proposed model achieved 98% validation accuracy rate with 0.3% loss. Thus the developed research contributes an innovative and a novel ML model that predicts and identifies the cracks in concrete structures with less loss and higher accuracy with CNN architecture than ViT (vision transformer) models. Current study also provides more input upon CNN being more accurate than ViT models for future researchers for comparative analyses.


INTRODUCTION
In the recent years the researches on the structural health of concrete has increased rapidly.The study of infrastructures on the dams, buildings, bridges and roads have been done by the investigators for the health, impact, mishandlings and other structural observations [1].The expansion and contraction of the concrete structures in the buildings have been observed as the primary reason to study the health of the concrete structures.To identify these damages, cracks and breaks in the concrete structures the investigators have been adopting and leaning towards machine learning models in the recent years than observing through direct and physical investigation that requires more time, budget, labour and human intervention.
The machine learning based models in the concrete structural health analysis has been focused by researchers to study the variations in the expansions, vibrations, features (frequency and spatial), contractions and dampness [2].Automatic detection of the concrete structures' cracks and how it impacts the health of the concrete structures have been recently focused with the "vision transformers" (ViT) as the core focus.
The studies by authors [3] and [4] focused on pavement cracks through ViT models where they insisted that the Convolutional Neural-Networks (CNNs) as the base model network for identifying the cracks in pavements which is evaluated and presumed to be the better and higher performing models in image segmenting and classifying.
The CNN models in identifying the cracks in the images of pavements and concrete structures have been identified as better model.However, study by [5] insisted that deep learning-based CNN models in detecting the cracks with the aerial support of unmanned vehicles (UAV) as a combination as dual process is faster, robust and reliable than detecting cracks in the images of random inputs.Digital images in the automated crack detection models as inputs than videos in real-time have been attempted more to evaluate the detection models.The measurement of the cracks in the concrete structures are calculated by considering various features, like width, length, thickness, thinness, mass and depth of the crack [6].Though the manual crack detection is measurable it is not always a reliable outcome since there is a huge possibility of error and miscalculations through human interventions.Henceforth, automatic detection of cracks in concrete structures and pavements through deep learning, where advanced techniques like transformers are recently identified as 'highly' performing models than the CNN models with better accuracy [7].The models traditionally adopted NN models are ResNet50 [8], AlexNet [9], VGG16-Net [10], LeNet [11], GoogleNet [12,13], MobileNets [14] and DenseNet models [15].
Initially vision transformers (ViT) were applied only in the natural language processing (NLP) by replacing the RNN (recurrent neural-network) models with long-short-term memory (LSTM) approach.Later they were utilized in question answering, language translations and text classifications [16] applications too.ViT models have been tremendously dominating in SOTA (State-Of-The-Art) performance and efficiency with the NLP datasets, than other categories.Similarly the vision transformers are also popularly used for the advantages over CNN like its cost, processing time and speed.However there exists a debate that in certain areas, CNN models are more efficient with accuracy than the transformers.The ViT thus could be assumed as, a major contributor in the computer vision domains that is adopted for different applications like image segmentation, video understanding, object detection and image detection in machine learning.
In this research the author will develop a traditional CNN model to detect cracks in the concrete structures to examine its health with the main purpose as: "applying machine learning on concrete structures to monitor and examine the health of concretes".Simultaneously the research will also analyse and examine the objectives: • To examine the accuracy achieved by a conventional model in detecting the cracks in concrete structures; • To examine the accuracy achieved by a contemporary model in detecting the cracks in concrete structures; • To compare the outcomes to weigh the most appropriate model based on cost, time, resource, accuracy rate through metric evaluation.

LITERATURE REVIEW
The studies on the convolutional networks in detecting cracks in concrete structures and monitoring the concrete structural health are primarily focused.Models that use CNN architecture in deep learning and vision based (i.e.digital image segmentation and vision transformers) are studied as secondary studies as literary resources.

Structural health monitoring and crack detection in convolutional networks
Authors [17,18], conducted an inspection on concrete structures to study how the cracks appear through CNN architecture.Authors [17] developed a deep CNN model that achieved 92% f1-score and recall.Similarly, the model developed by the authors [18] achieved 99% accuracy with image processing technique and machine learning algorithm.The developed model was based on ANN (artificial neural-network).The samples used were 1000 images.Authors [19] used multi-resolution analysis (ResNet50 and AlexNet) as their data analysis technique where the deep learning is used.Their model achieved 90% accuracy with CNN architecture with 56000 images.The outcomes were compared against SDNET2018 model developed by [20].From all these models and the samples size it could be observed that, the models majorly used the CNN architecture for better accuracy and the size of the sample datasets varied from 1000-56000 images.However, the larger the data the longer the processing of dataset in computing and the cost incurs.Hence, it is advised by the authors in their studies that, sample should be lesser for more accurate and robust crack detection in concrete structures.
By adopting a hybrid machine learning approach in their study authors [21] developed a model with SVM (support vector-machine) and CNN architecture that uses the aerial based unmanned vehicle (UAV).The model achieved 92% accuracy than the single classifier and image processing methods in crack detection to monitor concrete structural health.Authors [22] used PCA (principal component analysis) with four machine learning algorithms (gradient booster, decision tree, AdaBoost and randomized tree) in their model.Though they found that AdaBoost as effective algorithm.The gradient booster and randomized tree algorithms were recorded with overfitting issues that ought to be prevented in future studies for higher accuracy.From these studies it can be deduced that, adopting the decision and random tree algorithms will cause overfitting issue, thus prior analysing the data researcher should consider about adopting the better algorithm for predicting the cracks.

Traditional methods versus contemporary methods
Several existing researchers [23][24][25][26][27][28] have studied about the traditional methods in the crack detection in buildings and pavements.According to the common findings of the studies, it has been claimed that the traditional methods are better in accuracy than the transformer models.The traditional methods in crack detection uses ResNet, AlexNet, MobileNet, Inception, Convolutional, DenseNet and LeNet models that has more layers than the transformers.However, the accuracy of the outcomes was predicted to be 99% where the error rate is at 1%. Contrarily, several authors [1][2][3]29,30] focused and examined exclusively upon the vision based crack detection models.They claimed that though the traditional models are better in accuracy than the vision transformer models that averagely produce outcomes that are of 95% accurate, the transformer models are rapid, robust, incurs lesser costs and lesser time for computing and processing.Hence, the recent researchers have been adopting the vision transformers as crack detection models that produce lesser accuracy than traditional models in micro cracks, macro cracks, complex and closely-spaced crack detection in buildings, pavements and asphalt blocks that are of concrete structures.To monitor the health in the concrete structures the authors have been developing the prediction and identification models that also reduce the manual labour and errors that occur due to human miscalculations, thus the automated machine learning based models are being adopted by eradicating human intervention in the crack detection.Also, to make the models highly efficient and accurate the reliable models like CNN, ANN, ResNet and Vision Transformers (ViT) have been adopted by the researchers more, based on their necessity and research objectives.

Transformers
GEHRI et al. [6] used digital image correlation technique to identify and detect cracks in the images through automated detection model where the model achieved higher accuracy in detecting even the complex cracks than micro and macro cracks in the concrete structures.The authors used kinetic measurement technique and achieved higher accuracy.However, the kinetic measurement method had biased outcomes in closely-spaced cracks.Crack detection models that have bias in complex crack detection will result with irregular outcomes and inconsistent results that could affect the research.Henceforth the authors [3] proposed a model later on with 'slope surface' crack detection from images using deep learning.The authors found that vision transformer model achieved 94% accuracy whereas the other models (LeNet, AlexNet, InceptionA, MobileNet, ResNet and InceptionE) achieved higher accuracy as 99% each respectively.So, in the year 2021, authors [31] proposed a vision transformer model (SOTA) through visual interpretation technique that uses CNN architecture.The authors used CNN architecture model and ViT model with Chinese and German asphalt as samples.The accuracy obtained by the ViT model with German samples was 99% and in the Chinese samples was 91%.The authors thus proved that ViT model achieves higher accuracy too with lesser cost and time for processing the data.
Similarly, the authors [4] also proposed visual interpretation through deep learning and authors [7] proposed a deep learning model for slope surface-based crack detection.Both studies compared CNN and ViT architectures and found that CNN proved to be higher in accuracy score than ViT model.However, they also insisted that, cost, time, labour and resources were also high in the CNN model along with its stacked-up layers.Henceforth they concluded and claimed that ViT models are better in crack detection for small budget researches with better accuracy and CNN models are costlier but highly efficient with accurate outcomes.YU et al. [32] examined the damages in building structures using the deep learning CNN model and authors [33] examined vision-based crack detection in concrete structures using the hybrid CNN model approach.Both studies concluded that CNN model is accurate in crack detection.Later, the study by [34] utilized the optimization algorithm (improved bird-swarm algorithm: IBSA) in CNN model and found it to be significantly accurate and precise with minimal loss than earlier approaches.Similarly in another study authors [35] examined the concrete structures with CNN model with enhanced chicken-swarm algorithm (ECSA) and found that, CNN models achieve higher accuracy and performance in detecting cracks.Thus it is comprehensible that, utilizing an optimization algorithm with CNN model certainly increases the accuracy than a hybrid or a normal CNN model architecture.
Hence to prove these researches and how the samples might differ in the ViT model crack detection and CNN crack detection models, the current study aims at comparing the outcomes of the CNN and ViT models based on accuracy, computing cost and processing time.

PROPOSED METHOD
The proposed method for the developed research includes different processing phases in this research.The flow of the research (refer to Figure 1) is: • Pre-processing the images of the concrete structures, • Evaluating, classifying and categorizing the images, • Segmenting the cracks and non-cracks images and storing in separate folders and • Measurement of cracks and comparison done by both models.
The model developed includes stages where the concrete structural image is inputted to the model.Then the image is later on pre-processed, classified as cracks and non-cracks and the outcomes/results are segregated and the final output obtained is stored and the models are compared for the accuracy as the metric evaluation.

Architecture of the proposed model
The model proposed is the VGG16-Net model for the detection of the concrete structural health.The images are obtained, pre-processed and then the model is applied on the datasets.
The developed model has 7 stages.Initially the input as image (i.e.concrete structure) is passed to the model as the first stage.In the second stage, 2 × convolutional layers are stacked up with 224 × 224 × 64 size.In the following stage three, a single max-pool layer and 2 × convolutional layers is stacked up with the size of 112 × 112 × 128.In the fourth stage, one max-pool layer and 3 × convolutional layers are stacked up with 56 × 56 × 256 as the size.Followed by the same set of layers (1 × max-pooling and 3 × convolutional layers) as above, where, 28 × 28 × 512 as fifth stage layer size and 14 × 14 × 512 as layer size as sixth stage is stacked up in the model.
In the final seventh stage, a max pooling layer with 7 × 7 × 512 with 500 batches as size for fully-connected layer and 500 batches as size for dropout layer with final layer of Softmax with 2 as batch size is stacked up (refer to Figure 2).

Google transformer: adopted architecture
The ViT model adopted is the model developed by the authors [36].The model has 7 layers.The first layer includes the embedding layer where the 3 × convolutional 2D of size 768 that has kernel size of 16 × 16 and stride of 16 × 16.The second layer is the encoder layer that includes the 7 × ViT layers that has ViT layers as self-attention, self-output, intermediate, output and 2 × LayerNorm of 768.Followed by the dropout layer as third layer where p = 0.1 and next fourth layer a linear layer of 768 batch size of output featuring 256.The same is repeated in the following layers of fifth and sixth layer where the dropout layer with p = 0.2 and next a linear layer of 256 batch size of output featuring 2. The final layer is the Softmax as the seventh layer (refer to Figure 3).
The transformer model is applied on datasets where the images of the concrete structures are focused and detected for the cracks and non-cracks in the images.The detected images are then classified into separate folders with classes "cracks" and "non-cracks".

Categorical cross-entropy (loss) function
In this research the loss is estimated through computing the 'Softmax' function for the developed model.The Softmax loss function is considered neither as activation nor as a loss function.It is rather identified as a Cross-Entropy function in estimating the loss for detection and image classification models.However, researchers also adopt this technique as activation and loss estimation function according to their necessity.Here the formula used is: Where: δ = Softmax, ν -= Input vector, C = total classes in multi-class classifier, e a x = input vector's standard exponential function and e a y = output vector's standard exponential function.

Convolutional neural network (CNN)
The convolutional neural network (CNN) in Pytorch is used by the researchers for small datasets and also for large datasets.In this research the CNN is used for the developed model where the pre-trained weights are used.For pre-processing the images (size, resolution, colour, brightness, contrast and dimension) the researcher has set the size as 224×224 pixels for the images.Once the images are pre-processed the research adopts the CNN layers convolutional layers, max pooling layers, Softmax layer, dropout layer and fully connected layer.In the initial stage the input is passed to the convolutional layers and max pooling layers.Once the image is passed  through the layers, it is then passed on to the dropout layer and fully connected layer.Finally, the Softmax layer is connected and the outcome is processed and segregated post classification as "cracks" and "non-cracks".

Adam optimizer as adopted algorithm
Adam optimizer (AOA) is mostly used by the researchers in the deep-learning based identification and prediction models, especially for optimized outcomes or for evaluation of the existing models for better outcomes.It is mostly adopted for its rapidness and robustness.It requires lesser processes than other algorithms and thus preferred by researchers majorly.This research being the comparative study of ViT model and VGG16-Net model the researcher had adopted the Adam optimization for comparing the outcomes.The pseudo-code for the adopted algorithm is:
By using the AOA-algorithm the researcher uses the model to identify and classify the concrete structures and classify them as cracked and non-cracked structures.

EXPERIMENT AND RESULTS
Firstly, the experimentation for the developed model is carried out through applying the VGG16-Net model on datasets.The model is tested trained and the outcomes obtained are stored under the classified labels.Secondly, the experimentation is carried out for ViT model and the outcomes are stored for comparison against the VGG16 model.Through the experimentation the outcomes from the models are evaluated and compared for better performing model with higher accuracy.

Setting-up the experiment
The experiment is conducted on the datasets acquired from the kaggle resource of concrete structures where the developed model is aimed to detect the cracks and non-cracks and label the outcomes as classified images in separate folders.The developed 'VGG16-Net model' includes the Adam optimization algorithm.For the model the weights that are pre-trained from the Google patch (google/vit-base-patch16-224-in21k) has been used.The library used here is the "Hugging Face" transformers.

Datasets
The images of the cracked and non-cracked cement structures as the datasets are acquired from the resource "Kaggle" by [37].The total datasets accumulate to images of 20 thousand and more concrete structures from buildings, bridges and pavements.Among which the research has used the 70% (14000:14000 for cracks and non-cracks as image classification) for training and the rest 30% datasets (6000:6000 for both cracks and noncracks image classification) for testing and validation.Thus, the split-ratio for the obtained dataset is categorized as 70:20:10 for training, testing and validation respectively.

Training and testing
For training, the images are first pre-processed and then the first model (VGG16-Net) is applied on the dataset.The outcomes are observed until the model's accuracy is higher and constant.Until the model obtains high accuracy the model is trained and the final model with higher accuracy is retained.The final VGG16-Net model is then compared against the ViT model.
The VGG16-Net model is trained with 70% datasets and the loss is evaluated for the model where the cross-entropy is used.The loss functions are:  The training loss (refer to Figure 5) of the model is evaluated and the loss is observed initially at the 47 th epoch at 0.59.The loss reduced from 0.58 to 0.38 at 75 th epoch.Later on, the loss fluctuated and at epoch 380 th the loss reduced at 0.36.Finally at the epoch 600 th the loss value reduced to 0.356.

Performance metric evaluation
The evaluation metrics for the developed VGG16-Net model is calculated through the metric evaluation technique.
The techniques are:

F1-score
The F1-score for the model is calculated through: From Figure 6 the highest F1-Score has been observed as 0.985 at the 600 th epoch.

Inference:
The evaluation loss (refer to Figure 4) was observed as high (0.

Recall
The Recall for the model is calculated through: Where: TruPos denotes true-positives, TruFal denotes true-negatives, FalNeg denotes false-negatives and Fal-Pos denotes false-positives.From Figure 7 the highest recall has been observed as 0.991 from 360 th to 600 th epoch.

Precision
The Precision for the model is calculated through: From Figure 8 the highest precision has been observed as 0.98 at the 600 th epoch.

Accuracy
The Accuracy for the model is calculated through:

Accuracy = TruPos + TruNeg TruPos + FalPos + TruNeg + FalNeg
(5)   The loss from Table 1 could be observed where at the 10 th epoch the accuracy increased as the losses decreased for both training and validation.

Experimenting results
The outcomes of the VGG16 model are classified as two classes, where the results obtained are segregated and categorized as 'Cracks' and 'Non-cracks': The prediction was made for the VGG16 model with the concrete images and the following outcomes are obtained where the label cracks are applied on the classified imagespre-storing the results: The prediction was made for the VGG16 model with the concrete images and the following outcomes are obtained where the label Non-cracks are applied on the classified images prior storing the results: From the results from Tables 2 and 3 it can be observed that, the prediction made by the VGG16 model was high where among the 10 sample datasets only one was found to be predicted wrongly.

Comparative analysis of the models developed
The performances of both models are evaluated through metric evaluation techniques.The outcomes are (refer to Table 4): The validation accuracy as the metric for evaluating the performance of the models 'ViT' and 'VGG16-Net' is measured (refer to Table 2).The ViT model achieved 96.6% and the VGG16-Net model achieved 95.98% validation accuracy.
The training of the VGG16-Net model was carried out by increasing the epochs from 10 to 30 with 16 batch size and learning rate 0.1.
From Table 5 it can be observed that the loss has been reduced from 0.65 to 0.35 from epoch 1 to epoch 30.From the observation it is understood that, the outcomes obtained from the VGG-16 model for identifying and detecting the cracked and non-cracked concrete structures are better than the transformer model.The evaluation metrics shows that accuracy, recall and f1-scores are higher in VGG16+CNN model than ViT model; whereas the precision of ViT model is somewhat similar to the VGG16 model.This finding is similar to conclusions of the studies by authors [3,6,12,13] where CNN models are better in accuracy than ViT models.
From Figure 11 it is apparent that the developed VGG16 model acquired more accuracy than existing convolutional neural networking models like AlexNet, GoogleNet, InceptionNet and ResNet.

FINDINGS
The findings from the evaluated outcomes are presented in Table 6.
• The scores of the ViT model are: precision at 97.30%, accuracy at 92.72%, recall at 92.50% and f1-score at 94.87%.
• The validation accuracy of the ViT model (96.6%) and the VGG16-Net model (95.98%) was found to be 96%.
Hence it is found that when the computing cost, scheduling time, processing time, constructing stacked-up layers and other resources are considered ViT crack detection model is better however when the accuracy, precision, recall and f1-score are considered the CNN crack detection model is better.

DISCUSSION AND CONCLUSION
The existing researches on crack detection models were made by examining the detection models processing speed, processing time and how accurate they are.However, the lack of comparative studies on how the novel and traditional models differ had not been done in machine learning models for the crack detection models in concrete structures.The current study aimed at applying the VGG model and ViT model on datasets of concrete images through CNN architecture in VGG.The images were acquired from kaggle as resource.Datasets acquired were more-than 20 thousand concrete images of buildings and pavements.The secondary datasets were journals, e-journals, articles, research papers and studies on concrete structural health through machine learning.
The cross-entropy was used as loss function to evaluate and examine the training and validation loss of the model.The Adam optimization as the model's algorithm was utilized in the developed VGG model neural layers to increase the accuracy with minimal loss through fine tuning the hyper-parameters learning rate and epochs.To avoid overfitting (too many) and underfitting (too few) of data, the epochs were retained to as 30 with the learning rate of 0.1.The training and testing and validation of datasets were split in 70:20:10 ratio and the outcomes are segregated as two classes, namely, "cracks" and "non-cracks".The images that are predicted with cracks are classified and stored under the folder "cracks" and the images that are classified and identified as "non-cracks" are stored under the non-cracks folder.The outcomes of the prediction made by the ViT model and the VGG16 model are compared.Both models (VGG and ViT) have acquired 96% as validation accuracy.The research examined the performance metrics, where the VGG model acquired higher accuracy (98%) than the ViT model.However, since a VGG model is stacked up with the CNN layers, the computing time and other resource costs are considerably higher than vision transformers.Thus, it's safer to conclude that, when a researcher aims at a detection model with better performance and higher accuracy it is considerable to utilize the traditional models with CNN, like VGG16 model, AlexNet, LeNet, ResNet model and other models.Contrarily if a researcher concentrates on lesser resources and rapid time for processing it is better to utilize the ViT model to minimize the resource expenses.
Through the findings the study concludes that, by utilizing the vision transformer model to detect and classify the crack identification images in structural engineering saves costs and processing time.Contrarily, the CNN based models with VGG16 architecture is simpler, customizable, more matured to implement and also to be trained than vision transformers that utilizes image-net.ViT splits-up the images as several visual tokens whereas the CNN utilizes the pixel arrays.The differences in accuracy of the models are marginal, but when the benefits and advantages are weighed-in for structural health monitoring projects, the VGG16 + CNN architecture is better with accuracy and performance than vision transformers.

Future enhancements:
The current study developed a VGG16 model and compared the results with the ViT model.However, on the future the same could be extended where instead of VGG16 model other models like ResNet, LeNet, AlexNet and other architectures without CNN layers could be used for comparing the outcomes with the ViT model.Similarly, the future research could focus on models of crack prediction in concrete structures by examining their precision, accuracy, recall and f1-score to find the best model in predicting and detecting cracks.In future, more CNN based architectures will be analysed and studied for performance and accuracy.Different evaluation metrics will also be utilized to compare the performances of the models and to find the better model and metric.Furthermore, different NN like RNN (Recurrent NN), and ANN (Artificial NN) will also be focused to find the better NN models in crack detection, than CNN and vision transformers.

Figure 2 :
Figure 2: Architecture of the VGG16-Net model adopted.

Figure 3 :
Figure 3: Architecture of the vision transformer adopted (Google model).
47) at the initial epoch training.Later around the 80 th epoch the loss reduced from 0.47 to 0.35.However, the loss fluctuated from 125 th epoch till 275 th epoch and remained reducing.The loss attained the value 0.34 at 575 th epoch and remained same until 600 th .

From Figure 9
the highest accuracy has been observed as 0.98 at the 600 th epoch.The final training and testing of the model was done with 1261 samples for training and 239 for validation with 32 batch sizes and 165 samples for testing with 16 batch sizes.The epoch size was 10.

Figure 10
Figure 10 illustrates the comparison of the models ViT (vision transformer) and VGG16-Net (CNN).From the observation it is understood that, the outcomes obtained from the VGG-16 model for identifying and detecting the cracked and non-cracked concrete structures are better than the transformer model.The evaluation metrics shows that accuracy, recall and f1-scores are higher in VGG16+CNN model than ViT model; whereas the precision of ViT model is somewhat similar to the VGG16 model.This finding is similar to conclusions of the studies by authors[3,6,12,13] where CNN models are better in accuracy than ViT models.

Figure 11 :
Figure 11: Comparison of few different CNN models.

Table 1 :
Epoch table for loss evaluation.

Table 2 :
Results of the VGG model classification -Crack.

Table 3 :
Results of the VGG model classification -Non-crack.

Table 4 :
Comparative analysis of the models.

Table 6 :
Accuracy comparison of CNN models.