Crack Detection based on Convnext and Normalization

Cracks, mostly caused by irregular expansion and contraction, indicating potential damage to a building, are of great significance to building quality assessment and predicting potential disasters like earthquakes. In this paper, after a comparison of the conventional methods using image processing techniques to detect the cracks and the newly-proposed CNN method, ConvNeXt is involved, which shows more sufficiency and stability. At the stage of the experiment, due to the specialization of the datasets and the methodology, a serial of image processing is engaged as the preprocessing before the data are used to train the CNN model. By using two crack-specialized datasets, namely, Concrete Crack Images for Classification and the SDNET2018, over 60,000 images are selected as the training samples. After adequate training and the AdamW involved as the optimizer, an accuracy of 99.0% on the dataset is reached and the expected results of accuracy of 99.0% are obtained.


Introduction
In reality, the cracked solid (i.e. dirt, concrete) plays a significant role in various subjects. In geographical terms, the crack on the earth's surface is an important factor in predicting disasters like landslides or earthquakes. In the mechanical term, parameters like the quantity of cracks are necessary to get measured because the detection in advance allows the staff concerned adequate time to take measures to prevent the probable building damage [1]. One thing that must be mentioned is that most of the closed-circuit televisions used in the architectural industry are static, so it would be possible to detect cracks by using a classification network with sliding windows.
Recently, the power of GPU has been grown boomingly and a variety of convolutional neural networks have been continuously proposed since the proposal of AlexNet [2]. Afterward, a growing number of manual works have been done by computers with proper GPU instead. Furthermore, rather than noting the feature manually and formulating a specialized algorithm, the method of using CNN allows the computer to extract the features automatically, which is more sufficient, and precise.
Since Transformer was applied to computer vision maturely, more and more researchers chose to embrace Transformer as their baseline instead of convolution neural network [3]. In 2021, Liu et al. proposed the Swin-Transformer [4], a hierarchical vision transformer using shifted windows, and won the 2021 Marr Prize for it. In their research, the backbone was applied to and held sway over the three main topics in computer vision, i.e. object recognition, instance segmentation, and semantic segmentation. Nevertheless, in 2022, Liu [5], which would be involved as a training model. Their research shows that the existed structure like Resnet still has prospects of revolution.
During the past few years, quite a few optimizers (e.g. SGD, Adam [6], Amsgrad) are being proposed and applied to the industry for accelerating purposes. The Adam, designed to combine the merits of the AdaGrad and RMSProp [7], becomes a hit in the industry due to its ability to anneal the step size. However, in 2018, still, at ICLR, the AdamW [8], claimed to break the limits which common implementations have, outdid the Adam in the aspect of the convergence of gradient. In this paper, AdamW is involved as the optimizer of the training process.
Concerning the theories mentioned above, instead of using the conventional image process for detection, this paper is going to propose a method of detecting cracks with the image process and CNN and accelerate the training process with AdamW. Finally, at the stage of the experiment, we obtain a precise rate of 99.0% on the dataset.

Related work
Looking back to the previous works in crack detection, most of the works can be classified into four particular categories, i.e., integrated algorithm, morphological approach, percolation-based method, and practical technique.
In 2007, Fujita, Yusuke et al. proposed a method using slight variation and a line filter [9]. In their research, the line filter was able to emphasize the outline of the crack for detection while in CNN, the outline is a deep-level feature that could be extracted automatically. As the result, they reached a detection accuracy of about 98.13% in a relatively early period but the drawbacks are obvious. To be specific, the samples used for evaluation of performance are only 50 images with noise, which is too few to reach the industrial standard. To solve such a problem, a large number of images would be involved as training and validation samples.
In 2018, Mohan et al. summarized and drew an overview of the previous prediction work [10]. They mentioned two things that might affect the accuracy of detection that should be paid more attention to. The first is the directions of propagation of the cracks while the second is the poor quality of images. To solve the first problem, the preprocessing of random flipping and resizing is introduced in this research. For the second problem, two particular datasets are involved, Concrete Crack Images for Classification and the SDNET2018, both of which consist of images of over 20,000 pixels.

Datasets
Two particular datasets, Concrete Crack Images for Classification and SDNET2018, are involved as training samples in the experiment [11,12].

Concrete Crack Images for Classification
Concrete Crack Images for Classification involved is a dataset containing concrete images having cracks. The data are collected from various METU Campus Buildings. The dataset consists of two particular parts which are either negative (cracked) or positive (non-cracked). Each class has 20,000 images with a total of 40,000 images with 227 x 227 pixels with RGB channels, of which all are generated from 458 high-resolution images (4032x3024 pixel) where no data augmentation in terms of random rotation or flipping is applied.

SDNET2018: A concrete crack image dataset for machine learning applications
SDNET2018 is an annotated image dataset of cracks, consisting of over 56,000 images of cracked and non-cracked buildings. Like what was done in the generation of the single images in Concrete Crack Images for Classification, in this dataset, sub-images were extracted from 230 images of cracked and non-cracked surfaces located in three types of spots D, P, and W, which stands for bridge decks, walls, and pavements respectively. Two labels are used for the classification, to be specific, the sub-image is marked as 'C' if there was a crack or 'U' if there was not a crack. One thing that must be noticed is

Whitening
In the images, the adjacent pixels have quite a strong similarity and relativity, which is a redundant part of the training process, so the whitening is applied to preprocessing to reduce the redundancy. The process of whitening can be briefly defined as the following three steps Average and normalize all the data (1) Then for every vector do the following steps, (2) Calculate the covariance matrix and the singular value decomposition afterward

ConvNeXt
ConvNeXt is a convolutional neural network proposed by Liu et al. in 2022. What is interesting, is that, due to the transferred application of Transformer in Computer Vision, which is originally hotly used in Natural Language Processing, more and more researchers defected turning their backs to CNN. However, the appearance of ConvNeXt may be the ice-breaker since it outperformed Swin-Transformer, especially in the aspect of large input size (Table 1). As proposed in the original paper, the creation of ConvNeXt is a mixture of improvements of the previous works and their combinations, such as the adjustment in kernel size, the increase in width in the ResNeXt module (Figure 2). The most intriguing thing about the network is that, instead of using the ResNet block, Liu et al. used a more radical residual structure, the ConvNeXt block, in which the Batch Normalization (BN) was substituted by Layer Normalization (LN) and the Rectified Linear Unit (ReLU) was replaced by Gaussian Error Linear Unit (GELU) (Figure 3).

AdamW optimizer
AdamW can be seen as an improvement and a correction of Adam which is developed from ADADELTA [13]. In the original paper proposing AdamW, the author pointed out that several implementations of the Adam optimizer limited the merits of weight decay regularization, which is essential in optimizing the process of the training model [6,7].
As shown in the charts below, it is clear that, compared to the Adam, the AdamW has a faster and more stable convergence of the gradient (Figure 4).

Figure 4. Different weight decays of Adam and AdamW
Let's define x t as the training weight which is an n-dimension vector at step t, where n is associated with the number of parameters in the convolutional neural network, ω t as the ratio of the accumulated weight decay at step t.
At the t-th step, x t is defined as follows: where both ̂ and ̂ are defined by , and the corresponding bias, in the iteration of Adam, the accumulated gradient is the sum of the current derivative of ( ) and −1 .

Preprocessing
To preprocess the data for the training purpose, whitening and random flipping are both involved. The four probable results of random flipping ( Figure 5):

Performance and comparative analysis
In the experiment, we established the models and trained them with a single NVIDIA RTX 3090 GPU. The batch size is set to 64 while the learning rate of both models is set to 0.003. After the iterations of 100 epochs for both models, a precise rate of 99.0% is reached. Through charts drawn from the smoothed lines, the accuracy reached at different stages in the experiment is shown in the table below (Table 2) It can be illustrated in the chart that, despite the gaps in terms of accuracy when using optimizers like SGD and Adam (-1.6% and -1.3% respectively), ConVnet reached the highest precise rate when optimized by AdamW.

Discussion
In the real process of training, two intriguing things were observed. The first is that we trained a lot of models and they all failed to outperform ConvNeXt and Swin-t with noticeable disadvantages. However, one model is outdoing both of them all the way, the VGG-X [14], including VGG-16 and VGG-19. To judge whether there existed a problem of overfitting, we involved some other datasets for extra validation and found that the VGG-19 outperformed Swin-t on quite a lot of datasets while it failed to outdo ConvNeXt with only a slight gap. The other one is that having been tested dozens of times, it seemed that, the Adam optimizer failed to handle the situation where the gradient is truly quite nearly the best solution with a noticeable oscillation.

Conclusion
This paper proposes and realizes a crack detection method using a classic classification model rather than a complex detection model like YOLO or Fast R-CNN [15,16]. Before the data entered the CNN to be trained, particular preprocessing is involved. It can be seen from the experimental results that, compared with Swin-T, the ConvNeXt has higher accuracy. Furthermore, the convergence is reached earlier with the proper optimizer involved, which is AdamW. At last, the accuracy of the model reaches 99.0%, 3.2% higher than the counterpart with the same optimizer.