Runet: Convolutional Networks for Crack Detection

In daily life, crack is a common phenomenon. Such as road cracks, stone cracks. These cracks are usually not easy to be quickly and intuitively identified, because these cracks are characterized by poor continuity and low contrast. Among the existing crack detection methods based on deep learning, the model is too large to be directly used in life. We design an end-to-end neural network, its name is Runet, which greatly reduces the parameters of the model by depthwise separable convolution, increases the receptive field of the network by hole convolution. At the same time, we introduce the trainingtime and inferencetime architecture to increase the running speed and precision.


Introduction
In work and life, if some cracks are found in time, it can avoid large economic losses and protect people's life safety. Up to now, the existing defect detection methods can not be widely used in our life. If we can directly use mobile phones to detect cracks, the prospect of large scale application of deep learning for crack detection can be expected.
Inspired by the methods of semantic segmentation and image classification, we put forward the idea of adding depthwise separable convolution into encoder network, and apply the hole convolution to increase the receptive field, and we are inspired by the RepVgg [15] network, we add the trainingtime and inferencetime architecture to our network, then we constructe a new end-to-end neural network for crack detection. Our main contributions are as follows: -The network makes full use of the information which obtained from the encoder-decoder structure network, and establishes a trainingtime and inferencetime architecture by a structural reparameterization technique.
-Compared with the existing crack detection neural network, the parameters of our network model are greatly reduced.
-On the basis of greatly reducing the number of parameters, our network model can still guarantee good recall and precision.

Network Architecture
The network is based on the U-Net network. we call it Runet. as is shown In the Fig.1, it consist two path, the contracting path and expansive path, which can be called as encoder path and decoder path. We add the principle of RepVgg, which consists of the repeated basic convolution module, the basic module is consists of one 3 3 convolution and one 1 1 convolution and one batch normalize. Each convolution followed by a batch normalize. In the contracting path, every two basic module followed by a 2 2 max pooling operation with stride 2 for downsampling. Every step in the expansive path consists of an upsampling operation. After this operation, there is a concatenation with the correspondingly feature map from the contracting path. Then followed two basic module. At the network, the last convolution operation, produces a 1-channel feature map, which 1 represent the cracks.
As we know, we use upsampling operation at the expansive path, but we can not ignore one thing that the upsampling operation cause a loss of spatial resolution, the result of this is that it will make the bias of boundaries. In order to avoid the lack of the information of edge, we use bilinear interpolation in the expansive path. Simultaneously, we think of every pixel as squares rather than dots, the feature map tensors are aligned through the center of their corner pixels, thus preserving the values at the corner pixels. Through the above measures, we can get sparse feature maps, which can provide more precise location message of region boundaries.
The contracting path is designed to get the features of the input image, in order to get more detailed information, we can learn from receptive filed, which represent a point on the feature map corresponds to the region on the input image. The larger the receptive field, the larger the range of the original image it can touch,which means that it may contain more global and higher semantic features, The smaller the receptive field is, the more local and detailed the features are. For semantic segmentation task, the larger the receptive field, the more important information will not be ignored. So, we used dilated convolution in contracting path to increase the receptive field. At the same time, In order to reduce the parameters of the whole model, we learn from the idea of depthwise separable convolution, when we use it to backbone network, the parameters of the whole model down by As is shown in the Fig.1, the basic module Rep is consists of two convolution branch and a BN branch, but the module will change when the Runet is used for inference, the change process is shown in Fig.2, we call this process as Re-param, the purpose of Re-param is blend the parameters of 3 3 convolution and 1 1 convolution and BN branch, we use as the accumulated mean, standard deviation and learned scaling factor and bias of the BN branch, for the BN layer following 3 3 conv, for the BN layer following 1 1 conv. We use i w , i b as the parameters of a i i  conv and the bias of each branch. The important step is fuse every BN and its preceding conv layer, the formula is In BN branch, the formula (1) can also be used because it can be viewed as a 1 1 conv with an identity matrix as the kernel. By doing this change, there will have one 3 3 conv, two 1 1 conv, we use zero to pad 1 1 conv to 3 3 conv, then there have three 3 3 conv, after this, we use formula (1) to fuse parameters of every branch, after doing this, every branch's parameters can be shown as , the last step is the 3 3 conv addition of the three branches.
The reason for separating the training and inference process is that training a multi-branch model avoids the gradient vanishing problem. But the multi-branch architecture are good at training and not good at inference. In order to solve the problem, we can convert the architecture from one to another

Experimental Settings
We build Runet by using Pytorch. In contracting path and expansive path, batch normalization is used after every convolutional layer. The upsampling operation in decoder network is conducted by biliner interpolation. The optimizer is RMSPro algorithm, which can eliminate the swing during gradient . The batchsize is 16. The neural network is trained 1000 epochs in total, and we choose the best model throughout the process. Considering that the number of images in the datasets is too small, we adopt data augmentation to expand the datasets. All the experiments are carried out by using a single GeForce Tesla V100, which memory is 32G.
We use two datasets to verify the accuracy and efficiency of the neural network. Among the crack datasets, 70% for training, 10% for val, and the rest for testing.
In order to evaluate the advantages and disadvantages of our neural network, we choose precision and Recall curve, which can be calculated by comparing the detected cracks against the ground truth image. By comparing the area enclosed by the P-R curve and the coordinate axis, we can know which model is better, if the area enclosed is not easy to compare, we can compare the point where the P-R curve intersects Because the crack have width, so a predict crack pixel is still regarded as true positive if it is no more than 1 pixels away from ground truth image.
In order to fully demonstrate the superiority of our network model in reasoning accuracy and model parameters, we compare the effect with U-Net, Attention U-Net, PspNet and SegNet in the same datasets.

Experimental Comparison
In this part, we compare the result of different networks in two datasets. By comparing the result of the same datasets in different networks, we can know which is the best network. The result is shown in Fig.3 and Fig 4. In this pictures, The abscissa is recall, The ordinate is precision.   Fig. 4 the result in the datasets of Stone331 As we know, the model size is connected to the speed of network, The smaller the model size is, the faster the model runs, so we can compare the parameters of the four network, The first column is the network model, and the second column is the model size , the input size of all network is 512 512  . as is shown in the Table 1. From Figure 3 and Figure 4, It can be seen that the effect of Runet is slightly lower than U-Net in cracktree260 datasets and slightly higher than U-Net in stone331 datasets. In these two datasets, U-Net and Runet have the leading effect than other models. At the same time, in terms of model size, it can be seen from the Table 1 that Runet is only about half of unet and about one third of SegNet, so Runet's model size advantage is far ahead than others. Therefore, Runet achieves the best balance between crack identification effect and model size.

Conclusion
In this paper, we propose a new network to detect cracks. We call this new network as Runet, we add the hole convolution, depthwise separable convolution to Runet, we also introduce the trainingtime and inferencetime architecture to Runet. by doing this, we can see the result from the pictures and table, it shows the advantages of Runet.