A pixel-wise framework based on convolutional neural network for surface defect detection

: The automatic surface defect detection system supports the real-time surface defect detection by reducing the information and high-lighting the critical defect regions for high level image under-standing. However, the defects exhibit low contrast, different textures and geometric structures, and several defects making the surface defect detection more difficult. In this paper, a pixel-wise detection framework based on convolutional neural network (CNN) for strip steel surface defect detection is proposed. First we extract the salient features by a pre-trained backbone network. Secondly, contextual weighting module, with different convolutional kernels, is used to extract multi-scale context features to achieve overall defect perception. Finally, the cross integrate is employed to make the full use of these context information and decoded the information to realize feature information complementation. The experimental results of this study demonstrate that the proposed method outperforms against the previous state-of-the-art methods on strip steel surface defect dataset (MAE: 0.0396; F  : 0.8485). into


Introduction
Strip steel is widely used in industrial production, including automobile, electromechanical, aerospace, ship and so on. Fundamentally speaking, there are inherent problems in the quality of strip steel, which will not only affect the beauty and comfort of products, but also these areas are usually the starting point of physical damage or chemical corrosion, which also has an adverse impact on the The early detection methods of surface defects are mainly based on the manual inspection techniques, which have low efficiency and high cost. Recently, automatic defect inspection (ADI) technology methods based on machine learning have developed rapidly. ADI method not only has higher detection efficiency and accuracy, but also significantly reduces the human and financial resources. Despite this, it is still a very challenging task for ADI to identify the intrinsic and diverse defects in steel. The varieties of strip steel surface defects are shown in Figure 1. The surface defects of strip steel mainly have the following three characteristics, which make it difficult for surface detect detection.
1) Low contrast quality. Surface defect images are usually captured by CDD cameras. However, the environment for the image acquisition of surface defects are affected by light and dust, which resulting in low contrast between background and defects, as shown in Figure 1(a). This case increases the difficulty in defect detection.
2) Different textural and geometric structures. Generally, the defect images collected from different materials exhibit diverse textures. The Figure 1(b) depicts differences in texture features in same type of defects in different materials, where the boundary of defects is fuzzy and irregular. These factors also increase the difficulty in surface defect detection.
3) Diversity of defects. Surface defects always include many categories like inclusion, patches and scratches in which, some features are obvious while others are ambiguous. Further, the defects of the same category invariably show significant differences in appearance, while some defects of different categories have great similarities in appearance, as shown in Figure 1(c). These factors further improve the difficulty in detection process.
To address the above challenges, local binary pattern was applied for surface defects detection [1,2]. Djukic et al. [3] distinguished real defects from random noise pixels by dynamic threshold processing. An entity sparsity pursuit approach was also proposed for surface defects inspection [4]. Neogi et al. [5] suggested a global adaptive percentile thresholding of gradient images, which segment the defect regions and retain the characteristics of the defect without considering the size of the defect. In [6], a Gabor filter combination is proposed to detect the tiny holes on steel slabs. Li et al. [7] proposed an unsupervised approach based on a small number of flawless samples to detect and locate defects in random colour texture. On the other hand, Cohen et al. [8] connected the Markov with Gaussian distribution, and proposed Gaussian Markov Random Field to model the texture image of a nondefective fabric texture. However, all these methods are designed to identify defect detection by designing some artificial features, which lack generality.
Recently, especially CNN based methods are outstanding in the field of machine vision. These methods can automatically extract target features, find the internal feature relationship and law in the sample through iterative optimization, adaptively learn image features and complete object detection tasks, and solve the shortcomings of low efficiency and low detection accuracy of manual design features. A semi-supervised approach based on CNN was used to classify the strip steel surface defect [9]. Since the industrial defect images are difficult to collect, Natarajan et al. [10] adopted transfer learning to extract multi-level features and then input these features into SVM classifiers to avoid the over fitting caused by the small samples. However, the accuracy of these methods needs further improvement.
In this work, a pixel-wise detection framework based on CNN for strip steel surface defect detection is proposed to obtain multi-scale context information from high-level features by different sizes of convolution kernels. A cross integration is adopted to realize the effective utilization of these context information and to decode the information, which realizes the feature information complementation. The output of the framework is accurate pixel-wise classification and location. The main contributions of this study are: • A pixel-wise detection framework based on CNN for strip steel surface defect detection is introduced. The output of the detection framework is the pixel-wise binary saliency maps of defect regions, which can effectively evaluate the quality of strip steel products.
• A contextual weighting module is proposed, which uses convolutional kernels with different size to obtain multi-scale context feature information from the convolution layers to achieve overall perception of the defect.

•
In the decoder module, the cross integration is used to integrate the context information and previous decoded information into the current decoding block to realize feature information complementation.
• The proposed method is tested on the NEU-strip steel surface defect dataset, and the experimental results prove the effectiveness of the proposed method.

Related works
In this section, two kinds of detection methods for surface defect will be introduced, including: i) traditional approaches; ii) deep learning-based approaches.

Traditional approaches
The traditional methods for surface defect detection mainly include three categories: the statistical-based approaches, the filter-based approaches and the model-based approaches.

Statistical-based approaches
These methods use random phenomenon to analysis the distribution of random variables from the perspective of statistics, so as to realize the description of the image texture. Neogi et al. [5] proposed a global adaptive percentile thresholding of gradient images, which segment the defect regions and retain the characteristics of the defect without considering the size of the defect. Win et al. [11] proposed two thresholding methods namely, contrast-adjusted Otsu's method and contrast-adjusted median-based Otsu's method for automated defect detection system. Ricci et al. [12] used canny operator to detect the defect edges. Hu et al. [13] used Fourier shape descriptors for description of outline features in steel surface defects. Zhao et al. [14] proposed a two-level labelling technique based on super pixels. This method clustered pixels into super pixels and then the super pixels into subregions. Wang et al. [15] extracted and fused features of co-occurrence matrix and the histogram of oriented gradient to describe the local and the global texture characteristics, respectively. Chu et al. [16] proposed a smoothed local binary patterns by applying weight on the local neighbourhood. Fekri-Ershad et al. [17] applied a new noise-resistant and multi-resolution version of the LBP to extract jointly the colour and texture features jointly. Song et al. [1] proposed an adjacent evaluation completed local binary patterns against noise for defect inspection. Zhang et al. [18] used gray level co-occurrence matrix (GLCM) and HU invariant moments for feature extraction, and then applied adaptive genetic algorithm for feature selection.

Filter-based approaches
The principle of this method was to transform the original image in frequency domain, and then use the corresponding filter to consider the image and to remove the features with low noise and correlation, so that the algorithm can extract more valuable information. Ai et al. [19] adopted kernel locality preserving projections and curvelet transform extract feature for the surface longitudinal cracks detection of the slabs. In [6,20], a Gabor filter combination is proposed to detect the tiny holed on steel slabs. Other method [On the other hand, Choi et al. [21] adopted two Gabor filters to detect the seam cracks on the steel plates, which have high detection performance and can effectively reduce noise. Wu et al. [22] used modular maximum of inter scale correlation of wavelet coefficient to determine the positions of the defects, and then used the prior knowledge about the characteristics of the surface defect defects for their classification. Öztürk et al. [23] proposed novel BiasFeed cellular neural network model for glass defect inspection. Li et al. [24] proposed a second-order derivative and morphology operations, the row-by-row adaptive thresholding, and 2-D wavelet transform to process the images showing different defects of the castings. Liu et al. [25] applied a non-subsampled shearlet transform and the kernel locality preserving projection to the surface defect detection. Akdemir et al. [26] adopted wavelet transforms to glass surface defects detection.

Model-based approaches
These methods are based on the construction model of the image, and uses the statistics of model parameters as texture features. Different textures are expressed as different values of model parameters under some assumptions. In [7], an unsupervised approach based on a small number of flawless samples was used to detect and locate the defects in random color texture. Cohen et al. [27] connected Markov with Gaussian distribution, and proposed Gaussian Markov Random Field to model the texture image of a non-defective fabric texture. Song et al. [28] proposed a saliency propagation algorithm based on multiple constraints and improved texture features (MCITF) for surface defect detection.

Deep Learning-based approaches
Recently, deep learning based on CNN approaches have achieved outperformed in the field of machine vision tasks. Many scholars have solved the problem of industrial defect detection by deep learning. In [9], a semi-supervised approach based on CNN was used to classify the strip steel surface defect. Since the industrial defect images are difficult to collect, Natarajan et al. [10] adopted transfer learning to extract the multi-level features and then input these features into SVM classifiers to avoid the over fitting caused by small samples. Masci et al. [29] proposed a Multi-scale pyramidal pooling network for generic steel defect classification. He et al. [30] proposed a multi-group convolutional neural network (MG-CNN) to inspect the defects of the steel surface. In [31], an end-to-end detection framework was proposed, which integrated multi-level features to complete the detection of the strip steel surface defect. The output of the network located the defect areas through some dense bounding boxes and gave the category name to these defects. Kou et al. [32] developed an end-to-end defect detection model based on YOLO-V3 for the surface defect detection on strip steel. In [33], a pretrained deep learning network is used to extract multi-scale features from raw image patches to achieve image classification and defect segmentation. In [34], a multi-scale feature-clustering-based fully convolutional was proposed for the texture surface defect detection. Neven et al. [35] proposed a multibranch U-Net for steel surface defect type and severity segmentation. Zhou et al. [36] proposed edgeaware multi-level interactive network for salient object detection of strip steel surface defects. Song et al. [37] adopted encoder-decoder residual network for salient object detection of strip steel surface defects. Dong et al. [38] proposed a pyramid feature fusion and global context attention network for automated surface defect segmentation. Although these methods achieved outstanding performance in the defects detection, they still need to be improved especially, in the feature extraction and utilization. Unlike previous studies, this paper proposes a pixel-wise detection framework based on CNN for strip steel surface defect detection.

Overview of the structure
The surface defect inception is formulated in this work as a pixel-wise segmentation task. Given a defect image, the proposed framework outputs a binary map, the defect area is represented by "1", while the non-defect area is represented by "0". The architecture of the framework mainly includes three parts: an encoder, the contextual weighting module and a decoder as shown in Figure 2.
Given a defect image, the framework first extracts the multi-level features from fine, shallow layers (enc1) to coarse, deep layers (enc5) by a pre-trained VGG-16 [39] network which is called an encoder module. The encoder module is composed of convolution layers and max pooling layers. In order to retain the spatial information of each pixel, the fully connection layers of VGG-16 network is removed. Subsequently, a contextual weighting module is adopted to obtain multi-scale contextual information from the high-level features to keep the shape and size in variance of the final features. In the encoder, the features extracted from enc3, enc4 and enc5 are considered as high-level features. In the decoder, the output of each con-textual weighting network is fused to the input of the same decoder in a feedback fashion. The final output of the decoder is a defect binary saliency map.

Encoder module
The encoder is used to extract multi-level features of the defect images, which is built on the pretrained VGG-16 network. The encoder module mainly consists of 5 convolution layers and 4 max pooling layers. The details of the encoder module, i.e. blocks encx where, x =1,…,5 are listed in Table  1. In the encoder, the convolutional layer performs sliding on the input local areas through a series of convolutional kernels to obtain the features of the input image, followed by ReLU and BN. Let is the corresponding ground truth for Xn. The convolution of Xn is as follows: where, W denotes weights, b refers bias, and  represents the ReLU activation. By sliding the convolution kernels to obtain the feature sets. The pooling layers adopt 2 × 2 pool filter to down-scale the input feature maps, which is to change the spatial dimension and reduce the amount of calculation. The output of pooling layer is given below: where pool denotes the max pooling with 2×2 pool filter and stride 2. The encoder finally generates five resolution feature maps F = {f1, f2, …, f5}, and f1 denotes the enc1 features and so on.

Contextual weighting module
The fusion of convolutional features obtained from different stages is a common mechanism in most detection methods, because these features not only contain low-level visual information, but also include high-level abstract information. The earlier methods [40,41] combine these features directly from bottom to top. However, this simple combination may induce some bad features in the images to be integrated into the final prediction. To address this issue, a contextual weighting module, inspired by [42], is proposed in Figure 3. The CWM applies different convolution kernels to extract multi-scale contextual information from high-level features, which provides entire description for interpretation of the whole scene especially, multi-scale and multi shape objects. In the CWM, the features f3, f4 and f5 are used as high-level features. CWM used four stacked convolutional kernels (1×1, 3×3, 5×5, 7×7) to obtain multi-scale contextual information from the high-level features, and each kernel generates a feature map with the size of high-level features. For high-level feature f3, the output multi-scale contextual information can be denoted by F3: Where BN denotes Batch Normalization,  is nonlinear activation function ReLU. Wii denote the i  i convolutional kernel. The size of each generated features 3 i M (i = 1,3,5,7) is the same as that of f3, and the number of channels is 32. Then these feature maps are fused by concatenation. After that, 1×1 convolutional kernels are used to resize the channel of concatenated features to reduce the computation of the contextual weighting. The output saliency map G3 is formulated as: Where BN denotes Batch Normalization,  is nonlinear activation function ReLU, CAT denotes concatenation. W11 is 11 convolutional kernel with 128 channels. The number of channels of G3 is 128. For high-level feature f4 and f5, the model generates G4 and G5 in the same way as G3.

Decoder module
In this section, a novel decoder module is proposed, which includes 4 blocks (dec2, dec3, dec4, dec5), as shown in Figure 2. The dec3 and dec4 are fusion decoders, which are composed of the former one or two decoders and the output from contextual weighting module connected with enc3 and enc4, respectively. To enable effective fusion of these features, which must ensure that they have the same dimensionality. Firstly, a series of 3×3×D convolution kernels are applied to reduce channel dimension of these fused feature maps, where D is 32. Then a bilinear interpolation is applied to upsample lowresolution features to the target spatial resolution of the features that will be fused. Subsequently, these feature maps are fused by element-wise concatenation, as shown in Figure 4. The output decx is defined as follows: The final prediction Yp is formulated as: where, CAT refers concatenation, up denotes upsample,  represents the ReLU activation and ch is 3×3 convolution.

Loss function
Loss function is the most basic and key factor in machine learning, which is used to measure the quality of model prediction. In this paper, three losses are applied to optimize the model. The final loss is defined as: where lBCE, lIoU and lSSIM represent the BCE loss, IoU loss and SSIM loss, respectively. The BCE [43] loss is applied to compute the similarity between the prediction and ground truth, which is defined as: where T [0, 1] denotes the ground truth, and P[0, 1] is the predicted probability. The IOU [44] loss is used to measure the repeatability between the prediction and the ground truth, which is defined as: where T [0, 1] denotes the ground truth, and P[0, 1] is the predicted probability. The SSIM [45] are originally applied to measure the structural similarity of two images. Let  pt  is their covariance. C1 and C2 are small constants that are applied to avoid dividing by zero.

Experiments results and analysis
This section mainly consists of six experimental parts: the details of implementation, the dataset and the evaluation metrics, the performance of the proposed method and other previous methods, followed by the ablation study and analysis of failure cases.

Implementation details
The proposed method is implemented based on TensorFlow [46] framework. The weights of new convolution layers in the framework are initialized with standard deviation 0.01 and biases are initialized to 0. The weights of backbone network are initialized using pre-trained ImageNet [47] network. The momentum and weight decay are set to 0.9 and 0.0005, respectively. The initial learning rate is set to 5e-5, which decreased by 10 after 10 epochs. The framework is trained for 300 epochs in total.

Dataset
In the experiment of this study, three kinds of surface defect of strip steel [1] are selected, including Scratches, Patches, and Inclusion, as shown in Figure 5. All categories of defects are considered as detection targets. In the dataset, the training set includes 3630 defect samples, and the test set includes 792 defect samples. All the samples are resized to 256×256 during in the process of training network.

Evaluation metric
To evaluate the proposed framework, four metrics are used along with other previous state-ofthe-art approaches, namely precision-recall (PR) curves, F-measure score and mean absolute error (MAE). The PR curve demonstrates the average recall and precision and of saliency maps at different thresholds, formulated as follows: where FN, FP and TP indicate correctly the number judge of false negative pixels, false positive and true positive, respectively. F-measure, refers Fβ and is computed by weighted harmonic mean of recall and precision under nonnegative weight β, which defined as: the β 2 =0.3 is used in other methods. MAE [48] is used to calculate the mean absolute error between the ground truth and the prediction. First, the prediction and the ground truth are binarized. Then, the MAE score is computed by:  (17) where P and S refer the prediction and the ground truth, respectively, while H and W are the height and width of images, respectively.

Comparisons methods
In this subsection, the proposed method is compared with 10 previous state-of-the-art methods, including BSCA [49], FT [50], MIL [51], RC [52], SMD [53], FCN [40], UNet [41], DN [54], DHSNet [55] and DSS [56], all the compared are pixel-wised method. For the sake of comparison, the same evaluation metrics and code are used to evaluate the output prediction maps.   Figure 5, the proposed method can accurately detect defects and highlight them evenly in various challenging cases, i) low contrast quality between the defect region and background (e.g., row 1 and 3); ii) different textural and geometric structures (e.g., row 4 and 5); iii) diversity of defects (e.g., row 6, 7 and 8). For low contrast quality: some methods are missing or detecting a rough defect area which cannot express the defects vividly. For different textural and geometric structures, the defect region detected by methods are with little noise and not obvious. For diversity of defects, some methods cannot detect all categories of defects. The detection effect of deep learning-based methods is better than that of the traditional methods. However, for some minor defects, FCN, UNet, DN, DSS and DHSNet are either missing or incomplete detection areas. Instead, the proposed approach not only can distinguish the defect area and background effectively under low contrast, but also locate and detect the defects in different positions, scales and shapes accurately.

Quantitative comparison
The advantages of the proposed method are shown in Figure 6. The method achieves outstanding performance among all the compared methods on strip steel surface defect dataset in terms of all evaluation metrics. It further improves the P-R curve and F-measure, and reduces MAE significantly. As listed in Table 2, the proposed method outperforms the competitive methods in F and MAE. Compared with the traditional methods, F is improved 36.12%, and decreased by 12.08% in MAE. Compared with deep learning method, the F is improved by 0.38%, and decreased by 0.02% in MAE. The comparison of the above qualitative and quantitative analysis further proves the effectiveness of this method.

Ablation study
In this work, the ablation study on the proposed CWN to verify the effectiveness of CWN. First, the CWN is removed, and directly combined the feature maps output from the encoder with the decoded feature maps through dense short connections, and output the optimal model after training the overall network. The ablation study further add the CWN module into the proposed method, and output the trained model after the same training. Finally, the two trained models are tested separately and output saliency prediction maps. As listed in Table 3, the contextual weighting module get declines by 0.21% in MAE and improve the performance by 0.48% in F. These results prove the effectiveness of contextual weighting module to in the framework. In addition, the ablation study on the loss function to verify the effectiveness of the loss function. As listed in Table 3, the loss function get declines by 1.21% in MAE and improve the performance by 14.60% in F.

Analysis of failure cases
The results of this study show that the proposed method is outstanding over the previous state-ofthe-art methods on the strip steel surface defect dataset. However, some defect images still pose challenges to these methods. The images (c) and (d) of Figure 7 show that the detection of some defect images are lack of integrity. The images (a), (b), (e) and (f) show that some defects are missed. Figure  7 shows the reasons leading to failure detection are attributed to some defects are too small to be detected; some defects show low contrast, so it is difficult to judge whether they really are defects, and in some cases, the characteristics of some defect areas apparently change. In the future, I plan to focus on solving these problems.

Conclusion
In this paper, a pixel-wise inspection framework based on CNN for the surface defect inspection of strip steel is proposed. Firstly, the encoder of the framework is built on the pre-trained VGG-16 network, which is used to extract multi-level features. Next, the contextual weighting module uses convolutional kernels with different size to obtain multi-scale context feature information from the convolution layers, which achieve overall perception of defect. Finally, in the decoder module, the cross integration is used to integrate the context information and previously decoded information into the current decoding block, which realizes the feature information complementation. The experiments of this study demonstrate that the proposed method is outstanding over the previous state-of-the-art methods in detection of strip steel defect dataset. To sum up, the proposed method can detect defects accurately, which makes the network strong robust and effective in defect detection. In the future, I will further optimize the algorithm model.