Defect Detection of Aluminum Profiles based on Improved Feature Pyramids

. For the surface defects of aluminum profiles, there are problems of multi-scale, small object and irregular shape. This paper proposes a defects detection algorithm based on improved feature pyramid. This method compresses and saves the feature information extracted by the backbone networks, and calculates the similarity between deep and shallow features, so as to alleviate the phenomenon of loss of feature information and weakening of feature expression ability, thereby solving the problem of multi-scale and small object. At the same time, deformable convolution is introduced to enhance the feature extraction ability of the model and alleviate the detection problems caused by irregularly shaped defects. To verify the effectiveness of the proposed method, Faster R-CNN was used as the basic detection algorithm to conduct ablation experiments, and compared with the classical detection algorithm, the accuracy rate was as high as 72.8%. The experimental results show that the proposed method has a good performance on the task of aluminum profile defects detection, and is superior to the comparative detection algorithms.


Introduction
Aluminum profiles are not only widely used in infrastructure construction, but they are also integral to aerospace, automobile manufacturing, and other industries. However, various defects will appear on the surface of the aluminum profiles during production and transportation. Therefore, it is crucial to improve quality by detecting surface defects in the aluminum profiles. Traditional surface defects detection methods are typically completed by human eyes and have the disadvantages of slow speed and low accuracy, failing to accomplish the task of the actual production process while increasing production costs and affecting the economic benefits of enterprises. In recent years, Convolutional neural networks (CNN) has performed well in the field of computer vision, which has attracted the attention to scholars in various fields. CNN automatically learns image features of low-level to highlevel at different levels, which solves the defects of the time-consuming and labor-intensive manual design of feature extraction operators in traditional techniques. Compared with traditional computer vision technology, the deep learning method does not require tedious preprocessing process, and the extracted features are more generalized due to the model's self-learning. Therefore, some object detection algorithms with excellent performance based on CNN have been proposed. such as R-CNN1, SSD2, YOLO3, which promote the application of CNN in surface defects detection. Liao4 improves YOLOv4 for surface defects detection on printed circuit boards (PCBs), Zhang5 used the attention mechanism to effectively integrate with the SSD detector to complete the surface defects detection of rare earth magnetic materials, Wang6 and Xiang7 simultaneously uses Kmeans clustering and feature pyramid networks to optimize Faster R-CNN, thereby achieving surface defects detection of aluminum profiles. In summary, although many scholars have conducted indepth research and application on surface defects detection, there are still two problems: (1) In the face of multi-scale and small object problems with surface defects, most researchers' solutions are to use Faster R-CNN8 and feature pyramid networks (FPN)9 combination to solve, but the FPN has design flaws, which reduces the representation ability of feature and loses feature information.
(2) The shape of the surface defects is irregular. Feature extraction networks increase the receptive field by stacking convolutional layers, but the receptive field is usually a regular grid that cannot match the target shape, so background information is mixed in during feature extraction. In view of the above problems, the following two aspects of work have been done: (1) An information enhancement feature pyramid networks (IE-FPN) is designed to alleviate the information loss problem and the weakening of multiscale feature characterization ability in the FPN by constructing information re-injection and information similarity modules.
(2) By introducing a deformable convolutional networks (DCN)10, the receptive field of the feature extraction networks can conform to the shape of surface defects, reducing the focus on background information. As well as the area and proportion distribution of the annotation boxes in the statistical training set to generate appropriate anchors.

Related work
In the task of defects detection, it is particularly important to accurately detect surface defects. Due to the excellent performance of the two-stage detection algorithm, the Faster R-CNN is selected as the basic framework in this paper. Faster R-CNN is composed of feature extraction networks (backbone networks), region of interest pooling (ROI Pooling) and prediction head. The structure of Faster R-CNN is shown in Figure 1.  Compared with natural images, surface defects images have small objects and multi-scale problems. However, the Faster R-CNN only uses the last layer feature map output by the backbone networks of prediction, which cannot effectively solve the problem. This is because when the backbone network's structure is deepened, the down-sampling rate will also increase, which means that the resolution of the feature maps will be reduced, resulting in the loss of the detailed information about the detection targets on the low-level features, and the detection effect of small areas of defects is not effective. To alleviate the above two problems and improve the surface defects detection effect. We introduce two new modules in Faster R-CNN, including: (1) Improved feature pyramid networks, which maps defects of different scales to corresponding feature maps for prediction, which has alleviated the detection performance poor caused by small targets and multi-scale problems. (2) DCN, by applying a coordinate offset to the ordinary convolution kernel, the convolution kernel can better adapt to the shape of the defects. Considering the model complexity, DCN is only introduced into the last two stages of backbone networks. In the feature extraction process, low-level features have rich location and detailed information, which is conducive to the detection of small-area targets, while high-level feature maps have rich semantic information and are suitable for handling large-size targets. Therefore, FPN maps objects of different scales to feature maps of different receptive fields, and then uses the corresponding feature maps for prediction. The definition of FPN can be expressed in equation (1).
Where W 3×3 and W 1×1 are 3×3 convolution and 1×1 convolution respectively, and Unsample is bilinear interpolation. The structure of FPN is shown in Figure 2.  The feature pyramid networks has been widely used in surface defects detection due to its structural and performance advantages. However, AugFPN11 points out that there are two design flaws: (1) There is a semantic gap between the feature semantic information about different levels, and direct fusion will weaken the characterization ability of multi-scale features; (2) The lateral connection in the FPN uses 1×1 convolution to reduce the channel dimension of the feature, which will lead to information loss. To alleviate information loss and the weakening of the representation ability of multi-scale features, we propose IE-FPN. The structure of IE-FPN is shown in Figure 3. IE-FPN introduces the information re-injection module (IRIM) and the information similarity module (ISM) based on FPN. The feature information extracted from the backbone networks is passed through the IRIM module to alleviate the problem of information loss caused by channel dimension reduction. Considering that the similarity matrix between the calculated features is affected by the spatial size of the features, this paper uses the features of P4 and P3, P3 and P2. The ISM module is introduced into fusion to alleviate the problem of reduced representational ability brought about by the semantic gap between feature fusion. The IE-FPN can be defined as follows.

Information Enhancement Feature Pyramid Networks
Where i is the level index of the feature map, and W 3×3 represents the convolution operation of size 3×3. The following is a detailed introduction to the two modules of IRIM and ISM. The information re-injection module aims at the information loss caused by dimension reduction of horizontal connection channels in FPN. By compressing the channel information of each pixel on the unreduced feature map into space, the compressed information feature map of the unreduced feature map is obtained, and then the compressed information feature map is normalized to achieve the functions of information filtering and information preservation. Feature map before channel dimension reduction. The framework of IJIM is shown in Figure 4. The input feature map X∈ℝ H×W×C is the original feature output by the backbone networks, and the feature map X will be operated through three branches. Firstly, the pooling operation is performed according to the channel dimension to obtain the information compression feature map with the dimension H×W×1. Considering that single pooling may lead to information loss, the maximum and average pooling parallel methods are adopted to alleviate the loss. Secondly, the two information compression feature maps are spliced for convolution to achieve the effect of information interaction, and then normalized by SoftMax to obtain the spatial attention map. At the same time, the feature map X is dimensionally reduced by 1×1 convolution to obtain the feature map X'∈ℝ H×W×C' , and finally the dimension-reduced feature map X' is multiplied by the corresponding element of the spatial attention map to obtain the output feature Y. The IRIM can be defined as follows.
Where W X ' is the 1×1 convolution parameter, W P is the 3×3 convolution parameter, σ is the normalization function SoftMax, C is the number of channels of the feature map X, and k is the index value.
In summary, we propose IRIM compresses the channel information into space by combining two pooling operations to obtain the information compression feature map, which is normalized and then applied to the feature map after channel dimension reduction, which alleviates the original problem. The problem of information loss after feature channel dimension reduction. At the same time, in the spatial domain, the response of the feature map with important information is strengthened, and the quality of the feature map is improved. The design of the information similarity module is mainly aimed at the problem that the semantic information asymmetry exists in the feature fusion of different levels in FPN, and the direct fusion will lead to the weakening of the feature expression ability. ISM reduces the information gap between feature maps by calculating the pixel similarity between features at different levels, and reduces the information gap between feature maps through the similarity matrix, thereby alleviating the weakening of feature representation. The framework of ISM is shown in Figure 5.
Where i and j are represented as the horizontal and ordinate coordinates index of the feature map, and k is the index value of the number of channels C. M is a similarity matrix, and its expression is as follows.
1 , In summary, inspired by the attention mechanism, IE-FPN constructs IRIM and ISM to alleviate the problem of information loss in feature pyramids and the weakening of feature representation capabilities.

Deformable Convolution Networks
Currently, most detection models use ordinary convolutions for feature extraction. It uses the regular slide window R to sample on the input feature map X, and then the parameter values on the sliding window of the sampling point domain are weighted to obtain the output feature map Y, and the ordinary convolution is calculated as shown in the equation (8).
Where p 0 represents the pixel position on the output feature map Y, W represents the parameter value on the sliding window, p n represents the coordinates of all parameter values on the sliding window R, taking the sliding window of 3×3 as an example, p n = {(-1, -1), (-1,0),...,(0,1), (1,1)}. However, there is a problem of inconsistent shape of surface defects, and if ordinary convolution is used for stacking, the receptive field of the model will not adapt to the shape of the target, resulting in the networks model failing to extract valid information from the target or extracting background interference information. Therefore, we use deformable convolution to alleviate the above problems. Deformable convolution calculates the coordinate offset of the sliding window at the sampling point on the feature map by using the convolution operation branch. This makes the sampling point no longer a regular square, but changes with the shape of the target. Deformable convolution is calculated as shown in the equation (9).
Where ∆p n is the predicted coordinate offset of the sample point. Figure 6 shows a schematic diagram of ordinary convolution and deformable convolution of 3×3 size. The red box in the Figure 5 is the defect area. For the target with a large aspect ratio, the ordinary convolution will extract the background information in the process of extracting features. By improving the grid sampling method of ordinary convolution, spreading the sampling points can make the feature extraction networks more suitable for the shape of the target. Therefore, deformable convolutions can capture more challenging and irregularly shaped defects. To solve the problem, we adopt offline enhancement of the training set, and combine operations such as rotation and image sharpening to achieve the purpose of data enhancement. Figure 7 shows the sample distribution before and after data augmentation, and it can be seen that the above problems are alleviated. The experiment is based on the Pytorch framework. The GPU is NVIDIA GeForce RTX 3060, the memory is 12GB.The CPU is Intel(R) Core(TM) i5-11400f.

Implementation Details
The optimizer used by the networks model when updating the parameters is Adam, and the weight decay factor in the optimizer is 0.0001. Set the initial Learning Rate to 0. 0001, batch size is 4, and using a fixed step learning rate attenuation strategy, the learning rate attenuation rate (Lr scheduler) is 0.1, and a total of 30 iterations (epoch) are trained. At the same time, because the image resolution of the datasets is too large, it requires a large amount of computation, so the input image size will be scaled by 1/4 of the original resolution when training and validating the networks. The performance of the aluminum profile defects detection model is evaluated by using the mean average precision (mAP).

Ablation Experiments
To better set the model parameters, improve the performance of the algorithm. In this paper, a statistical analysis of the aspect ratio of the training set annotation boxes and the proportion of defective areas is performed. Figure 8 shows the aspect ratio distribution of the annotation boxes, and Table 1 shows the proportion of defect area.  (3) the IRIM and the ISM are introduced at the same time on the basis of the basic detector, that is, the original FPN is replaced by the IE-FPN proposed in this paper; (4) on the basis of step 3, the aspect ratio and area parameters selected after the analysis of the annotation boxes are introduced; (5) and the DCN is finally introduced to obtain the final detector algorithm of this paper. The mAP results of each experiment are shown in Table 2. As can be seen from the Table 2, the mAP detected by the basic detection algorithm is 59.3%. Replacing the FPN with the IE-FPN proposed in this paper, that is, the mAP of experiment 3 is increased by 5.3% compared with the basic detection algorithm experiment. The mAP of the final model obtained using all the improved strategies reached 72.8%, an increase of 13.5% over the basic detection algorithm. In summary, the introduction of other optimization strategies such as IRIM and ISM can improve the performance of basic detection algorithms and improve the detection effect of surface defects of aluminum profiles. To further verify the effectiveness of the method proposed in this paper, it is compared with the representative object detection algorithm based on deep learning at this stage.
The comparison results are shown in Table 3.
As can be obtained from Table 3, the proposed method has the best detection performance compared with the current popular object detection algorithm. Among them, compared with the SSD512 algorithm, mAP of the detection obtained by our method is increased by 12.5%; the average detection accuracy improved by 7.8%, 6.3%, and 6.8% compared to the Cascade R-CNN13+FPN algorithm, the Retinanet14+FPN algorithm, and the YOLOv315 algorithm. Meanwhile, comparing the detection speed of different algorithms by fps, our method processes 18.2 images per second, which is faster than Cascade R-CNN+FPN algorithm and slower than Retinanet+FPN, YOLOv3 and SSD512. This is a normal result because the basic detection framework in this paper is Faster R-CNN, which improves detection accuracy at the expense of detection speed.

Conclusion
Aiming at the surface defects of aluminum profiles, there are problems of multi-scale, small objection and irregular shape of defects, this paper proposes a surface defects detection method based on the improved FPN. By constructing and introducing IRIM and ISM, the information loss and semantic gap of FPN are alleviated. Secondly, the DCN is introduced to make the receptive field of the feature extraction networks conform to the shape of the surface defects of aluminum profile, reducing the attention of background information. Experiments on the surface defects datasets of aluminum profiles verify the effectiveness of the method, and the detection accuracy of the proposed method is better than that of the current popular algorithm of defects detection. In order to make the defects on the surface of aluminum profiles more in line with industrial scenarios and improve the quality of products, we will continue to study the surface defects detection of aluminum profiles in the future, in the hope of better identification accuracy and detection speed.