Research on Vehicle Appearance Component Recognition Based on Mask R-CNN

Recognition of vehicle exterior components is one of the most important core algorithms in the process of intelligent propulsion. This paper focuses on the recognition algorithm of vehicle appearance parts, which can detect the position of vehicle appearance parts and recognize the name of the parts in the image. This paper applies enterprise self-built datasets. Firstly, the vehicle close-range dataset is produced by tailoring in data enhancement. Secondly, the models based on ResNeXt-50+FPN, ResNeXt-101+FPN, ResNeXt-50 backbone network and Mask R-CNN are applied in three stages: panoramic dataset, panoramic dataset and panoramic close-range integrated dataset. Finally, the network model based on ResNeXt-101+FPN and Mask R-CNN is the best on the comprehensive dataset.


Introduction
In recent years, with the sustained and rapid development of China's economy and society, the number of motor vehicles in China has been growing rapidly. At the same time, people pay more and more attention to vehicle safety. Among them, the status of vehicle appearance components, such as scraping, deformation, etc. are important applications in vehicle appearance quality inspection, fixedloss claims of automobile insurance, second-hand vehicle valuation and other business scenarios. Therefore, the recognition of vehicle appearance components is particularly important as the premise of vehicle appearance condition detection to determine the location of damage.
With the rapid development of deep learning, object detection method based on deep convolution neural network has been widely used. At present, the problem of vehicle recognition is mostly focused on the detection of vehicle, vehicle type and license plate, and the research on image recognition of vehicle appearance parts is less. Aiming at the problem of vehicle appearance component recognition, in order to avoid the overlap of location areas after vehicle appearance component detection, which will bring error to subsequent application research such as damage detection, this paper proposes an example segmentation method for components, that is, the method of image appearance component is recognized by pixel-level object segmentation.
This paper builds recognition model based on Mask R-CNN network structure. ResNeXt-50+FPN, ResNeXt-101+FPN and ResNeXt-50 are used as feature extraction networks respectively. The objective of vehicle appearance components is divided into 31 categories, including front bumper, front fender and central network. In the training phase, the self-built dataset CATARC-Vehicle of vehicle appearance components is used, and the data enhancement and reasonable adjustment of parameters are carried out to complete the target detection research in complex scenes.

Introduction of Mask R-CNN
Mask R-CNN network model [1] was proposed by He Kaiming in the ICCV paper. It can effectively detect targets and output high-quality instance segmentation at the same time. Mask R-CNN network model improves the network structure on Faster R-CNN network model [2], increases the output of the target mask, and enables the Mask R-CNN network model to accomplish three tasks of target recognition, candidate box regression and segmentation mask at the same time.
Mask R-CNN network model first extracts features from input images, and then inputs image features into RPN network [3] to get candidate frames. Mask R-CNN network model uses ROI Align layer to correct ROI pixels, and then uses head structure network to classify targets and regress candidate boxes. In parallel with the implementation of classification and regression problems, the branch of Mask prediction is added, and each pixel in ROI can be identified as belonging to the target category. The loss function of Mask R-CNN network consists of three parts: classification error, regression error and segmentation error. The Mask R-CNN network structure is shown in Figure 1.

Vehicle Appearance Component Recognition Model
ResNeXt [4] solves the problem of deep network degradation compared with LeNet [5], AlexNet [6], and has better results than ResNet [7] with the same number of parameters. The feature pyramid network FPN [8], which consists of special diagnostic images of different scales, retains the location information better in the low-level feature images, and can analyze the semantic information better in the high-level feature images. Combining the high-level feature images of each layer with the lowlevel feature images, it can better deal with small-scale targets. In addition, it can better analyze location information and semantic information.
Aiming at the problem of vehicle appearance component recognition, it needs high accuracy of model recognition and fast prediction speed of model. In order to make Mask R-CNN better applied to vehicle appearance parts target detection, this paper selected ResNeXt-50+FPN, ResNeXt-101+FPN, ResNeXt-50 backbone network as feature extraction network and combined with Mask R-CNN, and built the network structure for experiments.

Panoramic Dataset
There is no public datasets for vehicle appearance component recognition. Therefore, the dataset used in this experiment comes from the dataset CATARC-Vehicle built by the unit. In order to realize the pixel-level target detection of the vehicle appearance parts, the CATARC-Vehicle dataset adopts the form of polygon segmentation and annotation, totaling 10,026 panoramic images. Vehicle information in the image is very special, such as no serious damage, no component opening, etc. In addition, the image includes daytime and night shooting, which is very suitable for vehicle appearance detection in complex scenes. The annotation example is shown in Figure 2.

Close-Range Dataset
In the application scenario of vehicle appearance recognition, in order to shoot the vehicle damage more clearly, the vehicle close-range image will be taken. Therefore, the recognition model of vehicle appearance components needs to have good recognition ability for local appearance components in close-range images. Data enhancement is a commonly used method when the number of data is insufficient, such as image flipping, rotation, scaling, clipping and so on. In this paper, in order to save the cost of image annotation and achieve the goal of constructing the close-range dataset, the vehicle panoramic dataset is cut vertically and horizontally to get four close-range images, and the close-range dataset of vehicle appearance component recognition problem is produced. The specific methods are as follows: (1) Firstly, the trained vehicle detection model is applied to identify the position of the vehicle in the picture, and the transverse and longitudinal axes are obtained by taking the vehicle center as the segmentation point; (2) the original picture is cut into four images according to the transverse and longitudinal axes; (3) in the process of cutting the corresponding labeling information, four labeling documents are updated. Among them, we should pay attention to the complex concave polygon.
The number of close-range dataset is larger than that of panoramic dataset because the actual target tends to be close-range. After data cleaning and excluding invalid pictures, 20,287 pictures were finally obtained. The clipped image is shown in Figure 5.

Experiments and Results Analysis
The experimental computer in this paper is configured as a high-performance computer, six NVIDIA GEFORCE GTX, Centos operating system, CUDA10.1, cuDNN7.5.0, based on deep learning framework caffe2, programming environment based on Python 3.7.
Close-range images were obtained by data enhancement. There are 30,313 datas available. The datas were randomly divided into training set, verification set and test set, and the corresponding proportion was 70%, 15% and 15%. After dataset partitioning, statistics show that there are 6,951 training sets, 1,513 validation sets and 1,562 test sets in panoramic images, 14,174 training sets, 3,119 validation sets and 2,994 test sets in close-range images.
The experiment is based on dataset CATARC-Vehicle, which is divided into three stages: (1) panoramic dataset; (2) close-range dataset; (3) panoramic and close-range integrated dataset, which are In order to explore a faster and concise method, experiments are carried out on three datasets, and in order to meet the needs of practical application, comprehensive datasets are used to test, which can better evaluate whether the model meets the needs of application scenarios.

First Stage
In the first stage, experiments are carried out on panoramic dataset based on the networks of ResNeXt-50+FPN, ResNeXt-101+FPN and ResNeXt-50 backbone combined with Mask R-CNN. The experimental results of target segmentation are shown in Table 2.

Second Stage
In the second stage, experiments are carried out on close-range dataset based on three backbone networks of ResNeXt-50+FPN, ResNeXt-101+FPN and ResNeXt-50 combined with Mask R-CNN.
The experimental results of target segmentation are shown in Table 3.

Third Stage
In the third stage, experiments are carried out on comprehensive dataset of panoramic and close-range based on the networks of ResNeXt-50+FPN, ResNeXt-101+FPN and ResNeXt-50 backbone combined with Mask R-CNN. The experimental results of target segmentation are shown in Table 4.  Figure 6. Compared with the three stages of the experiment, the test results of the three training datasets, panoramic and close-range comprehensive dataset are generally better than those of the other two datasets. From the results of visualization, we can also see that the test results of integrated dataset make the mask edge smoother and the image segmentation more accurate.
Compared with the three kinds of networks, the network based on ResNeXt-101+FPN and Mask R-CNN has the highest accuracy in three stages of dataset. In addition, FPN network is good at dealing with small targets. Compared with APS indicators, the network that increases FPN feature extraction performs better than the network that only uses ResNeXt.

Conclusion
In this paper, in-depth learning is introduced into the recognition of vehicle appearance components in complex scenes. The model of Mask R-CNN combined with three backbone networks is experimented on panoramic dataset, close-range dataset and panoramic close-range integrated dataset. Finally, on the comprehensive dataset of panoramic and close-range, based on ResNeXt-101+FPN and Mask R-CNN network, the training model has the best effect, the recognition accuracy is 84.4%, and the AP50 is as high as 95.1%.
According to the need of practical application scenarios, this paper overcomes the difficulty of insufficient datasets, and uses data enhancement method to produce close-range dataset. Experiments show that the training set is more effective after adding close-range data.
In future research, more types of vehicle appearance components and more datasets can be added to improve the generalization ability of the model. In addition, the network structure can be further improved to improve the accuracy of the model.