A method of cross-layer fusion multi-object detection and recognition based on improved faster R-CNN model in complex traffic environment

doi:10.1016/j.patrec.2021.02.003

Pattern Recognition Letters

Volume 145, May 2021, Pages 127-134

https://doi.org/10.1016/j.patrec.2021.02.003 Get rights and content

Highlights

•
Cross-layer Fusion.
•
Multi-object Detection and Recognition.
•
Faster R-CNN Model.
•
Multi-class cross entropy loss function.
•
Soft-NMS.

Abstract

Improving the detection accuracy and speed is the prerequisite of multi-object recognition in the complex traffic environment. Despite object detection has made significant advances based on deep neural networks, it remains a challenge to focus on small and occlusion objects. We address this challenge by allowing multiscale fusion. We introduce a cross-layer fusion multi-object detection and recognition algorithm based on Faster R-CNN, an approach that the five-layer structure of VGG16 (Visual Geometry Group) is used to obtain more characteristic information. We implement this idea with lateral embedding the 1×1 convolution kernel, max pooling and deconvolution, in conjunction with weighted balanced multi-class cross entropy loss function and Soft-NMS to control the imbalance between difficult and easy samples. Considering the actual situation in a complex traffic environment, we manually label mixed dataset. On Cityscapes and KITTI datasets, experimental results show that the proposed model achieves better effects than the current mainstream object detection models.

Introduction

Object detection has always been a key technology for vehicles to cope with complex scenes reasonably and safely, and is one of the hot spots in computer vision research. Researchers have conducted a large number of studies on object detection methods. Such as selecting the Harr feature and Adaboosting classifier, the sliding window was used for face detection. The features extracted by the Histogram of Gradients (HOG) [1] were examined by a Support Vector Machine (SVM) [2] for pedestrian detection. For general object detection, the features of HOG and the Deformable Part Model (DPM) algorithm were adopted. Above methods have few features and great improvement in time efficiency, however, those approaches have obvious limitations and inaccuracies.

In recent years, with the development of deep learning technology, convolutional neural network is significantly superior to traditional methods in accuracy, and has become the latest research hotspot. Girshick et al. [3] proposed that region-based R-CNN could apply high-capacity convolutional neural network (CNN) to the bottom-up region for locating and segmenting objects. Zhang et al. [4] performed improving object detection with deep convolutional networks via bayesian optimization and structured prediction. Ren et al. [5] proposed Faster R-CNN and introduced a fully convolutive network RPN, which could simultaneously predict the object boundary and object fraction of each position. Kong et al. [6] proposed HyperNet, which combined the Hyper features of the bottom layer, the middle layer and the upper layer to obtain better effects in the processing of small objects. Zuo et al. [7] proposed traffic signs detection based on Faster R-CNN. Wang et al. [8] proposed Fast R-CNN and introduced GAN [9], [10] to generate highly difficult samples to improve the network’s adaptability to occlusion and deformation. Jian et al. [11] focused on investigating the salient feature fusion strategies in human visual attention mechanism for saliency detection. Jian et al. [12] proposed a novel computational model for saliency detection by integrating the holistic center-directional map with the principal local color contrast map. Jian et al. [13] proposed a novel framework for underwater image saliency detection by exploiting Quaternionic Distance Based Weber Descriptor. Jian et al. [14] proposed a video saliency-detection model based on human attention mechanism and full convolution neural networks. Jian et al. [15] described a simple visual saliency-detection model based on spatial position of salient objects and background cues. Chen et al. [16] carried out class detection of accurate object with 3D object proposals through stereo images. Peng et al. [17] designed a concurrent softmax to handle the multi-label problems in object detection and propose a soft-sampling method with hybrid training scheduler to deal with the label imbalance. Li et al. [18] provided the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution. Gkioxari et al. [19] proposed Faster R-CNN model to detect and recognize human-object interactions. Hu et al. [20] proposed a method based on depth supervision for the salient object detection of short connections.

Faster R-CNN algorithm has high precision and strong scalability. In recent years, many researchers have improved based on the Faster R-CNN. Aiming to the low accuracy and speed of multi-object detection in the current complex traffic environment, we propose a cross-layer fusion multi-object detection and identification algorithm based on Faster R-CNN. Our main contributions are as follows:

(1) We use the five-layer structure of VGG16 to obtain more characteristic information. This idea is lateral embedding 1×1 convolution kernel in the last convolutional at the 1, 3 and 5 layers, and then add a max-pooling layer in the 1 layer to fuse with 3 layer. For 5 layer, we add a deconvolutional operation to fuse with 3 layer.

(2) Aiming to control the imbalance between difficult and easy samples, we use weighted balanced multi-class cross entropy loss function and Soft-NMS (Non-maximum suppression).

(3) Considering the actual situation in a complex traffic environment, we manually labeled mixed dataset. experimental results and data show that the proposed model achieves better effects than the current mainstream object detection models.

The process of performing our method is shown in Fig. 1.

The rest of this paper is organized as follows. In Section 2, we briefly introduce the Faster R-CNN and RPN. Section 3 introduces the improved network structure based on the Faster R-CNN model and the weighted balanced multi-class cross entropy loss function. In Section 4, we describe the training process and the experimental contrast results. Section 5 gives the discussion and conclusion.

Section snippets

Faster R-CNN for object detection

Faster R-CNN used Region Proposal Network (RPN) [21] to replace selective search (SS) [22] in the selection of candidate frames, which greatly improved the detection speed. Faster R-CNN was widely used in object detection and recognition, but the detection accuracy for small objects and occluded objects need to be improved. As shown in Fig. 2, both occluded cars and distant cars were not recognized.

Visual geometry group network

Visual Geometry Group Network was a deep convolutional neural network architecture [23], [24].

The improved faster R-CNN networks

In this paper, a cross-layer fusion multi-object detection and recognition algorithm is proposed. Five-layer convolution of the VGG16 is the mainstream architecture, and small convolution kernels of different dimensions are added to the hidden layer on the 1, 3 and 5 layers. After pairwise cross fusion, the feature map is extracted, and then the classification and location are performed by RPN and ROI. The algorithm is divided into 4 parts, the first is the input images of any size and angle,

Experimental environment

Experiments are carried out in the fast Feature embedded Caffe software environment under Ubuntu 18.04. The hardware environment is i7 8700k, and the GPU is GTX 1070ti 8G memory.

Training process

In order to verify the influence of multiscale fusion, weighted balance multi-class cross entropy loss function and Soft-NMS on the performance of model detection, the training process of this paper is divided into the following six steps. Algorithm 4 is shown the pseudocode of training.

Loss function of training process

In Fig. 4, the initial value of

Conclusions

In this paper, a VGG16-based improvement of the Faster R-CNN was proposed for multi-object detection and recognition. Experimental results and data show that the improved Faster R-CNN model integrates low-level and high-level image semantic features, compared with previous neural networks such as Fast R-CNN and Faster R-CNN based on the VGG16 template, which are allowing the model to acquire more, so the positioning accuracy of the object pixel feature is improved, and the weighted multi-class

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant 61701060 and the Doctoral Talent Training Project of Chongqing University of Posts and Telecommunications under Grant BYJS202007.

References (27)

S. Tian et al.
Multilingual scene character recognition with co-occurrence of histogram of oriented gradients
Pattern Recognit.
(2016)
Y.T. Zhang et al.
Improving object detection with deep convolutional networks via Bayesian optimization and structured prediction
Conference on Computer Vision and Pattern Recognition (CVPR)
(2015)
M.W. Jian et al.
Assessment of feature fusion strategies in visual attention mechanism for saliency detection
Pattern Recognit. Lett.
(2019)
M.W. Jian et al.
Saliency detection based on directional patches extraction and principal local color contrast, journal of visual communication and image representation
J. Vis. Commun. and Image Repres.
(2018)
M.W. Jian et al.
Integrating QDWD with pattern distinctness and local contrast for underwater saliency detection
J. Vis. Commun. Image Repres.
(2018)
Y. Li et al.
Overcoming classifier imbalance for long-tail object detection with balanced group softmax
Conference on Computer Vision and Pattern Recognition (CVPR)
(Seattle, 2020)
Y.H. He et al.
Bounding box regression with uncertainty for accurate object detection
Conference on Computer Vision and Pattern Recognition (CVPR)
(2019)
T. Noi et al.
Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using sentinel-2 imagery
Sensors.
(2018)
R. Girshick et al.
Rich feature hierarchies for accurate object detection and semantic segmentation
Conference on Computer Vision and Pattern Recognition (CVPR)
(2014)
S.Q. Ren et al.
Faster R-CNN: towards real-time object detection with region proposal networks
IEEE Trans. Pattern Anal. Mach. Intell.
(2017)

T. Kong et al.

HyperNet: towards accurate region proposal generation and joint object detection

IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

(2016)

Z.R. Zuo et al.

Traffic signs detection based on faster R-CNN

Computer International Conference on Distributed Computing Systems Workshops (ICDCSW)

(2017)

X.L. Wang et al.

A-Fast-RCNN: hard positive generation via adversary for object detection

Conference on Computer Vision and Pattern Recognition (CVPR)

(2017)

Cited by (36)

Vehicle detection using improved region convolution neural network for accident prevention in smart roads
2022, Pattern Recognition Letters
Citation Excerpt :
The rapid population growth in modern cities has increased the demand for smart technologies for environmental sustainability and safety. Road safety is one of the most critical issues in smart city development when it comes to intelligent mobility [5,13,22]. Accident prevention [16,17] is one of the hot topics in road safety, where the goal is to find an efficient mechanism in predicting accidents before they happened.
This paper explores the vehicle detection problem and introduces an improved regional convolution neural network. The vehicle data (set of images) is first collected, from which the noise (set of outlier images) is removed using the SIFT extractor. The region convolution neural network is then used to detect the vehicles. We propose a new hyper-parameters optimization model based on evolutionary computation that can be used to tune parameters of the deep learning framework. The proposed solution was tested using the well-known boxy vehicle detection data, which contains more than 200,000 vehicle images and 1,990,000 annotated vehicles. The results are very promising and show superiority over many current state-of-the-art solutions in terms of runtime and accuracy performances.
Improved Mask R-CNN for obstacle detection of rail transit
2022, Measurement: Journal of the International Measurement Confederation
Citation Excerpt :
In order to solve the problem that Faster R-CNN [26] cannot effectively detect small targets and improve the classification ability of Faster R-CNN, Shao et al. [27] systematically improved the fast region based on Faster R-CNN for traffic sign detection in actual traffic conditions. For complex traffic scenes, Li et al. [28] proposed a cross-layer fusion multi-objective detection and recognition algorithm based on Faster R-CNN, which uses VGG16 [29] five-layer structure to obtain more feature information. By horizontally embedding 1 × 1 convolution kernel, max pooling and deconvolution, the imbalance between difficulty and simple samples is controlled by combining the weighted balanced multi-class cross-entropy loss function and Soft-NMS.
Accurate identification of obstacles shows great significance to improve the safety of automatic operation trains. The ME Mask R-CNN is proposed to improve the accuracy of active identification. The SSwin-Le Transformer is used as the feature extraction network and the ME-PAPN is used as the feature fusion network. A variety of multi-scale enhancement methods are integrated to improve the detection ability of small target objects. PrIme sample attention is used as the sampling method, the anchor boxes size and ratio suitable for the characteristics of train obstacles are adopted. The train obstacle dataset is based on a variety of test scenarios such as Nanning Metro Line 1 test line, tunnel line and night test. The test results show that ME Mask R-CNN achieves 91.3 % mAP with an average detection time of 4.2 FPS, which is 11.1 % higher than that of Mask R-CNN.
Tea Bud Picking Point Localization Method in Natural Environmet Based on Attitude Guidance
2024, SSRN
A Novel Method for Road Anomaly Objects Detection in the Traffic Environment With Multi-Mechanism Fusion
2024, IEEE Access
Adaptive enhancement of spatial information in adverse weather
2024, Spatial Information Research
An improved single-stage convolutional neural network for rail transit obstacle detection
2023, Measurement Science and Technology

View all citing articles on Scopus

: Editor: Yuxin Peng.

View full text

A method of cross-layer fusion multi-object detection and recognition based on improved faster R-CNN model in complex traffic environment

Highlights

Abstract

Introduction

Section snippets

Faster R-CNN for object detection

Visual geometry group network

The improved faster R-CNN networks

Experimental environment

Training process

Loss function of training process

Conclusions

Declaration of Competing Interest

Acknowledgments

Pattern Recognit.

Pattern Recognit. Lett.

J. Vis. Commun. and Image Repres.

J. Vis. Commun. Image Repres.

Comparison of random forest, k-nearest neighbor, and support vector machine classifiers for land cover classification using sentinel-2 imagery

Sensors.

Rich feature hierarchies for accurate object detection and semantic segmentation

Conference on Computer Vision and Pattern Recognition (CVPR)

Faster R-CNN: towards real-time object detection with region proposal networks

IEEE Trans. Pattern Anal. Mach. Intell.

HyperNet: towards accurate region proposal generation and joint object detection

IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Traffic signs detection based on faster R-CNN

Computer International Conference on Distributed Computing Systems Workshops (ICDCSW)

A-Fast-RCNN: hard positive generation via adversary for object detection

Conference on Computer Vision and Pattern Recognition (CVPR)