Elsevier

Pattern Recognition Letters

Volume 145, May 2021, Pages 127-134
Pattern Recognition Letters

A method of cross-layer fusion multi-object detection and recognition based on improved faster R-CNN model in complex traffic environment

https://doi.org/10.1016/j.patrec.2021.02.003Get rights and content

Highlights

  • Cross-layer Fusion.

  • Multi-object Detection and Recognition.

  • Faster R-CNN Model.

  • Multi-class cross entropy loss function.

  • Soft-NMS.

Abstract

Improving the detection accuracy and speed is the prerequisite of multi-object recognition in the complex traffic environment. Despite object detection has made significant advances based on deep neural networks, it remains a challenge to focus on small and occlusion objects. We address this challenge by allowing multiscale fusion. We introduce a cross-layer fusion multi-object detection and recognition algorithm based on Faster R-CNN, an approach that the five-layer structure of VGG16 (Visual Geometry Group) is used to obtain more characteristic information. We implement this idea with lateral embedding the 1×1 convolution kernel, max pooling and deconvolution, in conjunction with weighted balanced multi-class cross entropy loss function and Soft-NMS to control the imbalance between difficult and easy samples. Considering the actual situation in a complex traffic environment, we manually label mixed dataset. On Cityscapes and KITTI datasets, experimental results show that the proposed model achieves better effects than the current mainstream object detection models.

Introduction

Object detection has always been a key technology for vehicles to cope with complex scenes reasonably and safely, and is one of the hot spots in computer vision research. Researchers have conducted a large number of studies on object detection methods. Such as selecting the Harr feature and Adaboosting classifier, the sliding window was used for face detection. The features extracted by the Histogram of Gradients (HOG) [1] were examined by a Support Vector Machine (SVM) [2] for pedestrian detection. For general object detection, the features of HOG and the Deformable Part Model (DPM) algorithm were adopted. Above methods have few features and great improvement in time efficiency, however, those approaches have obvious limitations and inaccuracies.

In recent years, with the development of deep learning technology, convolutional neural network is significantly superior to traditional methods in accuracy, and has become the latest research hotspot. Girshick et al. [3] proposed that region-based R-CNN could apply high-capacity convolutional neural network (CNN) to the bottom-up region for locating and segmenting objects. Zhang et al. [4] performed improving object detection with deep convolutional networks via bayesian optimization and structured prediction. Ren et al. [5] proposed Faster R-CNN and introduced a fully convolutive network RPN, which could simultaneously predict the object boundary and object fraction of each position. Kong et al. [6] proposed HyperNet, which combined the Hyper features of the bottom layer, the middle layer and the upper layer to obtain better effects in the processing of small objects. Zuo et al. [7] proposed traffic signs detection based on Faster R-CNN. Wang et al. [8] proposed Fast R-CNN and introduced GAN [9], [10] to generate highly difficult samples to improve the network’s adaptability to occlusion and deformation. Jian et al. [11] focused on investigating the salient feature fusion strategies in human visual attention mechanism for saliency detection. Jian et al. [12] proposed a novel computational model for saliency detection by integrating the holistic center-directional map with the principal local color contrast map. Jian et al. [13] proposed a novel framework for underwater image saliency detection by exploiting Quaternionic Distance Based Weber Descriptor. Jian et al. [14] proposed a video saliency-detection model based on human attention mechanism and full convolution neural networks. Jian et al. [15] described a simple visual saliency-detection model based on spatial position of salient objects and background cues. Chen et al. [16] carried out class detection of accurate object with 3D object proposals through stereo images. Peng et al. [17] designed a concurrent softmax to handle the multi-label problems in object detection and propose a soft-sampling method with hybrid training scheduler to deal with the label imbalance. Li et al. [18] provided the first systematic analysis on the underperformance of state-of-the-art models in front of long-tail distribution. Gkioxari et al. [19] proposed Faster R-CNN model to detect and recognize human-object interactions. Hu et al. [20] proposed a method based on depth supervision for the salient object detection of short connections.

Faster R-CNN algorithm has high precision and strong scalability. In recent years, many researchers have improved based on the Faster R-CNN. Aiming to the low accuracy and speed of multi-object detection in the current complex traffic environment, we propose a cross-layer fusion multi-object detection and identification algorithm based on Faster R-CNN. Our main contributions are as follows:

(1) We use the five-layer structure of VGG16 to obtain more characteristic information. This idea is lateral embedding 1×1 convolution kernel in the last convolutional at the 1, 3 and 5 layers, and then add a max-pooling layer in the 1 layer to fuse with 3 layer. For 5 layer, we add a deconvolutional operation to fuse with 3 layer.

(2) Aiming to control the imbalance between difficult and easy samples, we use weighted balanced multi-class cross entropy loss function and Soft-NMS (Non-maximum suppression).

(3) Considering the actual situation in a complex traffic environment, we manually labeled mixed dataset. experimental results and data show that the proposed model achieves better effects than the current mainstream object detection models.

The process of performing our method is shown in Fig. 1.

The rest of this paper is organized as follows. In Section 2, we briefly introduce the Faster R-CNN and RPN. Section 3 introduces the improved network structure based on the Faster R-CNN model and the weighted balanced multi-class cross entropy loss function. In Section 4, we describe the training process and the experimental contrast results. Section 5 gives the discussion and conclusion.

Section snippets

Faster R-CNN for object detection

Faster R-CNN used Region Proposal Network (RPN) [21] to replace selective search (SS) [22] in the selection of candidate frames, which greatly improved the detection speed. Faster R-CNN was widely used in object detection and recognition, but the detection accuracy for small objects and occluded objects need to be improved. As shown in Fig. 2, both occluded cars and distant cars were not recognized.

Visual geometry group network

Visual Geometry Group Network was a deep convolutional neural network architecture [23], [24].

The improved faster R-CNN networks

In this paper, a cross-layer fusion multi-object detection and recognition algorithm is proposed. Five-layer convolution of the VGG16 is the mainstream architecture, and small convolution kernels of different dimensions are added to the hidden layer on the 1, 3 and 5 layers. After pairwise cross fusion, the feature map is extracted, and then the classification and location are performed by RPN and ROI. The algorithm is divided into 4 parts, the first is the input images of any size and angle,

Experimental environment

Experiments are carried out in the fast Feature embedded Caffe software environment under Ubuntu 18.04. The hardware environment is i7 8700k, and the GPU is GTX 1070ti 8G memory.

Training process

In order to verify the influence of multiscale fusion, weighted balance multi-class cross entropy loss function and Soft-NMS on the performance of model detection, the training process of this paper is divided into the following six steps. Algorithm 4 is shown the pseudocode of training.

Loss function of training process

In Fig. 4, the initial value of

Conclusions

In this paper, a VGG16-based improvement of the Faster R-CNN was proposed for multi-object detection and recognition. Experimental results and data show that the improved Faster R-CNN model integrates low-level and high-level image semantic features, compared with previous neural networks such as Fast R-CNN and Faster R-CNN based on the VGG16 template, which are allowing the model to acquire more, so the positioning accuracy of the object pixel feature is improved, and the weighted multi-class

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant 61701060 and the Doctoral Talent Training Project of Chongqing University of Posts and Telecommunications under Grant BYJS202007.

References (27)

  • T. Kong et al.

    HyperNet: towards accurate region proposal generation and joint object detection

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • Z.R. Zuo et al.

    Traffic signs detection based on faster R-CNN

    Computer International Conference on Distributed Computing Systems Workshops (ICDCSW)

    (2017)
  • X.L. Wang et al.

    A-Fast-RCNN: hard positive generation via adversary for object detection

    Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • Cited by (36)

    • Vehicle detection using improved region convolution neural network for accident prevention in smart roads

      2022, Pattern Recognition Letters
      Citation Excerpt :

      The rapid population growth in modern cities has increased the demand for smart technologies for environmental sustainability and safety. Road safety is one of the most critical issues in smart city development when it comes to intelligent mobility [5,13,22]. Accident prevention [16,17] is one of the hot topics in road safety, where the goal is to find an efficient mechanism in predicting accidents before they happened.

    • Improved Mask R-CNN for obstacle detection of rail transit

      2022, Measurement: Journal of the International Measurement Confederation
      Citation Excerpt :

      In order to solve the problem that Faster R-CNN [26] cannot effectively detect small targets and improve the classification ability of Faster R-CNN, Shao et al. [27] systematically improved the fast region based on Faster R-CNN for traffic sign detection in actual traffic conditions. For complex traffic scenes, Li et al. [28] proposed a cross-layer fusion multi-objective detection and recognition algorithm based on Faster R-CNN, which uses VGG16 [29] five-layer structure to obtain more feature information. By horizontally embedding 1 × 1 convolution kernel, max pooling and deconvolution, the imbalance between difficulty and simple samples is controlled by combining the weighted balanced multi-class cross-entropy loss function and Soft-NMS.

    View all citing articles on Scopus

    Editor: Yuxin Peng.

    View full text