Yolo-tiny-MSC: A tiny neural network for object detection

Object detection is an important and active area of computer vision. However, the expensive computational cost and memory requirement are big challenges for deploying object detection networks in resource constraints embedded devices. To address this problem, we designed the multi-stage cascade module to reduce the computational cost of object detection networks. The proposed structure is used to build a compact and efficient object detection neural network. Experiments show that our multi-stage cascade module could significantly reduce the computational and parameter budgets meanwhile achieve outstanding object detection accuracy.


Introduction
Object detection is one of the most important and active areas of computer vision. In recent years, deeplearning-based object detection methods have been attracting increasing amounts of attention. Due to the technological breakthroughs in deep-learning made in recent years, many convolutional neural networks (CNNs) based object detection methods have been proposed. The advanced CNN-based methods have significantly improved the accuracy of object detection. These methods tend to adopt complex structures to enhance the representation ability of networks, such as R-CNN [ [1]], SSD [ [2]], Fast-RCNN [ [3]], and Faster-RCNN [ [4]]. However, even though these outstanding CNN-based detectors show state-of-the-art object detection accuracies, it is still challenging to deploy them on embedded mobile devices for practical applications. One of the major reasons is that these networks possess huge parameter sizes and require large amounts of computing resources, thus they are heavily dependent on the high-performance Graphics Processing Units (GPU) to achieve the inference process. Therefore, it is difficult to deploy these networks on embedded devices with limited computational ability and memory resources. This problem impedes the widespread application of these highperformance networks in stand-alone terminal detection devices that require almost real-time detection such as video surveillance and unmanned aerial vehicles. Thus it is necessary to design compact and lightweight CNN-based object detection neural networks that are suitable for edge devices with limited computational resources.
In the last few years, to reduce computation cost, many lightweight single-shot deep-learning-based detectors have been proposed. Among them, the tiny Yolo series is an attractive and interesting family that greatly reduces the model-size and computational cost of object detection networks. And it is extended into many variants [ [5], [6]]. Yolov3-tiny [ [7]], as one lightweight and efficient object detection network, is well-received in many areas. However, its performance needs to be further improved. Compared with yolov3-tiny, yolov4-tiny [ [8]] achieves higher object detection accuracy, but it requires more computational operations than yolov3-tiny. In order to find a fine balance between detection accuracy and computational cost, in this paper, a multi-stage cascade (MSC) module is proposed. The MSC module composes of a series of convolutional layers and all features of convolutional layers are 2 merged by a certain method. We replaced the CSP (cross spatial partial) module [ [9]] in yolov4-tiny with our MSC module to build a compact network. Comparison experiments suggest that the MSC module could significantly reduce the computation operators and the number of parameters of the network, meanwhile achieve superior object detection accuracy.

Method
The basic aim of our multi-stage cascade module is to achieve a trade-off between object detection accuracy and computation complexity of the neural network. The main design strategy of our multistage cascade module is based on two thoughts: 1) use multi-stage cascade structure as the backbone, 2) use progressive aggregation to merge features in different stages.

Multi-stage cascade structure
Our multi-stage cascade structure is inspired by the Inception module [ [10], [11]]. In Inception, input feature is processed by a combination of convolution layers with different kernel size, and the output of these convolution layers are concatenated into a single output. The naïve Inception module without pooling path is shown in Fig. 1(a). However, convolution with kernel size larger than 5 is computationally expensive and this structure can lead to a computational blow-up. Furthermore, in [ [11]], it suggests that convolutions with large spatial filters can be reduced into a sequence of 3×3 convolutional layers to relieve the disproportionally expensive in terms of computation. In this condition, the reduced Inception is shown in Fig. 1(b). As shown in Fig. 1(b), feature f 1 and f 2 are both generated by processing base input with a 3×3 convolution layer. Therefore, in terms of the receptive field, feature f 1 and f 2 are equivalent. It is kind of redundant to deploy two different convolutional layers to produce features with the same receptive field level. Based on this assumption, in our multi-stage cascade structure, we consider merging convolution operators with the same receptive level in different paths to reduce computational cost. The backbone of the two-stage cascade structure is shown in Fig. 1(c) (except parts marked with dashed lines). In fact, this structure could be further extended by cascading more convolutional layers behind the base structure and then concatenate the outputs together.
In order to balance the dimension of input and output, we adopt two strategies. First, decreasing the number of filters stage-by-stage. Second, employing 1×1 convolutional layers to adjust the channels of features in different paths. Through these two strategies, the dimension of input is kept to be equal to the dimension of output. As an example, the detailed structure of the basic four-stage cascade module is shown in Fig. 2(a). The parameter budget of the MSC module with n stages can be formulated as: where, c is the channels of input feature. The limitation of Eq. (1) is lim(P)=20c 2 /3=6.67c 2 . For comparison, the parameter budget of commonly used 3×3 convolutional layer with the same input and output channels is 9c 2 , which is 1.35 times of MSC module.

Progressive aggregation
In the basic structure of the multi-stage cascade module, features in different stages are concatenated directly. However, due to the channel decreasing strategy, the number of features at high level is much less than the numbers of features at low levels. In this condition, the proportion of features generated in each stage is not balanced in the output.
In order to solve this problem, we design PA (progressive aggregation) structure to fuse features in different stages from the bottom layer gradually. In each fusion, only outputs in two adjacent stages are concatenated together. Thus the dimensions of the two features are equal in every fusion operation. And then a 1×1 convolutional layer is applied to further integrate the fused features. As an example, the structure of four-stage cascade module with PA is shown in Fig. 2(b).

Experiment Results and Discussion
Based on the architecture of the yolo-tiny series, we designed a lightweight object detection network by replacing CSP modules in yolov4-tiny with our MSC module. To study the efficacy of our method, we examine its computational resource cost and object detection accuracy on COCO [ [12]] dataset. The yolov3-tiny and yolov4-tiny are used as the baseline models since they are both classical and popular compact neural networks for object detection. All baselines and our methods are trained from scratch on COCO datasets by the Darknet neural network framework. All evaluation is conducted on the condition that the input size is 416×416. The final results are shown in Table 1. And the best results are highlighted in bold. As shown in Table 1, in terms of the number of parameters, our method possesses a parameter size of about 5.5×10 6 , which is 0.62 times smaller than yolov3-tiny and is also smaller than yolov4-tiny. And for FLOPs, our methods require the fewest computational operations, which is 0.78 times of yolov4tiny. It indicates that our method could significantly reduce the computation and parameter budget of the network, which is important for resource constraints embedded devices.
In addition, although leveraging the fewest parameters and calculations, our methods achieve outstanding object detection accuracy. The mAP of our method is about 8% higher than yolov3-tiny. The network with progressive aggregation strategy even performs slightly better than yolov4-tiny.