A Fast Vehicle Counting and Traffic Volume Estimation Method Based on Convolutional Neural Network

Vehicle counting and traffic volume estimation on traffic videos has gained extensive attention from multimedia and computer vision communities. Recent vehicle counting and volume estimation methods, including detection based and time-spatial image (TSI) based methods have achieved significant improvements. However, how to balance the accuracy and speed is still a challenge to this task. In this paper, we design a fast and accurate vehicle counting and traffic volume estimation method. Firstly, traffic videos are converted to TSIs and we annotate the vehicle locations in TSIs manually. Then, we design a simple TSI density map estimation network which utilizes attention mechanism to strengthen the features in the traffic locations for vehicle counting. Finally, we use the parameters obtained from the vehicle counting network to further estimate the traffic volume. Experiments on UA-DETRAC dataset demonstrate that the vehicle counting network not only takes a balance between counting accuracy and speed, but also well estimates the traffic volume when the video data is insufficient.


I. INTRODUCTION
As one of the tasks of intelligent video surveillance, vehicle counting and traffic volume estimation plays an important role in intelligent transportation, including vehicle management and regulation. Due to its broad application prospect, vehicle counting and traffic volume estimation has attracted more and more attention from scholars in the fields of multimedia and computer vision. In recent years, the machine learning based methods have brought great improvement to vehicle counting and traffic volume estimation. However, this task still faces some challenges, such as occlusion, vehicle appearance differences, etc.
Many methods have been proposed to address these challenges in order to improve the accuracy of vehicle counting and traffic volume estimation. These methods can be divided into detection-tracking based methods and TSI based meth-The associate editor coordinating the review of this manuscript and approving it for publication was Nabil Benamar .
ods. The former one uses detection algorithm to detect the vehicles in each frame, and then utilizes tracking algorithm to match vehicle trajectories and obtains the number of vehicles in the video. This kind of methods usually contain detection and tracking process. The number of vehicles can be calculated according to the tracking trajectories. Popular detection methods such as background subtraction deep neural network(DNN) [1], [2] are usually utilized to detect the vehicles. The tracking methods such as Deep Simple Online Realtime Tracking (Deep Sort) [3] and Kanad-Lucas-Tomasi (KLT) Tacker [4] are often used to track the traffics. Detectiontracking based methods are usually highly complex, thus requiring more time to realize vehicle counting and traffic volume estimation.
TSI based methods first convert the video into time-spatial image (TSI), which represents the vehicle numbers passing in a certain region within a period of time and then obtain the vehicle number in the TSI. Background subtraction is the mostly used method to estimate the vehicle numbers in TSIs. Background subtraction algorithms include Block-sparse Robust Principal Component Analysis (RPCA) [5], self-adaptive sample consensus back ground model [6], and morphology model [7], etc. These methods do not require cumbersome process, but may encounter problems such as vehicle mutual occlusion, vehicle deformation, and vehicle not fully displayed on TSI, which will lead to the reduction of vehicle counting accuracy.
Inspired by crowd counting, a TSI based vehicle counting method [8] is proposed and has improved the accuracy of vehicle counting evidently. This method first generates TSI and then utilizes a neural network to generate a vehicle density map to count the number of vehicles passing through the region in a certain period of time. This method not only avoids the cumbersome processing process caused by detection and tracking, but also ensures the accuracy of vehicle counting. However, the design of several multi-scale modules complicates the vehicle counting network, and thus leading to a drastic reduction in vehicle counting speed.
In this paper, we design a vehicle counting and traffic volume estimation method based on TSI regression. This method first converts videos to TSIs, and manually labels the training data to obtain accurate TSI density map labels. Then we design a simple neural network to balance the accuracy and speed of vehicle counting. Finally, we use the trained network to carry out vehicle counting and traffic volume estimation. Experimental results demonstrate the effectiveness of the proposed method.
The main contributions of this work are summarized as follows: 1) We propose a fire-new vehicle counting and traffic volume estimation method to avoid the complex detection and tracking operations and balance the accuracy and speed. The counting method converts a video into a TSI and generates the density map which represents the vehicle numbers in the TSI. As a result, vehicle detection and tracking are not required in the proposed method.
2) We design a simple base network for vehicle counting, which achieves accurate vehicle counting and traffic volume estimation with fewer parameters. In the counting process, we design a 6-layer base network and the maximum channels of these layers are 60. The counting accuracy of the base network exceeds most of counting methods.
3) We introduce spatial and channel attention mechanisms into the vehicle counting network, which effectively improves the accuracy of vehicle counting without increasing the parameter size evidently. The spatial attention mechanism aims at guiding the network to focus on vehicle locations and the channel attention mechanisms can strengthen the important feature channels. The two attention modules bring very slight parameter increase to the counting network.
The remainder of the paper is organized as follows. Section II presents some related works about vehicle counting. In section III, the proposed fast vehicle counting and traffic volume estimation method is introduced. Section IV displays and discusses the experimental results. Finally, section V concludes the paper.

II. RELATED WORKS
In this section, we make a brief summary of existing vehicle counting methods, which are methods based on detection and tracking, methods based on TSI and methods by regression.

A. VEHICLE COUNTING BY DETECTION AND TACKING
Detection-tracking based vehicle counting methods usually contain detection and tracking procedures. The number of vehicles that cross over an area can be calculated according to the tracking trajectory. Detection-tracking based vehicle counting methods can be further divided into background subtraction based methods [9], [10] [11]and DNN based methods [1], [2] [12].
The background subtraction technology is utilized to design the background model and extract the moving vehicles in the video. In this process, some morphological operations are usually applied to segment the vehicles, and then the cross vehicles are counted. When using background subtraction technology, Gaussian mixture model (GMM) [9] is often used to model the background. Abdelwahab et al. [11] applied the background model to only a narrow area in the video frame. Morphological processing is carried out on the extracted target to enhance the target and reduce the influence of vehicle occlusion. El-Khoreby et al. [10] modeled the background through an adaptive threshold algorithm, and then applied morphological operations to detect the vehicle. Since it needs to extract enough features for background modeling in BS algorithm, it may not work when dealing with complex background.
The other dominant detection method is neural networks such as Faster Region CNN (Faster-RCNN) [13], You Only Look at Once (YOLO) [14] and Single Shot MultiBox Detector (SSD) [15]. High accuracy can be obtained by using DNN framework. However, these DNN detection methods can only count the vehicles in one frame of image. Tracking methods still need to calculate the vehicles from the video. These methods use tracking algorithms [3], [4] to find the information of the detected vehicles from continuous frames. Liu et al. [16] adopted the latest object detector to detect vehicles from monocular video, and then used Kalman filter based tracking method to track and count vehicles. Kanade-Lucas Tomasi (KLT) tracers are used to extract trajectories and count vehicles [4]. Hoai et al. [3] proposed a comprehensive vehicle counting framework by integrating the latest target detection and tracking technologies, such as YOLO and depth sorting. The tracking is done using a tracing method based on Deep Simple Online Realtime Tracking (Deep Sort). The tracking method in vehicle counting is a complex multi-target training process, and the tracking accuracy is often reduced when there is occlusion or clustering. VOLUME 9, 2021

B. VEHICLE COUNTING BASED ON TSI
In order to avoid the cumbersome process caused by detection and tacking, some TSI-based methods have been proposed in recent years. This method first fixes a line in each frame of the video to extract the pixels on this line, and then combines the pixels extracted from successive N frames to obtain TSI. Then, background cutting method is designed to detect vehicles in TSI, so as to obtain the traffic flow during this period.
Aiming at the low-rank constraint of spatio-temporal images generated by superposition of pixels on the virtual frame over time, Gao et al. [5] introduced the block sparse robust principal component analysis algorithm to highlight the foreground by using motion tips to achieve highprecision. On city roads, however, downward-facing cameras are rare. Zhang et al. [6] used the adaptive sample consistent background model to form the foreground TSI to count and detect the vehicles on the virtual line. Chen et al. [7] used support vector machine (SVM) and deterministic non-model methods to remove shadows in TSI. They detected the Region of Interest (ROI) through a simple morphology process and used the ROI accumulative curve method and Fuzzy Constraints Satisfaction Propagation (FCSP) to process occlusion problems.

C. VEHICLE COUNTING BY REGRESSION
Inspired by the methods in the field of crowd counting, vehicle counting methods based on regression have also appeared in recent years. The method regresses an input image or every frame in the video to obtain vehicle number. Regression based approaches can be further divided into two categories: approaches based on traditional machine learning and approaches based on deep neural networks (DNN). The former [17], [18] one design feature extraction methods to extract useful features in the image and then carries out regression based on these features to get the vehicle number in the image.
Different from machine learning based methods, DNN based methods train an end-to-end network based on the images and directly output the vehicle numbers. Zhang et al. [19] designed a new FCN-RLSTM network. A Full Convolutional Neural network (FCN) is combined with Long and Short Time Memory network (LSTM) through residual-learning method. To estimate vehicle density and the number of maps, they take advantage of FCN in pixel level prediction and LSTM in complex time dynamics learning. FCN-RLSTM implements a refined feature representation and a novel end-to-end trainable mapping from pixels to the number of vehicles. Different counting tasks are evaluated on three datasets, and the experimental results demonstrate the effectiveness and robustness of the proposed method. Tayara et al. [20] introduced an automatic detection and counting system for aerial images of vehicles. This system uses convolutional neural network to extract the spatial density map of vehicles from aerial images. It evaluates the system on two public datasets, namely the Munich and over-head image study datasets. Experimental results show that this system produces higher accuracy and recall rates than the comparison method. The first contribution of [21] is that they propose a new convolutional neural network solution named Counting CNN (CCNN). It is proved that the object density map can be estimated accurately and effectively, and the network can learn to map and transform the appearance of the image patch into the object density map. The second contribution of [21] is a size perceived counting network-Hydra CNN. The network estimates the object's nest in a very crowded scene without providing the geometric information about the scene. Hydra CNN further provides a proportional perceived solution, which is intended to learn nonlinear regressors and generate object density maps from the pyramid of a multi-scale image block.
The combination of TSI and regression based methods cannot only avoid the cumbersome process of detection and tracking, but also ensure the accuracy of vehicle counting. Li et al. [8] made the first attempt to regress the vehicle numbers based on TSIs. They designed a TSI density estimation network based on VGG16 [22] and the number of vehicles can be obtained from the density map. In the design of the vehicle counting network, the author firstly extracted the basic features by VGG16, and then stacked several multiscale modules (MSMs) to further extract features. This network can extract more stable features and get high accuracy. However, the MSMs bring a large amount of parameters and limit the speed of the counting network.
Inspired by [8], we design a fast vehicle counting and traffic volume estimation method. The detail of the proposed method is introduced in section III.

III. OUR METHOD
The fast vehicle counting and traffic volume estimation method proposed in this paper includes three modules, which are TSI and its corresponding density map generation (TSI-DM) module, vehicle counting (VC) module and traffic volume estimation (TVE) module. Among them, the TSI-DM module mainly pre-processes the dataset. In the TSI-DM module, we manually mark the position of traffic flow in TSI and then generate the density map label according to Eq. 1. In the VC module, we design a simple and effective density map estimation network and introduce the attention mechanism to guide the network to focus on the traffic locations. The TVE module estimates the traffic volume by utilizing the trained vehicle counting module. Figure 1 shows the framework of the approach in this article.

A. TSI-DM MODULE
We refer to TM-Net [8] to generate TSIs from traffic videos. Firstly, we select a virtual line perpendicular to the road that all vehicles in the screen will pass through. Then, all the pixels in this line are captured from each frame in the video sequence. Finally, TSI is obtained by stacking these pixels in chronological order, as shown in figure 2(a).  In order to conduct supervised training, the coordinates of vehicles in generated TSI should also be marked. Since the video is converted to TSI, we no longer need to mark each vehicle in a single frame. In this article, we manually mark the traffic flows in the TSI. Specifically, we label the top left coordinate (x i 1 , y i 1 ) and bottom right coordinate (x i 2 , y i 2 ) of the traffic block which is generated by each volume of vehicles passing by in the TSI. Based on this, we can calculate the center point coordinate of the traffic block as well as its length and width. Then, we generate the TSI density map label according to formula 1.
In the formula, o j represents the center point of the jth traffic block, and w j , h j respectively represent the width and height of jth traffic block. δ(·) denotes impulse function and G(·) is Gaussian function. The labeled TSI and the corresponding density map generated according to this formula is shown in figure 2(b).

B. VEHICLE COUNTING MODULE
Deep neural network method has been widely used for traffic [19], [20] [21] and crowd [23], [24] [25] density estimation of static images, but rarely for TSI based vehicle density estimation. In this paper, a simple and effective convolutional neural network is designed by comprehensively considering VOLUME 9, 2021 the accuracy and speed of vehicle counting. The network is shown in figure 1 in the vehicle counting module.
The vehicle counting network consists of 3 units, which are basic feature extraction unit, TSI density map generation unit and attention mechanism unit. Inspired by MCNN [24], the design of the basic feature extraction unit divides the convolution kernel of different sizes into three branches for multi-scale feature extraction, so as to deal with the problem of the scale difference of traffic blocks caused by the different size and speed of vehicles. The sizes of the three convolution kernels are 7 × 7, 5 × 5 and 3 × 3 respectively. The obtained features are stacked to get F cat . On the basis of F cat , channel attention unit is introduced to guide network adaptively enhance the effective feature channels and generate feature F ch−am . Further, we use three-layer convolution for further feature extraction, and use spatial attention mechanism to guide the network to the features where traffic blocks are located, and get F s−am . On the basis of F s−am , we further use 2-layer convolution to generate density map. Since containing 2 max pooling layers in this architecture, it finally outputs a density map with 1/16 size of the original image. Details of network parameters, including the size of the convolution kernel, the number of characteristic channels and the number of convolution and pooling layers, can be found in figure 1.
CAM and SAM in figure 3 represent channel attention mechanism and spatial attention mechanism respectively. Attention mechanisms have been proven to be effective in some tasks in natural language processing [26], [27] [28], [29] and computer vision [30], [31] [32], [33]. The channel attention mechanism aims to strengthen the effective feature channels, and the spatial attention mechanism can effectively strengthen the effective features in the feature map. We will introduce these two attention mechanisms in details in the following part.
Among different convolution branches, the one with large kernel pays more attention to large receptive fields and the one with small kernel is more conducive to small-size traffic flow. In order to strengthen the feature channel more efficiently for the different inputs adaptively, we introduce the Squeeze-and-Excitation Networks (channel-wise attention mechanism) [34] after multi-scale feature fusion. The specific setting of the channel attention mechanism is shown in figure 3. Firstly, global average pooling is adopted for the merged features, then the weight of each channel is obtained by using two full connection layers, and the weight is added to the corresponding channel. Finally, F ch−am is obtained by adding the original features. In addition, there are often green belts on the road where vehicles are driving, resulting in complex backgrounds in the generated TSI. In order to reduce the influence of these backgrounds on vehicle counting, we introduce a spatial attention mechanism. Figure 3 shows the setup of the spatial attention mechanism. Because the number of feature channels is small, we directly use a convolution of 3 × 3 to obtain the probability map. Then, the probability map and the original features are used for element-wise product. After training, this probability map can strengthen the eigenvalues of effective spatial locations. It is noteworthy that neither channel-attention nor spatial attention mechanism significantly increases the number of network parameters.
The network training process is supervised training, and we use Euclidean distance to measure the difference between output and label in the training process. As formula 2 shows: where N is the number of the training samples, D is the ground-truth density map and F is the function that mapping the input X i to the estimated density map with parameters . TSI is generated by continuous m frames in the period of t, so the number of vehicles passing in the period of t can be obtained by counting the number of traffic blocks in TSI. The estimated number of traffic flows is finally the sum of pixel values in the estimated density map. As shown in formula 3: where C e indicates the estimated number of the TSI and D e (i) is the ith pixel in the estimated density map D e .

C. TRAFFIC VOLUME ESTIMATION MODULE
Besides the vehicle counting statistics, traffic volume estimation is also an important indicator to evaluate traffic conditions for traffic scheduling. Traffic volume is the number of vehicles passing a certain road in an hour. In this paper, traffic volume is estimated based on the number of traffic flows in time period T output by the vehicle counting module. Given the data acquisition frequency f fps, the corresponding time of m frames is m/f seconds. Vehicle counting module estimates the number of vehicles that cross over a line in m frames, which is denoted as c e m . Based on this, traffic volume is further estimated as: It can be seen from formula 4 that the video sampling frequency and the number of continuous images m generated by TSI both affect the results of traffic volume estimation. In the experiment part, we will discuss the influence of m on traffic volume estimation.
TSIs are generated by stacking pixels from lines that perpendicular to traffics' forward direction. If the vehicles change lanes when crossing the vertical line, the vehicle flow in the TSI will be deformed. For online application, we can set the video camera above the road where there are solid lines on the ground. The vehicles always move straight forward without changing lanes on this part of the road. Then we can convert existing video over a period of time to a TSI by extracting the pixels on the vertical line in every frame. Finally, the vehicle counting module will generate the density map to get the vehicle number in the TSI.

IV. EXPERIMENTS
The proposed method is conducted on UA-DETRAC dataset [35]. This section first introduces the public dataset, experimental setting and evaluation metrics. Then the experimental results are presented to verify the effectiveness of the proposed method.

A. UA-DETRAC DATASET
The UA-DETRAC dataset consists of 10 hours of videos captured with a Cannon EOS 550D camera at 24 different locations in Beijing and Tianjin in 2015. These videos are recorded at 25 frames per seconds (fps), with a resolution of 960 × 540 pixels. When using this database to carry out vehicle counting and traffic volume estimation, we refer to [8] to select some data in this database. There are some videos captured in tilted view. We select same data with the compared algorithms. In this database, 11 videos named 20011, 20033, 20035, 20051, 20061, 20062, 20064 In the process of training data generation, we first generate the corresponding TSI for each video sequence and marked the position of each vehicle in the TSI. Then, the density maps are generated according to the annotations. For training set, we adopt image cropping and flipping to expand the training sets. Patches with 200 × 200 pixels are cropped in random locations of the image to train the vehicle counting network. Finally, 6300 patches are obtained, of which 450 are randomly selected as validation set, and the rest compose training set. For testing data, we directly input the whole image to the network. It is worth noting that since there are two pooling layers in the counting network, the width and height of the density map are 1/4 of the original one, we zoom the width and height of the test image to an integral multiple of 4.

B. EXPERIMENTAL SETTING
In the training procedure, the learning rate and momentum are set as 10 −5 and 0.9 for Adam optimization. The batch size is set as 1 for training. All of the experiments are conducted on GeForce GTX TITAN-X. In the experiment, TSI generation, data annotation and density map generation are realized by matlab programming. The traffic estimation network is implemented on the GeForce GTX TITAN-X using the Pytorch framework.

C. EVALUATION METRICS
In order to quantify the effectiveness of vehicle counting module, we adopt mean absolute error (MAE) and accuracy  as the measurement standards. MAE is defined as formula 5: where N is the number of TSI to be measured, C e n and C g n are the estimated number and marked number of traffic flows respectively. The calculation method of accuracy is shown in formula 6: In order to further measure the accuracy of traffic volume estimation by the algorithm, we introduce mean absolute percentage error (MAPE) to quantify the effectiveness of the traffic volume module. MAPE is formulated as formula 7: where GT is the actual traffic volume and ES denotes the estimated traffic volume.

1) ABLATION STUDY
This section demonstrates the effectiveness of the proposed network and the attention units. We mainly compare the results in literature [8], because this method is also based on TSI and uses DNN regression to count the traffic flow. Li et al. [8] takes VGG16 as the base network, so we also compare VGG16 and ResNet50, which are two typical basic networks. We make a comparison based on the estimated results of video 40192. The ground truth vehicle number is 258 and the experimental results are shown in table 1.
From the accuracy point of view, the proposed basic network performs better than VGG16, and the MAE of the proposed basic network is 2.63 lower than that of VGG16. With the introduction of channel attention mechanism, MAE is reduced by 0.59. With further introduction of spatial attention mechanism, the vehicle counting module proposed in this paper is formed. The MAE is significantly reduced by 4.31, but it is still 0.79 higher than TM-Net. The parameter size of our BaseNet is 0.56 × 10 5 , which is 1/78 of VGG16. Li et al. [8] takes VGG16 as the basic network, but does not give the specific parameters of the subsequent structure, so the number of network parameters exceeds 44 × 10 5 . The testing times also demonstrate the speed advantage of the proposed network. The introduction of CAM and SAM does not increase the number of parameters and testing time significantly. The proposed vehicle counting method achieves good performance and faster speed, which means it effectively balances the accuracy and speed of vehicle counting.

2) COMPARISON WITH OTHER METHODS
In order to verify the effectiveness of the proposed vehicle counting method, we compare it with several typical algorithms, including tracking-detection based methods, TSI based background subtraction method and DNN based method. We conduct different methods on video 40192 and the comparison results are shown in table 2.
In table 2, the first three methods carry out vehicle flow statistics based on detection and tracking. These methods firstly detect vehicles from every frame of the video and then use tracking algorithm to get the trajectories. We then count the trajectories that across the virtual line as the vehicle numbers. However, the results of these methods are more dependent on the detection algorithm, and the processing of images slows down the speed of vehicle counting. In paper [11], Abdelwahab et al. designed a method of background subtraction on TSI to realize vehicle counting. The authors modeled the background of a rectangular area which was perpendicular to the traffic line and then extracted the foreground objects of these areas. A whole vehicle is counted by measuring the common area of foreground objects between nth frame and the previous one. We applied this algorithm on UA-DETRAC dataset and founded that it is slightly better than the detection-tracking based methods both in accuracy and speed. Recently, the DNN regression based vehicle counting algorithm [20], [21] for TSI is better than that of FastBS [11]. We converted videos to TSIs and applied these methods to regress the TSI density map. The CCNN [21] has an advantage in speed but is low in accuracy. The FCRN [20] used encoder-decoder to estimate the density map and the counting accuracy was significantly improved over CCNN [21]. Both the proposed method and TM-NET made attempt to conduct vehicle counting based on density map estimation of TSI. Compared with the TM-NET, the proposed method have two advantages in the vehicle counting process. Firstly, the TM-NET labels the vehicles in training TSIs based on the detection and tracking algorithm. However, the detection and tracking error will affect the training labels. We manually label the training TSIs to guarantee the accuracy of the training labels. Secondly, we design a shallow vehicle counting network which achieve good counting performance in a fast way. The accuracy of our method is slightly worse than that of TM-NET. But according to parameter numbers and testing time, the proposed method has an advantage in speed.
In order to show the effectiveness of the proposed method more intuitively, we present the vehicle counting and traffic volume estimation results of all test videos in table 3. The evaluation index is the evolution metric introduced in section IV. The evaluation indexes MAE and accuracy of vehicle counting are obtained according to formula 5 and formula 6. MAPE is calculated based on formula 7. Figure 4 shows the TSI generated from the test videos and the results of density estimation based on the TSI. At the same time, ground-truth density map is also shown to visually compare the estimated density map. Due to the large difference in TSI size, we scaled it to a certain proportion to show the results more clearly. As can be seen from figure 4, the proposed vehicle counting method can effectively focus on the vehicle flows while filtering the backgrounds.

E. VOLUME ESTIMATION RESULTS
In practical application, the traffic volume will be estimated at any time. Firstly, TSI is generated from consecutive m frames, and then its corresponding density map is estimated by the trained vehicle counting network. Then, the values of pixels in the density map are summed up to obtain the traffic flow within the time represented by m frames. Finally, the traffic volume which represents the traffic number that pass in an hour is predicted based on the output of the vehicle counting network and m according to formula 4.
The method in [8] estimates the traffic volume by dividing the whole video into different parts with stride m. Then they estimate the vehicle numbers of every part and add them to get the traffic volume. In other words, they always use the whole video to predict the traffic volume even with different m. Differently, we use the former m frames to estimate the traffic volume and this method is more reasonable to measure the prediction ability of the volume estimation module. We take m from 200 to 1000 to predict traffic volume, and conduct experiments on video 40191 and 40192 to discuss the influence of parameter m. The experimental results are shown in table 4. As can be seen from table 4, with the increase of m, the accuracy of traffic volume estimation is steadily improved. The data acquisition frequency in UA-DETRAC dataset is 25fps. When m = 200, the MAPE of the two test videos is greater than 10%, indicating that too few continuous images are not enough to achieve accurate traffic volume estimation. When m = 600, the MAPE has been reduced to between 2% and 4%. This indicates that the proposed traffic volume estimation method still performs well when the data is insufficient.

V. CONCLUSION
In this paper, we proposed a fast and effective vehicle counting and traffic volume estimation method. The proposed method firstly generates TSIs from traffic videos and labels the traffic locations in the TSI. Then the traffic numbers are estimated based on TSIs, which avoids the complicated operation caused by tracking detection method. In addition, a simple and effective neural network for vehicle flow density map estimation is designed to achieve accurate vehicle counting without using too many parameters. The proposed vehicle density estimation network effectively balances the speed and accuracy of vehicle counting. Based on this result, we use continuous videos in a period of time to estimate the traffic volume, which can provide more comprehensive traffic condition information for intelligent traffic scheduling. However, the proposed method can not well handle the situation that traffics change lanes when crossing the lines where pixels are extracted to form the TSI. In the future, we will explore how to handle the serious deformation of vehicle flows in TSIs as well as further improve the accuracy to make the counting algorithm more applicable. SHUANG LI received the Ph.D. degree in pattern recognition and intelligent systems from Shandong University, Jinan, China, in 2021. Her current research interests include automatic target detection and recognition, machine learning, deep learning, and traffic flow parameter estimation.
XIANGLIN DAI received the B.S. degree in computer science and technology from Jilin University and the M.S. degree in computer engineering from the Ocean University of China. His current research interests include pattern recognition and image processing. VOLUME 9, 2021