Improved vehicle detection systems with double-layer LSTM modules

The vision-based smart driving technologies for road safety are the popular research topics in computer vision. The precise moving object detection with continuously tracking capability is one of the most important vision-based technologies nowadays. In this paper, we propose an improved object detection system, which combines a typical object detector and long short-term memory (LSTM) modules, to further improve the detection performance for smart driving. First, starting from a selected object detector, we combine all vehicle classes and bypassing low-level features to improve its detection performance. After the spatial association of the detected objects, the outputs of the improved object detector are then fed into the proposed double-layer LSTM (dLSTM) modules to successfully improve the detection performance of the vehicles in various conditions, including the newly-appeared, the detected and the gradually-disappearing vehicles. With stage-by-stage evaluations, the experimental results show that the proposed vehicle detection system with dLSTM modules can precisely detect the vehicles without increasing computations.


Related work
The YOLO approaches have higher speed performance than the SSD method. Recently the improved accuracy and speed versions of YOLO methods, namely YOLOv2, YOLOv3 and YOLOv4 have been proposed in [30][31][32]. The LSTM technique, which is evolved from the recurrent neural network (RNN) for speech and language modeling [33,34], can solve the vanishing gradient problem and model the self-learned context information. By using the temporal information, the LSTM module has been effectively used for precision 3D pose estimation [35] and single object tracking [36,37]. The long short-term memory (LSTM) extends the RNN by adding forget, input and output gates. To catch the details of the LSTM module, as shown in Fig. 1, the cell state vector c t and output hidden state vector h t can be respectively computed by: and where f t , i t and o t are the activations of the tth forget, the tth input and the tth output gates, which are respectively given as and the update cell state vector is obtained by In (3)-(6), the weights, w ρ and the offsets, b ρ for ρ = f, i, o and g need to be trained for the best connections of the previous hidden vector, h t−1 and current input vector,x t . As stated in (3)-(4), the gating activations are squashed into a range between 0 and 1 by a sigmoid function. The gating activation describes the ratio of the multiplied vector components being passed through the gate. If it is "0", the gate will control "nothing passes", while "1" means "all pass". As exhibited in (4), the updated cell state vector, c ′ t is regulated by tanh function. As shown in (5), the current cell state vector, c t is a combination of the previous cell state vector controlled by the forget gate and the updated cell state vector (1) (2) h t = tanh(c t ) * o t , (3) controlled by the input gate. Finally, as stated in (2) and (6), the vector component is regulated by tanh function in a range between − 1 and + 1.
To include temporal information, the LSTM modules can be added in any layer of detection networks. In order to track the multiple vehicles better, we adopt a modified version of long short-term memory (LSTM) module with a dual-layer and multistage structure. Instead of inner layers, which have larger feature maps, we suggested that the LSTM modules to directly track the object detection outputs to simplify the computation. It is noted that the proposed LSTM modules can be applied to any of the latest detection methods. Without loss of generality, we choose the YOLOv2 as the initial object detector, which will be improved step-by-step to accomplish the performance improvements in this paper. Figure 2 exhibits the conceptual diagram of the proposed object detection and tracking system, which includes two subsystems, improved YOLO (iYOLO) object detector and double-layer LSTM (dLSTM) object refiner. After the iYOLO detector, the dLSTM refiner takes T consecutive outputs of the iYOLO to refine the final prediction. Before the dLSTM refiner, however, we need to spatially order the iYOLO outputs, which are the bounding boxes and confidences of the detected objects, to correctly characterize their spatial associations. After the spatial association, the dLSTM object refiner then performs the final refinement of the iYOLO outputs. As shown in the top part of Fig. 2, the detailed descriptions of the iYOLO object detector, the multi-object spatial association, and the dLSTM refiner are addressed in the following subsections.

The iYOLO object detector
The proposed object detection system, as shown in the top part of Fig. 2, first resizes the images to 416 × 416 or 448 × 448 size as the inputs of the improved YOLO (iYOLO) object detection network. For performance improvement and computation reduction, the proposed iYOLO object detector is designed to classify 30 onroad moving objects with one combined-vehicle class, including car, bus, and truck classes together. The iYOLO also combines low and high level features to detect the objects. The details of data representation, network structure, and loss functions of the iYOLO are stated as follows.
For moving object detection, we not only need to predict the locations and box sizes of the detected objects but also need to detect their classes. Therefore, in the iYOLO, the output data is a three-dimension array, which is with the size of 14 × 14 × D , where D denotes the channel number of the information for representation of detections and classifications. The 14 × 14 array is considered as the grid cells of the image. Thus, there are 196 grid cells in total. Each grid cell, as shown in Fig. 3, contains five bounding boxes, which are called as "anchor boxes".
Each grid cell contains D elements, which carry the positions, confidences and class information, where D usually is given by: where M is the number of classes and B denotes the number of the anchor boxes in a single grid cell for detecting. As shown in Fig. 4, each grid cell contains B bounding boxes while each bounding box comprises (5 + M) parameters. For the bth box, we have 5 parameters including its bounding box, {x b , y b , w b , h b } and the occurrence probability, The bounding box is defined by the center with coordinates x b and y b and the width and the height the box with w b and h b , respectively. In Fig. 4, C b i denotes the conditional probability of the bth box that contains ith class as: for 1 ≤ i ≤ M . Therefore, we can find the probability of the ith object in the bth box is given by: where P b denotes the occurrence probability of the bth box. If C b i P b passes a pre-defined threshold, we will consider the ith class object being existed in the bth bounding box and detected by the proposed iYOLO detector.

Network structure of iYOLO detector
The proposed iYOLO network structure as shown in Fig. 5 is composed of several stages of convolutional layers and max pooling layers. The convolutional layers [38] with batch normalization [39] are mostly with 3 × 3 or 1 × 1 convolutions. The pooling layers perform with stride 2 of direct down sampling. In this paper, we include car, truck and bus classes into vehicle class. Thus, the number of output classes of the iYOLO are reduced to 30. As shown in Fig. 5, we eliminate three sets of two repeated 3 × 3 × 1024 convolution layers compared to the YOLOv2. The reason for decreasing high-level layers is that we can reduce the computations since we use the vehicle class to represent all type of cars. The more high-level layers we reduce; the less complex the model becomes.
In order to enhance the performance, the proposed iYOLO further includes two lowlevel features to help the final detection. As marked by the thick red lines and green-box functions in Fig. 5, we concatenate the outputs of 12th (after the green-boxed max-pooling) and 17th features (after the green-boxed 3 × 3 × 512 convolution) layers with the final feature. To keep the size of low-level features the same as that of the high layer feature, we introduce two convolution layers with 3 × 3 × 128, 1 × 1 × 64 for first low-level  feature and 3 × 3 × 256, 1 × 1 × 64 convolution and reorganize their features into the half of the original resolution in marked functions before the concatenation. Since the proposed iYOLO will output the information of bounding box location, classification and confidence results simultaneously. Therefore, the prediction of the module is composed of three loss functions: (1) location loss of the bounding box, g = {x, y, w, h} , where (x, y) denotes the center position while w and h respectively represent width and height of the bounding box; (2) classification loss defined by the conditional probability for specific class,p s (i) ; and (3) confidence loss related to probability P s (o) that states an object existing in the sth grid cell. The total loss function is given by where f is the input image data, c is the class confidence, g and g denote the predicted and the ground truth boxes, respectively. As stated in (10), the total loss is composed of the location loss, confidence loss, and class loss functions balanced by weighting factors loc , obj , noobj , and cls separately. These four loss functions are described as follows.
The location loss, L loc is given as: where g s = {x s , y s , w s , h s } and g s = {x s , y s , w s , h s } are the predicted and ground truth bounding boxes, S and B denote the numbers of grid cells and anchor boxes, respectively. In (11), loc represents the location weighting factor and α o s,b means the responsibility for the detection of the sth grid with the bth box. If the bounding box passes the intersection over union (IoU) threshold 0.6, then the box specific index, α o s,b will be 1, otherwise α o s,b goes 0. The confidence loss, L con is expressed as: where the first term exhibits the bounding box confidence loss of the objects while the second term denotes the confidence loss without the objects. In (12), o and no express the confidence weighting factors with object and no-object cases, respectively. The loss values are only valid for responsible bounding boxes, α o s,b = 1 , since that the non-responsible bounding boxes don't have truth label.
The classification loss, L cls is given as: where p s (i) denotes the i-class confidence in specific grid cell of the anchor box and cls denotes the classification weighting factor. If the bounding box does not contain any object, the predicted probability should be decreased to close to 0. On the contrary, if the box contains an object, the predicted probability should be push to near 1. Before discussing the proposed dLSTM object refiner, we should properly associate the outputs of each detected object of the iYOLO according to its spatial position. The spatial association of the temporal information of multiple objects is designed to collect all the outputs of the same physical object in a spatial-priority order to become the time series inputs of the dLSTM object tracking modules.

Multi-object association
To make a good association of a series of outputs for each detected object, we need to design a proper association rule to construct a detected data array as the input of the dLSTM object refiner. Usually, the iYOLO shows the detection results according to the confidence priority, which is not a robust index for object association since the confidences vary from time to time. Therefore, we utilize the close-to-the-car distance as the object association index since any on-road moving objects will physically travel arround their nearby areas smoothly. For the i th detected object with its bounding box, the association should be based on the spatial positions in the image frame. If we set the frame left-upper corner of as the origin at (0, 0), the right-lower corner at (W − 1, H − 1), where W and H are the width and height of the frame, respectively. The position at (W/2, H), which is used to measure the closeness of the detected vehicle to the driving car, is set as the reference point. If the detected object is closer to the reference point, it will be more dangeous to the car. Thus, the bounding box of a detected object is closer to the reference point, it should be more important for the object and we should give it a higher priority. The priority of the detected object is spatially ordered by a priority-regulated distance to the critical point as where the horizontal distance between the ith bounding box and the reference point is given as: (14) finds the minimum displacement of any point in the bounding box to the reference point horizontally. If x d i is smaller, the bounding box will be closer to the reference point. If y d i is smaller, the bottom of the bounding box of the object is close to the bottom of the frame.
After the computation of all priority-regulared distances of the detected objects, the object indices are determined by the priority-regulated distances in decenting order. The smaller priority-regulated distances will be given a higer order, i.e., a small prioity index to the detected object. If ρ = 0.5, Fig. 6 shows the order of the object confidences and the priority orders of the prority-regulated distances between the reference point and the (14)  detected objects. The order of the objects with the regulated distances is spatially stable since the spatial positions of the real objects will not change too quickly. Even if the object is moving horizontally to occlude some objects, the tracked objects will be still reasonable and stable since we don't care about the ones which are occluded with one combined-vehicle class. We can then focus on the objects, which are geometrically close to the driving car and give them higher priorities for tracking.

LSTM refiner
After determining the priority order of the detected objects, we collect all bounding boxes as a 2D data array with the same priority order. For example, the data array for the first priority object with g (1) For simplicity, we ignore the index of the priority order. For each detected object at instant T, we then collect an array of bounding boxes as 1 , y t , x t,2 , y t,2 ] T denotes positions of left-top corner (x t,1 , y t,1 ) and bottomright corner (x t,2 , y t,2 ) of the bounding box at the tth instant. After collection of X t for T consecutive samples, Fig. 7 shows the 2D time-series data array of bounding boxes for each detected object. With the 2D data array for each object, the double-layer LSTM (dLSTM) is designed to reduce unstable bounding boxes. In order to achieve better performance, we might use a longer LSTM module, however, it would increase some unnecessary delay of tracking. As shown in Fig. 8, the double-layer LSTM (dLSTM) refiner contains K-element hidden state vectors with T time instants. The fully connected layer take h (2) T , the Tth hidden state vector of the second LSTM layer and output 4-point prediction position of bounding box, X . As stated in (17), the dLSTM network inputs a series of bounding box data X t for for t = 1, 2, …, T. In Fig. 8 t denote the cell and the hidden states of the lth layer at the tth time step of the dLSTM model, respectively. To make the model deeper for more accurate earnings [30,31], we stack two LSTM layers to achieve better time series prediction and avoid long delay simultaneously. As stated in (6), the first LSTM layer inputs the location data array L in chronological order. It generates the K-dimension hidden state h (1) t and the K-dimension cell state c (1) t . Then, we will output hidden state features of the first LSTM layer as the inputs of the second LSTM layer. In the dLSTM refiner, the hidden state h t−1 . The second LSTM only returns the last step of its output sequences for dropping the temporal dimensions. Finally, the second LSTM layer followed by the fully connected layer interprets the K feature vector to the predicting location X learned by the dLSTM module.
To train dLSTM module, the IoU-location loss, which combines intersection over union (IoU) and position mean square error (MSE) losses as where X i and X i denote the locations of the predicted and ground truth bounding boxes, respectively. In (18), the IoU, which represents the ratio of intersection and union of the predicted and groundtruth bounding boxes, gives 0 < IoU < 1. Thus, we use − log(IoU) as the first loss function. In addition, the mean square error (MSE) of X i and X i is used for the coordination loss function. To balance IoU and MSE loss functions, we need to select α and β for a better combination. After the training process, it is noted that the dLSTM refiners with the same weights are used for all detected objects in this paper.

Object tracking status
For each detected object, as shown in the bottom part of Fig. 2, we need to constantly monitor the object tracking condition, which could be a newly-appeared, tracked, or disappeared object. The tracking status will help to control the dLSTM refiner correctly. Not only the bounding boxes, we also need adopt the occurrence and conditional probabilities, as shown in the bottom part of Fig. 2, for effective object tracking. We assume that we have already initiated P dLSTM refiners to track P priority objects. For each detected object at instant T, from the iYOLO, we have collected the tracking information: where i* carries the index of the top i class and p(i * ) = C b i * p b , which records the confidences of top five classes. As shown in Fig. 9, there are four possible conditions of confidence plots in practical applications, where Th p denotes the detection threshold of confidence. With all outputs of the detected object collected from the iYOLO, the status of the dLSTM refiner can be determined as the follows: At T + 1 instant for a detected object, whose the confidence estimated by the iYOLO is higher than the threshold, we need to first distinguish that the object is a newly-appearing (red solid line) or stably-tracked object (red-dash line) as shown in Fig. 9. With the same priority index, we first check the IoU of bounding boxes of the object obtained at T and T + 1. If the IoU of two consecutive bounding boxes is greater than a threshold and the object is with the same class, we then determine this object as the stably-tracked one. In this case, we need to update the collected information by left-shifting data array as, and top five confidences.
If the IoU of the current and previous bounding boxes is lower than the threshold or the class index is different, we then treat it as a newly-appeared object. In this case, we need initialize a new dLSTM refiner and a new set of data array with the same X T+1 as (19) X t = X t+1 , for t = 1, 2, . . . , T , Fig. 9 Confidence plots of gradually-appearing, stably-tracked, unstably-detected and gradually-disappearing objects We store the new set of top five confidences with the same { p(i * ) , i*}, for i = 1, 2, …, 5. The newly-appeared object becomes the tracked object in the next iterations. For both tracked and newly-appeared cases, we entrust the detection ability of the iYOLO. Once the iYOLO decides it as the positive confidence, which is greater than the threshold, it must be an active object. It is noted that the dLSTM refiner will not change the detection performance for the stably-tracked (red-dash line) and newly-appearing (red-solid line) object. For these two cases, the miss-detected (MD) counter will be reset to zero shown in the bottom part of Fig. 2 and the dLSTM refiners are actually used to refine the tracked bounding boxes.
For a dLSTM refiner, we hope not only to improve the accuracy of bounding boxes but also to raise the detection performance for gradually-disappearing (blue-dash line) and unstably-tracked (blue-solid line) conditions as shown in Fig. 9 while the dLSTM refiner obtains the miss-detected information from iYOLO. Based on the fact of no-sudden-disappeared objects, we will not turn off the dLSTM refiner at once if the miss-detection case happens. To improve the detection performance, we further design a miss-detected (MD) counter, which is reset to zero if the iYOLO actively detects the object, which possesses sufficient confidence with a bounding box. Once the detected object does not have large enough confidence at some instants, we will increase MD counter of the tracked object by one until MD is larger than the miss-detection threshold, N mis . When MD ≤ N mis , the tracking system will still treat the object as a tracked object but in the "unstably-detected" condition. For any unstably-detected object, the system will give the output of the dLSTM refiner as the compensation, i.e., X T +1 =X T . Once MD > N mis , the tracking system will delete the object and stop the dLSTM refiner hereafter. In general, we can improve the detection performance for unstably-detected and gradually disappeared conditions as shown in Fig. 9.
If the detected object is with a larger bounding box, it should not disappear in a shorter time. On the contrary, the object is with a smaller bounding box, it could be closer to the disappearing case. As shown in Fig. 2, we suggest the adaptive missed-detection counting (AMC) threshold as: where A T = w T h T denotes the area of the latest bounding box of the detected object by the iYOLO before T time instant. To terminate the dLSTM refiner, the AMC threshold will set to N mis = 10, 5, and 2 for large, middle, and small detected bounding boxes, respectively. With the AMC threshold, we could help to raise the detection rate and properly avoid the false positive rate. Since the dLSTM refiner will take 10 sets of bounding boxes from iYOLO, i.e., T = 10, we choose the AMC threshold, N mis to be 10, 5, and 2 for large, middle, and small detected bounding boxes empirically. With the above adaptive confidence threshold, we could recover the miss-detected object, which could be frequently disturbed by various environmental changes. Since the dLSTM module needs the data vectors of consecutive bounding boxes of the detected object, for the missdetected object, we will replace the output of the iYOLO with X T +1 =X T . (20) X t = X T +1 , for t = 1, 2, . . . , T ,

Results and discussion
With 1280 × 720 videos, the proposed object detection and tracking system and the other systems are implemented in Python 3.5 and Keras 2.1.5 with Tensorflow 1.4.0 backend as the deep learning function library. We used a personal computer with Intel Core i5-8400 CPU 2.8 GHz, 16 GB 2400 GHz RAM for the hardware system. The iYOLO and dLSTM networks utilize NVIDIA Geforce GTX 1080 8G to accelerate the testing and training speeds.
To evaluate detection performances, we started from COCO pre-trained weights and then combined the KITTI dataset [36] and a self-build dataset collected in the Taiwanese highway to train the vehicle detectors. Since we focus on the vehicle detection and tracking on Taiwanese highway traffic roads, we labeled cars, buses and trucks as a single vehicle class. In the KITTI dataset, we filtered out the images that don't contain any vehicles to obtain 7480 images. In the self-build dataset, the testing videos, we collected 2635 images in day and night with various conditions of tunnel, shading and weather conditions from the Taiwanese highway. The original resolutions of the KITTI and selfbuild dataset are 1224 × 370 and 1280 × 720, respectively.
To balance the loss functions stated in (10), we set loc = 5, obj = 1, noobj = 0.5 and cls = 1 in training process. Table 1 shows the details of all the comparing detection network models with 5 model indices. With model indices #1 and #2, the "SSD COCO" and "YOLOv2 COCO" models are pre-trained with the COCO database, respectively. With model index #3, the "YOLOv2 Re-trained" model denotes the VOLOv2 is re-trained by combining original automobile-related classes to one vehicle class and includes the "self-build dataset" collected in Taiwanese roads in training. With model index #4, the "YOLOv2_Reduced" model denotes that VOLOv2 has been eliminated three sets of two repeated 3 × 3 × 1024 convolution layers. With model index #5, the "Proposed iYOLO" model shown in Fig. 5, the final feature of the "YOLOv2_Reduced" model concatenates with two additional low-level features for object detection. For the dLSTM refiner depicted in Fig. 8, we capture the trajectory of the moving objects from the video sequences for training. The trajectory is labeled by a sequence of (x 1 , y 1 , x 2 , y 2 ) , where (x 1 , y 1 ) and (x 2 , y 2 ) are top-left and right-bottom corners of the bounding boxes of the detected object, respectively. The moving object is characterized by the vehicle's trajectory. There are 27 video sequences which contain 8100 location sequences.

Vehicle detector by iYOLO
For practical applications in Taiwan, the experimental results for the vehicle detections in the Taiwanese dataset are demonstrated. Generally, the YOLOv2 achieves an excellent performance in object detection and has good speed in computation. However, it is not robust for detecting vehicles for some cases. In the testing phase, there are 10 videos with 3150 frames which contain 10,707 vehicles. In detection, the true positive rate (TPR) and false positive rate (FPR) are respectively given as: The F1-score is defined as where FNR denotes false negative rate. Table 2 shows the performances achieved by the proposed iYOLO and the other detection models enlisted in Table 1. In Table 2, the YOLOv2 with model index #2 trained on COCO dataset has poor detection performances in Taiwanese highway dataset since the COCO dataset does not match up with the true scenarios in Taiwan. Since the "YOLOv2_Re-trained" model with model index #3 deals with one combined-vehicle class and is retrained by the self-build dataset, it can improve the detection rate and false positive rate. The "YOLOv2_Reduced" with model index #4 after network reduction achieves better detection rate but performs worse in false positive rate. Finally, we found that the proposed iYOLO with model index #5 concatenated with two low-level features achieves the best performance. If we slightly increase the input size to 448 × 448 , we can further improve the detection and false positive rates simultaneously. Two detected examples for YOLOv2 with COCO, YOLOv2_re-trained, YOLOv2_reduced and the proposed iYOLO are visually exhibited in Fig. 10. The results also demonstrated that the proposed iYOLO performs better than the others.

Vehicle detection with dLSTM refiner
To evaluate the tracking performances achieved by the dLSTM refiner after the proposed iYOLO network, we first setup the experiments and determine the parameters, {α, (24) F1 -score = (2 · TPR)/(2 · TPR + FNR + FPR),  (18) and (21), respectively. The training dataset of dLSTM refiner is collected from self-build traffic trajectory dataset with sequential data arrays of (x 1 , y 1 , x 2 , y 2 ) . With random error deviations of groundtruth, the dLSTM model is trained for 15,000 epoches using ADAM optimizer with a learning rate 0.0001. Unlike traditional classification and detection training, which involves images, the trajectory datasets in sequential data arrays will not take too much time. During the training, the many-to-one LSTM strategy is used where the time steps set to T = 10. In other words, for every 10 time steps of training data, there will generate a corresponding ground truth location. In the testing phase, we inherit the iYOLO testing videos which have 10 videos with 3150 frames. There are total 10,828 vehicles in these frames. Since the loss function plays an important role in training the dLSTM refiner, we should first determine the weighting parameters defined in (18). We should find a better loss function which could effectively use both MSE location loss and IoU loss. In the experiments, we adopted the direct object status determination with fixed with fixed N mis = 30. Table 3 shows the detection performances with different sets of α and β. The results show that α = 1 and β = 0.5 could achieve the best performance.
With α = 1 and β = 0.5, we need to design a suitable adaptive missed-detection counting (AMC) threshold to help the dLSTM refiner better. As stated in (21), we need to determine a better pair of {A max , A min }, which can help to achieve the best performance with a higher detection rate and a lower false positive rate. The performances of AMC strategy with different pairs of {A max , A min } are shown in Table 4. For image size of 448 × 448, we found that the AMC strategy with A max = 5000 and A min = 1000 can help (a-1) the proposed dLSTM refiner to achieve the best performance. With the AMC strategy, we successfully reduce the false positive rate to avoid the unreasonable time extension for those small missing objects. With α = 1 and β = 0.5 and the AMC threshold with A max = 5000 and A min = 1000, the proposed adaptive confidence threshold can help the proposed dLSTM refiner to achieve the best performance.

Vehicle detection performances with dLSTM
To evaluate the vehicle tracking performance improved by the dLSTM, Table 5 shows the detection rates, false positive rates and F1-scores achieved by the iYOLO and the iYOLO and dLSTM with/without controlling by AMC threshold. The simulation results show that the dLSTM refiners significantly improve the detection rate.   The dLSTM refiner physically tries to infer the existence of the vehicle, which has been tracked. If the tracked vehicle becomes disappeared, the dLSTM refiner will not release it at once. That is why the proposed tracking method could have a larger false positive rate (FPR). With the help of the AMC threshold, the proposed method can reduce the FPR more. It is noted that all the false positive cases are the fast-disappeared vehicles, which are with fast speed. For those cases, the proposed method actually will not cause any safety problem of driving since those vehicles in a distance are fast away from the target driver. It is interesting that YOLO-v4 has highest accurate detection rate, however, it also has highest FPR at the same time. The blinking false cases in YOLO-v4 are suddenly-appeared vehicles mostly. The FPR of YOLO-v4 could be reduced if it conducts the same improving procedures and cooperates with the dLSTM refiners. Figures 11 and 12 show two visual simulation results for video set#1 and set#2, respectively. The iYOLO cannot detected small vehicles occasionally, however, the iYOLO with dLSTM can successfully tracked the results. In summary, the proposed dLSTM refiner can help to compensate the frames that the vehicles are not detected properly. The dLSTM refiner solves the blinking issues caused by the miss-detecting bounding boxes and successfully achieves the multiple objects tracking with the support of the decision system stated in Fig. 2. With better performances, the proposed vehicle detection and tracking system can achieve 21 frames per second, which reaches the nearly real-time requirements.

Conclusion
In this paper, we proposed a robust object detection system by introduction of the double-layer long short-term memory (dLSTM) refiner. We started from the YOLO-v2, the improved YOLO (iYOLO) vehicle detector is achieved by parameter reduction, combined-vehicle class and concatenating two low-level features. With the training data in the combination of the KITTI and self-build Taiwanese highway traffic datasets, we can raise the vehicle detection rate much higher. Finally, we further proposed the dLSTM refiner, which needs the multi-object association rule and the adaptive missed-detection counting (AMC) threshold method to improve its performances. The multi-object association rule can help to successfully collect the time series detection results of the objects while the AMC threshold can help to reduce the false positive rate for tracking vehicles. Simulation results show that the proposed system can successfully track the temporal locations to compensate miss-detected bounding boxes for unstably-detected and gradually-disappearing objects. Since the dLSTM refiner acts as a temporally predictor in uses of past detected bounding boxes, we can obtain more precise detection results than those achieved by the original object detector. However, we cannot gain any benefits for gradually-appearing objects, which have lack of the prior information. The improvements of the object detector and the designs of the dLSTM refiner can be applied to any latest object detectors.