Multiobject Tracking in Videos Based on LSTM and Deep Reinforcement Learning

,


Introduction
Multiobject tracking in videos plays an important role in a wide range of applications, for example, video surveillance, robot navigation, intelligent transportation systems, and video analysis, to name a few [1,2]. Despite that the field has made tremendous progress since early work, visual multiobject tracking is still regarded as a challenging problem due to frequent occlusions, appearance similarity between objects, varying number of objects, and environmental noise within measurements [3,4].

Related Work.
Tracking-by-detection methods [5][6][7] have appeared as one of the most successful strategies due to recent advances in methods for object detection [8][9][10]. Most of the recent tracking-by-detection algorithms aim at decomposing multiobject tracking into two stages: object detection and data association. These algorithms apply the object detector in each frame and associate the results of the detector continuously. Therefore, this kind of multiobject tracking method can recognize emerging or disappearing objects in the sequences of a video more easily, and the search space of object hypothesis can be greatly reduced.
Tracking-by-detection methods are frequently classified roughly into two categories: offline approaches and online approaches. Offline approaches often use the detections of all the frames of the video sequence together to build long trajectories against false detections and occlusion. A crowded or cluttered scene usually causes some detection failures, which will decrease the accuracy of data association in turn. To compensate for these problems, many multiobject tracking algorithms using the global data association have been proposed [11][12][13][14]. However, the performance of the offline approaches is still limited, and it is hard to apply the offline approaches to real-time applications. As data associations between detections and trackers for each frame are performed in an online manner, we can apply the online methods to real-time applications. Bae and Yoon [15] proposed a novel online visual multiobject tracking approach that can handle the similarity between multiple objects. Data association is the major issue of tracking-bydetection methods [16]. Classical data association approaches include the joint probabilistic data association filter (JPDAF) and multihypotheses tracking (MHT) [17]. JPDAFs consider all possible associations between objects to make the best assignment in each time step. MHT considers multiple possible associations over several time steps, but its application can be usually limited due to its complexity. Many recent multiobject tracking algorithms have concentrated on enhancing the performance of the object detector or designing better data association schemes [18][19][20].
In recent years, LSTM has attracted increasing attention in modeling sequential data. The applications cover feature selection [21], machine translation [22], action recognition [23], video captioning [24], and human trajectory prediction [25]. The main advantages of LSTMs for modeling sequential data is that they allow end-to-end fine-tuning and they are not confined to fixed-length inputs to outputs. Inspired by the successful works that have applied LSTM in computer vision fields, we adopt a data association method based on LSTM in this paper. LSTM includes nonlinear transformations and memory cells, which makes it effective for data association.
Most previous multiobject tracking methods represent objects using raw pixel and low-level handcrafted features, such as histograms of oriented gradients (HOG) [26], Harrlike features [27], and local binary patterns (LBP) [28]. Although they achieve computational efficiency, they have many limits because handcrafted features cannot capture more complex characteristics of the objects. Recently, deep learning has received much attention with state-of-the-art results in complicated tasks such as object detection [29], image classification [30], object recognition [31], and object tracking [32]. A deep-learning tracker (DLT) was proposed in [33], which uses a stacked denoising autoencoder to learn the generic features from a large number of auxiliary images offline. However, the DLT tracker cannot describe the temporal invariance of deep features, which is important for visual object tracking. In [34], a deep-learning tracking method was developed that uses a two-layer convolutional neural network (CNN) to learn hierarchical features from auxiliary video sequences; in the visual tracking method, appearance variations and complicated motion transformations of objects are taken into account. In [35], the authors present a visual tracking algorithm, which includes a specific feature extractor with CNNs from an offline training set; both spatial and temporal features can be learned by the CNNs jointly from image pairs of two adjacent frames. These deep-learning trackers often overlook how to search the interesting region of objects and select the best candidate as the tracking result.
With the recent exciting achievements of deep learning, integrating deep-learning methods with RL has recently shown very promising results on decision-making problems, that is, deep reinforcement learning (DRL). Deep neural networks are able to make reinforcement learning algorithms perform more effectively because they can provide deep feature representations. DRL algorithms have achieved unequalled success in many challenging domains, for example, Atari games [36] and playing board game GO [37]. In the computer vision community, there are also many attempts of applying DRL to solve traditional tasks, such as action recognition [38], object localization [39], object tracking [40], and region proposal [41]. Yun et al. propose an end-to-end active object tracking algorithm via reinforcement learning, which addresses tracking and camera control simultaneously [42]. In [43], the authors present actiondecision networks for visual tracking with deep reinforcement learning. However, these tracking methods based on deep reinforcement learning usually focus on a single object; there is little work related to multiobject tracking. Unlike the aforementioned methods, our method exploits how to apply deep reinforcement learning to solve the online multiobject tracking problem.

Summary of Contributions.
Our motivation is to design a real-time multiple-object tracker via LSTM and DRL, which can incorporate appearance by DRL and learning a more effective association strategy by LSTM to improve the performance of tracking. The key contributions of this paper can be summarized as follows: (i) We propose a novel visual multiobject tracking algorithm based on LSTM and deep reinforcement learning to solve the problems in the existing methods, which is model-free and requires no prior knowledge. To the best of our knowledge, we are the first to combine such concepts to overcome problems in the process of the visual multiobject tracking.
(ii) The proposed multiobject tracker includes three modules: an object detection module, a number of single-object trackers, and a data association module. We adopt YOLO V2 as an object detector as it is a real-time detection system. Each single-object tracker is treated as an agent, which is trained using DRL. An LSTM-based architecture is adopted to solve the joint data association problem.
(iii) To compare our multiobject tracker with other stateof-the-art methods qualitatively and quantitatively, we conducted extensive experiments on publicly available challenge benchmark datasets.
The rest of our paper is structured as follows: Section 2 reviews the background. Section 3 introduces the proposed multiobject tracking framework. Section 4 demonstrates the experimental results and analysis. Finally, we draw conclusions in Section 5.

Background
2.1. Long Short-Term Memory (LSTM). Traditional recurrent neural networks (RNNs) contain cyclic connections that make them a powerful tool to learn complex temporal 2 Complexity dynamics, as shown in Figure 1. The formulas that govern the computation happening in a RNN are as follows: where f is an element-wise nonlinearity function, x t and y t represent the input vector and the output vector at time step t, and h t ∈ ℝ N is the hidden-layer vector with N hidden units at time step t. U, W, and V are the weight matrices of the connection from input nodes to hidden nodes, hidden nodes to hidden nodes, and hidden nodes to output nodes. Though RNNs have been successfully used for sequence modeling tasks, they can only model the data within a fixed-size window. At the same time, training conventional RNNs is difficult due to the problem of exploding and vanishing gradients. These problems limit the capability of RNNs to learn long-term dynamics. LSTM was proposed in [44] to solve these problems. The LSTM unit is used in this paper as described in [45], as shown in Figure 2.
In this subsection, we provide the equations of LSTM for a single memory unit only. Let x = x 1 , … , x T be an input sequence and y = y 1 , … , y T represent an output sequence; an LSTM network computes a mapping iteratively between x = x 1 , … , x T and y = y 1 , … , y T using the following equations: where σ x = 1/ 1 + exp −x is the logistic sigmoid function, c t is the cell input activation vector, i t describes the input gate, f t represents the forget gate, and o t output gate. All of the above are the same size as the hidden vector h t . That is, in addition to a hidden vector h t ∈ ℝ N , the LSTM includes an input gate i t ∈ ℝ N , forget gate f t ∈ ℝ N , output gate o t ∈ ℝ N , and memory cell c t ∈ ℝ N . We can find the meaning of the weight matrix; for example, W hi represents the hidden to input gate matrix and W xo represents the input to output gate matrix. b i , b f , b o , and b c are the bias terms which are added to i, f, o, and c.

Deep Reinforcement Learning (DRL).
Reinforcement learning (RL) can usually be used to solve sequential decision-making problems. The process of reinforcement learning is shown in Figure 3. Recently, significant progress has been made by combining reinforcement learning with the ability for learning feature representations in deep learning. Deep Q network (DQN) and policy gradient are two popular methods in DRL algorithms. DQN is a form of Q-learning with function approximation using a neural network, which means it tries to learn a state-action value function Q given by a neural network in DQN by minimizing temporal-difference errors. To improve performance and keep stability, various network architectures are Hidden layer Input layer 3 Complexity based on the DQN algorithm such as dueling DQN [46] and double DQN [47].
A policy gradient approach is a type of reinforcement learning method that directly optimizes parametrized policies by using gradient descent [48]. Policy gradient methods have many advantages compared to traditional reinforcement learning approaches. For example, they need fewer parameters to represent the optimal policy than the corresponding value function and they do not suffer from the difficult problem caused by uncertain state information.

Proposed Visual Multiobject Tracking Algorithm
In Subsections 3.1-3.3, we show a brief architecture of our proposed multiobject tracking algorithm firstly. The details of our method are described in the following content.

Architecture of the Proposed Multiobject Tracking
Algorithm. Our method consists of three major components: an object detection module, many single-object trackers, and a data association module, which are shown in Figure 4. In the first place, as demonstrated in Figure 4, we choose YOLO V2 [49] as an object detector because it is a state-of-the-art, real-time object detection system. YOLO V2 is applied on every frame and outputs a set of detections D t at time step t. In each frame, YOLO V2 may output many kinds of detections. To obtain the correct detections to the tracking objects, the intersection-over-union (IoU) distance is computed between the ground truth and the detections at the first frame. The IoU distance between the mean of its short-term history of validated detections and the current detections is also computed to obtain the correct detections at the other frame. Secondly, the single-object tracker is composed of a network that includes a CNN followed by an LSTM unit. Each tracker, regarded as an agent, is trained by utilizing deep reinforcement learning. Finally, inspired by [50], we adopt an LSTM-based architecture that can learn to solve the joint data association problem from training data.

Single-Object Tracker via Deep Reinforcement Learning.
We cast the problem of object tracking as a Markov decision process (MDP) since this setting provides a formal strategy to model an agent that makes a sequence of decisions. In our formulation, a single-frame image is considered as the environment, in which the agent transforms a bounding box using a set of actions. The MDP includes a set of actions a ∈ A, a set of states s ∈ S, a state transition function f s, a , and a reward signal r. Our single-object tracking framework is illustrated in Figure 5. This section presents details of these components. In our paper, the set of action A is composed of six actions that can be applied to the bounding box and one action to terminate the search process, as shown in Figure 6. The state definition is a tuple s t = p t , v t , where p t is the image patch (which is pointed by a 4-dimensional vector p t = x t , y t , w t , h t ) within the bounding box of the object and v t is a vector with the history of taken actions. The history vector stores the past 10 actions, which means v t has 70 dimensions as each action vector has 7 dimensions. At time step t + 1, the state s t+1 = p t+1 , v t+1 is decided by s t = p t , v t and the state transition functions, where p t+1 = f t p t , a t and v t+1 = f v v t , a t .
The agent will receive a reward signal r t from the environment during the training process. In our method, reward r t is given at the end of a tracking episode when the object is tracked successfully. More specifically, the reward signal r t = 0 during iteration in MDP in a time step. When the "stop" action is selected at termination step T, the reward signal r T is a thresholding function of IoU as follows: where IoU p T , g = area p T ∩ g /area p T ∪ g represents the overlap ratio of p T and the ground truth of the object.
We adopt policy-based reinforcement learning methods as they have a better capability of learning random policies and convergence properties. Our whole network is 4 Complexity parameterized by W, the policy-based method models, the policy function π a | s ; W , and the value function V a | s ; W ′ ; the aim of training this network is to maximize the overall tracking performance by policy gradient approximation. At each time step t, the goal of the agent is to learn a policy function π a t | s t ; W . Approximation of the policy function can be obtained by a stochastic gradient ascent algorithm. As there are very limited amounts of labelled data for multiobject tracking, we use synthetic data as a supplementary to the real data in the training. The parameters W and W′ can be learned according to the following equations: where R t = ∑ t+T−1 t ′ =t α t ′ −t r t ′ is the sum of future rewards up to T time steps, 0 < α ≤ 1, λ is the learning rate, H ⋅ is an entropy regularizer, and ε is the regularizer factor.
Our deep CNN is conducted on the VGG-16 network, which includes five pooling stages, that is, Conv1-2, Conv2-2, Conv3-3, Conv4-3, and Conv5-3. The gradual decrease in the spatial resolution occurs when the depth of layers increases, because all convolutional layers have a 2 × 2 kernel size and a stride of 2 in the VGG-16 model. For example, when inputting an image with size M × N, the output feature maps of pooling 5 have a size M/2 5 × N/2 5 . In our model, we use the feature maps from Conv3-3, Conv4-3, and Conv5-3, which have been elevated to the same size by using bilinear interpolation.

Data Association. Let
represent the set of all outputs of single-object trackers at time step t, p i t refers to the state of the ith output of a single-object tracker, and M is the number of objects that can be tracked simultaneously in one time step. The state of the ith object is represented by the 4-dimensional vector as the set of detections from the object detector with q j t the jth detection and N the number of detections. Let D t ∈ ℝ M×N denote the similarity matrix for data association that measures the relation between an output of single-object tracker p i t and a detection m j t , where D ij t = p i t − q j t 2 is the Euclidean distance between p i t and q j t . Data association based on LSTM for object i is illustrated in Figure 6.
The task of data association is to predict the assignment for each object using the temporal step-by-step functionality of LSTM. The inputs at each step i are the hidden state h i , the cell state c i , and the similarity matrix D t . The output are the hidden state h i+1 , the cell state c i+1 , and the assignment probability vector A i t . A i t is a vector of assignment probabilities for object i and all available measurements, which is obtained by applying a softmax layer with normalization to the predicted values. A ij t = a (object i assigned to the jth detection) and ∑ j A ij t = 1. Let ε be the correct assignment; we adapt the negative log-likelihood loss as the cost function to measure the misassignment cost: The data association requires more representation power, so it is a more complex task. The data-association-modulebased LSTM include two layers and 512 hidden units. It takes approximately 40 hours to train all the modules in our tracker on a CPU. The training can be sped up significantly by using GPUs.

Qualitative Evaluation.
In this section, we compare our visual multiobject tracker with several state-of-the-art methods on the MOT Challenge benchmark [51] in order to show the performance of our algorithm. The synthetic datasets OVVV [52] and virtual KITTI [53] are used as supplementary to the real data in the training. In the singleobject tracker, the learning rate for CNN is set to 0.0001, and for fully connected layers it is set to 0.001. In the DRL network, the learning rate λ is set to 0.0001, and the regularizer factor ε is set to 0.01, T = 20, α = 0 95.
The PETS09-S2L2 sequence consists of 436 frames of 768 * 576 pixels with heavy crowd density and illumination changes. The pedestrians undergo severe occlusion and scale changes in the sequence. The ADL-Rundle-3 sequence consists of 625 frames of 1920 * 1080 pixels. It shows a crowded pedestrian street captured from a stationary camera. Frequent occlusions, missed detections, and illumination variation happen among the multiple objects. The TUD-Crossing sequence shows a road crossing from a side view.
It consists of 201 frames of 640 * 480 pixels and includes the nonlinear motion, objects in close proximity, and occlusions. The AVG-Town Center contains 450 frames of 1920 * 1080 pixels. It shows a busy town center street from a single elevated camera. The sequence contains medium crowd density, frequent dynamic occlusions, and scale changes.
From these experimental results, we can see that our tracker performs well most of the time despite frequent occlusions, similarity among objects, scale changes, and illumination changes. Nevertheless, there are still some examples of unavoidable tracking failures as illustrated in Figure 15. For example, the brightness of the environment  From the results, we can see that the object is missed in the detector, while he is tracked in the single-object tracker according DRL.
where f n t , f p t , IDSW t , and gt t are false negatives, false positives, identity switching, and ground truth at frame t. MOTP is the average dissimilarity between all true positives and their corresponding ground truth objects, which calculates the intersection area over the union area of bounding boxes. This is computed as where d t,i denotes the bounding box overlap of object i with its assigned ground truth object and c t is the number of matches in frame t. The proposed method obtains a better performance that can mainly be attributed to the three parts of the tracker: YOLO V2 is a state-of-the-art object detector, the data association strategy based on LSTM can find a global optimal assignment, and the single-object trackers are able to find the location of the object via deep reinforcement learning.
We implement the experiments of our proposed multiobject tracking algorithm based on the Windows 10  Complexity operating system and using MATLAB R2016b as the software platform. The configuration of the computer is Intel® Core™ i7-4712MQ and GeForce GTX TITAN X GPU, 12.00 GB VRAM.
The results of running time on the MOT Challenge test dataset are shown in Table 2, where they are compared to some state-of-the-art trackers. Our method is a real-time tracking system and although the speed is slower than RNN-LSTM, which does not incorporate appearance, the other performance of our method is better than it.

Conclusion
This paper proposes a visual multiobject tracking algorithm based on LSTM and deep reinforcement learning to overcome the problems of the existing algorithms: they have many limits because handcrafted features cannot capture more complex characteristics of the objects, tracking fails when the number of objects vary, and so on. We adopted the object detector YOLO V2 to detect the multiple objects. The single-object tracker is composed of a network that includes a CNN followed by an LSTM unit. Each tracker, regarded as an agent, is trained by utilizing deep reinforcement learning. We conduct data association using LSTM for each frame between a pretrained object detector and a number of single-object trackers. From the experimental results, we can see that the proposed multiobject tracking method improves the robustness and accuracy of the algorithm.

Conflicts of Interest
The authors declare no conflict of interest.