Online Learned Siamese Network with Auto-Encoding Constraints for Robust Multi-Object Tracking

Multi-object tracking aims to estimate the complete trajectories of objects in a scene. Distinguishing among objects efficiently and correctly in complex environments is a challenging problem. In this paper, a Siamese network with an auto-encoding constraint is proposed to extract discriminative features from detection responses in a tracking-by-detection framework. Different from recent deep learning methods, the simple two layers stacked auto-encoder structure enables the Siamese network to operate efficiently only with small-scale online sample data. The auto-encoding constraint reduces the possibility of overfitting during small-scale sample training. Then, the proposed Siamese network is improved to extract the previous-appearance-next vector from tracklet for better association. The new feature integrates the appearance, previous, and next stage motions of an element in a tracklet. With the new features, an online incremental learned tracking framework is established. It contains reliable tracklet generation, data association to generate complete object trajectories, and tracklet growth to deal with missing detections and to enhance the new feature for tracklet. Benefiting from discriminative features, the final trajectories of objects can be achieved by an efficient iterative greedy algorithm. Feature experiments show that the proposed Siamese network has advantages in terms of both discrimination and correctness. The system experiments show the improved tracking performance of the proposed method.


Introduction
As a key technology in computer vision, multi-object tracking (MOT) has received growing attentions from researchers all over the world. In recent years, with the improvements in object detecting techniques [1][2][3], tracking-by-detection (TBD) has become one of the most successful strategies. It applies an object detector to produce detection responses in each frame, which are then used to generate complete trajectories. The data association process mainly depends on object features including appearance, motion, and other factors. It is often solved by Hungarian algorithms [4,5], network flows [6][7][8], minimum energy models [9,10], conditional random field approaches [11,12], hyper-graph model [13], deep learning methods [14][15][16][17], and so on.
Object feature expression is the basis of data association. Handcrafted features, such as the histogram of oriented gradient (HOG) [18], local binary patterns (LBP) [19], and the histogram of color (HOC) are widely used in computer vision researches [8,11,13,20]. These features were originally designed to distinguish objects from various backgrounds. Although a combination of different handcrafted features [11,13] is often used to improve discrimination, it is still not robust enough.
Meanwhile, detection responses given by object detectors are not always accurate and sometimes even false due to complex backgrounds, poor image quality, complicated movements, or occlusions of objects. Thus, how to better distinguish targets by online detection responses, how to deal with noise due to detection inaccuracy, and how to combine various cues of a target to enhance discrimination remain key issues that limit tracking performance.
With the developments of deep learning in image classification, segmentation, and other applications, researchers used deep architectures to learn discriminative features for multi-object tracking, and they achieved good results. In [12,15,17,[21][22][23], deep Siamese networks were adopted instead of traditional handcrafted methods [11,13]. A contrastive loss function was used with the aim of decreasing the feature distances for the same object pairs while increasing distances for the different pairs. Due to the shortage of online samples, training of such deep neural network mainly depends on offline learning. Although online fine-tuning measures are often adopted, the online data are too limited to run a deep network effectively.
In this paper, a Siamese network with an auto-encoding constraint (SNAC) is proposed, which is able to work well with a small-sized sample set. Different from previous deep Siamese networks, the SNAC has a simple structure with two fully-connected layers, an auto-encoder layer, and a code-mix layer. The simple network can be easily learned by online limited samples to extract discriminative features to distinguish objects on the scene. Inspired by stacked auto-encoder methods [24,25], the output of the encoder layer tries to represent the input detection response as accurately as possible. This is done by adding a constraint term to the loss function, called the auto-encoding constraint, which effectively prevents the network from overfitting while training with limited samples. To deal with inaccurate detection responses (red bounding box in Figure 1a), Gaussian distribution training samples are generated around detection responses to suppress noises. For each detection response, one SNAC is trained to distinguish it from others in adjacent frames. Meanwhile, in order to enhance robustness, following [22], the HOC is used as the input instead of raw pixels. With the discriminative detection features extracted by SNACs, reliable tracklets are generated.
To better distinguish tracklets, SNAC is improved to extract a composite previous-appearance-next (PAN) feature for each tracklet, which combines previous and next step motions with the appearance of the tracklet element. Following [11,26], elements in the same tracklet can be treated as positive samples, and the negative samples are obtained from time overlapped tracklets. The distribution is proposed to express motion that can suppress motion noises, and this is also compatible with the appearance for joint learning of the PAN feature.
In order to solve the MOT problem by the proposed SNAC, an online incremental learned tracking framework is established. First, one SNAC is trained for each detection response online, and reliable tracklets are generated mainly by the extracted features. Then, the PAN features are learned from tracklets by improved SNACs. To improve the training efficiency, SNACs are trained by incremental learning. During tracklet generation, the parameters of SNAC for detection in the new frame are inherited from the predecessor tracklet element, and the training samples are updated frame by frame. To extract PAN, the parameters are initialized by the SNAC of the related detection response. A tracklet growing process is used to deal with missing and partial detections (Figure 1b,c) before tracklet association. With the discriminative PAN feature, complete trajectories are solved efficiently by an iterative greedy algorithm. The main contributions of this paper are summarized as follows: (1) A simple structure Siamese network with an auto-encoding constraint is proposed to extract discriminative features efficiently for objects on the scene. An auto-encoding constraint is added to prevent overfitting when training data are limited. (2) A composite feature of tracklet, PAN, is defined, which combines appearance and motion through joint learning. To describe the sequential features of tracklets better, data association based on PAN is more reliable.
(3) A tracking framework is established that includes reliable tracklet generation by incremental learning with SNAC for the detection response, tracklet growth to enhance PAN performance and deal with missing detections, and tracklet association with PAN to generate complete trajectories.
(a) Inaccurate detection (b) Missing detection (c) Partial detection

Related Works
Tracking by detection (TBD) has been one of the most promising methods developed to solve the multi-object tracking (MOT) problem in recent years. It generates object trajectories based on detection responses given by pre-designed detectors. For reliable data association, most recent researches were based on tracklets. In [26], the dual-threshold method was proposed to generate reliable tracklets and utilize them to get the final trajectories hierarchically. In [27], a prototype of a three frames triplet, which is a type of three members tracklet, was designed to extract high-level features. The Hungarian algorithm was also used to generate reliable tracklets in [12,15]. On the basis of tracklets, [11] built an online learning conditional random field (CRF) model focused on distinguishing the difficult pairs of objects. In [13], a hyper-graph model was developed to explore more complex relations among objects. The latest MOT methods [12,21] also focused on using tracklets. In these studies, tracklet building and feature expression are important to achieve reliable data association. In this section, MOT object feature extraction methods are mainly introduced.
From handcrafted methods to deep learning techniques, many studies have achieved significant improvements in extracting appropriate object features for MOT. In [11,13], a combination of multiple handcrafted features was proposed to distinguish objects by appearance. Their sample collection schemes were used in many following studies. The developments of deep learning have introduced new ideas for feature description in tracking areas. In [24,[28][29][30], deep neural networks were adopted for single object tracking (SOT), and achieved significant improvements. In SOT problems, features of objects were used to distinguish them from the background. Different from SOT, MOT mainly distinguishes objects from each other. Due to this difference, the deep learning scheme of SOT cannot provide good results for MOT problems.
The deep learning methods for MOT can be summarized into two categories. The first builds a deep learning based tracking model to form the whole MOT system. Milan et al. [31] proposed a tracking model based on recurrent neural networks (RNN). The proposed RNN model described the whole tracking system including motion prediction, updating, object state judgment, and data association. It was trained online in an end-to-end manner to track various objects. Schulter et al. [14] proposed a deep network flow model for MOT, which instead of empirically hand-crafting costs, learned the parameterized costs of the network flow model by end-to-end training. This dynamic parameter setting method improved the robustness and accuracy of tracking. Zhou et al. [12] proposed a deep continuous conditional random field (DCCRF) model for solving online MOT problems. The unary term was used to provide a deep discriminative appearance feature for tracklet association, and a pairwise term was used to deal with inter-object relations. In [16], a deep neural network consisting of an encoder and a decoder was proposed. In their method, an encoder was a fully-connected network and a decoder was a bidirectional long short-term memory (LSTM). This network was able to learn the association matrix to solve MOT.
The second group uses a deep neural network to extract discriminative feature for each object. Unlike the previous kind, this method deals with the object feature extraction problem directly, and many researchers have followed this idea. Sadehgian et al. [32] proposed an RNN model jointly used the appearance, motion, and interactions of an object to encode a discriminative long-term temporal relationship using these cues. Their discriminative appearance features were extracted by a deep CNN. Son et al. [33] designed a quadruplet CNN (QCNN) network to learn the affinities among objects based on appearance and motion. The proposed quadruplet loss function guided the network to learn a temporally-smooth appearance model with motion-aware constraints. Features extracted from the QCNN included time continuity, which enhanced the discrimination. In addition, Siamese networks, first defined and used for signature verification, played an important role and have achieved good results in face identification [34], people re-identification [35], and many computer vision applications. Siamese networks are more suitable for distinguishing objects due to their symmetrical structures. Wang et al. [15] applied a Siamese CNN (SCNN) to construct an appearance affinity model for tracklets. They embedded a temporally-constrained multi-task mechanism in their training process. Leal-Taixé et al. [22] used an SCNN to estimate the likelihood of two objects using a multi-modal inputs including image and optical flow. Following [22], Yoon et al. [23] proposed the historical appearance matching method and trained a Siamese network by a two-step process to deal with noisy detections. In [17], a speeding method was proposed to remove redundant appearance matchings of SCNN for real-time tracking. In the DCCRF model [12], SCNN was also used to extract discriminative features. Based on SCNN, Bae et al. [21] proposed a confidence-based data association method for MOT. They utilized the SCNN to learn a discriminative appearance model from offline training datasets.

Online Learned Siamese Network with Auto-Encoding Constraint
In this section, a new Siamese network with an auto-encoding constraint (SNAC) is proposed. It is better at distinguishing objects in MOT. Benefiting from the simple structure of two fully-connected layers, an auto-encoder layer and a code-mix layer, the SNAC can be learned effectively. Meanwhile, with an auto-encoding constraint in the loss function, SNAC can prevent overfitting while training with limited online samples. In order to suppress detection noises, Gaussian distribution samples were generated around detection responses to make up the training set and HOC was used as the input instead of raw pixels. Then, an incremental learning algorithm was proposed to train the SNAC to generate reliable tracklets. Mathematical notations are listed in Table 1.

SNAC
Siamese network with an auto-encoding constraint The two-layer structure of SNAC is shown in Figure 2a. Bounding boxes of detection responses were first resized to 48 × 32 as the inputs of the Siamese network. The two sub-networks (dashed boxes in Figure 2a) were identical in structure and share parameters including weights and biases. A contrastive loss function was employed to learn the Siamese network.
As shown in Figure 2b, each sub-network consisted of an auto-encoder layer and a code-mix layer. The first layer contained three parallel auto-encoders corresponding to the red, green, and blue channels of the input RGB image, respectively. Similar to [22], the inputs were R, G, and B histograms, not pixel values, and they were denoted as 256 dimensions vectors: x0, x1, and x2. Because of limited samples, training based on pixel values may lead to overfitting. Meanwhile, the histogram can also suppress the detection noises. Each auto-encoder contained a forward encoder, a backward decoder, and an auto-encoding error evaluator. The encoder and decoder were fully-connected networks. The output of the encoder was a vector with 100 dimensions, and the output of the decoder was a reproduction of the corresponding input. The code-mix layer was fully connecting and combined three code vectors of the first layer to produce a feature vector with 100 dimensions as the final output. Mathematically, the sub-network can be written as: where subscript m indexes the upper p or lower q sub-network, the upper-script k indexes the channel, y is the code vector from an encoder,x is the reproduction of y by the decoder, and z is the final feature vector. W, b, and σ are the weights, biases, and activation functions of the neural networks, with the subscripts E, D, and M indicating the encoder, decoder, and code-mix layer.

Loss Function and Auto-Encoding Constraint
To learn a Siamese network, a contrastive loss function was formulated based on similarity or difference measurements between input pair. The objective was to train the network to sufficiently reduce differences between pairs of the same inputs and to increase feature distances of different ones. The distance of input training pair is denoted as: where x p and x q are feature vectors from the two sub-networks in SNAC. Instead of using the Euclidean distance here, other measures, like Mahalanobis and Bhattacharyya distances, can be used.
Given a group of training samples, the loss function of SNAC to be minimized consists of three terms, L1, L2, and L3, as follows: where α, β, and γ are weight coefficients between zero and one. The first term, L1, is a margin-based loss of difference of sample pairs; δ is the decision margin, which satisfies (0 ≤ δ ≤ 1); l pq is the sample indicator; l pq = 1 denotes a positive pair; and l pq = 0 denotes a negative pair. The L3 term is the regularization constraint. However, deep neural networks contain a large number of parameters and require huge sample sets for training. For the case of using limited online samples, parameters of a deep model will often be overfitting after training, and the network will not work. This method often pays more attentions to some local details of training samples and does not balance the general features. Subsequently, inspired by the stacked auto-encoder in [24,25], the L2 term was added, an auto-encoding constraint (AC) to the loss function in Equation (3), to prevent overfitting, even when training with limited online samples.

Denoising through the Collection of Training Samples
i from other object detection responses in adjacent frames, not over a longer time period. The training samples of SNAC(d t i ) were collected online. Inspired by [11], d t i is the only one positive sample, and the remaining detection responses at frame t constitute the negative sample set. Although SNAC(d t i ) can be trained by small-sized samples, an unbalanced sample set with only one positive sample cannot drive it. To solve this problem, more samples are needed, which means additional detection responses of d t i . There is a fundamental issue whereby detection responses are not always perfect, and their bounding boxes are often inaccurate, as explained before in Figure 1a. When a noisy detection is used as a training sample, it will impair the parameters of SNAC. However, detection noise is inevitable, so this error can be suppressed through more d t i with random noise. This noise processing is just enough to solve the positive sample shortage problem.
Detection noise was assumed to be modeled as additive noise as follows: where p = (x, y) is the center position of the detection response, s = (w, h) is the size vector of width and height, and n p and n s are additive noises that refer to position and size, respectively. n p and n s are assumed to follow a Gaussian distribution, G(0, σ p ) and G(0, σ s ), where σ p and σ s are corresponding covariances obtained by prior analysis. A group of random bounding boxes Ψ({d t i }) was generated around d t i according to Equation (4) with distributions of n p and n s . In the same way, are the positive and negative sample sets, respectively. Using these online collected samples, SNAC(d t i ) not only can extract discriminative features for d t i , but it also can suppress detection noises.

Iterative Tracklet Generation with SNAC by Incremental Learning
The above sections discussed the establishment and training of SNAC. Each detection response d t i is associated with SNAC(d t i ), which extracts discriminative features to better distinguish d t i from other detections belonging to D t+1 . Moreover, connecting these original independent networks not only increases the number of samples, but can also improve the training efficiency. On the one hand, SNAC(d t i ) can obtain more training samples from d t−1 j in the adjacent frame t − 1 through a relationship. On the other hand, with this relationship, SNAC(d t i ) does not need random initialization parameters for training, but inherits them from SNAC(d t−1 j ), which can reduce the training time to improve the efficiency. This relationship is the principle of tracklet linking, that is the two detection responses between adjacent frames belong to the same object. Incremental learning of SNACs through this inheritance relationship can effectively match adjacent frame detection responses. To generate reliable tracklets, an iterative algorithm with SNAC by incremental learning is proposed as shown in Algorithm 1.

Algorithm 1 Iterative tracklet building with SNAC by incremental learning.
Input: D = {D 1 , D 2 , ..., D t }, detection set of each frame Output: T t = {T t k }, tracklet setup to frame t 1: Initialization: t = 1, T 1 = ∅ 2: for each d ∈ D 1 do 3: Initialize F 1 k with random parameters 5: Train F 1 k with P and N 7: end for 8: while t ≥ 2 do 9: for each T t−1 k ∈ T t−1 and each d ∈ D t do 10: Compute Λ a (T t−1 k , d) as Equation (6) 11: Compute Λ(T t−1 k , d) as Equation (5) 12: end for 13: For all Λ(T t−1 k , d) meeting the link requirement, select 14: pairs of T t−1 k and d by the Hungarian algorithm. 15: T t = renewed T t−1 by linking the selected pairs. 16: for each T t k ∈ T t having a new detection added do 18: incrementally trained with P and N 21: end for 23: for each d ∈ D R t do 24: Add a new single member tracklet T t k = d, 25: and set its F t k as above. 26: end for 27: end while At the first frame t = 1, a new tracklet T 1 i was established by a single member of d 1 i in D 1 , and the current total number of tracklets was N 1 . To match the detection response belonging to the same object (or inexistence) in the next frame, a randomly initialized network, SNAC(d 1 i ), was associated with Together with the position similarity Λ p (T 1 i , d 2 j ) based on position and size, the total similarity Λ(T 1 i , d 2 j ) can be calculated. When similarities of all detection responses in Frame 1 have been calculated, the Hungarian algorithm was used to determine if there was a d 2 j that could be combined with T 1 i . If d 1 i and d 2 j belong to the same object, d 2 j joins with T 1 i , and tracklet T 1 i is updated to T 2 i . Otherwise, a new tracklet T 2 N 1 +1 of d 2 j is generated. Then, the processing went into Frame 2, and tracklets that contained the detection responses in Frame 2 needed to train. Taking T 2 i as an example, its last element was d 2 j . If d 1 i exists as a former element of d 2 j in tracklet T 2 i , the initial parameters of SNAC(T 2 i ) equal to SNAC(d 2 j ) will be inherited from the trained SNAC(d 1 i ). In addition, the positive and negative training sets can be expanded through the samples of SNAC(d 1 i ). Training of SNAC(T 2 i ) can be done with fewer iterations in this incremental manner. If T 2 i is a new added tracklet that only contains d 2 j , SNAC(T 2 i ) will be trained similarly to SNAC(d 1 i ). Finally, all reliable tracklets T will be produced frame-by-frame. Now, the calculation of similarities between a tracklet and a detection response is explained.
is given as follows: The appearance similarity was computed by the distance between feature vectors output by the SNAC(T t−1 k ). It is given by: where T t−1 k (e) denotes the end element of tracklet T t−1 k , F t−1 k denotes the output feature vector of the SNAC for tracklet T t−1 k , and g is a probability function on the squared distance of feature vectors. Because of the margin-based loss of SNAC, the definition of function g is as follows: where δ is the decision margin given in the loss function of Equation (3).
Overlapping is widely used to describe the detection position relationship. It takes information about the coordinates and size into account. The overlapping Λ o (T t−1 k , d t j ) is given as: where A is the area function on a detection response and A ∩ is the area function on the intersection of two detection responses.

Overall Framework
Based on SNAC, a tracking framework following TBD was established to solve the MOT problem. A TBD scheme can be described as solving an MAP problem by: where D is the set of given detection responses and T is the set of trajectories. In the framework, tracklets were first generated. Because a tracklet is an ordered combination of detection responses, it is able to extract higher order features to better describe relations between objects. Then, the problem can be converted into a more reliable tracklet association as follows: where T is the set of all tracklets. The whole framework is shown in Figure 3. First of all, the inputs were checked, and deformity detection responses were deleted, such as too large or small bounding boxes. SNAC was proposed to extract discriminative appearance features for detection responses. The online SNAC incremental learning method mentioned above was used to generate reliable tracklets. The next step was to generate tracking results through tracklet association. Similar to detection association based on the learning method, SNAC was improved to extract a new discriminative composite feature PAN for the tracklet instead of using traditional handcrafted methods. To enhance tracklet association, the tracklet growing module was embedded to make tracklets as extended as possible. With the discriminative PAN feature, tracklet association was converted to a linear programming problem that was solved by an efficient greedy iterative algorithm, and the final trajectories were achieved. For real-time tracking, the whole tracking process was carried out in sliding time windows. Figure 3. Illustration of the overall online tracking by detection (TBD) framework. In addition to standard inputs and outputs, an online tracking framework is established with new facilities, including an iterative Siamese network with an auto-encoding constraint (SNAC) to learn the detection responses, previous-appearance-next (PAN) to represent the composite features of tracklets, and pre-processing of tracklet growth to cope with short-time detection failures. Finally, a greedy iterative algorithm is used to output robust trajectories in sliding windows.

Previous-Appearance-Next Feature of the Tracklet
, ...d t2 k } is an ordered sequence of detection responses that represents a moving object with a short time from frame t1-t2. To describe T t2 m , appearance and motion are indispensable. They are often assumed to be independent of each other in several studies [12,21,36]. Only by weighted summation can they express the similarity between two tracklets. To increase the flexibility and discrimination, a composite previous-appearance-next (PAN) feature was proposed.
The new feature combined appearance and motion for the tracklet, and it was extracted jointly by an improved SNAC.
Taking T t2 m and T t4 n as examples, as shown in Figure 4b, T t4 n is from frame t3-t4 and t2 < t3. To calculate the similarity between T t2 m and T t4 n , it is better to use the tail part of T t2 m and the head part of T t4 n rather than using their whole information. T t2 m (e) is the last element of tracklet T t2 m , and T t4 n (s) is the first element of T t4 n . The PAN(T t2 m (e)) vector integrated the appearance, previous, and next stage motions of T t2 m (e) to express the tail part composite feature of tracklet T t2 m . Correspondingly, the PAN(T t4 n (s)) vector was defined for the head part composite feature of T t4 n . The next stage motion of tail T t2 m and the previous of head T t4 n were computed by estimation methods. The SNAC for detection response was revised to extract PAN(.) vectors of tracklets. The new structure is shown in Figure 4a. The previous and next stage motions were used as additional inputs to the mix-layer. The first layer of the new SNAC was same as the old SNAC. ∆ p = (x p , y p ) and ∆ n = (x n , y n ) are the previous and next motion vectors of T t2 m (e), respectively. As shown in Figure 4b, ∆ p represents the x and y axes displacements of T t2 m from t2 − 1 to t2. For the next-stage motion vector, T t2 m (e + 1), the estimation of T t2 m in frame t2 + 1 was computed first, and then, ∆ n of T t2 m (e) was calculated.
Since ∆ p and ∆ n are two-dimensional vectors that include displacements with x and y directions and the output of each auto-encoder in the first layer of SNAC is a 100-dimension feature vector, they are totally different in type and cannot work together simply. Meanwhile, the existence of detection noises makes the deterministic motion descriptions inaccurate. A distribution description method was proposed to represent the motion instead of specific values. Assuming following the Gaussian distribution, the x axis displacement, x p of ∆ p for instance, is described by G(x p , σ x ), where σ x is set by pre-training. G(y p , σ y ) is for y displacement, as well. The distribution description was given by sample vectors of G(x p , σ x ) and G(y p , σ y ), and its length was taken to be equal to that of the appearance vector. For in MOT, the motion feature is as important as appearance. The distribution description for ∆ n can also be obtained. Then, they were merged with the three outputs of the first layer to form one mixed vector for the second-layer training.
Then, SNAC(T t2 m (e)) was trained to extract the tail PAN(T t2 m (e)) feature. Training samples of SNAC(T t2 m (e)) were also collected online. Similar to [11], elements in T t2 m are positive samples. Tracklets that overlap with T t2 m in time are positive samples. The parameters of the first layer were inherited from the corresponding detection SNAC. After training the SNAC(T t2 m ), discriminative local composite features can be extracted to distinguish T t2 m from other subsequent tracklets. As shown in Figure 4b, similarities between tracklet T t2 m and T t4 n were computed. After training, PAN(T t2 m (e)) and PAN(T t4 n (s + 1)), as shown by the blue dashed circle areas in the figure, were extracted. Then, forward similarity was achieved as follows: To get a reliable similarity, the backward relationship was also computed, as shown in Equation (12).
The final similarity was given by: Λ PAN (T m , T n ) = g(min(S F m,n , S B m,n )) (13) where g is the probability function for the distance of feature vectors, as defined in Equation (7).

Tracklet Growing
If the frame gap between T t2 m and T t4 n was small, variations in the appearance and motion from T t2 m -T t4 n were not obvious, and the PAN could work well. Otherwise, the long-term frame gap brought a large variety of appearances, and motions may reduce the performance. PAN considers more local elements of the tracklet to enhance the performance. In order to make tracklet association more reliable, it is effective to reduce the time interval in the sliding windows as much as possible. Therefore, the tracklet growing process was used to extend the tracklet by estimated bounding boxes, which were missing from the detection. It contained forward and backward growth.
To forward the extended tracklet T t2 m , the center position p f 1 (T t2 m ) = (x,ŷ) in frame t2 + 1 was first estimated by quadratic fitting. Then, the optimal estimation bounding box was searched as follows: where C is the candidate bounding boxes set, center positions x and y are sampled according to the distribution of G(0, σ m ), and the size is equal to T t2 m (e). H denotes the color histogram of detection T t2 m (e). The goal was to find the most similar estimation. If the optimal estimation d t2+1 o was found, a conflict process was also required to avoid false alarms. If the overlap between d t2+1 o and an existing d t2+1 i exceeded the threshold, the forward growth of T t2 m stopped. Otherwise, T t2 m was updated to T t2+1 m with d t2+1 o and the growing process continued to frame t2 + 2. The backward extension was similar to the forward process. For the isolated tracklets, random sampling was used to form the candidate estimations. After these missing detection compensation processes, tracklets were extended to improve the discrimination performance of PAN, and more reliable associations could be made.

Tracklet Association in Sliding Windows
Tracklet association was the last module in MOT to generate the final trajectories of objects. The main task was to link tracklets belonging to the same objects into a complete trajectory based on similarities among tracklets. Solutions such as min-cost networks, energy minimization, successive shortest paths, and the Hungary algorithm are widely used to generate tracking results. Global optimization is an ideal scheme because the previous judgments will be revised to achieve the overall optimal results. In cases where it is difficult to distinguish objects, this dynamic scheme can achieve better tracking performance than a greedy strategy. Similar to tracking by learning feature extraction method [15], network flows methods were no longer used to get the tracking result. The MAP problem shown in Equation (10) was directly mapped to a generalized linear assignment: To solve problem Equation (15), the similarity Λ(T i , T j ) between tracklets was used; this is equal to linking probabilities mainly based on PAN features. Λ(T i , T j ) was computed by Equation (13). However, PAN features cannot be extracted from tracklets with lengths of less than two elements. For this particular case, Λ(T i , T j ) degenerated into the traditional weighted combination of appearance and motion. L ij is the association indicator, where 1 indicates connection and 0 means disconnection. The constraints guaranteed the uniqueness of association. As the better discriminative PAN, the similarity matrix Λ was normalized, and Equation (15) was solved by a greedy iterative algorithm.

Experiments
In this section, the performance of SNAC is first evaluated on detection responses and tracklets. Then, the proposed MOT system is tested on the MOT Challenge Benchmark [37].

Evaluation of SNAC
In the MOT system, the SNAC was proposed to extract discriminative features for detection responses and tracklets instead of handcrafted methods. Discrimination and accuracy were used as the main indicators to evaluate the performance of SNAC. Meanwhile, the effects of histogram inputting and the auto-encoding constraint were also evaluated. According to the order of the system framework, the performance of SNAC was first evaluated on detection responses and then tested the SNAC on tracklets. Since current public platforms do not provide annotation data for tracklets, how to make a fair comparison is a thorny issue. Therefore, the performance of SNAC was mainly compared with different constraints and handcrafted methods. In this experiment, the training processes of SNAC were carried out with graphics processing units (GPUs).

SNAC for Detection Responses
During tracklet generation, an SNAC(d t i ) was established for each detection response d t i to implement the explicit frame-by-frame association. Through an online learning process, SNAC(d t i ) was able to extract features for d t i and D t+1 . Then, the similarity between d t i and each detection of D t+1 could be obtained by the Euclidean distance. Statistical discrimination and variance of SNAC(d t i ) from these similarities can be calculated. Discrimination reflects the strength of the distinguishing ability, and variance represents the robustness. To generate the tracklet set in sliding temporal windows, each SNAC(d t i ) was trained by an incremental learning algorithm. Indicators of discrimination and variance were computed from the overall results. Another important indicator in evaluating the SNAC is the tracklet accuracy (TA). To compute TA, tracklets were treated as the final tracking results in a time window, so the metrics of MOT [38] could be used to evaluate the accuracy of tracklets. In this case, the core indicator MOTA was equal to the TA in Equation (15): where t is the frame index in the current time window; FN, FP, and IDs are the number of false negatives, false positives, and mismatches, respectively; and GT is the number of ground truth tracklets annotated by us in this experiment. Three subsequences of the 2DMOT2015 dataset were chosen to do this experiment. TUD-Crossing is a static camera scene, ETH-Jemoli and EHT-Linthescher are moving camera sequences. Three time windows are selected from each sequence to create a total of nine video segments for the experiment. GTs of the nine video segments are annotated.
As shown in Table 2, SNAC_L2 was chosen as the original SN with the L2 regularization constraint, the SNAC_L2(pixel) with raw pixel input, and the RGB and HOG histogram methods were used for comparison. The comparison of SNACs with traditional methods is first discussed. In Table 2, the red number in each column represents the best performance. Compared with the RGB and HOG histogram methods, the average discriminations of the SNACs were obviously superior, implying that the SNACs distinguished objects better than traditional RGB and HOG histogram methods. There were lower variances in the HOC and HOG methods due to lower discrimination. TA curves are shown in Figure 5. TA values followed the variance of the appearance threshold. From Figure 5, it can be seen that the SNACs methods were obviously better than HOC and HOG with a large threshold area. This means that SNACs were more robust. The value of TA was one when the appearance threshold was zero in these nine testing video experiments. In order to simplify the labeling works and clearly identify the relationships among objects, these nine segments were relatively simple videos with no complex interactions between objects. Thus, detections could be correctly associated only through overlapping relationships. However, it is impossible to work in a complex environment only through position and size information. Appearance is an essential factor in tracklet generation. In order to reduce the annotation workload, the experiment selected related simple scenarios. Table 2 and Figure 5 show that when a histogram was used as input, SNAC_L2 and SN_L2 were superior to the method with raw pixels as the input for all indicators. This implies that the use of the histogram as input was a more robust method that was better at suppressing detection noises. In the comparisons between SNAC_L2 and SN_L2, no significant differences in TA or average discrimination were found, but the discrimination variance of SNAC_L2 was lower. The auto-encoding constraint was shown to be useful to enhance the robustness of SNAC and made it adapt to various environments.  Nine video sequences were sampled from the 2D MOT 2015 dataset and annotated. The abscissa axis indicates the appearance threshold from 0-1, and ordinates axis represents the TA up to 100. Through these curves, it can be seen that learning features are better than traditional methods at distinguishing objects in multiple object tracking (MOT). The auto-encoding constraint (AC) term and histogram inputs proposed in this paper also showed reasonable results.

SNAC for Tracklets
To improve the reliability of tracklet association, SNAC was improved to distinguish tracklets, and its performance is evaluated in this section. To provide fair comparisons, the average discriminations of PAN features and hand-crafted methods were evaluated. Six testing video sequences were selected from the 2D MOT 2015 dataset, and the generated tracklets in a time window were annotated for this experiment. The discrimination was calculated by the GT of tracklets, as shown in Table 3. In each sequence, there discrimination was significantly enhanced from the appearance to the composite PAN feature. Thus, PAN can effectively integrate appearance and motion to enhance discrimination.

Evaluation of the MOT System
In this section, the whole MOT system is evaluated using the MOT Challenge Benchmark, and the 2D MOT 2015 dataset was used for testing. Evaluation metrics are given by [38]. Multiple object tracking accuracy (MOTA) combines false positives, missed targets, and identity switches. Multiple object precision (MOTP) indicates the misalignment between GTs and tracked bounding boxes. Mostly tracked targets (MT) is the ratio of GTs that are covered by a track hypothesis for at least 80% of their respective life span. Mostly lost targets (ML) is the ratio of GTs that are covered by a track hypothesis for at most 20% of their respective life span. FP and FN are the total number of false positives and missed targets, respectively. ID switch (IDs) is the total number of identity switches. Frag is the total number of times a trajectory is fragmented. The proposed MOT system was developed by the Theano library [39] in a Python environment. The primary station was equipped with a 4.0-GHz CPU and an NVIDIA GeForce GTX 1070 GPU.
The proposed MOT system was tested on the benchmark and compared with closely related works and state-of-the-art MOT methods including those using traditional features [8,10,40,41], learning features [17,22,23,31,42,43], and higher order motion information [44]. The experimental results are listed in Table 4. Table 4. Performance comparison of multiple object tracking (MOT) systems. Red represents the best. The upward arrow indicates the higher the better, and the downward arrow means the lower the better. MOTA, multiple object tracking accuracy; MOTP, multiple object precision; MT, mostly tracked; ML, mostly lost; Frag, the total number of times a trajectory is fragmented. The results for the MOT 2015 dataset showed that the proposed MOT system using SNAC obtained a better performance for MOTA than the other competitors listed in Table 4. The proposed method showed a comprehensive performance improvement compared with the hand-crafted feature methods CEISP and DP_NMS. This means that online learned features can better distinguish among targets and complete data association than traditional hand-crafted methods. Compared with the deep neural network feature MOT system, it can be seen that learning features is suitable for MOT applications. A higher MT indicates that tracklet growth can extend the short tracklets to enhance the PAN feature to make object trajectories as complete as possible. Meanwhile, a lower ML also benefits from the tracklet growing module. It also has disadvantages, as inaccurate detection compensation will lead to increases in FP and FN and reduce MOTP and the performance of PAN to achieve more IDs. Further improvement is needed in this area. Specific indicators such as MT and ML were superior for the proposed method than for several deep learning methods, especially the related deep Siamese network methods [17,22,23]. This implies that the online learned feature extraction method, which collects samples only from current scenes, can describe objects accurately and distinguish objects robustly. The feature extraction method with a simple structure and online training is useful for MOT. Although the proposed method was still no better than the state-of-the art methods detailed in [37], a pure online solution is possible in terms of time and performance, but this needs to be confirmed by further research. Figure 6 demonstrates some tracking results of the proposed method on the 2D MOT 2015 dataset. For the static camera cases of Figure 6a-e and the upper part of Figure 6f, tracking results showed good performance. In Figure 6a, there are two pedestrians close in distance and alike in appearance, and they walk together. This is a difficult situation in MOT as their trajectories are likely to interfere with each other and produce false tracking results. With the help of discriminative features, the proposed method correctly tracked them. Figure 6d shows that the method can track the targets of complex movements robustly. Though scenes of the lower Figure 6f,g-i were difficult due to camera motion, the proposed method still worked properly and correctly distinguished objects. The execution efficiency of the proposed method is shown in Table 5. As the execution efficiency of MOT methods tested on the MOT Challenge Benchmark were not calculated officially, but uploaded by the authors themselves, it is hard to make fair comparisons. Multiple object tracking is a system including tracklet generation, tracking model establishment, tracklet association, trajectories generation, and other specific modules. The runtime performance of the main modules in the proposed MOT system are shown in Table 5, which is conducive to specific analysis. In the proposed MOT system, tracklet generation, tracklet association, and tracking results generation were executed with a 4.0-GHz CPU, and detection training and tracklet training were ran by a Nvidia Geforce GTX 1070 GPU card. From Table 5, the efficiencies of tracklet generation and trajectory generation basically met the real-time requirements. However, the training of SNAC consumed much time and reduced the efficiency of the whole MOT system. The main reason was that the program codes were encoded only for the purpose of functions evaluation and have not been optimized for running efficiency. In addition, the hardware was not an engineering-grade graphics card. Further works will be carried out for real-time implementation of the proposed MOT framework. Table 5. Specific execution efficiency of proposed MOT system. Time consumption (C) and execution efficiency (E) of the whole MOT system and main modules are calculated.

Conclusions
In this paper, an SNAC method has been presented to better distinguish objects for MOT. The online learned SNAC can work well in noisy and small sample environments. An incremental learning SNAC algorithm was proposed to generate reliable tracklets. SNAC was also improved to extract an PAN feature that combines appearance and motion for distinguishing tracklets. Tracklet growth was used to compensate for missing detections to improve the association.
Two sub-experiments were designed to evaluate the performance of SNAC and the PAN feature. The experimental results showed that SNAC could extract discriminative features from detection responses and better distinguish them. Meanwhile, in terms of appearance, PAN had a significant improvement in discrimination over SNAC and could better carry out tracklet association. The whole tracking system was evaluated over the 2D MOT 2015 dataset, and the results were compared with the state-of-the-art methods, showing a comparable performance. Experiments showed that this kind of pure online feature extraction solution is suitable for MOT.
Further research includes two aspects. One is combining more useful information to improve the proposed feature extraction method to better distinguish objects for MOT. Another is improving the efficiency of the proposed method to achieve real-time tracking.
Funding: This research was funded by the National Natural Science Foundation of China under Grant 61671126.

Acknowledgments:
The authors would like to acknowledge the Multiple Object Tracking Benchmark platform for providing fair comparative experimental data.

Conflicts of Interest:
The authors declare no conflict of interest.