Multiple Object Tracking for Dense Pedestrians by Markov Random Field Model with Improvement on Potentials

Pedestrian tracking in dense crowds is a challenging task, even when using a multi-camera system. In this paper, a new Markov random field (MRF) model is proposed for the association of tracklet couplings. Equipped with a new potential function improvement method, this model can associate the small tracklet coupling segments caused by dense pedestrian crowds. The tracklet couplings in this paper are obtained through a data fusion method based on image mutual information. This method calculates the spatial relationships of tracklet pairs by integrating position and motion information, and adopts the human key point detection method for correction of the position data of incomplete and deviated detections in dense crowds. The MRF potential function improvement method for dense pedestrian scenes includes assimilation and extension processing, as well as a message selective belief propagation algorithm. The former enhances the information of the fragmented tracklets by means of a soft link with longer tracklets and expands through sharing to improve the potentials of the adjacent nodes, whereas the latter uses a message selection rule to prevent unreliable messages of fragmented tracklet couplings from being spread throughout the MRF network. With the help of the iterative belief propagation algorithm, the potentials of the model are improved to achieve valid association of the tracklet coupling fragments, such that dense pedestrians can be tracked more robustly. Modular experiments and system-level experiments are conducted using the PETS2009 experimental data set, where the experimental results reveal that the proposed method has superior tracking performance.


Introduction
Video multiple object tracking (MOT) is widely used in computer vision research applications, including video surveillance, traffic detection, and robotic assistance. With developments [1][2][3][4][5] in object detection technology, tracking by detection (TBD), such as in [6][7][8][9], has become a common tracking strategy. This tracking scheme performs data association based on the appearance and motion characteristics of detected information to obtain the complete trajectories of objects.
One of the major challenges in TBD is object occlusion. In a single-camera scene, trajectory estimation and data association [10,11] are used to deal with occlusion; however, frequent long-term occlusions cased by dense pedestrians scene may have a large effect and result in a significant reduction in tracking performance. In a multi-camera system with overlapping fields of view, some kinds of occlusions can be effectively solved by cross-view data fusion [12][13][14][15]. Figure 1a-c presents the first, second, and third view video frame images, respectively, of the 233rd frame of the PETS2009 experimental data set S2.L3. The man in yellow vest (displayed in the dashed yellow bounding box) is occluded by the crowd in the second view; however, he is visible in the first and third views (solid yellow bounding box). By data fusion of the first and third views, this kind of occlusion problem can be effectively solved. However, one of the most difficult challenges for multi-camera systems [12][13][14] is the tracking of dense pedestrians, in which several views or all views are largely or partially occluded. As illustrated in Figure 1, the object represented by the blue bounding box is partially occluded in Figure 1a and completely occluded in Figure 1c; similarly, the objects represented by red, yellow, and green are also occluded to varying degrees. These occlusions lead the target detection information to be highly inaccurate, resulting in large errors in the 3D reconstruction information of the same target from different cameras (as illustrated in Figure 1d), which can lead to errors in multi-view data fusion. In addition, due to the group motion characteristics of dense pedestrians, the complete occlusion points typically change with changes in time and position, which may result in a large number of short tracklets when the trajectory fragments are established. With insufficient information, these short tracklets cannot provide reliable features of the objects, thus resulting in a decline in data association performance. The two above-mentioned problems caused by dense object occlusions both have a large impact on multi-view multiple object tracking performance.
In this paper, the multi-camera tracking system builds cross-view tracklet couplings with a new data fusion method, and links them by association algorithm based on a new Markov random field (MRF) model. To address the problem of inaccurate detection caused by frequent occlusions in dense crowds, the human key points detection method [5] is used to improve the object positions. Then, two-dimensional (2D) tracklets are generated in each view and are reconstructed in three dimensions using camera parameters. A proposed data fusion method is used to calculate the spatial similarity of the cross-view tracklets, based on image mutual information. This method takes into account the position and motion relationships between two tracklets. The proposed MRF model uses the link candidates of two tracklet couplings as the observation node and their internal link state as the implicit node. The MRF model contains a new potential function improvement method for dense pedestrian crowd scenarios, including assimilation as well as extension processing and a message selective belief propagation (MSBP) algorithm. The former enhances the information of the short tracklet coupling and expands it through information sharing. The latter prevents the unreliable messages of short tracklet couplings from spreading throughout the network. The potentials of the model are improved with the help of iterative belief propagation processing [16,17]. Consequently, an effective association of the tracklet coupling fragments is achieved and improved tracking performance of dense pedestrian crowds obtained.
The main contributions of this paper are as follows.
(1) We propose a cross-view data fusion method based on image mutual information. Together with human key point optimization, it can generate more reliable tracklet couplings. (2) An MRF model is constructed and the potential function improvement method is proposed for better association of short cross-view tracklet couplings. (3) We construct a complete multi-view MOT system, which is tested on public data sets containing dense pedestrian scenarios and achieves favorable results.
The rest of this paper is organized as follows. Related works in Section 2; generation of cross-view tracklet coupling is described in Section 3; system framework and Markov random field model for data association are introduced in Section 4; experiments are shown in Section 5; Section 6 is the discussion and conclusion; the last section is the Appendix A.

Related Works
TBD is an effective solution in MOT, the main task of which is to determine the complete trajectories of objects through data association and estimation processing of the detected information. Multi-view MOT with overlapping fields of view has been designed to improve tracking performance by performing multi-view data fusion. However, a multi-view tracking system is more complex than a single-view system, and much research [18][19][20][21] has been conducted on this topic. In this section, we discuss multi-view data fusion and association.
Berclaz et al. [22] designed a probabilistic occupancy map [23] for multi-camera tracking systems to achieve the 3D reconstruction of multi-view detection information. They modeled data association as a linear programming problem and proposed the k-shortest paths algorithm. Dockstader et al. [18] constructed a complete multi-view MOT system and used a Bayesian belief network to accomplish multi-view data fusion. In addition, they adopted a Kalman filter to estimate the trajectories. For tracking in dense pedestrian crowds, Eshel et al. [19] established a multi-camera system with overlapping views. Placing the cameras at higher positions facilitated capturing the heads of objects, and robust tracking in a dense crowd was achieved based on head detection. In [15] and [20], thorough discussions of the reconstruction before tracking and tracking before reconstruction frameworks and proposed improvement measures were provided. Leal-Taixé et al. [12] proposed a global optimization scheme, which establishes a multi-layer graph model based on reconstruction matching and data association and solves it based on a multi-commodity flow algorithm. It takes the distance metric into consideration in 3D reconstruction and adopts the metric function proposed in [24] to convert the absolute distance into a probability. This function decreases gently within the threshold and declines rapidly when the distance is greater than the threshold, which may improve the performance of the system against detection noise. Hofmann et al. [13] established a multi-view network flow tracking model to simultaneously describe multi-view information reconstruction and time-domain data association. In addition, they added multi-view reconstruction to the network flow graph as an additional constraint. Wen et al. [14] constructed a global hypergraph model to describe multi-view reconstruction and tracking, taking into account high-order dependencies among nodes in addition to simple domain relationships. Duanmu et al. [25] proposed a multi-view MOT system, which generated tracklets in each view and used the graph matching method to solve the cross-view association problem. Nie et al. [26] proposed a general tracking framework for single-view and multi-view systems, which transformed the data association of tracklets into graph matching problems. Nithin et al. [27] proposed a grammar model with stochastic attributes to improve cross-view tracking performance using complementary and distinguishing attributes. In the association framework, Liu et al. [28] modeled the association of tracklets as a combination optimization problem based on appearance motion and geometric information, while considering the long-term and short-term occlusion problems and improving system efficiency.
In [13,14,25,26], Euclidean distance metrics have been used as cross-view metrics. Due to the ubiquitous detection noise caused by object occlusion, errors can occur during 3D reconstruction in single-view object detection. It is common to use the absolute distance directly to calculate the 3D reconstruction differences between detections at different views. However, this method is too sensitive to detection noise and may cause matching errors in difficult situations. Leal-Taixé et al. [12] considered this factor and used the Gaussian error function for calculation, which can improve the robustness of the matching. In this paper, we also considered the metric's ability to suppress detection noise, set a reasonable error threshold, and used the image mutual information metric to provide the matching index.
In data association-based MOT research, an effective method is to use a probability graph model solution, which can be globally optimized and achieves favorable tracking performance. The network flow tracking model presented in [8] clearly describes the relationship between detections and plans a possible match for all objects as a whole. It can also simultaneously solve the complete trajectories of multiple objects. There are a number of variants [29][30][31] in which, for instance, nodes have different meanings and different methods are used to describe the relationships between nodes. Due to its clear structure, the network flow tracking model has become one of the most popular tracking models. To handle more complex tracking scenarios, conditional random field models have been widely used. Yang et al. [32] added a high-order trajectory continuity constraint to ensure the reliability of matching while paying attention to the node connections. To deal with object pairs with similar positions and appearances, the authors of [7] modeled the relationship between such pairs as the edge of the conditional random field and mapped this relationship as a data association problem, with the binary energy function as a constraint, to robustly solve problems in complex situations. Milan et al. [33] modeled the trajectory smoothness problem [9] in tracking as a unitary energy function and modeled mutual exclusion as a binary energy function. They performed optimization and achieved favorable results. In [34][35][36][37][38], deep learning technology has been further applied to the conditional random field tracking model, in order to improve the distinction degree of object features. In [6,11], a larger range of node relationships were considered and a hypergraph model was established to address the data association problem. However, among the many existing data association models, it is more likely that the reliability of the nodes and the weights on the edges will be trusted. In an object-intensive scene, due to the ubiquity of occlusion, tracklets are too short to provide sufficient information. This situation makes the relationships between tracklets unreliable, and consequently data associations based on these unreliable relationships can lead to the generation of a large number of erroneous trajectories. In this study, the Markov optimization model is established for the data association of cross-view tracklets and a potential function improvement method is proposed to increase the reliability of the relationships between the nodes, which lay the foundation for trajectory generation.

Generation of Cross-View Tracklet Coupling
In the tracking framework, a 2D tracklet set T v is generated for each view, based on the detection input (where v ∈ V represents the view). Tracklets in multiple views are sequentially merged by data fusion to generate a multi-view tracklet coupling set T . To reduce the influence of detection noise caused by occlusion on data fusion accuracy, the human key points detection method is used to optimize the object detection in each view, thereby reducing the 3D reconstruction error of the object. A cross-view tracklet measurement method based on image mutual information is proposed, which can more accurately describe the spatial relationship between the cross-view tracklets and obtain a multi-view fusion tracklet coupling set through an iterative generation algorithm.

Object Position Data Optimization Based on Human Key Points Detection
The basis of multi-view data fusion is the 3D reconstruction of objects; that is, mapping the 2D detection information of multiple views into a unified 3D space. Generally, only the object landing location (center position at the bottom of the 2D detection bounding box) is 3D mapped [14] to reconstruct the pedestrian object to the ground of the public 3D space, which can effectively reduce the computational complexity. In each view, the detection algorithm provides an approximate outline of the object. Most solutions provide a rectangular area [8,14] containing the object; however, some methods provide an elliptical area [32]. The detection bounding box is generally accurate under the condition that the object is not substantially occluded. In a typical 3D reconstruction process, it is assumed that the precalibrated camera parameters are accurate and that only accurate 2D detection data can reconstruct reliable 3D position. However, the detection contains some noise and, for an object, there is always a deviation between the center point of the target detection bounding box at the bottom of the same target and the corresponding real point in 3D. In this study, these two deviations are collectively referred to as detection errors. As the connection between the camera and the object's landing location usually forms an acute angle (of less than 45 degrees) with the ground, the error caused by detection is magnified when mapped by the camera parameters into a 3D space. For dense crowd scenarios, as illustrated in Figure 1, a slightly severe occlusion causes large errors or even false detections of the object. Large errors occur when 3D position reconstruction is performed directly through the bottom center of the detection bounding box, which can lead to the failure of data fusion. Therefore, conducting multi-view data fusion based on existing detection methods is often not reliable. This study makes use of the human key points detection method to optimize detection information to obtain a more accurate object 3D reconstruction position, reduce the reconstruction error, and lay a solid foundation for multi-view data fusion.
We adopt the human key point detection algorithm proposed in [5]. For the detection d t i whose confidence is less than a threshold δ c , the human key point detection is executed on the area slightly larger than the detection. If detection overlapping happens due to dense objects, the processing area will be doubled. The obtained key point data KP(d t i ) is used to optimize the original detection. Detection optimization based on key point information can accomplish three tasks-removing error detection, finding missed detection, and correcting the detection bounding box-as shown in Figure 2 by white, black, and green arrows respectively. The optimization is mainly based on head key point KP h (d t i ) which is often available. If not severe occluded, as the 4th, 14th, and 16th detections indicated by the green arrows in Figure 2, the foot key point KP f (d t i ) can also be obtained. The top and bottom of the bounding box is then refined by KP h (d t i ) and KP f (d t i ). Due to the uncertainty of the pedestrian's posture, prior knowledge is used for width of detections instead of shoulder key points. If part of key point information is unavailable, prior knowledge will be used. When the output of human key points detection algorithm is completely null, the detection is taken as false and deleted, as the d t 2 6 at the white arrow in Figure 2a. In addition, the processing area is an extension of the original detection area d t i . If d t i is in a dense crowd, the processing area will contain other pedestrians, that is, KP(d t i ) contains multiple groups of key point information. This is helpful for missing objects. For example, the pedestrians at black arrows were not detected in Figure 2a and is now found by human key point detection algorithm in Figure 2b. In a densely crowded area, processing areas may overlap, resulting in multiple detection of an object. Therefore, overlapping processing is required.

Cross-View Tracklet Spatial Relationship Metric Based on Image Mutual Information
If object detections d t i (m) and d t j (n) from views m and n correspond to the same object, then the 3D coordinates of their landing locations should theoretically overlap. Due to the detection errors, the positions of their landing locations in the 3D reconstruction may become deviated, which may significantly affect the quality of cross-view image fusion in dense crowd situations.
Based on 2D tracklets, we perform data fusion from three dimensions in this study: time, space, and view. To achieve correct matching, it is necessary to accurately measure the spatial relationships between tracklets from different views. We take two 2D tracklets as an example, where T m i is the ith tracklet of view m and T n j is the jth tracklet of view n. To measure the spatial relationship between T m i and T n j , the authors of [13,14] provide the calculation method for combined dispersion. In [12], a Gaussian metric was used to calculate the positional similarity between tracklets. Both of these schemes can tolerate the position errors in 3D object reconstruction. However, in addition to the positional relationship between T m i and T n j , the motion relationship between them should also be taken into consideration. In studies of image registration [39,40], an effective method is to use mutual information as the similarity between two images a and b. Based on the reference image b, the geometric transformation of the input image a is carried out iteratively and the mutual information of a and b is calculated. When the mutual information reaches maximum, the optimal registration parameters are obtained. Inspired by this, we propose a spatial similarity measurement method that can comprehensively consider the position and motion information of two tracklets; that is, the method uses the image mutual information to calculate the spatial relationship between the tracklets.
As illustrated in Figure 3a, the 3D coordinates of T m i and T n j are first reconstructed using the calibrated camera parameters and detection information. Generally, it is assumed that, in a real tracking scene, the ground is flat and fluctuations can be neglected; furthermore, reconstruction is performed by utilizing only the center points at the bottom of the detection bounding boxes to obtain the 3D coordinates (z = 0) of the object's landing points. When considering the spatial relationship between T m i and T n j , the distance between the detections corresponding to the same frame is usually calculated and summed as follows.
(a) 3D coordinates of tracklets (b) Images for mutual information calculation As the presence of detection noise causes each 3D reconstruction coordinate to deviate from the true value, this noise may affect the calculation of the positional relationship. The Gaussian metric can tolerate reconstruction errors within a certain range (σ) and attenuate distances outside this range. In this study, the image mutual information method can simultaneously satisfy the three requirements for the spatial similarity calculation for the tracklets; namely, positional relationship, noise tolerance, and motion relationship.
To calculate the spatial similarity between two tracklets, they are described as two 8-bit grayscale images (I m i and I n j , respectively) with time-overlapping tracklets taken into consideration. As illustrated in Figure 3b, the background color is set to black; that is, the pixel value is set to 0. When I m i or I n j is established, a Gaussian gray block is constructed, frame-by-frame, with the reconstructed coordinates as the center. The scale of the gray block corresponds to the range of noise tolerance and the Gaussian attenuation reflects the weight attenuation of the deviation from the reconstructed coordinate point. The gray base values of each Gaussian block are incremented, step-by-step, in increasing order of T m i and T n j to indicate their direction of motion. The overlapping relationship between each Gaussian window can express the velocity information of the tracklets. By calculating the mutual information MI(I m i , I n j ) of the two images, the spatial similarity of T m i and T n j can be obtained.
An improved distinguishing effect can be achieved by using the image mutual information to calculate the spatial relationship between two tracklets. As illustrated in Figure 3a, it is assumed that T m i and T n j belong to the same object, whereas T m i and T n k belong to different objects. As the motion trajectories of T m i and T n k overlap, the distance superposition calculation or average distance calculation method would be unable to distinguish them effectively. However, the image mutual information can distinguish them well and, as seen in Figure 3b, there are significant differences between the images of I m i and I n k . As the lengths and distances between tracklets differ, there are differences in the sizes of each pair of images. Therefore, it is also necessary to standardize the mutual information to perform a unified comparison. To this end, we have set all images to the same size. Assuming that the information entropy of a certain image with a size of N 0 is H 0 (x), it can be easily proven that, when its size is expanded to N 0 + N a by adding zero value pixels, its information entropy H 1 (x) satisfies the following, is a function of n 0 , N 0 , and N a , and n 0 is the number of zero value pixels in original image N 0 . The specific form and the proof process of Equation (3) are given in the Appendix A.

Iterative Generation for Tracklet Couplings
In this section, an iterative generation algorithm for multi-view tracklet coupling is designed, based on the cross-view tracklet mutual information metric proposed in Section 3.2. According to Equation (4), the multi-view data fusion problem can be described as the problem of identifying a set of tracklet couplings that can obtain the maximum mutual information: T v j ; and T i = ∅ contains at least one tracklet. MI(T m j , T n k ) is the spatial similarity between tracklet pair (T m j , T n k ). T m j , T n k ∈ T i , m, n ∈ V and m = n. In the cross-view fusion process, the 2D tracklets must meet the temporal overlap of f t (T m j , T n k ) = 1 and the appearance consistency of f a (T m j , T n k ) = 1 before they can be coupled.
Using T m j and T n k as an example, t(T m j ) is the frame list of T m j , while t(T n k ) is the frame list T n k . If T m j and T n k can be coupled, then they contain at least one pair of detections that have the same frames; that is, there is a temporal overlap relationship, as presented in Equation (5).
Thus, T m j and T n k must meet cross-view appearance constraints at the same time, according to Equation (7). Due to the difference in views, there can be large differences in the colors of the cross-view images; furthermore, the difference in the angle of the field of view can cause the texture characteristics of the same object to be quite different in different views. These create challenges in the calculation of cross-view appearance similarity. An effective processing scheme is to use a neural network method for online training to extract the differentiated appearance features of the cross-view object. Due to the complexity of the online training process, it will be studied in follow-up work. In Equation (6), the current study applies a traditional color histogram to calculate the cross-view appearance constraint, using the Bhattacharyya function B(h(T m j (p)), h(T n k (q))) for the similarity calculation function. Before the calculation, color deviation preprocessing is performed for multiple views to reduce the calculation error caused by color deviation.
T m j and T n k can perform cross-view coupling, and their appearance similarity Λ c a (T m j , T n k ) must be greater than the set threshold δ c a , as indicated in Equation (7). Correspondingly, δ i a is the threshold for the same-view calculation.
In the process of coupling, as illustrated in Algorithm 1, if T i already contains other 2D tracklets T n l in view n, then T n k must also satisfy the appearance constraints of T n l from the same view before the coupling, as demonstrated in Equation (7), in which the calculation of appearance similarity in the same view also adopts the traditional color histogram method, as shown in Equation (8). Find the max and incomplete processed pair (T m j , T n k ).

6:
if the current (T m j , T n k ) does not belong to any existing T i then 7: Form this (T m j , T n k ) into new T i+1 , T i+1 = {T m j , T n k }.

8:
else if the current T m j only belongs to one existing T i then 9: if T n k and T i satisfy the Equation (5) and (7) then 10: Update T i = {T m j , T n k }.

11:
else 12: if (T m j , T n k ) has been unprocessed then 13: Mark the pair (T m j , T n k ) as preliminarily processed. 14: else if (T m j , T n k ) has been preliminarily processed then 15: Mark the pair (T m j , T n k ) as complete processed. 16: end if 17: end if 18: else if the current (T m j , T n k ) belongs to two existing T i1 and T i2 then 19: if T i1 and T i2 satisfy the Equation (5) and (7) then 20: Merge T i1 and T i2 .

21:
else 22: if (T m j , T n k ) has been unprocessed then 23: Mark the pair (T m j , T n k ) as preliminarily processed. 24: else if (T m j , T n k ) has been preliminarily processed then 25: Mark the pair (T m j , T n k ) as complete processed. 26: end if 27: end if 28: end if 29: end while

System Framework and Markov Random Field Model for Data Association
The proposed multi-camera tracking system is presented in Figure 4. It consists of position correction, tracklet building, cross-view tracklet coupling generation, the Markov random field model, potential function improvement, data association optimization, and the trajectories generation. The first three units in the framework have been discussed in the previous section and the tracklet coupling set T = {T i } was obtained. In the absence of severe occlusion, the length of the tracklet coupling is effectively expanded. However, in some cases of severe occlusions, especially in the case of dense crowds, a large number of tracklet couplings fragments are produced, making it impossible to obtain the complete trajectories of the objects. As the fragmented tracklet couplings contain scarce information, they cannot be connected well, even with the use of data association. For this purpose, we establish an MRF model in this study, and propose assimilation as well as extension processing and a message selection propagation algorithm (MSBP) to achieve the potential function improvement of fragmented tracklet couplings. This MSBP optimizes the MRF network parameters to obtain better trajectories of objects.

Markov Random Field Model
Among studies on MOT, the existing methods generally perform data association based on the correlations between the current tracklets. In the network flow model proposed in [8], each node represents a tracklet and the complete trajectories of objects are calculated by determining a global optimal association. The conditional random field model constructed in [32] considers the trajectory smoothness between nodes while searching for the optimal connection, while ensuring the reliability of the association. In [7], the mutual exclusion between node pairs was considered, on the basis of the work in [32], and the relationships between the difficulty pairs were handled more effectively. The hypergraph model constructed in [6] took into account the relationships among tracklets over a larger range and ensured reliable global association. When there are short and fragmented tracklets with scarce information, unreliable factors can be introduced into the data association, which reduces the overall accuracy of the association.
In this paper, we establish an MRF to describe the association of T , as illustrated in Figure 5. The node n p ∈ N, as represented by the circle in the figure, represents a link candidate for two tracklet couplings T i and T j , where T i and T j satisfy the time-successive relationship and the time interval is less than the threshold of the discontinuous processing. Each node has a corresponding observation node y m ∈ Y which, as represented by the block in the figure, reflects the observation data of the corresponding tracklet couplings of the node. The edge e pq ∈ E between n p (T i , T j ) and n q (T j , T k ) is established on the condition that they contain the same T j and meets the requirement that T i , T j , and T k are successive in time. The state l p of node n p is binary, where l p = 1 represents that T i and T j in the node are in a connected state. Conversely, l p = 0 represents a disconnected state. The state of the nodes in the Markov network is implicit and the set of states of all nodes in the network is represented as L = {l 1 , ..., l N }. When a node contains only one T k , this indicates that it is either a complete trajectory or a false alarm.  According to the research in [17], in an MRF containing a set X = {x p } of implicit nodes and a set of observation nodes Y = {y p }, the joint posterior probability of the implicit nodes can be calculated by Equation (9), where ψ p (x p , y p ) is the local evidence of the node (i.e., the observation probability) and ψ pq (x p , x q ) is the compatible matrix of nodes n p and n q : In Equation (9), X and Y correspond to L and Y, respectively, in this model. We set ψ pq (x p , x q ) = ψ pq (l p , l q ) = exp(−ϕ(l p , l q )).
In Equation (10), φ(l p ) = φ(T i , T j ) is the observation of the similarity between the two tracklet couplings contained in node n p = (T i , T j ), where the appearance and motion information of T i and T j jointly determine φ(T i , T j ) as shown in Equation (12): Similar to the method in [14], the appearance similarity Λ a (T i , T j ) in this model is calculated based on the traditional appearance feature extraction method. The appearance similarity Λ v a (T i , T j ) between the two tracklet couplings in each view is calculated separately, and the multi-view overall appearance similarity Λ V a (T i , T j ) is determined jointly by Equation (13), T v s ∈ T i and T v t ∈ T j , |V| is the number of views.
The motion similarity Λ m (T i , T j ) calculation method is similar to the method for single-view MOT [9][10][11]. It involves performing motion estimation of two tracklet couplingsT i andT j , calculating the distance D(T i ,T j ) between their estimated locations, and using Gaussian function for the motion similarity calculation, as shown in Equation (14): In Equation (15), ϕ(l p , l q ) is jointly determined by the motion and appearance relationships of the couplings contained in two nodes n p (T i , T j ) and n q (T j , T k ): where Λ(T p , T q )=Λ a (T p , T q )Λ m (T p , T q ), T j is the frame length of the common coupling, and τ ϕ is the frame length threshold. If the common coupling is short, the two nodes exhibit dependencies and the cascade similarity of the respective couplings is calculated. Otherwise, the similarity of the weaker is calculated. Finally, the posterior probability of the MRF is transformed into the following.

Improvement of Potentials of Nodes Containing Small Tracklet Couplings
In multi-view MOT, dense crowd scenarios may cause frequent occlusions of the objects, often resulting in a large number of small tracking fragments. The appearance and motion characteristics of the small tracklet couplings formed by these short tracking fragments are not abundant or accurate, which affects the similarity calculation between tracklet couplings and leads to erroneous data association.
The majority of conventional schemes directly use the obtained network parameters to perform optimization after the establishment of the association model without specifically addressing the inaccuracy of the parameters. In this study, the nodes with short tracklet couplings are processed before data association is performed; furthermore, the related potential functions are improved to lay the foundation for subsequent reliable network solutions. In particular, this includes the following three steps.
The first step is the assimilation and extension processing. In multi-view tracking of dense pedestrian crowds, the MRF model often contains both long tracklet couplings and a large number of fragmented tracklet couplings. As some local information is reliable and can be mined, we propose a method in this paper that starts from reliable tracklet couplings and then enhances the assimilation of adjacent small tracklet couplings and their extension processing. Beginning from the node of the longest tracklet coupling, if the similarity reaches the threshold, then they should be connected internally, which is called a soft link (for soft connections, real connection processing is not performed). After tracklet coupling in the soft-connect node is temporarily cascaded, the appearance and motion are recalculated, such that the two tracklet couplings are assimilated and the short tracklet coupling information is enhanced. Then, as a shared tracklet coupling, it directly affects the neighboring adjacent nodes, which is called assimilation extension processing, as illustrated in Figure 6. Assimilation and extension serve to improve the potential functions between the nodes.  demonstrates that the left node n p accepts the message passed by the right node n q and does not adopt the message passed by node n r .
The second step is MSBP processing. We use the belief propagation algorithm to calculate the marginal probability P(l p ) of each node state. Let m p (l p ) be the (normalized) local message sent by observation of node n p , as indicated in Equation (17): m pq (l q ) is the (normalized) message sent to n q by node n p , as indicated in Equation (18): Unlike conventional processing [16,17], we introduce a special message selection process into the belief propagation algorithm. It can be seen, from the definition of Equation (15), that when a common tracklet coupling is short, the associated potential function of nodes n p and n q is calculated based on the internal cascade result of the two nodes. Based on this special background, we adopt a message selection rule for short common tracklet couplings, as follows.
It means that, if the message of node n p to node n q is biased toward a link, then n q accepts the message; otherwise, n q does not accept the message (i.e., the message is set to be a binary equal probability distribution). It can prevent the propagation of unreliable messages introduced by small tracklet couplings.
In the third step, the MSBP described above is iteratively performed until the marginal distribution of all nodes tends to stability or a termination condition is met. After the iteration procedure is complete, we verify the resulting marginal probabilities: P(l p ) is the probability of node n p is calculated by Equation (20).
If the node contains a small tracklet coupling, we specify the following, The local message is associated with the local potential function, and the potential function is thereby improved.
The above three processes are also performed in combination with the iterative strategy; that is, returning to the first step after the third step, searching for a reliable node that contains longer couplings, and performing processing again until the nodes whose tracklet coupling lengths are greater than the threshold have all fulfilled assimilation and extension, as well as the potential function improvement tasks. By improving the potential functions, the MRF's parameters are optimized.
To generate trajectories of objects, in the existing research, based on the similarity between tracklet couplings, data association can be performed by the global dynamic programming algorithm [11], the continuous shortest path algorithm [10], or the minimum cost flow algorithm [8], as well as others. The global optimization-based methods can comprehensively make use of the relationship between trajectory segments and obtain the optimal association of multiple object trajectories from a global context. When the similarities exhibit high discrimination and accuracy, the local connection schemes, such as those presented in [7,37,38], can often achieve a global solution with better system efficiency. In the practical implementation, the MSBP method adopts maximum selection rule instead of Equation (19), i.e., only the maximum message is accepted. This simplifies the processing and still achieves good results. The stitching of complete trajectories is performed using the trajectory smoothness fitting method.

Experiment
In this section, we first introduce the evaluation indicators and experimental data sets. Then, we separately discuss the performance of key point optimization method, the tracklet couplings generation method, and the MRF data association optimization method. Finally, we compare the overall tracking system performance with that of other methods.

Evaluation Metrics and Experimental Dataset
Multi-view tracking performance can be evaluated using a single-view object-tracking evaluation system. In studies on multi-view tracking, one view is the main field of view and other views play an auxiliary role. Therefore, the authors of [13,14] proposed a feasible scheme using the tracking performance of the main view as the evaluation index for multi-view tracking. The main performance indicators include the tracking accuracy (MOTA), tracking precision (MOTP), mostly tracked target (MT), mostly lost target (ML), false negative (FN), false positive (FP), and object identity switches (IDs).
In this subsection, the whole tracking system is evaluated using the PETS2009 data set, where the evaluation metrics are those which were given in [41,42]. According to Equation (22), the MOTA combines FP t , FN t , and IDs t in frame t, and is given by GT t is the ground truth. The MOTP indicates the misalignment between tracked bounding boxes and their ground truth, and is given by where M t is the number of correct matches between target tracking results and GT in frame t, and D i t is the distance of each match. The MT is the ratio of GT that are covered by a track hypothesis for at least 80% of their respective life span, whereas the ML is the ratio of GT that are covered by a track hypothesis for at most 20% of their respective life span. FP and FN are the total number of false positives and missed targets, respectively. IDs represent the total number of identity switches.
The PETS2009 data set [43] provides three experimental data sets for multi-view MOT with overlapping fields of view; namely, the S2.L1, S2.L2, and S2.L3 video sequences. The difficulty levels are L1, L2, and L3 from low to high, respectively, according to crowd intensity. At the same time, the image resolution of the three experimental sets is not very high; therefore, the appearance feature extraction and target 3D reconstruction accuracy exhibit large deviations, compared to the results obtained when using high-resolution images. These three experimental sequences are, thus, difficult and can be used as experimental sequences for evaluating tracking methods.

Evaluation of Key Point Optimization
In this subsection, the impact of the key point optimization on the whole tracking system is evaluated. Comparison experiments with and without key point optimization module were conducted using the three data sets S2.L1, S2.L2, and S2.L3 and the experimental results are presented in Figure 7. . Evaluation results of the key point optimization. The horizontal axis represents the distance threshold, while the vertical axis represents the tracking accuracy. The solid line indicates that the experiment is conducted using data from two views, while the asterisk line indicates that the experiment is conducted using data from three views. Red represents the tracking system with key point optimization and blue indicates the tracking system without key point optimization.
The abscissa is the distance threshold between the tracklets, the vertical axis represents the tracking accuracy MOTA. The three charts represent the experimental results for S2.L1, S2.L2, and S2.L3, respectively. It can be seen the red curves are obviously better than the blue ones. This suggests that key point optimization is very helpful and improves the tracking performance largely.

Evaluation of Tracklet Coupling Generation
In this subsection, we analyze and compare the coupling method based on a Gaussian distance metric and image mutual information. We separately replaced the coupled tracklet generation modules in the overall tracking framework and keep the other module parameters unchanged to observe the influence of different coupling methods on system tracking performance. The parameters of the Gaussian distance measurement method and the image mutual information method were set according to the current threshold δ i . The mean µ of the Gaussian distance measurement function was set to 0, σ = 2δ i /3, and the Gaussian window size in the mutual information method was s = 3δ i /2. Comparison experiments were conducted using the three data sets S2.L1, S2.L2, and S2.L3 and the experimental results are presented in Figure 8.
The abscissa is the distance threshold between the tracklets: when the minimum distance between two tracklets was greater than the threshold, the coupling operation was not performed. The vertical axis represents the tracking accuracy. The three curve charts represent the experimental comparison result curves for S2.L1, S2.L2, and S2.L3, respectively. The red curve represents the mutual information coupling method, the blue curve represents the Gaussian distance measurement method, the solid line represents the coupling method using data from two views, and the asterisk line represents the coupling method using data from three views. It can be seen, from the three curves, that when the distance threshold was 0 (i.e., no coupling operation was performed), an increasing number of tracklets participated in coupling as the distance threshold increased, and the tracking performance also improved. This suggests that the use of multiple view data for coupling helped the improve tracking performance. In Figure 8a,c, the mutual information method appeared to have equal or superior results to the Gaussian distance measurement method within the entire threshold range. In Figure 8b, although the tracking performance of the mutual information method did not exceed the conventional method locally, it was better within other threshold ranges and had a significant advantage; furthermore, its peak value was higher than that of the Gaussian distance measurement method. In summary, as indicated by the experimental results, the mutual information method is feasible and produces superior results. . Evaluation results of tracklet coupling generation. The horizontal axis represents the distance threshold, whereas the vertical axis represents the tracking accuracy. The solid line indicates that the experiment is conducted using data from two views, while the asterisk line indicates that the experiment is conducted using data from three views. Blue represents the Gaussian distance metric and red represents the coupling method based on image mutual information.

Evaluation of Markov Random Field Optimization Method
In this subsection, we evaluate the data association optimization method based on MRF. Under the condition that the other module parameters of the system remained unchanged, the analysis was performed by comparing the impact on system performance with and without the optimization module. The final trajectory solutions all adopted the data association solving algorithm given in [38], to ensure the validity and fairness of the comparison. Experiments were conducted using the three data sets S2.L1, S2.L2, and S2.L3 and the experimental results are presented in Table 1. The direction of the arrow indicates good performance. The methods used data from three views to perform tracking. The difference was that, in data association, the normal method directly used the current trajectory similarity to generate the trajectory, whereas the MRF method optimized the tracklet similarity first and then generated the trajectory. The experimental results indicate that the MRF method is effective for optimizing the tracklets and the MOTA is improved through the optimization of data association. It is worth noting that during the optimization process some fragmented tracklets were activated, and that a decrease in the FN and IDs indicates the successful connection of difficult tracklet pairs.
To provide an effective analysis and discussion, the other parameters were kept unchanged. By changing the motion similarity threshold, the system tracking performance could be observed, as illustrated in Figure 9. Four experiments were conducted for each experimental sequence; that is, for the conventional methods (red) with two views, the MRF method (blue) with two views, the conventional method (green) with three views, and the MRF method with three views. Figure 9 demonstrates that, in the three sequences, the MRF method maintained an advantage in tracking performance over the conventional method with an increase in the motion similarity threshold. In addition, regardless of the method used, the tracking performance of the tracking system with three views was superior to that of the tracking system with two views, which also suggests that the tracking system in this study functions properly.  Figure 9. Evaluation results of MRF model. The horizontal axis represents the motion relationship threshold, while the vertical axis represents the tracking accuracy. The black curve represents the potential function improvement method with three views, the green curve represents the conventional method with three views, the blue curve represents the potential function improvement method with two views, and the red curve represents the conventional method with two views.

Discussion and Conclusions
In this paper, we used a multi-camera system with overlapping fields of view to study the problem of dense pedestrian tracking and proposed a new MRF model for cross-view tracklet couplings. This model is equipped with a new potential function improvement method that can perform effective association of tracklet coupling fragments caused by dense crowds. To generate reliable tracklet couplings, a data fusion method based on image mutual information was proposed. This method can calculate the spatial relationships of cross-view 2D tracklet pairs by integrating position and motion information. The human key point detection method was also adopted to correct the position data of incomplete and deviated objects in dense crowds.
We made use of the PETS2009 experimental data set for modular experiment. From the experimental results, human key points can effectively improve the object detection in dense pedestrians scene and lead to better 3D reconstruction of tracklet. The data fusion method based on image mutual information combines the motion and position information of the tracklets and provides a more discriminative spatial relationship. The potential function improvement method of our MRF model helps association of fragmented tracklet couplings in the case of dense crowds. These three steps provide an effective solution to the occlusion problem in dense pedestrians tracking.
We also provide comparisons between the tracking system proposed in this study and existing methods, such as those presented in [12][13][14]28]. The comparison results are presented in Table 2. The input and output evaluations of the experimental results were consistent with the evaluation system in [14], and the same input and ground truth were used to ensure the fairness of the comparison. Table 2 indicates that the tracking system performance in this study achieved favorable results with the S2.L1 experimental sequence, although it did not exceed the method of [14], the 100% MT and lower IDs values surpassed the other methods. In the more difficult dense crowd scenes of S2.L2 and S2.L3, our method achieved the best results, indicating that the Markov optimization model plays a key role in processing dense scenarios. The table also indicates that the method in this study makes better use of multi-view data information for tracking. In the three experimental sequences, the tracking performance using three views was superior to the tracking performance using only two views, which is consistent with the actual physical meaning and the original intention of multi-view research. Figure 10 presents the partial tracking results of the three experimental sequences, indicating that the object trajectory can be estimated more accurately when tracking is performed using the information of multiple views. At the same time, it can also be seen that the same object was correctly assigned the same tracking number in different views.  Figure 10. Tracking results. Panels (a-c) display the tracking results of the first to the 350th frame of the first, fifth, and seventh views in PETS2009 S2.L1, respectively. Panels (d-f) display the tracking results of the first to the 51st frame of the first, second, and third views in PETS2009 S2.L2, respectively. Panels (g-i) display the tracking results of the 160th to the 210th frame of the first, second, and fourth views in PETS2009 S2.L3, respectively.
In future research, cross-view feature extraction will be a core task. Due to difference in views between cameras, backgrounds can be quite different and the same target may have a different appearance. The direct use of traditional feature extraction methods may not be able to effectively identify the same object or distinguish among different objects. In recent years, the development of deep learning in the field of pedestrian recognition has supplied new solutions for cross-view appearance feature extraction. Consequently, deep-learning-based cross-view appearance feature extraction will be the next major focus of our research.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Suppose an image of size N 0 has n i pixels of value i, where 0 ≤ i ≤ 255. The information entropy is calculated as follows, Suppose that the image is expanded to size of N 1 = N 0 + N a by adding N a zero value pixels. Noting that the new image has n 0 = n 0 + N a pixels of zero value, its information entropy H 1 (x) satisfies Equation (25), that is,