A Distributed Tracking Algorithm for Counting People in Video by Head Detection

We consider the problem of people counting in video surveillance. This is one of the most popular tasks in video analysis, because this data can be used for predictive analytics and improvement of customer services, traffic control, etc. Our method is based on the object tracking in video with low framerate. We use the algorithm from [1] as a baseline and propose several modifications that improve the quality of people counting. One of the main modifications is to use a head detector instead of a body detector in the tracking pipeline. Head tracking is proved to be more robust and accurate as the heads are less susceptible to occlusions. To find the intersection of a person with a signal line, we either raise the signal lines to the level of the heads or perform a regression of bodies based on the available head detections. Our experimental evaluation has demonstrated that the modified algorithm surpasses original in both accuracy and computational efficiency, showing a lower counting error on a lower detection frequency.


Introduction
Counting people passing through certain zones of a public infrastructure, such as pedestrian crossings, sidewalks, squares, etc., is a practically important task. There are many solutions to this problem, one of them is object tracking. The task of object tracking is to create tracks for each person. A track unambiguously corresponds to a person. It marks this particular person locations in all frames in which he or she is visible. In order to count people a signal line is usually specified in the frame (see fig. 1). If the track crosses the signal line, we can say with confidence that the person also crossed it. We propose a fully automatic people counting algorithm. The algorithm takes as input a video stream {F i } i=1 of frames captured by stationary camera and signal line that specified by an ordered pair of points (L a , L b ) on the frame. The output of the algorithm is a set of events {E i } i=1 that represented by triples of values E i = (k i , r i , d i ). The first value indicates the number of the frame where the signal line was crossed, the second value specifies the coordinates of the bounding box and the last value indicates the direction of the signal line intersection.
Our solution is an extension of the algorithm described in [1]. In this paper we propose the following improvements: use of the head detector instead of full-body detector; detection matching procedure is modified in order to work with small head bounding boxes; the use of body regression on the heads at different stages of tracking; an algorithm to automatically determine a region of interest (ROI) by signal line position is proposed to speed up detection.

Related Work
The modern methods of tracking are based on tracking-by-detection. There are a lot of ways to detect the desired object on the frame. Three most popular methods are: detection of body [2,3,4], detection of head [5,6] and using of key points. The first solution is popular because there are many datasets and ready-made solutions. The head-tracking approach is well suited to track people in a crowd: usually video surveillance cameras are installed above the height of the person, where heads in a crowd can be seen better than full bodies. Heads are more resistant to overlapping than bodies. The number of ready-made solutions and data for training is less than for bodies. There are methods that use body parts detectors for tracking [7], key points of human pose [8], combined solutions (body and head) [9], and detector ensembles [10].
After detection we need to bind all the detections to tracks. As in the detection task, there are a lot of methods. The first group of algorithms for creating tracks is greedy algorithms. In most online algorithms tracks are constructed frame by frame, each frame creates a matrix of the cost of matching new detections and existing tracks, then the problem of matching is solved by a greedy algorithm (searching for the maximum in each row/column) [11,12] or by a Hungarian algorithm [13] [2,3,4]. Sometimes MCMC is used to bind all the detections to tracks [5,14].
Recently, neural networks have been used more often in tracking. For example, in paper [15] authors suggest using detector to obtain new detection by regression of detection on previous frame. However, this method has disadvantages. For example, it is able to work well only at a high frame rate and it also increases the load on the detector.

Baseline
We use the solution from [1] as the baseline, which is an extension of SORT tracking algorithm [2]. We choose this algorithm as it is capable to work by detection on a sparse set of frames, which is performed on remote servers. This significantly reduces the amount of computational resources required for large-scale video surveillance systems (see fig. 2). Our proposed method inherits distributed nature of baseline.
Baseline works in online mode and use the Hungarian algorithm [13] to match detections. To improve results at a low detection rate ASMS visual tracking [16] is used to evaluate a speed of people between frames. The same approach with visual tracking speed estimation is used in [11].
The baseline algorithm consists of the following steps: (1) detection; (2) evaluation of the speed of detections using visual tracking; (3) prediction of the position of tracks by the Kalman filter; (4) matching; (5) extrapolation of the tracks; (6) detection of signal line crossing events.
Proposed improvements to some of the steps above are described below.

Detection
Since heads are seen better in the video and are less prone to occlusions we decided to use the head detector based on SSD [17] approach instead of the body detector. Another advantage of the head detector is that neural network body detectors can combine nearby people into one bounding box, which is less frequent for heads. The detector were trained on CrowdHuman [18] public dataset and on the dataset collected by Video Analysis Technologies. Experimental evaluation showed AUC of 0.66 on the test part of CrowdHuman. Usage of the head detector leads to a sub-task of restoring a bounding box of an entire body to find intersections of the signal line, which is located on the ground. We describe it in section 3.4.

Matching
During the experiments we realized that IOU metric used in the basic algorithm is not suitable for head matching. As head bounding boxes are small enough due to the errors in position and speed estimation bounding box pairs required to match are closely positioned, but have no intersection. IOU equal to zero acquired in the case which leads to track partitioning. Therefore we suggest to increase by s times the size of the bounding boxes (while saving the center of the bounding box) before matching them using IOU metric. This approach solves the problem described above.

Signal Line Crossing
We got the problem of missing signal line intersections as head tracks are located on a human height and the signal line is located on the ground. Problem reveals itself most on the scenes where the signal line is placed orthogonal to the camera. Therefore we offer two solutions: rising of signal lines and body bounding box regression by head.

Rising of Signal Lines
We propose to raise the signal line to the level of the human head (see fig. 3). This solution reduces the computational cost and time of the algorithm as it doesn't require any extra steps to project head track to the ground plane. The signal line can be raised either manually or automatically. Automatic approach we propose is using the following anatomical fact: the height of the head is 1 8 of the height of the entire human body. We can use a body detector to calculate average height of human body for the scene. And raise the signal line to a height equal to 7 8 of the average height of a person on the scene. This approach has a drawback: people's growth is different, which means that people's heads are on different planes, while people's legs are always on the same plane. Therefore, the raised signal line doesn't look so clearly since it's not clear where the plane is located. Bounding Box Regression by Head To solve the drawbacks that arose in the previous paragraph we propose to use regression of a body by a head, which is trained for the specific scene. Our idea consists of two stages: combining the detections of heads and bodies for the scene and training linear regression on the received data.
Combining head and body detections First of all we launch head and body detectors. To combine heads and bodies a matrix of head and body correspondence is built on each frame.
In eq. (1) nb k , nh k -the number of body and head detections on the k-th frame, B k, i , H k, j -the bounding box of the i-th body and j-th head on the k-th frame. We calculate matching cost as: In eq. (2) I 1 is threshold for ioh k, i, j (we are using I 1 = 0.5), τ 1 , τ 2 -minimum and maximum normalized distances between the center of the head and the upper point of the body (we are using τ 1 = −1 and τ 2 = 1). So at least half of head bounding box area should be inside body bounding box area and vertically head center shouldn't be far from body top by at least one head height.
Next the assignment problem is solved using the Hungarian algorithm to maximize cost. Combined head and body detections with non-zero match cost form training dataset for a regression (see fig. 4).
Linear regression After the previous step we have the data to train a linear regression. The training dataset consists of head and body bounding box pairs: (B i , H i ). The regressor predicts the following values: (B h , B w , shif t x ), where B h , B w -height and width of the body bounding box, shif t x -a shift from the center of the head to the center of the predicted body normalized by head width: Prediction is done by linear regression with quadratic members: The same way B w , shif t x are predicted, but with separate sets of coefficients. After learning coefficients of linear regression on training dataset, we use predicted value to restore body bounding box: This approach helps us keep the signal line on the ground plane, which solves drawbacks of the previous approach. Now we have a choice to do speed estimation by visual tracking with head or regressed body detections. As heads are less prone to occlusions visual tracking of them may be more reliable. But body bounding boxes are several times larger and have more visual information to track. We check both choices on experimental evaluation.

Limiting the Detection Area
The entire frame is not required to find events of crossing the signal line. We can limit the detection area to the area around the signal line -region of interest (ROI). This solution speeds up the SSD detector as it is slower for images with higher resolution.
If the signal line is horizontal then the ROI is located on the top and bottom of the line. Otherwise the ROI is located on the left and right side of the line, as well as on the top of the line.
Let be α ≤ π 2 -the minimum angle between signal line and horizon, w mean , h mean -the average width and height of body detections and (x a , y a ), (x b , y b ) -the coordinates of the beginning and the end of the signal line.
x 2 = max(x a , x b ) + s w w mean 1 + sin α 2 (10) Then A = (x 1 , y 1 , x 2 , y 2 ) is the region of interest. s w , s h in equations 9-12 are parameters. After visual testing we selected the following values for these variables: s w = 2, s h = 1.

Datasets
For an experimental evaluation of our algorithm we need datasets filmed by static camera with body tracks markup. Video sequences should be long enough to evaluate people counting quality. If dataset provides head tracks markup it allows us to check effect of body regression by comparing with head tracking and raised signal lines (section 3.4). Most of the public datasets including popular MOTChallenge dataset [19] have short videos, only body tracks markup or filmed by moving camera. So we used 19 videos from the collection of the Video Analysis Technologies company and the Towncentre dataset [5] to test our algorithm. For all videos signal lines were manually drawn at ground level as well as at head level. The table 1 provides detailed information about each test video.

Metrics
As a quality metric we use the average error of counting the number of intersections (events) [1]. The resulting events can include both true and false ones. The false events have no correspondences in the reference labeling. We say that an event E i in the input set of data matches the event E i in the reference labeling if they correspond to the same person crossing the signal line at the same time. We match all events as described in the paper [1]. After the events have been matched we divide videos to segments with 10 reference events and calculate the following characteristics on them: -GT seg is the number of reference events on the segment; -F P seg is the number of unmatched events from the algorithm on the segment; -F N seg is the number of unmatched events from the reference events on the segment; is an error on the segment.

Then final error is calculated as
Eseg N , where N is the number of the segments.

Experimental Results
Rising of Signal Lines At first we have tested baseline with modifications proposed in sections 3.2, 3.3 and manually raised signal lines as described in section 3.4. The algorithm is marked as heads-no-regression in the results table (see table 2) and parameter s shows how many times head bounding boxes have been increased. Experiments with rising the signal line clearly shows the advantage of the head detector over the body detector. The head detector gives a significant increase in an accuracy reducing the error by ≈ 2 times. The algorithm without matching modification (s = 1) has poor results at low FPS due to the previously voiced problems which shows importance of the modification proposed in the section 3.3.
Body Bounding Box Regression by Head Next we have tested body bounding box regression by head. As mentioned in section 3.4 there are two alternative ways to apply it. There are two algorithms we have tested (see table 3  As you can see the second configuration gives better results. Visualization showed us that visual tracking of heads is more reliable as they are less prone to occlusions. It worth noting that heads-vistrk-regression performed better than heads-no-regression (s = 2) on low detection frequency.
Limiting the Detection Area Next we tested limiting of the detection area (section 3.5). It allowed us to increase the speed of the algorithm almost without affecting the counting error (see table 4). Detection area reduced by almost 60% on some of the videos.

Conclusion
We have proposed the algorithm of counting people in a video, which is an extension of the algorithm described in [1]. Usage of the head detector and body bounding box regression allowed to increase a counting accuracy. Detection area limiting saves computational resources. Comparing with the baseline the proposed method is able to work on a lower detection frequency showing lower counting error.
For the future work we plan to use automatic camera calibration algorithms (see fig. 5) [20,21]. This will allow to perform person tracking on the ground map and further improve accuracy of people counting.