Keywords

1 Introduction

The increasing number of vehicles makes the intelligent transportation system (ITS) more and more significant. For the surveillance concern, various techniques have been proposed to alleviate the pressure of transportation systems. The vision-based system has absorbed many researchers owing to its lower price and much direct observation information. ITS extracts useful and accurate traffic parameters for traffic surveillance, such as vehicle flow, vehicle velocity, vehicle count, vehicle violations, congestion level and lane changes, et al. Such dynamic transportation information can be disseminated by road users, which can reduce environmental pollution and traffic congestion as a result, and enhance road safety [1]. Multiple-vehicle detection and tracking is a challenging research topic for developing the traffic surveillance system [2]. It has to cope with complicated but realistic conditions on the road, such as uncontrolled illumination, cast a shadow and visual occlusion [3]. Visual occlusion is the biggest obstacle among all the problems of outdoor tracking.

The current method for handling the vehicle occlusion can be divided into three categories, feature-based tracking [4–8], 3D-model-based tracking [9] and reasoning-based tracking [10]. Based on normalized area of intersection, Fang [5] proposed to track the occluded vehicles by matching local corner features. Gao [11] uses a Mexican hat wavelet to change the mean-shift tracking kernel and embedded a discrete Kalman filter to achieve satisfactory tracking. In [9], a deformable 3D Model that could accurately estimate the position of the occluded vehicle was described. However, this method has a high computational complexity. Anton et al. [10] presented a principled model for occlusion reasoning in complex scenarios with frequent inter-object occlusions. The above approaches do not work well when the vehicle is seriously or fully occluded by other vehicles.

In order to cope with the vehicle occlusion in the video sequence captured by a stationary camera, we propose a two-level framework. The basic workflow of the proposed method is illustrated in Fig. 1, in which two modules is consisted: (1) vehicle detection unit; (2) vehicle tracking unit. In vehicle detection module, features are extracted for fast vehicle localization by combining the improved frame difference algorithm and background subtraction algorithm, and with the aid of the distance between the pair of tail lights, the vehicle cast shadow is reduced. In vehicle tracking module, the two-level framework is proposed to handle vehicle occlusion: (1) NP level (No or Partial level), tracking vehicles by mean-shift algorithm; (2) SF level (Serious or Full level), handling occlusions by occlusion reasoning model. Occlusion masks are adaptively created, and the detected vehicles are tracked in both the created occlusion masks and original images. For each detected vehicle, NP and SF levels are implemented according to the degree of the occlusion (see Fig. 2). With this framework, most partial occlusions are effectively managed on the NP level, whereas serious or full occlusions are successfully handled on the SF level.

Fig. 1.
figure 1

Overview of the proposed framework.

Fig. 2.
figure 2

Diagram of the two-level framework.

The rest of this paper is given as follow: in Sect. 2 the moving vehicle detection method is presented. NP level occlusion handling is described in Sect. 3, and SF level occlusion handling is introduced in Sect. 4. Section 5 shows the experimental results and discussion. The conclusion is given in Sect. 6.

2 Moving Vehicle Detection

2.1 Moving Vehicle Detection

At present, frame difference [12] and background subtraction [13] are conventional approaches for detecting moving vehicle in the vision-based system. Frame difference can keep the contour of the moving vehicle very well. However, it is failed when the speed of inter-frame is less than one pixel or more than the length of the moving vehicle. So, a bidirectional difference multiplication algorithm is used here which can meet the needs of slow or fast speed. The proposed method can overcome the shortage of frame difference as the method adopts the discontinuous frame difference and bidirectional frame difference. The bidirectional difference multiplication algorithm detects targets based on relative motion and can detect multiple vehicles at the same time. In order to overcome the disadvantage when the speed of inter-frame is more than the length of the vehicle, here firstly, the bidirectional frame differences are calculated among the current kth frame, the (k − 2)th frame and the (k + 2)th frame images; furthermore, the difference results are multiplied in order to strengthen the motion region. The frame difference image D fd (x,y) is given by Eqs. (1) and (2):

$$ \left\{ \begin{aligned} D_{k + 2,k} (x,y) = \left| {f_{k + 2} - f_{k} } \right| \hfill \\ D_{k,k - 2} (x,y) = \left| {f_{k} - f_{k - 2} } \right| \hfill \\ \end{aligned} \right. $$
(1)
$$ D_{fd} (x,y) = D_{k + 2,k} (x,y) \times D_{k,k - 2} (x,y) $$
(2)

In which D k+2,k (x,y) is the difference between the (k + 2)th frame and the kth frame. And D k,k-2 (x,y) is the difference between the (k − 2)th frame and the kth frame. The basic diagram is shown in Fig. 3.

Fig. 3.
figure 3

Diagram of the bidirectional multiplication algorithm.

The primary idea of background subtraction is to subtract the current frame image from the background model pixel by pixel first, and then to sort the obtained pixels as foreground or background by a threshold \( \delta \). Compared with complicated background modeling techniques, the multi-frame average method has the high computation speed and low memory requirement. Therefore, we utilize a multi-frame average method to update the background model. This method is formulated as follows,

$$ D_{bd} = \left\{ {_{0\quad background,\quad otherwise \, }^{{1\quad foreground,\quad if\left| {I - B} \right| > \delta }} } \right. $$
(3)

where I is the current frame image, B is the background image.

The background model needs to be updated momentarily due to the constant changes of the real traffic environment. Here the multi-frame average method is used to update the background model by Eq. (4).

$$ B_{k + 1} = (1 - \alpha )B_{k} + \alpha I_{k} $$
(4)

Where \( \alpha \,\left( {0\, < \, \alpha \, < \, 1} \right) \) determines the speed of the background updating. B k and B k+1 mean the background model at the time k and k + 1 independently.

The more precise foreground image D is obtained by using Eq. (5), adding D fd and D bd pixel by pixel, in which background subtraction can keep the integrity of the moving vehicle and the bidirectional difference multiplication algorithm can obtain the full contour of the moving vehicle information.

$$ D = D_{fd} \cup D_{bd} $$
(5)

2.2 Post-processing

The binarized foreground image D may have holes and some noise (see Fig. 4(b)), which influence results of target tracking. A flood-fill operation is utilized on D to eliminate the holes in the motional target (see Fig. 4(c)). Moreover, the morphological closing and opening operators are utilized to fuse the narrow breaks and remove the noise (see Fig. 4(d)).

Fig. 4.
figure 4

Moving vehicle detection. (a) Original image. (b) Detection result of Section 2.1. (c) The result after the flood-fill operation. (d) The final binarized image by using morphological closing and opening operators.

2.3 Tail Lights Detection

In order to reduce the affection of vehicle cast shadow, the distance between the pair of tail lights is utilized to determine the precise region of the moving vehicle, and the method was first proposed by Qing et al. [14]. The tail lights are detected in the obtained partial image. Then, the following constraint is used to determine the light pairs.

$$ \left\{ {_{{w_{\hbox{min} } \le w_{{c_{1} c_{2} }} \le w_{\hbox{max} } }}^{{\left| {h_{{c_{1} }} - h_{{c_{2} }} } \right| \le d}} } \right. $$
(6)

Where c 1 and c 2 are the barycenter of the tail lights, \( h_{c1} \) and \( h_{{c_{2} }} \) represent the height of c 1 and c 2 respectively; d is the image resolution. \( w_{{c_{1} c_{2} }} \) is the width between the pair of tail lights for a moving vehicle.

3 NP Level Occlusion Handling

In this paper, the mean-shift algorithm is utilized to track detected vehicles in a video sequence. Usually, mean-shift algorithm adopts color histogram of the detected target as feature [15]. In order to find more similar candidate area in the neighborhood of the moving target, we define Bhattacharyya coefficient [16] as the similarity function in this paper.

3.1 Target Model and Candidate Model

The first step is to initialize the position of the target region in the first frame. Target model can be described as the probability density distribution of the color feature value in the target region [17]. The probability distribution function (PDF) of the target model q u and the candidate model p u are calculated as follow:

$$ q_{u} (x) = C\sum\limits_{i = 1}^{n} k \left( {\left\| {\frac{{x_{i} - x_{0} }}{h}} \right\|^{2} } \right)\delta \left[ {b(x_{i} ) - u} \right] $$
(7)

Where x 0 is the barycenter of the target region, \( k(\left\| x \right\|^{2} ) \) represents kernel function; h indicates the bandwidth of the kernel function, and b is the color histogram index function of pixels.

$$ p_{u} (y) = C_{h} \sum\limits_{i = 1}^{{n_{h} }} {k\left( {\left\| {\frac{{x_{i} - y}}{h}} \right\|^{2} } \right)\delta \left[ {b(x_{i} ) - u} \right]} $$
(8)

where C h is the normalization coefficient.

3.2 Bhattacharyya Coefficient and Target Position

Here Bhattacharyya coefficient is defined as the similar function, which can be expressed as

$$ \rho (y) = \rho \left[ {p(y),q} \right] = \sum\limits_{u = 1}^{m} {\sqrt {p_{u} (y)q_{u} } } $$
(9)

The similarity function is Taylor expansion in the point of p u (y 0 ), and the formula is given as follows:

$$ \rho \left[ {p(y),q} \right] \approx \frac{1}{2}\sum\limits_{u = 1}^{m} {\sqrt {p_{u} (y_{0} )q_{u} } } + \frac{{C_{h} }}{2}\sum\limits_{i = 1}^{n} {\omega_{i} k\left( {\left\| {\frac{{y - x_{i} }}{h}} \right\|^{2} } \right)} $$
(10)

where

$$ \omega_{i} = \sum\limits_{u = 1}^{m} {\delta \left[ {b(x_{i} ) - u} \right]\sqrt {\frac{{q_{u} }}{{p_{u} (y_{0} )}}} } $$
(11)

The best candidate model has the largest value of \( \rho (y) \) and can be searched by mean shift iterations. First, the barycenter of the target region y 0 in the current frame is set to as the barycenter of the target region in the previous frame; that is y 0 = x 0. Then, searching the optimal matching position y 1 around the barycenter of y 0 by the Bhattacharyya coefficient and the updated barycenter of the target region y 1 is calculated by (12). At last, stopping the iterative convergence when \( \left\| {y_{1} - y_{0} } \right\| < \varepsilon \), and the barycenter of the target will be replaced by y 1.

$$ y_{1} = \frac{{\sum\limits_{i = 1}^{n} {\omega_{i} (x_{i} - y_{0} )g\left( {\left\| {\frac{{y_{0} - x_{i} }}{h}} \right\|^{2} } \right)} }}{{\sum\limits_{i = 1}^{n} {\omega_{i} g\left( {\left\| {\frac{{y_{0} - x_{i} }}{h}} \right\|^{2} } \right)} }} $$
(12)

So the moving target barycenter is adjusted gradually from the initial position to the real position.

4 SF Level Occlusion Handling

On the NP level, partial occlusion can be easily solved, whereas, for serious or full occlusions, this level appears to be ineffective. An occlusion reasoning model which combines with constructing occlusion masks to estimate the information of the motional vehicle is proposed on the SF level.

On the SF level, occlusion reasoning model is used to track moving objects based on barycenter and vector here. Vehicle tracking is described as follows. Let \( VC_{n}^{i} \) be the motion vector of the ith detected a vehicle in the nth frame. The barycenter position of \( VC_{n}^{i} \) is denoted as \( \left( {C_{n,x}^{i} ,C_{n,y}^{i} } \right) \). The average motion vector of \( VC_{n}^{i} \) is denoted as \( \left( {V_{n,x}^{i} ,V_{n,y}^{i} } \right) \). Therefore, we can estimate the barycenter position of \( VC_{n}^{i} \) as \( \left( {\tilde{C}_{n,x}^{i} ,\tilde{C}_{n,y}^{i} } \right) \) in the (n + 1)th frame by Eq. (13):

$$ \left\{ {\begin{array}{*{20}c} {\tilde{C}_{n,x}^{i} = C_{n,x}^{i} + V_{n,x}^{i} } \\ {\tilde{C}_{n,y}^{i} = C_{n,y}^{i} + V_{n,y}^{i} } \\ \end{array} } \right. $$
(13)

Here \( VC_{n}^{i} \) and \( VC_{n + 1}^{j} \) are the same vehicle if \( VC_{n}^{i} \) and \( VC_{n + 1}^{j} \) meet the constraining conditions as follows:

$$ \left\{ {\begin{array}{*{20}c} {\left| {A_{n}^{i} } \right. - \left. {A_{n + 1}^{i} } \right| < Threshold_{A} } \\ {\sqrt {\left( {\tilde{C}_{n,x}^{i} - C_{n + 1,x}^{j} } \right)^{2} + \left( {\tilde{C}_{n,y}^{i} - C_{n + 1,y}^{j} } \right)^{2} } < Threshold_{C} } \\ \end{array} } \right. $$
(14)

where \( A_{n}^{i} \) and \( A_{n + 1}^{i} \) are areas of detected vehicles in the ith and (i + 1)th frames.

Occlusion masks represent the estimated images that make up of the moving vehicle regions occluded by other vehicles or no-vehicles. The detected vehicles are tracked in both adaptively obtained occlusion masks and captured images.

The proposed occlusion reasoning model is described in Fig. 5. In the nth frame, the motion vectors and the predicted motion vectors are \( \left\{ {VC_{n,}^{1} VC_{n,}^{2} ,VC_{n,}^{3} \cdots ,VC_{n,}^{i} } \right\} \) and \( \left\{ {\widetilde{VC}_{n}^{1} }, {\widetilde{VC}_{n}^{2} }, {\widetilde{VC}_{n}^{3} }, \cdots, {\widetilde{VC}_{n}^{i} } \right\} \) individually. First, the predicted vehicle \( {\widetilde{VC}_{n}^{i} } \) is matched with all of the vehicles in the frame (n + 1)th with the Eq. (14). If \( {\widetilde{VC}_{n}^{i} } \) matches none of the vehicles in the frame (n + 1)th, it will be checked in a preset exiting region which is decided by the barycenter position of \( VC_{n,}^{i} \) vehicle area \( A_{n}^{i} \) and mean of motion vector. If the unmatched vehicle in the exiting region, it means that the vehicle \( VC_{n,}^{i} \) has moved out of the scene in the frame (n + 1)th. If not, it is assumed that the vehicle \( VC_{n,}^{i} \) moves into the occlusion in the frame (n + 1)th. Therefore, the occlusion mask is created and added the unmatched \( {\widetilde{VC}_{n}^{i} } \). Occlusion mask is created for each occluded vehicle; that is to say, there is only one vehicle in each occlusion mask. Then, the position of vehicle \( {\widetilde{VC}_{n}^{i}} \) in occlusion mask is updated according to the motion vector frame by frame. Moreover, the vehicles in the occlusion mask are matched with all of the vehicles in the next frame. If the vehicle in the occlusion mask matches with a vehicle in the next frame, it is assumed that the vehicle in the occlusion mask has moved out of the occlusion mask. At last, the matched vehicle in the occlusion mask is deleted with the occlusion mask.

Fig. 5.
figure 5

The workflow of the occlusion reasoning model.

5 Simulation Results

The proposed method is evaluated on traffic sequences. The vehicle is partially occluded, seriously occluded or fully occluded, and finally, the vehicle appears in the scene again when a vehicle is overtaken by another vehicle. The image sequences used in experiments are captured by a stationary camera and live traffic cameras, which are available on a Web page at http://vedio.dot.ca.gov/.

Experimental results are shown in Fig. 6. The results show that the proposed method has better performance compared to other vehicle detection algorithms, background subtraction [18], adaptive background subtraction [19], the frame difference [20] and Bidirectional multiplication method [21].

Fig. 6.
figure 6

Moving region detection (a) Original image. (b)Moving region detected by background subtraction. (c) Moving region detected by adaptive background subtraction (d) Moving region detected by frame difference. (e) Moving region detected by Bidirectional multiplication method. (f) Moving region detected by the proposed method.

In Fig. 7, the vehicles are accurately tracked under the condition of without occlusion. A typical occlusion is used to demonstrate the efficient of the proposed framework. In Fig. 8, a vehicle is overtaken by another vehicle and is occluded. In Fig. 9, the vehicles are tracked by the mean-shift algorithm, which is not efficient when the vehicle is seriously occluded.

Fig. 7.
figure 7

Vehicles tracking by the proposed method without occlusion.

Fig. 8.
figure 8

The proposed method handling of the vehicle occlusion.

Fig. 9.
figure 9

Mean-shift algorithm handling of the vehicle occlusion

The proposed framework is quantitatively evaluated on real-world monocular traffic sequences. The traffic sequences including un-occluded and occluded vehicles are used in our experiments, and the results are shown in Table 1. For partial occlusion, the handling rate of the proposed framework is 85.3 %. The inefficient conditions mainly occur in the same color vehicles. For the full occlusion, the handling rate is 84.6 %, and tracking errors mainly appear in the situation of seriously or fully occlusion for a long time. The average processing times of the methods proposed by Faro [22], Zhang [23], Qing [14] and our method are shown Table 2. The image sequences are captured by live traffic cameras, which are available on a Web page at http://vedio.dot.ca.gov/. From the experiments, we can see the proposed method has a good balance between vehicle counting and times.

Table 1. Quantitative evaluation of the proposed framework
Table 2. Comparison the average processing times for 3-min-long videos

The mean processing time of the proposed method for a 3-min-long video is 236.3 s, whereas Faro’s method and Zhang’s method reach average processing time of about 247 s and 280.3 s. Qing’s method reaches an average processing time of about 223.7 s; however, it cannot handle the serious occlusion and full occlusion.

In Fig. 10, the dashed line is the un-occluded vehicle, whereas the solid line shows the vehicle 2 which was occluded by vehicle 1, and the Asterisk line is the estimated position by the proposed method. The vehicle 2 was occluded at the19th frame and reappeared at the 30th frame. In Fig. 8, the position of vehicle 2 is estimated by the occlusion reasoning model in 1340th to 1360th frame. The proposed method is proved to be accurate by the Fig. 9.

Fig. 10.
figure 10

Real position and estimated position during a tracking.

6 Conclusions

This paper presents a novel two-level framework to handle vehicles occlusion, which consists of NP level and SF level. On the NP level, color information is selected as the observation model, un-occluded vehicles are tracked by mean shift algorithm in parallel; partial occluded vehicles is separated into different color tracking. If vehicles have the same color, NP is failed to handle the occlusion and then resorts to the SF level. On the SF level, occlusion masks are created, and the occluded vehicles are tracked by the occlusion reasoning model.

The quantitative evaluation shows that 85.3 % of partial occlusions can be correctly tracked. Serious and full occlusions can be handled efficiently on the SF level. However, the proposed framework failed to handle the same color vehicle occlusion due to only color information as the observation model in the mean-shift algorithm. The future work is to use further features to handle the occlusion tracking when the same color vehicles occluded.