Detection and Tracking of Moving Pedestrians with a Small Unmanned Aerial Vehicle

Small unmanned aircraft vehicles (SUAVs) or drones are very useful for visual detection and tracking due to their efficiency in capturing scenes. This paper addresses the detection and tracking of moving pedestrians with an SUAV. The detection step consists of frame subtraction, followed by thresholding, morphological filter, and false alarm reduction, taking into consideration the true size of targets. The center of the detected area is input to the next tracking stage. Interacting multiple model (IMM) filtering estimates the state of vectors and covariance matrices, using multiple modes of Kalman filtering. In the experiments, a dozen people and one car are captured by a stationary drone above the road. The Kalman filter and the IMM filter with two or three modes are compared in the accuracy of the state estimation. The root-mean squared errors (RMSE) of position and velocity are obtained for each target and show the good accuracy in detecting and tracking the target position—the average detection rate is 96.5%. When the two-mode IMM filter is used, the minimum average position and velocity RMSE obtained are around 0.8 m and 0.59 m/s, respectively.


Introduction
Recently, the use of small/miniature unmanned aerial vehicles (SUAV) or drones has increased for a variety of applications. SUAVs range from micro air vehicles to man-portable UAVs, classified by their weight or size [1]. The SUAV is cost-effective for capturing aerial scenes. The camera can be easily built and manipulated in order to capture the scene of interest in a long distance, however, the computational resources of the drone are often limited to processing high-resolution video sequences in real time.
Visual object detection has been studied with various methods [2]. Methods based on background subtraction or frame difference were studied in [3][4][5][6]. Gaussian mixture modeling (GMM) was used to analyze the background and target regions [3,4]. In [5], the background was subtracted under a Gaussian mixture assumption, followed by a morphological filter. Long-range moving objects were detected by a drone in [6]. Visual tracking is of intense interest, with the development of digital camera technology and image processing technology [7][8][9][10][11]. Various experimental studies were surveyed in [7]. A particle filter was utilized with background subtraction in [8]. Deep learning-based visual tracking was researched in [9]. In [10,11], supervised learning and reinforcement learning were adopted for visual tracking, respectively. Vison-based target tracking with UAV was researched in [12][13][14][15]. Closely located-object tracking was performed with feature matching and multi-inertial sensing data in [12]. Pedestrians were tracked by template-matching in [13]. Small animals were tracked with a freely moving camera in [14]. A moving ground target was tracked in dense obstacles areas with UAV in [15].
Tracking can be performed by means of a consecutive estimation of the target state, such as the position, velocity, and acceleration [16]. The Kalman filter is known to be optimal under independent Gaussian noise assumption in estimating the target's dynamic state in real time [17]. The interacting multi model (IMM) can handle multiple targets with different maneuverings, because it can switch the target's dynamics between multiple modes [18]. IMM was researched with an unscented Kalman filter (UKF) in [19]. The effect of the multi-modal approach on high maneuvering was emphasized in [20].
Another consideration for multiple target tracking is data association, a method to assign each measurement to either established targets, a new target, or a false alarm. The Bayesian data association approach, probabilistic data association (PDA), calculates the probabilities of association between the target and the measurement. It has been extended to joint probabilistic data association (JPDA) to handle multiple targets [21]. Another Bayesian approach, multiple hypotheses tracking (MHT) [22], requires hypothesis reduction techniques to reduce computational complexity, which increases exponentially. A non-Bayesian data association approach, N-dimension (frame) assignment, was developed in [23].
In this paper, we address the detection and tracking of multiple moving pedestrians by an SUAV or drone. Visual detection is performed through frame subtraction, with thresholding and dilation operation, and false alarm removal [24,25]. Each frame is subtracted from a past frame, separated by a constant interval. Then, thresholding generates a binary image and dilation is applied to the binary image to produce candidate target regions. Finally, false target regions are removed, with the known size of the real object. The centroids of final region of interest (ROI) windows are considered x and y positions, which are fed to the next tracking stage as measurements. This detection approach does not require intense training process and heavy computational burdens. Therefore, this method is suitable for autonomous stand-alone aerial video surveillance systems with a drone, which have limited computational resources.
For state estimation, the IMM filter estimates the state of the target and the covariance matrix. Nearly constant velocity (NCV) models with two different covariance matrices of the process noise are assumed for the dynamic states of the target [16]. For data association, a gating process excludes measurements outside the validation region of each target. The nearest measurement-to-track association scheme assigns one measurement to the closest track based on the statistical distance of the residual. This nearest neighbor (NN) approach is efficient for the visual tracker because the false measurements cannot appear in the area of a target of interest. It is assumed that the measurement from the target in the next frame is closest to the predicted state of the target.
In the experiments, a total of 13 moving pedestrians and one car are captured at a height of 15 m by a drone. Some people are clustered as one target during detection, thus a total of 10 tracks are established by the Kalman (IMM filter with one mode) and the IMM filter with two or three modes. The RMSEs of position and velocity are obtained and compared between the filters, showing their dynamic states are well tracked with good accuracy-the average detection rate is 96.5% and the minimum position and velocity RMSE are obtained at around 0.8 m and 0.59 m/s, respectively, when the two mode-IMM filter is used.
The major contributions of this paper lie in the following: (1) we integrate visual detection based on image processing and a target tracking derived from statistical estimation. In the literature, image-based detection and state estimation-based target tracking are often researched individually, but few studies are found that integrate the two parts. It is noted that we instantly get dynamic state estimates, such as position, velocity, and acceleration, in the proposed method; (2) No massive training data is required for target detection and tracking. Thus, this method can speed up the process, with less computational resources. Drones have limited computing power, memory, bandwidth, and battery; thus, a small computational load is required for a drone system; (3) A practical solution is proposed for autonomous stand-alone aerial surveillance. The SUAV can move to any location where CCTV cameras cannot be installed and hover or maintain its position. It is very low cost and can be operated by non-experts. Thus, it is useful for combat missions, counter terrorist operations, or search and rescue in military or commercial use. Figure 1 illustrates fully autonomous stand-alone aerial video surveillance with an SUAV. The SUAV continuously monitors human movement within the field of view of the attached camera at a certain altitude. If any threat is detected, an alert is sent to the authorities. view of the attached camera at a certain altitude. If any threat is detected, an alert is sent to the authorities. The remains of the paper are organized as follows. Moving pedestrian detection is discussed in Section 2. Multiple target tracking with IMM is presented in Section 3. Section 4 demonstrates experimental results. The conclusion follows in Section 5.

Object Detection with Frame Subtraction
A current frame is subtracted from a past frame at a constant interval. A thresholding step follows to generate a binary image as: where I(m,n;k) and I(m,n;k-kd) are the kth frame and the (k-kd)th frame, respectively, kd is a constant interval for frame subtraction, θT is a thresholding value, and M and N are pixel sizes in x and y directions, respectively. After this, a morphological filter, dilation is applied to the binary image to enlarge the segmented regions. The dilation operation is defined as [26]: where D is the structuring element for dilation, and l denotes an integer value less than the image size. All alternative regions are considered candidate target regions. At last stage of detection, false target regions are removed as: where Oi is the i-th region, θs and θf are the minimum and the maximum size of region, respectively; they are determined based on the true size of the target. The center of each target region is considered a measured position for target tracking in the next section. Figure 2 is the block diagram of moving object detection. The remains of the paper are organized as follows. Moving pedestrian detection is discussed in Section 2. Multiple target tracking with IMM is presented in Section 3. Section 4 demonstrates experimental results. The conclusion follows in Section 5.

Object Detection with Frame Subtraction
A current frame is subtracted from a past frame at a constant interval. A thresholding step follows to generate a binary image as: where I(m,n;k) and I(m,n;k-k d ) are the kth frame and the (k-k d )th frame, respectively, k d is a constant interval for frame subtraction, θ T is a thresholding value, and M and N are pixel sizes in x and y directions, respectively. After this, a morphological filter, dilation is applied to the binary image to enlarge the segmented regions. The dilation operation is defined as [26]: where D is the structuring element for dilation, and l denotes an integer value less than the image size. All alternative regions are considered candidate target regions. At last stage of detection, false target regions are removed as: where O i is the i-th region, θ s and θ f are the minimum and the maximum size of region, respectively; they are determined based on the true size of the target. The center of each target region is considered a measured position for target tracking in the next section. Figure 2 is the block diagram of moving object detection.
Appl. Sci. 2019, 9, x 3 of 21 view of the attached camera at a certain altitude. If any threat is detected, an alert is sent to the authorities. The remains of the paper are organized as follows. Moving pedestrian detection is discussed in Section 2. Multiple target tracking with IMM is presented in Section 3. Section 4 demonstrates experimental results. The conclusion follows in Section 5.

Object Detection with Frame Subtraction
A current frame is subtracted from a past frame at a constant interval. A thresholding step follows to generate a binary image as: where I(m,n;k) and I(m,n;k-kd) are the kth frame and the (k-kd)th frame, respectively, kd is a constant interval for frame subtraction, θT is a thresholding value, and M and N are pixel sizes in x and y directions, respectively. After this, a morphological filter, dilation is applied to the binary image to enlarge the segmented regions. The dilation operation is defined as [26]: where D is the structuring element for dilation, and l denotes an integer value less than the image size. All alternative regions are considered candidate target regions. At last stage of detection, false target regions are removed as: where Oi is the i-th region, θs and θf are the minimum and the maximum size of region, respectively; they are determined based on the true size of the target. The center of each target region is considered a measured position for target tracking in the next section. Figure 2 is the block diagram of moving object detection.

System Modeling
The dynamic state of the target is modeled as a nearly constant velocity (NCV) model; the targets' maneuvering is modeled by the uncertainty of the process noise, which is assumed to follow the Gaussian distribution. The following is the discrete state equation of target t: where x t (k) is the state vector of target t at frame k, which is composed of positions and velocities in x and is a process noise vector composed of Gaussian white noise in x and y directions as v is the number of targets at frame k, and F(∆) and q(∆) are the transition and the noise gain matrix, respectively. They are defined as: The filter modes of the IMM filter is set up with different covariance matrices of v as where M is the number of filter modes. The following is the measurement equation of target t: where z t (k) is the measurement vector of target t which is composed of positions in x and y directions is a measurement noise vector composed of Gaussian white noise in x and y directions as It is assumed that the covariance matrix of w(k) is , and H is the measurement matrix, defined as:

Multi-Mode Interaction
The state vectors and the covariance matrices of the IMM mode filters at the previous frame k−1 are mixed to generate the initial state vectors and the covariance matrices for each of the IMM mode filter at the current frame k: wherex t i (k − 1|k − 1) and P t i (k − 1|k − 1) are, respectively, the state vector estimation and the covariance matrix at the previous frame, µ t i (k − 1) is the i-th mode probability of target t, and p ij is the mode transition probability from mode i to mode j.

Mode Matched Kalman Filtering
The Kalman filter was performed for each IMM mode. The first step is to predict the state of each target of which the dynamic state is modeled as: Next, the residual covariance S t j (k) and the filter gain W t j (k) are, respectively, obtained as:

Measurement Gating and Data Association
Measurement gating is a pre-process of data association that reduces the number of candidate measurements. Let Z(k) be a set of measurement vectors detected at frame k: where m(k) is the number of measurements at frame k. The measurement gating is chi-squared hypothesis testing, assuming the Gaussian measurement residuals. Thus, a set of valid measurements for target t and mode j is obtained as: where γ is the gating size. The NN rule is adopted to associate a measurement with a track by minimizing the norm of the residual as: and m t j (k) is the number of candidate measurements which falls in the validation region for target t and mode j.

State Estimate and Covariance Update
The state estimate and the covariance matrix of targets are updated as: Appl. Sci. 2019, 9, 3359 6 of 21 If m t j (k) is equal to zero, i.e., no measurement exists in the validation region, the state estimate and the covariance become the predictions of the state and the covariance as: The mode probability is updated as: where N denotes Gaussian probability density function. If no measurement exists in the validation region, the mode probability becomes: Finally, the state vector and covariance matrix of each target are updated as: The procedures from Equation (10) to Equation (29) repeat until the track is terminated. Figure 3 is the block diagram of moving object tracking. The track is terminated when the track continuously fails to update its state with validated measurements for a certain number of frames. It is noted that when there is no measurement in the validation region, the track can update its state as Equations (23) and (24), not Equations (21) and (22). The terminated track is also considered false if the number of updates with validated measurements is too small, that is, it is assumed that the true target generates at least a certain number of validated measurements. Appl. Sci. 2019, 9, x 7 of 21

Performance Evaluation
Several metrics are used for performance evaluation, position error, velocity error, and RMSE of position and velocity. The position error of target t at frame k is obtained in x and y directions, respectively: where Kt(s) and Kt(f) are the first and last frame where target t is estimated. The velocity error of target is obtained in x and y directions, respectively: where the ground truth of the velocity is approximated as: where δ is heuristically set up when the minimum velocity error is produced in the experiments.
The RMSE of velocity is obtained as:

Performance Evaluation
Several metrics are used for performance evaluation, position error, velocity error, and RMSE of position and velocity. The position error of target t at frame k is obtained in x and y directions, respectively: where x true t (k) and y true t (k) are the ground truth of position of target t in x and y directions, respectively, N t and N k are, respectively, the total number of targets and the total number of frames. The ground truth of positions of targets is obtained manually at each scene. The RMSE of position is obtained as: where K t (s) and K t (f ) are the first and last frame where target t is estimated. The velocity error of target is obtained in x and y directions, respectively: where the ground truth of the velocity is approximated as: where δ is heuristically set up when the minimum velocity error is produced in the experiments. The RMSE of velocity is obtained as:

Experimental Set-Up
A drone (DJI Phantom 4 Advanced) was used to capture moving objects. The drone, with an attached gimbal and camera, is shown in Figure 4. The gimbal can tilt the camera within a 120 • range (−90 • to 30 • ). The camera pitch was set to −30 • during the experiments. (34)

Experimental Set-Up
A drone (DJI Phantom 4 Advanced) was used to capture moving objects. The drone, with an attached gimbal and camera, is shown in Figure 4. The gimbal can tilt the camera within a 120° range (−90° to 30°). The camera pitch was set to −30° during the experiments. The drone ascended to a height of 15 m and stayed still as a stationary sensor (platform), as shown in Figure 5. It maintained its position while capturing video sequences of moving objects in the campus area. Figure 6 shows the take-off/landing position. Figure 6 was taken by a camera pointing directly downwards (−90° pitch) at a height of 100 m, to better visualize nearby buildings and structures. Figure 7a shows a sample frame extracted from the video. The size of one frame was 4096 × 2160 pixels; the frame was reduced to 20% size for efficient image processing, and the front and central part was cropped to 550 × 300 pixels, as shown in Figure 7b.  The drone ascended to a height of 15 m and stayed still as a stationary sensor (platform), as shown in Figure 5. It maintained its position while capturing video sequences of moving objects in the campus area. Figure 6 shows the take-off/landing position. Figure 6 was taken by a camera pointing directly downwards (−90 • pitch) at a height of 100 m, to better visualize nearby buildings and structures. Figure 7a shows a sample frame extracted from the video. The size of one frame was 4096 × 2160 pixels; the frame was reduced to 20% size for efficient image processing, and the front and central part was cropped to 550 × 300 pixels, as shown in Figure 7b. (34)

Experimental Set-Up
A drone (DJI Phantom 4 Advanced) was used to capture moving objects. The drone, with an attached gimbal and camera, is shown in Figure 4. The gimbal can tilt the camera within a 120° range (−90° to 30°). The camera pitch was set to −30° during the experiments. The drone ascended to a height of 15 m and stayed still as a stationary sensor (platform), as shown in Figure 5. It maintained its position while capturing video sequences of moving objects in the campus area. Figure 6 shows the take-off/landing position. Figure 6 was taken by a camera pointing directly downwards (−90° pitch) at a height of 100 m, to better visualize nearby buildings and structures. Figure 7a shows a sample frame extracted from the video. The size of one frame was 4096 × 2160 pixels; the frame was reduced to 20% size for efficient image processing, and the front and central part was cropped to 550 × 300 pixels, as shown in Figure 7b. (34)

Experimental Set-Up
A drone (DJI Phantom 4 Advanced) was used to capture moving objects. The drone, with an attached gimbal and camera, is shown in Figure 4. The gimbal can tilt the camera within a 120° range (−90° to 30°). The camera pitch was set to −30° during the experiments. The drone ascended to a height of 15 m and stayed still as a stationary sensor (platform), as shown in Figure 5. It maintained its position while capturing video sequences of moving objects in the campus area. Figure 6 shows the take-off/landing position. Figure 6 was taken by a camera pointing directly downwards (−90° pitch) at a height of 100 m, to better visualize nearby buildings and structures. Figure 7a shows a sample frame extracted from the video. The size of one frame was 4096 × 2160 pixels; the frame was reduced to 20% size for efficient image processing, and the front and central part was cropped to 550 × 300 pixels, as shown in Figure 7b.

Scenario Description
The drone captured a video sequence at 30 frames per second (fps) at a height of 15 m. A total of 13 people and 1 car were captured for 550 frames (16.5 s). Figure 8a shows Targets 1-3 at the sixth frame, Figure 8b shows Targets 1-7 at the 350th frame, and Figure 8c shows Targets 3-10 at the 550th frame. Table 1 shows the duration, moving direction, and components of each target. All targets were composed of one person or one car except for Target 3 and 4, which were composed of two and three people, respectively. Target 1 was partly composed of two people, because the person in Target 2 merged to Target 1 after 237th frame.

Scenario Description
The drone captured a video sequence at 30 frames per second (fps) at a height of 15 m. A total of 13 people and 1 car were captured for 550 frames (16.5 s). Figure 8a shows Targets 1-3 at the sixth frame, Figure 8b shows Targets 1-7 at the 350th frame, and Figure 8c shows Targets 3-10 at the 550th frame. Table 1 shows the duration, moving direction, and components of each target. All targets were composed of one person or one car except for Target 3 and 4, which were composed of two and three people, respectively. Target 1 was partly composed of two people, because the person in Target 2 merged to Target 1 after 237th frame.

Scenario Description
The drone captured a video sequence at 30 frames per second (fps) at a height of 15 m. A total of 13 people and 1 car were captured for 550 frames (16.5 s). Figure 8a shows Targets 1-3 at the sixth frame, Figure 8b shows Targets 1-7 at the 350th frame, and Figure 8c shows Targets 3-10 at the 550th frame. Table 1 shows the duration, moving direction, and components of each target. All targets were composed of one person or one car except for Target 3 and 4, which were composed of two and three people, respectively. Target 1 was partly composed of two people, because the person in Target 2 merged to Target 1 after 237th frame.

Detection of Moving Objects
The detection and tracking methods were implemented in MATLAB (version 8.5) on a PC (Intel i5-7500). The interval k d in Equation (1) was set at 5, thus the detection process was applied from the sixth frame; θ T in Equation (1) was set at 30; D in Equation (3) was set at [1] 1×1 , θ s and θ f in Equation (4) were set at 30 and 1200 pixels, respectively. Figure 9 shows the detection process of the 6th, 350th, and 550th frame. The first row is the detection results of Figure 8a. The second and the third rows are the results of Figure 8b,c, respectively. Figure 9a shows the binary images after the frame subtraction with thresholding. Assuming that the size of the targets is known, Equations (2) and (4) were applied to Figure 9a to result in Figure 9b. Figure 9c shows the target regions with rectangular windows. All targets in three frames were detected, with one false alarm in the 550th frame, as shown in the third row in Figure 9c. Table 2 shows the detection rate of ten targets; the average detection rate was 96.5%. The total number of false alarms detected was 638, thus the false alarm rate was 1.17 per frame. The supplementary material, Video S1: Object Detection (AVI format) for object detection is available online.

Detection of Moving Objects
The detection and tracking methods were implemented in MATLAB (version 8.5) on a PC (Intel i5-7500). The interval kd in Equation (1) was set at 5, thus the detection process was applied from the sixth frame; θT in Equation (1) was set at 30; D in Equation (3) (4) were set at 30 and 1200 pixels, respectively. Figure 9 shows the detection process of the 6th, 350th, and 550th frame. The first row is the detection results of Figure 8a. The second and the third rows are the results of Figure 8b,c, respectively. Figure 9a shows the binary images after the frame subtraction with thresholding. Assuming that the size of the targets is known, Equations (2) and (4) were applied to Figure 9a to result in Figure 9b. Figure 9c shows the target regions with rectangular windows. All targets in three frames were detected, with one false alarm in the 550th frame, as shown in the third row in Figure 9c. Table 2 shows the detection rate of ten targets; the average detection rate was 96.5%. The total number of false alarms detected was 638, thus the false alarm rate was 1.17 per frame. The supplementary material, Video S1: Object Detection (AVI format) for object detection is available online.
(a) (b) (c)     Figure 10 shows all the measured (detected) positions of 10 targets, including false alarms. The sampling time ∆ in Equation (5) was 0.033 s, since the frame rate was 30 fps. It was assumed that one pixel corresponds to 0.1 m. The standard deviations of the process and the measurement noise of the two-mode IMM filter were set at σ x1 = σ y1 = 0.6 m/s 2 , σ x2 = σ y2 = 1 m/s 2 , and r x = r y = 0.5 m, respectively.
A track was initialized by a two-point initialization method with the speed gating, which limited the maximum speed to 1 m/s. Figure 11a-j are the tracking results of Targets 1-10, respectively. Figure 11k shows all trajectories in one frame. Also, the supplementary material, Video S2: Human Tracking (AVI format) for target tracking is available online. The track was terminated if there were no updates for more than 40 consecutive frames. After termination, the track was considered false if the number of updates with validated measurements was less than 60 times, thus, the true target should be detected in at least 60 frames (2 s).

Target No Initial Frame Final Frame # of Frames # of Detection Detection Rate (%)
Target 1 6 515 510 510 100% Target 2  6  237  232  230  99%  Target 3  6  550  545  545  100%  Target 4  128  550  423  423  100%  Target 5  223  550  328  242  74%  Target 6  271  550  280  280  100%  Target 7  315  550  236  236  100%  Target 8  384  550  167  167  100%  Target 9  416  550  135  130  96%  Target 10  402  550  149  143  96%  Avg.  --326  316 96.5%  Figure 11k shows all trajectories in one frame. Also, the supplementary material, Video S2: Human Tracking (AVI format) for target tracking is available online. The track was terminated if there were no updates for more than 40 consecutive frames. After termination, the track was considered false if the number of updates with validated measurements was less than 60 times, thus, the true target should be detected in at least 60 frames (2 s).   Figure 12 shows the ground truth of the position of the targets. Figure 13 shows the approximated ground truth of the velocity in x and y directions, obtained by Equation (33) Figure 12 shows the ground truth of the position of the targets. Figure 13 shows the approximated ground truth of the velocity in x and y directions, obtained by Equation (33); δ was set at 65 when the least average velocity RMSE was produced. Figures 14 and 15 show the position in Equation (30) Figure 12 shows the ground truth of the position of the targets. Figure 13 shows the approximated ground truth of the velocity in x and y directions, obtained by Equation (33)  (e) (f)  Table 3 shows the RMSE of position and velocity obtained in Equations (31) and (34), respectively.   Table 3 shows the RMSE of position and velocity obtained in Equations (31) and (34), respectively.  Table 2 shows the detection rates of ten targets-the average detection rate was 96.5%. The detection rate of Target 5 was particularly low, at 74.5%. Target 5 was located far away from the drone, as shown in Figures 8b and 11e. Therefore, its relative lower speed caused detection to be missed during frame subtraction. The false alarm rate was 1.17 per frame. The false alarms were mostly generated when the drone was swung by the wind or the objects passed through a complex background. Table 3 shows the RMSE of position and velocity. The average RMSE of the position was about 0.8 m, and the average RMSE of the velocity was about 0.586 m/s (≈2.1 km/h) for the two-mode IMM filter. The minimum RMSEs were obtained when the two-mode IMM filter was used. It turns out that the process noise standard deviations (0.6 and 1 m/s 2 ) were properly chosen, because similar results were obtained from the Kalman filter with the average standard deviation (0.8 m/s 2 ). The two-mode IMM filter provided slightly better results than the Kalman filter. It was especially good for Target 7 (car), which moved to higher maneuvers than other targets. It is noted that the IMM filter with three modes did not provide better results in this scenario.

Discussion
The position RMSE varied from 0.458 m for Target 5 to 1.284 m for Target 10. The average RMSE (about 0.8 m) was close to half the human height. Except for Target 7 (car), Targets 9 and 10 generated higher position errors than other targets; there were biases between the measurements and the position estimates. The velocity RMSE varied from 0.342 m/s (1.23 km/h) for Target 1 to 0.959 m/s for Target 2 (3.45 km/h). The speed of human movement is important because we can recognize the threats from unexpected movements.

Conclusions
In this paper, several moving people and cars were captured by an SUAV. The objects were detected based on frame subtraction. Ten targets were tracked with the Kalman and IMM filters. Experimental results show that moving objects were well detected and tracked with good accuracy. The number of filter modes and the target dynamics of each mode, such as the process noise variance, should be determined properly to cope with the maneuvering of multiple targets.
For security and defense applications, the trajectories and the states of targets can be transferred to a control tower in real-time. Also, this system is suitable for people counting in a crowd area. Fully autonomous and stand-alone aerial video surveillance systems are very useful in commercial as well as military/government applications. In this work, the drone was fixed in the air as a stationary sensor (platform). Target tracking with a moving platform remains a subject for future study.