A Systematic Solution for Moving-Target Detection and Tracking While Only Using a Monocular Camera

This paper focuses on moving-target detection and tracking in a three-dimensional (3D) space, and proposes a visual target tracking system only using a two-dimensional (2D) camera. To quickly detect moving targets, an improved optical flow method with detailed modifications in the pyramid, warping, and cost volume network (PWC-Net) is applied. Meanwhile, a clustering algorithm is used to accurately extract the moving target from a noisy background. Then, the target position is estimated using a proposed geometrical pinhole imaging algorithm and cubature Kalman filter (CKF). Specifically, the camera’s installation position and inner parameters are applied to calculate the azimuth, elevation angles, and depth of the target while only using 2D measurements. The proposed geometrical solution has a simple structure and fast computational speed. Different simulations and experiments verify the effectiveness of the proposed method.


Introduction
Moving-target detection and tracking is widely required by and applied to various applications, such as security systems, obstacle detection, and search-and-rescue missions [1][2][3][4]. Consequently, the research topic of moving-target detection and tracking has attracted much attention. Previous studies about moving-target detection and tracking can be divided into different methods according to the applied sensors, e.g., lidar-, sonar-, somatosensory-, and visual-based methods [5,6]. The lidar-based method is mainly used for indoor applications, and a certain number of reflective plates need to be installed in the environment. In addition, the installation accuracy and installation location have high requirements; thus, costs are high. The sonar-based method is always used for underwater scenarios. The somatosensory-based method requires some special sensors to be carried by the targets, which are not available, especially in security applications. Compared with the above transmission methods, the visual-based method is the most popular in civilian applications. Due to the rapid development of camera sensors, the visual-based method has many advantages, e.g., low cost, a high signal-to-noise ratio, convenient transmission, and fast operation speed [7,8]. Therefore, it is an appropriate choice for moving-target detection and tracking, even for microtargets [9][10][11].
In computer vision, moving-target detection has two steps, i.e., removing redundant information, and extracting the region with positional changes in the image sequence. In the 1970s, Jain R. et al. proposed a method to detect moving targets by using scene changes between adjacent frames called the frame difference method [12]. The main advantage of the frame difference method is the low computational complexity. Thus, it is simple to implement the algorithm, and it can adapt to different dynamic environments target. However, the realization of the 3D tracking of moving targets is fundamental in industrial control. In recent years, there have been many visual devices to obtain the 3D spatial information of moving targets, such as the RealSense depth camera, Kinect camera, and Leap Motion somatosensory device [37]. However, such devices are usually expensive, have different limitations, and are always used in academic research fields [38]. Using computer technology and an ordinary optical camera to obtain 3D information at a low cost is greatly significant. The Kalman filter (KF) is the most commonly used tool for the 3D tracking of moving targets. For nonlinear motion, researchers improved the KF and then proposed many suitable methods for nonlinear motion, such as the extended Kalman filter (EKF), unscented Kalman filter (UKF), quadrature Kalman filter (QKF), and cubature Kalman filter (CKF) [39][40][41].
To summarize, there are some solutions for moving-target detection and tracking based on visual images, but the limitations of their complexity and cost have not been fully addressed. Specifically, some key problems are listed below. (1) The noise problem seriously affects the detection effect of moving targets. (2) Most previous papers about visual-based target detection used a 3D camera, which has the disadvantages of inaccurate depth estimation and complex computation. Previous methods only using a monocular camera could hardly satisfy the practical requirements. (3) Systematic solutions, including improvements in target detection and estimation, have not been fully addressed. This paper achieves significant improvements according to our previous work [42]. A systematic method is proposed for mobile-target detection and tracking. The three key contributions of this paper are presented below.

•
An improved optical flow method with modifications in the PWC-Net is applied, and the K-means and aggregation clustering algorithms are improved. Thus, a moving target could be quickly detected and accurately extracted from a noisy background. • A geometrical solution using pinhole imaging theory and the CKF algorithm is proposed that has a simple structure and fast computational speed. • The proposed method was verified with sufficient simulation and experimental examples, which has significant value for practical applications.
This paper is organized as follows. Section 2 introduces some related previous methods and the problem formulation. The optical flow-based moving target detection and extraction method is developed in Section 3. Section 4 proposes the pinhole geometrical target state estimation algorithm in detail. A CKF algorithm is modified in Section 5 to further improve the estimated results from the geometrical solution. The experimental verification of the proposed method is provided in Section 6. Lastly, Section 7 gives the conclusion.

Pyramid Lucas-Kanade Optical Flow Method
The classical optical flow method comprises two parts: obtaining the optical flow basic constraint equation and calculating the optical flow value. To acquire the primary constraint equation of optical flow, two assumptions should be satisfied: (i) the brightness of the same object in different pictures taken at diverse time instants should be uniform, and (ii) the positional change of the moving target between adjacent frames is small. We define the central position of a target in an image as (x, y), and its gray value is I(x, y). The target central point moves to position (x + d x , y + d y ) in the next frame after ∆t, and the gray value becomes I(x + d x , y + d y , t + ∆t).On the basis of the first assumption that the brightness is uniform, the following equation holds, On the basis of the second hypothesis, we can acquire the Taylor expansion of (1), which takes the following form: where ε is the second-order infinitesimal term that can be ignored since ∆t → 0. Therefore, we have We define u, v as the velocity component of the optical flow along the coordinate axes x and y, and u = d x ∆t , v = d y ∆t . I x = ∂I ∂x d x , I y = ∂I ∂y d y and I t = ∂I ∂t is the partial derivative of the gray value of the image pixel along the x, y, t directions; thus, the following basic constraint equation of optical flow is obtained: In order to obtain a unique u, v, many calculation methods can be applied. The most popular one is the Lucas-Kanade optical flow method. To use the Lucas-Kanade optical flow method, another assumption should be held as an extra constraint equation of optical flow. The assumption is that the same neighbors' pixels have the same motion. Specifically, the projected points of neighboring points in the 3D space are also the nearest neighbors in the image. On the basis of this assumption, the system equation of neighborhood pixels can be added, and the optical flow of regional central pixels can be obtained using the weighted least-squares method [42]. The Lucas-Kanade (LK) optical flow method needs to simultaneously meet three assumptions, and the corresponding constraints should be satisfied. Therefore, LK optical flow method cannot be applied directly. Otherwise, the calculated optical flow has large errors. The pyramid LK optical flow method is an appropriate solution [22]. In the pyramid Lucas-Kanade optical flow method as shown in Figure 1, we first need to establish the image pyramid, which includes image smoothing through a Gaussian filter, and sample collection to reduce the image resolution. The image resolution of each layer is reduced to half its value after each layer's sampling process. Thus, the large displacement motion is gradually reduced and decomposed into several small displacement motions for optical flow accumulation calculation that are suitable for high-speed motion. In the iterative solution, layer by layer, the optical flow is calculated with the LK optical flow method from the top of the image pyramid. Then, the optical flow calculated from the image of Layer L-1 is taken as the initial value of the optical flow estimation of the Layer L-2 image. This process is repeated until the optical flow at the bottom of the image pyramid is calculated. The pyramid operations can satisfy the three assumptions mentioned before, and the optical flow can be calculated accurately. For example, if the maximal displacement of pixel motion processed by the LK optical flow method is d max , the maximal displacement of pixel motion processed by the pyramid LK method becomes d max f inal = (2 L+1 − 1)d max . Thus, the optical flow calculation error of the large-displacement moving target is significantly reduced.

Target Localization Using Monocular Vision
Previous 3D spatial localization methods only used monocular cameras and were generally based on processing two images acquired from different positions. An example of locating one target using two images is shown in Figure 2. In order to understand the monocular visual-spatial localization method of two images, we drew a 2D plane schematic diagram, as shown in Figure 3. The positions of the camera's optical center before and after the translation are L 1 , L 2 , the position of L1 is the origin, the camera optical axis is along the Z axis, and the perpendicular direction to the camera optical axis is the X axis that established a 2D rectangular coordinate system. L 1 and L 2 are located at the same coordinate axis, and the distance between them is b. When the camera's optical center is at position L 1 , the projected position of point P on the imaging plane is P 1 . When the camera's optical center is at position L 2 , the projected position of point P on the imaging plane is P 2 . The other geometrical details are shown in Figure 3.   On the basis of Figure 3, the following proportional relationship exists in similar triangles:

Direction of Camera Movement
From Equations (5) and (6), we have Thus, the target's depth is obtained according to the information of the target in the image, the camera's translation distance, and the camera's focal length. When the depth is known, the other two coordinates of the target can be obtained as follows: X 1 − X 2 is called the parallax, and b is called the baseline. The above method is commonly used for monocular vision 3D localization. Because the horizontal movement of the camera ensures constant coordinates along the y axis, the correction link is omitted, and then the image feature points are directly matched. By slightly moving the monocular camera, the principle of 3D target localization is similar to that of the binocular camera. However, there are also some limitations. First, the monocular camera must be mobile to obtain images from two places. Second, the results of target coordinates X, Y, Z are affected by parallax X 1 − X 2 . Therefore, calculation errors accumulate. The limitations of directly using a binocular camera are introduced in the last section.

Nonlinear Filter
The extended Kalman filter (EKF) is a classical nonlinear estimator that is widely applied for target tracking [43]. The EKF has three main steps, i.e., state prediction, Kalman gain calculation, and state correction. Its algorithmic structure is similar to that of the conventional linear KF. Since linearization in the EKF only uses the first-order term in the Taylor expansion, the EKF's estimation performance for nonlinear target tracking is limited. Although the EKF has extended the KF method to nonlinear applications, EKF still has many disadvantages. In the EKF, the linearization error can hardly be avoided, and the divergence problem always happens. An improved nonlinear filter, the cubature Kalman filter (CKF) [44], is an algorithm that is approximately the closest to the Bayesian filter in theory. After strict mathematical derivation, the estimation accuracy and convergence of the CKF are guaranteed in theory. CKF refers to the idea of the particle filter. In the CKF, many particles with identical weights are selected for nonlinear function propagation processing according to the cubature criterion and the prior probability density distribution. Therefore, the calculation cost is low, there is no need to linearize the nonlinear function, and the linearization error is eliminated. The divergence problem in the EKF was also resolved.

Moving-Target Detection and Extraction
Moving-target tracking based on the monocular camera first needs to accurately capture a moving target in a video image. Therefore, this section proposes a moving-target detection method based on an improved PWC-Net, and a moving-target extraction method that is a combination of the improved K-means aggregation clustering algorithm, frame difference method, and morphological operation. The diagram of the proposed detection and extraction method is shown in Figure 4.   Grayscale processing converts color images into grayscale images. Grayscale images can enhance and highlight image features. Because the expression of grayscale image information is simple, it can reflect the local and overall chromaticity with different digital values, brightness, and additional scene information. This paper uses the averaged weight method to complete the image processing. Different weights are allocated according to the three primary colors, i.e., red, green, and blue. Since people are sensitive to green and red, the weights of the grayscale processing are always set with special values [45,46]. In this paper, we define them as follows: Two groups of video images were collected. The two processed video images represent the different motions of different moving targets and motion states in different indoor and outdoor scenes. As shown in Figure 5, a basketball is falling and a target person is running, and the processed figures are applied for further target object extraction.

Wavelet Transform Threshold for Noise Elimination
Noise exists in an image after grayscale processing, which impacts the final results since they have different characteristics after orthogonal wavelet decomposition. Thus, the wavelet transform threshold is applied to eliminate the noise. The purpose of this method is to find a suitable threshold that can guarantee appropriate wavelet coefficients. The applied wavelet transform threshold method for noise elimination has three steps. We define an actual signal as follows: where s(t) is the effective signal, and e(t) is the noise. The first step is to select the wavelet base. In terms of the wavelet base, we selected the Haar wavelet and then performed the orthogonal wavelet transform on the measured signal: The second step is to select the threshold and threshold processing function to determine the wavelet transform coefficients. We selected the soft threshold processing function: where w is the wavelet coefficient, and λ is the selected threshold. To set an appropriate threshold, we used N is the sum of the number of wavelet coefficients of the actual signal after wavelet transform decomposition, and σ is the standard deviation of the noise signal. The third step is reconstructing the signal according to the low-frequency and highfrequency coefficients after wavelet decomposition. The process of the wavelet transform threshold method is shown in Figure 6. Then, the images in Figure 5 were processed using the wavelet transform threshold method, and the results are shown in Figure 7, indicating improved performance.   Figure 5).

Contrast Limited Adaptive Histogram
Equalization (CLAHE) To further simplify the complexity for object extraction in the later optical flow step, we needed to enhance the image contrast. Thus, since the images had been processed during the previous two steps, i.e., grayscale processing and noise elimination, we directly utilized the CLAHE method [47] to improve the images. The CLAHE algorithm restricts the histogram of each sub-block region to an appropriate area to overcome the problem of overamplifying the noise. Referring to [48], the image processed by the CLAHE algorithm is shown in Figure 8.

Optical Flow Estimation by Improved PWC-Net
With the pre-processed images, we next needed to identify a moving object by using the optical flow method before accurately extracting the target object. PWC-Net is a deeplearning optical flow estimation network proposed in 2018 [19]. However, in the PWC-Net method, the warping problem exists that results in doubling images, white space, ambiguity, and invalid information phenomena, as shown in Figure 9. To resolve the warping problem, we needed to detect the areas with the phenomena of doubling images, white space, ambiguity and invalid information, and eliminate them accurately. The asymmetric occlusion-aware feature matching module (AsymOFMM) that can learn occlusion masks was proposed in 2020 [21]. The AsymOFMM method can predict an occluded area and filter out useless information generated by warping without additional supervision with almost negligible calculation costs. AsymOFMM's overall structure on the layer of the feature pyramid in the improved PWC-Net is shown in Figure 10.  The parameter training strategy of the improved PWC-Net was similar to that of PWC-Net. "Flying Chairs" was the basic training set, and the initial learning rate was equal to 0.0001. The batch size was 8 with 1.2 M iterations, and it halved the learning rate when the number of iterations reached 0.4, 0.6, 0.8, and 1 M. The FlyingThings3D dataset was finetuned. The initial learning rate was = 0.00001 and batch size = 4 for 0.5 M iterations. Then, the learning rate was halved when the number of iterations reached 0.2, 0.3, and 0.4 M. The loss function adopted the standard error measurement parameter in optical flow estimation, end-point error (EPE). The calculation formula of EPE is as follows: where u and v are the components of each pixel's predicted optical flow value in the transverse and longitudinal directions, and u g and v g are the components of the actual optical flow value in the label in the transverse and longitudinal directions. As shown in Table 1, through many experiments on the Sintel, KITTI 2012, and KITTI 2015 datasets, the improved PWC-Net based on AsymOFMM substantially improved all datasets. In the table, AEPE refers to the average EPE of all effective pixels in the image, FL-all refers to the percentage of abnormal optical flow values in all effective pixels in the image, and the calculation formula of AEPE is as follows: where m and n are the numbers of pixels in the horizontal and vertical directions of the image, respectively. We directly used PWC-Net to process the images in Figure 8 and acquire the results in Figure 11.
FL-all refers to the percentage of abnormal optical flow values in all effective pixels in the image, and the calculation formula of AEPE is as follows: where m and n are the numbers of pixels in the horizontal and vertical directions of the image, respectively. We directly used PWC-Net to process the images in Figure 8 and acquire the results in Figure 11. The improved PWC-Net was used for optical flow prediction for two frame images. The two different visualization results are shown in Figure 12. Next, we needed to extract the moving objects for the processed optical flow images.

Target Extraction
Next, we needed to accurately locate the position of the moving target in the 2D camera view from the processed optical flow images. However, in these images, the optical flow of the moving target was still mixed with some background edge values that interfered with the accuracy of extracting the moving target, as shown in the blue areas in Figure 12. This paper proposes a target extraction strategy using an improved K-means and aggregation clustering algorithm combined with a frame difference method and morphological operation to eliminate useless background optical flow. In addition, to improve the efficiency, we first set a threshold for the optical flow values to filter interesting dynamic information from the optical flow image in Figure 12. The optical flow after simple threshold filtering is shown in Figure 13 as an example. Then, the proposed 2-step extraction method was applied. The improved PWC-Net was used for optical flow prediction for two frame images. The two different visualization results are shown in Figure 12. Next, we needed to extract the moving objects for the processed optical flow images.

Target Extraction
Next, we needed to accurately locate the position of the moving target in the 2D camera view from the processed optical flow images. However, in these images, the optical flow of the moving target was still mixed with some background edge values that interfered with the accuracy of extracting the moving target, as shown in the blue areas in Figure 12. This paper proposes a target extraction strategy using an improved K-means and aggregation clustering algorithm combined with a frame difference method and morphological operation to eliminate useless background optical flow. In addition, to improve the efficiency, we first set a threshold for the optical flow values to filter interesting dynamic information from the optical flow image in Figure 12. The optical flow after simple threshold filtering is shown in Figure 13 as an example. Then, the proposed 2-step extraction method was applied.

Improved K-Means and Agglomerative Clustering Algorithm
On the basis of the improvement of the K-means and agglomerative clustering algorithm, the proposed algorithm was divided into two stages to cluster the optical flow samples combined with the advantage of not specifying the number of categories before clustering.
At the first stage, the key difference between the classical K-means algorithm and the proposed improved method is in the initialization process of K cluster centers. Specific conditions were added to the selection process of cluster centers to lengthen the distance between cluster centers enough to avoid the adverse effects caused by completely random initialization. First, at the first stage of the improved clustering algorithm, the corresponding sample features are established for each filtered optical flow in the optical flow field for similarity measurements. Specifically, suppose that, in the pixel coordinate, the position coordinates of the j-th optical flow vector in the image X j are (x j , y j ), the corresponding optical flow amplitude is A j , and the optical flow direction is D j . Then, we define four-dimensional sample features X j of the j-th optical flow vector as follows: To simplify the calculation, normalization processing is required. The normalization has Similarly, the construction of other optical flow vector sample features was completed. Second, Euclidean distance was selected as the similarity measurement index at the first stage of the proposed improved clustering method. Specifically, once again, we defined

Improved K-Means and Agglomerative Clustering Algorithm
On the basis of the improvement of the K-means and agglomerative clustering algorithm, the proposed algorithm was divided into two stages to cluster the optical flow samples combined with the advantage of not specifying the number of categories before clustering.
At the first stage, the key difference between the classical K-means algorithm and the proposed improved method is in the initialization process of K cluster centers. Specific conditions were added to the selection process of cluster centers to lengthen the distance between cluster centers enough to avoid the adverse effects caused by completely random initialization. First, at the first stage of the improved clustering algorithm, the corresponding sample features are established for each filtered optical flow in the optical flow field for similarity measurements. Specifically, suppose that, in the pixel coordinate, the position coordinates of the j-th optical flow vector in the image X j are (x j , y j ), the corresponding optical flow amplitude is A j , and the optical flow direction is D j . Then, we define four-dimensional sample features X j of the j-th optical flow vector as follows: To simplify the calculation, normalization processing is required. The normalization has Similarly, the construction of other optical flow vector sample features was completed. Second, Euclidean distance was selected as the similarity measurement index at the first stage of the proposed improved clustering method. Specifically, once again, we defined one of the categories in the clustering process as S k with cluster center sample features C k . Then, the similarity between the j-th optical flow vector in image X j and the cluster category was S k , expressed as the Euclidean distance between sample features X j and the cluster center sample features C k : A small distance denotes a high similarity between the optical flow vector and a class. Lastly, the numbers of clusters and their cluster centers were automatically determined by using the sample features and similarity measurement function defined above, and the optical flow clustering in the first stage was completed by combining it with the classical K-means algorithm. The specific steps are as follows: (1) Initialize the first cluster center. Set the first optical flow class S 1 , randomly select a filtered optical flow vector (blue parts in Figure 13) assigned to S 1 and as the center of the S 1 , and assume its sample features as the central sample features of category C 1 . (2) Calculate the similarity. Select an unclassified optical flow X in the optical flow field, calculate its similarity with the current existing clustering class, and record the Euclidean distance between its sample features and the sample features of the center of its most similar category. The optical flow eliminated by the first stage of the clustering algorithm in this paper is shown in the figure below. Thus, the sample data in blue in Figure 13 were divided into the differently colored clusters in Figure 14. The white dot in the figure is the clustering center of each class. The foreground optical flow was aggregated in a large area due to the similar characteristics, while the background optical flow was scattered into many categories due to their differences.
At the second stage of the improved algorithm, the idea of the cohesion method in hierarchical clustering is used. After a certain number of clusters are obtained in the first stage, some clusters should be merged, and the number of clusters can lastly be optimized. We propose a modified method to optimize the cluster number that reproduces some data and refers to the cohesion method. Unlike the classical cohesion method, the proposed method reproduced some assist points to improve clustering performance.
(1) Eight new data samples were calculated according to the position of each cluster center. The eight new data points were centered on one cluster center and arranged at a certain angle. Assuming that the cluster central position is (C xk , C yk ), the new produced data point is (m i , n i ), which has (2) Then, the distances between all newly generated data points from two different clusters (for example, 8 points for Cluster 1 and another 8 points from Cluster 2) are calculated, and the shortest distance is selected as the inter-class distance. (3) Set the threshold and merge the two categories whose inter-class distance between two clusters is less than the threshold.
(4) Update and return the final cluster center and the cluster member. The overall flow of the optical flow clustering algorithm in this paper is shown in Figure 15. The optical flow calculated by the improved clustering algorithm is shown in Figure 16. Few clusters existed, and they could help in target extraction. Next, after setting a threshold, most optical flow noise can be effectively eliminated through a binarization step. The final optical flow is shown in Figure 17; thus, only true moving-target information remained.   Figure 17 results show that the target areas still had some black parts that divided the true targets into multiple parts. We extracted the final moving target with an extra step to improve the accuracy.

Accurate Extraction Based on Frame Difference Method and Morphological Operation
The frame difference method is commonly used for moving-target detection and segmentation. The flow chart of the two-frame difference method is shown in Figure 18. The n-th and n − 1-th frames in a video are recorded as γ n and γ n−1 , respectively, and the gray values of the corresponding pixel points of the two frame images are recorded as γ n (x, y) and γ n−1 (x, y), respectively. Here, (x, y) represents the positions of the pixel points in the image. When calculating a difference image, the gray values of corresponding pixel points at each position on the two frame images are successive. Then, we can obtain the absolute value of image difference D n (x, y): Next, we set another threshold value for the gray value difference, binarized each pixel's gray value difference according to Equation (22), and lastly obtained binarized image R n (x, y).
Thus, the difference between two successive images is shown in Figure 19. Then, morphological processing was applied [49] that eliminated the thorns, and filled the narrow discontinuities and small holes in the image using the results in Figures 17 and 19. In other words, the target areas in Figures 17 and 19 were added together. The region of the interested target after morphological processing and frame difference fusion is shown in Figure 20. The extraction strategy proposed in this paper could separate the background and moving-target optical flow well. Figure 21 shows the final extracted moving target.

Target Localization Based on Pinhole Imaging Theory
After the moving target is detected and extracted, the location of the moving target can be calculated. This paper proposes a simple mathematical model based on pinhole imaging theory. The target position relative to the camera can be calculated only by using the camera parameters and the measured two-dimensional image. Camera installation information was required, such as the camera's position, and the camera lens was horizontal.

Localization Problem Formulation
The pinhole imaging model is the camera imaging principle shown in Figure 22a. An actual object point M in the 3D space had a corresponding image point m on the imaging plane after passing through camera optical center C. A specific proportional relationship exists between the size of the true object and its image. According to the Gaussian imaging formula, we have where f is the camera's focal length, a is the object distance, and b is the image distance of the object. Since the object distance is much larger than the image distance, Equation (23) can be approximated as follows: Assuming that the imaging plane coincided with the focal plane was reasonable, since the camera was horizontally installed. The distance between the imaging plane and the optical center was equal to the focal length f of the camera. Four coordinates are usually mentioned in the proposed localization algorithm, i.e., the camera coordinate, world coordinates, image coordinate, and pixel coordinate. The camera coordinate uses (X c , Y c , Z c ) to represent a point position, and the origin is the camera. The world coordinate is denoted by (X, Y, Z). (X i , y i ) presents a position in the image coordinate. In the pixel coordinate, the position is represented by (u, v). The difference between the image and pixel coordinate systems is shown in Figure 22b. The origin of the image coordinate is o i , and in the pixel coordinate, the origin is denoted by (u 0 , v 0 ). These two coordinates have the same system, but with different scalars. Under the assumption that the side length of each pixel in the image is d, the following relationship between the image and the pixel coordinates obtained from Figure 22b takes the following form: Expression (25) and Equation (26) can be rewritten as follows: In order to simplify the calculation and analysis, we built a pinhole imaging model as shown in Figure 23. A new imaging plane at focal length f from the optical center of the camera was built; thus, 3D points were projected onto the camera coordinate system. Specifically, as shown in Figure 23, O c X c Y c Z c is the camera coordinate system, o i X i y i is the image coordinate system, O c is the optical center, and the Z c axis coincided with the camera's optical axis. P(X c , Y c , Z c ) is the point position in the 3D space of the camera coordinate system, and p(X i , y i ) is the 3D point's projected position in the image coordinate system. According to Figure 23, combined with Equation (24), P(X c , Y c , Z c ) and p(X i , y i ) have the following relationship: We have On the basis of Equations (27) and (30), we obtain Equation (32) shows that, for a coordinate point in 3D space, a unique pixel point can be found in its imaging plane. However, depth information cannot be obtained. Therefore, additional information is required, since we used monocular vision for 3D localization. Therefore, this paper proposes a strategy using the camera's installation height to calculate the final 3D position that includes two steps: (1) computing the target azimuth angle and depth measurements, and (2) calculating the target Cartesian position using the measurements.

Angle and Distance Measurements
In the camera coordinate, azimuth and pitch/elevation angles are defined as α and β, respectively, and are shown in Figure 23 and have Substituting (28) and (29) into (33) and (34) yields Therefore, according to the position of the object point in the image plane, the angular information of the target object relative to the camera in 3D space can be roughly obtained. Angular information is acquired by only using the camera pixel image, pre-known visual size, and focal length. Lens distortion was ignored, or the noise caused by lens distortion satisfied a Gaussian distribution. However, it is still impossible to calculate the target depth by only using the azimuth and pitch angles. Additional reference information is required, e.g., the camera installation height. As shown in Figure 24, the camera was installed at height h from the ground. Moreover, the camera was placed manually, and its observation direction was known, defined as θ. The world coordinate system was established simultaneously, and the origin of the two coordinate systems coincided. For example the angular diagram for when the target moved on the ground is shown in Figure 24. The relationship between the world coordinate system and the camera coordinate systems can be expressed in the following form: From Equations (28) and (29), we can obtain the 2D image position of the target centroid, (X i , y i ) . Since the camera focal length f and the target centroid's Y-axis position in the camera coordinate system are both known, the target centroid position (X c , Y c , Z c ) can be calculated. Then, the distance between the target and the camera can be obtained, Since camera tilt angle θ is known, and the target centroid pitch angle can be obtained from Equation (36), according to the geometric relationship, Figure 24 shows In particular, when β = 0, that is, y i = 0 is satisfied, Y c = 0 and Z c = h sin θ can be easily obtained from Figure 24. At that time, X c = Z c X i f is acquired from Equation (30). In addition, when the height of the target cannot be ignored (i.e., the target is not mass on the ground), O c A = h−l sin (θ+β) exists. Accordingly, when β = 0 holds, Z c = h−l sin θ can be known, where l is the height of the target centroid from the ground. The target height can be estimated using some known or pre-given reference static object .

Moving-Target Tracking Using Visual Image and Cubature Kalman Filter
The discrete state transition and observation equations of the nonlinear system in this paper take the following form: where X k is the state vector at time instant k, Z K is the measurement vector at k, f (·) is the dynamic model function, h(·) is the measurement function. Here, we used both angle and distance measurements. ω k , ζ k were assumed to be Gaussian white noise with zero additive mean, which represents the process and measurement noise of which the covariance matrices are Q k , R k , respectively.
Since the CKF has many advantages in nonlinear estimation [44], it has been widely applied as a state-of-the-art method. The structure of the CKF algorithm is divided into two parts, i.e., prediction and update with measurement steps.

Prediction update:
(i) Decompose estimation error covariance matrix.
(iii) Calculate the propagation cubature point of the state transfer function.
where X i,k|k and X * i,k+1|k are cubature points, m is the number of cubature points. When using the third-order spherical radial criterion, the number of cubature points should be twice the dimension n of the state vector of the nonlinear system. ξ i is the (iv) Calculate status prediction value.
(v) Calculate the prediction covariance matrix.
Update with measurement: (i) Decompose the prediction covariance matrix.
(ii) Calculate the updated cubature point.
(iii) Calculate the propagation cubature point of the measurement function.
(iv) Calculate the measured predicted value.
(vii) Calculate the cross covariance matrix.
(ix) Calculate the estimated state value at k + 1.
(x) Calculation of estimation error covariance matrix.
On the basis of the pinhole imaging and positioning model described in Section 4, the CKF is used to estimate the 3D target state compared with the assumed dynamic models.

Experimental Examples
In this paper, we used an ordinary monocular industrial camera to acquire some videos, and the detailed parameters of the camera are shown in Table 2. The methods proposed in this paper were verified step by step. More specifically, the improved optical flow method introduced in Section 3 is first verified, and then the proposed geometrical algorithm using the pinhole imaging method is demonstrated. Lastly, combined with the CKF algorithm, we verify the effectiveness of the proposed strategy in Sections 3-5 via different examples with some comparisons. Several moving target videos with different target objects and motion states were randomly taken in indoor and outdoor environments. The optical flow calculation and moving-target extraction network using the proposed algorithm could effectively suppress the interference of background noise and accurately extract the region of interest. The experimental results are shown in Figure 25. The first column shows the pre-processed images, the second column indicates the optical flow visualization images, the third column shows the moving target segmentation results, and the fourth column displays the final extracted moving targets.

Moving Target Localization (Using the Proposed Methods in Sections 4 and 5)
In order to verify the accuracy of the positioning method, in this experiment, several identified target images were captured. Then, the data were calculated, and we lastly compared the results with the true values measured with an advanced RGB camera. The first step was to set up some experimental scenarios and randomly place the experimental target in multiple positions. The camera's height, shooting angle, and target position were different. In the second step, we recorded the relevant data needed for model calculation, such as the camera's height and shooting angle, collected the target image, and extracted target information from the 2D image by using the proposed algorithm in Section 3; then, we calculated the targets' locations. Six pictures captured from the previous videos in Figure 25 were used. In the last step, the angle and distance measurements were used to calculate the 3D target positions of different examples. The experimental results are shown in Table 3 [42]. Table 3 shows that the target's locations computed with the proposed geometrical method were very close to the true values, but some errors still existed that had been caused by the extraction process and camera lens distortion noise, which commonly satisfy Gaussian distribution.
The modified CKF method was applied to process the measurements to improve the target tracking performance. The CKF utilized our contribution regarding the modified measurement model proposed in (39) and (40). In the CKF algorithm, we assumed that the target moved with constant velocity, and in each image, multiple angular measurements could be produced to guarantee the CKF running. In the CKF simulation, the sampling time interval was T. With the proposed pinhole imaging localization method, the state vector contained the 3D position and velocity of the moving target. The measurement was the 2D target position in the image coordinate system. Therefore, the dynamic system model has and the measurement model is ω k−1 and ζ k are the process and measurement noise, respectively. G is a transformation matrix.
To clearly show the advantages of the proposed CKF, a moving target with nonlinear process noise (maneuvering) was tracked. In the dynamic model, we have . ϑ and τ determine the amplitude and frequency, respectively, of the moving target during sinusoidal motion. The EKF method [50,51] was applied as a comparison, and both the EKF and CKF were applied with 1000 Monte Carlo repeats. In this part, Examples 1 and 2 are provided with different target initial positions (EKF and CKF had the same initial state). In Example 1, the true target initial state was [300, 400, −800, 2, 4, 6] T . The initial states are [X 0 = [320, 420, −820, 2.1, 3.8, 6.3] T ], R k = diag [1,1] and P 0 = diag[100, 100, 100, 100, 100, 100] for both Kalman filters. The sampling time was reset to 0.15 s, and the total running time was 300 s. To ensure the sinusoidal motion of the moving target, we set amplitude ϑ = 600 m in the noise driving matrix, and the frequency τ = 0.12 s in the process noise variance.
In Example 2, the true target initial state was [−800, 300, 400, 1, 1, 2] T . The initial states is [100, −100, 630, 4.2, 0.8, 2.2] T , R k = diag [1,1] and P 0 = diag[1000, 1000, 1000, 1000, 1000, 1000] for both two Kalman filters. We set the amplitude ϑ = 100 m in the noise driving matrix and the frequency τ = 0.1 s in the process noise variance. The sampling time was also 0.2 s, and the total running time was 200 s. The estimated results using the different methods were recorded. The true and estimated target trajectories using EKF and CKF were drawn as well. The RMSEs between the estimated result and the true target states of the different methods are shown in Figures 26-29.    The first example shows that the proposed CKF achieved satisfactory convergence performance when tracking a sinusoidal moving target. The CKF also had better estimation accuracy than that of the EKF in both position and velocity estimation. The EKF had to diverge in some of the Monte Carlo runs, resulting in target loss. In the second example, divergence did not happen in either method since the true target initial position became friendly to the EKF. However, the proposed CKF still had more accurate estimation results than those of the EKF. The effectiveness and superiority of the proposed CKF in mobile target tracking were verified.

Conclusions
This paper addressed moving-target detection and tracking problems in a 3D space with a proposed visual target tracking system only using a 2D camera. First, an improved optical flow method combined with a clustering algorithm was developed to accurately detect and extract a moving target from a noisy background. Second, a geometrical pinhole imaging algorithm was proposed to roughly and quickly estimate the target position. Third, the CKF algorithm was modified by directly using the camera pixel measurement for moving-target tracking in a 3D space. Simulation and experimental results demonstrated the effectiveness of the proposed systematic method. The work in this paper has great practical value, as it provides a solution for mobile-target detection and estimation using low-cost equipment.
In the future, we will further improve the accuracy of moving-target detection and estimation by considering the distortion problem of the camera lens. In addition, a pan-tiltzoom serving system will be combined with the proposed system in this paper to further improve mobile-target tracking performance.