An Efficient Extended Targets Detection Framework Based on Sampling and Spatio-Temporal Detection

Excellent performance, real-time and low memory requirement are three vital requirements for target detection in high resolution marine radar system. Unfortunately, many current state-of-the-art methods merely achieve excellent performance when coping with highly complex scenes. In fact, a common problem is that real-time processing, low memory requirement and remarkable detection ability are difficult to coordinate. To address this issue, we propose a novel detection framework which bases its principle on sampling and spatiotemporal detection. The framework consists of two stages, coarse detection and fine detection. Sampling-based coarse detection is designed to guarantee the real-time processing and low memory requirements by locating the area where targets may exist in advance. Different from former detection methods, multi-scan video data are utilized. In the stage of fine detection, the candidate areas are grouped into three categories: single target, dense targets and sea clutter. Different approaches for processing the different categories are implemented to achieve excellent performance. The superiority of the proposed framework beyond state-of-the-art baselines is well substantiated in this work. Low memory requirement of the proposed framework was verified by theoretical analysis. Real-time processing capability was verified by the video data of two real scenarios. Synthetic data were tested to show the improvement in tracking performance by using the proposed detection framework.


Introduction
Moving target detection plays a primary and pivotal role in the marine radar system, which aims to completely and accurately detect moving objects from video data. For the non-Gaussian sea clutter and complex backgrounds, using sequential radar images to extract targets of interest such as vessels and low-flying aircraft is a challenging task. The moving target detection problem mainly has two issues, target detection and target tracking. Target detection is explored to find the candidate positions of targets by the video data originated from radar front-end. Target tracking is designed to associate the positions into the trajectories of the targets. As the increased resolution of modern radar, targets would be found in several resolution cells rather than merely appearing in one single resolution cell. Then, the high-resolution radar would receive more than one points per time step from different corner reflectors of a single target. The target is unsuitable to be categorized as a point. Therefore, a hot research topic, extended target detection and tracking, arises recently. The aim of this work is to develop a novel target detection framework to improve the extended target tracking performance by providing more accurate points of targets and fewer false alarm points. is designed to roughly locate the area where targets may exist in advance by uniformly selecting seeds in the whole surveillance area. Only the selected seeds are used to guarantee the real-time processing and low memory requirement. In the fine detection stage, only the areas where targets may exist are processed. The candidate areas are identified into three categories, namely single target, dense targets and sea clutter, by the contours of the areas [21]. The areas of dense targets are further separated into subareas using the Rain algorithm method [22]. Each subarea is regarded as an individual target. Excellent performance can be achieved by the fine detection. As presented in Figure 1, the input of the target detection is the video sequences of radar. The results of target detection are three-dimensional points, i.e., two-dimensional positional information and its measuring time. The measuring time in target tracking algorithms can be simply represented by the frame number (see, e.g., [1][2][3][4]). Correct points can be obtained by the detection framework to further improve the final tracking performance. Figure 1 describes the relationship between the radar data processing and the existing methods mentioned above. the areas where targets may exist are processed. The candidate areas are identified into three categories, namely single target, dense targets and sea clutter, by the contours of the areas [21]. The areas of dense targets are further separated into subareas using the Rain algorithm method [22]. Each subarea is regarded as an individual target. Excellent performance can be achieved by the fine detection. As presented in Figure 1, the input of the target detection is the video sequences of radar. The results of target detection are three-dimensional points, i.e. two-dimensional positional information and its measuring time. The measuring time in target tracking algorithms can be simply represented by the frame number (see, e.g., [1][2][3][4]). Correct points can be obtained by the detection framework to further improve the final tracking performance. Figure 1 describes the relationship between the radar data processing and the existing methods mentioned above. The remainder of the work is organized as follows. Section 2 defines the models and problems. Section 3 presents the implementation of the sampling-based spatiotemporal detection method. In this section, the proposed detection framework is also presented. The superiority of the proposed framework beyond state-of-the-art baselines is substantiated in Section 4 using real high-resolution marine radar data as well as synthetic data. Section 5 draws conclusions.

Target Model
Assume that the extended targets are randomly distributed on an x-y plane. We use Mk to denote the number of targets at k th scan. The size and quantity of targets are unknown. A general approach based on support functions to model smooth object shapes presented in [23] is used here. The state of an individual extended target can be modeled in state space R s , s = 6. The target state of m th target is Stm = (xm, ym, lm ' , wm ' , αm, pm), 1 ≤ m ≤ M. xm and ym denote the centroid of m th target on x-and y-axis, respectively. lm ' and wm ' are the lengths of the major axis and the minor axis. αm is the angle between The remainder of the work is organized as follows. Section 2 defines the models and problems. Section 3 presents the implementation of the sampling-based spatiotemporal detection method. In this section, the proposed detection framework is also presented. The superiority of the proposed framework beyond state-of-the-art baselines is substantiated in Section 4 using real high-resolution marine radar data as well as synthetic data. Section 5 draws conclusions.

Target Model
Assume that the extended targets are randomly distributed on an x-y plane. We use M k to denote the number of targets at k th scan. The size and quantity of targets are unknown. A general approach based on support functions to model smooth object shapes presented in [23] is used here. The state of an individual extended target can be modeled in state space R s , s = 6. The target state of m th target is St m = (x m , y m , l m ' , w m ' , α m , p m ), 1 ≤ m ≤ M. x m and y m denote the centroid of m th target on x-and y-axis, respectively. l m ' and w m ' are the lengths of the major axis and the minor axis. α m is the angle between the major axis and line of sight. p m denotes the intensity present in a single pulse return. The comparison between reflection models and real data [20] infers that Swerling type 1 is more appropriate to express the magnitude of the target. The magnitude of a pulse y t return follows the Rayleigh distribution.
where y t means the intensity of the echo originated from a target. The intensity of a target in an azimuth bin M(r,a) is presented in Figure 2.

Noise Model
The clutter consists of two parts, sea clutter and measurement noise. The mean and variance of sea clutter are closely related to the sea state. The sea clutter distribution model of major theoretic and practical interest is the so-called K-distributed clutter model [24][25][26], and the PDF (probability distribution function) is Equation (2).
where p c denotes the power of the sea clutter in this cell, Γ(v) is the gamma function, ν is referred to as the shape parameter, b is a scale parameter, and K v (u) denotes the modified Bessel function of second kind and order. Measurement noise is a zero-mean, white and uncorrelated Gaussian noise sequence.

Measurement Model
The surveillance area is divided into N A × N R grid cells in a polar coordinate, where N A and N R are the number of cells on azimuth and range axes, respectively. Each cell corresponds to a pixel in radar video. Figure 2 show that the video data can be modeled by Equation (3) in a polar coordinate. Parameters r and a denote the location on range and azimuth axes. Z(r,a) means the amplitude in the range-azimuth resolution cell (r,a).
where ω(θ) in Figure 2 is the antenna pattern function. The non-noise measurement of cell Z(r, a) is related to the RCS of target M(r, a) and the clutter C(r, a). The distribution of M(r, a) is related to the shape and material of the target in cell (r, a). N(r, a) denotes the additive measurement noise. In Figure 2, an aircraft is illuminated by radar beams. The measurement model of marine radar [20,21] infers that, once parts of the aircraft are illuminated by the beam, the radar echoes in this azimuth bin are affected by the aircraft. The target would be illuminated by the main lobe of the beam when the direction of the beam equals ϕ. The scope of ϕ, ϕ 1 ≤ ϕ ≤ ϕ 2 , can be estimated by Equation (4). The area where the echo is affected by the extended target in azimuth and range axes can be represented by A and R, respectively. The proof of Equation (4) is presented in the Appendix of [7]. θ 0 denotes 3dB azimuth beam width of the radar. The expression of A and R is presented by (5) l max and l min are the upper and lower limits of l ' . Equation (4) infers that, for the extension of the beams, the image of the target is larger than its real size. The azimuth bins and range bins whose amplitude (a m , r m ) might be affected by the object can be estimated using Equation (6).
where function • denotes rounding up a value and CR denotes the coverage range of radar. The measurements obtained from the front-end of radar are images that have N A × N R pixels, in accordance with the time series.
where K is the quantity of images.
lmax and lmin are the upper and lower limits of l ' . Equation (4) infers that, for the extension of the beams, the image of the target is larger than its real size. The azimuth bins and range bins whose amplitude (am, rm) might be affected by the object can be estimated using Equation (6).
where function •     denotes rounding up a value and CR denotes the coverage range of radar. The measurements obtained from the front-end of radar are images that have NA×NR pixels, in accordance with the time series.
where K is the quantity of images.

Problem Statement
The aim of target detection is extracting the state of targets St m by the video Z k . The quantity of video in high-resolution radar system is enormous. The parameters of the available high-resolution marine radar in this work are presented in Table 1. Target detection must be completed in a radar scanning cycle (10 s). The images of the radar in two different scenarios are presented in Figure 3. Putting the detection performance aside, CFAR-based methods [10][11][12][13] and spatiotemporal-based methods [16][17][18] spend more than 200 s to complete the detection in the two real scenarios presented in Figure 3a,b. Meanwhile, it can be seen that the video data for the two scenarios are quite different, mainly because of the location of the two radars. Scenarios 1 and 2 correspond to Radars 1 and 2 in Figure 3c. Radar 1 is located on the hillside of a peninsula. Most of the areas near Radar 1 are sea or forest, the echo intensity of which is far less than the objects in urban areas. Meanwhile, the beams of Radar 1 are obscured by the peak in some azimuth bins. Relatively fewer clutter regions emerge for these two reasons. However, Radar 2 is located at the peak of a mountain that is facing the sea. Therefore, two urban regions around Radar 2 can be illuminated by the beams and more clutter regions emerge. The video data of the two scenarios were processed by the existing detection methods. However, the methods in [8][9][10][11][12][15][16][17] are far from meeting the real-time processing requirement.

Problem Statement
The aim of target detection is extracting the state of targets Stm by the video Zk. The quantity of video in high-resolution radar system is enormous. The parameters of the available high-resolution marine radar in this work are presented in Table 1. Target detection must be completed in a radar scanning cycle (10 s). The images of the radar in two different scenarios are presented in Figure 3. Putting the detection performance aside, CFARbased methods [10][11][12][13] and spatiotemporal-based methods [16][17][18] spend more than 200 s to complete the detection in the two real scenarios presented in Figure 3a,b. Meanwhile, it can be seen that the video data for the two scenarios are quite different, mainly because of the location of the two radars. Scenarios 1 and 2 correspond to Radars 1 and 2 in Figure 3c. Radar 1 is located on the hillside of a peninsula. Most of the areas near Radar 1 are sea or forest, the echo intensity of which is far less than the objects in urban areas. Meanwhile, the beams of Radar 1 are obscured by the peak in some azimuth bins. Relatively fewer clutter regions emerge for these two reasons. However, Radar 2 is located at the peak of a mountain that is facing the sea. Therefore, two urban regions around Radar 2 can be illuminated by the beams and more clutter regions emerge. The video data of the two scenarios were processed by the existing detection methods. However, the methods in [8][9][10][11][12][15][16][17] are far from meeting the real-time processing requirement.   Meanwhile, an image of the whole surveillance area is about 200 Mb. It is impossible to directly store the past images using the methods in [16][17][18]. The methods in [16][17][18] have difficulty being performed on the current hardware (MPC8640D PowerPC). The efficient methods in our former work [19][20][21] were developed to meet the requirement of real-time processing and low memory. However, we found that the methods [19][20][21] are insufficient to cope with the complex environment. Based on those methods [16][17][18][19][20][21][22], we propose a novel detection framework, which is promising to achieve the three requirements simultaneously.

Sampling-Based Spatiotemporal Thresholding Method
Some clutter regions originate from some huge fixed objects such as buildings and islands. Areas with high sea conditions are also responsible for clutter regions. Clutter regions are much larger than the resolution cell. Meanwhile, as shown by the direct-viewing explanation in Figure 3a, the measurements are spatially correlated. A sampling-based spatiotemporal thresholding algorithm is proposed with the utilization of the spatial context. The implementation of the method consists of the following steps. The input of the method is K successive images, each of which has NR × NA pixels. The result is a sampled thresholding map.
Step 1. The sample intervals in range and azimuth, dR and dA, are estimated according to the parameters of the radar and the size of clutter sources. The values of the two sample intervals can be set to dR and dA when the clutter region in the image is no larger than 2dR × 2dA. The sample interval in time dt is related to the variation rate of clutter regions. A larger dt can be set when the area and intensity of the clutter regions are changing slowly.
Step 2. To efficiently monitor the variation of clutter regions, only some of the pixels are uniformly selected from the images. The selected pixels used to evaluate the clutter Zm are called marked cells here.
Step 3. The sampled spatiotemporal thresholding map M has (NR/dR) × (NA/dA) pixels. The intensity of the pixels can be estimated by the marked cells Zm. A (2w + 1) × (2w + 1) × (K/dt) local patch is defined in the marked cells. The set of marked cells in the local patch can be regarded as Equation (9) when evaluating the threshold of a marked cell (r, a): Then, the mean intensity of the local image patch at cell (r, a) is represented by m(r, a): Meanwhile, an image of the whole surveillance area is about 200 Mb. It is impossible to directly store the past images using the methods in [16][17][18]. The methods in [16][17][18] have difficulty being performed on the current hardware (MPC8640D PowerPC). The efficient methods in our former work [19][20][21] were developed to meet the requirement of real-time processing and low memory. However, we found that the methods [19][20][21] are insufficient to cope with the complex environment. Based on those methods [16][17][18][19][20][21][22], we propose a novel detection framework, which is promising to achieve the three requirements simultaneously.

Sampling-Based Spatiotemporal Thresholding Method
Some clutter regions originate from some huge fixed objects such as buildings and islands. Areas with high sea conditions are also responsible for clutter regions. Clutter regions are much larger than the resolution cell. Meanwhile, as shown by the direct-viewing explanation in Figure 3a, the measurements are spatially correlated. A sampling-based spatiotemporal thresholding algorithm is proposed with the utilization of the spatial context. The implementation of the method consists of the following steps. The input of the method is K successive images, each of which has N R × N A pixels. The result is a sampled thresholding map.
Step 1. The sample intervals in range and azimuth, d R and d A , are estimated according to the parameters of the radar and the size of clutter sources. The values of the two sample intervals can be set to d R and d A when the clutter region in the image is no larger than 2d R × 2d A . The sample interval in time d t is related to the variation rate of clutter regions. A larger d t can be set when the area and intensity of the clutter regions are changing slowly.
Step 2. To efficiently monitor the variation of clutter regions, only some of the pixels are uniformly selected from the images. The selected pixels used to evaluate the clutter Z m are called marked cells here.
Step 3. The sampled spatiotemporal thresholding map M has (N R /d R ) × (N A /d A ) pixels. The intensity of the pixels can be estimated by the marked cells Z m . A (2w + 1) × (2w + 1) × (K/d t ) local patch is defined in the marked cells. The set of marked cells in the local patch can be regarded as Equation (9) when evaluating the threshold of a marked cell (r, a): Then, the mean intensity of the local image patch at cell (r, a) is represented by m(r, a): Step 4. After obtaining the sampled spatiotemporal thresholding map M, the intensity of non-marked cell can be estimated by the two-dimensional linear interpolation in Equation (11).
The result of the sampling-based spatiotemporal thresholding method is the thresholding map M. Meanwhile, it is worth noting that not all of the intensities of non-marked cells are necessary for fine detection. The intensities of non-marked cells are calculated only when an extended target potentially exists in the area.
cells involved in evaluating the sampled map are stored in the processer. Compared with existing spatiotemporal-based methods [16][17][18], many calculations can be saved. Meanwhile, fewer involved cells also means a drastic decrease in computation.

The Proposed Detection Framework
After the proposed approach above, the spatiotemporal thresholding map that has all N R × N A cells, M = {m(r,a)|1<r<N R ,1<a<N A }, is available to detect the targets in theory. A contrast map C, which has N R × N A pixels, is defined first, and the intensity of cell (r, a) in the contrast map is denoted by C(r, a): C(r, a) = Z(r, a, K) − m(r, a) The input of the proposed detection framework is the contrast map C. Similar to the thresholding map M, the intensities of cells are not calculated if unnecessary. The proposed detection framework consists of two stages, coarse detection and fine detection.
In coarse detection stage, some cells are uniformly selected from the contrast map for efficiency. The input of the coarse detection is the contrast map.
Step 1. The sample intervals in range and azimuth, d r and d a , are estimated according to the parameters of the radar and the size of the targets.
Step 2. The approximate locations of targets are found efficiently by uniformly selecting some of the cells from the contrast map C. The selected cells, C s , are called "seed cells".
Step 3. The candidate areas where targets may exist are found by setting a threshold T d to the seed cells.
C(id r , jd a ) ≥ T; target may exist in (id r , jd a ) C(id r , jd a ) < T; no targets exist in (id r , jd a ) The function of false alarm rate P FA (T d ) and the function of target detection rate P D (T d ) can be derived when the parameters of radar and targets are given. The optimal threshold can be obtained by Equation (15).
The derivation and simulation of the expressions for P FA (T d ) and P D (T d ) can be found in our previous work [20]. The results of the coarse detection are the seed cells whose intensities are larger than the threshold. The set of the seed cells is assumed to have N s elements, i.e.
The second stage is fine detection. The accurate statement of targets is estimated by the seed set C T . The fine detection consists of the following steps.
Step 1. A seed cell in C T is taken to find the contour of the candidate target in contrast map C. The multiple contour tracking method in [19] is utilized to obtain the contours of the area under different thresholds.
Step 2. The area can be grouped into four categories by its contours. If the area is a huge plain without outstanding peaks, the area very likely is unresolved clutter. If the area is larger than a normal target and has several outstanding peaks, the image of the area usually originates from several nearby targets. Then, go to Step 3 for further processing the area. If the area is moderate in size and has an outstanding peak, the image of the area should originate from a single target. Then, go to Step 4. If the area is very small, it is a false alarm. Then, go back to Step 1.
Step 3. The image of multiple targets is partitioned into smaller subareas, each of which can be regarded as a single target. The multilevel thresholding method using the Rain algorithm in [20] has been developed for this purpose. After obtaining the subareas, go to Step 4.
Step 4. The state of a single target can be estimated by the image of the area. The state includes not only location, size, and posture of the target, but also the texture of the subarea. The texture is promising for improving the association in multi-target tracking [27]. Then, go to Step 1 to process the next seed cell in Cs.
To have a better description of the proposed framework, two points are worth noting. The first is the sample intervals in spatiotemporal thresholding method and stage of coarse detection. Figure 4 presents an example of this relationship. The sample intervals d R and d A are utilized to locate the area of a clutter region. Therefore, d R and d A are larger than dr and da, which are utilized to locate the targets because the clutter regions are much larger than the targets. The sample intervals d R , d A , d r , and d a are related to the parameters of the radar and the size of targets. It assumes that the long axes of an extended target and a clutter region are l m ' and L m ' . The lower limits of l m ' and L m are l min and L min .
Then, according to Equation (6), there are at least d a t azimuth bins and at least d r t range bins whose amplitudes are affected by an extended target.
Similarly, there are at least d A C azimuth bins and at least d R C range bins whose amplitudes are affected by a clutter region.
The sample intervals d R , d A , d r , and d a should be no less than the lower limits, i.e., The second point is the multiple contours of the candidate area. As presented in Figure 5, the contours of a single target, nearby targets and false alarm are represented by the black lines of different intensities. The contour of the false alarm is small and irregular. The outstanding peaks in the area of targets can be found by the contours. The second point is the multiple contours of the candidate area. As presented in Figure 5, the contours of a single target, nearby targets and false alarm are represented by the black lines of different intensities. The contour of the false alarm is small and irregular. The outstanding peaks in the area of targets can be found by the contours.   The second point is the multiple contours of the candidate area. As presented in Figure 5, the contours of a single target, nearby targets and false alarm are represented by the black lines of different intensities. The contour of the false alarm is small and irregular. The outstanding peaks in the area of targets can be found by the contours.   The flowchart of the proposed detection framework is presented in Figure 6. The inputted video and current image are presented in the red dashed box. The video data contain enormous cells to be processed in detection algorithms. The sampled video for spatiotemporal thresholding method is presented in the blue dashed box. Only a few cells need to be stored in the processor, thus much memory is saved. The (N R /d r ) × (N A /d a ) selected cells utilized in the stage of coarse detection are presented in the green dashed box. The cells involved in the fine detection are presented in the purple dashed box. The state of targets is estimated using only these cells. A small quantity of involved cells brings a significant decrease in calculations. The black dashed boxes in the bottom of Figure 6 infer that the areas are clustered into three categories and the points regarding the location of targets can be obtained. The points are the results of the target detection. The flowchart of the proposed detection framework is presented in Figure 6. The inputted video and current image are presented in the red dashed box. The video data contain enormous cells to be processed in detection algorithms. The sampled video for spatiotemporal thresholding method is presented in the blue dashed box. Only a few cells need to be stored in the processor, thus much memory is saved. The (NR/dr) × (NA/da) selected cells utilized in the stage of coarse detection are presented in the green dashed box. The cells involved in the fine detection are presented in the purple dashed box. The state of targets is estimated using only these cells. A small quantity of involved cells

Real Data
Suitable memory requirement and real-time performance are two basic requirements in target detection. However, it is hard to balance good detection ability and these two requirements at the same time. To evaluate the superiority of the proposed framework in memory requirement and calculation, two problems of several representative methods are discussed in this section.
The calculation of CFAR-based methods is closely related to the quantity of the cells employed for estimating a threshold. Figure 7 shows the quantity of the cells used in several methods. The x-, y-, and z-axes in Cartesian coordinates represent azimuth, range and time axes, respectively. The blue and red cells, respectively, denote the pixels in the current frame and past frames. The green cell represents the cell whose threshold is being estimated.
Parameters m and n are the size of one target in azimuth and range. Then, the local region size is (m + 2d 1 ) × (n + 2d 2 ). The guard area with m × n cells exists so that the clutter pixels are collected some distance away from the test cell and target pixels are prohibited from contaminating clutter statistics estimation. The target size in the image can be estimated by the models in Section 2.3. The parameters of the radar in this work are listed in Table 1 and presented in Figure 7. m and n equal 21 and 3. respectively. Here, we set d 1 = 5 and d 2 = 4. d 1 and d 2 denote the width of protection cells on range and azimuth axes, respectively. d 1 and d 2 should meet the criterion in Equation (19) to ensure that the selected cells do not belong to the target.
However, large values of d 1 and d 2 mean more cells would be employed to evaluate the threshold. Then, in Cell-Averaging CFAR (CA CFAR) [28] and OS CFAR [10], (m + 2d 1 ) × (n + 2d 2 ) -m × n cells are employed for estimating the threshold. In CM CFAR [15], the p past cells at the same location are employed. We set p = 15 here. In spatiotemporal CFAR [16], both the (m + 2d 1 ) × (n + 2d 2 ) -m × n cells in current frame and the p cells in past frame are employed for one threshold. In spatiotemporal CA CFAR, all cells in this region are necessary, i.e. (m + 2d 1 ) × (n + 2d 2 ) -m × n cells in the current frame and (m + 2d 1 ) × (n + 2d 2 ) × p cells in the past frames. In the proposed framework, as presented in Figure 7, for the sampling, a × b × c cells are selected from the (a × d R ) × (b × d A ) × (c × d t )-sized cube. a, b, and c equal 11, 5 and 8 here, respectively. The sample intervals d R , d A and d t are 5, 10 and 5, respectively. that the areas are clustered into three categories and the points regarding the location of targets can be obtained. The points are the results of the target detection.

Real Data
Suitable memory requirement and real-time performance are two basic requirements in target detection. However, it is hard to balance good detection ability and these two requirements at the same time. To evaluate the superiority of the proposed framework in memory requirement and calculation, two problems of several representative methods are discussed in this section.
The calculation of CFAR-based methods is closely related to the quantity of the cells employed for estimating a threshold. Figure 7 shows the quantity of the cells used in several methods. The x-, y-, and z-axes in Cartesian coordinates represent azimuth, range and time axes, respectively. The blue and red cells, respectively, denote the pixels in the current frame and past frames. The green cell represents the cell whose threshold is being estimated.
Parameters m and n are the size of one target in azimuth and range. Then, the local region size is (m + 2d1) × (n + 2d2). The guard area with m×n cells exists so that the clutter pixels are collected some distance away from the test cell and target pixels are prohibited from contaminating clutter statistics estimation. The target size in the image can be estimated by the models in Section 2.3. The parameters of the radar in this work are listed in Table 1 and presented in Figure 7. m and n equal 21 and 3. respectively. Here, we set d1 = 5 and d2 = 4. d1 and d2 denote the width of protection cells on range and azimuth axes, respectively. d1 and d2 should meet the criterion in Equation (19) to ensure that the selected cells do not belong to the target.
However, large values of d1 and d2 mean more cells would be employed to evaluate the threshold. Then, in Cell-Averaging CFAR (CA CFAR) [28] and OS CFAR [10], (m + 2d1) × (n + 2d2) -m × n cells are employed for estimating the threshold. In CM CFAR [15], the p past cells at the same location are employed. We set p = 15 here. In spatiotemporal CFAR [16], both the (m + 2d1) × (n + 2d2) -m × n cells in current frame and the p cells in past frame are employed for one threshold. In spatiotemporal CA CFAR, all cells in this region are necessary, i.e. (m + 2d1) × (n + 2d2) -m × n cells in the current frame and (m + 2d1) × (n + 2d2) × p cells in the past frames. In the proposed framework, as presented in Figure  7, for the sampling, a × b × c cells are selected from the (a × dR) × (b × dA) × (c × dt)-sized cube. a, b, and c equal 11, 5 and 8 here, respectively. The sample intervals dR, dA and dt are 5, 10 and 5, respectively.  The quantity of employed cells for a threshold in these methods is listed in Table 2 for a better description. The sufficient quantity of employed cells and the huge distance between the test cell and target pixels guarantee that a more suitable threshold can be obtained. Meanwhile, employing more cells can improve the robustness in estimating threshold. However, it means that more calculations are spent on one threshold. In CFAR-based approaches, the threshold of all cells, N R × N A here, is calculated. However, only the threshold of (N R /d R ) × (N A /d A ) marked cells are estimated in our method. We set d R = 5 and d A = 10 here. Thus, only N R × N A / 50 thresholds are estimated. The total quantity of employed cells equals the number of cells for one threshold and the marked cells, i.e. 440 × (N R /d R ) × (N A /d A ). Therefore, considerable calculations are saved. The total number of employed cells is presented in the fourth column of Table 2. Table 2. The quantity of employed cells for thresholds.

The Quantity of Cells for One Threshold
The Value of the Quantity

The Total Number of Employed Cells
The proposed framework a × b × c 440 8.8 × N R × N A Spatiotemporal CFAR [17] (m + 2d 1 ) × (n + 2d 2 )-m × n + p 293 293 × N R × N A OS CFAR [10] (m + 2d 1 ) × (n + 2d 2 )-m × n 278 278 × N R × N A CM CFAR [15] p 15 15 × N R × N A CA CFAR [28] (m + 2d 1 ) × (n + 2d 2 )-m × n 278 Next, the memory requirement of the methods is compared. In OS CFAR and CA CFAR, only the current frame is utilized. The memory requirement of the two approaches is N R × N A cells. In spatiotemporal CFAR [17], CM CFAR [15] and spatiotemporal CA CFAR, the (p-1) past frames are also required. Therefore, the memory requirement of the three approaches is N R × N A × p cells. In the proposed frame, the current frame and (N R /d R ) × (N A /d A ) × (c-1) cells in past frames are necessary. The expression and the value of the memory requirement are listed in Table 3. It infers that the memory requirements of the spatiotemporal CFAR [17], CM CFAR [15] and spatiotemporal CA CFAR are much larger than the others. Table 3. The memory requirement.

In theory In the Experiment
The proposed framework After the theoretical analysis in calculation and memory requirement, we conducted an experiment in which the data for the two scenarios presented in Figure 3 were processed. The methods were performed on the PowerPC MPC8640D, 1.0 GHz with 4 GB RAM in Wind River Workbench 3.2 environment. As is presented in Table 4, it is no surprise that the elapsed time of the proposed framework was much less than the others. The CA CFAR had the second lowest calculation time for its strategy in estimating the threshold. A huge elapsed time was spent in OS CFAR [10] for sorting the cells and estimating the threshold iteratively. The elapsed time of Scenario 2 was much larger than that of Scenario 1 because Scenario 2 is more complex. More iterations were spent estimating each threshold. However, the elapsed time of the other methods was stable in different scenarios. The interval of two scans is 10 s in this radar. It is apparent that the proposed method is the only approach that can satisfy the real-time requirement.
However, real data cannot be utilized to evaluate the tracking performance, mainly because the state of the targets in the two scenarios is unknown. Even if a trajectory were obtained, we would not know that the trajectory originated from a target or a clutter resource. Therefore, synthetic data were applied in the following experiment.

Synthetic Data
Extensive experiments were conducted to verify the performance of the proposed framework from the robustness against to noise, the ability of background suppression and target detection ability, and the computation time of the algorithm. To fully access the superiority of the proposed algorithm, the five approaches used in Section 4.1 were included for comparison.
In this example, a fleet constituted by three paralleled vessels is regarded as one group. The configuration of the fleet is shown in Figure 8a. The long axis and short axis of the three targets are 60 m and 15 m, respectively. The space between the two targets is 35 m. The three targets, whose initial positions are (12,225 [20,21]). The echoes of three targets among 20 scans are presented in Figure 8b. The three targets can be fine-detected when the sea clutter and measurement noise are absent. Therefore, the synthetic scenario is presented in Figure 8c. In our former work [20], we should that synthetic data are similar to real data in both distribution and texture. Therefore, synthetic data are suitable to evaluate the detection performance. The seven groups of targets are moving in a surveillance area which consists of 13 subareas. The intensity of K distributed sea clutter in each subarea is different. The parameter v represents the shape parameter in K distributed sea clutter [26]. A larger value of v means a higher sea clutter. The detection rate and tracking precision can be greatly deteriorated in this situation. The synthetic images of the three targets under various shape parameter are shown in Figure 8c. The values of the shape parameter in each subarea can be seen in Figure 8c. It is worth noting that, in Group 0, no clutter exists and the only noise is the measurement errors. The study is presented to show the deterioration in performance originated from clutter regions. The targets are hard to follow by the naked eye when v equals 10 or 12. The synthetic video data are fed to the detection approaches for the points representing the position of a possible target. Then, as presented in Figure 1, the points are associated to form trajectories using the target-tracking approaches. In the experiment, the points obtained by the detection methods were fed to the PHD filter [1]. The result, i.e. the trajectories of targets, were obtained finally. A better set of trajectories can be achieved when an outstanding detection approach is employed. The optimal sub-pattern assignment (OSPA) distance [29] was used for evaluating the correctness of the trajectories. A lower OSPA distance means a more appropriate result. The OSPA distance between the ground truth of n targets T = {T 1 , T 2 , . . . ,T n } and the estimated positions p = {p 1 , p 2 , . . . , p n } in each scan can be calculated by: where Ω represents the set of permutations of length m with elements taken from T. The cut-off value c and the distance order p of OSPA distance were set as c = 100 and p = 1.5. Note that the cut-off parameter Figure 9 shows the OSPA distance for 20 scans and also reveals that the proposed detection approach performs better than the others. Comparison between Groups 1-7 and Group 0 infers that the tracking performance would be greatly deteriorated by the clutter. The tracking performance would be further deteriorated when the intensity of clutter is high. However, the performance advantage of the proposed approach appears more obvious in severe scenarios (e.g., Group 6) because it has a strong capability of background suppression and target detection.  Figure 9 shows the OSPA distance for 20 scans and also reveals that the proposed detection approach performs better than the others. Comparison between Groups 1-7 and Group 0 infers that the tracking performance would be greatly deteriorated by the clutter. The tracking performance would be further deteriorated when the intensity of clutter is high. However, the performance advantage of the proposed approach appears more obvious in severe scenarios (e.g., Group 6) because it has a strong capability of background suppression and target detection. and targets. The guard cells in azimuth and range were set to match the size of the target in this experiment. Therefore, it is hard to achieve a satisfying result in real engineering because the targets have various sizes. Setting multiple guard areas in different sizes is a promising way to alleviate the problem. However, it requires many more calculations. Meanwhile, the prior information of targets is unnecessary in selecting the employed cells in the proposed approach. Therefore, the proposed approach can solve the difficulty and is superior to the others in complex environments. (e) (f) (g) Figure 9. The OSPA distance of six scenarios at each scan: ()-f) Groups 1-7; and (g) Group 0.
The average OSPA distance of different groups is presented in Table 5. The lowest OSPA distance in each group is emphasized in boldface. It is obvious that, with the utilization of the proposed detection framework, a better target tracking result can be achieved.  The existing CFAR detectors are insufficient in the detection of the fleet for several drawbacks. Firstly, in the CA CFAR and OS CAFR detectors, the cells of Target 1 may be employed to estimate the threshold of a cell which belongs to Target 2. The cells of other targets usually have a larger intensity than the normal background cells. For the cells of another target, a higher threshold is obtained. This would decrease the detection rate of targets in the stage of target detection. A satisfactory trajectory can be hardly achieved by few points of targets, even though an advanced target tracking algorithm is utilized. Secondly, for the existence of sea clutter, two or three targets would be regarded as one large target when the intensity of the cells between two targets is high. Taking the example in Figure 8c, Targets 1 and 2 in the patch of Group 4 would be regarded as a large target for the sea clutter. Instead of two individual points, one inaccurate point is obtained. Misdetection and a huge localization error would arise in target tracking. However, in the proposed approach, for the utilization of the Rain algorithm [22], the image of multiple targets is partitioned into separate subareas where each smaller subarea denotes one potential target. Therefore, two accurate points denoting the two targets can be obtained. Thirdly, for the limitation of memory space, saving many images in the cache is impossible. Therefore, compared to the cells that can be employed in the current frame, fewer cells in past frames can be employed in CM CFAR detector and spatiotemporal-based CFAR detectors.
Meanwhile, the performance of the CFAR-based method is related to the parameters of the radar and targets. The guard cells in azimuth and range were set to match the size of the target in this experiment. Therefore, it is hard to achieve a satisfying result in real engineering because the targets have various sizes. Setting multiple guard areas in different sizes is a promising way to alleviate the problem. However, it requires many more calculations. Meanwhile, the prior information of targets is unnecessary in selecting the employed cells in the proposed approach. Therefore, the proposed approach can solve the difficulty and is superior to the others in complex environments.
The average OSPA distance of different groups is presented in Table 5. The lowest OSPA distance in each group is emphasized in boldface. It is obvious that, with the utilization of the proposed detection framework, a better target tracking result can be achieved. The results show that the methods in [15,17] that consider several past frames are superior to the others because the clutter intensity of the cell can be estimated by the cells in the past frames, in addition to the cells in current frame. Although OS-CFAR is more appropriate in a multi-target situation, it is still possible that both the employed cells and the cell under estimation are occupied by the same extended target. A higher threshold would be obtained, which is harmful for reaching a remarkable detection rate. The problem has been relieved by considering more cells in past frames. However, there is no free lunch. The methods in [15,17] need far more memory space than OS-CFAR and CA-CFAR to store the video data of past frames. As to the proposed framework, the method is designed to detect the extended targets. Non-extended targets would be missed for the sampling.
The average running time of the performed methods is presented in Table 6. The lowest elapsed time in each group is emphasized in boldface. The result matches the analysis in Section 4.1. The elapsed time of the proposed approach was less than the 1/80 of that of CA CFAR detector. In the proposed approach, processing one frame of synthetic image takes 0.5 ms at most. Meanwhile, the calculation of the OS CFAR detector is still much larger than those of the others.
It can be verified by the experiment presented in this section that the proposed approach is superior to the existing detectors in regards to detection performance, calculation and memory requirement simultaneously. The outstanding detection framework is a promising approach to improve target tracking performance in real engineering.

Conclusions
In this research, we present a target detection framework based on sampling and spatiotemporal detection. The coarse detection guarantees the real-time and low memory requirements by locating the area where targets may exist in advance. The fine detection can improve the detection performance by identifying and processing the single target, dense targets and sea clutter using different strategies.
The extensive experiments showed that excellent performance, real-time processing and low memory requirement can be achieved simultaneously by the proposed detection framework. The tracking performance can be improved by utilizing the proposed approach with far fewer calculations and less memory being spent. Meanwhile, far less prior information, such as the extension of targets, is necessary in using the proposed approach. It also makes the proposed approach more practical in real engineering.