Ground target localization and recognition via descriptors fusion

Keypoint matching can be defined as locating the position of a particular point in two images precisely. Recently, keypoint descriptors have taken a great effect targeting to be powerfully invariant to rotation, scale and translation for improving target detection. The detection task is carried out by a reference-scene image matching to localize the desired target in the input scene. An innovative approach is proposed in this work to fuse the state-of-the-art feature descriptors ORB, BRISK for the sake of accurate ground target detection in two phases. Firstly, off-line phase, where the fused features are extracted from different perspective, azimuth angles of the desired target to build a comprehensive reference image representation. Secondly, on-line phase, where the fused features extraction task is carried out from the whole scene. Hence, it is matched with the stored reference one to find the keypoints correspondence. The outliers’ problem is eliminated using Random Sample Consensus (RANSAC) algorithm resulting in speeding up the matching procedure. The conducted comparative analysis has revealed the discriminative power of the fused features in localization and recognition tasks while keeping the proposed system works in real-time.


Introduction
Keypoint matching is considered an essential task in several applications related to computer vision, like image stitching [1], image retrieval [2] and object detection and recognition [3]. This task can be accomplished through three successive steps; firstly, detecting the keypoints locations in both reference and scene image such as blobs or corners. Secondly, extracting a proper feature to describe each detected keypoint. Finally, finding the correspondences between the keypoints of both images via a similarity measurement. Several aspects may have applied to improve the keypoint matching task but the most significant one is leveraging the keypoint description. Feature descriptors like SIFT [3], SURF [4], ORB [5], BRISK [6] and AKAZE [7] have been introduced in the previous years for this task, with each one having its faintness and robustness. Generally, the more discriminative power the descriptor has, the more computational time it needs. Fusing numerous descriptors was presented to improve the matching capability. Fusion techniques can be classified into two main types [8], [9]; the early fusion, where the fusion task is carried out before the matching task and the late fusion, where the fusion task is carried out after the matching task. This work aims to fuse the binary descriptors to enhance the keypoint matching accuracy. In the detection task, the matched points are utilized to localize the desired target in the input scene image. The contribution of this proposal is summarized as follows:  2 1-Building a comprehensive description for the reference image to overcome the lack of descriptors invariant to the wide perspective angle variation. 2-Improving the discriminative power of the extracted features via descriptors fusion while keeping the detection system works in real time. 3-Providing the approaching direction alongside the target location in the scene image.
The organization of the paper is as follows: the related work is presented in Section 2. Section 3 introduces a background brief. The proposed detection system is explained in detail in Section 4. Section 5 includes the experimental results. Finally, Section 6 presents the conclusion.

Related works
This section illustrates the related and previous work briefly in two main subsections.

Keypoint descriptors
In the latest few years, various keypoint descriptors were introduced. Lowe [3] proposed SIFT (Scale-Invariant Feature Transform) as a robust descriptor where the given input image is convolved with DoG functions at numerous scales to find the location of the keypoints that are scale-space extrema. Although SIFT is considered one of the most powerful descriptors as shown in [10] but its low speed has limited its utilization in many applications. Later, SURF (Speeded Up Robust Features) was proposed by Bay et al. [4] as a fast descriptor in the matching task. SURF descriptor uses the Fast-Hessian detector for the extraction phase, relying on the factor of Hessian matrix. Binary descriptors were introduced as a result of various real-time applications requesting faster feature extractors and low-memory descriptors. BRIEF (Binary Robust Independent Elementary Features) was proposed by Calonder et al. [11] as first robust binary descriptor. An image patch is binary tested and a bit string is linked to them to create the descriptor. BRISK (Binary Robust Invariant Scalable Keypoints) was proposed by Leutenegger et al. [6] depending on a round sampling pattern, as it figures various comparisons to form a binary string that is not variant to the rotation and the scale. As an improvement of the BRIEF descriptor and the FAST (Features from Accelerated Segment Test) detector, Rublee et al. [5] proposed ORB (Oriented FAST and Rotated BRIEF). Alahi et al. [12] proposed FREAK (Fast Retina Keypoint). It depends on using an algorithm to form feature descriptors by choosing a group of binary tests that feats variety among them.

Keypoint fusion
Significant hard work is accomplished aiming to improve keypoint descriptors. Although a specific keypoint descriptor may not be revealing for a particular dataset, it may be helpful in another situation, but its contribution can't be appraised if it is rejected. A Multiple Descriptor was proposed by He et al. in [13] that classify images via using its NN (nearest neighbors) approaching to categories and various types of given feature descriptors from input images. NN approaching to all classes were linked with various kinds of descriptors despite of applying the K-NN over a single feature descriptor only. For keypoint fusion, a Bayesian descriptors approach was proposed by Mountney et al. in [14]. Where, a set of powerful keypoint descriptors is chosen by a first training step. After that, the chosen keypoint descriptors is fused by a Naive Bayesian Network. Using a trained system that depends on the dataset assigned is considered the disadvantage of this approach. A novel approach was proposed by Bakshi et al. in [15] for recognition as to serve the IRIS people, where SIFT & SURF descriptors achieved matches are fused. The matching task is achieved individually for the two descriptors. After that, the matching task is evaluated dependent on the total matches achieved by both descriptors. A fusion of twodimension & three-dimension features in the perspective of ground-targets detection was proposed by Perakis et. al in [16]. Firstly, ground-targets are learned from a marked dataset. Secondly, a certain template matching technique is achieved for the task of recognition. An evaluation for various fusion methods of template matching was accomplished presenting their strength and weakness points.  Figure 1 shows typical target detection and recognition via keypoints matching. A walk through the detection system main components is introduced here.

Keypoints detection
Feature detector can be defined as an algorithm that is used to detect keypoints (points of interest) in the scene-image [17]. Corners, edges and blobs are the main forms of detected features. Feature detectors are either available with their own description algorithm such as ORB, AKAZE, SIFT, KAZE, BRISK and SURF or exist individually like MSER, FAST and AGAST. As a case study, the FAST detector can be briefly explained as follows: FAST is a corner detector proposed by Rosten et al. [18] for detecting the interest points in a scene-image. An interest point is a pixel which has a well-known position and can be robustly detected. FAST main idea is based on the comparison of gray values between neighborhood pixels and nucleus. If point P is a feature, then at least N contiguous pixels around P are brighter than it or darker by more than a threshold T. Although FAST detector is beatable with respect to the other detectors due to its fast performance, resulting in suitability for real time applications, it has few drawbacks such as being variant to scaling, low-memory requirements and its weak robustness to image noise [19]. As an improvement for the FAST detector, AGAST (Adaptive and Generic Accelerated Segment Test) was proposed by Elmar Mair et al. in [20] to choose a more robust method of estimating the decision tree in order to be common and not required to get adapted periodically to the new environment. AGAST detector is much faster and is not required to be adapted in the time of maintaining the similar corner response like FAST detector.

Description extraction
A feature descriptor represents the locality of all interest points via a feature vector. It must be distinctive, invariant to scale, rotation and translation. An additional description about the detected features/keypoints from the detector must be provided [21]. Furthermore, it must be able to uniquely identify the corresponding feature points between two image frames to provide good feature matching accuracy. Descriptor vectors have binary or real values; a binary descriptor gives better performance in terms of time at the cost of encoding less information (e.g. BRIEF, ORB, and BRISK), while with a real values descriptor we can encode more information at the cost of increase the time required to compute it (e.g. SIFT and SURF). To figure out the difference between both types of descriptors, a comparative explanation for the ORB, BRISK, SURF and SIFT is highlighted as follows: 3.2.1. ORB descriptor. ORB [5] is based on the visual descriptor BRIEF [11] and the FAST [18] keypoint detector. It aims to be faster and more efficient with respect to SIFT in the detection task and deals with the BRIEF problem of being invariant to rotation. It presents a method of positioning estimation to FAST detection algorithm, thus providing independence to rotation. It is much powerful than BRIEF as it learns an efficient subsection related to binary tests. ORB computes a local orientation through the use of an intensity centroid [22], which is a weighted averaging of pixel intensities in the local patch assumed not to be coincident with the center of the feature. Keypoints are further detected at different scales. The orientation is the vector between the feature location and the centroid. Although this might look to be unstable, it is reasonable with the only orientation assignment working in SIFT [5].
3.2.2. BRISK descriptor. BRISK [6] provides both rotation and scale invariance. For scale invariance, BRISK detects keypoints in a scale-space pyramid, performing non-maxima suppression and interpolation across all scales. To describe the features, BRISK turns away from the random or learned patterns of BRIEF and ORB, and instead it uses a symmetric pattern. Sample points are positioned in concentric circles surrounding the feature, with each sample point representing a Gaussian blurring of its surrounding pixels. The standard deviation of this blurring is increased with the distance from the center of the feature. [3] detects an amount of keypoints by searching for extrema of a Difference-of-Gaussian (DOG) function at different scales. A feature vector is extracted at each keypoint. The location of keypoints is then further refined. Using native scene-image properties, the positioning of given image is predicted to afford non-variant against rotation above a neighborhood around the point of interest. Then, a descriptor is computed for each detected point, based on local image information at the characteristic scale. The orientation of keypoints is estimated based on the local image gradient as the SIFT descriptor builds a histogram of gradient orientations of sample points in a region around the keypoint. It finds the highest orientation value and uses these orientations as the main orientation of the keypoint. The SIFT descriptor is largely invariant to scale, orientation, and illumination changes. [4] was proposed as an efficient improvement of SIFT as it aimed to enhance the computation time. It is much faster and more robust as opposed to SIFT. Features extraction is based on the integral image, using the box filter to substitute approximately the second-order Gaussian filter in an efficient way and calculating the Hessian value of the feature points and their surrounding points. The feature description is formed by calculating four kinds of Haar wavelet of the minor area around the feature point. The description vector of the feature point is used to match as SURF algorithm can complete image matching under the moderate conditions. By this way it basically achieved realtime processing.

Matching
Image matching is an important technology in computer vision and image processing, based mainly on keypoint matching. The main target of keypoint matching is to find pixel correspondences representing the same real point in two images. Its ideology is based on looking for the known method of image in others. Basically, the performance of matching techniques based on interest points depends on both the properties of the keypoints and the choice of their associated descriptors. The purpose of keypoints extraction is to match the same feature point in different images, and then complete the matching between images. Once the keypoints and their associated descriptors have been extracted from two or more images, the next step is to establish some feature matches between them. The Bruteforce matcher is examined due to its simplicity and efficiency in the matching task.
A distance measure between the two interest points descriptors ∅ and ∅ can be defined as: we use binary descriptors (4) Note that other distance measures are used in matching descriptors, like Hellinger [23] and Mahalanobis [24] distance. Based on k d , the points of Q are sorted in ascending order independently for each descriptor, creating the sets A match between the pair of interest points (

RANSAC outliers' rejection algorithm
The Random Sample Consensus (RANSAC) is a simple algorithm for robust fitting of models in the presence of many data outliers. It is a non-deterministic algorithm as it produces a reasonable result only with a certain probability. This probability increases as more iterations are allowed. We can define the probability after k iteration that we have not picked a set of inlier as (1 − ) where G is the proportion of inlier in the matches found and P is the number of pair needed for the model, 4 in case of matrix H. Given a dataset of matches containing both inliers and outliers, RANSAC uses the voting scheme to find the optimal fitting result. Data elements in the dataset are used to vote for one or multiple models. The implementation of this voting scheme is based on two assumptions: that the noisy features will not vote consistently for any single model (few outliers) and there are enough features to agree on a good model (few missing data).

Proposed detection system
The proposed detection system accomplishes the detection task in two phases; Firstly, integrates the extracted descriptions from a pre-selected set of target reference-images in off-line phase. Secondly, matches the extracted input scene description with the pre-extracted ones in on-line phase to precisely localize the desired target and find its view-point direction. The aforementioned phases are clearly explained in the upcoming subsections.

Building reference image description (off-line phase)
Detecting the ground target from an aerial scene comprises many challenges. These challenges mainly arise due to the varying in the ground target viewpoint. That's because the airborne camera may capture the desired target from any arbitrary direction surrounding it. To that end, a set of eight distinct viewpoints reference images have been used to establish a comprehensive target description. These reference images are taken from eight quarters when the airborne camera approaches the target from the North (N) direction, the North-East (NE), and so on up to the South-East (SE) direction. The keypoints are detected using the FAST detector from each reference image to deliver them to the descriptors. Where a fused feature vector for each detected keypoint is established by concatenating both ORB and BRISK features arrays horizontally. The resulting feature vector has a higher dimension and encodes more information for the detected keypoints to strengthen the matching performance. As depicted in Figure 2(a), each reference image was captured from a different approaching direction to tackle the extracted features viewpoint variant issue. Figure 2(b) demonstrates the stacking of the fused features to build up the reference image description. It is worthy to mention that, all extracted features points are indexed in a manner preserves the reference image number as follows: ɸ(P)={∅ 1,1 ,∅ 2,1 ,…..,∅ 1 ,1 ,∅ 1,2 ,∅ 2,2 ,…..,∅ 2 ,2 ,…..,∅ 1,8 , ∅ 2,8 ,…..,∅ 8 ,8 } Thus, for example, ∅ p1,2 is the fused extracted description for the first keypoint in the second reference image which is annotated by the NE approaching direction and Nk where k = 1,2,...8, is total of all detected number of keypoints at each reference-image.  Figure 3. The proposed detection system block diagram

Desired target approaching direction and localization (on-line phase)
In the online phase, the input scene keypoints are detected and their descriptions are extracted by the same way of the offline phase. Subsequently, the matching process is carried out to find the correspondence between reference and scene keypoints via the brute-force matcher based on Hamming distance. The matching task is carried out among the scene and all reference images keypoints. Later, all outliers (poor matches) are rejected using the RANSAC algorithm to find the inliers points (good matches). These inliers are utilized to decide the approaching direction alongside the target spatial location as follows: The reference image which has the maximum number of inliers with the minimum sum of distance is picked to obtain the approaching direction according to this formula: where is the dissimilarity ratio for the reference image r, is the number of the inliers points for is the sum of all reference-scene k inliers keypoints distances of the reference image r. Thus the reference image which achieves the minimum dissimilarity ratio is chosen as the approaching direction.
The desired target center in the scene image is determined by using the median values of all inliers. Fig.3 shows the proposed detection system block diagram.

Experimental results
A method of testing the time and detection accuracy taken by the proposed ORB-BRISK fusion and the SURF-MSER fusion in [25] is illustrated in this section. These values are the total time taken to detect keypoints and to extract descriptors for a dataset of 30 test images with different perspective, rotation  8 and scaling for a ground stationary target with a combination of 8 reference images taken from different sensor, each one represents an approaching direction. The speed of features detection and extraction is an important factor in real time target localization and recognition. The absolute time taken are dependent on the machine that running the code so we test both algorithms on the same machine with core i7-7700 HQ 2.8GHZ processor and 8 GB ram. Table 1 indicates the detection accuracy and the mean processing time from features detection till target localization and recognition for SURF-MSER and the proposed ORB-BRISK algorithm. Fig.4 shows as indicted in Table 1 the accuracy in object localization of SURF-MSER is a little better than that of ORB-BRISK. But in Fig.5 we can find that the time taken in each algorithm shows that the proposed ORB-BRISK takes mean processing time of 36.93 msec (27 fps) while the SURF-MSER algorithm takes 220 msec (4 fps) processing speed. Fig.6 shows the matched points between reference image and target in scene image using the proposed ORB-BRISK algorithm.

Conclusion
A novel approach for ground target detection via keypoints matching is presented in this paper. The approach accomplishes the ground target detection in two phases. Firstly, off-line phase, where the fused binary features are extracted from different view-point angles of the desired target to build a comprehensive reference image representation. Secondly, on-line phase, where the fused features extraction from the whole scene is accomplished then matched with the stored reference one to find the keypoints correspondence. These correspondences are filtered via RANSAC algorithm to find the good matches keypoints which are utilized to localize the target and obtain its approaching direction. The conducted comparative analysis has revealed the authentic power of the fused features in localization and recognition tasks while keeping the proposed system works in real-time.