PRIMITIVE VISUAL RELATION FEATURE DESCRIPTOR APPLIED TO STEREO VISION

In this study, we present a novel local image descriptor, which is very efficient to compute densely, with semantic information based on visual primitives and relations between them, namely, coplanarity, cocolority, distance and angle. The designed feature descriptor covers both geometric and appearance information. The proposed descriptor has demonstrated its ability to compute dense depth maps from image pairs with a good performance evaluated by the Bad Matched Pixel criterion. Since novel descriptor is very high dimensional, we show that a compact descriptor can be sustitable. An analysis of size reduction was performed in order to reduce the computational complexity with no lose of quality by using different algorithms like max-min or PCA. This novel descriptor has a better results than state-of-the-art methods in stereo vision task. Also, an implementation in GPU hardware is presented performing time reduction using a NVIDIA R GeForce R GT640 graphic card and Matlab over a PC with Windows 10.


INTRODUCTION
Computer vision is an interdisciplinary field that seeks to perform process as similar to human vision, employing methods that can understand digital images and video, such as acquisition, processing and analyzing.Some tasks in computer vision include segmentation, object detection and identification by extracting high-dimension data from the real word and transforming data using descriptor that can interface with other processes.
Stereo vision is one of the most active research areas in the computer vision.Therefore, a variety of solutions and variations of existing methods have been presented for specific needs or requirements.The goal of stereo vision is to estimate the depth of a scene by disparity maps, matching similarities from a pair of images.A taxonomy of existing stereo algorithms that allows the dissection and comparison of individual algorithm components is presented in [1].This taxonomy is based on four steps that stereo algorithms typically perform: 1. Matching Cost 2. Cost aggregation 3. Disparity computation 4. Disparity refinement The sequence of the steps depends on the type of an algorithm, where local algorithms typically follow the steps 1,2,3 but some others combine steps 1,2 and use matching costs based on the support region.On the other hand, global algorithms do not perform an aggregation step but rather seek a disparity assignment (step 3) that minimizes a global cost function (step 1).Some authors focus their efforts in one or more steps, depending on particulars goals.Difference matching cost have been studied; the most common is based in pixel difference and includes squared intensity differences (SAD) and absolute intensity differences (AD); also, in the video processing field, the mean absolute difference (MAD) and meansquared error are more frequently used.Other approaches use gradient-based measures and nonparametric measures, such as rank and census transform.It is also possible to perform a preprocessing step, using histogram equalization or Gaussian filters.
Local and windows-based methods aggregate the matching cost over a support region employing squared windows or Gaussian convolutions, shiftable windows or windows with adaptive sizes.
Algorithms can be classified by the disparity computation step, local methods, global methods and dynamic program methods.
Local methods emphasize on the matching cost computation and cost aggregation steps, computing the final disparity by a "winner take all" methods.While global methods often skip the aggregation, they formulate an energy minimization framework.The objective is to find a disparity function that minimizes a global energy.More recently, max-flow and graph-cut methods have been proposed to solve a special class of global optimization problem.Dynamic programming methods find the global minimum for independent scanlines as an optimization problem.These approaches work by computing the minimum-cost path through the matrix of all pairwise matching costs between two corresponding scanlines.
Most state-of-art methods rely on local measure to estimate the similarity of pixels across images and then on impose global shape constraints using some aggregation cost such as dynamic programming [2], level sets [3], graph-cuts [4], PDE [5], or EM [6].
Image descriptors can be classified as global and local descriptor.2D local features such as SIFT are commonly used in object detection task, while global descriptors, such as visual contours have been proved to provide a semi-global overview of a scene and give more information than local features about the shape of an object, also, they are flexible enough for task such as classification and recognition.
2D visual contours and their relations have been used in computer vision and robotics in various contexts; for example, in contour relations [7], they are used as features for object recognition.Similarly, Henricsson [8] uses geometrical relations such as proximity, curvilinearity and symmetry between contours to describe objects based on combinations of these relations.Contours in computer vision are important because they provide a means to group the local features together as well as saving the spatial relations between these contours [9].

RELATED WORKS
Let present brief review of similar papers.Local image descriptor has already been used in dense matching, although in a more traditional way to match only sparse pixels that are feature points.More of the existing stereo vision algorithms are based on pixel difference and present matching cost and disparity refinement.For example, in [10] (SAD+Wavelet) techniques are performed with aggregation cost using a multilevel disparity map (DM) approach and matching cost, such as SAD combined with a wavelet; finally, an adaptive filter is used during the postprocessing step.A variation of this approach is presented in [11] (MDEC+SSIM) technique, where a pyramid DM estimation is used with SSIM measure as the matching cost.Methods based on a global approach usually present better performance at high computational cost, such as graph-cuts, belief propagation, or semi-global matching.Paper [12] presents an algorithm based on Randow Walk with Restar Algorithm (RWR) updating the matching cost aggregated into superpixels.
Paper [13] presents a hybrid method using transition pixel values in horizontal and vertical orientations and a polynomial curve fitting, showing robustness under radically different radiometric conditions.This approach uses a "winner take all" disparity computation.Yong in [14] presents a feature detector using SURF, SIFT, and HOG algorithms to find interesting points and to evaluate the quality of the points detected; then, a regression of the multimodal image is used to compute the disparity map.
Promising Daisy descriptor [15] advocates an approach based on SIFT and GLOH, it has been designed to obtain robustness to perspective and lighting changes and have been proved to be optimal for dense matching.Another local descriptor mainly used for image correlation is the Scale Invariant Descriptor (SID) [20].This descriptor uses a combination of log-polar sampling with spatially varying filtering that converts image scaling and rotation into translations.Scale invariance is achieved by taking the Fourier Transform Modulus (FTM) of the transformed signals because the FTM is translation invariant.
In this paper, we propose novel local feature descriptor based on visual primitives (VP).Additionally, the semantic information is obtained using the relation between visual primitives (VPR).These relations are cocolority, coplanarity, normal distance and the angle between them.The principal difference between of novel descriptor and existing approaches is that it can extract structural and semantic information from an image; additionally, the designed descriptor has demonstrated robustness against radiometric distortions.Also, a feature descriptor size reduction is applied in order to save computational cost and memory.The designed descriptor is implemented in a GPU to accelerate processing speed, which is important for real-time applications.The designed descriptor is used with traditional depth map estimation algorithms, confirming their performance via traditional quality metrics.
The remainder of the paper is organized as follows.In Sect.2, the novel local feature descriptor is explained and the framework for disparity estimation is presented with the dimension reduction.In the Sect. 3 the experimental results are presented.Finally, Sect. 4 concludes this study by discussing the results of the proposed approach.

VISUAL PRIMITIVE RELATION DESCRIPTOR
The framework of designed descriptor is explained as follows; for a given input image in RGB space, we first should compute a space color conversion from the RGB color space to the CIELab space and then apply the monogenic filter in the L channel.This filter gives the information about visual primitives: magnitude, phase, and orientation.Next, this information should be used to obtain the relation between them and to form the feature vector.We take advantage of the two degrees of freedom when designing monogenic filters to extract information.To measure quality performance, the designed feature descriptor is used as a metric for stereo matching similarities across a pair of images.Then, this measure is used in the traditional block matching algorithm to estimate a depth map.

VISUAL PRIMITIVES
The visual primitives are a set of visual descriptors.These primitives [21] describe edge structures by means of several properties that are relevant for edges only.They have been used to formalize different contexts in visual scenes, as well as 6D motion and 3D spatial context.These descriptors have been employed in several applications such as learning of object representations, pose estimation, motion estimation, and vision-based grasping.
The primitives express explicitly important structural properties of the edges such as local orientation, phase, color, and motion; this information is encoded in a multi-dimensional feature, where geometric and appearance cues are separated.Information about these different properties can be extracted from images by applying a variety of linear and non-linear local filtering operations.
Current work makes use of the monogenic signal presented by Felsberg and Sommer [36].It uses a bandpass filter that is radially symmetric around the origin ('even') in both the frequency domain and image domain.The Log-Gabor is used as even filter, as follows: Two odd parts of the filter, and , using the Riesz transform, are presented in eqs. 2 and 3.Each of the two resulting filters are odd-symmetric, with the axis of symmetry along the two image axes.After filtering, we can present the monogenic signal as a combination of the three parts (one even, two odd) as a vector shown in eq.4: = ( ), ( ), ( ) . ( These three components can be explained as spherical polar coordinate system, using the radius, elevation angle, and azimuthal angle.The local amplitude is the radial part of the representation A(x); the local phase is found from the angle between the even part and the combined odd part φ; and the local orientation θ is the orientation of the odd filter and represents the dominant direction in an image at point x.The visual primitive is a vector as shown in eq. 3. Fig. 1 presents an example of the visual primitives from an image to obtain visual primitives.

RELATIONS BETWEEN VISUAL PRIMITIVES
Since primitives carry geometrical and appearance information, the primitives have attributes such as the mean color, position and orientation.The mean color is defined in the CIELab color space because of the statistically less correlated behavior of an image in the CIELab space.These attributes together with the geometrical and visual of primitives give relations between primitives that can be used within the context of various reasoning processes.Let describe certain primitive relations.
Angle: The angle between two primitives is defined by using the orientations of the primitives as: Normal Distance: The normal distance between two primitives is defined by the distance from one primitive's position to the line created by the others primitive orientation and position.Therefore, the distance between the ith and jth primitives in a scene is defined as: Coplanarity: The coplanarity of entities can be measured by their elongation with a common plane.We define the coplanarity between two primitives as the mean angle between a common plane and the best-fit lines of the primitive.Therefore, the coplanarity between the ith and jth primitive in the scene can be defined as follows: Cocolority: The cocolority between two primitives is defined as the color difference between the colors on the primitive.The color difference is calculated in such a way: The relations between primitives are illustrated in Fig. 2.

VISUAL PRIMITIVE RELATIONS DESCRIPTOR
We present a formal definition of the designed visual primitive relations descriptor (VPR).For a given input color , we convert it from the RGB to the CIELab space since the cocolority relation is defined in eq. 9.The monogenic signal is performed using scales and σ Gaussian kernels, so S*σ filters are used.Each filter is then convolved with the L channel of image obtaining 3*S*σ different components of the monogenic signal: { , }, { , }, { , }, calculated at scale wavelengths and Gaussian kernel with i=1...,S and j=1...,Σ.We obtain the visual primitives with different responses.
At each pixel location, the designed descriptor consists of a vector made of values from the visual primitive relations located on a squared window centered on the location.Let to h(x, y) represent the vector formed of the values at location ( , ) in an image: , (10) where , is the angle relation described in eq.6 between location ( , ) and location ( , ) in the neighborhood inside window from the filter chosen response at scale and Gaussian kernel σ.
We normalize this vector to unit and to denote the normalized vectors by ℎ ( , ).If σ is the value of Gaussian kernels used and S is the number of scales of the monogenic signal, the feature vector of the angle relation ( , ) for a location ( , ) is defined as the concatenation of ℎ vectors: For the normal distance and coplanar relations, we perform the same structure as the angle relation.The relationship vector ℎ ( , ) and ℎ ( , ) at point ( , ) is shown as follows: Because the cocolority relation does not depend on the filter parameters, the feature vector is defined as: where ( , ) is the cocolority relation between the location ( , ) and the neighborhood ( , ) inside the window .
The final descriptor ( , ) for location ( , ) is defined as the concatenation of the vectors of primitives relations: The order of the elements in the vector is selected using PCA analysis.The vector is sorted using the eigenvalues from maximum to minimum, forming the elements of the descriptor as follows: ( , ) = ( , ), ( , ), ( , ), ( , ) ,

Computational Complexity
VPR descriptor is parameterized by the number of scales , the value of Gaussian kernels σ, and the size of the rectangular window .Assuming that the image has pixels, the filters in the frequency domain with size are created.These filters are convolved with the image spectrum to produce different response versions of the visual primitive components.
Therefore, at each location of an image, the relations between primitives are computed inside a block window , where it should be used (2 + 1) for each primitive relation.Therefore, computing all the descriptors of an image requires: 4Σ convolutions, and  The designed framework that appears to demonstrate the competitive quality performance was implemented using an Intel Core i7-3770 CPU at 3.40Ghz with 8 GB of RAM memory.Time values were computed using this CPU.Table 3 presents the processing time values for the Daisy and VPR descriptor.
The advantage of VPR descriptor is that can be implemented using parallel programing.The main process of our work is based on the visual primitives and the monogenic signal, so they should be computed in GPU.We use the Felsberg's monogenic filters described in eqs. 1 and 2, that allow to compute the visual primitive components by Fast Fourier Transform (FFT).The filters are computed in the frequency and in the image spectrum domains in order to obtain the visual primitives components.Once we obtained the visual primitives, the relations can be computed in parallel matter.
Let perform a kernel by each block window , this means that we compute the relation between primitives for each location ( , ) in one kernel.Each a kernel should perform [(2 + 1) − 1] × Σ × .We applied this method for each primitive relation that should be computed.
The time for the primitive calculation is for only one scale and for one sigma value.The calculation time for a window size 3x3 is 0.084 for each visual primitive relation.Finally, the total time value to compute designed descriptor result is 0.76 and it is shown in Table 3, while Daisy descriptor time in GPU computes result during 9.96 as it was presented in [24].We emphasize that the novel descriptor VPR based on visual primitives and their relations can obtain the structure and semantic information of an image.The novel descriptor is robust against radiometric distortions such as illumination and exposure changes.Additionally, VPR descriptor can be used together with state-of-the-art methods to improve quality.Even the computation cost could be higher than that of Daisy descriptor; the quality results shows that it is worth it.

Feature Descriptor Reduction
The descriptor's size can be obtained from eq. 19, where is [Σ × (2 + 1) , 3 + 1].As is shown, the descriptor's size grows exponential, using the parameters = 2, Σ = 4 and = 4 the descriptor size will be [324,7] for each pixel in an image.If the image size is [490,720], we have 685,843,200 feature values, and using double format, we will need up 4,677,684,480 bytes.In order to reduce this size, different reduction algorithms are applied, such as, statistic algorithms, direction approach and PCA.
For statistic algorithms, at each vector relation descriptor , we applied an operation that can be max, min, or mean algorithm, so the final descriptor is formed as follows: The computation complexity reduction for each algorithm is summered in the Table 4.One can see observing this table that the statistic methods get the best reduction and PCA method only can reduce the size to the half.

EXPIREMENTS AND RESULTS
In this section, we discuss the results of the experiments that we performed to justify the performance of the designed descriptor in the reconstruction of the disparity maps.First, to understand the influence of VPR parameters, we perform a parameter sweep experiment.Then, a comparison of the designed descriptor and other descriptors in depth map estimation is performed.

Data
To evaluate the performance of the proposed method, we used the dataset Middlebury [23].The 2014 dataset was employed for testing and comparing with disparity maps.These datasets contain up to nine different pair images with their ground truth at full size (width: 1330-1390 pixels, height: 1110 pixels) half size, and one-third size.The 2014 dataset contains 33 image pairs divided into three sets, 10 for training, 10 for testing and 13 additional images without a ground truth provided.Additionally, this dataset presents two views of each image pair, taken under several different illuminations (L) and different exposures (E).

Quality criteria
In quality analysis, the quantitative metric Percentage of Bad Matching Pixels (B) is employed, justifying the performance of the proposed framework.To compute the selected metric, the ground truth GT for density maps (DM) obtained from the Middlebury Stereo Vision website for each a stereo pair and the DM estimates obtained by proposed descriptor are employed.
The B values are calculated as follows: where N is the total number of pixels in an image or frame, is the estimated disparity, and is the ground truth.δ is the error threshold difference for each a pixel valuated, commonly used value is 2.0.

Comparison with other descriptors
To compare the novel feature descriptor with other descriptors, we used the database Middlebury 2014 at a quarter side of the original image.We employed commonly used SID and Daisy descriptors for comparison because Daisy is one of the most cited descriptors among state-of-the-artmethods, and SID is based on a monogenic signal.We apply the parameters = 2, = 4 and = 4, the parameters for SID and Daisy have been chosen according to their respective works.Table 5 shows the results in the tested dataset for these three feature descriptors using traditional block matching to compute the depth map.The first column presents results for the Daisy descriptor.This descriptor demonstrates sufficiently good performance, with B criterion value less than 20%; the worst performance can be seen for the Jadeplant image, and the best performance appears in the Motorcycle image.The second column is the B for SID; it appears that the descriptor demonstrates better performance, but since the SID descriptor performs image reduction, the disparity map is also reduced, and it cannot be compared directly.Additionally, the SID cannot be performed for large images.The third column presents results for the novel descriptor, where one can see that they are very close to those of Daisy.Next, the columns show the experiment results for testing the Daisy and VPR descriptor in the case of image exposure (E) and lighting (L) changes.Finally, we can conclude that the novel descriptor shows better performance in almost all tested images, even with exposure and lighting changes.
Observing the results for the image Piano, one can see more differences between the Daisy descriptor and the designed one, but these differences cannot be easily seen in the depth map.The Daisy descriptor exhibits lower performance when lighting differences are present, and the VPR descriptor appears to demonstrate robustness against these changes.
As we used traditional block matching for cost aggregation, we cannot resolve the occlusion problems, so for the areas where the depth could not be calculated correctly, most of the differences appear because of occlusion.
vision algorithm was performed in the preprocessing step; the novel descriptor demonstrates the ability to improve the quality results when implemented with state-of-the-art methods.Computing our descriptor takes more computational complex using single core, so is important to seek for a faster computation of the descriptor for all image pixels.This could have implications beyond stereo reconstruction because dense computation of ijmage descriptors is fast becoming an important technique in other task, such as object recognition, object detection or facial aging analysis.
is the normal distance relation between the location ( , ) and location ( , ) described in eq.7 and( , ) is the coplanar relation shown in the eq.8.The feature vector of the normal distance relation ( , ) and is concatenated as follows:

Table 2 . Number of operations required for the proposed VPR and Daisy descriptor.
Novel descriptor have advantage in comparison with Daisy one, it can be easily parallelized because each relation at point , is calculated separately.A summary of this is shown in Table 2 considering an image with size 250 × 250.