Efficient Depth Map Estimation Method Based on Gradient Weight Cost Aggregation Strategy for Distributed Video Sensor Networks

Video sensor networking technologies have developed very rapidly in the last ten years. In this paper, a cross-based framework strategy for cost aggregation is presented for the depth map estimation based on video sensor networks. We formulate the process as a local regression problem consisting of two main steps with a pair of video sensors. The first step is to calculate estimates for a set of points within a shape-adaptive local support region. The second step is to aggregate the matching cost for the gradient-based weight of the support region at the outmost pixel. The proposed algorithm achieves strong results in an efficient manner using the two main steps. We have achieved improvement of up to 6.9%, 8.4%, and 8.3%, when compared with adaptive support weight (ASW) algorithm. Comparing to cross-based algorithm, the proposed algorithm gives 2.0%, 1.3%, and 1.0% in terms of nonocclusion, all, and discontinuities, respectively.


Introduction
Wireless sensor networks (WSN) have drawn the attention of the research community in the last few years, driven by a wealth of theoretical and practical applications [1,2]. Recently, as rapid improvements and miniaturization in hardware, a single embedded device can be equipped with audio and visual information collection modules [3]. The availability of low-cost hardware is like enabling the development of wireless multimedia sensor networks (WMSNs), that is, networks of resource-constrained wireless devices that can retrieve multimedia content such as video and audio streams, still images, and scalar sensor data from the environment [4][5][6][7].
There are many algorithms for developing WMSNs applications [8][9][10][11][12]. In [8], Chi et al. have studied the problem of compression of video surveillance sequences collected by a wireless sensor network. In particular, they have proposed a low-complexity coding framework based on change detection and JPEG-like compression of regions of interest, along with a suitable low-complexity change detection algorithm. Huang et al. have proposed a robot wireless sensor network that can enhance multimedia surveillance and provide the foundation for strategies based on multi-modal sensor integration [9].
In [10], DeBardelaben have investigated techniques that can be applied at each layer of the network protocol stack to produce clandestine, power-efficient wireless microsensor network implementations. Also, a smart camera network has been demondtrated for providing extensive coverage of a large virtual public space, a train station populated by autonomously self-animating virtual pedestrians [11,12]. Figure 1 shows that several adjacent sensors may have overlapping sensing areas, due to the field of view of the cameras in the sensors. Since the overlapped image region contains information about the alignment of the image, we are able to make an approach for stereoscopic view generation (depth estimation) with a pair of neighboring camera senors. This kind of work can give more plentiful information for object recognition and tracking strategies. To make stereoscopic view generation (depth estimation), many different approaches have been taken towards solving the stereo correspondence problem and great progress has been made within the field during the last decade. Dense stereo matching forms the basis for extraction of a threedimensional scene structure and involves the identification of, for every pixel in one image, the corresponding pixel in another image [13].
Because individual pixel values are not unique and as corresponding pixels may not have the same values in different views, searching for correspondence at the level of individual pixels produces questionable results. To overcome this problem, local stereo matching algorithms generally aggregate similarities in the areas around each pixel with the implied assumption that all pixels in area have the same depth.
Several algorithms have been proposed in attempts to solve the problem of optimizing the size of the region. Kanade and Okutomi [14] presented a method selecting a window adaptively by evaluating the local variations in both the intensity and the initial disparities. They used a statistical model to represent the uncertainty in the disparity of points over a given window. Veksler [15] proposed an algorithm to choose an appropriate window shape by optimizing over the class of compact windows. A compact window class was achieved via the minimum ratio cycle algorithm. However the selected shapes produced by the methods of Kanade and Okutomi and Veksler are both constrained to a rectangle.
To overcome the mentioned constraints, Okutomi et al. [16] determined regions by recovering precise object edges and obtaining smooth surfaces using multiple stereo pairs and multiple windowing. Yoon and Kweon [17] adjusted the support weight of pixels in a given support window based on the color similarity and the geometric proximity in order to reduce image ambiguity. However, their algorithm consumes a huge amount of memory and has high computationally complexity due to storage of center pixel-dependent support weights. To overcome the computational complexity, Richardt et al. [18] introduced a real-time stereo matching technique based on a reformulation of Yoon's adaptive support weight algorithm.
Zhang et al. [19] proposed a cross local support aggregation algorithm that uses color similarities and connectivity constraints to construct a shape-adaptive full support region on the fly, merging horizontal segments of the crosses in the vertical neighborhood. However the problem of finding the optimal support region and aggregating the matching costs still remains.
We propose a cross-based gradient weight cost aggregation algorithm to find the optimal support region. After constructing a support region, we aggregate matching costs over the support region using gradient-based weights in the outermost pixel of the support region.
Section 2 describes our proposed algorithm in detail, with observations. We verify the performance of the proposed algorithm in Section 3, and conclusions are presented in Section 4.

The Proposed Algorithm
Our proposed algorithm is based on the following assumptions: (1) color differences and a variation in depth values exist in a texture area; (2) the variation in the depth value is small between adjacent pixels in the nontexture area; (3) the gradient magnitude has a constant bandwidth within the object boundaries. It is important to find the same depth area in the local stereo matching. Because minimum matching cost is obtained by aggregating matching costs of the same depth level, we used color similarity information to find the same depth area. Most regions of the same depth level have similar colors such as homogeneous region. Although textured region has some exception, matching costs, which are calculated in the previous aggregation step, are discriminable at each disparity range in the texture region. Hence we considered that color differences and variation in depth values exist in texture area. Also the variation in the depth value is small between adjacent pixels in the nontexture area.
The final assumption is that the gradient magnitude has a constant bandwidth within the object boundaries. Because each object has different bandwidth of edge magnitude in the boundary, we have aggregated matching costs by weighting factor as edge magnitude to distinguish different depth levels. Figure 2 shows the overall procedure of proposed algorithm to estimate the depth map. We focus on the cost aggregation step using matching cost after the calculated matching cost initialization.
The proposed cross-based gradient weight cost aggregation algorithm consists of a few stages. First, for each pixel , a set of four varying support arm lengths is determined for on the image. Before a constructing support region, this step searches the nontexture area in order to limit the support region. Figure 3(a) shows the limited arm lengths { , , , } as left, right, up, and bottom, respectively. The nontexture area is calculated using a variance value of the intensities over the window of 3 × 3.
Second, an upright cross with four arms is constructed for each pixel. Figure 3  endpoint pixel that violates one of the two following conditions (see (1) and (4)). The first condition can be defined as where ( , ) is the color distance between and , and is a preset threshold value. The color distance is given as follows: where indicates the CIELab color value at the endpoint . is expressed as where = 1/(ℎ + 1) is a parameter used to control the update rate and ℎ indicates arm length in constructed support region. The second condition can also be defined as where ( , ) is the spatial distance between and and { , , , } is set of the limited arm lengths. A following step is to aggregate matching costs over the constructed support region. Figure 3(c) shows the aggregation process. We assume the gradient magnitude has a constant bandwidth in the object boundaries. Based on this assumption, our algorithm aggregates the matching cost, which is a given gradient as the weight of the support region at the outermost pixel (yellow block in Figure 3(c)). Aggregation cost is expressed as where ( ) is the support region at pixel and ( ) is the aggregation cost that is added in the horizontal direction at the pixel within ( ). After summing the horizontal costs, the matching cost is merged and aggregated by summing aggregation costs vertically. ( ) can be represented as where ( ) is the matching cost at pixel . If a pixel is in the outermost support region, the aggregation cost is multiplied by a weight ( ). The weight ( ) is given as where ( ) is edge magnitude at pixel and is preset threshold value. The edge magnitude ( ) is usually defined as follows: where ( , ) and ( , ) indicate the responses in both directions (horizontal and vertical) and is intensity of image plane. Also, and can be considered as the mapped indices for searching color similarity vertically and horizontally, respectively. The proposed gradient-based weight has a value in the range [0, 1]. For example, if edge magnitude is strong, gradient weight is near 0. If edge magnitude is weak, gradient weight is close to 1.
Based on the response of the edge in the object, we designed an adaptive weighting scheme for cost matching. The proposed weighting scheme is based on uncertainty of the edge feature. As the edge becomes stronger, the contribution of the given pixel becomes dominant. Otherwise, the contribution will be decreased in the aggregation stage.
With the proposed gradient-based weight, the proposed algorithm produces a more accurate depth map in the aggregation step for application of wireless video sensor networks.
The overall procedure of the proposed algorithm is summarized as follows (in Figure 2).
(i) First, before constructing support regions, we determined size of support region to find much more homogeneous region.
(ii) Support regions are constructed based on the assumption that the same depth level has similar color. To improve the performance, we have to collect the matching costs in the same depth level. (iii) Finally, we have aggregated matching cost over the constructed support region based on the response of the edge in the object. The goal of aggregation step is to collect matching costs of depth levels. To find the correct depth level, we constructed the support region in the same depth area using color similarity and aggregated matching costs weighted by response of edge over support region to reduce disturbance in the object boundary.

Results and Discussion
To verify the performance of the proposed algorithm, we tested our system using the Middlebury benchmark [20]. Middlebury defined three measures for evaluating performance, including nonocclusion, all, and discontinuities. The quantitative measure is the percentage of bad pixels in subsets of estimated disparity map, which is defined as The parameters given in Table 1 have been kept constant for all the datasets. is the threshold value for checking on nontexture area. is the threshold value to compare the color similarity between pixels. Finally, is the threshold value to get a weighting factor. This threshold has been set from average magnitude of edge in all datasets. These parameters have been selected through experiments. Before analyzing our proposed method, we used a Census transform cost measure to initialize the cost volume [21]. Census transform has good performance on images with exposure and light changes.
After performing the aggregation step, we used the WTA (winner-take-all) [13] local optimization technique to calculate disparity map. Figure 4 shows the quantitative performance. We evaluated results by comparing between the ASW algorithm [18]  and the cross-based algorithm [19]. We achieved improvements of up to 6.9%, 8.4%, and 8.3% compared with ASW algorithm and improvements of 2.0%, 1.3%, and 1.0% compared with Cross-based algorithm in terms of nonocclusion, all, and discontinuities, respectively.  The proposed method produced an accurate disparity map, as shown in Figure 5. Figure 5(a) shows the ground truth image of the tested venus image. In particular, depth discontinuities were well preserved comparing to the ground truth image. On the other hand, the other methods failed to preserve depth discontinuities. With the proposed algorithm, we can observe that the defects of inside of near objects can be avoided efficiently. Table 2 shows the Middlebury ranking of the proposed algorithm. Although the Middlebury ranking is low in terms of the overall procedure, we have improved the performance of the aggregation step which is based on the response of edge comparing with other aggregation algorithms. Our key idea is that we design the weight as each magnitude of edge in the object boundary and construct a support region using color similarity in the homogeneous region to improve accuracy, in the aggregation step. In terms of the overall stereo matching procedure, the improvement may be smaller than that of just considering an aggregation step.
From the above results, the average rank has been improved by using the proposed algorithm. From this result, we would like to address that the proposed algorithm has enough contribution in the aggregation step.

Conclusions
We have proposed a cross-based gradient weight cost aggregation algorithm for efficient depth map estimation of WMSNs. We constructed an adaptive shape support region and aggregate matching costs by efficiently using gradientbased weight within the support region with a pair of neighboring camera sensors. Using the proposed feature and our algorithm, we verified promising results compared with the other aggregation algorithms.