Using Wavelet Transformation and Edge Detection to Generate a Depth Map from a Single Image

Abstract A depth map is essential for 3D display. Traditional depth maps are derived from multi-images. Producing a depth map from a single image is a more challenging problem and much more difficult than that of from multi-images. We propose a novel approach that combines wavelet transformation with edge detection to generate depth information from a single image. A depth map is then generated by depth prediction. Peak signal-to-noise ratio (PSNR), the structural similarity image measure (SSIM), and computation time are used to evaluate the performance of the proposed algorithm. The experimental results reveal that the PSNR and SSIM of the proposed algorithm are larger than those produced in the traditional approach. The computation time of the proposed algorithm is less than that produced by the traditional approach. Furthermore, 1000 test images and the crossponding depth maps are produced in this study. The average of the three metrics was found to be superior to those of the traditional approach. The statistic tests were also significant for PSNR and TIME between the two approaches. We have also written and files four patent applications with respect to this work.


Introduction
The success of three-dimensional (3D) video games and the online 3D virtual world emphasizes the demand for viewing experience beyond the passive watching of two-dimensional (2D) television programs. In a traditional imaging system, a 3D scene is projected by a 2D image sensor and depth information is lost. Many 3D TVs which give a 3D appearance using binocular disparity have a 2D-3D conversion function. Various techniques have been introduced for 2D-3D conversion, and many of them are based on depth image-based rendering (DIBR) [1][2][3]. DIBR is a technique used to render at least two images (for the left and right eyes) from a 2D image and a depth map. In a typical procedure, a depth map is first generated, in one of a number of different ways, and the 3D images are then generated by DIBR using the depth map.
Valencia and Dagnino [4] proposed a method for deriving a depth map from a single image using wavelet analysis and edge estimation based on Lipschitz exponents. The relative depth can be estimated based on the values of wavelet coefficients in the high-frequency bands. Based on this, Valencia et al. divided images into 16 × 16-pixel macroblocks. A macro block wavelet transform which generated 256 wavelet coefficients was then performed. Relative depth was estimated by counting the number of The high-frequency components represent focused objects in images with limited DOF and their high-frequency wavelet sub-bands contain high energy. For images of limited DOF, objects in the image may not all be in focus [5]. Usually, the objects in the background are blurred and texture is smooth, while the main foreground focused objects have sharp edges and detailed texture. In other words, the high frequencies are retained in the foreground and greatly attenuated in the background. This suggests that the local spatial frequency is directly related to the degree of blurring, and thus to their relative distance from the camera. The high frequencies are described by the coefficients of the wavelet transform of the image. If the energy of the high-frequency wavelet bands is significant, this suggests the region has more detail and less blurring.
The authors have written and filed four patent applications: two for inventions patents and two for utility models. The two invention patents are ' A method and apparatus for combining wavelet transformation with corner point detection to generate a depth map from a single image' (application No. 105121680). And, ' A method and apparatus for combining wavelet transformation and edge detection to generate a depth map from a single image' (application No. 105121679). The two utility model patents are ' An apparatus for combining wavelet transformation with corner points to generate a depth map from a single image' (application No. 105210335). And, ' An apparatus for combining wavelet transformation with edge detection to generate a depth map from a single image' (application No. 105210334).
This paper is organized as follows: after this introduction, the second section describes the problem, the third section presents the proposed algorithm, the fourth section is the experimental section, which includes statistical analysis results. The conclusion is reached in the last section.

The Problem
To produce a depth map from a single image is a challenging problem and much more difficult than doing the same from multi-images. The essence of the problem is to make a depth map from data extracted from a given 2D image with limited DOF. A depth map is a grayscale image in which distance information for each pixel in the image which is represented by shades of gray. For example, the objective of the proposed algorithm would be the conversion of the image shown in Figure 1(a) to that shown in Figure 1(b). In this image, the insect is in front and the other subjects are in the background. The depth information is shown in the gradient layer on the right side of Figure 1(b). The color of the object from white to non-zero coefficients. In an early effort to recover depth using focus/defocus clues, a deconvolution method in the frequency domain was introduced by Pentland [5] in 1987 to determine the amount of image blur. A method for depth estimation based on focus/defocus cue where the entropy of high-frequency sub-band of wavelet decomposition is regarded as the measure of blur is shown in [6]. Mishiba et al. [7] introduced a depth estimation method using a wavelet-based matching cost optimization to estimate sharp object boundaries. There is also commercial hardware such as Kinect which consisting of an infrared projector and camera and a special microchip that generates a grid from which the location of a nearby object in three dimensions can be ascertained [8].
There are also some documented patents. For example, patent CN102427539B [9] proposed a method in which the input image is first wavelet transformed and the highfrequency components are separated into three different blocks, foreground, midground, and background. The depth map is obtained after defocus of the blocks. Patent CN103049906B [10] reveals an image depth extraction method comprised of the following steps: (1) treatment of the original image by Gaussian blur processing to give N pieces of blurred image. (2) Detection of the original image and the blurred image of N frames is done; N ≥ 2 fuzzy parameter estimates. (3) Based on the edge image, each pixel is calculated at each edge of the corresponding Gaussian filter parameters. (4) Each pixel at the edge of N fuzzy parameters is analyzed to estimate its worth using a statistical method. (5) Based on the parameter estimates at the edges of each pixel, the depth of each pixel in the image to the edges of the sparse depth map is calculated. (6) An interpolation process is used to obtain a dense depth map. The method described in this invention, relative to prior art methods, allows high accuracy to be obtained from fuzzy parameters and depth values can be accurately calculated.
Edge detection information is also a clue for the depth map. To create 2D-plus-depth format content from images of limited depth of field (DOF), the focused objects in the foreground are sharp-edged and defocused objects in the background are blurred [11]. That is, the focused area contains high-frequency components and blurred area contains low-frequency components. The degree of blurring can be directly related to the spatial frequency described by wavelet transformation. It suggests there are more details and less blurring in the high-frequency bands with more energy, where the objects are closer to the camera. Accordingly, the relative depth of each pixel can be estimated based on the values of wavelet coefficients. The high-frequency bands can then be derived using wavelet analysis. After two stages of smoothing and scale manipulation, the depth map data with fewer errors can be used for 3D display [12]. black corresponds to the distant of the object from near to far. Without loss of generality, we use uppercase letters to denote the image and images with M × N pixels are considered in this paper.
Three metrics were used to evaluate the performance of the proposed algorithm. The first was the peak signalto-noise ratio (PSNR) [13], PSNR is expressed as a ratio between the maximum possible power of a signal and the power of corrupting noise. The PSNR is usually expressed in terms of logarithmic decibel scale as follows: 2 , X and X represent the original image and the depth map, respectively.
The second metric used was the structural similarity image measure (SSIM) [14] (see Figure 2). The system diagram of the SSIM is shown in Figure 3. Suppose X and Y are the original image and the corresponding depth map, respectively. We define X = 1

For luminance comparison
where C 1 = (K 1 L) 2 with L is the dynamic range of the pixel values (255 for 8-bit grayscale images) and K 1 ≪ 1 is a small constant to avoid the denominator approaches 0 when 2 X + 2 Y is very close to 0. The contrast comparison function takes a similar form where C 2 = (K 2 L) 2 with K 2 << 1 is a small constant to avoid the denominator approaches 0 when 2 X + 2 Y is very close to 0. The structure comparison function is as follows: where C 3 = C 2 2 . In general, the SSIM is defined as follows:  The third metric used was computation time. It is known that high values of PSNR and SSIM indicate better depth map quality. For the same test, hardware and software computation should be as fast as possible to ensure efficiency.

The Proposed Approach
The proposed algorithm has the following steps (see Figure 3): (1) 2D wavelet transformation of the input image with a specific threshold to produce a binary image. (2) Canny edge detection with a specific threshold to produce a binary image.
Perform logic AND on the two images to produce a defocused image.
(4) Production of the depth map. Figure 4 shows the wavelet transformation procedure. The converted grayscale image I first passes a high/lowpass filer and then decimates along the rows, the result image then passes high/low-pass filter which decimates along the columns. We have the resulting HH, HL, LH, LL sub-images after the wavelet transform procedure. The HH sub-image consists the high-frequency information of the original image and hence its high energy. A threshold is needed to produce a binary image which represents the contour of the main object. Note that the size of the where , , > 0. Without loss of generality, K 1 = 0.01, K 2 = 0.03, α = β = γ = 1, and C 3 = C 2 2 are used in this study. We have the SSIM for comparison    Figure 5 and the threshold can be found as shown in Figure 6. After some experimentation, the threshold was chosen as follows: where a, b, and c are the top three values of the smoothed plot. Figure 7 is a typical wavelet transformed image computed with a specific threshold.   The resulting depth map is shown in Figure 10.

Experimental Results and Statistical Discussion
Photo images are adopted from Flick and Chun-Po Chen in FaceBook. We used a total 1000 images for the test database. We use the computation time, PSNR, and SSIM to evaluate the performance of the proposed approach and of that used by Zhuo and Sim [17] as well.

The Experimental Results
For illustration, we use test image1 and test image2. The corresponding depth maps are shown in Figures 11 and  12, respectively. It can be seen that the two depth maps are almost the same. The values of PSNR, SSIM, and computation time for test image 1 and test image2 are shown in Tables 1 and 2, respectively. It can be seen that the values of the three metrics of the proposed algorithm are superior to those of the conventional method. Figure 13 shows the proposed PSNR (solid line) and the conventional PSNR (dotted line) [17] and it can be seen that the proposed PSNR outperforms the traditional approach. Figure 14 is the proposed SSIM (solid line) and the conventional SSIM (dotted line) [17]. It can be seen that the SSIM by the proposed approach is almost the same as that of the traditional approach. Figure 15 illustrates the relative difference (proposed computation time-conventional computation time) between the proposed and the conventional approach [17].
The Canny edge detector [15] with a threshold median value of 0.33 was used and Figure 8 shows the result of Canny edge detection.
The construct of the defocused image is obtained by where f(m, n) is the defocused image, g(m, n) is the wavelet transformed image and c(m, n) is the Canny edge detected image. The defocused image of the example is shown in Figure 9. The depth interpolation can be produced from the defocused image by the approach used by Levin et al. [16]. Further, to avoid the stair effect in the image, suppose I and Î is the original image and defocused image, respectively. We propose the coefficient from solving the following equation [16] where D is a diagonal matrix with entry D ii = 1 if the ii-th entry of the defocus image is not 0, L is a square matrix with entry where δ ij is the Kronecker delta function, U 3 is a 3 × 3 identity matrix, μ k and Σ k are the mean and variance of the color matrix w k , respectively, I i and I j are the i-th and j-the entry of the original image I, respectively, ɛ is an normalized parameter, and |w k | is the size of window w k . The depth value d can be solved by the equation       Table 3. It can be seen that the proposed approach outperforms the conventional approach.
To further illustrate the comparison of performance of the two approaches. Let and p = {X p1 , X p2 , ⋯ , X p1000 }

Statistical Analysis of the Experimental Result
For the 1000 test images, the average PSNR and average SSIM  challenging problem and much more difficult than that of from multi-images. In this study, an approach is presented that combines wavelet transformation with edge detection to generate depth information from a single image. We use PSNR, SSIM, and computation time to evaluate the performance of the proposed algorithm. The experimental results reveal that the three metrics of the proposed algorithm outperform those produced in a conventional approach. One thousand test images and their crossponding depth maps are produced in this study. The average of the three metrics was found to be superior to those of the traditional approach. The statistic tests were also significant for PSNR and TIME between the two approaches. We have also written and files four patent applications with respect to this work.

Disclosure Statement
No potential conflict of interest was reported by the authors.

ORCID
Der-Feng Huang http://orcid.org/0000-0003-3955-6949 be the results from the proposed and conventional approaches, respectively. It is known that p and c are independent and the conditions of the central limit theorem apply. Assume μ p and μ c are the means of p and c , respectively. 2 p and 2 c are the variances of p and c , respectively. We consider a testing hypothesis on μ p − μ c .
The p-value of PSNR and TIME are shown in Table 4. It shows that no matter the variances ( 2 p and 2 c ) equal or unequal), the p-values are extremely small [18]. That is, the test statistics are significant and the null hypothesis H 0 must be rejected in favor of the alternative hypothesis H 1 . This concludes that the PSNR and TIME of the proposed approach outperform the traditional approach.

Conclusion
Traditional depth maps are derived from multiple-images. Producing a depth map from a single image is a more c = {X c1 , X c2 , ⋯ , X c1000 }