Improved Upsampling Based Depth Image Super-Resolution Reconstruction

Constrained by current sensing technology, depth camera only acquires a low-resolution depth image that does not meet actual requirements. To solve this problem, this paper take a divide-and-conquer strategy to synthesize a high-resolution depth image from a low-resolution range image under the guidance of a registered high-resolution color image. Initially, the depth image is divided into planar areas and edge regions. For different zones, we exploit different methods to interpolate the missing depths. At planar area, the linear interpolation method is employed to perform upsampling. At edge region, a segmentation-separation upsampling method is used to interpolate the missing values. Then the upsampling result are refined on the Depth CNN that is built in this paper. We conduct extensive experiments on the benchmark database and real world data with various upsampling rates to illustrate the upsampling ability of our method. The comparison with classical super-resolution algorithms demonstrates that our upsampling algorithm achieves the best quality with fewer artifacts and our depth CNN outperforms the most state-of-the-art methods in terms of qualitative and quantitative evaluations.


I. INTRODUCTION
In recent years, depth image has been applied more and more widely in the field of computer vision, telemedicine, driverless and security monitoring to improve the performance of products. However, the resolution of the depth image collected by the sensing devices is relatively low due to the equipment's limitation. For example, the resolution of the depth image collected by the Mesa Swiss Ranger 4000 or Microsoft Kinect V2 is only 176×144 or 512×424, which is difficult to meet the actual requirements [1]. Therefore, how to reconstruct high resolution (HR) depth images from low resolution (LR) depth images has become one of the research hotspots.
The upsampling operation of feature image is very important for image restoration in reconstruction. Different upsampling operations may directly affect the quality of The associate editor coordinating the review of this manuscript and approving it for publication was Yong Yang . reconstructed image. Existing depth image upsampling methods can be roughly categorized into filtering based methods, optimization based methods and deep-learning based methods. Filtering based methods employ the color information of color photo or texture information of texture image with various edge-preserving filters. Chang et al. [2] use potency guided upsampling and adaptive gradient fusion filters to enhance the erroneous depth images to refine the upsampled depth results. Qiao et al. [3] construct a feature-based bilateral filter (FBF) for the interpolation, by using the extracted RGB shallow and multi-layer features to improve the upsampled depth image quality. Liu et al. [4] puts forward a precise three-dimensional (3D) reconstruction method for uncooperative spacecraft based on a low-resolution time-of-flight (ToF) depth camera coupled with a high-resolution optical texture camera. Yang et al. [5] build a novel joint trilateral filter with two different modes: one for the pixels on the edges and the other for the pixels in the smooth regions. Jiang et al. [6] propose a deep edge map guided depth SR method which includes an edge prediction subnetwork and an SR subnetwork. Yang et al. [7] use depth-texture similarity to construct a pixel-level confidence calculation method for 3D view synthesis, and construct a joint guided filter based on confidence, which not only considered the smoothness between depth pixels, but also incorporated depth-texture similarity, improving the performance of sampling on depth images and the quality of synthesized views. Lei et al. [8] propose a view synthesis quality based trilateral depth-map upsampling method, which considers depth smoothness, texture similarity and view synthesis quality in the upsampling filter. Filtering based methods generally use the local or non-local neighborhood relationship of the depth image to estimate the high-resolution depth value. One advantage of filtering-based methods is that they can be easily parallelized on graphics hardware. However, to find enough support for each pixel, large filtering kernels are often used, or the filters have to be performed iteratively, which may lead to over-smoothed depth results.
The second category is the optimization-based methods. Sharma [9] presents a depth image enhancement algorithm based on Riemannian Geometry that performs depth image de-noising and completion simultaneously. Chen et al. [10] propose a new optimization model depending on the relative structures of both depth and color images for both depth image filtering and upsampling tasks. Yan et al. [11] use the non-local mean algorithm to obtain the initial upsampling depth image first, and then optimized it using the edge detection algorithm to improve the reconstruction quality of the depth image. Maxim et al. [12] present a new fuzzy method for creating a depth image based on a combination of Canny detector with a three-level fuzzy system. Jung et al. [13] propose a post-processing algorithm to refine the depth image using super-pixel segmentation and considering the relation between multiple views. Wang et al. [14] propose an RGB-guided depth image recovery method to recover true boundaries in seriously distorted depth images. The optimization-based methods generally use the depth image degradation model and various prior knowledge to transform the reconstruction problem into the cost function optimization problem. One advantage of filtering-based methods is the applicability to multiple degenerate models and only needs to change the data item of the cost function. However, these methods may lead to high computational complexity and the selection of prior knowledge has a great impact on performance.
In recent years, convolution neural network has made remarkable achievements in the field of image superresolution reconstruction. Dong et al. [15] proposed the end-to-end SRCNN (Super resolution convolution neural network) network, which can learn the mapping relationship from low resolution to high resolution and solve the problem of image super-resolution reconstruction based on depth learning. SRCNN is considered as the first work of the third category that based on deep-learning. Since then, a large number of researchers have tried to realize depth image super-resolution reconstruction using convolutional neural networks. Lim et al. [16] and Shi et al. [17] improve SRCNN and increase the depth of the network. Zhang et al. [18] and Zhou et al. [19] improve the quality of hyper-differential reconstruction and limit the model parameters based on the recursive residual network which effectively reduced the computational complexity. Cao et al. [20] propose a novel dual auto-encoder attention network (DAEANet) which includes two auto-encoder networks, where guidance auto-encoder network (GAENet) and target auto-encoder network (TAENet) aim to extract feature information from intensity image and depth image. Kim et al. [21] propose a novel depth image super-resolution method using guided deformable convolution, which obtains 2D kernel offsets of the depth features from the guidance features to significantly alleviate the texture copying artifacts in the resultant depth image. Guo et al. [22] propose a two-branch network to achieve depth image super-resolution with highresolution guidance image, which can be viewed as a prior to guide the low-resolution depth image to restore the missing high-frequency details of structures. Deep-learning based methods use manifold learning, sparse coding, depth learning and other strategies to learn the relationship between high-resolution and low-resolution depth images by the training of a large number of data. The advantage of deep-learning based methods is the good reconstruction performance. However, the training process of these methods is time-consuming and the selection of training sets has a great impact on the performance.
In general, the filtering based methods are applicable to the scenarios that need strong real-time performance, and the optimization based methods are applicable to the scenarios with multiple degradation coexisting or serious degradation in the depth image, and the deep-learning based methods are applicable to the scenarios with a large number of training samples. In this paper, we try to use a improved upsampling algorithm that comprehensively utilize both optimization based and filtering based methods to obtain a higher resolution image, and then use the deep neural network to refine it.
The first part of the work focuses on the improved edge guided depth image upsampling. The new upsampling algorithm not only should take advantage of the characteristic of depth image, but also should preserve the sharp edges while upscaling range image with large upsampling rate. First, unlike general nature image, depth image belongs to cartoon image. Generally speaking, cartoon images are composed of planar patches with sharp transitions between the boundaries of planar patches. For high quality upsmapling result, it is necessary to embed the characteristic into upsampling model; Second, the resolutions of depth sensors are extreme low and we often need the same size depth image as optical photo whose solution is at least 1000 × 1000. This situation implies that the upsampling rate is usually very large; Third, VOLUME 11, 2023 Due to the large upsampling rate, geometric information in the LR depth image is not enough to produce fine details. To supply extra edge information and suppress the blurring effect, people advocates that depth and color boundaries of a scene are closed correlated: abrupt depth transition often leads to abrupt color transition. Thus one way to enhance the resolution of depth image is to use a high resolution optical camera in tandem with the depth sensor. In this way, we can obtain a one-to-one relationship between pixels of HR depth image and color image. Further, the acquired LR depth image is mapped into HR depth image and forms a set of sparse seeds in the HR depth image. We will interpolate the missing depth values under the guidance of the registered color photo. Previous depth upsampling methods tend to produce artifacts at the planar areas and likely smooth the sharp edges in the edge regions. These methods directly use the registered color photo to indicate the geometric structure of HR depth image. However, color edges do not completely coincide with depth edges. A planar area of HR depth image usually corresponds to many color edges. The mismatching situation will inevitable introduce artifacts into planar areas. Previous methods also do not explicitly model the cartoon-like depth image. On the contrary, they employ weight coefficient to indicate the similarity between coupled pixels. Large weight signifies the coupled pixels are likely in the same region with same depths, otherwise, they belong to different areas and should be assigned different depths. For preserving shape depth edges, we hope the weight coefficients tend to zero, when the coupled pixels belong to different regions. However, the weight coefficient would not be zero, even if the coupled pixels actually across the depth boundary, according to the definition of weight coefficient of previous methods. Therefore these methods will inevitable leak the depth information between different regions and produce intermediate values that could smooth the sharp depth edges.
We model the characteristics of depth image and take a divide-and-conquer strategy to upscale LR depth image. Initially, the LR depth is segmented into planar areas and edge regions, and these zones are mapped into HR depth image. The process not only can form the planar areas and edge regions of depth image, but also could determine the reference areas on the registered color photo for different regions of depth image. We exploit different methods to interpolate the missing depths of different regions. At planar area, we employ the simplest linear interpolation method to perform upsampling, because it could achieve competitive performance with least cost. At edge region, we calculate a set of pseudo depth images and exploit them to compute correlation coefficients of interpolated pixel with respect to given seeds. The most correlative depths of seeds are chosen as interpolation values. This segmentation-like interpolation method unavoidably introduces zigzag artifacts into edge regions. We develop a separation algorithm to separate them. The segmentation-separation edge region upsampling method would not produce any intermediate values, thus it can perfectly preserve the sharp edges.
The second part of the work focuses on the neural network for high resolution reconstruction of depth images. In this paper, we try to build a new CNN network for depth image super-with pre-upsampling structure and dense connections based on DenseNet and ResNet. The main contributions of our work are: • We compare the upsampling performance of classical algorithms for different areas and find out a way to improve the upsampling quality.
• Our upsampling algorithm takes a divide-and-conquer strategy. For planar area, we exploit the linear interpolation method. For edge region, we take advantage of the segmentation-separation interpolation method.
• The proposed depth CNN will refine the rough image obtained from upsampling result, which significantly reduces the learning difficulty and can take interpolation with arbitrary size and scaling factor.
• No matter the visual quality or the quantity evaluation, the results of our method outperform compared methods.

II. UPSAMPLING ALGORITHM
We comprehensively utilize both optimization-based and filtering-based methods to perform upsampling, and demonstrate the pipeline of our upsampling algorithm in Fig.1. Initially, the geometric structure is separated from the contaminated LR depth image; Then, we detect the edge regions and the planar areas of LR depth image, after that we take customized upsampling methods to upscale each region. At last, the synthesized result is processed again by our separation algorithm to remove nasty artifacts.
The acquired data by current depth capturing devices is somewhat like a degraded version of the underlying ground truth, we thus separate the geometric structure and noise from the LR range map under the guidance of the downsampled photo image, shown in the red box. This step could remarkably stabilize the edge detection in the following process, illustrated in the green box. We use different upsampling methods to interpolate different areas. For planar areas, the simplest linear interpolation could produce comparative results. For edge regions, we take a segmentation separation method to interpolate the missing values. Finally, the synthesized result is processed again by our separation algorithm to remove the nasty artifacts.

A. STRUCTURE AND NOISE SEPARATION
Both acquired LR depth image D ↓ and upsampled HR depth image D are likely contaminated by various noise and artifacts. We formulate the corrupted depth image as z = x + y, where x is the geometrical structure and y is the contaminated noise or artifacts. Both components are of arbitrary magnitude. We use the well-known non-local regularization [23], [24] to separate the geometric structure x and the noise component y from the contaminated z by solving following optimization problem: where D k represents the weighted differential operator at the direction k, i.e.
and shrink(y, a) = sgn(y) max{|y| − a, 0}, the closed form solutions of subproblem (2)-(4) are given by: Separating results x and y can be achieved by iteratively computing x k and y k according to formula (5)- (7). Finally, we note that the separation algorithm not only is a preprocessing step to suppress the noise in the acquired LR depth image, but also is a post-processing step to remove artifacts produced by our upsampling method.

B. PLANAR AREA UPSAMPLING
In this section, we will show which method is most suitable for the planar area upsampling by evaluating the bad pixel distribution of previous algorithms. The 8X statistics data listed in Table 1 are collected from the results of highly cited classical naive methods [27], [28], [29], [30], [31]. Table 1 not only proves that, for planar area upsampling, linear interpolation method is the most appropriate upsampling algorithm with the highest cost-performance, but also indicates our divideand-conquer upsampling strategy is a reasonable method.
To determine planar areas and edge regions of HR depth image D, we detect planar areas and edge regions in the denoised LR depth image D ↓ . A pixel i ∈ D ↓ is classified as 'planar' if its depth value D ↓ (i) satisfies |D ↓ (i) − D ↓ (j)| <λ, ∀j∈ N(i), otherwise it falls into the VOLUME 11, 2023 edge region, where N (i) is the second order neighborhood of pixel i, and λ is a depth threshold value. For U rate upsampling, a pixel i ∈ D ↓ corresponds to a U × U patch in D. Thus, we can map planar areas and edge regions of D ↓ into D and determine corresponding planar areas and edge regions of D.
We calculate the percentage of planar areas and edge regions (PP and PE) to reveal which zone dominates depth image. The percentage of bad matching pixels (PBP and PBE) at planar areas and edge regions are given to compare the upsampling quality of different methods, where the threshold is set to 1. Other than PBP and PBE, we also count the bad matching pixel number (BMP and BME) for planar areas and edge regions. To choose the best planar area upsampling method, the relative performances (RPP) of different methods are figured out, using formula (8).
where BPP ref denotes the BPP performance of comparison methods [27], [28], [29], [30], [31] and BPP baseline is BPP performance of linear interpolation method. The performance discrepancy of different methods at planar areas is not distinctive. From the data reported at RPP row of Table 1, we can observe that the well-designed methods MRF [27] and AR [31] do not significantly surpass the linear interpolation method, and the performance of BF [28] and NLM [30] are even worse. The upsampling methods [27], [28], [29], [30], [31] assume that similar colors has similar depths, and utilize color information of color photo to indicate the geometric structure of planar areas of depth image. But, a planar area usually corresponds to multiple color edges, the mismatch will introduce artifacts and could degrade upsampled planar regions. Depth surfaces of planar areas are usually planes, thus the linear model used by linear interpolation method is an appropriate upsampling model for planar areas. We conclude that the simplest linear interpolation method is the most appropriate upsampling algorithm for planar area upsampling. Table 1 also reveals that over half of the bad pixels of each method exist in the edge region of depth image (refer to PBE row), while most of the pixels in the disparity belong to the planar region (refer to PP row). Moreover, there is a connection between BMP and BME. The lower bad pixel number of edge regions suggests lower bad pixels number of the planar areas (refer to BMP and BME rows). We conclude that the restoration quality of the upsampling method is dependent on the restoration ability of upsampling method in the boundary area, thus edge regions deserve more attention and should be carefully handled to get a satisfactory upsampling result.
According to above discussion, we ought to pay more attention on edge region upsampling and use linear interpolation to simplify the computation procedure of planar areas.

C. EDGE REGION UPSAMPLING
Edge region upsampling ability of an algorithm determines the quality of final result. To improve the upsampling quality, we introduce a set of pseudo depth images which can represent the geometric structure of HR depth image and then jointly use these pseudo depth images to compute the correlation coefficients of interpolated pixels with respect to seeds. Finally, the most correlative depth values of seeds are assigned to the missing value pixels of HR depth image.

1) PSEUDO DEPTH IMAGE ESTIMATION
Classical upsampling methods [27], [28], [30], [31] directly encode color edge information into their upsampling models, based on the assumption that color edge with high contrast indicates abrupt depth edge. The upsampling strategy limits further improvement, because edges of color photo do not completely coincide with edges of depth image. Sharp depth edge often corresponds to a blurred color edge with low contrast, thus previous upsampling algorithms inevitable produce intermediate depth value and the depth transition in the boundary will be smoothed by the intermediate depths.
We retrieve pseudo depth images from color photo to indicate the structure of HR depth image, instead of directly using color guidance. The edges of pseudo depth image will coincide with the edges of HR depth image. The values of pseudo depth image are only used to indicate the edges instead of representing the actual depth. Thus we can freely scale the values of pseudo depth image.
We use a segmentation-like interpolation algorithm to compute the pseudo depth imageD. First, we quantify the depth range of LR depth image and obtain a pseudo LR depth image by mapping disjoint depth interval to its median value. In this way, the contaminated noise can be suppressed and the edges will be enhanced. Second, the pseudo LR depth image is mapped into pseudo depth image and forms a set of seeds. Third, we compute the correlation coefficients between seeds and interpolated pixels, where the correlation coefficient p ij of missing value pixel j with respect to seed i is defined as Fourth, the quantized depth value of the most correlative seed is chosen as the interpolated depth for each interpolated pixel. It is worth noting that the assignment process of our algorithm can be interpreted as a kind of segmentation process, if we view the quantized depth value as a label for each seed.
Pseudo depth image can preserve sharp edges of LR depth image, even if the corresponding color edges are blurred, for the reason that the depth interpolation, in a sense, is a kind of segmentation that does not yield any intermediate value. Therefore pseudo depth image is better than color photo to represent the geometric structure of depth image.

2) EDGE GUIDED DEPTH IMAGE UPSAMPLING
In this section, we will take advantage of random walk model [32], [33] to compute the most correlative depth under the guidance of both edge information of pseudo depth image and color image. Let N f i be the first order neighborhood, w ij be the weight coefficient between i and j ∈ N f i and N s denote a window neighborhood of seed s, where N s must be large enough to contain other seeds around s. We partition the pixels of N s into two sets, v S (seeds) and v U (unseeded nodes), and use S d to represent the set of v S whose depth values are equal to d in the N s . Further, we define indicator vector Then we can use optimization problem (10) to estimate the correlation coefficients p U of v U with respect to S d . where L is the combinatorial Laplacian matrix, which is defined as For seeds with depth value d, the minimal point can be computed by p u = −L −1 u B T δ S d according to the norm equation of E(p U , p S ). The correlation coefficients with respect to depth value {d 1 . . . d n } can be figured out by ..δ S dn simultaneously. Finally, we ffnd the maximal correlation coefficient for each row of P u to determine the most correlative depth.
Our weight coefficient calculating procedure is much simpler than NLM [7] and AR [8]. Instead of computing complicate segmentation, edge saliency, anisotropic structural-aware filter and patched based bilateral filter as NLM [7] and AR [8] do, we only employ a bilateral-filterlike kernel to compute the coefficients w c ij and w d ij from color image and pseudo depth image, then add them together to obtain the final weighting coefficients used in our upsampling model, where w c ij and w d ij are respectively formulated as

3) WEIGHT COEFFICIENTS FUSION
Single pseudo depth image can not accurately present the position of ground truth boundary, thus weight coefficients estimated from it are not very reliable. As a remedy, we produce a set of pseudo depth images {D l } under different parameter configurations, and synthesize more reliable weight coefficients w d ij from the weight coefficient set {w d ij } estimated from the pseudo depth images {D l }, where l ∈ {1 . . . L}. Here, P is the pixel set of HR depth image and w d According to the index k, w d ij can be divided into different sets w d k = {w d i,k |i ∈ P} and each w d k forms a image that has same size with D under the cyclic boundary condition. We use TV model (14) to fuse the information of {w d l k }, for each k where ∂ x w d k and ∂ y w d k denote weight difference between neighboring pixels along the x and y directions. Let w d a k = 1 L L l=1w d l k , formula (14) equals to formula (15).  The energy function of (15) is quadratic and thus has a global minimum. We can use formula (16) to find the optimization solution [34].
where F is the FFT operator and F() * represents the complex conjugate and F(1) is the Fourier Transform of the delta function.

III. DEPTH CNN
The Depth CNN network proposed in this paper (as shown in Fig.2) is mainly based on ''resblock'' and ''denseblock'', where the ''batch-normalization'' layer are removed and Its linear conversion function is merged into the convolution layer. To reduce the channel and computation, the denseblock is cascaded a transition module and the combined pair of denseblock and transition can be implemented multiple times according to the actual situation. The depth image is sent to DepthSRNet through trunk network and optimization branch, and then feature fusion and reconstruction processing are carried out to obtain HR depth image.
The low-level feature extraction module is based on residual network structure that includes three 3 * 3 convolution layers and the residual jump connection is added between the last two convolution layers. The high-level feature extraction module includes a plurality of dense connection layers and an equal number of transition layers which are connected in a cascade at intervals and the connection can be presented as y n = h([x n−1 , x n−2 , . . . , 1]), where h represents convolution layer and activation function processing and [. . . ] represents connection operation. The Pixel_shuffle layer can carry out high and wide upsampling processing on feature image that can effectively retain image details. In general, The LR depth image is sent into the trunk of DepthSRNet and the original HR depth image that is upsampled and constructed from LR depth image is sent into optimization branch as the monitoring signal for model training. Then, the loss between the HR depth image output by the neural network model and the original HR depth image is calculated, and the loss function used in training is expressed as: where, n represents the number of samples, y i represents the original HR depth image andŷ i represents the HR depth image that output from the model training. The Adam gradient update algorithm is used in the training, and the exponential decay rate range is (0.9, 0.999).

IV. EXPERIMENTS AND COMPARISONS
We implement our upsampling program with python 3.6 on a PC. The parameter configuration used in separation solver (1) is σ ↓ C = 50, σ ↓ S = 4, τ = 1, λ 1 = 70, λ 2 = 10. To indicate the edges of depth image, we usually compute two pseudo depth images with parameter settings σ s = 3, σ c = 20 and σ s = 3, σ c = 80. Moreover, we find that the performance of computing correlation coefficient (10) [30] and AR [31], because the most time consuming calculation is solving the quadratic optimization.
Both synthetic examples and real world examples are tested in the experiments. Then the Depth CNN is deployed on the PyTorch framework and the experiment is conducted on a PC with Intel (R) Core (TM) i7-7700HQ CPU@2.80GHz and NVIDIA GeForce GTX 1060 5GB GPU, and we compare the proposed method with the state-of-the-art depth image SR networks (VDSR [35], TSDR [36], DepthSR-Net [37], MFR-SR [38], RYNet [39]) in conducting qualitative and quantitative performance by various scaling factors.

A. SYNTHETIC EXAMPLES EVALUATIONS
In the experiments, we employ two depth images, Art and Book from the Middlebury's benchmark, to evaluate the 46788 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   upsampling performance. The low resolution version depth images are down-sampled from the ground truth depth images. Original color images are used as the high resolution guided images. We use MAD (mean absolute difference) criterion to evaluate the upsampling quality. The MADs against the ground truth depth images are reported in Table 2. We can observe that our method obtains the lowest MADs for 4X, 8X and 16X upsampling. For low upsampling rate such as 2X, the restoration ability on planar regions is the critical factor. The reason is that our algorithm will degenerate to the simplest linear interpolation method, when the pixel number in the planar regions overwhelms the number in the edge regions. The statistics in Table 2 also show that our method becomes more and more prominent with the upsampling factor increasing or the boundary regions growing. For visual comparison, 8X upsampling depth images of Art are shown in Figure 3. Partial enlargements are drawn in Figure 3(a)-3(f). Figure 3(g)-3(l) are corresponding section curves. We can find out that the compared methods introduce obvious jaggy artifacts along the section curve. Although the result of VOLUME 11, 2023 AR [31] is comparable to ours, its edges are smoothed and blurred.
In the first two rows, we show the close-up of Art. In the next two rows, we illustrate the depth profile along the red section line shown in the first rows, where the red line denotes the groudtruth and the blue line is the produced depths.

B. REALWORLD EXPERIMENTS
We use the standard database of KITTI Vision Benchmark Suite to perform real-world experiments. All of the data in the benchmark is acquired by a standard station wagon with two color and grayscale video cameras. The accurate ground truth is provided by a Velodyne laser scanner. We employ the laser scanner and companied color image to perform our experiments.
It is a tough task to upscale the depth image of KITTI. All images of KITTI are captured from Karlsruhe streets, thus the scenes are very large and the geometric structure of objects is very small, compared with background scene in the image. In contrast to the artificial in-door pictures used in state-ofthe-art method [30], [31], the images of KITTI surfer from varied sensing noise. Figure 4 illustrates 6X super-resolution results of our method. The zigzag artifacts are inevitable using maximum correlation principle to fulfill the missing hole,  since acquired depths in the practical environment vary dramatically and a lot of depth information is lost by downsampling the laser data. To deal with this situation, our structure and noise separation algorithm is used as a post processing step to smooth the final results and departs the stair jump from the structure of depth image. All of results are exhibited in Figure 4. We can observe that our separation algorithm can keep the sharp boundary of depth image Figure 4(g)-4(h) and segregate the jagged artifacts Figure 4(e)-4(f) from primitive results Figure 4(c)-4(d) simultaneously. The rendered images Figure 4(i)-4(j) suggest that our method reliably restores the geometric relationship.The depth-color pairs are shown at their original ratio of size and the image size is given in the captions.

C. DEPTH CNN EVALUATIONS
In the network evaluation phase, 100 RGB-D images are chosen from the Middlebury to form the test dataset, while 82 images for training and 18 images for validation. the root mean square error (RMSE) and peak signal to noise ratio (PSNR) are used to evaluate the super-resolution reconstruction performance of the network. The state-of-the-art depth image super-resolution methods under comparison include VDSR [35], TSDR [36], DepthSR-Net [37], MFR-SR [38], RYNet [39], and the result is list in table 3 and table 4. As we can see, the proposed Depth CNN and RYNet show suboptimal and optimal performance for most test images respectively.

V. CONCLUSION
We introduce a divide-and-conquer upsampling method to upscale the LR depth image with a registered high quality optical image. The depth image is divided into planar areas and edge regions. We exploit different methods to interpolate the missing depths of different areas. The statistic data shows the simplest linear interpolation method can produce completive upsampling results at planar areas, compared with state-of-the-art methods. For edge region upsampling, our segmentation-separation upsampling method outperforms previous methods and yields much better upsampled edge regions. Then we propose a depth CNN to refine upsampled results, and the experiments show that the effect meets the requirements. In future work, we will attempt to add constraints such as prior images to further improve network performance, and also consider other network architectures such as GANs.