Multi-channel residual network model for accurate estimation of spatially-varying and depth-dependent defocus kernels

: Digital projectors have been increasingly utilized in various commercial and scientiﬁc applications. However, they are prone to the out-of-focus blurring problem since their depth-of-ﬁelds are typically limited. In this paper, we explore the feasibility of utilizing a deep learning-based approach to analyze the spatially-varying and depth-dependent defocus properties of digital projectors. A multimodal displaying/imaging system is built for capturing images projected at various depths. Based on the constructed dataset containing well-aligned in-focus, out-of-focus, and depth images, we propose a novel multi-channel residual deep network model to learn the end-to-end mapping function between the in-focus and out-of-focus image patches captured at diﬀerent spatial locations and depths. To the best of our knowledge, it is the ﬁrst research work revealing that the complex spatially-varying and depth-dependent blurring eﬀects can be accurately learned from a number of real-captured image pairs instead of being hand-crafted as before. Experimental results demonstrate that our proposed deep learning-based method signiﬁcantly outperforms the state-of-the-art defocus kernel estimation techniques and thus leads to better out-of-focus compensation for extending the dynamic ranges of digital projectors


Introduction
In recent years, digital projection systems have been increasingly used to provide pixel-wise controllable light sources for various optical measurement and computer graphics applications such as Fringe Projection Profilometry (FPP) [1][2][3] and Augmented Reality (AR) [4][5][6]. However, digital projectors utilize large apertures to maximize their display brightness and thus typically have very limited depth-of-fields [7][8][9]. Once a projector is not precisely focused, its screen-projected images will contain noticeable blurring effects. A comprehensive analysis of the spatially-varying and depth-dependent defocus properties of projectors provides useful information to achieve more accurate three dimensional (3D) shape acquisition and virtual objects rendering.
When the setup of a digital projector is not properly focused, the light rays from a single projector pixel will be distributed in a small area instead of being converged onto a single point on the display surface. The distribution of light rays is typically depicted through defocus kernels or point-spread functions (PSF) [7,10]. In the thin-lens model, the diameter of the defocus kernel is directly proportional to the aperture size. As a result, projectors with larger apertures will suffer from smaller depth-of-fields and more severe out-of-focus blurring effects. Once the 2D spatially-varying defocus kernels of the projector at different depths are estimated as priors, the appropriate out-of-focus compensation method can be determined [7,8].
Previously, a number of techniques have been presented to estimate the defocus kernels or PSF. For instance, some methods directly acquire PSF using point-like light sources [11,12]. Although these methods can achieve more accurate PSF measurement with a higher peak signalto-noise ratio (PSNR), they require specifically designed optical instruments and have difficulty in generating multiple point sources to obtain the spatially-varying PSF. Non-blind methods, which are based on specific calibration patterns (e.g., checkerboard targets [13,14] or grid points [9]), are the most commonly used techniques for defocus kernel estimation. It is noted that most non-blind methods employ simplified parametric models to constrain the space of possible PSF solutions. However, these model-based methods impose strong/simplified prior assumptions on the regularity of the defocus kernel and thus cause inaccurate estimation results. In comparison, non-parametric kernels can accurately describe complex blurring effects [15]. However, it is difficult to adopt the high dimensionality representations to reflect the interrelationship between the kernel shape and the optical parameters (e.g., aperture size or projection depth), thus the non-parametric methods are typically scene-specific or depth-fixed [8,13,15].
Recently, deep learning-based models (e.g., Convolutional Neural Networks) have significantly boosted the performance of various machine vision tasks including object detection [16,17], image segmentation [18] and target recognition [19]. Given a number of training samples, Convolutional Neural Networks (CNN) can automatically construct high-level representations by assembling the extracted low-level features. For instance, Simonyan et al. presented a very deep CNN model (VGG), which is commonly utilized as a backbone architecture for various computer vision tasks [17]. He et al. proposed a novel residual architecture to improve the training of very deep CNN models and achieved improved performance by increasing the depth of networks [16]. Moreover, some 3D CNN architectures have been proposed to extend the dimension of input data from 2D to 3D, processing video sequences for action recognition [20,21] or target detection [22]. Although CNN-based models have been successfully applied to solve many challenging image/signal processing tasks, very limited efforts have been made to explore deep learning-based methods for defocus kernel estimation or analysis.
In this paper, we present the first deep learning-based approach for accurate estimation of spatially and depth-varying projection defocus kernels and demonstrate its effectiveness for compensating blurring effects of out-of-focus projectors. An optical imaging/displaying system, which consists of a single-lens reflex camera, a depth sensor, and a portable digital projector, is geometrically calibrated and used to capture projected RGB images at various depths. Moreover, we calculate a 2D image warping transformation, which maximizes the photoconsistency between in-focus and out-of-focus images, to achieve sub-pixel level alignment. Based on the constructed dataset containing well-aligned in-focus, out-of-focus, and depth images, we present a compact yet effective multi-channel CNN model to precisely estimate the spatially-varying and depth-dependent defocus kernels of a digital projector. The proposed model incorporates multi-channel inputs (RGB images, depth maps, and spatial location masks) and learns the complex blurring effects presented in the projected images captured at different spatial locations and depths. To the best of our knowledge, this represents the first research work revealing the complex spatially-varying and depth-dependent blurring effects can be accurately learned from a number of in-focus and out-of-focus image patches instead of being hand-crafted as before. The contributions of this paper are summarized as follows: (1) We construct a dataset that contains a large number of well-aligned in-focus, out-of-focus, and depth images captured at very different projection distances (between 50cm and 140cm). This new dataset could be utilized to facilitate the training of CNN-based defocus analysis models and to perform quantitative evaluation of various defocus analysis approaches.
(2) We propose a novel CNN-based model, which incorporates multi-channel inputs, including RGB images, depth maps, and spatial location masks, to estimate the spatially-varying and depth-dependent defocus kernels. Experiment results show that the proposed deep learning-based approach significantly outperforms other state-of-art defocus analysis methods and exhibits good generalization properties.
The rest of this paper is organized as follows. Section 2 provides the details of the optical displaying/imaging system and the constructed dataset for defocus analysis. Section 3 presents the details of the proposed multi-channel CNN model. Section 4 provides implementation details of the proposed CNN model and experimental comparison with the state-of-the-art alternatives. Finally, Section 5 concludes the paper.

Image acquisition
We have built an optical system which consists of a Nikon D750 single-lens reflex camera (SLR), a Microsoft Kinect v2 depth sensor, and a PHILIPS PPX4835 portable projector. The spatial resolutions of the SLR camera and the digital light processing (DLP) projector are 6016 × 4016 and 1280 × 720 pixels, respectively. The Kinect v2 depth sensor is utilized to captures a 512 × 424 depth image, and its effective working distance ranges from 0.5m to 2.0m. These optical instruments are rigidly attached to preserve their relative position and orientation. The system moves along a sliding track in the direction approximately perpendicular to the projection screen, displaying/capturing images at different depths with spatially-varying and depth-dependent blurring effects. The system setup is illustrated in Fig. 1. We make use of the multimodal displaying/imaging system to capture a number of projected RGB images (using the SLR camera) and depth images (using the depth sensor). In total, we captured in-focus/out-of-focus projected images and depth maps at 13 different projection distances (50cm, 55cm, 60cm, 65cm, 70cm, 75cm, 80cm, 90cm, 100cm, 110cm, 120cm, 130cm, and 140cm positions in the sliding track) . Note the projector is properly focused at 80 cm position. In each position, we projected 200 images (1280 × 720) from the DIV2K dataset [23] (publicly available online for academic research purpose) for capturing the training images and selected another 100 images with large varieties as the testing images to evaluate the generalization performance of our proposed method. The complete data capturing process is illustrated in Fig. 2.

Image alignment
It is important to generate a number of precisely aligned in-focus and out-of-focus image pairs to analyze the characteristics of spatial and depth varying defocus kernels. In each projection position, we establish corner correspondences between a checkerboard pattern input image and its screen-projected version. The transformation between two images is modeled by a polynomial 2D geometric mapping function whose coefficients are estimated by least squares based on the found corner correspondences. In our experiment, we empirically use a 5th order polynomial The data (in-focus, out-of-focus, and depth images) capturing process at different projection positions. We projected hundreds of images (1280 × 720) from the publicly available DIV2K dataset [23] for capturing the training and testing images.
model. The computed polynomial mapping function is then utilized to rectify the geometrical skew of the projected images (both in-focus and out-of-focus images) to the front-parallel views, as illustrated in Fig. 3.   Fig. 3. Based on the established corner correspondences between a checkerboard pattern and its screen-projected image, a polynomial 2D geometric mapping function is computed to generate the viewpoint rectified images.
During the image acquisition process (capturing 200 training and 100 testing in-focus/out-offocus images in each projection position), it is impractical to keep the SLR camera completely still. Therefore, the calculated polynomial mapping function cannot be used to achieve high-accuracy alignment of in-focus/out-of-focus images, as illustrated in Fig. 4(c). To address the problem, we further present a simple yet effective image warping-based technique to achieve sub-pixel level alignment between in-focus and out-of-focus image pairs. Given an in-focus image I IF , we deploy the non-parametric defocus kernel estimation method [15] to predict its defocused version I DF .
Then, we calculate a 2D image displacement vector X * which maximizes the photoconsistency between the predicted (I DF ) and real-captured (I DF ) defocus images as where X * denotes the estimated sub-pixel level 2D displacement, and p denotes pixel coordinates on the 2D image plane Ω. Note Eq. (1) represents a nonlinear least-squares optimization problem and can be minimized iteratively using the Gauss-Newton method. The calculated 2D displacement X * is utilized to warp input images to achieve sub-pixel level image alignment, as illustrated in Fig. 4(d). Finally, we make use of the calibration technique proposed by Moreno et al. [24] to estimate the intrinsic matrices of the depth sensor and the portable projector and the relative pose between them. The estimated six degrees of freedom (6DoF) extrinsic matrix is used to accurately align coordinate systems of two optical devices, transforming the depth images from the perspective of the depth sensor to the one of the projector. In this manner, the captured depth data is associated with the viewpoint rectified in-focus/out-of-focus images. Since the resolution of the depth images is lower than the screen-projected images, we apply bicubic interpolation to increase the size of the viewpoint rectified depth images and fill the missing pixels. Figure 5 shows some sample images (1280 × 720) in the constructed dataset for defocus analysis. Note the training and testing images present large varieties to evaluate the generalization performance of our proposed method. These well-aligned in-focus, out-of-focus, and depth images captured at different projection distances will be made publicly available in the future.  5. Some well-aligned in-focus, out-of-focus, and depth images captured at different projection distances. We purposely use very different training and testing images to evaluate the generalization performance of our proposed method.

Deep learning-based defocus kernel estimation
In this section, we present a Multi-Channel Residual Deep Network (MC-RDN) model for accurate defocus kernel estimation. Given the in-focus input image I IF , the aim of the proposed network is to accurately predict its defocused versions I DF at different spatial locations and depths.

Image patch-based learning
In many previous CNN-based models [17,25], the full-size input images are directly fed to the network, and a reasonably large receptive field is utilized to capture image patterns presented in different spatial locations. However, training a CNN model by feeding the entire images as input has two significant limits. First, this technique requires a very large training dataset (e.g., ImageNet dataset contains over 15 million images for training CNN models for object classification [26]). It is impractical to capture such large-scale datasets for the device-specific defocus analysis task. Second, its computational efficiency drops when processing a large number of high-resolution images (e.g., 1280 × 720 pixels) during the training process. To overcome the above-mentioned limits, we propose to divide the full-size RGB/depth images into a number of sub-images which are further integrated with two additional location maps (encoding the x and y coordinates) through the concatenation as illustrated in Fig. 6. As a result, our CNN model is capable of retrieving the spatial location of individual pixels within an image patch of arbitrary size without referring to the full-size images.
Each full-size 1280 × 720 image is uniformly cropped into a number of 80 × 80 image patches. It is noted that many cropped image patches cover homogeneous regions and contain pixels of similar RGB values, as shown in Section A in Fig. 7. It is important to exclude such homogeneous image patches in the training process. Otherwise, the CNN-based model will be tuned to learn the simple mapping relationships between these homogeneous regions instead of estimating the complex spatially-varying and depth-dependent blurring effects. As a simple yet effective solution, we compute the standard variation of pixels within an image patch as an indicator to decide whether this patch is suitable for training. A threshold θ is set to eliminate patches with low RGB variations. In our experiments, we set the threshold θ = 0.1. Only the image patches with abundant textures/structures, as shown in Section B in Fig. 7, are utilized for deep network training.

Network architecture
The architecture of the proposed MC-RDN model is illustrated in Fig. 8. Given RGB image patches and the corresponding depth and location maps as input, our model extracts highdimensional feature maps and performs non-linear mapping operation to predict the defocused version. Since optical blurring effects are color-channel dependent [8,15,27,28], the MC-RDN model deploys three individual convolutional layers to extract the low-level features in the Red (R), Green (G), and Blue (B) channels of the input images as where Conv 1×1 denotes the convolution operation using a 1 × 1 kernel and F R,G,B 0 are the extracted low-level features in the R, G, B channels. F R,G,B 0 features are then fed into a number of stacked residual blocks to extract high-level features for defocus kernel estimation. We adopt the residual block used in EDSR [29], which contains two 3 × 3 convolutional layers and a Rectified Linear Units (ReLU) activation layer. Within each residual block, we add skip connections between deeper and shallower convolutional layers to integrate both global and local contexts for improving the accuracy of image restoration. Moreover, shortcut connections enable gradient signal to back-propagate directly from the higher-level features to lower-level ones, alleviating the gradient vanishing/exploring problem of training deep CNN models. In the MC-RDN model, we empirically set the number of residual blocks N = 4 in each channel to achieve a good balance between restoration accuracy and computational efficiency. The informative features extracted by the N th residual blocks are then fed into three 3 × 3 convolutional layers for predicting the out-of-focus images in the R, G, B channels as where

Network training
Our objective is to learn the optimal parameters for the MC-RDN model, predicting a blurred image I DF which is as similar as possible to the real-captured defocused image I DF . Accordingly, our loss function is defined as where ||.|| 2 denotes the L2 norm which is the most commonly used loss function for high-accuracy image restoration tasks [30,31], α, β and γ denote the weights of the R, G, B channels (we set α= β = γ =1), and p indicates the index of a pixel in the non-boundary image region P. Note the value of a pixel in the out-of-focus image depends on the distribution profile of its neighboring pixels in the corresponding in-focus image, thus we only calculate the differences for non-boundary pixels which can refer to enough neighboring pixels for robust defocus prediction. The loss function calculates the pixel-wise difference between the predicted I DF and real-captured I DF in R, G, and B channels, which is utilized to update the weights and biases of the MC-RDN model using mini-batch gradient descent based on back-propagation.

Experiment results
We implement the MC-RDN model based on the Caffe framework and train this model on NVIDIA GTX 1080Ti with Cuda 8.0 and Cudnn 5.1 for 50 epochs. SGD solver is utilized to optimize the weights by setting α = 0.01 and µ = 0.999. The batch size is set to 32 and the learning rate is fixed to 1e − 1. We adopt the method described in [32] to initialize the weight parameters and set the biases to zeros. The source code of MC-RDN model will be made publicly available in the future.

Defocus kernel estimation
We compare our proposed MC-RDN model with state-of-the-art defocus kernel estimation methods qualitatively and quantitatively. Firstly, we consider two parametric methods that minimize the Normalized Cross-Correlation (NCC) between predicted and real-captured defocused images using Gaussian kernel (Gauss-NCC [9]) and circular disk (Disk-NCC [14]). Moreover, we consider a non-parametric defocus kernel estimation method (Non-para [15]), which deploys a calibration chart with five circles in each square to capture how step-edges of all orientations are blurred. Non-parametric kernels can accurately describe complex blurs, while their high Fig. 9. The predicted defocused images in the 50cm position using Gauss-NCC [9], Disk-NCC [14], 2D-Gauss [15], Non-para [15], and our MC-RDN model. Please zoom in to check details highlighted in red bounding box. dimensionality hinders understanding of the relationship between the defocus kernel shape and the setting of optical systems (e.g., spatial locations or projection distances). Therefore, Kee et al. also made use of a 2D Gaussian distribution to reduce the dimensionality and model the complex 2D defocus kernel shape (2D-Gauss [15]). Source codes of these hand-crafted methods are either publicly available or re-implemented according to the original papers. Firstly, we evaluate the performance of different defocus kernel estimation methods at a number of projection positions where the training/calibration images are available (50cm, 60cm, 70cm, 80cm, 100cm, 120cm, and 140cm positions in the sliding track). We adopt Peak signal-to-noiseratio (PSNR) and structural similarity index (SSIM) [33] as the evaluation metrics. Table 1 summarizes the quantitative results. It is observed that our MC-RDN model surpasses all of the previous hand-crafted methods in terms of PSNR and SSIM values. This deep learning-based approach constructs a more comprehensive model to accurately depict blurring effects of an out-of-focus projector, achieving significantly higher PSNR and SSIM values compared with the parametric methods (Gauss-NCC [9], Disk-NCC [14], and 2D-Gauss [15]). Our MC-RDN model also performs favorably compared with the non-parametric method based on high dimensionality representations (Non-para [15]). A noticeable drawback of the non-parametric method is that it requires to capture in-focus and out-of-focus calibration images at each projection positions to compute the optimal defocus kernels. In comparison, our proposed deep learning based-method is trained by utilizing image data captured at 7 fixed depths (50cm, 60cm, 70cm, 80cm, 100cm, 120cm, and 140cm positions in the sliding track), then it can adaptively compute defocus kernels at various projection distances (e.g., 55cm, 65cm, 75cm, 90cm, 110cm, and 130cm positions). Some comparative results with state-of-the-art defocus kernel estimation methods are shown in Fig. 9. Our method can more accurately predict blurring effects, providing important prior information for defocus compensation and depth-of-field extension. We also evaluate the performance of defocus kernel estimation without referring to the training/calibration images. The parametric methods (Gauss-NCC [9], Disk-NCC [14], and 2D-Gauss [15]) firstly calculate the defocus model at a number of fixed depths and then interpolate model parameters between measurement points. In comparison, our proposed MC-RDN model implicitly learns the characteristics of defocus kernels at a number of fixed depths and predicts defocus kernels at various projection distances. Note the second-best performing non-parametric method is not applicable in this case since calibration images are not provided in these projection positions. Experimental results in Table 2 demonstrate that our proposed method exhibits better generalization performance compared with these parametric methods, predicting more accurate defocused images at very different projection distances (between 50cm and 140cm).

Out-of-focus blur compensation
We further demonstrate the effectiveness of the proposed deep learning-based defocus kernel estimation method for minimizing the out-of-focus image blurs. We adopt the algorithm presented by Zhang et al. [8] to compute a pre-conditioned image I * which is most closely matches the in-focus image I IF after defocusing. The computation of I * is achieved through a constrained Fig. 10. Some comparative results of out-of-focus blurring effect compensation in the 50cm position using Gauss-NCC [9], Disk-NCC [14], 2D-Gauss [15], Non-para [15], and our MC-RDN model. Please zoom in to check details highlighted in red bounding box. minimization problem as where DF(I(p)) denotes the predicted defocused version of a pre-conditioned image I, and φ is the background radiance which can be omitted in a completely dark environment [8]. Figure 10 shows some comparative results of out-of-focus blurring effect compensation using different defocus estimation methods. It is visually observed that the screen-projected image of I * computed using our deep learning-based method achieves better deblurring results compared with other alternatives. More accurate defocus kernel estimation results lead to restoring sharper and clearer textures and structural edges, suppressing undesirable artifacts, and producing higher PSNR and SSIM values. The experimental results demonstrate that our proposed MC-RDN model provides a promising solution to extend the depth-of-field of a digital projector without modifying its optical system.

Conclusion
In this paper, we attempt to solve the challenging defocus kernel estimation problem through a deep learning-based approach. For this purpose, we firstly construct a dataset that contains a large number of well-aligned in-focus, out-of-focus, and depth images. Moreover, we present a multi-channel residual CNN model to estimate the complex blurring effects presented in the screen-projected images captured at different spatial locations and depths. To the best of our knowledge, it is the first research work to construct a dataset for defocus analysis and reveals that the complex out-of-focus blurring effects can be accurately learned from a number of training image pairs instead of being hand-crafted as before. Experiments have verified the effectiveness of the proposed approach. Compared with state-of-the-art defocus kernel estimation methods, it can generate more accurate defocused images, thus lead to better compensation of undesired out-of-focus image blurs.

Funding
National Natural Science Foundation of China (51575486, 51605428).

Disclosures
The authors declare no conflicts of interest.