Physics-based Shading Reconstruction for Intrinsic Image Decomposition

We investigate the use of photometric invariance and deep learning to compute intrinsic images (albedo and shading). We propose albedo and shading gradient descriptors which are derived from physics-based models. Using the descriptors, albedo transitions are masked out and an initial sparse shading map is calculated directly from the corresponding RGB image gradients in a learning-free unsupervised manner. Then, an optimization method is proposed to reconstruct the full dense shading map. Finally, we integrate the generated shading map into a novel deep learning framework to refine it and also to predict corresponding albedo image to achieve intrinsic image decomposition. By doing so, we are the first to directly address the texture and intensity ambiguity problems of the shading estimations. Large scale experiments show that our approach steered by physics-based invariant descriptors achieve superior results on MIT Intrinsics, NIR-RGB Intrinsics, Multi-Illuminant Intrinsic Images, Spectral Intrinsic Images, As Realistic As Possible, and competitive results on Intrinsic Images in the Wild datasets while achieving state-of-the-art shading estimations.

We illustrate the steps of the proposed method in Figure 1.
(1.) We calculate the albedo gradients to identify true color (reflectance) changes.If the albedo gradient map shows no significant changes in a local neighborhood, it manifests a homogeneously colored patch.A homogeneously (single) colored patch means that the only source causing pixel values to change is the shading component.(2.) Therefore, we calculate image gradients only for homogeneously colored patches.Then, we use an off-the-shelf algorithm to compute the global least squares reconstruction of the shading map from its shading gradient fields.This process generates a sparse shading map, where the albedo changes are masked out.(3.)Using an optimization framework, a shading smoothness constraint is employed to fill in the gaps (i.e. the masked out albedo changes) based on neighboring pixel information.This process generates a dense shading map.The dense shading map, filled with the smoothness constraint, is mostly blurry, suffers from scale problems and lacking crisp geometry changes.(4.)Therefore, the final dense shading map is integrated into a deep learning framework to further refine it and also to predict the reflectance image to achieve full intrinsic image decomposition.Therefore, Steps 1 and 2 are computed directly from the corresponding RGB image in a learning-free manner.
Step 3 involves an optimization process using a single constraint, and is also learningfree.
Step 4 is based on an end-to-end deep CNN model trained using supervised learning.We design a CNN model such that the RGB image only refines the initial shading estimation, and it is not directly involved in the reconstruction phase to avoid color leakage.The model is provided in Figure 2.
Encoders: Encoder blocks use strided convolution layers for downsampling (4 times).Each convolution is followed by residual blocks (He et al., 2016).The RGB encoder uses 4 consecutive residual blocks, while the shading encoder uses 1 block after each strided convolution.Only the last block of the shading encoder has 4 consecutive residual blocks with different dilation rates.An encoder block is illustrated in Figure 3 with the residual block in Figure 4.  Fusion: The last layers of the RGB encoder and the shading encoder are fused with a 1x1 convolution and a contextual attention module to create a bottleneck such that the related RGB features can properly guide the shading estimation.The fusion block is illustrated in Figure 5. Before going into the contextual attention module, RGB features are first fed to a 1x1 convolution kernel for dimension reduction.As a result, the RGB features are fused with the shading features (1) as a (learnable) weighted combination using a 1x1 convolution, and (2) by the contextual attention module.Then, those two strate-

Loss Functions
The loss functions used to train the intrinsic images are as follows: (1) where L pixel is the pixel-wise reconstruction loss, L gradient denotes the gradient-wise reconstruction loss, L dssim assesses the structural dissimilarity, L perceptual measures the reconstruction distance in several feature spaces of a pre-trained network, L Image is the image formation loss.Table 1 provides the values of the λs as the weighting factors of the loss functions.The individual loss functions are defined as follows.Let Î be the ground-truth intrinsic image and I be the estimation of the network.Then, mean squared error (MSE) is defined as follows: where c is the color channel index, p denotes the pixel coordinate, and N is the total number of valid pixels.For the albedo estimations c ∈ {R, G, B}, whereas the shading is gray-scale (single channel).Scale-invariant MSE loss first scales (with α) the estimation I, then compares its MSE with the ground-truth Î as follows: Following the common practice, a weighted combination of MSE and SMSE are used for the pixel-wise reconstruction loss as follows: Then, the gradient-wise reconstruction loss is defined as follows: where ∇ denotes the gradient operation over a pixel which is the difference between an adjacent and the current pixel.The loss is computed both horizontally (x) and vertically (y).The structural dissimilarity is derived from the structural similarity index (SSIM) by Wang et al. (2004) to quantify image reconstruction quality degradation as follows: where the SSIM is calculated using the implementation provided by Tensorflow (Abadi et al., 2015).We only modify the filter size for the Gaussian kernel from 11 × 11 to 7 × 7.As the training takes place on object centered images with the size of 256 × 256, we intuitively use a smaller sized kernel.The perceptual loss measures the distance in the feature spaces of a 16layer VGG network (Simonyan and Zisserman, 2015) trained on the ImageNet (Deng et al., 2009) dataset, denoted as φ, as follows: We use the feature maps obtained by the first three blocks of the VGG-16: conv1 1, conv1 2, conv2 1, conv2 2, conv3 1, conv3 2 and conv3 3. Finally, the image formation loss is used to force that the estimated reflectance (R) and shading (S) images should reconstruct the original RGB image as follows: Table 1: Weight values for the loss functions.

Training and Implementation Details
The input and output image sizes are fixed to 256 × 256 pixels.Images that are not 265 × 256 are resized to this resolution by bilinear interpolation.We do not apply any pre-processing step and use 8-bit images for both as inputs and outputs.Tensorflow framework is utilized for the experiments (Abadi et al., 2015).Convolution weights are initialized by using He initialization (He et al., 2015) without any weight decay.The batch size is fixed to 8 images and the experiments do not include any data augmentation process.Adam Optimizer (Kingma and Ba, 2014) is utilized with an initial learning rate of 0.000512.An exponential decay is applied to the learning rate such that the learning rate is lowered every 6000 iterations (approximately one epoch) with a base of 0.94.The models are trained approximately 35 epochs.The training takes around 2.5 days on a single NVIDIA GeForce GTX TITAN X GPU.

Feature Maps
Finally, we provide the details of the feature maps.For simplicity, D is for dilation rate, S is for stride size and C is for number of feature maps (channels).All convolutions use 3 × 3 kernels, except the ones utilized in the fusion module, which use 1 × 1.The RGB encoder is designed to have more features maps than the shading encoder, because it needs to disentangle both albedo and shading cues.

Fig. 1 :
Fig. 1: Workflow of the proposed method.Steps 1 and 2 are computed directly from the corresponding RGB image in a learning-free manner.Step 3 involves an optimization process using a single constraint, also learning-free.Step 4 is based on an end-to-end deep CNN model trained using supervised learning.

Fig. 2 :
Fig. 2: Proposed model.RGB guides the shading estimation during the fusion phase using a 1x1 convolution and a contextual attention module (Yu et al., 2018).The shading decoder receives the shading encoder features through skip connections not to be affected by high resolution RGB color features.The albedo decoder only receives RGB features through skip connections.

Fig. 5 :
Fig. 5: The Fusion Module for the RGB features to guide the shading features.