Intrinsic Image Decomposition using Physics-based Cues and CNNs: Supplementary Materials

The encoder configuration is given in table 1. Figure 1 shows an illustration of the encoder. Each convolution is followed by a Batch Normalisation and a ReLU non-linearity. For each input type (RGB image, CCR image and CR) a separate encoder is created based on the configuration detailed. Each encoders provide a bottleneck of 512 feature maps. The decoder configuration is listed in table 2 and illustrated in figure 2. The decoder is configured to be parametrically simpler. Instead of mirroring the encoder, a single transposed convolution is used, followed by a batch normalisation and ReLU nonlinearity. This is repeated for every block of the decoder, except for the last layer. Here, instead of downsampling, a convolution is used to condition the feature maps in the same spatial dimensions, before Figure 1: Proposed encoder architecture

1 Architecture details The encoder configuration is given in table 1. Figure 1 shows an illustration of the encoder. Each convolution is followed by a Batch Normalisation and a ReLU non-linearity. For each input type (RGB image, CCR image and CR) a separate encoder is created based on the configuration detailed. Each encoders provide a bottleneck of 512 feature maps.
The decoder configuration is listed in table 2 and illustrated in figure 2. The decoder is configured to be parametrically simpler. Instead of mirroring the encoder, a single transposed convolution is used, followed by a batch normalisation and ReLU nonlinearity. This is repeated for every block of the decoder, except for the last layer. Here, instead of downsampling, a convolution is used to condition the feature maps in the same spatial dimensions, before  Two decoder blocks are used to correspond to the intrinsic components: the reflectance and the shading. In the shading decoder, the last convolution outputs a single channel (for the white light source assumption). Skip connections from the encoder blocks are connected to the corresponding decoder blocks. For each decoder block, the output feature maps are depth-wise concatenated before being transferred to the next decoder block. The network is optimized with Adam optimizer with a learning rate of 2e − 4. The reflectance stack is pre-trained for 250k iterations. The stack is then trained for a total of 750k iterations.

Loss Functions & Evaluation Metrics
The objective is to model the following: where R is the reflectance image, corresponding to the (albedo) color of the object, and S is the shading component associated with the geometry and illumination. Following the work of Shi et al. (2017), a combination of scale invariant mean squared error (SMSE) and standard L2 reconstruction losses are used to train the network.
where x i denotes the i-th pixel coordinate, c the color channel index and N the total number of pixels.Ĵ is the ground-truth intrinsic image and J the prediction. The SMSE version of the loss scales the prediction before comparing it to the ground-truth: The losses (2)  Instances of eq. (4) is used for the predicted reflectance, shading, reconstruction and the ccr obtained from the reflectance. Hence, the total loss for the network becomes: Where, L net is the final loss for the network, L ref is the reflectance loss, L shd is the shading loss, L rec is the reconstruction loss, L ccr is the CCR loss and E R is the edge regularisation (detailed in the main paper). The reconstruction is obtained by using the predicted reflectance and shading and combining them using eq. (1). The reconstruction loss further enforces the interdependence of the intrinsic components. Following standard practices, MSE. eq. (2), Local MSE (LMSE) and DSSIM are reported. For LMSE a window size of 20 × 20 is used.

Ablation Study
An ablation study on the proposed network to justify the design choices are shown. Since the physics based prior can be directly calculated from the RGB image, these can also be modelled as a loss. That is, outputs that give the same IID components of the image should result in the same CR and CCR descriptors. Specifically, the following configurations are studied: i) Influence of the physics priors, ii) influence of CR and CCR as losses and, iii) the influence of the stack and curriculum. The details of the network setup is tabulated in table 3. The results are shown in table 4. For the 'Non-Stacked' experiment, the reflectance and shading are simultaneously decoded, similar to the classical approach of IID using deep learning. All the variants are trained on the NED dataset. For fairness, all networks are trained for the same number of iterations.  It is observed that adding the priors improves the performance. However, using CR and CCR as losses shows a deteriorates the performance. This is because the CR and CCR depends only on RGB images, which can still be correct even if the reflectance and shading are not correct, due to the commutative property of eq (1). As a result, the CR and CCR losses does not necessarily inform about the misses in the predictions of the intrinsics components.
Removing the stack is shown to deteriorate the performance, since the explicit inter-dependence of the components are lost. Similarly, the order of the decomposition is also shown to be important. This is because, CR is not a geometry independent descriptor, which may result in leakages of geometry. This degrades the reflectance, which uses the shading as an additional prior in the stack.
Removing the curriculum also deteriorates the performance. Since, the reflectance is not pretrained, the shading is not able to make use of correct reflectance cues. The proposed method is able to address all these problems and achieves better scores with the same number of iterations. This shows that the (generalised) invariant representation and component separation are beneficial in modelling the image formation process, validating the design choices.