Joint Deep Estimation of Intrinsic and Dichromatic Image Decomposition

This paper proposes an image formation model that jointly combines dichromatic and intrinsic image decomposition models. The two decomposition models analyze image formation process from a different perspective, and they can be combined synergistically. It is confirmed that the proposed method performs better than the individual decomposition. The joint estimation and the study of the decomposition order (‘intrinsic + dichromatic’ or ‘dichromatic + intrinsic’) are the first attempt to the best of our knowledge. It was confirmed that the proposed ‘intrinsic + dichromatic’ is more optimal through experimental evaluations. We also exploit the temporal property of AC light sources, which can further improve the decomposition performance. The experimental results show that the proposed model can make an accurate image decomposition and achieve a remarkable color constancy performance.


I. INTRODUCTION
There are two image decomposition models to inversely find out the process of color image formation from the observed image. They are the intrinsic image decomposition and dichromatic models which describe the properties of surface reflection. The former assumes that the image can be expressed as the product of reflectance (R) and illumination (L): where ⊗ is pixel wise multiplication [1]. The latter assumes that the reflected light is the sum of diffuse (D) and specular (S) reflection [2]: The image formation of two model is described in Fig. 1.
They are one of the most fundamental tasks in computer vision and graphics communities. The two models are closely related to each other in that they commonly deal with surface reflection in the image formation process. The The associate editor coordinating the review of this manuscript and approving it for publication was Miaohui Wang. Because the reflected properties between objects and illumination are described well with these models, they have been popularly exploited for image quality enhancement. The dichromatic model is useful for color constancy [3], [4], [5] and highlight removal [6], while the intrinsic model is for low-light enhancement [7] and relighting [8], [9]. However, the two decomposition models have fundamental limitations for unveiling image formation process thoroughly. The existing intrinsic model assumes Lambertian surface, and thus, it works poorly for real scenes with highlight or saturation. The dichromatic model focuses on surface reflection only. It has difficulty in obtaining object intrinsic characteristics. For shading regions, it is hard to recover chromaticity unlike the intrinsic model as shown in Fig. 1. In other words, they look into image formation process from a different perspective, and can be combined synergistically for understanding its details.
Therefore, we propose to jointly learn the dichromatic and intrinsic models in order to accurately separate the reflection components from the observed image. This enables us to deeply understand the details of color image formation process. The proposed network learns the two models together, and it decomposes an input image in two ways simultaneously. The simultaneous learning can further improve the accuracy of the model estimation rather than the individual learning because the two inverse problems are highly ill-posed. The original intrinsic model often approximates the reflection component to diffuse reflection, and neglects specular reflection. Recently, it is extended by considering the specular component as an additive residue term [10], [11]. The extended model first removes highlight, and is followed by the conventional intrinsic decomposition. However, in the proposed method, intrinsic decomposition is first made, and the separated reflectance is further decomposed into the diffuse and specular components. This logically follows the imaging process in sequence where incident light is reflected on surface in two ways (diffuse and specular) [3], [12]. Our work thoroughly studies the order of the decompositions (i.e., 'intrinsic + dichromatic' or 'dichromatic + intrinsic'). It was confirmed that the proposed 'intrinsic + dichromatic' is more optimal through experimental evaluations. Also, it was found that the gain of the joint decomposition is superior to individual decomposition. The joint estimation and the study of the decomposition order are the first attempt to the best of our knowledge.
Estimating the two models from a single image is a highly ill-posed problem. Conventional methods [7], [13], [14], [15], [16], [17], [18], [19] assume white-light environment for simplicity, but the proposed method also attempts to estimate illuminant chromaticity which is more general and practical. As reported in the previous works, the sinusoidal variation of AC (alternative current) powered light sources can be an important clue for illumination chromaticity estimation and image decomposition [5], [6], [20]. However, the previous studies simply exploit this prior as the cost of the deep network. In the proposed method, to exploit the temporal feature more efficiently, knowledge distillation [21], [22], [23], [24], [25] is used. The feature of a teacher network that learns temporal feature is transferred to the student network that learns image decomposition. By leveraging the AC variation, the proposed network showed better image decomposition performance.

II. RELATED WORKS A. INTRINSIC IMAGE DECOMPOSITION
Although the intrinsic image decomposition has been extended to I = R⊗L +S (Lambertian shading L, reflectance R, and specularity S) recently [10], [11], [26], it decomposes an image into two components (R and L) by ignoring specularity for simplicity in many previous works [27], [28], [29], [30]. Therefore, intrinsic image decomposition in this paper, means separation into reflectance and illumination. Intrinsic image decomposition is an ill-posed problem, and some priors have been studied in conventional methods. One of the priors is the Retinex model [31], which has been widely used. Recently, deep learning based intrinsic decomposition has been popularly studied, and it achieves a superior performance. However, these models are trained in a supervised manner and require ground truth of intrinsic decomposition for training [32], [33], [34], [35], [36]. Because they use synthetic images which are far from real scenes, there exists a fundamental limitation from a perspective of practical applications. Although the human-labeled real-world dataset (IIW [37] and SAW [38]) were created, they have sparse annotation and it is difficult to collect annotation in a large scale [19]. On the other hand, there are several studies that utilize time-lapse sequences [18], [39]. They assume varying illumination and constant reflectance in a scene. However, they require a large number of images and often fail in indoor scenes, because the assumption works primarily for outdoors [39]. Also, the conventional intrinsic model is commonly inadequate for highlight, strong shadow regions, and colored illumination [39]. No decomposition between specular and diffuse reflection makes the model work poorly for stronglyilluminated objects, leading to distorted visual quality ( Fig. 8  (b -g)). The proposed method attempts to overcome these challenges by jointly combining intrinsic decomposition, dichromatic decomposition and color constancy tasks in a cooperative way.

B. DICHROMATIC MODEL BASED DECOMPOSITION
Dichromatic model based decomposition separates diffuse and specular reflection from an input image, and due to its ill-posed property, various priors have been investigated. The thresholded Value in the HSV color space [40], [41] and minimum intensity among the RGB channels for each pixel [42], [43] were explored as the prior of specular reflection. Several methods use color dictionary to recover the chromaticity of VOLUME 11, 2023 41771 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. diffuse reflection, based on the assumption that diffuse color can be expressed as a linear combination of some representative colors [44], [45], [46]. There are previous studies that use multiple images, which are captured in different viewpoints or directions of a light source [47], [48], [49], [50], [51], while our proposed method has no constraint on the position of a camera and a light source. In these multiple images based methods including the proposed, constant diffuse chromaticity becomes a useful prior. While many conventional methods rely on only spatial features of images, the proposed method utilizes both temporal and spatial features which are obtained from natural high-speed video frames.
There are few studies that have exploited temporal features for dichromatic decomposition. The works in [6], [20], and [52] use the intensity fluctuation of AC lights captured in high speed video. Tsuji [20] assumes the linear relation between minimum and maximum luminance of a high-speed video. Yoo et al. [6] proposed a deep network that estimates all parameters of the dichromatic model including illuminant color. Ha et al. [52] introduced a new temporal dark prior for dichromatic model based decomposition. However, Tsuji [20] still showed color distortion on strong highlight regions which is a common problem of highlight removal. Also, Yoo et al. [6] have limitation on diffuse color recovery as other dictionary based methods do.

C. MODEL IMPROVEMENT
Because intrinsic decomposition commonly assumes Lambertian surface, specularity is hardly considered [10], [11], [26]. Cheng et al. [26] simply assumes that diffuse reflection is dominant over specular reflection as follows: Then, a couple of previous works [10], [11] extended the intrinsic decomposition to accommodate specularity for highlight removal, and (1) is extended by Modeling as (4) is similar to the proposed method in that it combines the intrinsic model with the dichromatic one. However, it just treats the separation of the specular component as the preprocess of intrinsic decomposition for Lambertian surface input, while we model the image formation process by closely combining both models.

III. THE PROPOSED METHOD A. THE PROPOSED IMAGE FORMATION MODEL
The total reflected light on a surface point x, observed at viewpoint ω p * under incident light L i (x, ω i ) whose incident direction is ω i can be expressed as follows: where + means positive hemisphere to sample the whole incident light and n is a surface normal. In (5), f r (x, ω i , ω p * ) is bidirectional reflectance distribution function (BRDF), which is the fraction of reflected radiance observed from a direction ω p * for each incident direction ω i . If non-Lambertian is considered more generally, BRDF in (5) is extended to the sum of a diffuse isotropic lobe (f d ) and a specular lobe (f s ) [53]: To derive the proposed image decomposition model, assume for a single ray environment as Fig. 2 (a), first. Under non-Lambertian assumption, the reflected radiance caused by a single ray that comes through the i th segment of + (denoted by A i ) can be expressed as: Diffuse reflection does not depend on the incident direction and f d has a constant value α d . Also, the intensity of specular observed at ω p * can be assumed as constant reflectance (α s,i ) in a single ray environment. Then, the reflected radiance in (7) can be re-expressed as: where L A i is total amount of light incident on A i . Because of the directional property of the specular reflection, the specular reflection observed at ω p * is generated by incident light that comes through the small area A p . Therefore, the specular reflectance by incident direction in A p is assumed as constant (α s ) with respect to ω i , and 0 for the other directions: For N multiple incident rays (in Fig. 2 (b)), each ray generates diffuse and specular reflection. The total reflected light is expressed as follows: Then, by the relation of α s,i and ω i in (9): where k is the ratio of incident light between A p and positive hemisphere ( + ), and L t is total illumination. From (12), we derive the proposed joint decomposition model as follows: With the proposed model, the image (I ) is decomposed as diffuse reflectance (D R ), specular reflectance (S R ), and illumination (L). The specular reflectance is given by kα s that means the ratio between specular reflection and whole incident light. So, the specular reflectance is highly affected by the direction of viewpoint, while the diffuse reflectance does not depend on the direction. Reflectance in the conventional intrinsic image decomposition model under Lambertian assumption corresponds to the diffuse reflectance of the proposed model. Since the intrinsic model assumes for the Lambertian reflection, the specular term is ignored. In the proposed decomposition model, the specular term is considered as specular reflectance which means the ratio of illumination and specular reflection. So many conventional works have dealt with the decomposition of imaging formation process, which is a crucial part for high-quality imaging. The intrinsic model mainly describes the reflective phenomenon of incident light, while the dichromatic model further analyzes the reflectance into diffuse and specular reflection. The previous improved model in (4) primarily concentrates on intrinsic decomposition, and specular reflection is just added to intrinsic decomposition. However, we attempt to estimate both models simultaneously in a single deep network, targeting at improving the decomposition accuracy better than the individual model estimation. Following the order of imaging process, intrinsic image decomposition is first made, and the resulting reflectance component is further separated into the diffuse and specular components based on the dichromatic model. This sequence of the image decompositions is actually equal to the reflection flow of the incident light for image formation. The integrated model is cooperatively learned within a single deep network for more accurate decomposition. Fig. 3 shows the overall network structure of the proposed method. The proposed network consists of a Temporal Feature Network (Teacher, TF-Net) and Image Decomposition Network (Student, ID-Net). The ID-Net consists of the Illumination and Reflectance subnets that learn the features of illumination and reflectance, respectively. The subnets adopt a convolutional auto-encoder structure based on VGG16 as [54]. In the most conventional teacher-student learning studies [21], [22], [23], [24] for the network compression, the student network is a lightweight version of the teacher network. However, the proposed method treats the Teacher network as a temporal feature extractor, and VGG16 based auto-encoder is used identically to the ID-Net.

B. NETWORK STRUCTURE
TF-Net learns temporal feature by estimating AC fitting map, M AC , generated with high-speed frames. The intensity variation under AC light source can be modeled as a sine curve [5], [6], [55]. Fig. 4 shows the estimated sine curves of highly illuminated region (red) and low illuminated region (blue). The more the regions are affected by the AC light sources, the larger intensity variation is observed. Therefore, we generated AC fitting map with amplitude of each pixel variation, and it reflects the effect of illumination. Fig. 5 shows the examples of input video and its M AC . By training TF-Net to estimate the AC fitting map with N frames of high-speed video, it can learn the temporal variation of the incident light. By transferring these features to the ID-Net, it is expected that the temporal feature can be extracted more efficiently than just reflecting it to the cost only.
The proposed method exploits high speed video as an input, and estimates illuminant chromaticity, the achromatic illumination component, and the specular and diffuse components (corresponding to reflectance in (1)). The input image I t is t th frame of input video, and every frame of input video is sequentially fed into the Illumination-encoder and Reflectance-encoder. N is the number of frames of the input video. For t th input frame, illumination L gray,t and its chromaticity t , the specular component of reflectance S R,t , and the diffuse component of reflectance D R,t are generated through the proposed network.
With the Illumination subnet, the illumination L gray,t and its chromaticity are estimated. Motivated from FC4 [56], the proposed method estimates illuminant as a weighted sum of local illuminants and its confidence map. The Illuminant estimation decoder generates 4 channels of output that represent local illuminant and confidence map with 1/8 resolution of the input. Note that the truncated decoder is used for the Illuminant estimation decoder in order to efficiently generate a single illuminant RGB. Recall that the proposed method considers chromatic illumination. Since the chromaticity of illumination probably leads to inaccurate prediction of reflectance color [39], the input image is white-balanced with the predicted illuminant, I wb , and then, it is put into the Reflectance subnet. The Reflectance subnet generates diffuse and specular components separated from the reflectance. As illustrated in Fig. 3, the Reflectance decoder outputs 32 channels of features (F), which go to the two different convolutional blocks. The concatenation of F and the white-balanced input goes through convolutional blocks, leading to the estimation of the specular component, S R,t . Then, the concatenation of F, S R,t , and I wb are fed into another convolutional blocks, which generates the diffuse component, D R,t . The prediction of S R,t is followed VOLUME 11, 2023  by that of D R,t sequentially, and that is originally inspired from [57].
As explained above, the proposed network finally estimates the diffuse and specular reflection of the dichromatic model, and the reflectance and illumination of intrinsic decomposition. These are clearly confirmed by deriving the dichromatic model from (13) straightforwardly as follows: where D and S indicate the diffuse and specular components of the dichromatic model.

C. LOSS FUNCTIONS
To train the network, several losses that reflect the characteristics of the two models are exploited. The network is trained with the weighted sum of losses as follows: VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
The sub-losses L recon , L CC , L invar , L smooth , L AC and L KD mean the reconstruction, color constancy, invariant, smooth, AC fitting, and knowledge distillation losses, respectively.

1) RECONSTRUCTION LOSS
Based on our proposed decomposition model, a target frame I t should be equal to the reconstructed frame with the network output. The reconstructed frame of the input frame I t can be represented as (13). For saturated pixels, the reconstructed value is larger than 255. This may lead the network to be trained with inaccurate reconstruction loss. So, the reconstruction loss is calculated on non-saturated regions as follows: where α ij is 1 for i = j, and otherwise it is smaller than 1. M sat is a saturated region mask. Objects in a scene and camera are assumed to be static in the input video, and this is a quite reasonable assumption because the time interval of the high-speed video frames is very short. Thus, the reflectance of all input frames should be constant. So, the reconstructed frame with illumination L i and reflectance components(S R,j and D R,j ) should be the same as the input frame I i .
To find saturated pixels, previous studies [6], [40], [41] depend on only pixel intensity, while our proposed method leverages temporal constraint additionally. Under AC light sources, the intensity of a saturated pixel is constant, while the non-saturated pixel varies sinusoidally. So, the saturated pixels have zero temporal gradients, and it is used for determining saturated regions. The pixels with small temporal gradient TG(i) and high intensity are determined as saturated and it is expressed as follows: where Th 1 and Th 2 are threshold values of intensity and temporal gradient and i is a pixel index.

2) COLOR CONSTANCY LOSS
Unlike other dichromatic and intrinsic decomposition researches, our proposed model does not assume gray illumination and estimating illumination chromaticity is crucial. As a loss for illumination color estimation, angular error which is the common quality measure of color constancy is exploited. The angular error between the estimated illuminant i and ground truth illuminant gt is expressed as:

3) INVARIANT LOSS
As described in 'Reconstruction loss', the reflectance for all frames should be constant. The invariance of diffuse reflectance is expressed with L1 loss as follows:

4) SMOOTH LOSS
By the Retinex model, the illumination should be smooth [59]. The large gradients of an image come from reflectance variations and small gradients are relatively related to illumination information. To reflect this property, TV-L2 loss is applied to illumination [59]. Also, the specular reflection is spatially smooth on surfaces [60], and it is reflected to TV-L2 loss. These smoothness losses can contribute to extract the reflection components closer to ground truth, and L smooth is represented as:

5) AC FITTING LOSS
As the input high-speed video is captured under AC light source environments, the intensity of incident light varies sinusoidally by double the AC standard frequency, and the reflected light also fluctuates accordingly. This periodic variation is fit with the Gauss-Newton method [61], and the regression error is measured as AC fitting loss. The mean values of all the illumination frames are fit with a sinusoidal function as in [6].

6) KNOWLEDGE DISTILLATION LOSS
The teacher network is pretrained with MSE loss between the estimated AC fitting map and its groundtruth, and it is not updated while training the student network. The down-sampling layers of the network are used as the breakpoints. Since the tasks of teacher and student are different, the every feature channel of TF-Net might not be equally beneficial. So, we use meta-network to decide which channel of the teacher network is useful for the ID-Net. The features after meta-network are transferred to the student with MSE loss.

7) NETWORK TRAINING
The proposed network is trained with two types of losses: temporal and spatial loss. To calculate the temporal loss (L AC and L invar ), the outputs of all input frames are required, while the spatial loss (L smooth and L CC ) is calculated for each frame. Therefore, the network is updated by every video sequence (N frames), not by a single frame.

A. COMPARISONS WITH CONVENTIONAL METHODS
The performance was compared with several conventional methods that conduct dichromatic model, intrinsic image decomposition and color constancy. Since the dataset has no ground truth for dichromatic and intrinsic decomposition, a qualitative comparison is made for high-speed video dataset. A quantitative evaluation for highlight removal was conducted with SHIQ [57].

1) DICHROMATIC MODEL RESULT
The proposed method is compared with the dichromatic based methods such as Akashi et al. [13], Yamamoto et al. [14], Yang et al. [58], Fu et al. [16], and JSHDR [57] which are the single-image approaches, and Tsuji [20] and DDME [6] which are the multiple-image approach that exploits high-speed video captured under AC light source (close to our method). Since conventional methods except DDME [6] assume gray illumination condition, the white-balanced image with ground truth illuminant is used for the input. Since JSHDR [57] is a supervised method and there is no ground-truth in the high-speed video dataset, the model is trained with the same loss as the proposed method in an unsupervised manner. Network structure in [57] was not changed. Note that the learning-based models (JSHDR, DDME, and the proposed method) are trained with  the high-speed video dataset. Fig. 6 compares the diffuse reflection component. As shown in the red boxed regions which have strong specularity, conventional methods suffer from color distortion or fail to remove highlight properly, while the proposed method successfully reconstructs the inherent color. The methods that exploit both temporal and spatial features have better performance than single image methods with only spatial feature. As mentioned in the section of 'Introduction', the dichromatic model has a fundamental limitation in reconstructing chromaticity of shadow and dark regions. The proposed method further alleviates this problem by jointly learning dichromatic and intrinsic image decomposition, and this can be observed in the results of (a4) in Fig. 6. The quantitative comparison is made with the real image dataset (SHIQ) proposed in [57], and it is shown in Table 1 and Fig. 7. Since SHIQ is a single image dataset, the multi-image based methods, Tsuji [20] and DDME [6] cannot be evaluated. Although the proposed method is trained with multiple frames, the network can be evaluated with a single frame. The proposed network was fine-tuned for the gray illumination input. The performance of the proposed method exceeds the conventional methods in both qualitative and quantitative aspects by achieving the highest PSNR and SSIM.

2) INTRINSIC IMAGE DECOMPOSITION MODEL RESULT
Li et al. [17], Lettry et al. [18], Wei et al. [7], JieP [62], STAR [63], and UIDNet [64] are evaluated to compare their performances with the proposed method. Wei et al. [7] and Lettry et al. [18] take the multiple-image approach that exploits low/normal light image pairs and time-lapse image dataset. Since these conventional methods except STAR [63] and JieP [62] require illumination chromaticity as a prior, the white-balanced image with groundtruth illuminant is used as an input. The reflectance of JieP [62] and STAR [63] is white-balanced with its estimated illuminant calculated as the global average of the illumination. The learning-based methods (Lettry et al. [18], Wei et al. [7], UIDNet [64], and the proposed method) are trained with high-speed video dataset.
The experimental results are shown in Fig. 8. Our proposed network generates the reflectance component (which contains specularity) and its separated diffuse component, and they are shown in (h) and (i). It is shown that the intrinsic chromaticity is accurately recovered by removing specularity. Since other intrinsic models do not consider the specularity of real scenes, they often fail to recover the chromaticity of strong specularity regions, as shown in red boxed regions of Fig. 8. Also, the conventional methods cause severe artifacts around saturated regions, while the proposed method accurately separates illumination and reflectance. One of the weak points for intrinsic image decomposition is the failure on strong shadow regions. As shown in the blue boxed region of Fig. 8, the strong shadow caused color distortion and artifacts in previous studies, while the proposed method successfully removes shadow and reconstructs the intrinsic chromaticity.

3) COLOR CONSTANCY COMPARISON
The result of the proposed decomposition can usefully contribute to color constancy, and its performance is compared with the SOTA methods in Table 2. As shown in Table 2, the proposed method achieved a remarkable performance. Although the task of the proposed method is primarily on image decomposition, its performance is better than the color constancy methods thanks to its accurate decomposition capability. DDME [6], JieP [62] and STAR [63] estimate illuminant by decomposing image based on dichromatic and intrinsic model.  [17] (c) Lettry et al. [18], (d) Wei et al. [7], (e) JieP [62], (f) STAR [63], (g) UIDNet [19] and (h, i) the proposed method. Note that two reflectance components w/wo specular are shown in the proposed method.

B. ABLATION STUDY
To confirm the effectiveness of the proposed decomposition model, three ablation studies were conducted. As shown in Fig. 10, we evaluated several image decomposition models, which are intrinsic decomposition, dichromatic decomposition, and the extended intrinsic decomposition. The decomposition results are shown in Fig. 9 (b-d) and their color constancy performances are compared at Table 3. The results  Fig. 10, (e) without white-balance in the Reflectance subnet, (f) a single frame input (no temporal feature), (g) without knowledge distillation, and (h) the proposed method. Note that all methods except for (f) accept multiple video frames as an input. show that our proposed method is superior to the other possible decomposition models for both color constancy and image decomposition. Specularity is removed in the reflectance of the model C and the proposed. Unlike other models, the proposed method accurately separates specular components and accurate reflectance is obtained. Since the original intrinsic decomposition (A in Fig. 10) does not consider specularity, color distortion happens in high specular regions. It demonstrates the importance of considering specularity in intrinsic decomposition. The result of dichromatic decomposition (B in Fig. 10) is also degraded by severe color distortion and fails to reconstruct inherent chromaticity on saturated regions. Although the extended intrinsic decomposition model (C in Fig. 10), which is the same assumption as [10] and [11] accomplishes better performance than the model A and B in Fig. 10, it showed over-smoothed result as in the red boxes. The image details are not reconstructed in reflectance, while the proposed method correctly classifies the pattern to reflectance.
To examine the effect of illuminant color in the input image of the reflectance subnet, the original input image is used without white-balance. As shown in Fig. 9 (e), the result without white-balance is suffered from severe color distortion. By transferring illumination chromaticity from the Illumination subnet to the Reflectance subnet, reflectance chromaticity recovery is improved and the color distortion is alleviated.
To confirm the importance of temporal features, we conducted a couple of experiments. First, a single frame input is used instead of N frames, and accordingly, the temporal losses (L AC and L invar ) and L KD are removed. Second, with N frames of input, only temporal feature distillation is removed. Fig. 9 (f) and (g) are the results of a single input and without distillation loss. It is confirmed that temporal features are helpful for both color constancy and image decomposition. The single image method is poor in separating the reflection components, and the intrinsic chromaticity the shadow region is not reconstructed successfully. Also, Fig. 9 (g) shows more blurred reflectance than the proposed method. Unlike previous study [6] that reflects temporal variation in training cost, the proposed method further improves performance by exploiting the temporal feature more efficiently with knowledge distillation.

C. LIMITATION
Although our proposed method performs superior to the conventional methods, it still has limitation in some cases, because of the ill-posed property of image decomposition. The conventional methods reported that some regions (weak texture, strong shadow and saturation) are incorrectly decomposed. Although this mis-classification has been further improved in our method, there are still limitations for perfectly reconstructing intrinsic property for saturation and strong shadow as shown in Fig. 6 and Fig. 8. This is because saturated regions and weakly-illuminated shadow are lack of AC variation. Note that the stronger the AC variation is, the better temporal features are [70].

V. CONCLUSION
In this paper, we proposed a new image formation model that conducts dichromatic and intrinsic decomposition jointly. The experimental result shows that the proposed model performs better than each single decomposition. Also, it was experimentally found that the decomposition performance depends on the order of the two decompositions. Namely, the proposed method ('intrinsic + dichromatic') performs better than the conventional model ('dichromatic + intrinsic'). Specular reflection is generally very weak and sparse, and thus, its separation is more difficult than the illumination of the intrinsic model. The fundamental limitation of intrinsic image decomposition is Lambertian assumption and poor working for real scenes, which is easily solved with the proposed method. Unlike conventional methods, the proposed model is trained in an semi-supervised manner with real scenes by leveraging the temporal property of AC light sources. The performance was further improved by temporal features. Illumination chromaticity is estimated in the Illumination subnet, and is used for white-balance in the Reflectance subnet, leading to the significant reduction of color distortion. Although our main task is decomposition, color constancy performance is better than SOTA methods.