Intrinsic Image Decomposition using Paradigms

Intrinsic image decomposition is the classical task of mapping image to albedo. The WHDR dataset allows methods to be evaluated by comparing predictions to human judgements ("lighter","same as","darker"). The best modern intrinsic image methods learn a map from image to albedo using rendered models and human judgements. This is convenient for practical methods, but cannot explain how a visual agent without geometric, surface and illumination models and a renderer could learn to recover intrinsic images. This paper describes a method that learns intrinsic image decomposition without seeing WHDR annotations, rendered data, or ground truth data. The method relies on paradigms - fake albedos and fake shading fields - together with a novel smoothing procedure that ensures good behavior at short scales on real images. Long scale error is controlled by averaging. Our method achieves WHDR scores competitive with those of strong recent methods allowed to see training WHDR annotations, rendered data, and ground truth data. Because our method is unsupervised, we can compute estimates of the test/train variance of WHDR scores; these are quite large, and it is unsafe to rely small differences in reported WHDR.


INTRODUCTION
Many computer vision problems can be thought of as regressing some spatial fields (for example, normal; depth; shading; albedo; a processed image; a denoised image; etc) against an image. An extremely powerful strategy for solving such problems is to collect a large set of representative tuples and use them to train a convolutional neural network. This supervised strategy runs into difficulties when it is hard to obtain data. We use predicting albedo and shading from an image (otherwise, intrinsic images) as a model problem, because the problem is well understood and well studied and because there are strong established criteria for evaluation. For intrinsic images, the supervised framework is unsatisfying for several reasons. First, it is hard to obtain reliable data. Second, until recently quite unsophisticated unsupervised methods were competitive with supervised methods, suggesting that more sophisticated unsupervised methods worth studying. Finally, supervised methods cannot explain how a visual agent might learn to produce intrinsic images without ever having seen an intrinsic image -visual animals are not provided with true or rendered albedo data at conception.
Intrinsic images have several important properties. An intrinsic image decomposition should explain the imagepixel values should be accurately predicted by albedo and shading. Intrinsic images have a local character -one can tell whether a moderately sized image patch is an albedo (resp. shading) patch without reference to the rest of the albedo (resp. shading) field. Similarly, the mapping from an image to an intrinsic image has a local character -for a large enough image patch, the intrinsic images recovered from the patch should be the same as those recovered from the whole image. Furthermore, the mapping from an image to an intrinsic image is equivariant under image translation and orthonormal transformations -for any two images of the same scene, the albedo (resp. shading) reported for overlapping regions should be the same. Similarly, the mapping from an image to an intrinsic image is somewhat scale equivariant -a moderate scaling of an image up or down should result in intrinsic images that are similarly scaled up or down.
Our method exploits these properties. We train a network to decompose fixed size tiles (128 × 128 in this paper) to albedo and shading estimates. We have no ground truth, but the local character of the problem means we can train a network using synthetic albedo (resp. shading) fields that need be accurate models only locally. If we pass a real tile through this network, its reported albedo (resp. shading) fields should "look like" the synthetic fields locally, too, and we achieve this with an adversarial loss. Furthermore, the albedo and shading fields should explain the image, so we penalize the residual with a loss. To ensure that we report a translation equivariant estimate, we cover a real image with randomly offset, overlapping tiles, compute albedo (resp. shading) fields for the tiles, then average, so the albedo at a given location is an average over all tiles covering that location. We apply this procedure over several scales and average the result to obtain scale equivariance. We compute a scale and translation equivariant estimate for a discrete set of rotations and reflections, and average those to estimate a rotation equivariant result. Finally, a simple pointwise procedure ensures that the residual is small.

RELATED WORK
Originally, an intrinsic image decomposition reduced an image to intrinsic components (properties of surfaces like albedo, specular albedo, roughness) and extrinsic components (like irradiance or shading) [1]. We adopt current usage, which implies a decomposition into albedo and shading.
This paper shares essential features with Retinex-like models. First, this work is unsupervised (or minimally supervised, if one uses data to choose a gradient magnitude threshold). Second, this work assumes that the key questions in recovering intrinsic images is deciding whether local phenomena are due to albedo or to shading (as in gradient thresholding), and then assembling a global estimate from those decisions (as in integration). In contrast to Retinexlike models, rather than use abstract models of albedo and shading to motivate formulations, samples from these models are used to train a decomposition procedure.

Evaluation
Quantitative evaluation of intrinsic image methods is a recent phenomenon. It is hard to produce data by experiment, and so only very small quantities of real albedo and shading data are available (e.g. [37], [38]). We choose to focus on WHDR measures, as they are based on images of real scenes. Alternative evaluations include: scores on the images of [37] (but there are very few images in unrealistic illumination [39]) and scores on SINTEL frames (from [40]), as in [41] (but this rendered data is quite unlike real images as in [42], section 2).
The WHDR evaluation framework was put in place by [34], who constructed a dataset (Intrinsic Images in the Wild or IIW) consisting of human judgements which compare the absolute lightness at pairs of points in real images. Each pair is labelled with one of three cases (first lighter; second lighter; indistinguishable) and a weight, which captures the certainty of labellers. One evaluates by computing a weighted comparison of algorithm predictions with human predictions; the comparison is known as the weighted human disagreement ratio (WHDR). Predictions were originally by testing differences in estimated log-albedo against a standard threshold [34]. Other authors test against a threshold chosen using validation data (eg. []). Yet other authors test differences in estimated albedo (eg []). The choice of predictor is significant. Differences in log-albedo are scale invariant, but this predictor may perform poorly over the full range of albedos. Two quite similar dark albedos will have the same difference in logs as two quite different light albedos. Differences in albedo are not scale invariant, and this means that the scale on which the algorithm reports albedo and the test thresholds are fungible. Some authors fix threshold, and learn scale; others fix scale and choose threshold using validation data. In this paper, we use differences in albedo, and test against a variety of thresholds (section 4).
There is a standard WHDR test-train split (20% test and 80% train) introduced by [29]. The choice of scale and threshold significantly affects reported WHDR (see table  1 of [29]). Table 1 shows reported WHDR's for a large selection of methods, using the best rescaled value known as appropriate.
WHDR scores can be improved by postprocessing, because most methods produce albedo fields with very slow gradients, rather than piecewise constant albedos. [44] demonstrate the value of "flattening" albedo (see also [45]); [27] employ a fast bilateral filter [46] to obtain significant improvements in WHDR.

Supervision
Direct supervision occurs when a method sees the albedo and shading of training images. With even a few ground truth images are available, local regression strategies have been successful [38]. The recent literature strongly emphasizes directly supervised CNN based models. One option is to [48] regress lightness differences against image features using IIW data. [30] smooth pairwise lightness comparisons (learned using WHDR data) to albedo and shading fields using a fully connected CRF. Recent methods emphasize direct supervision using CGI rendering of scene models [41], [47], [48] However, models trained exclusively on rendered scenes do not do well on real images (eg [42]; section 2). This is likely because rendered images are insufficiently "like" real images in some important ways. Competitive modern methods are trained using a training portion of the IIW dataset, then evaluated on a the test portion. [32] obtain a SOTA WHDR of 14.45% in this way, but their method produces strange colors in albedo images, making its applicability in computational photography questionable and qualitative comparison unhelpful. [27] use a similar approach, but different network architectures, to obtain a mean WHDR of 17.18% with strong qualitative results; we use this method for qualitative comparison. There is good evidence that relatively little supervision is required, and that self-supervision can be successful. [49] apply a learned renderer to decompositions of unlabelled data to obtain a residual loss that improves performance. [50] show that a form of bootstrapping (augment training data with the results of previous models) is effective in improving performance.
Indirect supervision occurs when a method does not see the albedo and shading of training images, but sees equivalent information. This can take a variety of forms. Aligned views of the same real scene under distinct illuminants offer strong cues to intrinsic image decomposition, exploited in [51], [52], [53]. Alternatively, one can use aligned CGI renderings of the same scene [42]. [54] show how to exploit these cues to learn a method that, at inference time, can be applied to a single view. [33] show that it is enough to partially align images of real scenes (by matching sections of frames).
Indirect supervision can take a more abstract form by providing the method only statistical models of albedo (resp. shading), much like the original Retinex assumption. [35] who use albedo and shading CGI renderings to build autoencoders. These are used to impose albedo (resp. shading) structure on the inferred components of the input image; the components must also compose to make the image. This method obtains the current SOTA WHDR for methods that use only indirect supervision (18.69%). Our method also receives only statistical models of albedo and shading, but it receives them directly. We multiply samples from albedo and shading paradigms, and train the method to decompose the product into the original samples. This training data is quite unlike real images or CGI renderings, and we rely on adversarial smoothing to ensure the decomposer applies to real images, resulting in a new SOTA WHDR for indirect supervision (17.04%).

Invariance and Equivariance
Most applications must control how a CNN behaves when an image is transformed. A classifier, for example, should not change prediction if the image is shifted or scaled. There is no crisp theoretical framework for transformations of the input. The theory of group actions does not apply exactly  [29] ibid N N N 18.1 Bi et al '18 [27] ibid N Y Y 17.18 Zhou et al '15 [30] ibid Y N Y 15.7 Li and Snavely '18 [31] ibid Y Y Y 14.8 Fan et al '18 [32] ibid Y N Y 14.45 *Zhao et al. '12 [14] [29] N N N 26.4 Shen and Yeo '11 [23] [29] N N N 26.1 Yu and Smith '19 [33] ibid N N N 21.4 (a) Retinex (rescaled; color/gray) [29] N N N 19.5*/18.69* Bell et al '14 [34] [29] N N Y 18.6 Liu et al '20 [35] ibid N Y+ N 18.69 Bi et al '15 [36] ibid  [29]. For our method, we report the held-out threshold value of WHDR. We report two figures for [36], because we found two distinct figures in the literature. Key: * -method uses IIW training data to set scale or threshold ONLY. + - [35] build models of albedo and shading from CGI, but does not use them for direct supervision. a - [33] use patches of registered images from MegaDepth.
to image rotations, scaling or cropping, because almost all interesting transformations of this form involve information being gained or lost at the boundary of the image. For image classification, data augmentation -training with multiple crops, scalings, colorings and rotations of training examplesseems to result in classifiers that are robust to transformations (origins uncertain; survey in [55]). Averaging predictions over multiple distinct crops is now universal practice (origins again uncertain). Augmentation and averaging result in a property analogous to invariance, though a precise definition remains obscure (early attempts, in another context, in [56], [57]). Imposing augmentation robustness seems to constrain a representation strongly.
More important in regression applications is equivariance. A function φ : x ∈ X → y ∈ Y is equivariant under the action of a group G if there are actions of G on X and Y such that φ(g • x) = g • φ(x). Again, information being gained or lost at the boundary is an obstacle to applying the theory of group actions exactly (except for certain finite groups [58]). If one relaxes the definition to require only an approximate match, well-known visual feature representations tend to have strong equivariance properties either by design or in practice [59]. Generally, equivariance properties have not been imposed on regression networks; we know of no better strategy for doing so than averaging.
TODO: Anand's papers on padding

FRAMEWORK
We model an image as a colored albedo field multiplied by a shading field and a single color. Generally, we use bold for vectors (position, color fields) and so write where c is the color of the shading field and • is elementwise multiplication. There is strong support in the literature for the model of albedo as patches of constant color. For example, postprocessing with the fast bilateral filter makes this assumption and is helpful [27]; most priors are derived from this assumption; most current regression methods produce albedos that look like patches of constant color (eg Figure 6). We adopt this model. Our shading model supports more complex phenomena like fast shading edges (like the cast shadows or the cloth folds in Figure 2). Imposing these models poses what are essentially local problems such as deciding how an image gradient should be decomposed into shading and albedo effects; once these are solved, the albedo can be determined by "filling in" appropriate constant colors. But it is inconvenient to determine the details, or to (say) optimize a posterior. Instead, we train a fully convolutional network directly on synthetic examples which can represent how the local problems are to be solved (section 3.1); we then use adversarial smoothing methods to ensure that the network produces reasonable results on real images (section 3.2). Our network could be applied to any size of image, because it is fully convolutional; but doing so ignores the significance of scale in intrinsic image problems and produces solutions without the required equivariance properties. Instead, the network is trained on fixed size tiles, and the results on tiles are reassembled into a (somewhat) equivariant estimate (section 3.3). Finally, the result is postprocessed per-pixel to ensure that albedo and shading compose to make the image (section 3.5).

Paradigms
Our synthetic albedo (resp. shading, color) models, paradigms in what follows, are samples from easily sampled random processes that produce tiles that appear to capture the important properties of albedo (resp. shading, color) at a short scale. Paradigms can be thought of as priors represented in a form that is convenient -rather than a loss that depends on the prior, we train the network to decompose examples from a prior model. For some kinds of constraint -for example, the requirement that an albedo

Base Cases
Base all α are 1; Nt = 7, Nσ = 3, average over 3 checkpoints. Ma01NP as Base, but with exponential moving average during training (section 3.2) with w = 0.9 and for every training pair example the decomposer sees paradigm ground truth for albedo or for shading, but not both.
Best BBA same as Ma01NP, but Nt = 15, Nσ = 5 NP same as Ma01NP, but Nt = 15, Nσ = 5 and no location code BBAP same as BBA, but with postprocessing. BBAF same as BBAP, but with discrete image averaging as well.

Variant Paradigms
CGI albedo and shading tiles from CGIntrinsics [47] are used rather than paradigm images. Tiles are selected from shading and albedo independently. CGIT albedo and shading tiles from resized versions of CGIntrinsics [47] images are used rather than paradigm images; resizing is to 180 pixels on the shortest edge, and ensures albedo tiles have more structure; tiles are selected from shading and albedo independently. CGITD albedo and shading tiles from resized versions of CGIntrinsics [47] images are used rather than paradigm images; resizing is to 180 pixels on the shortest edge, and ensures albedo tiles have more structure; dependence between shading and albedo tiles is preserved. Dark the paradigm for shading is modified to have a higher dynamic range (s min = 0.05). AlbFrag the albedo paradigm contains very small fragments; d max = 9, p min = 100. ShaFrag the shading paradigm contains very small fragments; nm = 16. be piecewise constant, with sharp edges -there may be a practical advantage to representing the prior with paradigms, rather than as a cost function, because it can be difficult to author cost functions to capture these constraints accurately. We require that paradigms represent albedo and shading only on a relatively short scale. This means that paradigm samples do not need to look like real albedo (resp. shading) images. The paradigms must be chosen by hand (we have no search procedure for paradigms).
Our albedo paradigm uses a surface color model and a spatial model. The qualitative properties it is intended to capture are: albedoes are piecewise constant; the color distribution should reflect likely surface colors; there should be a profusion of edges with no strong orientation bias; there should be at least some vertices with degree greater than three. Surface color is modelled by drawing color samples uniformly and at random from the IIW training set. These must be adjusted for presumed illumination. We do so by assuming the range of illumination intensity is approximately the same as the range of lightnesses, and so dividing by the square root of intensity.
The spatial model is an evenly weighted mixture of two spatial models. The first models the albedo as a kd tree, with spatial splits chosen at random, a fixed maximum depth (d max = 6 unless otherwise stated), and a fixed minimum number of pixels per cell (p min = 1000 unless otherwise stated). For each cell, the color is chosen uniformly at random from the surface color model. The second models the albedo as a mondrian of rotated mondrians. We build a dictionary of rotated mondrians by first constructing random axis aligned rectangular grids, then filling in each grid cell with a sample from the color model, then applying a random rotation. Each mondrian is then obtained by constructing a random axis aligned grid (of n c cells on edge), and filling each grid cell with a correspondingly sized, randomly selected block from a random dictionary entry. The number of cells on edge n c is chosen uniformly and at random in the range 1 to n m = 4 (unless otherwise noted).
Our shading paradigm uses a spatial model to combine samples from Perlin noise. The qualitative properties it is intended to capture are: shading contains many slow and very slow changes; there are some sharp shading edges; and the dynamic range of shading indoors is limited. Our shading model uses a Perlin noise field, constructed from five scales of smoothing. We construct five dictionaries, one per scale (σ ∈ [3,6,12,16,24]). Each dictionary contains IID unit normal images smoothed with gaussians at the corresponding σ. A shading component consists of a weighted sum of randomly chosen elements, one per dictionary, weighted by [0.2, 0.2, 0.4, 1, 1] respectively. A shading sample is obtained by: randomly constructing a shading component as a background; choosing a random number of masks to impose; then, for each mask, replacing the shading in the interior of the mask with the shading from another, randomly chosen, shading component. The masks are chosen from two options: leaves in a random kd tree of fixed maximum depth (d smax = 6) and minimum number of pixels per leaf (p smin = 1000); or cells in a dictionary of rotated mondrians, constructed as per the albedo mondrians. The resulting sample is rescaled to have fixed minimum (s min = 0.2) and maximum (s max = 1) value. Figure 1 shows typical samples.
Our model assumes there is no spatial variation in illumination color. A sample from the color paradigm is given by 0.5 * (1, 1, 1) T + 0.5 * ξ, where ξ ∼ N (0, I). This means that paradigm images can have quite strong color casts ( Figure 1).
For synthetic examples, we know: image; albedo; color; and shading. We must ensure that the predicted albedo for the tile is close to the true albedo (L a ); the predicted shading is close to the true shading (L s ); the predicted color is close to the true color (L c ); and the image is explained by the albedo and shading (L r ). We work with images, rather than log images, so albedo and shading must multiply to yield the image. Our loss is

Adversarial Smoothing
For real examples, we do not know albedo or shading, but we can ensure that the image is explained by the albedo and shading (L rr ), and predicted albedo and shading are within a reasonable range (L range ). Our loss is (details in appendix D). But this loss does not control what the model does to real tiles in any detail. Here is one way to diagnose whether the model is mapping real tiles appropriately. Take a population of real tiles, and decompose them. Cut pairs of patches of some appropriate fixed size out of each of the resulting albedo and shading fields -call these the real data pairs. Similarly, cut pairs of patches of the same size from the training data -call these the training data pairs. Because albedo (resp. shading) has a local character, we expect that, if the patch size is sufficiently small, real data pairs should be "like" test data pairs; any reliable distinction between the two categories is a sign that the model may not be behaving properly. We do not seek to match the distribution of albedos for decomposed images to that of paradigms (which doesn't work particularly well, as shown in Figure 10). Instead, we impose distribution matching only at the scale of patches, because we can trust the paradigm model only at fairly short scales. For the real pairs, the decomposer is a generator, because it makes image pairs for which we know no loss, so we can use an adversary to refine it. Some modifications are required. It is usual to write an adversarial loss and seek a saddle point [60]. If the saddle point exists (unlikely; see [61]), the generated distribution matches the data distribution [60]. For our purposes, this matching may be undesirable, as the training pairs may be at best a rough approximation of what real pairs look like. In practice, generators are implemented by taking some steps on the discriminator for fixed generator, then some steps of the generator for fixed discriminator. We follow this procedure (details in Appendix).
In this case, training dynamics may not converge to a single model, but rather wander around a stationary set of distinct models each of which produce somewhat different reconstructions. This effect may not be a nuisance for generators because it affects only the mapping from latent variable to image, which doesn't usually matter. In our case, it is a potentially serious nuisance, because different checkpoints taken at the end of training may report very different albedos for the same image ( Figure 4). We manage this effect by averaging model parameters, either over a fixed number of checkpoints or using a moving average of parameters, which gives better results. Write θ for the current estimate of the generator parameters; we maintain a separate set of parameters ψ, and update them by ψ → wψ + (1 − w)θ every 5000 images.

Equivariance and Averaging
An ideal intrinsic image method will report the same albedo for the same location in a scene, however that location is viewed. We know no way to impose this criterion. A simpler equivariance requirement is that all image tiles (however located, oriented or scaled) containing some point x in the scene will report the same albedo and shading for that point. Note first that there is a problem to solve here: even a fully convolutional network is not equivariant under shifts of the image, because of boundary effects -some locations in the output depend in some way on units whose support extends outside the image and into the padding. This means that a pixel in the overlap of two tiles could be estimated (say) with padding in the first estimate and without padding in the second, and different estimates will result. A natural way to impose this equivariance requirement is to estimate the albedo at each point as the average of estimates made by multiple tiles (with different offset, location and scale) containing that point.
Experimental images are approximately 400 pixels on edge, with some range of variation. Cropping tiles of arbitrary scale and orientation is inefficient. Instead, for each scale, we average over a random set of tiles of fixed size. At a given scale, we cut images into a N t × N t grid of overlapping tiles, with dithered centers, arranged to cover the image, and then form a weighted average of the results for the tiles. Tiles are organized to ensure that each pixel is covered by at least one tile, though most pixels are covered by many tiles. We use a weighted average to suppress ringing artifacts; weights decline exponentially to the boundary of the window (detailed form in appendix). We have not experimented with other window forms. To ensure that feature computation takes into account whether a location is near the center of the tile or near the edge, we augment input tiles with a simple location code (detailed form in appendix). We have not experimented with other location codes.
We average the albedo and shading estimates so obtained for several rescaled versions of the original image, and average. We average translation averaged albedo and shading reconstructions over N s scales spaced evenly from approximately 1/ √ 2 × image size to √ 2 × image size. It is trickier to achieve equivariance under orthogonal transformations by averaging. Recall that an orthogonal transformation is a rotation possibly composed with a reflection. The number of samples required becomes large, and extracting tiles (resp. images) at arbitrary rotations is inefficient. We have investigated two simplified strategies. In the first, we compute all eight images obtained by rotation by a multiple of 90 0 composed with a reflection, compute translation and scale averaged decompositions for each, then average the results (discrete image averaging). In the second, we average over all eight tiles so obtained for each tile processed during scale and translation averaging (discrete tile averaging). These averaging steps increase inference time eightfold, and so we investigate their effects only for models known to be strong. For the image shown we compare albedo reconstructions from a reference model (BBAF, our best) with others. In this case, each image has been passed through the underlying model (which is convolutional, and so applies to any scale). Model 1 and Model 0 are different checkpoints, separated by approximately 10,000 training images; notice how there are significant long scale differences, caused by the fact that the adversarial smoothing does not identify a unique best model. As Figure 8 shows, an exponential moving average resolves this effect. Scale shows the (rescaled) albedo for an image that was rescaled down by 1.4, then decomposed using model 0. Comparing this to the result of model 0 shows a severe failure of scale equivariance. Similarly, Flip shows the (reflected) albedo for an image that was reflected in both axes, then passed through model 0; comparing this to the result of model 0 shows a severe failure of rotation equivariance. Finally, BR and TL show the results of cutting the image into two overlapping tiles, and passing each through the network; comparing these shows a severe failure of translation equivariance. The symptom of these equivariance failures is long spatial scale error of a form disruptive to WHDR comparisons. Our strong WHDR performance shows that our averaging procedures control these effects.

Averaging Controls Error
Averaging across scale, translation and rotation helps control some form of model error. The smoothing procedure ensures that a generic image will produce an albedo (resp. shading) field that "looks like" the training data at the scale of patches.
The albedo is not controlled on a longer scale. This means that the predicted albedo for a tile may contain an error that is on a longer scale than the size of a patch and that depends on the input image. Here are some examples. The albedo model consists of piecewise constant patches, and the shading model contains some fast shading boundaries (sec-tion 3.1). The network could predict a fast change in albedo coordinated with a fast change in shading. Alternatively, the network could predict an albedo that has a slow (but not zero) gradient that is low enough that the difference from zero is hard to resolve at the scale of a patch. This error may depend on the long-scale structure of the input image -for example, mostly red images might get spurious fast changes in albedo. One strategy to control this effect is to have models that are very good on long spatial scales, too; but we do not know how to produce training data that properly represents the desired outcome at a long scale.  [35], and note near .17 WHDR with held out threshold for BBA, BBAP, BBAF). The standard WHDR test set may be easier than most subsets that size (green bars well below median in boxplots). Postprocessing and flipping may appear to weaken performance (cf red/black bars for BBA, BBAP and BBAF), but this is an artifact of using one test set; as Figure 11 shows, BBAF beats BBA and BBAP for almost every simulated test set. Key: Fixed thresholds: shown in boxplots of WHDR values for 50 simulated test sets for the two fixed thresholds, and green bars are the value for the standard test set. Oracle thresholds: heavy black bar. Held out threshold: heavy red bar. Boxplots: horizontal bar = median; notch = fraction of interquartile range outside which a difference in medians is significant; bottom and top of the box = 25 and 75 percentiles resp.; whiskers extend to the most extreme data points that are not outliers; outliers -greater than 1.5 times the interquartile range outside top and bottom -are '+'. Best viewed in color.
The error in each tile will be referred to the coordinate system of that tile. As a result, for any given location x, we are averaging estimates of albedo and shading that have different error terms (because they have different locations in the tile coordinate systems for their tiles). Equivalently, the context used to produce the estimate at x is different from tile to tile. As a result, we expect the averaging process to suppress errors at long spatial scales. Figures 4 and 8 strongly suggest that this error control is important. Discrete image averaging significantly outperforms discrete tile averaging, likely because discrete tile averaging cannot control error on long scales.
An alternate view is this. Each albedo (resp. shading) in a tile estimate is the result of an estimator (the function implemented by the network at that point). But not every estimator is the same; some have support that reaches into the padding. Training ensures that the expected error of each estimator is zero, or close, but does not ensure that estimators have the same variance. By shifting, rotating, and scaling images, we are essentially producing multiple distinct estimates of the same albedo (resp. shading), and averaging reduces their variance.

Postprocessing
Averaging at a fixed scale has two important effects. First, the color estimate c is no longer constant as a function of position (each tile produces a constant, but the average may not be). Second, averaging means that the residual might be larger than desired. In particular, fine details in the images may be obscured by averaging across scales. These effects can be fixed at inference time by post processing, and the results demonstrate a small advantage to doing so ( Figure 5). Assume that I has produced averaged albedo estimate a(x), averaged shading estimate s(x), averaged color estimate c(x) and residual r(x) = I − a • [sc]. Then we seek small δa(x), δs(x) so that (a + δa)[(s + δs)c] is closer to I . As the appendix establishes, δs = r T a a T a + s 2 and δa = (1/s)(r − a r T a a T a + s 2 ) Note that (a) the process can be iterated and (b) the computation is pointwise and fast. Our experience has been that the averaging method produces a fairly small residual, and few iterations are required. Where noted, postprocessing is applied for each scale's average, and then for the average across scales.

Network Details
Discriminator network: We want our discriminator score to depend only on local neighborhoods. We use a straightforward trick, derived from the practice of training adversarial networks using a hinge loss []. Write I for the input field, y for the label (-1 for real, 1 for generated). We achieve a local discriminator by structuring the network as a set of convolutional layers to produce a 1 × k × k tensor F from I . The scale of the patches is dictated by the size of the receptive field for the elements of F. The discriminator is trained by using a mean hinge loss over all overlapping patches of that scale; this can be computed by computing mean (ReLU(1 − yF). The loss used to train the generator is then obtained as mean(F ). All experiments, except scale experiments, use the same network structure for the adversarial discriminator (Appendix). Scale experiments vary the number of layers and the size of the kernel to achieve the given patch size. All discriminator networks are trained with leaky ReLU's and spectral normalization.
Decomposer network: Our network accepts a 128 × 128 image tile p and produces a 3×128×128 estimate of absolute spectral albedo a(p; θ) (i.e. our albedo estimate is colored), a 1 × 128 × 128 estimate of absolute shading s(p; θ), and a 3 dimensional estimate of illuminant color c(p; θ). All experiments use the same network structure (Appendix), but with different losses and different training data as noted.
Training details: All experiments train the network in single precision and use a batch size of 128; all networks see a total of 32M training images, evenly divided between real tiles and paradigms. For all experiments, training albedos for a batch are selected uniformly and at random from a cached dictionary of 4000 samples; similarly, training shading is selected uniformly and at random from a cached dictionary of 4000 samples. Real tiles are selected uniformly at random from a dictionary of 4000 samples. These samples are drawn only from the standard training set for IIW.

EVALUATION
Our procedure is intended to produce colored albedo (surface color) estimates. We reduce these to lightness estimates by  6. Qualitative comparison to [27], [26], [48], [45] and [62], using parts of Figure 1 of [27]. As [27] remark, the methods of [26] and [48] are trained on rendered data alone, and face difficulties due to the difference between rendered data and real images. As [27] remark, the methods of [48] and [45] face difficulties due to the deep shadows in the scene. The albedo produced by our method does not show the "colored paper" effect seen in other methods and does not produce odd colors; this is an advantage (text). Our method reports albedo and shading up to image boundaries, that of [27] appears not to (the crop of the figures is as in the original paper; for our method, we show the whole image).
averaging the three color channels, and test the difference in predicted lightnesses against a threshold. If the absolute chosen in one of three ways. For comparison with other algorithms, we compute WHDR on the standard test set, using both a held-out threshold (chosen as the threshold that gives the best WHDR for the training set) and an oracle threshold (the threshold that yields the best WHDR). Because we wish to investigate the performance of lightness algorithms that have never seen real training data, we evaluate fixed thresholds (chosen in advance and largely independent of the WHDR dataset) are our primary interest. We investigate two thresholds: 0.1 (because Bell et al. [34] used this value, and because this yields about 10 distinguishable lightness values) and 0.165 (because a search on WHDR validation data gives this threshold as the one at which the differences in image intensities yields the best WHDR on a validation set; this is the only reliance on IIW data in choosing this threshold).

Models:
We have investigated a number of models. Models use variants of our loss, summarized below for convenience. Paradigms are used as training data; all α are 1; N t = 7 and N σ = 3 except where explicitly noted. Table 2 lists models.

Standard Test WHDR
Other published methods are allowed to see training WHDR data (Retinex does so to choose a scale), so the our methods can be compared using WHDR on the standard test using the held out threshold. Here our method sees the training set to choose threshold (but for nothing else). As table 1 indicates, in this comparison, our best method (BBA) with 17.04% WHDR strongly outperforms other unsupervised methods and is comparable to recent strong supervised methods. However, this comparison is not a particularly good way of choosing models.

Simulated Test WHDR
Because our method does not see any WHDR labels in training, we can estimate how WHDR reports change with test set. We repeatedly draw a simulated test set from the IIW dataset. We then compute WHDR on each of the collection of simulated test sets for each method. This exposes the variance in WHDR caused by choice of test set. For each simulated test set, each image is chosen with probability 0.2, yielding simulated test sets that are the same size as the standard test set. We draw 50 simulated test sets to form a Ours Shading Albedo Fig. 7. Qualitative comparison to [34], [14] and [25], using in part Figure 14 of [25]. The method of [25] is more successful than others at suppressing this complex mixed shadow, but produces "colored paper" effects in the albedo. The method of [34] does not handle the shadows well; the method of [14] is better, but washes out the albedo. By comparison, our method is moderately successful on this challenging image. Best viewed in color.  collection. Results are shown as box plots of WHDR for all sets in the collection at the two fixed thresholds. Our methods are strong: Figure 5 summarizes results for our strongest methods. All beat rescaled Retinex. The reported WHDR is not particularly sensitive to threshold (note how held-out threshold WHDR is very close to oracle WHDR). There is some evidence that the standard test set is easier than randomly selected sets of the same size (green bars in Figure 5 are mostly well below the median in the boxplot, and this is consistent across the figures).
Adversarial smoothing is important: Figure 8 compares various configurations. Adversarial smoothing is important to the method's success (NoSmo is relatively weak, but better than both Retinex and [35]).
Averaging is very important: Both Figure 5 and Figure 8 support the conclusion that methods that use discrete image averaging and also average over more boxes and more scales work noticeably better. The notches on the boxplots allow judgements of significance; the difference between BBAF at 0.165 and the other models is clearly significant.
Standard test set WHDR is unreliable: The WHDR varies quite strongly across simulated test sets -the standard deviation is 0.3% for the base method, but note the quite heavy tails. As a result, comparing methods using a single WHDR is unwise. For example, the held-out threshold WHDR for BBAP appears to be worse than that for BBA -postprocessing appears to make things worse, though qualitative differences strongly favor BBAP (Figure 11). Closer analysis reveals this is misleading.
Paradigms are better than CGI: Figure 9 compares results for different paradigms and for CGI tiles. The details of the paradigm do not seem to matter very much, but using CGIntrinsics for paradigm data causes a sharp loss in WHDR, resulting in methods that are outperformed by Retinex. Relative robustness to details of the paradigm is convenient, because we have no procedure to search paradigms. The paradigms described here are not the result of any systematic search.

Scale is important:
The discriminator is engineered to see albedo and shading patches of fixed size (the scale; section 3.6). This parameter is important. Figure 10 shows performance for discriminators that view patches of several different sizes. The scale of discriminator patches has a strong effect on performance (Figure 10), so that imposing Retinex Color (rescaled) Liu 2020 Fig. 9. Varying the details of the paradigm has some effect; a Dark shading paradigm creates notable difficulties, but varying the size of shading (ShaF) and albedo (ShaF) fragments seems to have only minor effects. Using tiles excerpted from CGIntrinsics [47] leads to significant fall off in performance (CGI -tiles extracted from CGIntrinsics at original scale; CGIT -extracted from images shrunk so that tiles contain more detaile; CGITD -dependency between shading and albedo preserved). Graphical conventions as in Figure 5. Best viewed in color.  the requirement that a predicted albedo (resp. shading) looks like a paradigm at the wrong scale leads to a notable fall-off in performance. This -and the tremendous improvements resulting from averaging -supports the idea that paradigms are essentially local models.
Which model is best: Note that standard test set WHDR suggests that BBA is our best model ( Figure 5; Table 1). Standard test set WHDR is a poor way to choose models, because WHDR varies quite strongly across simulated test sets, and because the standard test set seems to be somewhat easy than randomly selected test sets. From Figure 5, BBA, BBAP and BBAF are reasonable contenders for best model. Figure 11 shows a treatment effects comparison of these models. For each simulated test set, we compute the difference between WHDR reported by the two models (W A − W B ). If a boxplot of these differences straddles 0, the models may be the same; if it lies far below (resp. above) 0, then model A (resp. B) is better, because on most simulated test sets it gets lower WHDR. Figure 11 shows these boxplots comparing BBA, BBAP, BBAF. BBAP appears slightly better than BBA; BBAF is clearly a lot better, because for every simulated test set, the WHDR of its predictions is below that of BBA. Some variation in the reported difference in WHDR must be caused by the random offsets in the averaging process of section 3.3. We estimate this variance by computing WHDR for different averages of the same simulated test set. The figure suggests that this effect is not strong enough to explain the difference between BBAF and BBA.  11. Boxplots of the difference in WHDR for simulated test sets reported by pairs of models reveals whether one is reliably better than another; here BBAP almost always reports a slightly smaller WHDR than BBA, and BBAF always reports a very much better WHDR than BBA. The dashed lines show the three standard deviation range for the variation caused by random offsets in the averaging process. The only question of significance is for the comparison between BBA and BBAP. But random offsets in the averaging process should affect each method equivalently, and every difference favors BBAP, suggesting that BBAP is genuinely better than BBA. The difference between BBAF and BBA is pronounced. BBAF is clearly the best of our current models. Best viewed in color. . The advantage of doing so is that a decomposition will then capture the thin bars of darkness associated with grooves separately from albedo (example decomposition shown here). Qualitatively, these thin bars do appear to be associated with grooves (but note the thin dark paint bars on the ceiling, which also appear in this map). The cost in WHDR (top right compares to BBAF) is noticeable, but may be tolerable in some applications. Best viewed in color. handle dark shadows well. Figure 6 shows comparisons to a number of strong recent methods. As these comparisons indicate, WHDR may be a limited guide to success. Methods that achieve strong WHDR on test can produce quite eccentric albedo fields. One difficulty comes with choice of colors: methods that do not enforce a small residual can produce quite odd colors in the albedo field. For this particular example, the method of [26] produces very strongly saturated colors (and has a very poor WHDR). The method of [45] (which gets a strong WHDR on this scene) produces highly desaturated colors; [27] have a better WHDR and somewhat less desaturated colors. These methods uses postprocessing procedures to impose a piecewise constant albedo. While this results in WHDR improvements, the resulting albedo fields may be hard to use. In particular, they display a a "colored paper" effect, where surfaces look as though they are made of flat colored paper. Figure 7 shows comparisons to other methods on a demanding outdoor image with dark shadows. The method of [25] produces very strong shading recovery at the cost of a strong "colored paper" effect. A piecewise constant albedo is entirely consistent with the spatial model underlying intrinsic image estimation. However, relatively few surfaces behave as if they have piecewise constant albedo. For example, the cupboard doors in Figure 6 likely do not have piecewise constant albedo, and removing the wood grain effect to the shading -as the method of [27] does -is likely a mistake. The grain is the result of real variations in albedo. More difficult is handling shading at narrow grooves in surfaces (for example, between the cabinet drawers in Figure 6). The narrow dark shadows here are clearly a shading effect, but they behave like an albedo effect. As [63] noted, deep grooves are hard to illuminate and so are dark for almost all shading fields, an effect confirmed in [64]. This means they behave very largely like intrinsic properties. Whether these effects belong in an albedo map or a shading map likely depends on the application. If, for example, one wishes to report physical albedo, then they should appear in albedo effect, though this application is uncommon. Alternatively, one may wish to reshade images, insert objects, and so on. For these applications, it is likely better for these effects to appear in albedo. No current normal recovery method can resolve these effects, and leaving them out of albedo means reshading will omit them. One alternative is to recover them in a separate layer, distinct from shading and albedo. Our method can do this, using a straightforward extension ( Figure 12). While quantitative evaluation methods for this kind of decomposition do not exist, qualitatively the thin bars appear in sensible places, at some cost in WHDR.

Qualitative evaluation
As the qualitative comparison shows, all intrinsic image methods suffer from indecisiveness. Albedo and shading reports are quite strongly correlated, likely because nothing forces the method to "make up its mind" -a shadow typically results in a dark patch in both albedo and shading (for example, the dark fridge in Figure 6; Figure 13). While this does not appear to cause problems for WHDR score, reports with this property must be inaccurate. Our method is somewhat less subject to this effect than the others shown in Figure 6 for that image. However, the effect appears strongly in Figure 7 for ours and other methods.

CONCLUSION
This paper has demonstrated a novel approach to intrinsic image decomposition. The method relies entirely on authored spatial models of the intrinsic components required. These paradigms serve as a convenient encoding of priors. A decomposition network is trained to (a) network decompose authored paradigm images correctly and (b) produce albedo and shading layers for real images that are "like" paradigms at short spatial scales. Long scale error control is by a process of averaging over translations, rotations and scales. The method achieves better WHDR than any current unsupervised method. Qualitative evaluation suggests that the methods albedo maps may have advantages in computational photography applications, as they do not display "colored paper" effects and they do capture groove shading as an intrinsic (rather than extrinsic) phenomenon. The method can be extended to represent other intrinsic effects, by supplying spatial paradigms.

APPENDIX A NETWORK DETAILS
We have engaged in no organized search over network architectures, and do not claim either network to be optimal.

A.1 Decomposer
The decomposer is our implementation of a U-net with skip connections. The encoder accepts 128 × 128 × 7 tiles (3 color dimensions, 4 from the location code). A single layer of 1x1 convolutions increases the dimension of the input, which is then subjected to five convolutional layers, each of kernel size 4, stride 2 and no padding. Each layer uses a leaky ReLU (0.2) as a nonlinearity; it is possible a ReLU would have been a better choice [65], but we have had no problems with stability in training. The decoder accepts the resulting code, and applies five down layers, each of kernel size 5, stride 1, and padding 2. A down layer consists of: stacking the input block with the corresponding block from the encoder (which has the same set of spatial dimensions, whence the choice of padding and stride), applying a convolutional layer to the result, then upsampling by 2 using a bilinear interpolate. Finally, a 1x1 convolution projects to 3 dimensions, and a tanh nonlinearity is applied.

A.2 Discriminator
The standard discriminator consists of four convolutional layers, each with kernel size 4 and stride 2; there is no padding, and there is a bias. The nonlinearity is a leaky ReLU (0.2) and each layer is spectrally normalized. This produces an 8 × 8 × 1 block of activations u. The training loss for the discriminator is computed by averaging max (0, 1 + yu) over this block, with y = 1 when the batch consists of generated images and y = −1 when it consists of real images. The loss that the discriminator produces for the generator is computed for a batch of generated images, and is computed by averaging max (0, 1 − u) over this block,

APPENDIX B WEIGHTS AND CODES
Location codes: Each RGB tile is stacked with four code tiles. Each code tile represents distance to one of the four edges of the RGB tile, with the (i, j)'th location in the k'th code tile containing max(0, 40 − dist([i, j], edge k ).
Weighting window: The weight window is the pointwise minimum of four separate x and y weighting windows. The (i, j)'th pixel of the first x-weighting window is 1−e 41−j 1−e −1 j <= 40 1 otherwise .
The second x-weighting window is a reflection of the first; the y-weighting windows are transposes of the x-weighting windows.

APPENDIX C POLISHING ALBEDO AND SHADING
Write i for the image at some pixel, a for the albedo estimate at that location (which is colored, hence a vector), s for the shading estimate. We wish to compute updates to albedo and shading so that i − (a + δa)(s + δs) = 0 to first order. Write r = i − as. Then r = aδs + sδa.

APPENDIX D LOSSES
Write a (resp. s, c) for true andâ (resp.ŝ,ĉ) for predicted albedo (resp. shading, color) for image I . We have where C compares two fields (we use the mixed L 1 -L 2 loss of [] for albedo and shading, and L 2 for color). For real images, true albedo (resp. shading) is not known. We assume the illuminant for real images is not colored.

APPENDIX E ADVERSARIAL SMOOTHING
Our procedure is as follows. Write θ for the generator's parameters and φ for the discriminator's parameters. Write R for a batch of N real pairs, r i for the i'th example from that batch, T for a batch of training pairs, etc., g(·; φ) for the discriminator (a parametric function that maps a pair to a number) and h(x, y) = max(0, 1 − xy) for the hinge loss. Note that r is a function of the map parameters θ, because it was generated by applying the map to an image. Assume that we have estimates θ k , φ l of θ, φ. We update φ l by taking an optimizer step using the gradient to obtain φ l+1 . We now add the following term to the gradient with respect to θ: and update θ k by taking an optimizer step using the resulting gradient (α is as before).