The Cube++ Illumination Estimation Dataset

Computational color constancy has the important task of reducing the influence of the scene illumination on the object colors. As such, it is an essential part of the image processing pipelines of most digital cameras. One of the important parts of the computational color constancy is illumination estimation, i.e. estimating the illumination color. When an illumination estimation method is proposed, its accuracy is usually reported by providing the values of error metrics obtained on the images of publicly available datasets. However, over time it has been shown that many of these datasets have problems such as too few images, inappropriate image quality, lack of scene diversity, absence of version tracking, violation of various assumptions, GDPR regulation violation, lack of additional shooting procedure info, etc. In this paper, a new illumination estimation dataset is proposed that aims to alleviate many of the mentioned problems and to help the illumination estimation research. It consists of 4890 images with known illumination colors as well as with additional semantic data that can further make the learning process more accurate. Due to the usage of the SpyderCube color target, for every image there are two ground-truth illumination records covering different directions. Because of that, the dataset can be used for training and testing of methods that perform single or two-illuminant estimation. This makes it superior to many similar existing datasets. The datasets, it's smaller version SimpleCube++, and the accompanying code are available at https://github.com/Visillect/CubePlusPlus/.


I. INTRODUCTION
T HE human visual system is able, in some conditions, to recognize colors despite the influence of the illumination on their appearance through the ability known as color constancy [1]. It is not yet fully understood how this ability functions and therefore it is not possible to directly model it. Nevertheless, various computational color constancy methods are used in the pipelines of digital cameras. They are usually designed to first identify the chromaticity of the light source and then to remove its influence on the scene. The last one is described in details here [2]- [5]. For both of these tasks the commonly used image formation model that also includes the Lambertian assumption is usually given as where x is a pixel in the image f , c ∈ {R, G, B} is the color channel, λ is a wavelength in the visible light spectrum ω, I(λ, x) is the spectral distribution of the light source, R(λ, x) is the surface reflectance, and ρ c (λ) is the camera sensitivity for the color channel c. It is often assumed that the scene illumination is uniform. This means that the spatial information is not required in the illumination estimation For a somewhat satisfying color correction it is already enough to know the direction of e [6], which means that e can be described by chromaticities instead of colors. For example, r, g, and b chromaticity components are calculated as R, G, and B color components divided by their sum so that r + g + b = 1. Thus, knowing only two of them is enough.
Since there are more unknowns than equations, illumination estimation is an ill-posed problem and additional assumptions have to be made in order to tackle it. Because of that, numerous illumination estimation methods with various assumptions have been proposed and they are often divided in two groups: the low level statistics-based methods and the learning-based methods.
While the number of the proposed illumination estimation methods is ever growing, there are not too many illumination estimation datasets and even the existing ones have various problems. These include too few images, inappropriate image quality, lack of scene diversity, multiple poorly synchronized versions of the same dataset, violation of various assumptions, etc. A high-quality illumination estimation dataset should be: • Diverse. The more content and illumination cases are covered, the higher is the testing quality. • Large. It is important that the datasets are not only diverse, but that they also contain many images for each particular case. This makes it possible to notice quality improvement even for rare cases [47]. • Informative. Dataset should contain as many information about each captured image as possible. Precisely the information available during shooting procedure, meta-information about scene properties, information about light sources from different angles, etc. • Updatable. Every illumination estimation dataset usually contains ground-truth illumination errors. Because of that, the dataset infrastructure should provide simple and reliable way for dataset debugging and tracking of its versions. • Verifiable. From the previous point it follows that the dataset should be available for verification, namely all provided markup and ground-truth can be collected and, if necessary, recreated by anyone who just downloads the source images. • Accessible. The value of a dataset is decreasing when the downloading process is too complicated or time consuming. • GDPR compliant. Even a very good dataset can be of limited use for the European researches if it is not compliant with GDPR, because it may prevent the researchers from publishing some of their results without breaking the regulations.
In this paper a new illumination estimation dataset named Cube++ with all of these properties is described. It contains 4890 images (see Fig. 2) carefully calibrated so as to get highly accurate ground-truth illumination. The images were collected in numerous countries, places, and illumination conditions. The countries in question include Austria, Croatia, Czechia, Georgia, Germany, Romania, Russia, Slovenia, Turkey, and Ukraine. In order to enable easy selection of images with specific properties, each image is accompanied by additional semantic information such as whether there are shadows in the image, whether it is an indoor or an outdoor image, whether the scene contains objects with known coloration, etc. An example of an image from Cube++ is shown in Fig. 1. The dataset is appropriate for different light source estimation use cases such as: single light source estimation, two light sources estimation, or estimation of at least one significant light source in the scene. Finally, some of the collected images were not included in the dataset and they are kept aside to be released later as part of a future illumination estimation benchmark somewhat similar to [48].
The paper is structured as follows: Section II describes the most important existing illumination estimation datasets and the problems associated with them, Section III gives the motivation for creating a new dataset, Section IV describes the methodology used to collect the dataset, Section V describes the newly proposed Cube++ dataset, Section VI presents a discussion about the scientific usefulness of contemporary datasets' form and about a potential improvement, and finally, Section VII concludes the paper.

II. INFLUENTIAL EXISTING DATASETS
One of the first illumination estimation datasets with a large number of images was the GreyBall dataset [49]. A gray ball was placed in the scene of each of the 11346 images to extract the ground-truth illumination. The main problem with this dataset is that the images are non-linearly processed and as such they do not comply with the image formation model given in Eq. (1). Furthermore, the images in the GreyBall dataset are relatively small with a size of 240 × 360. Finally, the images were extracted from a video that was captured at several locations, which means that many of them have highly correlated illuminations and content. To cope with this problem of high redundancy, it has been proposed to use only a subset of 1135 images from GreyBall [50], [51].
In 2008, the ColorChecker dataset [25] with its 568 images was published and the ground-truth illumination was extracted by means of putting a color checker instead of a gray ball in the image scenes. This dataset was created by two different cameras and its images, which are individually bigger than the ones in the GreyBall dataset, also underwent non-linear processing, which means that similarly as with the GreyBall dataset they are given as 8-bit per-channel JPEG images.
In 2011, the reprocessed version of the ColorChecker dataset that contains only linearly processed images was published [52]. However, as observed already in 2013 [53], it was not mentioned clearly enough that the black level was supposed to be subtracted before using the images. Despite this observation, a lot of papers continued publishing results of methods obtained on the technically unprepared images with the black level included. This effectively led to the circulation of at least three versions of the ColorChecker datasets and the problem was formally addressed in [54] by also bringing into question the alleged advances in the illumination estimation research. In 2018, there was an attempt to rehabilitate the ColorChecker dataset by publishing the recalculated accuracy of various methods by using the allegedly correct ground-truth [55]. However, this attempt was marred by serious technical faults and wrong calculations that included comparing the estimations obtained on older versions to the new ground-truth, which only introduced further confusion [56]. This effectively opens the possibility of more future version of the results on the ColorChecker dataset. In short, using the ColorChecker dataset can be very confusing and problematic due to many circulating versions of the alleged results and consequent inappropriate comparisons, and therefore, to avoid problems, it should probably be omitted as the primary dataset choice.
In 2014, nine new NUS datasets with each of them taken by one of nine different cameras were published [18]. The images were only linearly processed, the black level subtraction was performed from the start in the initial paper, and the number of images was sufficiently high. The calibration object used to extract the ground-truth illumination was again a color checker. However, the problems with the NUS datasets include violations of uniform illumination assumption when having only a single ground-truth illumination, a relatively small number of images per camera with 268 being the maximum, having the same scenes repeated in images, and not being GDPR compliant as well as neither of the previous datastes.
In 2017, the Cube dataset [44] was published with 1365 images taken with a single camera and with a SpyderCube 1 calibration tool used for calibration. Due to its geometry that is superior to the one of a color checker, SpyderCube allows for easier detection of the presence of two illuminations and their extraction. This was extensively used to carefully calibrate each of the images of the Cube dataset and to obtain an accurate ground-truth. Special care was also taken to avoid the violation of the uniform illumination assumption as much as possible. The main drawback of the Cube dataset is that it contains only outdoor images, which also negatively affects the ground-truth illumination distribution. This drawback was alleviated in the Cube dataset's extension named the Cube+ dataset [44]. It contains 342 additional indoor images for a total of 1707 images and a wider span of ground-truth illumination distribution similar to the one in other datasets.
A relatively recent dataset is the INTEL-TAU dataset [57], a successor to the INTEL-TUT dataset [58], with 7022 images taken by three different cameras. While the number of images is sufficiently high, its main drawback is the fact that most of its images do not contain a calibration object in their scenes. Namely, it was removed after the initial calibration. Although this removes the requirement for masking it out, it also makes it impossible to reliably check and verify the ground-truth calibration and it is known that such errors occur [59]. Additionally, since the original raw image files are not provided, the EXIF data with the meta-information that may be important to some methods is also not available. The INTEL-TAU dataset is also completely GDPR compliant. Instead of avoiding problematic scenes, GDPR compliance was achieved by having "privacy masking applied on all sensitive information" such as "recognizable faces, license plates, and other privacy sensitive information". The masking was performed so that "color component values inside the privacy masking area were averaged". However, this effectively changes the original content and it may be undesirable in some cases.
A relatively recent dataset is the one for temporal color constancy [60], which contains 600 sequences of varying length between 3 and 17 frames. The dataset has not yet been made publicly available at the moment of writing this paper.
It is also important to mention that in contrast to all the described datasets that contain real-world images taken in mostly uncontrolled conditions, there are a lot of datasets made in fully controlled or even laboratory conditions, such as [6], [61]- [66].
The main advantage of the laboratory dataset is that it allows to research particular problem in fully-controlled conditions, but the variability of such datasets is often too low.
While other illumination estimation benchmark datasets also exist, it can be argued that the ones mentioned here are the most influential ones. They also share many problems with other existing datasets and thus their descriptions also cover most of the problems of other datasets. Some characteristics of the datasets mentioned here are summarized in Table 1.

III. MOTIVATION
After laying out the brief descriptions of some of the best known illumination estimation benchmark datasets, it is possible to identify some of their main problems already recognized by the wider interested research community. Therefore, the motivation for creating a new illumination estimation is to try to reduce or entirely eliminate some of the mentioned problems of the existing datasets.

A. SIMPLE TECHNICAL FAULTS
Probably the most serious and most detrimental problem is the one connected to the technical shortfalls that can happen when creating and publishing a dataset. Some of the main such shortfalls are using non-linearly processed images and providing confusing information on black level subtraction.
As for the non-linearly processed images, the solution is to simply avoid performing non-linear processing and this can be simply carried out.
In the case of the black level subtraction, with the earlier datasets this problem occurred due to lack of explicit mentioning of the black level value in the papers that originally described these datasets. Additionally, in some cases even a script that demonstrates the proper handling of the black level was either missing or put to a somewhat obscure location. In the case of the ColorChecker such problems have led to multiple circulating versions of the ground-truth data and experimental results. Therefore, in the case of publishing a new dataset, such and similar problems motivate to clearly provide all necessary details on the required data for the black level subtraction and also to provide an example of how to do it.

B. RELIABLE GROUND-TRUTH
One of the probably least detectable technical faults with serious consequences is erroneous calibration and groundtruth illumination extraction. Based on the experience with existing datasets, it usually happens that there are multiple illuminations in the scene and the calibration object is under the influence of only one of them, which may not even be the dominant one. In that case, even if a method estimates the dominant illumination, it will be penalized because the ground-truth is based on another illumination. As mentioned earlier, this was already reported for the ColorChecker dataset.
To make the ground-truth reliable, one should use such a calibration objects that can detect the presence of multiple illuminations. Examples of the such calibration objects include a gray ball such as in the GreyBall dataset or a SpyderCube instance such as in the Cube+ dataset, because they make it possible to simultaneously observe illuminations coming from different angles, and these can then be checked for any significant difference. An example of capturing two significantly different illuminations with a SpyderCube instance and showing the difference in how they affect color correction is given in [44]. If a significant difference is present, additional steps can be taken to either correctly determine which of the illuminations is the dominant one or to discard the image to prevent any future problems, which finally results in a correctly extracted and reliable groundtruth illumination.

C. VERIFIABLE GROUND-TRUTH
While the ground-truth should primarily be reliable, it should also be verifiable in order to add an additional layer of reliability. The simplest way of making the ground-truth of a dataset verifiable is to have all the dataset images contain a calibration objects in their scenes. In that way the groundtruth can easily be extracted by other researchers and then compared to the originally provided one to look for potential errors. Additionally, the very visual information can help identify cases such as e.g. having the calibration object in a shadow while the majority of the scene is outside of that shadow.

D. CONTENT VARIETY
A new illumination estimation dataset should have a high content variety. While this seems rather obvious, it is not always put into practice to the full extent. For example, while the GrayBall dataset contains over 11k images, they are highly correlated and thus effectively not as rich in content as it may seem at first. In the case of datasets such as the ColorChecker dataset or the NUS datasets, all images were taken at the same geographical location and during the same season. None of the images there were taken e.g. during winter or at night. Such content choice restriction results in failure to cover many interesting and challenging environments that illumination estimation methods encounter in real-world applications and that should also be included in the research.

E. ILLUMINATION VARIETY
Having an appropriate ground-truth illumination variety in an illumination estimation dataset is important for several reasons. The most important one is to closely cover as much as possible of the illuminations that are encountered in the real-world applications because in that way the illumination estimation methods can be properly trained and tested. In most existing datasets the chromaticities of the illuminations are rarely too far away from the Planckian locus [67]. This means that less common colors of the artificial illumination sources are also effectively excluded. Therefore, another thing to consider when creating a new illumination estimation dataset is the inclusion of such less common groundtruth illuminations.
An additional reason to have a sufficient ground-truth illumination variety is to prevent abuses of some often used error statistics that are possible if the ground-truth illumination are too clustered [68]. Such abuses can lead to false conclusions about the performance of the tested methods and consequently be detrimental for the research community and practitioners.

F. CHECKING FOR MULTIPLE ILLUMINATIONS
The majority of the illumination estimation datasets provide only a single ground-truth illumination per image. This effectively means that in terms of evaluation these datasets implicitly assume an uniform illumination. However, it is know that in illumination estimation datasets this is generally not the case [56]. As a matter of fact, any image with shadows has already effectively at least two illuminations that may differ significantly and this can also have a significant outcome on the later color correction step [44]. Additionally, even if there are no shadows, it is still possible for an image to be under the influence of multiple illuminations. In that case having a calibration object that is designed to successfully capture the illumination from only one direction at a time will fail to capture all the illuminations in the scene, let alone to detect their presence. Capturing only a single illumination when there are more present also leads to a problem during the evaluation. Namely, if a method correctly estimates one of the illuminations, but the other one is marked as the ground-truth, it may be argued that in this case the method is being unfairly judged. Because of that, an illumination estimation dataset should preferably use calibration objects that can simultaneously capture the illumination color from multiple directions. This would solve at least two problems. First, it would detect whether there are multiple illuminations in the first place, and second, if there really are multiple VOLUME 4, 2016  illuminations in the scene, then such a calibration object will capture more information on them. An example of such a calibration object is the SpyderCube object that has been described earlier.

G. NUMBER OF IMAGES
While some of the previous datasets with non-linearly processed images are obviously disadvantageous, some of them like the GreyBall dataset have the advantage of having thousands of images, which still makes them attractive to many researchers. Therefore, besides having a technically correct dataset, it is also important to make it have a sufficiently large number of images. This can result in both making the dataset desirable by offering a lot of useful data as well as simultaneously discouraging researches from using the inferior older datasets just because of their size. As for how large exactly a dataset should be, it should contain several thousands of images to outsize the existing datasets of lower quality and also to enable new breakthroughs. Finally, having a dataset with a large number of images is a prerequisite for achieving the previously mentioned content and illumination variety.

H. SEMANTIC DATA
In numerous cases, additional semantic information can be useful for research of specific illumination estimation methods. For example, some of the methods may be interested in being explicitly trained only on indoor or outdoor images. Others may be interested in training images that contain no shadows whatsoever since they introduce additional illuminations. More generally, it may be useful to know whether there is a violation of the uniform illumination assumption on a given image. In such cases, it can be highly practical to be able to efficiently filter out images from a dataset based on some given criteria.
Because of that, a useful addition to a new illumination estimation benchmark dataset would be semantic information for each image. In that way, the research could be speeded up by not requiring researchers to label the images from scratch. Additionally, if such semantic information were given in advance, a lot of potential label mismatches between the labels created by different researchers could also be prevented.

I. PRIVACY CONCERNS
With the recent arrival of regulations such as the General Data Protection Regulation (GDPR), it becomes ever more important to respect privacy in publicly available images. This also means that using images from previous datasets with e.g. recognizable people or registration plates may nowadays be potentially seen as problematic. With respect to this, for the sake of respecting privacy, any new illumination estimation dataset should also take care of avoiding images that would contain any content that could compromise someones privacy.
On the other hand, if a public dataset is also supposed to be useful for development of methods that rely on e.g. faces [69], [70] or sclera [71], then it should obviously also contain images with faces. However, in that case it would be appropriate to obtain the consent for public use from the persons present in the image scenes. That would enable the researchers to use and show these images publicly in papers.

J. MULTIPLE INSTANCES OF THE SAME SENSOR CLASS
There can be significant differences between spectral characteristics of different sensors used by various cameras. This effectively means that a learning-based method that has been successfully trained on the images created by a camera of one model will not necessarily perform well on images created by a camera of another model without some adjustments. As a result, the problem of inter-camera color constancy has recently started to gain ever more attention [44], [45], [72]. Since almost every dataset was taken with another camera sensor, there is no shortage of training and testing images.
On the other hand, it is known and it can be shown that even for the instances of the same sensor class there are measurable differences in the spectral characteristics [73]. Hence, to check the significance of the impact of these differences on the accuracy of illumination estimation methods, an interesting feature of an illumination estimation dataset would be to have images created by several instances of the same sensor class. In addition to ground-truth illumination, such a dataset would also have to provide the sensor instance labels for each image.

IV. ACQUISITION METHODOLOGY
By identifying the problems with the existing datasets and describing some desired properties of the future datasets, the guidelines for creating a new illumination estimation dataset have been laid out. One of the main goals of this paper is not just to provide a theoretical framework, but also to create and propose a dataset by following these guidelines. The first step in doing so is to describe the used acquisition methodology.

A. TECHNICAL SETUP 1) Color target (SpyderCube) characterisation
As the calibration tool in the newly proposed dataset, the SpyderCube instances were used. SpyderCube is a color target for photographers whose main purpose is to help them to adjust the white balance manually. The general look of the SpyderCube is given in Fig. 3. A chrome ball is used to analyse specular highlights, the white on two faces is used to estimate true highlight value, the gray on two faces represents the midtone of the image and its color temperature, and the bottom black face is used to evaluate shadow values in the scene in relation to the black trap i.e. the hole, which represents absolute black.
According to the manufacturer company Datacolor, the gray cube faces are neutral gray with a reflection coefficient of 18%.
Since SpyderCube is a relatively low-cost tool, some doubts about its declared optical properties could arise. To validate its properties, two SpyderCube instances, labeled SC1 and SC2, were compared. Individual faces of these SpyderCube instances were named G1, G2 and W1, W2, as shown in Fig. 3.
Reflection spectra of SpyderCube parts were measured using a Eye-One Pro spectrophotometer by X-Rite in highresolution mode of 3.3 nm with the help of the spotread utility from Argyll CMS 5 . For each part, three measurements were made and the results were averaged. Figures 4 and 5 show the spectral reflection coefficients of the white and gray parts of the SC1 and SC2, respectively.   These measurements lead to the following observations: • Gray parts of both SpyderCube instances are not "ideal" gray, i.e. the reflection spectra slightly depend on the wavelength. The sensitivity of the blue sensor in many cameras has a maximum at around 450 nm wavelength, and the reflection coefficients of gray parts G1 and G2 have a noticeable drop in the blue band. • Each SpyderCube instance has small differences between reflection coefficients of its own gray parts G1 and G2. • There are rather big differences between the gray parts reflection coefficients of the two measured SpyderCube instances. • White parts of both SpyderCube instances are also not "ideal" white, i.e. the reflection spectra are not horizontal lines. • Differences of the white parts W1 and W2 reflection coefficients of the both SpyderCube instances are small. The idea behind SpyderCube as a calibration tool is that it does not distort the color of the illumination source, i.e. it is assumed to be "color neutral". From this point of view, what is important is the similarity between the shapes of the curves of reflection coefficients for the two SpyderCube instances, and not the differences between the curves' values. From the measurements it can be concluded that the curve shapes are indeed very similar. Therefore, the SpyderCube "color neutrality" assumption generally holds with one exception being the blue region of the spectrum for the white faces.
The degree of SpyderCube's color neutrality is one of the most important factors for accurate ground-truth extraction. The height of the grey reflection coefficient curve does not significantly influence the ground-truth extraction. Compared to other uncertainties, the measured deviations from color neutrality have only rather a small impact on the groundtruth.
Nevertheless, to measure the amount of this impact in terms of practical use, several images of two SpyderCube instances that were simultaneously in the same scene were captured with a Canon 600D camera under a D50-like illumination. The average difference between the ground-truth illuminations extracted from the faces of each SpyderCube instance and measured in terms of the angular reproduction error [74] was about 0.15 • , which is in terms of color reproduction insignificant and invisible [75].
Performed measurements effectively demonstrate that using grey faces of different SpyderCube instances in different images has no significant effect on the overall ground-truth extraction quality. Still, SpyderCube quality should be studied in details also for all types of complicated artificial light sources (such as gas discharge lamp, etc.).

2) Handheld setup
To collect the dataset in natural conditions, the following equipment shown in Fig. 6 was used: • Canon 550D camera or Canon 600D, • SpyderCube calibration tool, and • special attachment of the cube to the camera. Special cube fasteners were built that allow the cube to be positioned so that it appears near the lower right corner of the image. The fasteners can also be rotated both in horizontal and vertical planes. The distance of the cube from the camera can be adjusted using a telescopic mono-pod and during the dataset images capturing it was set to 50 cm. The experience gained while collecting the dataset images has led to the conclusion that the custom-built handheld setup is convenient to use.

B. DATA COLLECTION AND FILTRATION
The main thing to pay attention to during the image capturing was to assure that the used target cube and the majority of the observed scene are under the same illumination or illuminations. Examples of images with scenes where this requirement was not met are shown in Fig. 7.
Another significant factor that prevents accurate groundtruth extraction is the occurrence of glare on the color target. Images with this issue are usually characterized by clipping of the values in one of the color channels on the gray or white faces of the color target. The overexposure can be avoided in at least two ways: either by using manual camera settings or by specifying relative exposure compensation. Dimming by one step usually turned out to be enough during the image collecting. Manual camera settings and one step lighting can also help to properly deal with the overly dark images. Examples of an overexposed and a too dark image are shown in Fig. 7c and Fig. 7d, respectively.
It should again be mentioned that there may often be several different illumination sources in one scene, commonly two, e.g., sun and sky or sky and streetlight. In this case, especially if the areas of the scene parts illuminated by different light sources are comparable, it is preferable to place the cube so that both illumination colors are captured by different : Examples of images that should be excluded from the dataset: a) the color target is illuminated by the local lantern from the near shop, the color is different from the lighting of the most of scene; b) the color target is illuminated by sources that have almost no effect on the lighting of the observed scene; c) overexposed color target; d) the overly dark image.
cube faces. By doing so, it is later possible to simultaneously extract the illumination color of both influential scene light sources.
One of the main problems during image acquisition was to find the right position for the photographer to avoid differences between illuminations influencing the target cube and most of the observed scene. A lot of interesting scenes are available only in urban areas where there are a lot of different artificial illuminations. However, scenes in urban areas are usually full of different personal data like faces or plate numbers, which means that there are some difficulties related to GDPR. To make Cube++ GDPR compliant, the images with humans in the scene were filtered out and removed. This was done both automatically by using YOLOv3 [76] and manually by additionally checking each image.
During the final quality filtration, all images were divided into three categories: a) images with full light source estimation, where the cube was illuminated by all the main light sources in the scene, b) incorrect images where the cube does not allow to determine the illumination rating consistent with the scene, and c) the rest i.e. images with partial light source estimation, see Fig. 8.
As a result, about 400 images were marked as incorrect and removed from the dataset, about 524 images were marked as difficult images with a partial light source estimation, and the rest of the images were marked as good.
Additionally, the fiber on the top of the cube may fall on cube or remain on image after cropping out the color target. To prevent it, the fibers were glued to the cube or just cut off on most images. All of the images are captured horizontally without the use of camera flash.

C. GROUND-TRUTH EXTRACTION
The ground-truth extraction was performed on raw images. First, a simple debayering has been performed by transforming each RGGB Bayer pattern square into a single pixel. The red and blue channels of the pixel color were obtained directly from the R and B components of the pattern, while the green was obtained by averaging the two G values. No interpolation was performed and therefore the number of image rows and columns was halved. Next, the oversaturated pixels were masked out and then the black level of 2 11 was subtracted from all pixels. Finally, the ground-truth illumination values were extracted by calculating the average chromaticity of the manually annotated areas of the SpyderCube triangles. Four chromaticities were calculated for every image. They correspond to white and gray triangles on the left and right cube faces. Note that on the brightly illuminated cubes, the white triangles may have oversaturated areas that cannot be properly used. On the contrary, the gray triangles chromaticities on the darker images may not be stable due to the black level noise. It is important to note no image contains saturated grey edges, while some of the images contain saturated white edges and in such cases, a corresponding mark is provided.
The illuminations for a triangle were calculated as the mean illumination of its area after 50% downscale to the barycenter. The value of 50% is selected as a simple empirical trade-off. Namely, a full-size triangle may contain non-triangle pixels because of unfocused cube or markup VOLUME 4, 2016 inaccuracies, while a tiny triangle would contain too few pixels and would be affected by noise.

D. SEMANTIC MARKUP
When developing and testing an algorithm for illumination estimation in a scene, it is useful to be able to analyze the structure of errors. The average error over the entire dataset will often not help to reveal whether e.g. the accuracy of the method for indoor images is much less accurate than for outdoor images. To enable performing such and similar checks faster and easier, additional information about the scene and shooting conditions were added to each image in the dataset. In addition to the information available during the shooting, this also includes following manual annotation: Finally, it is important to note that none of the fields had a preset default value. In that way, the value of every field had to be explicitly set by an annotating person. Namely, if some default field values were to be set in advance, it could increase the annotation bias.

V. THE PROPOSED DATASET
Having in mind all of the concerns and motivation from the previous section, a new dataset named Cube++ is proposed that continues on the previous Cube+ dataset. The dataset download link, the accompanying code, and the technical file description are available at https://github.com/Visillect/ CubePlusPlus/.
The Cube++ dataset contains 4890 images. It includes only 1359 of the 1707 images from the Cube+ dataset and only 330 of the 363 images from the 1st Illumination Estimation Challenge (IEC2019) test set [77]. Other images were excluded because they may go against respecting privacy by containing personal data such as faces and license plates or they may be problematic for ground-truth extraction. The remaining 3201 images are brand new.
Cube++ has diverse scene illumination cases as demonstrated by Fig 9. There it can be seen that the chromaticity coverage area is wider than in e.g. Cube+. In other words, the illumination variability has been significantly improved. The ground-truth illumination distribution can also be seen from another point of view by taking a look at Fig. 10. This figure shows that Cube+ and Cube++ have similar distributions, which in turn means that a lot of images were taken under outdoor daylight illumination. One of the important features of the proposed dataset is the fact that it contains two ground-truth illumination records per image, one for each side of the SpyderCube instance. Even though in many of the images there is effectively only one dominant illumination in the scene, Fig. 11 helps to better understand the relation between the two recorded illuminations over the dataset images. Currently the average angular error of state-of-the-art illumination estimation methods is arguably somewhere between 1 • and 2 • . With that in mind, all images with larger angular difference between their illumination records can be treated as two-illumination cases. Another important feature of the proposed dataset that has to be stressed additionally is that it contains semantic data for each image. All semantic information features are shown in Table 2. Different features provided in the semantic data can be helpful for algorithm tuning and profiling as they give potentially useful information about each image individually.

A. TECHNICAL DESCRIPTION
The dataset consists of several parts. First, there are the raw images with only simple linear debayering performed that are stored in 16-bit PNG format. Next, there are CSV files with ground-truth illumination values and CSV files with additional related properties. Furthermore, there are JPEG images generated by using the dcraw open-source tool 6 . Finally, there are also additional files for storing auxiliary information. All these files are automatically built from sources included in the dataset by running a script that is also provided. The sources contain the original CR2 images from the camera and JSON files with the manual annotation data.
The original camera JPEG images are not included as their generation depends on cameras' settings, which means that they cannot be recreated simply or even accurately [78].

1) PNG and JPEG images
The main 16-bit PNG images are generated from the original CR2 files in three steps. First, the CR2 files are decoded by using the dcraw tool with the options -D -4 -T. This generates a 16-bit 1-channel TIFF image. Second, the [10, 10+5184]× [4, 4+3456] rectangle was cropped, to have the same area as the default camera JPEG, which comes with certain advantages. Finally, a naive debayering is applied so that every R, G1, G2, B pattern is converted to a pixel of color R, G1+G2 2 , B . After that the size of the generated PNG images is 2592 × 1728 = 2 5 3 4 × 2 6 3 3 . Even though the color channel values have 16 bits of storage, in practice their maximal value is always below 2 14 − 1. The black level that can be used for every dataset image is 2048 = 2 11 . For visualization purposes, the modified versions of JPEG images generated by the dcraw tool are included as well. The modification includes cropping and downscaling in order for the JPEG images to have the same size as the PNG images. Downscaling is required because JPEG images generated by dcraw are not downscaled like the PNG images. On the other hand, JPEG images generated by the camera were not included because they depend on camera settings and the camera's white balancing algorithm, which is proprietary, not fully documented, and it may differ for Canon 550D and 600D cameras that have been used for image capturing. Because of that, they can not be recreated reliably.

2) The ground-truth
The ground-truth illumination records are stored in the gt.csv file. Ground-truth illuminations are calculated as described in Section IV-C. The columns are: image and for each of the 4 triangles (left, right, left bottom, right bottom) it contains three columns r, g, b with the corresponding RGB illumination chromaticities so that r+g+b = 1. The triangles brightness values are given in the properties.csv file.
Usually, computational color constancy datasets contain only a single ground-truth illumination vector, which represents the dominant illumination in the scene. In the Cube++ dataset such illumination is not given, because the precise single illuminant estimation may require the specialist annotation. Moreover, some images have two significantly different illuminations, which makes it harder to select the dominant one. If only a single ground-truth illumination is required and the possible errors that it leads to are acceptable, then one of the following methods can be used to obtain it: • sample images with relatively similar left and right ground-truth illuminations (the sugested answers for such images are denoted in properties.csv); • select from the left and right sides the brighter one; then select the white triangle for a dark image, and grey triangle for a bright image.
Note that the difference between the sensitivities of the white faces is greater then the difference between the sensitivities of the gray ones (see Section IV-A). Additionally, since the white faces are more often overexposed than the gray ones, using the gray faces should be preferred. On the contrary, using white faces may be better on dark images as mentioned in Section IV-C.
We also estimated if the ground truth values are distorted by the pixels with the clipped values. The images with overxposed grey triangles were removed from the dataset. The images with the clipped values on white triangles are present in the dataset, but the overexposed triangles are marked in properties.csv.

3) Relevant meta-information
The properties.csv file contains the most relevant meta-information about the Cube++ images. It includes the average triangle brightness R+B+G information about overexposed triangles, and a carefully selected subset of EXIF data fields.
The EXIF data was extracted from CR2 files using the PyExifTool library 7 . All the extracted values can be found in the corresponding JSON files. The properties table contains only a few selected ones. The EXIF data format slightly differs between the Canon EOS 550D and 600D cameras: there are 312 common fields, 2 in 550D only, and 21 in 600D only. All the selected EXIF fields are common.
The cam_estimation.csv file contains the EXIF fields of the camera that contains the camera's light source estimation

B. IMAGE PREPARATION
Finally, it is important to clearly specify how to properly prepare the provided Cube++ images before handing them over to illumination estimation methods that are to be tested.
There are three main steps that have to be taken.

1) Black level substraction.
The first step is to subtract the approximate black level of 2048 from all image pixel color components. In some cases this can result in negative values, but such values should then be set to 0.
The second step is to calculate the maximum value m for all pixels across all color channels. After that all pixels that have a value greater than or equal to m − 50 in any of their channels should have all their channel values set to 0 This would remove most of the incorrect pixels with clipped values. Nethertheless, it would leave some rare overexposed pixels, because demosaicing procedure may mix them with the normal ones. To get precise information about saturated pixels it is recommended to analyse images before demosaicing (the last one can be extracted from CR2 files).

3) Color target masking.
The last step is to mask out the lower right rectangle of the image that contains the color target to remove any potential bias and thus to have a relatively fair testing. The size of this rectangle is 700 × 1000 for all images. The rectangle is masked out by setting all channel values of all its pixels to 0, i.e. by making it black.

C. INTENDED DATASET USAGE
With all its features, especially the two ground-truth illumination records, Cube++ is appropriate for several illumination estimation use cases. All datasets mentioned in Section II, except for maybe TCC dataset, are designed for the most widely used classical illumination estimation problem: estimation of the single light source in the scene. Therefore, each image is provided with only single light source ground-truth, even in cases when the scene is obviously under the influence of multiple illuminations. In contrast to these datasets, Cube++ allows to work on following problem statements: 1) Estimate one and only dominant lighting in a scene; 2) Estimate two dominant light sources in a scene; 3) Estimate at least one dominant light source in the scene. For each of the listed problem statements we propose the following rules to filter the Cube++ images that are suitable for it. To form the dataset subset for the first problem, one needs to select all images where the angular differences between its two extracted ground-truth illuminations is below 1 • (except partially light source estimation part, see section IV-B). For the two light source estimation problem, one needs to do the opposite, i.e. to select all images that are not selected for the first problem (except partially light source estimation part). Finally, to work on the third problem, all images can be chosen.
Here it is important to mention that the proposed rules are arbitrary, that they will result in some of the images being inappropriately selected, and that they may be improved. One example where these rules fail is shown in Fig. 12. There the extracted ground-truth illuminations differ significantly, but practically the whole scene is mostly under the illumination captured by the right cube face. This means that even though the scene is effectively under uniform illumination, the mentioned rules will result in the opposite conclusion based on the difference between the extracted illuminations on the two cube faces.
Also it is important to mention what at the moment partially light source estimation part of the dataset is not provided with subjective single illumination estimation choice. The plan is to solve this in future work.

D. SIMPLECUBE++ DATASET
In addition to the main 200GB Cube++ dataset, a 2GBsmall and simpler version of it is prepared. The small dataset contains 4x downscaled images which have less than 1 • degrees difference between left and right grey edges ground truth illumination estimation. It includes only images with single source illumination, consequently ground truth file Total Cube++ contains only one ground truth per image. This one ground truth was extracted in a following manner: firstly, average values for both gray triangles were calculated as in the main Cube++ dataset; secondly, they were normed with l 1 -norm (r + g + b = 1 for both gray triangles); finally, obtained ground truth values were averaged and normed again with l 1norm. This dataset has two main advantages: small weight (around 2GB) and single answer.
SimpleCube++ contains PNG and JPG files, gt.csv with ground truth data and properties.csv with manual annotation data. In addition, small dataset was divided into train and test parts with rule: each image was independently assigned to the test set with probability 20%.

VI. DISCUSSION
Having another high-quality illumination estimation dataset such as the one proposed in this paper is certainly beneficial to the interested research community as well as the industrial sector and there should probably be no discussion about that.
However, proposing a new dataset is still only an incremental move in terms of the overall paradigm of illumination estimation research since this has been done on numerous occasions while the dataset usage has remained relatively unchanged.
A much more constructive and necessary discussion that is rarely taken forward should be about the direction of how to better use or not use the datasets to achieve better progress in illumination estimation research. In terms of that, one of the burning issues is that the results in most illumination estimation papers are unverifiable and thus questionable. Therefore, for the sake of improving the state of the illumination estimation research, it would be quite useful to further discuss this problem as well as the potential solutions to it in more detail.

A. QUESTIONABLE RESEARCH PROGRESS
Obtaining low illumination estimation errors on a benchmark dataset is a regularly used approach when trying to demonstrate the superiority of a proposed illumination estimation VOLUME 4, 2016 method. For all well-known datasets the ground-truth illumination used during the test phase is publicly available and the actual error statistics calculation is usually performed by the authors themselves and published in their papers. However, this introduces several problems with the most serious being data dredging, i.e. p-hacking and erroneous reporting.
The problem with data dredging in illumination estimation is that in cases when a model selection is required, the final results that are reported were not always obtained through nested cross-validation [79]. Instead, the reported results are the ones that were used to select the model in the first place. By using these results, a method's true performance on new unknown data may be masked and unfairly shown to be better than it actually is. This can prevent or slow down the progress in illumination estimation research by giving misleading clues about the validity of the method's underlying assumptions.
In the area of visual odometry similar problems with e.g. the KITTI dataset [48] have been prevented by simply keeping the ground-truth for the test secret. By having the evaluation of the results on the test set carried out by the dataset administrators, any serious attempts of p-hacking have been prevented.
Another problem that can be prevented if the evaluation is carried out by a third party is the erroneous reporting. For example, in [18] the results of the proposed illumination estimation method on several datasets were allegedly all obtained by using the same value of a hyperparameter. However, trying to re-implement the method fails to produce the same results and only after checking the associated webpage [80] it becomes clear that the hyperparameter value has to be changed for each dataset to fully reproduce the published results.
A somewhat similar example is the 2007 paper by van de Weijer et al. [23]. In an erratum published in 2008 [81] it was explained how testing was inadequately performed, which consequently resulted in reporting of erroneous error statistics.
Finally, any doubts in the validity of some reported error statistics could be reduced or fully eliminated if they were calculated not by the authors themselves, but by a reliable third party. This would also help the overall research progress.

B. ILLUMINATION ESTIMATION CHALLENGES
Inspired by the ideas mentioned in the previous subsection, two international illumination estimation challenges have already taken place [77], [82]. The challenges provided the participants with thousands of training images and their respective ground-truth illuminations, while for the test set only the images were provided and the ground-truth remained secret until the end of the challenge. Because of that the error statistics for the illumination estimations sent over by the authors were calculated by the challenge organizers, which prevented a lot of problems described in the previous section. The results were thus more trustworthy and they have shown e.g. high errors for some methods that were previously reported to be highly accurate. Additionally, the challenges helped to recognize additional problems such as training a method to obtain excellent values for a given error metric [83], which results in issues related to the so called Goodhart's law [84].

C. BENCHMARK
While the described international illumination estimation challenges have shown the advantages of having a reliable third party calculate the error statistics, they were fixed in time and they cannot be repeated on the same images anymore. Therefore, the next step would be to create a benchmark dataset similar to the KITTI dataset with an online user interface for submitting the results at any given time. This would surely represent a significant contribution to the illumination estimation research since it would simultaneously provide the researchers with trustworthy results and also eliminate many of the serious problems that were described earlier in this paper.
For the above reasons, creating such a benchmark is already underway at the time of writing this paper. Present time we are working on the question of benchmark creation. Possible benchmark will be based on the images that were taken during the same time as the rest of the Cube++ images, but that were excluded from its final version. Because of that, in this paper there are purposely no error statistics obtained on the Cube++ dataset by any of the illumination estimation methods. The error statistics will be published online and they will be based on the first version of the benchmark test set. This aims to avoid providing any results obtained on the Cube++ images with known ground-truth illumination. Namely, the idea is to separate the testing and the associated problems from the dataset and to relegate it to the benchmark. Therefore, the overall goal of this paper is to provide high quality training data without any testing. The role of testing data is to be assumed by the future benchmark.

VII. CONCLUSIONS
A new illumination estimation dataset named Cube++ has been proposed. Unlike similar existing illumination estimation datasets, it provides rich, reliable, and verifiable data on scene illumination with specific care being given to precise calibration. For every of its 4890 images there are two ground-truth illumination records as well as a multitude of semantic information and it is GDPR-compliant. Furthermore, a wide variety of scene content is covered, and numerous illuminations are captured. Cube++ contains images taken with several instances of the same model of the camera sensor. In addition to that, a centralized versioning control system for Cube++ has been established to simplify and document possible future changes in the dataset and error handling. By having these properties and novelties, Cube++ is technically superior to most similar illumination estimation datasets. One of the future steps that should also be a significant progress in the overall illumination estimation research is to create an online illumination estimation benchmark based on the infrastructure that was used to create the Cube++ dataset.