Avoiding the Inherent Limitations in Datasets Used for Measuring Aesthetics When Using a Machine Learning Approach

Animportanttopicinevolutionaryartisthedevelopmentofsystemsthatcanmimictheaestheticsdecisionsmadebyhumanbegins, e.g.,fitnessevaluationsmadebyhumansusinginteractiveevolutioningenerativeart.Thispaperfocusesontheanalysisofseveral datasetsusedforaestheticpredictionbasedonratingsfromphotographywebsitesandpsychologicalexperiments.Sincethese datasetspresentproblems,weproposedanewdatasetthatisasubsetofDPChallenge.com.Subsequently,threedifferentevaluation methodswereconsidered,onederivedfromtheratingsavailableatDPChallenge.comandtwoobtainedunderexperimental conditionsrelatedtotheaestheticsandqualityofimages.WeobserveddifferentcriteriaintheDPChallenge.comratings,which hadmoretodowiththephotographicqualitythanwiththeaestheticvalue.Finally,weexploredlearningsystemsotherthanstate-of-the-artones,inordertopredictthesethreevalues.Theobtainedresultsweresimilartothoseusingstate-of-the-artprocedures.


Introduction
Estimating aesthetic value and the complexity of an image is a technological challenge that has recently been addressed by numerous fields, including psychology and artificial intelligence.Several research groups have attempted to create computer systems that are able to learn the aesthetics perception of a group of human beings as a part of a generative system (such as evolutionary art systems) or that can be used for automatic image selection or ordering.Given the subjective nature of the aesthetic problem, the selection of the dataset for the training is vital.This paper explores a new way to build a dataset and provide initial results by using machine learning techniques.
Previous research studies [1,2] have concluded that the degree of generalisation of some existing sample sets was not enough to take them as reference in the training of automated prediction and classification of images.Other functional limitations were identified in these datasets, which are also mentioned in this paper.
In order to solve the problems identified in these datasets, this paper describes the creation of a new set of images from the website DPChallenge.com,with greater statistical consistency.Besides, this new dataset was evaluated in terms of aesthetics and quality by a group of individuals under controlled experimental conditions.This makes it the first dataset evaluated by two different populations (the one evaluating at the DPChallenge.comportal and the one evaluating it in person).
With the new dataset created, several Machine Learningbased models were trained for the automated prediction of the aesthetic and quality value and that of DPChallenge.com.
This paper starts with a state-of-the-art section on the datasets created for the automated prediction and classification of images.In Section 3, the limitations found in such sample sets are provided.Section 4 describes the method for 2 Complexity the creation of a new dataset with greater statistical coherence and the results of the evaluation procedure obtained under experimental conditions.Section 5 presents the Machine Learning models based on the prediction that were used as well as the results obtained in the training based on the three available criteria for the images of the proposed set.There is a section discussing the results and another one with the final conclusions.

State of the Art
Some authors, such as Datta et al. [3], Wang et al. [4], Ke et al. [5], and Luo et al. [6], conducted studies aimed at automated aesthetic classification using a number of technical characteristics such as lightness, saturation, Rule of Thirds, etc.For these experiments, sets of large-format photographs from websites and the evaluations made by the users of such sites were used.On the other hand, other authors, including Cela-Conde et al. [7], Forsythe et al. [8], and Nadal et al. [9], carried out aesthetic perception and image complexity experiments using a sample set with a more limited number of images, but evaluated by a specific set of people under controlled experimental conditions.A brief analysis of these sample sets is presented below.
2. 1. Photo.net (2006).Datta et al. [3] created a dataset based on the photography website Photo.net, which has over a million images and 400,000 users.In this dataset, each image is rated on a range from 1 to 7 (1 being the worst possible score and 7 the best) based on aesthetics and originality.Statistical information on the rating can be found on the website.It does not provide information on the image evaluators, though.The full dataset comprises 3,581 images rated by at least 2 persons and has an average score between 3.55 and 7 and an overall total average of 5.06, with a standard deviation of 0.83.The high correlation found between the criterion of originality and aesthetics (Pearson's r = 0.891) might indicate that users most assuredly are not making such distinctions.
Datta et al. [3] and other researchers such as Wong et al. [8], who used this sample group, have established a division to obtain two different groups: (i) the images with an average score equal to or higher than 5.8 were branded high quality and (ii) those with scores equal to or lower than 4.2 were branded low quality.In the case of the study conducted by Datta et al. [3] a success rate of 70.12% was achieved in the global classification using Support Vector Machines (SVM): 68.08% for high quality images and 72.31% for low quality images.

Photo.net (2008).
In 2008, a new study was published by Datta et al. [10], which introduced a second set of data from the website consisting of 20,278 images rated by an average of 16.81 persons with a standard deviation of 16.19.It should be noted that there were images evaluated by a minimum of four people and others by a maximum of 395.When comparing this study with the previous one, it becomes apparent that this statistical analysis is more complete, as it provides specific data for each image.The total set of images had at least four ratings per image, with scores ranging between 2.33 and 6.99, and a global average of 5.15, with a mean standard deviation of 0.58.From the same set, Wong et al. [8] displayed 44 metrics grouped into three categories with global characteristics, for which they used a reduced set of images from the original experiment down to a total of 3,161.After performing a classification using SVM with linear kernel and resorting to a crossed validation with 5 independent runs, 78.2% of the images were successfully classified (82.9% high quality and 75.6% low quality).
2.3.DPChallenge.com.Ke et al. [5] created a different sample set, which became one of the most commonly used in aesthetic classification experiments.For the construction of this set, the photography portal DPChallenge.comwas used, with a total of 60,000 images rated by at least 100 persons being selected.
For the aesthetic classification experiments, two sets of 6,000 photographs were created by selecting the top and bottom 10% after arranging them according to their mean score.Subsequently, Ke et al. [5] carried out a subdivision into two new random subsets, thus obtaining 4 sets of 3,000 images (two high quality and two low quality sets).A set of each type was used to train the proposed systems, while the other was used to validate their capacity and efficacy.

Dataset Created by Psychologists.
Cela-Conde et al. [7] created a dataset consisting of a final standardized set of 800 images divided into 5 categories: artistic abstract (AA), nonartistic abstract (AN), artistic representational (RA), nonartistic representational (RN), and photographs of natural scenes and human constructions (NHS).
The images were shown to a group of 240 participants (112 men and 128 women, with a mean age of 22.03 years and a standard deviation of 3.75), randomly divided into subgroups of 30 persons in a controlled experimental environment.The images were shown for five seconds and participants were asked to rate the visual complexity of a subset of stimuli on a Likert scale from 1 to 5 (1 being the worst possible score and 5 the best).Consequently, each image had a total of 30 ratings.The mean value obtained by each subgroup for each stimulus was the value considered to represent the complexity of this stimulus in the final set.The stimuli in this set were used by Cela-Conde et al. [7], Forsythe et al. [8], Nadal et al. [9], and Machado et al. [11].

Limitations Found in the Dataset Available
The study of the generalisation capacity of the analysed datasets led to the conclusion that they did not provide a satisfactory degree of generalisation: the correlation is greater when the validation set belongs to the same source of data as the training set.However, in experiments where the test was performed with a set from a source different from the training set, the correlation results decreased notably.A clear example in this regard can be seen in experiments conducted in previous research studies [1,2]: when training a subset of 6,000 images from DPChallenge.comcarried out by Ke et al. [5], the result of the correlation was 91.38%.If validated with another subset from the same source, however, the resulting percentage decreased down to 56.21%, when using, for example, the dataset from Photo.net created by Datta et al. in 2006 [3], and down to 55.39% with the dataset from Photo.net created by Datta et al. in 2008 [10].
Besides, the sample sets trained with ratings from the photography portals had some defects: the evaluation system did not have the same control as a psychological test because it was not possible to obtain all the information about the evaluating users or about the device used to see the image (smartphone, computer), or distance or lighting conditions; the amount of images might be insufficient as there was no justified reason to choose a sample size and there was a huge difference in the number of people rating each image; user evaluations could be easily conditioned by personal relationships with the creator of the work or a momentary surge in popularity of certain styles.Lastly, in one of the cases [3] it was shown that the users of these portals did not have enough basis to differentiate between aesthetic and originality criteria, with Pearson's correlation coefficient of 0.891.Furthermore, as these datasets were designed for binary classification, only the images rated with extreme scores (those obtaining the highest and lowest ratings) were used, leaving out of the set the images with intermediate ratings.
In the set created by Ke et al. [5] there was another limitation in the collected evaluations, as the DPChallenge.comportal operated as if it were a photography competition and there was no specification of any criteria to assess the images.Consequently, any user can evaluate the image on their own criteria, which may have nothing to do with those of other people.
On the other hand, in the dataset created by Cela-Conde et al. [7] the number of images presented by category was not balanced.Therefore, the obtained results cannot be considered as representative of the whole.Besides, the set was built on the basis of a considerable number of subsets of images, which resulted in the dataset eventually becoming a number of datasets of independent themes of smaller size, with less internal coherence.
Once the limitations of the studied datasets were identified, a new dataset was built for the aesthetic prediction of images.This dataset was evaluated by humans under controlled experimental conditions using a coherent set of images.

Building a New Dataset
After identifying the limitations discussed above in the existing sets of images, we created a new dataset for the prediction and classification of images, in which the process of human evaluation of the images was carried out under controlled experimental conditions.This new method is generally put forward in [1] and includes the advantages of the sets of images studied in this article.This new method of creation makes it possible to build a set of images with greater statistical coherence from the rating results on the photography website DPChallenge.comand is subsequently evaluated in a manner similar to the procedure used by Forsythe et al. [8].Thus, we shall be able to analyse the correlation between the results obtained with subjects under controlled circumstances and those obtained through the photography portal.
4.1.Source Data.We began by collecting a set of images from the DPChallenge.comphotography portal.The images on the DPChallenge.comportal are rated by users within the range [1,10], where 1 is the lowest possible score and 10 the highest.The only information about the score in DPChallenge.com is that a score of 1 is a "bad" photo, and a score of 10 is a "good" photo.So the score is not clearly related to aesthetics, photographic quality, or originality.Nevertheless, this portal has been used in the past to obtain data for aesthetic classification experiments [5,12,13].The original idea behind this site was for it to be a place where friends could teach themselves to be better photographers by giving each other a "challenge" for the week.Methodologically, DPChallenge organises weekly competitions into "themes" represented by a word of phrase (e.g., "Alfred Hitchcock", "Abstract: Black and White II", "Color Portrait IV").For the current study, this aspect of the evaluation is not taken into account.
Images were collected using a brute force process whereby all data from all images whose identifiers were between 10,000 and 172,000 in May 2012.All statistical information of the ratings was available for only 40,047 images.The images in this initial set were rated by an average of 233 subjects and the mean rating was 5.23 ± 0.78.All descriptive data are shown in Figure 1(a).The file with the evaluation data and the links to the images used (for copyright reasons) are publicly available at https://doi.org/10.6084/m9.figshare.6127295.v1. Figure 1(c) shows the arrangement of votes based on each range and Figure 1(b) displays the distribution of the mean evaluations of the images within the range of scores, showing in both cases that they apparently follow a Gaussian model.

Dataset Proposed.
As noted above, only the images in which all the evaluation data were available were used.Then, only the images with at least 100 ratings were selected.The objective was that the mean value subsequently attributed to each image was the least biased possible.
Once this selection was made, images were arranged in groups according to the mean ratings given on DPChallenge.com.The images in our selection were classified according to 9 scoring ranges, one for each integer value of valid evaluation.Then, a minimum number of images were set for all groups.In our case, the minimum number was 200 (see Figure 2(b)).There were no sets of images numerous enough with mean scores below 3 or higher than 8. Consequently, the used groups were collected from the [3,8] range.From these groups, 200 images with the lowest standard deviation were selected.In other words, these were images with the most internally consistent scores.We used the more consistent image set in order to build a dataset that can be used as ground truth dataset.The descriptive data for each of the ranges are detailed in Table 1. Figure 2 shows (a) the distribution of the number of votes within the range of valid  scores and (b) the distribution of the mean ratings within the range of valid scores for the 1000-image dataset.This process provides a set of images with equal number of elements for each range, with high scoring consistence, and which could eventually be the most representative.

Human Evaluation.
The dataset proposed above was evaluated by a number of humans under controlled experimental conditions.According to Infinite Population Sampling [14] with a minimum sample size of 8 individuals and 95% of confidence level, the true population rating of an image can be obtained, with a margin of error of 3%.
To this end, 5 subsets were created with randomly selected images out of a total of 1,000 available.Each person could rate the images in one or several of these subsets with a score between 1 and 5, where 1 is the lowest possible score and 5 the highest.Each set was evaluated by at least 10 persons (a total of 10,000 ratings).
Evaluations were carried out on February 1st and March 5th, 2018, by student volunteers of the University of A Coruña, Spain (mainly, students at the School of Communication Sciences).Ninety (33 male and 66 female) participants (18.7 years, age range 18-30) took part in this study.Each participant evaluated at least 200 images before the research study and under the same viewing conditions: screens with the same specifications, same lighting conditions, and same distance between evaluators and the screens.
For every image, users independently rated its aesthetic value and quality.The English translation of the text of the survey questions verbatim is: "In this task we want you to evaluate the quality and aesthetic value of each of the images that we propose.To score the "quality" you should look at the framing, focus, colors, etc.In general, professional photographs have higher quality than photographs taken by amateurs.The editing of images (use of Photoshop, filters, etc.) does not have to affect its quality.It may be that you do not like an image, but if it is well made, your quality score should be high.For the aesthetic score value we look for your personal opinion about the image, whether you like it or not.The semantic value should not influence.That is, a nice picture of a crying baby can have a high aesthetic value score." The data shown in Figure 3 correspond to the mean obtained for each image from the different evaluations made for both aesthetic and quality criteria.
The correlation between the scores given in person and those registered on the Dpchallenge.complatform was calculated (see Figure 4).Pearson's correlation between the mean score on Dpchallenge.comand the mean score was 0.692 according to the aesthetic criterion and 0.690 according to Spearman's.The mean correlation between DPChallenge.comand the mean according to the quality value was 0.748 according to Pearson's and 0.756 according to Spearman's.Lastly, the correlation between the two measures obtained in the inperson experiment (aesthetics/quality) was 0.787 according to Pearson's and 0.786 according to Spearman's, higher than in the other two correlations.Figure 4 shows the Scatterplots between ranks for the three possible combinations given the three criteria that are evaluated for the entire study.

Machine Learning Approach
In this study, some state-of-the-art models based on Machine Learning applied to the proposed input were proposed.The aim of these experiments was to study whether the existing correlation values between both human populations seen in the previous sections (DPChallenge and control group) can be replicated by a computer system for the proposed dataset.

Materials and Methods.
To characterise the images that make up the study set, a feature extractor available in WND-CHARM [15] was used, which is a multipurpose image classifier that can be applied to a wide variety of image tasks.According to its developers, the system extracts a large set of image features, including polynomial decompositions, high contrast features, pixel statistics, and textures, among others.These features are computed on the raw image, transforms of the image, and transforms of transforms of the image.The final feature vector comprises 2905 variables, each of which reporting on a different aspect of image content.All features are based on greyscale images, so colour information is not currently used.
The authors tested the different computational models using a 10-fold cross-validation to split the data and 50 runs per model in order to evaluate the performance across different experiments.The performance of the models was evaluated using Spearman's correlation coefficient (rho) and Pearson's correlation coefficient (Pearson's r).

Computational Models.
The authors performed several experiments in order to select the best model using the R package and MATLAB©.Some of the used computational models looked for the smallest subset of variables of the original set which provided a better performance [16], or at least equal to that obtained when using all the possible variables, considering this was a Feature Selection (FS) approach [17][18][19].
More specifically, the used methods were the following: the well-known Support Vector Machines-Recursive Feature Elimination (SVM-RFE) [20,21] and the Generalized Linear Model with Stepwise Feature Selection (GLM) [22] which selects features that minimise the AIC score and the most basic standard Multiple Linear Regression (LM) without FS.The abilities of the RRegrs Package [23] were enhanced in  order to implement the SVM-RFE and GLM to avoid finding the best model according to the proposed methodology as, according to [24], it should be performed based on a null hypothesis test.This package was also enhanced in order to avoid the initial splitting process, and an external crossvalidation process was performed to avoid selection bias as suggested by [25].The last step was modified in order to easily extract the results for all the models.
The K-nearest neighbour (k-NN) algorithm is a technique based on the cluster theory.In this case, a variant called weighted k-NN [26] was used.It is based on the fact that a new observation particularly close to an observation within the learning set should have great importance in the decisionmaking process and, conversely, an observation that is at a further distance should have much less importance [27].For this algorithm, only the hyperparameter k was tuned, which represented the number of neighbour data points that were considered closest.The range of values was from 1 to 5.
The generalized boosted models (GBM) applied the approach described in [28], to establish the foundation of boosting algorithms.GBM estimation involves an iterative process with multiple regression trees to capture complex and nonlinear relationships without overfitting the data [29,30].It works with continuous and discrete variables and is invariant to their monotonic transformations [31].For this algorithm, the interactive depth was represented by the number of splits it had to perform on a tree (starting from a single node) and the number of trees that were tuned.The range of values used was from 1 to 4 and 100, 250, and 500 for the number of trees.
The design of our experiments was based on a novel methodology for the development of experimental designs in regression problems with multiple machine learning regression algorithms [32].For each model described above, the optimal set of parameters was sought using hyperparameter optimisation.2 show the results obtained for each of the four methods studied according to Pearson's and Spearman's correlation value, using as reference the average ratings from the DPChallenge.comportal.Firstly, examining Spearman's correlation values, the maximum value of the SVM-based model was 0.574, using 1024 variables.The input set could be decreased down to 256 with no significant loss of performance (0.570), as the correlation values remained statistically constant between both figures.On the other hand, if we look at the values for Pearson's r, the same pattern remained, since, with 1024 input variables, 0.581 was obtained, whereas, with 256, 0.574 was obtained (with no significant difference in performance).In any case, both Spearman's and Pearson's values show a moderate uphill (positive) relationship, with the exception of k-NN.

Results. Figure 5 and Table
The authors checked the significance of the difference between GLM, SVM, GBM, and k-NN with 256 input variables (see Figure 6) using a Kruskal-Wallis test, and our results showed that, with a very high level of confidence, SVM (cost = 2 −6 y gamma=2 −9 ) was significantly better than the others with a p-value < 2.2 x 10 −16 .Consequently, it could be stated that the minimum input set with the best results was the one with 256 input variables in combination with an SVM prediction model with specified parameters.
Once the method with the best results was identified using the average ratings obtained by the users of DPChallenge.com, the best SVM hyperparameters were calculated (cost = 2 −4 and gamma=2 −12 in both cases) training the scores for "aesthetics" and "quality" obtained in the abovementioned experiment with humans.
As shown in Figure 7, the values for any of the 3 cases are below 0.60 on average.Specifically, it was 0.578 for DPChallenge, 0.456 for aesthetics, and 0.539 for quality, using as mean of performance Spearman's rho and 0.574, 0.451, and 0.562, respectively, using Pearson's r.On the negative side, it is particularly relevant that in the case of "aesthetics" there is a weak uphill (positive) relationship given the average value obtained with both measures.

Discussion
A correlation of 0.78 was obtained between the ratings based on aesthetics and those based on quality.This indicated that the evaluation teams distinguished between both criteria when compared with the measurements made by Datta et al. [3], where Pearson's correlation between aesthetics and originality was 0.891.Regarding the correlation between DPChallenge and quality and aesthetics individually, we should begin by underscoring that the highest correlation was between DPChallenge and quality, which suggests that, at DPChallenge, the photographic quality is valued over the aesthetic value of the image.
In our opinion, there was no single reason that explained the difference between the correlations regarding DPChallenge, as far as the aesthetic and quality values were concerned: (i) In the case of DPChallenge, the users' rating may be conditioned by affinity with the author of the photograph as we were dealing with a competition whereas in the case of the control group, the experimental conditions were controlled (for instance, everyone used the same screen model, at the same distance, with the same ambient light, etc.).
(ii) At DPChallenge, numerous devices can be used (smartphones, tablets, and high resolution screens) and conditions such as viewing distance and ambient light are heterogeneous.
(iii) In the case of the in-person group, the evaluation criteria were established: aesthetics and quality.At DPChallenge, as mentioned above, we were dealing with a photography competition and many different things may be evaluated such as quality, aesthetics, originality, etc.  obtained was 0.578 using SVM.This value is similar to those obtained by Marin and Leder [33] using as criteria "arousal" (Spearman's rho=0.44)and "pleasantness" (Spearman's rho=0.64)or by Nadal [9] with "beauty" (Spearman's rho=0.648)under similar experimental conditions with humans.These values were obtained using numerous state-of-the-art methods in predicting and determining the best configuration for each of them through hyperparameterisation.
As to the correlation between the SVM model with quality and aesthetics individually (Figure 8), it follows that for the system it was simpler to learn the quality values than the aesthetic ones, which makes sense considering that the former is a less subjective component and more related to the characteristics of the image.

Conclusions
Taking into account a number of problems found regarding the state-of-the-art datasets, a dataset was developed following a new methodology.This dataset consists of 1000 images from the DPChallenge portal, which were evaluated in 3 different ways: (1) evaluation from the DPChallenge portal with at least 100 scores per image; (2) an aesthetic evaluation conducted under controlled experimental conditions and a minimum of 10 votes per image; (3) a quality assessment made under the same conditions as (2).As far as the authors are aware, this is the first time a dataset is evaluated based on three different criteria by two different populations.
The results of the correlation suggest that the evaluation of DPChallenge is closer to a quality criterion than to an aesthetic one.The DPChallenge users and in-person evaluators rate images differently and it is apparent that at DPChallenge each user may be following different criteria for the evaluation of images, such as originality, image editing, quality, aesthetics, etc.
Numerous state-of-the-art computational techniques were used and their optimal configurations were identified and applied to all three criteria (DPChallenge, aesthetic, and quality) and correlations of 0.578, 0.456, and 0.539, respectively, were achieved.These results are similar to those obtained in the state-of-the-art experiments.They show that machine learning techniques are more able to learn human assessment of technical quality than aesthetic value, despite the fact that the gap between them is very narrow.
It should be emphasized that machine learning approaches are better at predicting quality than aesthetics, perhaps because of their lower subjective component and their greater association with the intrinsic characteristics of the images.

Data Availability
The data used to support the findings of this study are included within the article.

Figure 1 :
Figure 1: Characterisation of all 44,047 images initially obtained from DPChallenge.(a) Descriptive data, (b) arrangement of the number of votes within the range of valid ratings, and (c) distribution of mean image evaluations within the range of valid ratings.

Figure 2 :
Figure 2: Characterisation of the 1000 images in the proposed set.(a) Distribution of the number of votes within the scoring range and (b) distribution of mean ratings in the images within the valid range of scores.

Figure 3 :
Figure 3: Distribution of the mean aesthetic (a) and quality (b) ratings obtained in the control group.

Figure 4 :
Figure 4: Scatterplots between ranks for the three possible combinations given the criteria evaluated for the entire study.

Figure 7 :
Figure 7: Distribution of correlations (Spearman's on the right and Pearson's on the left) obtained for each of the three criteria (DPChallenge, Aesthetic, and Quality) using 256 input variables and an SVM model optimised using hyperparameterisation.

ComplexityFigure 8 :
Figure 8: Examples of images with different scores based on the three evaluation criteria.For each image, a number value is given according to each criterion, while bars show the normalised weight of such value within each assessment range (DPChallenge in the [1, 10] range and aesthetics and quality in the [1, 5] range).

Table 1 :
Descriptive data for each of the five sets of 200 images that make up the proposed dataset.

Table 2 :
Average results presented in Figure5, identifying hyperparameters and input size for each model.
Distribution of the correlations obtained for each optimised model (Pearson's on the right and Spearman's on the left).For each pair, the p-value obtained using a Kruskal-Wallis test is shown.