No Reference, Opinion Unaware Image Quality Assessment by Anomaly Detection

We propose an anomaly detection based image quality assessment method which exploits the correlations between feature maps from a pre-trained Convolutional Neural Network (CNN). The proposed method encodes the intra-layer correlation through the Gram matrix and then estimates the quality score combining the average of the correlation and the output from an anomaly detection method. The latter evaluates the degree of abnormality of an image by computing a correlation similarity with respect to a dictionary of pristine images. The effectiveness of the method is tested on different benchmarking datasets (LIVE-itW, KONIQ, and SPAQ).


Introduction and Background
The advent of digital photography and the growth of the smartphone industry has led to an exponential growth of the number of images taken and stored. The easy access to a camera and the consequent nearly effortless task of taking photos has made shooting a picture tolerated nearly everywhere and at any time. We took photos to capture every moment of our lives, from the simplest one like an unexpected event to milestone events like weddings. Even if cameras and sensors are becoming increasingly sophisticated and precise, it is not rare to shoot pictures that have a low perceived visual quality. Poor external conditions, such as low light environment, backlight scenes, or moving objects, alongside erroneous capturing settings, like exposure, ISO (camera's sensitivity to light), and aperture, could cause annoying image artifacts and distortions that lead to an unsatisfactory perceived visual quality. Being able to automatically distinguish good quality images from the bad ones can help various types of applications to prune the input set of images, like automatic photo album creation, or even help users to discard bad quality images from their personal collection to save space and time for revisiting those pictures.
Even if the subjective assessment is the most accurate criterion of an image's quality, it is not possible to perform it over big collections of images in a relatively small amount of time. Without taking into account the expensiveness and the cumbersomeness of manually evaluate pictures, automatic Image Quality Assessment (IQA) algorithms have been widely studied in recent years.
The perceptual quality of an image is a subjective measure and it is usually defined as the mean of the individual ratings of perceived quality assigned by human subjects (Mean Opinion Score-MOS). Given an image, IQA systems are designed to automatically estimate the quality score. Existing IQA methods can be classified into three major categories: full-reference image quality assessment (FR-IQA) algorithms, e.g., [1][2][3][4][5][6][7], reduced-reference image quality assessment (RR-IQA) algorithms, e.g., [8][9][10][11], and no-reference/blind image quality assessment (NR-IQA) algorithms, e.g., [12][13][14][15][16]. FR-IQA methods compare the distorted image with respect to the reference image in order to predict a quality score, and because of that, they requires both the original image alongside the corrupted one. RR-IQA • We propose an anomaly detection no-reference opinion-unaware image quality assessment by exploiting the correlations between feature maps extracted by means of a convolutional neural network. • We demonstrate the effectiveness of our method with respect to one no-reference opinion-unaware image quality assessment and two opinion-aware methods, on three databases containing real distorted images.
• We propose a different way to evaluate NR-IQA methods based on their capabilities to discriminate good quality images from the bad ones. • We show the advantages of combining the correlations between feature maps with an anomaly detection method.
The rest of the paper is organized as follows. In Section 2 we describe in detail the proposed method. Section 3 presents the databases taken into account for the experiments and the metrics used for the evaluation of the proposed method. In Section 4 we report the experimental results, concluding with the Section 5.

Proposed Method
The aim of the proposed method is to estimate the quality score of a given picture exploiting the correlations between the feature maps coming from a VGG16 [30] trained on the Imagenet dataset [20]. In particular, we combine the arithmetic mean of this correlations, and a score resulting from an anomaly detection method, on the intra-layer correlation represented by the Gram matrix. The final image quality score is given by the sum of these two metrics after applying min-max scaling to them. A brief overview of the method is depicted in Figure 1. The intra-layer correlation is computed by the Gram matrix over the activation volumes of a Convolutional Neural Network (CNN). Then the abnormality and the average of the correlation are computed before applying the min-max scaling on both of them. In the end, the two metrics are summed resulting in the predicted image quality score.

Intra-Layer Correlation
The intra-layer correlation has been firstly introduced by Li et al. [26] as representative of the texture, and since then it has been used in several other studies: as a loss [27,31] for style transfer, or to classify the style of the images [28,29]. This property can be defined as the correlation between the feature maps in a given layer and it can be done by calculating the Gram matrix of the feature maps, which is defined as all possible inner products of the feature maps. The Gram matrix has also been used in several other domains and applications: from computing linear independence of a set of vectors to represents kernel functions as in [32].
Given an input image I and a neural network N e made of e hidden layers which can be interpreted as the functional composition of e functions L j followed by a final mapping L that depends on the task: N e = L • L e • ... • L 1 . Let N j (I) be the feature maps of the j-th layer of the network N e for the input I, with j ≤ e, which can be seen as a matrix of shape C j × H j × W j resulted from the application of the functional composition of the first j functions L 1 , · · · , L j of the network N e such that N j = L j • ... • L 1 . For the layer j-th the Gram matrix G N j can be defined as a matrix of shape C j × C j whose elements are given by: where c and c are the indices of the Gram matrix, and both vary in the range [1, C j ]. We report in the Figure 2 an illustration of how the Gram matrix is computed from a feature maps N j (I). G N j captures which features tend to activate together, therefore we expect G N j to be a symmetric matrix with a negligible diagonal representing the correlation between a filter and itself. For our purpose we can consider only the lower triangular matrix of G N j , which once flattened results in a feature vector x j of size 1 × [(C j (C j + 1)/2) − C j ]. For each layer j of the VGG16, in the Table 1 we report the resulting feature vector dimension representing the intra-layer correlation. The Gram matrix is invariant with respect to the size of the input image I since its dimension depends only on the number of filter C j of the j-th layer, and it can be computed efficiently if N j (I) is reshaped into a matrix A of size C j × (H j × W j ) so that we achieve G N j = AA T /C j H j W j . We report in Figure 3a schematic overview of the Gram matrix efficient computation.
(a) (b) Figure 3. Visual overview of the Gram matrix efficient computation and feature vector extrapolation. In (a) is shown how the activation volume is reshaped to efficiently compute the Gram matrix in (b) to finally compute the feature vector that represents the intra-layer correlation.

Anomaly Detection
Our anomaly detection technique is inspired by the approach presented in [33]. The degree of abnormality, that is the presence of distortions in the input image, is computed by measuring the similarity between the aforementioned intra-layer correlation representation of a given image, and a reference dictionary W of Gram matrices computed from a database of pristine images. The similarity is measured through the Euclidean distances between the feature vector x extracted from a given image I and the feature vectors of the dictionary W. The final abnormality score is the sum of the average distances and α times the standard deviation of the distances. It is defined as follow: where D is the number of words in the reference dictionary W = {w 1 , ..., w D }, and dist(x, w d ) is the Euclidean distance between the feature vector x representing the input image I and the words of the dictionary w d . The dictionary is built from a subset of pristine images P = {I 1 , ..., I D }: for each image I d we compute the Gram matrix respect a given jth layer of the network N e and keep only the flattened lower triangular matrix of G N j of shape (C j (C j + 1)/2) − C j . We then reduce the dimension of the feature vector to M with M < (C j (C j + 1)/2) − C j by applying the Principal Component Analysis (PCA) [34] to all the feature vector computed from P. M represents the number of principal components such that a given percentage of the data variance is retained.
Finally, we group all the reduced features vectors of P into clusters using Mean Shift [35], a centroid-based algorithm which use kernel density estimation to iteratively move the sample to the nearest local maxima of the probability density with respect to the input samples. It only has a parameter named bandwidth which represents the width of the kernel, and the output are the best k clusters corresponding to k centroids. The advantage of using Mean Shift as clustering algorithm is that it does not require to explicitly define the number of clusters k in advance. The number of clusters is instead variable and depends solely on the data and the width of the kernel chosen. A centroid is the most representative point within the cluster, and in this case, is the mean position of all the elements of the cluster. Our dictionary W is therefore composed by the coordinates of the center of each cluster found.
A schematic view of the dictionary creation pipeline is presented in the Figure 4.

Combining Method
The final quality score is the combination of factors: (1) the arithmetic mean of the CNNs correlations and (2) the abnormality score.
To make these two factors comparable we scale them [36]. Moreover, we flip the correlation of the abnormality score by computing 1-abnormality score. Then, with the purpose of having better interpretability, we divide by 2 and multiply by 100 the final score in order to have values that ideally range between 0 and 100 as the MOS.
Since the anomaly detection method relies on the concept of distance from a dictionary of pristine images, it implies that the closer the input image is to this dictionary, the less abnormal is the considered picture. On the contrary, the farther the input representation is from the dictionary, the higher is the abnormality score. The output of the anomaly detection method is therefore negatively correlated with the Mean Opinion Scores.
On the other hand, it is reasonable to believe that a quality artefact in an input image of a CNN, may alter the activation maps, hence it affects negatively the average intra-layer correlation: the more corrupted (less qualitative appealing) is the input image, the less correlated are the layers of the network. It reflects that the average of the intra-layer correlation is positively correlated with the Mean Opinion Scores.

Experiments
In this section, we start describing the databases taken into account for the experiments, followed by the metrics used for the evaluation of our approach and then the implementation details of the proposed method.

Database for Image Quality Assessment with Real Distorted Images
There are several databases for image quality assessment with real distorted images. For our experiments we take into account one of the most used dataset in the field of IQA which is the LIVE in the Wild Image Quality Challenge Database [37] (LIVE-itW) alongside with two recently presented large scale IQA databases namely: Smartphone Photography Attribute and Quality database (SPAQ) [38] and the KonIQ-10k [39] (KONIQ).
The task of gathering opinion scores on the perceived quality of images is highly subjective and subtle, for all the previously mentioned databases, therefore, every worker was first provided with detailed instructions to help them assimilate the task. Since LIVE-itW and KonIQ-10 rely on crowdsourcing frameworks, a selection of participants based on the worker's reliability was applied. Also, during and after the process of data annotation several filtering steps were implemented to ensure an acceptable level of quality for the resulting Mean Opinion Scores. To further validate the credibility of the collected scores, the authors of the three databases, also reports the mean inter-group agreement as the mean agreement between the MOS values of non-overlapping random groups of users in terms of Spearman's rank ordered correlation. In particular, the LIVE-itW database reaches a value of 0.9896 while KonIQ-10 and SPAQ measure a correlation of 0.973 and 0.923 respectively.
The LIVE-itW database [37] is a collection of 1162 authentically distorted images captured from many diverse mobile devices. The images have been evaluated by over 8100 unique human observers through amazon mechanical turk and each image has been viewed and rated on a continuous quality scale by an average of 175 unique subjects. The Mean opinion scores ranges from 0 to 100.
The KONIQ database [39] contains 10,073 images selected from the Yahoo Flickr Creative Commons 100 Million dataset (YFCC100M) [40]. Each image has a total of 120 reliable quality ratings obtained by crowd-sourcing, performed by 1459 subjects. The collected values of the Mean Opinion scores belongs to the interval [0, 5], for readability, in this paper, we scale them in the range [0, 100] to be coherent with the other datasets.
Finally, the SPAQ database [38] consists of 11,125 pictures taken by 66 smartphones. The images have been labeled through a subjective test, in a well-controlled laboratory environment, with a total o 600 subjects. Each image has been rated at least 15 times. The MOS varies between 0 and 100.
The datasets for image quality assessment with real distorted images are divided into good quality images and bad quality images according to the MOS. As a threshold, for each dataset, we decided to take the 75th percentile with respect to the MOS distribution. Therefore, we label images having a MOS over the 75th percentile as good quality images, while we labeled as poor quality images the remaining ones. These thresholds are 71.71 and 71.74 for the KONIQ and LIVE-itW respectively while for the SPAQ the 75th percentile respect the MOS is 68.82.
In the Figure 5 we report the distributions of the Mean Opinion Scores of the three datasets where it can be seen that KONIQ and LIVE-itW seam to distribute similarly with unimodal behaviour, while the SPAQ's distribution of the MOS appear to be bimodal. An overview of database properties is provided in Table 2 and samples images alongside MOS are in Figure 6.

Database for Image Quality Assessment with Real Synthetic Distortion
As our pristine collection of images, we decide to take into account a large scale database for image quality assessment with synthetic distortion named Konstanz Artificially Distorted Image quality Set (KADIS700k) [41]. For our purpose, we consider only the non-distorted pictures. The dataset is composed of 140,000 pristine images collected from Pixabay.com, a website for sharing photos and videos. According to the image quality guidelines of Pixabay.com, users must upload pictures that have a well defined subject, clear focus, and compelling colours. Also, images with chromatic aberration, JPEG compression artefacts, image noise, and unintentional blurriness are not accepted. Moreover, authors of KADIS700k have collected up to twenty independent votes by Pixabay users claiming that the quality rating process provides a reasonable indication that the released images are pristine. After selecting 654,706 images with resolutions higher than 1500 × 1200 they randomly sampled 140,000 images as pristine reference images. Under the aforementioned conditions is our belief that KADIS700k can model the concept of pristine images necessary to build our system. An example of the pictures present in the dataset is reported in Figure 7.

Experimental Setup
The most commonly used metrics to evaluate the performance of no-reference image quality assessment methods are respectively the Pearson's Linear Correlation Coefficient (PLCC) and the Spearman's Rank-order Correlation Coefficient (SROCC). Both are used to compare the scores predicted by the models and the subjective opinion scores provided by the dataset. The PLCC correlation evaluates the linear relationship between two continuous variables while the SROCC evaluates the monotonic relationship between two continuous or ordinal variables. Both indexes are used to evaluate correlation between variables. However, according to [42], the Spearman's index is more suitable to evaluate relationships involving ordinal variables than the Pearson's index. The aim of the automatic image quality assessment methods is to predict a quality index of the image which simulates the Mean Opinion Score achieved thanks to subjective evaluation of users. Since both the quality index and the MOS are continuous and ordinal variables, the SROCC is the most reliable evaluation metric for IQA methods.
The PLCC is a statistic that captures the linear correlation between the predicted scores and MOS; it ranges between +1 and −1, where a value of +1 or −1 reflects a total positive, or negative respectively, linear correlation, and a value of 0 is no linear correlation. Given n as the number of the considered samples, x i and y i the sample points indexed with i,x andȳ the means of each sample distribution; we can define the PLCC as follows: Instead, the SROCC operates on the rank of the data points ignoring the relative distances between them, hence assesses the monotonic relationships between the actual and predicted scores. As the PLCC, it varies in the interval [+1, −1] and for n considered samples it is defined as follow: where d i = (rank(x i ) − rank(y i )) is the difference between the two ranks of each sample.
To measure the capability of the methods of being able to discriminate good quality images from the bad one, we decided to take into account the area under a receiver operating characteristic (ROC) curve, abbreviated as AUC, and the area under the precision-recall curve (AUPR). In particular the AUC measures the overall performance of a binary classifier, and it ranges between 0.5 and 1.0, where the minimum value corresponds to the performance of a random classifier while the maximum value represents the oracle. Meanwhile, the AUPR reflects the trade-off between precision and recall as the threshold varies.

Implementation Details
The proposed method is trained and tested using Python. The training process consists of extracting the information of intra-layer correlation, from a subset of pristine images of KADIS700k [41], and compute the dictionary W of k centroids trough Mean Shift. Then the average intra-layer correlation and the degree of abnormality is computed from a different subset of pristine images of KADIS700k [41] in order to save the minimum and maximum value of both to lately perform min-max scaling in order to merge them. The parameter α for the anomaly detection method has been chosen performing a parameter tuning over a subset of synthetically distorted images for KADIS700k, resulting in a value of 2.
For the intra-layer correlation, we use the ImageNet [20] pre-trained VGG16 provided by the Torchvision package of the PyTorch framework [43]. The model was trained scaling the input images into the range of [0, 1] and then normalized using mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225]. As preprocessing images were first cropped such that the resulting size was evenly distributed between 8% and 100% of the original image area and then a random aspect ratio between 3/4 and 4/3 of the original aspect ratio was applied. The resulting crops were finally resized to 224 × 224. As layer for our representation, we rely on the output of the first convolutional layer of the second convolutional stage (conv2_1) of the VGG16, which consists of 128 filters, resulting in a feature vector of shape 8192. Moreover, the input of the VGG16 images are scaled so the smaller edge results of 512 pixels and normalized according to the mean value and standard deviation of ImageNet.
For the training phase, we randomly selected 10,000 pristine images from KADIS700k, and we compute the Gram matrix for each image. Subsequently, we apply the PCA with a percentage of retained variance of 97%. Finally, we use the Mean Shift algorithm with a flat kernel to find the representative centroids and build the dictionary for the computation of the abnormality score. The bandwidth for the Mean Shift was empirically estimated as the average pairwise distance between the samples that are in the same cluster applying the Nearest Neighbors algorithm. The resulting bandwidth is 1.804.
To perform the min-max scaling on both the abnormality score and the average value of the intra-layer correlation, we randomly select 1000 pristine images from KADIS700k different from the previous ones and we collect the minimum and maximum values for the two metrics.
The proposed method is compared against three different no-reference benchmark algorithms for the IQA, one opinion unaware, which does not require MOS during the training phase named ILNIQE [23] and two opinion aware methods which require the mean opinion scores of the images on which they are trained, namely CORNIA [17] and DIIVINE [18]. For these methods, we took their original implementations alongside the saved models released from the authors. For this reason, these methods are executed a single time.

Results
In this section, we compare the average performance in terms of SROCC, PLCC, AUC, and AUPR on the three considered databases (LIVE-itW, KONIQ, and SPAQ) across 100 iterations. We first present the average performance across all the three datasets and then we present the performance on each dataset separately. Table 3 reports the average performance over the three databases of our opinionunware method and state-of the-art methods, both opnion unware and opinion aware. This table permits to have a quick look of the best performing method whatever is the database considered. Our proposal is on average the first in terms of SROCC, AUC and AUPR against all the methods. In terms of PLCC, our method is the second best opinion-unware method. We believe that the PLCC gap is caused by the fact that the proposed method is not trained using Mean Opinion Score hence it is not forced to have a linear relationship with the target distribution. Moreover, as discussed in Section 3.3, SROCC is a more suitable metric than PLCC for the evaluation of ordinal variables. Table 3. Pearson's Linear Correlation Coefficient (PLCC), Spearman's Rank-order Correlation Coefficient (SROCC) area under the receiver operating characteristic curve (AUC) and the area under the precision-recall curve (AUPR) of the proposed solution respect the three state of the art methods (ILNIQUE, CORNIA, DIIVINE). In each column, the best values are marked in boldface.
In terms of correlation, DIIVINE appears to be the least correlated according to both SROCC and PLCC. ILNIQE and CORNIA are very close in terms of SROCC but they perform worse with respect to our method, while on the PLCC, CORNIA results to be the most effective method, followed by ILNIQE.
Regarding the capability of the methods to discriminate good quality images from the bad one, in terms of AUC, the CORNIA, and ILNIQE methods have similar performance, followed by the DIIVINE method which performs slightly better, and finally, by our method that achieves the best performance. Regarding the AUPR, the behavior is very similar to the AUC excepts for the DIIVINE method which results to be better than CORNIA.

Single Dataset Performance
The average results obtained on the three datasets show the generalization skills of the proposed algorithm. In Table 4 we report the performance of the methods under examination for each dataset. On the LIVE-itW database the performance highlights that our method performs better in terms of SROCC with respect to all methods, while it is the best in terms of AUC and AUPR against all the opnion-unware methods but the second best (with a negligible difference of about 0.006 on average) with respect to the DIIVINE, which is an opinionaware method. In terms of PLCC, ILNIQE is the best then followed by CORNIA, DIIVINE and our method. The LIVE-itW is the database on which all the methods perform worse with respect to the others datasets. This can be caused by the dimension of the dataset, which is one-tenth of the others.
On the KONIQ database, our method performs better in terms of SROCC and AUC with respect to all methods. Surprisingly, ILNIQE reaches the top performance on AUPR even if on the other measures is not highly competitive. The best in terms of PLCC is the CORNIA. Even in this case, performance of all methods, regardless the metric, are quite low.
Finally, on the SPAQ database, the overall behavior follows the one presented in the average performance (cf. Table 3): our method is the best in terms of SROCC, AUC, and AUPR while is not competitive in terms of PLCC.
Summing up, our method is first in terms or SROCC against all the methods over the three considered databases. It is the best opinion-unaware method in terms of AUC while it is the second best against an opinion-aware method only on the LIVE-itW dataset with a difference of 0.096. Since our method does not force any kind of linear relationship between the output and the data distribution, it is not competitive in terms of PLCC: it is the lowest performing method except for the SPAQ database were it is placed third. Considering the AUPR metric, except for the KONIQ database, we are the best of all methods on SPAQ and the second best on the LIVE-itW database of a small amount (0.0028).
In Figure 8 we report, for each of the databases, the predicted quality score distributions with respect to the ground truth MOS. While in the Figure 9 are reported two pictures from the three dataset LIVE-itW, KONIQ, and SPAQ, alongside the predicted image quality scores and the mean opinion scores.
(a) (b) (c) Figure 8. Scatter plots of the predicted quality scores respect MOS for the three considered datasets: (a) LIVE-itW, (b) KONIQ, and (c) SPAQ. In red is depicted the second-order interpolation line. The points in blue belongs to the bad quality images (MOS < 75 • percentile of the MOS distribution for that dataset) while the orange ones refer to the good quality images.

Ablation Study
In this section we focus on the contribution given by the anomaly detection method to the solely average of the intra-layer correlation. From Table 5 we can see that the intra-layer correlation achieves competitive results in terms of SROCC, AUC, and AUPR while is weaker with respect to the PLCC. On the other hand, the output from the anomaly detection method tends to be less effective on the four metrics. Although, the abnormality score tends to be more competitive on the PLCC except on the KONIQ database on which the PLCC scores is very low.
Even if the average intra-layer correlation alone shows interesting performance, combined with the abnormality score results in an increase of the PLCC while maintaining similar performance on the other metrics.  Figure 9. Examples of predicted quality score (QS) alongside the mean opinion score (MOS). First column images (a,d) belong to LIVE-itW database, second column picures (b,e) are from KONIQ dataset while the last column photos (c,f) belong to the SPAQ collection.

Conclusions
In this work, we have introduced a novel and completely blind image quality assessment method based on anomaly detection techniques. We have designed a model that classifies good and the bad quality pictures by exploiting the information deep encoded in CNN pre-trained on ImageNet, using only a subset of artifact-less images. To this end, we have developed a pipeline that relays on the Gram matrix computed over the activation volumes of a CNN to encode the intra-layer correlation. We then use this information in an anomaly detection fashion to improve the performance with respect to the only average of intra-layer correlation. Experimental results on three different datasets containing real distorted images demonstrated the effectiveness of the proposed approach. Moreover, cross-dataset results highlighted the robustness and generalization skills of the approach in comparison to other algorithms in the literature.
As future works, we would like to further investigate why the considered methods on certain databases, like SPAQ, perform on average better than on the others datasets. In view of the fact that our proposed algorithm lacks mainly concerning the PLCC, we are interested in improving such metric simultaneously with the others. To better compare our method with respect to the state of the art, we will take into account more recent approaches of no-reference opinion-unaware image quality assessment like the one by Mukherjee et al. [44]. Finally, to better prove the capability of the proposed method to discriminate good quality images with respect to bad quality ones, a possible alternative to the AUC and AUPR could be to take into account frameworks to stress image quality assessment methods [45].