Toward a Reliable Collection of Eye-Tracking Data for Image Quality Research: Challenges, Solutions, and Applications

Image quality assessment potentially benefits from the addition of visual attention. However, incorporating aspects of visual attention in image quality models by means of a perceptually optimized strategy is largely unexplored. Fundamental challenges, such as how visual attention is affected by the concurrence of visual signals and their distortions; whether visual attention affected by distortion or that driven by the original scene only should be included in an image quality model; and how to select visual attention models for the image quality application context, remain. To shed light on the above unsolved issues, designing and performing eye-tracking experiments are essential. Collecting eye-tracking data for the purpose of image quality study is so far confronted with a bias due to the involvement of stimulus repetition. In this paper, we propose a new experimental methodology to eliminate such inherent bias. This allows obtaining reliable eye-tracking data with a large degree of stimulus variability. In fact, we first conducted 5760 eye movement trials that included 160 human observers freely viewing 288 images of varying quality. We then made use of the resulting eye-tracking data to provide insights into the optimal use of visual attention in image quality research. The new eye-tracking data are made publicly available to the research community.

Abstract-Image quality assessment potentially benefits from the addition of visual attention.However, incorporating aspects of visual attention in image quality models by means of a perceptually optimized strategy is largely unexplored.Fundamental challenges, such as how visual attention is affected by the concurrence of visual signals and their distortions; whether visual attention affected by distortion or that driven by the original scene only should be included in an image quality model; and how to select visual attention models for the image quality application context, remain.To shed light on the above unsolved issues, designing and performing eye-tracking experiments are essential.Collecting eye-tracking data for the purpose of image quality study is so far confronted with a bias due to the involvement of stimulus repetition.In this paper, we propose a new experimental methodology to eliminate such inherent bias.This allows obtaining reliable eye-tracking data with a large degree of stimulus variability.In fact, we first conducted 5760 eye movement trials that included 160 human observers freely viewing 288 images of varying quality.We then made use of the resulting eye-tracking data to provide insights into the optimal use of visual attention in image quality research.The new eye-tracking data are made publicly available to the research community.

I. INTRODUCTION
D IGITAL imaging systems generate, as a side effect, various types of distortion in visual signals [1].Visual distortions degrade the quality of digital media content and consequently, may affect consumers' visual experiences or lead to analytical errors in visual inspection tasks [2], [3].To prevent the appearance of visual distortions and to control image quality, current imaging systems rely on algorithms that can automatically predict image quality as perceived by human observers.The basis of these algorithms is formed by the so-called objective quality metric (OQM).
Substantial progress has been made on the development of OQMs.The state of the art OQMs mainly benefit from the advances in understanding and modelling early visual processing in the human visual system (HVS) and its underlying quality perception behaviour [4]- [6].Significant findings in visual psychophysics, such as contrast sensitivity and masking have been mathematically modelled and integrated in various OQMs [7]- [12].By incorporating functional aspects of the HVS, distortion can be quantified in a way that reflects its genuine annoyance to the human eye, which consequently results in a more reliable image quality prediction.
A significant trend in current image quality research is to investigate the impact of visual attention, which is an essential aspect of the HVS.Visual attention refers to a mechanism that enables the HVS to select the most relevant information in a visual scene [13].Such attentional selection is known to be guided by two types of mechanism, namely the stimulus-driven, bottom-up mechanism and the expectation-driven, top-down mechanism [13].In the area of computer vision, visual attention is mainly concerned with the former attentional mechanism, and is often interchangeably referred to as saliency [14]- [24].The empirical foundation of saliency modelling lies in the eye movements of human observers [25]- [28].Computational models of visual saliency (i.e., bottom-up attentional mechanism) aim at explicitly addressing the first few seconds of eye movements in freeviewing a visual stimulus [13].A saliency model generally outputs a topographic map that represents conspicuousness of scene locations, where some parts of a scene that appear to an observer to stand out relative to their neighbouring parts.
Incorporating saliency has demonstrated great potential for further improvement of OQMs [29]- [31]; however, finding ways to achieve such integration in a perceptually optimised way remains largely unexplored.The challenge lies in the fact that our knowledge about how saliency is actually affected by the concurrence of visual signals and their distortions as well as the associated implications for image quality judgements is very limited.Due to the lack of such knowledge, the vast majority of existing work has focused on simply utilising a specific saliency model as a weighting function to improve a specific OQM [32]- [36].However, the following issues such as how to optimise the combination of saliency and OQMs and how to determine appropriate saliency models remain, which are the urgent topics to be investigated.

II. RELATED WORK AND CONTRIBUTIONS A. Related Work
Psychophysical studies have been attempted to better understand visual saliency in relation to image quality assessment [37]- [43].For example, an eye-tracking study was performed in [40] to investigate (via visual inspection of fixation patterns) how task-free fixations (i.e., saliency) This work is licensed under a Creative Commons Attribution 3.0 License.For more information, see http://creativecommons.org/licenses/by/3.0/ of undistorted images may be affected by two variables, i.e., quality rating task and visual distortion.Based on the visualisations of eye-tracking data, white noise and blurring (under quality rating conditions) are not observed to significantly impact the fixation patterns (relative to the task-free conditions), whereas the impact tends to be more obvious in the case of compression artifacts.In [41], task-free eye-tracking experiments were conducted to investigate how JPEG compression affects fixations.It shows that the impact of JPEG artifacts on fixations is more disruptive at low image quality than the high quality.The eye-tracking data in [42] indicate that fixations change as visual distortion occurs, and that the extent of the change seems to be more related to the strength of artifacts rather than the type of artifacts.In general, psychophysical studies reveal that visual distortions may lead to a deviation from the natural scene saliency, and that such deviation tends to depend on the visual content, the type of distortion and the level of distortion.
Notwithstanding the above effort, it should be noted that the generalisability of the findings reported in these studies remains limited by the choices made in their experimental design.For example, some experiments used a limited number of human subjects [38]; some experiments were restricted to a small degree of stimulus variability in terms of scene content, distortion type and degradation level [40]- [42]; and some eye-tracking studies involved top-down aspects of visual attention (e.g., the involvement of a quality rating task) rather than studying free-viewing bottom-up saliency [42], [43].
Apart from the above drawbacks, existing studies by their nature potentially suffer from an inherent bias due to the involvement of stimulus repetition.Typical eye-tracking data collection for the purpose of image quality assessment often involves each observer viewing the same scene repeatedly several times (with multiple variations of distortion) throughout a session.This repetition (i.e., repeated versions of the same scene) becomes massive as the number of distortion types and/or levels increases and would potentially skew the intended eye-tracking data.In [44], eye-tracking data were collected where participants first viewed 12 short videos and then after a 2-min break they viewed the same 12 videos again.The results showed that there was a notable difference in the locations of the participants' gaze for the first and second viewings of the same video.The eye-tracking experiments in [45] included 10 original videos and their 50 impaired versions (i.e., five levels of degradation per original).The results showed evidence for a memory or learning effect for several viewings of the same video content, and that the observers' gaze behaviour tended to be affected by the involvement of stimulus repetition.Both studies suggest that to ensure the consistency of oculomotor behaviour throughout the experiment (i.e., observing stimuli naturally rather than being forced to learn where to look for visual artifacts, e.g.) and as such to guarantee the reliability of fixation data collection, there is a need for reducing the impact of stimulus repetition.

B. Contributions of the Paper
1) Recent literature [46], [47] has revealed potential limitations of existing approaches taken to integrate saliency to OQMs, and the need to investigate the real interactions between natural scene saliency and visual distortions via eye-tracking.To ensure the validity of fixation data collection, we propose a new experimental methodology with carefully justified control mechanisms.This methodology allows reliably obtaining a substantial eye-tracking data with a large degree of stimulus variability in terms of scene content, distortion type as well as degradation level.
2) Unlike previous eye-tracking studies that have focused more on a limited dataset and rather qualitative analysis, the resulting eye-tracking data enable us to thoroughly evaluate the relation between saliency and distortion.In particular, we perform an exhaustive statistical analysis to provide a comprehensive view of the extent to which different types of distortion with each represented at different levels of degradation can actually affect fixation deployment.
3) Up until now, little has been known about how to optimise the integration of saliency and OQMs in a perceptually meaningful way.An important question has arisen whether saliency derived from an original natural scene or that from the same scene affected by unnatural artifacts should be included in OQMs.Based on our eye-tracking data, we assess whether the difference between both types of saliency is sufficiently large to actually affect the performance gain for existing OQMs.
4) Being able to effectively apply saliency in OQMs requires pre-screening of saliency models, since the effectiveness of saliency models differs in different application domains.On the basis of our eye-tracking data, we benchmark the state of the art saliency models for the purpose of image quality assessment.We explicitly evaluate whether these saliency models possess sufficient capabilities of detecting natural scene saliency and its deviation due to quality changes, and the added value of modelled saliency to OQMs.
5) Moreover, we have made the eye-tracking data publicly available [48] to facilitate research on saliency modelling in image quality assessment.

III. EYE-TRACKING:REFINED EXPERIMENTAL METHODOLOGY
Unlike previous studies, our experiment contains a large degree of stimulus variability in terms of scene content, distortion type as well as distortion level.In addition, a dedicated protocol is devised to eliminate potential bias due to the involvement of massive stimulus repetition, which inherently occurs in a typical image quality study.An eye-tracking database was collected with 160 human observers and 288 test stimuli, and from 5760 eye movement trials.

A. Stimuli
A set of test stimuli is constructed by systematically selecting images from a widely recognised image quality assessment database (i.e., LIVE database [49]).
1) Construction of Source Images: from the fixation deployment perspective, natural scenes can be classified based on the degree of saliency dispersion [31].As the observation revealed from eye-tracking studies in [50] and [51], if an image contains highly salient objects, then most viewers will concentrate their fixations around them, whereas if there is no obvious objectof-interest viewers' fixations will appear as a more evenly distributed pattern.Thus, images with salient objects tend to have less variation in fixations between viewers than images without salient objects.By use of eye-tracking data in [31], the degree of saliency dispersion-the degree of agreement between observers for human fixations-was determined and used to categorise all source images in the LIVE database.The results showed that the majority of images (i.e., 19 out of 29) clustered around the range of medium degree of saliency dispersion.To mitigate the unbalanced distribution of source images, we decided to remove some images having a medium degree of saliency dispersion.This yielded a rather balanced set of 18 source images as illustrated in Fig. 1.The new makeup consists of 6 images of a small degree of saliency dispersion (e.g., images with distinct foreground/background configurations); 4 images of a greater saliency dispersion (e.g., images without any specific object-of-interest); and 8 images that fall into the range of medium degree of saliency dispersion.
2) Construction of Test Images: Test stimuli used in our experiment cover the full range of distortion types available in the LIVE database, including white noise (WN), JPEG compression (JPEG), Gaussian blur (GBLUR), JPEG2000 compression (JP2K) and simulated fast-fading in wireless channels (FF).For each distortion type, three distorted versions per source image were systematically selected, which were intended to reflect three distinct levels of perceived quality: "High" (i.e., with perceptible but not annoying artifacts), "Medium" (i.e., with noticeable and annoying artifacts) and "Low" (i.e., with very annoying artifacts).Taking advantage of the LIVE database that contains per image a "ground truth" mean opinion score (i.e., DMOS), distortion strengths/levels were adjusted perceptually by using the following mapping: DMOS = [10,40] to "High" quality, DMOS = [40,70] to "Medium" quality and DMOS = [70, 100] to "Low" quality.By doing so, for a specific distortion type, the selected 18 "High" quality versions of source images are meant to have approximately the same perceived quality; and similarly for other distortion levels (i.e., "Medium" and "Low").In addition, a "High" quality version of any source image chosen under a specific distortion type is meant to have approximately the same perceived quality as the "High" quality version of the same source image chosen under any other distortion type; and similarly for other distortion levels (i.e., "Medium" and "Low").The selection procedure resulted in a set of 288 test stimuli (including the originals) from the LIVE database.Fig. 2 illustrates the average DMOS of images (i.e., 90 images Fig. 2. Illustration of average DMOS of images assigned to a pre-defined level of distortion.The distortion levels are meant to reflect three perceptually distinguishable levels of image quality (i.e., denoted as "High", "Medium" and "Low").The error bars indicate a 95% confidence interval.
based on 18 source images × 5 distortion types) assigned to individual distortion levels.It clearly shows three distinct means of DMOS (i.e., 30, 55 and 83 within the score range [0, 100]); and hypothesis testing (i.e., based on t-test preceded by a test for the assumption of normality) reveals that the difference between these three pre-defined categories is statistically significant (i.e., with P < 0.01 at the 95% confidence level).

B. Proposed Experimental Protocol
There is little consensus on which method is the most appropriate for the conduct of an eye-tracking experiment for the purpose of image quality study.A within-subjects method, in which the same group of subjects views all test stimuli, is commonly used in relevant studies [29], [40]- [42].This experimental methodology, however, potentially contaminates the results due to carry-over effects, which refer to any effect that carries over from one experimental condition to another [52].Such effects become more pronounced as the number of test stimuli and/or the rate of stimulus repetition increase in eye-tracking.In our case, the test dataset contains a total of 288 stimuli representing 16 repeated versions (i.e., 15 distorted + 1 original) per source image, which makes the use of a within-subjects method prone to undesirable effects such as fatigue, boredom and learning from practice and experience, and thus increases the chances of skewing the results.To overcome these problems, an alternative method, namely between-subjects [53] was employed in our experiment.In a between-subjects method, multiple groups of subjects are randomly assigned to partitions of test stimuli, each contains little or no stimulus repetition.We decided to divide the test dataset into 8 partitions of 36 stimuli each; and to allow only 2 repeated versions of the same scene in each partition.To further reduce the carry-over effects, each session per subject was divided into two sub-sessions with a "washout" period between sub-sessions; and by doing so, each subject effectively had to view 18 stimuli without no stimulus repetition in a separate session.Mechanisms were further applied to control the order in which participants per group perform their tasks: (1) half of the participants view the first half partition of stimuli first, and half of the participants view the second half partition first; (2) the stimuli in each sub-session are presented to each subject in a random order.A dedicated control mechanism was also adopted in each sub-session to deliberately include a mixture of all distortion types and the full range of distortion levels.We recruited 160 participants in our experiment, consisting of 80 male and 80 female university students and staff members (between 19 to 42 years of age), all inexperienced with image quality assessment and eye-tracking.The participants were not tested for vision defects, and we considered their verbal expression of the soundness of vision was adequate.The participants were first randomly divided into 8 groups of equal size, each with 10 males and 10 females; and the 8 groups of subjects were then randomly assigned to 8 partitions of stimuli.Based on the rule of thumb for determining sample size in relevant studies (i.e., 5-15 subjects per test stimulus), we assume 20 per stimulus is an adequate sample size (note that the validity of sample size will be further quantitatively tested in Sec.IV).

C. Experimental Procedure
We set up a standard office environment as to the recommendations of [54] for the conduct of our experiment.The test stimuli were displayed on a 19-inch LCD monitor (native resolution is 1024 × 768 pixels).The viewing distance was set to be approximately 60cm.Eye movements were recorded using an image processing based contact-free tracking system with sufficient head movement compensation (SensoMotoric Instrument (SMI) RED-m).The eye tracking system features a sampling rate of 120Hz, a spatial resolution of 0.1 degree and a gaze position accuracy of 0.5 degree.Each subject was provided with instructions on the purpose and general procedure of the experiment before the start of the actual experiment.Each session per subject contained two successive sub-sessions with a break of 60 minutes between sub-sessions.Since each subject had only two viewings of the same scene, the 60-minute "washout" period was considered sufficient to balance between further reducing the carry-over effects and completing the entire data collection within a reasonable timescale.Each individual sub-session was preceded by a 9-points calibration of the eye-tracking equipment.The participants were instructed to look at the stimuli in a natural way ("view it as you normally would").Each stimulus was shown for 10 seconds followed by a mid-gray screen of 3 seconds.

A. Gaze Map
A gaze map representative for stimulus-driven, bottom-up visual attention is derived from the recorded fixations [29]- [31].Fixations were extracted from the raw eye-tracking data using the SMI BeGaze Analysis Software with minimum fixation duration threshold set to 100ms.A fixation was rigorously defined by SMI's Software using the dispersal and duration based algorithm established in [55].Fig. 3(b) illustrates the collection of fixations over all subjects (i.e., 20) for each of the two sample stimuli.To construct a topographic gaze map for an average human observer, each fixation location (contained in the aggregated data as shown in Fig. 3(b)) gives rise to a gray-scale patch that simulates the foveal vision of the HVS.The activity of the patch is modelled as a Gaussian distribution of which the width approximates the size of the fovea (2 degree of visual angle).As treated similarly in relevant literature (see e.g., [29], [41], [42]) the duration of fixation was not included when creating a gaze map.

B. Validation: Proposed Reliability Testing
Since standardised methodology for the collection of eye-tracking data does not exist, researchers often follow best practice guidelines for the design of their own experiments.The resulting data, however, differ in their reliability depending on the choices made in the experimental methodology, such as the sample size and the ways of presenting stimuli [56].To make use of eye-tracking data as a solid "ground truth", it is crucial to validate the reliability of the collected data.We, therefore, propose and perform systematic reliability testing to assess: (1) whether the variances in the eye-tracking data obtained from different subject groups (in a between-subjects method) are similar; (2) whether the sample size (number of participants) per stimulus is sufficient to create a stable gaze map; and (3) whether the eye-tracking data collected in our study are comparable to similar data obtained from other independent studies.Note, hereafter, when performing a statistical significance test, if the assumption of normality is tested to be satisfied a parametric test (e.g., t-test) is used; otherwise a nonparametric alternative (e.g., Wilcoxon signed rank test) is used.
1) Homogeneity of Variances Between Groups: Since a between-subjects method is adopted, assuming the representativeness of participants in each group is satisfied, we test whether variances of eye-tracking data across all groups are homogeneous.To identify such homogeneity, we measure the inter-observer agreement (IOA), which refers to the degree of agreement in saliency among observers viewing the same stimulus [57], [58].In our implementation, per stimulus and per subject group, IOA is quantified by comparing the gaze map generated from the fixations over all-except-one observers to the gaze map built upon on the fixations of the excluded observer; and by repeating this operation so that each observer serves as the excluded subject once.The similarity between two gaze maps is commonly measured by AUC (i.e., area under the receiver operating characteristic curve) [13].Fig. 4 illustrates the IOA value averaged over all stimuli assigned to each subject group in our experiment.It shows that the IOA remains similar across eight groups.A statistical significance test (i.e., analysis of variance (ANOVA)) is performed and the results show that there is no statistically significant difference between groups (i.e., with P > 0.05 at the 95% confidence level).The above evaluation indicates that a high degree of consistency across groups is found in our data collection.
2) Data (Saliency) Saturation: There is, unfortunately, no general agreement on how many participants are adequate to achieve reliable eye-tracking data.Researchers often use "data saturation" as a guiding principle to check whether a given/chosen sample size is sufficient to cause a "saturated" gaze map.This means a gaze map reaches the point at which no new information is observed.We test the adequacy of sample size required to reach saliency "saturation" (i.e., a proxy of sufficient degree of reliability) in our experimental data.The validation is again based on the principal of IOA, which is extended to an inter-k-observer agreement measure (i.e., referred to as IOA-k,andk=2, 3...20).More specifically, for a given stimulus, IOA-k is calculated by randomly selecting k participants among all observers.Fig. 5 illustrates the IOA-k value averaged over all stimuli contained in our entire dataset.It shows that "saturation" occurs with 16 participants, although a reasonably high degree of consistency in fixation deployment is already reached with 12 participants.It demonstrates that our chosen number of 20 observers for each subject group is fairly sufficient to yield a stable/saturated gaze map.
3) Cross-Database Similarity: To further evaluate the reliability of our eye-tracking data as a "ground truth", we compare our data to other relevant databases that are publicly available and obtained from independent laboratories.In terms of free-viewing eye movement recordings related to the LIVE database, there exist three widely cited eye-tracking databases (with stimuli being only the 29 source images of the LIVE database), namely TUD [29], UN [59] and UWS [30].An exhaustive comparative study is already conducted in [59], and shows a high degree of similarity between these databases, despite the fact that they were independently collected under different experimental conditions.As a reference provided in [59], for the same image, when comparing its two independently generated gaze maps by means of Pearson correlation, the result that falls into the range [0.8, 0.9] indicates a high degree of similarity.Since we only selected 18 source images from the LIVE database, the comparison had to be based on these 18 images only.The Pearson correlation averaged over all images between our data and TUD is 0.87; and is 0.87 and 0.89 with respect to UN and UWS, respectively.This suggests that our eye-tracking data should be considered as reliable "ground truth".

C. Validation: Impact of Stimulus Repetition
We hereby investigate the impact of stimulus repetition on the reliability of data collection, via a dedicated eye-tracking experiment combining the ideas of both [44] and [45] as mentioned in Section II-A.Note our main purpose here is to raise awareness of the need for eliminating stimulus repetition in the scenario where subjects have to view the same scene repeatedly, e.g., 16 times, rather than compare the general usage of different subjective testing methodologies.Our experiment aims to investigate two aspects: 1) how stimulus repetition affects fixation behaviour when viewing several distorted versions of the same scene (as also similarly studied for videos in [45]); 2) how stimulus repetition affects fixation behaviour when viewing several times the same undistorted scene (as also similarly studied for videos in [44]).
We chose five source images to construct our test stimuli.In creating distorted stimuli, we selected 7 distorted images (covering all available distortion types and the full range of DMOS) per content from the LIVE database, resulting in 35 distorted images.In creating undistorted stimuli, we just used the 5 source images three times.This gave a total of 50 test stimuli.As illustrated in Fig. 6, the 35 distorted stimuli were presented in a random order to each participant.The three groups of the same source images (presented in a random order within group) were positioned in the beginning, middle and end of the presentation.Therefore, in terms of the distorted stimuli, there are 7 repetitions per content; and in terms of the undistorted stimuli, there are 3 repetitions per content.We recruited 20 participants (10 females and 10 males) in our experiment.Each participant viewed freely all stimuli.Each stimulus was shown for 10 seconds followed by a mid-grey screen for 3 seconds.We followed the same experimental setup as described in Section III-C.
1) The Effects for Distorted Stimuli (7 Repetitions): For each participant, first the similarity in fixations between each distorted image and the corresponding source image (presented in the beginning) is measured by AUC.Then, the 7 AUC values per content are ranked in the order of viewing, averaged over all contents and all participants as shown in Fig. 7.It clearly shows the general trend that the similarity decreases as the viewing order increases, independent of the image content, distortion type and distortion level.The results of t-test show that there is a statistically significant difference between the 1st viewing and the Nth viewing (N = 3t o7 ) with P<0.05 at the 95% confidence level.This suggests that stimulus repetition can significantly impact the fixation behaviour, and consequently bias the intended fixation data.
2) The Effects for Undistorted Stimuli (3 Repetitions): A mean gaze map (over all subjects) is produced for each undistorted stimulus, and is compared by AUC to the corresponding baseline gaze map taken from the TUD database [29].The gaze maps contained in the TUD database were collected under task-free, no distortion, no stimulus repetition conditions, using the source images of the LIVE database.Fig. 8 illustrates the AUC values in viewing order, averaged over all 5 source images.It shows that the similarity dramatically drops after the first viewing of a scene, independent of image content.A Wilcoxon signed rank test shows that there is a statistically significant difference between the first and the second (or the third) viewing with P<0.05 at the 95% confidence level.
The above study provides evidence that when subjects view the same stimuli repeatedly the fixation data are likely to be biased, and care should be taken to eliminate the effect of stimulus repetition in such a scenario.

D. Fixation Deployment
Fig. 9(a) illustrates an overview of all distorted versions (5 distortion types × 3 distortion levels) of a source image (of a large degree of saliency dispersion) and their corresponding gaze maps (i.e., referred to as distorted scene saliency (DSS)).The same layout of distorted images and DSS for a different source image (of a small degree of saliency dispersion) is illustrated in Fig. 9(b).The grids visualise typical correspondences and differences between DSS rooted from the same source image.In general, there exist consistent patterns among the relevant DSS, e.g., the highly salient regions tend to cluster around the same positions.However, there are some deviations, which are seemingly caused by either the distortion type or distortion level.It is observed in Fig. 9(a) that as the quality degrades (i.e., the strength of distortion saliency patterns become more convergent (i.e., less amount heated areas in DSS); and that at the same distortion level how saliency disperses tends to depend on the distortion type, e.g., at "High" quality saliency is more spread out for JPEG, JP2K and FF than for WN and GBLUR.In addition, the two examples (rooted from two different source images) exhibit different trends in terms of the variation in the array of DSS.For example, the change in quality seems to cause a more obvious rate of convergence in saliency in Fig. 9(a) than in Fig. 9(b).This may be due to the fact that the two source images fall into distinct categories of visual content in terms of saliency dispersion (see Fig. 1).It implies that image content also has an impact on the deployment of DSS, as already mentioned in [31].

V. I NTERACTIVE RELATIONS BETWEEN SALIENCY
AND QUALITY ASPECTS The resulting eye-tracking data represent sufficient statistical power, which allows further statistical analysis on the observed tendencies in the changes of saliency induced by the changes of image quality aspects.More specifically, we evaluate the impact of three individual categorical variables (i.e., distortion type, distortion level and image content) on the deployment of fixation.

A. Investigation Framework
We use saliency derived from the original undistorted scene (i.e., referred to as scene saliency (SS)) as the reference, and quantify the deviation of DSS from its corresponding reference SS.The deviation between two gaze maps is often quantified by three similarity measures widely used in the literature.They are Pearson linear correlation coefficient (CC) [60], [61], normalized scanpath saliency (NSS) [62], [63] and AUC [13].The use of these measures is already described in more detail in [64], and we only briefly repeat their meaning in our context as follows: CC: when CC is close to -1 or 1, the similarity between SS and DSS is high; when CC is close to 0, the similarity is low.
NSS: When NSS>0, the higher the value of the measure the more similar DSS and SS are; whereas NSS<0 indicates that being able to use DSS to reproduce its reference SS is likely due to chance only.
AUC: AUC=1 means DSS can predict perfectly the characteristics of its reference SS; whereas AUC=0.5 corresponds to a prediction at chance level.

B. Investigation Results
The statistical evaluation is based on 270 data points (i.e., 270 distorted stimuli rooted from 18 originals) of SS-DSS similarity (i.e., the similarity calculated by CC, NSS and AUC between a given DSS and its corresponding SS).A full factorial ANOVA is conducted with the SS-DSS similarity as the dependent variable (the test for the assumption of normality indicates that the dependent variable is normally distributed); and the distortion type, distortion level and image  content as independent variables.The results are summarized in Table .I, and show that all main effects (except for the case of distortion type when AUC and NSS are used for SS-DSS similarity) are statistically significant.

1) Impact of Distortion Type on SS-DSS Similarity:
As shown in Table I, "distortion type" has a statistically significant effect on SS-DSS similarity measured by CC.The same effect, however, is not found when the SS-DSS similarity is calculated based on NSS or AUC.The inconsistency in the results is attributed to the fact that different similarity measures capture different characteristics of saliency changes while being coherent in measuring SS-DSS similarity, as already mentioned in [64].CC focuses on the similarity in terms of the spatial distribution of fixation, whereas NSS and AUC are based on the estimation of similarity in terms of the locality and density of fixations.Fig. 10 illustrates the rankings of the five available distortion types in terms of the SS-DSS similarity measured by CC, NSS and AUC, respectively.They consistently produce the same rank order for the five distortion types.For each subplot, the results of hypothesis testing (i.e., Wilcoxon signed rank test) show that the impact of distinct distortion types (e.g., FF and GBLUR) on SS-DSS similarity is statistically different with P<0.05 at the 95% confidence level.The distortions contained in FF (i.e., high-frequency, localised artifacts) produce a large extent of saliency deviation, whereas the GBLUR distortions (i.e., low-contrast, uniformly distributed artifacts) cause only slight changes in saliency.
2) Impact of Distortion Level on SS-DSS Similarity: Table I shows that "distortion level" has a statistically significant effect on SS-DSS similarity, independent of the similarity measure used.The degree of saliency deviation increases as the perceived quality decreases (or strength of distortion increases).Fig. 11 illustrates the measured SS-DSS similarity (again in terms of CC, NSS and AUC) for three levels of perceived quality.It reveals a statistically significant (i.e., basedont-testwithP<0.05 at the 95% confidence level) drop in SS-DSS similarity at low quality relatively to teh other two cases, which means that the distraction power of the annoying artifacts (or strong distortions) present in an image comes into impact the perception of the natural scene.
3) Impact of Image Content on SS-DSS Similarity: Table I also shows that SS-DSS similarity is strongly affected by "image content" (i.e., classified by the degree of saliency dispersion).Fig. 12 illustrates the measured SS-DSS similarity (again in terms of CC, NSS and AUC) for images having different degrees of saliency dispersion.In the case of images that do not contain highly salient objects (i.e., a large degree of saliency dispersion), adding artifacts to these images results in substantial changes between SS and DSS, as indicated by the statistically significant (i.e., based on t-test with P<0.05 at the 95% confidence level) drop in SS-DSS similarity relatively to the other two cases.On the other hand, images with highly salient objects (i.e., a small degree of saliency dispersion) are less sensitive to the distortions, as evidenced by the statistically significantly larger (i.e., based on t-test with P<0.05 at the 95% confidence level) values of CC, NSS and AUC.

VI. SS VERSUS DSS ON THE PERFORMANCE GAIN OF OQMs
Previous research [29] has demonstrated that adding "ground truth" SS does improve the performance of OQMs in predicting perceived image quality.The findings, however, also showed that the performance gain could be potentially optimised by taking into account the interactions between SS and distortion.DSS, to some extent, represents the interactive effect of the concurrence of natural scene and unnatural artifacts.The added value of DSS as opposed to SS in OQMs, however, has not been investigated.To provide insights into this matter, both types of saliency are added to several OQMs well-known in the literature.

A. Investigation Framework
We follow the general framework established in [65] for assessing the added value of saliency in OQMs.The basic idea is to quantify the performance gain of an OQM by comparing its predictive power with and without saliency.The predictive power of an OQM can be simply measured by the Pearson correlation (i.e., CC) between the output of the OQM and the subjective quality ratings [66]; and the performance gain can be effectively expressed by the increase in CC (i.e., CC).The OQMs used in our evaluation are six fullreference (FR) OQMs, peak signal-to-noise ratio (PSNR) [1], universal quality index (UQI) [67], structural similarity index (SSIM) [12], multi-scale SSIM (MS-SSIM) [68], visual information fidelity (VIF) [69] and feature similarity index (FSIM) [70]; and four no-reference (NR) OQMs, generalized block-edge impairment metric (GBIM) [71], NR blocking artifact measure (NBAM) [72], NR perceptual blur metric (NPBM) [73] and just noticeable blur metric (JNBM) [74].

B. Investigation Results
1) Original Versus Saliency-Based OQMs: Per OQM, adding SS and DSS (i.e., as the implementation detailed in [65]) results in two new saliency-based OQMs.The performance (i.e., CC) of an OQM is calculated based on the subjective quality scores contained in our database, which is summarised in Table II.In general, it shows that the performance of OQMs is improved by using both SS and DSS.The gain (i.e., CC) ranges from 0.002 (FSIM extended with SS) to 0.058 (GBIM extended with DSS).Note VIF and FSIM obtain relatively small gain by adding saliency, due to the fact that some well-established saliency aspects (i.e., information content feature in VIF [75] and phase congruency feature in FSIM [76]) are already embedded in these metrics, which consequently causes a saturation effect in saliency optimisation [65].
The observed effects are statistically analysed with hypothesis testing, selecting the metric strategy (SS-based vs. original or DSS-based vs. original) as the independent variable and the performance gain as the dependent variable.A Wilcoxon signed rank test is performed using the data points contained in Table II.The results, with P<0.01 at the 95% confidence level reveal that both SS and DSS statistically significantly improve the original OQMs.To further check the effectiveness of adding saliency for individual OQMs, the differences were statistically analysed per OQM (i.e., as the implementation detailed in [65]): in the case of normality, t-test was performed; otherwise a Wilcoxon signed rank test was conducted, as the results summarised in Table III.
2) SS-Based Versus DSS-Based OQMs: As can be seen in Table II, on average (over all OQMs), the gain achieved by use of SS is similar to that of using DSS.To check the effects with a statistical analysis, a Wilcoxon signed rank test is performed, selecting the type of saliency as the independent variable and the performance as the dependent variable.The test results (i.e., p>0.05 at the 95% confidence level) show that there is no statistically significant difference between the inclusion of both types of saliency.In response to the investigation framework identified in Section V, we further assess how the performance gain between SS-based and DSS-based OQMs is affected by the observed main effects, i.e., the distortion type, distortion level and image content.More specifically, our database is again characterised at three individual aggregation levels, using "distortion type", "distortion level" and "image content" as the classification variables, respectively.Fig. 13(a) illustrates the performance gain (i.e., CC) averaged once over all SS-based OQMs and once over all DSS-based OQMs, when assessing WN, JPEG, GBLUR, JP2K and FF, respectively.It shows that both types of saliency are beneficial for OQMs (i.e., CC values are positive in all cases).Results of a Wilcoxon signed rank test show that the difference in performance gain between the use of SS and DSS is not statistically significant different with P>0.05 at the 95% confidence level for all distortion types except for JP2K.For JP2K, using DSS improves the OQMs' performance more, which is in line with the conclusions drawn in [65] that when saliency is added in OQMs for accessing localised distortion, such as JP2K, taking into account the interactions between saliency and distortion can be used to optimise the performance gain.Note the same trend is also observed for the localised JPEG and FF distortion, although the results are not significant in our current samples.Fig. 13(b) shows the comparison of CC between SS-based and DSS-based OQMs, when accessing images with three distinct levels of perceived quality.At low quality, OQMs do not benefit from the use of saliency (i.e., marginal values of CC).At high quality, there is no statistically significant difference (i.e., based on t-test with P>0.05 at the 95% confidence level) between the added value of SS and DSS, which is attributed to the fact that SS and DSS is very similar (i.e., a small degree of SS-DSS deviation as shown in Fig. 11).In terms of the medium level of quality, the results of a t-test (with P<0.05 at the 95% confidence level) demonstrate that adding DSS to OQMs yields statistically significantly higher performance gain than adding SS, suggesting that the use of saliency in OQMs potentially benefits from taking into account the interactions between saliency and distortion.Fig. 13(c) illustrates the difference in CC between SS-based and DSS-based OQMs, when accessing images with three distinct degrees of saliency dispersion.Adding saliency deteriorates the performance of OQMs for assessing images with a large degree of saliency dispersion, which should be avoided in saliency optimisation.This is mainly due to the uncertainty of a dispersed gaze map, which confuses the workings of OQMs by e.g., unhelpfully downplaying the importance of high distortion in certain regions [31].Images with a medium range of saliency dispersion do not profit from adding saliency to an OQM (i.e., marginal CC).For images having a small degree of saliency dispersion, the use of DSS produces statistically significantly (i.e., based on t-test with P<0.05 at the 95% confidence level) larger CC than that of using SS.Again, this suggests the interactions between saliency and distortion play a significant role in optimising the increase in the performance of OQMs.

VII. STUDY OF MODELLED SS AND DSS
A realistic OQM, however, will use a computational saliency model rather than eye-tracking.Before the application of a saliency model, it is highly desirable to validate its performance against the ground truth.Benchmarking saliency models against ground truth SS has been attempted [13], [65], [77]; however, little is known about the performance of existing saliency models in detecting DSS.Questions still remain whether these saliency models sufficiently cope with the distortions added to the undistorted scenes, or at least whether they operate on the original and the distorted stimuli in a similar manner.In addition, the benefits of including SS versus DSS in OQMs have been demonstrated by use of eye-tracking data in Section VI.It is worthwhile to verify whether the findings still remain significant, and potentially useful, when computational saliency is used in this place.
Existing saliency models are usually evaluated against the fixations collected with undistorted natural scene stimuli (i.e., SS), we now check their corresponding performance on distorted stimuli (i.e., modelled DSS).Fig. 15 illustrates the predictive power (i.e., based on SAUC) of the saliency models using the subset of undistorted stimuli and the subset of distorted stimuli in our database, where modelled SS is evaluated against ground truth SS and modelled DSS is evaluated against ground truth DSS.The Pearson correlation between the two sets of SAUC values is 0.98, indicating that the performance of individual saliency models is consistent in both cases.A t-test is also conducted between the two sets of SAUC values; and the results (i.e., with P > 0.05 at the 95% confidence level) show that the average performance of saliency models is the same for both cases.

B. Modelled SS Versus DSS in OQMs
We conducted a statistical evaluation using 27 state of the art saliency models and 10 best-known OQMs as used in Section VI.The study thus resulted in 270 saliency-augmented OQMs; and the performance of each OQM was evaluated against the entire LIVE database (with 779 stimuli).In each case, both modelled SS and DSS are generated by applying a saliency model to the reference and distorted image.Table IV shows the performance (i.e.CC without non-linear regression) in each case, averaged over 27 saliency models.
In contrast to the conclusions concerning Table II, Table IV reveals the following consistent findings: (1) OQMs generally benefit from including both modelled SS and DSS.A Wilcoxon signed rank test is performed using the data points of Table IV.The results, with P<0.05 at the 95% confidence level, show that both modelled SS and DSS statistically significantly improve the original OQMs.(2) On average (over all OQMs), there is no statistically significant difference between the use of modelled SS and DSS, as demonstrated by a Wilcoxon signed rank test with P>0.05 at the 95% confidence level.
We also repeated the same experiment in Section VI to evaluate how the observed effects, i.e., the distortion type, distortion level and image content, impact the optimal use of  modelled SS and DSS.Compared to the results reported in Fig. 13, Fig. 16 shows: (1) In terms of the impact of distortion type, the results of a t-test with P<0.05 at the 95% confidence level show that the difference between the use of modelled SS and DSS is not statistically significant for WN, FF, JPEG.For JP2K, modelled DSS yields a statistically significantly (i.e., basedont-testwithP<0.05 at the 95% confidence level) larger CC than modelled SS.The above findings are consistent with the results determined by eye-tracking data in Fig. 13.For GBLUR, modelled SS produces statistically significant (i.e., based on t-test) larger gain than modelled DSS with P<0.05 at the 95% confidence level, which is inconsistent with the results as shown in Fig. 13.The relatively small gain obtained from modelled DSS is mainly caused by the fact that saliency models cannot fully capture the salient features of blurred images (or modelled saliency computed on blurred images is less accurate), which consequently reduces the usefulness of including saliency to an OQM.(2) In terms of the impact of distortion level and image content, Fig. 16 shows the consistent findings as also presented in Fig. 13.We again performed the t-tests on our data.The results show that using modelled DSS in OQMs produces statistically significantly larger gain than using modelled SS, with P<0.05 at the 95% confidence level, when assessing the images of medium quality and images having a small degree of saliency dispersion.The observed tendencies can therefore serve as useful tools in optimising the saliency integration in OQMs.

VIII. CONCLUSIONS
In this paper, we investigated a more reliable methodology for collecting eye-tracking data for image quality study.We proposed dedicated control mechanisms to effectively eliminate potential bias due to the involvement of massive stimulus repetition.The refined methodology resulted in a new eye-tracking database with a large degree of stimulus variability, including 288 test images distorted with different types of artifacts at various levels of degradation.The database contains 5760 eye movement trials recorded with 160 human observers.
Based on the "ground truth" data, we thoroughly assessed the interactions between saliency and distortion.An exhaustive statistical evaluation was conducted to provide insights into the tendencies in the changes of saliency induced by distortion.We found that the occurrence of distortion in an image tends to deviate fixation deployment.We also quantified the extent of such deviation as a function of distortion type, degradation level and image content, respectively.In terms of optimal use of saliency in OQMs, we investigated whether saliency of the undistorted scene or that represents the same scene affected by distortion would deliver the best performance gain for OQMs.The results show that both types of saliency are beneficial for OQMs, but the latter which reflects the interactions between saliency and distortion tends to further boost the effectiveness of the integration of saliency in OQMs.
We make use of our new eye-tracking database to benchmark saliency models for the purpose of image quality assessment.The evaluation indicates that existing saliency models operate on the undistorted and distorted scenes in a similar manner in terms of predicting human fixations.Moreover, the findings regarding the benefits of including SS versus DSS in OQMs still hold when using computational saliency instead of eye-tracking data.
Avenues for future research include an in-depth understanding of how visual attention plays a role in assessing image quality, and a quest for a perceptually optimised saliency integration strategy for quality assessment applications.

Fig. 1 .
Fig. 1.Illustration of source images with different degrees of saliency dispersion used in our experiment, which yield 288 test images.

Fig. 3 .
Fig. 3. (a) Two sample stimuli of distinct perceived quality (DMOS = 95.96(top image) and DMOS = 32.26(bottom image)).(b) The collection of human eye fixations over 20 subjects.(c) Gaze maps (the darker the regions are, the lower the saliency is).(d) Saliency superimposed on the sample stimuli.

Fig. 4 .
Fig. 4. Illustration of inter-observer agreement (IOA) value averaged over all stimuli assigned for each subject group in our experiment.The error bars indicate a 95% confidence interval.

Fig. 5 .
Fig. 5. Illustration of inter-k-observer agreement (IOA-k) value averaged over all stimuli contained in our entire dataset.The error bars indicate a 95% confidence interval.

Fig. 6 .
Fig. 6.The construction of stimuli in a single trail.The boxes indicate 35 stimuli in random order.The 5 original images, as a group, are inserted in the front end, middle and back end of each trail in random order.

Fig. 7 .Fig. 8 .
Fig. 7. Illustration of the impact of stimulus repetition on fixation behaviour.When viewing 7 distorted versions of the same scene, the similarity in fixations (measured by AUC) relative to its original decreases as the viewing order increases.The error bars indicate a 95% confidence interval.

Fig. 9 .
Fig. 9. (a) Illustration of all distorted versions of a source image (of a large degree of saliency dispersion) and their corresponding gaze maps.The same layout of distorted images and gaze maps for a different source image (of a small degree of saliency dispersion) is illustrated in (b).

Fig. 10 .
Fig. 10.Illustration of rankings of five distortion types contained in our database in terms of the SS-DSS similarity measured by CC, NSS and AUC, respectively.The error bars indicate a 95% confidence interval.

Fig. 11 .Fig. 12 .
Fig. 11.The measured SS-DSS similarity in terms of CC, NSS and AUC for images of different perceived quality.The error bars indicate a 95% confidence interval.

Fig. 13 .
Fig. 13.Comparison of performance gain between SS-based and DSS-based OQMs, with the effect of (a) distortion type dependency, (b) perceived quality level dependency and (c) saliency dispersion degree dependency.The error bars indicate a 95% confidence interval.

Fig. 14 .
Fig. 14.Illustration of modelled saliency maps generated by twenty-seven saliency models for one of the source images (a) and one of its distorted versions (i.e., JPEG, DMOS=90.43)(b) in our database.

Fig. 15 .
Fig. 15.Illustration of teh power for 27 saliency models.Modelled SS is evaluated against ground truth SS.Modelled DSS is evaluated against ground turth DSS.The error bars indicate a 95% confidence interval.

Fig. 16 .
Fig. 16.Comparison of performance gain between modelled SS-based and modelled DSS-based OQMs, with the effect of (a) distortion type dependency, (b) perceived quality level dependency and (c) saliency dispersion degree dependency.The error bars indicate a 95% confidence interval.
Toward a Reliable Collection of Eye-Tracking Data for Image Quality Research: Challenges, Solutions, and Applications Wei Zhang, Student Member, IEEE, and Hantao Liu, Member, IEEE

TABLE I RESULTS
OF THE ANOVA TO EVA L UAT E T H E IMPACT OF DISTORTION TYPE,DISTORTION LEVEL AND IMAGE CONTENT ON THE MEASURED SIMILARITY BETWEEN SS AND DSS.df DENOTES DEGREE OF FREEDOM,FDENOTES F-RATIO AND Sig DENOTES THE SIGNIFICANCE LEVEL

TABLE II PERFORMANCE
FOR 10 OQMs (CC WITHOUT NON-LINEAR FITTING) AND THEIR CORRESPONDING SALIENCY-BASED VERSIONS ON OUR DATABASE WITH 270 DISTORTED STIMULITABLE III RESULTS OF STATISTICAL SIGNIFICANCE TESTING FOR INDIVIDUAL OQMs."1" MEANS THAT THE DIFFERENCE IN PERFORMANCE IS STATISTICALLY SIGNIFICANT WITH P<0.05 AT T H E 95% CONFIDENCE LEVEL."0"MEANSTHAT THE DIFFERENCE IS NOT SIGNIFICANT

TABLE IV PERFORMANCE
OF 10 OQMs (CC WITHOUT NON-LINEAR FITTING) AND THEIR CORRESPONDING SALIENCY-BASED VERSIONS ON LIVE DATABASE WITH 779 DISTORTED STIMULI.NOTE THAT CC IS AVERAGED OVER ALL SALIENCY MODELS