DeepGaze III: Using Deep Learning to Probe Interactions Between Scene Content and Scanpath History in Fixation Selection

Many animals make eye movements to gather relevant visual information about the environment. How fixation locations are selected has been debated for decades in neuroscience and psychology. One hypothesis states that “priority” or “saliency” values are assigned locally to image locations, independent of saccade history, and are only later combined with saccade history and other constraints to select the next fixation location. A second hypothesis is that there are interactions between saccade history and image content that cannot be summarised by a single value. Here we discriminate between these possibilities in a data-driven manner. Using transfer learning from the VGG deep neural network, we train a model of scanpath prediction “DeepGaze III” on human free-viewing eye scanpath data. DeepGaze III can either be forced to use a single saliency map or can be allowed to learn complex interactions via multiple saliency maps. We find that using multiple saliency maps gives no advantage in scanpath prediction compared to a single saliency map. This suggesest that – at least for free-viewing – no complex interactions between scene content and scanpath history exist and a single saliency map may exist that does not depend on either current or previous gaze locations.


Introduction
How humans explore their visual environment has attracted research for many decades. A long-standing theory in the field of gaze prediction poses the existence of an imagedependent saliency map which is combined with task information and scanpath history to decide on the target of the next saccade. Different locations have been proposed for where such a map might be implemented in the brain, including V1 Zhang, Zhaoping, Zhou, and Fang (2012) and Superior Colliculus. Starting with the Feature Integration Theory implemented in the seminal model by Itti, Koch, and Niebur (1998), many models proposed different ideas how such a saliency map might be computed. The last decades have seen great growth in the number and performance of models predicting the spatial fixation distribution, with the current state-of-theart being our model "DeepGaze II" (Kümmerer, Wallis, Gatys, & Bethge, 2017) according to the influential MIT Saliency Benchmark (saliency.mit.edu).
However, the saliency map hypthesis puts strong constraints on how fixations are selected. Interactions between saccade history and image content that cannot be summarised by a single value are not allowed. For example, if after long saccades different image features drive the next fixation than after short saccades, then it is impossible to assign a single saliency value to image locations.
In order to discriminate between these possibilities, here we move from predicting spatial fixation distributions to predicting sequences of fixations. We do so by extending our previous model DeepGaze II to predict fixation locations depending on where a subject fixated before.

Model
In Figure 1b we show the architecture of DeepGaze III. DeepGaze III first encodes image content and scanpath history into spatial feature maps. The image content is encoded via deep VGG (Simonyan & Zisserman, 2014) features. For the previous scanpath history feature maps are used that encode the euclidean distance as well as the difference in x and y coordinate to the encoded fixation. These feature maps are then processed by a neural network using only 1 × 1 convolutions. This neural network is split into an purely imagedependent saliency network that computes one or multiple saliency maps, a purely scanpath dependent scanpath network and a final fixation selection network that combines the output of the previous networks. The fixation selection network outputs a single feature map that is subsequently blurred, combined with a center bias and fed through a softmax to yield the final conditional fixation density for the next fixation given the previous fixations. We train DeepGaze III on the MIT1003 dataset (scanpaths from 15 human observers, 1003 images, 3 seconds free-viewing; Judd, Ehinger, Durand, & Torralba, 2009) using maximum-likelihood training via gradient descent and tenfold crossvalidation to avoid overfitting.
In Figure 2a we show fixation densities as predicted by the model for an example stimulus and different scanpath histories. It can be seen that the model prediction strongly depends on the scanpath history: the model favors locations close to the last fixation.
In Figure 2b we test how well DeepGaze III reproduces key properties of human scanpaths, for example a very specific distribution of saccade lengths and a tendency to favor horizontal saccades to vertical saccades and to favor vertical saccades to diagonal saccades. To this end, we sampled new scanpaths from the model and compared said statistics between the empirical data and the sampled data in Figure 2b. All properties are better reproduced by DeepGaze III than by other models.

Evidence for a Spatiotopic Free-viewing Saliency Map
All results presented above use only one single saliency map as output of the saliency network, as stated by the saliency map hypothesis in the abstract. In order to collect evidence in favor of or against that hypothesis we trained additional versions of DeepGaze III where the saliency network computes multiple saliency maps (Figure 1b, dashed feature map). Figure 3 shows that all models show very similar performance. This rules out more complicated interactions between image content and scanpath history such as the ones exemplified in the introduction and provides some evidence for the existence of a spatiotopic saliency map for free-viewing.
One might argue that our model is limited by the fact that it is not foveated. A retinotopic saliency map could show up in our model as multiple saliency maps, one that is used for the fovea and others that are used for the periphery. How-  Figure 3: Whether DeepGaze III can use one or multiple saliency maps doesn't affect performance: complex interactions between scanpath history and image content don't seem to play a relevant role in fixation selection, providing some evidence for the existence of "the saliency map" of an image. ever, since we don't find evidence against the even stronger hypothesis of a spatiotopic saliency map, this doesn't affect our results.
Since we find that the saliency map has strong high-level components (faces, text), we expect that higher brain areas that are sensitive to these objects play an important role in computing this saliency map. The saliency map could be computed downstream from these areas, or these areas could feed back to earlier areas from which the saliency map is read out.

Discussion
The present work applies deep learning to learn a probabilistic model of free-viewing human scanpaths. Using this model, interactions between scene content and recent scanpath history in fixation selection are probed with the result that no such interactions that go beyond a simple pixelwise saliency measure seem to exist.
The recent years have seen increasingly many applications of deep learning in neuroscience (Yamins et al., 2014;Hong, Yamins, Majaj, & DiCarlo, 2016;Jozwik, Kriegeskorte, Storrs, & Mur, 2017). Deep learning models as such are black boxes and it is hard to understand what the models are actually learning. For this reason, their usefulness in neuroscience is often questioned. In some cases this critique might be justified: even more for deep learning models than for classical models it is not enough to just predict the data well. Good prediction performance is merely a necessary requisite for being able to draw scientific conclusions. We want to argue that the present work showcases how deep learning can be applied in a way that tests a well-defined question and gives a clear answer: whether there are (on a functional level) interactions between scene content and scanpath history that cannot be described by a simple pixelwise saliency measure.
In order to answer this question by model comparison, there are two important factors. Firstly, the model that uses a simple pixelwise saliency measure has to be powerful enough to not be penalized simply due to the fact hat it cannot learn a sufficiently good saliency measure. Secondly, the model that uses more complicated interactions has to be able to learn quite general and arbitrary interactions. If the first model is not able to extract a good saliency measure, the second model might perform better simply because it "missuse" parts of its architecture intended for interaction modeling to learn a better saliency measure although there are no interactions. If the second model is too limited, it might just not be able to pick up existing interactions.
The deep learning based model architecture presented here is designed to circumvent exactly those problems. The architecture provides good modeling power in the form of DNNs to most parts of the models and only controls whether the models can use interactions beyond a simple saliency measure.