Adding Neurally-inspired Mechanisms to the SceneWalk model improves Scan Path Predictions for Natural Images

The selection of ﬁxation locations during natural scene viewing depends in large part on image-dependent and observer-dependent factors. However, eye movement data from different images, viewers, and experimental designs also consistently contain systematic tendencies such as pronounced saccade angle distributions, return saccade statistics, and dependencies of these measures on ﬁxation duration. When modelling complete human scan paths during extended natural image viewing these systematic tendencies are critical. The SceneWalk model (Engbert et al., 2015) incorporates image-dependent information through saliency maps and uses attentional processing and inhibitory tagging mechanisms to dynamically generate scan paths. Currently, scan paths simulated with this approach only partially reproduce observed systematic tendencies. Here we propose adding several neurally-inspired mechanisms to the model to improve performance: pre-saccadic and post-saccadic attentional shifts as well as facilitation of return mechanisms. These mechanisms are well-established both in experiments and neurocognitive theories of vision. We ﬁnd that this extension improves the model to generate scan paths which are in qualitative agreement with empirical data. As the model is ﬁrmly theory-based, all parameters are biologically interpretable and thus permit evalu-ations of theoretical predictions of behavior. We also dis-cuss a fully Bayesian framework using adaptive Markov Chain Monte Carlo methods


Background
The human visual system depends crucially on the ability to move the eyes over a scene. As only a small central region of the visual field, the fovea, receives detailed high resolution input, humans scan their visual environment in a series of fast jumps, saccades, and periods of relative motionlessness, fixations. The sequence of fixation locations chosen as gaze postions is subject of extensive research, as it permits insight into the theories of visuomotor control and their impact on visual perception.
Eye movements are guided by a variety of different mechanisms. Firstly, the image itself contains regions which are inherently more informative than others. For example, objects attract fixations (Nuthmann & Henderson, 2010). This finding and other observations have inspired a class of models which use saliency maps to predict which regions in the image are particularly interesting and therefore likely to be fixated (e.g. Itti, Koch, & Niebur, 1998;Kümmerer, Wallis, & Bethge, 2016). The second category of mechanisms are observer-dependent top-down effects, stemming from task and motivation as well as individual differences (de Haas, Iakovidis, Schwarzkopf, & Gegenfurtner, 2019). Thirdly, there exists a category of mechanisms which is stable over both observers and images (Tatler, Vincent, et al., 2008). Examples of these are the central fixation bias, the distribution of inter-saccadic angles and dependencies of saccade length and fixation duration.

SceneWalk
The SceneWalk Model Engbert, Trukenbrod, Barthelme, & Wichmann, 2015) uses attentional mechanisms coupled with inhibitory tagging to dynamically generate scan paths. Both streams are motivated by well-documented findings from the field of visual perception research. Attention is guided by image-dependent information and the foveated nature of the input Engbert et al., 2015). Inihibitory tagging of previously fixated regions promotes image exploration (Mirpour, Bolandnazar, & Bisley, 2019).
In the model, the two streams exist as independent 2D activation maps which evolve over time and are later combined to form a target map from which fixations are selected probabilistically. We compute a Gaussian G A/F centered around the current fixation position for each stream (A = attention map, F = fixation map/inhibitory tagging). Both streams are implemented on an L × L lattice and evolve via coupled differential equations, i.e., S i j is a saliency map of the image, for which we can use a map generated by another model or, in this case, the empirical fixation density on the image.
The two pathways are each shaped by an exponent λ or γ, respectively. Then we subtract the weighted (C F ) inhibition path from the attention path.
As this operation can cause negative activation, in the next step we take only the positive component of the map, and finally add noise (ζ).
As the SceneWalk model implements concrete theorybased mechanisms, the parameters of the model have clear biological interpretations.

The Challenge: Systematic Tendencies
The performance of a scan path model can be quantified by the likelihood of empirical data given the model. In our model, the target map π(i, j) for fixation selection can be used to directly read out the fixation likelihood for an upcoming experimentally-observed fixation . In addition to likelihood-based inference, however, it is important to evaluate how well the model-generated data compare to the experimental data with respect to the empirically observed effects.
Fixation behavior produced by the SceneWalk model already resembles empirical scan paths on several important metrics such as the saccade amplitude distributions (see Fig.  1, top) or more complicated statistics like the pair correlation function of fixation locations (Engbert et al., 2015). Other systematic tendencies, however, are currently not reproduced by the model.
An example of an important statistic not reproduced by the SceneWalk model is the angle distribution of subsequent saccades (see Fig 1, middle). The characteristic "W"-shape of the empirical data shows that saccades are more likely to either continue in the same direction as the previous saccade or return in the direction of origin than to continue in any other direction. Not surprisingly, neither the SceneWalk model nor density sampling or homogeneous point processes capture this dynamic of saccades. The same is true of the relationship between fixation duration and change in saccade direction (Fig. 1, bottom).
Thus, these tendencies are caused by mechanisms present in the visual system, but not implemented in previous versions of our model. Adding new mechanisms to the model can significantly improve the agreement between simulated and experimental data, as shown by the successful addition of a Central Fixation Bias mechanism to the model . The following section will expand on how we used previously proposed features of visuomotor control and attention to motivate the update of the SceneWalk model. Figure 1: The figure outlines three systematic effects found in eye movement: saccade amplitude distribution, saccade angle distribution and the relationship of fixation durations and saccade angles. We compare empirical scan paths to simple sampling from a density, a homogeneous process and the local saliency as implemented in the SceneWalk model.

Extending SceneWalk
The existing literature includes evidence for attentional shifts directly preceding saccades (Deubel & Schneider, 1996) as well as attentional remapping immediately following saccades (Golomb, Chun, & Mazer, 2008). There are also indications that in addition to inhibition of return, in certain time frames there is also a facilitation of return (e.g., Smith & Henderson, 2009. The proposed extensions of the SceneWalk model are based on splitting each fixation into three distinct phases of attention and saccade control: • The fixation phase is implemented exactly as in the original SceneWalk model. At the end of this phase the upcoming fixation location is selected and at the beginning effects of the previous saccade are important.
• The pre-saccadic phase begins shortly before saccade onset. The attention Gaussian precedes the eye movement to the next fixation location while the inhibition Gaussian remains centered around the current position.
• The post-saccadic phase immediately follows the saccade. During this phase the attention Gaussian is shifted in the direction of the saccade, emulating a retinal remapping system.
To enable facilitation of return saccades in the model, we assume that there is a prolonged activation in the attention map at recently fixated locations. Mathematically, we implemented a location-dependent decay of the attention map, where a small window around the previous fixation location on the attention map decays slower (ω shi f t ) than on average for the map (ω A ).
In the next section, we report some qualitative analyses of the consequences of these model modification for scan path statistics.

Results
Using the extended SceneWalk model, we generate data and compare experimental scan paths from human participants with model-simulated data. As shown in Fig. 3, the modelsimulated scan paths now qualitatively reproduce the shape of the saccade angle distribution. Furthermore, for the complex relationship between fixation durations and saccade angles we observe good qualitative agreement between experimental and simulated data (Fig. 4).
Our results lend support to the idea that pre-and postsaccadic attention shifts are responsible for some of the dynamics found in eye movement data. Thus, we find that neurally-inspired mechanisms are highly compatible with scan path generation when implemented within our dynamical framework of the SceneWalk model.

Outlook: Likelihood-based Parameter Inference
We set out to implement neurally-inspired visuomotor control principles to improve a model of scan path generation. Results reported here suggest that, with the modifications, the  SceneWalk model can be improved to include important systematic tendencies of eye-movement behavior. Ongoing work focuses on likelihood-based parameter inference for the extended version of the SceneWalk model.
The SceneWalk model generates a continuous-time evolution of a target map for upcoming fixations. For likelihoodbased parameter inference, this target map provides an efficient tool to compute the likelihood for experimentallyobserved fixation sequences. Thus, the likelihood can be computed numerically without approximation. Such models with a computable likelihood function are characterized by two considerable advantages. First, it is straightforward to estimate model parameters by maximizing the likelihood of the model given some empirical data. Moreover, since the model is implemented efficiently, the likelihood opens the door to a fully Bayesian framework . Using a Differential Evolution Adaptive Metropolis Algorithm (Laloy & Vrugt, 2012) we obtained pilot results recently for the improved version of the model. Secondly, models with a likelihood function are more easily compared to competing framework, as comparisons does not have to rely on ad-hoc performance metrics that are, in most cases, motivated by experimental research but lack statistical rigor.
Finally, estimated parameters can then be fed back into the model to simulate data on the level of individual observers. The fit between simulated and experimental data will shed light on the dynamical system that produces fixation behavior, including interindividual differences in fixation behavior (de Haas et al., 2019) and the underlying visuomotor mechanisms.