Visual backward masking: Modeling spatial and temporal aspects

In modeling visual backward masking, the focus has been on temporal effects. More specifically, an explanation has been sought as to why strongest masking can occur when the mask is delayed with respect to the target. Although interesting effects of the spatial layout of the mask have been found, only a few attempts have been made to model these phenomena. Here, we elaborate a structurally simple model which employs lateral excitation and inhibition together with different neural time scales to explain many spatial and temporal aspects of backward masking. We argue that for better understanding of visual masking, it is vitally important to consider the interplay of spatial and temporal factors together in one single model.


VISUAL BACKWARD MASKING: MODELING SPATIAL AND TEMPORAL ASPECTS
In visual backward masking, a target stimulus is followed by a mask, which impairs performance on the target. Although visual masking is often used as a tool in cognitive and behavioral sciences, its underlying mechanisms are still not well understood. The focus of masking research has been on understanding how it is possible that for some combinations of target and mask, a delay of the mask yields stronger masking than having the mask immediately follow the target. This phenomenon is known as 'B-type masking' or 'Ushape masking,' of which the latter refers to the shape of the curve linking stimulus onset asynchrony (SOA) between the target and the mask to performance.
Explanations of B-type masking are either based on a single process (e.g. Anbar & Anbar, 1982;Bridgeman, 1978;Francis, 1997) or on a combination of two processes (e.g. Neumann & Scharlau, in press;Reeves, 1986). Most models which use a single process apply a mechanism which was termed 'mask blocking' by Francis (2000). The basic idea of this mechanism is that a relatively strong target can block the mask's signal at short SOAs, but fails to do so at intermediate SOAs due to the decaying trace of the target. The two process theories assume that the U-shape curve in B-type masking actually consists of two parts, both of which are monotonic. The two underlying processes might relate to the accounts of 'integration' and 'interruption' masking (Scheerer, 1973), or to 'peripheral' and 'central' processes (Turvey, 1973).
While the focus of visual backward masking has been on temporal aspects, the effects of the spatial layout of the target and the mask have received much less interest (but, see Hellige, Walsh, Lawrence, & Prasse, 1979;Kolers, 1962).
If spatial aspects were investigated, they mainly involved low-level aspects, such as the spatial distance between the target and the mask, and the spatial frequencies of http://www.ac-psych.org Frouke Hermens and Udo Ernst the stimuli. Recently, Herzog and colleagues (Herzog, Schmonsees, & Fahle, 2003a, b;Herzog & Fahle, 2002;Herzog, Harms, Ernst, Eurich, Mahmud, & Fahle, 2003c; started to investigate the effects of the spatial layout of the mask systematically, while keeping the target (a vertical Vernier) constant.
Even though the mask consisted of simple bar elements only, slight changes in the layout of these elements resulted in large differences in masking strengths. For example, adding two collinear lines to a grating mask strongly impaired performance on the Vernier target (Herzog, Schmonsees, & Fahle, 2003a).
Only a few modeling attempts have been made to explain spatial aspects of visual masking. The aspects that were modeled include the effect of the distance of the mask to the target (modeled by Breitmeyer & Ganz, 1976;Bridgeman, 1971;Francis, 1997), and the distribution of the mask's contour (modeled by Francis, 1997).
Several of the existing masking models (Anbar & Anbar, 1982;Di Lollo, Enns, & Rensink, 2000;Weisstein, 1968) are constructed in such a way that they cannot account for spatial aspects of the target and the mask.
Here, we describe a structurally simple model that can explain several spatial aspects of visual backward masking as well as temporal aspects. The model we use is inspired by the basic structures found in the visual cortex, with excitatory and inhibitory neurons driven by feed-forward input, and exchanging action potentials via recurrent horizontal interactions. We describe neural activity in terms of population firing rates, whose dynamics are similar to the classical Wilson-Cowan differential equations (Wilson & Cowan, 1973) for spatially extended populations. Here, we will present new simulations of the effects of a shift of the mask either in space or time, embedded in an overview of results earlier presented by Herzog et al. (Herzog, Ernst, Etzold, & Eurich, 2003;Herzog, Harms et al. 2003c).

SETUP OF THE MODEL
The general structure of our model is illustrated in Figure 1. The input I(x,t) is filtered by a Mexican hat kernel and fed into an excitatory and an inhibitory layer. The activation of both layers is updated over time, where activation from both layers is mutually exchanged via the coupling kernels W e and W i . The activation dynamics of the model are determined by two coupled partial differential equations for the firing rates of neuronal populations, originally introduced by Wilson and Cowan (1973). We modified the original equations in order to match more recent work (Ben-Yishai, Bar-Or, & Sompolinsky, 1995;Ernst, Pawelzik, Sahar-Pikielny, & Tsodyks, 2001) In these equations, τ e and τ i denote time constants, and w ee , w ie , w ei , w ii are weighting coefficients for the interactions. x denotes the position of the neuronal population in the corresponding layer, and t denotes time. We assume an approximate retinotopical mapping of the visual input onto the cortical layer, such that x also describes position in the visual field.

Recurrent interaction between the layers is mod-
The feed-forward filtered input into both layers is computed by using an input kernel defined as a difference of correct decisions with a 20 ms Vernier duration), and weakest for gratings with more than 11 elements (about 91% correct).
First we will focus on an explanation of why the 5 elements yield stronger masking, while a larger mask (25 elements) yields weaker masking. Figure   2B  Stimulus sequence (A) and simulation results (B) of data presented by    Herzog et al. (2003). A Vernier target was masked by a field of light of the size either of either five (left) or 25 elements (right). The model correctly predicts that the five-element size field masks the Vernier much more strongly than that of the size of a 25-element grating, as indicated by the longer Vernier trace for the 25-element grating in the center of the image of the network activation.

Stimulus sequence (A) and simulation results (B) of data presented by
http://www.ac-psych.org While in a previous publication we employed this quantitative procedure, in this review article we only use qualitative measures, as e.g., predicting the peak performance in a specific condition, for evaluating the model's performance.

Uniform fields of light
In the previous paragraph, we saw that a grating of five elements masks a Vernier target much more strongly than a grating consisting of 25 elements. This  Figure 3B shows the activation over time for the two light masks. The pattern of results resembles that obtained for grating masks ( Figure 2B). The small field of light suppresses the Vernier activation more strongly than the larger one.
Whether small fields of light mask more strongly than larger ones was experimentally investigated by Herzog, Harms et al. (2003c). Vernier offset discrimination thresholds indicated that the small light-field was indeed a stronger mask, although the difference in thresholds between the two mask sizes was not as large as for the grating masks. By using a function that linked network activation to thresholds (the 'linking hypothesis'), Herzog, Ernst et al. (2003) showed that the model could accurately predict the observed thresholds. and −2 from the Vernier, as illustrated in the right part of Figure 4A. Also, this slight change in mask layout resulted in a strong increase in the masking strength.

Irregularities in the mask
The simulation plots of Figure 4B show how we can understand the strong increase in masking strength by the introduction of the gaps or the double luminance elements into our model. The model is sensitive to irregularities in the grating, which yield high activations in the neuronal layers. As the activation induced by the gaps or by the elements with doubled luminance is close to the preceding Vernier activity, the decay of the Vernier activation will be faster, and thus predicted performance will be low.
The simulations with the mask with the two gaps show that not only the mask affects the target, but also the target affects the mask. The inner edges at the two gaps show weaker activation than the outer edges, which can be understood as resulting from stronger inhibition of the inner edges by the target than the outer edges. Said differently, the target forwardly masks the mask.
Masking is predicted to be slightly weaker for the mask with the two gaps than for the mask with  This observer was presented with a sequence of a Vernier presented for 12 ms (the optimal duration for this observer), followed by a 25-element grating for 300 ms. The center of the grating was shifted from 0", via 400", 800", 1600", 2000", 2200"', to 2300" (edge close to the Vernier position), as is illustrated in the left part of Figure 6. For the rest, the experimental procedure was the same as in earlier demonstrations of the shine-through effect (e.g., Herzog, Harms et al., 2003c). The right part of Figure 6 shows the results. Thresholds start to rise at a shift of about 1600" (≈ ±8 elements offset), and reach a maximum for a shift of 2300" (≈ ±11.5 elements offset), where no threshold could be measured anymore.
The model was correct in predicting that thresholds increase with an increase in the shift of the mask. In addition, the model could well predict for which shift thresholds would strongly rise, which suggests that the model is correct in its assumption that the distance to the mask's nearest edge determines the masking strength.

Alternative explanations
Of the existing models of masking, only few are implemented in such a way that spatial information about the target and the mask can be coded (Bridgeman, 1978;Francis, 1997;Öğmen, 1993). Other computational models represent target and mask in single neurons (Anbar & Anbar, 1982;Di Lollo et al., 2000;Weisstein, 1968), an approach which does not allow spatial information to enter the model system.
Of the models that can code for spatial properties, only the model by Bridgeman can easily be implemented. The remaining two models (Francis,  Center shift (arc sec) Threshold (arc sec) Figure 6.

The sequence of Vernier and mask (left) and Vernier offset discrimination thresholds for observer FH (right) as a function of the size of the shift of the center of the grating mask. The data confirm the model's prediction that a close edge yields strong inhibition of the Vernier's signal, reflected in higher offset discrimination thresholds.
Frouke Hermens and Udo Ernst 1997;Öğmen, 1993) involve many stages and complex processing. For example, the model by Francis (1997), which is based on the boundary contour system (Grossberg & Mingolla, 1985), consists of six layers with many complex interactions. To simulate these models, one probably needs the help of the authors to understand the full details in order to correctly implement the model. Moreover, these models often require simplifications of the model to be able to perform the simulations. Due to these restrictions, we will only present simulation results of Bridgeman's model here. Vernier + 25 grating with gaps

Figure 7.
Cell activations in Bridgeman's (1978) (1) Bridgeman (1971Bridgeman ( , 1978. To initialize the network, 500 iterations were run in which only background activation was provided, before the stimuli were presented to the network. The target was presented for 2 time frames, the mask for the remaining 18 frames.

Onset of context
As discussed before, a grating of five elements is a stronger mask than one consisting of 25 elements  . The small red horizontal bars indicate where the activity of the trace drops below a particular threshold. A Vernier target was masked by a grating consisting of a five-element center and a 20-element surround, which were presented at different onset times. Once presented, the stimulus remained on the screen until 300 ms after target offset. The model correctly predicts that the target strength remains strongest for simultaneous onset of the mask's center and surround. . Here, we will show simulation results in which the relative onset of the five central elements and the 20 surrounding elements of a 25-element mask was varied. Figure  Verbally, the explanation of the results can be phrased as follows. When the center and the surround are presented simultaneously, the network will consider the two parts as one object. The edges of this object are determined, and since they are far away from the Vernier target, they will hardly affect the signal of the Vernier. If the surround is presented earlier, the network will respond by detecting the edges of the two parts of the surround. Since the edges of these parts are much closer to the Vernier location, they will inhibit the Vernier more strongly.

Stimulus sequence (A) and simulation results (B) of data presented by
Similarly, if the center is presented before, its edges will be detected, and since also these edges are close to the Vernier, they will inhibit the Vernier's  (Figure 6; Herzog, Ernst et al., 2003).

Optimal masking at a non-zero SOA
In the introduction, we mentioned the relatively strong focus of the masking research community on explaining that masking can be strongest at a non-zero SOA (B-type masking). The work by Francis (2000) suggests that many models that apply a non-linearity (rectification) and decay can explain B-type masking.
As our version of the Wilson-Cowan model contains both properties, we would expect that a combination of target and mask can be found for which the model shows strongest masking at a non-zero SOA. Figure   9 shows such a combination (

GENERAL DISCUSSION
In this paper, we have argued that it is important to study both spatial and temporal aspects of visual backward masking. Temporal aspects have been studied for a long time. Although some basic spatial aspects, such as the distance between target and mask, and their spatial frequencies have been studied in the past, it is only recently that spatial aspects have started to be investigated systematically. A similar trend can be seen for models of visual masking. Most earlier models (Anbar & Anbar, 1982;Weisstein, 1968) could only model temporal aspects of masking, simply because spatial aspects could not be coded by the models. An exception is the model by Bridgeman (1978), which allows for a representation of stimuli in a spatial array.
However, we showed that this model can not account for the difference in masking strength of the 25-element grating (weak masking), the five-element grating and the grating with two gaps (strong masking). Later models can represent the spatial layout of the stimuli, even in two dimensions (Francis, 1997;Öğmen, 1993).
However, these models are so complex that a single simulation can take a standard computer days to perform (see the appendix of Francis, 1997), while at the same time preventing any analytical investigation of the relevant mechanisms.
Here, we showed that a structurally simple corti- Only when this activity has decayed sufficiently, is the mask rendered effective. This mechanism in our model provides a putative neural basis for U-shaped masking curves.
By systematically comparing model output and experimental results, we can determine which aspects of masking can be explained with a simple mechanism, and which aspects need a more elaborate model. For example, the U-shaped dependence of performance on SOA for certain targets and masks can be explained with a single mechanism, and does not necessarily require two processes. However, Francis and Herzog (2004) showed that masking curves can intersect, even if the target and the task are kept constant, and just the mask is varied. This result poses strong restrictions on plausible models, suggesting that two or more neural processes underlie masking curves [as suggested by Reeves (1986) and Neumann and Scharlau (in press)].
Computational models are also necessary to determine which conclusions can be drawn from data, as is illustrated by a recent contribution by Di Lollo and colleagues (2000) (Francis & Cho, 2007).
The ultimate goal of modeling visual processing will be to construct a predictive model of the visual cortex. However, current computer capacities and also our current knowledge of the visual system do not allow this yet. Until the ultimate model of the brain can be constructed, we will have to work with much simpler models. The best strategy hereby is to tightly combine experimental and modeling studies to test upcoming theories of visual information processing, and to break down visual processing as far as possible into distinct modules which can under certain conditions be studied separately from each other. In such an integrative approach, we have demonstrated that a structurally simple cortical network can explain a quite extensive set of data in visual masking, which suggests that masking phenomena can be easily understood through the dynamics of network structures that are common to many areas found in the visual cortex.