A theory of moving form perception: Synergy between masking, perceptual grouping, and motion computation in retinotopic and non-retinotopic representations

Because object and self-motion are ubiquitous in natural viewing conditions, understanding how the human visual system achieves a relatively clear perception for moving objects is a fundamental problem in visual perception. Several studies have shown that the visible persistence of a briefly presented stationary stimulus is approximately 120 ms under normal viewing conditions. Based on this duration of visible persistence, we would expect moving objects to appear highly blurred. However, in human vision, objects in motion typically appear relatively sharp and clear. We suggest that clarity of form in dynamic viewing is achieved by a synergy between masking, perceptual grouping, and motion computation across retinotopic and non-retinotopic representations. We also argue that dissociations observed in masking are essential to create and maintain this synergy.


IntroductIon
Studies at the turn of the 20th century analyzed the perception of moving form and laid foundations of important discoveries related to visual masking (e.g., McDougall, 1904;Piéron, 1935) as well as the relationship between form and motion processing (Kolers, 1972). Surprisingly, however, most of the studies during the last three decades have focused on static form perception, and very little is known about mechanisms underlying moving form perception. The goal of this paper is to provide a short ov��v��w o� fi�d���s ����t�d to th� ������t�o� o� moving form and to lay the foundations of a theory of dynamic form perception. In this theory, masking, perceptual grouping, and motion computation interact within and across retinotopic and non-retinotopic representations of the stimuli.
Th� v�s�b�� ���s�st���� o� � b���fly ���s��t�d st�tionary stimulus is approximately 120 ms under normal viewing conditions (e.g., Haber & Standing, 1970; see also Coltheart, 1980). Based on this duration of visible persistence, one would expect moving objects to appear highly blurred. For example, a target moving at a speed of 10 deg/s should generate a comet-like trailing smear of 1.2 deg extent. The situation is similar to pictures of moving objects taken at an exposure duration that mimics visible persistence. As illustrated in Fig. 1, in such a picture, stationary objects are relatively clear but moving objects exhibit extensive blur.
Because object and self-motion are ubiquitous in natural viewing conditions, understanding how the human visual system achieves a rela-tively clear perception for moving objects is a fundamental problem in visual perception. While pursuit eye movements can retinotopically stabilize a moving target and help reduce its perceived smear (Bedell & Lott, 1996;Tong, Patel, & Bedell, 2005), even under these conditions, the problem of smear remains for other ob-jects present in the scene. Furthermore, the initiation of an eye movement can take about 150-200 ms dur-ing which a moving object can generate considerable smear. In the next section we present evidence that one mechanism that contributes to the perceived clarity of moving objects is metacontrast masking. This is followed by a section that highlights the importance of dissociation properties of metacontrast in achieving this task. In the subsequent section, we argue that, while metacontrast masking can reduce the extent of smear for moving objects, the synthesis of form for moving objects necessitates non-retinotopic feature processing. This leads to the s��t�o� wh���� w� �o�����t� s����fi� hy�oth�s�s �o� dynamic form perception. Findings from anorthoscopic perception to provide empirical evidence for the proposed non-retinotopic form perception mechanisms are reviewed next. In the following section, we present our recent results indicating that non-retinotopic perception is not limited to anorthoscopic perception but applies to perception in general. Possible neural correlates for non-retinotopic mechanisms are discussed ��xt. Th� fi��� s��t�o� �o����d�s th� ����s����t. Burr (1980) and  measured the perceived extent of motion smear produced by a random array of moving dots as a function of exposure duration. For exposure durations shorter than approximately 40 ms, the extent of perceived smear increased with exposure duration, as one would expect from the visible persistence of static objects. However, for exposure durations longer than 40 ms, the length of perceived smear was much less than that predicted from the persistence of static targets. This reduction of perceived smear for moving objects has been termed "motion deblurring" (Burr, 1980;Burr & Morgan, 1997).

Figure 1.
A picture taken at a shutter speed to illustrate the effect of visible persistence on blur. Reproduced with permission from Free-Foto.com. http://www.ac-psych.org Contrary to the reports of motion deblurring, it has been long known that isolated targets in real motion (e.g., Bidwell, 1899;McDougall, 1904) and in apparent motion (Castet, 1994;Dixon & Hammond, 1972;Farrell, 1984;Farrell et al., 1990) exhibit extensive smear. In order to reconcile the apparently contradictory observations of motion deblurring for � fi��d o� �ov��� dots ��d �xt��s�v� s���� �o� �so��t�d moving targets, we conducted experiments in which the density of moving dots was varied systematically, ranging from a single dot to 7.5 dots/sq-deg (Chen, Bedell, & �����, 1995). Our results showed that isolated targets moving on a uniform background are perceived with extensive motion blur and the reduction in the spatial extent of perceived motion blur (motion deblurring) increases as the density of moving dots in the array is increased. In other words, the motion deblurring reported by Burr (1980) is not a general phenomenon and applies principally to displays containing a relatively dense array of moving objects.
Several models have been proposed to explain motion deblurring based on a motion estimation procedure which is used to compensate for the adverse blurring effect resulting from the object motion (e.g. Anderson & van Essen, 1987;Burr, 1980;Burr, Ross & Morone, 1986;Martin & Marshall, 1993). According to Burr (1980), motion estimation is achieved by the s��t�o�t���o����y o����t�d �����t�v� fi��ds o� �ot�o� mecha-nisms. Martin and Marshall (1993) proposed a similar model wherein excitatory and inhibitory feedback connections suppress the persistent activity of neurons along the motion path. The "shifter-circuit" model of Anderson and van Essen (1987) uses an estimation of motion in order to generate a cortically localized (i.e. stabilized) representation of moving stimuli thereby avoiding the smear which would result from the change of cortical locus of neural activities. All these motion estimation/compensation models predict that an isolated moving target should produce no visual blur ��ov�d�d th�t �t s��fi����t�y st�����t�s th� �ot�o� �st�mation/compensation mechanisms. However, as stated above, this prediction, is in sharp contradiction with the extensive blur observed for a moving isolated target (e.g. Bidwell, 1899;Chen et al., 1995;Lubimov & Logvinenko, 1993;McDougall, 1904;Smith, 1969a, b).
Several researchers suggested inhibition as a candidate mechanism for motion deblurring (e.g., Castet, 1994;Dixon & Hammond, 1972;Francis, Grossberg, & Mingolla, 1994;McDougall, 1904;������ 1993). Because inhibition is a rather general concept, it is important to determine how and where it operates to achieve motion deblurring. Empirical evidence supports the view that the inhibitory mechanisms underlying metacontrast masking are the ones involved in motion deblurring.
To test the relationship between metacontrast and motion deblurring computationally, we used a model of REtino-COrtical Dynamics (RECOD) (������ 1993), which has been applied to both paradigms. The general structure of this model is discussed in the next section.
There is also clinical evidence supporting the model's prediction that transient-on-sustained (M-on-P) inhibition plays a major role in motion deblurring: Tassinari et al. (Tassinari, Marzi, Lee, Di Lollo, & Campara, 1999) found th�t ��t���ts w�th � �����y d�fi��t �� th� M ��thw�y� d�� to a compression of the ventral part of the pre-geniculate pathway, had substantially less motion deblurring than normal controls.  Figure 2 depicts the stimulus arrangements used by McDougall (1904) and Piéron (1935). McDougall reported that the blur generated by a leading stimulus ("a" in Th�s fi�d��� �s �� ��������t w�th th� �o�� �����t fi�dings discussed in the previous section. Piéron (1935) �od�fi�d M��o�����'s st�����s to d�v�s� � "s�q���t���" version as shown in Figure 2B. A notable aspect of the percept generated by this sequential version (see also Otto et al., 2006) is that, under appropriate parametric conditions, segment "a" can suppress the visibility of segment "b", segment "b" in turn can suppress the visibility of segment "c", etc. In other words, even though segment "b''s visibility is suppressed, its effectiveness as a mask suppressing the visibility of segment "c" remains intact, i.e. a dissociation occurs between the visibility of a stimulus and its masking effectiveness.

FroM SHArPEnEd GHoStS to cLEAr ForMS: ProcESSInG oF ForM InForMAtIon For MoVInG tArGEtS occurS In non-rEtInotoPIc SPAcE
Metacontrast mechanisms solve only partly the motion blur problem. If we consider the example shown in Fig. 1

A tHEorY oF MoVInG ForM ProcESSInG
We put forward the following hypotheses for the basis of moving form perception:  http://www.ac-psych.org that phenomenal visibility at a given instant requires a correlated activity at both of these levels. This hypothesis is elaborated further in the next section where we apply the theory to anorthoscopic perception.

Depiction of the stimulus used in anorthoscopic perception experiments
of two triangular shapes moving in opposite directions.
The tips of the triangles pass through the slit simulta-��o�s�y� �o��ow�d by th� ��dd�� s�����ts ��d fi����y the longest segments. Assume that the tip, the middle, and the base of the triangles cross the slit at t 0 , t 1 , and t 2 , respectively with t 0 <t 1 <t 2 . Observers are required to fix�t� o� th� fix�t�o� ��oss ��d ���o�t th� ������v�d shape of stimuli. The time-of-arrival coding theory states that the time-of-arrival will be used to construct spatial form. As shown in Fig. 7, according to this theory these time-of-arrivals are converted to spatial positions s 0 , s 1 , and s 2 , respectively with s 0 <s 1 <s 2 . As a result, the theory predicts that the observers should perceive the two triangles pointing in the same direction. The same prediction is made by the retinal painting theory.
This theory assumes that an involuntary eye movement shifts the retina with respect to the stimulus. Assume that the eye movement brings retinotopic positions s 0 , s 1 , and s 2 in alignment with the slit at time instants t 0 , t 1 , and t 2 , respectively. As depicted in Fig. 8 , 1978;Sohmiya & Sohmiya, 1992, 1994. Not only does this experiment reject these two theories but it also highlights an essential part of anorthoscopic perception: If the direction of motion is not known, the stimulus is ambiguous in that a leftward moving image and its mirror-symmetric version moving rightward generate identical patterns in the slit. Therefore, the determination of the direction of motion is critical for anorthoscopic perception.  When viewing the stimulus shown in Fig. 10, observers typically "perceive" a circle and a square even though part of the square is not directly visible. This ty�� o� fi����� �o����t�o� �s �����d ��od�� �o����t�o� (Michotte, Thinès, & Crabbé, 1964). From a terminological point of view, to distinguish this type of perception from the perception that arises in response to "directly visible" stimulus, we use the term amodal visibility as opposed to phenomenal visibility. What is perceived behind the slit in anorthoscopic perception can be viewed as a dynamic version of amodal visibility. Even though ��� ���ts o� th� fi���� ��ss��� b�h��d th� s��t ��� �ot simultaneously visible, observers "perceive" the complete shape. For example, after the tip of the triangle falls behind the occluder, observers continue to perceive the tip moving forward even though they do not directly see it. To accommodate this amodal effect, we simply assume that, at any given instant, the retinotopic and non-retinotopic activities that are linked by perceptual grouping (e.g., the tips of the triangle for t 0 , the middle parts of the triangles for t 1 , etc. in Fig. 9) become phenomenally visible. At any instant, the activity in the non-retinotopic space that has no correlated activity in the retinotopic space would be perceived "amodally".
We designate this as dynamic amodal perception in that the non-retinotopic activity without correlated retinotopic activity will appear to move according to the velocity vector associated with that part o� th� fi����.
Finally, let us point out that, due to the "aperture problem", the recovery of motion and form information in anorthoscopic perception is illposed (e.g., Shimojo & Richards, 1986). Our theory relates shape and motion distortions reported in anorthoscopic percepts to the errors in estimation of velocity vectors.

non-rEtInotoPIc PErcEPtIon IS not rEStrIctEd to AnortHoScoPIc PErcEPtIon
While anorthoscopic perception shows clearly that form perception can take place in the absence of a retinotopic image, generalization of underlying nonretinotopic mechanisms to normal viewing requires the demonstration of non-retinotopic perception without the use of occluders or slits. Previous research revealed illusions where features of objects are perceived non-retinotopically, i.e. at different locations than their retinotopic location. Treisman and Schmidt (1982) showed examples of illusory feature conjunctions when observers' attention is divided. For example, in a small number of trials observers may report seeing a green square in response to a display containing red squares Figure 10.
An example of a stimulus that leads to "amodal completion". Typically, observers perceive a square behind the circle, even though part of the square is not explicitly present in the image. This part is assumed to be present and occluded by the circle.
On the other hand, to provide support for our theory, we need to demonstrate cases of non-retinotopic perception that result not from errors of the visual system, but rather from its fundamental and lawful aspects. In particular, our Hypothesis 3 states that the transfer of information from the retinotopic to the non-retinotopic space is guided by perceptual grouping operations.
Introduced by Gestalt psychologists, the basic Ternus-Pikler display consists of two frames separated by an inter-stimulus interval (ISI). Th� fi�st ����� �o�t���s � given number of elements (e.g., three line segments) and the second frame consists of a spatially shifted ver-s�o� o� th� ������ts o� th� fi�st ����� s��h th�t � s�bs�t of the elements spatially overlaps in the two frames.
An example is shown in Fig. 11A where the two frames contain three elements arranged in such a way that two of the elements spatially overlap.
These displays are designed to investigate factors that control how objects, or parts thereof, maintain their identities during motion. When ISI is short, the prevailing percept is that of element motion (Fig. 11B), i.e.
th� ���t�ost ������t �� th� fi�st ����� �s s��� to �ov� directly to the rightmost element in the second frame while the two central elements are perceived stationary (as depicted by the dashed arrows in Fig. 11B). When ISI is long, the prevailing percept is that of group mo-tion� �.�. th� th��� ������ts �� th� fi�st ����� �ov� �s a single group to match the corresponding three elements in the second frame (as depicted by the dashed arrows in Fig. 11C). Thus the resulting percepts can be understood in terms of motion-induced grouping operations. In element motion, the leftmost element in the fi�st ����� ��d th� ���ht�ost ������t �� th� s��o�d frame are perceived as "one object" moving from left to right. The remaining two elements form together a second group. This latter two-element group is perceived stationary and matched with the two element group in the second frame according to the arrows in Fig. 11B. In s�d�� fi�st th� ��t��oto��� hy�oth�s�s. A��o�d��� to th�s hypothesis, features are perceived at the retinotopic positions where they are presented. Furthermore, features can be integrated retinotopically due to temporal integration properties of the visual system (Herzog, Parish, Koch, & Fahle, 2003). Consider for example the static control condition (Fig. 12C) which is identical to the Ternus display in Fig. 12A with the exception that th� ���t�ost ������t o� th� fi�st ��d th� ���ht�ost ���ment of the second frame are omitted. In this control experiment no motion percept is elicited and the spatiotemporal integration combines the probe Vernier offset information retinotopically across the two frames.
As shown in Fig. 12C, the percentage of responses in agreement with the probe Vernier is high for element 1 and near chance for element 2 for ISI = 0 and 100 ms.
If the attribution of features in the two-frame display were made according to retinotopic relationships, we would expect a similar outcome for the Ternus-Pikler display provided that ISI is short enough to fall in the range where temporal integration occurs. Thus, we would expect the percentage of responses in agreement with the probe Vernier to be high for element 1 and near chance for elements 2 and 3 for ISI = 0 and 100 ms.

Results in
Second, because they lack proper metacontrast mechanisms, they cannot predict when and how motion blur will be curtailed (Section "Motion deblurring in human vision"). Pääkkönen and Morgan (1994) proposed a two-�h�s� �ot�o� d�b������� �od�� wh����� th� fi�st st��� is "camera like exposure phase" that always produces motion blur. The second phase is proposed to carry out a "translation-invariant integration" of moving stimuli.
Our proposed theory goes beyond these previous models by including a retinotopic stage with "camera like" persistence whose extent is controlled by metacontrast interactions. Furthermore, the transition to non-retinotopic representation is governed by perceptual grouping operations, a property that allows us to explain ������ Otto, & Herzog's (2006) experimental results summarized in this section. The theory can also be applied to other non-retinotopic percepts observed in anorthoscopic viewing conditions.

PotEntIAL nEurAL corrELAtES
The current neurophysiological knowledge of primate brain is not detailed enough to map directly our theory to neural structures. However, it is well known that early visual areas V1, V2, V3, V4/V8 and V3a are retinoto-pic and contain a complete eccentricity and polar angle map. Beyond retinotopic cortex, the polar angle representation becomes cruder. Interestingly, a recent study by Yin, Shimojo, Moore and Engel (2002) investigated neural correlates of anorthoscopic perception using fMRI.
Their experiments included anorthoscopic percepts and control conditions with distorted stimuli that failed to generate anorthoscopic percepts. The activities in the retinotopic cortex did not correlate with whether the observers experienced anorthoscopic percepts or not. On the other hand, cortical activities in "object areas", in the Lateral Occipital Complex (LOC), a mainly non-retinotopic area, as well as those in the human motion area MT+ were correlated with anorthoscopic perception.
LOC is a cortical region that exhibits selectivity to pictures of intact "meaningful" objects compared to scrambled objects and pictures that lack a clear meaningful object interpretation (Allison, Ginter, McCarthy, Nobre, Puce, Luby et al., 1994;Allison, Puce, Spencer & McCarthy, 1999;Doniger, Foxe, Murray, Higgins, Snodgrass & Schroeder, 2000;Faillenot, Toni, Decety, Gregoire & Jeannerod, 1997;Grill-Spector, Kushnir, Hendler, Edelman, Itzchak & Malach, 1998;Grill-Spector, Kushnir, Hendler & Malach, 2000;Kanwisher, McDermott & Chun, 1997;Kourtzi & Kanwisher, 2000;Malach, Reppas, Benson, Kwong, Jiang, Kennedy et al., 1995;Murtha et al., 1999;Sergent, Ohta, & MacDonald, 1992). LOC also exhibits strong size and position invariance (Grill-Spector, Kushnir, Edelman, Avidan-Carmet, Itzchak & Malach, 1999;Malach et al., 1995). Hence, LOC and other similar non-retinotopic areas showing object selectivity can be candidates for our "non-retinotopic space". Yin et al.'s (2002) study suggests that the motion vectors, directly depicted in the non-retinotopic area in Fig. 4, may physically reside in area MT+. A recent study by Kim & Kim (2005) provides evidence that LOC has direct connections to MT+ and V3A and that MT+ and V3A have reciprocal connections. V3A is part of the V3 complex which has been implicated in the analysis of dynamic form (Zeki, 1991). Thus a tentative mapping would include areas extending to V3 complex as our retinotopic space, LOC as the non-retinotopic space, and the connectivities between MT+, V3A, and LOC establishing the coupling of dynamic form and motion vector representations between these areas. While this map-ping is highly speculative at this point, we believe that future neurophysiological studies can test more directly neural correlates of the proposed functional theory.

concLudInG rEMArKS
The three-dimensional structure of an object is mapped through the optics of the eye on two-dimensional retinae creating a "retinotopic image" of the object.
Retino-cortical pathways provide an orderly projection to the lateral geniculate nucleus and to the primary visual cortex so that neighboring points on the retina map to neighboring points in these areas, a property known as retinotopy. This retinotopic organization is found in numerous visual cortical areas. Through their "���ss����" �����t�v� fi��ds� ����o�s �� th�s� v�s��� ��eas process information locally in the retinotopic space.
Retinotopic organization and retinotopically localized �����t�v��fi��ds h�v� b��� two ���d����t�� ������s upon which most theoretical accounts of visual form perception are built. However, these theories are based mainly on a static characterization of visual perception and focus on how form information is processed for static objects. On the other hand, very little is known on how the nervous system computes the form of moving objects. Based on an analysis of dynamic aspects of vision, we argued that non-retinotopic computational principles and mechanisms are needed to compute the form of moving objects. We designate as "non-retinotopic" those mechanisms that can generate perception of form in the absence of a retinotopic image. Indeed, perceptual data demonstrate that a retinotopic image �s ���th�� ����ss��y �o� s��fi����t �o� th� ������t�o� o� form: When a moving object is viewed behind a narrow slit cut out of an opaque surface (anorthoscopic perception, Fig. 5), all information about the moving object's shape collapses temporally on a narrow retinotopic locus in a fragmented manner, i.e. there is no spatially extended retinotopic image of the shape. Yet, observers perceive a spatially extended and perceptually integrated shape moving behind the slit instead of � s����s o� �������t�d ��tt���s th�t �s �o�fi��d to th� region of the slit. Anorthoscopic perception shows that a retinotopic image is not necessary for the perception of form.
The visibility of a "target stimulus" can be completely suppressed by a retinotopically non-overlapping "mask stimulus" that is presented in the spatio-temporal vicinity of the target stimulus, phenomena known as para-and metacontrast masking (Bachmann, 1984;. These masking effects indicate that the existence of a retinotopic image is not � s��fi����t �o�d�t�o� �o� th� ������t�o� o� �o�� ��d that the dynamic context within which the stimulus is embedded plays a major role in determining whether form perception will take place.
In this manuscript, we presented a theory of moving form perception where masking, perceptual grouping, and motion computation interact across retinotopic and non-retinotopic representations. Due to visible persistence, moving targets are expected to generate extensive blur in retinotopic representations implemented in early visual cortex. We provided evidence showing that metacontrast masking controls the spatial extent of this b���. Wh��� th�s fi�st st�� �s ���t���� �� ����t��� th� d��eterious effect of motion blur; the computation of clear percepts for moving objects requires a non-retinotopic �����s��t�t�o� wh��� fi����� ���o���t�o� �bo�t �ov��� objects is processed. We argued that motion-induced grouping is critical in transferring information from the retinotopic to non-retinotopic space. Dissociation between visibility and masking effectiveness allows metacontrast to be effective in a sequential mode.
The RECOD model captures this property. The RECOD model can also explain the dissociation between visibility and spatial localization. This dissociation, allows the computation of motion information that can lead to motion grouping under metacontrast suppression conditions. Thus, taken together RECOD can implement the deblurring of retinotopic activity while preserving information for motion-induced grouping. In addition to normal viewing conditions, the proposed theory can also be applied to anorthoscopic perception which provides strong evidence that a "retinotopic image" is not necessary for the synthesis of a spatially extended percept. Our current work focuses on the interactions between perceptual grouping operations and non-retinotopic representations in order to develop a more detailed quantitative account for the remaining parts of the theory. rations contributed to many aspects of the work de-