A causal inference model for the perception of complex motion in the presence of self-motion

Our subjective percept of motion differs from the actual velocity on the retina in striking ways that have traditionally been studied by disparate fields. Here, we present a Bayesian model, as well as new data to support it, that unifies our understanding of motion perception of complex objects (“grouping”) with our understanding of the influence of self-motion on our perception (“flowparsing”). The central (recurring) motif in our hierarchical model is a prior over velocity consisting of a mixture of both a delta and a Gaussian centered on zero. This simple modification of the classic slow speed prior implies a “causal inference” process over whether the object is stationary or moving. Applied to multiple visual elements it leads to a “chunking” of these elements into groups, and groups of groups, with the goal to make the relative speed of as many of the elements zero with respect to the group they are inferred to belong to. As a result, our model infers individual motion relative to a group, and accounts for any inferred self-motion based on optic flow. Preliminary data from two experiments confirm new predictions of the model.


Introduction
A long line of empirical work has established that our perception of motion is both hierarchical (Johansson, 1950): the motion of visual elements that the brain infers to belong to a group of other elements (e.g. an object) are perceived to be relative to the motion of that object. Such relative perception can even be nested, e.g. the motion of a part of an object is perceived to be relative to the motion of an object that itself is perceived to move relatively to an even larger group of objects. Gershman and colleagues recently proposed a hierarchical Bayesian model that can correctly infer both groupings of visual elements and the perceived relative motion (Gershman, Tenenbaum, & Jäkel, 2016). In our work we modify and extend Gershman's model in two key ways. First, we reformulate it in a way that replaces its ad-hoc simplicity prior based on a Chinese restaurant process by a slow-speed mixture prior that is justified by the statistics of natural inputs. Second, we extend it to account for self-motion and vestibular inputs and thereby enable it to account for a large body of data describing human motion perception in the presence of optic flow (Warren & Rushton, 2009).

Central Model Motif for Motion Perception
Model Description: First consider the case of an object that is part of a larger object ("group") moving in the world. In order to determine the velocity of this object ( v object ), the subject has two sources of information: (a) The observed velocity of the object as manifested on its retina ( o object ). And (b) the predicted velocity of the object based on knowledge of the larger object that it belongs to ("group") having velocity ( v group ). We hypothesize that the subject combines these two sources of information by performing inference in the generative model in Figure 1A. In this generative model, the object velocity ( v object ) is modeled as a sum of the group the object belongs to ( v group ) and the object's velocity relative to the group ( v object group ) with some uncertainty associated with approximate computations (Σ estimation ). Eg: the velocity of a person moving on a train is the velocity of the train and the person's velocity relative to the train. The observed object velocity ( o object ) is modeled as v object corrupted with visual sensory noise (Σ estimation ). The prior velocities for v group and v object group are modeled as mixture of a delta at 0 and a normal distribution with mean 0 and variance Σ prior illustrated in Figure 1B. We hypothesize that the subject perceives the object velocity as the inferred v object group if it is non-zero, i.e. the object moves relative to the group else perceives it as the inferred group velocity ( v group ).
Link to slow speed prior: The mixture prior used in our model is different from the traditional slow-speed prior (Gaussian centered on zero) used in earlier studies (Stocker & Simoncelli, 2006). We believe that this form of mixture better describes the statistics of the outside world, and hence the brain's expectations over velocities in the world since a substantial fraction of objects in the world are actually stationary, while a Gaussian prior without the delta at zero would predict that almost all objects move with some non-zero velocity.
Causal Inference: The delta component of the prior allows the model to group different motion elements into a hierarchy of motion groups. The reason is that multiple points moving with similar velocity have a higher likelihood under the model of being part of an object which moves with that velocity instead of being separate elements with non-zero velocity each. It is straightforward to show that this process of grouping follows a causal inference mechanism analog to that previously proposed for multi-sensory inference (Körding et al., 2007). In our model, the mixture prior allows the subject to infer which group an object belongs to and to infer the velocity of the object relative to that group.
Complex Motion Perception: This motif can be recursively extended to model the motion perception at different hi-530 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 erarchical levels from simple "dot" elements to groups of dots, groups of dots to supergroups of groups and so on ( Figure  1C). Furthermore, it is straightforward to extend this model to incorporate self-motion ( Figure 1C left) the prior over which we again model as a mixture of a delta at zero and a Gaussian. This choice again implies a causal inference process over whether the subject is stationary or not, mathematically equivalent to that recently proposed by (Dokka, Park, Jansen, DeAngelis, & Angelaki, 2019). In the next sections, we adapt this general model to predict the motion perception of complex objects and motion perception under perceived self motion for two specific experiments designed to test its novel elements: the mixture prior implying causal inference, and the combination with self-motion. u + m I C S L w n s 1 j W E Q k F H I f E a J 0 t L Q P H N S o F m a P 2 S Z I w K n j r 1 I 1 X M 8 y Y f F H Q v g O p h C P j R r V s O a A S 8 T u y Q 1 V K I 1 N L 8 c L 6 J J A K G i n E j Z t 6 1 Y D T I i F K M c 8 q q T S I g J n Z A R 9 D U N S Q B y k M 0 W y v G p V j z s R 0 K f U O G Z + n s i I 4 G U 0 8 D V y Y C o s V z 0 C v E / r 5 8 o / 3 q Q s T B O F I R 0 / p C f c K w i X L S D P S a A K j 7 V h F D B 9 F 8 x H R N B q N I d V n U J 9 u L K y 6 R z 0 b A 1 v 7 u s N e t l H R V 0 j E 7 Q O b L R F W q i W 9 R C b U T R I 3 p G r + j N e D J e j H f j Y x 5 d M c q Z I / Q H x u c P b J 2 d Y A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " r p X S V x P s R H M a 4 7 l c 9 l y p 1 Y d B l 1 o = " > A A A C E H i c b V D L S s N A F J 3 4 r P U V d e l m s I g u S k l E 0 G X B j c s K 9 g F N L Z P J T T t 0 8 m B m E i g h n + D G X 3 H j Q h G 3 L t 3 5 N 0 7 a L L T 1 w M C Z c 8 5 l 5 h 4 3 5 k w q y / o 2 V l b X 1 j c 2 K 1 v V 7 Z 3 d v X 3 z 4 L A j o 0 R Q a N O I R 6 L n E g m c h d B W T H H o x Q J I 4 H L o u p O b w u + m I C S L w n s 1 j W E Q k F H I f E a J 0 t L Q P H N S o F m a P 2 S Z I w K n j r 1 I 1 X M 8 y Y f F H Q v g O p h C P j R r V s O a A S 8 T u y Q 1 V K I 1 N L 8 c L 6 J J A K G i n E j Z t 6 1 Y D T I i F K M c 8 q q T S I g J n Z A R 9 D U N S Q B y k M 0 W y v G p V j z s R 0 K f U O G Z + n s i I 4 G U 0 8 D V y Y C o s V z 0 C v E / r 5 8 o / 3 q Q s T B O F I R 0 / p C f c K w i X L S D P S a A K j 7 V h F D B 9 F 8 x H R N B q N I d V n U J 9 u L K y 6 R z 0 b A 1 v 7 u s N e t l H R V 0 j E 7 Q O b L R F W q i W 9 R C b U T R I 3 p G r + j N e D J e j H f j Y x 5 d M c q Z I / Q H x u c P b J 2 d Y A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " r p X S V x P s R H M a 4 7 l c 9 l y p 1 Y d B l 1 o = " > A A A C E H i c b V D L S s N A F J 3 4 r P U V d e l m s I g u S k l E 0 G X B j c s K 9 g F N L Z P J T T t 0 8 m B m E i g h n + D G X 3 H j Q h G 3 L t 3 5 N 0 7 a L L T 1 w M C Z c 8 5 l 5 h 4 3 5 k w q y / o 2 V l b X 1 j c 2 K 1 v V 7 Z 3 d v X 3 z 4 L A j o 0 R Q a N O I R 6 L n E g m c h d B W T H H o x Q J I 4 H L o u p O b w u + m I C S L w n s 1 j W E Q k F H I f E a J 0 t L Q P H N S o F m a P 2 S Z I w K n j r 1 I 1 X M 8 y Y f F H Q v g O p h C P j R r V s O a A S 8 T u y Q 1 V K I 1 N L 8 c L 6 J J A K G i n E j Z t 6 1 Y D T I i F K M c 8 q q T S I g J n Z A R 9 D U N S Q B y k M 0 W y v G p V j z s R 0 K f U O G Z + n s i I 4 G U 0 8 D V y Y C o s V z 0 C v E / r 5 8 o / 3 q Q s T B O F I R 0 / p C f c K w i X L S D P S a A K j 7 V h F D B 9 F 8 x H R N B q N I d V n U J 9 u L K y 6 R z 0 b A 1 v 7 u s N e t l H R V 0 j E 7 Q O b L R F W q i W 9 R C b U T R I 3 p G r + j N e D J e j H f j Y x 5 d M c q Z I / Q H x u c P b J 2 d Y A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " r p X S V x P s R H M a 4 7 l c 9 l y p 1 Y d B l 1 o = " > A A A C E H i c b V D L S s N A F J 3 4 r P U V d e l m s I g u S k l E 0 G X B j c s K 9 g F N L Z P J T T t 0 8 m B m E i g h n + D G X 3 H j Q h G 3 L t 3 5 N 0 7 a L L T 1 w M C Z c 8 5 l 5 h 4 3 5 k w q y / o 2 V l b X 1 j c 2 K 1 v V 7 Z 3 d v X 3 z 4 L A j o 0 R Q a N O I R 6 L n E g m c h d B W T H H o x Q J I 4 H L o u p O b w u + m I C S L w n s 1 j W E Q k F H I f E a J 0 t L Q P H N S o F m a P 2 S Z I w K n j r 1 I 1 X M 8 y Y f F H Q v g O p h C P j R r V s O a A S 8 T u y Q 1 V K I 1 N L 8 c L 6 J J A K G i n E j Z t 6 1 Y D T I i F K M c 8 q q T S I g J n Z A R 9 D U N S Q B y k M 0 W y v G p V j z s R 0 K f U O G Z + n s i I 4 G U 0 8 D V y Y C o s V z 0 C v E / r 5 8 o / 3 q Q s T B O F I R 0 / p C f c K w i X L S D P S a A K j 7 V h F D B 9 F 8 x H R N B q N I d V n U J 9 u L K y 6 R z 0 b A 1 v 7 u s N e t l H R V 0 j E 7 Q O b L R F W q i W 9 R C b U T R I 3 p G r + j N e D J e j H f j Y x 5 d M c q Z I / Q H x u c P b J 2 d Y A = = < / l a t e x i t > v group,n world < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 L 5 w 3 0 + S r y W e b x U Q 0 1 n 7 7 j s o G z k E f + R 8 / g D J z 5 0 L < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 L 5 w 3 0 + S r y W e b x U Q 0 1 n 7 7 j s o G z k E f + R 8 / g D J z 5 0 L < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 L 5 w 3 0 + S r y W e b x U Q 0 1 n 7 7 j s o G z k E f + R 8 / g D J z 5 0 L < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 L 5 w 3 0 + S r y W e b x U Q 0 1 n 7 7 j s o G x 8 = " > A A A C D 3 i c b Z C 7 S g N B F I Z n 4 y 3 G W 9 T S Z j A o F i H s i q B l w M Y y g r l A d g 2 z k 9 l k y F y W m d l I J r e C C c G d P 3 k R 6 i d l 1 / D t a a l y m M W R J 0 V y Q I 6 J S 8 5 I h V y T K q k R T h 7 J M 3 k l b 9 a T 9 W K 9 W x + z 1 p y V z e y T P 7 I + f w D 0 W Z e B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " f n + J r e C C c G d P 3 k R 6 i d l 1 / D t a a l y m M W R J 0 V y Q I 6 J S 8 5 I h V y T K q k R T h 7 J M 3 k l b 9 a T 9 W K 9 W x + z 1 p y V z e y T P 7 I + f w D 0 W Z e B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " f n + J r e C C c G d P 3 k R 6 i d l 1 / D t a a l y m M W R J 0 V y Q I 6 J S 8 5 I h V y T K q k R T h 7 J M 3 k l b 9 a T 9 W K 9 W x + z 1 p y V z e y T P 7 I + f w D 0 W Z e B < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " f n +

Hierarchical Motion Perception
In this section, we study how our perception of object velocity is affected by the presence of other objects leading to it being grouped together with them. We adapt the generative model described in the previous section to predict behavior. Task Description: The schematic for the task is shown in Figure 2A (top). The subject fixates at the center, observes the object (red dot) move at a particular angle, θ with respect to the horizontal (Figure 2A, bottom) and reports the perceived direction of movement with a dial. The red dot is surrounded by N group dots green dots which move horizontally concurrently with the red dot (N group dots is chosen as 1,2,3,5 or 10). All dots move back and forth once before moving in the direction which the subject has to report (indicated by change in colour of the fixation dot). The trials with different N group dots are interleaved with conditions where the grouping dots are stationary. The speed of the group dots are chosen such that they match the x component of the red dot velocity (Figure 2A, bottom).
Model Description: We model the subject's perception as inference in the generative model shown in Figure 2B with the coloured boxes indicating the possible candidates for perceived variables. The subject observe the velocity of all the dots ( o) which are modeled as the true dot velocity ( ε) corrupted by external (eg: screen resolution) and internal noise (eg: photoreceptor noise). All velocities are represented by their x and y components. The inferred velocity of each dot is modeled using the motif in terms of the velocity of a group and its velocity relative to the group. While the model captures the grouping due to common velocity components, objects can be part of different groups despite having a common velocity (e.g. a bird flying in front of a moving train). Thus, the subject also infers if different objects belong to a group before they can split the velocity as per the motif. Here, we model this causal inference by inference over C object group which indicates if the object is part of the group or not.
Predictions: We predict the distribution of subject responses over different trials from the model. When the number of moving green objects is 0 (stationary green dots), we predict that the subject perceives the veridical value (Top left panel). With concurrently moving green dots, we predict three modes for the subject percepts: (a) Veridical value of object velocity (identity line) if the subject does not infer the object and green dots to be part of the same group (b) group velocity (0 degrees) if the subject infers that the object moves as part of the group (c) relative velocity to the group (90 degrees since horizontal velocities are matched). The mode at zero (signature of causal inference) indicate a pull towards the group velocity which is not predicted by traditional vector subtraction models.
Results: Model predicted distributions are indicated using the red violin plot and the responses of an example subject is indicated by the blue dots in Figure 2C. The predicted distributions by the model largely agree with our earlier predictions. The model predicts more reports of relative velocity direction as the angle between the patch and the group increases. For small angles, the model predicts the mode at zero which strengthens with number of dots (higher chance of grouping with more group dots). The responses of the example subject qualitatively match the model predictions.

Object Motion Perception under Self Motion
In this section, we study at how our perception of object velocity is affected due to perceived self motion and adapt the motif described in the previous section to model subject behavior. Task Description: The schematic for the task is shown in . The subject fixates at the centre and observes the object (green dot) move at a particular angle, θ to the horizontal on the retina (Figure 2A, bottom) and reports the direction of perceived direction of the object with a dial. The background consist of red dots which whose velocity simulates an optic flow indicate of self motion either towards or away from the screen. The number of dots is interleaved across trials in a session and is chosen as 1,10,100 or 1000. The self motion velocity is chosen in such a way that the optic flow vector at the object location matches the horizontal component of the object's retinal velocity. ( Figure 3B) Model Description: We model the subject's motion perception in the task as inference in the generative model in Figure  3C. The notation is similar to the model in the previous section with o indicating the observed velocity, ε indicating the true velocity and v indicating the inferred velocity. The subject infers their self motion velocity from their vestibular observations and the optic flow motion (visual) modeled using causal inference. The velocity of all dots on the retina is inferred as a sum of the dot's velocity in the world and the velocity due to the subject's self motion. This is calculated as the velocity of dot on the retina if it has been stationary in the world (denoted by v stationary which depends on the subject's self motion and the position of the object in the world. Predictions: The model predicts that subject's infers their self motion from the observed dot velocities. Since there is a mixture prior ( Figure 1B) over the perceived self motion, the model predicts that the subject's percept of self motion increases with the number of dots. This happens as more dots moving with a velocity consistent with self motion have a higher likelihood of being 'actually' stationary in the world and a moving subject. The inferred self motion determines the predicted velocity of a stationary object in the world which is subtracted from the observed retinal velocity to determine the object's velocity in the world (Perceiving which can be modeled as "flow parsing"). The magnitude of flow parsing can be computed using the flow parsing gain which is ≤ 1 where 1 indicates complete flow parsing (percept of 90 degrees in the task). Hence we can use the flow parsing gain as an indicator of the increase in the inferred self motion by the subject.

Results:
The subject reports of the perceived angle of the moving object is given in Figure 3E. The subject reports are biased due to the optic flow dots consistent with the model predictions. Since the model predicts increase of the flow parsing gain (linked to self motion percept) with the number of dots, we compute the gain for the example subject ( Figure 3D). The flow parsing gain increases with the number of dots for both forward and backward motion as predicted by the model.