Exploring and explaining properties of motion processing in biological brains using a neural network

Visual motion perception underpins behaviors ranging from navigation to depth perception and grasping. Our limited access to biological systems constrains our understanding of how motion is processed within the brain. Here we explore properties of motion perception in biological systems by training a neural network to estimate the velocity of image sequences. The network recapitulates key characteristics of motion processing in biological brains, and we use our access to its structure to explore and understand motion (mis)perception. We find that the network captures the biological response to reverse-phi motion in terms of direction. We further find that it overestimates and underestimates the speed of slow and fast reverse-phi motion, respectively, because of the correlation between reverse-phi motion and the spatiotemporal receptive fields tuned to motion in opposite directions. Second, we find that the distribution of spatiotemporal tuning properties in the V1 and middle temporal (MT) layers of the network are similar to those observed in biological systems. We then show that, in comparison to MT units tuned to fast speeds, those tuned to slow speeds primarily receive input from V1 units tuned to high spatial frequency and low temporal frequency. Next, we find that there is a positive correlation between the pattern-motion and speed selectivity of MT units. Finally, we show that the network captures human underestimation of low coherence motion stimuli, and that this is due to pooling of noise and signal motion. These findings provide biologically plausible explanations for well-known phenomena and produce concrete predictions for future psychophysical and neurophysiological experiments.


Introduction
The transduction of changing patterns of light into the perception of motion underpins adaptive behaviors ranging from depth estimation to navigation and grasping. For motion perception to guide these behaviors effectively, changes in visual input must be translated into accurate estimation of both direction and speed. This-uniquely-requires combining information across space and time. Many biological systems appear to be highly proficient at this task; for example, humans can reliably discriminate differences in speeds between 5% to 7% (de Bruyn & Orban, 1988;McKee, 1981) and over a century of research on motion processing has expanded our understanding of the neural computations that underlie this ability. However, the biological basis for many aspects of speed estimation remain unknown. A primary constraint on our understanding of these (and other) neural mechanisms is imposed by the limited access we have to biological systems. For example, we can measure the output of the system in response to different inputs (i.e., psychophysics), gross population activity (e.g., fMRI or EEG), or point measurements (i.e., cell recordings), but combining this information to extract the underlying neural computations and principles remains a challenge.
We recently demonstrated the potential of taking an artificial systems approach to bolster understanding of how biological systems function. In particular, we trained a shallow neural network ("MotionNet") to classify the velocity of motion sequences generated from natural images (Rideaux & Welchman, 2020). Using this approach, we revealed novel relationships between speed and direction encoding and explained drivers of biases in population tuning and perception.
Here we sought to extend this approach to test aspects of motion processing in relation to spatial and temporal frequency characteristics. Moreover, the architecture of the neural network used in our previous study constrained the units in the output layer that we described as being analogous to the middle temporal area (MT) in the primate visual system. This stood in contrast to the units in the layer corresponding to V1, which were unconstrained and therefore allowed us to gain valuable insights into population characteristics (e.g., tuning biases) that were chosen by the network to best estimate velocity. In this article we used a new neural network that did not predefine the V1 or MT stages of the model. Specifically, we train a new neural network ("MotionNet xy ") to estimate continuous measures of horizontal and vertical velocity by including an additional regression layer. This does not constrain the properties of the MT layer units, allowing them to develop characteristics that best serve the task of velocity estimation.
Using this artificial systems approach we examine how spatiotemporal information is combined to produce our (mis)perceptions of image velocity. For instance, when image contrast is reversed the between motion frames, this produces a corresponding reversal in perceived motion direction (Anstis, 1970). Electrophysiological work shows that this perceptual illusion is also reflected in the responses of macaque V1 and MT neurons: their preferred direction is inverted (Duijnhouwer & Krekelberg, 2016). After verifying this behavior in the artificial system, we explore how these changes influence the calculation of speed. Unpublished observations suggest that observers over and underestimate the speed of slow and fast reverse-phi motion, respectively (Parthasarathy, 2019;Ruda, Riesen, & Hock, 2016). We find that the network exhibits the same biases, and then use our access to the system to show that this is due to the similarity between reverse-phi motion and the receptive fields of spatiotemporal neurons tuned to opposite directions.
We then examine how spatial and temporal information is combined to compute speed. Electrophysiological work shows that V1 neurons are tuned to a range of spatial and temporal frequencies (Friend & Baker, 1993;Holub & Morton-Gibson, 1981;Tolhurst & Movshon, 1975), but their tuning for these properties are independent. By contrast, some MT neurons appear to show speed tuning, requiring joint encoding of spatial and temporal frequency (Perrone & Thiele, 2001;Priebe, Cassanello, & Lisberger, 2003). It has been proposed that MT neurons tuned to slow speeds receive input from V1 neurons sensitive to high spatial and low temporal frequencies, while the opposite is true for MT neurons tuned to high speeds. This notion is supported by some neurophysiological evidence (Priebe et al., 2003), but remains a challenge to directly test in biological systems due to the difficulty of tracking synaptic connections between brain regions. By contrast, the connections between layers in the artificial system are equally accessible as all its architecture; thus we test this possibility and find that the relationship predicted between spatiotemporal V1 and MT neurons in biological systems is evident in the network.
Although some MT neurons appear achieve speed selectivity by pooling V1 activity, neurophysiological work suggests that many MT neurons exhibit selectivity indistinguishable from V1 neurons, that is, separable tuning to spatial and temporal frequency (Priebe et al., 2003). Similar diversity across MT neurons is also observed for direction selectivity, that is, whether a neuron responds to the individual components or combined pattern of a moving object (Movshon, Adelson, Gizzi, & Newsome, 1986). These two properties index the complexity of the information that is encoded by MT neurons in terms of speed and direction, and we find that they are positively correlated in the network, that is, MT units tuned to speed are more likely to be also tuned to pattern motion.
Finally, we show that the network recapitulates neural and psychophysical performance in response to reduced motion coherence (Britten, Shadlen, Newsome, & Movshon, 1992), exhibiting the same speed opponency, noise reduction, mechanisms observed in biological systems (Mikami, Newsome, & Wurtz, 1986). In particular, we show that MotionNet xy underestimates the speed of low coherence motion stimuli (Schütz, Braun, Movshon, & Gegenfurtner, 2010) and demonstrate that this is due to pooling of (net velocity = 0) noise and signal motion.

Method Naturalistic motion sequences
To train a neural network to estimate image velocity, we generated motion sequences using 200 photographs from the Berkeley Segmentation Dataset (https://www2. eecs.berkeley.edu/Research/Projects/CS/vision/bsds/). Images were grayscale indoor and outdoor scenes (converted from RGB using MATLAB's (The MathWorks, Inc., Matick, MA) rgb2grey function). Motion sequences (six frames) were produced by translating a 32-× 32-pixel cropped patch of the image ( Figure 1a). Motion direction and speed were randomly assigned from uniform distributions between 0°to 360°and 0.8 to 3.8 pixels/frame, respectively. Images were translated in polar coordinates, for example, an image moving at a speed of 1 pixel/frame in 0°( right) direction was translated by +[x = 1,y = 0] per frame, whereas an image moving at the same speed in 45°direction was translated +[x = .7071,y = .7071]. Image translation was performed in MATAB using Psychtoolbox v3.0.11 subpixel rendering extensions (Brainard, 1997;Pelli, 1997) (http://psychtoolbox.org/). The speeds used to train the network were selected because they did not exceed the image dimensions (32 × 32 pixels) and matched those used in our previous study (Rideaux & Welchman, 2020). We generated 32,000 motion sequences, which were scaled so that pixel intensities were between -1 and 1, and randomly divided into training and test sets, as described in the Training Procedure section.

MotionNet xy architecture
All the networks described in the study were implemented in Python v.3.6.4 (https://python.org) using TensorFlow (www.tensorflow.org), a library for efficient optimization of mathematical expressions. We used a convolutional neural network that comprised (i) an input layer, (ii) one convolutional-pooling layer, (iii) one dense layer, and (iv) an output regression layer ( Figure 1a).
Inputs were image patches (32 × 32 × 6 pixels; the last dimension indexing the motion frames; Figure 1a, input layer). In the convolutional layer, inputs passed through 64 three-dimensional kernels (6 × 6 × 6 pixels) producing 64 two-dimensional output maps (27 × 27 pixels; Figure 1a, V1 layer). This resulted in 18,496 units (64 maps of 27 × 27 pixels) forming 10,077,696 connections to the input layer (64, 27 × 27 × 6 × 6 × 6 pixels). Because mapping is convolutional, this required that 13,888 parameters were learned for this layer (64 filters of dimensions 6 × 6 × 6 plus 64 offset terms). We chose units with rectified linear activation functions to model neurophysiological data (Movshon, Thompson, & Tolhurst, 1978). The activity, a, of unit j in the k th convolutional map was given by: where w (k) is the 6 × 6 × 6 dimensional 3D kernel of the k th convolutional map, s j is the 6 × 6 × 6 motion sequence captured by the j th unit, b j is an offset term and (.) + denotes a linear rectification non-linearity (ReLU). Parameterizing the motion image frames separately, the activity a (k) j can be alternatively written as: where w (t n k) represent the k th kernels applied to motion image frames (i.e., receptive fields at times 1 to 6), while s t n j represent the input images captured by the receptive field of unit j.
A dense layer (1,183,776 connections; 23,328 per feature map, resulting in 1,183,744 parameters including the 64 offset terms; Figure 1a, MT layer) mapped the activities in the pooling layer to 64 fully connected units. The vector of dense layer activities r was obtained by mapping the vector of activities in the convolutional layer via the weight matrix W and adding the offset terms b: Finally, a regression layer (128 connections, 64 for each of the two regression units, resulting in 130 parameters including the two offset terms; Figure 1a, output layer) mapped activities from the dense layer to two regression units, which represented the x and y velocity of the motion sequence. The regression unit activities were obtained using Equation (3).

Training procedure
Motion sequences were randomly divided into training (75%, n = 24,000) and test (25%, n = 8000) sets. No sequences were simultaneously present in the training and test sets. To optimize MotionNet xy , only the training set was used. We initialized the weights of the convolutional layer as Gaussian noise (mean, 0; SD, 0.001). The weights in the dense and regression layers and all offset terms were initialized to zero.
MotionNet xy was trained using mini-batch gradient descent with each batch comprising 32 randomly selected examples. For each batch, we computed the derivative of the mean squared loss function with respect to parameters of the network via back-propagation, and adjusted the parameters for the next iteration accorded to the update rule: where α is the learning rate, and ∂L ∂w (Di) is the average over the batch D i of the derivative of the loss function with respect to the w, evaluated at w i . The learning rate α was constant and equal to 1.0 × 10 −4 . After evaluating all the batches once (i.e., completing one epoch), we tested MotionNet xy using the test image dataset. We repeated this for 25 epochs.

Generation of test stimuli
A range of stimuli were used to test the response of the network after it had been trained on natural images. With the exception of sinewave and plaid stimuli, which were generated in Python using in-house scripts, all stimuli were generated using the Python toolbox Psychopy (Peirce, 2007)

Decoding direction and speed
To avoid issues associated with using a circular variable to train a regression output, the network was trained to estimate the x and y velocity of motion sequences. These estimates were then converted to speed ρ and direction φ with the following: where v x and v y denote x and y velocity vectors.

Component-and pattern-motion selectivity
To compare the component-and pattern-motion selectivity of MotionNet xy units to those of neurons in macaque V1 and MT (extracted and replotted neurophysiological data from Figures 11-13 of Movshon et al., 1986), we measured the activity of V1/MT units in response to sinewave gratings and plaids (135°separation) moving in 16 evenly spaced directions between 0°and 360°at its preferred spatial and temporal frequency (Figure 2c).
To classify each unit as component-selective (i.e., selective for the motion of the individual components comprising a plaid pattern), pattern-selective (i.e., selective for the motion of the plaid pattern), or unclassed (Figure 2c), we used the method described in (Movshon et al., 1986). Briefly, we compared the unit responses to ideal "component" and "pattern" selectivity using goodness of fit statistics. Because the component and pattern selectivity responses may be correlated, we used the partial correlation in the form: where R p denotes the partial correlation for the pattern prediction, r p is the correlation of the data with the pattern prediction, r c is the correlation of the data with the component prediction, and r cp is the correlation of the between the two predictions. The partial correlation for the component prediction was calculated by exchanging r c for r p and vice versa. We labeled units as "component" if the component correlation coefficient significantly exceeded either zero or the pattern correlation coefficient, whichever was larger. Similarly, we labeled units as "pattern" if the pattern correlation coefficient significantly exceeded either zero or the component correlation coefficient.
Units were labeled as "unclassed" if either (i) both pattern and component correlations significantly exceed zero, but do not differ significantly from one another, or (ii) neither correlation coefficient differed significantly from zero. To demonstrate the consistency in training outcomes, we trained 10 networks and in Figure 2 present the cumulative distribution of all 10 networks.
To compare the distribution of pattern-motion selectivity among V1 and MT units in MotionNet xy with those of our previous network ("MotionNet"; Rideaux & Welchman, 2020) and V1 and MT neurons, we projected the values shown in Figures 2b and 2c, in addition to data from Figure 3e our previous study (Rideaux & Welchman, 2020) along the diagonal to establish a unified estimate of pattern-motion selectivity for each unit (Figures 2d-2f). We then compared the responses of component-and pattern-motion selective MT units to grating and plaid stimuli. We selected the 16 MT units with the highest and lowest pattern-motion selectivity index and measured their response to gratings and plaids (135°separation) moving in 16 direction between 0°to 360°(temporal frequency: 0.265; spatial frequency: 0.085).

Reverse-phi motion responses
To compare the phi and reverse-phi responses of MotionNet xy units to those of neurons in macaque V1 and MT (extracted and replotted neurophysiological data from Figures 3a and 4a of Duijnhouwer & Krekelberg, 2016), we measured the activity of V1/MT units in response to dot motion. Dot motion stimuli in the phi condition consisted of 5 randomly positioned white dots (pixel value, 1.0; radius, 4 pixels) on a mid-gray background (pixel value, 0.0), which were allowed to overlap (with occlusion) and wrapped around the image when their position exceeded the edge. Of the six motion sequence frames presented, only the first two frames comprised dot motion, whereas the last four were presented as uniform mid-gray. For each V1/MT unit, we presented dot motion stimuli moving in 16 evenly spaced directions (0-360°), at their preferred speed. The reverse-phi dot motion stimuli were the same as those used in the phi condition, except the contrast of the dots was reversed (from white to black) on the second frame. The responses of V1 and MT units from 10 networks were aligned to a common preferred direction and the average for each are shown in Figures 3b-c.
To test how MotionNet xy estimated the speed of reverse-phi stimuli, we compared the speed decoded by the network in response to the phi and reverse-phi stimuli described above over a range of speeds (five linearly spaced speeds between 1.0 and 3.5 pixels/frame). We tested 10 networks and the average and standard deviation of their estimated speed is shown in Figure 3d. To explore why MotionNet xy misjudges the speed of reverse-phi stimuli, we separated the V1 and MT units in two groups, those that were more tuned to the displacement direction and those that were more tuned to the opposite-to-displacement direction, by assessing whether they were positively or negatively weighted to the v x regression output unit, respectively. This classification was straightforward for MT units, which are directly connected to the regression layer, but for V1 units we used the classification of the MT unit for which each V1 unit was most positively weighted. We then measured the average activity of these subpopulations of V1 and MT units in response to the phi and reverse-phi stimuli. Finally, to explain why the speed of reverse-phi motion is misjudged, we ran a simulation on a simplified version of the phenomena. The simulation consisted of computing the cross-correlation between phi and reverse-phi stimuli (16 × 16 × 2 [x,y,t] pixel image sequence) comprising a white [pixel value, 1] and black [pixel value, -1] vertical edge centered on the midline at time 0, and moving at one of 3 displacements speeds (1, 2, and 3 pixels) to the right (+v x ) at time 1) and a bank of four spatiotemporal filters (8 × 8 × 2 [x,y,t] pixels comprising a white and black vertical edge centered on the midline at time 0 and moving at the same displacement speed as the phi/reverse-phi stimuli to the right (+v x ) or to the left (−v x ) at time 1). The reverse phi stimulus was the same as the phi stimulus, except that it reversed polarity at time 1, and both combinations of light-dark and dark-light edge filters were used. For each cross-correlation we calculated the average of value. To emulate the computations of MotionNet xy , only positive and valid cross-correlation values were included.

Spatiotemporal tuning properties
To compare the properties of V1 and MT units that emerged within MotionNet xy to those of V1 and MT neurons in biological systems, we extracted neurophysiological data of owl monkey V1 neurons from Figure 9A and Figure 10A of (O'Keefe, Levitt, Kiper, Shapley, & Movshon, 1998) and re-analyzed data of macaque MT neurons from (Wang & Movshon, 2016). To establish the spatial and temporal frequency tuning preferences of MotionNet xy V1 and MT units we tested the network with drifting sinewave gratings. The direction and spatiotemporal tuning preference of each unit was determined as the stimulus movement direction, spatial frequency, and temporal frequency that produced maximal activity (Figures 4a-c, right). Sixteen directions (linearly spaced between 0°-360°), 10 spatial frequencies (logarithmically spaced between 8 and 25 pixels/cycle), and 10 temporal frequencies (logarithmically spaced between 4 and 25 cycles/frame) were tested, resulting in 1600 (16 × 10 × 10) stimulus types. For each stimulus type, we computed the average activation of 32 gratings at evenly spaced starting phase positions between 0°and 360°.
To assess the input from the V1 layer to MT units tuned to different speeds, we first established the preferred speed of MT units ρ MT with: where sf MT and tf MT denote the preferred spatial and temporal frequency of the MT unit. Then, for each V1 unit, we established the MT unit to which it was maximally connected and used a median split to separate the V1 units into those maximally connected to MT units that preferred slower or faster speeds. Finally, we compared the preferred spatial and temporal frequency tuning of these distributions (Figures 4d-e).
To demonstrate the consistency in training outcomes, we trained 10 networks and in Figure 4 present the mean values with error bars showing standard deviation.

Separable and covariate spatiotemporal tuning properties
To compare the separable spatial/temporal-frequency and speed-selectivity of MotionNet xy 's units to those of neurons in macaque MT (extracted and replotted neurophysiological data from Figures 5b to 5d of Priebe et al., 2003), we measured the activity of V1/MT units in response to sinewave gratings moving in their preferred direction at six spatial frequencies (logarithmically spaced between 8 and 33 pixels/cycle), and six temporal frequencies (logarithmically spaced between 4 and 500 cycles/frame), resulting in 36 (6 × 6) stimulus types. This method yielded spectral responses maps for each V1/MT unit in the network. We used the method described by Perrone and Thiele (2001) to fit a two-dimensional Gaussian model to the spectral response maps according to the following equation: where G(x, y) denotes the unit response at location (x, y), p is a constant offset, A is the amplitude of the peak, (x 0 ,y 0 ) is the location of the center of the peak, and a, b, and c are positive-definite and defined as where θ denotes the orientation of the peak, and σ x and σ y indicate the width of the peak in x and y dimensions, respectively. To classify the units as independently tuned to spatial-/temporal frequency, speed tuned, or unclassified, we used the method described by (Priebe et al., 2003); that is, we compared the correlation of the each unit's spectral response map to the model fit described in Equation (9) where the orientation is either zero (independent tuning) or at an angle that aligns the peak to the origin (speed tuning). Using these values, we performed the same assay as was conducted to determine the component-and pattern-motion selectivity to establish their independent and speed selectivity (Figure 5d). To compare the distribution of speed selectivity among MT units in MotionNet xy to that among MT neurons, we projected the values shown in Figure 5a and Figure 5b along the diagonal to establish a unified estimate of speed selectivity for each unit (Figures 5c,  5d). To assess the relationship between pattern-motion and speed selectivity of MotionNet xy units we computed the Pearson correlation between pattern and speed indices of MT units (Figure 5e). In line with previous neurophysiological work (Priebe et al., 2003), units that were unclassified in both dimensions were omitted from the correlation analysis. To demonstrate the consistency in training outcomes, we trained 10 networks and in Figure 5 present the values of all 10 networks.

Speed opponency
To compare the direction discrimination performance of MotionNet xy at varying levels of motion coherence to neurophysiological recordings from macaque (extracted and replotted neurophysiological data from Figures 9a and 11a of (Mikami et al., 1986)), we measured individual MT unit activity in response to dot motion stimuli (dot pixel value, 1.0; background pixel value, −1.0; dot radius, 4 pixels) moving in either the preferred or nonpreferred direction at eight logarithmically (base 2) spaced speeds between the minimum (0.8 pixels/frame) and maximum (3.8 pixels/frame) speeds used to train the network.  Mikami, Newsome, and Wurtz (1986) showing the response of two MT neurons to a dot moving either in its preferred or non-preferred direction over a range of speeds. (a, b, right) The same as (a, b, left), but for the responses of selected MotionNet xy MT units.

Motion coherence
To compare the direction discrimination performance of MotionNet xy at varying levels of motion coherence to neurophysiological and psychophysical recordings from macaque (extracted and replotted neurophysiological/psychophysical data from Figures 4 and 6 of (Britten et al., 1992)), we measured the direction estimates of the network in response to dot motion stimuli. Dot motion stimuli consisted of 333 randomly positioned white dots (pixel value, 1.0; radius, 2 pixels) on a black background (pixel value, -1.0), which were allowed to overlap (with occlusion) and wrapped around the image when their position exceeded the edge. A proportion of the dots moved in the signal direction, while the remaining dots moved in directions randomly sampled from 0 to 360°; all dots moved at 3 pixels/frame. Seven coherence levels were tested, logarithmically spaced between 0.001 to 0.2. For each coherence level, 100 trials were performed and estimates within ±90°of the signal direction were considered correct. In line with (Britten et al., 1992), we fit a Weibull function to the mean performance to estimate the threshold. Using a similar approach, we compared the speed estimates of MotionNet xy at varying levels of motion coherence with psychophysical data from humans (extracted and replotted psychophysical data from Figures 8b of Schütz et al., 2010). For this test, dot motion stimuli consisted of 10 randomly position dots, and we used five linearly-spaced coherence levels between 0.2 and 1.0. To test if the MotionNet xy underestimated the speed of partially coherent dot motion stimuli because of pooling noise and signal, we computed the Pearson correlation between the mean activity of MT units across 10 networks in response to 0% and 100% noise, with the activity in response to 50% noise.

Data availability
We performed analyses in Python using standard packages for numeric and scientific computing. All the code and data used for model optimization, and implementations of the optimization are freely and openly available at repository.cam.ac.uk/handle/1810/317333.

Network architecture and training
We created an artificial system, which we refer to as "MotionNet xy ", tasked with decoding the velocity of image sequences (Figure 1a). The network input comprised a sequence of image frames (x-y) depicting a scene moving through time (t). This was convolved with three-dimensional kernels (x-y-t). The resultant activity was then passed to a dense layer of units. Finally, the activity of the dense layer was read out by two output units, to produce estimates of horizontal (v x ) and vertical velocity (v y ). We referred to the convolutional and subsequent dense layer as V1 and MT, respectively, as their hierarchy was analogous to their namesake in biological systems.
We trained MotionNet xy to decode the velocity of natural images moving at a range of speeds (0.8-3.8 pixels/frame) and directions (0-360°); image sequences resembled viewing a translating natural image through a window. After training, there was a high correlation between the network's estimates and the velocity of novel motion sequences (v x , r = .89; v y , r = .93). V1 units were initialized with Gaussian noise, but after training they resembled (Figure 1b) receptive fields in primary visual cortex (Movshon et al., 1978;Rust, Schwartz, Movshon, & Simoncelli, 2005). However, unlike spatiotemporal receptive fields of neurons in V1, the receptive fields of MotionNet xy 's V1 units do not gradually decline in amplitude as a function of time. This is likely because the image sequences used to train the network consisted of constant rigid motion; it is possible that localized receptive fields would emerge if image sequences containing localized motion were used during training.

Component-and pattern-motion selectivity
To judge an object's movement, motion signals must be integrated across the stimulus as local motions are often ambiguous ("the aperture problem"). Experimental tests of motion integration often use plaid patterns composed of two sinewave components (Figure 2a). The individual components can move in different directions from the overall plaid (Movshon et al., 1986) and V1 neurons signal motion of the components (Gizzi, Katz, Schumer, & Movshon, 1990;Movshon et al., 1986). For example, the V1 neuron shown in Figure 2b responds most strongly to a leftwards moving grating; but when shown a plaid, it responds most strongly to motion above or below leftwards such that one of the component gratings moves leftwards. By contrast, some MT neurons show pattern-motion selectivity (Figure 2b, bottom)-responding to the plaid's features, rather than the individual components. The response of a neuron to sinewave and plaid stimuli can be used to classify it as either component-or pattern-motion selective. Applying this classification to a population of neurons shows that V1 neurons are exclusively component-motion selective, whereas MT contains a mixture of neurons selective to component and pattern motion (Figure 2b). We applied the same analysis to the units of MotionNet xy and found a similar pattern of results ( Figure 2c).
We previously showed a similar pattern of selectivity emerged in a neural network ("MotionNet") trained to make discrete velocity classifications (Rideaux & Welchman, 2020); however, these results differed from biological findings in that MT units were exclusively pattern-motion selective (rather than containing a mixture of selectivity; Figure 2d). This is likely because in the previous network, which performed discrete velocity classifications, MT units were constrained to represent specific velocities. By comparison, units in the MT layer of this network, like the units in V1, were unconstrained and could form characteristics that best served the output regression layer. As a result, here we found a pattern of selectivity that more closely resembled that found in biological systems (Figure 2e): V1 units were component-motion selective whereas units in the MT layer had a mixture of component-and pattern-selectivity (Figure 2f). A possible explanation for the emergence of component-motion selective units in MT, rather than uniform pattern-motion selectively, is that these units provide better direction estimates of simple motion, such as a bar of light, than pattern-motion selective units. Consistent with this explanation, we found that although the tuning curves of component-motion selective units were broader than pattern-motion selective units in response to plaid stimuli, they were narrower in response to grating stimuli (Figure 2g). Thus, by populating MT with both component-and pattern-motion selective units, the network can achieve more accurate direction estimation of both simple and complex images.
How are signals transformed between V1 and MT layers? A popular model of motion processing proposed a readout scheme from V1 to MT that followed a von Mises distribution, with the maximum excitatory connections between V1 and MT units of the same direction preference (Rust, Mante, Simoncelli, & Movshon, 2006). By contrast, we previously found that the pattern of weights between MotionNet's V1 and MT formed a bimodal distribution when aligned by the preferred V1 unit's direction (Figure 2h, black circles), which resembled the shape found when weights were aligned by the direction of maximum inhibition (Figure 2h, blue squares), whereas aligning by the direction of maximum excitation produced a second derivative Gaussian distribution (Figure 2h, red triangles). In support of our previous finding, we measured the weights between V1 and MT of MotionNet xy and found the same pattern of results ( Figure 2i). However, here we found the readout weights were more balanced between inhibition and excitation and more sharply tuned (especially in the case of alignment to maximum excitation). This is likely due to differences in the architecture required to support classification (MotionNet) compared to that required for regression (MotionNet xy ); however, the sharper tuning may also reflect a more diverse MT layer.

Reverse-phi motion
The direction selectivity of neurons can be dramatically altered, as in the case of "reverse-phi" motion, in which the contrast of images in a sequence is reversed between frames (Figure 3a). Perceptually this leads to the impression of movement in the opposite direction from true movement (Anstis, 1970). It has been shown that neurons in V1 and MT will exhibit inverted preferences in this situation, such they respond maximally to reverse-phi stimuli moving in the non-preferred direction (Duijnhouwer & Krekelberg, 2016;Figures 3b, 3c, left). We found that the activity of MotionNet xy 's V1 and MT units were similarly reversed in response to reverse-phi stimuli (Figures  3b, 3c, right). It is encouraging to see the network recapitulates this well-known phenomenon, but how does it estimate the speed of these stimuli? We tested the network with phi and reverse-phi motion stimuli over a range of displacement speeds. We found that for phi motion, the network consistently underestimated the speed of stimuli, which is likely because the network was trained on motion sequences comprising six frames, whereas our phi stimuli comprised only two (Figure 3d, cyan markers). By contrast, we found that the speed of reverse-phi stimuli was overestimated for low displacement speeds and underestimated for high speeds (Figure 3d, orange markers). Some evidence for the same pattern of behavior in humans has previously been found (Parthasarathy, 2019;Ruda et al., 2016), but more work is needed to explicitly investigate this phenomenon in biological systems.
To understand why this phenomenon occurs in the network, we measured the activity of V1 and MT units tuned to either the displacement (+v x ) or oppositeto-displacement direction (−v x ), in response to phi and reverse-phi motion at different speeds ( Figures  3e, 3f). For phi motion, the activity of the V1 +v x subpopulation stays approximately the same as speed was increased, while that of V1 −v x subpopulation is reduced. This increasing difference in activity between subpopulations of V1 units is propagated to the MT units to produce a divergent pattern of activity. As the difference between subpopulations responses increases, the balance of activity shifts toward the displacement direction, evoking a faster estimate of speed in this direction (Figure 3d, cyan markers). This pattern of responses is consistent with our previous work (Rideaux & Welchman, 2020), where we showed that low-speed motion sequences moving in different directions are highly correlated; thus directions are less distinguishable than high-speed sequences.
The responses evoked by reverse-phi are markedly different. First, as expected from evidence of the reversal of direction selectivity, the V1 −v x subpopulation are more active than the V1 +v x subpopulation. Second, the activity of both V1 subpopulations is lower than seen for phi motion at the slowest speed and increases with displacement speed. This reflects the evolutionary adaptation of receptive fields to frequently occurring (phi) motion compared with infrequent (reverse-phi) motion. Finally, both subpopulations increase at approximately the same rate, so the relative difference between their activity reduces with displacement speed. To explain why this occurs, we simulated a simplified version of the phenomenon in which we measure the cross-correlation between a phi and a reverse-phi edge stimulus at three displacement speeds with four spatiotemporal filters tuned to leftward and rightward displacement with either light-dark or dark-light polarity arrangement (Figure 3g, left). At the lowest displacement speed (v x = 1), the cross-correlation for reverse-phi is both attenuated and reversed compared to the cross-correlation for phi (Figure 3g, right). However, the relative difference between the cross-correlation for −v x and +v x filters is larger for reverse-phi. With increasing displacement (v x = 2 and v x = 3), the relative difference between −v x and +v x filters increases for phi, while decreasing for reverse-phi.

Spatiotemporal tuning distributions and connections
In biological visual systems the tuning of spatiotemporal neurons in V1 and MT to spatial and temporal frequency follows a log-normal distribution (O'Keefe et al., 1998;Wang & Movshon, 2016;Figures 4a-4d, left). Similarly, we found that the preferred spatial and temporal frequencies of V1 and MT units in MotionNet xy also followed a log-normal distribution (Figures 4a-4d, right). Speed is determined by the ratio of spatial and temporal frequency, meaning that different combinations of spatial and temporal frequencies could be used to achieve the same speed selectivity. For example, the same speed could be produced by a combination of low spatial and temporal frequency, or high spatial and temporal frequency. How might this be implemented in terms of the readout of V1 activity by speed-selective MT units? We established the preferred speed to which MotionNet xy 's MT units were tuned and separated these into "low" or "high" speed groups using a median split. We then compared the spatiotemporal tuning distributions of V1 units to which each group was maximally connected (Figures 4e, 4f), that is, weights with the highest positive values. We found that compared to MT units tuned to fast speeds, slow tuned units primarily received input from V1 units tuned to high spatial frequency and low temporal frequency. These results are consistent with work showing that the preferred speed of macaque MT neurons, as measured using dot motion stimuli, is negatively correlated with their preferred spatial frequency and positively correlated with their preferred temporal frequency (Priebe et al., 2003).

Separable and covariate spatiotemporal tuning
Just as neurons can be classified according to their direction selectivity (i.e., component-/pattern-motion), they can be classified by their spatiotemporal selectivity.
In particular, neurophysiological evidence shows that V1 neurons are separately tuned to either spatial or temporal frequency. That is, they respond most strongly to a particular spatial frequency, regardless of the temporal frequency, or vice versa (Foster, Gaska, Nagler, & Pollen, 1985;Priebe, Lisberger, & Movshon, 2006;Tolhurst & Movshon, 1975). By contrast, some MT neurons are tuned to object speed, such that their sensitivity to spatial frequency is dependent on temporal frequency (Perrone & Thiele, 2001;Priebe et al., 2003). To identify whether a neuron has separable tuning or speed tuning, its response can be measured for a range of spatial and temporal frequencies. If the neuron has separable spatiotemporal tuning, the peak responses will align either horizontally or vertically with a particular spatial or temporal frequency (Figure 5a, top). By comparison, if a neuron is tuned to speed, the peak responses will extend radially from the origin, with the angle indicating the speed to which the neuron is tuned (Figure 5a, bottom). The fit of a two-dimensional Gaussian that is either aligned cardinally (horizontally/vertically) or radially to this activity can be used to quantitatively classify neurons as either separable or speed tuned (Figure 5a, left). That is, in the same way as the response of a unit to plaid stimuli can be classified as component-or pattern-motion selective based on its alignment to the plaid versus sinewave directions, we can use the radial versus cardinal alignment of a unit's responses to different spatial and temporal sinewaves to classify it as either separable-or speed-tuned. We performed this classification analysis on the V1 and MT units in MotionNet xy and found that, in line with biological systems (Priebe et al., 2003), V1 units were separably tuned, whereas MT units showed a mixture of independent and speed tuning (Figure 5b).
Just as is observed in macaque (Figure 5c), we found a diverse range of MT units that were component-/pattern-motion selective and showed separable/speed tuning (Figure 5d). It is possible that direction and speed selectivity properties are related among MT units, that is, a unit selective for complex direction (pattern-motion) may be more likely to be selective for complex speed. We tested this in MotionNet xy found a positive correlation between pattern and speed indices of MT units (n = 568, Pearson r = .72, p = 1.9 × 10 −93 ; Figure 5e).

How do motion signals interfere with each other?
We next considered situations in which motion signals can degrade or may interfere with each other. First, we tested how the response to a moving dot pattern is affected by superimposing dots moving at different speeds. Biological visual systems exhibit inhibitory mechanisms that are thought to reduce noise and sharpen activity in response to visual features. For instance, experimenters have presented moving dot patterns and then overlaid dots moving in a different direction. V1 neurons are not substantially affected by this manipulation; however, MT neurons show direction opponency and are suppressed by dots moving in a non-preferred direction (Qian & Andersen, 1994;Rust et al., 2006;Snowden, Treue, Erickson, & Andersen, 1991). We previously found comparable responses within a neural network trained to classify image velocity (Rideaux & Welchman, 2020). However, MT neurons also exhibit speed opponency and are suppressed by dots moving in a nonpreferred speed (Mikami et al., 1986;Figures 6a, 6b, left). We tested whether this noise reduction mechanism was also present in MotionNet xy and found the same patterns of responses among MT units (Figures 6a, 6b, right).
We then tested MotionNet xy with random dot stimuli that have been widely used to study motion. Using these stimuli, it is possible to precisely titrate the relationship between dots moving in a particular direction (the signal) and dots moving in a randomly chosen direction (noise). We tested the ability of MotionNet xy to correctly estimate the direction of motion by varying the proportion of signal and noise dots in the stimulus (Figure 7a). Like individual neuronal responses (Britten, Shadlen, Newsome, & Movshon, 1992; Figure 7b) and macaque monkey psychophysical judgments (Figure 7c, blue markers), we found graceful degradation in estimates of motion direction (Figure 7c, red markers). We showed that reducing motion coherence reduces the accuracy of direction estimates, but how are speed judgements influenced? Previous psychophysical evidence shows that humans underestimate the speed of dot motion with reduced coherence (Schütz et al., 2010;Figure 7d, orange markers). We tested how MotionNet xy estimated the speed of dot motion at different coherence levels and found the same pattern of results (Figure 7d, cyan markers).
As the directions of noise dots are uniformly distributed around 360°, the average velocity of the noise is zero. The underestimation of the speed of partially coherent dot motion stimuli appears to adhere to a linear trend that is equal to the weighted average of noise (zero) and signal (nonzero) speed, where the weights are equal to the proportion of noise and signal dots. Thus, a possible explanation for this bias is that it is produced by pooling of noise and signal by the network. We reasoned that if the bias is produced by pooling of noise and signal, then we would expect that the response of the network to 50% coherence motion to be similar to the pooled responses to 0% and 100% coherence. Consistent with this explanation, we found that the average activity of MT units in response to 50% coherence motion could be predicted with high accuracy by averaging their responses to 0% and 100% coherence (n = 64, Pearson r = .99, p = 1.4 × 10 −50 ; Figure 7e).

Discussion
The ability to see movement underpins adaptive behaviors ranging from depth estimation to navigation and grasping. Here we explore and explain the neural computations that support motion estimation in biological systems by investigating the structures that emerge in an artificial system trained to estimate the velocity of image sequences. Using complete access to the artificial system, we reveal aspects of the neural architecture that instantiates the motion estimation, producing concrete predictions for future empirical study. Specifically, we show that (i) the network overestimates the speed of slow reverse-phi motion while underestimating the speed of fast reverse-phi motion because of the correlation between reverse-phi motion and the spatiotemporal receptive fields tuned to motion in opposite directions, (ii) compared to MT units tuned to fast speeds, those tuned to slow speeds primarily receive input from V1 units tuned to high spatial frequency and low temporal frequency, (iii) there is a positive correlation between the pattern-motion and speed selectivity of MT units, and (iv) the network recapitulates human underestimation of low coherence motion stimuli, which is explained by pooing of noise and signal motion.
Reverse-phi motion is perceived as moving in the opposite direction to the actual movement (Anstis, 1970). The manner in which this image manipulation influences the preferred direction of neurons and the perceived direction of movement has been documented (Duijnhouwer & Krekelberg, 2016). Here we show that in addition to these effects related to direction, this manipulation may also produce biases in perceived speed. Furthermore, we lay bare the computational mechanism explaining this new phenomenon. That is, the similarity between reverse-phi motion and receptive fields of spatiotemporal units tuned to opposite velocities. Although some behavioral evidence for this bias has previously been documented (Parthasarathy, 2019;Ruda et al., 2016), future psychophysical and neurophysiological work is needed to directly test these predictions.
We previously showed that multiple physiological and psychophysical phenomena in motion processing are recapitulated by a network trained to classify the velocity of moving image sequences (Rideaux & Welchman, 2020). For example, we found that the anisotropic distribution of direction preferences in units in a layer representing V1 matched that of neurons in mouse V1. Here we found that the distribution of spatial and temporal frequency tuning also matched that found in macaque V1 and MT (i.e., log-normal distribution of neuronal frequency preference). Previous electrophysiological work suggested that the MT neurons tuned to low speeds primarily receive input from V1 neurons tuned to high spatial frequency and low temporal frequency, whereas the opposite pattern of transmission was true for MT neurons tuned to high speed (Priebe et al., 2003). This evidence was based on the activity of MT neurons, because measuring connections and preferences of neurons across cortical regions on a sufficiently large scale is beyond the limitations of current biological techniques. By contrast, this analysis is made possible within the artificial system, and we find evidence consistent with previous hypotheses: slow-tuned MT units receive more input from high spatial and low temporal frequency V1 units than fast-tuned MT units.
Considerable work has been undertaken to understand how the properties of spatiotemporal neurons in MT are distinguished from those in V1, as this knowledge can provide insight into the hierarchical computations that underlie motion processing. Neurons can be classified by their direction selectivity (i.e., component-/pattern-motion) or spatiotemporal selectivity (i.e., separate/speed). V1 only contains neurons selective for component-motion and separate spatiotemporal frequencies, while neurons selective for pattern-motion and speed are found in MT. This dichotomy supports the notion that "simple" motion signals from V1 are pooled in MT, yielding selectivity for more "complex" signals. However, neurophysiological work shows that the selectivity of many MT neurons is indistinguishable from those in V1. We found the same pattern of results for MotionNet xy : the MT layer comprised a mixture of units tuned to component-and pattern-motion, and separate spatiotemporal frequency and speed. We further showed that component-motion selectively in MT is likely retained to preserve sensitivity for simple image motion, such as a bar of light.
Our results indicate that rather than MT units either being separately tuned to a particular spatial/temporal frequency or speed, the distribution of speed selectivity in MT reflected a continuum along this dimension. This tuning diversity is consistent with neurophysiological evidence from macaque (Priebe et al., 2003) . We also found a positive relationship between direction and speed selectivity of MT units, indicating that units tuned to complex motion signals in one domain (e.g., direction) were more likely to be tuned to complex signals in the other (e.g., spatiotemporal). Given that the complexity of the selectivity for both direction and speed is derived from the same characteristic, i.e., diversity of connection weights between V1 and MT, it seems reasonable to expect that these properties would be related. However, in contrast, previous neurophysiological work did not find evidence for this relationship in macaque (Priebe et al., 2003).
A possible explanation for this conflict is that there was an insufficient range of speed selectivity in the neurophysiological sample to detect the relationship. In our data, we recorded units ranging almost the entire speed selectivity continuum, whereas the neurophysiological data accounted for approximately half this range (possibly due to noise within the biological system reducing the effectiveness of the classification technique). More neurophysiological work is needed to test this possibility.
We previously demonstrated that the tendency for humans to underestimate the speed of objects moving at low visibility could be explained by the lawful relationship between spatiotemporal contrast and speed in natural image sequences, rather than exposure to a non-uniform distribution of motion speeds in the environment, that is, the "slow-world" bias (Rideaux & Welchman, 2020). There have been multiple psychophysical demonstrations of the bias under conditions of reduced contrast (Hürlimann, Kiper, & Carandini, 2002;Sotiropoulos, Seitz, & Seriès, 2014;Vintch & Gardner, 2014;Weiss, Simoncelli, & Adelson, 2002); however, there is also evidence that humans underestimate the speed of dot motion stimuli with reduced signal coherence (Schütz et al., 2010). This could be interpreted as evidence for the slow-world account, because reducing signal coherence likely reduces estimation certainty. However, we tested MotionNet xy and found the same pattern of results: the network underestimated the speed of dot motion stimuli with reduced signal coherence. Importantly, this phenomenon was an outcome of pooling signal and noise together, and unrelated to the mechanism that produces underestimation of low contrast motion signals.
Using an artificial systems approach, here we explored several aspects of motion processing; however, many avenues remain for future work. There are multiple ways in which the training image sequences could be altered to address remaining questions. For example, image sequences containing localized motion could be used to train the network to determine the influence of using rigid motion on the characteristics that emerge within the network. Alternatively, training images could be initially filtered with kernels representing center-surround receptive fields to represent ganglion inputs to V1. There is also scope to increase the complexity of the network to explore how more complex motion signals are processed. For example, by adding another layer, analogous to MST, future work could explore estimation of complex optic flow, such as rotation.
In recent years, deep neural networks comprising many layers have surpassed human performance on many tasks, for example, object recognition (He, Zhang, Ren, & Sun, 2016;Russakovsky et al., 2015). However, their scale and complexity often obscures inspection; limiting understanding of their internal processes as much as in biological systems. Here, we constrain the size of the artificial system, allowing us to apply in silico electrophysiological techniques that lay bare and understand the processes that underlie perceptual (mis)estimation of velocity. We demonstrate how optimizing motion estimation in an artificial network using natural images recapitulates a diverse array of neurophysiological and perceptual phenomena. More importantly, we use this technique to explain the computational basis of existing perceptual phenomena and generate predictions for some yet to be tested.
Keywords: motion perception, neural network, speed and direction, reverse-phi, V1 and MT