Going in circles is the way forward: the role of recurrence in visual inference

Biological visual systems exhibit abundant recurrent connectivity. State-of-the-art neural network models for visual recognition, by contrast, rely heavily or exclusively on feedforward computation. Any finite-time recurrent neural network (RNN) can be unrolled along time to yield an equivalent feedforward neural network (FNN). This important insight suggests that computational neuroscientists may not need to engage recurrent computation, and that computer-vision engineers may be limiting themselves to a special case of FNN if they build recurrent models. Here we argue, to the contrary, that FNNs are a special case of RNNs and that computational neuroscientists and engineers should engage recurrence to understand how brains and machines can (1) achieve greater and more flexible computational depth, (2) compress complex computations into limited hardware, (3) integrate priors and priorities into visual inference through expectation and attention, (4) exploit sequential dependencies in their data for better inference and prediction, and (5) leverage the power of iterative computation.


I. INTRODUCTION
The primate visual cortex uses a recurrent algorithm to process sensory input 1-3 . Anatomically, connectivity is cyclic. Neurons are connected in cycles within their local cortical neighborhood [4][5][6] . Between cortical areas, as well, connections are generally reciprocal 7,8 . Physiologically, the dynamics of neural responses bear temporal signatures indicative of recurrent processing 1, 9,10 . Behaviorally, visual perception can be disturbed by carefully timed interventions that coincide with the arrival of re-entrant information to a visual area [11][12][13][14] . The evidence for recurrent computation in the primate brain, thus, is unequivocal. What is less obvious, however, is why the brain uses a recurrent algorithm.
Although computer vision and computational neuroscience both have a long history of recurrent models [28][29][30][31] , feedforward models have earned a dominant status in both fields. How should we account for this discrepancy between brains and models?
One answer is that the discrepancy reflects the fact that brains and computer-vision systems operate on different hardware and under different constraints on space, time, and energy. Perhaps we have come to a point at which the two fields must go their separate ways. However, this answer is unsatisfying. Computational neuroscience must still find out how visual inference works in brains. And although engineers face quantitatively different constraints when building computer-vision systems, they, too, must care about the spatial, temporal, and energetic limitations their models must operate under when deployed in, for example, a smartphone. Moreover, as long as neural network models continue to dominate computer vision, more efficient hardware implementations are likely to be more similar to biological neural networks than current implementations using conventional processors and graphics processing units (GPUs).
A second explanation for the discrepancy is that the abundance of recurrent connections in cortex belies a superficial role in neural computation.
Perhaps the core computations can be performed by a feedforward network 32 , while recurrent processing serves more auxiliary and modulatory functions, such as divisive normalization 33 and attention [34][35][36] . This perspective is convenient because it enables us to hold on to the feedforward model in our minds. The auxiliary and modulatory functions let us acknowledge recurrence without fundamentally changing the way we envision the algorithm of recognition.
However, there is a third and more exciting explanation for the discrepancy between recurrent brains and feedforward models: Although feedforward computation is powerful, a recurrent algorithm provides a fundamentally superior solution to the problem of visual inference, and this algorithm is implemented in primate visual cortex. This recurrent algorithm explains how primate vision can be so efficient in terms of space, time, energy, and data, while being so rich and robust in terms of the inferences and their generalization to novel environments.
In this review, we argue for the latter possibility, discussing a range of potential computational functions of recurrence and citing the evidence suggesting that the primate brain employs them. We aim to distinguish established from more speculative, and superficial from more profound forms of recurrence, so as to clarify the most exciting directions for future research that will close the gap between models and brains.

II. UNROLLING A RECURRENT NETWORK
What exactly do we mean when we say that a neural network -whether biological or artificial -is recurrent rather than feedforward? This may seem obvious, but it turns out that the distinction can easily be blurred. Consider the simple network in Fig. 1a. It consists of three processing stages, arranged hierarchically, which we will refer to as areas, by analogy to cortex. Each area contains a number of neurons (real or artificial) that apply fixed operations to their input. Visual input enters in the first area, where The same network with lateral (blue) and feedback (red) connections added, to make it recurrent. (c) "Unrolling" the network in time clarifies the order of its computations. Here, the network is unrolled for three time steps before its output is read out, but we could choose to run the network for arbitrarily more or fewer steps. Areas are staggered from left to right to show the order in which their neural activities are updated. (d) Alternatively, we can unroll the recurrent network's time steps in space, by arranging the areas and connections from different time steps in a linear spatial sequence. Note how all arrows now once again point in the same (forward) direction, from input to output. Throughout panels (a-b) Connections that are identical (sharing the same synaptic weights) are indicated by corresponding symbols. (e) If we lift the weight-sharing constraints from the previous network, this induces a deep feedforward "super-model", which can implement the spatially-unrolled recurrent network as a special case. This more general architecture may include additional connections (examples shown as light gray arrows) not present in the spatially-unrolled recurrent net.
it undergoes some transformation, the result of which is passed as input to the second area, and so forth. Information travels exclusively in one direction -the forward direction, from input to output -and so this is an example of a feedforward architecture. Notably, the number of transformations between input and output is fixed, and equal to the number of areas in the network. Now compare this to the architecture in Fig. 1b. Here, we have added lateral and feedback connections to the network. Lateral connections allow the output of an area to be fed back into the same area, to influence its computations in the next processing step. Feedback connections allow the output of an area to influence information processing in a lower area. There is some freedom in the order in which computations may occur in such a network. The order we illustrate here starts with a full feed-forward pass through the network. In subsequent time steps, neural activations are updated in ascending order through the hierarchy, based on the activations that were computed in the previous time step.
This order of operations can be seen more clearly if we 'unroll' the network in time, as shown in Fig. 1c. In this illustration, the network is unrolled for a fixed number of time steps (3). In fact, recurrent processing can be run for arbitrary durations before its output is read out -a notion we will return to later. Notice how this temporally unrolled, small network resembles a larger feedforward neural network with more connections and areas between its input and output. We can emphasize this recurrent-feedforward equivalence by interpreting the computational graph over time as a spatial architecture, and visually arranging the induced areas and connections in a linear spatial sequence -an operation we call unrolling in space (Fig. 1d). This results in a deep feedforward architecture with many skip connections between areas that are separated by more than one level in this new hierarchy, and with many connections that are exact copies of one another (sharing identical connection weights).
Thus, any finite-time RNN can be transformed into an equivalent FNN. But this should not be taken to mean that RNNs are a special case of FNNs. In fact, FNNs are a special case of finite-time RNNs, comprising those which happen to have no cycles. Moreover, we can always expand an FNN into a range of different RNNs by adding connections that form cycles. More practically, although the set of all FNNs contains a subset that are equivalent to unrolled RNNs in terms of their computational graph (Fig. 2a), not all of these are realistic (Fig. 2b). Realistic networks, here, are networks that conform to the real-world constraints the system operates under. For computational neuroscience, a realistic network is one that fits in the brain of the animal and does not require a deeper network architecture or more processing steps than the animal can accommodate. For computer vision, a realistic network is one that can be trained and deployed on available hardware at the training and deployment stages. For example, there may be limits on the storage and energy available, which would limit the complexity of the architecture and computational  Fig. 1a), or equivalently, setting the weights of these connections to zero. Vice versa, any one feedforward network can be expanded to an infinite variety of recurrent networks, by adding lateral or feedback connections. Feedforward networks, then, form an architectural subset of RNNs. In this illustration, we specifically consider RNNs that accomplish their task in a finite number of time steps. These finite-time RNNs (ftRNNs) have the special property that they can be unrolled into equivalent feedforward architectures (a concept we expound in the text). White points connected by black lines illustrate these mutually equivalent architectures. Thus, the feedforward NNs contain a subset of architectures that can be obtained by unrolling a ftRNN. (b) These sets of networks can be further subdivided into subsets that are, or are not realistic to implement with available computational resources (areas below and above the dotted line, respectively). Very deep networks, or more generally networks with many neurons and connections, require more memory to store, and more computational time to execute and train, and are therefore more demanding to implement. Some realistic ftRNNs remain realistic when unrolled to a feedforward architecture -these are indicated in blue. Others, however, become too complex, when unrolled, to be feasible (exemplified in the figure by an 'unrolling connector' that crosses the realism line). This is because the unrolling operation induces a much deeper architecture with many more neural connections to be stored and learned. These not-realistically-unrollable ftRNNs especially interesting, since they correspond to recurrent solutions that cannot be replaced by feedforward architectures.
graph. A realistic finite-time RNN, when unrolled, can yield an unworkably deep FNN. Although the most widely used method for training RNNs (backpropagation through time) currently requires unrolling, an RNN is not equivalent to its unrolled FNN twin at the stage of real-world deployment. An important recent observation is that the architecture that results from spatially unrolling a recurrent network, resembles an architecture of FNNs called Residual Networks (ResNets) 17,37-39 . These networks similarly have skip connections and can be very deep. ResNets may form a super-class of models ( Fig.  1e), which reduce to recurrent-equivalent architectures when certain subsets of weights are constrained to be identical. Interestingly, when ResNets were trained with such recurrent-equivalent weightsharing constraints, their performance on computer vision benchmarks was similar to unconstrained ResNets (even though the weight sharing drastically reduces the parameter count and limits the component computations that the network can perform) 37 . This is especially noteworthy given that ResNets, and architecturally related DenseNets, are currently among the top-ranking DNNs on prominent computer vision benchmarks 17,40 , as well as measures of brain-similarity 27 . Today's best artificial vision models, thus, actually implement computational graphs closely related to those of recurrent networks, even though these models are strictly feedforward architectures.

III. REASONS TO RECUR
We have described how a recurrent network can be unrolled into a deep feedforward architecture. The resulting feedforward super-model offers greater computational flexibility, since weight-sharing constraints can be omitted and additional skip connections added to the network (Fig. 1e). So what would be the benefit of restricting ourselves to recurrent architectures? We will first discuss the benefits of recurrence in terms of overarching principles, before considering more specific implementations of these principles.
A. Recurrence provides greater and more flexible computational depth

Recurrence enables arbitrary computational depth
One important advantage of recurrent algorithms is that they can be run for arbitrary lengths of time before their output is collected. We can define computational depth as the maximum path length (i.e. number of successive connections and nonlinear transformations) between input and output. A recurrent neural network (RNN) can achieve arbitrary computational depth despite having a finite count of parameters and being limited to finite spatial components. In other words, it can multiply its limited spatial resources along time. In addition to enabling an arbitrarily deep computation given enough time, an RNN can adjust its computational depth to the task at hand. The computational depth of a feedforward net, by contrast, is a fixed number determined by the architecture. In particular, by adjusting their computational depth, RNNs can gracefully trade off speed and accuracy. This was recently demonstrated by Spoerer et al., who implemented recurrent models that terminate computations when they reach a flexible confidence threshold (defined by the entropy of the posterior, a measure of the model's uncertainty). An RNN could flexibly emulate the performance of a different FNNs, with the RNN's accuracy at a given confidence threshold matching the accuracy of an FNN that requires a similar number of floating point operations 21 (Fig. 3). This presents a clear advantage of recurrence for animals, who may need to respond rapidly in some situations, must limit metabolic expenditures in general, and may benefit from slower and more energetically costly inferences when great accuracy is required. In fact, computer vision faces similar requirements in certain applications. For example, a vision algorithm in a smartphone should respond rapidly and conserve energy in general, but it should also be capable of high-accuracy inference when needed.

B. Recurrent architectures can compress complex computations in limited hardware
Another major benefit of recurrent solutions is that they require fewer components in space when physically implemented in recurrent circuits, such as brains. Compare Figs. 1b and 1e: the recurrent network is anatomically more compact than the feedforward network and has fewer connections. It is easy to see why evolution might have favored a recurrent implementations for many brain functions: Space, neural projections, and the energy to develop and maintain them are all costly for the organism. In addition, synaptic efficacies must be either learned from limited experience or encoded in a limited-capacity genome. Beyond saving space, material, and energy, thus, smaller descriptive complexity (or parameter count) might ease development and learning.
Engineered devices face the same set of costs, although their relative weighting changes from application to application. In particular, a larger number of units and weights must either be represented in the memory of a conventional computer or implemented in specialized (e.g., neuromorphic) hardware. The connection weights in an NN model need to be learned from limited data. This requires extensive training, e.g., in a supervised setting, with millions of hand-labeled examples that show the network the desired output for a given input. The larger number of parameters associated with a feedforward solution might overfit the training data. The learned parameters then do not generalize well to new examples of the same task. DNNs for image recognition typically have many more parameters than they have training data.
In practice, such DNNs often turn out to generalize well even when they have very large numbers of parameters [41][42][43] . This phenomenon is thought to reflect a regularizing effect of the learning algorithm, stochastic gradient descent. Indeed, the trend is towards ever deeper networks with more connections to be optimized, and this trend is associated with continuing gains in performance on computer vision benchmarks 44 .
Nevertheless, it could turn out that recurrent architectures that achieve high computational depth with few parameters may bring benefits not only in terms of their storage, but also in terms of learnability. At the same time, computational resources are not infinite, even outside of biological constraints. Increasingly complex DNNs take increasingly longer to train on increasingly larger computing clusters, while drawing increasingly large amounts of power -a trend that is not sustainable. In the long run, therefore, computer vision too may benefit from the anatomical compression that can be achieved through clever use of recurrence.
Importantly, however, not every deep feedforward model can be compressed into an equivalent recurrent implementation. This anatomical compression can only be achieved when the same function may be applied iteratively or recursively within the network. The crucial question, therefore, is: what are these functions? What operations can be FIG. 3: Recurrence enables a lossless speed-accuracy trade-off in an image classification task. Circles denote the performance of a recurrent neural network that was run for different numbers of time steps, until it achieved a desired threshold of classification confidence (quantified by the entropy of the class probabilities in the final network layer). Squares correspond to three architecturally similar feedforward networks with different computational depths. On the xaxis is the computational cost of running these models, measured by the number of floating point operations. For the feedforward models, this cost is fixed by the architecture. For the recurrent models, it is the average number of operations that was required to meet the given entropy threshold. The y-axis shows the classification accuracy achieved by each model. The performance of the recurrent model for different certainty thresholds follows a smooth curve, trading off computational cost (and thus computational speed) and accuracy. Note that this curve passes almost exactly through the speed/accuracy levels achieved by the feedforward models. Thus, a single recurrent model can emulate the performance of multiple feedforward models as it trades off speed and accuracy. This flexibility does not appear to come at a cost in terms of either parameters or computation: The recurrent model had a similar number of parameters as the feedforward models. For any desired accuracy, the recurrent model applied repeatedly in a productive manner? The remainder of this review will reflect on the various roles that have been proposed for recurrent processing for visual inference, from superficial to increasingly more profound forms of recurrence.
C. Feedback connections are required to integrate information from outside the visual hierarchy A key, established role of recurrent connections in biological vision is to propagate information from outside the visual cortex, so that it can aid visual inference 45 . Here, we will briefly discuss two such outside influences: attention and expectations.

Attentional prioritization requires feedback connections
Animals have needs and goals that change from moment to moment. Perception is attuned to an animal's current objectives. For instance, a primate foraging for red berries may be more successful if its visual perception apparatus prioritizes or enhances the processing of red items. Since current goals are represented outside the visual cortex (e.g. in frontal regions), top-down connections are clearly required for this information to influence visual processing. Such top-down effects have been grouped under the label "attention", and they have been the subject of an entire sub-field of study. For our purposes, it is sufficient to note that the effects and mechanisms of top-down attention are well-documented and pervasive in visual cortex (for review, see [34][35][36]), and thus there is no question that this is one important function of recurrent connections.

Integrating prior expectations into visual inference requires feedback connections
Organisms may constrain their visual inferences by expectations 46 .
Visual input can be ambiguous and unreliable, and thus open to multiple interpretations. To constrain the inference, an observer can make use of prior knowledge [47][48][49] . One form of prior knowledge is environmental constants (e.g. "light tends to come from above" 50 ). Such unvarying knowledge may be stored within visual cortex, especially when it pertains to the overall prevalence of basic visual features (e.g. local edge orientations 51 ). Another form of prior knowledge is contextual information specific to the current situation. Such time-varying knowledge may require a flexible representation outside visual cortex (e.g. "I rang the doorbell at my mother's house, so I expect to see her open the door"). Such expectations, represented in higher cortical regions, require feedback connections to affect processing in visual cortex 46 .
The top-down imposition of attention and expectation must be mediated by feedback connections. However, it is unclear whether these influences fundamentally change the nature of visual representations or merely modulate these representations, adjusting the gain depending on the current relevance of different features of the visual input. As illustrated in Fig. 4a, for a given input this would require only two "sweeps" of computation through the visual processing hierarchy: a feedback sweep that primes visual areas with top-down information, and a bottom-up sweep to interpret the visual input and integrate or modify this interpretation with the top-down signal (not necessarily in that order). Importantly, if the feedback signal merely enhances or suppresses some visual features, then the core inference algorithm need not be fundamentally recurrentone can imagine that the bottom-up part of such a network is modeled perfectly by an FNN, while an optional recurrent module could be added in order to implement top-down contextual influences.

D. Recurrent networks can exploit temporal dependency structure
Contextual constraints on visual inference include not only information from outside the visual hierarchy, such as information from other sensory modalities and memory, as discussed in the previous section. The recent stimulus history within the visual modality also provides context, likely represented within the visual system.

Recurrent networks can dynamically compress the stimulus history
The primate visual system is thought to contain a hierarchy, not only of processing stages and spatial scales, but also of temporal scales 52,53 . Visual representations track the environment moment by moment. However, the duration of a visual moment, the temporal grain, may depend on the level of representation. These principles apply to all sensory modalities and have been empirically explored, in particular, for audition and speech perception. At the simplest level, a neural network could use delay lines to detect spatiotemporal, rather than purely spatial, patterns. Recurrent neural networks have internal states and can represent temporal context across units tuned to different latencies. An RNN could represent a fixed temporal window, by replicating units tuned to different patterns for multiple latencies. However, RNNs trained on sequence processing tasks, such as language translation, learn more sophisticated representations of temporal context 54 . They can represent context at multiple time scales, learning a latent representation that enables them to dynamically compress whatever information from the past is needed for the task. In contrast to a feedforward network, a recurrent network is not limited by spatial constraints in terms of its retrospective time horizon. It can maintain task-relevant information indefinitely, integrating long-term memory into its inferences. In these examples, circles correspond to neurons (or neural assemblies) encoding the feature illustrated within the circle, and lines that connect to circles indicate neural connections with significant activity. (a) Top-down influences from outside the visual processing hierarchy may be incorporated through two computational sweeps: a feedback sweep priming the network with top-down information and a feedforward sweep to interpret visual input and combine this interpretation with the top-down signal. Note that the lateral connections here merely copy neural activities in each area to the next time point; this identity transformation could also be implemented in other ways, such as slow membrane time constants or other forms of local memory. In the example on the right, a top-down signal communicates the expectation that the upcoming input will be horizontal motion. This primes neurons encoding this direction of motion to be more easily or strongly activated, and sharpens the interpretation of the subsequent (ambiguous) visual input. (b) To efficiently perform inference on time-varying visual input, recurrent connections may implement a fixed temporal prediction function akin to the transition kernel in a Kalman filter, extrapolating the ongoing dynamics of the world one time step into the future. For instance, in the example on the right, a downward moving square was perceived at t = 1. This motion is predicted to continue, and this prediction constrains the interpretation of the (ambiguous) visual input at the next time point. For simplicity, only lateral recurrence is shown in this example. Note that each input is mapped onto its corresponding output in a single recurrent time step. (c) Static input may also benefit from recurrent processing that iteratively refines an initial, coarse feedforward interpretation. In this mode of recurrence, there are several processing time steps between input and output, whereas in (b) there was one input and output for each time step. Illustrated on the right is an iterative hierarchical inference algorithm. Here, a higher-level hypothesis, generated in the first time step, refines the underlying lower-level representation in the next time step, which in turn improves the higher-level hypothesis, and so forth, until the network converges to an optimal interpretation of the input across the entire hierarchy.
For simplicity, lateral recurrent interactions are not shown in this example.

Recurrent dynamics can simulate and predict the dynamics of the world
Dynamic compression of the past exploits the temporal dependency structure of the sensory data. The purpose of representing the past is to act well in the future. This suggests that a neural network should exploit temporal dependencies not just to compress the past, but also to predict the future. In fact, an optimal representation of even just the present requires prediction, because the sensory data is delayed and noisy.
Changes in the world are governed by laws of dynamics, which by definition are temporally invariant. An ideal observer will exploit these laws in visual inference and optimally combine previous with present observations to estimate the current state. This implies an extrapolation of the past to generate predictions that improve the interpretation of the present sensory input. When the dynamics are linear and noise is Gaussian, the optimal way to infer the present state by combining past and present evidence is the Kalman filter 55 -an algorithm widely used in engineering applications. A number of authors [56][57][58][59] have proposed that the visual cortex may implement an algorithm similar to a Kalman filter. This theory is consistent with temporal biases that are evident in human perceptual judgments 60-62 .
Kalman filters employ a fixed temporal transitional kernel. This kernel takes a representation of the world (e.g., variables encoding the present state of a physical system, such as positions and velocities) at time t, and transforms it into a predicted representation for time t + 1, to be integrated with new sensory evidence that arrives at that time. While the resulting prediction varies as a function of the kernel's input, the kernel itself is constant, reflecting the temporal shift-invariance of the laws governing the dynamics. Recurrent neural networks provide a generalization of the Kalman filter and can represent nonlinear dynamical systems with non-Gaussian noise.
Note that this type of recurrent processing is more profound than the two-sweep algorithm (Fig. 4a) that incorporated top-down influences on visual inference. The two-sweep algorithm is trivial to unroll into a feedforward architecture.
In contrast, unrolling a Kalman filterlike recurrent algorithm would induce an infinitely deep feedforward network, with a separate set of areas and connections for each time point to be processed. A finitedepth feedforward architecture can only approximate the recurrent algorithm. While the feedforward approximation will have a finite temporal window of memory to constrain its present inferences, the recurrent network can in principle integrate information over arbitrarily long periods.
Due to their advantages for dealing with time-varying (or otherwise ordered) inputs, recurrent neural networks are in fact widely employed in the broader field of machine learning for tasks involving sequential data. Speech recognition and machine translation are prominent applications that RNNs excel at 54,63-66 . Computer vision, too, has embraced RNNs for recognition and prediction of video input [67][68][69] . Note that these applications all exploit the dynamics in RNNs to model the dynamics in the data.
What if we trained a Kalman filter or sequence-tosequence RNN (Fig. 4b) on a train of independently sampled static inputs to be classified? The memory of the preceding inputs would not be useful then, so we expect the recurrent model to revert to using essentially only its feedforward weights. The type of recurrent processing we described in this section, thus uses memory to improve visual inference. In the next section, we consider how recurrent processing can help with the inferential computations themselves, even for static inputs.

E. Recurrence enables iterative inference
Recurrent processing can contribute even to inference on static inputs, and regardless of the agent's goals and expectations, by means of an iterative algorithm. An iterative algorithm is one that employs a computation that improves an initial guess. Applying the computation again to the improved guess yields a further improvement. This process can be repeated until a good solution has been achieved or until we run out of time or energy. Recurrent networks can implement iterative algorithms, with the same neural network functions applied successively to some internal pattern of activity.
In many fields, iterative algorithms are used to solve estimation and optimization problems. In each iteration, a small adjustment is made to the problem's proposed solution, to improve a mathematically formulated objective. A locally optimal solution is found by making small improvements until further progress is not required or not possible. The algorithm navigates a path in the space of the values to be estimated or the optimization parameters that leads to a good solution (albeit not necessarily the global optimum).
Much of machine learning involves iterative methods. Gradient descent is an iterative optimization method, whose stochastic variant is the most widely used method for training DNNs. Many discrete optimization techniques are iterative. Iterative algorithms are also central to inference in machine learning, for example in variational inference (where inference is achieved by optimization), sampling methods (where steps are chosen stochastically such that the distribution of samples converges on the posterior distribution), and message passing algorithms (such as loopy belief propagation). In particular, such iterative inference algorithms are used in probabilistic approaches to computer vision 29,31 . It is somewhat surprising, then, that iterative computation is not widely exploited to perform visual inference in DNNs.
Visual inference is naturally understood as an optimization problem, where the goal is to find hypotheses that can explain the current visual input 47 . A hypothesis, in this case, is a proposed set of latent (i.e. unobserved) causes that can jointly explain the image. The hypothesized latent causes could be the identities and positions of objects in the scene. Visual hypotheses are hierarchical, being subdivided into smaller hypotheses about lower or intermediate-level features, such as the local edges that make up a larger contour. An iterative visual inference algorithm starts with an initial hypothesis, and refines it by incremental improvements. These improvements may include eliminating hypotheses that are mutually exclusive, strengthening compatible causes, or adjusting a hypothesis based on its ability to predict the data (the visual input). In a probabilistic framework, the optimization objective would be the likelihood (probability of the image given the latent representation) or the posterior probability (probability of the latent representation given the image).

Incompatible hypotheses can compete in the representation
There are often multiple plausible explanations for a given sensory input that are mutually exclusive. The distributed, parallel nature of neural networks enables them to initially activate and represent all of these possible hypotheses simultaneously. Recurrent connectivity between neurons can then implement competitive interactions among hypotheses, so as to converge on the best overall explanation.
There is some evidence that sensory representations are probabilistic 70-72 -in this case, the probabilities assigned to a set of mutually exclusive hypotheses must sum to 1. A strengthening of belief in one hypothesis, thus, should entail a reduction of the probability of other hypotheses in the representation. If neurons encode point estimates rather than probability distributions, then only one hypothesis can win (although that hypothesis may be encoded by a population response involving multiple neurons). The winning hypothesis could be the maximum a posteriori (MAP) hypothesis or the maximum likelihood hypothesis. Influential models of visual inference involving competitive recurrent interactions include divisive normalization 33 , biased competition 34 , and predictive coding 28,30,73 .
Recent theoretical work has demonstrated that lateral competition can give rise to a robust neural code, and can explain certain puzzling neural response properties 73,74 . This theory considers a spiking neural network setting, in which different neurons encode highly overlapping or even identical features in their input. This degeneracy means that the same signal can be encoded equally well by a range of different response patterns.
When a particular neuron spikes, lateral inhibition ensures that other competing neurons do not encode the same part of the input again. Which neuron gets to do the encoding thus depends on which neuron fires first, because its membrane potential happened to be closest to a spiking threshold. This leads to trial-to-trial variability in neural responses that reflects subtle differences in initial conditions -conditions that may not be known to an experimenter, who may thus mistake this variability for random noise. This could explain the puzzling observation that individual neurons reliably reproduce the same output given the same electrical stimulation, but populations of neurons, wired together, display apparently random variability under sensory stimulation [75][76][77] . Since multiple neurons can encode the same feature, the resulting code is also robust to neurons being lost or temporarily inactivated.
FNNs do not incorporate lateral connections for competitive interactions, although they very often include computations that serve a similar purpose. Chief among these are operations known as max-pooling and local response normalization (LRN) 15,78 .
In max-pooling, only the strongest response within a pool of competing neurons is forwarded to the next processing stage. In LRN, each neuron has its response divided by a term that is computed from the sum of activity in its normalization pool. While neither of these mechanisms is mediated by explicit lateral connections in a DNN, a strictly connectionist implementation of these mechanisms (e.g. in biological neurons or neuromorphic hardware) would have to include lateral recurrence. This, then, is another way in which apparently feedforward DNNs can exhibit a (limited) form of recurrent processing "under the hood". Note, though, that each of these operations is carried out only once, rather than allowing competitive dynamics to converge over multiple iterations. Furthermore, in contrast to the lateral interactions in predictive coding or other normative models, LRN and max-pooling are not derived from normative principles, and do not necessarily select (or enhance) the best hypothesis (however "best" is defined).

Compatible hypotheses can strengthen each other in the representation.
In feedforward models of hierarchical visual inference, neurons at higher stages selectively respond to combinations of simpler features encoded by lower-level neurons. Higherlevel neurons thus are sensitive to larger-scale patterns of correlation between subsets of lower-level features. But such larger-scale statistical regularities may not be most efficiently captured by a set of larger-scale building blocks. Instead, they may be more compactly captured by local association rules. Consider, for instance, the problem of contour detection. Many combinations of local edges in an image can form a continuous contour. The resulting space of contours may be too complex to be efficiently represented with larger-scale templates. What all these contours have in common, however, is that they consist of pairs of edges that are locally contiguous, with sharper angles occurring with lower probability. Thus, the criteria for 'contour-ness' may be compactly expressed by a set of local association rules: these edges go together; those do not 79,80 . Contours may then be pieced together by repeatedly applying the same local association rules. Those edge pairs which are most clearly connected would be identified in early iterations. Later inferences can benefit from the context provided by earlier inferences, enabling the process to recognize continuity even where it is less locally apparent.
This insight has inspired network models of visual inference that implement local association rules through lateral connections, to aid contour integration and other perceptual grouping operations 81 . Recent examples include Linsley et al., who developed horizontal gated-recurrent units (hGRUs) that learn local spatial dependencies 82 . A network equipped with this particular recurrent connectivity was competitive with state-of-the-art feedforward models on a contour integration task, while using far fewer free parameters. George et al. 83 similarly leveraged lateral interactions to recognize contiguous contours and surfaces, by modeling these with a conditional random field (CRF), using a message-passing algorithm for inference. This approach made their Recursive Cortical Network (RCN) the first computer vision algorithm to reliably beat CAPTCHAsimages of letter sequences under a variety of distortions, noise and clutter, that are widely used to verify that queries to a user interface are made by a person, and not an algorithm. CRFs were also used by Zheng et al. 84 , who incorporated them as a recurrent extension of a convolutional neural network for image segmentation. The model surpassed state-of-the-art performance at the time. Association rules enforced through lateral connections may also help to fill in missing information, such as when objects are partially hidden from view by occluders. Lateral connectivity has been shown to improve recognition performance in such settings 21,85,86 . Montobbio et al. showed that lateral diffusion of activity between features with correlated feedforward filter weights improves robustness to image perturbations including occlusions 86 .
Enhancement of mutually compatible hypotheses (this section) and competition between mutually exclusive hypotheses (previous section) can both contribute to inference. A more general perspective is provided by the insight that prior knowledge about what features in a scene are mutually compatible or exclusive may be part of an overarching generative model, which iterative algorithms can exploit for inference.

Iterative algorithms can leverage generative models for inference
Perceptual inference aims to converge on a set of hypotheses that best explain the sensory data. Typically, a hypothesis is considered to be a good explanation if it is consistent with both our prior knowledge and the sensory data. A generative model is a model of the joint distribution of latent causes and sensory data. Generative models can powerfully constrain perceptual inference because they capture prior knowledge about the world. In machine learning, defining generative models enables us to express and exploit what we know about the domain. A wide range of inference algorithms can be used to compute posterior distributions over variables of interest, given observed variables. The algorithms include variational inference, message passing, and Markov Chain Monte Carlo sampling, all of which require iterative computation.
In this section, we focus on a particular approach to leveraging generative models in visual inference, in which the joint distribution p(x, z) of the image x and the latents z is factorized as p(x, z) = p(z) · p(x|z), which we refer to as the top-down factorization. The architecture contains components that model p(x|z) and predict the image from the latents (or more generally lower-level latent representations from higher-level latent representations). Compared to the alternative factorization p(x, z) = p(x) · p(z|x), the top-down factorization has the potential advantage that the model operates in the causal direction, matching the causal process in the world that generated the image. The topdown model predicts what visual input is likely to result from a scene that has the hypothesized properties. This is somewhat similar to the graphics engine of a video game or image rendering software. This top-down model can be implemented via feedback connections that translate higherlevel hypotheses in the network to representations at a lower level of abstraction.
Using generative models implemented with top-down predictions for inference is known as analysis-by-synthesis -an approach that has a long history in theories of perception 28,30,47 .
Arguably, the goal of perceptual inference, by definition, is to reason back from effects (sensory data) to their causes (unobserved variables of interest), and thus invert the process that generated the effects. The crucial question, however, is whether the causal process is explicitly represented in the inference algorithm. The alternative, which can be achieved with feedforward inference, is to directly approximate the inverse, without ever making predictions in the causal direction. The success of the feedforward approach then depends on how well the inverse can be approximated by a fixed mapping of inputs to hypotheses. To iteratively invert the causal process, a neural network can evaluate the causal model for a current hypothesis and update the hypothesis in a beneficial direction. This process can then be repeated until convergence. This process of analysis by repeated synthesis may be preferable to directly approximating the inverse mapping if the causal process that generates the sensory data is easier to model than its inverse. In particular, the causal process may be more compactly represented, more easily learned, more efficient to compute, and more generalizable beyond the training distribution than its inverse.
Another potential advantage of generative inference lies in robustness to variations in the input. While FNNs can accurately categorize images drawn from the same distribution that the training images were drawn from, it does not take much to fool them. A slight alteration imperceptible to humans can cause a DNN to misclassify an image entirely, with high confidence 87 . State-of-the-art DNNs rely more strongly on texture than humans, who rely more on shape 88 . More generally, FNNs seem to ignore many image features that are relevant to human perception 89 . One hypothesized reason for this is that these networks are trained to discriminate images, but not to generate them. Thus, any visual feature that reliably discriminates categories in the training data will be weighted heavily in the network's classification decisions. Importantly, this weight is unrelated to how much variance the feature explains in the image, and to the likelihood, i.e. the probability of the image given either of the categories. An ideal observer should evaluate the likelihood for each hypothesis and adjudicate according to their ratio 90 . A feedforward network may instead latch on to a few highly discriminative, but subtle image features that don't explain much and may not generalize to images from a different data set 89,91 . In contrast, visual features that are important for generating or reconstructing images of a given class may be more likely to generalize to other examples of the same category. In support of this intuition, two novel RNN architectures that employ generative models for inference were found to be more robust to adversarial perturbations 92,93 . Generative inference networks were also shown to better align with human perception, compared to discriminative models, when presented with controversial stimuli -images synthesized to evoke strongly conflicting classifications from different models 94 .
Despite these promising developments, generative inference remains rare in visual DNN models.
The exceptions mentioned above are rather simple networks trained on easy classifications problems, and are not (yet) competitive with state-of-the-art performance on more challenging computer vision benchmarks. Within computational neuroscience, by contrast, generative feedback connections appear in many network models of visual inference. Prominent examples are predictive coding 28,30 and hierarchical Bayesian inference 95 . However, these models have not had much success in explaining visual inference beyond its earliest stages A notable exception is work by Wen et al. 96 , which shows that extending supervised convolutional DNNs with the recurrent dynamics of predictive coding can improve classification performance. The fields of computer vision and computational neuroscience both stand to benefit from the development of more powerful generative inference models.

Iteration is necessary to close the amortization gap
Iterative inference has many advantages. A drawback of iteration, however, is that it takes time for the algorithm to converge during inference. This is unattractive for animals who need to perform visual inference under time pressure. It is also a challenge when training a DNN, which already requires many iterations of optimization. If each update of the network's connections additionally includes an iterative inner loop to perform inference on each training example, this lengthens the time required for training.
A complementary inference mechanism is amortized inference 97,98 , where a feedforward models approximates the mapping from images to their latent causes. DNNs are eminently suited for learning complicated input-output mappings. A single transformation then replaces the trajectories that would be navigated by an iterative inference algorithm. In some cases, the iterative solution and the best amortized mapping may be exactly equivalent. A linear model, for instance, can be estimated iteratively, by performing gradient descent on the sum of squared prediction errors. However, if a unique solution exists, it can equivalently be found by a linear transformation that directly maps from the data to the optimal coefficients.
In general, however, amortized inference incurs some error, compared to the optimal solution that might be found through iterative optimization. This error has been called the amortization gap 99,100 . It is analogous to the poor fit that may result from buying clothes "off the rack", compared to a tailored version of the same garment. The amortization gap is defined in the context of variational inference, when the iterative optimization of the variational approximation to the posterior is replaced by a neural network that maps from the image to the parameters of the variational distribution. The resulting model suffers from two types of error: (1) error caused be the choice of the variational approximation (variational approximation gap) and (2) error caused by the model mapping from images to variational parameters (amortization gap). One recent study has argued that the amortization gap is often the main source of error in amortized inference models 99 .
Amortized and iterative inference define a continuum. At one extreme, iterative inference until convergence reaches a solution through a trajectory of small improvements, explicitly evaluating the quality of the current solution at every iteration. At the other extreme, fully amortized inference takes a single leap from input to output. In between these extremes lies a space for algorithms that use intermediate numbers of steps, to approximate the optimal solution through a computational path that is more refined than a leap, but more efficient than fullfledged iterative optimization. Models that occupy this space include explicit hybrids of iterative and amortized inference [100][101][102] , as well as RNNs with arbitrary dynamics that are trained to converge to a desired objective in a limited number of time steps (e.g. 21,103-105 ).

F. Recurrence is required for active vision
Vision is an active exploratory process.
Our eye movements scan the scene through a sequence of wellchosen fixations that bring objects of interest into foveal vision. Moving our heads and our bodies enables us to bring entirely new parts of the scene into view, and closer for inspection at high resolution. Active control of our eyes, heads, and bodies can also help disambiguate 3D structure as fixation on points at different depths changes binocular disparity, and head and body movements create motion parallax. Active vision involves a recurrent cycle of sensory processing and muscle control, a cycle that runs through the environment.
Our focus here has been on the internal computational functions of recurrent processing, and active vision has been reviewed elsewhere [106][107][108] . However, it is important to note that the internal recurrent processes of visual inference from a single glimpse are embedded within the larger recurrent process of active visual exploration. Active vision provides not just the larger behavioral context of visual inference. It also provides a powerful illustration of the fundamental advantages that recurrent algorithms offer in general. It illustrates how limited resources (the fovea) can be dynamically allocated (eye movements) to different portions of the evidence (the visual scene) in temporal sequence. A sensory system limited to a finite number of neurons, thus, can multiply its resources along time to achieve a detailed analysis. The cycle may start with an initial rough analysis of the entire visual field, followed by fixations on locations likely to yield valuable information. This is an example of an essentially recurrent process whose efficiency cannot be emulated with a feedforward system. The internal mechanisms of visual inference are faced with qualitatively similar challenges: Just like our retinae cannot afford foveal resolution throughout the visual field, the ventral stream cannot afford to perform all potentially relevant inferences on the evidence streaming in through the optic nerve in a single feedforward sweep. Internal shifts of attention, like eye movements, can sequentialize a complex computation and avoid wasting energy on portions of the evidence that are uninformative or irrelevant to the current goals of the animal.
Whereas the outer loop of active vision is largely about positioning our eyes relative to the scene and bringing important content into foveal vision, the inner loop of visual inference on each glimpse is far more flexible. Beyond covert attentional shifts that select locations, features, or objects for scrutiny, a recurrent network can decide what computations to perform so as to most efficiently reduce uncertainty about the important parts of the scene. In a game of twenty questions, we choose a question that most reduces our remaining uncertainty at each step. The budget of twenty would not suffice if we had to decide all the questions before seeing any answers. The visual system similarly has limited computational resources for processing a massive stream of evidence. It must choose what inferences to pursue on the basis of their computational cost and uncertainty-reducing benefit as it forages for insight [109][110][111] .

IV. CLOSING THE GAP BETWEEN BIOLOGICAL AND ARTIFICIAL VISION
We have reviewed a number of advantages that recurrence can bring to neural networks for visual inference. Going forward, neural network models of vision should incorporate recurrence; not just to better understand visual inference in the brain, but also to improve its implementation in machines.

A. Recurrence already improves performance on challenging visual tasks
Efforts in this direction are already underway, and turning up promising results. Some of this work has been described in previous sections, such as the use of lateral connections to impose local association rules [82][83][84] and generative inference for more robust performance outside the training distribution 92,93 . Several other recent findings are worth highlighting here, as they have shown improved performance on visual tasks, better approximations to biological vision, or both, through recurrent computations.
In particular, several studies have found that recurrence is required in order to explain or improve visual inference in challenging settings. Kar and colleagues 104 identified a set of 'challenge images' that required recurrent processing in order to be accurately recognized. A feedforward DNN struggled to interpret these images, whereas macaque monkeys recognized them as accurately as a set of control images. Challenge images were associated with longer processing times in the macaque inferior temporal (IT) cortex, consistent with recurrent computations. Neural responses in IT for images that took longer were well accounted for by a brain-inspired RNN model. In a different study 112 , this same recurrent architecture was found to account for behavior and neural responses in object recognition tasks, while also achieving good performance on an important computer vision benchmark (ImageNet 113 ).
One prominent challenge to visual inference is posed by partial occlusions, which hide part of a target object from view. In two recent studies, recurrent architectures were shown to be more robust to occlusions than their feedforward counterparts 85,114 . Interestingly, in both human observers and in an RNN model, object recognition under occlusion was impaired by backward masking 114 (the presentation of a meaningless noise image, shortly after a target stimulus, to disrupt recurrent processing 12,14,115 ). Another challenge for human perception is crowding, which occurs when the detailed perception of a target stimulus is disrupted by nearby flanker stimuli 116 . In certain instances, the target stimulus can be released from crowding if further flankers are added that form a larger, coherent structure with the original flankers. This uncrowding effect may be due to the flankers being 'explained away', thus reducing their interference with the target representation 117,118 . Recent work 119 has shown that both effects can be explained by architectures known as Capsule Nets 120,121 , which include recurrent information routing mechanisms that may be similar to perceptual grouping and segmentation processes in the visual cortex.
Note that, in all of these cases, it may be possible to develop a feedforward architecture that performs the task equally well or better. Trivially, and as we discussed previously, a successful recurrent architecture can always be unrolled (for a finite number of time steps) into a deep feedforward network with many more learnable connections. However, a realistic recurrent model, when unrolled, may map onto an unrealistic feedforward model (Fig. 2), where realism could refer to the real-world constraints faced by either biological or artificial visual systems. Future studies should compare RNN and FNN implementations for the same visual inference task, while matching the complexity of the models in a meaningful way. Setting a realistic budget of units, connections, and computational operations is one important approach. To understand the computational differences between RNN and FNN solutions, it is also interesting to (1) match the parameter count (number of connection weights that must be learned and stored), which requires granting the FNN larger feature kernels, more feature maps per layer, or more layers, or (2) match the computational graph, which equates the distribution of path lengths from input to output and all other statistics of the graph, but grants the FNN a much larger number of parameters 21 .

B. Freeing ourselves from the feedforward framework
Deep feedforward neural networks constitute an essential building block for visual inference, but they are not the whole story. The missing element, recurrent dynamics, is central to a range of alternative conceptions of visual inference that have been proposed 29,[106][107][108]122,123 . These ideas have a long history, they are essential to understanding biological vision, and they have great potential for engineering, especially in the context of modern hardware and software. The promise of active vision and recurrent visual inference is, in fact, boosted by the power of feedforward networks.
However, the beauty, power, and simplicity of feedforward neural networks also makes it difficult to engage and develop the space of recurrent neural network algorithms for vision. The feedforward framework, embellished by recurrent processes that serve auxiliary and modulatory functions like normalization and attention, enables computational neuroscientists to hold on to the idea of a hierarchy of feature detectors. This idea might not be entirely mistaken. However, it is likely to be severely incomplete and ultimately limiting.
The insight that any finite-time recurrent network can be unrolled compounds the problem by suggesting that the feedforward framework is essentially complete. More practically, the fact that we train RNNs by unrolling them for finite time steps might in some ways impede our progress. DNNs are usually trained by stochastic gradient descent using the backpropagation algorithm. This method retraces in reverse the computational steps that led to the response in the output layer, so as to estimate the influence that each connection in the network had on the response. Each connection weight is then adjusted, to bring the network output closer to a desired output. The deeper the network, the longer the computational path that needs to be retraced. RNNs for visual inference typically are trained through a variation on this method, known as backpropagation through time (BPTT). To retrace computations in reverse through cycles, the RNN is unrolled along time, so as to convert it into a feedforward network whose depth depends on the number of time steps as shown in Fig. 1b-d. This enables the RNN to be trained like an FNN.
BPTT is attractive for enabling us to train RNNs like FNNs on arbitrary objectives. When it comes to learning recurrent dynamics, however, BPTT strictly optimizes the output at the specific time points evaluated by the objective (e.g., the output after exactly N steps). Outside of this time window, there is no guarantee that the network's response will be well-behaved. The RNN might reach the desired objective at the desired time, but diverge immediately after. Ideally, we would like a visual RNN presented with a stable image to converge to an attractor that represents the image and behave stably for arbitrary lengths of time. This would be consistent with iterative optimization, in which each step improves the network's approximation to its objective. While it is not impossible for BPTT to give rise to such dynamics, it does not specifically favor them.
In effect, BPTT shackles RNNs to the feedforward framework, in which the goal is still to map inputs to outputs, rather than to discover useful dynamics. BPTT is also computationally cumbersome, as every additional recurrent time step extends the computational path that must be retraced in order to update the connections. This complication also renders BPTT biologically implausible. Although the case for backpropagation as potentially biologically plausible has recently been strengthened [124][125][126] , its extension through time is difficult to reconcile with biology or implement efficiently in a finite engineered system for online learning -precisely because it requires unrolling and keeping track of separate copies of each weight as computational cycles are retraced in reverse.
Given these drawbacks, we speculate that a true breakthrough in recurrent vision models will require a training regime that does not rely on BPTT. Rather than optimizing an RNN's state in a finite time window, future RNN training methods might directly target the network's dynamics, or the states that those dynamics are encouraged to converge to. This approach has some history in RNN models of vision. Predictive coding models, for instance, are designed with dynamics that explicitly implement iterative optimization. Such models can update their connections through learning rules that require only the converged network state as input 28 , rather than the entire computational path to this state. Marino et al. 100 recently proposed iterative amortized inference, training inference networks to have recurrent dynamics that improve the network's hypotheses in each iteration, without constraining these dynamics to a particular form (such as predictive coding).

C. Going forward, in circles
We started this review with the puzzling observation that, whereas biological vision is implemented in a profoundly recurrent neural architecture, the most successful neural network models of vision to date are feedforward. We have argued, theoretically and empirically, that vision models will eventually converge to their biological roots and implement more powerful recurrent solutions. One appeal of this view is that it suggests that neuroscientists and engineers may work synergistically, to make progress on common challenges. After all, visual inference, and intelligence more generally, were solved once before.