Computer Vision and Image Understanding

and compare them with approaches taken by computer vision. Based on this comparative analysis of computer and biological vision, we present some recent models in biological vision and highlight a few models that we think are promising for future investigations in computer vision. To this extent, this paper provides new insights and a starting point for investigators interested in the design of biology-based computer vision algorithms and pave a way for much needed interaction between the two communities leading to the development of synergistic models of artiﬁcial and biological vision.


Introduction
Biological vision systems are remarkable at extracting and analysing the essential information for vital functional needs such as navigating through complex environments, finding food or escaping from a danger. It is remarkable that biological visual systems perform all these tasks with both high sensitivity and strong reliability given the fact that natural images are highly noisy, cluttered, highly variable and ambiguous. Still, even simple biological systems can efficiently and quickly solve most of the difficult computational problems that are still challenging for artificial systems such as scene segmentation, local and global optical flow compu-tation, 3D perception or extracting the meaning of complex objects or movements. All these aspects have been intensively investigated in human psychophysics and the neuronal underpinnings of visual performance have been scrutinised over a wide range of temporal and spatial scales, from single cell to large cortical networks so that visual systems are certainly the best-known of all neural systems (see Chalupa and Werner, 2004 for an encyclopaedic review). As a consequence, biological visual computations are certainly the most understood of all cognitive neural systems.
It would seem natural that biological and computer vision research would interact continuously since they target the same goals at task level: extracting and representing meaningful visual information for making actions. Sadly, the strength of these interactions has remained weak since the pioneering work of Marr (1982) , and colleagues who attempted to marry the fields of neurobiology, visual psychophysics and computer vision. The unifying idea presented in his influential book entitled Vision was to articulate these fields around computational problems faced by both biological and artificial systems rather than on their implementation. Despite these efforts, the two research fields have however largely drifted apart, partly because of several technical obstacles that obstructed this interdisciplinary agenda for decades, such as the limited capacity of the experimental tools used to probe visual information processing or the limited computational power available for simulations.
With the advent of new experimental and analysis techniques significant amount of progress has been made towards overcoming these technical obstacles. A new wealth of multiple scales functional analysis and connect omics information is emerging in brain sciences, and it is encouraging to note that studies of visual systems are upfront on this fast move ( Editorial, 2013 ). For instance, it is now possible to identify selective neuronal populations and dissect out their circuitry at synaptic level by combining functional and structural imaging. The first series of studies applying such techniques have focused on understanding visual circuits, at both retinal ( Helmstaedter et al., 2013 ) and cortical ( Bock et al., 2011 ) levels. At a wider scale, a quantitative description of the connectivity patterns between cortical areas is now becoming available and, here again the study of visual cortical networks is pioneering ( Markov et al., 2013 ). A first direct consequence is that detailed large scales models of visual networks are now available to study the neurobiological underpinnings of information processing at multiple temporal and spatial scales ( Chaudhuri et al., 2015;Kim et al., 2014;Potjans and M., 2014 ). With the emergence of international research initiatives (e.g., the BRAIN and HBP projects, the Allen Institute Atlas), we are certainly at the first steps of a major revolution in brain sciences. At the same time, recent advances in computer architectures make it now possible to simulate largescale models, something that was not even possible to dream of a few years ago. For example, the advent of multi-core architectures ( Eichner et al., 2009 ), parallel computing on clusters ( Plesser et al., 2007 ), GPU computing ( Pinto and Cox, 2012 ) and availability of neuromorphic hardware ( Temam and Héliot, 2011 ), promises to facilitate the exploration of truly bio-inspired vision systems ( Merolla et al., 2014 ). However, these technological advancements in both computer and brain sciences call for a strong push in theoretical studies. The theoretical difficulties encountered by each field call for a new, interdisciplinary approach for understanding how we process, represent and use visual information. For instance, it is still unclear how the dense network of cortical areas fully analyses the structure of the external world and part of the problem may come from using a bad range of framing questions about mid-level and high-level vision Gur, 2015;Kubilius et al., 2014 ). In short, we cannot see the forest (representing the external world) for the trees (e.g., solving face and object recognition) and reconciling biological and computer vision is a timely joint-venture for solving these challenges.
The goal of this paper is to advocate how novel computer vision approaches could be developed from these biological insights. It is a manifesto for developing and scaling up models rooted in experimental biology (neurophysiology, psychophysics, etc.) leading to an exciting synergy between studies in computer vision and biological vision. Our conviction is that the exploding knowledge about biological vision, the new simulation technologies and the identification of some ill-posed problems have reached a critical point that will nurture a new departure for a fruitful interdisciplinary endeavour. The resurgence of interest in biological vision as a rich source for designing principles for computer vision is evidenced by recent books ( Cristóbal et al., 2015;Frisby and Stone, 2010;Hérault, 2010;Liu et al., 2015;Petrou and Bharat, 2008;Pomplun and Suzuki, 2012 ) and survey papers Turpin et al., 2014 ). However, we feel that these studies were more focused on computational neuroscience rather than computer vision and, second remain largely influenced by the hierarchical feedforward approach, thus ignoring the rich dynamics of feedback and lateral interactions.
This article is organised as follows. In Section 2 , we revisit the classical view of the brain as a hierarchical feedforward system ( Kruger et al., 2013 ). We point out its limitations and portray a modern perspective of the organisation of the primate visual system and its multiple spatial and temporal anatomical and functional scales. In Section 3 , we appraise the different current computational and theoretical frameworks used to study biological vision and re-emphasise the importance of putting the task solving approach as the main motivation to look into biology. In order to relate studies in biological vision to computer vision, we focus in Section 4 on three archetypal tasks: sensing, segmentation and motion estimation. These three tasks are illustrative because they have similar basic-level representations in biological and artificial vision. However, the role of the intricate, recurrent neuronal architecture in figuring out neural solutions must be re-evaluated in the light of recent empirical advances. For each task, we will start by highlighting some of these recently identified biological mechanisms that can inspire computer vision. We will give a structural view of these mechanisms, relate these structural principles to prototypical models from both biological and computer vision and, finally we will detail potential insights and perspectives for rooting new approaches on the strength of both fields. Finally, based on the prototypical tasks reviewed throughout this article, we will propose in Section 5 , three ways to identify which studies from biological vision could be leveraged to advance computer vision algorithms.

The classical view of biological vision
The classical view of biological visual processing that has been conveyed to the computer vision community from visual neurosciences is that of an ensemble of deep cortical hierarchies (see Kruger et al., 2013 for a recent example). Interestingly, this computational idea was proposed in computer vision by Marr (1982) , even before its anatomical hierarchy was fully detailed in different species. Nowadays, there is a general agreement about this hierarchical organisation and its division into parallel streams in human and non-human primates, as supported by a large body of anatomical and physiological evidences (see Markov et al., 2013;Ungerleider and Haxby, 1994;Van Essen, 2003 for reviews). Fig. 1 (a)-(b) illustrates this classical view where information flows from the retina to the primary visual cortex (area V1) through two parallel retino-geniculo-cortical pathways. The magnocellular (M) pathway conveys coarse, luminance-based spatial inputs with a strong temporal sensitivity towards Layer 4C α of area V1 where a characteristic population of cells, called stellate neurons, immediately transmit the information to higher cortical areas involved in motion and space processing. A slower, parvocellular (P) pathway conveys retino-thalamo-cortical inputs with high spatial resolution but low temporal sensitivity, entering area V1 through the layer 4C β. Such color-sensitive input flows more slowly within the different layers of V1 and then to cortical area V2 and a network of cortical areas involved in form processing. The existence of these two parallel retino-thalamo-cortical pathways resonated with neuropsychological studies investigating the effects of parietal and temporal cortex lesions ( Ungerleider and Mishkin, 1982 ), leading to the popular, but highly schematic, two visual systems theory ( Milner and Goodale, 2008;Ungerleider and Mishkin, 1982;Ungerleider and Haxby, 1994 ), in which a dorsal stream is specialised in motion perception and the analysis of the spatial structure of the visual scene whereas a ventral The two visual pathways theory states that primate visual cortex can be split between dorsal and ventral streams originating from the primary visual cortex (V1). The dorsal pathway runs towards the parietal cortex, through motion areas MT and MST. The ventral pathway propagates through area V4 all along the temporal cortex, reaching area IT. (b) These ventral and dorsal pathways are fed by parallel retino-thalamo-cortical inputs to V1, known as the Magno (M) and Parvocellular pathways (P). (c) The hierarchy consists in a cascade of neurons encoding more and more complex features through convergent information. By consequence, their receptive field integrate visual information over larger and larger receptive fields. (d) Illustration of a machine learning algorithm for, e.g., object recognition, following the same hierarchical processing where a simple feed-forward convolutional network implements two bracketed pairs of convolution operator followed by a pooling layer (adapted from Cox and Dean, 2014 ). stream is dedicated to form perception, including object and face recognition.
At the computational level, the deep hierarchies concept was reinforced by the linear systems approach used to model low-level visual processing. As illustrated in Fig. 1 (c), neurons in the primary visual system have small receptive fields, paving a high resolution retinotopic map. The spatiotemporal structure of each receptive field corresponds to a processing unit that locally filters a given property of the image. In V1, low-level features such as orientation, direction, color or disparity are encoded in different subpopulations forming a sparse and over complete representation of local feature dimensions. These representations feed several, parallel cascades of converging influences so that, as one moves along the hierarchy, receptive fields become larger and larger and encode for features of increasing complexities and conjunctions thereof (see DeYoe and Van Essen, 1988;Roelfsema et al., 20 0 0 for reviews). For instance, along the motion pathway, V1 neurons are weakly direction-selective but converge onto the medio-temporal (MT) area where cells can precisely encode direction and speed in a form-independent manner. These cells project to neurons in the median superior temporal (MST) area where receptive fields cover a much larger portion of the visual field and encode basic optic flow patterns such as rotation, translation or expansion. More complex flow fields can be decoded by parietal neurons when integrating these informations and be integrated with extra-retinal signals about eye movements or self-motion ( Bradley and Goyal, 2008;Orban, 2008 ). The same logic flows along the form pathway, where V1 neurons encode the orientation of local edges. Through a cascade of convergence, units with receptive fields sensitive to more and more complex geometrical features are generated so that neu-rons in the infero-temporal (IT) area are able to encode objects or face in a viewpoint invariant manner (see Fig. 1 (c)).
Object recognition is a prototypical example where the canonical view of hierarchical feedforward processing nearly perfectly integrates anatomical, physiological and computational knowledges. This synergy has resulted in realistic, computational models of receptive fields where converging outputs from linear filters are nonlinearly combined from one step to the subsequent one Nandy et al., 2013 ). It has also inspired feed-forward models working at task levels for object categorisation ( Serre et al.,20 07a; 20 07b ) as illustrated in Fig. 1 (d), prominent machine learning solutions for object recognition follow the same feed-forward, hierarchical architecture where linear and nonlinear stages are cascaded between multiple layers representing more and more complex features Hinton and Osindero, 2006 ).

Going beyond the hierarchical feed-forward view
Despite its success in explaining some basic aspects of human perception such as object recognition, the hierarchical feedforward theory remains highly schematic. Many aspects of biological visual processing, from anatomy to behaviour, do not fit in this cartoon-like framing. Important aspects of human perception such as detail preservation, multi-stability, active vision and space perception for example cannot be adequately explained by a hierarchical cascade of expert cells. Furthermore, taking into account high-level cognitive skills such as top-down attention, visual cognition or concepts representation needs to reconsider this deep hierarchies. In particular, the dynamics of neural processing is much more complex than the hierarchical feed-forward abstraction and very important connectivity patterns such as lateral and recurrent interactions must be taken into account to overcome several pitfalls in understanding and modelling biological vision. In this section, we highlight some of these key novel features that should greatly influence computational models of visual processing. We also believe that identifying some of these problems could help in reunifying natural and artificial vision and addressing more challenging questions as needed for building adaptive and versatile artificial systems which are deeply bio-inspired.
Vision processing starts at the retina and the lateral geniculate nucleus (LGN) levels. Although this may sound obvious, the role played by these two structures seems largely underestimated. Indeed, most current models take images as inputs rather than their retina-LGN transforms. Thus, by ignoring what is being processed at these levels, one could easily miss some key properties to understand what makes the efficiency of biological visual systems. At the retina level, the incoming light is transformed into electrical signals. This transformation was originally described by using the linear systems approach to model the spatio-temporal filtering of retinal images ( Enroth-Cugell and Robson, 1984 ). More recent research has changed this view and several cortex-like computations have been identified in the retina of different vertebrates (see Gollisch and Meister, 2010;Kastner and Baccus, 2014 for reviews, and more details in Section 4.1 ). The fact that retinal and cortical levels share similar computational principles, albeit working at different spatial and temporal scales is an important point to consider when designing models of biological vision. Such a change in perspective would have important consequences. For example, rather than considering how cortical circuits achieve high temporal precision of visual processing, one should ask how densely interconnected cortical networks can maintain the high temporal precision of the retinal encoding of static and moving natural images ( Field and Chichilnisky, 2007 ), or how miniature eye movements shapes its spatiotemporal structure ( Rucci and Victor, 2015 ).
Similarly, the LGN and other visual thalamic nuclei (e.g., pulvinar) should no longer be considered as pure relays on the route from retina to cortex. For instance, cat pulvinar neurons exhibit some properties classically attributed to cortical cells, as such pattern motion selectivity ( Merabet et al., 1998 ). Strong centresurround interactions have been shown in monkeys LGN neurons and these interactions are under the control of feedback corticothalamic connections ( Jones et al., 2012 ). These strong corticogeniculate feedback connections might explain why parallel retinothalamo-cortical pathways are highly adaptive, dynamical systems ( Briggs and Usrey, 2008;Cudeiro and Sillito, 2006 ;Nandy et al., 2013 ). In line with the computational constraints discussed before, both centre-surround interactions and feedback modulation can shape the dynamical properties of cortical inputs, maintaining the temporal precision of thalamic firing patterns during natural vision ( Andolina et al., 2007 ).
Overall, recent sub-cortical studies give us three main insights. First, we should not oversimplify the amount of processing done before visual inputs reach the cortex and we must instead consider that the retinal code is already highly structured, sparse and precise. Thus, we should consider how cortex takes advantage of these properties when processing naturalistic images. Second, some of the computational and mechanistic rules designed for predictivecoding or feature extraction can be much more generic than previously thought and the retina-LGN processing hierarchy may become again a rich source of inspiration for computer vision. Third, the exact implementation (what is being done and where) may be not so important as it varies from one species to another but the cascade of basic computational steps may be an important principle to retain from biological vision.
Functional and anatomical hierarchies are not always identical. The deep cortical hierarchy depicted in Fig. 1 (b) is primarily based on gross anatomical connectivity rules ( Zeki, 1993 ). Its functional counterpart is the increasing complexity of local processing and information content of expert cells as we go deeper along the anatomical hierarchy. There is however a flaw in attributing the functional hierarchy directly to its anatomical counterpart. The complexity of visual processing does increase from striate to extrastriate and associative cortices, but this is not attributable only to feed-forward convergence. A quick glance at the actual cortical connectivity pattern in non-human primates would be sufficient to eradicate this textbook view of how the visual brain works ( Hegdé and Felleman, 2007;Markov et al., 2013 ).
For example, a classical view is that the primary visual cortex represents luminance-based edges whereas higher-order image properties such as illusory contours are encoded at the next processing stages along the ventral path (e.g., areas V2 and V4) . Recent studies have shown however that illusory contours, as well as border ownerships can also be represented in macaque area V1 ( Lee and Nguyen, 2001;Zhou et al., 20 0 0 ). Moreover, multiple binocular and monocular depth cues can be used to reconstruct occluded surfaces in area V1 ( Sugita, 1999 ). Thus, the hierarchy of shape representation appears nowadays more opaque than previously thought ( Hegdé and Van Essen, 2007 ), and many evidences indicate that the intricate connectivity within and between early visual areas is decisive for of the emergence of figure-ground segmentation and proto-objects representations ( von der Heydt, 2015 ;[262] ). Another strong example is visual motion processing. The classical feed-forward framework proposes that MT cells (and not V1 cells) are true speed-tuned units. It has been thought for decades that V1 cells cannot encode the speed of a moving pattern independently of its spatiotemporal frequencies content ( Rodman and Albright, 1987 ). However, recent studies have shown that there are V1 complex cells which are speed tuned . The differences between V1 and MT regarding speed coding are more consistent with a distributed representation where slow speeds are represented in V1 and high speeds in area MT rather than a pure, serial processing. Decoding visual motion information at multiple scales for elaborating a coherent motion percept must therefore imply a large-scale cortical network of densely recurrently interconnected areas. Such network can extend to cortical areas along the ventral stream in order to integrate together form and complex global motion inputs ( Hedges et al., 2011;Zhuo et al., 2003 ). One final example concerns the temporal dynamics of visual processing. The temporal hierarchy is not a carbon copy of the anatomical hierarchy depicted by Felleman and Van Essen. The onset of a visual stimulus triggers fast and slow waves of activation travelling throughout the different cortical areas. The fast activation in particular by-passes several major steps along both dorsal and ventral pathways to reach frontal areas even before area V2 is fully activated (for a review, see Lamme and Roelfsema, 20 0 0 ). Moreover, different time scales of visual processing emerge from both the feed-forward hierarchy of cortical areas but also from the longrange connectivity motifs and the dense recurrent connectivity of local sub-networks ( Chaudhuri et al., 2015 ). Such rich repertoire of temporal time windows, ranging from fast, transient responses in primary visual cortex to persistent activity in association areas, is critical for implementing a series of complex cognitive tasks from low-level processing to decision-making.
These three different examples highlight the fact that a more complex view of the functional hierarchy is emerging. The dynamics of biological vision results from the interactions between different cortical streams operating at different speeds but also relies on a dense network of intra-cortical and inter-cortical (e.g., feedback) connections. Designing better vision algorithms could be inspired by this recurrent architecture where different spatial and temporal scales can be mixed to represent visual motion or complex patterns with both high reliability and high resolution.

Dorsal/ventral separation is an over-simplification.
A strong limitation of grounding a theoretical framework of sensory processing upon anatomical data is that the complexity of connectivity patterns must lead to undesired simplifications in order to build a coherent view of the system. Moreover, it escapes the complexity of the dynamical functional interactions between areas or cognitive sub-networks. A good example of such bias is the classical dorsal/ventral separation. First, interactions between parallel streams can be tracked down to the primary visual cortex where a detailed analysis of the layer four connectivity have shown that both Magno and Parvocellular signals can be intermixed and propagated to areas V2 and V3 and, therefore the subsequent ventral stream ( Yabuta et al., 2001 ). Such a mixing of M-and P-like signals could explain why fast and coarse visual signals can rapidly tune the most ventral areas along the temporal cortex and therefore shape face recognition mechanisms ( Giese and Poggio, 2003 ). Second, motion psychophysics has demonstrated a strong influence of form signals onto local motion analysis and motion integration ( Mather et al., 2012 ). These interactions have been shown to occur at different levels of the two parallel hierarchies, from primary visual cortex to the superior temporal sulcus and the parietal cortex ( Orban, 2008 ). These interactions provide many computational advantages used by the visual motion system to resolve motion ambiguities, interpolate occluded information, segment the optical flow or recover the 3D structure of objects. Third, there are strong interactions between color and motion information, through mutual interactions between cortical areas V4 and MT . It is interesting to note that these two particular areas were previously attributed to the ventral and dorsal pathways, respectively ( DeYoe and Van Essen, 1988;Livingstone and Hubel, 1988 ). Such strict dichotomy is outdated as both V4 and MT areas interact to extract and mix these two dimensions of visual information.
These interactions are only a few examples to be mentioned here to highlight the needs of a more realistic and dynamical model of biological visual processing. If the coarse division between ventral and dorsal streams remains valid, a closer look at these functional interactions highlight the existence of multiple links, occurring at many levels along the hierarchy. Each stream is traversed by successive waves of fast/coarse and slow/precise signals so that visual representations are gradually shaped ( Roelfsema, 2005 ). It is now timely to consider the intricate networks of intra and inter-cortical interactions to capture the dynamics of biological vision. Clearly, a new theoretical perspective on the cortical functional architecture would be highly beneficial to both biological and artificial vision research.
A hierarchy embedded within a dynamical recurrent system. We have already mentioned that spatial and temporal hierarchies do not necessarily coincide as information flows can bypass some cortical areas through fast cortico-cortical connections. This observation led to the idea that fast inputs carried by the Magnocellular stream can travel quickly across the cortical networks to shape each processing stage before it is reached by the fine-grain information carried by the Parvocellular retino-thalamo-cortical pathway. Such dynamics are consistent with the feed-forward deep hierarchy and are used by several computational models to explain fast, automatic pattern recognition ( Rousselet et al., 2004;Thorpe, 2009 ).
Several other properties of visual processing are more difficult to reconcile with the feed-forward hierarchy. Visual scenes are crowded and it is not possible to process every of its details, Moreover, visual inputs are often highly ambiguous and can lead to different interpretations, as evidenced by perceptual multistability. Several studies have proposed that the highly recurrent connectivity motif of the primate visual system plays a crucial role in these processing. At the theoretical level, several authors recently resurrected the idea of a "reversed hierarchy" where highlevel signals are back-propagated to the earliest visual areas in order to link low-level visual processing, high resolution representation and cognitive information ( Ahissar and Hochstein, 2004;Bullier, 2001;Gur, 2015;Hochstein and Ahissar, 2002 ). Interestingly, this idea was originally proposed more than three decades before by Peter Milner in the context of visual shape recognition ( Milner, 1974 ), and had then quickly diffused to the computer vision research leading to novel algorithms for top-down modulation, attention and scene parsing (e.g., Fukushima, 1987;Tsotsos, 1993;Tsotsos et al., 1995 ). At the computational level, in Lee and Mumford (2003) , the authors reconsidered the hierarchical framework by proposing that concatenated feed-forward/feedback loops in the cortex could serve to integrate top-down prior knowledge with bottom-up observations. This architecture generates a cascade of optimal inference along the hierarchy ( Lee and Mumford, 20 03;Roelfsema et al., 20 0 0;Rousselet et al., 2004;Thorpe, 2009 ). Several computational models have used such recurrent computation for surface motion integration ( Bayerl and Neumann, 2004;Perrinet and Masson, 2012;Tlapale et al., 2010 ), contour tracing ( Brosch et al., 2015a ), or figure-ground segmentation ( Roelfsema et al., 2002 ).
Empirical evidence for a role of feedback has long been difficult to gather in support to these theories. It was thus difficult to identify the constraints of top-down modulations that are known to play a major role in the processing of complex visual inputs, through selective attention, prior knowledge or action-related internal signals. However, new experimental approaches begin to give a better picture of their role and their dynamics. For instance, selective inactivation studies have begun to dissect the role of feedback signals in context-modulation of primate LGN and V1 neurons ( Cudeiro and Sillito, 2006 ). The emergence of geneticallyencoded optogenetic probes targeting the feedback pathways in mice cortex opens a new era of intense research about the role of feed-forward and feedback circuits ( Issacson and Scanziani, 2011;Luo et al., 2008 ). Overall, early visual processing appears now to be strongly influenced by different top-down signals about attention, working memory or even reward mechanisms, just to mention. These new empirical studies pave the way for a more realistic perspective on visual perception where both sensory inputs and brain states must be taken into account when, for example, modelling figure-ground segmentation, object segregation and target selection (see Kafaligonul et al., 2015;Lamme and Roelfsema, 20 0 0;Squire et al., 2013 for recent reviews).
The role of attention is illustrative of this recent trend. Mechanisms of bottom-up and top-modulation attentional modulations in primates have been largely investigated over the last three decades. Spatial and feature-based attentional signals have been shown to selectively modulate the sensitivity of visual responses even in the earliest visual areas ( Motter, 1993; 20 0 0 ). These works have been a vivid source of inspiration for computer vision in searching for a solution to the problems of feature selection, information routing and task-specific attentional bias (see Jancke et al., 2004 ;Tsotsos, 2011 ), as illustrated for instance by the Selective Tuning algorithm of Tsotsos and collaborators ( Tsotsos et al., 1995 ). More recent work in non-human primates has shown that attention can also affect the tuning of individual neurons ( Ibos and Freedman, 2014 ). It also becomes evident that one needs to consider the effects of attention on population dynamics and the efficiency of neural coding (e.g., by decreasing noise correlation ( Cohen and Maunsell, 2009 )). Intensive empirical work is now targeting the respective contributions of the frontal (e.g., task-dependency) and parietal (e.g., saliency maps) networks in the control of attention and its coupling with other cognitive processes such as reward learning or working memory (see Buschman and Kastner, 2015 for a recent review). These empirical studies led to several computational models of attention (see Bylinskii et al., 2015;Tsotsos, 2011;Tsotsos et al., 2015 for recent reviews) based on generic computations (e.g., divisive normalisation Heeger, 2009 , synchrony ( Fries, 2005 ), or feedback-feed-forward interactions ( Khorsand et al., 2015 )). Nowadays, attention appears to be a highly dynamical, rapidly changing processing that recruits a highly flexible cortical network depending on behavioural demands and in strong interactions with other cognitive networks.
The role of lateral connectivity in information diffusion. The processing of a local feature is always influenced by its immediate surrounding in the image. Feedback is one potential mechanisms for implementing context-dependent processing but its spatial scale is rather large, corresponding to far-surround modulation ( Angelucci and Bressloff, 2006 ). Visual cortical areas, and in particular area V1, are characterised by dense short-and long-range intra-cortical interactions. Short-range connectivities are involved in proximal centre-surround interactions and their dynamics fits with contextual modulation of local visual processing ( Reynaud et al., 2012 ). This connectivity pattern has been overly simplified as overlapping, circular excitatory and inhibitory areas of the non-classical receptive field. In area V1, these sub-populations were described as being tuned for orthogonal orientations corresponding to excitatory input from iso-oriented domains and inhibitory input from cross-oriented ones. In higher areas, similar simple schemes have been proposed, such as the opposite direction tuning of center and surround areas of MT and MST receptive fields ( Born and Bradley, 2005 ). Lastly, these surround inputs have been proposed to implement generic neural computations such as normalisation or gain control ( Carandini and Heeger, 2011 ).
From the recent literature, a more complex picture of centresurround interactions has emerged where non-classical receptive fields are highly diverse in terms of shapes or features selectivity ( Cavanaugh et al., 2002;Webb et al., 2003;Xiao et al., 1995 ). Such diversity would result from complex connectivity patterns where neurons tuned for different features (e.g., orientation, direction, spatial frequency) can be dynamically interconnected. For example, in area V1, the connectivity pattern becomes less and less specific with farther distances from the recording sites. Moreover, far away points in the image can also interact through the long-range interactions which have been demonstrated in area V1 of many species. Horizontal connections extend over millimetres of cortex and propagate activity at a much lower speed than feed-forward and feedback connections ( Bullier, 2001 ). The functional role of these long-range connections is still unclear. They most probably support the waves of activity that travel across the V1 cortex either spontaneously or in response to a visual input ( Muller et al., 2014;Sato et al., 2012 ). They can also implement the spread of cortical activity underlying contrast normalisation ( Reynaud et al., 2012 ), the spatial integration of motion and contour signals Reynaud et al., 2012 ), or the shaping of low-level percepts ( Jancke et al., 2004 ).
A neural code for vision?. How is information encoded in neural systems is still highly disputed and an active field of theoretical and empirical research. Once again, visual information processing has been largely used as a seminal framework to decipher the neural coding principles and its application for computer sciences. The earliest studies on neuronal responses to visual stimuli have suggested that information is encoded in the mean firing rate of individual cells and its gradual change with visual input properties.
For instance cells in V1 labelled as feature detectors are classified based upon their best response selectivity (stimulus that invokes maximal firing of the neuron) and several non-linear properties such gain control or context modulations which usually varied smoothly with respect to few attributes such as orientation contrast and velocity, leading to the development of tuning curves and receptive field doctrine. Spiking and mean-field models of visual processing are based on these principles.
Aside of from changes in mean firing rates, other interesting features of neural coding is the temporal signature of neural responses and the temporal coherence of activity between ensembles of cells, providing an additional potential dimension for specific linking, or grouping, distant and different features ( von der Malsburg, 1981;Von der Malsburg, 1999;Singer, 1999 ). In networks of coupled neuronal assemblies, associations of related sensory features are found to induce oscillatory activities in a stimulusinduced fashion ( Eckhorn et al., 1990 ). The establishment of a temporal coherence has been suggested to solve the so-called binding problem of task-relevant features through synchronization of neuronal discharge patterns in addition to the structural patterns of linking pattern ( Engel and Singer, 2001 ). Such synchronizations might even operate over different areas and therefore seems to support rapid formations of neuronal groups and functional subnetworks and routing signals ( Buschman and Kastner, 2015;Fries, 2005 ). However, the view that temporal oscillatory states might define a key element of feature coding and grouping has been challenged by different studies and the exact contribution of these temporal aspects of neural codes is not yet fully elucidated (e.g., Shadlen and Movshon, 1999 for a critical review). By consequences, only a few of bio-inspired and computer vision models rely on the temporal coding of information.
Although discussing the many facets of visual information coding is far beyond the scope of this review, one needs to briefly recap some key properties of neural coding in terms of tuning functions. Representations based on the tuning functions can be basis for the synergistic approach advocated in this article. Neurons are tuned to one or several features, i.e., exhibiting a strong response when stimuli contrains a preferred feature such as local luminance-defined edges or proto-objects and low or no response when such features are absent. As a result, neural feature encoding is sparse, distributed over populations (see Pouget et al., 2013;Shamir, 2014 andhighly reliable Perrinet, 2015 , at the same time. Moreover, these coding properties emerge from the different connectivity rules introduced above. The tuning functions of individual cells are very broad such that high behavioural performances observed empirically can be achieved only from some nonlinear or probabilistic decoding of population activities ( Pouget et al., 2013 ). This could also imply that visual information could be represented within distributed population codes rather than grandmother cells ( Lehky et al., 2013;Pouget et al., 2003 ). Tuning functions are dynamical: they can be sharpened or shifted over time ( Shapley et al., 2003 ). Neural representation could also be relying on spike timing and the temporal structure of the spiking patterns can carry additional information about the dynamics of transient events ( Perrinet et al., 2004;Thorpe et al., 2001 ). Overall, the visual system appears to use different types of codes, one advantage for representing high-dimension inputs ( Rolls and Deco, 2010 ).

The Marr's three levels of analysis
At conceptual level, much of the current computational understanding of biological vision is based on the influential theoretical framework defined by Marr (1982) , and colleagues. Their key message was that complex systems, like brains or computers, must be studied and understood at three levels of description: the computational task carried out by the system resulting in the observable behaviour, the instance of the algorithm used by the system to solve the computational task and the implementation that is emboddied by a given system to execute the algorithm. Once a functional framework is defined, the computational and implementation problems can be distinguished, so that in principle a given solution can be embedded into different biological, or artificial physical systems. This approach has inspired many experimental and theoretical research in the field of vision ( Daugman, 1988;Granlund, 1978;Hildreth and Koch, 1987;Poggio, 2012 ). The cost of this clear distinction between levels of description is that many of the existing models have only a weak relationship with the actual architecture of the visual system or even with a specific algorithmic strategy used by biological systems. Such dichotomy contrasts with the growing evidence that understanding cortical algorithms and networks are deeply coupled ( Hildreth and Koch, 1987 ). Human perception would still act as a benchmark or a source of inspiring computational ideas for specific tasks (see Andreopoulos and Tsotsos, 2013 , for a good example about object recognition). But, the risk of ignoring the structure-function dilemma is that computational principles would drift away from biology, becoming more and more metaphorical as illustrated by the fate of the Gestalt theory. The bio-inspired research stream for both computer vision and robotics aims at reducing this fracture (e.g., Cristóbal et al., 2015;Frisby and Stone, 2010;Hérault, 2010;Petrou and Bharat, 2008 for recent reviews).

From circuits to behaviours
A key milestone in computational neurosciences is to understand how neural circuits lead to animal behaviours. Carandini (2012) , argued that the gap between circuits and behaviour is too wide without the help of an intermediate level of description, just that of neuronal computation. But how can we escape from the dualism between computational algorithm and implementation as introduced by Marr's approach? The solution depicted in Carandini (2012) , is based on three principles. First, some levels of description might not be useful to understand functional problems. In particular sub cellular and network levels are decoupled. Second, the level of neuronal computation can be divided into building blocks forming a core set of canonical neural computations such as linear filtering, divisive normalisation, recurrent amplification, coincidence detection, cognitive maps and so on. These standard neural computations are widespread across sensory systems ( Fregnac and Bathelier, 2015 ). Third, these canonical computations occur in the activity of individual neurons and especially of population of neurons. In many instances, they can be related to stereotyped circuits such as feed-forward inhibition, recurrent excitation-inhibition or the canonical cortical micro-circuit for signal amplification (see Silies et al., 2014 for a series of reviews). Thus, understanding the computations carried out at the level of individual neurons and neural populations would be the key for unlocking the algorithmic strategies used by neural systems. This solution appears to be essential to capture both the dynamics and the versatility of biological vision. With such a perspective, computational vision would regain its critical role when mapping circuits to behaviours and could rejuvenate the interest in the field of computer vision not only by highlighting the limits of existing algorithms or hardware but also by providing new ideas. At this cost, visual and computational neurosciences would be again a source of inspiration for computer vision. To illustrate this joint venture, Fig. 2 illustrates the relationships between the different functional and anatomical scales of cortical processing and their mapping with the three computational problems encountered with designing any artificial systems:how, what and why.

Neural constraints for functional tasks
Biological systems exist to solve functional tasks so that an organism can survive. Considering the existing constraints, many biologists consider the brain as a "bag of tricks that passed evolutionary selection", even though some tricks can be usable in different systems or contexts. This biological perspective highlights the fact that understanding biological systems is tightly related to understanding the functional importance of the task at hands. For example, there is in the mouse retina a cell type able to detect small moving objects in the presence of a featureless or stationary background. These neurons could serve as elementary detectors of potential predators arriving from the sky ( Zhang et al., 2012 ). In the same vein, it has been recently found that output of retinal direction-selective cells are kept separated from the other retino-thalamo-cortical pathways to directly influence specific target neurons in mouse V1 ( Cruz- Martin et al., 2014 ). These two very specific mechanisms illustrate how evolution can shape nervous systems. Computation and architecture are intrinsically coupled to find an optimal solution. This could be taken as an argument for ignoring neural implementations when building generic artificial systems. However, there are also evidence that evolution has selected neural micro-circuits implementing generic computations such as divisive normalisation. These neural computations have been shown to play a key role in the emergence of low-level neuronal selectivities. For example divisive normalisation has been a powerful explanation for many aspects of visual perception, from low-level gain control or attention ( Carandini and Heeger, 2011;Reynolds and Heeger, 2009 ). The role of feed-forward-feedback connectivity rules of canonical micro-circuits in predictive coding have been also identified ( Bastos et al., 2012 ), and applied in the context of visual motion processing ( Dimova and Denham, 2009 ). These examples are extrema lying on the continuum of biological structure-function solutions, from the more specific to the more generic. This diversity stresses the needs to clarify the functional context of the different computational rules and their performance dynamics so that fruitful comparisons can be made between living and artificial systems. This can lead to a clarification about which knowledge from biology is useful for computer vision.
Lastly, these computational building blocks are embedded into a living organism and low-to-high vision levels are constantly interacting with many other aspects of animal cognition ( Vetter and Newen, 2014 ). For example, the way an object is examined (i.e., the way its image is processed) depends on its behavioural context, whether it is going to be manipulated or only scrutinised to identify it. A single face can be analysed in different ways depending upon the social or emotional context. Thus, we must consider these contextual influence of "why" a task is being carried out when integrating information (and data) from biology ( Willems, 2011 ). All these above observations stress the difficulty of understanding biological vision as an highly adapted, plastic and versatile cognitive system where circuits and computation are like Janus face. However, as described above for recurrent systems, understanding the neural dynamics of versatile top-down modulation can inspired artificial systems about how different belief states can be integrated together within the low-level visual representations.

Matching connectivity rules with computational problems
In Section 2 , we have given a brief glimpse of the enormous literature on the intricate networks underlying biological vision. Focusing on primate low-level vision, we have illustrated both the richness, the spatial and temporal heterogeneity and the versatility of these connections. We illustrate them in Fig. 3 for a simple case, the segmentation of two moving surfaces. Fig. 3 (a) sketches Between circuits and behaviour: rejuvenating the Marr approach. The nervous system can be described at different scales of organisation that can be mapped onto three computational problems: how, what and why. All three aspects involve a theoretical description rooted on anatomical, physiological and behaviour data. These different levels are organised around computational blocks that can be combined to solve a particular task.

Fig. 3.
Matching multi-scale connectivity rules and computational problems for the segmentation of two moving surfaces. (a) A schematic view of the early visual stages with their different connectivity patterns: feed-forward (grey), feedback (blue) and lateral (red). (b) A sketch of the problem of moving surface segmentation and its potential implementation in the primate visual cortex. The key processing elements are illustrated as computational problems (e.g., local segregation, surface cues, motion boundaries, motion integration) and corresponding receptive field structures. These receptive fields are highly adaptive and reconfigurable, thanks to the dense interconnections between the different stages/areas the main cortical stages needed for a minimal model of surface segmentation ( Orban, 2008;Tlapale et al., 2010 ). Local visual information is transmitted upstream through the retinotopicalyorganized feed-forward projections. In the classical scheme, V1 is seen as a router filtering and sending the relevant information along the ventral (V2, V4) or dorsal (MT, MST) pathways ( Kruger et al., 2013 ). We discussed above how information flows also backward within each pathway as well as across pathways, as illustrated by connections between V2/V4 and MT in Fig. 3 ) ( Markov et al., 2014 ). One consequence of these cross-over is that MT neurons are able to use both motion and color information . We have also highlighted that area V1 endorses a more active role where the thalamo-cortical feedforward inputs and the multiple feedback signals interact to implement contextual modulations over different spatial and temporal scales using generic neural computations such surround suppression, spatiotemporal normalisation and input selection. These local computa-tions are modulated by short and long-range intra-cortical interactions such as visual features located far from the non-classical receptive field (or along a trajectory) can influence them ( Angelucci and Bullier, 2003 ). Each cortical stage implements these interactions although with different spatial and temporal windows and through different visual feature dimensions. In Fig. 3 , these interactions are illustrated within two (orientation and position) of the many cortical maps founds in both primary and extra-striate visual areas. At the single neuron level, these intricate networks result in a large diversity of receptive field structures and in complex, dynamical non-linearities. It is now possible to collect physiological signatures of these networks at multiple scales, from single neurons to local networks and networks-of-networks such that connectivity patterns can be dissected out. In the near future, it will become possible to manipulate specific cell subtype and therefore change the functional role and the weight of these different connectivities.
How these connectivity patterns would relate to information processing? In Fig. 3 (b) as an example, we sketch the key computational steps underlying moving surface segmentation ( Braddick, 1993 ). Traditionally, each computational step has been attributed to a particular area and to a specific type of receptive fields. For instance, local motion computation is done at the level of the small receptive fields of V1 neurons. Motion boundary detectors have been found in area V2 while different subpopulation of MT and MST neurons are responsible for motion integration at multiple scales (see Section 4.3 for references). However, each of these receptive field types are highly context-dependent, as expected from the dense interactions between all these areas. Matching the complex connectivity patterns illustrated in Fig. 3 (a) with the computational dynamics illustrated in Fig. 3 (b) is one of the major challenges in computational neurosciences ( Fregnac and Bathelier, 2015 ). But it could also be a fruitful source of inspiration for computer vision if we were able to draw the rules and numbers by which the visual system is organised at different scales. So far, only a few computational studies have taken into account this richness and its ability to adaptively encode and predict sensory inputs from natural scenes (e.g., Beck and Neumann, 2010 ;Bouecke et al., 2011;Tlapale et al., 2011 . The goal of this review is to map such recurrent connectivity rules with the computational blocks and their dynamics. Thus, in Section 4 (see also Table 2 and 1 ), we will recap some key papers from the biological vision literature in a task centric manner in order to show how critical information gathered at different scales and different context can be used to design innovative and performing algorithms.
In the context of the long-lasting debate about the precise relationships between structures and functions, we shall briefly mention the recent attempts to derive deeper insight about the processing hierarchy along the cortical ventral pathway. It has been suggested that deep convolutional neural networks (DCNNs) provide a potential framework for modelling biological vision. A directly related question is degree of similarity between the learning process implemented over several hierarchies in order to build feature layers of different selectivities with the cellular functional properties that have been identified in different cortical areas ( Kriegeskorte, 2015 ). One proposal to generate predictive models of visual cortical function along the ventral path utilises a goaldriven approach to deep learning ( Yamins and DiCarlo, 2016 ). In a nutshell, such an approach optimises network parameters regarding performance on a task that is behaviourally relevant and then compares the resulting network(s) against neural data. As emphasised here, a key element in such a structural learning approach is to define the task-level properly and then map principled operations of the system onto the structure of the system. In addition, several parameters of deep networks are usually defined by hand, such as the number of layers or the number of feature maps within a layer. There have been recent proposals to optimise these automatically, e.g., by extensive searching or using genetic algorithms ( Bergstra et al., 2013;Pinto et al., 2009 ).

Testing biologically-inspired models against both natural and computer vision
The dynamics of the biological visual systems have been probed at many different levels, from the psychophysical estimation of perceptual or behavioural performance to the physiological examination of neuronal and circuits properties. This diversity has led to a fragmentation of computational models, each targeting a specific set of experimental conditions, stimuli or responses.
Let us consider visual motion processing in order to illustrate our point. When both neurally and psychophysically motivated models have been developed for a specific task such as motion integration for instance, they have been tested using a limited set of non-naturalistic inputs such as moving bars, gratings and plaid patterns (e.g., Nowlan and Sejnowski, 1994;Rust et al., 2006 ). These models formalise empirical laws that can explain either the perceived direction or the emergence of neuronal global motion direction preference. However, these models are hardly translated to velocity estimations in naturalistic motion stimuli since they do not handle scenarios such as lack of reliable cues or extended motion boundaries. By consequence, these models are very specific and not applicable directly to process generic motion stimuli. To overcome this limitation, a few extended computational models have been proposed that can cope with a broader range of inputs. These computational models handle a variety of complex motion inputs Tlapale et al., 2010 ) but the specific algorithms have been tuned to recover coarse attributes of global motion estimation such as the overall perceived direction or the population neuronal dynamics. Such tuning strongly limits their ability to solve tasks such as dense optical flow estimation. Still, their computational principles can be used as building blocks to develop extended algorithms that can handle naturalistic inputs Solari et al., 2015 ). Moreover, they can be evaluated against standard computer vision benchmarks ( Baker et al., 2011;Butler et al., 2012 ). What is still missing are detailed physiological and psychophysical data collected with complex scenarios such as natural or naturalistic images in order to be able to further constrain these models.
A lesson to be taken from the above example is that a successful synergistic approach between artificial and natural vision should first establish a common set of naturalistic inputs against which both bio-inspired and computer vision models can be bench-marked and compared. This step is indeed critical for identifying scenarios in which biological vision systems deviate with respect to the definition adopted by the computer vision. On the other side, state-of-the-art computer vision algorithms shall also be evaluated relative to human perception performance for the class of stimuli widely used in psychophysics. For the three illustrative tasks to be discussed below, we will show the interest of common benchmarks for comparing biological and computer vision solutions.

Task-based versus general purpose vision systems
Several objections can be raised to question the need for a synergy between natural and biological vision. A first objection is that biological and artificial systems could serve different aims. In particular, the major aim of biological vision studies is to understand the behaviours and properties of a general purpose visual system that could subserve different types of perceptions or actions. This generic, encapsulated visual processing machine can then be linked with other cognitive systems in an adaptive and flexible way (see Pylyshyn, 1999;Tsotsos, 2011 for example). By contrast, computer vision approaches are more focused on developing task specific solutions, with an ever growing efficiency, thanks to advances in algorithms (e.g., LeCun et al., 2015;Mnih et al., 2015 ) supported by growing computing power. A second objection is that the brain might not use the same general-purpose (Euclidean) description of the world that Marr postulated ( Warren, 2012 ). Thus perception may not use the same set of low-level descriptors as computer vision, dooming the search for common early algorithms. A third, more technical objection is related to the low performance of most (if not all) current bio-inspired vision algorithms when solving a specific task (e.g., face recognition) when compared to state-of-the-art computer vision solutions. Moreover, bio-inspired models are still too often based on over-simplistic inputs and conditions and not sufficiently challenged with high-dimension inputs such as complex natural scenes or movies. Finally, artificial systems can solve a particular task with a greater efficiency than human vision for instance, challenging the need for bio-inspiration.
These objections question the interest of grounding computer vision solution on biology. Still, many other researchers have argued that biology can help recasting ill-posed problems and showing us to ask the right questions and identifying the right constraints ( Turpin et al., 2014 ;Zucker, 1981 ). Moreover, to mention one recent example, perceptual studies can still identify feature configurations that cannot be used by current models of object recognition and thus reframing the theoretical problems to be solved to match human performance ( Ullman et al., 2016 ). Finally, recent advances in computational neurosciences has identified generic computational modules that can be used to solve several different perceptual problems such as object recognition, visual motion analysis or scene segmentation, just to mention a few (e.g., Carandini and Heeger, 2011;Cox and Dean, 2014;Fregnac and Bathelier, 2015 ). Thus, understanding task-specialised subsystems by building and testing them remains a crucial step to unveil the computational properties of building blocks that operate in largely unconstrained scene conditions and that could later be integrated into larger systems demonstrating enhanced flexibility, default-resistance or learning capabilities. Theoretical studies have identified several mathematical frameworks for modelling and simulating these computational solutions that could be inspiring for computer vision algorithms. Lastly, current limitations of existing bio-inspired models in terms of their performance will also be solved by scaling up and tuning them such that they pass the traditional computer vision benchmarks.
We propose herein that the task level approach is still an efficient framework for this dialogue. Throughout the next sections, we will illustrate this standpoint with three particular examples: retinal image sensing, scene segmentation and optic flow computation. We will highlight some important novel constraints emerging from recent biological vision studies, how they have been modelled in computational vision and how they can lead to alternative solutions.

Solving vision tasks with a biological perspective
In the preceding sections, we have revisited some of the main features of biological vision and we have discussed the foundations of the current computational approaches of biological vision. A central idea is the functional importance of the task at hand when exploring or simulating the brain. Our hypothesis is that such a task centric approach would offer a natural framework to renew the synergy between biological and artificial vision. We have discussed several potential pitfalls of this task-based approach for both artificial and bio-inspired approaches. But we argue that such task-centric approach will escape the difficult, theoretical question of designing general-purpose vision systems for which no consensus is achieved so far in both biology and computer vision. Moreover, this approach allow us to benchmark the performance of computer and bio-inspired vision systems, an essential step for making progress in both fields. Thus, we believe that the task-based approach remains the most realistic and productive approach. The novel strategy based on bio-inspired generic computational blocks will however open the door for improving the scalability, the flexibility and the fault-tolerance of novel computer vision solutions. As already stated above, we decided to revisit three classical computer vision tasks from such a biological perspective: image sensing, scene segmentation and optical flow. 4 This choice was made in order to provide a balanced overview of recent biological vision studies about three illustrative stages of vision, from the sensory front-end to the ventral and dorsal cortical pathways. For these three tasks, there are a good set of multiple scales biological data and a solid set of modelling studies based on canonical neural computational modules. This enables us to compare these models with computer vision algorithms and to propose alternative strategies that could be further investigated. For the sake of clarity, each task will be discussed with the following framework: Task definition. We start with a definition of the visual processing task of interest.
Core challenges. We summarise its physical, algorithmic or temporal constraints and how they impact the processing that should be carried on images or sequences of images.
Biological vision solution. We review biological facts about the neuronal dynamics and circuitry underlying the biological solutions for these tasks stressing the canonical computing elements being implemented in some recent computational models.
Comparison with computer vision solutions. We discuss some of the current approaches in computer vision to outline their limits and challenges. Contrasting these challenges with known mechanisms in biological vision would be to foresee which aspects are essential for computer vision and which ones are not.
Promising bio-inspired solutions. Based on this comparative analysis between computer and biological vision, we discuss recent modelling approaches in biological vision and we highlight novel ideas that we think are promising for future investigations in computer vision.

Sensing
Task definition. Sensing is the process of capturing patterns of light from the environment so that all the visual information that will be needed downstream to cater the computational/functional needs of the biological vision system could be faithfully extracted. This definition does not necessarily mean that its goal is to construct a veridical, pixel-based representation of the environment by passively transforming the light the sensor receives.
Core challenges. From a functional point of view, the process of sensing (i.e., transducing, transforming and transmitting) light patterns encounters multiple challenges because visual environments are highly cluttered, noisy and diverse. First, illumination levels can vary over several range of magnitudes. Second, image formation onto the sensor is sensitive to different sources of noise and distortions due to the optical properties of the eye. Third, transducing photons into electronic signals is constrained by the intrinsic dynamics of the photosensitive device, being either biological or artificial. Fourth, transmitting luminance levels on a pixel basis is highly inefficient. Therefore, information must be (pre-)processed so that only the most relevant and reliable features are extracted and transmitted upstream in order to overcome the limited bandpass properties of the optic nerve. At the end of all these different stages, the sensory representation of the external world must still be both energy and computationally very efficient. All these aforementioned aspects raise some fundamental questions that are highly relevant for both modelling biological vision and improving artificial systems.
Herein, we will focus on four main computational problems (what is computed) that are illustrative about how biological solutions can inspire a better design of computer vision algorithms. The first problem is called adaptation and explains how retinal processing is adapted to the huge local and global variations in luminance levels from natural images in order to maintain high visual sensitivity. The second problem is feature extraction . Retinal processing extracts information about the structure of the image rather than mere pixels. What are the most important features that sensors should extract and how they are extracted are pivotal questions that must be solved to sub-serve an optimal processing in downstream networks. Third is the sparseness of information coding. Since the amount of information that can be transmitted from the front-end sensor (the retina) to the central processing unit (area V1) is very limited, a key question is to understand how spatial and temporal information can be optimally encoded, using context dependency and predictive coding. The last selected problem is called precision of the coding, in particular what is the temporal precision of the transmitted signals that would best represent the seaming-less sequence of images.
Biological vision solution. The retina is one of the most developed sensing devices ( Gollisch and Meister, 2010;Masland, 2011;. It transforms the incoming light into a set of electrical impulses, called spikes, which are sent asynchronously to higher level structures through the optic nerve. In mammals, it is sub-divided into five layers of cells (namely, photo-receptors, horizontal, bipolar, amacrine and ganglion cells) that forms a complex recurrent neural network with feed-forward (from photo-receptors to ganglion cells), but also lateral (i.e., within bipolar and ganglion cells layers) and feedback connections. The complete connectomics of some invertebrate and vertebrate retinas now begin to be available ( Marc et al., 2013 ).
Regarding information processing, an humongous amount of studies have shown that the mammalian retina can tackle the four challenges introduced above using adaptation, feature detection, sparse coding and temporal precision ( Kastner and Baccus, 2014 ). Note that feature detection should be understood as "feature encoding" in the sense that there is no decision making involved. Concerning adaptation , it is a crucial step, since retinas must maintain high contrast sensitivity over a very broad range of luminance, from starlight to direct sunlight. Adaptation is both global through neuromodulatory feedback loops and local through adaptive gain control mechanisms so that retinal networks can be adapted to the whole scene illuminance level while maintaining high contrast sensitivity in different regions of the image, despite their considerable differences in luminance ( Demb, 2008;Shapley and Enroth-Cugell, 1984;Thoreson and Mangel, 2012 ).
It has long been known that retinal ganglion cells extract local luminance profiles. However, we have now a more complex view of retinal form processing. The retina of higher mammals sample each point in the images with about 20 distinct ganglion cells ( Masland, 2011;, associated to different features . This is best illustrated in Fig. 4 , showing how the retina can gather information about the structure of the visual scene with four example cell types tilling the image. They differ one from the others by the size of their receptive field and their spatial and temporal selectivities. These spatiotemporal differences are related to the different sub-populations of ganglion cells which have been identified. Parvocellular (P) cells are the most numerous are the P-cells (80%). They have a small receptive size and a slow response time resulting in a high spatial resolution and a low temporal sensitivity. They process information about color and details. Magnocellular cells have a large receptive field and a low response time resulting in a high temporal resolution and a low spatial sensitivity, and can therefore convey information about visual motion ( Shapley, 1990 ). Thus visual information is split into parallel stream extracting different domains of the image spatiotemporal frequency space. This was taken at a first evidence for feature extractions at retinal level. More recent studies have shown that, in many species, retinal networks are much smarter than originally thought. In particular, they can extract more complex features such as basic static or moving shapes and can predict incoming events, or adapt to temporal changes of events, thus exhibiting some of the major signatures of predictive coding ( Gollisch and Meister, 2010;Masland, 2011;. A striking aspect of retinal output is its high temporal precision and sparseness . Massive in vitro recordings provide spiking patterns collected from large neuronal assemblies so that it becomes possible to decipher the retinal encoding of complex images ( Pillow et al., 2008 ). Modelling the spiking output of the ganglion cell populations have shown high temporal precision of the spike trains and a strong reliability across trials. These coding properties are essential for upstream processing what will extract higher order features but also will have to maintain such high precision. In brief, the retina appears to be a dense neural network where specific sub-populations adaptively extract local information in a contextdependent manner in order to produce an output that is both adaptive, sparse, over complete and of high temporal precision.
Another aspect of retinal coding is its space-varying resolution. A high-resolution sampling zone appears in the fovea while the periphery looses spatial detail. The retinotopic mapping of receptors into the cortical representation can be characterized formally by a non-linear conformal mapping operation. Different closed-form models have been proposed which share the property that the retinal image is sampled in a space-variant fashion using a topological transformation of the retinal image into the cortex. The smooth variation of central into peripheral vision may directly support a mechanism of space-variant vision. Such active processing mechanism not only significantly reduces the amount of data (particularly with a high rate of peripheral compression) but may also support computational mechanisms, such as symmetry and motion detection.
There is a large, and expanding body of literature proposing models of retinal processing. We attempted to classify them and isolated three main classes of models. The first class regroups the linear-nonlinear-poisson (LNP) models ( Odermatt et al., 2012 ). In its simplest form, a LNP model is a convolution with a spatiotemporal kernel followed by a static nonlinearity and stochastic (Poisson-like) mechanisms of spikes generation. These functional model are widely used by experimentalists to characterise the cells that they record, map their receptive field and characterise their spatiotemporal feature selectivities ( Chichilnisky, 2001 ). LNP models can simulate the spiking activity of ganglion cells (and of cortical cells) in response to synthetic or natural images , but they voluntarily ignore the neuronal mechanisms and the details of the inner retinal layers that transform the image into a continuous input to the ganglion cell (or any type of cell) stages. Moreover, they implement static non-linearities, ignoring many existing non-linearities. Applied to computer vision, they however provide some inspiring computational blocks for contrast enhancement, edge detection or texture filtering.
The second class of models has been developed to serve as a front-end for subsequent computer vision task. They provide bioinspired modules for low level image processing. One interesting example is given by Benoit et al. (2010) ; Hérault (2010) , where the model includes parvocellular and magnocellular pathways using different non-separable spatio-temporal filter that are optimal for form or motion detection.
The third class is based on detailed retinal models reproducing its circuitry, in order to predict the individual or collective responses measured at the ganglion cells level ( Lorach et al., 2012;Wohrer and Kornprobst, 2009 ). Virtual Retina ( Wohrer and Kornprobst, 2009 ), is one example of such spiking retina model. This models enables large scale simulations (up to 10 0,0 0 0 neurons) in reasonable processing times while keeping a strong biological plausibility. These models are expanded to explore several aspects of retinal image processing such as (i) understanding how to reproduce accurately the statistics of the spiking activity at the population level ( Nasser et al., 2013 ), (ii) reconciling connectomics and simple computational rules for visual motion detection  and (iii) investigating how such canonical microcircuits can implement the different retinal processing modules cited above (feature extraction, predictive coding) ( Gollisch and Meister, 2010 ).
Comparison with computer vision solutions. Most computer vision systems are rooted on a sensing device based on CMOS technology to acquire images in a frame based manner. Each frame is obtained from sensors representing the environment as a set of pixels whose values indicate the intensity of light. Pixels pave homogeneously the image domain and their number defines the resolution of images. Dynamical inputs, corresponding to videos are represented as a set of frames, each one representing the environment at a different time, sam pled at a constant time step defining the frame rate.
To make an analogy between the retina and typical image sensors, the dense pixels which respond slowly and capture high resolution color images are at best comparable to P-Cells in the retina. Traditionally in computer vision, the major technological breakthroughs for sensing devices have aimed at improving the density of the pixels, as best illustrated by the ever improving resolution of the images we capture daily with cameras. Focusing of how videos are captured, one can see that a dynamical input is not more than a series of images sampled at regular intervals. Significant progress have been achieved recently in improving the temporal resolution with advent of computational photography but at a very high computational cost ( Liu et al., 2014 ). This kind of sensing for videos introduces a lot of limitations and the amount of data that has to be managed is high.
However, there are two main differences between the retina and a typical image sensor such as a camera. First, as stated above, the retina is not simply sending an intensity information but it is already extracting features from the scene. Second, the retina asynchronously processes the incoming information, transforming it as a continuous succession of spikes at the level of ganglion cells, which mostly encode changes in the environment: retina is very active when intensity is changing, but its activity becomes quickly very low with a purely static stimulation. These observations show that the notion of representing static frames does not exist in biological vision, drastically reducing the amount of data that is required to represent temporally varying content.
Promising bio-inspired solutions. Analysing the sensing task from a biological perspective has potential for bringing new in-sights and solutions related to the four challenges outlined in this section. In terms of an ideal sensor, it is desired to have control over the acquisition of each pixel, thus allowing a robust adaptation to different parts of the scene. However, this is difficult to realize on the chip as it would mean independent triggers to each pixel, thus increasing the information transfer requirements on the sensor. In order to circumvent this problem, current CMOS sensors utilize a global clock trigger which fails us to give a handle on local adaptation, thus forcing a global strategy. This problem is tackled differently in biologically inspired sensors, by having local control loops in the form of event driven triggering rather than a global clock based drive. This helps the sensor to adapt better to local changes and avoids the need for external control signals. Also, since the acquisitions are to be rendered, sensory physiological knowledge could help in choosing good tradeoffs on sensor design. For example, the popular Bayer filter pattern has already been inspired by the physiological properties of retinal color sensing cells. With the advent of high dynamic range imaging devices, these properties are beginning to find interesting applications such as low range displays. This refers to the tone mapping problem. It is a necessary step to visualize high-dynamic range images on low-dynamic range displays, spanning up to two orders of magnitude. There is a large body of literature in this area on static images (see Bertalmío, 2014;Kung et al., 2007 for reviews), with approaches which combine luminance adaptation and local contrast enhancement sometimes closely inspired from retinal principles, as in Benoit et al. (2009) ;Ferradans et al. (2011) ;Meylan et al. (2007) ; Muchungi and Casey (2012) , just to cite a few. Recent developments concern video-tone mapping where a few approaches have been developed so far (see Eilertsen et al., 2013 for a review). We think it is for videos that the development of synergistic models of the retina is the most promising. Building on existing detailed retinal models such as the Virtual Retina ( Wohrer and Kornprobst, 2009 ), (mixing filter-based processing, dynamical systems and spiking neuron models), the goal is to achieve a better characterization of retinal response dynamics which will have a direct application here.
The way that retina performs feature detection and encodes information in space and time has received relatively little attention so far from the computer vision community. In most cases, retina-based models rely on simple caricatures of the retina. The FREAK (Fast Retina Keypoint) descriptor ( Alahi et al., 2012 ), is one example where only the geometry and space-varying resolution has been exploited. In Alahi et al. (2012) , the "cells" in the model are only doing some averaging of intensities inside their receptive field. This descriptor model was extended in Hilario Gomez et al. (2015) , where ON and OFF cells were introduced using a linearnonlinear (LN) model. This gives a slight gain of performance in a classification task, although it is still far from the state-of-the-art. These descriptors could be improved in many ways, by taking into account the goal of the features detected by the 20 types of ganglion cells mentioned before. Here also the strategy is to build on existing retinal models. In this context, one can also mention the SIFT descriptor ( Lowe, 2001 ), which was also inspired by cortical computations. One needs to evaluate the functional implication at a task level of some retinal properties. Examples include the asymmetry between ON and OFF cells ( Pandarinath et al., 2010 ), and the irregular receptive field shapes ( Liu et al., 2009 ).
One question is whether we would still need inspiration from the retina to build new descriptors, given the power of machine learning methods that provides automatically some optimized features given an image database? What the FREAK-based models show is that it is not only about improving the filters. It is also about how the information is encoded. In particular, what is encoded in FREAK-based models is the relative difference between cell responses. Interestingly, this is exactly the same as the rank-order coding idea proposed as an efficient strategy to perform ultra-fast categorization ( VanRullen and Thorpe, 2002 ), and which has been reported in the retina ( Portelli et al., 2014 ). This idea has been exploited for pattern recognition and used in many applications as demonstrated by the products developed by the company Spikenet ( http://www.spikenet-technology.com ). This means that the retina should serve as a source of inspiration not only to propose features, but more importantly, how it encodes these features at a population level.
The fact that the retinal output is sparse and has a high temporal precision conveys a major advantage to the visual system, since it has to deal with only a small amount of information. A promising bio-inspired solution is to develop frame-free methods, i.e., methods using sparse encoding of the visual information. This is now possible using event-based vision sensors where pixels autonomously communicate the change and grayscale events. The dynamic vision sensor (DVS) ( Lichtsteiner et al., 2008;Liu and Delbruck, 2010 ), and the asynchronous time-based image sensor (ATIS) ( Posch et al., 2011 ) are two examples of such sensor using address-event representation (AER) circuits. The main principle is that pixels signal only significant events. More precisely, an event is sent when the log intensity has changed by some threshold amount since the last event (see Fig. 5 ). These sensors provide a sparse output corresponding to pixels that register a change in the scene, thus allowing extremely high temporal resolution to describe changes in the scene while discarding all the redundant information. Because the encoding is sparse, these sensors appear as a natural solution in real-time scenarios or when energy consumption is a constraint. Combined with what is known about retinal circuitry as in Lorach et al. (2012) , they could provide a very efficient front-end for subsequent visual tasks, in the same spirit of former neuromorphic models of low-level processing as in Benoit et al. (2010) ; Hérault (2010) . They could also be used more directly as a way to represent visual scenes, abandoning the whole notion of a video that is composed of frame-sequences. This provides a new operative solution that can be used to revisit computer vision problems (see Liu et al., 2015 for a review). This field is rapidly emerging, with the motivation to develop approaches more efficient than the state-of-the-art. Some examples include tracking ( Ni et al., 2011 ), stereo ( Rogister et al., 2012 ), 3D pose estimation ( Valeiras et al., 2016 ), object recognition ( Orchard et al., 2015 ) and optical flow Brosch et al., 2015b;Giuliani et al., 2016;.

Segmentation and figure-ground segregation
Task definition. The task of segmenting a visual scene is to generate a meaningful partitioning of the input feature representation into surface-or object-related components. The segregation of an input stimulus into prototypical parts, characteristic of surfaces or objects, is guided by a coherence or homogeneity property that region elements share. Homogeneities are defined upon feature domains such as color, motion, depth, statistics of luminance items (texture), or combinations of them Pal and Pal, 1993 ). The specificity of the behavioural task, e.g., grasping an object, distinguishing two object identities, or avoiding collisions during navigation, may influence the required detail of segmentation ( Ballard et al.,20 0 0;Hayhoe and Ballard,20 05 ). In order to do so, contextual information in terms of high-level knowledge representations can be exploited as well . In addition, the goal of segmentation might be extended in regard to eventually single out a target item, or object, from its background in order to recognise it or to track its motion.
Core challenges. The segmentation of a spatio-temporal visual image into regions that correspond to prototypical surfaces or objects faces several challenges which derive from distinct interrelated subject matters. The following themes refer to issues of representation . First, the feature domain or multiple domains need to be identified which constitute the coherence or homogeneity properties relevant for the segregation task. Feature combinations as well as the nested structure of their appearance of coherent surfaces or objects introduces apparent feature hierarchies ( Koenderink, 1984;Koenderink et al., 2012 ). Second, the segmentation process might focus on the analysis of homogeneities that constitute the coherent components within a region or, alternatively, on the discontinuities between regions of homogeneous appearances. Approaches belonging to the first group focus on the segregation of parts into meaningful prototypical regions utilising an agglomeration (clustering) principle. Approaches belonging to the second group focus on the detection of discontinuous changes in feature space (along different dimensions) ( Nothdurft, 1991 ), and group them into contours and boundaries. Note that we make a distinction here to refer to a contour as a grouping of oriented edge or line contrast elements whereas a boundary already relates to a surface border in the scene. Regarding the boundaries of any segment, the segmentation task itself might incorporate an explicit assignment of a border ownership (BOwn) direction label which implies the separation of figural shape from background by a surface that occludes other scenic parts . The variabilities in the image acquisition process caused by, e.g., illumination conditions, shape and texture distortions, might speak in favor of a boundary oriented process. On the other hand, the complexity of the background structure increases the effort to segregate a target object from the background, which argues in favour of region oriented mechanisms. It should be noted, however, that the region vs boundary distinction might not appear as binary as in the way outlined above. Considering real world scenes the space-time relationships of perceptual elements (defined over different levels of resolution) are often defined by statistically meaningful structural relations to determine segmentation homogeneities ( Witkin and Tenenbaum, 1983 ). Here, an important distinction has been made between structure that might be influenced by meaning and primitive structure that is perceived even without a particular interpretation. While the previous challenges were defined by representations, the following themes refer to the process characteristic of segmentation. First, the partitioning process may yield different results given changing view-points or different noise sources during the sensing process. Thus, segmentation imposes an inference problem that is mathematically ill-posed ( Poggio et al., 1985 ). The challenge is how a reliability, or confidence, measure is defined that characterises meaningful decompositions relating to reasonable interpretations. To illustrate this, Fig. 6 shows segmentation results as drawn by different human observers. Second, figural configurations may impose different efforts for mechanisms of perceptual organisation to decide upon the segregation of an object from the background and/or the assignment of figure and ground direction of surface boundaries. A time dependence that correlates with the structural complexity of the background has in fact been observed to influence the temporal course needed in visual search tasks .
Biological vision solution. Evidence from neuroscience suggests that the visual system uses segmentation strategies based on identifying discontinuities and grouping them into contours and boundaries. Such processes operate mainly in a feed-forward fashion and automatic, utilising early and intermediate-level stages in visual cortex. In a nutshell, contrast and contour detection is quickly accomplished and is already represented at early stages in the visual cortical hierarchy, namely areas V1 and V2. The assignment of task-relevant segments happens to occur after a slight temporal delay and involves a recurrent flow of lateral and feedback processes ( Roelfsema, 2006;Roelfsema and Houtkamp, 2011;Scholte et al., 2008 ).
The grouping of visual elements into contours appears to follow the Gestalt rules of perceptual organisation ( Koffka, 1935 ). Grouping has also been studied in accordance to the ecological validity of such rules as they appear to be embedded in the statistics of natural scenes . Mechanisms that entail contour groupings are implemented in the structure of supragranular horizontal connections in area V1 in which oriented cells preferentially contact like-oriented cells that are located along the orientation axes defined by a selected target neuron Kapadia et al., 1995 ). Such long-range connections form the basis for the Gestalt concept of good continuation and might reflect the physiological substrate of the association field, a figure-eight shaped zone of facilitatory coupling of orientation selective input and perceptual integration into contour segments Geisler et al., 2001;Grossberg and Mingolla, 1985 ). Recent evidence suggests that the perceptual performance of visual contour grouping can be improved by mechanisms of perceptual learning . Once contours have been formed they need to be labelled in accordance to their scene properties. In case of a surface partially occluding more distant scenic parts the border ownership (BOwn) or surface belongingness can be assigned to the boundary ( Koffka, 1935 ). A neural correlate of such a mechanism has been identified at different cortical stages along the ventral pathway, such as V1, V2 and V4 areas ( O'Herron and von der Heydt, 2011; Zhou et al., 20 0 0 ). The dynamics of the generation of the BOwn signals may be explained by feed-forward, recurrent lateral and feedback mechanisms (see Williford and von der Heydt, 2013 for a review).
Such dynamical process of feedback, called re-entry ( Edelman, 1993 ), recursively links representations distributed over different levels. Mechanisms of lateral integration, although slower in processing speed, seem to further support intra-cortical grouping Kapadia et al., 1995; 20 0 0 ). In addition, surface segregation is reflected in a later temporal processing phase but is also evident in low levels of the cortical hierarchy, suggesting that recurrent processing between different cortical stages is involved in generating neural surface representations. Once boundary groupings are established surface-related mechanisms "paint", or tag, task-relevant elements within bounded regions. The feature dimensions used in such grouping operations are, e.g., local contour orientations defined by luminance contrasts, direction and speed of motion, color hue contrasts, or texture orientation gradients. As sketched above, counter-stream interactive signal flow ( Ullman, 1995 ), imposes a temporal signature on responses in which after a delay a late amplification signal serves to tag those local responses that belong to a region (surrounded by contrasts) which has been selected as a figure , (see also Roelfsema et al., 2007 ). The time course of the neuronal responses encoding invariance against different figural sizes argues for a dominant role of feedback signals when dynamically establishing the proper BOwn assignment. Grouping cells have been postulated that integrate (undirected) boundary signals over a given radius and enhance those configurations that define locally convex shape fragments. Such fragments are in turn enhanced via a recurrent feedback cycle so that closed shape representations can be established rapidly through the convexity in closed bounding contours ( Zhou et al.,20 0 0 ). Neural representations of localized features composed of multiple orientations may further influence this integration process, although this is not firmly established yet ( Anzai et al., 2007 ). BOwn assignment serves as a prerequisite of figureground segregation. The temporal dynamics of cell responses at early cortical stages suggest that mechanisms exist that (i) decide about ownership direction and (ii) subsequently enhance regions (at the interior of the outline boundaries) by spreading a neural tagging, or labelling, signal that is initiated by the region boundary ( Roelfsema et al., 2002 ), (compare the discussion in Williford and von der Heydt (2013) ). Such a late enhancement through response modulation of region components occurs for different features, such as oriented texture ( Lamme et al., 1999 ), or motion signals ( Roelfsema et al., 2007 ), and is mediated by recurrent processes of feedback from higher levels in the cortical hierarchy. It is, however, not clear whether a spreading process for region tagging is a basis for generating invariant neural surface representations in all cases. All experimental investigations have been conducted for input that leads to significant initial stimulus responses while structure-less homogeneous regions (e.g., a homogeneous coloured wall) may lead to void spaces in the neuronal representation that may not be filled explicitly by the cortical processing (compare the discussion in Pessoa et al., 1998 ).
Yet another level of visual segmentation operates upon the initial grouping representations, those base groupings that happen to be processed effortlessly as outlined above. However, the analysis of complex relationships surpasses the capacities of the human visual processor which necessitates serial staging of some higher-level grouping and segmentation mechanisms to form incremental task-related groupings. In this mainly sequential operational mode visual routines establish properties and relations of particular scene items ( Ullman, 1984 ). Elemental operations underlying such routines have been suggested, e.g., shifting the processing focus (related to attentional selection), indexing (to select a target location), coloring (to label homogeneous region elements), and boundary tracing (determining whether a contour is open or closed and items belonging to a continuous contour). For example, contour tracing is suggested to be realized by incremental grouping operations which propagate an enhancement of neural firing rates along the extent of the contour. Such a neural labelling signal is reflected in a late amplification in the temporal signature of neuronal responses. The amplification is delayed with respect to the stimulus onset time with increasing distances of the location along the perceptual entity Roelfsema and Houtkamp, 2011 ), (that is indexed by the fixation point at the end of the contour). This lead to the conclusion that such tracing is laterally propagated (via lateral or interative feed-forward and feedback mechanisms), leading to a neural segmentation of the labelled items delineating feature items that belong to the same object or perceptual unit. Maintenance operations then interface such elemental operations into sequences to compose visual routines for solving more complex tasks, like in a sequential computer program. Such cognitive operations are implemented in cortex by networks of neurons that span several cortical areas ( Roelfsema, 2005 ). The execution time of visual cortical routines reflects the sequential composition of such task-specific elemental neural operations tracing the signature of neural responses to a stimulus ( Lamme and Roelfsema, 20 0 0;Roelfsema, 2005 ).
Comparison with computer vision solutions. Segmentation as an intermediate level process in computational vision is often characterised as one of agglomerating, or clustering, picture elements to arrive at an abstract description of the regions in a scene ( Pal and Pal, 1993 ). It can also be viewed as a preprocessing step for object detection/recognition. It is not very surprising to see that even in computer vision earlier attempts were drawn towards single aspects of the segmentation like edge detection ( Canny, 1986;Lindeberg, 1998;Marr and Hildreth, 1980 ), or grouping homogeneous regions by clustering ( Coleman and Andrews, 1979 ). The performance limitations of both these approaches independently have led to the emergence of solutions that reconsidered the problem as a juxtaposition of both edge detection and homogeneous region grouping with implicit consideration for scale. The review paper by Freixenet et al. (2002) , presents various approaches that attempted in merging edge based information and clustering based information in a sequential or parallel manner. The state of the art techniques that are successful in formulating the combined approach are variants of graph cuts ( Shi and Malik, 20 0 0 ), active contours, and level sets. At the bottom of all such approaches is the definition of an optimisation scheme that seeks to find a solution under constraints such as, e.g., smoothness or minimising a measure of total energy. These approaches are much better in terms of meeting human defined ground truth compared to simpler variants involving discontinuity detection or clustering alone. The performance of computer vision approaches to image partitioning has been boosted recently by numerous contributions utilizing DCNNs for segmentation (e.g., Hong et al., 2015;Noh et al., 2015 ). The basic structure of the encoder component of segmentation networks is similar to the hierarchical networks trained for object recognition ( Krizhevsky et al., 2012 ). For example, the AlexNet has been trained by learning a hierarchy of kernels in the convolutional layers to extract rich feature sets for recognition from a large database of object classes. Segmentation networks Noh et al., 2015 ), have been designed by adding a decoder scheme to expand the activations in the category layers through a sequence of deconvolutions steps such as in autoencoder networks ( Hinton and Salakhutdinov, 2006 ). Even more extended versions include a mechanism of focused attention to more selectively guide the training process using class labels or segmentations ( Hong et al., 2016 ). The hierarchical structure of such approaches shares several features of cortical processing through a sequence of areas with cells that increase their response selectivity at the size of their receptive fields over different stages in the cortical hierarchy. However, the explicit unfolding of the data representation in the deconvolution step to upscale to full image resolution, the specific indexing of pixel locations to invert the pooling in the deconvolution, and the large amount of training data are not biologically plausible.
A major challenge is still how to compare the validity and the quality of segmentation approaches. Recent attempts emphasise to compare the computational results from operations on different scales with the results of hand-drawn segmentations by human subjects Fowlkes et al., 2007 ). These approaches suggest possible measures in judging the quality of automatic segmentation given that ground truth data is missing. However, the human segmentation data does not elucidate the mechanisms underlying the processes to arrive at such partitions. Instead of a global partitioning of the visual scene, the visual system seems to adopt different strategies of computation to arrive at a meaningful segmentation of figural items. The grouping of elements into coherent form is instantiated by selectively enhancing the activity of neurons that represent the target region via a modulatory input from higher cortical stages Lamme et al., 1998 ). The notion of feedback to contribute in the segmentation of visual scenes has been elucidated above. Recent computer vision algorithms begin to make use of such recurrent mechanisms as well. For example, since bottom-up datadriven segmentation is usually incomplete and ambiguous the use of higher-level representations might help to validate initial instances and further stabilise their representation Ullman, 2007 ). Along this line, top-down signalling applies previously acquired information about object shape (e.g., through learning), making use of the discriminative power of fragments of intermediate size, and combines this information with a hierarchy of initial segments ( Ullman et al., 2002 ). Combined contour and region processing mechanisms have also been suggested to guide the segmentation. In Arbelaez et al. (2011) , multiscale boundaries are extracted which later prune the contours in a watershed region-filling algorithm. Algorithms of figure-ground segregation and border-ownership computation have been developed for computer vision applications to operate under realistic imaging conditions ( Stein and Hebert, 2009;Sundberg et al., 2011 ). These were designed to solve tasks like shape detection against structured background and for video editing. Still, the robust segmentation of an image into corresponding surface patches is hard to accomplish in a reliable fashion. Performance of such methods mentioned above depends on parametrization and the unknown complexity and properties of the viewed scene. Aloimonos and coworkers proposed an active vision approach that adopted biological principles like the selection and fixation on image regions that are surrounded by closed contours ( Mishra and Aloimonos, 2009;Mishra et al., 2012 ). The key here is that in this approach only the fixated region (corresponding to a surface of an object or the object itself) is selected and then segmented based on an optimization scheme using graph-cut. All image content outside the closed region contour is background w.r.t. the selected target region or object. The functionality requires an active component to relocate the gaze and a region that is surrounded by a contrast criterion in the image.
Promising bio-inspired solutions. Numerous models that account for mechanisms of contour grouping have been proposed to linking orientation selective cells Grossberg et al., 1997;Li, 1998 ). The rules of mutual support utilize a similarity metric in the space-orientation domain giving rise to a compatibility, or reliability measure , (see Neumann and Mingolla, 2001 for a review of generic principles and a taxonomy). Such principles migrated into computer vision approaches ( Kornprobst and Médioni, 20 0 0;Medioni et al., 20 0 0;Parent and Zucker, 1989 ) and, in turn, provided new challenges for experimental investigations ( Ben-Shahar and Zucker, 2004;Sigman et al., 2001 ). Note that the investigation of structural connectivities in high dimensional feature spaces and their mapping onto a low-dimensional manifold lead to define a "neurogeometry" and the basic underlying mathematical principles of such structural principles ( Citti and Sarti, 2014;Petitot, 2003 ).
As outlined above, figure-ground segregation in biological vision segments an image or temporal sequence by boundary detection and integration followed by assigning border ownership direction and then tagging the figural component in the interior of a circumscribed region. Evidence suggests that region segmentation by tagging the items which belong to extended regions involves feedback processing from higher stages in the cortical hierarchy ( Scholte et al., 2008 ). Grossberg and colleagues proposed the FACADE theory (form-and-color-and-depth Grossberg, 1993;Grossberg and Mingolla, 1985 ), to account for a large body of experimental data, including figure-ground segregation and 3D surface perception. In a nutshell, the model architecture consists of mutually coupled subsystems, each one operating in a complementary fashion. A boundary contour system (BCS) for edge grouping is complemented by a feature contour system (FCS) which supplements edge grouping by allowing feature qualities, such as brightness, color, or depth, to spread within bounded compartments generated by the BCS.
The latter mechanism has recently been challenged by psychophysical experiments that measure subject reaction times in image-parsing tasks. The results suggest that a sequential mechanism groups, or tags, interior patches along a connected path between the fixation spot and a target probe. The speed of reaching a decision argues in favor of a spreading growth-cone mechanism that simultaneously operates over multiple spatial scales rather than the wave-like spreading of feature activities initiated from the perceptual object boundary ( Jeurissen et al., 2016 ). Such a mechanism is proposed to also facilitate the assignment of figural sides to boundaries. BOwn computation has been incorporated in computer vision algorithms to segregate figure and background regions in natural images or scenes Ren et al., 2006;Sundberg et al., 2011 ). Such approaches use local configurations of familiar shapes and integrate these via global probabilistic models to enforce consistency of contour and junction configurations , of learning of templates from ensembles of image cues to depth and occlusion .
Feedback mechanisms as they are discussed above, allow to build robust boundary representations such that junctions may be reinterpreted based on more global context information ( Weidenbacher and Neumann, 2009 ). The hierarchical processing of shape from curvature information in contour configurations , can be combined with evidence for semi-global convex fragments or global convex configurations . Such activity is fed back to earlier stages of representation to propagate contextual evidences and quickly build robust object representations separated from the background. A first step towards combining such stage-wise processing capacities and integrating them with feedback that modulates activities in distributed representations at earlier stages of processing has been suggested in  . The step towards processing complex scenes from unconstrained camera images, however, still needs to be further investigated.
Taken together, biological vision seems to flexibly process the input in order to extract the most informative information from the optic array. The information is selected by an attention mechanism that guides the gaze to the relevant parts of the scene. It has been known for a long time that the guidance of eye movements is influenced by the observer's task of scanning pictures of natural scene content ( Yarbus, 1967 ). More recent evidence suggests that the saccadic landing locations are guided by contraints to optimize the detection of relevant visual information from the optic array ( Ballard and HayHoe, 2009;Hayhoe and Ballard, 2005 ). Such variability in fixation location has immediate consequences on the structure of the visual mapping into an observer representation. Consequently, segmentation might be considered as a separation problem that operates upon a high-dimensional feature space, instead of statically separating appearances into different clusters. For example, in order to separate a target object against the background in an identification task fixation is best located approximately in the middle of the central surface region . Symmetric arrangement of bounding contours (with opposite direction of BOwn) helps to select the region against the background to guide a motor action. In order to generate stable visual percept of a complex object such information must be integrated over multiple fixations ( Hayhoe et al., 1991 ). In case of irregular shapes, the assignment of object belongingness requires a decision whether region elements belong to the same surface or not. Such decision-making process involves a slower sequentially operating mechanism of tracing a connecting path in a homogeneous region. Such a growth-cone mechanism has been demonstrated to act similarly on perceptual representations of contour and region representations which might tag visual elements to build a temporal signature for representations that define a connected object (compare Jeurissen et al., 2016 ). In a different behavioral task, e.g., obstacle avoidance, the fixation close to the occluding object boundary helps to separate the optic flow pattern of the obstacle from those of the background . Here, the obstacle is automatically selected as perceptual figure while the remaining visual scene structure and other objects more distant from the observer are treated as background. These examples demonstrate evidence that biological segmentation might be different from computer vision approaches which incorporates active selection elements building upon much more flexible and dynamic processes.

Optical flow
Task definition. Estimating optical flow refers to the assignment of 2-D velocity vectors at sample locations in the visual image in order to describe their displacements within the sensor's frame of reference. Such a displacement vector field constitutes the image flow representing apparent 2-D motions from their 3-D velocities being projected onto the sensor ( Verri and Poggio, 1987;. These algorithms use the change of structured light in the retinal or camera images, posing that such 2-D motions are observable from light intensity variations (and thus, are contrast dependent) due to the change in relative positions between an observer (eye or camera) and the surfaces or objects in a visual scene.
Core challenges. Achieving a robust estimation of optical flow faces several challenges. First of all, visual system has to establish form-based correspondences across temporal domain despite the fact that physical movements induced geometric and photometric distortions. Second, velocity space has to be optimally sampled and represented to achieve robust and energy efficient estimation. Third, the accuracy and reliability of the velocity estimation is dependent upon the local structure/form but the visual system must achieve a form independent velocity estimation. Difficulties arise from the fact that any local motion computation faces different sources of noise and ambiguities, such as for instance the aperture and problems. Therefore, estimating optical flow requires to resolve these local ambiguities by integrating different local motion signals while still maintaining segregated those that belong to different surfaces or objects of the visual scene (see Fig. 7 (a)). In other words, image motion computation faces two opposite goals when computing the global object motion, integration and segmentation ( Braddick, 1993 ). As already emphasised in Section 4.2 , any computational machinery should be able to keep segregated the different surface/object motions since one goal of motion processing is to estimate accurately the speed and direction of each of them in order to track, capture or avoid one or several of them. Fourth, the visual system must deal with complex scenes that are full of occlusions, transparencies or non-rigid motions. This is well illustrated by the transparency case. Since optical flow is a projection of 3D displacements in the world, some situations yield to perceptual (semi-) transparency ( McOwan and Johnston, 1996 ). In videos, several causes have been identified, such as reflections, phantom special effects, dissolve effects for a gradual shot change and medical imaging such as X-rays (for example see Fig. 7 (b)). All of these examples raise serious problems to current computer vision algorithms.
Herein, we will focus on four main computational strategies used by biological systems for dealing with the aforementioned problems. We selected them because we believe these solutions could inspire the design of better computer vision algorithms. First is motion energy estimation by which the visual system estimates a contrast dependent measure of translations in order to indirectly establish correspondences. Second is local velocity estimation : contrast dependent motion energy features must be combined to achieve a contrast invariant local velocity estimation after de-noising the dynamical inputs and resolving local ambiguities, thanks to the integration of local form and motion cues. The third challenge concerns the global motion estimation of each independent object, regardless its shape or appearance. Fourth, distributed multiplexed representations must be used by both natural and artificial systems to segment cluttered scenes, handle multiple/transparent surfaces, and encode depth ordering to achieve 3D motion perception and goal-oriented decoding.
Biological vision solution. Visual motion has been investigated in a wide range of species, from invertebrates to primates. Several computational principles have been identified as being highly conserved by evolution, as for instance local motion detectors ( Hassenstein and W., 1956 ). Following the seminal work of Werner Reichardt and colleagues, a huge amount of work has been achieved to elucidate the cellular mechanisms underlying local motion detection, the connectivity rules enabling optic flow detectors or basic figure-ground segmentation. Fly vision has been leading the investigation of natural image coding as well as active vision sensing. Several recent reviews can be found elsewhere (e.g. ( Alexander et al., 2010;Borst, 2014;Borst and Euler, 2011;Silies et al., 2014 )). In the present review, we decided to restrain the focus on the primate visual system and its dynamics. In Fig. 3 , we have sketched the backbone of the primate cortical motion stream and its recurrent interactions with both area V1 and the 'form' stream. This figure illustrates both advantages and limits of the deep hierarchical model. Below, we will further focus on some recent data about the neuronal dynamics in regards with the four challenges identified for a better optic flow processing.
As already illustrated, the classical view of the cortical motion pathway is a feed-forward cascade of cortical areas spanning from the occipital (V1) to the parietal (e.g., area VIP, area 7) lobes. This cascade forms the skeleton of the dorsal stream. Areas MT and MST are located in the deep of the superior temporal sulcus and they are considered as a pivotal hub for both object and self-motion (see, e.g., Bradley and Goyal, 20 08;Orban, 20 08;Pack and Born, 2008 for reviews). The motion pathway is extremely fast, with the information flowing in less that 20ms from the primary visual area to the frontal cortices or brainstem structures underlying visuomotor transformations (see Bullier, 2001;Lamme and Roelfsema, 20 0 0;Lisberger, 2010;Masson and Perrinet, 2012 for reviews). These short time scales originate in the Magnocellular retino-geniculo-cortical input to area V1 carrying low spatial and high temporal frequencies luminance information with high contrast sensitivity (i.e., high contrast gain). This cortical input to layer 4 β projects directly to the extra striate area MT, also called the cortical motion area. The fact that this feedforward stream by-passes the classical recurrent circuit between area V1 cortical layers is attractive for several reasons. First, it implements a fast, feedforward hierarchy fitting the classical two-stage motion computation model ( Hildreth and Koch, 1987;Nakayama, 1985 ). Direction-selective cells in area V1 are best described as spatiotemporal filters extracting motion energy along the direction orthogonal to the luminance gradient ( Conway and Livingstone, 2003 ;Emerson et al., 1992 ;. Their outputs are integrated by MT cells to compute local motion direction and speed. Such spatio-temporal integration through the convergence of V1 inputs has three objectives: extracting motion signals embedded in noise with high precision, normalising them through centre-surround interactions and solving many of the input ambiguities such as the aperture and correspondence problems. As a consequence, speed and motion direction selectivities observed at single-cell and population levels in area MT are largely independent upon the contrast or the shape of the moving inputs ( Born and Bradley, 2005;Bradley and Goyal, 20 08; Orban, 20 08 ). The next convergence stage, area MST extracts object-motion through cells with receptive fields extending up to 10 to 20 degrees (area MSTl) or optic flow patterns (e.g., visual scene rotation or expansion) that are processed with very large receptive fields covering up to 2/3 of the visual field (area MSTd). Second, the fast feedforward stream illustrates the fact that built-in, fast and highly specific modules of visual information are conserved through evolution to subserve automatic, behaviour-oriented visual processing (see, e.g. Borst, 2014;Dhande and Huberman, 2014;Masson and Perrinet, 2012 for reviews). Third, this anatomical motif is a good example of a canonical circuit that implements a sequence of basic computations such as spatio-temporal filtering, gain control and normalisation at increasing spatial scales ( Rust et al., 2006 ). The final stage of all of these bio-inspired models consist in a population of neurons that are broadly selective for translation speed and direction Simoncelli and Heeger, 1998 ), as well as for complex optical flow patterns (see e.g., Grossberg and Mingolla, 1999 ;Layton and Browning, 2014 for recent examples). Such backbone can then be used to compute biological motion and action recognition ( Escobar and Kornprobst, 2012;Giese and Poggio, 2003 ), similar to what was observed in human and monkey parietal cortical networks (see Giese and Rizzolatti, 2015 for a recent review).
However, recent physiological studies have shown that this feedforward cornerstone of global motion integration must be enriched with new properties. Fig. 3 depitcs some of these aspects, mirroring functional connectivity and computational perspectives. First, motion energy estimation through a set of spatiotemporal filters was recently re-evaluated to account for the neuronal responses to complex dynamical textures and natural images.
When presented with rich, naturalistic inputs, responses of both V1 complex cells and MT pattern-motion neurons become contrast invariant Cui et al., 2013;Priebe et al., 2003 and more selective (i.e., their tuning is sharper) ( Gharaei et al., 2013;Priebe et al., 2003 ). Their responses become also more sparse ( Vinje and Gallant, 20 0 0 ) and more precise ( Baudot et al., 2013 ). These better sensitivities could be explained by a more complex integration of inputs, through a set of adaptive, excitatory-and inhibitory-weighted filters that optimally sample the spatiotemporal frequency plane ( Nishimoto and Gallant, 2011 ). Second, centresurround interactions are much more diverse, along many different domains (e.g., retinotopic space, orientation, direction) than originally depicted by the popular Mexican-hat model. Such diversity of centre-surround interactions in both areas V1 and MT most certainly contributes to several of the computational nonlinearities mentioned above. They involve both the classical convergence of projections from one step to the next but also the dense network of lateral interactions within V1 as well as within each extra-striate areas. These lateral interactions implement long-distance normalisation, seen as centre-surround interactions at population level ( Reynaud et al., 2012 ), as well as feature grouping between distant elements . These intra-and inter-cortical areas interactions can support a second important aspect of motion integration: motion diffusion. In particular, anisotropic diffusion of local motion information can play a critical role in global motion integration by propagating reliable local motion signals within the retinotopic map . The exact neural implementation of these mechanisms is yet unknown but modern tools will soon allow to image, and manipulate, the dynamics of these lateral interactions. The diversity of excitatory and inhibitory inputs can explains how the aperture problem is dynamically solved by MT neurons for different types of motion inputs such as plaid patterns ( Rust et al., 2006 ), elongated bars or barber poles ( Tsui et al., 2010 )), and they are thought to be important to encode optic flow patterns ( Mineault et al., 2012 ), and biological motion ( Escobar and Kornprobst, 2012 ). Finally, the role of feedback in this contextdependent integration of local motion has been demonstrated by experimental Nassi et al., 2014 ), and computational studies ( Bayerl and Neumann,20 04;20 07 ) and is now addressed at the physiological level despite the considerable technical difficulties (see Cudeiro and Sillito, 2006 for a review). Overall, several computational studies have shown the importance of the adaptive normalisation of spatiotemporal filters for motion perception; see ( Simoncini et al., 2012 ), illustrating how a generic computation (normalisation) can be adaptively tuned to match the requirement of different behaviours.
Global motion integration is only one side of the coin. As pointed out by Braddick (1993) , motion integration and segmentation works hand-in-hand to selectively group the local motion signals that belong to different surfaces. For instance, some MT neurons integrate motion signals within their receptive field only if they belong to the same contour , or surface ( Stoner and Albright, 1992 ). They can also filter out motion within the receptive field when it does not belong to the same surface ( Snowden et al., 1991;Stoner and Albright, 1992 ), a first step for representing motion transparency or structure-from-motion in area MT ( Grunewald et al., 2002 ). The fact that MT neurons can thus adaptively integrate local motion signals, and explain away others is strongly related to the fact that motion sensitive cells are most often embedded in distributed multiplexed representations . Indeed, most direction-selective cells are also sensitive to binocular disparity Smolyanskaya et al., 2013 ), eye/head motion , and dynamical perspective cues ( Kim et al., 2015 ), in order to filter out motion signals from outside the plane of fixation or to disambiguate motion parallax. Thus, depth and motion processing are two intricate problems allowing the brain to compute object motion in 3D space rather than in 2D space.
Depth-motion interaction is only one example of the fact that motion pathway receives and integrates visual cues from many different processing modules . This is again illustrated in Fig. 3 , where form cues can be extracted in areas V2 and V4 and sent to area MT. Information about the spatial organisation of the scene using boundaries, colours, shapes might then be used to further refine the fast and coarse estimate of the optic flow that emerges from the V1-MT-MST backbone of the hierarchy. Such cue combination is critical to overcome classical pitfalls of the feed-forward model. Noteworthy, along the hierarchical cascade, information is gathered over larger and larger receptive fields at the penalty that object boundaries and shapes are blurred. Thus, large receptive fields of MT and MST neurons can be useful for tracking large objects with the eyes, or avoiding approaching ones, but they certainly lower the spatial resolution of the estimated optic flow field. This feed-forward, hierarchical processing contrasts with the sharp perception that we have of the moving scene. Mixing different spatial scales through recurrent connectivity between cortical areas is one solution ( Cudeiro and Sillito, 2006;Gur, 2015 ). Constraining the diffusion of motion information along edges or within surface boundaries in certainly another as shown for texture-ground segmentation . Such form-based representations play a significant role in disambiguation of motion information ( Geisler, 1999;Heslip et al., 2013;Mather et al., 2012;McCarthy et al., 2012 ). It could also play a role in setting the balance between motion integration and segmentation dynamics, as illustrated in Fig. 3 (b).
Over the last two decades, several computational vision models have been proposed to improve optic flow estimation with a bioinspired approach. A first step is to achieve a form-independent representation of velocity from the spatio-temporal responses from V1. A dominant computational model was proposed by Simoncelli and Heeger (1998) , where a linear combination of afferent inputs from V1 is followed by a non linear operation known as untuned divisive normalisation. This model, and it subsequent developments ( Nishimoto and Gallant, 2011;Rust et al., 2006;Simoncini et al., 2012 ), replicates a variety of observations from physiology to psychophysics using simple, synthetic stimuli such as drifting grating and plaids. However, this class of models cannot resolve ambiguities in regions lacking of any 2D cues because of the absence of diffusion mechanisms. Moreover, their normalisation and weighted integration properties are still static. These two aspects may be the reason why they do not perform well on natural movies. Feedback signals from and to MT and higher cortical areas could play a key role in reducing these ambiguities. One good example was proposed by Bayerl and Neumann (2004) , where dynamical feedback modulation from MT to area V1 is used to solve the aperture problem locally. An extended model of V1-MT-MST interactions that uses centre-surround competition in velocity space was later presented by Raudies et al. (2011) , showing good optic flow computations in the presence of transparent motion. These feedback and lateral interactions primarily play the role of context dependent diffusion operators that spread the most reliable information throughout ambiguous regions. Such diffusion mechanisms can be gated to generate anisotropic propagation, taking advantage of local form information ( Beck and Neumann, 2010 ;Tlapale et al., 2010 ). An attempt at utilising these distributed representation for integrating both optic flow estimation and segmentation was proposed in Nowlan and Sejnowski (1994) . The same model explored the role of learning in establishing the best V1 representation of motion information, although this approach was largely ignored in optic flow models contrary to object categorisation for instance. In brief, more and more computational models of biological vision take advantages of these newly-elucidated dynamical proper-ties to explain motion perception mechanisms. But it is not clear how these ideas perfuse to computer vision.
Comparison with computer vision solutions. The vast majority of computer vision solutions for optical flow estimation can be split into four major computational approaches (see Fortun et al., 2015;Sun et al., 2010 for recent reviews). First, a constancy assumption deals with correspondence problem, assuming that brightness or color is constant across adjacent frames and assigning a cost function in case of deviation. Second, the reliability of the matching assumptions optimised using priors or a regularisation to deal with the aperture problem. Both of these solutions pose the problems as an energy function and optical flow itself is treated as an energy minimisation problem. Interestingly, a lot of recent research has been done in this area, always pushing further the limits of the state-of-the-art. This research field has put a strong emphasis on performance as a criterion to select novel approaches and sophisticated benchmarks have been developed. Since the early initiatives, current benchmarks cover a much wider variety of problems. Popular examples are the Middleburry flow evaluation ( Baker et al., 2011 ), and, more recently the Sintel flow evaluation ( Butler et al., 2012 ). The later has important features which are not present in the Middlebury benchmark: long sequences, large motions, specular reflections, motion blur, defocus blur, and atmospheric effects.
Initial motion detection is a good example where biological and computer vision research have already converged. The correlation detector proposed by Hassenstein and W. (1956) , serves as a reference for a velocity sensitive mechanisms to find correspondences of visual structure at image locations in consecutive temporal samples. Formal equivalence of correlation detection with a multi-stage motion energy filtering has been demonstrated . There are now several examples of spatiotemporal filtering models that are used to extract motion energy across different scales. Initial motion detection is ambiguous since motion can locally be measured only orthogonal to an extended contrast. This is called the aperture problem and mathematically it gives an ill-posed problem to solve. For example, in gradient-based methods, one has to estimate the two velocity components from a single equation called the optical flow constraint. In spatiotemporal energy based methods, all the spatiotemporal samples lie on a straight line in frequency space and the task is to identify a plane that passes through all of them ( Bradley and Goyal, 2008 ). Computer vision has dealt with this problem in two ways: by imposing local constraints ( Lucas and Kanade, 1981 ), or by posing smoothness constrains through penalty terms ( Horn and Schunck, 1981 ). More recent approaches are attempted to fuse the two formulations ( Bruhn et al., 2005 ). The penalty term plays a key role as a diffusion operator can act isotropically or anisotropically ( Aubert and Kornprobst, 2006;Black et al., 1998;Scherzer and Weickert, 20 0 0 ). A variety of diffusion mechanisms has been proposed so that, e.g., optical flow discontinuities could be preserved depending on velocity field variations or image structures. All these mechanisms have demonstrated powerful results regarding the successful operation in complex scenes. Computational neurosciences models also tend to rely on diffusion mechanisms too, but they differ in their formulation. A first difference stems from the fact that local motion estimation is primarily based on the spatio-temporal energy estimation. Second, the representation is distributed, allowing multiple velocities at the same location, thus dealing with layered/transparent motion. The diffusion operator is also gated based on the local form cues also relying on the uncertainty estimate which could possibly be computed using the distributed representation ( Nowlan and Sejnowski, 1994 ).
Promising bio-inspired solutions. A modern trend in bioinspired models of motion integration is to use more form-motion Fig. 8. Comparison between three biological vision models tested on the Rubberwhale sequence from Middlebury dataset ( Baker et al., 2011 ). First column illustrates , where the authors have revisited the seminal work by  using spatio-temporal filters to estimate optical flow from V1-MT feedforward interactions. Second column illustrates , an extension of the Heeger and Simoncelli model with adaptive processing algorithm based on context-dependent, area V2 modulation onto the pooling of V1 inputs onto MT cells. Third column illustrates , which incorporates modulatory feedbacks from MT to V1. Optical flow is represented using the colour-code from Middlebury dataset.
interactions for disambiguating information. This should be further exploited in computer vision models. Future research will have to integrate the growing knowledge about how diffusion processes, form-motion interaction and multiplexing of different cues are implemented and impact global motion computation ( McDonald et al., 2014;Rasch et al., 2013;Tsui et al., 2010 ). Despite the similarities in the biological and artificial approaches to solve optical flow computation, it is important to note that there is only little interaction happening between computer vision engineers and biological vision modellers. One reason might be that biological models have not been rigorously tested on regular computer vision datasets and are therefore considered as specifically confined to laboratory conditions only. It would thus be very interesting to evaluate models such as ( Bayerl and Neumann, 2007;Brinkworth and Carroll, 2009;Simoncelli and Heeger, 1998;Tlapale et al., 2011 ), to identify complementary strengths and weaknesses in order to find converging lines of research investigations. Fig. 8 illustrates work initiated in this direction where three bioinspired models that have been tested on the Middlebury optical flow dataset ( Baker et al., 2011 ). Each of these models describe a potential strategy applied by the biological visual systems to solve motion estimation problem. The first model , demonstrates the applicability of a feedforward model that has been suggested for motion integration by MT neurons ( Rust et al., 2006 ), for estimation of optical flow by extending it into a scalespace framework and applying a linear decoding scheme for conversion of MT population activity into velocity vectors. The second model , investigates the role of contextual adaptations depending on form based cues in feedforward pooling by MT neurons. The third model , studies the role of modulatory feedback mechanisms in solving the aperture problem.
Some elements of the mechanisms discussed above (e.g., the early motion detection stage, ), have already been incorporated in recent computer vision models, For instance, the solution proposed by Wedel et al. (2009) , uses a regularisation scheme that considers different temporal scales, namely a regular motion mechanism (using short exposure frames) as well as a slowly integrating representation (using long exposure frames), the latter resembling the form pathway in the primate visual system ( Sellent et al., 2011 ). The goal there was to reduce inherent uncertainty in the input ( Mac Aodha et al., 2013 ). Further constraining the computer vision models by simultaneously including some of the above-described mechanisms (e.g., tuned normalisation through lateral interactions, gated pooling to avoid estimation errors, feedback-based long range diffusion) may lead to significant improvements in optic flow processing methods and engineering solutions.

Discussion
In Section 4 we have revisited three classical computer vision tasks and discussed strategies that seemed to be used by biological vision systems in order to solve them. Table 1 and 2 provide a concise summary of existing models for each task, together with key references about corresponding biological findings. From this metaanalysis, we have identified several research flows from biological vision that should be leveraged in order to advance computer vision algorithms. In this section, we will briefly discuss some of the  Multi-scale implementation of a feedforward model based on spatio-temporal motion energy filters inspired by  Dense optical flow estimation, evaluated on Middlebury benchmark • major theoretical aspects and challenges described throughout the review.

Structural principles that relate to function
Studies in biological vision reveal structural regularities in various regions of the visual cortex. For decades, the hierarchical archi-tecture of cortical processing has dominated, where response selectivities become more and more elaborated across levels along the hierarchy. The potential for using such deep feed-forward architectures for computer vision has recently been discussed by Kruger et al. (2013) . However, it appears nowadays that such principles of bottom-up cascading should be combined with lateral interactions within the different cortical functional maps and the Table 2 Summary of the strategies highlighted in the text to solve the different task, showing where to find more details about the biological mechanisms and which models are using these strategies.

Biological mechanism
Experimental paper Models Sensing Visual adaptation Kastner and Baccus (2014) ; Shapley and Enroth-Cugell (1984) ; Thoreson and Mangel (2012) Hérault ( Contrast enhancement and shape representation massive feedback from higher stages. We have indicated several computations (e.g., normalisation, gain control, segregation...) that could be implemented within and across functional maps by these connectivity motives. We have shown the impact of these interactions on each of the three example tasks (sensing, segmentation, optic flow) discussed throughout this article. We have also mentioned how these bio-inspired computational blocks (e.g., normalisation) can be re-used in a computer vision framework to improve image processing algorithms (e.g., statistical whitening and source separation ( Lyu and Simoncelli, 2009 ), pattern recognition ( Jarrett et al., 2009 )). One fundamental aspect of lateral and feedback interactions is that they implement context-dependent tuning of neuronal processing, over short distance (e.g., the classical centre-surround interactions) but also over much larger distances (e.g., anisotropic diffusion, feature-based attention). We have discussed the emerging ideas that these intricate, highly recurrent architectures are key ingredients to obtain an highly-flexible visual system that can be dynamically tuned to the statistics of each visual scene and to the demands of the on-going behavioural task on a moment-by-moment basis. It becomes indispensable to better understand and model how these structural principles, for which we are gaining more and more information every day, relate to functional principles. What is important in sensing, segmenting and computing optical flow is not much what could be the specific receptive fields involved in each of these problems but, rather to identify the common structural and computational architectures that they share (see Box 1). For instance, bottom-up signal representations and top-down predictions would achieve a resonant state in which the context re-enters the earlier stages of representation in order to emphasise their relevance in a larger context ( Edelman, 1993;Grossberg, 1980 ). These interactions are rooted in the generic mechanisms of response normalisation based on non-linear divisive processes. A corresponding canonical circuit, using spiking neurons representations, can then be proposed, as in Brosch and Neumann (2014) , for instance. Variants of such computational elements have been used in models tackling each of these three example task; sensing, segmenting and optical flow (e.g., Bayerl and Neumann, 20 04;20 07;Tlapale et al., 2010;Wohrer and Kornprobst, 2009 ), using either functional models or neural fields formalism (see Box 1). More important, these different models can be tested on a set of real-world images and sequences taken from computer vision. This is just one example of the many different instances of operative solutions and algorithms that can be inspired from biology and computational vision. It is important to consider that the computational properties of a given architecture (e.g. recurrent connectivity) have been investigated in different theoretical perspectives (e.g., Kalman filtering) and different mathematical frameworks (e.g., Dimova and Denham, 2009;Perrinet and Masson, 2012;Rao and Ballard, 1999 ). Some of the biologically-plausible models assembled in Table 1 offer a repertoire of realistic computational solutions that can be a source of inspiration for novel computer vision algorithms.

Data encoding and representation
Biological systems are known to use several strategies such as event-based sensory processing, distributed multiplexed representation of sensory inputs and active sensory adaptation to the input statistics in order to operate in a robust and energy efficient manner. Traditionally, video inputs are captured by cameras that generate sequences of frames at a fixed rate. The consequence is that the stream of spatio-temporal scene structure is regularly sampled at fixed time steps regardless of the spatio-temporal structure. In other words, the plenoptic function ( Adelson and Bergen, 1991 ), is sliced in sheets of image-like representations. The result of such a strategy is a highly redundant representation of any constant features in the scene along the temporal axis. In contrast, the brain encodes and transmits information through discrete sparse events and this spiking encoding appears at the very beginning of visual information processing, i.e., at the retina level. As discussed in Section 4.1 , ganglion cells transmit a sparse asynchronous encoding of the time varying visual information to LGN and then cortical areas. This sparse event-based encoding inspired development of new type of camera sensors. Some events are registered whenever changes occur in the spatio-temporal luminance functions which are represented in a stream of events, with a location and time stamp ( Lichtsteiner et al., 2008;Liu and Delbruck, 2010;Posch et al., 2011 ). Apart from the decrease in redundancy, the processing speed is no longer restricted to the framerate of the sensor. Rather, events can be delivered at a rate that is only limited by the refractory period of the sensor elements. Using these sensors brings massive improvements in terms of efficiency of scene encoding and computer vision approaches could benefit from such an alternative representation as demonstrated already on some isolated tasks.
In terms of representation, examining the richness of receptive fields of cells from retina of the visual cortex (such as in V1, MT and MST) shows that the visual system is almost always using a distributed representation for the sensory inputs. Distributed representation helps the system in a multiplicity of ways: It allows for an inherent representation for the uncertainty, it allows for task specific modulation and it could also be useful for representing the multiplicity of properties such as transparent/layered motion ( Pouget et al.,20 0 0;Simoncelli and Olshausen,20 01 ). Another important property of biological vision that visual features are optimally encoded at the earliest stages for carrying out computations related to multiplicity of tasks in higher areas. Lastly, we have briefly mentioned that there are several codes to be used by visual networks in order to represent the complexity of natural visual scenes. Thus, it shall be very helpful to take into account this richness of representations to design systems that could deal with an ensemble of tasks simultaneously instead of subserving a single task at a time.
Recently, the application of DCNNs to solve computer vision tasks has boosted machine performance in processing complex scenes, achieving human level performance in certain scenarios. Their hierarchical structure and the utilisation of simple canonical operations (filtering, pooling, normalisation, etc.) motivated investigators to test their effectiveness in predicting cortical cell responses ( Güçlü and van Gerven, 2015;Pinto et al., 2009 ). In order to generate artificial networks with functional properties which come close to primate cortical mechanisms, a goal-diven modelling approach has been proposed which achieved promising results ( Yamins et al., 2014 ). Here, the top-layer representations should be constrained in the learning by the particular task of the whole network. The implicit assumption is that such a definition of the computational goal lies in the overlapping region of artificial and human vision systems, since otherwise the computational goals might deviate between systems as discussed above ( Turpin et al., 2014 ) (his Fig. 1 ). The authors argue that the detailed internal structures might deviate from those identified in cortex, but additional auxiliary optimisation mechanisms might be employed to vary structures under the constraint to match the considered cortical reference system ( Bergstra et al., 2013 ). The rating of any network necessitates the definition of a proper similarity measure, such as using dissimilarity measures computed from response patterns of brain regions and model representations to compare the quality of the input stimulus representations ( Kriegeskorte, 2009 ).

Psychophysics and human perceptual performance data
Psychophysical laws and principles which can explain large amounts of empirical observations should be further explored and exploited for designing robust vision algorithms. However, most of our knowledge about human perception has been gained using either highly artificial inputs for which the information is welldefined or natural images for which the information content is much less known. By contrast, human perception continuously adjusts information processing to the content of the images, at multiple scales and depending upon different brain states such as attention or cognition. For instance, human vision dynamically tuned decision-boundaries related to changes observed in the environment. It has been demonstrated that this adaptation can be achieved dynamically by non-linear network properties that incorporate activation transfer functions of sigmoidal shape ( Grossberg, 1980 ). In Chen et al. (2010) , such a principle has been adopted to define a robust image descriptor that adjusts its sensitivity to the overall signal energy, similar to human sensitivity shifts. One of the fundamental advantages of these formalisms is that they can render the biological performance at many different levels, from neuronal dynamics to human performance. In other words, they can be used to adjust the algorithm parameters to different levels of constraints shared by both biological and computer vision ( Turpin et al., 2014 ) Most of the problems in computer vision are ill-posed and observable data are insufficient in terms of variables to be estimated. In order to overcome this limitation, biological systems exploit statistical regularities. The data from human performance studies either on highly controlled stimuli with careful variations in specific attributes or large amounts of unstructured data can be used to identify the statistical regularities, particularly significant for identifying operational parameter regimes for computer vision algorithms. This strategy is already being explored in computer vision and is becoming more popular with the introduction of larges scale internet based labelling tools such as ( Russell et al., 2008;Turpin et al., 2014;Vondrick et al., 2013 ). Classic examples for this approach in the case of scene segmentation are exploration of human marked ground truth data for static , and dynamic scenes ( Galasso et al., 2013 ). Thus, we advocate that further investigation on the front-end interfaces to learning functions, decision-making or separation boundaries for classifiers might improve the performance levels of existing algorithms as well as their next generations. Emerging work such as ( Scheirer et al., 2014 ), illustrates the potential in this direction. Scheirer et al. (2014) , use the human performance errors and difficulties for the task of face detection to bias the cost function of the SVM to get closer to the strategies that we might be adapting or trade-offs that our visual systems are banking on. We have provided other examples throughout the article but it is evident that further linking learning approaches with low-and mid-levels of visual information is a source of major advances in both understanding of biological vision and designing better computer vision algorithms.

Computational models of cortical processing
Over the last decade, many computational models have been proposed to give a formal description of phenomenological observations (e.g., perceptual decisions, population dynamics) as well as a functional description of identified circuits. Throughout this article, we have proposed that bio-inspired computer vision shall consider the existence of a few generic computational modules together with their circuit implementation. Implementing and testing these canonical operations is important to understand how efficient visual processing as well as highly flexible, task-dependent solutions can be achieved using biological circuit mechanisms and and to implement them within artificial systems. Moreover, the genericness of visual processing systems can be viewed as an emergent property from an appropriate assembly of these canonical computational blocks within a dense, highly recurrent neural networks. Computational neurosciences also investigate the nature of the representations used by these computational blocks (e.g., probabilistic population codes, population dynamics, neural maps) and we have proposed how such new theoretical ideas about neural coding can be fruitful to move forward beyond the classical isolated processing units that are typically approximated as linearnon linear filters. For each of the three example tasks, we have indicated several computational operative solutions that can be inspiring for computer vision. Table 1 highlights a selection of papers where even a large panels of operative solutions are described. It is beyond the scope of this paper to provide a detailed mathematical framework for each problem described or a comprehensive list of operative solutions. Still, in order to illustrate our approach, we provide in Box 1 three examples of popular operative solutions that can translate from computational to computer vision. These three examples are representative of the different mathematical frameworks described above: a functional model such as divisive normalisation that can be used for regulating population coding and decoding; a population dynamics model such as neural fields that can be used for coarse level description of lateral and feedback interactions and, lastly a neuromorphic representation data and of event-based computations such as spiking neuronal models.
The field of computational neurosciences has made enormous progress over the last decades and will be boosted by the flow of new data gathered at multiple scales, from behaviour to synapses. Testing popular computational vision models against classical benchmarks in computer vision is a first step needed to bring together these two fields of research, as illustrated above for motion processing. Translating new theoretical ideas about brain computations to artificial systems is a promising source of inspiration for computer vision as well. Both computational and computer vision share the same challenge: each one is the missing link between hardware and behaviour, in search for generic, versatile and flexible architectures. The goal of this review was to propose some aspects of biological visual processing for which we have enough information and models to build these new architectures.

Box 1 Three examples of operative solutions
Normalization is a generic operation present at each level of the visual processing flow, playing critical role in functions such as controlling contrast gain or tuning response selectivity ( Carandini and Heeger, 2011 ). In the context of neuronal processing, the normalization of the response R i of a single neuron can be written by indicates the summation over normalization pool, σ is a stabilization constant, W ij are weights, n and k tuned are the key parameters regulating the behavior. When k tuned = 0 and n = 1 this equation represents a standard normalization. When the constant k tuned is non-zero, normalization is referred to as tuned normalization. This notion has been used in computational models for, e.g., tone mapping ( Meylan et al., 2007 ), or optical flow ( Bayerl and Neumann, 2004;Solari et al., 2015 ).
The dynamics of biological vision results from the interaction between different cortical streams operating at different speeds but also relies on a dense network of intra-cortical and inter-cortical connections. Dynamics is generally modelled by neural fields equations which are spatially structured neural networks which represent the spatial organization of cerebral cortex ( Bressloff, 2012 ). For example, to model the dynamics of two populations p 1 ( t, r ) and p 2 ( t, r ) (where p · is the firing activity of each neural mass and r can be thought of as defining the population), a typical neural field model is where the weights W i → j represent the key information defining the connectivities and S ( · ) is a sigmoïdal function. Some example of neural fields model in the context of motion estimation are ( Rankin et al., 2014;Tlapale et al., 2011;. Event driven processing is the basis of neural computation. A variety of equations have been proposed to model the spiking activity of single cells with different degrees of fidelity to biology ( Gerstner and Kistler, 2002 ). A simple classical case is the leaky-integrate and fire neuron (seen as a simple RC circuit) where the membrane potential u i is given by: with a spike emission process: the neuron i will emit a spike when u i ( t ) reaches a certain threshold. τ is time constant of the leaky integrator and R is the resistance of the neuron. When the neuron belongs to a network, the input current is given by represents the time of the f th spike of the j th pre-synaptic neuron, α( t ) represents the post synaptic current generated by the spike and W j → i is the strength of the synaptic efficacy from neuron j to neuron i . This constitutes the building block of a spiking neural network. In term of neuromorphic architectures, this principle has inspired sensors such as event-based cameras (see Section 4.1 ). From a computation point of view, it has been used for biological vision ( Lorach et al., 2012;Wohrer and Kornprobst, 2009 ), but also for solving vision tasks ( Escobar et al., 2009;Masquelier and Thorpe, 2010 ).

Conclusion
Computational models of biological vision aim at identifying and understanding the strategies used by visual systems to solve problems which are often the same as the one encountered in computer vision. As a consequence, these models would not only shed light into functioning of biological vision but also provide innovative solutions to engineering problems tackled by computer vision. In the past, these models were often limited and able to capture observations at a scale not directly relevant to solve tasks of interest for computer vision. More recently, enormous advances have been made by the two communities. Biological vision is quickly moving towards systems level understanding while computer vision has developed a great deal of task centric algorithms and datasets enabling rapid evaluation. However, computer vision engineers often ignore ideas that are not thoroughly evaluated on established datasets and modellers often limit themselves to evaluating highly selected set of stimuli. We have argued that the definition of common benchmarks will be critical to compare biological and artificial solutions as well as integrating recent advances in computational vision into new algorithms for computer vision tasks. Moreover, the identification of elementary computing blocks in biological systems and their interactions within highly recurrent networks could help resolving the conflict between task-based and generic approach of visual processing. These bio-inspired solutions could help scaling up artificial systems and improve their generalisation, their fault-tolerance and adaptability. Lastly, we have illustrated how the richness of population codes, together with some of their key properties such as sparseness, reliability and efficiency could be a fruitful source of inspiration for better representations of visual information. Overall, we argue in this review that despite their recent success, machine vision shall turn the head again towards biological vision as a source of inspiration.