A model of collective behavior based purely on vision

From minimal visual information, organized collective behavior can emerge without spatial representation or collisions.


INTRODUCTION
Models of collective behavior often rely on phenomenological interactions of individuals with neighbors [e.g., see (1)(2)(3)(4)(5)(6)(7)(8)]. However and contrary to physical interaction, these social interactions do not have a direct physical reality, such as gravity or electromagnetism. The behavior of individuals is influenced by their representation of the environment, acquired through sensory information. Current models often suggest that individuals are responding to the state of movement of their neighbors, their (relative) positions and velocities, which are not explicitly encoded in the sensory stream. Thus, these phenomenological interactions implicitly assume internal processing of the sensory input to extract the relevant state variables. On the other hand, neuroscience has made tremendous progress in understanding various aspects of the relation of sensory signals and movement response, yet connections to large-scale collective behavior are lacking. Although evidence has been found for neural representation of social cues in the case of mice (9) and bats (10), details and role of these internal representations remain unclear, particularly in the context of coordination of movement. Collective behavior crucially depends on the sensory information available to individuals; thus, ignoring perception by relying on ad hoc rules strongly limits our understanding of the underlying complexity of the problem. Besides, it obstructs the interdisciplinary exchange between biology, neuroscience, engineering, and physics.
Recently, the visual projection field has appeared as a central feature of collective movements in fish (11)(12)(13)(14), birds (15), humans (16), or artificial systems (17,18). Because of the geometrical nature of vision, i.e., the projection of the environment, vision appears as a good starting point to explore the relationship between sensory information and emergent collective behaviors. Some models have attempted to relate vision and movement (4,15,17,19). However, they use vision as a motivation to refine established social interaction models or rely on additional interactions based on information not explicitly represented in visual input such as distance or heading direction of neighboring individuals. Furthermore, most of the above models consider only part of the interaction by assuming constant speed of individuals and focusing solely on their turning response. A more general modeling approach is required to investigate the role of adaptive speed in vision-mediated movement coordination.
Here, we propose a radically different approach by introducing a general mathematical framework for purely vision-based collective behavior. We use a bottom-up approach using fundamental symmetries of the problem to explore what types of collective behavior may be obtained with as minimal as possible requirements.

MATERIALS AND METHODS
Formally, we can write the movement response of an agent to the visual projection field V in three spatial dimensions as the following evolution equation for its velocity vector v i (see Fig. 1 for the geometry of the problem) The first term accounts for the self-propelled movement of an individual. Here, we used a simple linear propulsion function: with v 0 being the preferred speed of an individual,  being the speed relaxation rate, and v i being the heading direction vector of the focal individual with |v i | = 1. The second term accounts for the movement response to the visual sensory input given by the visual projection field V i ( i , i ,t) experienced by the individual i.  i and  i are the spherical components relative to the individual i, and F vis is an arbitrary transformation of the visual field. This function does not have an explicit dependence on the other individual properties.
The physical, visual input corresponds to a spatiotemporal intensity and frequency distribution of the incoming light. In our framework, we considered V to be an abstract, arbitrary representation of the visual input. In particular, V can implicitly account for relevant sensory (pre-)processing, e.g., it can represent colors or brightness of the visual scene. Furthermore, V can also account for higher-order processing of visual stimuli such as object identification and classification. Equation 1 describes the projection of the full information encoded in the visual field onto the low-dimensional movement response and must hold for any particular choice of visual field.
To simplify the description, we limited our analysis first to the two-dimensional (2D) case. Without any loss of generality, F vis can be written as The functional G [V] encodes what information from the visual input influences the movement response and how. An arbitrary G can be expanded as a series of derivative in space and time and power series of the visual field. This accounts for any function of the visual projection field, e.g., specific functions of the visual cortex such as detection of edges in all directions or optical flow. The function h( i ): ℝ → ℝ 2 , on the other hand, encodes the fundamental properties of the perception-motor system ("the observer") independent of the specific visual input, e.g., symmetries of the movement response, or spatial dependence of perception (e.g., blind angle). Experimental data in fish have shown that the variation of orientation depends on the left-right position of the other individual, while variations of speed depend on the front-back position. The components of h are therefore expanded as a Fourier series in .
Up to this point, no approximation has been made; the model is as general as possible regarding response to an arbitrary visual field. To develop a systematic understanding of how collective behavior can arise from the visual field, we proposed a minimal model of vision-based interactions. First, we assumed that individuals respond to an instantaneous, binary visual field, i.e., the visual projection field V (, t) only accounts for the presence or absence of objects and no other properties. Second, we considered an expansion of an arbitrary functional G in terms of the lowest order of retinal space and time derivatives in V. The velocity vector of an agent in 2D is determined by the velocity with respect to the heading direction v i (t) and the polar angle determining the heading vector  i (t). The simplest equations of movements, satisfying the fundamental symmetries from (20), read The first terms in the brackets in the integral describe the movement response to the perceived angular area (subtended angle) of the objects in the visual projection; the second ones describe the response to edges, while the third ones account for dynamical changes such as translation or loom. The coefficients  m and  n are arbitrary constants obtained from the expansion of G. In the following, we showed that coordinated collective movement can also emerge without considering temporal derivatives, i.e., by setting  2 =  2 = 0. In the following, our analysis is restricted to a simple case where only a binary projection of the visual field is considered (Fig. 1, C and D).

RESULTS
The first terms associated with the angular area of objects in the visual projection creates a short-range interaction that decreases as the object gets further ( Fig. 2 and fig. S4). On the contrary, the second terms with the first derivative with respect to the visual field coordinate yield long-range interaction due to the nonlinearity of the sine/ cosine function ( Fig. 2 and fig. S4). Thus, these lowest-order terms, neglecting temporal derivative, are sufficient to generate shortrange repulsion and a long-ranged attraction: The individual is repelled by the subtended angle of the object on its visual field while getting attracted by the edges. On the basis of the choice of corresponding interaction parameters, we can define an equilibrium distance, where attraction and repulsion balance (see the Supplementary Materials for details). This equilibrium now introduces a characteristic metric length scale into the system despite the lack of any representation of space at the level of individual agents.
The front-back equilibrium distance is L eq (f b) = BL / 2  1 , whereas the left-right equilibrium distance is L eq (lr) = BL / 2  1 , with BL (body length) being the diameter of individual agents. Here, we will focus on the case  1 =  1 , i.e., where the attractive terms associated with the edges are equal for turning and acceleration, resulting in spatially isotropic equilibrium distance L eq = L eq (f b) = L eq (lr) in 2D (see the Supplementary Materials for details).
A systematic exploration of the collective behavior of multiple agents interacting through the minimal vision model reveals the emergence of a wide range of collective behaviors for different parameter sets and group sizes (Figs. 3 to 5). In particular, we observe robust self-organized collective movements for a large range of parameters, emerging from the interplay of visual perception and the movement response of individuals. The degree of coordination and density of the flocks can be quantified through the normalized average velocity of the group, also referred to as orientational order or polarization, and the average nearest neighbor distance (Fig. 5). We note that because of the vision-induced long-ranged attraction, fragmentation of groups is negligible for the group sizes considered. Figure 4 gives a qualitative overview over the location of different states in the ( 0 ,  0 ) parameter plane, where  0 controls the overall acceleration-deceleration response and  0 controls the overall turning response. The exact boundary between the different regions depends on the number of individuals N and the equilibrium distance L eq . In Fig. 5, we show the corresponding quantitative results for neighbor distances and polarization values for different N and L eq . Some general principles can be extracted. First, when both  0 and  0 are small, i.e., individuals have a very low overall response to the visual projection field, no polarization is observed, and the average closest neighbor distance is high. Obviously, in the limit of individuals not reacting to their neighbors, they will just move in straight lines, and the interindividual distance will naturally increase up to infinity. As long as either  0 or  0 is large enough, then the average distance to the closest neighbor decreases. Thus, individuals remain close to at least one other individual. This distance becomes smaller than L eq when there are more than two individuals.
For  0 = 0, where individuals do not modify their acceleration, two transitions can be observed. First, as the turning rate  0 increas-es, the system reaches a swarm-like state, where individuals permanently change their positions within the group in a fluid-like manner (Fig. 3D). When  0 is increased even further, the system remains in a disordered state, but the average positions of individuals become locked in place (Fig. 3E), i.e., individuals move around fixed relative positions resulting in a crystal-like group structure.
On the other hand, when  0 is small, i.e., the individual turning response is small, three transitions are observed. First, as the linear acceleration rate  0 increases, the system reaches a swarm-like state where the position of the individuals remains fluid inside the group. In this parameter regime, milling states can be observed (Fig. 3C). We note that most of the time, no common rotation direction emerges, i.e., individuals turn in both directions simultaneously. When  0 is increased further, a polarized state is observed, where individuals arrange approximately along a line perpendicular to the direction of motion (Fig. 3A). This is close to the trivial steady state with individuals arranged in a perfect line, where, because of occlusions, an individual in the middle interacts only with its two closest neighbors to the left and right. The shape of the polarized group can be modified from this line state to an elliptical shape by increasing the turning rate  0 (Fig. 3B). If  0 is increased further while  0 remains low, the group gets stuck in place. Individuals are oscillating forth and back along their heading direction: They approach their neighbors but then move back when they come too close. Because of the erratic individual motion in this regime, no ordered steady state emerges. Note that those patterns are modified when the number of individuals is modified or when L eq is changed (Fig. 4). In particular, groups with small numbers of individuals almost always display strong polarization. Two mechanisms may explain the observed decrease in polarization with group size N (Fig. 5): First, because of occlusions, agents only perceive visual projections of a subset of the entire swarm, which may lead to decreasing global coordination with increasing N. Second and more likely, it is a consequence of the binary nature of the visual projection. With increasing group size, the visual projection becomes less and less informative because of increas-ing overlaps of projections from different individuals at different distances up to (partial) saturation of the visual field. This visual "confusion" inhibits the ability of the group to coordinate. The latter mechanism is also in line with smaller parameter regions where large, polarized groups can be observed (P > 0.5, N ≥ 10) for the smaller equilibrium distance L eq , resulting in higher flock densities (Fig. 5).
Last, for large groups, another collective mode becomes very prominent. The group assumes a tube-like geometry by spreading out in one spatial dimension, with individuals moving mainly along the main axis of the tube (Fig. 3E). This state can also be observed in smaller groups for small values of L eq .
Besides the ability to exhibit ordered, directed collective movement, an often neglected property is the ability of agents to avoid collisions. This might be particularly critically important for artificial swarm robotic systems. Here, we can identify extended regions of parameter space without any collisions overlapping with the regions of ordered motion (Figs. 4 and 5).
The observation of coordinated motion without any collisions is, in particular, remarkable, as our minimal vision model does not take any time derivatives of the visual field (i.e., optical flow) into account and thus lacks any explicit or implicit alignment mechanisms [e.g., see (6,7,21)]. Furthermore, individuals do not know where they are relative to others; thus, they do not use any information on the number or the distance of other individuals.
The absence of collision is observed in two main regions of the phase diagram (Fig. 5): when the turning rate is high (individuals swarm in a crystal-like configuration) and when the acceleration term is high. A balance needs to be found between acceleration rate and turning rate to generate noncolliding polarized swarm. Because of the symmetry of the interaction field, modifying linear acceleration is crucial for reliable avoidance of direct collisions. This emphasizes the importance of the individual speed modulation [c.f., (20)] and questions the generality of flocking systems where individuals move with constant speed and respond to others only through changes in their direction of motion. The ability to accelerate and decelerate is critical for obtaining noncolliding polarized swarm in the absence of explicit velocity alignment forces [see also (22)].

Extension to 3D
Extending the model to three spatial dimensions can be performed in a straightforward way yet is not trivial. For this, we consider now the full visual projection in spherical coordinates for each individual by taking into account the corresponding azimuth angle of  i . An additional equation is required to account for the variation of velocity in the third dimension. This could be implemented either with cylindrical coordinates, through the variation of velocity in the z direction v zi (Fig. 6A), or with spherical coordinates where the individuals are able to rotate in all spatial directions. This modeling choice raises new fundamental questions related to kinematics, perception, and neural representation in the context of collective behavior. If the individuals can rotate in 3D, should the visual projection field be linked to the individual and thus be decoupled from the outside reference of left-right and up-down in the real world? Or should we rather assume an external reference frame, defined, e.g., via gravity, that anchors the visual projection so that the horizon remains horizontal? Here, recent insights from neuroscience may help resolve these questions, e.g., for bats, the existence of such a gravity-anchored reference frame has been recently suggested (23,24).
Furthermore, the question of the role of edge detection and response along different directions becomes conceptually nontrivial. Here, in a simple extension of the 2D case discussed above, we focus only on left-right edges with the derivative ∂φ i V and neglect, for simplicity, the impact of up-down edges, ∂ i V.
These conceptual questions require a deeper analysis that is beyond the scope of this paper. The simple example discussed here is meant as a proof of concept that the minimal model can be extended to 3D and already yields potentially interesting dynamics for binary visual projections. Specifically, we make the following simplifying assumptions: Individuals can move in the z direction without rotation in  and independently in the (x,y) plane. The visual field is thus always anchored to the real world. Derivatives are only considered in the left-right direction to be consistent with the analysis performed in 2D. The variation of velocity in the z direction is performed by comparing elements that are up and down. The variation of movement in the horizontal plane is only defined by the individuals that are contained in that plane. The corresponding equations of motion read We propose that in the z direction, the agents are still attracted to edges and repelled by angular area of the objects in their visual projection. From the pure point of view of symmetry of the system, it is then natural that the focal individual does not respond through vertical motion to objects in its horizontal plane. In other words, if all individuals are located in the horizontal plane, z i = 0 and  i = 0, no movement direction in z (up or down) can be chosen unless a bias is introduced.
One needs to be careful that when objects are moving further away, their apparent size in the visual field is reduced not only along the  axis but also along the  axis [see Fig. 6 (B and C) for an impression of the visual field in 3D]. This leads to a situation where both the attraction and the repulsion strength decrease at infinity (see the Supplementary Materials for more details on the comparison between the 2D and 3D models). The attraction interaction toward the edges of objects acts now only at intermediate ranges. However, a linear correlation still exists between the equilibrium distances in the horizontal plane (x,y) and the values of  1 −1 and  1 −1 . It is expected that a similar equilibrium distance is given in the direction z by the inverse of the vertical edge attraction coefficient  1 −1 .
We focus here only on the effect of the equilibrium distance in the z direction by setting the equilibrium distances equal in both directions in the plane,  1 −1 =  1 −1 = 5BL . We set  0 = 5 and  0 = 2 so that polarization is observed in the horizontal plane (x,y) when  0 = 0. Last, we chose a large value for the vertical response parameter  0 to reduce convergence time in the z direction:  0 = 10. With these settings, we investigate simple metrics quantifying the shape and coordination of the swarm, namely, the maximal extension of the swarm in the plane (x,y), r max , and in the direction z, r zmax , as well as the average polarization of the swarm (Fig. 6D).
Looking at the extension of the swarm in the z direction, r zmax , a sharp transition is observed. When  1 >  1 , the system remains mainly in the horizontal plane, r zmax < BL, while if  1 <  1 , then the swarm expands more in the z direction and r zmax reaches values higher than 10BL. This qualitative pattern is independent of the number of individuals.
This transition can be intuitively understood through the analysis of the two equilibrium distances: When  1 <  1 , the equilibrium distance in the z direction is small compared to the equilibrium distance in the horizontal plane. The swarm extends in the direction of the larger equilibrium distance, i.e., the group flattens and becomes quasi-2D. An analogous explanation can be given when  1 >  1 . Here, the constraints in the horizontal plane dominate. The system becomes effectively more compressed in the x,y directions. As the individuals come close together in x and y, they need to increase their distance in z and the group extends vertically.
As the swarm extends vertically, i.e., when  1 is small, its extension in the horizontal plane is also increasing (Fig. 6D). Large extension in z results in the dilution of neighbors approximately in the same (x,y) plane as the focal individual. Because of the construction of the model, individuals respond strongest to others in the same horizontal plane. The stronger tendency of neighbors being located outside of the plane of the focal individual results in weaker overall visual attraction, which, in consequence, leads to an increase in the horizontal interindividual distance beyond the theoretical value L eq obtained in the pure 2D case.
Last, for strongly anisotropic configurations, polarization seems to drop. In large swarms, polarization appears to become maximal when both equilibrium distances are of the same order of magnitude ( 1 / 1 ≈ 1). For large z extensions, the reduced coordination can be explained again with the overall reduction in visual interactions due to non-negligible z differences between neighbors. For small extension in the z direction, we end up effectively with a 2D group. Here, the visual confusion in large groups due to binary visual projection, as discussed for the 2D case, leads to lower levels of coordination. Thus, more isotropic groups can be viewed as an optimal configuration of quasi-horizontal layers, which maximize in-plane coordination while minimizing off-plane dilution of visual interactions, resulting in maximal polarization of the entire group.

DISCUSSION
The central aim of this work is the formulation of a mathematical framework for collective movement based exclusively on the visual projection field. Following a bottom-up approach, in this work, we focused on the simplest possible case: individual motion response to a binary visual projection field based on lowest-order expansion of the vision processing function G [V].
By relating perception and movement response, we have shown how a simple purely vision-based model of collective behavior can be constructed directly without the need for explicit ad hoc rules of coordination between individuals. This model does not specify spatial representation, explicit alignment, or even an explicit representation of other individuals. Therefore, these features are not essential ingredients of social interactions underlying organized collective behavior. It is important to emphasize this last point: The model cannot be simply rewritten and reformulated as a "classical" social-force model in the referential of an individual. Hence, it also calls into question the underlying representations implicitly assumed in socialforce models.
Can animals identify the positions in space of other individuals? How many neighbors can be represented simultaneously from the vision of an individual? The answer to this question should arise from neurophysiological data, but their link to the movement response needs to be explicitly stated. Furthermore, a spatial scale is introduced in the system through the size of the animal and not through ad hoc parameters in the equation. The here-formulated mathematical model framework allows us to study the effects so far largely neglected in mathematical models of flocking, such as the role of the body shape of individuals in the visual projections or the role of coloration patterns on vision-mediated collective behaviors.
Eventually, we are convinced that a perception-based modeling framework will help build bridges between collective behavior research and sensory neuroscience (4). Specifically, a systematic bottom-up approach revealing discrepancies between predictions of minimal vision-based models and empirical observations will provide fundamental insights into the role of neural representations and higher-order processing of visual inputs in collective behavior.
In general, all theoretical models of flocking, including this one, should be critically assessed in terms of their relevance for realworld biological or artificial flocks. Despite the simplicity of our minimal model discussed above, we have shown that it can reproduce the social response map reported for pairs of fish (20) while, at the same time, producing coordinated movement patterns in larger groups of up to 100 agents. Nevertheless, even with this fundamental agreement, we should be cautious regarding the ability of this simple model to account for a broad range of collective behaviors observed in vertebrates. It relies only on lowest-order terms in the expansion of the visual response function G[V]; thus, it is likely more suited for describing coordination in scenarios where higherorder processing can be expected to play a very limited role, as in collective escape cascades in fish schools (12). Low-order, visionbased interactions are likely more relevant for collective behavior of invertebrates, such as insects [e.g., locusts (25) and midges (26)] or crustaceans [e.g., soldier crabs (27) and Antarctic krill (28)]. Here, juvenile locusts appear to be a promising study system, where we observe effective coordination and collective migration (25,29) of individuals with stereotypical optomotor responses and a still-developing, thus limited, visual system (30)(31)(32).
We note that the estimation of the visual projection field from video data requires more information than individual center of mass coordinates typically recorded in standard tracking experiments. It is essential to quantify the body shapes, head positions, and orientations. Whereas corresponding advanced tracking methods of animal groups are being actively researched (33), currently, available datasets lack this information. Hence, substantial additional effort is required to extract visual projection fields even for existing datasets, which goes beyond the scope of this work. In the future, a particularly promising avenue for investigating vision-based social interactions is state-of-the-art virtual reality techniques (34).
For minimal vision models with binary visual projection in the limit of very large and dense flocks, we may obtain full saturation of the visual field for individuals within the flock. In this case, the social response of individuals with saturated vision vanishes, whereas the individuals at the boundary experience a social attraction into the flock. Our results suggest that even in the absence of full saturation, overlaps in visual projection may inhibit coordination in large groups. These examples show the limitations of the minimal model relying on binary vision. However, we note that bird flocks operate at marginal opacity, without saturation of visual projection field, and the attraction toward edges in the visual projection in the minimal model is an effective mechanism for density regulation toward marginal opacity states (15). Furthermore, extremely high densities necessary for saturation of visual projection in bird flocks would require interindividual distances, which would make collisions very likely for empirically derived parameters (5). These extremely high densities are more likely to occur in large schools of pelagic fish, such as sardines or herring (35,36). The simplest solution for avoiding saturation of the visual projection field is to abandon the restriction of a binary visual projection. For example, for plainly colored schooling fish, neighbors can be assumed to blend with the background with increasing distance. A simple way to model decreasing contrast with distance is to assume a grayscale visual projection field where darkness decreases with distance. Thresholds in contrast detection would then naturally result in a visual response only to the first shells of nearest neighbors avoiding full saturation of the visual field. Ideally, corresponding distance dependencies and thresholds can be obtained from the properties of the visual system and/or the optical properties of the medium due to attenuation and scattering of light (37). Last but not least, for extremely high densities and corresponding short nearest-neighbor distances, other senses, such as touch and lateral line, will play an important role in movement coordination (38,39).
In contrast, at low density, an inaccurate visual system may fail to capture other individuals if they are too far away. Individuals would then become effectively invisible, and the interaction would vanish at infinity. Care needs to be taken when designing artificial systems to check that the size of the individuals and the expected size of the swarm can be captured by the used visual sensors.
Even if the simple model discussed above does not account for the full complexity of sensory and cognitive processing in humans or many vertebrates, we have demonstrated its ability to produce various modes of collective movement already with minimal assumptions on the vision-based interactions. Therefore, it represents an interesting reference model for self-organization of flocks, which is radically different from similar idealized models widely used in literature, such as the Vicsek model (7,21) or more biologically inspired models relying on phenomenological social forces (1-3).
We believe that the model framework is also of relevance to the theory of dynamical systems from a very fundamental point of view. It is a paradigmatic example of a class of models where interaction between individual units is not based on physical force fields but solely on the perception and internal representation of the social environment by the local agent. The coupling between agents is based on a lower-dimensional projection of the actual dynamical behavior of many agents. The resulting flocking model is neither metric nor topological (11); thus, new mathematical approaches are needed to explore the emergent collective behaviors at the macroscopic scale. Furthermore, the simple vision-only interaction discussed here has some interesting properties. It does not correspond to a simple superposition of binary interactions and does not rely on arbitrary cutoffs or thresholds. Thus, it results in a self-consistent description of interactions from a single individual up to large groups, naturally accounting for effects like self-organized marginal opacity (15) due to saturation of the visual field.
This vision-based model can also be useful for the construction of terrestrial and aerial robotics swarms. The ability to avoid collisions is given directly to each individual agent without the implementation of specific algorithms (40). Organized collective behavior can emerge from the instantaneous reaction to the visual projection field. The whole system is fully decentralized, and the collective organization does not rely on any explicit exchange of information between individuals. Once an omnidirectional, binary visual field is available, then the local computational requirements are low. The acquisition of a full field of view may pose a technical challenge, but the integrative nature of the model can be used efficiently. Expanding on works such as in (41), an array of sensors can perform independent computations and only exchange the results of the local integration. We show that the reduction of complex environmental perception through integration is sufficient for effective coordination. Minimal information bandwidth is then required between parts of the computational system. This final aspect reveals an interesting analogy to the perceptual modularity of our own brain, where the scene that we observe with both our eyes does not need to be fully exchanged between both sides of the brain.

SUPPLEMENTARY MATERIALS
Supplementary material for this article is available at http://advances.sciencemag.org/cgi/ content/full/6/6/eaay0792/DC1 Fig. S1. Stable solution defined for the speeding force and the turning force. Fig. S2. Derivatives of simple discontinuous function. View/request a protocol for this paper from Bio-protocol.