Deep attention networks reveal the rules of collective motion in zebrafish

A variety of simple models has been proposed to understand the collective motion of animals. These models can be insightful but lack important elements necessary to predict the motion of each individual in the collective. Adding more detail increases predictability but can make models too complex to be insightful. Here we report that deep attention networks can obtain in a data-driven way a model of collective behavior that is simultaneously predictive and insightful thanks to an organization in modules. The model obtains that interactions between two zebrafish, Danio rerio, in a large groups of 60-100, can be approximately be described as repulsive, attractive or as alignment, but only when moving slowly. At high velocities, interactions correspond only to alignment or alignment mixed with repulsion at close distances. The model also shows that each zebrafish decides where to move by aggregating information from the group as a weighted average over neighbours. Weights are higher for neighbours that are close, in a collision path or moving faster in frontal and lateral locations. These weights effectively select 5 relevant neighbours on average, but this number is dynamical, changing between a single neighbour to up to 12, often in less than a second. Our results suggest that each animal in a group decides by dynamically selecting information from the group. Highlights At 30 days postfertilization, zebrafish, Danio rerio, can move in very cohesive and predictable large groups Deep attention networks obtain a predictive and understadable model of collective motion When moving slowly, interations between pairs of zebrafish have clear components of repulsion, attraction and alignment When moving fast, interactions correspond to alignment and a mixture of alignment and repulsion at close distances Zebrafish turn left or right depending on a weighted average of interaction information with other fish, with weights higher for close fish, those in a collision path or those moving fast in front or to the sides Aggregation is dynamical, oscillating between 1 and 12 neighbouring fish, with 5 on average


Introduction
Aggregation rules that weigh all members in a collective equally can give estimations that are better on average than choosing one member at random [1,2]. However, this is a mathematical certainty only under some particular conditions [3,4]; e.g. when all individuals in a collective are noisy estimators but better than random and statistically independent [1,5]. There is an intense debate on whether the conditions under which these simple rules are guaranteed to work are useful to defend a truth-tracking version of majority or plurality voting in society [6,7,4,8]. These rules might still work when knowledge is in a minority, for example when an ignorant majority chooses with equal probability among all available options and the informed minority biases towards the correct answer [5,9,10]. However, even if this is the case, the effect of a minority is small and thus very sensitive to noise.
Many models of interactions in collective animal behavior also use an average-across-neighbours approach as a simplifying assumption [11,12,9,13,14,15,16] or other operations that also aggregate all individuals equally [17,18,19,20,21]. However, there are models, like the many-eyes model for predator detection, explaining that, after some individuals detect a predator, the rest do not aggregate all members equally but focus instead on the informed minority [22,23,24,25,26]. An ideal aggregation system would have these two cases -averaging or following very few individuals-as extremes, shifting between them to match the changing knowledge distribution in the group [4]. We set out to test for this type of aggregation rule using high-quality tracking data of large animal groups of zebrafish, Dario rerio, obtained using the recently developed idtracker.ai [27], and modelling techniques using deep attention networks [28,29,30].

Predicting the future using a deep interaction network
We recorded videos of groups of 60, 80 or 100 juvenile zebrafish, Danio rerio, (Fig. 1A for a detail of two fish and Fig. S1 for illustrative video frames). We tracked videos using our system idtracker.ai, obtaining high-quality position, velocity and acceleration values (see Materials and Methods). We used the trajectories to obtain data-driven models of fish interactions. We required our models to be predictive of the future of a focal fish in test data (video sequences not used to train the model). We found deep interaction networks, good predictors of physical systems like the motion of planets [31], to have the highest predictive power on test data. Our deep interaction network is divided in two parts: (i) n pair-interaction subnetworks, each describing the interaction of a focal fish with one of its n closest neighbors (Fig. 1B), and (ii) an aggregation subnetwork, aggregating the n outputs of the pair-interaction subnetworks (Fig. 1C, subnetwork to the right).
The inputs to the network are quantities expressed in a coordinate system centered at the focal fish and with the y-axis in the direction of the velocity of the focal (Fig. 1A, red). The pair-interaction subnetwork has as inputs the asocial information of the focal, α (Fig. 1B, red), and the social information of one neighbour i, σ i (Fig. 1B, blue). The asocial information of the focal is its speed, v, tangential acceleration, a , and normal acceleration, a ⊥ . We found that a had little impact on accuracy (Table S1), so we did not consider it in further computations. The social information is the neighbour position with respect to the focal, x i and y i , its velocity, v i,x and v i,y , and acceleration, a i,x and a i,y . Neighbour accelerations had little impact on accuracy (Table S1) and were not used in further computations.
Predictability of the turning side of the focal fish after 1 second improves with the number of neighbours, but with diminishing returns (Fig. S2); we chose n = 25 neighbors. The tangential acceleration of the focal and the acceleration of the neighbours have a small impact on accuracy of < 0.1% (Table S1) and we thus discarded them in further analysis. In the main text we provide analysis of groups of 100 animals and prediction at 1 s in the future for illustration purposes. Our models predict well a range of futures (Fig. S3). Results on how fish interact were found to be similar in computations using 250 ms, 500 ms and 1.5 s in the future (Fig. S4-S6) and for groups of 60 or 80 zebrafish (Fig. S7,8).
Predictability of the turning side after 1 s is higher for large turning angles than for turning angles close to 0 or 180 degrees (Fig. S9). For turning angles of 20 − 160 • , the interaction network predicted the correct side with an accuracy of 84.4%; up to 87 % for 40 − 100 • (Fig. S9). In contrast, a model using only focal variables failed to obtain a high accuracy and reached only 55 % . The high accuracy of the interaction network shows that the 6 × 25 = 150 dimensions capture an important part of the collective dynamics. The instances not predicted may originate from a variety of effects, including higher-order correlations, individuality and non-markovian effects, i.e. history-dependency, be it at short scales or at long scales (internal states or unaccounted behavioural variables, like posture or eye movements). Accuracies were larger the larger the group (Table S2), consistent with the idea that interactions lock individuals into social dynamics, less stochastic and less dependent on individual internal states than asocial dynamics.

Deep attention networks obtain a predictive and analyzable model
The interaction model obtained is too high-dimensional to provide useful insight of animal interactions. The pair interaction subnetwork takes the values of 6 variables as inputs and outputs 128 values (Fig. 1B). The aggregation subnetwork first sums up the 25 128-dimensional vectors to give a single vector of 128 components, and then processes it to output a single number, z (Fig.  1C). However, the interaction network remains a useful reference for how predictive the behavior is.
To gain insight, we used a deep attention network instead [28,29]. Like the interaction network, the attention network has a subnetwork to describe the interaction of the focal with each of the n neighbors, except now with a single output (Fig. 1D). The aggregation subnetwork is a function weighting differently each neighbor depending on its kinematic parameters and those of the focal (Fig. 1E). We found that focal and neighbor speed and neighbor position are the inputs to the aggregation subnetwork with the highest impact in accuracy. We can express the probability that the focal turns to the right after 1 s, p, as p = 1/1 + exp(−z), where z is the logit that the deep attention network outputs (Fig. 1F), .
The pair-interaction subnetwork, Π(α, σ i ), describes the interaction of the focal and one neighbour i. The aggregation network W gives different weights to the different neighbors i in the aggregation depending on the kinematic parameters of focal α (W ) and neighbor relative to focal σ (W ) i . The subscript indicates that these variables may be different to the ones in the pair-interaction subnetwork.
Since we want the pair-interaction subnetwork, Π, and the aggregation subnetwork, W , to represent the logit of turning after 1 s given a neighbor and a weight representing the importance of that neighbor, respectively, they must differ on several accounts. (i) Π can have any real output, while W must be always positive. (ii) Π must be antisymmetric with respect to reflection on the y-axis, while W must be symmetric. This is because we assume that a neighbour to the right makes the focal go to the right as much as an identical neighbour to the left of the focal makes the focal move to the left. For the aggregation weight W , however, we assume that the importance of the two cases is the same. (iii) The aggregation weights must sum 1. These three conditions are required and we enforced them by: (i) using an exponential as final activation function of W , (ii) antisymmetrizing Π and using symmetric input in W , and (iii) normalising the outputs of W by the sum across all neighbours prior to the integration with the outputs of Π.
The resulting attention network achieves 83.2% accuracy for turns between 20 • and 160 • and around 85.9% for 30-100 • (Fig S9). This is slightly less accurate than the interaction network, but the much lower dimensionality of the two subnetworks allows for a detailed analysis.

The structure of interaction of a pair of animals in a collective
The pair-interaction subnetwork Π is a six-dimensional function. We plotted its output, the logit of the focal fish turning to the right after 1 s, z, as a function of two variables: the angle, θ i , and the speed of the neighbour, v i (Fig. 2). We fixed the other four variables: focal at median velocity of 3.04 BL/s, focal normal acceleration at a ⊥ = 0, and neighbor position at x i = 7 BL and y i = 1 BL). At a neighbour velocity above the median ( Fig. 2A, left; median speed indicated with a horizontal line at 3.04 BL/S), the focal animal is sensitive to the neighbour orientation, with a high probability of turning right (left) after 1 s when the neighbour is moving away from (toward) the focal, resulting in an alignment of the focal to the neighbour. When the neighbour speed is below the median, however, the focal is attracted towards it regardless of the neighbour orientation. As a contrasting example, consider when the neighbour is closer and slightly in front, at x i = 3 BL and y i = 1 BL ( Fig. 2A, right). In this case, the focal gets repelled by the neighbour when the neighbor speed is below 3 BL/s.
These two examples illustrate how alignment, attraction and repulsion depend not only on the neighbour location but also on its speed (Fig. 2B, similar to Fig. 2A but for a 8x8 matrix of subplots, each for a different neighbour position), and on the speed and acceleration of the focal ( Fig. S10-14).
From this six-dimensional function we can define alignment regions as those where the logit changes sign with neighbour orientation, that is, when focal will turn right (left) if neighbour orients to the right (left) (Fig. 3, gray regions. The alignment score (see Materials and Methods) measures how sensitive the logit is to neighbour orientation (Fig. 3A, gray region; focal speed fixed at median value of 3.04 BL/s and neighbour speed indicated on top of each subplot). The alignment region increases in size and in score with increasing neighbour speed. At high neighbour velocities, strong alignment areas are 2-5 BL behind the focal and 3-5 BL at the sides (Fig. 3A, right, darker gray regions). In a region 5-7 BL behind the focal there is a weak orientation score but reversed in sign, with focal turning right (left) when neighbour orients to the left (right), (Fig. 3A, pink). This anti-alignment region extends when increasing focal speed, while keeping neighbour speed fixed at the median value of 3.04 BL/s (Fig. 3B, pink).
We define attraction (repulsion) regions as those where the logit does not change sign when changing the neighbour angle. Instead, the focal is attracted towards (repelled from) the neighbour's location independently of its orientation. The attraction-repulsion score (see Materials and Methods) measures how positive (attraction) or negative (repulsion) is the logit of turning towards the neighbour (Fig. 3A). Attraction regions shrink with increasing neighbour speed. They are mainly located to the side at 6-8 BL, extending to the back. Repulsion takes place only when the neighbour speed is below the median speed and neighbours are close to the focal (Fig.  3A,B, purple).
However, classifying interactions into only 4 classes is oversimplistic, and a more complete account is captured by the six dimensional pair-interaction function in Fig. 2. For example, when the neighbour is at (x i , y i ) = (3, 1) BL and at high velocity, there is alignment but with a much higher probability of turning left at angles below π/2 than turning right at angles above π/2. This asymmetry in angles makes the sensitivity to orientation to the neighbour a mix of alignment and repulsion. We can see the full extent of relative attraction and repulsion zones by plotting the average over neighbour orientation angles for all points in space ( Fig. 3C for different neighbour speeds and Fig. S15 for different focal speeds). There is an approximately 5 BL diameter region of relative repulsion around the focal. Regions with a mix of alignment and repulsion (or attraction) are those of alignment in Fig. 3A and Fig. 3B that overlap with regions of relative repulsion (attraction) in Fig. 3C and Fig. S15, respectively.

How information is aggregated: shifting between majorities and minorities
The aggregation subnetwork W outputs the (positive) weight of each neighbor in the aggregation. We found it to depend mainly on 4 variables: focal speed v as the asocial variable and neighbor speed v i and relative position of neighbour, x i and y i . In each subplot of Fig. 4A, we give W for Logit of focal fish turning right after 1 s   different neighbour positions, keeping neighbour and focal speed constant. Generally, W is higher for neighbours that are closer to the focal, and lower for neighbours behind the focal. In the upper row of Fig. 4A all subplots have the same focal speed at the median velocity of 3.04 BL/s, and each indicates the neighbour speed on top, with values v i = 1, 2, 4 and 8 BL/s. We see how W increases with neighbour speed for most neighbour positions, implying that faster neighbours carry more weight in the aggregation. This is more pronounced close by and to the side. In the lower row of Fig. 4A, all subplots have the same neighbor speed at the median value and each one indicates above the focal speed. We see how the mass of W increasingly shifts towards the front the faster the focal fish moves. Adding other variables to the attention marginally improves accuracy, and still further insight is gained. When the neighbour orientation angle is added, higher values of the weight W are obtained in positions leading to an immediate collision (Fig. S16). The final impact of each neighbour on the probability of the focal turning right is given by its weight relative to the weights of the rest of neighbours, W (α (w) , σ For example, if all neighbors are assigned the same weight by W , no matter how large or small this value is, the final logit is the average of the individual logits, z = 1 n n i=1 Π(α, σ i ). If one of the neighbours has a higher (lower) value of W , its importance in the integration increases (decreases), while the importance of the other neighbours decreases (increases).
The highest relative weight among all neighbours can be used to quantify the inequality of relative weights at any point in time. We find a wide distribution of values of this inequality score (Fig. 4B). In some cases, all neighbours have similar weights, and the inequality score is small (Fig. 4B, upper left). Most often, a subgroup of neighbours is weighed more, the most common inequality score being just below 0.2 (Fig. 4B, upper right). The distribution of the inequality score has a long tail, making high inequality disproportionately common. In some cases, one neighbour overweighs the others by far (Fig. 4B, lower right), up to the combined weight of all others.

Discussion
Our results show that animals in collectives can use an aggregation rule that naturally allows individuals to shift from simple averaging to following a single individual. This helps to close the gap between models using averages [11,12,15,16] and few individuals [22,23]. Also, it opens the door to study how animals match interactions among themselves to the knowledge distribution in the group [32,26]. Note that our models and its source code can be reused for any species.
From the pairwise interaction in the collective, we could extract attraction, repulsion and alignment as approximate notions [12,13]. Usually, these interaction classes are defined only in terms of relative position of neighbour. However, we found them to exist in a 6-dimensional space. This translates into these classes also depending on speeds of focal and neighbour, focal acceleration and relative orientation between the two fish. Moreover, the three classes are not cleanly separated as alignment regions are mixed with attraction or repulsion.
As a strategy to extract the relevant variables for behavior, we have required them to predict future behavior (like e.g. [33,34,30,16]). This approach has the additional advantage of automatically generating labelled data for supervised training of networks. It can be enriched, at the cost of increasing the model dimensionality, with more information about behavioral history, possible internal variables (parametrized, for example, by time of day or more direct internal measurements), explicit dynamics and posture using reduced variables [35,36,37,38]. A second requirement for our models was that they should work for data not used to obtain the model. These two requirements are standard in machine learning, but not in the study of collective animal behavior.
Our results illustrate how modular deep networks enable flexible data-driven modelling without losing insight. Each module is flexible, with tens of thousands of parameters, but implements a function with low dimensionality in the number of inputs and outputs. Combinations of modules [39,40], two types in the attention network, achieve high compositional complexity that adds flexibility without losing insight. 8

Data availability
60-and 100-fish as well as the new 80-fish videos can be found at www.idtracker.ai. Code is free and open source software ( https://github.com/fjhheras/fishandra)

Animal rearing and handling
Zebrafish, D. rerio, of the wild-type TU strain were raised by the Champalimaud Foundation Fish Platform, according to methods in [41]. Experimental procedures were approved by the Champalimaud Foundation Ethics Committee and the Portuguese Direcção Geral Veterinária in accordance to to the European Directive 2010/63/EU. Handling procedures were as in [27]. We used juveniles of 31-33 days post fertilization.

Videos and tracking
We used 6 videos of 60 and 100 freely swimming juvenile zebrafish from [27], and 3 new videos of 80 juveniles. The camera had a frame rate of 32 fps and 20 Mpx of definition. We obtained all the fish trajectories using idtracker.ai (Fig S1) with an accuracy of 99.95 (mean) ±0.01% (std) [27].

Preprocessing
We interpolated linearly the very small holes in the tracked trajectories (0.027% for 100-fish videos). We normalized trajectories, by translation (center of arena at (0,0)) and scaling (radius of the arena at 1). To reduce noise while preventing contamination by any future information, we smoothed the trajectories using a 5-frame half-Gaussian kernel with σ = 1 frame. We obtained velocity and acceleration by finite differences, using only current and past frames. To avoid direct border effects, we removed datapoints where the focal fish is further away from the center than 80% of the radius. Each video was divided in three parts, to obtain the training, validation and test datasets (97%/2%/1%).
In each video frame, for each individual, we found the n nearest neighbours (I). We then obtained (i) velocity and acceleration of the focal fish, (ii) relative position, absolute velocity and absolute acceleration of the closest n neighbours, (iii) whether the focal fish has turned right or left after N f frames in the future.

Deep networks
We implemented the Deep Networks using TensorFlow through its Python API [42]. We solved the following classification task: Given dynamical properties of a focal fish and its n closest neighbours, does the focal fish turn right or left after 1s? Asocial information is the set of speed and normal and tangential acceleration of the focal, Social information from a neighbour i is its location, velocity and acceleration whose coordinates we calculate in an instantaneous frame of reference that is not moving, which is centered in the focal fish and whose y-axis is co-lineal with the focal fish velocity. Note that v i is the absolute speed, while (x i , y i ) is the relative position of the neighbour, rotated to the frame of reference. In each network, we first obtain the logits z, and then the probabilities by using a logistic function p = 1/(1 + e −z ).

Interaction network
In the interaction network [31], given asocial (α) and social ({σ i , i ∈ I}) information, the logit of turning right is calculated as The function Π I is the pair-interaction subnetwork. We modelled it using a fully-connected network with 3 hidden layers of 128 neurons each, plus a readout layer of 128 neurons. There are rectified linear unit (ReLU, [43]) nonlinearities after each hidden layer (but not after the readout). The outputs of Π I for different neighbours are summed together and transformed by a second function, Γ. We modelled Γ as a fully-connected layer with one hidden layer of 128 neurons, plus a oneneuron readout layer. There are ReLU nonlinearities preceding the whole network and after each hidden layer (but not after the one-neuron readout layer).
To effectively multiply available data by n, we considered all neighbours to be equal. Equivalently, there is symmetry with respect to exchange of neighbour labels. We did not observe any turning side preference. Therefore, to effectively multiply available data by 2, we forced the network to be antisymmetrical with respect to a reflection along the body axis by antisymmetrization of I, where the star superscript represents a reflection along the longitudinal axis of the body, calculated by switching the sign of all x components.

Attention network
Eq. [1] in main text can be rewritten using a notation that compares directly with Eq. [4] in Methods as The function Π A captures the effect of pairwise interactions. It has the same structure as Π I except that its readout layer has only one neuron, and that we antisymmetrise it. W is an attention layer, weighting the logits of the different neighbours. W has the same structure as Π A , except that it accepts as input a y-axis-reflection-invariant subset of the asocial and social variables, and that there is an exponential function after the single-neuron readout signal.

Loss
Following standard procedures in binary classification, when training the network to estimate the probability p i of turning right, we minimised the cross-entropy loss, [43] summed along N b data points in the minibatch and where p * i is the probability given by the network to the actual turn. When the network predicts a right turn with probability p i , p * i = p i if the actual turn was to the right, and p * i = 1 − p i if the actual turn was to the left. We minimise loss using Adam [43]. We stopped training if validation loss did not reach a local minimum for 10 epochs and did increase 25% from the minimum, or after 100 training epochs. In the attention network, we annealed learning rate from 10 −4 to 10 −5 , using a batch size of 500. In the attention network, we annealed learning rate from 5 × 10 −5 to 10 −5 and trained with a batch size of 200. Dropout [43] did not improve accuracy.

Attraction-repulsion and alignment scores
Attraction-repulsion score is obtained as with attraction (repulsion) when the score is positive (negative). Alignment score is obtained as 10   Figure S1: Example video frames. Each fish has been individually tracked by identification using idtracker.ai [27]. Only focal 25 neighbours Figure S3: Predictability (test set) for different times to prediction. The prediction from an interaction model with 25 neighbours (orange) and from a model that is blind to any social information (blue). Predictability for immediate futures (<100 ms) is high for both models, because of correlations in the acceleration. Predictability with 25 neighbours has a local minimum for futures of 250 ms, it has a broad maximum when predicting futures between 1 and 10 s, and then it drops slowly when predicting more distant futures.      Figure S8: Pair interaction and aggregation, obtained from 60-fish videos. A Same as Figure 3A. The most conspicuous difference with Figure 3A is the weakening of anti-alignment B Same as Figure 3B C Same as Figure 4A. D Same as Figure 4B. Note the comparatively weak attention at the back of the focal.  Figure S16: Aggregation with information about relative orientation. Same as Fig. 4A, but when the attention subnetwork is trained with the relative orientation of the neighbour, in addition to the variables used in the main text. A The neighbour is parallel (at 0 degrees) to the focal. B The neighbour is at 45 degrees (towards the right) with the focal, C the neighbour is perpendicular (90 degrees) and pointing to the right of the focal. D The neighbour is antiparallel (180 degrees) to the focal.