Learning anisotropic interaction rules from individual trajectories in a heterogeneous cellular population

Interacting particle system (IPS) models have proven to be highly successful for describing the spatial movement of organisms. However, it is challenging to infer the interaction rules directly from data. In the field of equation discovery, the weak-form sparse identification of nonlinear dynamics (WSINDy) methodology has been shown to be computationally efficient for identifying the governing equations of complex systems from noisy data. Motivated by the success of IPS models to describe the spatial movement of organisms, we develop WSINDy for the second-order IPS to learn equations for communities of cells. Our approach learns the directional interaction rules for each individual cell that in aggregate govern the dynamics of a heterogeneous population of migrating cells. To sort a cell according to the active classes present in its model, we also develop a novel ad hoc classification scheme (which accounts for the fact that some cells do not have enough evidence to accurately infer a model). Aggregated models are then constructed hierarchically to simultaneously identify different species of cells present in the population and determine best-fit models for each species. We demonstrate the efficiency and proficiency of the method on several test scenarios, motivated by common cell migration experiments.

1. Introduction.Systems of autonomous agents are ubiquitous in the natural world.Research into their behavior has led to a plethora of proposed mathematical models, including the agent-based "boids" model [48], ordinary differential equation models for milling and flocking [32,16], and nonlocal partial differential equations [40,60], to name a few.
A general framework for rigorous analysis of these models is by now very mature [12].
Identifying the rules of interaction between agents is necessary for predicting and influencing the cooperative abilities of any such system, whether composed of autonomous robots, large multi-cellular animals, single-celled organisms, or even molecules.Methods for inferring the rules of interaction between agents using observed trajectory data have continued to advance since the early 2000's.Several of the principled techniques include force-matching [19,27], linear regression [35,34], mean-field formulations [57,36], informationtheoretic tools [46], underdamped Langevin regression [8,9], Gaussian processes [23], and even a method based on topological rather than metric distances [2].
These and related techniques have been successfully used to identify the dominant drivers of collective behavior in a variety of social and biological systems [52,56,1], including schools of fish [67,62], flocks of birds [15,45], and pedes-trian traffic [68], all directly incorporating measured trajectory data.While popular methods, such as force-matching, are useful in identifying fields of vision and spatial statistics of interactions, they cannot easily disentangle the combined effects of multiple forces (e.g.attraction, repulsion, and alignment) [41,20], let alone different interactions between multiple species of neighbors.This limits the classes of models they can identify and implies that new methods must be developed for heterogeneous populations.
The field of equation discovery is a highly active area of research [30,42,53,49,10,51,63,33,14] as it offers tools to directly learn governing differential equations.This approach is not only useful in prediction and validation, but can be used to simultaneously identify multiple active modes of inter-agent communication, such as repulsion, velocity alignment, and attraction.In this work, we tackle the problem of identifying governing equations for an interacting particle system (IPS) with multiple interacting species.Our proposed approach is completely naive with regard to species membership in order to specifically address problems of heterogeneity in collective cell migration [59].
Motivated by existing hypotheses regarding the anistropy of cell-cell interactions [50,26,44], we introduce our framework in the context of directional interaction models, as de-fined below.Moreover, we note that the documented significance of anisotropic interactions in general collective systems [17,3,70] suggests that our approach may have wide applicability.
1.1.Heterogeneous Populations.Many collective populations arising in nature are inherently heterogeneous, with the rules of interaction varying across different subsets of the population.This is readily observable in complex mammalian populations, but is also seen in simpler organisms, such as honey bee swarms, where bees divide into scout and worker bee roles [55].The advantages of heterogeneity in collective behavior have even inspired search optimization algorithms [18,28].
At the level of microorganisms, cells have been observed to adopt leader-like and follower-like roles during migration [66], without the aid of a central nervous system.Individual cell speed and persistence of motion have also been determined to be functions of the age and size of the cell [31,6,5], which may lead to heterogeneous responses to stimuli from neighboring cells.The mechanisms which produce these heterogeneities, and the extent to which heterogeneity is present in a given cell population, are current subjects of debate [43,54,25].Data-driven techniques may be useful in formulating accurate mathematical models in the presence of heterogeneity.
The authors of [69] develop a highly versatile method for inferring explicit rules of interaction in a heterogeneous population, although it is assumed that species membership is known a priori.Several recent works have offered methods of assessing the degree of population heterogeneity [54,52], yet these methods do not provide explicit mathematical models for the different populations.In contrast, the method presented here allows one to classify the given population into different species according to the heterogeneous interaction rules present, and produces explicit mathematical models for each species as a byproduct.

Directional Interaction Forces.
It is now wellknown that simple radial interaction models are incapable of explaining many observed collective behaviors in biological settings, and that directionally-dependent interaction rules, based on a limited field of view or sensing angle, offer a significant advantage [17,15,11,13,21].At the cellular level, directional dependence of cell-cell interaction has been proposed in the context of intracellular polarization [26], however the cellular sensing range is not immediately obvious, since a migrating cell does not have an obvious "field of view".Recent works have sought to quantify the degree to which interactions are density-dependent [7], but not which directional modes (radial, dipolar, quadrupolar, etc.) are dominant during a collective migration event.
In addition to providing an explanation for certain observed phenomena [70], directional interaction rules are capable of generating spontaneous migration, due to the total directional force between particles not being conserved in general.In the modeling of active matter systems (such as migrating cells) [47,57], such symmetry breaking is commonly generated by a combination of Brownian forcing and a self-propulsion device [65].However, it is not clear that selfpropulsion is an appropriate mechanism for modeling cellular movement (in comparison to fish, which are constantly swimming).Directional forces may then be an important mechanism for symmetry breaking and spontaneous cellular migration.
1.3.Weak-Form Sparse Identification of Nonlinear Dynamics.At its core, our method involves learning ordinary differential equations for cells using available trajectory data.For this we employ the weak-form sparse identification of nonlinear dynamics algorithm (WSINDy), which has been shown to successfully identify governing equations from data at the levels of ordinary different equations [38], partial differential equations [37], 1st-order interacting particle systems [36], and even works in a small-storage streaming scenario [39].
A significant advantage of the WSINDy method is that it identifies a single governing equation which can be analyzed and simulated using conventional techniques of applied mathematics.It does not involve any black-box algorithms or mappings as would be generated in using a neural networkbased approach. 1  Several alternative methods have been developed to accomplish the equation learning task for particle systems.In 1 See [4] for an example where the authors first learn a neural network model of the potential and then uses sparse identification to learn the algebraic form of the potential.particular, Lu et al. [34] develop a method for learning general feature-dependent 2nd-order interaction rules for heterogeneous populations, where features may include directional interaction forces, speed dependence, and so on.On the other hand, pooling data from multiple species into a single model can result in a highly inaccurate model depending on the difference between interaction rules in each species.
In general, there exists a spectrum of possible pooling strategies, ranging from learning few models from large subsets of the population, to learning many models from small subsets of the populations.The former intrinsically produces models with high bias and low variance, while the latter produces models with low bias and high variance.Such pooling strategies have been recently explored in [22], where it is found that identifying a single model can be improved by pooling models learned from subsets of the data.However, this has not been extended to classifying the data itself into species, and finding a model for each species.Moreover, the IPS setting offers a particular advantage on the subject of model validation, as data can easily be assimilated into forward simulations.
In this work we investigate the extreme case of learning an individual model M i for the i-th individual trajec-tory, and then clustering the set of learned models M := {M 1 , . . ., M N } according to their identified modes of interaction.This approach is counter-intuitive because there is no guarantee that a single cell trajectory will provide enough information on the interaction rules of its species.To be able to classify cells using the (potentially) insufficiently informative trajectories, we developed an ad hoc recursive classifier which we show (in Section 4) accurately clusters and sorts the models into species.This approach prevents any contamination that may result from combining trajectories of multiple species.
Once the models are clustered, an aggregate model M is computed by averaging the models in M belonging to the most populous cluster.The model M is then used to classify cells via forward simulations which are made highly efficient by directly incorporating the data.In particular, for each trajectory in the dataset, we use M to simulate a new trajectory, but with all neighbor interactions computed using the data.1.5.Paper outline.In Section 2 we discuss the general form of directional interacting particle models that will be assumed in the learning process.In Section 3 we introduce our model selection and classification algorithm, which is composed of the five steps: (a) learn single-cell models, (b) cluster learned models according to force modes, (c) form an aggregate model by averaging models in the largest cluster, (d) validate the aggregate model using data-driven forward simulations, (e) classify cells according to performance under the aggregate model.In Section 4 we examine the performance of the algorithm in learning and classifying homogeneous and heterogeneous populations of one, two, and three species.We discuss possible next directions in Section 5. Some additional information and a summary of notation are included in the appendix.
2. Directional Interacting Particle Models.We use a general 2nd-order directional interaction model framework, where the position and velocity (x i , v i ) of cell i are governed by the differential equations (2.1) Here, θ ij is the angle between v i and x j − x i (see the diagram in Figure 2.1) and (η i ) i∈ [Ntot] are white noise processes.The attractive-repulsive force f a-r , alignment force f align , and the drag force f drag define the rules by which cell i communicates with the rest of the population.Our primary objective is to identify a set of interaction rules {(f a-r , f align , f drag ) } 1≤ ≤S , one for each of the S species present in the population.
2.1.Directionality θ ij .As mentioned above, directional variation in the interaction forces between cells can arise from various factors, including intracellular polarization, or heterogeneous distribution of membrane-bound receptors, asymmetry in the protrusion/contraction of lamellopodia as the cell crawls on the substrate, and so on.In the current study, we assume that each of these effects is unobservable, hence we model the aggregate directional effect using the angles θ ij , depicted in Figure 2.1.Dependence on angle θ ij allows for interactions between cell i and cell j to vary depending on the direction of motion2 .Put another way, in the reference frame of cell i, the polar coordinates and θ ij allow one to represent any interaction force that varies over the 2D plane.
In this study we restrict the angular dependence to {1, cos(θ ij ), cos(2θ ij )}, which allows for a combination of radial, dipolar, and quadrupolar interactions (see Figure 4.1 for examples of dipolar (right) and quadrupolar (left) forces used in this study).Higher-order directionality can usually be assumed to be negligible, however extension to higher modes is straightforward.
2.2.Attractive-repulsive force f a-r .The interaction force f a-r acts along the vector from cell i to cell j and captures short-range repulsion and long-range signaling.Many IPS models include only an attractive-repulsive force, due to its extensive pattern-forming capabilities [61,24].
We impose the following natural constraints on f a-r : (2.2) f a-r (r, θ) where r nf is the near-field threshold and r f f is the far-field threshold, i.e., a large distance.
The first inequality enforces that f a-r is near-field repulsive, which must be true by volume exclusion.In practice we define the r nf implicitly by where in this work we set p nf = 0.001, and the dataset is used to compute the probability.This states that the near-field region is defined by short-distance interactions which account for less than 0.1% of observed neighbor-neighbor distances.
The second equality enforces long-range decay, as well as model stability.Decay is natural since interactions can be expected to be small outside of some large distance r f f .We enforce that interactions are attractive at large distances (allowing for decay as well), so that in simulation the particles do not spread to infinity.We set r f f = 1 throughout, although r f f can easily be chosen from the data (e.g.r f f = 50r nf corresponds to an effective interaction range of 50 cell radii).
(See Appendix A for resulting values of r nf and r f f and other hyperparameters for examples below).
The third set inclusion simply reiterates the assumptions on directionality described above.

Alignment force f align . The alignment force
f align captures cells' tendency to match the velocity of neighboring cells.There are many theories as to how this arises physically [58,50].Perhaps protrusions from cells inform the cell about the bulk direction of motion, which would be a very local effect.However, alignment models which have been proven to lead to flocking depend on sufficiently long-range alignment [16].We impose the following constraints on f align : The first inequality enforces that f align is non-positive, which is necessary for the constant velocity state to be a stable configuration.If not, small perturbations away from v i = v j result in cells accelerating away from each other, which is a redundant force given that cells can be pushed away from each other through f a-r (it is also not hard to see that f align > 0 is unphysical).The second constraint restricts the alignment force to be a combination of radial, dipolar, or quadrupolar modes, similar to f a-r .
2.4.Drag force f drag .The drag force f drag captures energy expenditure due to general resistance to motion (resulting e.g. from substrate roughness), however we allow an angular dependence on θ ij to capture possible decreases or increases in drag depending on local neighbor distribution.For this we impose the following constraints: (2.4) where s indicates the speed of the cell.The force f drag is chosen to be negative so that cells do not have a "self-propulsion" device.As mentioned previously, many models of active matter include self-propulsion as a partial mechanism for symmetry breaking and general non-equilibrium effects.To reiterate, we do not expect cells to have a self-propulsion device, in fact, we wish to learn how migration occurs spontaneously, incited by pairwise interactions.In addition, a positive drag force leads to populations spreading outside of the range of meaningful interactions.In this way, negative drag is computationally beneficial, as it leads to improved model stability.To summarize, the result is a set of S models and species , where each model M is constructed from an average of individual models within a cluster.We use the no- tation 3 of M i to be the model for the i-th cell, C j to be the j-th cluster of models, and S to be the set of cells identified as the -th species and which obey model M .
We now give more detail on steps (a)-(f), including stopping criteria, leaving the more technical aspects to the appendix.
3.1.Learning single-cell models.The first step in the algorithm is to learn an ensemble of single-cell models, one for each of the N focal cell selected from the N tot total cells tracked during the experiment 4 .By "single-cell" model, we mean that the scope of each model is limited to learning only the dynamics of its focal cell, however data from the remaining tracked cells are incorporated to learn the interaction forces.
WSINDy.The main ingredient in learning single-cell models in M is the WSINDy algorithm, together with careful choices for the bases used to represent the three main forces f a-r , f align , and f drag .Each cell i is identified by a position and velocity (x i (t), v i (t)) which we assume is well-approximated by a 2nd-order model of the form (2.1).The dynamics take the general form where (X(t), V (t)) ∈ R 2dNtot denotes the entire population of positions and velocities in the colony at time t.We then assume that we have available a dataset of positions k=1 sampled from the system X at L timepoints.Our goal is to identify F i using X.
The SINDy approach involves representing F i as a sparse linear combination of basis elements Θ(X, The available cell position data X is used to approximate velocities V := Ẋ ≈ Ẋ and accelerations Ẍ ≈ Ẍ, using e.g.finite differences, leading from (3.1) to the data-driven linear system With some abuse of notation, we denote by Θ(X, V) the matrix that results from evaluating the basis Θ(X, V ) at the time-series data (X, V).The entries are Θ(X, V) The data X is often corrupted by measurement noise, which leads to inaccurate computations of derivatives V.For the current setting, the standard SINDy approach just outlined requires second-order derivatives Ẍ, which are even less accurate to compute from noisy data.To prevent some of the corruption from noise, we can use the weak form, which leads to WSINDy.Returning to equation (3.1), we convert to the weak form by multiplying by a test function φ(t) and integrating in time, where the inner product •, • denotes the time intergral Choosing φ to be twice differentiable and zero outside of some interval (a, b), we then integrate by parts twice on the lefthand side to arrive at φ, so that the second derivative has been removed from x i and placed on φ.Choosing a basis of test functions Φ := (φ q ) 1≤q≤Q , we build the weak-form linear system where b As well as choosing Θ, in order to compute (G (i) , b (i) ), we need to compute5 V from the position data X, choose a test function basis Φ, and discretize integrals appearing in the linear system.For simplicity, we compute V using 2nd-order centered finite difference, although a number of methods exist for numerical differentiation from data [29,64].For integration, we use the trapezoidal rule, and we use test functions of the form for shape parameters m and p, and timestamps t q in the range of the available time series.We refer to [38,37] for methods of choosing (m, p, t q ) and for accuracy and robustness results concerning the trapezoidal rule combined with this particular class of test functions.In this work, we use the changepoint algorithm in [37] with τ = 10 Regression.We solve the linear system (3.4) for coefficients w (i) by approximately solving the following constrained sparse regression problem: The linear inequality constraint Cw ≤ d encodes the constraints listed in (2.2), (2.3), and (2.4) on the forces on f a-r , f align , and f drag , and λ is the sparsity threshold.We employ the modified sequential thresholding algorithm from [37,36], with least-squares iterations replaced by solving the associated linearly-constrained quadratic program 6 .
Since the coefficients w (i) have no a priori absolute magnitude, we threshold only on the magnitudes of the given term relative to the response vector b (i) , namely, we define the thresholding operator H λ (w) by w j , otherwise.
The constrained sequential thresholding algorithm then consists of alternate between (I) solving (3.6) with λ = 0 for w with the additional constraint supp ( w) ⊂ supp (w ), and (II) setting w ( +1) = H λ ( w).A sweep over λ values in the range 10 −4 to 10 0 is performed according to [38,36] to choose an appropriate threshold λ.
Trial Basis Functions.In the case of the directional force model (2.1), we require three bases F a-r , F align , and F drag for the three proposed forces f a-r , f align , and f drag .We seek a sparse model, and so choose global basis functions, rather than a model composed of a large sum over basis functions that are spatially localized.
For the attractive-repulsive basis F a-r we choose products of cosines and scaled and weighted Laguerre polynomials, for th degree Laguerre polynomial p .The scale α is chosen from r max , the maximum observed distance between cells, such that e − α 2 rmax = mach ≈ e −36 .We set α = 36 in all cases below since r max ≈ 2.
The pattern of attractive and repulsive regions of the force f a-r is not known a priori, hence the Laguerre basis offers flexibility.The choice of weighted Laguerre polynomials (with weight ω(r) = e −r/2 ) is guided by the orthogonality relation We find F a-r leads to a well-conditioned matrix G despite orthogonality not holding with respect to the data distribution.
For the alignment force we choose a basis of shifted cosines and exponential functions, which is informed by the fact that f align must be negative.This is easily controlled with F align by simply enforcing that the coefficients w align be negative.For the same reason, we choose the drag force from a basis of monomials and cosines, as this can also be easily controlled to yield an overall negative f drag force by constraining only the basis elements w drag .
Moreover, monomials capture the physical assumption that resistance to motion should increase with speed.

Cluster.
Once the ensemble of models M has been learned, each model is validated on a small neighborhood of cells, and models with poor performance are replaced by a neighboring model with better performance, if one exists.This step can be seen as a form of "cross-pollination" that greatly increases the chance of finding accurate models and eliminating inaccurate models.Details on this step can be found in Appendix D.
Models are then clustered according to the force modes present.Specifically, using the bases above, we can expand each force according to distinct directional modes: a-r (r) + cos(θ)f (1)  a-r (r) + cos(2θ)f (2)  a-r (r) align (r) + cos(2θ)f (2) This leads to eight possible force modes, which we order as follows: align , f align , f align , f drag .
We associate the sparsity pattern of the force modes with the set of all 8-bit codes, giving a total of 2 a-r , f align , f align , and f

Aggregate. Having formed the model clusters C,
let C be the cluster with the most members.We then compute M by averaging over the coefficients from models in C, that is, we compute where |C| denotes the number of elements in C.This ensemble average preserves the force modes that identify C. As explored in [22], in some cases it may be more appropriate to use the coefficient median, or take a weighted average.We leave these for future work.

Validate.
To validate the aggregate model M, we perform forward simulations over the remaining unclassified cells in a highly parallelizable way that utilizes the experimental data to efficiently march forward in time.
Let N ≤ N be the number of remaining unclassified cells.For each i = 1, . . ., N , we simulate a new trajectory {(x i (t k ), v i (t k ))} L k=1 using the averaged model M with the experimental initial conditions (x i (0), v i (0)) = (x i (0), v i (0)).We march forward in time according to the forward Euler update: where (X i (t k ), V i (t k )) indicates (X(t k ), V(t k )) with the i-th cell removed7 .Since the time resolution of the data ∆t is assumed to be coarse, we perform the simulation on a finer grid with timestep ∆t = 2 −5 ∆t, and use piecewise cubic hermite interpolation to generate positions and velocities of neighbors (X i , V i ) at intermediate times.We stress that we do not update the neighbor cells using the model M, which would be much more costly, we merely use neighbor positions and velocities from the data to compute interactions that govern the motion of cell i.The resulting trajectories {(x i , v i )} N i=1 can then be trivially computed in parallel 8 .
We then define the validation error for cells i = 1, . . ., N as the relative velocity difference (3.11) ∆V , where L ≤ L is a subset of the time series over which the simulation is expected to remain accurate 9 .In this work we choose L = 0.25L.With L = 200 timesteps in the examples below, this results in a comparison with the data over the first 50 timesteps at the original scale ∆t, or equivalently 1600 timesteps on the finer scale ∆t .The first case is an obvious criterion to prevent infinite looping over outlier cells for which there is not enough information to learn an adequate model.The second condition skips the GMM fitting process when all cells have sufficiently low error10 , and directly assigns all unlabeled cells to a new species.This is necessary to account for the cases of highaccuracy recovery, where it is observed that the V E is no longer approximately log-normal.For example, this is observed in the rightmost plot of   unique combination of the forces depicted in Figure 4.1, with the combinations and force definitions specified in Table 1.
The forces include a quadrupolar attractive-repulsive force f a-r , a dipolar alignment force f align , and a monopolar drag force f drag which is linear in its speed argument.As we will see below, species A and species B share the force f a-r and hence result in similar dynamics, which presents a challenge to identification.It turns out that using a longer time series results in correct classification.
We let X P denote a simulation with individuals from species P , for example, X A is a simulation with only individ-uals from species A and X A,B is a simulation with a mixed population of species A and species B. Each simulation has 1000 individuals and an even number of members in each species (up to rounding).More details on the simulations can be found in the appendix.
We are particularly interested in three traits of our learning algorithm: (1) was the classification successful?(2) are the learned forces close to the true forces?(3) are simulated trajectories using the learned model close to the original trajectories?To assess (1) we report the classification success CS(i) for i ∈ {A, B, C} as the fraction of individual from species i that ended up in the cluster in question, where clusters are listed as subrows (rows not separated by horizontal lines) within each row in Tables 2, 3, and 4, in the order they were identified in step (e) of the algorithm.For example, in row 2 of Table 3, two clusters are identified from the twospecies data X A,C , with the first cluster containing 100% of the species C cells, indicated by CS(C) = 1.000, and the second cluster containing 100% of the species A cells, indicated by CS(A) = 1.000, with no outliers.
To assess the accuracy of learned forces with respect to the ground truth forces, for each of the three forces we report the relative L 2 error over r ∈ [0, 2] (since r max ≈ 2 for all examples here) and θ ∈ {0, π/4, π/2, 3π/4, π}, denoting this by ∆f a-r , ∆f align , and ∆f drag .
Lastly, we assess the difference in learned and true trajectories using the average validation error ∆V = 1 |S| i∈S ∆V i , where ∆V i is computed from model M associated with identified species S using (3.11).

Homogeneous Populations.
As an initial benchmark we detect single-species populations from homogeneous data.While simpler than the heterogeneous case, this is a nontrivial task due to the variability of single-cell trajectories and local environments within the population.Our method successfully identifies the models for species A, B, and C from homogeneous simulations, achieving less than 1% mean validation errors in each case, and less than 4% relative force errors ∆f (Table 2).In simulation X B , three cells are identified as outliers (appearing in the right tail of    3. Force differences ∆f are less than 5% in each case, with trajectory vali-dation errors less than 3.5%.In particular, species C achieves less that 0.3% validation error, which is due to the true force f align existing in the learning library, whereas f a-r is approximated using a truncated series expansion, resulting in larger errors.See Figures 4.5 and 4.6 for comparison between original and learned trajectories. For experiment X A,B , initially the method is incapable of correctly classifying cells into species A and species B. Three clusters are identified with sub-optimal models (Table 3 row   4).Accurate classification is achieved by running the algorithm with a longer experiment X A,B (long) (Table 3 row 5) which is the continuation of X A,B for twice the total time points, at the same temporal resolution.An initial cluster is identified containing 97.8% of species B along with 0.2% (a single cell) of the existing species A cells, followed by a second cluster with 99.4% of the species A cells and no cells from species B. The last two clusters correctly partition the remaining cells (12 in total), again finding accurate models, allowing for recombination with the first two cluster during post-processing.Similar to the case X A,B , we see improvements with a longer time-series X A,B,C (long).For the initial experiment X A,B,C , species C is completely identified in the first cluster (Table 4 row 2), and in the second cluster 98.8% of species B cells are identified along with 9.1% of species A cells, leading to a fairly inaccurate model (∆V ≈ 0.06).The subsequent clusters divide the remaining species A and B cells.Doubling the time series with X A,B,C (long), we find the majority of each species residing in its own cluster (Table 4 row 3).Cluster 1 contains all of the species C cells, cluster 2 consists of 94% of species B cells and 1.8% of species A, and cluster 4 consists of 95% of species A. Moreover, the aggregrate models for each of these clusters result in validation errors under 1%.Several aspects of this approach deserve a more technical analysis, which may lead to improvements.We aim in future work to undertake more rigorous study of each component of the algorithm as outlined below.ative alignment, and negative drag.Directional modes enforce bilateral symmetry, and are low order (monopole, dipole, quadrupole).These can easily be adapted to incorporate other known information, moreover the bases themselves may be adapted to the data (note this is partially done, using the neighbor distance distribution ρ rr to restrict the range of interactions).One major assumption here is that there is no propulsion force, that energy is increased only through anisotropic interactions with neighbors.
It would be interesting to examine whether this assumption holds true.
(IV) Regression approach: We employ modified sequential thresholding, which looks for an overall sparse solution, although we threshold only on the term magnitude G j w j and not that raw coefficient w j .This in particular allows f a-r , f align , f drag to have  identifies a majority species B cluster.In the second iteration (right), a cluster with all Species A cells is identified, and a small group of cells remains which is then partitioned correctly (see row 5, columns CS(A) and CS(B) of Table 3).We see an initial complete separation of species C (left), followed by a mixed cluster containing 94% of the species B cells  Validate & Classify.In practice, the validation errors can easily be checked to satisfy log-normalcy a posteriori.If this is not satisfied, it may not be straightforward to cluster based on the validation error.In particular, chaotic trajectories cannot be expected to achieve a low validation error, in which case another metric is needed.In this case, it is reasonable to require that trajectories to be long enough to compute statistics.We aim to investigate the requirements for performing classification with chaotic interacting particles in future work.
In Table 5  where I ( ) is the set of coefficients of w ( ) satisfying (3.7).
The constraint system Cw ≤ d has the following four components.
1. f a-r ≥ 0 when 0 ≤ r < r nf : for the near-field repulsion of f a-r we discretize the region {(r, θ) : 0 ≤ r < r nf , θ ∈ [0, 2π)} choosing 5 equally-spaced points in r from 10 −6 to r nf and 5 equally-spaced points in θ from 0 to π. Evaluating each of the basis function for f a-r at this grid results in a constraint system C a-r,nf w a-r ≤ 0 of dimension 25 × J a-r where J a-r is the number of basis functions used to approximate f a-r and w a-r is the restriction of w to coefficients of f a-r .
2. f a-r ≤ 0 when r ≥ r f f : similarly for the far-field region, we choose 10 equally-spaced points in r from r f f to r max , where r max is the maximum observed neighbor-neighbor distance in the simulation, and θ over the same points as previous.This results in a constraint system C a-r,f f w a-r ≤ 0 of dimensions 50 × J a-r .
3. f align ≤ 0: since the basis is positive, the constrain system here is simply I J align w align ≤ 0, where I n indicates the identity on R n .
4. f drag ≤ 0: similarly the basis is positive, so the constraint system is I J drag w drag ≤ 0. where (X , V ) denote the remainder of the cell population excluding cell i.Respectively, these denote the distribution of distances from cell i to all other cells, the distribution of velocity differences between cell i and all other cells, and the distribution of speeds that cell i experiences 13 .We approximate these distributions from the data using histograms with 50 bins.
For each cell i we compute the Kullback-Leibler (KL) divergence between its distributions ρ (i) rr , ρ v and those of the rest of the population 14 , where the KL divergence between 13 These statistics are likely to correspond to the information content that cell i carries about its own forces fa-r, f align , and f drag , given the force dependencies 14 To limit the computational overhead, we only compute these KL divergences to cell i's nearest 200 neighbors in the Euclidean sense.The K validation cells used to validate model i are the K cells with smallest cost L , defined by

Experiment
Let the validation error ∆V i→j be defined as in (3.11), but indicating M i used to validate cell j (i.e. using the initial conditions of cell j).We replace M i with M j if the following three conditions are met: (1) ∆V i→i > ∆V j→i (2) ∆V i→j > ∆V j→j (3) max{∆V j→i , ∆V j→j } < tol where we set tol = 0.25 in this work.In words, M j replaces M i if (1) M j performs better than M i on cell i, (2) M j performs better than M i on cell j, and (3) M j achieves a reasonably low error (defined by tol) on both cell i and cell j.
(Note that cell i and cell j are required to be mutual validation cells for a model replacement to occur).Furthermore, if M j replaces M i , and another model M k replaces M j , we reassign M i to M k as well, even if cells i and k are not mutual validation cells.
We find this approach to be crucial to increasing accuracy of the learned models, as it transfers successful learning of few cells with highly informative trajectories to cells with less informative trajectories.As with all validation steps of our algorithm, this approach would be infeasible if not for fast data-driven forward simulations.
The differences between this and our work are the following.(i) We are performing the unsupervised learning task of classifying agents by their interaction rules, whereas Lu et al.'s work assumes knowledge of the species membership, (ii) we are interested in sparse model representations, in particular selection of the correct modes of interaction (e.g.attractive, repulsive, alignment, and drag force), whereas Lu's work assumes knowledge of both the feature-dependence and types of forces present (e.g. for planetary systems, priori knowledge is used to rule out the presence of an alignment force).(iii) Lastly, models are initially extracted from single-cell trajectories.As described in the next section, rather than aggregating data which may come from multiple cell species, we aggregate models which are likely to describe the same species, and then use the aggregate model to perform classification.1.4.Single-Cell Learning and Model Clustering.With a possibly heterogeneous population of cell trajectories available, one is tasked with the problem of deciding how to aggregate the data.If knowledge of the underlying species membership is available, a more accurate model can be inferred by pooling data from all individuals of a given species.
That is, only the new trajectory is propagated forward in time by model M, while the rest of the population is simply the data itself.This can then be trivially parallelized, reducing an O(N 2 ) computational cost per timestep to N cores performing O(N ) updates per timestep with no communication overhead.We show through examples below that this hierarchical model-pooling and validation procedure produces both correct species classification and accurate governing equations, despite individual cell trajectory data carrying low levels of information.For further information on the classification algorithm, see Section 3.

Fig. 2 . 1 :
Fig. 2.1: Diagram of social interactions depending on angle θ ij between cell i's velocity and cell j's position relative to i.

3 .
Algorithm.Our algorithm involves first learning an ensemble of directional force models M = {M 1 , . . ., M N }, that is, one model for each of the N focal cells selected for learning.We next group models into clusters C := {C 1 , . . ., C r } and compute an aggregrate model M from the largest cluster, denoted by C. A new species S is then identified, with membership in S determined by the accuracy of data-driven forward simulations of model M. Cells in the species S are then removed from the population and the remaining cells are returned to the clustering phase.More explicitly, the algorithm is composed of the following steps, which are visualized in Figure 3.1.(a) Identify individual cell models M = {M 1 , . . ., M N } using the WSINDy algorithm (b) Partition learned models M into clusters {C 1 , . . ., C r } according to active force modes (c) Identify the largest cluster C and average together the models in C to arrive at a single model M (d) Simulate the model M separately for each remaining unlabeled cell, using the data to calculate neighbor interactions (e) Classify cells based on simulation error under M and label the lowest-error class as the new species S (f) Remove cells in S from the population and return to step (b)

Fig. 3 . 1 :
Fig. 3.1: Classification pipeline for cells from heterogeneous populations.(a) An ensemble of models M = {M 1 , . . ., M N } is learned, each from an individual trajectory; (b) M is partitioned into clusters C = {C 1 , . . ., C r } according to active forces in each model; (c) models in the largest cluster C are averaged together, giving M; (d) M is validated along each individual trajectory; (e) validation errors are classified, producing an identified species S (cyan checkmarks) and the remaining cells (red X's) are returned to step (c) to be clustered again.Steps (b)-(e) repeat until all cells are classified.Note that the number and members of model clusters C and resulting aggregate model M will change each iteration depending on the identity of remaining unlabeled cells.
where subscript d indicates the spatial coordinate (d ∈ {1, 2} in this study).The basis Θ is chosen by the user and determines the accuracy of the learned model as well as the conditioning of the WSINDy algorithm.
other options for model replacement and clustering, include clustering based on the sparsity pattern of w, or simply on the presence or absence of each of the three forces f a-r , f align , f drag .The former significantly increases the number of possible clusters, while the latter leads to just 8 possible clusters.Our choice reflects the desire to disentangle directionalities of the governing forces without introducing a strong dependence on the bases used to approximate each force.

3. 5 .
Classify.Let V E be the set of validation errors, V E = {∆V 1 , . . ., ∆V N }.An empirical observation used in this work is that when M approximates well an underlying model for a true species, the log-transformed validation errors log 10 (V E) are fit well by a Gaussian mixture model (GMM) with two mixtures (seeFigures 4.4 ,4.7,4.9).We thus use a 2-mixture GMM to classify the remaining cells.Cells are granted membership into the mixture that yields the highest posterior probability of generating its log-validation error, and the class with lowest mean error is labeled as a species.This can thought of as a sequential binary classification scheme.For example, in each plot of Figure4.4,a representative GMM resulting from a two-species population, the left-most mixture corresponds to low validation errors under the model M and is classified as a species S (in this case, species C in Table1is identified).The cells in S are subsequently removed from the population, and cells in the right-most mixture are returned to the cluster phase (b).

Figure 4 . 3 .
Finally, for N very large, it may be necessary to restrict the total number of species, which is encapsulated in the third condition, although we did not observe the number of iterations exceeding 5 in any trials with N ≤ 1000 and S ≤ 3 true species.

4 .
Fig. 4.1: Forces used to generate artificial data, motivated by experiment.Left: quadrupolar attractive-repulsive force f a-r .Right: dipolar alignment force f align .Note that all artificial species are prescribed the same drag force f drag (s, θ) = −5s, linear in cell speed s.

Figure 4 . 3 (
middle)), and all other cells in X A , X B , and X C are correctly classified.A comparison between original and learned trajectories is depicted in Figure4.2, with learned trajectories overlapping original trajectories in each case.

4. 2 .
Two-Species Populations.Next we examine the ability of the learning algorithm to detect two-species populations along with accurate aggregate models.

Figure 4 . 4 displays
two representative Gaussian mixture fits to the logvalidation errors for X A,C (left) and X B,C (right).In both cases, the log-errors are well-approximated by Gaussian mixtures with wide separations between mixtures.This allows for complete classification in both cases, as indicated by CS(A), CS(B), and CS(C) in rows 2 and 3 of Table

Figure 4 .
Figure 4.8 shows a comparison between original and learned trajectories for X A,B (long) and Figure 4.7 depicts representative Gaussian mixture models.In particular, Figure 4.7 (left) shows increased overlap between the two Gaussian mixtures in the first iteration, compared to Figure 4.4, however model performance is still sufficiently different as to classify ≈ 98% of cells correctly.

Fig. 4 . 2 :
Fig. 4.2: Examples of learned and original trajectories from homogeneous populations.Left to right: X A , X B , X C .

Fig. 4 . 3 :Table 3 :Fig. 4 . 4 :Fig. 4 . 5 :
Fig. 4.3: Distribution of log-validation errors for homogeneous cell experiments X A , X B , X C .Distributions for X A and X B are fit well by a single Gaussian, indicating a single species is present.The distribution for X C has a non-Gaussian tail, although all errors are below 1%, indicating that the candidate model fits the population up to the specified error tolerance.

Fig. 4 . 6 :
Fig. 4.6: Example trajectories from experiment X B,C .As in Figure 4.5, cells with true color labels are depicted on the left.Classified cells from species C (middle) and species B (right) are highlighted showing excellent agreement original data and output of the learned models.

Fig. 4 . 7 :
Fig. 4.7: Log-validation errors for two-species population X A,B (long).Strong similarities between the two species present an initial challenge to identification, which is overcome by taking a longer time series.The initial Gaussian mixture model (left)

Fig. 4 . 8 :Table 4 :
Fig. 4.8: Example trajectories from experiment X A,B (long).Cells with true labels are depicted on the left and classified cells from species B (middle) are species A (right) are depicted with model output overlapping the input data in each case.

and 1 .
8% of the species A cells (middle).The next iteration classifies a majority species A cluster (right).Clusters 4 and 5 are effectively outliers and contain the remaining 31/1000 cells.

Fig. 4 . 10 :
Fig. 4.10: Example trajectories simulated using learned models from X A,B,C (long).Top left: example cells with true labels.Top right, bottom left, bottom right: example trajectories from clusters 1-3 (see Table4row 3 for details).Note in particular that learned models for clusters 2 and 3 are able to capture sharp turns in the true dynamics.

Altogether we get d = 0 0 PT T 0 P
We use MATLAB's quadprog with constraint tolerance 10 −10 and maximum iterations set to 1000.Note that since d = 0, we do not lose feasibility during the thresholding step, which is possible in general.Appendix D. Model replacement.Once the initial batch of N models M is learned, we simulate each model M i as outlined in 3.4 on K different validation cells selected from the data, where K = 20 throughout.For a given model M i , we select these K validation cells by finding cells in the population that match well certain statistics of cell i.In particular, we define the following distributions:x∼X ( x i (t) − x(t) < r) dt ρ (i) vv (s) = 1 v∼V ( v i (t) − v(t) < s) dt ρ (i) v (s) = P ( v i < s) x i → M i M → (C, M) (x i , M) → ∆V i V E → S 5-10 sec.10-30sec.10-30 sec.< 1 sec.Table7: Wall times for main components of one iteration of the algorithm, recorded for the X A,B,C (long) experiment (N tot = 1000 cells and L = 400 timepoints).Each component contains an inner iteration which may be trivially parallelized.The reported times are for one step of the respective inner iteration.Specifically, it takes 5-10 seconds to learn each modelM i ,while computation time for the full set of models M = {M 1 , . . ., M N } depends on the number of available CPUs.To form the model clusters C = {C 1 , . . .C r }, find the largest cluster C, and compute the averaged model M, forward simulations of each model M i (see Section D) are performed which take 10-30 seconds (depending on the complexity of M i ).Similarly, computation of each ∆V i requires one forward simulation (10-30 seconds).Lastly, to identify S, it takes < 1 second to perform GMM classification of log 10 (V E), and the results of 20 rounds of classification are averaged.If full parallelization is available, the walltime is less than two minutes per outer iteration, or < 2S minutes in total, where S is the number of identified species.For comparison, the cost of generating the data for X A,B,C (long) takes 5-6 hours.densities ρ and ν is given by D KL (ρ|ν) = − ρ(x) log ν(x) ρ(x) dx.

Table 2 :
Experiment ∆f a-r ∆f align ∆f drag CS(A) CS(B) CS(C) ∆V Performance of model learning and classification algorithm of homogeneous populations.

Table 5 :
Summary of notation used througout.article.In Table6we list several hyperparameter choices for the examples above which were not covered in Sections 3-4.Finally, we include walltimes for the main components of the algorithm in Table7, recorded in MATLAB using an AMD Ryzen 7 pro 4750u processor.
we include all notation used throughout the

Table 6 :
Hyperparameters used for each example.