A Data-Driven Statistical Description for the Hydrodynamics of Active Matter

Modeling living systems at the collective scale can be very challenging because the individual constituents can themselves be complex and the respective interactions between the constituents are not fully understood. With the advent of high throughput experiments and in the age of big data, data-driven methods are on the rise to overcome these challenges. To directly uncover the underlying physical principles, we present a data-driven method for obtaining the phase-space density such that the solution to the stochastic dynamic equation for active matter readily emerges, from which time and space dependence of physical order parameters can be readily extracted. If the system is near a steady state, we illuminate how to construct a field theory to subsequently make physical predictions about the system. The method is first developed analytically and subsequently calibrated using simulated data. The method is then applied to an experimental system of particles actively driven by a {\it Serratia marcescens} bacterial swarm and in the presence of spatially localized UV light. The analysis demonstrates that the particles are in the steady-state before and sometime after the UV light and obey a Gaussian field theory with a spatially-varying"mass"in those regimes. This novel, yet simple, finding is surprising given the complex dynamics of the bacterial swarm. In response to the UV light, we demonstrate that there is a net flow of the particles away from the UV light and that the entropy of the particles increases away from the light. We conclude with a discussion of additional potential applications of our data-driven method such as when the internal structure of the individual constituents dynamically changes to result in a modified stochastic dynamic equation governing the system.

Modeling living systems at the collective scale can be very challenging because the individual constituents can themselves be complex and the respective interactions between the constituents are not fully understood. With the advent of high throughput experiments and in the age of big data, data-driven methods are on the rise to overcome these challenges. Although machine-learning approaches can help quantify correlations between the various players, they do not directly shed light on the underlying physical principles of such systems. To directly uncover the underlying physical principles, we present a data-driven method for obtaining the phase-space density such that the solution to the stochastic dynamic equation for active matter readily emerges, from which time and space dependence of physical order parameters can be readily extracted. If the system is near a steady state, we illuminate how to construct a field theory to subsequently make physical predictions about the system. The method is first developed analytically and subsequently calibrated using simulated data. The method is then applied to an experimental system of particles actively driven by a Serratia marcescens bacterial swarm and in the presence of spatially localized UV light. The analysis demonstrates that the particles are in the steady-state before and sometime after the UV light and obey a Gaussian field theory with a spatially-varying "mass" in those regimes. This novel, yet simple, finding is surprising given the complex dynamics of the bacterial swarm. In response to the UV light, we demonstrate that there is a net flow of the particles away from the UV light and that the entropy of the particles increases away from the light. We conclude with a discussion of additional potential applications of our data-driven method such as when the internal structure of the individual constituents dynamically changes to result in a modified stochastic dynamic equation governing the system.

I. INTRODUCTION
Complexities of living matter include its multi-scale nature, its multi-body nature, its disrespect of timereversal symmetry, and its ability to replicate and to learn, or adapt, from which new properties and/or symmetries may emerge. An example of a new emergent property at the cellular level is the elongation of cells during development [1] and at the onset of some diseases [2]. With such changes comes new, emergent interactions that need to be quantified. The above complexities call for a focus on in vitro systems, with only a few types of players, or constituents, if we are to build quantitative models of living matter with predictive power.
One of the candidate frameworks to model living matter is known as active matter [3][4][5][6]. In active matter, each constituent is internally driven, typically by consuming energy to generate motion. With this construction, nonliving material such as colloids, microbots, and other selfpropelled particles also fall under the purview of active matter [7][8][9][10][11]. To date, two main categories of theoretical approaches have been used to quantify active matter: analytical analysis of dynamic stochastic equations of an assumed form (see, for example, equations governing the hydrodynamics of flocking [12,13]) and more intricate computational approaches. While results from the former category are robust, the assumptions involved may limit its applicability.
The challenges of the analytic approaches have motivated machine-learning approaches at the opposite end of the spectrum of quantitative frameworks [14]. With machine-learning techniques, one searches out correla-tions by sifting through reams of data. While this approach comes with many advantages, there are shortcomings as well. One of the difficulties in machine-learning is the freedom in model selection, which could contribute towards the non-reproducibility issues reported in [15]. Application of machine-learning methods in the exact sciences, such as physics, encounters the additional problem of violating basic physical principles. In other words, both feature and model selections in machine-learning are not constrained as severely as principles in physics. Therefore, a researcher may train a model that eventually violates the second law of thermodynamics, or the system's energy may not be bound from below such that ghosts with negative kinetic energy are allowed to propagate. We should mention that there has been very recent progress in dealing with this very issue [16].
Our approach in this manuscript is to take advantage of learning from data but remaining within the strict framework of the governing laws of physics. Therefore, instead of working with the conventional machinelearning models, we adopt the conventional statistical models of many-body systems to be trained by data, where the phase-space distribution function serves as the prominent feature of the model. An advantage of our approach to learning from data is that the interpretation of both the model and its features is well understood. The approach also aims at overcoming the challenge of dealing with unknown interactions, emergent or otherwise. The expectation with this hybrid approach is that new and old physical principles will readily emerge in living systems at the collective scale.
We start with the time evolution equation for the phase-space density function as the underlying equation generating the non-equilibrium field equations for the order parameters, such as the number density and the polarization vector. The dynamical equation of the phasespace density is under-determined in its general form but is exact. In analytic approaches, one needs to make simplifying assumptions and approximations to overcome the latter problem to find analytic solutions to the field equations. In our approach, we work with the exact dynamical equation and estimate the phase-space density directly from data. Since in an experiment, the phasespace density function is truly driven by the exact equation, our estimation would be the solution to the exact field equations with no assumptions required. Also, we show that our data-driven approach can be used to build an effective statistical field theory for those systems that are close to steady state. For this purpose, we use our data-driven description of the time evolution of the order parameters to observe their fluctuations over time and their correlations over space. Since the latter is related to the functional derivatives of the effective action, we can then solve a system of equations to reconstruct an analytic effective action. Finally, we test our method using both simulations and experiments. The structure of this paper is as follows. We construct the theoretical framework in Sec. II. We validate our data-driven method in three simulations that are presented in Sec. III. In Sec. IV, we apply our method to an experimental system of spherical particles embedded within a bacterial swarm that respond to localized UV light. We draw conclusions in Sec. V.

II. THEORETICAL SETUP
The breaking of time-reversal symmetry in active matter results in a non-equilibrium system making the direct connection with equilibrium statistical mechanics approaches untenable, particularly in the absence of any steady-state. However, non-equilibrium statistical mechanics approaches exist. Generally speaking, the stochastic dynamics of any observableÂ(Γ, t) in a statistical system can be written as where Γ is a point in the N-body phase-space, and L consists of combinations of derivatives with respect to real-space positions, velocities, angles, angular velocities, etc. of each of the N particles in the system. It should be noted that the exact form of L depends on the system of interest. For self-propelled rods, for example, its form is given in Ref. [17], which is found by starting with the Ito calculus [18]. A measurable quantity is the ensemble average of the observable, which reads wheref N (Γ) is the N-body phase-space probability density, and in the second line, the dynamics is equivalently passed to the latter density function. Taking partial derivatives of Eq. (2) and integrating by parts, we conclude that We define the m-body phase-space density function as As will be discussed later, most of the order parameters in active matter, such as the number density in realspace, the polarization vector, and the nematic tensor, are known in terms of f 1 . Therefore, we are interested in the time evolution of the latter quantity. In the following, for simplicity, we drop the subscript and refer to the one-body distribution function as f ≡ f 1 . We apply ∂ ∂t to both sides of Eq. (4) for m = 1 and use Eq. (3) to rearrange the terms in the following form where points in the one-body phase-space are labeled by q ≡ ( x, v,û, ω, · · · ), consisting of positions, velocities, orientations, angular velocities, and other possible internal structures, and repeated indices indicate a sum. We will discuss C in the following subsections.

A. Hydrodynamic field equations
For simplicity, we assume that q ≡ ( x, v,û, ω) and neglect the rest of the possible internal structures of the particles. To arrive at the field equations for the order parameters in active matter, we need to compute the moments of Eq. (5). The continuity equation is the zeroth moment of Eq. (5) and reads where ∂ µ ≡ ∂ ∂x µ , dv, dû, and dω refer to the "volume" integrals in their corresponding spaces, and the particle density and the average velocity are defined respectively by Also, we have set the surface term from ∂ ∂ûµ equal to zero.
The time evolution of the average velocity can be found through the first velocity moment of Eq. (5) where we have set dv µ dt = dω µ dt = 0. Also, we assumed that the external forces applied to the particles are zero. Moreover, the stress tensor is defined by It should be noted that the definition of the stress tensor might be different in other sources where some of the terms from the right-hand side of Eq. 8 are absorbed into the definition. Nevertheless, we prefer our definition of the stress tensor because (i) it is Eq. (8) that bears the physical meaning and not the definition of the stress tensor, and (ii) our Eq. 9 remains the same for every form of C, which as we will see later depends on the model of active matter. As an example, if C = ∂ µ f F µ with F µ being proportional to a constant active propelled force, a modified stress tensor could be defined as Finally, the pressure is defined as the trace of the stress tensor divided by the dimension of the real-space where there is a sum over the repeated indices. The time evolution of the polarization vector of the particles is given by the first orientation moment of Eq. (5) where the polarization vector is defined as and we have defined A special but common model would be if the probability distribution of velocities and orientations are independent such that f = f v f s . In this case, Γ µν ∝ p µvν . Another order parameter of interest is the nematic tensor defined as whose time evolution is given by the second orientation moment of Eq. (5) Again, under the special case that f = f v f s , the integral in the second term on the left-hand side is proportional tov α ρQ µν + ρδµν d . Since the one-body phase-space density is a probability function, we use the Shannon definition of entropy as a measure of information in the system of particles through It is interesting to note that when the Boltzmann Htheorem is valid, the equation above is also equal to the thermodynamic entropy of the system. Additionally, when the interactions are strong, while the above is a subset of the total thermodynamic entropy, it may still be an insightful measure of the system. It is key to note that the goal of solving the hydrodynamic field equations (6,8,12,16) The difficulty is that although the hydrodynamic equations above are exact, they do not make a closed set of differential equations and depend on higher order moments. Therefore, if analytic solutions are of interest, one has to make assumptions to break the hierarchy. As can be seen from equations (7,9,13,15), the time and space dependence of all the order parameters above are known if f , the one-body phase-space density, can be found. Here, we compute the left-hand side of Eq. (5) directly from data to arrive at C without needing to break the hierarchy.

B. Analytic approaches
Even though Eq. (5) is exact, it is not closed because C is a functional of f 2 , f 3 , · · · . Therefore, analytic descriptions for f is not possible unless we make simplifying assumptions. A large subset of theoretical approaches to understanding active matter consists of proposing a form for C based on a set of assumed symmetries and interactions. In a wide class of models for active matter, C is given by the so-called Boltzmann equation such that where the detailed description of the two terms for active matter can be found in for example [19][20][21][22]. The Boltzmann equation have been used by numerous researchers to describe a wide range of systems of active matter, see for example [23][24][25]. Another prevalent class of models for active matter are considered to be explained by the Smoluchowski equation with its simplifying assumptions. For two dimensional self-propelled rods, C in Eq. (5) reads [17,26] which is a slightly modified Smoluchowski equation, where D R is a noise coefficient, θ is the angle of the orientation vector, τ is the mean-field torque from other particles, and F is a mean-field force. The original Smoluchowski equation of single-particle distribution function f , which again falls under the form of Eq. (5), has been used to describe active cytoskeletal filaments [27]. Many other assumptions regarding the form of C can be found in the literature. For example, in [28], the Gay-Berne potential is used to suggest a form for the time evolution of the one-body probability density f . One can also write the most general form for C by constructing all possible scalars out of f , v µ ,û µ , ∂ µ , ∂ ∂ v µ , ∂ ∂ûµ , etc. The most general form of C is the following: where a with any type of subscript stands for a passive or an active coefficient. As can be seen, we have an infinite number of models for active matter each corresponding with a C equal to a subset of the terms above. Nevertheless, for most of the commonly known models of active matter, only a few terms are non-zero. For example, a 1 in Eq. (20) is similar to the viscosity in the Navier-Stokes equation, and a 3 is similar to one of the active parameters in the dry flock model presented in Ref. [5], where the moment of the right-hand side of Eq. (5) reads In fact, this active-current term can be constructed within the Boltzmann framework of Eq. (18) as is shown in Ref. [22]. It is also possible to impose constraints on the moments of Eq. (5) to arrive at a certain arrangement of non-zero coefficients in Eq. (20). One possible route is to study the terms analytically by categorizing them according to some group properties. In that way, we will end up with a study similar to the one presented in [12,13]. In this paper, we follow a different approach to be discussed next.
C. Data-driven approach As we mentioned above, C accounts for all of the higher-body probability functions such that Eq. (5) is exact. However, simplifying approximations are needed to find an analytic solution for f because the aforementioned equation is not closed. An alternative to making simplifying assumptions, is to directly estimate f from data and use the exact form of Eq. (5) to find C as well as the order parameters such as ρ ( x, t),v µ ( x, t), p µ ( x, t), and Q µν ( x, t).
Biological systems are often made of many types of particles and we should write one Eq. (5) for every particle type. In this case, C not only accounts for higherbody distribution functions but also for the interactions between different types of particles. In the following, we assume that only one type of particle is of interest. We also assume that an experiment has collected a data sample containing the information of positions and orientations of the particle type of interest over time.
The starting point would be to note that the one-body phase-space density f ( q) is equal to the probability of finding a particle at point q in phase-space. Perhaps the most straightforward approach to estimating f ( q) from a data sample is to discretize the q space and count the number of particles that fall into each bin. The problem with this approach is that if the number of dimensions of q is high, most of the bins remain empty. For example, a typical dataset contains only a few hundred to a few tenthousand particles. On the other hand, even for a poor and blurry discretization of 10 bins per dimension, and for a typical six-dimensional phase-space, we end up with one million bins. Another problem with this approach is that the estimation often depends on the choice of the locations of the bin edges.
Although the simple approach above is not necessarily practical for estimating the phase-space density, its careful interpretation can guide us in devising a better estimator. An interpretation of the binning method is that all of the observed particles contribute to a given bin with binary weights. In other words, a particle's contribution to a given bin is zero or one depending on whether the phase-space distance of that particle to the center of the bin is greater or smaller than the bin size. The estimation of the phase-space density then reads where i runs over particles, N −1 is a normalization factor, and w i ( q) is the contribution of the ith particle to the probability at q. It is already obvious that the binary form of w i ( q) is behind the problems of the binning method, and w i ( q) should be a continuous function such that all of the observed particles have a non-zero contribution to a given location of phase-space. In statistics, Eq. (22) with a continuous weight factor is known as kernel density estimator [29]. Also, the weights of particles close to a given phase-space location q should be more significant than the weight of particles that are far from it. Therefore, we use the following Gaussian form for the weights in Eq. 22 where r eff is a free parameter. The optimized value for r eff would be the smallest number that still returns a smooth estimation of phase-space. Hence, by observing the position of particles in the phase-space q i , the weight w i ( q), and subsequently the density f ( q) can be estimated. Finally, phase-space density reads A pictorial description of the algorithm for our estimator of the phase-space density can be found in Fig. 1.
By inserting this continuous one-body phase-space density function into the equations (7 , 9, 13, 15, 17), we can estimate the number density, the bulk velocity, the stress tensor, the polarization vector, the nematic tensor, and the entropy as functions of time, and as the solutions of the hydrodynamic equations in section II A without needing to solve the differential equations.
Having estimated the phase-space density at different times t, and assuming that the external forces are either known or absent, all of the terms on the left-hand side of Eq. 5 are known. Therefore, we can find an estimation for C, which can be subsequently compared with Eq. (20) to find its significant terms.

D. Statistical field theory of active matter
A field theoretic description of a system is potentially powerful due to its capability in explaining various experiments that are seemingly different. It is this one-tomany relationship that highlights the importance of field theory. When an active matter is close to its steady-state, an effective statistical field theory of its order parameters can be constructed. An objective of this paper is to learn this effective field theory by observing one of the experiments that it can explain.
To start the construction, we note that, on the one hand, the correlations between the order parameters are determined by a probability functional, and if these correlations are known, one should be able to recover the probability functional through a reverse engineering. On the other hand, in Sec. II, we have devised a method to learn the space and time dependence of the order parameters, and consequently their correlations, from data. More specifically, by definition, the effective partition function is the sum of the probabilities of possible configurations (25) where ϕ represents the order parameters, and J is an auxiliary field. On the one hand, the correlations among the order parameters are given by On the other hand, the same correlations are known through the data-driven method of Sec. II. The objective is to construct the unknown and analytic expression on the right-hand side from the left-hand side, which is known numerically. In the following, we pursue the construction of the effective field theory by first expanding the right-hand side of Eq. (26) using the perturbation theory, which is valid given the assumed stationary state of the system. Each term in the expansion would be an unknown variable to be determined. Therefore, we need to observe as many correlation functions, for k = 2, 3, · · · , as the number of unknown variables. By solving the system of equations, the leading terms in the expansion of the partition function would be recovered from data. It should be noted that, usually, a few first terms of the expansion are of interest, and the higher-order terms can be neglected for the sake of practicality and depending on how much data is available.
To practically use the right-hand side of Eq. (26), we need to calculate the path integration in Eq. (25), and write Z[J] explicitly in terms of the leading perturbations in S total [ϕ]. Since the system is close to the equilibrium and the fluctuations around the saddle point are small, the exponential function in Eq. (25), equivalent to the probability of configurations, takes the following form where V (ϕ) refers to the higher order terms, and we have assumed that at the saddle point δZ To recover the leading term, we introduce subscript "0" to refer to V (ϕ) = 0, and define ϕ 0 (x) to be the solution to the corresponding field equation which can be used to calculate the non-interacting partition function Here, we have dropped a normalization factor, and ∆ ( x 1 − x 2 ) is the Greens' function defined as Finally, we achieve the desired expansion of the full partition function through a straightforward calculation The algorithm for reconstructing the effective action is as follows. First, we use the data-driven method, developed in Sec. II, to find the phase-space density at different moments and determine whether or not f evolves with time within some tolerance. If not, we collect many snapshots of the system and estimate the phase-space density for each of them. The phase-space densities will be used to learn the order parameters as functions of the real-space position in the corresponding snapshots. The snapshots act as the members of the ensemble of the system. Correlations between the order parameters of different locations, such as ϕ( x 1 ) · · · ϕ( x k ) , are equal to the average of the multiplications of order parameters, i.e. ϕ( x 1 ) · · · ϕ( x k ), over the snapshots.

III. SIMULATION TEST CASES
To test the data-driven approach outlined in Sec. II, we apply it to datasets whose properties are known. We consider the following three test cases.

A. Simulation test case I: Multivariate Gaussian distributions
In this subsection, we assume the following fivedimensional phase-space distribution function where q = (x, y, v x , v y , cos θ) with θ being the angle of orientation for rod-like particles, and Note that the Maxwell-Boltzmann distribution is a very special case of the distribution above when every component of Σ −1 is zero except Σ −1 33 = Σ −1 44 = 0. It should also be noted that many interactions have been assumed between positions, velocities, and polarization, and so the system is far from a non-interactive one.
We use the "numpy.random.multivariate normal" package in python to draw 15 snapshots, each with 1000 particles. A histogram of the four-dimensional phasespace position of observed particles, which is the naive binning method discussed above, is shown in Fig. 2(a). The estimated phase-space density is shown in Fig. 2(b). The deficiency of the simple binning approach can be seen by comparing Fig. 2(a) and Fig. 2(b). One of the snapshots can be seen in Fig. 2(c). The true and the estimated, using r eff = 0.5, phase-space densities at every point of the phase-space are compared. The result can be seen in Fig. 2(d). This panel shows that unlike the histogram approach, the method can accurately estimate the true distribution function.
We study the effects of tuning r eff by estimating the phase-space density with different values of r eff for each of the 15 samples and comparing their average as our final estimation with Eq. 32. Fig. B.2 shows the error in our estimation of the phase-space density in terms of r eff . As can be seen from the figure, r eff 0.5 returns the most accurate estimation of the phase-space density. On the other hand, Fig. B.3 show that the error of estimation can be reduced by increasing the number of samples, snapshots of the system.
Having estimated the phase-space density using 15 samples with r eff = 0.5, we are ready to compute the order parameters of interest using Eqs. (7,9,13,15). For instance, our estimations of the pressure and the entropy  of the simulated data are shown in Fig. 3 Also, assuming that no significant external force exist in the system, we can estimate C at every location of phase-space by inserting the estimated phase-space density into the left-hand side of Eq. (5) and find the significant terms in Eq. (20).

B. Simulation test case II: Non-linear, non-steady-state scenario
To test our method in a non-linear and non-steady system, we simulate a system of particles on a surface whose phase-space density is ).
Although the distribution function has a nonconventional form, we are still able to draw samples using python's random package with the following trick. First, we use the "numpy.random.uniform" to generate one mil-lion data point in v x space. We scale the velocity space by dividing v x by the length of the container of the particles such that the velocities are confined within an interval of (-4,4). We repeat the same procedure to generate uniform points in the v y dimension. By zipping the two arrays, we have one million uniformly generated data point in the two-dimensional phase-space. The number of uniformly generated data point is intentionally high to cover almost the entire phase-space. Next, we use Eq. (34) to compute the probability of each of the data point. Finally, we use "random.choices" package in python to draw samples of 10000 particles out of the one-million population based on their computed probability.
To increase the accuracy of the estimation, we take ten snapshots of the system at each time, corresponding to having ten replicates, and repeat the procedure at four different times. We insert the snapshots as an input to our data-driven method and estimate the phase-space density. The average of the ten estimated phase-space densities at each time is compared with the true function above. Figure 4 shows that our method accurately estimates such non-linear, and out of equilibrium systems. We choose r eff = 0.05 at all the time points to get the best optimized result.
Since our estimation of f (t, v x , v y ) is closely similar to the true density at all of the time points, our estimation of any of the order parameters of the system will be accurate. For example, the pressure of the system reads FIG. 5. Simulation test case II: (a) One of the ten snapshots of the self-propelled rods system. The parameters are chosen such that ρ0 > ρN with the director unit vectorn in the y-direction such that there are more particles in the y-direction than those in the x-direction in accordance with the analytic solution in Eq. 36. (b) The estimated versus the true phase-space density at uniformly distributed positions of the 6-dimensional phase-space. The comparison shows fair agreement between the two. Therefore, our method's estimations for the additional order parameters are fairly accurate too because these parameters are simply moments of the phase-space density f as defined in the theory section above and also in [26,30].

C. Simulation test case III: Self-propelled rods
We now evaluate the performance of our method by applying it to a system whose analytic hydrodynamic solution is known, namely, self-propelled hard rods on a substrate in two dimensions. Both the microscopic and continuum description of the system can be found in Refs. [26,30], where the order parameters such as the number density, the polarization vector, and the nematic tensor are defined in terms of the one-particle phase-space density that satisfies a modified Smoluchowski equation of the general form of Eq. (5) where q µ = ( r, θ, v, ω) refers to the position vector in the twodimensional configuration space, the angle that defines the orientation of the rods, the velocity of the rods, and their angular velocities. As discussed in the two references above, if the friction due to the substrate is large and the interactions are of the excluded volume type, and after a few more simplifying assumptions, the following would be a stationary solution where v 0 is the self-propulsion along the direction of the rod, and I = l 2 /12 with l being the length of the rods. Moreover, f x , the real-space probability distribution, takes the following form where ρ 0 is a constant, ρ N = 3π 2l 2 is the Onsager transition density, and Q αβ = S n αnβ − 1 2 δ αβ , with S andn being constant scalar and unit vector. For more detail, we refer to the two references above.
Before we proceed, it should be clarified that although many dynamical microscopic simulations exist in the literature, their corresponding phase-space density is not necessarily known. On the other hand, to evaluate our method, we need to compare our estimated phase-space density with the true one. Fortunately, this true phasespace density of the system of self-propelled rods has been analytically derived in Refs. [26,30] starting from the underlying microscopic system. Therefore, we test our method in the same manner as described in test case II.
In the following, we simulate this system by choosing the length of the rods as l = 1, the temperature as kT = 2, the active parameter as v 0 = 2, and without loss of generalityn = (0, 1). Also, we choose ρ 0 = 5 to satisfy ρ 0 > ρ N since this is the more complex scenario whose estimation is harder. We take 10 snapshots of the system to serve as the input data into our data-driven method. One snapshot of the system can be found in Fig. 5(a).
Next, we pass the snapshots to our data-driven method to estimate the single-particle phase-space density. We find that r eff = 0.1 gives the best optimized estimation. We compare our estimation with the analytic solution in Eq. (36) and validate the method in such active systems. Figure 5(b) shows that since our method returns a fair estimate of the density function f , its prediction for all the rest of the order parameters, such as the number density, are fair estimations of the corresponding analytic expressions in the mentioned references. It should be mentioned that the advantage of our method, ultimately, is in its performance ability in situations where the simplifying assumptions are not valid and no analytic solution exists.

IV. EXPERIMENTAL TEST CASE: ACTIVELY-DRIVEN SPHERICAL PARTICLES
We now turn to apply our approach to an experimental test case. Bacterial swarms [31] are a staple experimental system for active matter [3,4]. Each bacterium is self-propelled via molecular motors controlling flagella to move up to speeds of tens of microns per second. While even the motion of an individual bacterium is nontrivial, the collective motion of many bacteria exhibits a range of dynamic phenomena from jets to whirls to turbulence [32,33]. The predominant theoretical approach is nematic hydrodynamics and it does predict the onset of bacterial turbulence with associated topological defects [34,35]. So the collective motion of a bacteria swarm is highly non-trivial as well. In addition to quantifying the motion of the swarm, researchers have also studied particles embedded in the swarm. So while the particles themselves are inherently passive, they are actively driven [36] by a complex field. See the Appendix for details on the experimental set-up and conditions. The experiment at hand probes the effects of localized UV light on the collective motion of the swarm and, thus, the spherical particles driven by the swarm. The perturbation of the localized UV light causes the bacteria where the light is localized to ultimately stop moving after some time window [36]. What happens prior to this jamming was not resolved by the prior experiments in terms of determining if some of the bacteria flee the region of localized light before others eventually get jammed. We address the issue here.
To begin, we choose 150 consecutive movie frames well before and another 150 consecutive movie frames well after the ultraviolet light is emitted. We refer to these two movie sets, each approximately three seconds, as stage-I and stage-II in the following. A snapshot of the two stages can be seen in Fig. 6. We use the trackpy package [37] in python to trace the position of spherical particles over time. We apply conventional selections to remove the spurious particles. After the positions of particles are quantified in all of the movie frames, we use the change in the positions in the consecutive frames to find the velocity of the particles.
Having reconstructed the positions and the velocities of the spherical particles, we implement our method to estimate the phase-space density for each movie frame. A comparison of the densities reveals that they are close to the steady-state in both stage-I and stage-II. However, the system goes out of equilibrium when the ultraviolet light is turned and migrates from stage-I to stage-II. Our estimations for the phase-space densities and the interaction terms in stage-I and stage-II are shown in Figs. 7(a,b). The figure shows that the probability of finding particles at the center is reduced but increased at the upper-left corner and the bottom of the container. This figure qualitatively indicates that the spherical particles are running away from the ultraviolet light into the upper-left and bottom regions of the real-space. Since the spherical particles are passive, and assuming that they are indicator of the active bacteria around them, we argue that the living matter-the bacteria-is evading the UV light. Given the crowding effect, it is difficult for all of the bacteria to flee the UV light, just as it is in a crowd in a panic trying to exit a stadium, for example. However, as the bacteria become jammed in the region of the UV light [36], we speculate bacteria just outside the region re-route since their fellow jammed bacteria in the illuminated region now emerge as an obstacle. It is in this re-routing that presumably directs both the bacteria and spherical particles away from the UV light. It would be interesting to determine whether or not this phenomenon is robust, i.e., extends beyond our test case.
With this theoretical framework, we can do more. We use Eq. (5) to estimate the four-dimensional interaction term C in both stages I and II, which are shown in Figs. 7(c,d). It should be noted that in the Boltzmann form of Eq. (18), the interaction term has the following general form is the probability amplitude of scattering one particle with velocity v into velocity v off the rest of the system at position x. The right hand side of the latter formulation suggests that d 2 vC[ x, v] = 0. This is consistent with our estimation in Fig. 7 where each sub-plot is symmetric between positive and negative values with a sum of approximately zero. Since we did not enforce Eq. (37) in our estimation of C, the latter is a validation of our method. Figure 8 shows the estimated number density, the pressure, the entropy, and the equation of the state of the spherical particles in stage-I. These quantities are the averages of the corresponding ones overall the movie frames of stage-I since as we discussed above, the phase-space densities associated with the frames in stage-I do not change significantly over time suggesting that the system is close to steady-state. The latter is consistent with the linear relationship between the number density and the pressure of the system, as can be seen in the figure above. Both of these findings are nontrivial given the complexity of the bacterial swarm driving the spherical particles.
To further quantify the migration of the spherical par- ticles, we compute the real-space density of particles in stage-II and show its difference with respect to the number density in stage-I as δρ ≡ ρ II ( x) − ρ I ( x) in Fig. 9(a). The error associated with this difference is presented in Fig. B.7, which shows that the difference in the number densities is statistically significant. Fig. 9(b) shows the difference in the pressures of the spherical particles in stage-I and stage-II. As can be seen, the density is increased near the walls of the container and decreased at the ultraviolet site. We emphasize that our estimation of the phase-space density, and hence the real-space density, do not solely rely on counting the particles at the position of interest. Instead, we take the distance of every observed particle from the position of interest to estimate the probability of finding particles at that point. Therefore, our estimation of the probability at a position may be different from zero while no particle has been observed at that point in the dataset. This point has been made clear in Fig. 2, where using our knowledge of underlying distribution in simulations, we have shown that the true probability at a given point may not be zero but still, no particles exist at that point in the dataset due to the lack of statistics. On the other hand, in Fig. 2(d), we have shown that our estimation of the same probability is fairly close to the true one and overcomes the lack of statistics in the dataset. As can be seen in Figs. 7(c,d), the interactions in the velocity sub-spaces around the center of the container are slightly changed due to the ultraviolet light being shined at that location. The change in the interactions is a result of the ultraviolet light leading to a drop in the pressure of the spherical particles, as can be seen in  Fig. 9. On the other hand, the system has reached a steady-state, therefore, the balance of the forces should be zero at every location of the container. So the only existing force on the particles that balances the pressure is their interactions with the rest of the system, i.e. their interactions with each other via the bacteria. Therefore, we can use UV light as a means of controlling the net flow of the spherical particles, and perhaps the bacteria. By shining it at a high-density region such as the bottom of the container, we can drive the particles to move toward the empty regions at the top of the box. It should be noted that to derive the equation of state of the spherical particles, we have used Eqs. (7,9), to compute and compare the real-space densities and the pressures of the spherical particles at every position of the container. Our best fit lines, which can be seen in Figs. 8 and B.5, indicate that the equation of state is the same in both stages and reads P 0.04ρ, (38) where the coefficient indicates the effective temperature of the spherical particles in natural units. Here, we presented a novel derivation of the linear equation of state that is often assumed for such systems. See for example Ref. [38]. At this point, we turn our attention to the statistical field theory of the actively-driven spherical particles. First, we analyze the 150 consecutive movie frames of stage-I for this purpose. First, we define ϕ( x) ≡ ρ( x) − ρ( x) , where the expectation symbol refers to the mean of the real-space particle densities in stage-I. Second, is the perturbative formalism via the Greens' function method valid for this system over this time interval? To answer this question, we look at the expectation values of the terms that might appear in the exponential of Eq. (27) and determine their strengths. We report that ϕ( x) 2 is the most significant term and all the other terms, especially ∂ 2 ϕ , are at least two orders of magnitude smaller. The leading perturbation terms are ϕ∂ x ϕ , ϕ∂ y ϕ , and ϕ 3 which are two orders of magnitude smaller than the leading term. In the following, we neglect the rest of the corrections to the partition function. The expectation values above are shown in Fig. B.6.
At this point, we construct the Green's function of the driven spherical particle system. In light of our evaluations of the expectation values above, we can safely assume that to the leading order Z[J] Z 0 [J]. Therefore, from Eqs. (26,29), we can conclude that An assessment of the four-dimensional expectation value on the right-hand side of the equation above, whose value is known from data, shows that it can be split as where d( x 1 − x 2 ) 1, and ϕ( x) 2 is the variance of ϕ at position x regardless of other positions, and its numerical values at different x is shown in Fig. B.6. A direct substitution of the two equations above into Eq. (30) proves that The effective free energy of the system therefore reads and the effective chemical potential of the system is equal to The driven spherical particle system is, therefore, described by a Gaussian field theory with a spatiallydependent "mass" and, hence, a spatially varying chemical potential. We have, therefore, computed the pressure, the temperature, the entropy, the effective free energy, and the effective chemical potential of an actively-driven tracer particle collective. We must again emphasize that the bacterial swarm is quite dynamical and complicated at the microscopic scale, which indicates that its tracers are also dynamical at the same scale, though exhibiting a different symmetry in terms of shape. Given the complex flow of the bacterial swarm, the novelty of our method is that it provides a simple field theoretic description of the system at the continuum scale for the tracer particles.

V. DISCUSSION
We have presented a data-driven approach to obtain the single-particle phase-space density as the solution to the stochastic dynamic equation of an active matter system, from which physical quantities such as the number density, the bulk velocity, the stress tensor, the polarization vector, the nematic tensor, and entropy can be extracted. We do not assume a particular form for the particle interactions beforehand. In other words, we pose an inverse method in which the data reveals the solution.
Should a stationary state exist, an analytic field theory can be constructed to make further predictions. In this paper, we have focused on a scalar field theory given that we were analyzing spherically-shaped particles. Our approach can be readily extended to nematic or polar particles, provided the orientation and polarization data is available. This would be interesting to do so since there are a number of nematic hydrodynamic theories addressing the onset of bacterial turbulence and the emergence of topological defects [34,35]. Interestingly, there exists a recent data-driven approach to quantitatively model bacterial swarms rooted in simultaneous measurements of the orientation and velocity fields to obtain effective parameters that can then serve as inputs into an equation of motion with an assumed form [39]. Again, here, we make no assumptions about the form of the solution.
In the experiment, the spherical particles were driven by a bacterial swarm whose motility was affected by UV light. We found that the system was in the steady-state before and after the introduction of a localized region of UV light and found that the particles obey a Gaussian field theory with a spatially-varying variance, or mass, in the language of conventional field theory. We obtained a simple equation of state in which the pressure is proportional to the density so that an effective temperature is readily identifiable, despite the particles being driven by a complex bacterial flow. We also have found that within the swarm, the spherical particles flew away from the region of UV light. We propose that this outflow is due to re-routing of some of the Serratia since those within the localized region of the light become jammed, along with any spherical particles in the region as well. The novelty of our method is in constructing a simple continuum description of a system that is assumed to be complicated at the microscopic level.
In the active matter community, there has been key work addressing the existence of an equation of state [40,41]. Specifically, when considering active fluids, one may decide to define pressure as the mean force per unit area exerted by the particles on a confining wall. Alternatively, one can define extract pressure from the trace of a bulk stress tensor, as is done here. The two definitions may not be equivalent, at least not in the generic case, given that active matter is a non-equilibrium phenomenon. Some researchers have argued that an equation of state with pressure as a state function is useless, unless one knows the specifics of the particle-wall interactions in steady-state [40]. We are able to bypass such a quandary, at least in steady-state, because various physical quantities are extracted from the data thereby containing all the information about the interactions between the particles and between the particles and their confinement. We allow the data to talk--walls and all. We note that recent work extracting the single-particle density from data for equilibrium systems has been used to directly construct a free energy [42]. What we present here is more general.
While we have focused on an active matter system with particles at the micron scale, we can readily apply our method to particles at the molecular scale. Consider, for example, an enzyme interacting with DNA. Our method also applies to systems consisting of much larger scales, such as astronomical systems, where such data-driven methods have been of interest [43]. Herein lies the power of the single-particle phase-space density and its dynamic equation and its applicability to any many-body physical system, regardless of scale. Moreover, while in this paper we have used the method on a fixed number of particles, the C term in Eq. (5) also takes into account matter creation and/or annihilation, such as cell birth/death, or protein synthesis/degradation, depending on the system of focus.
Finally, our data-driven approach can be contrasted with machine-learning methods, which are typically devoid of physical principles. It may, therefore, be tricky to use machine-learning methods to answer a physics question seeking to answer how a phenomenon occurs as compared to a classification question. Such questions can ultimately be framed as a question with a yes or no answer, such an image looking more like a cat or a dog, or a many-particle system looking more like a gas or a liquid. However, some scientists appreciate this point and are trying to integrate physics-based modeling with convention machine-learning methods [44]. This approach could ultimately prove fruitful. However, in this paper, we have been guided by the single-particle phase-space density, the stochastic dynamic equation, and its datadriven solutions in a given situation. By studying the system in different situations and under various perturbations, we can discover the system's underlying physics in non-equilibrium, which is key for living matter. Moreover, by using the data to find solutions to dynamical stochastic equations, should the form of the equations themselves evolve with time as the system adapts, our method can account for such adaptations-a hallmark property of living matter.

APPENDIX A: EXPERIMENTAL METHODS AND MATERIALS
Serratia marcesens (strain ATCC 274, Manassas, VA) were grown on agar substrates prepared by dissolving 1wt% Bacto Tryptone, 0.5 wt% yeast extract, 0.5 wt% NaCl and 0.6wt% Bacto Agar in deionized water. Melted agar was poured into petri dishes, and 2 wt% of glucose solution (25 wt%) was then added. Bacteria were then inoculated on solidified agar plates and incubated at 34 o C. Colonies formed and grew outward from the inoculation site.
As for the 2 micron polystyrene particles, they are cleaned by centrifugation and then suspended in a buffer solution (67 mM NaCl aqueous solution) with a small amount of surfactant (Tween 20, 0.03% by volume). A small aliquot of this particle solution (20 µL) is gently pipetted into the bacterial colony where the expanding colony front meets the agar. We did not notice any visible change in the bacteria swarm due to the addition of the spherical particles.
The swarm with the spherical polystyrene particles were imaged 12 -16 hours after inoculation. We used an inverted microscope (Nikon, Eclipse Ti-U) to image bacteria with the free surface facing down. Depending on the experiment either a Nikon 10× (NA = 0.3) or a Nikon 20× (NA = 0.45) objective was used. Images were gathered at 30 frames per second (fps) with a Sony XCD-SX90 camera or at 60 fps and 125 fps with a Photron Fastcam SA1.1 camera.
To mimic the exposure of bacteria to naturally occurring wide-spectrum light, we used a mercury vapour lamp as our light source. Standard epifluorescence optical components with the optical light path passing through the objective were used to focus the light on the swarm. The bare unfiltered maximum intensity (measured at 535 nm) of the lamp was reduced to lower, filtered (maximum) intensities using graded neutral density filters. Since both maximum intensites depend on the objective, care was taken to calibrate the incident intensity carefully. Using a spectrophotometer (Thorlabs, PM100D), we measured maximum intensities, with an unfiltered maximum intensity of 980 mW cm −2 (at 535 nm) for the 10× and a filtered maximum intensity of 3100 mW cm −2 (at 535 nm) for the 20× objectives. Variations in the intensity of incident light were small, as indicated by checking the incident intensity on a blank slide. The velocity distribution along either the vx or vy axis of the actively-driven particles at a fixed position. The line represents a Gaussian function with fitted parameters. It can be seen that the velocity distributions along the two axes are not the same but are both Gaussians. Therefore, the distribution is slightly different than the Maxwell-Boltzmann distribution due to existence of cross-terms.   Fig. 9. The standard error of the mean in each of stages I and II is being considered as the error of our estimation δρ. This figure shows that our estimation of ∆ρ is statistically significant.