Deep Potential: Recovering the gravitational potential from a snapshot of phase space

One of the major goals of the field of Milky Way dynamics is to recover the gravitational potential field. Mapping the potential would allow us to determine the spatial distribution of matter - both baryonic and dark - throughout the Galaxy. We present a novel method for determining the gravitational field from a snapshot of the phase-space positions of stars, based only on minimal physical assumptions, which makes use of recently developed tools from the field of deep learning. We first train a normalizing flow on a sample of observed six-dimensional phase-space coordinates of stars, obtaining a smooth, differentiable approximation of the distribution function. Using the Collisionless Boltzmann Equation, we then find the gravitational potential - represented by a feed-forward neural network - that renders this distribution function stationary. This method, which we term"Deep Potential,"is more flexible than previous parametric methods, which fit restricted classes of analytic models of the distribution function and potential to the data. We demonstrate Deep Potential on mock datasets, and demonstrate its robustness under various non-ideal conditions. Deep Potential is a promising approach to mapping the density of the Milky Way and other stellar systems, using rich datasets of stellar positions and kinematics now being provided by Gaia and ground-based spectroscopic surveys.


INTRODUCTION
To know the gravitational potential of the Milky Way is to know the three-dimensional distribution of matter. Stars and gas make up most of the baryonic mass of the Galaxy. However, thus far, dark matter is only detectable through its gravitational influence. Mapping the gravitational potential in 3D is therefore key to mapping the distribution of matter -both baryonic and dark -throughout the Galaxy.
The trajectories of stars orbiting in the Milky Way are guided by gravitational forces. If it were possible to directly measure accelerations of individual stars due to the Galaxy's gravitational field, then each star's acceleration would indicate the local gradient of the gravitational potential (Quercellini et al. 2008;Ravi et al. 2019;Chakrabarti et al. 2020). However, the typical scale of these gravitational accelerations -on the order of 1 cm s −1 yr −1 -is well below the precision achieved by current spectroscopic and astrometric instruments (Silverwood & Easther 2019).
In general, we are only able to observe a frozen snapshot of stellar positions and velocities. As the gravitational potential guides the motion of stars through phase space, it determines how the phase-space density, the "distribution function," evolves in time. Unless one invokes further assumptions, any gravitational potential is consistent with any snapshot of the distribution function, as the potential only determines the time evolution of the distribution function. A critical assumption of most dynamical modeling of the Milky Way is therefore that the Galaxy is in a steady state, meaning that its distribution function does not vary in time (Binney 2013;Bland-Hawthorn & Gerhard 2016).
State-of-the-art dynamics modeling techniques generally work with simplified analytic models of the distribution function and gravitational potential. The results produced by such techniques can only be as good as the models that are assumed. Astrometric and spectroscopic surveys, such as Gaia (Gaia Collaboration et al. 2016), LAMOST (Cui et al. 2012;Zhao et al. 2012), GALAH (De Silva et al. 2015), APOGEE (Majewski et al. 2017) and SDSS-V (Kollmeier et al. 2017), are vastly expanding the number of stars with measured phase-space coordinates, revealing that the distribution function of the Milky Way is richly structured (Antoja et al. 2018;Trick et al. 2019). This motivates us to go beyond simple parametric models. Here, we demonstrate a technique that learns highly flexible representations of both the distribution function and potential, capable of capturing the complexity of newly available data. Our method makes only minimal assumptions about the underlying physics: 1. Stars orbit in a gravitational potential, Φ ( x).
2. We have observed the phase-space coordinates of a population of stars that are statistically stationary (i.e., whose phase-space distribution does not change in time).
We do not require that the mass density ρ ( x) that generates the gravitational potential be consistent with the stellar population that we use as kinematic tracers. That is, the density ρ ( x) may contain additional components (such as dark matter) that go beyond our kinematic tracer population.
We make use of a number of mathematical tools from the field of deep learning. We represent the distribution function using a normalizing flow, a highly flexible representation of a probability density function (Jimenez Rezende & Mohamed 2015; Kobyzev et al. 2019;Papamakarios et al. 2019). We represent the gravitational potential using a densely connected feed-forward neural network (sometimes called a "multi-layer perceptron," or MLP; see Rosenblatt 1961;Rumelhart et al. 1986;LeCun et al. 2012), which takes a 3D position vector and outputs a scalar. These mathematical tools are particularly useful for dynamical modeling, as they can be automatically differentiated using backpropagation (for an introduction to automatic differentiation, see Baydin et al. 2018). We can thus easily calculate derivatives that commonly appear in the fundamental equations of gravitational dynamics, such as the Laplacian of the potential or the phase-space gradients of the distribution function.
Our method for recovering the potential, which we term "Deep Potential," can be summarized as follows: we first train a normalizing flow to represent the distribution of the observed phase-space coordinates of the stars, and then find the gravitational potential that renders this distribution stationary, subject to the constraint that matter density must be positive. This paper builds on the short workshop paper Green & Ting (2020), providing a more pedagogical explanation of Deep Potential (Section 2), demonstrating its performance on toy systems with spherical symmetry and axisymmetry (Section 3), and testing its performance under various non-ideal conditions: observational errors, selection functions and non-stationarity (Section 4). Finally, we discuss prospects for further development of Deep Potential and its application to real data (Section 5).

METHOD
Our first assumption is that stars orbit in a background gravitational potential, Φ ( x). The density of an ensemble of stars in six-dimensional phase space (position x and velocity v) is referred to as the distribution function, f ( x, v). Liouville's theorem states that the total derivative of the distribution function of a collisionless system (in which the particles are not scattered by close-range interactions) is zero along the orbits of the particles. For an ensemble of particles orbiting in a gravitational potential, this leads to the Collisionless Boltzmann Equation:

Collisionless
Boltzmann equation Overview of our method. Using observed phase-space information, we train a normalizing flow to represent the distribution function, f ( x, v). We then calculate the gradients of f at a large number of points in phase space. We represent the gravitational potential, Φ ( x), by a feed-forward neural network. We update the neural network until the gradients of the potential and the distribution function satisfy the Collisionless Boltzmann Equation (CBE) for a stationary system at the sampled points in phase space. We heavily penalize solutions for which ∇ 2 Φ < 0, which would correspond to negative matter densities.
Our second assumption, that the distribution function is stationary, implies that the density in any given region of phase space is constant in time: ∂f ∂t = 0. This assumption links gradients of the distribution function to gradients of the gravitational potential: Once we can describe the distribution function of a stationary system, in all physically plausible cases, the gravitational potential can be uniquely determined (up to an additive constant) by solving the Collisionless Boltzmann Equation (see Appendix A, as well as An et al. 2021). Realistic physical systems will not be completely stationary, and as such, there may not exist any potential which would render the system stationary. In general, therefore, Deep Potential recovers the potential which minimizes the amount of non-stationarity in the system (using a measure that will be discussed below). Fig. 1 gives a graphical overview of Deep Potential. Note that we do not assume that the gravitational potential is sourced by the observed stellar population alone. Accordingly, we do not impose the condition Additional mass components beyond the observed stellar population, such as unobserved stars or dark matter, may also contribute to the gravitational field. It is even possible, within our formalism, for the mass density corresponding to the recovered gravitational potential to be less than the mass density of the observed tracer population. We treat the observed stars merely as test masses, whose kinematics we can use to map the background potential.

Modeling the distribution function
In practice, when we observe stellar populations, we obtain a discrete sample of points in phase space, which we will refer to as {x,v}. We do not directly observe the smooth distribution function from which the points are drawn, f ( x, v). Typical methods of surmounting this difficulty include fitting a simple parametric model of the distribution function to the observed sample of phase-space locations of stars, often in action space (e.g., Binney 2010;Ting et al. 2013;McMillan & Binney 2013;Piffl et al. 2014;Trick et al. 2016), or to work with moments of the distribution function ("Jeans modeling," e.g., Gnedin et al. 2010;Hagen & Helmi 2018). Here, we instead represent the distribution function using a highly flexible tool from the field of deep learning, known as a "normalizing flow" (for a review, see Kobyzev et al. 2019;Papamakarios et al. 2019). A normalizing flow represents an arbitrary probability density function p ( z) as a coordinate transformation of a simpler distribution. For example, we might begin with a unit normal distribution in the space z . Given a bijection (i.e., an invertible coordinate transformation) between z and z , we can transform this simple distribution in z into a possibly complicated distribution in z: The key is to find a parameterized class of highly flexible, nonlinear, bijective transformations, for which the Jacobian can be efficiently calculated. In this paper, we call the parameters governing this family of coordinate transformations ϕ, and we refer to the resulting probability density function in z as p ϕ ( z). Given a set of points { z} that are drawn from an unknown distribution p ( z), we can then search for the parameters ϕ that maximize the likelihood of the points. This then yields a smooth approximation of the distribution from which the points { z} were drawn. The "training" of the normalizing flow usually proceeds through stochastic gradient descent using batches of points. There are several classes of bijective transformations commonly used in normalizing flows -an overview can be found in Kobyzev et al. (2019). In this work, we use FFJORD (Grathwohl et al. 2018), though the detailed choice of which class of transformations to use is not critical to our method. Research into normalizing flows is proceeding rapidly, and as better methods become available, they can be incorporated into our method as a drop-in replacement. We give a more detailed description of our normalizing flow implementation in Section 2.4. In order to obtain a smooth approximation to the distribution function, we train a normalizing flow on an ensemble of stars with measured phase-space information. We refer to the normalizing flow as f ϕ ( x, v ), where ϕ refers to the parameters controlling the coordinate transformation in the flow. The best-fit flow parameters are those that maximize the likelihood of the phase-space coordinates of the stars: We thus obtain an approximation to the distribution function, f ϕ * . The great advantage of using a normalizing flow is that our representation is both highly flexible and auto-differentiable. When implemented in a standard deep learning framework, such as Tensorflow (Abadi et al. 2015) or PyTorch (Paszke et al. 2019), it is possible to automatically differentiate the distribution function at arbitrary points in phase space, in order to obtain the terms ∂f /∂ x and ∂f /∂ v in the Collisionless Boltzmann Equation.

Modeling the gravitational potential
After learning the distribution function, we find the gravitational potential Φ ( x) that renders the distribution function stationary. The distribution is stationary when Eq. (2) is satisfied everywhere in phase space. We parameterize the gravitational potential as a feed-forward neural network, which takes a 3-vector, x, and returns a scalar, Φ. We denote the trainable parameters of this network (i.e., the weights and biases) by θ, and the resulting approximation function as Φ θ ( x). For any given value of θ, we can calculate the non-stationarity of our approximation of the distribution function at any arbitrary point ( x, v) in phase space: This is essentially a variational method, in which our Ansatz for the gravitational potential is a neural network Φ θ , and in which we vary the parameters θ to minimize non-stationarity of the distribution function.
We require a measure of the non-stationarity of the distribution function throughout all of phase space, which we can then use to find the optimal gravitational potential. We also require that the matter density be non-negative everywhere in space. By Poisson's equation, which links the potential to the density, this implies that ∇ 2 Φ ≥ 0. Rather than strictly enforcing this condition, we will penalize solutions that include negative matter densities. There are different ways that one could quantify the total non-stationarity and extent of negative matter density of a system, but we find it useful to consider measures that have the following form: where L is a function that takes the raw values of ∂f /∂t and ∇ 2 Φ (both of which depend on the potential, and thus on the parameters θ) at a given point and returns a positive scalar penalty, and W determines how different points in phase space are weighted. In this work, we choose The first term penalizes non-stationarity, while the second term penalizes negative masses. We first take the absolute value of ∂f /∂t, in order to penalize positive and negative changes in the phase-space density equally. The inverse hyperbolic sine function down-weights large values, while the constant α sets the level of non-stationarity at which our penalty transitions from being approximately linear to being approximately logarithmic. This down-weighting of large non-stationarities is not strictly necessary, but in our experiments leads to more consistent results. In our toy models, typical distribution-function gradients can vary by orders of magnitude between different regions of phase space. Down-weighting large non-stationarities prevents small regions of phase-space with large distribution-function gradients from dominating the fit. The second term penalizes negative mass densities, again using the inverse hyperbolic sine to down-weight large values. The constant β sets the negative mass density at which this penalty transitions from linear to logarithmic. The hyperparameter λ sets the relative weight given to the non-stationarity and negative-mass penalties (we choose λ = 1 in this work). We set the weight function W ( x, y) to be equal to our approximation of the distribution function. Thus, This integral is computationally expensive to evaluate directly, but it can be approximated by averaging the value of the bracketed term over samples in phase space drawn from the distribution function: We find the gravitational potential parameters, θ * , that minimize the above loss function: Finally, in order to regularize the potential and prevent over-fitting, it is possible to add a term to L (θ) that penalizes complicated potentials. As we implement Φ θ ( x) as a neural network, one obvious possibility is to penalize large weights in the network. For example, one might choose to add an L 2 weight regularization penalty to the logarithm of the loss: where {w} are the neural-network weights contained in the parameters θ, N w is the number of weights, and 2 is a small constant that sets the strength of the penalty. In order to optimize our model of the potential, we begin by drawing a large number of phase-space points from our approximation to the distribution function, and calculating the gradients ∂f ϕ * /∂ x and ∂f ϕ * /∂ v at each point. Given a value of θ, this allows us to calculate ∂f ϕ * /∂t at any of these points in phase space. We randomly initialize the gravitational potential parameters, θ, and then use stochastic gradient descent on batches of our phase-space points to minimize the loss and obtain θ * .
Once the potential is obtained, forces and densities can be computed from its gradient and Laplacian, respectively. Because the potential is represented by a neural network, these derivatives can be calculated using auto-differentiation.

Additional considerations
Before discussing our implementation of Deep Potential, we offer a few considerations on the approach sketched out above.
First, our stationarity assumption is frame-dependent. In the formulation we have given above, the frame of reference must be fixed at the outset (for example, one might choose a frame that is at rest with respect to the Galactic Center, or which is rotating with the pattern speed of the Galactic bar). However, it is possible to generalize our method by assuming that the distribution function is stationary in an arbitrary reference frame that is moving (or rotating) with respect to the frame in which the positions and velocities of the kinematic tracers have been measured. Instead of asserting that ∂f /∂t = 0, we can assert that the distribution function appears stationary for observers following a certain set of trajectories, defined by a spatially dependent velocity field u ( x). This leads to the condition, where˙ u is the acceleration along the trajectory. Inserting this condition into the Collisionless Boltzmann Equation (Eq. 1), we obtain a modified version of our stationarity condition (Eq. 2): The stationarity condition is nearly identical to Eq (2), except for two modifications: a velocity shift ( v → v − u), and a shift in the potential gradients (∇Φ → ∇Φ +˙ u) related to fictitious forces felt by an accelerating observer. The form of the velocity field u ( x) determines the change of reference frame. A frame of reference that is moving with constant speed with respect to the standard coordinates ( x, v) is defined by a constant u ( x) = u 0 , with time derivative˙ u = 0. A frame of reference that is rotating with angular velocity vector Ω about an axis passing through the point that define the change of frame, such as u 0 , x 0 and Ω, can be either fixed or treated as free parameters. If they are treated as free parameters, they can be varied at the same time as the potential is fit, so as to find the reference frame in which non-stationarity is minimized. The Ansatz is then that the system is stationary in some reference frame that is not determined at the outset, but which is related to the measurement frame by a simple transformation.
A second consideration is that we assume that the kinematic tracers are observed in all six phase-space dimensions. While it is possible to obtain full phase-space information for millions of stars in the Milky Way, for many extragalactic systems, one or more phase-space dimensions (such as line-of-sight position within the system or proper motion) are missing. One approach to dealing with missing phase-space dimensions would be to fit the distribution function in the lower-dimensional space, and then to recover the missing dimensions with symmetry assumptions (e.g., in the absence of proper motions, assume that velocities are isotropically distributed).
Third, we do not impose any symmetries on the gravitational potential. However, it is possible to impose simple symmetries. For example, one could impose axisymmetry on the potential by only feeding cylindrical radius, R, and height, z, into the neural network that represents it: Φ (R, z). It is also possible to favor -without strictly imposing -symmetries. For example, one could favor axisymmetry by representing the potential as a sum of an axisymmetric component and a regularized component without axisymmetry.
We leave deeper consideration of these issues to future work, and focus here on demonstrating Deep Potential with full phase-space data in a fixed frame of reference and no assumed symmetries in the potential.

Implementation
In this paper, we implement Deep Potential in Tensorflow 2. All of our code is publicly available, under a permissive license that allows re-use and modification with attribution. 1 To represent the distribution function, we use a chain of three FFJORD normalizing flows (Grathwohl et al. 2018), each with 3 densely connected hidden layers of 128 neurons and a tanh activation function. For our base distribution, we use a multivariate Gaussian distribution with mean and variance along each dimension set to match the training dataset. 2 During training, we impose Jacobian and kinetic regularization with strength 10 −4 (Finlay et al. 2020), which penalizes overly complex flow models and tends to reduce training time. We train our flows using the Rectified Adam optimizer (Liu et al. 2019), with a batch size of 2 13 (8096). We find that this relatively large batch size leads to faster convergence (in wall time) than more typical, smaller batch sizes. We begin the training with a "warm-up" phase that lasts 2048 steps, in which the learning rate linearly increases from 0 to 0.005. Thereafter, we use a constant learning rate. We decrease the learning rate by a factor of two whenever the training loss fails to decrease 0.01 below its previous minimum for 2048 consecutive steps (this period is termed the "patience"). We terminate the training when the loss ceases to decrease by 0.01 over the course of several consecutive drops in learning rate. In our experiments with mock datasets (see Section 3), these settings allow us to accurately recover a range of different distribution functions with varying levels of complexity (e.g., with symmetries or caustics).
After training our normalizing flow, we draw 2 20 (∼1 million) phase-space coordinates, and calculate the gradients ∂f ϕ * /∂ x and ∂f ϕ * /∂ v at each point (using auto-differentiation), for use in learning the gravitational potential.
We represent the gravitational potential using a feed-forward neural network with 4 densely connected hidden layers, each with 512 neurons and a tanh activation function (we eschew more commonly used activation functions, such as ReLU, which have discontinuous derivatives, as these may lead to unphysical potentials). The network takes a 3dimensional input (the position x in space), and produces a scalar output (the potential). We add in an L 2 loss (see Eq. 13) with strength 2 = 0.1. We train the network using the Rectified Adam optimizer, with batches of 2 14 (16384) phase-space coordinates. We use a similar learning-rate scheme as before, with a warm-up phase lasting 2048 steps, an initial learning rate of 0.001, and a patience of 2048 steps. We set the parameters α, β and λ in Eq. (12), which affect the penalties on non-stationarity and negative gravitational mass densities, to unity.
When fitting both the distribution function and the gravitational potential, we reserve 25% of our input data as a validation set. After each training step, we calculate the loss on a batch of validation data, in order to identify possible overfitting to the training data. Such overfitting would manifest itself as a significantly lower training loss than validation loss. In the experiments in this paper, no significant overfitting is observed -the difference in the likelihoods of the training and validation sets is typically less than 1%.
The choices made here are by no means set in stone, and can be altered without changing the overall Deep Potential framework. In particular, rapid advances have been made in research into normalizing flows over the preceding years. As more accurate and/or computationally efficient flows are developed, they can be used by Deep Potential.

DEMONSTRATIONS
We now evaluate the performance of our method on a number of toy physical systems in which the potential is known exactly. We will begin with ideal cases, and then in Section 4, we will show how our method performs in the presence of non-stationarity, observational errors and selection functions.

Plummer sphere
We begin with the Plummer sphere (Plummer 1911), a self-gravitating system with a spherically symmetric density and corresponding gravitational potential, given by The Plummer sphere admits a stationary distribution function with an isotropic velocity distribution, in which the distribution function is only a function of the energy E of the tracer particle (for simplicity, we set mass m = 1 for all of the particles): The bottom-left panel shows the density residuals (estimates minus truth) at these same points. We accurately recover the true potential and density over a wide range of distances. Right two panels: Our recovered potential (middle) and matter density (right) in the 2D plane defined by z = 0. The striped patterns in the recovered density field (in the right panel) are small fluctuations in regions of negligible density, as can be seen from the overplotted contours of true density.
We generate mock data by drawing 2 19 (524,288) phase-space coordinates from the above distribution function. We truncate the distribution at a radius of r = 10, as in practice, we find that rare particles very far from the origin can sometimes cause large excursions during the training of our normalizing flow. 3 Using these coordinates as input data, we first train a normalizing flow to represent the distribution function. Our results are shown in Fig. 2 Figure 4. Comparison of the spatial and velocity gradients of our normalizing flow with those of the true Plummer distribution function, at points drawn from the distribution function. Each column is a histogram of the residuals at a given value of df /dx (left panel) or df /dv (right panel). Because the Plummer distribution function is spherically symmetric in space and isotropic in velocity, we group the three spatial gradients together in the left panel, and the three velocity gradients together in the right panel. We generally estimate gradients to better than 10%, which we find to be sufficient to accurately precisely recover the gravitational potential.
2 20 (1,048,576) phase-space points from our normalizing flow, and calculate the gradients, ∂f ϕ * /∂ x and ∂f ϕ * /∂ v, at each point. We use these samples to fit the gravitational potential. Our results are shown in Fig. 3. We accurately recover both the gravitational potential and the matter density that sources it over a wide range of radii.
We find that the performance of our method depends sensitively on the accuracy with which we recover the gradients of the distribution function. Because the Plummer sphere has an analytic distribution function, we can calculate the true gradients exactly, and compare them with the estimate we obtain from our normalizing flow. As we show in Fig. 4, we generally recover both spatial and velocity gradients to better than 10% for our Plummer sphere toy model.
The computational time of our entire method is dominated by the training of the normalizing flow (which can range from a few hours to two days on one Nvidia A100 GPU, depending on the complexity of the distribution function) and the calculation of distribution function gradients at the sampled points. Training the potential takes comparatively little time (less than one hour on a single GPU).

Miyamoto-Nagai disk
The Plummer sphere studied above is a highly simplified system, with spherical symmetry and isotropically distributed velocities. In order to study a slightly more complicated system, we consider a population of kinematic tracers orbiting on near-circular orbits in a simple disk potential. Unlike in our previous example, this is not a selfgravitating system. This is, the kinematic tracer particles orbit in a fixed background potential, which is generated by an unseen mass distribution.
We make use of the Miyamoto-Nagai potential (Miyamoto & Nagai 1975), an axisymmetric disk potential given in cylindrical coordinates by: The Miyamoto-Nagai potential corresponds to a flattened disk, with thickness controlled by the ratio b/a. In the limit that b/a → 0, the potential reduces to the Kuzmin disk (Kuzmin 1956;Toomre 1963), which is the potential of a disk with infinitesimal thickness and surface density Σ (R) proportional to R 2 + a 2 −3/2 . In the limit that a → 0 and b/a → 0, the potential reduces to that of the Plummer sphere. We choose a = 1, b = 0.1, representing a highly flattened disk. In order to obtain a sample of stationary kinematic tracers, we initialize 2 19 (524,288) particles randomly, and then numerically integrate their trajectories for many dynamical times to allow them to phase-mix. While the initial distribution is non-stationary, the final distribution should be approximately stationary. We draw the initial particle  Figure 6. From left to right, residuals (recovered minus ideal) of the potential, z-, R-and φ-components of force, and density in the midplane (z = 0) of our Miyamoto-Nagai disk, as recovered using ideal mock data. The force residuals are divided by the norm of the true force vector at each location, while the density residuals are divided by the true density at each location.
In the left and right panels, we overplot potential and density isocontours, respectively.
positions from an axisymmetric double-exponential density distribution, with h R = 1, h z = 0.1. As the potential is axisymmetric, we do not need to draw the azimuthal angle ϕ in order to integrate orbits. We integrate the particle trajectories in (R, z, v φ , v R , v z ), and then assume that ϕ is uniformly distributed. We initialize the particles on nearly circular orbits, with small initial velocity dispersions in the radial, . Left panels, from top to bottom: Residuals (recovered minus ideal) of the potential, z-component of force, and density of our ideal Miyamoto-Nagai disk, evaluated at the locations of the kinematic tracers. In each bin along the x-axis (of potential, force or density), we show a histogram of the residuals. In regions where we have kinematic tracers, we typically recover vertical forces and densities to within 10%. Right panels, from top to bottom: Residuals of the potential, z-and R-components of force, and density of our ideal Miyamoto-Nagai disk, averaged over azimuth. In the top three panels on the right, we overplot isocontours of the density of our kinematic tracers, in logarithmic steps of 10×. In the bottom-right panel, we overplot isocontours of the density that sources the gravitational potential ( 1 4πG ∇ 2 Φ). Note that in regions of very low density, small absolute density residuals lead to large fractional density residuals. However, in regions of high density, we generally recover density to within ∼10% of the true value.
vertical and transverse (azimuthal) directions. In detail, we draw the three components of the initial velocity as follows: where δ i ∼ N (0, 1), and v c (R) is the circular velocity in the midplane at the particle's initial orbital radius, R. To ensure thorough phase mixing, we integrate each particle for approximately 128 dynamical times. We estimate the dynamical time for each particle as the maximum of three timescales: the orbital, epicyclic and vertical periods. In  Figure 8. Fitting a simple analytic potential to four different Miyamoto-Nagai mock datasets. From left to right, the panels show the results for mock data with: ideal conditions, an applied selection function, measurement noise, and non-stationarity. For each dataset, we plot the loss ln L (a measure of the non-stationarity of the distribution function under the assumed potential) as a function of the Miyamoto-Nagai disk potential parameters, a and b, computed using the distribution function gradients obtained from our normalizing flow. The true parameter values (a = 1, b = 0.1) are denoted by white crosshairs, while the parameter values that yield the lowest loss are denoted by a white dot. In all cases, we find close agreement between the true and minimum-loss parameter values, indicating that our normalizing flows provide sufficiently accurate distribution function gradients to tightly constrain the potential, and that the method is robust to various forms of non-ideal data. For ideal mock data, we recover best-fit parameter values of (a = 0.996, b = 0.104). The largest discrepancy from the true parameter values is found for the non-stationary case, in which our best-fit parameters are (a = 1.013, b = 0.112).
order to accurately trace the particle trajectories, the integration timestep for each particle is set to 1 /16 th of the minimum of the preceding three timescales. For the Miyamoto-Nagai disk with our chosen parameters, the longest of the three timescales is always the orbital period, while the shortest timescale is always the vertical period. The particle trajectories are integrated using a 6 th -order symplectic method (Yoshida 1990;Kinoshita et al. 1991), which is implemented by the galpy Python package (Bovy 2015). For more than 99.9% of the trajectories, energy is conserved to within 0.01%, indicating that integration errors are negligible.
We feed these particle positions and velocities into our Deep Potential machinery, first learning a normalizing flow representation of the distribution function, and then finding the potential that renders the recovered distribution function stationary. An advantage of our method is that it can be applied to any system for which our assumptions (gravitational dynamics, stationarity and non-negative gravitational mass -though this latter assumption can be dropped) hold. We thus recover the gravitational potential of our Miyamoto-Nagai disk system using the exact same code, without modification, as we use to recover the potential of the Plummer sphere.
Our resulting gravitational potential and gravitational mass density (ρ = ∇ 2 Φ/ (4π)) are shown in Fig. 5. Fig. 6 compares the potential, gravitational forces ( F = −∇Φ) and gravitational mass density that we recover in the midplane (z = 0) of the disk to the true forces felt by particles in a Miyamoto-Nagai potential. We recover the potential to within ∼0.1% and forces to within a few percent. The typical scale of radial force errors in the midplane, ∼1%, corresponds to errors in circular velocity of ∼0.5%. Within the regions of the midplane where ρ > 0.1 ρ max , we recover density to within ∼10%. This follows a general trend that we observe across various tests of Deep Potential with mock data: we typically recover the potential to high precision, while precision decreases with each succeeding derivative that we take of the potential (i.e., forces depend on first derivatives of the potential, while density depends on the second derivatives). The right panel of Fig. 7 shows the residuals of these same quantities in the (R, z)-plane, averaged over azimuth, while the left panel shows the residuals of gravitational potential, vertical force and density at the locations of the kinematic tracers.
Note that it is also possible to fit simple parameterized models of the potential to the distribution function gradients that are returned by the normalizing flow, by minimizing the loss L (Eq. 11) w.r.t. the model parameters. We demonstrate this here by fitting a Miyamoto-Nagai potential model (Eq. 18) to our recovered distribution function gradients. Fig. 8 shows the loss as a function of the Miyamoto-Nagai potential parameters a and b. We recover the correct parameters (a = 1, b = 0.1) with high precision (better than 1%), demonstrating that our distribution function gradients are sufficiently precise to constrain the potential. Figure 9. As Fig. 6, but for our mock dataset with an applied selection function. The unobserved regions of the disk midplane are shown in white. In the region of the Miyamoto-Nagai-disk midplane with mock data, we obtain the potential, forces and densities to a similar accuracy as in our mock dataset with no selection function.

NON-IDEAL DATA
Most physical systems are non-stationary to some degree, and real-world data is subject to observational errors and selection functions. In this section, we will explore how Deep Potential performs in the presence of these effects. In each case, we use our Miyamoto-Nagai disk model.

Observational errors
In order to simulate the effects of observational errors, we place a virtual observer at (x, y, z) = (1, 0, 0) in our Miyamoto-Nagai disk, and add independent Gaussian noise to the "observed" distances, radial velocities and proper motions. For constant parallax errors, the distance error would scale with the square of distance, r, from the observer. We thus add in Gaussian noise with fractional standard deviation given by σ r /r = min (0.1 r, 0.1). The fractional distance error grows linearly to 10% at a distance of 1, and then remains 10%. We add in Gaussian proper-motion errors with a standard deviation of 0.0015, regardless of distance. We add in constant proper motion errors, which translate to tangential velocity errors that grow linearly with distance. We choose σ v T = 0.00015 r. These errors are chosen to be similar to what Gaia DR3 is expected to achieve for stars with an apparent magnitude of G = 17. If our distance units were kiloparsecs, our distance errors would correspond to 0.1 mas (out to r = 1). Our radial velocity errors correspond to 1/400th of the maximum circular velocity (or ∼ 0.5 km s −1 in the Milky Way), and our tangential velocity errors are 1/4000th of the maximum circular velocity (equivalent to measuring proper motion to ∼ 0.1 mas yr −1 at a distance of 1 kpc in the Milky Way).
The accuracy of our recovered potential, vertical forces and densities are summarized in the middle panel of Fig. 10. For the level of noise assumed in this test, the force and density residuals are only slightly larger than what we obtain with ideal mock data. This indicates that we can expect good results for the Milky Way using Gaia DR3 data for stars with apparent magnitudes down to an apparent magnitude of G ∼ 17. In order to push into the low-signal-to-noise regime, however, we expect that deconvolution would be required to more accurately recover the true distribution function.

Selection functions
As currently formulated, our method does not take into account selection functions. Nevertheless, because Deep Potential only uses information about local distribution function gradients to determine the gradient of the gravitational potential at each point, Deep Potential can be applied to observational data for which the selection function is a constant within some volume of phase space. In the volume in which the selection function is constant, the recovered distribution function will be unaffected (up to a normalizing constant), and the gravitational potential that renders the system stationary will be unchanged.
It is common for observational selection functions to depend on position (and often luminosity and color), but not on velocity. We therefore apply a simple selection function to our Miyamoto-Nagai disk mock data: all stars within a  Fig. 7, but for our mock dataset with an applied selection function (left column), with noisy data (middle column) and with non-stationarity (right column). As can be seen in the left column, in the region of the Miyamoto-Nagai-disk with mock data, we obtain the potential to an accuracy of ∼0.1%, and the vertical restoring force and densities to better than 10%. Introducing measurement noise into the data increases the residuals slightly, but vertical forces and densities are still measured to within ∼10%. The introduction of significant non-stationarity into the system has a larger effect, introducing a bias into the recovered vertical forces and densities of slightly less than 20%. Nevertheless, the method recovers the potential, forces and densities remarkably well, given the sharp caustics in the non-stationary mock datasets.
sphere of unit radius centered on (x, y, z) = (1, 0, 0) are observed, while all stars outside of this sphere are unobserved. We generate a larger amount of mock data, so that after application of the selection function, 2 19 particles (the same number as in our other mock datasets) remain. We apply Deep Potential to this mock dataset. As demonstrated in Figs. 9 and 10, within the volume in which we have data, we are able to recover the potential, forces and density to similar accuracy as in the ideal case.
When the selection function is not constant within the observed volume, the method laid out in this paper would require modification. There are two broad approaches to dealing with non-uniform selection functions. The first approach is to take the selection function into account when training normalizing flow that represents the distribution function, in order to recover the true underlying distribution function (corrected for selection effects). The second approach is to modify the stationarity condition, Eq. (2), to include the selection function. We leave these considerations to future work, and suggest that initial real-world applications of Deep Potential should rely on carefully constructed datasets of kinematic tracers with selection functions that are approximately constant within some volume.

Non-stationarity
The Milky Way is not a perfectly stationary system. The Galactic bar and spiral arms are inherently non-stationary, the disk is perturbed by infalling satellite galaxies, such as the Sagittarius dwarf galaxy (Laporte et al. 2019), and dynamical times in the Galactic halo are on the order of Gigayears. In the Galactic disk, in particular, Gaia has revealed non-stationary features, such as a phase spiral in the (z, v z )-plane (Antoja et al. 2018). It is thus important to test the performance of Deep Potential in systems with some amount of non-stationarity. In such systems, rather Poisson significance (σ) Figure 11. Comparison of the distribution of the input training data for the non-stationary Miyamoto-Nagai disk (left column) with our normalizing flow (middle column) in several different two-dimensional projections of phase space. The right column shows the Poisson significance of the differences between the training data and our normalizing flow. Even for this highly challenging dataset, which contains many sharp caustics in phase space, our normalizing flow performs fairly well, recovering most of the sharp edges in the distribution function. This test is designed to represent an extreme case of non-stationarity, and the actual Milky Way stellar distribution function is expected to be much smoother than this test case.  Figure 13. As Fig. 6, but for our non-stationary Miyamoto-Nagai disk mock dataset. Despite the presence of sharp caustics in phase space, in the midplane of the disk, we are able to recover the potential to ∼1%, forces to better than 10%, and densities to ∼20%.
than finding a potential which renders the distribution system stationary (and in most non-stationary systems, no such potential exists), we find the potential that minimizes our measure of non-stationarity. In order to obtain a slightly non-stationary system, we truncate our integration of particle trajectories in the Miyamoto-Nagai disk after only 8 dynamical times (rather than the usual 128 dynamical times), before they have fully phase-mixed. Fig. 11 compares the distribution function of the particles at 8 and 128 dynamical times.
Despite the presence of sharp caustics in phase space, we are nevertheless able to recover the gravitational potential to ∼1%, forces to within 10%, and densities to within ∼20%, as seen in Fig. 13. The surprising robustness of Deep Potential in the presence of severe non-stationarity is likely due to the fact that at each location in space, a single force vector has to minimize non-stationarity over all velocities. At one point in physical space, different regions of velocity space may "prefer" different force vectors, with the errors introduced by different regions of velocity space partially canceling one another out. The possibility that such cancellation occurs can be seen more clearly in Fig. 12, which visualizes non-stationarity in three different projections of phase space. For example, holding cylindrical radius fixed but varying v T , ∂f /∂t rapidly oscillates between positive and negative values, indicating inflow or outflow of particles, respectively. Neighboring inflows and outflows tend to pull the optimal solution for the local force vector in opposite directions, partially cancelling one another's effect on the solution to the potential.

DISCUSSION
Deep Potential is a fairly direct approach to solving the Collisionless Boltzmann Equation: we first model the observed distribution function, and then directly search for a potential that minimizes non-stationarity. The question then arises: why hasn't this approach been attempted previously? We believe there are two primary reasons. First, our approach makes use of relatively new tools from machine learning. In particular, normalizing flows give us the ability to learn the underlying smooth (and possibly highly complicated) probability distribution function of a set of observed points. Second, and perhaps more importantly, this approach relies on the availability of large numbers of stars with measured six-dimensional phase-space coordinates. Prior to Gaia and large ground-based spectroscopic surveys, such as LAMOST and APOGEE (Majewski et al. 2017), catalogs of stars with six-dimensional phase-space information were insufficient to model the full distribution function in a model-free manner, and parameterized models with relatively small numbers of adjustable parameters were required. The ongoing dramatic increase in the available amount of phase-space data allows us to move beyond simple parametric models, and to discard some of the assumptions that often underlie them (such as axisymmetry).
Going beyond simplifying assumptions about the distribution function is particularly important at the present moment, as Gaia has shown us that the true distribution function of the Milky Way is rich in details. Deep Potential makes use of the whole distribution function, rather than just broad features, and therefore more fully extracts the information that is available in phase space. The output of Deep Potential is not simply the gravitational potential: after learning the potential, we can look for evidence of non-stationary features in the distribution function. On real datasets, we expect diagnostics such as Fig. 12, which shows the rate of inflow or outflow in each region of phase space, to be useful in identifying and mapping non-stationary features.
Deep Potential makes few model assumptions, the most significant being that the dynamical system is stationary. However, as we demonstrate with our non-stationary Miyamoto-Nagai mock dataset in Section 4.3, Deep Potential is robust to non-stationary caustics in the distribution function that are larger than what we expect to observe in the Milky Way. Because it makes few model assumptions, Deep Potential can be applied to any system with six-dimensional phase-space measurements with minimal modification. There is no need to choose analytic forms for distribution function and potential, or to specify the types of spatial symmetris expected in the problem. In this paper, we have demonstrated that the same pipeline works for a spherically symmetric system (Section 3.1), an axisymmetric Miyamoto-Nagai disk (Section 3.2), a system with a spatial selection function with sharp boundaries (Section 4.2), and a system with significant non-stationarity (Section 4).
There are several future directions in which Deep Potential can be developed. First, our current formulation can only handle selection functions that are constant in the region of interest. In Section 4.2, we impose a selection function which is constant within a spherical volume, and zero outside. However, a selection function that varies smoothly within the region of interest would require a modification of our formalism.
Second, our formalism does not include observational errors. In Section 4.1, we show that in a toy model, small observational errors do not significantly degrade our recovered gravitational potential. However, accommodation of observational errors in the formalism would allow us to apply this method to data with less precisely measured phasespace information (in particular, less precise parallaxes and proper motions).
Third, our current implementation of Deep Potential fixes the frame of reference in which the system is stationary. As laid out in Section 2.3, it is possible to introduce additional free parameters that encode a change of reference frame. This could, for example, accommodate systems that are stationary in a rotating frame, or in a frame that is moving relative to the observer. This change of frame may itself have physical meaning, as it may indicate the rest frame of the Galactic barycenter, or the pattern speed of the bar or spiral arms.
Fourth, we assume that all six phase-space dimensions are observed. With Gaia, six-dimensional phase-space data will be available in much of the Milky Way. However, in external galaxies, or in very compact systems (such as globular clusters), we will generally lack at least one dimension (e.g., line-of-sight distance within the system). One approach to dealing with missing dimensions is to impose symmetries on the system, though we leave this line of investigation to future work. Fifth, stellar metallicity offers an entirely new dimension that could be incorporated into Deep Potential. As pointed out in Price-Whelan et al. (2021), for phase-mixed systems, the stellar metallicity distribution should be constant along orbits. Price-Whelan et al. (2021) develops a method for finding the gravitational potential that aligns stellar orbits with surfaces in phase space of constant metallicity. This suggests a possible extension of Deep Potential: imposing stationarity on the distribution function, conditional on metallicity. For example, one could model the conditional distribution function, f ( x, v | [Fe/H]), as a normalizing flow, and then find the potential that renders it stationarity at a large number of points drawn from the joint distribution p ( x, v, [Fe/H]). We likewise leave development of such an approach to future work.
Finally, we would like to point to a related, likewise promising method, developed by An et al. (2021) and Naik et al. (2022), which similarly builds on Green & Ting (2020). This closely related method similarly uses a normalizing flow to recover the gradients of the distribution function. It differs from our method primarily in that it directly solves for accelerations -rather than the potential field -at discrete points in space, based on a least-squares minimization of a measure of non-stationarity. The most significant difference between these two approaches, in our view, is that by solving for the potential field Φ ( x), one can place a positivity constraint on the underlying matter density (∇ 2 Φ ≥ 0) and can penalize overly complicated potentials (e.g., through L 2 weight regularization). In the final stages of the preparation of the present manuscript, Buckley et al. (2022) published a related technique, which derives accelerations using a similar method to An et al. (2021), and then calculates densities using Gauss' theorem.

CONCLUSIONS
In this paper, we have shown that it is possible to accurately recover the gravitational potential of a stationary system using a snapshot of a sample of phase-mixed tracers. Rather than using moments or simple parametric models of the distribution function, our method makes use of the whole distribution function, in all its complexity. Autodifferentiation computational frameworks such as Tensorflow and PyTorch provide ideal tools for creating smooth and differentiable -yet highly flexible -representations of the distribution function and gravitational potential. Our method, "Deep Potential," first represents the distribution function using a normalizing flow, and then finds the gravitational potential -represented using a feed-forward neural network -that renders the distribution function stationary.
Making use of the full distribution function is all the more important now, as orders of magnitude more sixdimensional phase-space measurements of Milky Way stars are just now becoming available. The Gaia space telescope is currently surveying parallaxes and proper motions of over a billion stars, and is additionally measuring radial velocities of tens of millions of stars (Gaia Collaboration et al. 2016). Ground-based spectroscopic surveys are set to deliver millions more high-precision radial velocities over the coming years (Steinmetz et al. 2006;Cui et al. 2012;Zhao et al. 2012;De Silva et al. 2015;Majewski et al. 2017;Kollmeier et al. 2017). We thus will soon have access to precise six-dimensional phase-space coordinates of tens of millions of stars throughout a large volume of the Milky Way.
Deep Potential provides a means of extracting the gravitational potential -and therefore the three-dimensional distribution of baryonic and dark matter in the Galaxy -from the rich kinematic datasets that are now becoming available. Deep Potential makes only minimal physical assumptions (steady-state dynamics in a background gravitational field that corresponds to a positive matter density), without resorting to restricted analytical models of the distribution function and potential. We have demonstrated that Deep Potential performs well on ideal mock data, and have demonstrated its robustness in the presence of non-stationarity, observational errors and simple selection functions. This method therefore has the potential to reveal the full, unseen mass distribution of the Galaxy, using only a set of visible kinematic tracers -the stars.