Simulation-based inference with approximately correct parameters via maximum entropy

Inferring the input parameters of simulators from observations is a crucial challenge with applications from epidemiology to molecular dynamics. Here we show a simple approach in the regime of sparse data and approximately correct models, which is common when trying to use an existing model to infer latent variables with observed data. This approach is based on the principle of maximum entropy (MaxEnt) and provably makes the smallest change in the latent joint distribution to fit new data. This method requires no likelihood or model derivatives and its fit is insensitive to prior strength, removing the need to balance observed data fit with prior belief. The method requires the ansatz that data is fit in expectation, which is true in some settings and may be reasonable in all settings with few data points. The method is based on sample reweighting, so its asymptotic run time is independent of prior distribution dimension. We demonstrate this MaxEnt approach and compare with other likelihood-free inference methods across three systems: a point particle moving in a gravitational field, a compartmental model of epidemic spread and molecular dynamics simulation of a protein.


Introduction
Simulation-based inference (SBI) is a class of methods that infer the input parameters and unobservable latent variables in a simulator from observational data.SBI is different than traditional statistical inference or machine learning because simulators are typically not differentiable and their likelihoods are intractable.There have been great strides in methods for SBI and a recent review may be found in [1].Most SBI methods are concerned with finding a few simulator parameters from a rich set of observations [2][3][4].Here, we consider updating a simulator with many trusted parameters to match a sparse set of observations.The ancestor for this line of research is in molecular dynamics simulations of proteins.These simulations require thousands of parameters and the observed data (macroscopic experimental values) is often on the order of 10 to 100 data points (e.g.Reißer et al. [5]).An approach that has emerged in molecular dynamics simulations is maximum entropy (MaxEnt) biasing [6][7][8][9].MaxEnt biasing minimially modifies the simulator to match observations.The premise of MaxEnt is that the original model is approximately correct and observations should be matched in expectation, which is not the usual approach in SBI.These two assumptions lead to a unique bias [10] to the simulator that is independent of the parameters and can be implemented as a simple reweighting procedure.The MaxEnt method's run-time scales only with sample number, rather than the number of model parameters which is atypical of most SBI methods because they require joint sampling.
The second setting of MaxEnt is to make an ansatz that an observation can be substituted as a distribution moment.For example, consider observing a particle trajectory and we would like to make inferences about where the particle will go next.Our observations are specific pairs of time and position that describe the trajectory.If we have a good model for how the particle behaves, MaxEnt will minimally change the model to agree with the particle trajectory on average.The MaxEnt posterior will agree in expectation exactly with the observed trajectory.Thus, from a practical point of view MaxEnt provides an accurate description of the data and a probability distribution for the posterior.The inferred continuation of the trajectory will come from the expectation of that posterior.An alternative would be treating the observation as exact and regressing the prior model, which would not give a posterior but instead a mode (most likely trajectory).Yet this gives no uncertainties with the predictions and can lead far away from the prior model, leading to issues like overfitting and covariate shift.Bayesian inference could be used to fit the particle trajectory by supposing a measurement error distribution.Yet this creates an over-constrained problem where a weak error distribution reduces the agreement with the observed trajectory and a strong error distribution reduces the agreement of the prior model.In essence, by "relaxing" the observation to be an average we enable agreement with the observation exactly, maximize the agreement with the prior, and do not require potentially ad-hoc construction of error distributions 3 .This can be justified through the principle of maximum parsimony: the MaxEnt formulation requires the fewest input parameters.If multiple observations are gathered, then the Bayesian inference setting is more appropriate because the distribution moment ansatz would exclude information about variance in multiple observations.Another potential application area could be in few-shot regression with Bayesian network models [15], where only a few examples are available in a new task.MaxEnt provides a way to fit a previously trained Bayesian network to those few examples balancing agreement with them exactly and while minimizing the effect on the trained model.
The MaxEnt method presented here has a run time scaling that is independent of the number of model/prior parameters; it acts entirely on samples.This also means that intractable or infinite dimensional priors (such as sampling both models/priors) can be treated with MaxEnt.This can be a large advantage over other approximate inference methods like approximate Bayesian computing and likelihood free inference.
The MaxEnt approach in simulation can be traced to Jayne's early work on deriving statistical physics from MaxEnt [16].It was shown, for example, that the Boltzmann distribution could be derived by simply adding a restraint on average energy that must be satisfied in expectation, analogous to matching an observation.A similar method of incorporating observations in expectation returned 50 years later in determining how to match protein molecular dynamics simulations to observations [17].This method was then recast as an approximation to MaxEnt [12].Matching observations in molecular dynamics with MaxEnt was also shown in Pitera and Chodera [10].This was followed by rapid progress to create practical methods for use in simulations [9,[18][19][20].The MaxEnt method based on reweighting has been presented in the context of molecular dynamics simulations in many forms over the years [5,[21][22][23][24][25][26][27][28][29].MaxEnt-based methods have a long history of use in the molecular dynamics community across various types of systems, and this approach is still widely used for biasing applications in modern molecular simulations, demonstrating ongoing interest and engagement in the community [30][31][32].A review by Bonomi et al. provides broader context for the use of maximum entropy and other similar methods in the molecular simulation community [7].The review by Cesari, Reißer, and Bussi [33] provides an overview of the mathematics of MaxEnt, its connections to Bayesian inference and maximum 2 Bayesian inference for fitting distribution moments requires specifying an error distribution that requires additional system insight and its strength relative to the prior belief affects the agreement with the distribution moment data.See system 1. 3 Our maximum entropy formulation does allow uncertainty on distribution moments if desired.
likelihood, and some discussion of the potential hurdles involved.Also, for a comparative study weighing the benefits of MaxEnt and restraint-based methods, see Rangan et al. 2018 [34].
Our contribution here is deriving a general MaxEnt framework that is applicable to arbitrary simulators, demonstrating its application to areas outside of molecular dynamics, and showing one method of improving the support (sampling) of the posterior, which is important when the simulator is far from the observations.In the remainder of this work, we develop the theory, discuss sampling issues, and compare the MaxEnt method to approximate Bayesian computing (ABC) [3,[35][36][37], sequential neural likelihood (SNL) [38], and direct Bayesian inference when the likelihood is tractable.Additional background on these methods used for comparison can be found in the supporting information.

Theory
Given a simulator f ( θ) with a set of parameters θ, we have a prior distribution of parameters P ( θ).For example, the function f ( θ) could be propagating a system of ODEs for some set number of timesteps or a molecular dynamics simulation with intrinsic noise.
Suppose we have some set of N observations, {ḡ} k , k ∈ [1, . . ., N ], which we would like to match with our model.Assume the measurement of each ḡk has some uncertainty k , where k is a random variable distributed according to some prior distribution about uncertainty, P 0 ( k ).We would like to constrain our model such that This means that we want the average over the distribution of our updated models (P ( θ)) to match the observations data, with an allowable average disagreement based on { k }.This is an unusual constraint and is weaker than most simulation inference methods.It reflects the strong belief in our prior model in this setting.Note that inclusion of the P 0 ( k ) and k terms is optional: it is not necessary to allow disagreement on average with data, unlike in a Bayesian framework.This would be equivalent to setting the error distribution to a Dirac delta about 0: Another difference is that this distribution of uncertainty is about bias.It accounts for systemic deviation in average agreement and does not describe the underlying variance of the observational data.This approach is analogous to Bayesian model averaging [39], in that it is an average over many model parameter settings, reweighted by the posterior likelihood.
The maximum entropy modification to the prior distribution P ( θ) to satisfy the N constraints is given by [9,10,12,40]: where Z is a normalization constant and λ k are chosen such that E[g k + k ] = ḡk .The dependence on k can be removed by computing the marginal, The problem is reduced to finding λ k such that the constraint is satisfied.Again, we must remove and is understood to still be a function of If the prior is an exponential family, the λ k s will exist and be unique under some mild assumptions about support of the prior and covariance of observables (i.e., cannot have perfectly correlated observables with incompatible observations) [10,12].

Computing weighted properties and sampling efficiency
In Algorithm 1, we show the procedure for sampling from the MaxEnt distribution defined in Equation 4 via importance sampling [41].Here, P ( θ) is the prior distribution over simulation parameters θ, f is the simulator, M is the number of Algorithm 1: MaxEnt Weights with Uncertain Observations 2 end end end return w samples from the prior to take, N is the number of constraints, P 0 ( k ) is the error distribution, g k is the kth observation, and ḡk is the target observable value to which we would like to constrain our simulator.η is the learning rate.The output of this algorithm, {w i }, are the weights of trajectories {f (θ i )}, and any desired property g can be computed as The challenge of using MaxEnt is sampling from P ( θ).Our assumption thus far is that our prior P ( θ) is approximately correct, so that samples from P ( θ) should be similar to P ( θ).In this ideal case, the algorithm is simply a matter of reweighting.One samples θ i , computes f ( θ i ), compute weights proportional to w i )] consistent with the experimental data (Algorithm 1), and then any other property is reweighted with the same weights.In the non-ideal case (if for instance sampling is expensive, the space is high-dimensional, or the model is far from correct), there can be insufficient support to agree with the constraints.To treat insufficient support, we take a simple approach and use gradient descent to modify the sampling distribution parameters θ j to minimize the cross-entropy with P ( θ): where w i [P ] depends on θ j via the expectation function.We remove the effect of the sampling distribution from the posterior via reweighting by P ( θ)/P ( θ j ) We refer to this approach as variational.

Trivial Simulation with Gaussian Noise
We first consider a toy simulator f that outputs a scalar r.We have a prior belief about the value of the constant as a normal distribution N (r, θ).This example serves to compare the MaxEnt approach with Bayesian inference.The observed data is a single point (r) and we treat it as an average constraint in the MaxEnt.That is, we have a single observation and we constrain our simulator to on average match this observation.Figure 1 panel a) shows how the MaxEnt posterior changes with different observations (r = 5 or r = 10).The r = 10 observation requires the variational sampling (Equation 6) because the observed value is outside the sampled support of the prior.The expected value of E[r] of the posterior always matches the observation and the moments of the posterior are identical to the prior, except the 1 st moment (the mean).Although Figure 1a is calculated with Algorithm 1, the analytic equation for the posterior is simply N (r, θ) [10].
With Bayesian inference, we must assume some noise model of our simulator so that we can compute the probability of the single observation arising from the simulator, namely P (data|model) [42].We take this to be N (r, σ).The Bayesian posterior balances this evidence with the prior distribution: The expected value of r will not match the observed value except in the limit of σ/θ reaching 0. E[r] will be between the observed value and the prior belief expectation.Figure 1b compares the Bayesian inference and MaxEnt posteriors.Panel a) shows how the MaxEnt method leaves the variance of r unchanged as we consider different observed values.Panel b) shows the use of Bayesian inference to match the observation at r = 10.It requires extreme ratios between prior belief and experimental uncertainty to match the observation at 10.This is not necessarily a disadvantage, we simply are showing that observations are matched in expectation with MaxEnt and not with Bayesian inference.Panel c) shows how the MaxEnt method keeps the posterior entropy maximized regardless of the observation value (x-axis), as expected for a maximum entropy method.Bayesian inference shows a more peaked distribution when the observed value is far away from the prior, giving less entropy.That is, to agree with the observation we must necessarily increase the strength of evidence, which peaks the posterior.
Example System 1: Point Particle Gravitation Simulation For a our first example system, Figure 2 shows a comparison of SNL and MaxEnt reweighting on a unit mass particle in a gravitational field of three attractors.The simulator here is a point particle following Newtonian gravitational mechanics.The goal here is to modify the simulation trajectories to align with a small set of observations.An example task might be fitting the trajectory of a comet to a small number of observations separated by years.
The parameters for this simulation were m 1 , m 2 , m 3 , v 0x and v 0y , the masses of the three attractors, and the initial velocity of the particle, respectively.The positions of the attractors and the initial position of the particle were all fixed.We treat these parameters as unknown, and the prior belief for them follows a normal distribution, shown in Figure 2.
Repeatedly sampling from this prior and running the simulator results in a distribution of trajectories, whose means are shown in Figure 2a).MaxEnt reweights this ensemble of trajectories to agree with five observed positions along the trajectory.(The mean path does not exactly pass through the observations because some zero-mean normally-distributed noise with standard deviation of 3 was added to each observation.)The average posterior trajectory indeed agrees with all observations.The prior and posterior for the parameters are shown in Figure 2b.
The observed points were synthesized by choosing a set of true parameter values and imposing zero-mean normallydistributed noise with standard deviation 3 on every 20th timestep on the 100-step simulation.Thus, one way to evaluate the MaxEnt performance is to see if the posterior means are close to these true values.We can see that the MaxEnt posteriors are closer to these values than the prior, but still largely in agreement with the prior.It fits the observations while staying as close to the prior as possible, because that maximizes entropy.In contrast, SNL results in a much narrower posterior around the true values, while diverging from the prior, because that maximizes likelihood.
We computed the cross-entropy of the prior and posterior produced by MaxEnt and SNL.These values were 5.09 for SBI with SNL, and 3.43 for MaxEnt reweighting.This demonstrates how MaxEnt minimally alters the prior distribution while still matching observations in expectation -the average path followed by the MaxEnt particle matches all target points, while matching the posterior to the prior's shape more closely than SNL.This example illustrates two key points.Firstly, it shows that MaxEnt is robust to chaotic systems, as it is still able to match observations on average, with minimal change to the prior.However, it is also an example of when another SBI method may be preferable, depending on the goal.The goal of MaxEnt is, by construction, to alter the prior as little as possible while agreeing with observations on average.In cases where the true underlying parameters governing a model are in a low-density region of the prior, the posterior resultant from MaxEnt will therefore assign relatively low probability to these parameter values as well.Thus, in the sense of estimating likelihood when the prior is not close to the true values, other methods like SNL can be preferable to MaxEnt.We can see that while SNL makes a better estimate of the true parameters used to generate those observations, it does not reproduce a path that aligns with the observations.This presents a choice to the simulator.If the goal is to alter a model to agree with observations, MaxEnt is preferred, especially if the prior is strongly trusted.If the goal is accurate likelihood estimation, methods like SNL are preferred, especially if the prior is not strongly trusted.
Example System 2: Epidemiological modeling In our third example, we apply our framework to modeling the spread of a pathogen in vulnerable populations.We consider an SEAIR compartmental model of epidemic spread (Figure 3) on metapopulations connected via a spatial network of patches.Each patch corresponds to a location such as a zipcode in a city, or a county, and connections between patches correspond to mobility flows of residents encoded in a M × M mobility matrix for M patches, where M ij is the number of people moving from patch i to patch j in one time increment.Contacts within patches occur in a fully-mixed mean field manner where individuals can be in any one of five states of infection: Susceptible (S), Exposed (E), Asymptomatic (A), Infected (I), and Resolved (R).The choice for this particular combination of compartments was inspired by its relevance in modeling the evolution of the current SARS-CoV-2 pandemic [43,44].Each individual patch is represented with fractions of S, E, A, I, R, rather than the count of individuals within each compartment.We first create a "reference" trajectory that represents the true disease model.From this reference trajectory, we extract observations which are used as the input to the MaxEnt methods, by extracting values at specific timepoints in the reference trajectory.A challenge in modeling the spread of epidemics is associated with reporting of the empirical number of confirmed cases (compartment I), which is typically very noisy [45].To simulate this uncertainty, we add random additive noise to the observations from the reference trajectory (see Methods for details).This reference trajectory is represented as dashed lines in Figure 4a).We choose 5 uniformly random data points within the first half of the trajectory of the compartment I in patch 1 as observations (represented as black dots).The performance of the model is evaluated by comparing the predicted trajectory and the reference in a different patch (3).In Figure 4b) we compare the performance of MaxEnt, a least-squares fit, and ABC in fitting the prior to the observations.Compared to MaxEnt, the result from the least squares method was a poor fit with high variance, as it over-fits to observation noise.This was shown by doing a 5-fold leave-one-out cross-validation of the observations and evaluating the standard deviation at times t = 0, 125 and 250 for each method (inset in Figure 4.b).Out of all methods evaluated, ABC had the least variance, but was computationally more expensive to run, whereas MaxEnt can include more model parameters without additional computational cost.Variational MaxEnt was also implemented to reweight the disease trajectory (See details in supporting information FIG.S2).

Example System 3: MBP Fragment Molecular Dynamics
Finally, we consider an application from biophysical modeling of the myelin basic protein (MBP) epitope fragment.MBP is a common autoimmune target for the disease multiple scelrosis (MS) [46].Spyranti et al. [47] characterized the specific region of MBP (83-99) that is the binding epitope for T-cell receptor recognition with solution Nuclear Magnetic Resonance (NMR).NMR provides per-atom chemical shifts, which are population averages of a measurement of an atoms' local environment [48].However, we must infer a specific structure to understand the molecular biology of MBP.In this example we use molecular dynamics as a prior model, the chemical shifts as maximum entropy restraints, and compute a posterior of protein configurations.MaxEnt analysis has been applied frequently already in molecular dynamics, although not this exact approach with uncertainty [34].
Our prior model is an empirical distribution consisting of MBP fragment atomic positions as sampled from molecular dynamics.The specific fragment sequence was ENPVVHFFKNIVTPRTP and the molecular dynamics was initialized from an extended conformation.Simulations was performed in NVT ensemble in Gromacs 2020 [49][50][51][52][53][54][55] with CHARMM27 force field at a density of 25 mg/ml [56,57].The simulation duration was 1.3µs with frames saved for this analysis every 500ps.To compute the chemical shift, g k , we use a graph neural network that can compute chemical shift from atomic positions [58].We only biased backbone HN atoms, due to their higher accuracy [58].The first 6 HN atoms were biased (NPVVHF), excluding the N-terminus.The remaining HN atom chemical shifts were unbiased (KNIVTRT).P 0 ( ) was chosen to be a Laplace distribution with scale parameter 0.05 -allowing a small amount of systematic disagreement.
Figure 5 shows the maximum entropy posterior average chemical shifts.As expected, exact agreement is found for the chemical shifts for which observations are provided.The posterior for which there are no observations follow the the prior closely (as expected in MaxEnt), although they move in the wrong direction at some positions.The protein structures are shown to the right of the plot."Spyranti" is a representative structure from the deposited structure constructed by Spyranti et al. [47] The posterior mode from MaxEnt is shown to the right.It has a helix, though not the α-helix as shown in Spyranti.An advantage of this MaxEnt approach to analyzing NMR data is that there are 6500 structures in the posterior, whereas the traditional approach of NMR structure refinement results in 5-20 structures.This large distribution can then be used for other tasks with better calibration, such as finding drugs to target the protein structure, predicting protein-protein interfaces, and assessing structural properties.

Discussion
We have presented MaxEnt reweighting as an inference method for altering an approximately-correct simulator to agree with observations.This method can be used on arbitrary simulators with arbitrary numbers of parameters, requiring only sufficient sampling of the prior distribution.The simulator need not have derivatives or tractable likelihoods.We demonstrated this by comparing with other SBI methods using three different simulators in different example contexts.The framework is particularly effective and robust when data is scarce or expensive (epidemic spreading being an archetypal example).MaxEnt provably changes the prior minimally to fit observations.While the method was initially developed for and particularly well-suited for molecular dynamics simulations-where experimental observations are much more costly and few in number compared to simulation-as demonstrated here, its applicability has potential for use in any setting of stochastic modeling where the derivative of the simulator's output with respect to latent variables is unavailable or intractable.
The approach to sampling described here is an implementation of variational inference to sample from the posterior.One could instead use Monte Carlo sampling.This would have the advantage of not requiring prior distribution derivatives, but since the derivatives here are closed-form it is computationally convenient to use importance sampling.MaxEnt's advantages over other widely used SBI methods, such as SNL, are that it is simple to implement, requires no hyperparameter choices like a neural network design, and can fit observations in expectation.

Methods Point Particle Gravitation Simulation
For this simulation, the prior parameter distribution was taken as a multivariate normal distribution centered at {m 1 = 85, m 2 = 40, m 3 = 70, v 0x = 12, v 0y = −30}, with covariance matrix I × 50.This wide prior was chosen because MaxEnt needs parameter support that overlaps with the observations we would like to fit.Fitting was done using the SBI package for Python [59] with the SNL method, [38] and a custom implementation of MaxEnt reweighting using Keras [60,61].Both methods used 2048 prior samples for fitting.SNL used default parameters from the SBI package [59] and MaxEnt used the Adam optimizer [62] with a learning rate of 0.0001 with mean squared error for 30000 epochs and batch size 2048.

Epidemiology Modeling
Epidemic spreading in networks can be modeled as a reaction-diffusion process.The reaction corresponds to an infection caused by interactions of subjects within a fully-mixed region or patch of varying granularities (a meta-population), while diffusion corresponds to movement of people (of various infection states) between patches [63].In this example, the meta-population system is comprised of three isolated local populations (patches) connected via flows corresponding to migrating individuals.The spreading process is represented through a temporally discretized ODE that includes the spatial distribution of the population as well as their mobility patterns [64].
In our simulation, the infection begins in patch 1, propagating to the other two patches according to a synthetic mobility matrix.This mobility matrix was randomly generated with dominating diagonal elements to satisfy the fully-mixed region assumption.Five uniformly random data points within the first half of the trajectory of the compartment I in patch 1 were considered as observations with a 5 percent random additive noise and Laplace prior of 0.01 (shown as restraints in Figure 4a).The true parameters for the reference epidemic trajectory are: {start I = 0.02, start A = 0.05, E period = 7, A period = 5, I period = 14}.Parameters for this simulation were asymptomatic, infectious and exposed periods along with the fractional starting values for I and A compartments.The prior parameter distribution were taken as a truncated-normal distribution centered at {start I = 0.001, start A = 0.001, E period = 2, A period = 2, I period = 10}, with variances of {0.8, 0.8, 1, 4, 5}, respectively.For the simulation, the pyABC [65] package was used with default parameters, and the same MaxEnt implementation was used with the Adam optimizer, a learning rate of 0.1, and loss of mean squared error for 1000 epochs with a batch size of 8192.

Figure 1 :
Figure 1: Comparison of Bayesian inference and MaxEnt reweighting.Panel a) shows the simulator prior distribution in orange and the two versions of MaxEnt posterior with observations of 5 and 10.Panel b) shows the interplay between strength of prior and assigned uncertainty to the observation at 10 for Bayesian inference.The value is arbitrary and chosen to illustrate how Bayesian inference strongly alters the shape of the posterior compared to the prior, whereas MaxEnt preserves the shape well.Note the scale change between panels a) and b).Panel c) further illustrates this by comparing the posterior entropy of the two as a function of the observation location.

Figure 2 :
Figure 2: Comparison of SNL and MaxEnt methods on a gravitational field simulation of a particle moving through a fixed field with three attractors.All units are SI values (m, m/s, kg).a): weighted mean paths generated by SBI with SNL (green) and MaxEnt (purple), alongside the path generated by the mean of the prior distribution (dash-dotted grey), and the true path used to generate observations (dashed black).Target points appear as black stars, and the attractors are black circles.b): Kernel density estimate of the posterior distribution of parameters after fitting, alongside their respective priors.

Figure 3 :
Figure 3: SEAIR model.Populations in each patch can be in any one of Susceptible (S), Exposed (E), Asymptomatic (A), Infected (I) and Resolved (R).Susceptible individuals can get exposed to the disease by having contacts with the asymptomatic or the infected at infectivity rate β.Once exposed, they become asymptomatic and infected at rates η and α.The infected finally recovers or dies at rate µ and becomes resolved.

Figure 4 :Figure 5 :
Figure 4: Maximum entropy reweighting of disease trajectory in a meta-population SEAIR model.a): Priorgenerated trajectory in one of the spatial patches for compartments S (blue), E (orange), A (green), I (red) and R (purple) are shown with solid lines and the reference trajectory is in dashed lines.The colored area represents the one-third higher and lower quantiles than average.Observations shown as restraints (black circles) are selected randomly from compartment I with 5 percent additive noise and Laplace prior of 0.01.b): Comparing the performance of MaxEnt (pink), Least-squares (blue), ABC (yellow) in fitting to reference model (black dashed line) in patch 3, based on observations in patch 1. Table inset shows standard deviations from 5-fold cross validation of the observations at three different times.Shading on MaxEnt is from ±67% weighted quantiles.