Training deep neural density estimators to identify mechanistic models of neural dynamics

Mechanistic modeling in neuroscience aims to explain observed phenomena in terms of underlying causes. However, determining which model parameters agree with complex and stochastic neural data presents a significant challenge. We address this challenge with a machine learning tool which uses deep neural density estimators—trained using model simulations—to carry out Bayesian inference and retrieve the full space of parameters compatible with raw data or selected data features. Our method is scalable in parameters and data features and can rapidly analyze new data after initial training. We demonstrate the power and flexibility of our approach on receptive fields, ion channels, and Hodgkin–Huxley models. We also characterize the space of circuit configurations giving rise to rhythmic activity in the crustacean stomatogastric ganglion, and use these results to derive hypotheses for underlying compensation mechanisms. Our approach will help close the gap between data-driven and theory-driven models of neural dynamics.


Introduction
New experimental technologies allow us to observe neurons, networks, brain regions and entire systems at un-27 precedented scale and resolution, but using these data to understand how behavior arises from neural processes 28 remains a challenge. To test our understanding of a phenomenon, we often take to rebuilding it in the form of a Figure 1. Goal: algorithmically identify mechanistic models which are consistent with data. Our algorithm (SNPE) takes three inputs: a candidate mechanistic model, prior knowledge or constraints on model parameters, and data (or summary statistics). SNPE proceeds by 1) sampling parameters from the prior and simulating synthetic datasets from these parameters, and 2) using a deep density estimation neural network to learn the (probabilistic) association between data (or data features) and underlying parameters, i.e. to learn statistical inference from simulated data. 3) This density estimation network is then applied to empirical data to derive the full space of parameters consistent with the data and the prior, i.e. the posterior distribution. High posterior probability is assigned to parameters which are consistent with both the data and the prior, low probability to inconsistent parameters. 4) If needed, an initial estimate of the posterior can be used to adaptively generate additional informative simulations. as networks of spiking neurons. Its only requirement is that one can run model simulations for different parameters, 89 and collect the resulting synthetic data or summary features of interest. 90 We test SNPE using mechanistic models expressing key neuroscientific concepts. Beginning with a simple neural 91 encoding problem with a known solution, we progress to more complex data types, large datasets and many-parameter 92 models inaccessible to previous methods. We estimate visual receptive fields using many data features, demonstrate 93 rapid inference of ion channel properties from high-throughput voltage-clamp protocols, and show how  Huxley models are more tightly constrained by increasing numbers of data features. Finally, we explore which network 95 models can explain the activity in the stomatogastric ganglion [7], and study the geometry of the parameter space to 96 provide hypotheses for which compensation mechanisms might be at play. 97 Concurrently with our work, Bittner and colleagues [42] developed an alternative approach to parameter identifica-98 tion for mechanistic models, and showed how it can be used to characterize neural population models which exhibit 99 specific emergent computational properties. Both studies differ in their methodology and domain of applicability 100 (see descriptions of underlying algorithms in our [37,38] and their [43] prior work), as well in the focus of their non-white data, the two will in general be different.) By randomly drawing receptive fields θ, we generated synthetic 117 spike trains and calculated STAs from them (Fig. 2b), and subsequently trained a neural conditional density estimator 118 to recover the underlying receptive-field model (Fig. 2c). This allowed us to estimate the posterior distribution over 119 receptive fields, i.e. to estimate which receptive fields are consistent with the data (and prior) (Fig. 2c- Fig. 2). 123 As a more challenging inference problem, we inferred the receptive field of a neuron in primary visual cortex 124 (V1) [52,53]. Using a model composed of a bias (related to the spontaneous firing rate) and a Gabor function with 8 125 parameters [54] to describe location, shape and strength of the receptive field, we simulated responses to 5-minute 126 random noise movies of 41 × 41 pixels. In this case, the STA has 1681 dimensions (Fig. 2e), causing classical ABC 127 methods to fail ( Supplementary Fig. 3). This problem admits multiple solutions (as e.g. rotating the receptive field by 128 180 • ). As a result, the posterior distribution has multiple peaks ('modes'). Starting from a simulation result xo with 129 known parameters, we used SNPE to estimate the respective posterior distribution. To deal with the high-dimensional 130 data xo in this problem, we used a convolutional neural network (CNN), as this architecture excels at learning relevant 131 features from image data [55,56]. To deal with the multiple peaks in the posterior, we fed the CNN's output into 132 a mixture density network (MDN) [57], which can learn to assign probability distributions with multiple peaks as a 133 function of its inputs (details in Methods). Using this strategy, SNPE was able to infer a posterior distribution that 134 tightly enclosed the ground truth simulation parameters which generated the original simulated data xo, and closely 135 matched a reference MCMC posterior (Fig. 2f, posterior over all parameters in Supplementary Fig. 4). We also applied 136 this approach to electrophysiological data from a V1 cell [53], identifying a sine-shaped Gabor receptive field consistent 137 with the original spike-triggered average ( Fig. 2h; posterior distribution in Supplementary Fig. 5). 138 Functional diversity of ion channels: efficient high-throughput inference 139 We next show how SNPE can be efficiently applied to estimation problems in which we want to identify a large number 140 of models for different observations in a database. We considered a flexible model of ion channels [59], which we 141 here refer to as the Omnimodel. This model uses 8 parameters to describe how the dynamics of currents through 142 non-inactivating potassium channels depend on membrane voltage (Fig. 3a) 152 We therefore reasoned that by training a network once using a large number of simulations, we could subsequently 153 carry out rapid 'amortized' parameter inference on new data using a single pass through the network (Fig. 3d)  perform inference for all data from the prior (rather than a specific observed datum). Generating these simulations 158 took around 1000 CPU-hours and training the network 150 CPU-hours, but afterwards a full posterior distribution 159 could be inferred for new data in less than 10 ms. 160 As a first test, SNPE was run on simulation data, generated by a previous characterization of a non-inactivating 161 potassium channel (Fig. 3b). Simulations of the Omnimodel using parameter sets sampled from the obtained posterior 162 distribution ( Fig. 3e) closely resembled the input data on which the SNPE-based inference had been carried out, while 163 simulations using 'outlier' parameter sets with low probability under the posterior generated current responses that 164 were markedly different from the data xo (Fig. 3f). Taking advantage of SNPE's capability for rapid amortized inference, 165 we further evaluated its performance on all 350 non-inactivating potassium channel models in ICG. In each case, we 166 Figure 3. Inference on a database of ion-channel models. (a) We perform inference over the parameters of non-inactivating potassium channel models.  collective constraints on channel and membrane properties in the HH model. 212 We also inferred HH parameters for 8 in vitro recordings from the Allen Cell Types database using the same current- . 242 We then investigated the geometry of the parameter space producing these rhythms [17,18]. First, we wanted to 243 identify directions of sloppiness, and we were interested in whether parameter settings producing pyloric rhythms 244 form a single connected region, as has been shown for single neurons [80], or whether they lie on separate 'islands.' 245 Starting from the two above parameter settings showing similar activity, we examined whether they were connected 246 by a path through parameter space along which pyloric activity was maintained. To do this, we algorithmically 247 identified a path lying only in regions of high posterior probability (Fig. 5c, white, details in Methods). Along the path, 248 network output was tightly preserved, despite a substantial variation of the parameters (voltage trace 1 in Fig. 5d, 249 Supplementary Fig. 10a,c). Second, we inspected directions of stiffness by perturbing parameters off the path. We 250 applied perturbations that yield maximal drops in posterior probability (see Methods for details), and found that the 251 network quickly produced non-pyloric activity (voltage trace 2, Fig. 5d). In identifying these paths and perturbations, 252 we exploited the fact that SNPE provides a differentiable estimate of the posterior, as opposed to parameter search 253 methods which provide only discrete samples. 254 Overall, these results show that the pyloric network can be robust to specific perturbations in parameter space, but 255 sensitive to others, and that one can interpolate between disparate solutions while preserving network activity. This showing how these parameters can compensate for one another (Fig. 6c). When repeating this analysis across multiple 283 network configurations, we found that these 'conditional correlations' are often preserved (Fig. 6c, left and right). We 284 calculated conditional correlations for each parameter pair using 500 different circuit configurations sampled from the 285 posterior (Fig. 6d). Compared to correlations based on the pairwise marginals (Fig. 6b), these conditional correlations 286 were substantially stronger. They were particularly strong across membrane conductances of the same neuron, but 287 primarily weak across different neurons (black boxes in Fig. 6d). conductance of I H . 296 Overall, we showed how SNPE can be used to study parameter dependencies, and how the posterior distribution 297 can be used to efficiently explore potential compensation mechanisms. We found that our method can predict 298 compensation mechanisms which are qualitatively consistent with experimental studies.

300
How can we build models which give insights into the causal mechanisms underlying neural or behavioral dynamics? 301 The cycle of building mechanistic models, generating predictions, comparing them to empirical data, and rejecting, 302 or refining models has been of crucial importance in the empirical sciences. However, a key challenge has been the 303 difficulty of identifying mechanistic models which can quantitatively capture observed phenomena. We hypothesized 304 that a generally applicable tool to constrain mechanistic models by data would expedite progress in neuroscience. [37]), or to high-dimensional summary features which are challenging for ABC approaches (Fig. 2) observations (∼100s dims, also see [38]), but scaling to high-dimensional parameter spaces (>30) is challenging. 357 Second, while it is a long-term goal for these approaches to be made fully automatic, our current implementation still 358 requires choices by the user: as described in Methods, one needs to provide the architecture of the density estimation 359 network, and specify settings related to network-optimisation, and the number of simulations and inference rounds.
Linear-nonlinear encoding models 430 We used a Linear-Nonlinear (LN) encoding model (a special case of a generalized linear model, GLM, [19,21,[44][45][46][47]) to simulate the activity of a neuron in response to a univariate time-varying stimulus. Neural activity z i was subdivided in T = 100 bins and, within each bin i, spikes were generated according to a Bernoulli observation model, where v i is a vector of white noise inputs between time bins i − 8 and i, f a length-9 linear filter, β is the bias, and For the spatial receptive field model of a cell in primary visual cortex, we simulated the activity of a neuron depending on an image-valued stimulus. Neural activity was subdivided in bins of length ∆t = 0.025s and within each bin i, spikes were generated according to a Poisson observation model, where v i is the vectorized white noise stimulus at time bin i, h a 41 × 41 linear filter, β is the bias, and η(·) = exp(·) is the canonical inverse link function for a Poisson GLM. The receptive field h is constrained to be a Gabor filter: where (gx , gy ) is a regular grid of 41 × 41 positions spanning the 2D image-valued stimulus. The parameters of the 441 Gabor are gain g , spatial frequency f , aspect-ratio r , width w , phase φ (between 0 and π), angle ψ (between 0 and 442 2π) and location x, y (assumed within the stimulated area, scaled to be between −1 and 1). Bounded parameters 443 were transformed with a log-, or logit-transform, to yield unconstrained parameters. After applying SNPE, we back-444 transformed both the parameters and the estimated posteriors in closed form, as shown in Fig. 2. We did not 445 transform the parameters bias β and gain g . 446 We used a factorizing Gaussian prior for the vector of transformed Gabor parameters [ g , log f , log r , log w , l0,π(φ), l0,2π(ψ), l−1,1(x), l−1,1(y ) ], where transforms l0,π(X ) = log(X /(2π − X )), l0,2π(X ) = log(X /(π − X )), l−1,1(X ) = log((X + 1)/(1 − X )) ensured the 447 assumed ranges for the Gabor parameters φ, ψ, x, y . Our Gaussian prior had zero mean and standard deviations 448 [2, 0.5, 0.5, 0.5, 1.9, 1.78, 1.78, 1.78]. We note that a Gaussian prior on a logit-transformed random variable logitX with 449 zero mean and standard deviation around 1.78 is close to a uniform prior over the original variable X . For the bias β, 450 we used a Gaussian prior with mean −0.57 and variance 1.63, which approximately corresponds to an exponential 451 prior exp(β) ∼ Exp(λ) with rate λ = 1 on the baseline firing rate exp(β) in absence of any stimulus. 452 The ground-truth parameters for the demonstration in Fig. 2 were chosen to give an asymptotic firing rate of 1Hz 453 for 5 minutes stimulation, resulting in 307 spikes, and a signal-to-noise ratio of −12dB. Similar to the previous GLM, this model has a tractable likelihood, so we use MCMC to obtain a reference posterior. 464 We applied this approach to extracelullar recordings from primary visual cortex of alert mice obtained using silicon receptive field, then there is little hope that even an appropriately chosen acceptance threshold will yield a good 482 approximation to the posterior. This findings were also reflected in the results from SMC-ABC with a total simulation 483 budget of 10 6 simulations (Fig. 3b). The estimated posterior marginals for 'bias' and 'gain' parameters show that the 484 parameters related to the firing rate were constrained by the data xo, but marginals of parameters related to shape 485 and location of the receptive field did not differ from the prior, highlighting that SMC-ABC was here not able to identify 486 the posterior distribution. Further comparisons of neural-density estimation approaches with ABC-methods can be 487 found in the publications describing the underlying machine-learning methodologies [36, 38,103].

488
Ion channel models 489 We simulated non-inactivating potassium channel currents subject to voltage-clamp protocols as: where V is the membrane potential,ḡ K is the density of potassium channels, E K is the reversal potential of potassium, and m is the gating variable for potassium channel activation. m is modeled according to the first-order kinetic equation where m∞(V ) is the steady-state activation, and τm(V ) the respective time constant. We used a general formulation of m∞(V ) and τm(V ) [59], where the steady-state activation curve has 2 parameters (slope and offset) and the time constant curve has 6 parameters, amounting to a total of 8 parameters (θ1 to θ8): Since this model can be used to describe the dynamics of a wide variety of channel models, we refer to it as Omnimodel. 490 We modeled responses of the Omnimodel to a set of five voltage-clamp protocols described in [58]. Current  To amortize inference on the model, we specified a wide uniform prior over the parameters: θ1 ∈ U(0, 1), θ2 ∈ 497 U(−10., 10.), θ3 ∈ U (−120., 120.), θ4 ∈ U(0., 2000), θ5 ∈ U (0., 0.5), θ6 ∈ U (0, 0.05), θ7 ∈ U (0., 0.5), θ8 ∈ U (0, 0.05).  505 We simulated a single-compartment Hodgkin-Huxley type neuron with channel kinetics as in [67],

Single-compartment Hodgkin-Huxley neurons
where V is the membrane potential, Cm is the membrane capacitance, g l is the leak conductance, E l is the membrane 506 reversal potential,ḡc is the density of channels of type c (Na + , K + , M), Ec is the reversal potential of c, (m, h, n, p) are the 507 respective channel gating kinetic variables, and ση(t) is the intrinsic neural noise. The right hand side of the voltage 508 dynamics is composed of a leak current, a voltage-dependent Na + current, a delayed-rectifier K + current, a slow Allen Cell Types Database (http://help.brain-map.org/download/attachments/8323525/BiophysModelPeri.pdf). 515 We applied SNPE to infer the posterior over 8 parameters (ḡ Na ,ḡ K , g l ,ḡ M , τmax, V T , σ, E l ), given 7 voltage features 516 (number of spikes, mean resting potential, standard deviation of the resting potential, and the first 4 voltage moments, 517 mean, standard deviation, skewness and kurtosis). 518 The prior distribution over the parameters was uniform,  533 We simulated the circuit model of the crustacean stomatogastric ganglion by adapting a model described in [7]. The model is composed of three single-compartment neurons, AB/PD, LP, and PD, where the electrically coupled AB and PD neurons are modeled as a single neuron. Each of the model neurons contains 8 currents, a Na + current I Na , a fast and a slow transient Ca 2+ current I CaT and I CaS , a transient K + current I A , a Ca 2+ -dependent K + current I KCa , a delayed rectifier K + current I Kd , a hyperpolarization-activated inward current I H , and a leak current I leak . In addition, the model contains 7 synapses. As in [7], these synapses were simulated using a standard model of synaptic dynamics [130]. The synaptic input current into the neurons is given by Is = gss(Vpost − Es), where gs is the maximal synapse conductance, Vpost the membrane potential of the postsynaptic neuron, and Es the reversal potential of the synapse. The evolution of the activation variable s is given by Here, Vpre is the membrane potential of the presynaptic neuron, V th is the half-activation voltage of the synapse, δ sets 534 the slope of the activation curve, and k− is the rate constant for transmitter-receptor dissociation rate.  In order to find directions of robust network output, we searched for a path of high posterior probability. First, as in 572 [7], we aimed to find 2 similar model outputs with disparate parameters. To do so, we sampled from the posterior and 573 searched for 2 parameter sets whose summary features were within 0.1 standard deviations of all 174,000 samples 574 from the observed experimental data, but that had strongly disparate parameters from each other. In the following, 575 we denote the obtained parameter sets by θs and θg . 576 Second, in order to identify whether network output can be maintained along a continuous path between these 2 577 samples, we searched for a connection in parameter space lying in regions of high posterior probability. To do so, we 578 considered the connection between the samples as a path and minimize the following path integral: To minimize this term, we parameterized the path γ(s) using sinusoidal basis-functions with coefficients α n,k : These basis functions are defined such that, for any coefficients α n,k , the starting and end points of the path are exactly the two parameter sets defined above: With this formulation, we have framed the problem of finding the path as an unconstrained optimization problem over 580 the parameters α n,k . We can therefore minimize the path integral L using gradient descent over α n,k . For numerical 581 simulations, we approximated the integral in equation 1 as a sum over 80 points along the path and use 2 basis 582 functions for each of the 31 dimensions, i.e. K = 2.

583
In order to demonstrate the sensitivity of the pyloric network, we aimed to find a path along which the circuit output quickly breaks down. For this, we picked a starting point along the high-probability path and then minimize the posterior probability. In addition, we enforced that the orthogonal path lies within the orthogonal disk to the high-probability path, leading to the following constrained optimization problem: where n is the tangent vector along the path of high probability. This optimization problem can be solved using the gradient projection method [131]: ∆θ = − P(∇ log(p(θ|x))) (∇ log(p(θ|x))) T P(∇ log(p(θ|x))) with projection matrix P = 1 − 1 n T n nn T and 1 indicating the identity matrix. Each gradient update is a step along 584 the orthogonal path. We let the optimization run until the distance along the path is 1/27 of the distance along the 585 high-probability path. 586 Identifying conditional correlations 587 In order to investigate compensation mechanisms in the STG, we compared marginal and conditional correlations.