ABC-NN: Approximate Bayesian Computation with Neural Networks to learn likelihood functions for efficient parameter estimation

In cognitive neuroscience, computational modeling offers a principled interpretation of the functional demands of cognitive systems. Bayesian parameter estimation provides information about the full posterior distribution over likely parameters. Importantly, the set of models with known likelihoods is dramatically smaller than the set of plausible generative models. Approximate Bayesian Computation (ABC) methods facilitate sampling from posterior parameters for models specified only up to a data generating process, overcoming this limitation to afford bayesian estimation of complex stochastic models (Wood, 2010; Beaumont, 2010; Akeret, Refregier, Amara, Seehars, & Hasner, 2015; Turner & P., 2014). Relying on model simulations to generate synthetic likelihoods, these methods however come with substantial computational cost at inference where simulations are typically conducted at each step in a MCMC algorithm. We propose a method that learns an approximate likelihood over the parameter space of interest, using multilayered perceptrons (MLPs). This incurs a single upfront cost, but the resulting network comprises a usable likelihood function that can be freely used in standard inference algorithms. We test this approach in the context of drift diffusion models, a class of cognitive process models commonly used in the cognitive sciences to jointly account for choice and reaction time data in a variety of experimental settings (Ratcliff, Smith, Brown, & McKoon, 2016).


General Idea
The main shortcoming of current approximate bayesian inference (ABC) approaches to parameter estimation is the computational burden incurred by performing model simulations inside of MCMC algorithms. These burdens are especially salient in scenarios in which analytically intractable models are used and where one wishes to conduct model comparison over various hierarchical model extensions (illustrated in Figure 1). Simulations need to be run for every combination of parameters searched, for each step in each chain, and for every model variant (e.g., in a cognitive task with multiple conditions, researchers often test models in which one or any number of parameters may or may not vary per condition). An application of ABC methods in these scenarios may limit thorough exploration of alternative variants simply for pragmatic computational reasons. Our goal here is to make this process more computationally efficient and to permit model estimation for otherwise intractable models. The basic idea of our approach is to separate the model simulation / likelihood estimation stage from the inference stage. More specifically, we attempt to separate the MCMCalgorithms, employed locally by researchers at the inference stage, from the generation of approximate likelihoods, which can be computed up-front for a given model class over a range of parameters and stored as a function, which can then be called directly or inside software packages (e.g., HDDM: hierarchical bayesian estimation of the drift diffusion model in Python (Wiecki, Sofer, & Frank, 2013)). While a likelihood function is analytically available for many popular models (e.g., the DDM), even slight change to these models may not have analytical forms (e.g., a DDM with a collapsing bound). We propose here a machine learning approach, using model simulations across parameters to learn the likelihood function upfront with the help of a multi-layered-perceptron (MLP, also simply referred to as neural network in the following). Once such an approximate likelihood function is provided, simulat-260 This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 ing the DGP (Data Generating Process), is then not necessary at the stage of inference, during which MCMC-algorithms need only to evaluate the function, reducing the costs of bayesian inference back to what is typical for analytic likelihood functions. Figure 2 illustrates the idea in terms of the change in the corresponding MCMC algorithm for inference. Using this approach we can learn the likelihood manifold for a given model into a neural network first, and then provide researchers with the learned network (set of weights, requiring only matrix algebra) to be used for inference. We hope to exploit model simulations to directly inform an interpolation algorithm that can leverage regularities in the global structure of the likelihood manifold and then apply them during inference without need for further simulation.

Tested Model
As an intial test we focus on the viability of our approach to the DDM, a popular model of decision making in which reaction times and decisions are described as the crossing of one of two boundaries of a single diffusion process, and for which the analytical likelihood is available (and hence can be used as a benchmark) but non trivial. Figure 3 provides a an illustration of the DDM model.
For a starting point w, drift µ, and (symmetric) functional shape of the boundaries g(t), we have two defective probability distributions f + and f − , which jointly provide the first passage distribution of (upper and lower) boundary crossings (Ratcliff, 1978) of some associated stochastic process X(t; w, µ). For a given set of parameters (w, µ), the time integral of the sum of the two defective probability distributions f = f − + f + is a valid probability density. For a given data point, where the choice c ∈ {−, +}, and reaction time t ∈ (0, +∞), we can describe it's likelihood given parameters of the model as, f c (t; w, µ). As described above, our problem is that f + and f − are given in analytic form only for a small class of models (e.g., for the DDM in which the drift rate and decision boundary are fixed across time), even though there are many situations in which these assumptions are untenable, motivating ABC approaches in the first place. To get an approximation for the likelihood of (c,t), given (w, µ) and g(t), Turner et. al. (2014), suggest the following approach. We simulate the process to get an artificial data set (for example using Euler Maruyama), D = {(c 1 ,t 1 ), ..., (c n ,t n )}, which we split according to whether c = +, or c = −, and re-index, to create two separate data-sets D + = {(+,t 11 ), ..., (+,t 1n 1 )}, which, can then be used inside any MCMC algorithm. The approach proposed here is to usef + andf − , so computed, not inside the Markov Chain, but instead to form a training set to serve the training of Neural Network, which learns an approximate likelihood functionl NN (t, c; w, µ) for the relevant parameter space a priori.

DDM: Training on known likelihood
We attempt to fit the first passage distribution of a standard DDM (constant boundaries at level 0 and a). The functional form for the probability density of lower barrier crossings at time t (similar for upper barrier crossings) given parameters (µ, a, w) is (Feller, 1968),   [5,20]) to ensure that the training set contains reaction-time-outliers for a any given parameter set. While these proportions were not chosen based on some strict theoretical justification (more principled approaches are in progress, e.g. active sampling of parameters for more uncertain regions), empirically they were enough to constrain behavior of the learned manifolds. Finally, we split the resulting data-set into a training and test set using p train = 0.8.
Next we performed a hyperparameter search over NN architectures. A random search over architectures was performed, ranging from 3-5 hidden-layers and between 20-60 nodes by layer. The best performing model was close to ceiling on both hyper-parameters, with 5 hidden layers and (50, 60, 50, 50, 60) nodes by layer. From here we increased the architecture size to 6 hidden-layers, and tested a setup with 80 nodes and 100 nodes each and sigmoid-activation functions throughout, yielding the currently best performing network with (100, 100, 100, 100, 100) nodes and test MSE of 0.0009. Increasing the network size to 120 nodes each did not further improve performance. We plan to test deeper networks going forward.
The following approach was used for a parameter recovery study to test the performance of our network for inference. We sample 100 parameter sets (µ i , a i , w i ) uniformly from [−2.5, 2.5] × [0.15, 0.85] × [0.5, 3]. For each parameter set, we simulate the corresponding data generating process, 5000 (2500, 500, 100) times and record the respective choice outcomes. For each set of parameters, we attempt parameter recovery via maximum likelihood methods. For this purpose, a simple genetic algorithm was used (population size: 40, steps: 25, mutation probability: 0.1), but other approaches such as simplex or MCMC for full bayesian inference will work similarly. Figures 4 and 5 illustrate our preliminary results.

DDM: Training on empirical likelihood
While the previous section demonstrated the viability of the approach when a known analytic likelihood is used to train the network, here we assess the potential when this likelihood is not known, but instead provided via empirical likelihoods using KDE. Specifically, rather than providing as f (t|µ, a, w) in training the actual likelihood function, f (t|µ, a, w) = π a 2 exp −µaw− we instead provide at each training step the approximate (empirical) likelihood (generated using Gaussian Kernel Density Estimates on top of realizations of model simulations). While this procedure is more computationally intensive (more samples needed for each training step), it is again a one-off cost. Notably, we were able to perform MLE-based model recovery with similar precision, thus serving as validation of our general method, and a stepping stone towards application to models where we have no access to the ground truth (likelihood function) to begin with. The results are shown in Figure  6.

Current Development / Future Directions
The current results provide only the first steps necessary for scaling up the approach for use more widely. First, there is still room for improvement of training procedures of the neural networks. Access to the data generating process, predestines Figure 6: DDM: Parameter Recovery using empirical likelihoods in training our method for implementation of online learning. Moreover, online learning in turn allows the opportunity to apply ideas from active learning (Bachman, Sordoni, & Trischler, 2017), essentially an interaction between the state of learning of the model and and the specifics of the training data generation process. Second, we are currently validating more complicated models, such as DDM with dynamically varying parameterized boundaries (e.g., collapsing bounds) and other models such as the leaky competing accumulator (LCA). Third, we aim to provide researchers with two interfaces to the method proposed in this report. An interface to perform Bayesian inference with pre-learned approximate likelihoods of standard models in the literature, and an interface that provides researchers with a training pipeline for user supplied process models.