Neural Likelihood

A large body of evidence shows that perceptual decision making in humans and animals accounts for uncertainty in the relevant stimulus variable. This suggests that the decision is based on a distribution over stimuli given the neuronal activity rather than single point estimates. The likelihood over the stimuli captures this uncertainty for a fixed neuronal response. Because the neuronal population response can be high dimensional, estimating a per-trial likelihood can be challenging. Previous work has thus focused on parametric models, which can introduce a bias by ignoring noise correlations. Here, we present a simple yet general method to decode a per-trial likelihood based on neural networks. Our method applies to discrete and continuous, as well as static and time-series data. We demonstrate it on recordings from two experimental visual paradigms in macaque V1 and V2.


Introduction
Bayesian models of behavior have been widely successful in explaining decision-making under various tasks in both human and monkeys, frequently demonstrating that animals make nearly Bayes optimal decisions (Ma & Jazayeri, 2014). This suggests that the underlying computations are based on representations of probability distributions rather than their individual moments, such as the mean or the maximum (Pouget, Dayan, & Zemel, 2003). Representations of probability arise naturally in noisy systems, such as the brain. If one stimulus s can yield several possible neural responses r, then each r is naturally associated with a posterior p(s|r) or a likelihood p(r|s) ≡ L r (s). While the posterior also accounts for prior expectations about s, the likelihood captures the information the population response carries about the stimulus. It simultaneously encodes the best estimate of the stimulus (the peak of L r (s)) as well as the associated uncertainty (e.g. the width of L r (s)) (Ma, Beck, Latham, & Pouget, 2006). Because r is often high dimensional, estimating a trial-bytrial likelihood function from recordings of cortical population responses is challenging. Existing methods typically make strong parametric assumptions about the stimulus conditioned distribution of the population response p(r|s), such as an independent Poisson (Graf, Kohn, Jazayeri, & Movshon, 2011) or Poisson-like distribution (Ma et al., 2006). While these assumptions considerably simplify computing the trial-by-trial likelihood function, they can also introduce biases by ignoring potential noise correlations between different neurons (see Walker, Cotton, Ma, and Tolias (2018) Supplementary Figure  3) or subsequent time points, or internal brain state fluctuations (see Denfield, Ecker, Shinn, Bethge, and Tolias (2018); Ecker et al. (2014) for examples of conditional dependencies).
Here we present a simple yet general approach to estimate the trial-by-trial likelihood function that makes no parametric assumption about the stimulus conditioned distribution p(r|s) and only requires knowledge of the stimulus prior p(s), which is known to the experimenter in most experimental paradigms. Our approach is completely generic and can be applied to static or dynamic responses, as well as discrete or continuous data. It relies on two key steps: 1. Instead of estimating the likelihood p(r|s) explicitly, which is infeasible in most cases, we estimate it implicitly by recovering an unnormalized likelihood from a model q(s|r, θ) of the posterior using the known prior. θ denotes the parameters of the posterior model.
2. We estimate the posterior q(s|r, θ) using a flexible probabilistic model such as a neuronal network.
Because our method relies on estimating the posterior directly, it will work particularly well in situations when dim(s) dim(r). Estimating a conditional distribution over the lower dimensional s instead of a the high dimensional r substantially simplifies the problem and allows us to use more flexible models such as neural networks. While algorithm yields a per-trial likelihood L r (s) for each r, it trades the ability to provide a distribution over r for a given s for a simplified estimation problem.
Here, we present the mathematical foundation as well as a continuous and recurrent version of the algorithm. We recently applied a discrete static version of this method to decode likelihood function from a population of V1 neurons responding to orientation stimulus in monkeys (Walker et al., 2018).

1042
This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0

Implicit estimation of the likelihood Given data {s
we train a flexible model q(s|r, θ) of the posterior by minimiz- between the true posterior and the model. Because the KLdivergence is a strictly non-negative quantity, (θ) = 0 if and only if D [p (s|r) q (s|r, θ)] = 0 for each r except for sets of measure zero. The minimum is attained at p (s|r) = q (s|r, θ) which yields with a proportionality constant that depends on r, but not on s. Again, note that the simplified estimation comes at the expense of the normalization constant, i.e. the method does provide an estimate of likelihood L r (s) not the conditional distribution of r given s.

Flexible models for finite discrete posteriors
We use neural networks to get flexible models for q(s|r, θ). In the following, we describe the finite discrete and the continuous case.
In the discrete case, the likelihood function, the posterior, and the prior for each response vector r are represented by vectors L, p s , q ∈ R n where L = L r (s) and n is the number of different states s can be in. For instance, n = 2 in a task where trials can have two possible outcomes. We use a deep neural network f (DNN) (Goodfellow, Ian, Bengio, Yoshua, Courville, 2016) to directly predict L = f (r) from the population response vector r. To get the objective function, we combine L and p s to compute the log posterior over the stimulus s up to some scalar value b(r), and take a softmax q = exp(z)/ ∑ n j=1 exp(z j ) to normalize it. The network f is then trained to maximize the log posterior for the available data where δ(s i ) is a one-hot encoding of the stimulus s i .

Flexible models for continuous posteriors
To estimate a flexible posterior model for continuous stimulus variable, we use the idea of projection pursuit density estimation (Friedman, Stuetzle, & Schroeder, 1984) recently repopularized as flow models (Dinh, Sohl-Dickstein, & Bengio, 2017). A flow model transforms random variables ξ from some easy to evaluate source density p ξ (ξ) into random variables x = f (ξ) using an invertible and differentiable function f . The density of x is then given by The function f is usually implemented by a deep neuronal network with special layers that are invertible and have an easy to compute log-determinant of the Jacobian. The full log-determinant is then simply the sum of the single logdeterminants.
In our case we use a conditional flow model s = f (ξ, r) that generates a density q(s|r, θ) for each population response vector r. Note that f needs to be invertible and differentiable in s only. Because we can explicitly compute density values in a flow model, we can use equation (1) to find the parameters θ. After q(s|r, θ) has been trained, we use equation (2) to get the likelihood function.

Experiments
We tested the discrete and continuous stimulus likelihood decoders on two distinct sets of macaque population recording, each collected by different groups. All experimental protocols were approved by the local authorities (Regierungspräsidium Tübingen and Baylor College of Medicine Institutional Animal Care and Use Committee).

Discrete stimulus task
Two male macaques performed a variant (Kawaguchi et al., 2018) of the disparity discrimination task described in Nienborg and Cumming (2009) while neuronal single and multi-unit activity in visual area V2 was recorded using linear multichannel recording electrodes (Plexon, Inc. V-probes, 24 channels). Monkeys had to classify trials into near and far, based on the predominantly occurring disparity in a random sequence of disparities. The signal strength on each trial was defined to be the proportion of stimulus frames that show the disparity from the class. The remaining frames of the trial were randomly picked from a set of predefined disparity values including the disparity value associated with the current class. Here, data from a single session (939 trials) in one animal was analyzed.

Continuous stimulus task
We obtained V1 population responses from two macaques performing on orientation classification task as was presented in Walker et al. (2018). Up to 96 channels of multi-unit activities were recorded from each recording session, and the contrast of the orientation stimuli were varied on a trial by trial basis. There were total of 132 recording sessions, and for each session, trials with same contrast were grouped into a contrast-session. There were total of 546 contrast-sessions in the whole dataset, with a total of 303,326 trials.

Models
Discrete case We modeled the mapping f (r) as a recurrent neural network (RNN) consisting of a single-layer gated recurrent unit (GRU) (Cho et al., 2014), which had a hidden state size of N h , a learned initial hidden state and the same linear readout at each time step. The training to validation set split was set to 70%:30%. We trained the network on the training set with a fixed learning rate of λ while monitoring its performance on the validation set for early stopping, which was carried out once the validation set loss failed to improve over 400 epochs. Upon training completion, the parameter set that produced the lowest loss on the validation set was restored.
Using a random grid search over candidate hyperparameter values (Table 1), we found N h = 2 and λ = 0.02 to be the combination that yielded the lowest loss on the validation set.
Continuous case For the continuous task, we modeled the mapping ξ = f −1 (s, r) by a monotonously increasing function in s using linear interpolation between 10 base points g i and predefined stimulus locations s i . This choice is motivated by the fact that any mapping between two one-dimensional continuous distributions can be written as (F −1 2 • F 1 )(s) where F k denote the cumulative distribution functions (cdfs). As both cdfs are monotonously increase, so must be their inverse and their composition. The base points g i were predicted from each r using a 3-layer fully-connected ResNet (He, Zhang, Ren, & Sun, 2016) with a final softmax layer followed by a cumulative sum and appropriate scaling.
As in Walker et al. (2018), separate instance of the model were fit for each contrast-session. Trials from a contrastsession were divided into training and validation sets based on a 80%:20% split. We trained the network on the training set with a fixed learning rate of 1e − 4 and monitored its loss on the validation set. Upon training completion, the parameter set that produced the lowest loss on the validation set over the course of training was restored.

Results
Here we demonstrate that our approach on two real world datasets. Our intention is not to provide biological insight but to showcase that our method behaves reasonably given our current knowledge about the neural system. Because of limited number of trials, we report the behavior on the validation set used for early stopping. In a real experiment, the decoder would be used on all data (including training), just like a fitted tuning curve. at different time points during the trial (Fig. 1, bottom). Since the prior distribution over the stimulus classes was uniform, this also reflects the logposterior ratio. As expected, the evidence for a particular class increases over time (Fig. 1, top) and becomes more pronounced for higher signal strengths.

Continuous case
We visually evaluate the model by plotting the likelihood over different stimulus orientations for different values of the contrast. For low contrast the uncertainty about the stimulus is higher so the likelihood curves become wider, as expected (Fig. 2).

Summary
We developed a simple yet general deep-learning based method for decoding per-trial likelihood functions from population responses which makes substantially less assumptions about the form of the likelihood function. Our method thus deviates from the traditional approach where likelihood functions are computed by using a conditional distribution over the neural responses r given s. Estimating this distribution is Figure 2: Average likelihood function decoded by the continuous decoder for each contrast. On each trial, the decoded likelihood function over the continuous stimulus orientation was shifted such that the peak of the normalized likelihood function occurred at 0 • . The centered likelihood functions were then averaged across all trials within the same contrast bin.
very hard due to the curse of dimensionality (Nagler & Czado, 2016) requiring infeasibly large number of samples. While the use of strong parametric assumption on the distribution, such as conditional independence, can substantially decrease the sample size needed for the parameter estimation, this places heavy constraint on the distribution which is often only justified by computational feasibility. Our method avoids these issues by directly learning the likelihood function from the data and avoids modeling in the generative direction s → r. This means that our resulting approximation of the likelihood func-tionL r (s) does not yield a generative model of r given s. However, as long as our interest lies on the likelihood function L r (s) as a function of s, our method correctly approximates the true likelihood function L r (s) up to a multiplicative constant (or constant offset in log domain), and typical applications of likelihood functions are insensitive to such constant multiplication.