A model-free approach to link neural activity to behavioral tasks

mapping


Introduction
Contemporary neuroscience generates large datasets of neural activity in behaving animals [1][2][3] .To gain insight from these large-scale recordings, it is often desired to identify neurons with activity related to the behavioral 5 task at hand.A common approach to this problem is to * Correspondence: haesemeyer.1@osu.eduintuit a functional form ("a model") by which predictors such as sensory stimuli, motor actions and internal states might be related to neural activity.Neurons can subsequently be classified into those with activity explained by features of the task and the chosen model and those with activity likely to be unrelated or background noise.
A simple yet powerful approach is to use linear regression to explain neural activity as the weighted sum of sensory and motor features recorded during the experiment [4][5][6][7] .Since linear regression can accommodate nonlinear transformations of input variables, this technique can encompass diverse relationships between predictors and neural activity.Similarly, more flexible solutions using basis functions 8,9 can be used to identify task related neurons.However, brains are not engineered solutions to a well-defined problem but arose through selective pressure acting on random variation [10][11][12] .It is therefore unclear how well neural activity can be captured by a well-defined function chosen by the experimenter as the test model.In theory these issues could be alleviated by increasing the flexibility of the function using a large set of parameters.However, this approach is fraught with hazard as model fits become less and less well constrained, leading to an explosion of variance and over-fitting to the training data 9,13 .Artificial neural networks (ANN) can in principle accommodate any mapping of predictors to neural activity 14,15 .At the same time, they can be designed to generalize well to untrained inputs 16,17 .Due to this flexibility, insights into nonlinear receptive fields of visual 1 and auditory neurons [18][19][20][21][22] and into the encoding of limb motion in somatosensory cortex have been gained using ANNs 23 .However, an obvious drawback of ANNs is that they are much harder to interpret than models 40 based on intuition and data exploration.
Here we introduce "Model-free identification of neural encoding", (MINE), an approach that uses simple convolutional neural networks (CNN) to learn mappings from predictors (stimuli, behavioral actions, internal states) to neural activity.By employing Taylor expansion approaches and techniques used in interpretable machine learning such as the nonlinearity coefficient, we gain insight into how the CNN uses information about stimuli, states and behaviors to predict neural activity.This Here we demonstrate the utility of MINE using a ground-truth dataset and a cortical mouse widefield 65 imaging dataset.We then designed a set of experiments to exhaustively probe thermoregulatory circuits in larval zebrafish.Specifically, we exploit the flexibility of MINE to provide randomly varied temperature stimuli across zebrafish and imaging planes while maintaining the ability to identify functional groups of neurons based on features of the trained CNNs.Using MINE, we discover a new functional class of neurons integrating thermosensory with behavioral information.Combining MINE with anatomical mapping, we also identify brain regions containing neuronal responses of increased computational complexity.

Results
A model-free approach to identify task-relevant neurons MINE (for Model-free Identification of Neural Encoding) centers around convolutional neural networks (CNN) to overcome challenges with predefined models while maintaining interpretability.Feed-forward artificial neural networks are capable to approximate any function 14,15 and therefore afford great flexibility in capturing the relationship between "predictors" (sensory stimuli, behavioral actions, internal states, etc.) and the activity of individual neurons.We designed a simple threelayered network architecture (Figure 1A).It consists of a linear convolutional layer (to accommodate temporal transformations such as calcium indicator effects) and two dense layers that capture nonlinear transformations.The architecture's simplicity speeds up training and eases interpretation while capturing transformations across time (i.e. the convolutional layers) and nonlinear effects (i.e.dense layers with nonlinear activations functions).We chose a continuously differentiable activation function for our dense network layers.Unlike using the popular ReLu nonlinearity, this allows us to calculate higher-order derivatives which capture interaction effects and quantify the nonlinearity of transformations.Most network hyperparameters including the specific activation function ("swish" 24 ) were determined by minimizing test error on a small dataset (see Methods).While this parameter can be easily adjusted, we chose a fixed length of 10 seconds for our convolutional filters to capture temporal effects in all subsequent experiments.
To test MINE's ability to capture neural transformations and encoding of predictors we generated a groundtruth dataset.This dataset consists of four randomly generated predictors (Figure 1B).Two of these, S1 and S2, vary continuously in time mimicking sensory stimuli.The other two, M1 and M2, are discrete like motor or decision variables.From these predictors we generated "neural responses" that depend on single predictors F) Fraction of explained variance on test data for a simple linear regression model (blue), a linear regression model including first-order interaction terms and calcium kernel convolution (red) as well as the CNN model (orange).Each dot represents the average across 20 models.While the standard deviation is represented by a dash, it is smaller than the dot size in all cases and therefore not visible in the graph.
or interactions with and without intervening nonlinear transformations (Figure 1C and S1A).We added gaus-sian noise to all neural responses after convolution with a 120 calcium kernel to approach conditions that might be ob-served in functional calcium imaging experiments (Fig- ure 1D).For each neuron in our ground-truth dataset we subsequently trained a CNN to predict the activity from the predictors (Figure 1E).To assess generalization, we split our dataset into a training set (two thirds of the data) and a test set (last third) (Figure 1B  This is expected, since the linear regression model cannot learn the dynamics of the calcium indicator unlike the CNN.We therefore constructed an alternative model in which we convolved the predictors with the known 140 "calcium kernel" (Figure S1B).In this expanded linear model we also included all first-order interaction terms by including pairwise products between predictors (Fig- ure S1B).As expected, this model matches the performance of the CNN in more response categories including 145 non-linear interactions (Figure 1F).However, this is because the function it represents was designed using a posteriori knowledge about the responses.Nonetheless, the expanded linear model is poor in capturing nonlinear transformations of predictors and fails to capture 150 responses that relate to the time-derivative of an input, e.g., as expected in adapting neurons (Figure 1F).This further illustrates the point that a model-based approach is limited to the constraints of the chosen model.

As shown, MINE can identify responses that depend
155 on predictors independent of the linearity of these relationships.The underlying CNN is able to learn temporal transformations of inputs such as shifts in time or convolutions which otherwise have to be explicitly provided to a regression model 4,7 .For example, this removes the re-160 quirement of estimating calcium response kernels which might differ across different neurons.MINE can also identify responses that depend on derivatives of the input meaning that one predictor such as position can be used to identify neurons that depend on the velocity or acceleration of a stimulus.Moreover, we can empirically set a threshold of the Taylor metric (5x10 −3 ) which separates the expected contributing terms from the others across our ground-truth dataset in all but one case (Figure 2C-E; Figure S2B-H).

Taylor expansion identifies predictors driving the response
In summary, MINE was able to correctly identify contributions of predictors, such as sensory stimuli or behavioral actions to neural responses by local expansions of the trained CNNs.MINE also correctly identifies nonlinear interactions in generating the neural responses on our ground-truth dataset.This indicates that we can reduce the lack of interpretability of the CNN models and approach the expressivity of linear regression models while maintaining the ability to model nonlinear transformations and interactions of task variables.

Disambiguating linear and non-linear processing
Linear computations are limited in their expressivity and it is generally believed that the computational power of the nervous system depends on nonlinear computation [26][27][28] .In spite of this, the responses of many neurons tend to depend almost linearly on sensory or behavioral features 4,7,29,30 .The latter idea aligns with the hypothesis that information important to the animal should be linearly decodable by neural circuits 10,31,32 .
Disambiguating linear from nonlinear processing therefore provides important insight into circuit structure and function.Once fit to neural data our CNNs model the 250 transformations from predictors to neural activity (or from neural activity to action).We therefore set out to quantify whether these networks implement a linear or nonlinear function as a proxy to classifying the actual transformations between stimuli, behavior and neural 255 activity.
To measure nonlinearity encapsulated by the CNN we calculated two metrics that quantify different aspects of nonlinear computation (Figure 3A and S3A).Linear functions are characterized by a constant first derivative and the absence of curvature.Hence, our first metric uses the Hessian (matrix of second-order derivatives) of the network output with respect to the input at various time points to quantify the curvature of the function approximated by the CNN (see Methods).The second metric is the "nonlinearity coefficient" (NLC) which measures the deviation of the CNN from its linear approximation by comparing the variance of true network outputs to that of a linear approximation 33,34 .
We tested these metrics on ground-truth data in which 270 we transformed a predictor to varying degrees with either a linear or a nonlinear function.We then trained the CNN to learn the transformation between the predictor and the mixed output and subsequently calculated the curvature and NLC metrics.Increasing the contribution of an arbitrary nonlinear function (i.e., tanh 2 (x 3 )) leads to gradual increases in both nonlinearity metrics (Figure 3B).Importantly, these metrics do not simply quantify a loss of linear correlation.Calculating the derivative of a function is a linear operation, and increasing the contribution of a derivative of the predictor indeed does not increase either the curvature or the NLC metric in spite of reducing linear correlation between the predictor and the response (Figure 3C and S3B-D).In some cases, the performance of the metrics differ.When increasing the contribution of a rectified version of the predictor, the curvature metric correctly tracks this nonlinearity while the NLC does not (Figure 3D).We therefore systematically evaluated the classification performance of the two non-linearity metrics.To this end we generated random linear and non-linear transformations of the predictors (see Methods), trained the CNN and calculated the nonlinearity metrics.ROC analysis shows that both metrics consistently rank a nonlinear transformation higher than a linear one (in 86 % of cases the NLC will rank the nonlinear transformation in a random pair higher while the curvature metric will do the same in 92 % of cases, Figure 3E).Training a logistic regression model to combine the two metrics and subsequently evaluating its performance on test data further increased the classification performance.Importantly, this model greatly improved the true-positive-rate for thresholds with a low false-positive-rate compared to the curvature metric; a feature that is highly desirable when applying the classification to real-world data (Figure 3E).
In summary, we can further increase the interpretability of our CNN model and classify the transformations the model encapsulates according to their nonlinearity.
Since the CNN encapsulates the transformations that are enacted by neural circuits between stimuli and neural activity or neural activity and behavior, this information provides important insight into the computations that give rise to neural activity.A) Schematic of network decomposition and parameters used for the metrics.C is related to the curvature of the function which would be 0 for a linear network.The NLC on the other hand quantifies the relationship between the variance induced by a linear approximation of the network around the data mean versus the true output variance.B) Mixing varying degrees of a nonlinear response function with a linear response ("Nonlinear contribution") and its effect on network performance (left, fraction of explained variance on test data), the curvature metric (middle) and the NLC.Colored dots relate to the plots of linear correlations in Figure S3.C) As in B) but mixing a linear function of a predictor with a linear transformation of the predictor, namely the first order derivative.D) As in B) but mixing a linear function of a predictor with a rectified (nonlinear) version.E) ROC plot, comparing classifier performance of nonlinearity for the curvature metric (black), the NLC (orange) and a logistic model combining both metrics (dashed blue line).Note that the combined model allows to identify a large fraction of true positives while minimizing false positives.

Characterization of receptive fields
Receptive fields describe the stimulus features driving a neuron's response 35 .They can reveal integration times (those times over which coefficients of the receptive field are different from zero) and zero-crossings within the receptive field indicate that a neuron encodes the derivative of a stimulus across time (e.g., velocity encoding neurons) or that it detects edges across space.A powerful approach to characterize receptive fields of neurons is spike-triggered analysis [36][37][38][39] .In this approach whitenoise stimuli are presented to an animal and the stimulus history before each spike is analyzed.Averaging the stimuli leading up to a spike yields the spike-triggeredaverage (STA).The STA characterizes those stimulus features that increase the average firing rate of the neuron, classically referred to as its receptive field.Eigenvectors of the spike-triggered covariance (STC) matrix on the other hand reveal receptive fields that change the variance of the firing rate rather than the average.Together, STA and STC therefore characterize how key features of the stimulus drive the neural response: The projection of the stimulus along the STA influences the 335 average firing rate, while the projections along different components of the STC alter the response variance.
The first two terms of the Taylor expansion of the network fit by MINE suggest a correspondence to both STA and STC.In the first order term of the expansion, the 340 Jacobian has the same effect as the STA: it projects the stimulus and increases the average response.Eigenvectors of the Hessian on the other hand, mark directions that alter the variance of the response without affecting the average much like the components of the STC 345 (see Methods).A key difference however is that spike triggered analysis is agnostic to nonlinearities associated with the receptive fields defined by the STA and B) Same as A) but approximations derived using spike-triggered analysis.The spike-triggered average (STA) left and eigenvector of the spike-triggered covariance matrix (STC) right are compared to the RFA and RFV respectively.C) Across N = 100 randomly generated, smoothly varying stimuli, distribution of cosine similarities between the RFA and the MINE Jacobian (orange), the spike-triggered average (blue) and a random vector (black).D) As in C) but distribution of cosine similarities between the RFC and the MINE Hessian eigenvector (orange), the eigenvector of the spike-triggered covariance matrix (blue) and a random vector (black).
STC.For MINE on the other hand, nonlinear effects are largely contained in higher-order terms of the Taylor expansion.
We therefore wanted to test to what extent MINE could identify receptive fields that arise from the model underlying spike triggered analysis.To this end we generated a white-noise stimulus (Figure S4A) and created a simulated spike rate which depends on two receptive fields: One, which affects the average response (RF A ) and one which only changes its variance (RF V ) (Methods).We subsequently analyzed the relationship of the stimulus and the spike rate using MINE, fitting the network and extracting both the Jacobian as well as the eigenvectors of the Hessian.Indeed across many randomly generated stimuli, the Jacobian matched the receptive field increasing the average response, RF A (Figure S4B and D).The eigenvectors of the Hessian with the largest two eigenvalues on the other hand always contained the receptive field increasing the response vari-ance, RF V (Figure S4B and E).As expected, performing actual spike-triggered analysis on this data recovered the same receptive fields through STA and STC (Figure S4C-E).
Presenting white-noise stimuli is generally not a practical solution in in-vivo experiments.Stimuli experienced by animals are rarely white noise and behaviors cannot be drawn from a distribution by the experimenter.We therefore repeated the analysis using a stimulus that varies smoothly across time, mimicking ongoing stimuli or behaviors (Figure S4F).Again, fitting MINE on the data, the expected receptive fields RF A and RF V were recovered but the correspondence was slightly worse compared to using white-noise stimuli (Figure 4A and C-D).This makes sense given the strong autocorrelations in the stimulus.Analyzing the same data with spike-triggered analysis led to considerably worse correspondences, especially for RFV (Figure 4B and C-D).While whitening the stimuli could ame-liorate this effect for the STA fully recovering RF A , the STC analysis failed to recover RF V regardless (Figure S4G-H).This indicates that MINE is better in recovering receptive fields for smoothly varying inputs, leading to less degradation than spike triggered analysis.
Overall these results suggest that MINE can recover the types of receptive fields of neurons that can be obtained with spike triggered analysis under a broader set of biologically relevant stimulus conditions.

MINE characterizes cortical sensori-motor processing
Encouraged by MINE's ability to identify responses related to behavior and stimuli and its ability to characterize the nature of that relationship, we wanted to test 400 the method on biological data.We applied MINE to a publicly available wide-field calcium imaging dataset recorded in the mouse cortex 40 7 .Furthermore, instructed left and right grab events largely contribute to motor cortical regions, again with the expected left-right asymmetry (Figure 5H).Repeating this procedure on all sessions (Figure S5) reveals general agreement.But we note that not all sessions seem to have regions that are as clearly related to the chosen inputs according to Taylor analysis which is especially apparent for left and right instructed grabs (Figure S5C).
We next sought to determine the receptive fields that govern stimulus processing in individual pixels 18,35 .
We extracted the receptive fields across pixels that are strongly related to either left/right visual stimuli or E) Across all components from all 13 sessions that have been identified by MINE the Taylor metrics that were significantly larger than 0. Components (rows) have been sorted according to the predictor with the maximal Taylor metric.F) Per pixel Taylor-metric scores for the right visual stimulus ("rVisStim") subtracted from those of the left visual stimulus ("lVisStim").A=anterior, L=left, R=right.G) As in F) but the sum of the visual stimulus Taylor-metrics ("lVisStim+rVisStim") has been subtracted from the Taylor-metric of the whisking predictor ("Whisk").H) As in F) but the Taylor-metric for the right grab predictor ("rGrab") has been subtracted from the Taylor-metric of the left grab predictor ("lGrab").I) Clustering of per-pixel receptive fields, separating excitatory (bold lines) from inhibitory responses (pale lines).Pixels were selected to only include the top 10% Taylor metrics for visual (left plot) and grab (right plot) predictors; left blue, right red.Numbers in parentheses indicate cluster size.The gray dashed line indicates time 0, i.e. the time at which calcium activity is measured.Note that the sensory receptive field (visual) is exclusively in the past (future stimuli do not drive the neurons) while the motor receptive fields (grap) slightly overlap with the future, indicating that neurons ramp up activity before the behavioral action.
left/right instructed grabs.We broadly clustered the receptive fields into two groups (see Methods) to separate excitatory and inhibitory effects.Receptive fields for left and right visual stimuli in contralateral brain regions are highly similar to each other (Figure 5I, left).The excitatory effect is stronger than the inhibitory effect (larger coefficients in the positive receptive fields) and both are clearly biphasic, indicating that events around the time 475 of the stimulus as well as ∼ 1.5 s in the past, strongly influence the activity of the visual neurons.We note that this biphasic structure mimics the linear regression kernel in Figure 3b  In summary, the results presented above demonstrate 490 the applicability of our method to biological data and the potential for identifying diverse feature sets of sensorimotor processing.

Functional characterization of thermoregulatory circuits
Since MINE was able to extract expected biological features from a mouse cortical dataset, we sought to use it to gain novel insight into zebrafish thermoregulatory circuits.Temporal transformations of stimuli are a notable feature of how zebrafish process temperature stim-500 uli.In previous research we identified the neurons and centers in the larval zebrafish brain that process temperature: specifically, neurons that compute the rate of change of the temperature stimulus 41 as well as neurons whose activity is consistent with integration of tempera-505 ture fluctuations 42 .Due to the nature of these transformations, which were unknown a priori, regression-based approaches failed to identify neurons involved in temperature processing.We therefore previously used clustering to identify these neurons.However, because behavior 510 is stochastic, one particular drawback of clustering was that it precluded identifying neurons that integrate thermosensory stimuli and behavioral actions.Since MINE can learn and capture both temporal transformations and interactions, we set out to gain deeper insight into 515 thermoregulatory processing.
To classify neurons by clustering either requires presenting the same stimulus to every animal and neuron while imaging or it requires a strategy of clustering event-triggered activity.With MINE we do not need to enforce any stimulus constraint.As a result we were able to probe a wider variety of stimulus conditions.We imaged a total of 750 planes across 25 larval zebrafish that expressed the nuclear calcium indicator H2B:GCaMP6s 43 in all neurons and the excitatory neuron marker vglut2a-mCherry 44 in all glutamatergic neurons.This dataset provided slightly more than fourfold coverage of the larval zebrafish brain.On each imaging plane we presented a heat stimulus generated by randomly composing sine-waves with frequencies between 0.005 Hz and 0.075 Hz (Figure 6A and S6A-B).
We concurrently recorded elicited tail motion at 250 Hz.
We segmented the imaging data using CaImAn 45 which identified ∼ 433, 000 active neurons across our dataset.
From the tail motion data, we extracted (a) swim starts and (b) tail features that correlate with swim displacement ("Vigor") and turn angle ("Direction") in free swimming experiments (Figure 6A and S6C).We used the information about stimulus and elicited  As a comparison, we fit a linear regression model to this dataset.The inputs included time-shifted versions of stimulus and behavioral variables to allow the model to learn temporal transformations 7 and thereby put it on the same footing as the CNN model (Figure 6C).
We used ridge-regression 9 to improve generalization of the linear model.At our designated cut-off a total of ∼ 42, 000 neurons were identified using either MINE or regression.Importantly 40 % of these neurons were exclusively identified by MINE (Figure 6D), indicating the 565 superiority of the model-free approach.In fact, MINE consistently identifies more neurons regardless of the cut-off imposed on test data (Figure S6E).
Using the Taylor decomposition approach described earlier, we classified the neurons according to which 570 stimulus and behavioral features drove their activity.
We chose a stringent cut-off, requiring the Taylor metric to be significantly above 0.1 with a p-value < 0.05 on the whole dataset (Bonferroni correction; effective p-value ∼ 1.26x10 −6 ), to determine if a predictor influ-575 enced the activity of a neuron.Additionally, we applied our nonlinearity model for classifying neurons (deemed non-linear whenever the probability of nonlinearity was greater than 0.5) (Figure S6F-H).This led to the identification of 34 functional neuron classes (Figure 6E).We 580 ignored nonlinear interaction terms since they had very small Taylor metrics.As a control, we labeled reticulospinal neurons in six fish via spinal backfills and as expected the vast majority of these neurons were classified by MINE to be driven by behavioral features (Figure 6F).Across the brain, we identified multiple functional classes of neurons with mixed selectivity, i.e. neurons with activity jointly driven by the temperature stimulus and behavioral outputs (Figure 6E, blue filled bars).
These mark a previously unidentified group of neurons or mixed selectivity neurons are underrepresented in the pool identified by the linear model (Figure 6G).This points to potential problems of bias when identifying neurons by means of linear models.
We registered all our imaging data to a standard zebrafish reference brain (Z-Brain 46 ).This allowed us to assess the distribution of identified functional neuron types throughout the zebrafish brain (Figure 6H and S6I-L).While Stimulus-, Behavior-and Mixedselectivity neurons are generally broadly distributed, they are enriched in specific brain regions (Figure 6H and S6J-L).We found stimulus driven neurons to be enriched in telencephalic regions, as well as the habenula and its output region the interpeduncular nucleus.As expected, behavior driven neurons are enriched in hindbrain regions.Mixed selectivity neurons occupy some regions shared with stimulus driven neurons such as telencephalic area M4 and the interpeduncular nucleus but are also strongly enriched in the raphe superior as well as the caudal hypothalamus and the torus longitudinalis (Figure 6H).Our current dataset did not cover sensory ganglia well (especially not the trigeminal ganglion) with the exception of the olfactory epithelium where we found temperature sensitive neurons in a medial zone (Figure S6I).This is in line with reports in mice and frogs that describe specific thermosensitive neurons in the olfactory epithelium of these species [47][48][49] .
Overall these results demonstrate that functional cell types identified by MINE segregate spatially within the brain, likely forming organizational units within the thermoregulatory circuit.

Mapping computational features to the zebrafish brain
We next sought to localize computational features of thermoregulatory circuits in the brain by subdividing neural classes according to the behavioral features they control and the sensory features they extract.Analyzing brain regions for enrichment of encoding different behavioral features suggests a segregation in the con- Lastly, we subdivided neurons according to computational features themselves, i.e., those that occur between the sensory input and the observed activity of the neurons.Specifically, we were interested in the complexity of these computations (Figure 7D).MINE computes the probability that the relationship of predictors and the neural activity is nonlinear and temperature responsive neurons can be grouped into those that are linear and those that are nonlinear (Figure 6E).We note that these are not necessarily absolute classifications, meaning that we could transform the stimulus predictor such that some nonlinear relationships would become linear.
Instead, this classification points to differential processing upstream of the observed neural activity in linear vs. nonlinear neurons.We used the Taylor decomposition of networks to identify contributing predictors (see 690 above).The decomposition also provides a window into the complexity of nonlinear neurons.In particular, we asked how well a Taylor series around the data average truncated after the 2 nd order term could approximate the true neural activity.If this approximation is good, 695 the neural activity could be described by a second-order regression model (Figure 7D).If, however, the Taylor series fails to explain most of the variance in the neural activity we would conclude that the true relationship between sensory inputs and neural activity requires higher 700 order terms and hence a more complex model (Figure 7D).The goodness-of-fit of the 2 nd order Taylor series therefore allows us to further divide neurons classified as nonlinear.Plotting the goodness-of-fit of the 2 nd order model against the nonlinearity probability shows that all 705 except one neuron considered linear (nonlinearity probability < 0.5) are also well explained by the 2 nd order model (Figure 7E).This serves as a sanity check for our approach.Among nonlinear neurons we used a cut-off of 50 % explained variance by the 2 nd order model to di-710 vide them into two complexity classes (Figure 7E).Importantly, complexity is mapped onto both functional (Figure 7F) and anatomical features (Figure 7G; perature stimulus (Figure 7F).Anatomically, nonlinear neurons of complexity one are enriched in ventral forebrain regions while neurons of complexity class two are found in the cerebellum and anterior hindbrain (Figure 7G).This is in line with our previous observation that 725 neurons in the cerebellum respond to increased temper- In summary these data reveal the power of MINE to identify important computational features leading to a 730 detailed classification of neurons into functional types.
This automated, unbiased and model-free analysis has the potential to greatly aid the analysis of large-scale neural data.To facilitate the adoption of MINE we expose its functionality through a simple Python interface.
This interface allows fitting the corresponding CNN to neural data and extracting all metrics (Figure 8).

Discussion
A common goal in analyzing neural data is to relate the activity of neurons to sensory stimuli, behavioral 740 actions, internal states or other features observed during ongoing behavior.To explore this connection, we neuroscientists often define a model dictating the structure of this relationship.When well thought-out, these defined models have the great advantage that their parameters can be interpreted in a biologically meaningful manner.However even flexible models, if defined a priori, run the risk of not being able to match transformations occurring in the less than intuitive biological brain.
Model-selection based on probabilistic programs overcomes some of these challenges, allowing for a modelfree and flexible approach while maintaining predictive power 50 .One drawback of this approach however is reduced interpretability of the nature of the relationship between inputs (in our case stimuli, behaviors, etc. which aims to identify important input features driving the output of a classifier.In our case, however, we were interested to know which inputs persistently (across experimental time) are involved in driving the output of the network, i.e. the activity of the modeled neuron.
Notably, once the network is trained the different inputs are not separable anymore.This means that there is no guarantee that leaving out an unimportant input will not affect the output of the trained network.A workaround would be to train the network with subsets of inputs 25 .
However, this becomes infeasible even with a moderate number of inputs since all combinations would need to be tested.We therefore chose a continuously differentiable activation function for our network layers.This enabled Taylor decomposition to transform the network, pointby-point, into a separable sum of terms depending on individual inputs.This approach allowed us to infer which specific inputs drive the network output and hence the activity recorded in the modeled neuron.Notably this technique can also identify multiplicative interactions which indicate coincidence detection.Yet, while such interactions were evident in our ground-truth dataset, we found little evidence of multiplicative interactions in the biological data.This is likely due to added noise obfuscating the contribution of these terms which is comparatively small even on ground-truth data (Figure 2D).
In fact, most information about a multiplicative interaction would be carried in the first derivative: namely the derivative with respect to one input would vary according to another input.We did not exploit this insight in the present work but it could be used to disambiguate additive integration from coincidence detection.
To define the linearity of the relationship between features and neural activity we used two complementary measures that allowed us to arrive at a well-performing classifier.On the one hand, we directly compute a mea-860 sure of curvature of the function implemented by the network.On the other hand, we compute the nonlinearity coefficient, a concept from explainable machine learning that estimates the accuracy of a linear approximation of the network 33,34,56 .The artificial neural network fit by MINE must encapsulate the transformations between the chosen features and the neural activity in order to be able to predict the activity.The classification of linear versus nonlinear processing by the network should therefore apply to the neural transformations as were classified as nonlinear while most behavior driven neurons were classified as linear.This however does not mean that the stimulus is processed less linearly than the behavior -it is likely a result of our stimulus feature being the raw temperature while our behavior features are 880 removed from the nonlinear transformation that changes swim commands into appropriate tail motion.
Applying MINE to biological data revealed its poten-tial to discover relationships between stimuli and be- We currently determine linear receptive fields 35,38 of fit neurons using the first order derivatives.Similar to using spike-triggered covariance to obtain non-linear relationships between stimuli and neural activity 57 one could analyze eigenvectors of the matrix of second order derivatives to get a deeper understanding of how this receptive field changes with stimuli in the case of neurons classified as nonlinear.Analysis on ground-truth data suggests that these eigenvectors indeed match results obtained through spike-triggered covariance analysis (Figure 4).One crucial advantage of MINE is that it does not rely on specifically designed stimuli unlike spike triggered analysis.For spike triggered analysis, stimuli should ideally be white noise (uncorrelated) and for covariance analysis intensities must be drawn from a Gaussian distribution 38,58 .These constraints are impractical, especially when the goal is to study neural activity in freely behaving animals.However, we note that in cases where at least the Gaussian distribution of stimuli can be maintained, adverse effects of correlations can theoretically be overcome 59 .Independent of MINE the structure of receptive fields is influenced by the acquisition modality.For our zebrafish data, we used a nuclear localized GCaMP-6s.This variant has a long decay time 43 and since MINE is fit on calcium data, effects of this decay time will appear in the extracted receptive fields.This likely explains why the receptive fields shown for the zebrafish data extend over many seconds.This issue could be overcome by better deconvolution of the imaging data which would require a higher acquisition framerate.
In summary, MINE allows for flexible analysis of 970 the relationship between measured quantities such as stimuli, behaviors or internal states and neural activity.The flexibility comes from MINE being essentially model-free except for the choice of architecture and hyperparameters which limit the functions the network 975 can possibly approximate.This advantage can be seen by MINE identifying more neurons than regression approaches on our zebrafish dataset (Figure 6D) and doing so in a less biased manner.The regression approach especially failed on stimulus driven units, leading to 980 an overrepresentation of behavior-related activity (Fig- ure 6G).The ability to learn arbitrary temporal transformations through the use of convolutional networks makes MINE particularly helpful in analyzing calcium data since MINE does not require convolving inputs 985 with a "calcium kernel" that mimics temporal effects of the indicator 4 .Otherwise this kernel must be obtained through simultaneous electrophysiology recordings or has to be estimated by other means 4,60 .
We propose that the fit convolutional neural network 990 models the computation of upstream or downstream circuits of the neuron in question but we do not think or investigate if it models the circuitry itself.This could be investigated in the future but given the simplicity of the CNN used in MINE it is unlikely to provide inter-995 esting details in this manner.Bringing more advanced network-wide analysis methods to bear on the fit CNN could reveal other interesting features of the computation "surrounding" the neuron in question.
"full."The Jacobian (vector of first order derivatives of the output with respect to the input) and the Hessian (matrix of second order derivatives of the output with respect to the input) were computed using Tensorflow.For all analysis presented in the paper, Taylor expansions were performed every second (every five timepoints at our data-rate of 5 Hz) and predictions were made five seconds into the future (25 timepoints at our data-rate of 5 Hz).
The state of the predictors at time point t and the state of the predictors five seconds (in our case 25 samples) later was obtained.Subsequently these quantities, together with the Jacobian at time point t and the Hessian at time point t, were used to compute a prediction of the change in network output over five seconds.At the same time the true change in output was computed.Comparing the predicted changes and true changes via correlation resulted in an R 2 value.This value measures how well the full Taylor expansion approximates the change in network output across experimental time.We note that predicted changes in the output were compared via correlation rather than predicted outputs since we specifically wanted to understand which predictors are important in driving a change in output rather than identifying those that the network might use to set some baseline value of the output.In the following, P (t) is a vector that contains the values of all predictors across all timepoints fed into the network (e.g., in our case 50 timepoints of history for each predictor) in order to model the activity at time t, f (x) is the CNN applied to a specific predictor input, J(x) is the Jacobian at that input and H(x) the Hessian: The Taylor expansion around ⃗ P : The predicted change in network output according to the Taylor expansion: The true change in network output: Variance of the output change explained by the Taylor prediction using 2 and 3: After the computation of R 2 f ull , terms were removed from the Taylor series.We note that both individual timepoints fed into the network and predictors are separated at this level.However, only the removal of predictors onto the quality of prediction was tested, not the importance of specific timepoints across predictors (i.e.all time points across a given predictor or interaction were removed to test the importance of the predictors, instead of removing all predictors at one time point to test the importance of a specific time point).For the removal of a given predictor both its linear (via J) and its squared contribution (via H) were subtracted from the full Taylor prediction while for the removal of interaction terms only the corresponding interaction term (via H) was subtracted.The following quantities were computed to arrive at the Taylor metric of single predictors P n or interactions of two predictors P n,m where P n and J n identify subvectors that contain all elements belonging to the specific predictor n while H n,m identifies all elements in the Hessian at the intersection of all rows belonging to the second derivatives with respect to predictor n and columns with respect to the predictor m: Change predicted by only considering P n : Change predicted when removing P n using 2 and 5: Change predicted when only considering interaction between P n and P m : Change predicted when removing P n P m interaction using 2 and 7: Variance explained by Taylor prediction after removing P n using 3 and 6: Variance explained by Taylor prediction after removing P n P m interaction using 3 and 8: The Taylor metric was then defined as: and respectively.For the zebrafish data we additionally calculated an overall "Behavior" Taylor metric in which all terms (regular and interaction) that belonged to any behavioral predictor were removed and the Taylor metric was subsequently calculated as outlined above.
Related python file in the repository: taylorDecomp.py

Characterizing the linearity and complexity of the transformation
Two metrics related to the nonlinearity of the function implemented by the CNN were calculated.A "curvature metric" which quantifies the magnitude of the influence of the second order derivative and the "nonlinearity coefficient" which approximates the error made by linearizing the network.It is important to note that given random inputs that are unrelated to the training data, it is very likely that the CNN will produce outputs that as described in "Ground-truth datasets", page 26.The model consisted of two receptive fields (stimulus filters), RF a and RF v of length 50 timepoints, as well as two nonlinearities g sta and g stc .Given a stimulus s the rectified spike rate r was constructed according to: , where * denotes convolution and σ x is the standard deviation of the indicated quantity.Normalization by the standard deviations was performed to ensure that not one filtered response dominated the overall spike rate.
The two nonlinearities were constructed such that g sta was asymmetric and therefore led to a shift in the average spike rate, while g stc was symmetric around 0 so that it influences the variance of the spike rate but not the average: The model therefore generates a response function that upon spike triggered analysis (using appropriate whitenoise stimuli with intensities drawn from a Gaussian distribution) will result in a spike-triggered average resembling RF a while the eigenvector with the largest eigenvalue of the spike-triggered covariance matrix (ignoring RF a itself) would resemble RF v .
Examining equation 15 suggests that J has a similar influence on the network response as RF a (the receptive field) has on neural responses.The projection of the input in direction of J will influence the average response, just like the projection of the stimulus along RF a will influence the spike rate.Similarly, the eigenvectors of H are directions in stimulus space along which the variance of the response will be modified, similar to the eigenvectors of the spike triggered covariance matrix.
To test whether RF a and RF v could be recovered by MINE, the underlying CNN was subsequently fit using two thirds of the stimulus s as input and the generated spike rate r as target activity.The Jacobian J and the Hessian H were subsequently extracted from the network at the data mean.The Jacobian was subsequently compared to RF a by measuring the cosine similarity (as in equation 32).Eigenvectors and eigenvalues were extracted from H.
The two eigenvectors with the largest eigenvalues were subsequently compared to J using the cosine similarity and the eigenvector with the lower similarity was subsequently compared to RF v , again using the cosine similarity.The rationale for this was that RF a affects both the averege spike rate and its variance and it will therefore generally be one of two eigenvectors with the largest eigenvalue.
To compare the results obtained from MINE to those obtained when using the data for spike-triggered analysis both a spike triggered average and a spike triggered covariance matrix were extracted.To this end vectors ⃗ x i were constructed so that for each time point i contained the stimulus history of 50 timepoints leading up to the time point i.Using the vector of spike rates r the spike triggered average and spike triggered covariance matrix were subsequently computed as weighted averages and weighted covariances according to: The eigenvectors of STC with the two largest eigenvalues were directly compared to RF v and the vector with the larger cosine similarity was chosen to be the estimate of RF v .
For whitening, ZCA whitening 63 was performed to preserve the original rotation of the data.All ⃗ x i were adjusted by whitening the matrix X and then using its rows as the new ⃗ x i in all calculations above.The following steps were performed.The centered matrix X was created.
The data covariance C was computed according to: Eigendecomposition was performed on C: where U is the matrix of eigenvectors and Λ is the diagonal matrix of eigenvalues.The whitening matrix W was subsequently computed according to: The whitening transform W was applied to X: The whole procedure was repeated for 100 randomly generated stimuli to obtain different estimates for both MINE and spike-triggered analysis.
Related python file in the repository:cnn sta test.py(mitfa -/-; elavl3-H2B:GCaMP6s +/-; vGlut2a-mCherry) between five and seven days post fertilization were used for all experiments.Larval zebrafish were mounted and stimulated as previously described in 41 including scan stabilization as previously described to avoid artifacts associated with heating induced expansion of the preparation.The tail of the larva was free and tracked at 250 Hz using an online tracking method previously described 41 .Whenever our online tail-tracking missed a point along the tail an image was saved from which the tail was subsequently re-tracked offline as in 65 .Since the experiments were performed under a new microscope, we re-calculated a model translating stimulus laser strength to fish temperature as described in 41,66 and subsequently referred to as the "temperature model".
For each imaging-plane, 495 s of a "random wave" stimulus were generated by superimposing sine-waves of randomized amplitudes with periods between 13 s and 200 s.The stimuli were set to generate temperatures outside the noxious range between 22 • C and 30 • C.
For reticulospinal backfills Texas-Red Dextran 10,000 MW (Invitrogen D1828) at 12.5 % w/v was pressure injected into the spinal-cord of larval zebrafish under Tricaine (Syndel MS222) anesthesia.Injections were performed in fish expressing nuclear localized GCaMP-6s in all neurons (mitfa -/-; Elavl3-H2B:GCaMP6s) that were embedded in low-melting point agarose with a glass capillary having a 15 µm tip opening.Fish were unembedded after the procedure, woken from anesthesia and rested overnight.Fish were screened the next day for labeling, embedded again and imaged under the 2-photon microscope.

Processing of zebrafish data
Raw imaging data was pre-processed using CaImAn for motion correction and to identify units and extract associated calcium traces 45 .Laser stimulus data was recorded at 20 Hz and converted to temperatures using the temperature model (see above).The 250 Hz tail data was used to extract boutstarts based on the absolute angular derivative of the tail crossing an empirically determined threshold.The vigor was calculated as the sum of the absolute angular derivative across the swimbout, which is a metric of the energy of tail movement.The directionality was calculated as the sum of the angles of the tip of the tail across the swimbout, which is a metric of how one-sided the tail movement is.Extracted calcium data, stimulus temperatures and the three behavior metrics were subsequently interpolated / downsampled to a 5 Hz timebase.To reduce linear dependence the behavior metrics were subsequently orthogonalized using a modified Gram-Schmidt process.
Components outside the brain as well as very dim components inside the brain were identified by CaImAn.To avoid analyzing components that likely do not correspond to neurons labeled by GCaMP, all components for which the average brightness across imaging time was less than 0.1 photons were excluded.The number of analyzed components was thereby reduced from 706, 543 to 432, 882.
The CNN was subsequently fit to the remaining components using two thirds of imaging time as training data.
For each neuron, the stimulus presented during recording and the observed behavioral metrics were used as inputs.
Both the inputs and the calcium activity traces were standardized to zero mean and unit standard deviation before fitting.For every neuron in which the correlation r ≥ √ 0.5 all discussed MINE metrics were extracted.
To generate the cyclically permuted controls, all calcium data was rotated forward by a third of the experiment length with wrap around.The rationale for constructing the control data in this manner was that it maintains the overall structure of the data but since both our stimuli and the elicited behavior are stochastic it should remove any true relationship between predictors and responses.The CNN was fit to the control data in the same manner as to the real data but no further metrics were extracted.
To identify which predictors significantly contributed to neural activity, R f ull , R P 1 and R P 1,P 2 were bootstrapped, computing the Taylor metric for each variate.The standard deviation of the bootstrap variate was subsequently used to estimate confidence intervals according to a Normal distribution approximation.While confidence intervals could have been computed directly from the bootstrap variates this would have meant storing all samples for all neurons in questions to retain flexibility in significance cutoff.The significance level was set to p < 0.05 after Bonferroni correction across all fit neurons (effective p-value of 1.26x10 −6 ).To be considered a driver of neural activity the Taylor metric had to be larger than 0.1 at this significance threshold.
The linear comparison model was constructed akin to 7 .All predictors were timeshifted by between 0 and 50 timesteps for a total of 200 predictor inputs to the regression model.This setup emulates the convolutional filters present in the CNN used by MINE.A Modified-Gram-Schmidt process was used for orthogonalization to avoid the problems of QR decomposition observed in cases of near singular design matrices (singular up to the used floating-point precision).Ridge regression 9 was used to improve generalization.Setting the ridge penalty to 10 −4 led to a 25-50 % increase of identified neurons on small test datasets compared to a standard regression model.
All zebrafish stacks (except those acquired after spinal backfills) were first registered to an internal HuC-H2B:GCaMP6s reference using CMTK 67 .A transformation from the internal reference to the Z-Brain 46 was subsequently performed using ANTS 68 .Micrometer coordinates transformed into Z-Brain pixel coordinates together with region masks present in the Z-Brain were subsequently used to assign all neurons to brain regions.
To identify Z-brain regions in which a specific functional type is enriched, the following process was used.The total number of all types under consideration in each region was computed (i.e., if the comparison was between stimulus, behavior and mixed selectivity neurons all these neurons were summed but if the comparison was between different classes of behavior-related neurons only those were summed).Subsequently, the fraction of the type of interest among the total in each region was computed (plotted as bars in Figures S5 and S6).Similarly, an overall fraction of the type across the whole brain was computed (plotted as vertical blue lines in Figures S5 and S6).This overall fraction was used to simulate a 95 % confidence interval of observed fractions given the total number of neurons considered in each region (horizontal blue lines in Figures S5 and S6).Any true fraction above this interval was considered significant enrichment.We note that exact confidence intervals could have been computed from the binomial distribution, however the simulation approach allowed us to use the same method when computing confidence intervals around the average complexity of receptive field clusters.
For reticulospinal backfill experiments, neurons labeled by the backfill were manually annotated.This annotation was used to classify neurons (CaImAn identified components) as reticulospinal.
Experiments were excluded from further analysis for the following reasons: five fish died during functional imaging; eight fish could not be registered to the reference; tail tracking failed during the acquisition of six fish and during the acquisition of four fish problems in the acquisitoin hardware led to imaging artifacts.
Related python files in the repository: rwave data fit.py,rwave decompose.py,utilities.py

Supplemental Figures
Figure S1: (Related to Figure 1) MINE accommodates a large set of predictor-activity relationships A) Heatmap of all generated responses in the ground truth dataset 220 total across 11 groups including pure noise responses.B) Schematic of the expanded linear model which includes all first-order interaction terms and convolution with the true "calcium kernel" used in the ground-truth dataset.We note however that since multiplication is a nonlinear operation, the generation of the interaction terms after convolution is not exactly the same as the generation of the interaction terms in the ground-truth dataset (where convolution occurs after the multiplication).C) For data not shown in Figure 1F), fraction of explained variance on test data for a simple linear regression model (blue), a linear regression model including first-order interaction terms and calcium kernel convolution (red) as well as the CNN model (orange).Each dot represents the average across 20 models.While the standard deviation is represented by a dash, it is smaller than the dot size in all cases and therefore not visible in the graph.C) Same as B) but approximations derived using spike-triggered analysis.The spike-triggered average (STA) left and eigenvector of the spike-triggered covariance matrix (STC) right are compared to the RFA and RFV respectively.D) Across N = 100 randomly generated, white-noise stimuli, distribution of cosine similarities between the RFA and the MINE Jacobian (orange), the spike-triggered average (blue) and a random vector (brown).E) As in D) but distribution of cosine similarities between the RFC and the MINE Hessian eigenvector (orange), the eigenvector of the spike-triggered covariance matrix (blue) and a random vector (brown).F) Example section of one smoothly varying stimulus trace used in the analysis.G) Across N = 100 randomly generated, smoothly varying stimuli, distribution of cosine similarities between the RFA and the MINE Jacobian (orange), the spike-triggered average (blue) after the stimulus has been pre-whitened using ZCA whitening, and a random vector (black).H) As in G) but distribution of cosine similarities between the RFC and the MINE Hessian eigenvector (orange), the eigenvector of the spike-triggered covariance matrix (blue) after the stimulus has been pre-whitened using ZCA whitening, and a random vector (black).A) Across all imaged planes of all fish (orange lines) the autocorrelation of the provided stimulus.Black line is the average across all planes (N = 750).B) Boxenplot of the inter-trial correlations for the temperature stimulus and the three behavioral metrics.Note, that we use the term "trial" here to denote equal thirds of the imaging period in one plane which allows us to use two "trials" for training and one for testing the fit ANNs and linear models.Since there is no repetitive structure the term "trial" is used loosely here.C) Scatterplot and correlations on free-swimming data between the "Vigor" (left) and "Directionality" (right) tail metrics and freeswimming actual displacements (left) and actual heading-angle changes (right).D) For different test-correlation thresholds (x-axis) the number of units for which the ANN has above-threshold test-correlations (considered identified by the ANN).Red curve, real data, purple curve rotated data where the calcium activity has been rotated by one "trial" with respect to the stimulus and behavior predictors.Black dashed line indicates the threshold used in the paper and lists the ratio of identified units in the real versus rotated data.E) Same as D) but comparing the ANN (replot from D) with above threshold correlations for the linear model.Dashed line again indicates the threshold used in the paper.F) Across all identified units, the average curvature metric according to the NLC metric.Data divided into 50 even-sized quantiles based on the NLC metric.Band is bootstrap standard error across all data-points within the quantile.G) Across all identified units, the average NLC metric according to the curvature metric.Data divided into 50 even-sized quantiles based on the curvature metric.Band is bootstrap standard error across all data-points within the quantile.H) For stimulus (orange), behavior (dashed green) and mixed selectivity (blue) neurons the distribution of nonlinearity probabilities.I) Identified temperature encoding units in the olfactory epithelium.Orange dots are units that according to Taylor metric encode the stimulus while black dots represent units within the epithelium that were not identified by the ANN to highlight medial enrichment of temperature encoding neurons.J-L) For each analyzed Z-Brain region the units encoding the stimulus (J), behavior (K) or both (L) as a fraction of the sum across these three groups.Red bars highlight the top five regions with significant enrichment plotted in Figure 5H.Open bars and gray text highlight those regions that are neither significantly enriched nor significantly depleted of the given type.Blue line is the expected (average) fraction of the given type.Error-bars mark a boot-strapped 95% confidence interval around the average of possible observed fractions if the region would in fact contain the expected fraction of units.Bars for which the top is above this confidence interval are therefore significantly enriched, bars for which the top is below are significantly depleted.Note that regions in which less than 50 neurons total across the listed categories have been identified, have not been analyzed for enrichment.Region abbreviations are listed in Table S1.
50 allows us to make statements about a) how strongly individual predictors contribute to the activity of individual neurons, b) whether the mapping is likely linear or non-linear and c) the structure of the neuron's receptive field.Aside from network architecture and initialization 55 conditions MINE is essentially model-free.Furthermore, by incorporating a convolutional layer, temporal or spatial transformations of inputs introduced by the technique (such as calcium indicator effects) or by the brain (such as differentiation of a signal, edge detection) will be learned seamlessly by the model and do not have to be captured through a priori transformations of the task variables.

Figure 1 :
Figure 1: MINE accomodates a large set of predictor-activity relationships A) Schematic of the convolutional neural network used.B) The predictors that make up the ground-truth dataset.S1 and S2 are continuously variable predictors akin to sensory variables while M1 and M2 are discrete in time, akin to motor or decision variables.Dashed lines indicate a third each of the data with the first two thirds used for training of models and the last third used for testing.C) Schematic representation of ground-truth response generation.D) Example responses in the ground-truth dataset.Labels on the right refer to the response types shown in C).E) The model predicts activity at time t using predictors across time t − ∆t to t as inputs.The schematic shows how this is related to the generation of training and test data.Top inset shows development of training and test error across training epochs for 20 models trained on the R(S1)xR(S2) response type.Bottom inset shows example prediction (orange) overlaid on response (dark green).F) Fraction of explained variance on test data for a simple linear regression model (blue), a linear regression model including first-order interaction terms and calcium kernel convolution (red) as well as the CNN model (orange).Each dot represents the average across 20 models.While the standard deviation is represented by a dash, it is smaller than the dot size in all cases and therefore not visible in the graph.
and D).Training for 100 epochs led to excellent predictions of "neural activity" while maintaining generalizability as assessed by the fraction of explained variance (R 2 value) of the test data (Figure 1E-F).We sought to compare our ANN based approach to another widespread approach to model single-neuron activity, linear regression.Using the four predictors (Figure 1B) as inputs a simple linear regression model fails 135 to explain activity in most cases (Figure 1F and S1C).
Compared to regression models, MINE seemingly has a drawback: a lack of interpretability.Statistical methods can identify factors that significantly contribute to a regression model's output.Similarly, the magnitude of individual weights in models fit to data can give an idea of the importance of specific predictors.Corresponding overt parameters do not exist in the CNN model.Theoretically, it would be possible to re-fit successful CNN models in a stepwise manner, leaving out or adding in specific predictors 25 .However, since we are interested in uncovering relationships between predictors this could only succeed if all individual combinations between inputs are tested.This would be prohibitive for large sets of sensory and behavioral predictors as it would require repeatedly re-training all networks.We therefore perform local Taylor expansions of the network at different experimental timepoints.In other words, we differentiate the network's learned transfer function that transforms predictors into neural activity.This allows us to transform the complex nonlinear CNN with interdependent parameters into a series of simple polynomial equations much like the expanded linear regression model.Since we chose a continuously differentiable activation function for our network we can compute the function's derivatives (first, second, third, etc.) of the output with respect to any predictor and timepoint in the input.We specifically focus on the firstorder derivatives (the Jacobian J, i.e. single-predictor influences) and the matrix of second-order derivatives (the Hessian H, i.e. pairwise interactions between predictors) .Since the decomposition is a sum of terms that depend on individual inputs or their combinations, we can use it to determine how important individual predictors, such as sensory stimuli, are in shaping the neu-

Figure 2 :
Figure 2: Taylor decomposition reveals contributing single factors and interactions A) The neural model translates predictors into neural activity.By computing the vector of first order derivatives (Jacobian J) and matrix of second-order derivatives (Hessian H) the function implemented by the CNN can be linearized locally.Relating changes in predictors to changes in activity for full and partial linearizations reveals those predictors and interactions that contribute to neural activity.B) Example of Taylor metric computation.Left relationship between the CNN output and the full taylor approximation.Middle, after removal of the term that contains the S1 predictor.Right, after removal of the term that describes the interaction between S1 and S2.C-E) three examples responses and associated Taylor metrics.Red bars indicate predictors that are expected to contribute, blue bars those that should not contribute.Dashed line indicates a threshold at 5x10 −3 which identifies true contributing predictors in all cases.Error bars are 95 % bootstrap confidence intervals.C) M2 response type.D) R(S1) x R(S2) response type.Arrowheads indicate the metrics which are shown in the example (right and middle) of B).E) Response type encoding the absolute value of S1.

Figure 3 :
Figure 3: Measures of nonlinearity separate linear from non-linear predictor-activity relationships

Figure 4 :
Figure 4: Comparison of MINE and spike-triggered analysis A) Approximations of the receptive field affecting the average spike rate (RFA) using the Jacobian of the CNN fit by MINE (left) and of the receptive field affecting the variance of the spike rate (RFV) using the eigenvector of the Hessian of the CNN fit by Mine.Black lines are the true receptive fields, thin orange lines are derived approximations across N = 100 randomly generated, smoothly varying stimuli.B) Same as A) but approximations derived using spike-triggered analysis.The spike-triggered average (STA) left and eigenvector of the spike-triggered covariance matrix (STC) right are compared to the RFA and RFV respectively.C) Across N = 100 randomly generated, smoothly varying stimuli, distribution of cosine similarities between the RFA and the MINE Jacobian (orange), the spike-triggered average (blue) and a random vector (black).D) As in C) but distribution of cosine similarities between the RFC and the MINE Hessian eigenvector (orange), the eigenvector of the spike-triggered covariance matrix (blue) and a random vector (black).
. The dataset consists of 22 task related predictors (stimuli, reward states, instructed and non-instructed movements) and calcium activity time series across 200 temporal components which were used to compress each session's widefield imaging data 7 from 13 mice.It had previously been analyzed using linear regression7 .We analyzed one session from each of the 13 mice present in the dataset with MINE (Figure5A).As in the ground-truth data, we split the data into a training set (first two thirds of the session time) and a test-set (last third).To be able to capture not only activity caused by past stimuli but also preparatory activity leading up to movements, we shifted the predictor traces with respect to the activity trace (see Methods).As expected, correlations between MINE predictions and true activity on the test dataset is overall smaller than on the training set; however, in the majority of cases the CNN generalizes well to the new 420 data (Figure5B).Given that many individual neurons contribute to each pixel and temporal component within this dataset, and that a lot of brain activity is likely unrelated to the behavioral task at hand, we chose a lenient correlation cut-off of r = 0.1 for the test-data to decide that a component had been successfully fit by MINE (Figure5B-C).On average this led to the identification of more than 91 % of all components per session (Fig-ure 5C) on which we computed the Taylor metric and nonlinearity probability .Notably, MINE assigns very low probabilities of nonlinearity to all but five components.This means that the relationships between predictors and neural activity are linear (Figure5D).While this may seem surprising given the high nonlinearity of cortical processing, it is likely caused by the low resolution of the data.Each component blends the activity of hundreds of thousands of neurons.Averaging across many individual neurons which each may have their own nonlinear responses likely obfuscates nonlinear effects.Across all sessions we found a broad representation of predictors(Figure 5E).As expected from the findings of Musall et al. 7 , uninstructed movements appear overrepresented sporting the largest Taylor metric in more than half of the fit components.We next mapped the results of MINE back into acquisition space and recalculated the Taylor metrics for each imaging pixel for one example session.The obtained results further validate MINE's utility on biological data.Specifically, visual stimuli are shown to contribute most strongly to responses in visual cortical areas with the expected left-right asymmetry (Figure 5F).At the same time, comparing visual sensory responses to whisk responses shows that cortical regions enhanced for whisking (Figure 5G) correspond very well to regions marked strongly with a whisk event kernel in Figure 2 of Musall et al.

Figure 5 :
Figure 5: MINE identifies cortical features of sensori-motor processing during a learned task A) Simplified schematic of the widefield imaging experiment.B) MINE test data vs. training data correlations on 200 temporal components from 13 sessions across 13 mice.Black dashed line is identity, red dashed line is the test correlation threshold to decide that a component had been identified by MINE.C) In each of the 13 sessions, the fraction of identified components (dots) as well as the average (bar).D) Across 200 components each in the 13 sessions the distribution of nonlinearity probabilities computed by MINE.E) Across all components from all 13 sessions that have been identified by MINE the Taylor metrics that were significantly larger than 0. Components (rows) have been sorted according to the predictor with the maximal Taylor metric.F) Per pixel Taylor-metric scores for the right visual stimulus ("rVisStim") subtracted from those of the left visual stimulus ("lVisStim").A=anterior, L=left, R=right.G) As in F) but the sum of the visual stimulus Taylor-metrics ("lVisStim+rVisStim") has been subtracted from the Taylor-metric of the whisking predictor ("Whisk").H) As in F) but the Taylor-metric for the right grab predictor ("rGrab") has been subtracted from the Taylor-metric of the left grab predictor ("lGrab").I) Clustering of per-pixel receptive fields, separating excitatory (bold lines) from inhibitory responses (pale lines).Pixels were selected to only include the top 10% Taylor metrics for visual (left plot) and grab (right plot) predictors; left blue, right red.Numbers in parentheses indicate cluster size.The gray dashed line indicates time 0, i.e. the time at which calcium activity is measured.Note that the sensory receptive field (visual) is exclusively in the past (future stimuli do not drive the neurons) while the motor receptive fields (grap) slightly overlap with the future, indicating that neurons ramp up activity before the behavioral action.
of Musall et al. 7 .The visual receptive fields have essentially zero weight after the current time which is expected since future stimuli are unlikely to influence neural activity.The receptive fields of left and right grab neurons on the other hand are much sharper, indicating that these neurons influence movement over short timescales (Figure 5I, right).Furthermore, the grab-related receptive fields contain coefficients different from baseline for up to 100 ms into the future.This suggests preparatory activity, i.e., that future movements are reflected in current neural activity.
behaviors across time as inputs to MINE fitting CNNs to each neural response (Figure 6B and C) using the initial two thirds of time as training data and the last third as test data.Notably, due to the random nature of both our stimulus and the elicited behavior, test and training data were largely uncorrelated (Figure S6B).Predictive power over the test data is therefore a good indication that the neural network model generalizes and truly captures the relationship between stimulus and behavioral data and neural activity.We chose 50 % of explained variance on the test data, corresponding to a correlation of r = √ 0.5, as our cutoff to determine whether a neuron can be modeled by MINE.Using cyclic permutations of our dataset as a control revealed that this cut-off corresponds to a 93-fold enrichment over the control data (Figure S6D).

Figure 6 :
Figure 6: Using MINE to probe larval zebrafish thermoregulatory circuits A) Experimental design.Larval zebrafish expressing nuclear GCaMP6s in all neurons and mCherry in glutamatergic neurons are imaged under a 2-photon microscope while random heat stimuli are provided with a laser and behavior is inferred through tail motion, middle.Left, example temperature trajectory during one trial.Right, behavioral responses recorded during the same trial.B) Deconvolved calcium traces of all neurons identified in the plane imaged during the same trial as shown in A) (heatmap), sorted by the test correlation achieved after ANN fitting (plot on the left).Orange arrowheads mark the same time-points in A) and B).Orange dashed line indicates the fit cutoff used for deciding that a neuron was identified by MINE, blue line marks pearson correlation of 0. C) Illustration comparing MINE to the comparison linear regression model which uses time-shifted predictors to have access to the same information as the convolutional neural network used in MINE.D) Venn diagram illustrating what fraction of CaImAn identified neurons across the whole dataset were identified by MINE, the comparison LM model or both.E) Upset plot of functional classes identified by Taylor analysis across 25 fish.Bar-plot at the top indicates the total number of neurons in each class on a logarithmic scale.Dot-plot at the bottom marks the significant Taylor components identified in each functional class.Functional classes are sorted by size in descending order.Horizontal barcode plot on the right, indicates the total number of neurons whose activity depends on the given predictor.Orange filled bars mark classes only driven by the stimulus, green open bars those only driven by behavioral predictors while blue bars mark classes of mixed sensori-motor selectivity.F) Classification of neurons labeled by reticulospinal backfills (inset shows example labeling) across six fish.Bar color as in E).G) For different functional neuron classes identified by MINE the fraction also identified by the linear comparison model.H) Maps of top five Z-brain regions enriched for Stimulus related neurons (left), Behavior related neurons (middle) and neurons of mixed selectivity (right).Dorsal view top, lateral view bottom.Enriched regions are marked in red.CH: Caudal Hypothalamus; H: Habenula; IPN: Interpeduncular nucleus; M4: Telencephalic migrated area 4; MVN: Medial Vestibular Nucleus; P: Pallium; RS: Raphe Superior; SP: Subpallium; TL: Torus Longitudinalis; TS: Rhombencephalic Vglut2 Stripe 2 and Vmat2 Stripe 2 and Glyt2 Stripe 2; VII: Facial motor and octavolateralis efferent neurons.

590
in the thermoregulatory circuit which might play a key role in behavioral thermoregulation by giving the animal the ability to relate behavioral output to temperature changes and thereby characterize the thermal landscape.Analyzing the success of the linear comparison model ac-595 cording to functional class revealed a bias towards the identification of behavior related activity while stimulus

Figure 7 :
Figure 7: Functional subdivisions of thermoregulatory circuits A) Top five brain regions enriched for neurons that encode specific aspects of movements.Swim starts left, vigor (related to swim displacement) middle and directionality (related to turn angle) right.DT: Dorsal Thalamus; LC: Locus Coeruleus; nIII: Tegmental Oculomotor nucleus III; P: Pallium; RH3: Rhombomere 3; RH4: Rhombomere 4; SP: Subpallium; SPV: Tectum Stratum Periventriculare; TSC: Torus semicircularis; VGS2: Vglut2 Stripe 2; VMS1: Vmat Stripe 1; VMS2: Vmat Stripe 2; VT: Ventral Thalamus; XV: tenth nerve vagus motorneuron cluster.B) Sub-clustering of stimulus-selective neurons according to their temporal receptive field (left heatmap).Right heatmap visualizes for each cluster what fraction of that cluster is present in the major four subdivisions of the zebrafish brain.Arrowheads indicate differentially distributed example clusters highlighted in C).C) Temporal receptive fields of example clusters.Each plot shows the influence of a change in temperature on the activity (as measured by Calcium) of a neuron within the cluster.Numbers reflect the number of neurons present in each given cluster.Dashed lines indicate 0 where the number of 0-crossings of each receptive field indicate if the neuron responds to absolute temperature value (no crossings, cluster 10), to the first derivative (velocity of temperature, increasing, cluster 1; decreasing cluster 4) or to the second derivative (acceleration of temperature, cluster 3).D) Schematic of how the Taylor expansion of the fitted ANN around the mean of the data indicates what order terms (constant, linear, 2nd or higher-order) are used in the transformation of the stimulus to the observed neural activity.E) Scatter plot of the goodness-of-fit of a second order Taylor expansion model against the nonlinearity probability.These two metrics together divide the neurons into three classes of increasing complexity.F) Average complexity of each receptive field cluster shown in B) (gray bars).Blue horizontal line reveals the total average complexity and vertical blue lines indicate bootstrapped 95% confidence intervals around the average complexity.If the gray bar is above or below that interval, the complexity within that cluster deviates significantly from the data average complexity.G) For each complexity class the significantly enriched brain regions (2 regions for complexity 0, 4 for complexity 1 and 3 for complexity 2).CB: Cerebellum; MTB: Medial tectal band; POA: Preoptic area of the hypothalamus; PT: Posterior tuberculum; RH1: Rhombomere 1; RH2: Rhombomere 2; RH6: Rhombomere 6; SP: Subpallium; SPV: Tectum Stratum Periventriculare.
Figure S7E-G) indicating that this division is not arbitrary but rather indicates a meaningful difference in neuron 715 function.Specifically, averaging the complexity score for different clusters according to receptive field, reveals that four of these clusters have much lower complexity than expected indicating that most neurons in these clusters have responses that linearly depend on the tem-720

Figure 8 :
Figure 8: The programmatic interface to MINE ) and the output (in our case neural activity).Here we have presented a method, "Model-free Identification of Neural Encoding (MINE)," that uses simple convolutional neural networks to relate behaviors, stimuli or internal states to concurrently recorded neural activity.Despite the method's increased flexibility over predefined models, MINE maintains interpretability by applying Taylor decomposition and concepts from explainable machine learning.As an example, MINE identified more neurons related to the experimental task in the zebrafish brain with less bias than a regression model.In particular, MINE's ability to jointly model the relationship of neural activity to ongoing stimuli and stochastic behavior allowed identifying a new class of thermoregulatory neurons.These integrate temperature stimuli with informa-770 tion about behavioral actions.Moreover we quantified the computational complexity of neurons and characterized their receptive fields.Different paradigms are used to interpret neural activity.MINE specifically supports an encoding view, inspecting which information is encoded by the neurons of interest.Artificial neural networks have however also been used from a decoding perspective, probing what information can be gleaned about ongoing behavior from neural activity 51,52 .Here we restricted the question of 780 information encoding to individual neurons.Nonetheless, MINE could easily be extended to model joint population encoding.The most straight-forward approach to this problem would be to decompose the population activity through methods such as principal or indepen-785 dent component analysis.These components would then be fit by MINE instead of single neuron activity.Similarly, we only demonstrate the relationship of neural activity to external features (stimuli, behavioral actions) but neural population activity or local field potentials 790 could be used as predictors.Artificial neural networks have been used extensively as models of brains to identify principles underlying neural processing53 .In the present work, convolutional neural networks exclusively serve as tools to find relation-795 ships between features of interest and the activity of individual neurons.The network is a stand-in for the circuits transforming these features upstream (for sensation) or downstream (for behavior) as well as the computational properties of the analyzed neuron.We are there-800 fore interested in properties of the whole CNN instead of its individual units.Extracting the CNN properties aims to characterize the neuron of interest in terms of how it encodes the features.Such an approach has been used previously to characterize neural receptive fields, recognizing the power of ANNs in capturing nonlinear transformations.For example, characterizations of nonlinear visual receptive fields[18][19][20][21] and nonlinear auditory receptive fields22 .We go beyond these approaches in this study.MINE handles and disambiguates multiple different inputs originating from different domains (such as sensory stimuli and behavioral actions).MINE furthermore performs a more detailed characterization producing information about the linearity of the relationship of inputs and neural activity.These advances are made possible through mathematical analysis of the network.Different techniques aim at understanding how artificial neural networks arrive at their outputs given a set of inputs54 .These techniques are generally referred to as "explainable machine learning" and are particularly important in understanding what features classifiers use to make their decisions.These include visualization techniques as well as layer-wise relevance propagation55 haviors and neural activity across species and imaging modalities.On a mouse cortical widefield imaging dataset MINE recovered previously published structure in the neural data.Task elements driving neural activity could be mapped to expected cortical regions and receptive field analysis revealed expected structure in visual receptive fields while the receptive fields associated with neurons encoding instructed movements showed preparatory activity.Performing 2-photon calcium imaging in larval zebrafish while simultaneously presenting random temperature stimuli and recording behavior, allowed a deeper dive into thermoregulatory circuits.MINE allowed us for the first time to identify neurons that integrate temperature stimuli and information about behavior.This was not possible previously since regression-based approaches failed to identify neurons encoding temperature while clustering to identify these neurons relied on removing stochastic motorrelated activity41 .Neurons integrating stimuli and behavior might serve an important role in thermoregulation.Specifically, they might allow larval zebrafish to estimate thermal gradients by comparing behavioral actions and their consequences in terms of changes in environmental temperature.Combining receptive field analysis with the idea of computational complexity, on the other hand, provides a way to functionally subdivide temperature responsive neurons in great detail.This will greatly aid future studies trying to understand thermoregulatory circuits.Notably, the functional features we identified map to anatomical features in the brain pointing to structural subdivisions mirroring function.The ability to compare neurons, even though we used randomized stimuli, indicates another big advantage of MINE.When recording neural activity during free exploration, across animals different neurons will be recorded under very different stimulus and behavior conditions.This means that their responses can usually be compared only by choosing different trigger events (certain behaviors, specific stimulus conditions) and analyzing neural activity around those.MINE on the other hand allows comparing networks fit on the whole activity time series removing the requirement and bias of choosing specific triggering events.One surprising finding in comparing the cortical and zebrafish brain responses was the prevalence of nonlinear processing in zebrafish and the apparent lack of it in the cortical dataset.This is likely due to the different spatial resolution in the datasets.While the zebrafish data has single neuron resolution this is not the case for the cortical wide-field data especially since we apply MINE not to individual pixels but activity components.Each of these components combines the activity of thousands of neurons.It is likely that this averaging obfuscates individual nonlinear effects.

Figure
Figure S2: (Related to Figure 2) Taylor decomposition reveals contributing single factors and interactions A) Schematic highlighting key parts of the decomposition that isolates individual factors.B-H) Remaining responses and associated Taylor metrics.Red bars indicate predictors that are expected to contribute, blue bars those that should not contribute.Dashed line indicates a threshold at 5x10-3 which identifies true contributing predictors in all cases.Error bars are 95 % bootstrap confidence intervals.B) Response type only depending on S1 predictor.C) Response type only depending on S2 predictor.D) Response type only depending on M1 predictor.E) Response to multiplicative interaction of S1 and S2 predictors.F) Response to multiplicative interaction of rectified S1 and original M1 predictor.Note that the S2xM1 term which should contribute has the lowest metric.G) Response thresholding S2 predictor.H) Response to the derivative of S1.

Figure S3: (related to Figure 3 )
Figure S3: (related to Figure 3) Measures of nonlinearity separate linear from non-linear predictor-activity relationships A) Calculation of the nonlinearity metrics curvature and NLC and their relationship to the data and neural model.B-D) Linear correlations between input predictors and ground-truth data "neural activity" for different mixtures.Colors of correlation plots correspond to colored examples in Figure 3. B) Mixing of a nonlinear y = tanh(2x) 2 and linear transformation of the predictor.C) Mixing of a predictor and its derivative.D) Mixing of a predictor and a rectified y = ln(e x + 1) version of the predictor.Note: Outputs are recentered to 0 mean and 1 standard deviation before being passed to the ANN and before being plotted here.

Figure S4 :
Figure S4: Comparison of MINE and spike-triggered analysisA Example section of a white-noise stimulus trace used in the analysis.B) Approximations of the receptive field affecting the average spike rate (RFA) using the Jacobian of the CNN fit by MINE (left) and of the receptive field affecting the variance of the spike rate (RFV) using the eigenvector of the Hessian of the CNN fit by Mine.Black lines are the true receptive fields, thin orange lines are derived approximations across N = 100 randomly generated, white-noise stimuli.C) Same as B) but approximations derived using spike-triggered analysis.The spike-triggered average (STA) left and eigenvector of the spike-triggered covariance matrix (STC) right are compared to the RFA and RFV respectively.D) Across N = 100 randomly generated, white-noise stimuli, distribution of cosine similarities between the RFA and the MINE Jacobian (orange), the spike-triggered average (blue) and a random vector (brown).E) As in D) but distribution of cosine similarities between the RFC and the MINE Hessian eigenvector (orange), the eigenvector of the spike-triggered covariance matrix (blue) and a random vector (brown).F) Example section of one smoothly varying stimulus trace used in the analysis.G) Across N = 100 randomly generated, smoothly varying stimuli, distribution of cosine similarities between the RFA and the MINE Jacobian (orange), the spike-triggered average (blue) after the stimulus has been pre-whitened using ZCA whitening, and a random vector (black).H) As in G) but distribution of cosine similarities between the RFC and the MINE Hessian eigenvector (orange), the eigenvector of the spike-triggered covariance matrix (blue) after the stimulus has been pre-whitened using ZCA whitening, and a random vector (black).

Figure S5 :
Figure S5: (related to Figure5) MINE identifies cortical features of sensori-motor processing during a learned task A) Per pixel Taylor-metric scores for the right visual stimulus ("rVisStim") subtracted from those of the left visual stimulus ("lVisStim") for the twelve mice not shown in Figure4F.B) As A) but the sum of the visual stimulus Taylor-metrics subtracted from the Taylor metrics of the whisking predictor for the same twelve mice.C) As A) but the Taylor-metric for the right grab predictor has been subtracted from the Taylor-metric of the left grab predictor.

Figure S6: (related to Figure 6 )
Figure S6: (related to Figure 6) Using MINE to probe larval zebrafish thermoregulatory circuits

Figure
Figure S7: (Related to Figure 7) Functional subdivisions of thermoregulatory circuits.A-C)For each analyzed Z-Brain region the units encoding swim-starts (A), vigor (B) or directionality (C) as a fraction of the sum across these three groups.Red bars highlight the top regions with significant enrichment plotted in Figure6A.Open bars and gray text highlight those regions that are neither significantly enriched nor significantly depleted of the given type.Blue line is the expected (average) fraction of the given type.Error-bars mark a boot-strapped 95% confidence interval around the average of possible observed fractions if the region would in fact contain the expected fraction of units.Bars for which the top is above this confidence interval are therefore significantly enriched, bars for which the top is below are significantly depleted.Note that regions in which less than 50 neurons total across the listed categories have been identified, have not been analyzed for enrichment.Region abbreviations are listed in TableS1.D) Temporal receptive fields of all clusters.Each plot shows the influence of a change in temperature on the activity (as measured by Calcium) of a neuron within the cluster.Numbers after the colon reflect the number of neurons present in each given cluster.E-G) Same as A)-C) but for sensory units of complexity 0 (E), complexity 1 (F) or complexity 2 (G).Red bars highlight the top regions with significant enrichment plotted in Figure6G.Note, that since this comparison only contains stimulus related units, much fewer regions pass the minimum threshold of 50 neurons .
(nucleus of the medial longitudinal fascicle) NNIV Rhombencephalon -Noradrendergic neurons of the Interfascicular and Vagal areas Area of the Pretectum (M1)