Estimating and interpreting nonlinear receptive fields of sensory responses with deep neural network models

Sensory processing by neural circuits includes numerous nonlinear transformations that are critical to perception. Our understanding of these nonlinear mechanisms, however, is hindered by the lack of a comprehensive and interpretable computational framework that can model and explain nonlinear signal transformations. Here, we propose a data-driven framework based on deep neural network regression models that can directly learn any nonlinear stimulus-response mapping. A key component of this approach is an analysis method that reformulates the exact function of the trained neural network as a collection of stimulus-dependent linear functions. This locally linear receptive field interpretation of the network function enables straightforward comparison with conventional receptive field models and uncovers nonlinear encoding properties. We demonstrate the efficacy of this framework by predicting the neural responses recorded invasively from the auditory cortex of neurosurgical patients as they listened to speech. Our method significantly improves the prediction accuracy of auditory cortical responses particularly in nonprimary areas. Moreover, interpreting the functions learned by neural networks uncovered three distinct types of nonlinear transformations of speech that varied considerably in primary and nonprimary auditory regions. By combining two desired properties of a computational sensory-response model; the ability to capture arbitrary stimulus-response mappings and maintaining model interpretability, this data-driven method can lead to better neurophysiological models of the sensory processing.


Introduction
Creating computational models to predict neural responses from the sensory stimuli has been one of the central goals of sensory neuroscience research (1)(2)(3)(4)(5)(6)(7)(8). Computational models can be used to form testable hypotheses by predicting the neural response to arbitrary manipulations of 2 stimulus and can provide a way of explaining complex stimulus-response relationships. As such, computational models that provide an intuitive account of how sensory stimuli are encoded in the brain have been critical in discovering the representational and computational principles of sensory cortices (9). One simple yet powerful example of such models is the linear receptive field model, which is commonly used in visual (2,10) and auditory (11,12) neuroscience research. In the auditory domain, linear spectrotemporal receptive field (STRF) (10,12,13) has led to the discovery of neural tuning to various acoustic dimensions, including frequency, response latency, and temporal and spectral modulation (14)(15)(16). However, despite the success and ease of interpretability of linear receptive field models, they lack the necessary computational capacity to account for the intrinsic nonlinearities of the sensory processing pathways (17). This shortcoming is particularly problematic in higher cortical areas where stimulus representation becomes increasingly more nonlinear (18,19). Several extensions have been proposed to address the limitations of linear models (20-32) (see (33) for review). These extensions improve the prediction accuracy of neural responses, but this improvement comes at the cost of reduced interpretability of the underlying computation, hence limiting novel insights that can be gained regarding sensory cortical representation. In addition, these methods assumed a specific model structure whose parameters are then fitted to the neural data. This assumed model architecture thus limits the range of the nonlinear transformations that they can account for. This lack of a comprehensive yet interpretable computational framework has hindered our ability to understand the nonlinear signal transformations that are found ubiquitously in the sensory processing pathways (26,34,35).
A general nonlinear modeling framework that has seen great progress in recent years is the multilayer (deep) neural network model (DNN) (36,37). Theses biologically inspired models 3 are universal function approximators (38) and can model any arbitrarily complex input-output relation. Moreover, these data-driven models can learn any form of nonlinearity directly from the data without any explicit assumption or prior knowledge of the nonlinearities. This property makes these models particularly suitable for studying the encoding properties of sensory stimuli in the nervous system (39-41) whose anatomical and functional organization remains largely speculative. A major drawback of DNN models, however, is the difficulty in interpreting the computation that they implement because these models are analytically intractable (42). Thus, despite their success in increasing the accuracy of prediction in stimulus-response mapping, their utility in leading to novel insights into the computation of sensory nervous systems is lacking.
To overcome these challenges, we propose a nonlinear regression framework in which a DNN is used to model sensory receptive fields. An important component of our approach is a novel analysis method that allows for the calculation of the mathematically equivalent function of the trained neural network as a collection of stimulus-dependent, linearized receptive fields.
As a result, the exact computation of the neural network model can be explained in a manner similar to that of the commonly used linear receptive field model, which enables direct comparison of the two models. Here, we demonstrate the efficacy of this nonlinear receptive field framework by applying it to neural responses recorded invasively in the human auditory cortex of neurosurgical patients as they listened to natural speech. We demonstrate that not only these models more accurately predict the auditory neural responses, but also uncover distinct nonlinear encoding properties of speech in primary and nonprimary auditory cortical areas.
These findings show the critical need for more complete and interpretable encoding models of neural processing, which can lead to better understanding of cortical sensory processing.

Neural recordings
To study the nonlinear receptive fields of auditory cortical responses, we used invasive electrocorticography (ECoG) to directly measure neural activity from five neurosurgical patients undergoing treatment for epilepsy surgery. One patient had high-density subdural grid electrodes implanted on the left hemisphere, with coverage primarily over the superior temporal gyrus (STG). All five patients had depth electrodes (stereotactic EEG) with coverage of Heschl's gyrus (HG) and STG ( Fig 1A). While HG and the STG are functionally heterogenous and each contain multiple auditory fields (43)(44)(45)(46)(47), HG includes the primary auditory cortex, while the STG is considered mostly a nonprimary auditory cortical area (48). The patients listened to stories spoken by four speakers (two females) with a total duration of 30 minutes. All patients had selfreported normal hearing. To ensure that patients were engaged in the task, we intermittently paused the stories and asked the patients to repeat the last sentence before the pause. Eight separate sentences (40 seconds total) were used as the test data for evaluation of the encoding models, and each sentence was repeated six times in a random order.
We used the envelope of the high-gamma  band of the recorded signal as our neural response measure, which correlates with the neural firing in the proximity of electrodes (49,50). The high-gamma envelope was extracted by first filtering neural signals with a bandpass filter and then calculating the Hilbert envelope. We restricted our analysis to speech-responsive electrodes, which were selected using a t-test between the average response of each electrode to speech stimuli versus the response in silence (t-value > 2). This criterion resulted in 50 electrodes in HG and 47 in STG. Electrode locations are plotted in Fig 1A on the average FreeSurfer brain (51) where the colors indicate speech responsiveness (t-values). 5 We used the auditory spectrogram of speech utterances as the acoustic representation of the stimulus. Auditory spectrograms were calculated using a model of the peripheral auditory system that estimates a time-frequency representation of the acoustic signal on a tonotopic axis (52). The speech materials were split into three subsets for fitting the modelstraining, 6 validation, and test. The test was obtained by averaging the neural responses to the repeated sentences to reduce the effect of neural noise on model comparisons. The remainder of the data was split between training and validation subsets (97% and 3% respectively). There was no stimulus overlap between the three subsets.

Linear and nonlinear encoding models
For each neural site, we fit one linear (LN) and one nonlinear (NLN) regression model to predict its activity from the auditory spectrogram of the stimulus ( Fig 1A). The input to the regression models was a sliding window with a duration of 400 ms with 10 ms steps, which was found by maximizing the prediction accuracy ( Supplementary Fig 1). The linear encoding model was a conventional STRF that calculates the linear mapping from stimulus spectrotemporal features to the neural response (10). The nonlinear regression model was implemented using a deep convolutional neural network (CNN (53)) consisting of two stages: a feature extraction network and a feature summation network. This framework, commonly used for nonlinear regression (54)(55)(56), consists of extracting a high-dimensional representation of the input (feature extraction), followed by a feature summation network to predict neural responses. The feature extraction network comprises three convolutional layers each with eight 3x3 2D convolutional kernels, followed by two convolutional kernels with 1x1 kernels to reduce the dimensionality of the representation, thus decreasing the number of model parameters. The feature summation stage was a two-layer fully-connected network with 32 nodes in the hidden layer and a single output node. All hyperparameters of the network were determined by optimizing the prediction accuracy ( Supplementary Fig 2). All hidden layers had rectified linear unit (ReLU) nonlinearity 7 (57), and the output node was linear. A combination of mean-squared error and Pearson correlation was used as the training loss function (see Methods).

Predicting neural responses using linear and nonlinear encoding models
Examples of actual and predicted neural responses from the LN and NLN models are shown in LN predictions for the majority of electrodes (87 out of 97). Even though the overall prediction accuracy is higher for HG electrodes than ones in STG, the absolute improved prediction of NLN over LN is significantly higher for STG electrodes (Fig 1D; p < 0.003, one-tailed t-test). This higher improvement in STG electrodes reveals a larger degree of nonlinearity in the encoding of speech in the STG compared to HG.

Interpreting the nonlinear receptive field learned by DNNs
The previous analysis demonstrates the superior ability of the NLN model to predict the cortical representation of speech particularly in higher order areas, but it does not show what types of 8 nonlinear computation results in improved prediction accuracy. To explain the mapping learned by the NLN model, we developed an analysis framework that finds the mathematical equivalent linear function that the neural network applies to each instance of the stimulus. This equivalent function is found by estimating the derivative of the network output with respect to its input (i.e., the data Jacobian matrix (59)). We refer to this equivalent function as the locally linear receptive field (LLRF). The LLRF can be considered a STRF whose spectrotemporal tuning depends on every instance of the stimulus. As a result, the linear weighting function of the NLN model that is applied to each stimulus instance can be visualized in a manner similar to that of the STRF (see Supplementary Videos 1

and 2).
Finding the LLRF is particularly straightforward for a neural network with rectified linear unit nodes (ReLU), because ReLU networks implement piecewise linear functions ( Fig 2C). A rectified linear node is inactive when the sum of its inputs is negative or is active and behaves like a linear function when the sum of its inputs is positive (Fig 2A). Obtaining a linear equivalent of such a DNN for a given stimulus consists of first removing the weights that connect to the inactive nodes in the network ( Fig 2B) and replacing the active nodes with identity functions. Next, the remaining weights of each layer are multiplied to calculate the overall linear weighting function applied to the stimulus. Because the resulting weighting vector has the same dimension as the input to the network, it can be visualized as a multiplicative template similar to STRF ( Fig 2B). The mathematical derivation of LLRF is shown in the Methods. An example of LLRF at different time points for one electrode is shown in Fig 2D where each column is the vectorized LLRF (time lag by frequency) that is applied to the stimulus at that time point (for better visibility only part of the actual matrix is shown). This matrix contains all the variations of the receptive fields that the network applies to the stimulus at different time

Complexity of the nonlinear receptive field
We defined the complexity of the NLN model as the sum of the normalized singular values of the lag-frequency by time matrix (Fig 2D). The complexity for all electrodes in HG and STG is shown in Fig 2F and Fig 3). Furthermore, the linearized functions averaged over all samples closely resemble the STRF for the corresponding electrode, with the similarity decreasing with the complexity of the neural site. (Supplementary Fig 4).

Identifying various types of nonlinear receptive field properties
We showed that the nonlinear function of the NLN model can be expressed as a collection of linearized functions. To investigate the properties of these linearized functions learned by the models for various neural sites, we visually inspected the LLRFs and observed three general types of variations over time, which we refer to as: I) gain change, II) temporal hold, and III) shape change. We describe and quantify each of these three nonlinear computations in this section using three example electrodes that exhibit each of these types more prominently  13 The second type of LLRF change is temporal hold, which refers to the tuning of a site to a fixed spectrotemporal feature but with shifted latency in consecutive time frames (Fig 3B).
This particular tuning nonlinearity results in a sustained response to arbitrarily fast spectrotemporal features, hence resembling a short-term memory of that feature. An example of temporal hold for a site (E2) is shown in Fig 3B where  The last dimension of LLRF variation is shape change, which refers to a change in the spectrotemporal tuning of a site across stimuli. Intuitively, a model that implements a more nonlinear function will have a larger number of piecewise linear regions, each exhibiting a different LLRF shape. An example of shape change for a site (E3) is shown in Fig 3B, showing three different spectrotemporal patterns at these three different time points. To quantify the degree of change in the shape of the LLRF for a site, we repeated the calculation of network complexity but after normalizing and removing the effect of gain change and temporal hold from 14 the LLRFs. Therefore, shape change is defined as the sum of the normalized singular values of the gain-and shift-corrected lag-frequency by time matrix (Fig 2D). Thus, the shape change indicates the remaining complexity of the LLRF function that is not due to gain change or temporal hold. This stimulus-dependent change in the spectrotemporal tuning of sites reflects a nonlinearity that appears as the sum of all possible shapes in the STRF, as shown in Fig 3B. The distribution of the three nonlinearity types defined here across all neural sites in HG  (64,65), which allows the stimulus information to be combined across longer time scales.
Finally, the average of the shape change values is significantly higher in STG sites than in HG (p < 0.05), demonstrating that STG sites have more nonlinear encoding, which requires more diverse receptive field templates than HG sites.

Finding subtypes of receptive fields
As explained in the shape change nonlinearity, the NLN model may learn several subtypes of receptive fields for a neutral site. To further investigate the subtypes of receptive fields that the NLN model learns for each site, we used the k-means algorithm (66)

Contribution of nonlinear variations to network complexity and prediction improvement
So far, we have defined and quantified three types of nonlinear computation for each neural site gain change, temporal hold, and shape changeresulting in three numbers describing the stimulus-response nonlinearity of the corresponding neural population. Next, we examined how much of the network complexity ( Fig 2F) and improved accuracy over a linear model (Fig 1D) can be accounted for by these three parameters. We used linear regression to calculate the complexity and prediction improvement for each site from the gain change, temporal hold, and shape change parameters. The predicted and actual complexity of the models are shown in Supplementary Fig 6. The high correlation value (r = 0.92, p << 1) confirms the efficacy of these three parameters to characterize the complexity of the stimulus-response mapping across sites.
Moreover, the main effects of the regression (68) shown in Fig 5A suggest a significant contribution from all three parameters to the overall complexity of LLRFs in both HG and STG. The high correlation values between the actual and predicted improved accuracy ( Supplementary   Fig 6; r = 0.72, p < 0.001) show that these three parameters also largely predict stimulusresponse nonlinearity. The contribution of each nonlinear factor in predicting the improved prediction, however, is different between the HG and STG areas, as shown by the main effects of the regression in Fig 5B. All three factors contribute to the improved accuracy in STG sites, while only the gain change is a good predictor of the improvement in HG sites. This result reiterates the encoding distinctions we observed in HG and STG sites where temporal hold and shape change were significantly higher in STG than HG ( Fig 3C). Notably, the three types of LLRF variations are not independent of each other and covary considerably (Supplementary Fig   7). Together, these results demonstrate how our proposed nonlinear encoding method can lead to a comprehensive, intuitive way of studying nonlinear mechanisms in sensory neural processing by utilizing deep neural networks.

Discussion
We propose a general nonlinear regression framework to model and interpret any complex The computational framework we propose to explain the neural network function (81) can be used in any feedforward neural network model with any number of layers and nodes, such as in fully connected networks, locally connected networks (82), or CNNs (83). Nonetheless, one limitation of the LLRF method is that it cannot be used for neural networks with recurrent connections. Because feedforward models use a fixed duration of the signal as the input, the range of the temporal dependencies and contextual effects that can be captured with these models is limited. Nevertheless, sensory signals such as speech have long-range temporal dependencies for which recurrent networks may provide a better fit. Although we did not find a significant difference between the prediction accuracy of feedforward and recurrent neural networks in our data ( Supplementary Fig 8), the recent extensions of the feedforward architecture, such as dilated convolution (84) or temporal convolutional networks (85), can implement receptive fields that extend over long durations. Our proposed LLRF method would seamlessly generalize to these architectures, which can serve as an alternative to recurrent neural networks when modeling the long-term dependencies of the stimulus is crucial.
In summary, our proposed framework combines two desired properties of a computational sensory-response model; the ability to capture arbitrary stimulus-response mappings and maintaining model interpretability. We showed that this data-driven method reveals novel nonlinear properties of cortical representation of speech in the human brain which provides an example for how it can be used to create more complete neurophysiological models of sensory processing in the brain.

Participants and neural recording
Five patients with pharmacoresistant focal epilepsy were included in this study. All patients underwent chronic intracranial encephalography (iEEG) monitoring at Northshore University Hospital to identify epileptogenic foci in the brain for later removal. Four patients were implanted

Brain maps
Electrode positions were mapped to brain anatomy using registration of the postimplant computed tomography (CT) to the preimplant MRI via the postop MRI. After coregistration, electrodes were 21 identified on the postimplantation CT scan using BioImage Suite. Following coregistration, subdural grid and strip electrodes were snapped to the closest point on the reconstructed brain surface of the preimplantation MRI. We used FreeSurfer automated cortical parcellation (51) to identify the anatomical regions in which each electrode contact was located with a resolution of approximately 3 mm (the maximum parcellation error of a given electrode to a parcellated area was < 5 voxels/mm). We used Destrieux parcellation, which provides higher specificity in the ventral and lateral aspects of the medial lobe. Automated parcellation results for each electrode were closely inspected by a neurosurgeon using the patient's coregistered postimplant MRI.

Stimulus
Speech materials consisted of continuous speech stories spoken by four speakers (2 male and 2 female). The duration of the stimulus was 30 minutes and was sampled at 11025 Hz. Eight sentences (40 seconds) were used for testing and presented to the patients six times to improve the signal-to-noise ratio. The 30-minute data were split into two segments for training (30 minutes) and validation (50 seconds). There was no overlap between the training, test, and validation sets.
The input to the regression models was a sliding window of 400 ms (40 timesteps), which was chosen to optimize prediction accuracy ( Supplementary Fig 1). The windowing stride was set to one to maintain the same final sampling rate, and as a result, the two consecutive input vectors to the regression models overlapped at 39 time points.

Acoustic representation
An auditory spectrogram representation of speech was calculated from a model of the peripheral auditory system (52). This model consists of the following stages: 1) a cochlear filter bank consisting of 128 constant-Q filters equally spaced on a logarithmic axis, 2) a hair cell stage consisting of a low pass filter and a nonlinear compression function, and 3) a lateral inhibitory 22 network consisting of a first-order derivative along the spectral axis. Finally, the envelope of each frequency band was calculated to obtain a time-frequency representation simulating the pattern of activity on the auditory nerve. The final spectrogram has a sampling frequency of 100 Hz. The spectral dimension was downsampled from 128 frequency channels to 32 channels to reduce the model complexity.

Calculating spectrotemporal receptive fields (STRFs)
Linear STRF models were fitted using the STRFlab MATLAB toolbox (86,87). For each electrode, a causal model was trained to predict the neural response at each time point from the past 400 ms of stimulus. The optimal model sparsity and regularization parameters were chosen by maximizing the mutual information between the actual and predicted responses for each electrode.

DNN architecture
We designed a two-stage DNN consisting of feature extraction and feature summation modules ( Fig 1A). In this framework, a high-dimensional representation of the input is first calculated (feature extraction network), and this representation is then used to regress the output of the model (feature summation network). The feature extraction stage consists of three convolutional layers with eight 3x3 2D convolutional kernels each, followed by a convolutional layer with four 1x1 kernels to reduce the dimensionality of the representation. The output of this stage is the input to another convolutional layer with a single 1x1 kernel. A 1x1 kernel that is applied to an input with channels has 1x1x parameters, and its output is a linear combination of the input channels, thus reducing the dimension of the latent variables and, consequently, the number of trainable parameters. The feature summation stage is a two-layer fully connected network with a hidden layer of 32 nodes, followed by an output layer with a single node. All layers except the output layer have L2 regularization, dropout (88), and ReLU (57) activations. The output layer has regularization and linear activation. The parameters of the model, including the number of layers, the size of the convolutional kernels, and the number of fully connected nodes, were found by optimizing the prediction accuracy ( Supplementary Fig 2).

DNN training and cross-validation
The networks were implemented in Keras using the TensorFlow backend. A separate network was trained for each electrode. Kernel weight initializations were performed using a method specifically developed for DNNs with rectified linear nonlinearities (89) for faster convergence. We in which is the high-gamma envelope of the recorded neural data from a given electrode, and ̂ is the predicted response of the neural network.

Evaluating model performance (test-retest)
To account for the variations in neural responses that are not due to the acoustic stimulus, we repeated the test stimulus six times to more accurately measure the explainable variance. To obtain a better measure of the model's goodness of fit, we used a noise-corrected adjusted R-squared value instead of the simple correlation. Having responses to the same stimulus, 1 to , we defined and as the averages of odd and even numbered trials. Then, we calculated the noisecorrected correlation according to equation (2), where is the predicted response of a model, 1 is the average correlation of the predicted and actual responses, 2 is the correlation between response groups, and is our reported R-squared of the noise-corrected Pearson correlation. (2)

Computing locally linear receptive fields for convolutional neural networks
The first step for calculating the locally linear receptive field (LLRF) of a CNN consists of converting the CNN into a multilayer perceptron (MLP) ( Supplementary Fig 9) because calculating LLRFs for an MLP is more straightforward. To achieve this task, we must first convert each convolutional layer to its equivalent locally connected layer, which is essentially a sparse fully connected layer. To do so, we find the equivalent matrix for convolutional kernels 1 − Assume that all zeros tensor has dimensions × × × × × where and are, respectively, the rows and columns of the input to a convolutional layer; is the number of channels in the input; and is the number of kernels in a layer. Additionally, assume (the -th kernel of the layer) has dimensions × × , where H and W are the height and width of the kernel, respectively, and is defined as before. We begin by populating according to equation (3) for all values of , , and . Then, we reshape to ( * * ) × ( * * ) to obtain the 2D matrix that will transform the flattened input of the convolutional layer to the flattened output.
Of course, can be directly populated as a 2D matrix for improved performance.
where is the weighted sum of inputs to nodes in layer for input , ℎ represents the output of nodes in layer to the same input, and denotes the dependence on the parameters of the network.
In a network with ReLU nodes: which means that we can replace the product of and −1 with an adjusted weight matrix ̂− 1 defined by equation (7), where and are indices of nodes in layer and − 1.
Because LLRF can be defined as the gradient of the output with respect to the input of the network, we can speed up the calculation process by utilizing TensorFlow's built-in gradient calculation capability.

Linearized function robustness
To ensure the validity of our LLRF analysis considering the intrinsic randomness involved in training deep neural networks using a stochastic gradient-descent algorithm from different parameter initializations by random sampling of training data, we need to show that our computed LLRFs are robust across different instances of a network. We do this by training 10 instances for each electrode, then grouping them into two equal sets of 5 networks each and averaging the LLRFs over instances for each group. We compare the LLRFs from the two groups for all time steps and electrodes (Supplementary Fig 3). The left panel histogram suggests that LLRF are robust to initialization, while the right panel shows that the similarity becomes even stronger when the LLRF is less noisy (has higher gain). To counter initialization noise, all LLRFs used in this paper are averaged over at least five network instances.

Complexity estimation
To quantify the nonlinearity of the network receptive field, we measure the diversity of the equivalent linear functions that the network learns for different instances of the stimulus. To measure this function diversity, we calculated the singular-value decomposition (90) of the matrix containing all the linearized equivalent functions of a network (Fig 2D). Each singular value indicates the variance in its corresponding dimension; therefore, the steepness of the sorted singular values is inversely proportional to the diversity of the functions that are learned by the network. We define the complexity of the network as the sum of the normalized singular values time points corresponding to 200 ms, which was empirically determined as a sufficient upper limit). To accomplish this goal, we first zero padded the LLRFs by 200 ms on each side and shifted the LLRF at the ℎ future frame relative to the current frame by time steps. The number of future frames for which we found a significant correlation between the current LLRF and future latencyshifted LLRFs indicates the number of time points in which the current LLRF feature will appear 28 as shifted in latency, which we refer to as the temporal hold value of the LLRF. This procedure is shown in Supplementary Fig 5.

Estimating shape change
The shape change parameter represents the diversity of the linear functions that is not due to the gain change or temporal hold nonlinearities. To calculate this parameter, we first normalized the effects of gain change and temporal hold by dividing the LLRFs by their magnitude and time aligning the shifted LLRF features. Finally, we repeated the same procedure used for the calculation of complexity equation (9) but instead used the normalized, shift-corrected LLRFs to calculate the singular-value decomposition. The sum of the sorted normalized singular values indicates the diversity of the linear functions learned by the network due to a change in their spectrotemporal feature tuning.

Data availability
The data that support the findings of this study are available upon request from the corresponding author [NM].

Code availability
The codes for pre-processing the ECoG signals and calculating the high-gamma envelope are available at http://naplab.ee.columbia.edu/naplib.html (91).