A dynamic network model of temporal receptive fields in primary auditory cortex

Auditory neurons encode stimulus history, which is often modelled using a span of time-delays in a spectro-temporal receptive field (STRF). We propose an alternative model for the encoding of stimulus history, which we apply to extracellular recordings of neurons in the primary auditory cortex of anaesthetized ferrets. For a linear-non-linear STRF model (LN model) to achieve a high level of performance in predicting single unit neural responses to natural sounds in the primary auditory cortex, we found that it is necessary to include time delays going back at least 200 ms in the past. This is an unrealistic time span for biological delay lines. We therefore asked how much of this dependence on stimulus history can instead be explained by dynamical aspects of neurons. We constructed a neural-network model whose output is the weighted sum of units whose responses are determined by a dynamic firing-rate equation. The dynamic aspect performs low-pass filtering on each unit’s response, providing an exponentially decaying memory whose time constant is individual to each unit. We find that this dynamic network (DNet) model, when fitted to the neural data using STRFs of only 25 ms duration, can achieve prediction performance on a held-out dataset comparable to the best performing LN model with STRFs of 200 ms duration. These findings suggest that integration due to the membrane time constants or other exponentially-decaying memory processes may underlie linear temporal receptive fields of neurons beyond 25 ms.


Author summary
The responses of neurons in the primary auditory cortex depend on the recent history of sounds over seconds or less. Typically, this dependence on the past has been modelled by applying a wide span of time delays to the input, although this is likely to be biologically unrealistic. Real neurons integrate the history of their activity due to the dynamical properties of their cell membranes and other components. We show that a network with a realistically narrow span of delays and with units having dynamic characteristics like those found in neurons, succinctly models neural responses recorded from ferret primary auditory cortex. Because these integrative properties are widespread, our dynamic network provides a basis for modelling responses in other neural systems.

Introduction
The tuning properties of auditory neurons are commonly described by a spectro-temporal receptive field (STRF) model, which characterizes the linear dependence of the neural response on the sound spectrum at a range of latencies [1][2][3][4][5][6][7][8][9][10][11][12][13][14]. A static non-linearity is often applied to the output from the linear STRF-this linear non-linear (LN) model estimates the responses of neurons significantly better than the linear estimate [15,16]. While STRF models are somewhat successful in explaining the dependence of neural responses on the past few hundred milliseconds of stimulus history [17], these models do not show how this temporal aspect of receptive fields might be implemented biologically. The most direct biological interpretation of the STRF model would involve auditory cortical neurons receiving inputs at a range of simple delays spanning out to 200 ms; however, this is biologically unrealistic. The onset latencies of neurons in the ventral division of the medial geniculate body (vMGB), which provides the primary ascending input to primary auditory cortex (A1), are typically less than~14 ms in the ferret [18], the animal species that provided the data that were modelled in this study. In the cat vMGB response latencies are typically less than~20 ms [19], in the mouse less than~10 ms [20], and in the marmoset less than~40 ms [21]. These values are far less than the duration of stimulus history which influences the responses of A1 neurons. Although a few vMGB units do show onset latencies beyond this range (in cat up to 70 ms [19], in mouse up to 60 ms [20], and in marmoset up to 300 ms [21]), this small fraction is unlikely to be the main determinant of the dependence of the neural response in primary auditory cortex on stimulus history beyond a few tens of milliseconds. This is because there are other plausible mechanisms, such as the dynamic temporally-integrative properties intrinsic to neurons. In this study, using recordings from anesthetised ferrets, we asked how much of the dependence on recent stimulus history-the temporal receptive fields, of primary auditory cortical neurons can instead be explained by certain of these simple dynamical aspects of neurons-an approach more consistent with the known biology. To model a neuron's response, we constructed a neural network model whose output is the weighted sum of the responses of multiple units, each of which resembles an LN model. However, the response of each unit of the network was modified in accordance with a dynamic firing-rate equation. This low-pass filters the unit's response (by convolving with an exponential decay impulse response), providing a simple exponentially decaying memory. This integrative characteristic can be related to the capacitance and resistance of nerve cell membranes, and the individual time constant of each unit can be related to the membrane time constant of real neurons [22]. The dynamic aspect of our network can alternatively be interpreted in terms of other neurobiological dynamical phenomena, such as channel-based neural adaptation, short term synaptic plasticity, or recurrent network properties. We found that our biologically-motivated dynamical model can accurately capture the temporal receptive fields of neurons without the need for a wide span of latencies of input.

Ethics statement
In vivo electrophysiological recordings in ferrets were performed under ketamine (5 mg/kg/h) and medetomidine (0.022 mg/kg/h) anaesthesia. All animal procedures were performed under license from the UK Home Office and were approved by the University of Oxford Committee on Animal Care and Ethical Review.

Stimuli
Models were fitted to single-unit neural responses of anesthetized ferrets to natural sound stimuli. Altogether, 20 sound clips were presented containing human speech in different languages, other animal vocalizations (e.g. ferret) and environmental sounds (e.g. wind and water). All clips were 5 s long with a sampling rate of 48,828.125 Hz and with root mean square intensity ranging from 75 to 82 dB SPL.

Experimental setup and neural responses
The electrophysiological data used in this study were taken from a series of experiments used in a previous study by Harper et al. [23]. Briefly, recordings were made from the primary auditory cortical areas, A1 and the anterior auditory field (AAF) of 6 adult pigmented ferrets (5 females and 1 male) under ketamine (5 mg/kg/h) and medetomidine (0.022 mg/kg/h) anaesthesia for 20 repeats of the 20 natural sound clips played in random order. All animal procedures were performed under license from the United Kingdom Home Office and were approved by the local ethical review committee. In total, 56 penetrations resulted in 549 single and multi-units, of which 284 were single units. A single unit was taken for analysis only if its activity was driven by the stimulus according to the noise ratio (see below). For each unit, the number of spikes was counted in each 5 ms time bin and averaged over repeats, to provide a response profile y n (t), where t is time, and n is the clip number. The total number of time bins in a clip is T. For simpler notation, we will drop the subscript n, unless we note otherwise.

Noise ratio
The noise ratio was calculated as a measure of how much the response of each unit is dependent on the stimuli [9,24]. The noise ratio was measured over all 20 stimuli and 20 repeats. Any unit with a noise ratio > 40 were excluded from the study. Also, only putative single units were used. This resulted in 73 neurons for the study.

Cochleagrams
The models receive the input as a cochleagram, a spectrogram-like transformation of the sound waveform. The cochleagram approximates the spectral filtering performed by the auditory periphery and was calculated as follows [23,25]. For each sound clip, the amplitude spectrum was measured using 10-ms Hanning windows, overlapping by 5ms. The number of frequency channels was then reduced by weighted summation using overlapping triangular windows to provide 34 log-spaced channels (500 Hz to 22,627 Hz center frequencies, adapted from melbank.m (http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html)). Next, a log function was applied to the value in each time-frequency bin, and any values below a low threshold was set to the threshold. The cochleagram was normalized to zero mean and unit variance over the training set (see Cross-validation and model testing). The input x fq (t) at time t was the chunk of cochleagram preceding t, given by x fq (t) = K f (t−q+1), where f is the frequency channel and q is the latency and K f (t) is the cochleagram.

The LN model
The two stage LN model consists of a linear model followed by a sigmoid output nonlinearity.
Linear stage: where a(t) is the model neuron's activation, w fq is the input weight and b is the baseline activation. Both w and b are free parameters of the model and were estimated by regressing neural response y(t) against input x(t) using glmnet [26]. To overcome overfitting, the weights were regularized using an L1-norm (LASSO regularization). Thus a(t) can be seen as the best linear estimate of y(t) from x(t). The regularization hyperparameter λ controlled the strength of the regularization. To select λ, the model was subjected to k-fold cross-validation (see Cross-validation and model testing). Nonlinear stage: The nonlinear stage of the model was a parameterized sigmoid nonlinearity: The four parameters ρ i of the function were fitted by minimizing the squared error between v(t) and y(t).

The network receptive field (NRF) model
The NRF model is an artificial neural network with a single hidden layer of 20 hidden units (HU) which converge onto an output unit (OU). Each unit of the model is similar to a single LN model. The activation of the j-th HU, a j (t) is, The output of the j-th where g(a j (t)) is a sigmoid non-linear activation function, 1/(1+exp(−a j (t))). The HUs feed these outputs to the OU. The activation function of the OU a o (t) is, The output v o (t) of the OU is,

The dynamic network (DNet) model
A neuron's response depends on the activity of its presynaptic neurons through the total synaptic current they evoke. However, a neuron's membrane potential, and consequently its firing rate, is not an instantaneous function of this current. Due to the capacitance and resistance of the neural cell membrane, the membrane potential and hence the firing rate is approximately a low-pass filtered version of the synaptic current. A common simple dynamic model of neurons that captures this phenomenon is the differential equation (Eq 7) known as the firing-rate equation [22].
Here, τ cont is the membrane time constant, v is post-synaptic firing rate, I(t cont ) is the synaptic current at time t cont ,F() is a nonlinear function that relates synaptic current to the post-synaptic firing rate, and t cont is time. Here we use t cont and τ cont with the superscript 'cont' to denote that for this equation these variables operate in continuous time rather than discrete time. τ cont determines how rapidly firing rate approaches steady state for a constant synaptic current, and can be related to the membrane time constant [22]. Although the relationship is not strict, we shall call τ cont the membrane time constant. This is equivalent to convolving the instantaneous output of function F() with an exponential decay impulse response.
Eq 7 can be discretized by the Euler method, where c = 1 time-bin = 5 ms. We redefine v and I to depend on time-bin number, t, rather than continuous time (t = t cont / c, where t is an integer). Hence, the discretized form of the firing rate equation is: The dynamic network model incorporates units that behave like Eq 8. In recognition that the characteristics of units in our model may not exactly correspond to biophysical properties of neurons described by the firing rate equation, and to be consistent with the NRF model, we replace F() with g() and the current I(t−1) with activation a(t).
As a result, activation a j (t) of the j-th HU of the DNet model is the same as the NRF model, However, the output v j (t) is different from that of the NRF model, incorporating the discretized firing-rate equation.
Similarly, the activation a o (t) of the OU is, Here, τ j and τ o are the membrane time constants of j-th hidden unit and output unit respectively.

The sDNet model
The firing-rate equation (Eq 7) assumes that the time constant that governs the relationship between the current and the firing rate is substantially larger than the time constant that governs the decay of post-synaptic potentials. However, if we assume the reverse, i.e., that the synaptic time constant is substantially larger, a different simple model of the relationship between the firing rate and the current becomes appropriate [22]. This model is given by: Here, I is synaptic current and v is post-synaptic firing rate, z(t cont ) is the immediate influence of the presynaptic spikes on the synaptic current, F() is the nonlinear function of synaptic current to post-synaptic firing rate, t cont s is synaptic time constant, and t cont is time. Similar discretization, like that used for the DNet model (see above), can be used for Eq 13 to use this for the units of a neural network model. Again, replacing the synaptic current, I, with the activation a, the activation of hidden units of the sDNet model can be modelled as the following: And the hidden unit output, replacing F() with g(), is given by: Hence, in the sDNet model, the activation of each HU, rather than the output of the nonlinearity, is convolved with an exponential decay impulse response.
Similarly, for the output units, And the output, Here, τ sj and τ so are the synaptic time constants of j-th hidden unit and output unit respectively.

Objective function of the NRF and DNet models
The free parameters w jfq , w j , b j , and b o (also τ j and τ o for the DNet and τ sj and τ so for the sDNet) were optimized by minimizing the squared error between v o (t) and y(t) subject to L1-regularization of the weights. Thus, the objective function is given by, Here, n is included to indicate clip number, but was left out of other equations for simplicity. N is the number of clips used in training.

Parameter estimation
All models except for the LN model are fitted by minimizing the objective function with respect to the free parameters using the sum-of-function optimizer algorithm [27]. In using this algorithm, we take one clip to be one minibatch. The optimization algorithm requires calculation of the error gradients in respect to each of the parameters. For the NRF model, error gradients are calculated using standard chain rule (backpropagation). For the dynamic models, the process is similar (see below).
For the network models, before training, the weights were initialized by modified Glorot initialization from a uniform distribution ranging from-1 ffi ffi ffi ffi ffiffi fqþl p to + 1 ffi ffi ffi ffi ffiffi fqþl p , where fq is the number of input weights to a HU and l = 1 is the number of output weights from a HU. The biases were initialized similarly [28].
For the DNet model, to prevent the 1 t term running into mathematical error (when τ = 0), we substitute 1 t with h d ð Þ ¼ 1 1þd 2 . Parameters, d j and d o (for hidden and output units) were each independently initialized with the square root of a random variable from an exponential distribution with a mean of 1. Similar, substitution is used for the sDNet model.

Error-gradient calculation for the DNet model
To estimate the parameters, the gradients of the error with respect to various parameters are calculated using a method similar to the one used for real time recurrent learning [29].
, the dynamic equations for HUs (Eq 10) and OU (Eq 12) of the DNet model become respectively: Using the chain rule, the gradient of the error term, E(t) at time t in respect to the weights of the output layer, w j , The prime denotes the derivative of the function. Similarly, the gradient of the error term, E (t) in respect to the bias of the output layer, b o , Finally, the gradient of the error term, E(t) with respect to the parameter, d o , Hence, the gradient is passed forward from the current time step to the next time step to be used in calculation of the gradient at that new time step. At t = 1, the values of @v o ðtÀ 1Þ

Cross-validation and model testing
From the data for the 20 sound stimuli, 4 were chosen as a test set which was not used during training and cross-validation. The cross-validation set (the remaining 16 stimuli) was used to fit the models using k-fold cross validation, where k = 8. The cross-validation set was randomly divided into a training set of 14 stimuli and a validation set of 2 stimuli. The model was trained on the training set for 18 different values of the hyperparameter λ. A log spaced range of lambda values was used, but with a somewhat lower density at the extremes. For the LN model, the exact values of λ used were: 1.00 x 10 −1 , 2.00 x 10 −2 , 1.17 x 10 −2 , 6.84 x 10 −3 , 4.00 x 10 −3 , 2.34 x 10 −3 , 1.37 x 10 −3 , 8.00 x 10 −4 , 4.68 x 10 −4 , 2.74 x 10 −4 , 1.60 x 10 −4 , 9.36 x 10 −5 , 5.41 x 10 −5 , 3.20 x 10 −5 , 6.40 x 10 −6 , 1.28 x 10 −6 , 2.56 x 10 −7 , and 5.12 x 10 −8 . For the rest of the models, the values of λ used were: 1.00 x 10 −3 , 2.00 x 10 −4 , 1.17 x 10 −4 , 6.84 x 10 −5 , 4.00 x 10 −5 , 2.34 x 10 −5 , 1.37 x 10 −5 , 8.00 x 10 −6 , 4.68 x 10 −6 , 2.74 x 10 −6 , 1.60 x 10 −6 , 9.36 x 10 −7 , 5.41 x 10 −7 , 3.20 x 10 −7 , 6.40 x 10 −8 , 1.28 x 10 −8 , 2.56 x 10 −9 , and 5.12 x 10 −10 . For each of the fitted models, neural responses were then predicted for the validation set, and the correlation coefficient between the actual neural responses and the prediction was measured. This process was repeated 8 times for different non-overlapping validation sets. The model was then retrained with the whole cross-validation set using the λ value that provided the highest mean correlation coefficient over all 8 folds. Next, the retrained network was used to predict the neural responses to the test set. All the correlation coefficients and normalized correlation coefficients shown are for this held out test set, and all the model parameters shown are for the retrained network.
To ensure that all models receive the same amount of training data, models with different latency spans of STRFs are provided with exactly the same durations of neural response. This was done by clipping the necessary number of time points from the beginning of the neural response data.

Effective hidden units
The hidden layer of each of the networks contained 20 HUs, but because their weights were L1-regularized, they tended to develop substantive weights only if they explained aspects of the neural response that were not explained by any other units. As a result, many HUs developed weights close to zero, and thus the network models tended to have only a small number of effective HUs. The 'effectiveness' of each unit j was calculated [23] as the variance over time of the unit's weighted output w o v j (t). Any HU with variance greater than 5% of the sum of the variances of all 20 HUs is considered effective.

IE score
The IE score [23] measures whether the degree to which a HU is excitatory or inhibitory, on a scale between -1 and 1. Most inhibitory is IE = -1, when all of the input weights are negative or zero, and the output weight is positive, or when all of the input weights are positive or zero and the output weight is negative. Most excitatory is IE = 1, when all of the input weights are positive or zero, and the output weight is positive, or when all of the input weights are negative or zero and the output weight is negative.

Membrane time constant
Because the membrane time constant is modelled as 1+d 2 and the duration of each time step is 5 ms, the value of membrane time constant in ms is 5 (1+d 2 ).

Knocking out large time constants of the DNet model
After fitting a network model, HUs with time constant greater than the length of the STRF are selected. The output weights connecting these units to the output unit are then set to zero. As a result, HUs with large time constants become inactive during the test process. Everything else in the model remained unchanged.

Results
The neural response dataset used in this study was recorded from the primary auditory cortical areas, A1 and AAF, of anesthetized ferrets in response to a diverse selection of natural sounds (20 stimuli of 5 s duration), including speech, animal vocalizations and environmental sounds [23,25]. In total, 73 single-unit neural recordings showed sensitivity to the sound stimuli (see Methods), as measured by their noise ratios, and these were analysed in this study.
To probe the computations underlying stimulus encoding, models were fitted to estimate the neural firing rate (averaged over stimulus repeats) of each neuron as a function of the preceding sound stimulation (Fig 1). First, for all models, the sound stimuli were pre-processed to generate a spectrogram-like representation (a cochleagram) that approximates the processing that occurs in the cochlea and auditory nerve. Second, using responses to 16 of the natural stimuli, a parameterized model was trained to estimate the firing rate as a function of the preceding cochleagram. After parameters of the models were fitted to the data, their fit quality was tested on a held-out test set composed of the responses to the remaining 4 natural stimuli.

Using standard models, access to 200 ms of recent stimulus history is needed to achieve best prediction
For each neuron, an LN model was fitted to estimate its firing rate as a function of the preceding cochleagram. The LN model consists of an STRF, followed by a sigmoid nonlinearity ( Fig  2A). As mentioned in the Methods, the STRF is the weighted sum of the cochleagram over frequency components at a span of latencies. The effect of different spans of latencies (lengths) of the STRF was investigated. For the held-out test set, and averaged over the 73 neurons, the mean normalized correlation coefficient, mean CC norm [30,31], between the actual neural responses and the prediction of LN models with 25, 50, 100, 200, and 400 ms long STRFs are respectively 0.51, 0.64, 0.69, 0.71, and 0.71 (Fig 2B), where 1.0 is the maximum achievable prediction given the estimated neuronal noise and 0.0 indicates that there is no correlation between actual and predicted neural responses. The means of non-normalized Pearson correlation coefficients (CC) are 0.40, 0.50, 0.54, 0.55, and 0.55 respectively. Fig 2C shows STRFs of some example neurons estimated by the LN model.
A network receptive field model (the NRF model) was also developed and fitted to each neuron. This model consisted of the weighted sum of multiple LN-like units. This model is the same as a previously published version [23], except that it uses sigmoid nonlinearities instead of scaled tanh non-linearities. A sigmoid non-linearity was used to be consistent with the dynamic network model that we will come to, which requires non-negative outputs and hence also uses sigmoid non-linearities. The NRF model we use here is essentially an artificial neural network with a single hidden layer, using a sigmoid non-linearity in its hidden units and its single output unit (Fig 3A). The NRF model makes use of multiple component STRFs, which allows modelling of neuronal sensitivity to non-linear interactions of multiple features [23]. STRFs of at least 400 ms are needed for best performance in predicting neural responses ( Fig  3B). For the test set, the mean CC norm between neural responses and the model predictions made by NRF models with 25, 50, 100, 200 and 400 ms long component STRFs are 0.51, 0.64, 0.70, 0.72 and 0.73, respectively (mean CC 0.40, 0.50, 0.54, 0.55 and 0.56, respectively). The NRF model predicts the neural responses equal to or better than the LN model over all durations examined, although by a more modest amount than seen in an earlier study [23]. This may be because the NRF model in this study uses a sigmoid rather than tanh nonlinearity, or because the earlier study [23] used a different way of selecting the test set. Fig 3C shows the input weights (analogous to STRFs) for the 'effective' HUs (see Methods) of the model-fit for one example neuron, obtained using the NRF model. Fig 3D and Fig 3E show the 'effectiveness' (see Methods) and IE score (see Methods) of the units shown in Fig 3C.

The DNet model can characterize neural responses using very short duration STRFs
Although both the LN and NRF models can capture dependence on stimulus history using a span of input latencies extending up to at least 200 ms, there is little biological evidence for such a wide span of latencies in the auditory thalamic neurons that feed to auditory cortex (see Introduction). With the aim of better understanding the biological underpinnings of stimulus history dependence in auditory cortical responses, a model with a dynamic aspect was developed that integrates the output of its component units over time in a manner similar to those of real neurons (see Methods). Each unit in the DNet model includes exponentially-decaying response integration, motivated by the integrative properties of real neuronal membranes ( Fig  4A). This model is in essence the NRF model but with dynamic units, each modelled by the dynamic firing-rate differential equation (Eq 7) [22]. That is, each unit of the model applies an exponentially decaying impulse response to the output of their non-linear activation function. Each unit's decay rate is governed by its individually fitted time constant, τ (see Methods).
C. STRFs of some example neurons, for LN models with different latency spans. Red, excitatory fields, blue, inhibitory. Excitatory and inhibitory fields extend up to 100 ms in most cases and up to 200 ms in some cases.
https://doi.org/10.1371/journal.pcbi.1006618.g002 We found that the mean CC norm between the actual neural responses and the predictions of the DNet model with 25, 50, 100, 200 and 400 ms STRFs are respectively 0.71, 0.71, 0.70, 0.68 and 0.68 (Fig 4B) (mean CC 0.55, 0.55, 0.54, 0.52 and 0.52, respectively). Because the 25-ms model performs the best and the biological range of latencies are typically less than 30 ms [32], we focused on the 25-ms DNet model for further analysis. Fig 4C shows the input weights (analogous to STRFs) for the 'effective' hidden units (see Methods) of the model fits for 8 example neurons, using a DNet model with 25-ms long STRFs. Fig 4D and 4E show the 'effectiveness' (see Methods) and IE score (see Methods) of the units shown in Fig 4C. The 25-ms DNet model out performs the DNet with longer (200 ms and 400 ms) STRFs, as measured by prediction of neural responses on a held-out test set averaged over all 73 neurons (Fig 5A). It also achieves performance similar to the best performing LN model, but performs a little worse than the best performing NRF model (Fig 5A). This can be seen qualitatively in the time course of the prediction for an example neuron and stimulus in Fig 5B. A neuron by neuron comparison shows that this result is consistent for most neurons (Fig 5C and 5D). Also, the 25-ms LN and NRF models are outperformed by a large margin by the 25-ms DNet model (Fig 5A). A neuron by neuron comparison shows that this result is also consistent for almost all neurons (Fig 5E and 5F).
Note that the DNet model is equivalent to the NRF model when the discretized time constant for every DNet model unit is at the smallest possible value (τ = 1, d = 0). Hence, because all possible NRF models are a subset of all possible DNet models, the DNet should in theory always equal or outperform the NRF model. The fact that this is not the case suggests that, at least in some cases, the very best fits are not being found for the DNet model, perhaps as a consequence of some overfitting. That the DNet model performs worse for large latency spans, where there are more variables, is again probably due to challenges in fitting the DNet model, likely due to some overfitting to the training set.

Network structure and response integration are both required for an effective DNet
To understand how much of an effect HUs of the DNet model with long time constants have on the model's predictions, the fitted model was modified by 'knocking out' (see Methods) HUs with time constants longer than the duration of their STRFs. This was achieved by setting the output weights of those units to zero, while keeping all other parameters intact. We found that the 25-ms DNet model is the DNet whose CC norm is most reduced by this alteration (Fig  6A).
A single-unit dynamic model was also developed to examine the effect of a time constant in an LN-like single STRF model. Prediction performance of this model was found to be worse than the LN model (Fig 6B). This indicates that both long time constants and a network architecture are needed for the 25-ms DNet model to achieve prediction performance similar to that of standard models. Although the DNet model had 20 hidden units, the L1 regularization eliminated unnecessary units. Hence there were fewer effective HUs, and for the 25-ms DNet model this number was found to be between 2 and 8 ( Fig 6C). We defined an effective HU as one whose effectiveness (variance of weighted output, see Methods) exceeded a certain threshold.
In the 25-ms DNet model, the time constants of the effective HUs ranged between 5 and 475 ms, often far greater than the duration of the STRFs (Fig 6D). The time constants of the OUs of the DNet model ranged from 5 to 130 ms (Fig 6E). To investigate whether long time constants are associated with effective HUs, for each neuron HUs were ranked in order of effectiveness and the mean of the time constants of all HUs in the same rank (over all neurons) were calculated. This showed that high-ranked HUs tend to have long time constants (Fig 6F).

Hidden units with long time constants tend to be more inhibitory
To test whether a relationship exists between the time constant and unit's inhibitory/excitatory nature, we classified the effective HUs of the 25-ms DNet model into excitatory or inhibitory using an IE score (see Methods). If the output weight is positive, an IE score of 1 means a unit with entirely excitatory influence and an IE score of -1 means a unit with entirely inhibitory influence. We found that almost all units having large time constants are inhibitory in nature (Fig 7).

Membrane time constant versus synaptic time constant
In the DNet model, the integration governed by the time constant occurs after each unit's nonlinearity. Although the time constant is conceived and implemented in a manner consistent with a membrane constant, within the network it may capture other dynamic neural phenomena such as synaptic time constants, forms of a channel-based or synaptic adaptation, or certain forms of network recurrency. An alternative network model places the integration just before the nonlinearity, which is more consistent with the time constant being synaptic in nature [22]. To investigate how much of the neural activity can be explained by such a model, we modified the DNet model to form the sDNet model (synaptic DNet model), with each unit governed by Eqs 13 and 14 [22].
The sDNet model achieves best performance in predicting neural responses when the duration of the STRFs is 200 ms (CC norm = 0.71) (Fig 8A). Although the sDNet with 25-ms STRFs predicts neural responses better (CC norm = 0.67) than the 25-ms LN model (CC norm = 0.51) and the 25-ms NRF model (CC norm = 0.51), it is worse than the 25-ms DNet model (CC norm = 0.71). This suggests that a large percentage of the variance of neural responses can be explained by a time constant before non-linearity, but that having a time constant after the non-linearity further improves prediction.
While the largest time constants of the effective HUs of the sDNet model can be as large as 470 ms (Fig 8B), the distribution of time constants of output units have smaller values compared to the DNet model ( Fig 8C). As with the DNet model the more effective HUs of the sDNet model tend to have longer time constants (Fig 8D) and units with long time constants tend to be inhibitory in nature (Fig 8E).

Discussion
We find that a model of primary auditory cortical neurons which incorporates exponentiallydecaying memory processes-the DNet model-can capture some of the temporal aspects of the receptive field using very short (25 ms) component STRFs in a network. Standard models such as the LN model (and the NRF model) represent temporal receptive fields of neurons using delay lines whose latencies extend up to a few hundred milliseconds. Our results suggest that, beyond 25ms, the temporal receptive field can be modelled more succinctly by exponentiallydecaying memory processes than by delay lines.
When auditory neurons are fitted using a standard LN model, the resulting STRFs often contain narrowband inhibitory fields that decay over approximately 200 ms. In contrast, we find that the STRFs of the DNet model have short inhibitory fields with long time constants. This suggests that these temporally extended inhibitory fields in STRFs may reflect neural mechanisms that can be approximated by exponentially-decaying memory processes.
Although they have not been described in the ferret, a few vMGB units do show onset latencies beyond a few tens of milliseconds: in cat latencies up to 70 ms have been described [19], in mouse up to 60 ms [20], and in marmoset up to 300 ms [21]. Areas other than vMGB that project to cortex may also include long latency responses. Hence, we cannot rule out the possibility that long-latency inputs at least partially explain the dependence of auditory cortical neurons on stimulus history-indeed it is likely that they do to some extent. However, these long-latency responses are relatively rare and we show that it is possible to explain a lot of the dependence of the neural response in primary auditory cortex by exponentially-decaying memory properties consistent with those properties intrinsic to neurons, such as membrane dynamics.
There are many biological mechanisms which may underlie exponentially-decaying memory processes. Neurons have various intrinsic dynamic integrative and adaptive characteristics originating from basic membrane dynamics, channel-based adaptation, post-synaptic potentials and short-term synaptic plasticity [22]. Recurrent network effects could also underlie such memory processes. In a previous study, a model with synaptic depression has been shown to partially capture the integration of stimulus history in A1 [33]. Here, we found that a better prediction of neural responses can be achieved by a model based on the integrative properties of neural membranes (the DNet model) than by the one based on the dynamics of synaptic potentials (the sDNet model). We note, however, that although the DNet model was developed with membrane dynamics in mind, aspects of all the above listed dynamic processes could be captured by the same exponentially-decaying memory in the DNet. These different processes are not exclusive of each other, and combinations of them may contribute to the effects we find.
The firing rate of A1 neurons has been demonstrated to depend on stimulus history up tõ 4 s in the past [17]. This dependence cannot be fully captured by linear models such as STRFs (17). This indicates that there are some nonlinear dynamic processes that provide history dependence which cannot be captured either by a linear or LN models, nor by the DNet model. Perhaps this is because the DNet model has limited capacity to capture temporal dynamics that are not exponential in nature.
Much of the relationship between stimulus and neural responses results from neurons being embedded in a network structure and from the nonlinear properties of the units in this network. From the auditory periphery to the A1, there are numerous feedforward, feedback and lateral connections. Many previous models of auditory cortical neurons have not explicitly represented this network character. However, some recent models have incorporated non-linear contextual effects [14,25,[34][35][36]. Furthermore, a few models do explicitly incorporate multiple receptive fields [37][38][39][40][41][42][43] or more extensive network structure [23], which perform better in predicting neural responses of A1/AAF neurons than STRF models [23,37].
When modelling the dependence of neural responses on stimulus history, one approach is to include the history of a neuron's spikes and those of its recorded neighbours in a generalized linear model (GLM) model [35,44]. However, unlike the DNet model, this approach does not model multi-layer network structure with hidden units that are inferred but not recorded. Modelling hidden network structure has been applied to the salamander retina, where in vitro responses to simple artificial stimuli are well predicted by a network model (LNFDSNF) with various non-linearities and response delays mimicking the retinal pathway [45]. The model. E. Relationship between a HU's inhibitory/excitatory (IE) score and its time constant for the sDNet model. Most of the HUs with time constants greater than the duration of the STRFs (units right of the vertical dotted line) are inhibitory in nature. https://doi.org/10.1371/journal.pcbi.1006618.g008 Dynamic network model of temporal receptive fields in auditory cortex LNFDSNF model is particularly relevant to our study because it is a network model whose units have some memory of their past activity, though the form of this memory differs from that of the DNet model. The LNFDSNF model units have a feedback kernel that acts before the nonlinearity-in contrast, the DNet model units integrate their activity after the nonlinearity using an infinite impulse response that decays exponentially.
The succinct and biologically-motivated description of neural responses provided by the DNet model provides a basis from which to develop further models to account for the~30% of the CC norm that remains to be explained. One variant of the model that can be developed involves a network with multiple hidden layers where the units each have just a single latency (i.e. each unit in the first hidden layer uses just a spectral receptive field at a single delay). This approach provides an opportunity to build deeper neural-network-like models with fewer parameters, potentially overcoming overfitting problems for deep network models of neural responses [46]. This network can provide a basis for new network models that incorporate a greater range of intrinsic dynamic processes more explicitly, for example modelling in each unit its basic membrane dynamics, synaptic depression and channel-based adaptation together, with different appropriate time constants in each case. Such models may show more predictive power.
Neurons are intrinsically dynamic systems, each with memory of past events. Furthermore, neurons are embedded in networks. We show that models incorporating both these characteristics can be valuable for explaining moment to moment in vivo responses of neurons in the auditory cortex. This is an arguably more biologically plausible way of capturing the history dependence of neural responses than a set of delay lines. In artificial neural networks, memory is usually implemented through recurrent connections between units. However, real biological neurons also have intrinsic memory processes with particular characteristics. Including such single-unit memory has potential advantages for explaining the in vivo responses of biological neurons and may also be useful in artificial neural networks. In summary, the dynamic network model we present here provides a biologically-plausible alternative encoding model to the classic LN model, as well as a basis for further development of encoding models with hidden units with diverse intrinsic dynamic processes.