Predictive Coding in Auditory Cortical Neurons of Songbirds

The inferential basis of perception is thought to arise in the sensory cortex through prediction of future events that aid processing efficiency. Predictive coding (PC), a theoretical framework in which the brain compares a generative model to incoming sensory signals, seeks to explain this inferential process. There is little understanding, however, of how PC might be implemented at a mechanistic level in individual neurons within the auditory system. Here, we examined responses of single neurons in caudomedial nidopallium (NCM) and caudal mesopallium (CM), analogs of higher order auditory cortex, in anesthetized European starlings listening to conspecific songs. We trained a feedforward temporal prediction model (TPM) to define a “latent” predictive feature space and its corresponding feature space representing prediction error. We show that NCM spiking responses are best modeled by the predictive features of spectrotemporal song, while CM responses capture both predictive and error features. This provides strong support for the notion of a feature-based predictive auditory code implemented in single neurons in songbirds.


Introduction
In sensory processing of temporally patterned signals such as speech and audio, existing knowledge includes an internal model for the expectation of future events. By predicting what it expects to encounter, the efficiency of perceptual representations in the brain are increased. The inferential process of combining existing knowledge with information from the outside world has been modeled by predictive coding (PC). This theory posits that the brain is an active hypothesis-testing mechanism which compares an internal generative model to incoming sensory signals (Clark, 2013;Huang & Rao, 2011).
PC has been employed to explain perceptual and cognitive phenomena, and has inspired computational models. Most PC models rely on the hierarchical architecture of the cortex to implement a top-down algorithm that constantly predicts incoming sensory stimuli and compares these predictions with ascending sensory inputs to elicit a prediction error (Rao & Ballard, 1999). The error serves as feedback in an adaptive process that alters the prediction, resulting in an active system that continuously updates the internal model to minimize prediction error. The hierarchical generative model emerges in a natural unsupervised manner from the single criteria of prediction error minimization.
Evidence in support of PC has been observed across sensory modalities (Clark, 2013;Heilbron & Chait, 2018;Keller & Hahnloser, 2009). In songbirds, this theory provides the opportunity for investigation of song learning and vocal communication. Because it requires the ability to distinguish selfgenerated vocalizations from external sounds; to differentiate between developing and learned song; and to recognize conspecifics. However, it remains unclear whether and how PC is implemented neurally in the auditory domain. To study these processes directly in European starlings, we developed a simple model to generate stimulus representations that capture predictive spectrotemporal features of song. We subsequently modeled neural activity fit to three separate stimulus representations corresponding to general, predictive and error spectrotemporal auditory features.

Generative Model
To implement a generative PC model in the temporal auditory domain, we trained a simple, feedforward, single layer temporal prediction model (TPM) (Singer et al., 2018) to predict short segments (10.5 ms) of future natural birdsongs based on past 170 ms spectrographic samples (Figure 1a). The model predicts the output from a linear mapping of input weights followed by a monotonic (sigmoid) nonlinear transformation that resembles the linear-nonlinear cascade for sensory neuron firing. We trained the TPM on an extensive corpora of birdsong spectrograms to generate a "latent" predictive feature space comprising 256 hidden units that facilitate predictions of the imminent future song. Under this model, the latent space containing predictive spectrotemporal features of song represents the prediction component of the PC framework, compared to the more generalized class of spectrotemporal features com-

148
This work is licensed under the Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by/3.0 prising the whole of incoming song. We capture the prediction error component of the PC framework by computing the mean squared error between the output of TPM (i.e. predicted song) and the true song (Figure 1b).

Neural Responses Modeled to Stimulus Representations
To relate stimulus representations to neural activity, we computed receptive fields for NCM and CM neurons using the Maximum Noise Entropy (MNE) method (Fitzgerald, Sincich, & Sharpee, 2011), which well-describes the sensitivity to higher-order features exhibited by NCM neurons (Kozlov & Gentner, 2016).
For each single unit in simultaneously recorded populations, we independently estimate the MNE model parameters that optimally relate the neuron's response to each of three different stimulus representations: either the short-time Fourier transform spectrogram, the projection of the spectrogram into a TPM latent space, or the mean squared error between TPMpredicted future spectrogram and the true spectrogram. This yields a version of each neuron's composite receptive field (CRF) fit to either: 1) all spectrotemporal features of conspecific song (fft-CRF) or 2) only the predictive spectrotemporal features of song (tpm-CRF) or 3) spectrotemporal features of song corresponding to prediction error (mse-CRF). Examples of the three CRFs for a single NCM neuron are shown in Figure 1c. The parameters of each trained MNE model can be used to predict the spiking response of a neuron to novel stimuli (Figure 1d).

Conclusions
The sensory cortex develops internal models that are thought to generate predictions of incoming inputs, yielding efficient neural encoding. Here we investigated internal representations of sensory information under the computational framework of predictive coding in the auditory domain in songbirds. We trained a neural network as a proxy for the internal generative model, and examined responses of individual neurons fit to separate components of this generative model. We found that single neuron auditory responses in both NCM and CM are best modeled collectively by a signal representation that captures covariant structure in the predictive spectrotemporal acoustic features of song. We also found that spectrotemporal features capturing uncertainty between actual and expected song highly contribute to modeling of CM neuron responses. These results provide strong, direct support for the notion of a feature-based predictive auditory code implemented in single neurons in songbirds.