The Contribution of Brainstem and Cerebellar Pathways to Auditory Recognition

The cerebellum has been known to play an important role in motor functions for many years. More recently its role has been expanded to include a range of cognitive and sensory-motor processes, and substantial neuroimaging and clinical evidence now points to cerebellar involvement in most auditory processing tasks. In particular, an increase in the size of the cerebellum over recent human evolution has been attributed in part to the development of speech. Despite this, the auditory cognition literature has largely overlooked afferent auditory connections to the cerebellum that have been implicated in acoustically conditioned reflexes in animals, and could subserve speech and other auditory processing in humans. This review expands our understanding of auditory processing by incorporating cerebellar pathways into the anatomy and functions of the human auditory system. We reason that plasticity in the cerebellar pathways underpins implicit learning of spectrotemporal information necessary for sound and speech recognition. Once learnt, this information automatically recognizes incoming auditory signals and predicts likely subsequent information based on previous experience. Since sound recognition processes involving the brainstem and cerebellum initiate early in auditory processing, learnt information stored in cerebellar memory templates could then support a range of auditory processing functions such as streaming, habituation, the integration of auditory feature information such as pitch, and the recognition of vocal communications.


INTRODUCTION
Despite extensive evidence for cerebellar involvement in auditory conditioned behaviors in animals Boyd, 1975, 1978;Perrett et al., 1993;Lingenhöhl and Friauf, 1994;Ohyama et al., 2003;Thompson and Steinmetz, 2009), this paper describes the first neurocognitive model to integrate cortico-ponto-cerebellar pathways into the structure and functions of human auditory pathways. In particular, the neurophysiology of the auditory ponto-cerebellar pathways is first described in relation to conditioned behaviors in animals. Models of these networks are then integrated with the Object-Attribute Model (OAM) of auditory processing (McLachlan and Wilson, 2010), and supporting evidence from neuroimaging and lesion studies for cerebellar involvement in a wide range of human auditory behaviors is provided.
The importance of the cerebellum for motor control has been known since the early 19th century (Ghez and Fahn, 1985;Houk et al., 1996;Glickstein and Doron, 2008). Over a century later Marr proposed that, "the purpose of the cerebellum is to learn motor skills, so that when they have been learned a simple or incomplete message from the cerebrum will suffice to provoke their execution" (Marr, 1969). About the same time Ito (1970) showed that the output from the cerebellar cortex was inhibitory, suggesting that the cerebellum regulated the execution of motor sequences rather than initiated them. Detailed physiological data on cerebellar neural architecture (Eccles, 1967) has since been used by researchers to propose a wide range of information processing models to explain motor control and conditioned and unconditioned reflexes (Marr, 1969;Fujita, 1982;Thompson, 1986;Kawato et al., 1987;Moore et al., 1989;Houk et al., 1993;Wolpert et al., 1998;Ito, 2006). The common feature of these models is that the cerebellar cortex learns temporal sequences of sensory inputs associated with particular events that enable the precise timing of behaviorally relevant motor responses.
The cerebellum is connected to the rest of the brain by three peduncles; (i) the inferior peduncle receives input from the inferior olivary nucleus and the spinocerebellar and vestibular systems, (ii) the middle peduncle receives input from the pons, and (iii) the superior peduncle largely projects to the thalamus (Glickstein and Doron, 2008;Granziera et al., 2009). The principal cerebellar input from the motor system is via the red nucleus to the inferior olivary nucleus and then the inferior peduncle (Kawato et al., 1987;Wolpert et al., 1998;Granziera et al., 2009). In contrast, inputs to the middle peduncle from the pons originate from prefrontal and parietal cerebral pathways (Ramnani, 2006;Glickstein and Doron, 2008;Ito, 2008;Strick et al., 2009), although afferent sensory inputs have also been documented (Graybiel, 1974;Aitkin and Boyd, 1975;Glickstein and Doron, 2008). In particular afferent auditory inputs from the inferior colliculus (IC) and cochlear nucleus (CN) to the pons (Snider and Stowell, 1944;Boyd, 1975, 1978) and the inferior olivatory nucleus (Huang et al., 1982) have been reported. The principal cerebellar outputs from the superior peduncle are to the red nucleus which relays information to premotor nuclei in the brainstem, and to the thalamus which then connects broadly to the cerebral cortex and to the amygdala (Wolpert et al., 1998;Ramnani, 2006;Ito, 2008;Strick et al., 2009).

BEHAVIORAL ADAPTATION IN PRIMITIVE AUDITORY NETWORKS
The cerebellum is one of the oldest and most structurally conserved brain regions in vertebrates (Weaver, 2005;Apps and Hawkes, 2009), and in many mammalian species, it contains more than three times as many neurons as the neocortex (Herculano-Houzel, 2010). Iguanas, one of the most evolutionarily ancient lizard species, possess frequency selective hearing (Manley, 2002) and their auditory pathways include the brainstem, thalamus and cerebellum. Like all reptiles they lack neocortex (Wyneken, 2007) and many of the higher auditory processing centers found in primates. While marine iguanas do not communicate vocally, they are capable of distinguishing the predator alarm calls of mockingbirds from other mockingbird songs to initiate escape and alert behaviors (Vitousek et al., 2007). This behavior points to the early evolution of sound recognition abilities in primitive terrestrial vertebrates, and the involvement of sub-cortical auditory processing pathways in sound recognition.
Habituation to predictable sounds is another evolutionarily ancient behavioral response to auditory signals. Neural networks in the dorsal CN of mice (and even in the electrosensory systems of fish) have been shown to adapt to supress responses to predictable stimuli (Roberts and Portfors, 2008). Lateral inhibitory fields adapt to sharpen spectral features in neural representations (or templates) of commonly occurring signals and amplify their edges. This makes these networks particularly sensitive to stimulus driven activation that differs from learnt templates for commonly occurring signals, so they can supress responses to these stimuli without loss of sensitivity to novel stimuli (Tzounopoulos et al., 2004;Roberts and Portfors, 2008). Roberts and Portfors (2008) made the observation that the neural architecture of these dorsal CN circuits is very similar to the architecture of cerebellar circuits. However, the CN does not include a network capable of storing temporal sequences and operates on spectral information only.
Startle responses to acoustic stimuli are common in terrestrial vertebrates. In rats, startle responses have very short latencies of about 5 ms and habituation is frequency specific (Fleshler, 1965;Lingenhöhl and Friauf, 1994), indicating the presence of a rapid neural pathway with the capacity to learn spectral information. To account for the speed of the acoustic startle response Lingenhöhl and Friauf (1994) described a neural circuit comprising only three synapses in which the CN innervates giant neurons in the pons, which in turn, innervates cranial and spinal motor neurons. Consistent with this, Aitkin and Boyd (1978) described a subset of cells (15%) in the dorsolateral pontine nucleus with onset latencies to auditory stimuli as short as 3 ms that likely receive direct input from the CN (Huang et al., 1982). Adaptation of dorsal CN response fields enables habituation to common sounds (Tzounopoulos et al., 2004;Roberts and Portfors, 2008) and so could account for the frequency selectivity of habituation of startle responses to acoustic stimuli, while the pons connectivity to motor circuits provides specificity of the motor response to stimuli that elicit startle responses.
In rabbits, auditory conditioning of eye-blinks can be achieved by repeated paired presentations of an initially neutral stimulus, such as a tone, with puffs of air (Perrett et al., 1993;Ohyama et al., 2003;Thompson and Steinmetz, 2009). Rabbits can be taught to blink with high accuracy at any time between 100 and 600 ms after the onset of a tone at a specific frequency, with the accuracy and number of blink responses diminishing at longer time intervals up to about 3 s (Ohyama et al., 2003). In contrast, the minimum onset latency of an eye-blink in response to a puff of air in rabbits is about 25-40 ms (Thompson and Steinmetz, 2009). Lesions of the cerebellar cortex permanently abolish the adaptive timing of blinks, leaving only frequency specific reflex responses to tones with short and relatively fixed delays (Perrett et al., 1993;Ohyama et al., 2003). In humans, auditory conditioning of the eye-blink reflex with a delay of 720 ms between the conditioned and unconditioned stimuli has been observed in healthy people, but not in patients with damage to cerebellar structures (Daum et al., 1993). This indicates that the cerebellar cortex plays a role in learning accurate temporal relationships between acoustic and tactile stimuli in humans and other mammals.
In summary, very primitive auditory neural circuits in the CN are able to habituate behavioral responses to commonly occurring acoustic signals. In contrast, auditory conditioned eye-blinks in rabbits involve the timed initiation of a behavior associated with a specific auditory spectrum. Correct timing of conditioned eyeblink responses requires the involvement of the cerebellar cortex. Finally, the ability of marine iguanas to recognize mockingbird alarm calls likely involves a recognition pathway that includes the cerebellar cortex, since mockingbird song typically comprises large frequency sweeps over short temporal pulses that require both spectral and temporal information to categorize (Logan and Fulk, 1984).

Auditory Information Processing Architecture of the Cerebellum
The cerebellar cortex includes granule, basket, stellate, and Golgi cells. These cells are connected by the axons of granule cells that form long parallel fibers (about 3 mm in length) that excite profusely branched dendrites of Purkinje cells (Figure 1). Inhibitory basket and stellate cell axons run on either side of the excitatory parallel fibers that act on Purkinje cells, and so fine tune neural response fields in the network. Purkinje cells inhibit neurons in the deep cerebellar nuclei, thereby regulating the excitatory drive that these neurons receive from the pons. Purkinje cells project to the closest deep cerebellar nuclei, so the lateral hemispheres project to the dentate nucleus, the intermediate cortex project to the globose and emboliform nuclei, and the vermis projects to the fastigial nucleus (Granziera et al., 2009). Small regions of the cerebellar cortex with similar somatotopic receptive fields form microcomplexes, which project to one region of the deep cerebellar nuclei (Apps and Garwicz, 2005). These microcomplexes are repeated throughout the cerebellum generating a vast network of feed-forward information processing units interconnected by parallel fibers in the cerebellar cortex. Furthermore microcomplexes located in different regions of cerebellar cortex may be enervated from the same inferior olive neuron and project to the same group of motor neurons, allowing parallel processing of sensory inputs to achieve integrated motor responses (Apps and Garwicz, 2005).
In each microcomplex, neurons in the deep cerebellar nuclei and the cerebellar cortex receive inputs from mossy fibers that originate in the pons. The pons receives inputs from the brainstem and cerebral cortex, and so mossy fiber inputs to granule cells convey both direct sensory information and cerebral contextual information. Purkinje cells also receive input from excitatory climbing fibers that originate in the inferior olive. The inferior olive receives excitatory input directly from the brainstem and indirectly from the parietal cortex via the red nucleus (Oka et al., 1979). Conjunctive stimulation of climbing and mossy fibers at different firing rates can either excite or inhibit Purkinje cells, thereby providing learning feedback mechanisms for the network by changing the weights FIGURE 1 | Learning circuit of the cerebellum (black neurons are inhibitory). Input to the pons becomes associated with motor and thalamic outputs via the deep nucleus and is also relayed via granule cells and parallel fibers to Purkinje cells. Granule cells have varying latencies allowing specific timing of inhibition of the deep nucleus by Purkinje cells. Golgi, basket and stellate cells refine spatio-(or spectro-) temporal response fields of Purkinje cells. Sensory input to the inferior olive can alter Purkinje cell synaptic strengths to parallel fiber inputs and the inferior olive is inhibited by activation of the deep nucleus. Adapted from Yeo and Hesslow (1998) and Sacchetti et al. (2005). of parallel fiber synapses to Purkinje cells (Ito et al., 1982;. This alters the probability of a Purkinje cell firing to a particular set of cortical, sensory and inter-cerebellar inputs, and so with repeated exposure the cerebellar pathways could learn appropriate behavioral responses to sensory and/or proprioceptive information in relation to error messages from the inferior olive and the specific environmental/behavioral context provided by the cortical input via the pons (Ito, 2008).  and  developed a computer simulation of auditory conditioning of the eyeblink reflex in rabbits to investigate how temporal information is processed in the cerebellar cortex. They used physiologically realistic populations and connectivity of cerebellar cell types that were modeled with relatively simple "leaky integrator" model neurons. Auditory input to the pons was based on recordings in cats of mossy fiber responses to pure tones by Aitkin and Boyd (1975) in which about 4% of fibers displayed either onset or tonic responses to a particular tone. Consistent with the proposition that the pons associates spectral patterns of neural activation with motor reflexes, and transmits these spectral response patterns to the cerebellar cortex , Purkinje cell responses in the cerebellar vermis of cats display a wide variety of frequency response patterns (Aitkin and Boyd, 1975). Furthermore, these response patterns are stable over wide ranges of stimulus intensity, as would be required for recognition mechanisms that are independent of stimulus loudness. Axon and terminal degeneration in the dorsolateral pontine nucleus after lesion of the IC indicated that the IC was a source of the afferent auditory input to the mossy fibers (Aitkin and Boyd, 1978).
In the model by  the cerebellar cortex consisted of a layer of granule cells connected to Purkinje cells and a second layer of inhibitive Golgi cells. The modeled cerebellar cortex spontaneously evolved temporal specificity when the synaptic weights between granule and Purkinje cells decreased in strength if active in the presence of a climbing fiber input, and increased in strength if active in the absence of a climbing fiber input . After training over several 100 trials various populations of the model granule cells developed temporal response characteristics that involved specific periods of enhanced or depressed firing rates . As a result, different sets of granule cells would be active at different times during the presentation of the conditioned stimulus (tone), and the inhibitory action of Purkinje cells on the deep nuclei could be timed to allow a blink reflex at the expected time of the unconditioned stimulus (e.g., an air puff; . This model contrasts significantly with earlier models of cerebellar timing computation that propose a plurality of neural delay lines associated with specific motor or sensory functions (Eccles, 1967;Braitenberg et al., 1997). The principle difficulties with delay line models of cerebellar computations include: (1) the very short time domain over which sequences could be encoded (less than 200 ms), (2) the very precise neural timing required of Purkinje cells to synchronize with the velocity of neural conduction, and (3) the excitatory capacity of parallel fibers to activate Purkinje cells as required in delay line models (Braitenberg et al., 1997). In contrast to neural delay line models, the model proposed by  provides the possibility that temporally imprecise sequences of pons activations associated with cortical inputs (such as phoneme or word sequences) with relatively long delay times may be learnt by the cerebellum.
Like motor reflexes, learnt fear responses to acoustic stimuli are rapid and frequency specific. They have been shown to evoke shifts in the frequency responses of neurons in the medial geniculate body (MGB) of the thalamus (Weinberger, 2011). The thalamus projects to the amygdala, which is associated with autonomic arousal in response to aversive stimuli under moderation from the anterior cingulate and other frontal cortical areas (Critchley, 2005). The thalamus is enervated by the deep nuclei of the cerebellum and the pons (Paré et al., 1990;Reese et al., 1995), forming a neural circuit that can rapidly learn to recognize auditory stimuli that are paired with pain or threat, generate autonomic arousal, and adapt these learnt associations ( Figure 2) (Critchley, 2005;Sacchetti et al., 2005;Weinberger, 2011).
Many researchers have suggested that the consistency of neural architecture throughout the cerebellum indicates that it undertakes similar types of information processing on all its inputs, regardless of whether they are sensory, proprioceptive, or cortical in origin (Courchesne and Allen, 1997;Ramnani, 2006;Thach, 2007;Ito, 2008). As a consequence, existing neurocognitive models of cerebellar function should be consistent with existing models of auditory processing. Figure 2 shows a network model of the cerebellar cortex that adapts the cerebellar neural architecture proposed by  according to the proposition that sound recognition occurs in sub-cortical pathways suggested in the OAM of auditory processing by McLachlan and Wilson (2010). In Figure 2, spectral patterns of afferent auditory information are initially recognized in the CN-pons neural circuit. Multiple arrows between the CN, pons and deep nucleus in Figure 2 represent auditory input to giant cells in the pons from the CN that drive startle reflexes with very short latencies and low spectral specificity (Fleshler, 1965;Lingenhöhl and Friauf, 1994). Auditory inputs to the cerebellar cortex from the IC (via the pons) have a broad range of spectral and temporal response patterns including a small proportion of finely tuned tonic mossy fiber responses (Aitkin and Boyd, 1978) that provide higher spectral specificity than CN inputs over time (represented in Figure 2 as a single arrow from the IC).
Pons neurons can excite the deep nucleus to initiate motor reflexes, and the thalamus to initiate both autonomic arousal via the amygdala, and the association of symbolic or multimodal identities in the cerebral cortex. Purkinje cells in the cerebellar cortex also receive excitatory innervation from the pons via mossy and parallel fibers, and from the inferior olive via climbing fibers, and then project to the deep nucleus where they inhibit feed-forward motor and other responses if they receive an unexpected sensory input (Konnerth et al., 1990;. The deep nuclei also send projections to the inferior olive which may inhibit error messages when a firing sequence is progressing according to expected sensory inputs (Best and Regehr, 2009). FIGURE 2 | Network architecture in the cerebellum to support sound recognition and associated auditory processes (see text for a detailed explanation). Fine arrows denote neural connections involved in sound recognition and thick arrows denote connectivity between brain nuclei and regions (black denotes excitatory connections and gray inhibitory connections). CN, cochlear nucleus; DMGB, dorsal medial geniculate body; DN, deep nuclei of the cerebellum; and IC, inferior colliculus.
The association of spectrotemporal templates with identities in the secondary auditory cortex likely constitutes the second stage of sound identification (McLachlan and Wilson, 2010;McLachlan, 2011). This is consistent with previous research that identified contributions of both auditory feature processing and categorical knowledge formation in sound recognition mechanisms (Ballas, 1993). It is also consistent with neurophysiological data on sound recognition mechanisms (the "what" pathway) recorded in the anterolateral belt region of the auditory cortex of rhesus monkeys (Rauschecker and Tian, 2000). Rauschecker and Tian (2000) found neurons in this region that responded much more strongly to full bandwidth reproductions of monkey calls than spectrally filtered reproductions. These neurons also responded strongly to the complete temporal sequence of a monkey call, but only weakly to the first part of the call, and not at all to the second part of the call. In other words, neurons in the anterolateral belt of the auditory cortex preferentially respond to the specific temporal order of spectral information associated with a particular monkey call. Similar neurophysiological data has also been observed in the prefrontal cortex of monkeys (Averbeck and Romanski, 2006), suggesting that the semantic meaning of calls with particular spectrotemporal properties is stored in anterolateral belt regions of the auditory cortex and processed in frontal cortical regions.

THE CEREBELLUM IN HUMAN AUDITORY PROCESSING
The cerebellum contains two somatotopic representations (homunculi) in primates (Snider and Eldred, 1951;Grodd et al., 2001;Imamizu et al., 2003;Manni and Petrosini, 2004), and in Macaque monkeys, magnetic resonance tractography has shown that the dominant cerebral input to the cerebellum is from the motor areas . However, in humans, the lateral regions of the neocerebellum are by far the largest part of the cerebellum and receive input from the prefrontal and parietal areas of the cerebral cortex via the pons (Kelly and Strick, 2003;Ramnani et al., 2006;Strick et al., 2009). Stoodley et al. (2012) reported functional magnetic resonance imaging (fMRI) data for humans that showed that finger-tapping activated right cerebellar lobules IV, V and VIII along with the sensorimotor cerebral cortices. In contrast, verb generation, mental rotation, and working memory tasks activated the lateral neocerebellar lobules VI (Crus I) and VII (Crus II), along with prefrontal and parietal cerebral cortices. These data are consistent with the idea that implicit learning in the cerebellum may involve cognitive processes in the lateral Crus I and II lobules that are more abstract than basic integration of sensory and motor information (Imamizu et al., 2003;Konoike et al., 2012). This idea is further supported by recent evidence of fMRI activation in these lateral cognitive regions for purely visual sensory processing tasks such as the recognition of emotions in facial images (Baumann and Mattingley, 2012), and the motion of visual bars on a monitor (Kellermann et al., 2012). These visual functions may be supported by afferent visual connections to the pons via the superior colliculus, lateral geniculate body and pretectal area in the mid brain of mammals that were described in the 1970s (Graybiel, 1974).
Specific regions in the lateral cerebellum in humans, particularly in the left Crus I area, are consistently activated in brain imaging studies involving auditory tasks ranging from passive listening to pure tones or clicks, to various types of active auditory discrimination of intensity, duration, location, pitch, timbre and speech (Petacchi et al., 2005;Callan et al., 2007;Sens and de Almeida, 2007;Wilson et al., 2009). Tasks involving temporal processing of auditory stimuli have shown that individuals with lateral cerebellar lesions could not accurately perceive the difference between 'longer' and 'shorter' acoustic tone bursts (Ivry et al., 1988), and cerebellar involvement in this task was later confirmed in normal individuals in a fMRI study by Mathiak et al. (2004). Similar studies have also indicated that the lateral cerebellum is involved in classification of speech stimuli based on temporal dynamics . Finally Konoike et al. (2012) found fMRI activation in left Crus I of the cerebellum, the superior temporal gyrus, the inferior parietal lobes and the inferior frontal gyrus during a memory task requiring the encoding of a rhythmic phrase. Performance of the rhythm after a short time delay resulted in bilateral activation of the lobule VI regions of the cerebellum and in the cerebral motor areas. However, no activation was observed in the cerebellum during the maintenance of the rhythm in working memory. These data are consistent with involvement of the cortico-pontocerebellar system (including the frontal and parietal cortices) for encoding rhythmic information, but only the frontal regions of the cerebral cortex for maintaining a memory trace of the rhythm in working memory.

Speech and Sound Recognition
The rapid enlargement of the ventrolateral portion of the cerebellum in conjunction with the inferior frontal region of the cortex in humans suggests selective evolution of prefrontal input to the human cortico-ponto-cerebellar system (Ramnani, 2006), driven largely by increasing demands of language processing (Leiner et al., 1989). Murdoch (2010) suggested that the corticoponto-cerebellar system likely underpins the phonological loop, a short term memory store for phonological information that is supported by articulatory rehearsal of information (Baddeley et al., 1998), while Kotz and Schwartze (2010) postulated that the cerebellum plays an important role in the temporal regulation of cortical phonemic processing in relation to motor templates for speech production via the basal ganglia and thalamus. More recently, Schwartze and Kotz (2016) proposed that precise temporal processing in the cerebellum regulates cortical integration of auditory information through oscillatory feedback loops between cerebellar and cerebral cortices. They suggest that an event-based temporal representation of the speech signal could predict auditory dynamics and so regulate and optimize cortical attentional resources.
Lesion, neurophysiological and imaging studies have highlighted the role of the right lateral cerebellum in a range of non-motor aspects of language (Schlösser et al., 1998;Marien et al., 2001;Xiang et al., 2003;Frings et al., 2006;Callan et al., 2007;Stoodley et al., 2012;Moberget et al., 2014). For example, agrammatic speech can follow focal lesions that are relatively circumscribed to the right lateral cerebellum (Zettin et al., 1997), and repetitive transcranial magnetic stimulation over the right lateral cerebellum disrupted language function (Lesage et al., 2012). Given the homogeneity of cerebellar neural architecture (Apps and Garwicz, 2005), Moberget and Ivry (2016) explored whether predictive forward models that have been developed to explain cerebellar motor control would also be consistent with the demands of speech processing. This implies that nonspatial 'what' information can be processed by cerebellar neural architecture, and so they provide an account of how non-spatial error signals provided by climbing fibers in the cerebellar cortex (Figure 1) might enable learning of semantic information.
In response to reports of auditory neural plasticity in the thalamus of rats (Weinberger, 2011), McLachlan and Wilson (2010);McLachlan (2011), and speculated that the spectrotemporal template matching mechanisms proposed in the OAM may reside in the MGB. However, the vast neural resources available in the pons and lateral cerebellum, and their direct connectivity to the afferent pathways of the auditory brainstem and thalamus make them ideally suited to this role. Given spectral integration of auditory inputs may occur in the pons prior to cerebellar processing (Snider and Stowell, 1944;Boyd, 1975, 1978), the ponto-cerebellar pathways are ideally suited to undertake matching of auditory inputs with spectrotemporal memory templates, and associate these templates with symbolic identities in secondary auditory cortex (Davis and Johnsrude, 2003;Schwartze and Kotz, 2016). Figure 3 is a schematic representation of the 'what and where' auditory pathways (Arnott et al., 2004) that were initially discussed in the OAM (McLachlan and Wilson, 2010). These have now been expanded to include the cerebellum as the neural substrate for spectrotemporal template matching of speech and FIGURE 3 | A schematic representation of the auditory pathways that include the cerebellum. Blue arrows represent well-known primary auditory pathways, and red arrows represent the proposed sound recognition pathways that comprise sub-cortical aspects of the "what" auditory pathway including the cerebellum.
other auditory information. According to the updated OAM shown in Figure 3, the medial MGB of the thalamus relays information from the pons and deep nuclei of the cerebellum to the secondary auditory cortex (specifically the anterolateral belt region) for sound identification in the auditory 'what' pathway (Rauschecker and Tian, 2000;Arnott et al., 2004;McLachlan and Wilson, 2010). The dorsal MGB projects in parallel to the caudolateral belt region of the auditory cortex, which in turn projects to the parietal cortex as part of the auditory 'where' pathway (Rauschecker and Tian, 2000;McLachlan and Wilson, 2010). Various authors have proposed a multidimensional memory representation of acoustic stimuli including pitch, location, and loudness (Grau and Kemler-Nelson, 1988;Gomes et al., 1995;McLachlan and Wilson, 2010) associated with the 'where' pathway (Arnott et al., 2004), and bound by an overall gestalt such as an identifiable timbre that has been encoded in the 'what' pathway (McLachlan and Wilson, 2010).
By proposing that spectral integration of auditory information for sound recognition occurs in the pons rather than the cerebral cortex, the extended OAM reduces the role of the cerebellum in sound and speech processing to the learning and recognition of temporal sequences of pons inputs. Similarly, integration of visual information in the pons (via the superior colliculus) would allow the cerebellum to process temporal sequences of visual shapes. This allows the cerebellum to readily integrate synchronous multimodal sensory inputs, since all sensory modalities are represented by sequences of pons neural activations. Cortico-ponto-cerebellar connectivity then allows specific spectrotemporal and/or multimodal templates to be associated with symbolic identities that are encoded in the cerebral cortex. In turn this will allow commonly occurring sequences of these higher level identities such as words and phrases to be learnt from cerebral inputs to the pons, and enable subsequent automatic processing in the cerebellum (Ramnani, 2006;Ito, 2008;Argyropoulos, 2016), as suggested by neuroimaging studies (Moberget et al., 2014) and neurophysiological studies (Argyropoulos, 2011;Lesage et al., 2012).
The proximity of speech recognition to speech articulation mechanisms in the cerebellum would allow articulatory motor control maps to integrate closely with spectrotemporal phoneme recognition templates (Kotz and Schwartze, 2010), thereby enhancing speech learning and perception, particularly when only partial information for either task is available (Lindblom, 1996). Furthermore, cerebellar disorders may give rise to ataxic dysarthria, which is characterized by distinct articulatory and phonatory deficits (Ackermann et al., 2007). The location of speech recognition templates in the lateral cerebellum may therefore contribute to the large size increase of the lateral cerebellum as humans evolved speech capacity . A wide range of behavioral deficits have been observed in patients with cerebellar lesions including language deficits, and impairment of executive functions, abstract reasoning and spatial cognition (Schmahmann and Sherman, 1998;O'Halloran et al., 2012). Consistent with findings that associate cerebellar damage with neuropsychological disorders, communication deficits in low functioning autism spectrum disorder (ASD) have been associated with decreased density of Purkinje cells in the cerebellar cortex, while high functioning ASD has been associated with enlarged cerebellar volumes and increased gray matter (Salmond et al., 2007;Amaral et al., 2008).
The pons and deep nuclei of the cerebellum project to the MGB, which regulates inputs to the auditory cortex from the IC. Inhibition and/or potentiation of MGB neurons according to spectrotemporal patterns stored in cerebellar sound recognition templates may cause rapid plasticity of response fields of primary auditory cortex neurons that are enervated by the ventral MGB (Paré et al., 1990;Fritz et al., 2005;Weinberger, 2011). Rapid plasticity of the response fields of primary auditory cortex neurons could enable the streaming and integration of auditory features that are associated with an expected or attended acoustic source (David et al., 2012;Zion Golumbic et al., 2013). Fritz et al. (2010) recorded response patterns from neurons in the frontal cortex of ferrets that rapidly adapted during trained behaviors to encode task-relevant information through enhanced responses to behaviorally conditioned tones and suppressed responses to other tones. Consistent with the proposition that frontal cortex modulates primary auditory cortex neural sensitivity, adaptation of primary auditory cortex neural responses during and after behavior was similar to the adaptation of the frontal cortex neurons that were recorded simultaneously. Thus the cerebellar network could play an important role in regulating the auditory 'where' pathway through its connectivity with frontal regions of the cerebral cortex as shown in Figure 3. Finally, rarely described efferent pathways from the cerebellar nuclei to the pons (Rand, 1954;Achenbach and Goodman, 1968) may provide a fast network for priming expected pons neurons, and so streaming auditory information based on predictions from cerebellar processing.

Learning
Association of a recognized sound with a symbolic identity in the cerebral cortex could occur via the neural pathway from the deep nucleus to the dorsal MGB and the auditory association areas (Figure 3). Ito (2008) suggested that prefrontal and parietal cortical connections with the inferior olive via the red nucleus could support cerebellar learning of purely cognitive tasks, and in support of this proposition, electrical stimulation of the parietal cortex in cats has been shown to activate climbing fibers from the inferior olive (Oka et al., 1979). Error messages from the inferior olive are used to alter Purkinje cell synaptic connections to parallel fibers within the cerebellar cortex. In motor control tasks these error messages may include pain, somatosensory and visual information (Wolpert et al., 1998;Ramnani, 2006;Ito, 2008). In sound recognition these messages may also be produced by complete inhibition of the deep nucleus by Purkinje cells when recognition mechanisms fail. The inferior olive receives inhibitory innervation from the deep nucleus, so that it is not active during successful recognition processing, and is most active when an unexpected change occurs in the stimulus timing (Liu et al., 2008;Best and Regehr, 2009). The inferior olive also receives excitatory input from the CN, so if template matching fails in the cerebellum during stimulus presentation, the inferior olive could send error messages to the cerebellar cortex to adjust Purkinje cell synaptic strengths according to the stimulus properties. Inferior olive neurons are strongly electrically coupled by gap junctions (Llinas et al., 1974) and so tend to fire synchronously (Welsh et al., 1995). This may provide widespread coordination across cerebellar microcomplexes of precise temporal sequences, such as occurs during eye-blink conditioning (Van Der Giessen et al., 2008).
These learning mechanism may be sufficient to support learning to identify sounds without requiring error feedback from the prefrontal cortex, for which there is little neuroanatomical evidence to date. In other words, spectrotemporal templates in the ponto-cerebellar network might be organized by sensory and proprioceptive feedback, and simply labeled by association with symbolic and multimodal identities in the prefrontal cortex without cerebral cortical feedback to the inferior olive. Reptiles can learn complex auditory signals (Vitousek et al., 2007) without cerebral error feedback to the inferior olive, which could not evolve until after neocortex appeared in mammals. Feedback from cerebral cortex may also be too slow and largely redundant, since the cerebellar pathways can make rapid multisensory associations during motor planning (Wolpert et al., 1998).
The cortico-pontine-cerebellar pathway enables priming of cerebellar templates by cortical working memory and attentional mechanisms as described in Ramnani (2006) for motor control, and in the expanded OAM for sound recognition proposed in this review. According to Figure 2, activation of the pons by enervation from the association cortex in the absence of an auditory stimulus will activate a temporal sequence of neural activations in the deep nucleus. Activation of the deep nucleus could drive activation of the MGB in the thalamus in spectrotemporal sequences that would be similar to actually perceiving a well-remembered timbre, and lead to the experience of auditory imagination. This is consistent with the observation of patterns of cerebral fMRI activation associated with imagining sounds that are similar to those observed when hearing sound, albeit without activation of primary auditory cortex (Zatorre and Halpern, 2005). Moreover, fMRI activation of the left lateral regions of the cerebellum, and frontal and parietal cerebral areas was greater when people imagined well-known melodies compared to perceiving well-known melodies (Herholz et al., 2012), and non-motor related cerebellar activation was associated with covert singing when contrasted with overt singing (Kleber et al., 2007). This is consistent with the association of left lateral cerebellar activation with pitch discrimination rather than motor control (Holcomb et al., 1998), and suggests that pitch and timbral templates are stored in this region of the cerebellum.
The echoic trace was described in the OAM (McLachlan and Wilson, 2010) as a buffer of stimulus driven activation in the thalamus that enables review of recent auditory information when sound identification is ambiguous, or an individual makes a conscious decision to attend to detailed auditory information. However, in the extended OAM outlined in this review, the echoic trace is proposed to be the sequence of pons neural activations. Amygdala activation associated with increased arousal due to hypervigilance or the failure of recognition mechanisms activates the hippocampus to store sensory information (Phelps, 2004). Storage of the sequence of pons activations, rather than the full sensory fields that caused these activations, will substantially reduce the amount of information that must be stored in the hippocampus. This will also allow that information to be readily integrated with other sensory information (Schwartz and Their, 1999), recalled through the primary auditory pathways by sequential activation of the pons neurons, or consolidated with long-term memory via connectivity to the ponto-cerebellar pathways (Datta et al., 2005).
Music McLachlan (2009McLachlan ( , 2011 proposed that pitch was first approximated by both the frequency and the pattern of auditory nerve excitation, as in a harmonic template matching model of pitch (Terhardt et al., 1982), before being further refined by temporal processing of the stimulus waveform. According to the expanded OAM presented in Figures 2 and 3, initial spectral pitch processing would occur in the CN-IC-pons network. This is consistent with research that shows that the ponto-cerebellar network learns highly specific responses to frequency information from the auditory brainstem (Perrett et al., 1993;Ohyama et al., 2003), and the matching of learnt spectral templates to music stimuli in humans McLachlan N.M. et al., 2013).
Pitch resolution is refined by waveform driven mechanisms in the IC to produce sharper frequency resolution after multiple stimulus periods (Langner and Schreiner, 1988;de Cheveigné, 2005;Meddis and O'Mard, 2006;McLachlan, 2009;McLachlan and Grayden, 2014). According to McLachlan (2009McLachlan ( , 2011 the primed spectral pitch is associated with pitch dimension in a place code by neurons adjacent to the primary auditory cortex (Bendor and Wang, 2005). Lateral inhibition mechanisms in this network then operate as an adaptive filter, only allowing periodicity information from the IC that is consistent with the primed pitch to contribute to the pitch representation in short term memory. Consequently, as the frequency resolution of neural response patterns in the IC sharpen over multiple stimulus periods the pitch representation in auditory short term memory also sharpens. Furthermore, since the cerebellar cortex also receives input from the CN-IC-pons network over multiple stimulus periods, more refined frequency representations that develop over time in the IC may contribute to improved stimulus recognition over longer latencies. Efferent pathways from the deep nucleus of the cerebellum to the pons (Rand, 1954;Achenbach and Goodman, 1968) may contribute to refining pons firing patterns as pitch estimates become more precise over time.
Consistent with cerebellar involvement in pitch processing, Parsons et al. (2009) found that pitch discrimination thresholds of high functioning patients afflicted with varying degrees of global cerebellar degeneration performed on average, over five times poorer than controls, with performance proportional to the degree of cerebellar ataxia. Furthermore, a positron emission tomography (PET) study by Holcomb et al. (1998) found increased blood flow in the middle and left lateral cerebellum associated with the motor and decision components of a pitch recognition task, respectively. Hutchinson et al. (2003) subsequently reported that male musicians had larger cerebellar volumes relative to their total brain volume compared to non-musicians. This potentially represents structural adaptation to extended periods of music practice for enhanced motor control, and in light of Holcomb et al. (1998), for better pitch discrimination. Consistent with this, Abdul-Kareem et al. (2011) reported increased white matter volumes in the middle and superior cerebellar peduncles of musicians and Sokolov et al. (2014) reported diffusion tensor imaging evidence for a structural loop between the left cerebellum and the right superior temporal sulcus. Finally, pitch processing by absolute pitch musicians is faster than other musicians (Hsieh and Saberi, 2007), suggesting that they are able to use rapid recognition mechanisms in the CN-IC-pons network to identify standard pitches for which they have previously established fine pitch associations in auditory short term memory (Wilson et al., 2012).
Observed differences in pitch processing between musicians and non-musicians (Seither-Preisler et al., 2007;Kraus and Chandrasekaran, 2010) and musicians from different cultures (McLachlan N.M. et al., 2013) are consistent with the generation of long-term memory templates for musical stimuli in the pontocerebellar network through music training. Recent behavioral data shows that dissonance is experienced when musicians are presented with uncommon chords for which recognition mechanisms are likely to fail . Failure of cerebellar recognition mechanisms may lead to increased activation of the amygdala via activation of the thalamus by the deep nucleus (Figures 2 and 3). The amygdala also connects to the hippocampus and plays an important role in regulating autonomic arousal and the encoding of episodic memory (Poldrack et al., 2001) with additional input from the prefrontal cortex (Critchley, 2005). Increased hippocampal and parahippocampal activation has been associated with the experience of dissonance (Wieser and Mazzola, 1986;Blood et al., 1999), which is consistent with the activation of this pathway by failure of cerebellar recognition mechanisms for unfamiliar or incongruent musical stimuli .
In contrast, Blood et al. (1999) and Salimpoor et al. (2012) reported activation of the dopaminergic reward network when listeners reported feeling pleasure while listening to their favorite music. Activation of the dopaminergic brainstem pathways in humans has been shown for problem solving without explicit external rewards in other cognitive (non-musical) domains (Tricomi et al., 2006). The reward network was only activated by familiar music (Blood et al., 1999;Salimpoor et al., 2012), suggesting that successful predictions about the trajectory of musical features generated by cerebellar processing of familiar musical sequences may be involved as observed by Konoike et al. (2012) for rhythms. In rats, the nucleus accumbens receives input from frontal cortical regions that are enervated by the thalamus, and in turn, by the ponto-cerebellar pathways. The nucleus accumbens can either activate positively valanced reward, or negatively valanced dread behaviors, depending on the recognized stimulus and contextual information (Berridge and Kringelbach, 2013). When people listen to music, complex relationships between musical predictions based on well-known musical sequences may combine with episodic memories for peak personal experiences in frontal cortical regions to activate the reward network via the nucleus accumbens as observed by Blood et al. (1999) and Salimpoor et al. (2012).

Embodied Cognition
The cerebellum is involved in purely cognitive processes such as mental and perceptual timing, learning and reproducing new word and rhythmic phrases, and the mental rehearsal of speech and other movements (Fiez et al., 1992;Gebhart et al., 2002;Ravizza et al., 2006;Konoike et al., 2012). Ramnani (2006) suggested that the difference between cerebral and cerebellar information processing may be that in cerebellar circuits both the context and the rule are integrated in the same internal representation, and can only be efficiently applied in that context, whereas cerebral circuits abstract rules and relationships and maintain them in working memory. This allows cerebral circuits to rapidly adapt to novel stimuli and circumstances, whereas the cerebellum gradually learns implicit associations over multiple presentations of stimuli.
Broca's area in the inferior frontal cortical region in humans is important for learning sequences which contain abstract structures rather than fixed sequences (Dominey et al., 2003;Vandervert, 2011). It has a well-established role in the grammatical processing of phonological sequences (Roll et al., 2012), and has also been implicated in the processing of musical syntax (Maess et al., 2001). This suggests that commonly occurring sensory information may be learnt in patterns of neural connectivity in the cerebellum (templates) that are associated with conceptual identities stored in secondary auditory cortex via the thalamus (Figure 3), and arranged and manipulated in abstract grammars in Broca's area. In this way, implicit sensori-motor memories stored in the cerebellum may enable the automation of well-rehearsed cognitive processes as a form of embodied cognition (Mahon and Caramazza, 2008). This could enable more efficient performance of cognitively demanding tasks by releasing cerebral neural resources from undertaking well-rehearsed processes so that they can monitor and adapt behavior based on broader behavioral and environmental perspectives (Ito, 2008). In other words, the cerebellum can undertake automatic processing of well-rehearsed sensory, cognitive and motor functions.

CONCLUSION
Behaviors that are specific to a particular spectrotemporal feature of a sound may include its association with other sensory information or verbal labels, heightened autonomic arousal, startle and conditioned motor reflexes, and even spectral and temporal changes in the sensitivity of the auditory system itself. All of these behaviors are examples of neural plasticity associated with sound recognition, and have been observed in brain regions as early as the dorsal CN, where simple adaptation of auditory sensitivity for common sounds can occur (Tzounopoulos et al., 2004;Roberts and Portfors, 2008). Here we propose that more complex behaviors that include specific temporal properties of the stimulus involve the cerebellar cortex (Perrett et al., 1993;Ohyama et al., 2003), while cognitive tasks such as sound identification involve most of the auditory pathways including the anterior temporal lobe of the cerebral cortex (McLachlan and Wilson, 2010). The proposition that sound recognition commences in the cerebellar pathways is consistent with the initial premise of the OAM that sound recognition commences early in the auditory processing pathways (McLachlan and Wilson, 2010). In the expanded OAM, recognition of spectral features is proposed to occur in the CN-IC-pons network prior to recognition of temporal sequences of these features in the cerebellar cortex.
Overall, these observations point to the existence of an evolutionarily ancient sound recognition mechanism in the brainstem and hindbrain of terrestrial animals that is capable of implicit learning of spectrotemporal sequences of sound. The pons and deep nucleus of the cerebellum project to the thalamus that regulates the response fields of neurons in the primary auditory pathway. This could enable auditory information to be integrated and streamed according to the spectrotemporal properties of recognized sound sources. In turn, expectations about auditory information that are generated in the cerebral cortex may prime sound recognition mechanisms via the cortico-pontine-cerebellar pathway, leading to enhanced auditory sensitivity to behaviorally relevant sound sources. It should be noted, however, that cerebellar involvement in sound recognition may not be required when spectral information alone is sufficient to recognize sounds, or when accurate temporal processing is not essential.

AUTHOR CONTRIBUTIONS
NM was the principle author and initially developed the models described in the paper. SW refined and expanded the models and undertook detailed reviewing and editing of the text in collaboration with NM.