Introduction

The articulatory features that distinguish different vowel sounds are conventionally described along a two-dimensional coordinate system that naturally represents the position of the highest point of the tongue during articulation, for example, in the International Phonetic Alphabet (IPA) chart1. The two axes of this system are height (tongue vertical position relative to the roof of the mouth or the aperture of the jaw) and backness (tongue position relative to the back of the mouth). How is the structured articulatory production encoded and controlled in brain circuitry? The gross functional neuroanatomy of speech production was described by multiple imaging, lesion and stimulation studies2,3, and includes primary, supplementary and pre-motor areas, Broca's area, superior temporal gyrus (STG), anterior cingulate cortex (ACC) and other medial–frontal regions2,3,4. The temporal dynamics of collective neural activity was studied in Broca's area using local field potentials2,5. However, the basic encoding of speech features in the firing patterns of neuronal populations remains unknown.

Here, we study the neuronal encoding of vowel articulation in the human cerebrum at both the single-unit and the neuronal population levels. At the single-neuron level, we find signatures of two structured coding strategies: highly specific, sharp tuning to individual vowels (in medial–frontal neurons) and nonspecific, sinusoidally modulated tuning (in the STG). At the neural population level, we find that the encoding of vowels reflects the underlying articulatory movement structure. These findings may have important implications for the development of high-accuracy brain–machine interfaces for the restoration of speech in paralysed individuals.

Results

Speech-related neurons

Neuronal responses in human temporal and frontal lobes were recorded from 11 patients with intractable epilepsy monitored with intracranial depth electrodes to identify seizure foci for potential surgical treatment (see Methods). Following an auditory cue, subjects uttered one of five vowels (a/a/, e/ε/, i/i/, o/o/ and u/u/) or simple syllables containing these vowels (consonant+vowel: da/da/, de/dε/, di/di/, do/do/ and du/du/...). We recorded the activity of 716 temporal and frontal lobe units. As this study focuses on speech and owing to the inherent difficulty to distinguish between auditory- and speech-related neuronal activations, we analysed only the 606 units that did not respond to auditory stimuli. A unit was considered speech-related if its firing rate during speech differed significantly from the pre-cue baseline period (see Methods). Overall, 8% of the analysed units (49) were speech-related, of which, more than a half (25) were vowel-tuned, showing significantly different activation for the five vowels (see Supplementary Fig. S1).

Sharp and broad vowel tuning

Two areas commonly activated in speech studies2, the STG and a medial–frontal region overlying the rostral anterior cingulate and the adjacent medial orbitofrontal cortex (rAC/MOF; Brodmann areas 11 and 12; See Supplementary Fig. S2 for anatomical locations of the electrodes), had the highest proportions of speech-related (75% and 11%, respectively) and vowel-tuned units (58% and 77% of these units). In imaging and electrocorticography studies, the rACC was shown to participate in speech control2,6, the orbitofrontal cortex in speech comprehension and reading7, and the STG in speech production at the phoneme level8. Involvement of STG neurons in speech production was also observed in earlier single-unit recordings in humans9. We analysed neuronal tuning in these two areas and found that it had divergent characteristics: broadly tuned units that responded to all vowels, with a gradual modulation in the firing rate between vowels, comprised 93% of tuned units in STG (13/14) but were not found in rAC/MOF (0/10), whereas sharply tuned units that had significant activation exclusively for one or two vowels comprised 100% of the tuned rAC/MOF units (10/10) but were rare in STG (1/14).

Figure 1 displays responses of five sharply tuned units in rAC/MOF, each exhibiting strong, robust increases in their firing rate specifically for one or two vowel sounds, whereas for the other vowels firing remains at the baseline rate. For example, a single unit in the right rostral anterior cingulate cortex (rACC) (Fig. 1, top row) elevated its firing rate to an average of 97 spikes/s when the patient said 'a', compared with 6 spikes/s for 'i', 'e', 'o' and 'u' (P<10−13, one-sided two-sample t-test). Anecdotally, in the first two trials of this example (red arrow) the firing rate remained at the baseline level, unlike the rest of the 'a' trials; in these two trials, the patient wrongly said 'ah' rather than 'a' (confirmed by the sound recordings).

Figure 1: Sharply tuned medial–frontal (rAC/MOF) units.
figure 1

Raster plots and peri-stimulus time histograms of five units during the utterance of the five vowels a, e, i, u and o. For each of the units, significant change in firing rate from the baseline occurred for one or two vowels only (Methods). Red vertical dashed lines indicate speech onset. All vertical scale bars correspond to firing rates of 20 spikes/s.

A completely different encoding of vowels was found in the STG, where the vast majority of tuned units exhibited broad variation of the response over the vowel space, during the articulation of both vowels (Fig. 2a) and simple syllables containing these vowels (Supplementary Fig. S3a). This structured variation is well approximated by sinusoidal tuning curves (Fig. 2b and Supplementary Fig. S3b) analogous to the directional tuning curves commonly observed in motor cortical neurons10. Units shown in Fig. 2 had maximal responses ('preferred vowel', in analogy to 'preferred direction') to the vowels 'i' and 'u', which correspond to a closed articulation where the tongue is maximally raised, and minimal ('anti-preferred') response to 'a' and 'o' where it is lowered.

Figure 2: Broadly tuned STG units.
figure 2

(a) Raster plots and peri-stimulus time histograms during the utterance of the five vowels a, e, i, u and o. Significant change in firing rate from the baseline occurred for all or most vowels, with modulated firing rate (Methods). Red vertical dashed lines indicate speech onset; vertical bars, 10 spikes/s. (b) Tuning curves of the respective units in a over the vowel space, showing orderly variation in the firing rate of STG units with the articulated vowel.

Population-level decoding and structure

Unlike directional tuning curves, where angles are naturally ordered, vowels can have different orderings. In the tuning curves of Fig. 2, we ordered the vowels according to their place and manner of articulation as expressed by their location in the IPA chart1, but is this ordering natural to the neural representation? Instead of assuming a certain ordering, we could try and deduce the natural organization of speech features represented in the population-level neural code. That is, we can try to infer a neighbourhood structure (or order) of the vowels where similar (neighbouring) neuronal representations are used for neighbouring vowels. We reasoned that this neighbourhood structure could be extracted from the error structure of neuronal classifiers: when a decoder, such as the population vector11 errs, it is more likely to prefer a value that is a neighbour of the correct value than a more distant one. Thus, classification error rates are expected to be higher between neighbours than between distant features when feature ordering accurately reflects the neural representation neighbourhood structure. In this case, classification error rates expressed by the classifier's confusion matrix will have a band-diagonal structure.

To apply this strategy, we decoded the population firing patterns using multivariate linear classifiers with a sparsity constraint to infer the uttered vowel (Methods). The five vowels were decoded with a high average (cross-validated) accuracy of 93% (significantly above the 20% chance level, P<10−5, one-sided one-sample t-test, n=6 cross-validation runs; Supplementary Table S1), and up to 100% when decoding pairs of vowels (Fig. 3a). Next, we selected the vowel ordering that leads to band-diagonal confusion matrices (Fig. 3b). Interestingly, this ordering is consistent across different neuronal subpopulations (Fig. 3b and Supplementary Fig. S4) and exactly matches the organization of vowels according to their place and manner of articulation as reflected by the IPA chart (Fig. 3c). As the vowel chart represents the position of the highest point of the tongue during articulation, the natural organization of speech features by neuronal encoding reflects a functional spatial-anatomical axis in the mouth.

Figure 3: Inferring the organization of vowel representation by decoding.
figure 3

(a) Average decoding accuracy (±s.e.) versus the number of decoded vowel classes. Red dashed lines represent the chance level. (b) Confusion matrix for decoding population activity of all analysed units to infer the uttered vowels. Band-diagonal structure indicates adjacency of vowels in the order a, e, i, u and o in the neural representation. High confusion in the corner (between u and i) implies cyclicity. (c) The vowels IPA chart, representing the highest point of the tongue during articulation, on top of a vocal tract diagram. The inferred connections (blue lines) demonstrate neuronal representation of articulatory physiology.

Discussion

These results suggest that speech-related rAC/MOF neurons use sparse coding for vowels, in analogy to the sparse bursts in songbirds' area HVc12 and to the sparse, highly selective responses observed in the human medial–temporal lobe13. In contradistinction, the gradually modulated speech encoding in STG implies previously unrecognized correlates with a hallmark of motor cortical control—broad, sinusoidal tuning, implying a role in motor control of speech production9. Interestingly, speech encoding in these anatomical areas is opposite in nature to that of other modalities: broad tuning for motor control is common in the frontal lobe10 (versus STG in the temporal lobe) whereas, sparse tuning to high-level concepts is common in the temporal lobe13 (versus rAC/MOF in the frontal lobe). Analogous to the recently found subpathway between the visual dorsal and ventral streams14, our findings may lend support to a speech-related 'dorsal stream' where sensorimotor prediction supports speech production by a state feedback control3. The sparse rAC/MOF representation may serve as predictor state, in line with anterior cingulate15 and orbitofrontal16 roles in reward prediction. The broad STG tuning may support evidence that the motor system is capable of modulating the perception system to some degree3,17,18.

Our finding of sharply tuned neurons in rAC/MOF agrees with the DIVA model of the human speech system19, which suggests that higher-level prefrontal cortex regions involved in phonological encoding of an intended utterance sequentially activate speech sound map neurons that correspond to the syllables to be produced. Activation of these neurons leads to the readout of feedforward motor commands to the primary motor cortex. Owing to orbitofrontal connections to both STG20 and ventral pre-motor cortex21, rAC/MOF neurons may participate also in the feedback control map, where sharply tuned neurons may provide a high-level 'discrete' representation of the sound to utter, based on STG input from the auditory error map, before low-level commands are sent to the articulator velocity and position maps via ventral pre-motor cortex. Our broadly tuned STG units may also be part of the transition from the auditory error map to the feedback control map, providing a lower-level 'continuous' population representation of the sound to utter.

Our results further demonstrate that the neuronal population encoding of vowel generation appears to be organized according to a functional representation of a spatial-anatomical axis: tongue height. This axis was shown to have a significant main effect on decoding from speech motor cortex units22. Whether these structured multi-level encoding schemes also exist in other speech areas like Broca's2 and speech motor cortex, and how they contribute to the coordinated production of speech are important open questions. Notwithstanding, the structured encoding observed naturally facilitates high-fidelity decoding of volitional speech segments and may have implications for restoring speech faculties in individuals who are completely paralysed or 'locked-in'.23,24,25,26,27,28

Methods

Patients and electophysiology

Eleven patients with pharmacologically resistant epilepsy undergoing invasive monitoring with intracranial depth electrodes to identify the seizure focus for potential surgical treatment29 (9 right handed, 7 females, ages 19–53) participated in a total of 14 recording sessions, each on a different day. Based exclusively on clinical criteria, each patient had 8–12 electrodes for 1–2 weeks, each of which terminated with a set of nine 40-μm platinum–iridium microwires. Their locations were verified by magnetic resonance imaging or by computer tomography coregistered to preoperative magnetic resonance imaging. Bandpass-filtered signals (0.3–3 kHz) from these microwires and the sound track were synchronously recorded at 30 kHz using a 128-channel acquisition system (Neuroport, Blackrock). Sorted units (WaveClus30 and SUMU31) recorded in different sessions are treated as different in this study. All studies conformed with the guidelines of the Medical Institutional Review Board at the University of California Los Angeles.

Experimental paradigms

Patients first listened to isolated auditory cues (beeps) and to another individual uttering the vowel sounds and three syllables (me/lu/ha) following beeps (auditory controls). Then, following an oral instruction, patients uttered the instructed syllable multiple times, each following a randomly spaced (2–3 s) beep. Syllables consisted of either monophthongal vowels (a/a/, e/ε/, i/i/, o/o/ and u/u/) or a consonant (of: d/d/, g/g/, h/h/, j/dε/, l/l/, m/m/, n/n/, r//, s/s/ and v/v/), and one of the aforementioned vowels (for example, da/da/, de/dε/, di/di/, do/do/ and du/du/)1. For simplicity, this paper employs the English rather than IPA transcription as described above. All sessions were conducted at the patient's quiet bed-side.

Data analysis

Of the 716 recorded units, we analysed 606 that were not responsive during any auditory control (rAC and adjacent MOF cortex (rAC/MOF): 123 of 156; dorsal and subcollosal ACC: 68/72; entorhinal cortex: 124/138; hippocampus: 103/114; amygdala: 92/106; parahippocampal gyrus: 64/66; and STG: 32/64). The anatomical subdivisions of the ACC are according to McCormick et al.32 Owing to clinical considerations29, no electrode was placed in the primary or pre-motor cortex in this patient population. Each brain region was recorded in at least three subjects. A unit is considered speech-related when the firing rate differs significantly between the baseline ([−1000, 0]ms relative to the beep) and the response ([0, 200]ms relative to speech onset; paired t-test, P<0.05, adjusted for false discovery rate33 control for multiple units and vowels, q<0.05, n ranges between 6 and 12 trials depending on the session). For these units, we found the maximal response among the four 100-ms bins starting 100 ms before speech onset, and computed mean firing rates in a 300-ms window around this bin. Tuned units are speech-related units for which the mean firing rate is significantly different between the five vowel groups (analysis of variance; F-test, P<0.05, false discovery rate33 adjusted, q<0.05, n between 6 and 12 for each group). Broad, sinusoidally tuned units are tuned units whose firing rate correlates with: a+bcos (c+i2π/5) (where i=0,...,4 is the index of the vowel in a, e, i, u, o) with coefficient of determination R2>0.7.10 Sharply tuned units are tuned units for which the mean firing rate in the three vowel groups of lowest mean firing rate is the same with high probability (analysis of variance; F-test, P > 0.1, n between 6 and 12 for each group). The vowel decoder is a regularized multivariate linear solver, which minimizes ||x||L1 subject to ||Axb||L2σ (basis pursuit denoising problem34). It has superior decoding performance and speed relative to neuron-dropping decoders35. A contains the feature inputs to the decoder: spike counts of all units in a baseline bin ([−1000,0]ms relative to the beep) and in two 100-ms response bins that followed speech onset; b are 5-element binary vectors coding the individual vowels uniquely. All decoding results were sixfold cross-validated using trials that were not used for decoder training. The decoder was trained on all of the aforementioned features from the training set only, with no selection of the input neurons or their features. Instead, the sparse decoder described above automatically selects task-relevant features by higher weights it allocates to them using the minimal ||x||L1 constraint; task-unrelated features are thus diminished by low weights. Owing to the high accuracy in decoding, we randomly dropped 20% of the units (in each cross-validation training) when computing confusion matrices, to increase the amount of confusions and allow the extraction of a meaningful band-diagonal structure (except for the STG-only training, Supplementary Fig. S4).

The vowels in Fig. 3c were placed on the IPA chart according to the locations previously calculated for American speakers (ref. 1, page 42), and the overlaid connections (blue lines) were inferred by the maximal non-diagonal element for each row and each column of the confusion matrix.

Additional information

How to cite this article: Tankus, A. et al. Structured neuronal encoding and decoding of human speech features. Nat. Commun. 3:1015 doi: 10.1038/ncomms1995 (2012).