Abstract
Frequency-to-place mapping, or tonotopy, is a fundamental organizing principle throughout the auditory system, from the earliest stages of auditory processing in the cochlea to subcortical and cortical regions. Although cortical maps are referred to as tonotopic, it is unclear whether they simply reflect a mapping of physical frequency inherited from the cochlea, a computation of pitch based on the fundamental frequency, or a mixture of these two features. We used high-resolution functional magnetic resonance imaging (fMRI) to measure BOLD responses as male and female human participants listened to pure tones that varied in frequency or complex tones that varied in either spectral content (brightness) or fundamental frequency (pitch). Our results reveal evidence for pitch tuning in bilateral regions that partially overlap with the traditional tonotopic maps of spectral content. In general, primary regions within Heschl's gyri (HGs) exhibited more tuning to spectral content, whereas areas surrounding HGs exhibited more tuning to pitch.
SIGNIFICANCE STATEMENT Tonotopy, an orderly mapping of frequency, is observed throughout the auditory system. However, it is not known whether the tonotopy observed in the cortex simply reflects the frequency spectrum (as in the ear) or instead represents the higher-level feature of fundamental frequency, or pitch. Using carefully controlled stimuli and high-resolution functional magnetic resonance imaging (fMRI), we separated these features to study their cortical representations. Our results suggest that tonotopy in primary cortical regions is driven predominantly by frequency, but also reveal evidence for tuning to pitch in regions that partially overlap with the tonotopic gradients but extend into nonprimary cortical areas. In addition to resolving ambiguities surrounding cortical tonotopy, our findings provide evidence that selectivity for pitch is distributed bilaterally throughout auditory cortex.
Introduction
A key organizing principle of the auditory system is tonotopy, an orderly mapping of sound frequency to place. Tonotopy is established in the cochlea, where different frequencies maximally displace different locations along the basilar membrane, in a high-to-low ordering from the base to the apex (Von Békésy, 1960). This tonotopic organization has been found at numerous stages of the auditory pathways, up to and including auditory cortex (Saenz and Langers, 2014; Thomas et al., 2015). Studies of cortical mapping using functional magnetic resonance imaging (fMRI) have typically employed pure tones or narrowband noises (Formisano et al., 2003; Talavage et al., 2004; Da Costa et al., 2011; Striem-Amit et al., 2011; Saenz and Langers, 2014) in much the same way as has historically been done to establish tonotopy in earlier stages of the auditory processing hierarchy (Von Békésy, 1960; Bourk et al., 1981; Nuttall and Dolan, 1996; Ruggero et al., 1997; Schreiner and Langner, 1997; Narayan et al., 1998; Cooper, 1999). However, pure tones, narrowband stimuli, and even many natural sounds, conflate two primary perceptual attributes of sound: pitch height and timbral brightness. In most sounds, pitch (i.e., the property defining melodies in music) is determined by the fundamental frequency (F0), whereas timbre (the perceptual property that distinguishes a trumpet from a clarinet, even when they play the same note at the same loudness), is affected by the spectral centroid (Fc) of the sound's energy distribution, with brightness increasing with increasing Fc (Krumhansl and Iverson, 1992; Marozeau et al., 2003; Allen and Oxenham, 2014). Because previous studies have used stimuli in which these two dimensions covary, it remains unclear whether the spatial organization observed in cortex simply reflects frequency-to-place mapping, inherited from the cochlear representation of spectral content, or whether some or all portions of the cortical maps instead reflect one or more higher-level features, such as pitch and/or brightness.
Although precisely localizing primary auditory regions in human auditory cortex is an ongoing challenge (Moerel et al., 2014), there is mounting evidence that the primary area A1 [estimated to be within Heschl's gyrus (HG)] shows a preference for relatively simple acoustic features, whereas surrounding nonprimary areas show greater sensitivity to complex stimuli, such as speech and music (Norman-Haignere et al., 2015; de Heer et al., 2017; Kell et al., 2018). Thus, it may be that the multiple gradients identified in previous studies as multiple tonotopic maps (Moerel et al., 2014; Saenz and Langers, 2014), reflect not just different auditory fields, but also maps of different auditory features.
To distinguish between the mapping of Fc (timbral brightness) and F0 (pitch) in cortical representations, we used high-resolution 7T fMRI to measure cortical responses to sequences of pure tones that varied over a range of frequencies, and complex tones that varied in either Fc or F0. We then used computational models to characterize the spatial organization of responses to each of these features. Our results replicate previously found tonotopic maps produced by pure tones and find similar responses to complex tones with a distinct spectral peak, consistent with the organization found in the more peripheral auditory pathways. However, our results also reveal new tuning to pitch in bilateral regions that partially overlap with the tonotopic maps but are located primarily outside HG. Overall, our findings reveal the existence of spatially organized representations of both tonotopy and pitch bilaterally within human auditory cortex.
Materials and Methods
Participants
The Institutional Review Board (IRB) for human participant research at the University of Minnesota approved the experimental procedures. Written and informed consent was obtained from each participant before data collection. Ten members of the University of Minnesota community [average (SD) age of 29.3 (4.2) years; six females, four males], all right-handed and having normal hearing, defined as audiometric pure-tone thresholds of 20 dB hearing level (HL) or better, at octave frequencies between 250 Hz and 8 kHz, participated in this study. An eleventh participant was excluded after having great difficulty hearing the stimuli and discovering elevated thresholds since their last audiogram, making them no longer eligible for participation.
Stimuli
All stimuli were generated in MATLAB (The MathWorks) and presented using the Psychophysics Toolbox (Kleiner et al., 2007). Stimuli were presented in three conditions: pure tones, complex pitch tones, and complex timbre tones. The 13 pure tones, each a single frequency, spanned six octaves (100–6400 Hz), in half octave steps (Fig. 1A). The complex pitch and timbre tones were bandpass filtered harmonic complex tones (Fig. 1B). For all complex tones, the components started in sine phase and were bandpass filtered with 12 dB per octave slopes around the center frequency (CF), and then lowpass filtered with a 16th order filter and a cutoff frequency of 10 kHz. The nine complex timbre tones had a fixed F0 of 200 Hz and varied in the location of the bandpass filter's CF, spanning four octaves (400–6400 Hz) in half octave steps (Fig. 1C). The nine complex pitch tones had a fixed bandpass filter CF of 2400 Hz and a varying F0, which spanned four octaves (100–1600 Hz) in half octave steps (Fig. 1D). The ranges for the pitch and timbre conditions were chosen to ensure that the Fc was always well above the F0, so that the peak of the spectral envelope (corresponding to the CF of the bandpass filter) was always defined by the amplitudes of the harmonics.
Experimental design and task
Stimuli were presented via MRI-compatible Sensimetrics S14 foam tip earbuds with custom filters to flatten the frequency response. Sound attenuation with S14s is consistent with standard foam plugs. These earphones, which are commonly used in auditory fMRI studies, have well-documented distortion characteristics that are consistent across different pairs of S14s (Norman-Haignere and McDermott, 2016). Because of the broadband nature of the pitch and timbre stimuli, as well as the broadband masking provided by the continuous scanner noise (with the most prominent peaks falling around 1 kHz), any subtle increases in amplitude around certain harmonics caused by distortion products would have a negligible contribution to the sound percept, which is largely dominated by the stimulus itself.
The stimuli were adjusted to be of equal perceptual loudness. This was done through a multistep process. First, in a separate session from the main experimental session, two participants, while wearing the S14 earphones, listened to repetitions of a single tone type, in blocks lasting 15 s. The scanner was not running during this session. Participants were instructed to adjust the level of the tone by pressing button “1” on the button box to decrease the level and button “2” to increase the level, until it was clearly and comfortably audible. Once they were satisfied with the level, the participants would press “3” to stay at that level for the remainder of the block. If they did not press “3,” they would automatically advance to the next block at the end of the 15-s block. In each subsequent block, they were instructed to make the tone as loud as the tone presented in the previous block. In these blocks, tones were ordered randomly without replacement. The participants performed three repetitions of this task, with the aim of making all the tones equal in loudness. We calculated the median level of the three trials for each tone. These median levels for each tone were then increased by 25–30 dB to account for the presence of the scanner noise. Since equal loudness percepts across frequencies tend to compress at higher levels (International Organization for Standardization, 2003), the same participants then listened to the sounds while the scanner was running and continued to make adjustments until all sounds were again of approximately equal loudness over the scanner noise. These levels were further adjusted and customized for each participant at the beginning of their respective sessions, as needed, until all the tones were reported as being of roughly equal loudness. The final equal loudness contours were similar across participants, with only small offsets in the mean level required for comfortable audibility. The mean (SD) level was 83.4 (5.2) dB sound pressure level (SPL) for the pure tones, 80.2 (3.9) dB SPL for the timbre tones, and 75.3 (1.8) dB SPL for the pitch tones.
We incorporated a “Morse code”-like rhythm into the stimuli to enhance their perceptual salience over the sound of the MR pulse sequence, inspired by the stimulus design of Thomas et al. (2015). Each stimulus was presented with an equal number of short (50 ms) and long (200 ms) tone bursts, including 20-ms onset and offset ramps. Every 700 ms consisted of two short and two long tones, each followed by a 50-ms gap, presented in random order. This process was repeated 11 times, with random shuffling of the tones for each repetition, for a total stimulus length of 7.7 s (Fig. 1G). All tones presented within the 7.7 s had an identical F0 and Fc but varied in duration (50 or 200 ms). After a 700-ms gap, a new stimulus was presented (with a new frequency, F0, or Fc, depending on the condition) for 7.7 s, and so on, until all stimuli within a given condition were presented once (i.e., one condition block) in a random order, followed by a 12-s silent gap (Fig. 1F). There were 12 experimental runs (three pure-tone runs and nine complex-tone runs), each about 6 min long (Fig. 1E). The order of the pure-tone and complex-tone runs was counterbalanced across participants. Each pure-tone run consisted of three pure-tone blocks and each complex-tone run consisted of two pitch blocks and two timbre blocks, presented in a random order. To avoid run-specific differences across conditions, both complex-tone conditions were included within each run. Because of the well-established and robust nature of pure-tone tonotopy, most of the scanning session was used to acquire data for the complex-tone conditions. Within a session, there were a total of nine trials for each pure-tone stimulus and 18 trials for each of the pitch and timbre stimuli. Ten seconds of silence was added to the beginning and end of each run. Participants were instructed to keep very still and resist any desire to move to the rhythm of the stimuli. For each 7.7-s stimulus, the participants' task was to indicate, via button box, whether the current stimulus was lower or higher (in either pitch or timbral brightness) than the previous one.
MRI
All data were acquired using Siemens scanners at the Center for Magnetic Resonance Research (CMRR, University of Minnesota). Functional data were acquired at the passively shielded 7T Siemens MAGNETOM scanner using a single transmit 32-channel Nova Medical head coil. The acquisition parameters for the gradient-echo EPI sequence used were: repetition time (TR) = 1400 ms; echo time (TE) = 20 ms, field-of-view (FOV) = 198 mm; matrix size 180 × 180; number of slices = 44; 1.1-mm isotropic voxels; multiband factor = 2; generalized autocalibrating partially parallel acquisition (GRAPPA) acceleration factor = 3. Slices were angled to align with the Sylvian fissure of each participant to fully encapsulate auditory cortices. The sound level of the functional sequence at the center of the bore was 101 dBA before earphone attenuation. Four fieldmaps were also collected throughout each session for distortion correction. The acquisition parameters for the fieldmaps were: TR = 190 ms; first TE = 4.08 ms; second TE = 5.1 ms; 2.2 mm isotropic voxels; 22 slices. The complex-tone runs had 258 volumes, and the pure-tone runs had 267 volumes.
Anatomical (T1 and T2-weighted) data were acquired at the Siemens 3T Prisma scanner with a 32-channel head coil. MPRAGE T1-weighted parameters were: TR = 2400 ms; inversion time (TI) = 1000 ms; TE = 2.22 ms; flip angle = 8°; 0.8-mm isotropic voxels. T2-weighted parameters were: TR = 3200 ms; TE = 563 ms; 0.8-mm isotropic voxels. Six T1s and three T2s were acquired for each participant.
Half of the participants used custom foam Caseforge head cases (https://caseforge.com/). The posterior portion of each head case was used to help stabilize participants' heads during the scans and additional padding was added under the neck and around the ears for further stabilization and comfort. The remaining participants used standard MR-compatible foam padding on the back of the head, along with additional neck and ear padding.
Anatomical and functional preprocessing
The data were preprocessed using a custom pipeline (Kay et al., 2019). Gradient unwarping, which corrects image distortions caused by of gradient nonlinearities, was performed on the T1-weighted and T2-weighted anatomical volumes using the gradient coefficient file provided by Siemens. All six T1 volumes for a given participant were then coregistered using rigid-body transformation with six degrees of freedom and cubic interpolation. Once aligned, the volumes were averaged together to improve contrast between the gray and white matter for high-quality segmentation. The same process was used for the three T2 volumes. The averaged T2 volume was then aligned to the averaged T1 volume for each participant.
Cortical reconstruction was performed via FreeSurfer (Fischl, 2012) using the averaged T1 volume. Since the anatomical data had submillimeter resolution, a “hires” flag was added, and an expert file was used to specify a larger number of inflation iterations (50). Segmentation results were then visually inspected in Freeview. The functional data were sampled across the cortical thickness at 25%, 50%, and 75% cortical depths and then averaged together. Note that while the analyses were performed on vertices in cortical surface space, for simplicity, the term “voxel” will be used throughout. For group-level surface maps, individual participant results were mapped to FreeSurfer's fsaverage cortical surface group space via nearest-neighbor interpolation. Fsaverage is an anatomical surface template to which individual participants can be aligned via curvature-based alignment.
Functional data preprocessing included slice time correction, fieldmap-based undistortion, and motion correction. Functional data were aligned to the anatomical data using an affine transformation. In the slice time correction step, the data were temporally upsampled from 1.4 to 1 s. In the motion correction step, the data were sampled onto the FreeSurfer depth-dependent surfaces. No smoothing was applied to the data.
The GLMdenoise technique was used to process the data and obtain a clean estimate of the BOLD response related to the experimental conditions (Kay et al., 2013). The GLMdenoise toolbox is available at http://kendrickkay.net/GLMdenoise/. The β weights were used to specify the amplitude of the BOLD response to each stimulus and polynomial regressors were used to specify the baseline response in each run. Each 7.7-s stimulus was analyzed as a block, and a canonical hemodynamic response function (HRF) was assumed. Leave-one-run-out cross-validation was performed and R2 was used to quantify the proportion of the time-series variance (R2) that could be explained by the stimuli across all conditions. The three pure-tone runs (each containing three repetitions of each tone, totaling nine repetitions of each pure tone across the scanning session) were used to estimate three β weights per tone. Likewise, the nine complex-tone runs (each containing two repetitions of each tone, totaling 18 repetitions of each complex tone across the scanning session) were used to estimate two β weights for each pitch and timbre tone.
Statistical analysis
Encoding models
Encoding models were used to explore how similar or dissimilar topographic representations of pure tones were to the representations of complex tones varying either in their F0 or Fc. The first model implemented was the feature tuning model:
We assessed model performance using n-fold cross-validation, with pure tones having 3 folds (two β weights per stimulus used for training, one for testing), and pitch and timbre each having 2 folds (one β per stimulus for training, one for testing), because of the number of β estimates that came out of the general linear model (GLM) analysis. For each fold, model R2 values were derived using the held-out data, by computing the proportion of the original variance in the data that was unaccounted for by the model fit and subtracting this quantity from 1:
The second model implemented was the spectral tuning model, which was inspired by the population receptive field (pRF) method (Dumoulin and Wandell, 2008; Thomas et al., 2015). Instead of characterizing responses to each stimulus on the basis of a single-valued stimulus property, as was done for the feature tuning model, the spectral tuning model took into account the entire frequency spectrum of each stimulus. The form of this model is the same as Equation 1, except the response
While this model was the same as the feature tuning model for the pure-tone stimuli, which were characterized as a single frequency in both cases, it changed the input for the pitch and timbre stimuli, which are harmonic complex tones containing many frequencies. Because the input feature for this model was the frequency spectrum of the stimuli, the same model could be simultaneously applied to all conditions. However, to more closely compare the results of the feature tuning model to the spectral tuning model, this model was applied to each condition separately.
ROIs
The ROIs for this study were the main tonotopic regions within and around HG. The ROIs for each participant (one per hemisphere) were manually defined based on several criteria: macroanatomical landmarks of auditory cortices (identifying the HG for each participant,), myelin density maps, and functional data (i.e., the GLM R2 maps and pure-tone tonotopy results of the feature tuning model). The boundaries of HG were identified by two independent raters with experience locating HG on the cortical surface, and then cross-checked with the Destrieux atlas delineations in FreeSurfer (aparc.a2009s). The myelin density observed in and around that region was used to expand the ROI. These myelin density maps were generated by dividing the averaged T1 by the aligned and averaged T2 of a given participant. Myelin density was sampled across the cortical thickness at 25%, 50%, and 75% cortical depths. These samples were then averaged together for a mapping of density across cortical depths (Fig. 2). For all participants, these maps showed greatest cortical myelin density in somatosensory, visual, and auditory regions, consistent with earlier studies (Glasser and Van Essen, 2011). Since myelin density maps are gradients lacking clear boundaries, minor adjustments were made using the functional data to ensure that the ROIs were not too conservative, so as to be omitting voxels with high R2 values or parts of the main tonotopy gradients, but also not too liberal, so as to include an excessive number of uninformative voxels. These ROIs were then used for all surface plots for a given participant. Group-level ROIs were the intersection of all 10 participant's ROIs in each hemisphere.
For analyses involving the estimation of tuning properties within versus outside HG, the ROIs were divided into HG and non-HG sections. The HG ROI was manually drawn, based on the criteria described in the previous paragraph. The non-HG ROI was created by excluding the HG ROI from the original ROI.
Representational similarity analysis
In order to quantify the similarity of multivariate patterns of voxel activation elicited by different stimuli, representational similarity matrices (RSMs) were computed. For each participant, the average pattern of voxel-wise β estimates for each stimulus was computed within the two ROIs (both hemispheres), and voxels with a GLM R2 of at least 10% (reflecting robust stimulus-driven activation) were selected for use in this analysis. Subsequently, activation patterns for all stimulus pairs were used to compute pairwise Pearson's correlations, generating a matrix whose off-diagonal elements reflect representational similarities between different pairs of stimuli. The group-level RSM was computed by averaging participant-level RSMs across all 10 participants.
Results
Behavioral results
Behavioral performance in judging whether the current stimulus was higher or lower than the previous one (Fig. 1) was high across all three conditions for all participants, suggesting that they successfully attended to the stimuli. The average proportion of correct responses was 96.8% (SD = 2.3%) in the pure-tone condition, 93.1% (4.5%) in the timbre condition, and 95.8% (4.4%) in the pitch condition. Because of near-ceiling performance in all conditions, a nonparametric Friedman test was run to detect differences in performance between conditions, which indicated a significant main effect χ2(2) = 9.6, p = 0.008. Post hoc analysis with a two-tailed Wilcoxon signed rank test was then conducted to compare conditions. After a Bonferroni correction for multiple comparisons, setting α to 0.017 (0.05/3), none of the paired comparisons reached significance (pitch vs timbre: Z = −1.78, p = 0.074; pitch vs pure tones: Z = 0.00, p = 1.00; timbre vs pure tones: Z = −2.35, p = 0.019). Therefore, differences in cortical representations between the three conditions are unlikely to be because of differences in behavioral performance.
To verify that our pitch stimuli produced highly salient pitch percepts, psychophysical tests were conducted (Moore et al., 1984). A two-alternative forced-choice adaptive staircase procedure was used with a two-down one-up adaptive tracking rule that tracks the 70.7% correct point of the psychometric function (Levitt, 1971), consistent with our earlier work (Allen and Oxenham, 2014). F0 difference limens (F0DLs) were measured on four participants (three of whom participated in the main experiment) for six of the complex pitch tones used in the main experiment (F0s: 100, 141, 200, 400, 800, and 1600). The resulting F0DLs were small (all below 1%). For reference, a semitone difference, the smallest interval used in Western music, is around 6%. These results are in line with our earlier measures of F0DLs for similar broadband harmonic tone complexes (Allen and Oxenham, 2014), suggesting the salience was high across the entire range of our pitch stimuli.
Topographic mapping of both spectral content and fundamental frequency
To assess the patterns of topographic cortical mapping for each of the three conditions (pure tones, timbre, and pitch), we constructed a separate feature tuning model, where the GLM β estimates for each voxel in the ROI in auditory cortices of each participant were characterized as a Gaussian filter, with parameters gain, g, CF, and σ, applied to the respective stimulus feature (frequency, Fc, or F0). Figure 3 shows the resulting filters' CFs in the pure-tone condition on a cortical surface for one representative participant as well as the group average. Both individual and group levels of analysis show robust high-low-high tonotopic gradient reversals, in line with earlier studies (Formisano et al., 2003; Langers and van Dijk, 2012; Thomas et al., 2015), with a region of lower CFs (warmer colors) being anteriorly and posteriorly flanked by regions of higher CFs (cooler colors), centered roughly on HG. At both the individual and group levels, there are additional smaller clusters of low-CF and high-CF voxels, also as reported in earlier studies (Da Costa et al., 2011; Moerel et al., 2013).
To determine whether the well-established tonotopic organization found with pure tones reflects spectral energy or F0 in more complex sounds, we compared the pure-tone CF maps to the CF maps in the timbre and pitch conditions. Since it can be difficult to visualize the auditory cortices within the Sylvian fissure on an inflated lateral surface, Figure 4 shows these maps on spherical representations of the cortices for several representative individual participants and the group average. As with the individual participant maps, the group-level maps shown are unsmoothed, though smoothed versions of these figures produced very similar results. Although the range of CFs (frequency, Fc, and F0) are different between conditions, because Fc must always be higher than F0 (see Materials and Methods, Stimuli), normalized color ranges were chosen to help visualize similarities in the general pattern of tuning (i.e., highs and lows) across conditions. We found the timbre maps to be broadly similar to the pure-tone maps in terms of their high-low-high (blue-red-blue) structure. The topographic organization in the pitch condition seems less well defined, although a similar high-low-high gradient can be identified in both the individual and group-level data. The cortical locations of the high and low CF regions are reasonably similar for timbre and pitch, despite the fact that they are derived from independent acoustic features, the spectral peak and the F0, respectively. Maps for each of the 10 participants are shown in Figure 5. While the pitch CF maps appear noisier than those of the other conditions, the CF maps were highly consistent across model fit estimates, as can be seen in Figure 6. Bandwidths (BWs) of the Gaussian filters were also estimated for each condition (Fig. 7). For both the pure-tone and timbre conditions, the narrowest BWs tend to be clustered centrally, around HG, consistent with earlier reports using just pure tones (Thomas et al., 2015). The distribution of BWs for the pitch condition is again less clear-cut, although some participants show some indication of a central region with sharper tuning.
To better understand which regions in auditory cortex are driven by each condition, Figure 8 shows the variance accounted for (R2) of the held-out data in the β weights by each voxel's filter in each of the three conditions. As with the model CF and BW parameters, the spatial distribution of the high R2 voxels is similar in the pure-tone and timbre conditions. In the pitch condition, the number of voxels with a substantial amount of variance explained is reduced, with the exception of regions around the border of HG. While there appears to be some interindividual variability in the spatial patterns of high R2 voxels, the group average results show a small cluster of higher R2 values lining the anterolateral side of HG, bilaterally (Fig. 8, lower right). This location is consistent with previous studies' reports of the location of pitch-sensitive regions in both humans (Penagos et al., 2004; Norman-Haignere et al., 2013) and nonhuman primates (Bendor and Wang, 2006). However, because of the anatomical variability across individuals seen both in the present study and reported in earlier work (Rademacher et al., 2001), group-level maps have somewhat limited value and should be considered in conjunction with cortical representations from individual participants. R2 heat maps for each participant and comparisons of data with feature tuning model fits can be seen in Figures 9 and 10, respectively.
Pure-tone cortical tonotopy primarily reflects spectral content
The analysis shown in Figures 3 and 4 for each condition separately indicates strong similarities between the pure-tone and timbre conditions, suggesting that traditional pure-tone cortical tonotopy primarily reflects a sound's spectral content, rather than its pitch. To provide a more direct comparison of the cortical responses for different conditions, we calculated RSMs both within and across conditions (Fig. 11). Each matrix cell shows the correlation coefficient between voxels' responses for a given pair of stimuli (in terms of F, Fc, or F0) within the same ROIs as shown in the surface maps. High correlations in cells near the main diagonal, as seen in the within-condition comparisons for both pure-tone and timbre conditions (top-left and center boxes in each panel), indicate that tones that are closer in frequency (or Fc or F0) produce activation patterns that are more strongly correlated across voxels than tones that are distant in frequency. A similar diagonal correlation pattern can be seen when comparing patterns of activation between the pure-tone and timbre conditions (left-middle box), suggesting that voxels are responding to similar features in both conditions. In contrast, within-condition comparisons for the pitch condition (bottom right box) show higher correlations across all tones, and the RSMs comparing pitch and pure-tone conditions and comparing pitch and timbre conditions (bottom-left and bottom-middle boxes) show similarly high correlations for all higher frequencies (or Fc), independent of F0. This is likely driven by the relatively high Fc (2400 Hz) of all pitch stimuli. Overall, the RSM analysis confirms our initial analysis showing that classic tonotopy likely reflects the spectral content, and not the F0, of complex tones.
Shared and distinct tuning properties
To determine whether tuning to the different dimensions was anatomically distinct, we investigated voxels demonstrating clear tuning to one or more conditions. We did this by categorizing each voxel as being selective along a certain dimension if the fitted Gaussian function for that voxel accounted for at least 30% of the variance in that condition. We did this for each of the three conditions (pure tones, pitch, and timbre), resulting in each voxel being categorized independently as selective (or not) along each of the three dimensions. Figure 12A provides a surface plot for one participant, with voxels color-coded to indicate the condition(s) under which each voxel was categorized as selective.
In general, a large proportion of voxels jointly tuned to pure tones and timbre, as well as many voxels tuned specifically to timbre, are centered on HG. Beyond HG, while all combinations of tuning are represented in regions posterior to HG, there are prominent clusters of voxels in regions anterior to HG in both hemispheres, tuned either to both pure tones and pitch or just to pitch, in line with previously postulated pitch-sensitive regions (Patterson et al., 2002; Penagos et al., 2004; Norman-Haignere et al., 2013). Along with the surface plots in Figure 12A is a Venn diagram showing the proportions of voxels with each type of tuning for the same sample participant. The Venn diagram for the group-average data is shown in Figure 12B, along with examples of the data and model fits from individual voxels that provide examples of selectivity along one, two, or all three dimensions. Surface plots and Venn diagrams for each participant can be found in Figure 13. The relative proportions shown in these Venn diagrams remain similar for a range of R2 thresholds and are not specific to the selected 30% threshold. Visual inspection of the surface maps for each participant for cross-validation fold 1 versus fold 2 showed a high degree of consistency, and paired-sample t tests comparing the proportions within each of the seven sections of the Venn diagram across all 10 participants found no significant differences (p < 0.05) for any of the across-fold comparisons. A simulation was also run to determine the amount of overlap that would be expected by chance, assuming the tuning for each condition was distributed independently. In all cases, with the exception of the overlap between pitch and timbre, the overlap found in the present study exceeded the amount of overlap that could be ascribed to chance.
Overall, the greatest proportion of voxels are tuned to just the timbre of complex tones, followed by voxels jointly tuned to timbre and pure tones. Thus, over 70% of voxels are tuned to some aspect of spectral content (F, Fc, or both), with a relative lack of tuning to F0. The fact that many voxels appear to have selectivity for the spectral content of complex tones but not for the pure tones is consistent with findings from single-unit studies that have reported the existence of many cortical neurons that respond more strongly to spectrally complex sounds than to pure tones (Rauschecker et al., 1995; Bendor and Wang, 2005; Feng and Wang, 2017). Nevertheless, over 20% of voxels appear to exhibit tuning to pure-tone frequency without showing similar selectivity for the overall spectral shape of complex sounds.
Although the population appears to be dominated by voxels with spectral content selectivity, the resulting Venn diagrams are consistent with our other measures (e.g., Fig. 8) in showing a substantial proportion of voxels, approaching 30%, that appear to have F0 tuning, either exclusively or in combination with tuning to other dimensions.
Similarities in cortical tuning properties for pure-tone frequency, timbre, and pitch
Although the timbre response patterns, reflecting spectral content, seem to resemble pure-tone tonotopy (Fig. 11), and show the greatest degree of overlap in the Venn diagrams (Figs. 12, 13, purple section), some similarities were also observed between the pure-tone and pitch conditions, reflecting sensitivity to F0, as well as some overlap in the Venn diagrams (Figs. 12, 13, green section). Here, we provide a quantitative assessment of these similarities by comparing the model's CFs obtained in the different conditions for individual voxels. Figure 14 shows scatterplots of voxels that demonstrated tuning (i.e., selectivity along the dimension being tested) for both pure tones and timbre (Fig. 14A) or for both pure tones and pitch (Fig. 14B). As expected, there was a strong relationship between voxel CFs derived in the pure-tone condition and the CFs for the same voxels derived in the timbre condition (r = 0.89) with an average relationship close to unity.
Interestingly, although there were fewer voxels that were responsive to both pure-tone and pitch conditions, the correlation between the CFs for those voxels was similarly high (r = 0.79). This finding suggests that voxels tuned to both pitch and pure tones often have a best frequency corresponding to the best F0. Finally, the kernel density histograms indicate a broad peak of voxels with CFs (in terms of F, Fc, and F0) around 800 Hz, suggesting a somewhat nonuniform distribution of CFs across all three dimensions.
Pitch tuning in auditory cortex
Although the primary cortical tonotopic gradients seem to be dominated by spectral content, as shown by the close correspondence between responses in the pure-tone and timbre conditions, evidence for tuning to F0 or pitch was also observed in all participants. Voxels from one participant showing tuning to low, medium, and high F0s are shown in Figure 15A. To further explore the spatial layout of F0 tuning, we examined pitch-tuned voxels that were not sensitive to changes in either the pure-tone or timbre conditions (i.e., Figs. 12, 13, yellow section of Venn diagrams). Figure 15B shows voxels pooled across all participants with an R2 threshold of 0% (since model fit is evaluated on held-out data, the variance of the residuals can exceed that of the data, i.e., R2 < 0%), and a second, more stringent, map with an R2 threshold of 30%. These results suggest there may be some organization to these exclusively pitch-tuned voxels that is relatively insensitive to the R2 cutoff used in the analysis. There is a trend for a high-low-high F0 organization around the edges of HG, as denoted by the white and black arrows.
The impression that F0-tuned voxels were more likely to be found in nonprimary auditory cortex than frequency-tuned or Fc-tuned voxels was tested quantitatively by comparing the proportions of tuned voxels within each participant's ROI that were inside HG (the macroanatomical landmark associated with the “core” or primary auditory cortex, based on histologic methods; Wallace et al., 2002), versus outside HG. A two-tailed paired-samples t test revealed a significant increase in the proportion of pitch-tuned voxels (i.e., Figs. 12, 13, yellow section of Venn diagrams) outside of HG compared with inside HG (mean proportion of pitch voxels within HG = 11.59%; mean proportion of pitch voxels outside of HG = 18.36%; p < 0.01), mirrored by a proportional decrease in voxels tuned to spectral content (i.e., all other sections of the Venn diagrams, excluding yellow) outside of HG compared with inside HG (mean proportion of spectral voxels within HG = 88.41%; mean proportion of spectral voxels outside of HG = 81.64%; p < 0.01). Although the precise locus of A1 remains debated and is subject to large interindividual variability (Moerel et al., 2014), the fact that F0-tuning is found more outside than inside HG is consistent with pitch being represented more in higher-level cortical regions relative to spectral processing.
Critically, the voxels demonstrating clear F0 tuning (R2 > 30%) are distributed throughout auditory cortex, bilaterally, and are not located in an isolated region or limited to one hemisphere. There was no significant difference in the number of voxels per hemisphere across participants that were selective for F0 (p = 0.58). This broad distribution of pitch tuning, which has also been found in macaques (Kikuchi et al., 2019), may help explain why it has proved difficult to build a consensus on the presence or location of a “pitch center” in auditory cortex (Hall and Plack, 2009; Bendor, 2012).
In addition, there appears to be a region along the STG that is tuned predominantly to low F0s, as indicated by the rectangles in Figure 15B This area has been identified as a pitch-sensitive region, in addition to the region anterolateral to HG, which contains a large cluster of high F0-tuned voxels (e.g., Patterson et al., 2002; Penagos et al., 2004; Norman-Haignere et al., 2013). However, it is important to note that pitch sensitivity (i.e., stronger cortical responses to sounds with greater pitch salience), found in previous studies, is distinct from representations of pitch selectivity (i.e., tuning to specific F0s) that we demonstrate here.
Spectral tuning model
To further support the claim that the topography shown in Figure 15 is a reflection of F0 tuning and cannot be explained by the subtle differences in spectral fine structure that occur with changes in F0, we employed a spectral tuning model. In this model, instead of using the spectral peak as the input for the timbre condition, and F0 as the input for the pitch condition, the Gaussian weighting function was applied to the full sound spectrum and was fitted separately to each of the three conditions. Performance for the pure-tone condition was the same as in the feature tuning model (as the input for both models is the frequency of the pure tone) and performance for the timbre condition was very similar in both models. However, as expected, given the lack of change in spectral envelope across the range of F0 values tested, this model explained virtually no variance for the pitch conditions and predicted essentially a flat line across all stimuli in the pitch condition (Fig. 16). Changing the color range for pitch to be around the Fc of the stimuli did not improve these CF maps. The fact that the spectral tuning model could account for the observed responses in the pure-tone and timbre conditions, but not in the pitch condition, further supports the claim that the tonotopy observed in studies using pure tones is predominantly driven by spectral content.
Relationship between F0 and spectral density
Although our results are consistent with F0 selectivity, it is important to note that increases in F0 are inversely correlated with spectral density, or spacing between neighboring spectral components. Thus, it is, in principle, possible that voxels that appear to be tuned to F0 are instead tuned to different degrees of spectral density. We addressed this potential confound by using our pure-tone and timbre data, noting that pure tones are spectrally less dense than complex tones. Specifically, if spectral density were determining the responses in our F0-tuned voxels, voxels tuned to high F0s (i.e., potentially reflecting tuning to lower spectral density) should respond more strongly to pure tones than to timbre tones with the same CF. In contrast, voxels tuned to low F0s (i.e., potentially reflecting tuning to higher spectral density) should then respond more strongly to the timbre tones, which are spectrally dense. To test this prediction, we calculated the difference between the mean pure-tone and mean timbre responses in each F0-tuned voxel, defined as an F0 model fit exceeding R2 of 30%, and plotted this difference as a function of the voxels' best F0 (Fig. 17A). If the apparent F0 tuning instead reflected a tuning to spectral density, then the relative response to pure tones over timbre tones should increase as a function of best F0, leading to a positive slope. In fact, we found no relationship between best F0 and the difference in mean response to pure tones and timbre tones (mean slope = 0.00, 95% confidence interval (CI) = [−0.16,0.06]). This outcome is consistent with responses driven by F0, rather than spectral density. To confirm these voxels' selectivity for F0, we analyzed their responses using the same approach that was used to rule out their selectivity for spectral density. Specifically, we subtracted the mean β response to the lowest pitch stimulus (F0 = 100 Hz) from the mean β response to the highest pitch stimulus (F0 = 1600 Hz). The positive slope in Figure 17B (mean slope = 1.13, 95% CI = [0.92,1.19]) indicates an increased preference for higher pitch stimuli as a function of pitch CF preference, thus confirming their selectivity for F0. While more research is needed expressly controlling for effects of spectral density, the present findings are consistent with the findings of Penagos et al. (2004), in which they reported no difference in activation maps when contrasting responses to spectrally sparse from spectrally dense harmonic tones, so long as both evoked a similar degree of pitch salience. The argument in favor of F0 tuning, as opposed to spectral-density tuning, is further bolstered by the scatterplot in Figure 14B, which shows a good correspondence between the best frequencies and best F0s for voxels tuned to both pure-tone frequency and F0.
Discussion
This fMRI study used complex tones to dissociate the auditory cortical representations of F0 (which determines pitch) from spectral content (which influences timbre) to determine which of these underlies the well-known tonotopic organization observed with pure tones. Consistent with previous pure-tone studies (Formisano et al., 2003; Da Costa et al., 2011; Striem-Amit et al., 2011; Saenz and Langers, 2014; Thomas et al., 2015), we found bilateral V-shaped high-low-high gradient reversals overlapping HG in all participants, with narrower tuning BWs around HG and broader tuning BWs in surrounding regions. Although the alignment across participants is complicated by individual differences in size, shape, and number of HG in each hemisphere (Rademacher et al., 2001), this high-low-high pattern of tonotopy was preserved at the group level (Fig. 3).
A similar high-low-high pattern was found with harmonic complex tones that maintained a constant F0 but varied systematically in their spectral peak or Fc, resulting in changes in timbre along a dull-bright continuum (Fig. 4). The similarity of the pure-tone frequency and complex-tone Fc representations both within and beyond HG suggest that the tonotopy observed in earlier studies was driven primarily by spectral content. However, the fact that many voxels exhibited selectivity for spectral content specifically in complex tones but not pure tones and vice versa (Figs. 12, 13) suggests an organization more complex than the simple filtering found in the cochlea. Most strikingly, we found evidence of an orderly representation of pitch-tuned voxels, particularly in regions surrounding HG, when examining responses to complex stimuli with a fixed spectral peak but varying in F0 (Figs. 4, 5, 15).
Relationship to previous studies
In addition to the pure tones commonly used to reveal robust cortical tonotopy, complex natural sounds, such as speech, musical instruments, and animal vocalizations, have been used to derive feature representations in auditory cortex using fMRI (Moerel et al., 2012; De Angelis et al., 2018). However, as with pure tones, the positive correlation between F0 and spectral energy often found in natural sounds (Assmann and Nearey, 2008; Hillenbrand and Clark, 2009; McAdams, 2013) makes it difficult to conclude whether the derived maps reflect spectral energy distributions, F0, or a combination of the two. The present study resolves this issue by independently varying F0 and Fc to tease apart the cortical topography of these features.
Early MEG studies using complex tones attempted to study the relationship between pitch and sound spectra. Their differing conclusions suggest either that cortical tonotopy reflects pitch (Pantev et al., 1989), or that it reflects orthogonal representations of both pitch and spectral distribution (Langner et al., 1997). However, the limited spatial resolution of MEG makes it poorly suited to fine-grained analysis of the topographical organization of cortical representations. The present study used high-field fMRI to explore the topography of these features at a much higher spatial resolution.
Lastly, several studies have used fMRI in an effort to identify regions of human auditory cortex that respond preferentially to pitch-eliciting stimuli (Penagos et al., 2004; Hall and Plack, 2009; Norman-Haignere et al., 2013; De Angelis et al., 2018). While their focus was on finding regions exhibiting categorical pitch sensitivity, the present study extends this work by treating pitch (F0) as a continuous variable and dissociating F0 selectivity from frequency selectivity (tonotopy). Our results reveal voxel-wise tuning to different ranges of F0, in regions surrounding HG, that is distinct from tonotopy. As such, our work provides a unique and complementary perspective on F0 encoding in human auditory cortex.
Voxel tuning across multiple dimensions
A considerable proportion of voxels exhibited tuning in two or more of the conditions tested, particularly in the pure-tone and timbre conditions (Figs. 12, 13). Although a majority of the voxels that exhibited tuning to more than one dimension were selective to the pure-tone and timbre conditions, it may seem surprising that the overall proportion was not greater, given the evidence that tuning in both these conditions is driven by spectral content. This lack of overlap in the populations may be due in part to the fact that the range of pure-tone frequencies (100–6400 Hz) was greater than the Fc range (400–6400 Hz), but may also reflect genuine differences in selectivity based on higher-level features, such as sound complexity or BW. Indeed, our findings are in line with single- and multi-unit studies in other species that have identified neurons that are sensitive to either pure tones or complex sounds, but not both (Rauschecker et al., 1995; Feng and Wang, 2017; Kikuchi et al., 2019).
Our data suggest the existence of two distinct cortical representations: one based on frequency selectivity (i.e., tonotopy), and the other based on pitch or F0 selectivity. While partially overlapping, responses in HG were predominantly driven by spectral content, as reflected by the strong model fits in both the pure-tone and timbre conditions. Pitch representations, on the other hand, were mostly found in regions surrounding HG. These findings are consistent with the idea that lower-level frequency content is processed predominantly in primary auditory region A1 and higher-level sound features (such as pitch) are processed predominantly in surrounding nonprimary (belt and parabelt) regions.
Pitch tuning in auditory cortex
While many studies have explored responses to pitch in auditory cortex, the approach has generally been to compare sounds with salient pitches to those with weak pitches or to present a variety of pitch-evoking stimuli to identify pitch-sensitive regions (Penagos et al., 2004; Hall and Plack, 2009; Norman-Haignere et al., 2013). In contrast, the present study explored voxel-wise tuning to different pitches while controlling for spectral variations. We were able to identify voxels in all participants that exhibited selectivity along the F0 dimension. The initial maps incorporating all voxels exhibiting F0 selectivity suggested a (somewhat noisy) high-low-high organization of F0. However, when only voxels exclusively selective along the F0 dimension (R2 < 30% in the other two dimensions) were selected (Fig. 15), regions distinctly tuned to low, medium, and high F0s were found bilaterally, with clear clusters of voxels tuned to low F0s around the medial portion of HG (black arrows) and lining STG (black rectangles), and clear clusters of voxels tuned to high F0s in regions anterolateral to HG (white arrows). This distributed pitch coding throughout auditory regions is consistent with findings from cortical surface electrode recordings in humans (Gander et al., 2019).
Bilaterality in cortical representations
There is some disagreement in the literature regarding the laterality of cortical pitch representations. Bilateral sensitivity to pitch has been found in a number of studies (Patterson et al., 2002; Penagos et al., 2004; Warren et al., 2005; Hall and Plack, 2009; Norman-Haignere et al., 2013; Allen et al., 2017, 2018; De Angelis et al., 2018), whereas some others have suggested a right hemisphere lateralization for some forms of pitch processing (Zatorre et al., 2002; Hyde et al., 2008; Albouy et al., 2020). No significant difference in the number of pitch-tuned voxels between hemispheres was found in the present study, suggesting that pitch selectivity is represented bilaterally.
Limitations
Our results are consistent with pitch selectivity across extended regions of auditory cortex. However, as noted above, F0 is inversely proportional to spectral density of the harmonics. Although our additional analyses supported the hypothesis that the responses reflected selectivity to F0 and not spectral density, the use of both harmonic and inharmonic tones could help to better dissociate F0 and pitch from spectral density.
Our design involved varying either the F0 or Fc of a complex tone, while keeping the other dimension fixed. A fully-crossed stimulus design (i.e., pairing all possible F0s with all possible values of Fc) would have required more scanning sessions, but would help determine the generalizability of the F0 and Fc representations, such as whether the tuning to F0 of a given voxel is independent of the stimulus Fc. Now that the existence of F0 tuning has been established, such additional questions could be addressed in a more extensive study.
Finally, an open question is how relative changes are represented in cortex (i.e., contour and interval size, regardless of absolute changes in frequency or F0). Relative pitch processing is essential for both music and speech comprehension, but little is known about its cortical representations. There is some fMRI evidence of lateralization differences for the processing of relative contour changes versus absolute interval sizes (Stewart et al., 2008), and recent studies, based on recordings from subdurally implanted electrodes, as participants listened to variable pitch contours in speech stimuli, provide evidence of both absolute and relative pitch encoding in human auditory cortex (Tang et al., 2017; Hamilton et al., 2021). While the present study suggests there may be systematic maps of absolute pitch, follow-up work is needed to explore the contributions of relative pitch in the cortical processing hierarchy and to assess whether there is any systematic representation based on the magnitude and/or direction of change. Since relative changes are based on the relationships between consecutive sounds, this would point toward a higher, more holistic level of processing in the hierarchy compared with absolute pitch representations.
Footnotes
This work was supported by the National Institutes of Health (NIH) Grant R01 DC005216. The Center for Magnetic Resonance Research (CMRR) is supported by NIH Grants P41 EB027061, P30 NS076408, and S10 RR026783 and by the W.M. Keck Foundation. We thank Anahita Mehta, Omer Faruk Gulban, Andrea Grant, Cheryl Olman, and Stephen Engel for helpful assistance, training, and advice.
The authors declare no competing financial interests.
- Correspondence should be addressed to Emily J. Allen at prac0010{at}umn.edu