Frequency-specific attentional modulation in human primary auditory cortex and midbrain

Paying selective attention to an audio frequency selectively enhances activity within primary auditory cortex (PAC) at the tonotopic site (frequency channel) representing that frequency. Animal PAC neurons achieve this 'frequency-specific attentional spotlight' by adapting their frequency tuning, yet comparable evidence in humans is scarce. Moreover, whether the spotlight operates in human midbrain is unknown. To address these issues, we studied the spectral tuning of frequency channels in human PAC and inferior colliculus (IC), using 7-T functional magnetic resonance imaging (FMRI) and frequency mapping, while participants focused on different frequency-specific sounds. We found that shifts in frequency-specific attention alter the response gain, but not tuning profile, of PAC frequency channels. The gain modulation was strongest in low-frequency channels and varied near-monotonically across the tonotopic axis, giving rise to the attentional spotlight. We observed less prominent, non-tonotopic spatial patterns of attentional modulation in IC. These results indicate that the frequency-specific attentional spotlight in human PAC as measured with FMRI arises primarily from tonotopic gain modulation, rather than adapted frequency tuning. Moreover, frequency-specific attentional modulation of afferent sound processing in human IC seems to be considerably weaker, suggesting that the spotlight diminishes toward this lower-order processing stage. Our study sheds light on how the human auditory pathway adapts to the different demands of selective hearing.


Introduction
Listening in natural auditory scenes typically involves multiple sound sources that simultaneously stimulate the ears. To effectively extract information about a given source of interest from the complex acoustic input, the auditory system relies on selective attention. This process involves the allocation of neural and cognitive resources to acoustic input features representing the target source. This prioritizes the processing of the aforementioned features and gives them a perceptual advantage over non-attended features. Which specific feature is selected depends on the relative saliency of the available features and the listener's current goal: the listener's focus of attention can be involuntarily attracted by highly salient acoustic input features ('bottom-up' or exogenous attention) or volitionally allocated based on the listener's current goals ('top-down' or endogenous attention). In everyday life, these two forces often compete: for example, when attempting to follow a conversation in a noisy environment, the listener pays endogenous selective attention to the voice of interest to overcome the distracting effect of irrelevant salient sounds (for reviews, see e.g. (Alain and Bernstein, 2008;Shinn-Cunningham, 2008)).
Human neuroimaging studies have shown that focusing endogenous selective auditory attention on a specific frequency in acoustic input (frequency-specific attention) selectively enhances the activity of neuronal populations in primary auditory cortex (PAC) that are tuned to this attended frequency, relative to populations tuned to the nonattended frequencies (Paltoglou et al., 2009;Da Costa et al., 2013;Oh et al., 2013;Riecke et al., 2017). Analogous to the visual system (Tootell et al., 1998;Brefczynski and DeYoe, 1999), this operation can be conceptualized as that of an 'attentional spotlight' that focuses afferent sound processing in a behaviorally relevant and frequency-specific manner so as to prioritize processing in primary cortical areas that are tuned to the frequency of interest. Despite the converging evidence for the existence of such a tonotopic attentional spotlight in human PAC, two important questions regarding the neural basis of endogenous frequency-specific auditory attention in the human brain remain unanswered.
First, does frequency-specific attention modulate the frequency tuning of human PAC? Animal electrophysiology studies have shown that PAC neurons implement an attentional spotlight mechanism by adapting their frequency response, i.e., their responsiveness to different audio frequencies (for a review, see (Fritz et al., 2007a)). More specifically, attention can modulate the overall strength of neurons' frequency response (response gain) and/or the shape of this response (frequency tuning): Neurons whose preferred frequency lies closely to the attended frequency increase their responses to input (Atiani et al., 2009;Lakatos et al., 2013;O'Connell et al., 2014), compared with neurons tuned to more distant, unattended frequencies. In addition to this response gain, the 'near' neurons adapt (or 'reshape') their frequency tuning so as to increase and reduce their relative responsiveness to the attended and unattended frequencies, respectively (Fritz et al., 2007b;Atiani et al., 2009;David et al., 2012). In humans, it is still unclear whether attention affects response gain, frequency tuning, or both (Lee et al., 2014).
Invasive electrocorticography studies in neurosurgical patients have found modulation of spectrotemporal tuning in auditory cortex induced by variations in speech context and speaker-specific attention (Mesgarani and Chang, 2012;Holdgraf et al., 2016). Magnetoencephalography studies have observed attentional enhancement of tone-evoked long-latency auditory-evoked magnetic fields, which is thought to reflect modulation of neuronal-population frequency tuning (Kauramaki et al., 2007;Okamoto et al. 2007Okamoto et al. , 2009Ahveninen et al., 2011). A recent functional magnetic resonance imaging (FMRI) study investigating the laminar organization of frequency tuning in human PAC finds that intermodal (auditory vs. visual) attention modulates the bandwidth of frequency tuning in the upper layers of PAC (De Martino et al., 2015). However, it remains unclear whether the attentional spotlight in the human brain emerges from modulation of response gain, frequency tuning, or both, as in animals.
Second, does frequency-specific attention modulate afferent sound processing in human auditory subcortical regions? Studies of frequencyspecific attention have focused almost exclusively on cortical regions; however, recent electrophysiological recordings from the inferior colliculus (IC) of ferrets reveal attentional modulation of frequency tuning (Slee and David, 2015). This modulation is similar to that observed in PAC (Fritz et al., 2003;David et al., 2012), suggesting that frequency-specific attentional retuning in PAC is partially inherited to IC. Moreover, electric stimulation of PAC neurons has been shown to alter the frequency tuning of IC neurons (bats: (Zhang et al., 1997;Yan and Suga, 1998), mice: (Yan and Ehret, 2002), and gerbils (Sakai and Suga, 2001):) presumably via tonotopic corticofugal projections (cats: (Andersen et al., 1980;Winer et al., 1998); rats: (Herbert et al., 1991); ferret (Bajo et al., 2007):). In humans, findings of subcortical frequency-specific attentional modulation are relatively scarce (Galbraith et al., 1998;Maison et al., 2001;Lehmann and Schonwiesner, 2014;Yamagishi et al., 2016) and partially controversial (discussed in (Varghese et al., 2015)). Moreover, the locus of the reported subcortical attentional modulation remains unclear due to the limited spatial resolution of the applied methods (otoacoustic emission recordings and scalp electroencephalography).
In the present study, we addressed these unresolved questions by investigating frequency tuning in the human auditory pathway with ultra-high field FMRI during intramodal auditory attention tasks on target sounds with distinct frequencies. By measuring cortical and subcortical responses to various frequency-mapping stimuli under different states of endogenous frequency-specific attention, we could assess whether and how goal-directed frequency-specific attention shifts alter the frequency response of neuronal populations at different stages of the human auditory pathway. We demonstrate that the attentional spotlight in human PAC, as measured with FMRI, arises primarily from local alterations in overall response gain (i.e., elevation or reduction of the entire frequency response) rather than from frequency retuning (i.e., no shifts in characteristic frequencies or sharpening/widening of the frequency-tuning profile). We further provide results suggesting that frequency-specific attentional modulation can be observed with FMRI in human IC.

Participants
Ten healthy volunteers (four females, ages: 21-39 years) participated in the study. They reported no history of hearing disorders, or vision or motor problems, and had normal hearing (defined as pure-tone hearing thresholds of less than 25 dB HL at 0.25, 0.5, 0.75, 1, 2, 3, 4, and 6 kHz), except for one participant (P7) who had slightly elevated threshold (30 dB HL) in one ear at 3 kHz. Another participant's (P10) task performance was observed to approximate ceiling in the FMRI experiment (see Results). Excluding these participants' data from the analysis did not alter the conclusions that can be drawn from the study. All participants gave written informed consent according to a protocol approved by the local research ethics committee (Ethical Review Committee Psychology and Neuroscience, Maastricht University) and received a monetary reimbursement for their participation.

Tasks
Auditory stimuli are described in Fig. 1A-B, and in more detail in the next section. The experimental manipulation of listeners' focus of attention was implemented by varying the listener's task. An active task served to direct frequency-specific attention to either a low-frequency auditory target stream (L-target) or a high-frequency target stream (H-target). A passive task served to either enable localization of functional brain regions of interest (see Definition of functional-anatomical regions of interest) or establish an attentional baseline. For the active task, listeners were instructed to attend to a pre-defined target stream (L-target or H-target) and count the number of gaps in it. For the passive task, they were instructed to ignore any sound and rest. We refer to these three different experimental conditions as 'L-task', see Fig. 1B. Potential confounding of attention effects by physical stimulus variations was excluded by using the same set of stimuli for all task conditions. Fig. 1C visualizes the design of the experimental trials. Trials lasted 13s and began with the simultaneous presentation of a visual cue (symbol "L", "H", or "þ", depending on the task) and an auditory stimulus, in the absence of FMRI scanning. After the auditory stimulus, a cluster of two consecutive FMRI volumes was acquired. Subsequently and only in the active tasks, listeners gave a behavioral response, and visual feedback was provided by displaying the correct response in either green or red color, depending on the correctness of the listener's response. Stimulus presentation and FMRI were timed to capture the blood oxygenation level-dependent (BOLD) responses that were evoked primarily by the task and stimuli rather than by ambient scanner noise. This relative timing was estimated based on convolving the auditory stimulation protocol (task stimuli and scanner noise, modeled as boxcar waveforms) with a standard hemodynamic response function; for details see (Riecke et al., 2017). Occasional null trials served to establish a silent baseline. These trials involved no auditory stimulus but only the task cue for the upcoming block of trials. Participants were instructed to rest on these trials.

Stimuli
The stimulus design was inspired by a recent visual attention study, which mapped retinotopy in human primary visual cortex with FMRI while participants covertly attended to various peripheral target locations in visual space (Klein et al., 2014). The present auditory stimuli were composed of three simultaneous frequency bands: a specific mapping-frequency band and two target-frequency bands, all of which lasted 5.5s; see Fig. 1A. The center frequency (CF) of the mapping band was varied across trials to enable the identification of frequency channels and the assessment of frequency tuning (frequency mapping). The CFs of the target bands were fixed above and below the edges of the mapped frequency range to provide listeners with distinct, frequency-specific targets on which they could focus their attention during frequency mapping. Exemplary stimuli are provided as audio files in the Supplementary Material.

Mapping bands
Mapping bands had a CF of 0.4, 0.7, 1.0, 1.5, 2.1, 3.0, 4.3, or 6.0 kHz (corresponding to a fixed CF-spacing of three equivalent rectangular bands [ERBs]). We refer to these eight different experimental conditions as 'mapping frequencies'; see Fig. 1B. The bandwidth of each mapping band equaled two ERBs, thus adjacent bands were separated by a fixed spectral gap of one ERB. Mapping bands comprised a sequence of tone pairs pulsating at a fixed rate of 4 Hz, which has been shown to evoke strong BOLD responses in human PAC (Harms and Melcher, 2002;Hart et al., 2003). For each individual tone pair within the overall sequence, the inter-tone interval was 42 ms and the two tone frequencies were randomly chosen from each ERB flanking the current CF. Tones lasted 42 ms and had fixed amplitude. Mapping bands were matched for loudness (see Procedure) to promote uniform activation level across PAC.

Target bands
The low-frequency target (L-target) and high-frequency target (Htarget) had a CF of 0.2 kHz and 8.3 kHz respectively. They comprised a 0.5ERB-wide band of spectrally uniform noise, which was generated by mixing tones with random phases, fixed amplitude, and fixed frequency spacing (0.25 Hz). To facilitate attentional selection of targets, the CFs of the noise bands were modulated (sinusoidal modulation; depth: AE0.25ERB, frequency: 1.2 Hz and 0.4 Hz, starting phase: 0 and 180 , for L-target and H-target respectively). Thus, the targets fluctuated within a 1-ERB band that was separated by 1.3 ERBs from the adjacent mapping band. To implement the gap detection task (see Tasks), a random number of maximally three gaps (duration: 313 ms) was inserted independently in each target band. Gaps could occur with a fixed probability of 50% within each third of the stimulus interval (sub-intervals), for each target band. The exact timing of gaps within these sub-intervals was random, with the constraint that they did not overlap with the edges of the subinterval (its initial and final 375 ms) or any gap in the other target band.

Mixing of mapping band and target bands
To equalize task difficulty, the perceptual separability of targets from the distracting mapping bands was matched across the different stimuli and targets in two steps. First, spectrotemporal grouping cues were equalized across stimuli: increases in the spectral proximity of target band and mapping band were counteracted by decreasing the temporal coherence of these two bands. This was done by applying an amplitude modulation (a sinusoid with the same frequency as the tone-pair rate in the mapping band and fixed starting phase) to each target and varying its saliency (modulation depth: 100-0%) as a negative linear function of spectral proximity. For example, low-CF mapping bands would render the H-target more separable than the L-target due to their larger distance from the H-target; to compensate for this perceptual difference, the separability of the H-target was reduced by increasing its temporal coherence with the mapping band, i.e., by applying a more salient modulation to this target. Second, listeners fine-tuned the intensity of each target before the FMRI experiment (for details, see Procedure). The resulting target band intensities were substantially lower than the intensity of mapping bands, on average À20 dB and À35 dB for the L-target and H-target, respectively. Therefore, auditory-evoked brain responses were driven primarily by the mapping bands and potential distortion of frequency mapping by physical target variations, which were inevitably Schematic graph of the experimental design. The CF of the mapping band (mapping frequency) was varied across the eight conditions shown on the vertical axis. Listeners' task was varied across the three conditions (No-task, L-task, or H-task) shown on the horizontal axis. Each cell within the graph sketches sound level as a function of frequency (power spectrum) of the auditory stimulus that was presented in the respective condition. Dashed outlines indicate the specific target band on which listeners were instructed to focus their attention and count gaps. C. Schematic of the trial design. Trials involved the simultaneous presentation of a visual cue (symbol "þ", "L", or "H" to indicate the No-task, L-task, and H-task condition, respectively) and an auditory stimulus. These stimuli were presented in a silent period of 10s, followed by the acquisition of a pair of FMRI volumes (2 Â 1500 ms). Subsequently, listeners reported the number of detected target gaps and received visual feedback on response correctness (in L-task and H-task). The illustration exemplifies a late portion of a null trial (À6 to 0s), followed by a trial from the L-task condition (0-13s). Thus on the latter trial, the participant should count the number of gaps in the L-target while ignoring the H-target. Given the stimulus shown in panel A, the correct response is "2", which is visually fed back to the participant in red because the participant gave an incorrect response ("1") in this example. D. Schematic of the block design. Each block contained four trials of the same task preceded by a null trial, during which the task cue for the upcoming task was already presented. E. Schematic of an exemplary experimental run. Each run contained four blocks of each task in random order. Each of these four-block sets contained two full sets of mapping frequencies in random order.
induced by the difficulty-matching procedure, was reduced.
Onsets and offsets in the auditory stimulus components were ramped using 5-ms raised-cosine ramps, except for target gaps, which were adjoined by 30-ms ramps. The auditory stimuli were presented diotically via a soundcard (Sound Blaster X-Fi Xtreme Audio) with 16-bit resolution and 44.1-kHz sampling frequency at an overall sound pressure level of approximately 80 dB. Non-flatness of the frequency response of the MRcompatible audio system (Sensimetrics S14, Malden, MA, USA) in the auditory stimulus range (0.2-8.7 kHz) was corrected by applying the manufacturer's equalization filters. Stimulus presentation and button response collection were controlled using Presentation 17.1 software and trials were triggered by MR image acquisitions.

Procedure
The experimental procedure involved an initial training session and a subsequent FMRI session using the same MR-compatible audio system.

Training session
In the training session, participants first matched the loudness of the mapping bands to that of a fixed reference signal by adjusting the intensity of each band. Second, they were familiarized with the gapdetection task. Third, as mentioned above (see Mixing of mapping band and target bands), they adjusted the intensity of each target to the minimum level they deemed sufficient for just detecting the gaps. This was done in the presence of each the lowest and highest mapping band and repeated until listeners reported that task difficulty matched across all stimuli and targets. For intermediate mapping bands, the intensity of each target was set by interpolating logarithmically between the intensities obtained with the extreme mapping bands. Fourth, the obtained individual settings were applied in a subsequent training of the L-task and H-task, which lasted approximately 30min.
To enable a manipulation check, a control '?-task' was included that was identical to the L-and H-tasks, except that the visual cue was presented only after the auditory stimulus; thus listeners could not reliably focus their attention exclusively on the actual target as in the L-and Htasks (Riecke et al., 2017). Impaired performance in this control task would indicate that listeners reliably paid attention exclusively to the specified target in the L-task and H-task conditions.

FMRI session
In the FMRI session, participants first verified whether task difficulty still matched across the individualized stimuli inside the scanner and further fine-tuned them if needed. Before each functional run, participants balanced loudness across the two ears if needed. In addition to the insert earphones, they wore ear muffs, which together attenuated ambient noise by approximately 45 dB. Participants' heads were stabilized with cushions and they were instructed to keep as still as possible while being inside the scanner. They were further instructed to keep their eye gaze focused on the center of the screen throughout the experiment to reduce visual input variations. They gave behavioral responses by pressing one of four buttons with the four fingers of their right hand.
Each repetition (FMRI run) of the experiment comprised 48 task trials (two repetitions per mapping-frequency Â task condition) and twelve null trials, lasting in total 13min. Trials of the same task condition were clustered into blocks of four, preceded by a null trial during which the task cue was presented to specify the upcoming task condition; see Fig. 1D. The twelve blocks and four task trials within each block were presented in random order; see Fig. 1E. Overall, an FMRI session comprised anatomical scans (overall duration: 8min) followed by five runs of the experiment (except for participants P1 and P4 who underwent four and three functional runs, respectively).

Imaging settings
Brain images were collected with a Siemens Magnetom 7 T MRI scanner (Siemens Medical Systems, Erlangen, Germany) equipped with a 32-channel head coil (Nova Medical Inc., Wilmington, MA, USA). BOLDsignal changes were recorded using a modified gradient-echo echoplanar imaging (EPI) sequence at a nominal resolution of 1.5 Â 1.5 Â 1.5 mm3 (field of view: 201 Â 201 mm 2 , 29 transverse slices without gap, echo time: 17 ms, repetition time: 1500 ms, readout bandwidth: 1776 Hz/pixel, in-plane GRAPPA acceleration factor: 2, nominal flip angle: 60 ). Images were acquired such that a pair of two consecutive volumes was acquired on each trial followed by a 10-s interval without scanning (Fig. 1D), yielding two clustered volumes per every 11500 ms. This was done to increase sensitivity and enable the separation of auditory stimuli and scanner noise in terms of both perception and evoked BOLD responses (Hall et al., 1999). During each functional run, 120 vol were collected. Slices were positioned approximately orthogonally to the Sylvian fissure and in 45 -orientation relative to the longitudinal axis of the brainstem, so as to cover bilaterally the brain regions of interest (Heschl's region and IC) and reduce cardiac-induced pulsatile motion artefacts in IC (Slabu, 2010). Structural T1-weighted images optimized for gray-white matter contrast were obtained using a magnetization-prepared two rapidacquisition gradient-echo pulse (MP2RAGE (Marques et al., 2010),) sequence at 0.7 Â 0.7 Â 0.7 mm 3 resolution (field of view: 224 Â 224 Â 168 mm 3 , echo time: 2.47 ms, repetition time: 5000 ms, readout bandwidth: 250 Hz/pixel, GRAPPA acceleration factor: 3, inversion times: 900 ms and 2750 ms, flip angles: 5 and 3 ). All functional and anatomical images were acquired in anterior-posterior phase encoding direction. To enable subsequent correction for echo-planar image distortions using TOPUP (Andersson et al., 2003), an additional pair of EPI volumes (correction volumes) with reversed phase-encoding direction was acquired between the second and third functional run.

Functional data processing
Data were analyzed using Brain Voyager QX 2.8 (BrainInnovation BV, Maastricht, The Netherlands) and Matlab (The Mathworks Inc., Natick, MA, USA) software. Preprocessing of functional images included the following steps: First, for each run, volume pairs were merged into single average volumes. This was done by decomposing the acquired series of volumes into a pair of two series (n ¼ 1, 2) and computing their weighted average as described elsewhere (Riecke et al., 2017). All subsequent analyses were applied to the resulting average volumes. Second, head motion was corrected by co-registering all volumes with the last volume of the second functional run (target volume). Third, a map of estimated magnetic susceptibility-induced distortions was computed by combining the target volume with the acquired correction volume and then used to reduce distortions in all functional images. Fourth, images were co-registered with the individual's structural images and spatially normalized to stereotactic Talairach space. Fifth, slow signal fluctuations were removed by temporal high-pass filtering (cut-off: 3cycles/run) and the signal level was normalized to its time average. All aforementioned preparatory steps were applied on a voxel-by-voxel basis separately for each run and region of interest (ROI).
A supplementary analysis was applied to behaviorally weighted FMRI data. Data preparation involved extracting the FMRI-signal level from individual trials and normalizing it to its time average. Then, the weighted average of the resulting z-normalized single-trial FMRI signal was computed across trials, separately for L-task and H-task trials, as where e i corresponds to the absolute magnitude of the behavioral error (a value between 0 and 3; see section Behavioral data analysis) and x i to the z-normalized single-trial FMRI signal observed on the ith trial. Thus, more weight was given to trials comprising better performance.
Anatomical data processing Structural images were resampled to 0.5 Â 0.5 Â 0.5 mm 3 -resolution to fit one third of the resolution of the functional images. The resulting volume was spatially normalized to Talairach space. Gray-white matter borders were extracted, from which individual cortical surface meshes were reconstructed. All FMRI data analyses were conducted in volumetric space. Only for visualization purposes, functional maps were sampled using trilinear interpolation onto inflated versions of the reconstructed cortical-surface meshes.

Definition of functional-anatomical regions of interest
Individual ROIs were defined based on participants' individual functional brain anatomy. This approach takes into account inter-individual functional-anatomical variations and thereby improves inter-subject alignment.
First, areal borders were estimated using common macroanatomic criteria. The putative borders of PAC were manually defined on the inflated cortical surface meshes, based on Heschl's gyrus (HG) and possible duplications of it. The anterior border was marked by the most anterior transverse temporal sulcus. For cerebral hemispheres containing a single HG, PAC was defined as the medial two-third portion of that HG as suggested by ex vivo microanatomic findings (Rivier and Clarke, 1997;Hackett et al., 2001;Morosan et al., 2001;Wallace et al., 2002;Sweet et al., 2005). For all other hemispheres, the posterior border of PAC was extended to include the medial two-third portion of Heschl's sulcus and, in the case of a full HG duplication, the medial two-third portion of the second HG (Hackett et al., 2001;Sweet et al., 2005;Da Costa et al., 2011). For each of the resulting cortical surface patches, a volumetric region was extracted by sampling À1 and 3 mm along the normal of each vertex in the direction of white matter and gray matter, respectively. IC was defined from anatomical images in volumetric space using a segmentation procedure based on a region-growing approach.
Second, from each anatomically defined region, only voxels reliably exhibiting sound-evoked activation were retained, which was defined as a significantly higher BOLD response in the No-task condition (pooled across mapping-frequency conditions) vs. silent baseline (uncorrected P < 0.05).
Third, each of the resulting functional-anatomical regions was further sub-divided into eight frequency channels. Frequency channels were identified by selecting voxels with matching best frequency (BF). Voxels' BF was defined as the CF of the mapping band that evoked the highest BOLD response in the No-task condition.
Finally, each identified region was merged with its counterpart in the contralateral brain hemisphere. Subsequent analyses were applied separately to each (cortical and subcortical) bilateral ROI and each frequency channel.

Statistical analysis
The study used a balanced, full factorial experiment (three factors: eight mapping frequencies Â three tasks Â eight frequency channels) and a within-subject design. All measures described below were obtained separately for each participant and then submitted to 2 nd -level (randomeffects) group analyses using parametric statistical tests (ANOVAs and paired t-tests) and a significance criterion α ¼ 0.05. Type-I error probabilities inflated by multiple comparisons were corrected where indicated by controlling the false-discovery rate (Benjamini and Hochberg, 1995).

Behavioral data analysis
Two measures of behavioral performance in the active tasks were analyzed. Accuracy was defined as the proportion of trials on which the participant gave a correct response. Error magnitude was assessed per trial by subtracting the actual number of target gaps from the participant's reported number of gaps, thus positive and negative values indicate an over-and under-estimation of the number of target gaps, respectively.

FMRI data analysis
A general linear model (GLM) including 24 regressors (coding for eight mapping frequencies Â three tasks) was applied to the preprocessed functional data to estimate each voxel's average BOLD response in each experimental condition. Subsequent analyses were applied to the fitted GLM parameters (beta values), except for behaviorally-weighted FMRI data analysis (see section Functional data processing). FMRI data quality was checked based on two functional maps. First, sound-evoked activations were localized by contrasting the No-task condition vs. silent baseline. Second, tonotopic arrangement of frequency channels was visualized by color-coding each voxel according to its BF (see Definition of functional-anatomical regions of interest). The subsequent main analyses were applied separately to each ROI (PAC or IC) and each frequency channel within that ROI. For these ROI analyses, voxels exhibiting the same BF were pooled to yield a dataset comprising 192 averaged beta values (eight mapping frequencies Â three tasks Â eight frequency channels) per ROI.

Frequency-tuning analysis
Frequency tuning was assessed based on the frequency response, which we defined as the beta value as a function of mapping frequency. From this response, we further defined BF as the frequency that exhibited the maximum (i.e., the CF of the mapping band that evoked the highest BOLD response, as described already above). Each of these measures was obtained separately for each task condition in order to extract a 'passive' frequency response (No-task) and two 'active' frequency responses (Ltask and H-task) as well as a 'passive' BF and two 'active' BFs. For the active tasks, also a 'worst frequency' (WF) was extracted, which we defined analogously to BF, but with the difference that this frequency evoked the minimum response. Effects of frequency-specific attention, which for simplicity we refer to as 'attentional modulation', were assessed by contrasting L-task vs. H-task. Thus, positive and negative values of this measure indicate relative enhancement induced by focusing attention on the low-and high-frequency target, respectively.

Multi-voxel pattern analysis
Frequency-specific attentional modulation patterns were assessed using multi-voxel pattern analysis (MVPA). First, the silent baseline (average across all null trials) was subtracted from the z-normalized, unweighted single-trial FMRI signal for each trial. Second, on each iteration of a leave-one-run-out procedure, single trials from L-task and Htask conditions from all but one experimental run were used for classifier training, i.e., for estimating the function that mapped the measured BOLD response patterns to the corresponding frequency-specific attention conditions. This was done within each ROI (i.e., no searchlight approach). Third, the trials from the omitted run were used for consecutive testing, i.e., to assess the generalization performance of the trained classifier. The omitted run varied across iterations. Fourth, classification accuracy was computed as the percentage of correctly classified test trials across all iterations. Finally, the chance level was estimated empirically as the median of 500 classification accuracy values obtained iteratively using the same procedure as above, but after permuting the task labels (L vs. H) on each iteration. A non-linear classifier (nearest-neighbor model, Statistics and Machine Learning Toolbox in Matlab) was used, which is especially suited for datasets comprising a small number of features like IC (Duda et al., 2000). The classifier's only free parameter (the number of neighbors) was optimized on each iteration of the training step using three-fold cross-validation to minimize prediction error.

Supplementary IC experiment
The main experiment was primarily designed to address our first research question, i.e., to assess frequency responses in PAC and detect their attentional modulation. Following up on our observation of comparatively weak attentional modulation in IC (see Results), we conducted a supplementary experiment with auditory-stimulus and FMRI settings adjusted to fit more our second research question (i.e., to detect subcortical attentional modulation).
Five normally-hearing, trained listeners participated in the supplementary experiment (one female, ages: 23-39 years; including two of the ten participants from the main experiment). Methods largely matched those applied in the main experiment, except for the following differences: Auditory stimuli were composed of two alternating, individually loudness-matched frequency bands within the pitch range, i.e., with a CF of 0.4 kHz (L-target) and 3.2 kHz (H-target) respectively. Each band comprised a common melody composed of a sequence of 29 pure-tone pairs. Overall tone pairs spanned a range of seven semitones and pulsated at a fixed rate of 5.3 Hz (Riecke et al., 2017). Within each pair, tones resembled the same note and pulsated at 27 Hz. Tone-pair omissions (gaps) were inserted in each band as described above. An exemplary stimulus is provided as audio file in the Supplementary Material. As before, these multi-band stimuli were presented in L-target and H-target conditions to enable extracting attentional modulation. The experimental design further included null trials and passive conditions during which only one of the two bands was presented in isolation; these conditions served to enable to functionally define IC. In total, 60 trials were presented per experimental condition. Functional imaging data were acquired with a simultaneous-multi-slice gradient-echo EPI sequence (Setsompop et al., 2012) at a resolution of 1.1 Â 1.1 Â 1.1 mm 3 (field of view: 201 Â 201 mm 2 , 50 slices, echo time: 21 ms, repetition time: 1500 ms, readout bandwidth: 1446 Hz/pixel, in-plane GRAPPA acceleration factor: 3, slice acceleration factor: 2, nominal flip angle: 50 ) according to the sparse protocol described above. Structural images were acquired with an MP2RAGE sequence at 0.55 Â 0.55 Â 0.55 mm 3 resolution.
Compared with the main experiment, the target bands in this supplementary experiment were expected to enhance attentional modulation in IC because of their higher intensity (due to their at least 20 dB higher sound level and the absence of any simultaneous band), higher modulation rate, and higher bandwidth-factors which have been shown to promote IC activation (Harms and Melcher, 2002;Hawley et al., 2005;Sigalovsky and Melcher, 2006;Overath et al., 2012). Moreover, the absence of any acoustic stimulus variation (i.e., no simultaneous mapping-frequency manipulation) and the increased spatial resolution of imaging were expected to facilitate the detection of finer-grained attentional modulation patterns.

Behavioral results
Listeners' behavioral performance in the FMRI session was accurate and stable across conditions. On average, they slightly underestimated/ overestimated the number of gaps in the L-task/H-task, respectively ( Fig. 2A), but the magnitude of these errors did not differ significantly from zero for any attention task (all P > 0.22) or across attention tasks (P ¼ 0.19). Listeners on average responded correctly on 90.6% of the Ltask trials and on 91.4% of the H-task trials, with no significant difference between these tasks in the FMRI session (F 1,63 ¼ 0.18, P ¼ 0.68) or in the training session (F 1,63 ¼ 0.03, P ¼ 0.88); see Fig. 2B. Accuracy in the Ltask and H-task was generally better than in the '?-task' (all P < 10 À4 ); see Fig. 2B. Accuracy further differed across mapping frequencies in the H-task (F 7,63 ¼ 3.46, P ¼ 0.003), but not in the L-task (F 7,63 ¼ 1.11, P ¼ 0.37); see Fig. 2C. The interaction of these factors was not significant (no mapping-frequency Â target-frequency interaction: F 7,63 ¼ 1.71, P ¼ 0.12); see Fig. 2D. The observation of no significant task difference or interaction indicates that task difficulty was similar across the different conditions. The significant difference with the '?-task' further suggest that listeners focused reliably on the cued target during the active tasks inside the FMRI scanner, indicating that our experimental manipulation of frequency-specific attention was successful.
In the supplementary IC experiment, behavioral accuracies were 96.9% (L-task) and 97.0% (H-task), without significant difference between these tasks as above (t 4 ¼ 0.01, P ¼ 0.99; data not shown). The group data (panels A-D) show mean AE sem across participants. The single-subject data (panel E, upper row) show mean AE sem across mapping frequencies. n.s. non-significant, *P < 0.05. The horizontal axis in panels B and C was slightly shifted for each data series to improve visibility.

FMRI results
Functional-anatomical definition of PAC and PAC frequency channels Fig. 3A visualizes the location of PAC and its frequency channels, defined from functional-anatomical criteria, for each participant. Auditory stimuli in the passive task reliably activated most of the anatomically-defined PAC compared with silent null trials. These passively sound-activated PAC regions further exhibited spatial variations in BF that roughly agreed with tonotopic gradients (for reviews, see (Moerel et al., 2014;Saenz and Langers, 2014)), despite the fact that target bands were present during the BF mapping.
Data in panels B-G show mean AE sem across participants. Frequencies are expressed in units of kHz. BOLD response strength is expressed as beta value (arbitrary units). n.s. non-significant, *,*** corrected P < 0.05, 0.0005. The horizontal axis in panels D and E was slightly shifted for each data series to improve visibility. Fig. 3B shows the response characteristics of the functionalanatomical PAC during the passive task, averaged across participants. Frequency channels within PAC did not vary systematically with regard to their passive overall auditory-evoked response or number of voxels (no main effect of frequency channel in No-task condition on the BOLD response [F 7,63 ¼ 0.39, P ¼ 0.91; Fig. 3B left] or the number of voxels [F 7,63 ¼ 1.81, P ¼ 0.10]). Similarly, the passive auditory-evoked response of the overall PAC varied only slightly across mapping-frequency conditions, exhibiting a weak skewedness in PAC's passive frequency response (no main effect of mapping frequency in No-task condition:  A. The bar plot shows error magnitude, defined as the number of miscounted target gaps per trial, for each active task (L-task: red, Htask: blue) pooled across mapping frequencies in the FMRI session. Positive/ negative values represent overestimation/underestimation of the number of target gaps. B. The plot shows accuracy, defined here as the proportion of errorfree trials, for each active task pooled across mapping frequencies, and for each session (FMRI: dark colors, training: light colors). The horizontal line at the bottom represents accuracy in the '?-task' during the training session (shaded area: sem). C. Accuracy is shown as a function of mapping frequency, for each active task in the FMRI session. D. The plot shows the difference between L-task vs. H-task (attentional modulation; derived from panel C). E. Single-subject data are shown analogously to the group data in panel B (upper row: accuracy in FMRI session) and panel D (lower row: attentional modulation as a function of mapping frequency).
activation levels across tonotopic PAC (Fig. 3B left) and that overall PAC was approximately equally sensitive to these different bands (Fig. 3B  right). Fig. 3C shows PAC's passive frequency response separately for each frequency channel. While the observation of a marked single peak in each channel is trivial (because the channel had been defined based on this prior observation), a non-trivial observation is that the frequency response decreased near-monotonically beyond this channel BF, revealing band-pass frequency filtering in human PAC in the presence of the target bands.  Supplementary Fig. 1. B. The plots show the average activation level of PAC evoked by passive auditory stimulation. On the left, this measure is plotted as a function of frequency channel's BF to illustrate spatial variations in passive sound-evoked activation across PAC frequency channels. On the right, it is plotted as a function of mapping frequency to visualize overall PAC's passive frequency response. Numbers in the subtitle denote the overall volume of PAC. C. The plots show frequency channels' passive frequency responses (i.e., data from panel B right, but stratified for channels). The title shows the passive BF of the channel and the subtitle denotes the channel's ROI volume. The activation level at BF is biased due to the averaging of data previously used for defining the channel. D-E. Analogous to panels B-C, except that data stem from L-task (red) and H-task (blue). Arrows highlight channels' active response at their passive BF (i.e., in the active mapping-frequency condition that matched the channel's passive BF). Unlike panel C, the activation level at BF here is not biased by prior channel definition. F-G. Analogous to panels D-E, except that the difference between the L-task vs H-task is shown to visualize effects of frequency-specific attention (attentional modulation). Fig. 3D-E are analogous to Fig. 3B-C, but show data separately for each active task (L-task and H-task). Fig. 3F-G shows the difference between these tasks (L-task minus H-task) to visualize our main measure: attentional modulation.

Effects of frequency-specific attention on PAC
As predicted by the 'attentional spotlight' model, PAC's spatial activation profile varied across L-task vs. H-task. Frequency-specific attention altered the relative overall response strength of PAC frequency channels with remarkably high reliability across participants (frequencychannel Â target-frequency interaction: F 7,63 ¼ 10.0, P ¼ 10 À7 ); see Fig. 3F left ('Channel BF'). This spatially-selective attentional enhancement exhibited a gradual change in sign across frequency channels, indicating qualitatively a relative overall response enhancement in lowand high-frequency channels by low-and high-frequency attention, respectively.
In contradiction with the original 'attentional retuning' hypothesis, frequency-specific attention did not reliably alter or 'reshape' the frequency response of the overall PAC; see Fig. 3F right ('Mapping freq.'): both the height and the shape of this response remained essentially unchanged during L-task vs. H-task (no main effect of target frequency: F 1,9 ¼ 1.43, P ¼ 0.26; no mapping-frequency Â target-frequency interaction: F 7,63 ¼ 0.99, P ¼ 0.45). Similar observations were made when comparing each active task with the passive task (no main effect of task; L-task: F 1,9 ¼ 1.39, P ¼ 0.27, H-task: F 1,9 ¼ 0.54, P ¼ 0.48; no mappingfrequency Â task interaction; L-task: F 7,63 ¼ 0.57, P ¼ 0.78, H-task: F 7,63 ¼ 0.74, P ¼ 0.64), which was done for a purely anatomically defined PAC to avoid circularity in the analysis. This latter observation is in line with previous human FMRI studies comparing auditory attention with passive stimulation (Paltoglou et al., 2009) or visual attention (Petkov et al., 2004;Woods et al., 2009).
In sum, these results confirm the existence of the tonotopic attentional spotlight in human PAC (Fig. 3F left), as observed in previous studies that did not investigate frequency responses (Paltoglou et al., 2009;Da Costa et al., 2013;Oh et al., 2013;Riecke et al., 2017). However, they do not provide FMRI evidence for an attention-induced reshaping of frequency tuning in human PAC (Fig. 3F right).

Effects of frequency-specific attention on frequency channels within PAC
We did not statistically compare the frequency response of each frequency channel between active vs. passive tasks (Fritz et al., 2003;Atiani et al., 2009), because this comparison would be biased by the fact that we had already defined the channels based on the passive-condition data (Fig. 3C). Instead, we focused on the frequency response during the active tasks; see Fig. 3E: this 'active' frequency response varied significantly across frequency channels (interaction mapping-frequency Â frequency-channel; F 49,441 ¼ 15.3, P ¼ 10 À16 ) in a similar way as the passive frequency response (Fig. 3C), an observation that we elaborate on in the next section.
As predicted by the 'spotlight' model and the whole-PAC results reported above, frequency-specific attention altered the average response of PAC frequency channels: overall response-gain modulation reached or approached statistical significance specifically in the three lowest frequency channels (0.4-1.0 kHz, post-hoc tests: t 9 > 2.26, corrected P < 0.07; see Fig. 3F left), revealing a slight asymmetry along the tonotopic axis. In further contradiction with the 'retuning' hypothesis, frequency-specific attention did not alter much the shape of the frequency response of any frequency channel (no mapping-frequency Â target-frequency interaction: all F 7,63 <0.55, corrected P > 0.7; see Fig. 3G).
These observations indicate that the observed tonotopic modulation pattern in PAC (Fig. 3F left) emerged primarily from modulation of frequency channel's overall response gain rather than modulation of their frequency tuning (Fig. 3G) as measured with FMRI. Put differently, frequency-specific attention modulated the overall responsiveness of PAC frequency channels and this modulation depended primarily on the channel's characteristic frequency (i.e., its tonotopic location) not the channel's specific input (i.e., the mapping frequency).

Characteristics of frequency-specific attentional modulation in PAC
To further characterize the observed attentional modulation of response gain of PAC frequency channels, we replotted these channels' responses to a specific mapping frequency (i.e., the frequency nearest to the target band, the frequency farthest away from the target band, or the channels' respective BF) as a function of the channel's passive BF. Fig. 4A shows that attentional modulation of channels' response to the lowest mapping frequency (nearest to the L-target) varied across channels in a slightly asymmetric tonotopic modulation pattern that was similar to the pattern observed for the overall channel response (cf. Fig. 3F left). As shown in Fig. 4B, attentional modulation of channels' response to the highest mapping frequency (nearest to the H-target) also exhibited such an across-channel pattern, although with a shallower slope. The latter resulted in exclusively positive values, indicating that the observed modulation was strongly driven by attention to the L-target. Fig. 4C shows that attentional modulation of channels' BF-response (i.e., the channel's response in the mapping-frequency condition that matched the channel's passive BF) exhibited an across-channel pattern that resembled more the (non-significant) frequency response shape changes observed in overall PAC (cf. Fig. 3F right; average correlation coefficient: 0.77 AE 0.04).
Together, these observations indicate that the tonotopic responsegain modulation observed in the overall PAC emerged from modulation of frequency channels' responsiveness to a range of frequencies (including the lowest and highest ones), not specifically the BF or the frequency nearest to the target. This observed lack of input-frequency specificity corroborates the above notion that attentional modulation of frequency channels in PAC depends primarily on the channel's characteristic frequency ('spotlight' model), not its specific input (original 'retuning' hypothesis).
All data show mean AE sem across participants. Frequencies are expressed in units of kHz. n.s. non-significant, ***P < 0.0005. The The plots show frequency-specific attentional modulation (L-task vs H-task) of frequency channels' response to a specific mapping frequency (shown in the title), as a function of the channel's passive BF (i.e., per frequency channel). In panel A, B, and C the mapping frequency corresponds respectively to the lowest mapping frequency (0.4 kHz, nearest to the L-target), the highest mapping frequency (6.0 kHz, nearest to the L-target), and the channel's respective BF. D. The plots show frequency channels' active BF during the L-task (red) and H-task (blue), as a function of the channel's passive BF (i.e., per frequency channel). E. Analogous to panel D, except that the difference between the L-task vs H-task is shown to visualize attentional modulation of channels' active BF. F-G. Analogous to panels D-E, except that data refer to frequency channels' active worst frequency (WF), as a function of the channel's passive BF.
horizontal axis in panels D and F was slightly shifted for each data series to improve visibility.
In another attempt to detect the originally hypothesized attentional retuning, we assessed effects of frequency-specific attention on frequency channels' BF and WF (see Materials and Methods). As suggested by our channel frequency-response results (see preceding section), the active BF, which we defined from the maximum of the active overall frequency response (Fig. 3E), increased significantly and monotonously across frequency channels (main effect of frequency channel: F 7,9 ¼ 15.0, P ¼ 10 À10 ), exhibiting a pattern resembling the passive channel BFs; see Figs. 4D and 3A vs. Supplementary Fig. 1. More importantly, and again in contradiction with the 'retuning' hypothesis, frequency-specific attention did not modulate this pattern (no target-frequency Â frequency-channel interaction: F 7,63 ¼ 0.92, P ¼ 0.50; see Fig. 4E) or the active BF itself (no main effect of target frequency: F 1,9 ¼ 1.58, P ¼ 0.24). Similar outcomes were obtained for the WF; see Fig. 4F-G (main effect of frequency channel: F 7,9 ¼ 11.9, P ¼ 10 À8 ; no target-frequency Â frequency-channel interaction: F 7,63 ¼ 1.59, P ¼ 0.15; no main effect of target frequency: F 1,9 ¼ 0.004, P ¼ 0.95).
Together, these results show that PAC frequency channels' most and least characteristic frequencies remained relatively stable under frequency-specific attention. This indicates that these channels were reproducible across task conditions and again provides no FMRI evidence for attentional modulation of frequency tuning in human PAC.
Anatomical definition of IC and IC-frequency channels Fig. 5 shows results from the main experiment for IC, in a similar format as the PAC results in Fig. 3. The anatomically defined IC exhibited spatial variations in BF (De Martino et al., 2013); see Fig. 5A. However, in some participants, this region was not reliably activated in every mapping-frequency condition, hampering the definition of a complete set of frequency channels. To circumvent this problem, which affected only our analysis of channels' frequency responses (not MVPA; see next section), we defined IC based on anatomical criteria alone (i.e., without prior functional voxel selection) for this specific analysis. The anatomically-defined IC exhibited passive sound-evoked responses reliably across participants (Fig. 5B). These responses were significantly above the silent baseline, yet much weaker than in PAC. Similar to PAC, these passive sound stimulation-evoked activation levels were relatively stable across frequency channels in IC (no main effect of frequency channel in No-task: F 7,63 ¼ 1.25, P ¼ 0.29; Fig. 5B left) and the passive frequency response was nearly flat (no main effect of mapping frequency in No-task: F 7,63 ¼ 0.82, P ¼ 0.58; Fig. 5B right). The passive frequency response of IC-frequency channels to our auditory stimulus set is illustrated in Fig. 5C.

Frequency-specific attentional modulation of spatially-distributed activation patterns in IC
The overall response of IC was slightly stronger in active vs. passive tasks, but this difference was not statistically significant (t 9 ¼ 0.30, P ¼ 0.38). Applying the previous analyses for frequency-specific attention effects (see sections 3.2.2 and 3.2.3) to the IC dataset yielded no significant result (all P > 0.05); see Fig. 5D-G. Thus our univariate analyses did not reveal any indication for a tonotopic spotlight or frequency retuning in human IC.
An alternative approach for assessing frequency-specific attentional modulation within a given ROI is based on the multivariate analysis of spatially distributed response patterns. In contrast to conventional univariate (voxel-by-voxel) analyses, this MVPA approach does not rely on the definition of a BF or frequency channel but takes into account intervoxel dependencies, making it sensitive for finer-grained attentional modulation patterns. Fig. 6A shows results obtained from applying MVPA to the functional-anatomically defined IC in our main experiment. The trained classifier could decode the listener's focus of attention from the BOLD response patterns in IC measured during the active tasks with limited success: permutation testing revealed a significant outcome for one listener (P6: P < 0.014) and on average, the observed classification accuracy was higher than chance level (52.3 AE 1.4% vs. 50.0 AE 0.06%), a small difference that was not statistically significant (t 9 ¼ 1.61, P ¼ 0.071). Fig. 6B shows matching results obtained analogously from the supplementary IC experiment (52.7 AE 2.8% vs. 49.9 AE 1.6%, t 4 ¼ 0.95, P ¼ 0.20). Pooled analysis of both datasets revealed that classification accuracy was significantly above chance (52.6 AE 1.4% vs. 50.0 AE 0.06%, t 12 ¼ 1.88, P ¼ 0.042). These data suggest that spatiallydistributed patterns of frequency-specific attentional modulation may be observed with FMRI in human IC.
Overall, our observations in PAC and IC were stable across various analyses: the conclusions that can be drawn from our study did not change when applying the ROI analyses separately to each cerebral hemisphere (all P > 0.05), when applying behavioral weights to the single-trial FMRI signal level (see Functional data processing), when normalizing the active frequency response to the passive frequency response (e.g. (Fritz et al., 2003),), when applying more liberal functional-anatomical PAC definition criteria (e.g., larger anatomical masks or no functional mask), or when applying an alternative, regression-based frequency-response analysis that aims to separate gain from shape changes (for details, see (Atiani et al., 2009)).

Discussion
Our results demonstrate that the spotlight of frequency-specific attention, as measured with FMRI, emerges primarily from modulation of overall response gain, rather than from frequency retuning, of frequency channels in human PAC. They further suggest that frequencyspecific attention may alter afferent sound processing in human IC.

Frequency-specific attentional modulation of response gain in human PAC
Our analysis of spatial activation profiles confirmed that shifts in the focus of frequency-specific attention selectively alter spatial (tonotopic) activation patterns in human PAC, such that voxels with characteristic frequencies near the attended frequency are enhanced relative to voxels with characteristic frequencies far away from the attended frequency. This result confirms previous findings of the tonotopic attentional spotlight in PAC by human FMRI studies that did not specifically assess frequency responses (Paltoglou et al., 2009;Da Costa et al., 2013;Oh et al., 2013;Riecke et al., 2017). It further reveals that this phenomenon can be observed reliably even during simultaneous frequency mapping. Crucially, this latter aspect of our study enabled us to investigate frequency responses. This revealed that human PAC achieves the attentional spotlight by applying an overall gain to the outputs of specific frequency channels, i.e., a response modulation that depends on the channel's characteristic frequency, not the channel's specific input. The relative 'flatness' of this response gain implies that the originally hypothesized frequency retuning (i.e., reshaping of the frequency response induced by, e.g., shifts in channels' most or least characteristic frequencies or sharpening/widening of the tuning profile) seems to contribute relatively little to the attentional spotlight in human PAC, although this interpretation requires some cautionary remarks that we make in the next section.
Animal electrophysiological studies have indicated that gain modulation in PAC neurons can arise from synaptic depression (Rabinowitz et al., 2011) and temporal shifts in excitability (Lakatos et al., 2013;Reig et al., 2015) especially in lower PAC layers (O'Connell et al., 2014), and may be triggered by top-down signals originating in frontal cortex (Fritz et al., 2010). These mechanisms probably also contributed to our human PAC findings, which may be verified in future studies using invasive single-unit recordings in neurosurgical patients (e.g. (Howard et al., 2000;Mukamel et al., 2005;Bitterman et al., 2008),).
Our observation that gain modulation is strongest in frequency channels with low characteristic frequencies (0.4-1.0 kHz; Fig. 3F left) Fig. 6. Classification results. A. MVPA results in IC from the main experiment. The bar plot on the left shows the average activation level of functional-anatomically defined IC, for each active task (L-task: red, H-task: blue) pooled across mapping frequencies. The title denotes the average number of voxels in the ROI. Data show mean AE sem across participants. The plot in the middle shows classification accuracy for each participant (circles), i.e., how well the trained classifier could decode the listener's focus of attention from the BOLD response patterns measured in IC during the active tasks. The gray horizontal line illustrates the empirically derived chance level (shaded area: SD across permutations). *P < 0.05 (permutation testing). The plot on the right shows classification accuracy averaged across participants, with the gray line illustrating the average chance level (line width: sem across participants). B. Analogous to panel A, but for data from the supplementary experiment. C-D. Analogous to panels A-B, but for data from PAC for reference. and strongly driven by selective attention to low frequencies (Fig. 4A-B) suggests that frequency-specific attentional modulation is asymmetric across the tonotopic axis of PAC. Although frequency discrimination worsens at high frequencies (Sek and Moore, 1995), the observed asymmetry cannot be attributed to our individualized auditory stimuli, the task difficulty, or the sensitivity of the passive PAC in our study because none of these aspects varied appreciably across the attention tasks or the tonotopic axis. Instead, it might reflect a contribution from what we refer to as 'harmonic attentional modulation'. More specifically, frequency-specific attention possibly also modulated neuronal populations tuned to the higher harmonics of the attended frequency in human PAC (Fritz et al., 2007b;Borra et al., 2013). Because our frequency-mapping range covered harmonics of only the L-target (0.4, 0.6, …kHz), but not of the 16.6,…kHz), such putative harmonic modulations could only be observable in our L-task.
To explore this idea, we applied our BF-channel analyses to peripheral spike activity evoked by our auditory stimuli, which we estimated with a computational model of the peripheral auditory system (Meddis et al., 2013) incorporating eight frequency channels (centered on our mapping frequencies) and a multiplicative harmonic attentional gain (i.e., a scaling of each channel's output by its response to the attended target alone). These simulations indeed revealed harmonic responses patterns evoked by the L-target, but not the H-target, within the sampled range of the tonotopic axis. In particular, applying the same analysis for frequency-specific attentional modulation (see FMRI results in section 3.2.2) showed a low-frequency bias across this tonotopic range (Fig. 7). This indicates indeed that harmonic attentional modulation can in principle account for part of the asymmetry in our PAC results.
Analogously to the FMRI results shown in Fig. 3F, the plot shows attentional modulation of peripheral auditory responses as a function of frequency channel (dark magenta, scaled to match the scale of the FMRI results). Spike rates were estimated using an eight-channel computational model of the peripheral auditory system. Harmonic attentional gain was modeled by multiplying each channel's output by its response to the attended target stream alone. The same BF-channel analysis as for the FMRI data was applied to the spike rate data. The resulting spatial pattern of attentional modulation shows a slight asymmetry across the tonotopic axis: the majority of frequency channels, specifically those up to 3 kHz, show stronger enhancement during low-vs. high-frequency attention. The FMRI results are plotted in brighter color for reference. The horizontal axis was slightly shifted for each data series to improve visibility.
Detailed observations of attentional modulation of frequency responses in auditory cortex have been limited so far mostly to animals (see Introduction). In humans, MEG studies have provided rather indirect and coarse insights into this topic (Kauramaki et al., 2007;Okamoto et al. 2007Okamoto et al. , 2009Ahveninen et al., 2011). More detailed invasive electrocorticography studies have associated modulation of spectrotemporal tuning in human auditory cortex with selective listening to speech (Mesgarani and Chang, 2012;Holdgraf et al., 2016) but did not investigate frequency-specific attention. An FMRI study of the laminar organization of frequency tuning in human PAC observed that intermodal attention alters the width of the frequency response of upper PAC layers (De Martino et al., 2015). However, attentional targets were confounded by the mapping stimuli, leaving relative contributions of the attentional spotlight, response gain, and frequency tuning unclear. Our study extends the previous findings by showing that the tonotopic attentional spotlight in human PAC, as measured with FMRI, emerges primarily from modulation of response gain rather than from frequency retuning (but see cautionary remarks in next section).
Does frequency-specific attention modulate frequency tuning in human PAC?
Our analysis of PAC frequency channels' response to frequencies near or far from the target bands (Fig. 4A-B) revealed that attentional modulation of channel's responsiveness to non-BFs is sufficient to give rise to the attentional spotlight in human PAC. However, we found no statistical evidence for frequency retuning: the overall shape of the frequency response, as well as the most and least characteristic frequencies of frequency channels, did not alter systematically across attentional states (Figs. 3G,4E and 4G). This null result of frequency retuning could indicate that the frequency-tuning properties of PAC remain more stable under frequency-specific attention in humans than was expected based on animal findings (Jaramillo and Zador, 2011;Massoudi et al., 2013). Indeed, this interpretation would fit with the fact that frequency tuning in PAC and electrically induced modulation thereof vary across species (Suga and Ma, 2003;Bitterman et al., 2008). However, an equally possible interpretation is that attentional frequency retuning exists in human PAC as suggested by animal models, but we were unable to detect it due the following potential methodological limitations: Firstly, the signal-to-noise ratio in our frequency-response analysis was possibly insufficient, considering that this analysis required stratifying the data according to mapping frequencies (i.e., trials), not according to channels (i.e., voxels, as in our more successful spatial activation-profile analysis). A general lack of statistical power can be excluded because our spatial activation-profile analysis replicates several FMRI findings in PAC (Petkov et al., 2004;Paltoglou et al., 2009;Woods et al., 2009;Da Costa et al., 2013;Oh et al., 2013;Riecke et al., 2017).
Secondly, the spectral resolution of our frequency mapping was one ERB À1 ; thus presumed finer-grained spectral modulations probably averaged out in our measures. The fact that our target frequencies fell outside the mapped frequency range unlikely caused our null result, given that attentional gain modulation and frequency retuning in animals are not limited to PAC neurons with BFs equal to the target frequency, but extend to neurons tuned to nearby frequencies (Atiani et al., 2009).
Thirdly, the spatial resolution of our FMRI measurements was 1.1-1.5 mm À1 , thus finer-grained spatial variations, especially putative local heterogeneities in frequency tuning (Castro and Kandler, 2010), averaged out within voxels. It further cannot be excluded that our functionally-anatomically defined PAC lacked or contained a fraction of primary or non-primary tissue, respectively, as there currently exists no unequivocal definition of human primary AC in vivo (Moerel et al., 2014). However, our PAC definition unlikely caused our null result, given that our results replicate several PAC findings (see above).
Finally, our measure of brain activity-the BOLD response-merges neural excitation and inhibition (Logothetis et al., 2001;Logothetis, 2008); thus their putative contributions to frequency retuning in PAC or cortical feedback to IC (Yan and Suga, 1998;Weinberger, 2004;Ohl and Scheich, 2005;Galindo-Leon et al., 2009) probably averaged out in our measures.
Therefore, attentional frequency retuning may be observed in future human studies using methods that enable to resolve finer-grained spectral/spatial modulations and disambiguate potential contributions of neural excitation and inhibition, e.g., invasive neuroelectric recordings. Given the aforementioned potential limitations, our findings cannot be straightforwardly compared to related animal electrophysiology findings (Fritz et al. 2003(Fritz et al. , 2005(Fritz et al. , 2007bDavid et al., 2012;Slee and David, 2015). As noted above, our spectral and spatial resolutions were much coarser Fig. 7. Frequency-specific attentional modulation of simulated peripheral auditory responses. and we used a fundamentally different measure of neural activity (for further discussion of FMRI-electrophysiology differences, see e.g. (Mukamel et al., 2005;Langers et al., 2012;Overath et al., 2012)). Consequently, caution must be exercised when attempting to relate our macroscopic measures of frequency channels, frequency tuning, and response gain to those obtained from invasive electrophysiology. Moreover, most animal studies investigated auditory attention vs. passive stimulation, leaving unclear to what extent the reported effects reflect task performance-induced changes in the animals' arousal. Finally, target and mapping stimuli were presented consecutively, whereas in our study they were presented simultaneously (i.e., as an auditory scene). This latter difference seems to facilitate the observation of gain modulation: the only animal study that also presented simultaneous target/mapping stimuli also found prominent gain modulations in PAC, in addition to the previously observed frequency retuning (Atiani et al., 2009); this possibly reflects contributions from additional (mapping stimuli-related) neural suppression induced by additional streaming-related processes. Despite these differences, our human FMRI findings converge with animal electrophysiology findings showing that frequency-specific attention in auditory scenes induces frequency-specific response gain especially in 'near' neuronal populations in PAC.

Does frequency-specific attention modulate response patterns in human IC?
Our pattern-analysis results indicate that frequency-specific attentional modulation can be observed in the human PAC and to a much lesser extent in the human IC. The trained classifier could decode listeners' focus of frequency-specific attention from spatially-distributed, sound-evoked activation patterns, but with limited success in IC. The observed decoding accuracies were rather modest, yet pooled across the two experiments, the average accuracy was reliably above chance, which suggests that sound-evoked activation patterns in IC may carry at least some information about the focus of frequency-specific attention. Therefore, it is likely that endogenous frequency-specific attention can influence afferent sound processing in human IC, probably via top-down attentional signals fed from PAC to IC.
This tentative interpretation fits with previous human studies that have used otoacoustic emission recordings and scalp electroencephalography to observe frequency-specific attentional modulation in subcortical stages as early as the cochlea (Galbraith et al., 1998;Maison et al., 2001;Lehmann and Schonwiesner, 2014;Yamagishi et al., 2016). Our study extends these findings by attempting to specify the subcortical locus of attentional modulations in IC, which could not be previously achieved due to the limited spatial detail of the employed non-invasive compound measures of neuroelectric activity. Our study further extends animal electrophysiology findings of frequency-specific attentional modulation in IC (Slee and David, 2015) to humans. The frequency-response changes in IC observed in that animal study arose from modulation of response gain rather than from frequency retuning, and this attentional response-gain modulation was exclusively suppressive. Otherwise it was similar to that observed in animal PAC (Fritz et al., 2003;David et al., 2012).
In contrast to the aforementioned animal IC study and our PAC findings, we found no evidence for a frequency channel-specific attentional response gain in IC. Although our pattern analysis suggests that spatial patterns of frequency-specific attentional modulation may exist in human IC, the observed lack of a univariate effect in IC indicates that this presumed modulation is weaker or less frequency channel-specific than in PAC. This further suggests that the operation of the attentional spotlight diminishes along the efferent cortico-collicular pathway-an idea that remains to be validated in future research. Human IC is a small structure (average volume in our study: 303 mm 3 ) and attentional modulation is inhomogeneous across central vs. peripheral IC nuclei in ferrets (Slee and David, 2015), which may explain why we observed only subtle attentional modulations with pattern analysis and none with univariate BF-based analyses. Moreover, as described in the previous section, finer-grained within-voxel spatial variations of attentional enhancement and suppression probably averaged out in our FMRI measurements. In particular, we could not reliably define a complete set of reliably activated frequency channels in all participants, although we observed spatial variations in BF and activation patterns that were on average approximately flat across BF channels (Fig. 5B). This hints at a partial lack of activation level, which could be reduced in future FMRI studies by using naturalistic sounds that have already proven effective for activating tonotopic frequency channels in human IC (De Martino et al., 2013). Cardiac gating of FMRI may help to further unmask attentional modulations from cardiac-induced pulsatile motion, which is more prominent in IC than PAC (Guimaraes et al., 1998). Our method could not correct for such motion artefacts; however, we oriented FMRI slices so as to reduce these putative artefacts and could reliably observe sound-evoked IC activation in all task conditions when pooling mapping-frequency conditions and channels for our pattern analysis ( Fig. 5B-D, left).
Importantly, attentional modulation in our study was driven purely by endogenous frequency-specific attention, i.e., in the absence of physical stimulus variations. Human FMRI studies of interaural attention (Rinne et al., 2008) and perceptual switches in auditory streaming (Schadwinkel and Gutschalk, 2011) found similar top-down attentional modulation in the absence of physical stimulus variations in human IC, but based on listeners' perceived location of the auditory targets. Given the prominent role of IC in spatial processing (King et al., 2001), these spatial cues have been suggested to be particularly effective in producing attentional modulation in human IC (Rinne et al., 2008). Our study suggests that sound processing in human IC may also be modulated by endogenous attention to spectral cues alone.

Conclusion
We demonstrated that the spotlight of frequency-specific attention in human PAC as measured with FMRI emerges primarily from tonotopic modulation of neuronal population-level response gain. Contributions from fine-grained frequency retuning in human PAC (induced by, e.g., neural excitation and inhibition) remain a possibility to be investigated with methods that enable assessing frequency tuning with higher spatial and spectral resolution. Our study further indicates that endogenous frequency-specific attention induces comparatively subtle responsepattern modulations in human IC, suggesting that the operation of the attentional spotlight may diminish from PAC toward hierarchically lower auditory processing stages. In sum, frequency-specific neuronal population responses in the human auditory pathway appear not to be fixed, but adapt constantly to meet changes in listening demands. Noteworthy, our study reveals that frequency tuning in human PAC can be reliably assessed even during simultaneous presentation of task-relevant, frequency mapping-unrelated auditory stimuli-a methodological advancement that opens up new possibilities for future FMRI studies investigating attentional modulation of auditory feature tuning.