Speech frequency-following response in human auditory cortex is more than a simple tracking

The human auditory cortex was recently found to contribute to the frequency following response (FFR) and the cortical component has been shown to be more relevant to speech perception. However, it is not clear how cortical FFR may contribute to the processing of speech fundamental frequency (F0) and the dynamic pitch. Using intracranial EEG recordings, we observed a significant FFR at the fundamental frequency (F0) for both speech and speech-like harmonic complex stimuli in the human auditory cortex, even in the missing fundamental condition. Both the spectral amplitude and phase coherence of the cortical FFR showed a significant harmonic preference, and attenuated from the primary auditory cortex to the surrounding associative auditory cortex. The phase coherence of the speech FFR was found significantly higher than that of the harmonic complex stimuli, especially in the left hemisphere, showing a high timing fidelity of the cortical FFR in tracking dynamic F0 in speech. Spectrally, the frequency band of the cortical FFR was largely overlapped with the range of the human vocal pitch. Taken together, our study parsed the intrinsic properties of the cortical FFR and revealed a preference for speech-like sounds, supporting its potential role in processing speech intonation and lexical tones.


Introduction
The phase-locking response is an important way that the auditory system preserves temporal information in sounds ( Langner 1992 ;Schnupp 2011 ;Wang 2018 ) and can be noninvasively recorded as the frequency-following response (FFR) in humans ( Bidelman 2018 ;Chandrasekaran and Kraus 2010 ;Coffey et al., 2019 ). In recent years, studies have shown that the auditory cortex also contributes to the FFR recorded via EEG and MEG, which has thus been referred to as the "cortical FFR " ( Bidelman 2018 ;Coffey et al., 2016 ;Coffey et al., 2017b ). Compared with the subcortical responses, the cortical FFR shows more relevance to speech perception and attentional modulations ( Coffey et al., 2017a ;Hartmann and Weisz 2019 ;Holmes et al., 2018 ;Puschmann et al., 2018 ). In speech sounds, the fundamental frequency (F0) and its dynamic change constitute the intonation to facilitate speech perceptions ( Binns and Culling 2007 ;Fairbanks 1940 ;Steinhauer et al., 1999 ). Speech FFR, as a measurable indicator of neural synchrony to the speech F0 and its harmonics, is closely related to an individual's speech pitch, patterns of pitch contour were further analyzed and recognized ( Griffiths 2003 ;Patterson et al., 2002 ). In the human cortex, this hierarchy seems to start from posterior-medial Heschl's gyrus (HG) and proceed to lateral HG, and then to more anterior regions ( De Angelis et al. 2018 ;Patterson et al., 2002 ;Penagos et al., 2004 ). Specifically, left HG is proposed to play an important role in the processing of linguistic pitch Xu et al., 2006 ). However, how the dynamic pitch is represented in these regions remains unclear. We hypothesize that the human auditory cortex may encode the pitch contour in speech signals by temporal tracking of dynamic F0, and this coding may have a specific spatial distribution, thus providing precise F0 information for subsequent processing of intonation and lexical tones.
Another widely debated question is about the frequency limitation of cortical FFR ( Bidelman 2018 ;Coffey et al., 2019 ;Tichko and Skoe 2017 ). There is a general trend that the upper limits of frequencyfollowing responses decrease gradually along the ascending auditory pathway in humans ( Bidelman 2018 ;Zhang and Gong 2019 ). Using an EEG-FFR sourcing technique, the contribution of the auditory cortex was observed in the low-frequency range ( < 100 Hz) ( Tichko and Skoe 2017 ). Studies have also shown that the FFR at low F0s is more sensitive to attentional modulations than that at high F0s ( Galbraith and Doan 1995 ;Hartmann and Weisz 2019 ;Holmes et al., 2018 ), indicating that the cortical contributions to speech FFR likely occur in the low F0 range. However, direct evidence is still lacking and the full range of speech F0 has not yet been examined. Under the hypothesis that the cortical FFR encodes F0 contours for speech signals, we infer that the bandwidth of the cortical FFR is optimized to fit the F0 distribution of human speech.
Scalp EEG and MEG, though with sufficient temporal resolutions, are still hindered by their limited spatial resolution to pinpoint the source of cortical FFR ( Bidelman 2018 ;Coffey et al., 2017b ;Hartmann and Weisz 2019 ;Penagos et al., 2004 ). Intracranial EEG (iEEG) directly recorded from the human cortex, has a high resolution both in time and space, allowing fast fluctuating responses to be recorded in milliseconds and accurate localization of neural sources from the primary auditory cortex, which is embedded in the Sylvian fissure ( Nourski and Howard 2015 ;O'Sullivan et al., 2019 ;Zhang et al., 2018 ). Taking advantage of the intracranial EEG, we designed a group of speech-like F0 modulated stimuli, with the F0 covering the range of human vocal pitch, to quantify the tracking accuracy and frequency limits of the cortical FFR directly. To further investigate the role of the cortical FFR in speech processing, natural speech with lexical tones was also tested. An F0 tracking response (FFR at F0) to both speech and speech-like harmonic complex stimuli was observed bilaterally in the human auditory cortex, including Heschl's gyrus (HG) and the surrounding superior temporal gyrus (STG) regions. We further demonstrated the preference of the cortical-FFR in encoding speech F0s with harmonics and depicted an overlapping of the cortical FFR frequency band with the human vocal range.

Subjects
The experiments were carried out in 8 epilepsy patients (1 female; median 22.5 years, range 12-35 years; Table 1 ) who were implanted with intracranial depth electrodes (8-16 macro contacts on each depth electrode, 0.8-mm diameter, 3.5-mm spacing centre to centre) as part of their clinical evaluation for epilepsy surgery. The paths of electrode implantation were determined by the patients' clinical needs. No seizures were observed one hour before or after the tests in any patients. All patients were right-handed native Mandarin Chinese speakers with normal hearing and normal speech perception evaluated on the Chinese version of WAB scale (western aphasia battery, WAB) ( Kertesz 1982 ). None of the patients had music training experience. All patients signed informed consent forms. The experimental protocol was approved by the institutional review board at Tsinghua University, and the affiliated Yuquan Hospital, Tsinghua University.

Non-speech stimuli
Non-speech stimuli were generated in MATLAB (MathWorks, Natick, MA, USA). Since the F0 contour in speech is continuously varying ( Zatorre and Baum 2012 ), and mainly in the F0 range of approximately 85-255 Hz ( Baken 2010 ;Keating and Kuo 2012 ), a group of F0 modulated stimuli (sweeps) with the F0 ranging from 20 Hz to 335 Hz were designed to cover the speech fundamental frequency ( Fig. 2 A, top panels ). The sweeping rate of the F0 contour was 157.5 Hz per second upwards (or − 157.5 Hz per second downwards), and each sweep lasted 2 s. Futhermore, since speech is also featured with rich harmonic structures, stimuli with different harmonic structures but the same F0 contour were designed: a harmonic complex sweep (HCS, with F0, 6 equal-amplitude harmonics, 1st harmonic/F0 -6th harmonic/H6, added to sine phase), a missing fundamental sweep (MFS, without F0, 6 equal-amplitude harmonics, 2nd harmonic/H2 -7th harmonic/H7, added to sine phase), and a pure tone sweep (PTS, 1 component, F0 only). The spectrograms of all stimuli were generated directly in MATLAB.
In the spectral domain, the three stimuli in the same direction had the same F0 contour but distinct spectrograms ( Fig. 2 A, top panels ). In the temporal domain, their waveforms had the same periodic structure (blue dots, Fig. 2 A, bottom panels ) but different fine structures in each period. If the neural response is a simple tracking of F0 contour, then similar responses would be expected for these F0 modulated stimuli. However, if the cortical response shows a particular preference to speech-like sounds, unbalanced responses should be observed. Furthermore, to confirm that the F0 tracking responses were due to the F0 contour rather than acoustic energy at the fundamental frequency, we expected the response to missing fundamental stimuli was the same with the F0 present stimuli ( Coffey et al., 2017b ;Tang et al., 2017 ).

Speech stimuli with and without F0
In Mandarin Chinese, the F0 contours at the syllable level are categorically perceived as lexical tones ( Duanmu 2000 ), which are essential for determining the meaning of a word. For example, the syllable /yi/ can be accented in four lexical tones (i.e., level tone T1, rising tone T2, dipping tone T3, and falling tone T4) to represent four distinct word meanings: medicine " , " aunt " , " desk " , " or difference " , " respectively ( Si et al., 2017 ). In this study, 7 Chinese syllables (/da1/, /da4/, /da2/, /yi1/, /yi4/, /yi2/, and /yi3/, Fig. 5 A) were retrieved from the Mandarin monosyllabic speech corpora of the Chinese Academy of Social Sciences Institute of Linguistics and concatenated into a sound stream that had the same length as the synthesized sweeps (~2 s). The speech stream was meaningless at the sentence level but was still recognizable syllable by syllable. All syllables were spoken at normal speed (~270 ms) by a male speaker and were of good quality.  ( Boersma 1993 ). By setting the spectral energy around the F0 compo-nent to zeros, a group of syllables without F0 were obtained. We estimated the spectrograms of all stimuli by short-time Fourier transform with a 50 ms Gaussian window ( Fig. 5 A).
The speech stimuli were to examine the cortical FFR to more perceptually meaningful sounds. But it should be noted that in addition to the linguistic differences, there are also acoustic discrepancies between speech stream and harmonic sweeps. For example, the synthesized nonspeech signals are continuous, while speech signals consist of seven discrete tones; the speed and consistency of direction by which F0 contour changed are not consistent between the speech and non-speech stimuli.

Passive listening task
Speech streams and synthesized sweeps were normalized to the same RMS amplitude in MATLAB (MathWorks, Natick, MA, USA) and then played to participants binaurally at a sampling rate of 44,100 Hz via a pair of insert earphones (ER2, Etymotic Research, USA) under control of the Psychophysics Toolbox ( Brainard 1997 ). All stimuli were played in random order and were repeated 30 times. To avoid the effect of expectation, the stimulus onset interval was set to 1100 ms, with 5% jitter. During sound listening under any conditions (both speech and non-speech), silent films were played for subjects on a tablet to keep them awake. For each subject, the sound was played at the subject comfortable level based on self-report, but for the same subject, the sound intensity of different stimulus materials was consistent. To ensure that harmonic distortions created by the headphones did not reintroduce energy at F0 ( Norman-Haignere and McDermott 2016 ), we measured the sound output from both sets of earphones post hoc with a precise microphone (PCB378C10, PCB Piezotronics Inc; MA3 stereo microphone amplifier) and TDT RZ6 (Tucker-Davis Technologies, Gainesville, FL, USA) and found no evidence that any power had been reintroduced at F0 (Fig. S1).

Data acquisition and preprocessing
Intracranial EEG (iEEG) signals were recorded using a g.USBamp amplifier/digitizer system (G. TEC, Graz, Austria). The amplifier sampled data at 1200 Hz with a high-pass filter with a cut-off frequency of 0.1 Hz and notch filters centred at 50 Hz harmonics to remove power-line noise. The data were subsequently re-referenced using a local scheme whereby the signal for each electrode was adjusted concerning the signals of its nearest neighbours ( O'Sullivan et al., 2019 ;Stolk et al., 2018 ). The electrodes showing epileptiform activity during clinical monitoring would be labelled by the clinician, and were removed from the following analysis.

Anatomical location of electrodes
The locations of the electrodes relative to the cortical surface were determined according to the recommended procedure provided by FreeSurfer ( Fischl 2012 ;Stolk et al., 2018 ). (1) A presurgical highresolution T1-weighted structural MRI scan and a post-implantation CT scan were acquired for each subject. The CT images were registered to the presurgical MRI images with statistical parametric mapping (SPM) implementations of the mutual information-based transform algorithm ( Wells et al., 1996 ;Zhang et al., 2013 ) in FreeSurfer ( Fig. 1 A).
(2) For each bipolar iEEG channel, the best channel position was located between the two corresponding electrode positions, and its location on the surface of the cortex was identified as the vertex nearest to the contact site in the 3D-volume space. Recording sites with contact-vertex distance larger than 5 mm were not used ( Zhang et al., 2018 ). (3) The auditory cortex of interest was divided into three parts within each subject's individual space ( Fig. 1 B) ( Desikan et al., 2006 ), which is HG, pSTG, and aSTG. The STG regions anterior to HG were considered "anterior STG " (aSTG), and the STG regions posterior to HG were considered "posterior STG " (pSTG) ( Da Costa et al. 2011 ;Sammler et al., 2015 ). The electrodes were classified according to these anatomical parcellations. (4) For electrode visualization, the individual brains were coregistered to the Fsaverage Standard Brain in FreeSurfer, and electrodes' locations were then non-linearly projected onto the standard brain ( Fig. 1 C) ( Greve and Fischl 2009 ;Thomas Yeo et al. 2011 ).

Auditory response analysis
iEEG data processing was performed in MATLAB. With the Hilbert transform, the analytic amplitude of eight Gaussian filters (centre frequencies: 60-200 Hz, the high gamma band) was computed. The high gamma power was taken as the average analytic amplitude across these eight bands ( Khalighinejad et al., 2017 ) and was then downsampled to 200 Hz and z-scored to a silent baseline. The baseline period was defined as 50-300 ms before stimulus onset and the task-related analysis window was defined as 0-2300 ms after stimulus onset. An electrode was regarded as auditory responsive if it had a significantly larger response power than baseline for a period lasting at least 50 ms (paired test according to Wilcoxon signed-rank test, Bonferroni correction, p < 0.05) ( Si et al., 2017 ). Electrodes without auditory responses to any of the stimuli were excluded from subsequent analyses. Out of the 344 electrodes on the brain surface, 63 (LH: 34 of 183; RH: 29 of 161, Fig. 1 C) were responsive to the auditory stimuli. Follow-up analyses were restricted to these 63 electrodes.

The spectral amplitude of the F0 tracking response (FFR-F0)
The spectral amplitude of the FFR provides information on the robustness of auditory processing ( Krizman and Kraus 2019 ). The FFR amplitude was calculated in the time-frequency plane. Corticograms were generated with the pre-processed data by the filter-bank method ( Edwards et al., 2009 ;Stolk et al., 2018 ): (1) Time-frequency analysis was performed using a Gaussian filter-bank and Hilbert transformation to identify the envelope of each frequency channel separately; the centre frequencies ranged from 10 Hz to 350 Hz in 1 Hz increments; (2) Spectrograms with data averaged across trials were generated to create corticograms; (3) The corticogram was downsampled to 200 Hz and z-scored to the baseline (50-300 ms before stimulus onset) at each frequency channel.
To demonstrate how the iEEG responses tracked the F0 contours of the stimuli 0( ) (blue line, Fig. 2 B), dominant frequencies ( ( ) , the frequency with peak amplitude) were first extracted. Then, the latency of the F0 tracking response was estimated as the peak lag of the cross-correlation between the ( ) and the 0( ) . After the response latency was calibrated, the F0 tracking range was defined by the criterion | − 0 | < 5 ( Bidelman and Powers 2018 ; Chandrasekaran and Kraus 2010 ). The calibrated DF of the exemplar response that tracked F0 is shown as a white line in Fig. 2 B. To prevent false alarms, the F0 tracking segment had to cover at least 100 ms (~16 Hz sweep). With this criterion, 39 electrodes (out of 63 electrodes) were identified as significant "F0 tracking electrodes " (gray dots in Fig. 3 B, 3 C).
The FFR amplitude was calculated as the mean of spectral amplitude within ± 5 Hz around the stimulus F0 ( Bidelman and Powers 2018 ; Krizman and Kraus 2019 ). For visualization, the spectrograms of iEEG responses (corticograms) were temporally shifted according to the response latencies so that they aligned with the F0 contours, as displayed in Fig. 2 B.

Phase coherence of the F0 tracking response
Phase coherence is considered a frequency-specific measure of timing fidelity Omote et al., 2017 ). The calculation of phase coherence across trials was also in the time-frequency plane, based on the phase spectrogram ( , ) . The phase of each pixel in ( , ) was averaged across trials to obtain the intertrial phase-locking value ( , ) ( Krizman and Kraus 2019 ; Zhang and Gong 2019 ). The Red dots represent auditory responsive electrodes. Blue dots represent the non-responsive electrodes or noise-contaminated electrodes. Electrode E-L3 from subject S5 is also marked by a green dot. (C) Location of all the auditory responsive electrodes (red dots, N = 63/344) on the inflated Standard Fsaverage brain. White dots represent the non-responsive electrodes, which were excluded from further analysis.
original ITPL varied between 0 (not synchronized in phase across trials at all) and 1 (perfectly synchronized in phase across trials). To facilitate the comparison of the phase coherence and spectral amplitude, the ( , ) was downsampled to 200 Hz and z-scored to the baseline (50-300 ms before stimulus onset). ITPL used to be between 0 and 1, would no longer be limited to this range after normalization according to the baseline. The phase coherence of the FFR at F0 was calculated as the mean of ITPL within ± 5 Hz around the stimulus F0 (blue line, Fig. 2 C).

FFR band limit characterization
Since the F0 changed linearly from 20 Hz to 335 Hz for the sweep stimuli, the spectral amplitude or phase coherence was calculated for each frequency channel separately (projected onto the F0 axis). The original spectral profile where the DF did not track F0 was set to zero and then averaged across four harmonic sweeps (HCSup, HCSdown, MF-Sup, MFSdown) to obtain the spectral profile of the FFR for each site ( Fig. 6 A). For the sweep stimuli, the upper limit and lower limit were defined as the upper frequency and lower frequency of the F0 tracking response, respectively. The bandwidth of FFR was defined as the difference between the upper limit and the lower limit. To estimate the F0 distribution of the human vocal pitch ( Fig. 6 D), long narrative stories spoken by a male speaker and a female speaker (3 stories per speaker, ~4 min per story) were retrieved from the Mandarin monosyllabic speech corpora of the Chinese Academy of Social Sciences Institute of Linguistics.

Frequency tuning
Short pure tones (270 ms) at 20, 40, 80,…, 10 240 Hz (logarithmically spaced) were used to measure the frequency tuning curve (FTC, Fig. 6 E) of the recording sites. The frequency tuning curve shows the high-gamma response as a function of the frequency of pure tones, which reflect the frequency selectivity of the local neural population ( Jenison et al., 2015 ;Palmer 1987 ). The high gamma power (60-200 Hz) was averaged to quantify the response to each pure tone and was considered significant if it was larger than the baseline for at least 50 ms ( Si et al., 2017 ). The FTC shapes were characterized, and the following tuning properties were extracted: (1) best frequency (BF), which is the frequency at which the iEEG response power is the largest; and (2) bandwidth of the frequency tuning, for which the FTC was first zscored across frequencies, and the width of one octave of the FTC above 0 was defined as the bandwidth ( Bitterman et al., 2008 ). Fifty-three sites responded to pure tones and were included in the analysis of the relationship between the tuning properties and FFR properties ( Fig. 6 F, 6 G).

Statistical analysis
In this study, repeated-measures analysis of variance (ANOVAs) and post hoc comparisons (with Bonferroni corrections) were used to quantify the effect of the type of stimulus and electrode locations on FFR measures ( Figs. 3 4 , -5 ). The number of sites (each site was treated as an entry) was 63 for all tests. For repeated-measures ANOVAs, the number of within-subject factors is stated in the Results section; Greenhouse-Geisser corrections were applied if sphericity could not be assumed. For pairwise comparisons, the t-value and effect size (Cohen' d ) ( Cohen 1988 ) were also calculated.

Data and code availability
The data and codes that support the findings of this study are available upon request.

Cortical FFRs to F0 modulated sweeps
In this study, the subjects listened to the F0 modulated sweeps passively, and the cortical responses were recorded with intracranial electrodes. An auditory responsive electrode (E-L3 from subject S5) adjacent . Distribution of the FFR amplitude for the harmonic complex sweeps on the inflated brain, for which the anatomical boundary of HG, aSTG, and pSTG was delineated by dashed white lines. (E). Comparison of the FFR amplitude for the harmonic complex sweeps across the 3 cortical regions (mean ± SEM, * * p < 0.01, * p < 0.05, post hoc comparison, Bonferroni correction). (F). Linear regression of the FFR amplitude with the response latencies for electrodes ( N = 32) from HG (dark blue dots) and pSTG (green dots).
to the left posterior Heschl's sulcus was chosen for demonstration (green dot in Fig. 1 A). The iEEG responses (corticograms) of the example site to the F0 modulated sweeps are displayed in Fig. 2 B, which shows that the F0 contours of the stimuli (blue line, Fig. 2 B) were encoded by the dominant frequency (DF, white line, Fig. 2 B) of the iEEG responses. At this site, the dominant frequency of the responses to harmonic complex stimuli (HCS and MFS in both directions) tracked the F0 contours consistently, with a mean difference between the DF and F0 contours of less than 0.5 Hz. Conversely, the F0 tracking responses to the pure tone sweeps (PTS) were much weaker than were those of the harmonic sweeps. Corresponding to the sound waveform displayed in Fig. 2 A (duration: 100 ms, bottom panels), a short segment of the averaged iEEG response (duration: 100 ms) was plotted beneath each corticogram in Fig. 2 B, which clearly shows the temporal synchrony between the responses (black traces) and the stimulus peaks (blue dots). In addition to the spectral amplitude, we also examined the intertrial phase-locking (ITPL) of the iEEG response ( Fig. 2 C), which is more sensitive to the timing fidelity across trials ( Coffey et al., 2017b ;Krizman and Kraus 2019 ;Zhang and Gong 2019 ). It was found that the ITPL tracking was restricted to a narrower range compared to the spectral energy, especially for the upward harmonic sweeps ( Fig. 2 C). The enhanced ITPC does not appear at other times, suggesting it's a frequency-band specific phase coherence of the F0 contour. Since the cortical FFRs to sweeps at the first harmonic (H1 = F0) were dominant, and responses at higher harmonics were weak, we focused on the FFR-F0 in the subsequent analyses.

Stimuli specificity and anatomical distribution of the cortical FFR
By pooling the responses of all responsive sites ( N = 63), we obtained an overall picture of the cortical responses, which demonstrated that the F0 tracking responses (white line, Fig. 3 A) precisely encode the F0 contour (blue line, Fig. 3 A). The averaged discrepancy between the and the 0 for the F0 tracking electrodes was 0.02 Hz ± 0.29 Hz (mean ± SD). At the group level, the cortical FFR was also more sensitive to harmonic complex sounds than pure tone sweeps. To examine the selectivity of cortical FFRs to stimuli, repeated-measures ANOVAs were performed for two complementary FFR measures (FFR amplitude and ITPL, Materials and Methods ) separately. Two within-subject factors (spectral structures: HCS, MFS, PTS; directions: upward, downward) and two between-subject factors (hemispheres: LH, RH; regions of interest: HG, pSTG, aSTG) were considered to build the full model.
For the FFR amplitude, the repeated-measures ANOVA revealed a significant main effect of spectral structures (F 1.2, 68 = 43.841, p < 0.001, 2 = 0.435), while the effect of direction was not significant (F 1, 57 = 0.576, p = 0.451, 2 = 0.01), and the interaction effect was weak (F 1.98, 112. 8 = 3.740, p = 0.033, 2 = 0.062). Post hoc pairwise comparisons (with Bonferroni corrections) showed that the FFR amplitudes for HCS (with F0) and MFS (without F0) were significantly stronger than that for PTS (F0 only) (HCS > PTS, t 62 = 6.896, p < 0.001, d = 1.143, Fig. 3 B; MFS > PTS, t 62 = 7.126, p < 0.001, d = 1.193). Moreover, the difference between the amplitudes for HCS (with F0) and MFS (without F0) was not significant (n.s., t 62 = − 2.155, p = 0.74, d = − 0.085, Fig. 3 C). Although our analysis was based on 63 auditory response electrodes, we visualized the FFR measures of significant tracking electrodes (gray, N = 39) and non-tracking electrodes (blue) differently in Fig. 3  We further investigated whether the F0 tracking response was different between the primary auditory cortex and the associative auditory cortex. The FFR amplitudes were averaged across the harmonic sweeps for each responsive site and then projected onto the averaged brain ( Fig. 3 D). The distribution of FFR amplitudes on the brain surface shows a tendency of a stronger FFR in HG than in other regions. Then, we calibrated the model of repeated-measures ANOVA by removing the measures for pure tones. Results of the between-subject analysis show that the effect of the ROIs on FFR amplitude was significant (F 2, 57 = 9.77, p < 0.001, 2 = 0.255). But the effect of the hemisphere and interaction effect were not significant (hemisphere: F 1, 57 = 0.795, p = 0.376, 2 = 0.014; interaction: F 2, 57 = 0.308, p = 0.736, 2 = 0.011). Post hoc comparisons of the FFR amplitudes across the 3 cortical regions showed a stronger FFR in HG ( > aSTG, t 32 = 5.310, p < 0.001, d = 1.878; > pSTG, t 44 = 2.379, p = 0.026, d = 0.743; pSTG > aSTG, t 44 = 3.388, p = 0.016, d = 1.058; with Bonferroni corrections, Fig. 3 E). Besides, we analyzed the relationship between response latencies and FFR amplitudes. Fig. 3 F shows a negative correlation between the FFR and the processing latencies, which also indicates the degradation of the FFR along the processing hierarchy.
Anatomically, ITPL showed a similar gradient distribution across the three ROIs with the FFR amplitude (F 2, 57 = 6.595, p = 0.003, 2 = 0.188). Post hoc comparisons reveal highest ITPL in HG (HG > aSTG, t 32 = 4.183, p < 0.001, d = 1.479; HG > pSTG, t 44 = 2.387, p = 0.004, d = 0.746; pSTG > aSTG, t 44 = 2.471, p = 0.047, d = 0.772; with Bonferroni corrections, Fig. 4 E). But the effect of the hemisphere was also significant (LH > RH, F 1, 57 = 21.236, p < 0.001, 2 = 0.271). We mapped  Fig. 1 A) to the speech stimuli (same sites as those shown in Fig. 2 B, adjusted by 40 ms). The F0 contour is shown as a blue line. Bottom panel : Phase coherence of the example iEEG responses to speech stimuli. (C). Comparison of the FFR amplitudes for speech stimuli across the 3 cortical regions (mean ± SEM, * * p < 0.01, * p < 0.05, post hoc comparison, Bonferroni correction), with averaged FFR amplitudes for the harmonic sweeps shown by a gray line. (D). Comparison of the phase coherence (ITPL) for the speech stimuli across the 3 cortical regions (mean ± SEM, * p < 0.05, post hoc comparison, Bonferroni correction); the comparisons of the phase coherence for the speech stimuli and harmonic sweeps were performed within 3 cortical regions separately (HG & pSTG: * * p < 0.01, pairwise t -test). (E). Comparison of the FFR amplitude for speech and harmonic sweeps at the single electrode level, green and yellow dots represent the electrodes from the left and right hemispheres, respectively. In the bar plot, we compared the amplitude of speech-evoked FFR between the left (L) and right (R) hemispheres directly. (F). Comparison of the ITPL for the speech and harmonic sweeps. In the bar plot, we compared the phase coherence of the speech-evoked FFR between the left (L) and right (R) hemispheres directly. (G). Distribution of the ITPL for speech stimuli. The anatomical boundary of the HG was delineated by dark blue lines.
the phase coherence to harmonic sounds on the brain surface and found that the electrodes with high timing fidelity concentrated in the left HG ( Fig. 4 D). A negative correlation between the phase coherence and the processing latencies was also observed for ITPL ( Fig. 4 F), which also reveals the degradation of phase coherence along the processing hierarchy.
Despite few discrepancies between the two measures, it's consistent that the cortical FFR to harmonic complex sweeps was significantly stronger than that to pure tones, suggesting a harmonics preference of the cortical FFR. The results also revealed that the cortical FFR attenuates from the primary auditory cortex to the associate auditory cortex.

Cortical FFR to speech stimuli
We used F0-modulated sweeps as a simplified version of speech signals. An additional question is whether the cortical FFR to harmonic complex sweeps can be used to predict the responses to speech stimuli. To determine the role of the F0 tracking response in speech processing, speech stimuli composed of 7 simple Chinese syllables (SY) with different lexical tones (F0 contours) were also presented to participants ( Fig. 5 A). After all of the stimuli listening were finished, the subjects successfully repeated the syllables they heard, which suggested they had a clear perception of the lexical tones and syllable identities. At the exemplary site ( Fig. 1 A), in contrast to the responses to harmonic complex sweeps detailed above in Fig. 2 B, less obvious FFR but more induced high gamma responses that may represent the phonetic features of speech were observed ( Fig. 5 B, top panel ) ( Cheung et al., 2016 ;Mesgarani et al., 2014 ). However, the phase coherence (ITPL, Fig. 5  Parallel repeated measures ANOVAs (1 within-subjects factor: speech vs. missing fundamental speech; 2 between-subjects factors: ROIs and hemispheres) were performed for the measurements of speech evoked FFRs. The results on the FFR amplitude were highly consistent with those of the harmonic sweeps ( Fig. 5 C). First, the effect of the missing fundamental was not significant (F 1, 57 = 0.590, p = 0.445, 2 ≤ 0.001). Then, the main effect of the ROI was significant (F 2, 57 = 9.464, p < 0.001, 2 = 0.249), but the effects of the hemisphere (F 1, 57 = 0.114, p = 0.736, 2 = 0.002) and the interaction between the ROI and hemisphere (F 2, 57 = 3.145, p = 0.051, 2 = 0.099) were not significant. Post hoc comparisons showed stronger FFRs in HG than in the STG ( > aSTG, t 32 = 5.281, p < 0.001, d = 1.867; > pSTG, t 44 = 2.636, p = 0.02, d = 0.823; pSTG > aSTG, t 44 = 2.651, p = 0.022, d = 0.828; with Bonferroni corrections, Fig. 5 C).
However, the phase coherence of speech evoked FFRs was observed to be significantly higher than that of harmonic sweeps (t 62 = 4.760, p < 0.001, d = 0.622, Fig. 5 D). Another discrepancy was that the effect of the hemisphere on phase coherence of the speech evoked FFRs was We further compared the FFR measures between speech and harmonic complex stimuli. For the spectral amplitude, it did not differ considerably between speech evoked FFRs and harmonic complex stimuli evoked FFRs (t 62 = 1.074, p = 0.6, d = 0.099, Fig. 5 E scatter plot ). The amplitudes of the speech evoked FFRs also showed no significant differences between the two hemispheres (t 61 = 1.512, p = 0.136, d = 0.388, two-sample t -test, Fig. 5 E bar plot inset ). In contrast, the phase coherence of the speech evoked FFRs was higher than that of the harmonic complex stimuli evoked FFRs ( Fig. 5 D,5 F scatter plot ) and showed left lateralization (t 61 = 7.360, p < 0.001, d = 1.891, two-sample t -test, Fig. 5 F bar plot inset ). In Fig. 5 G, we mapped ITPL for the speech evoked FFRs on the brain surface, showing a prominent high timing fidelity of the left auditory cortex.

Frequency limits of the cortical FFR
By design, the F0 in the sweep stimuli changed as a linear function of time, which provided a spectral profile for the phase-locked activity with increasing frequency ( Fig. 6 A, top panel ). Since the cortical FFR to the harmonic sweeps was stronger than that to the pure tone sweeps and was similar for both directions, we combined the spectral profiles of the FFRs to the harmonic sweeps (HCS up , HCS down , MFS up , and MFS down ) for all sites. The spectral profiles of the ITPL (blue) and FFR amplitude (orange) were overlaid ( Fig. 6 A, bottom panel ). Electrodes with significant F0 tracking responses ( N = 39) were included in this analysis and were sorted according to the bandwidth of the FFR amplitude. The FFR bandwidth measured by the amplitude was correlated with that by ITPL ( r = 0.64, p < 0.001, Pearson's correlation). According to the FFR bandwidth, these electrodes were split into two groups of equal size, that is, the "wide " group and "narrow " group ( Fig. 6 A, bottom panel ). Electrodes with F0 tracking responses were primarily from HG ( N = 16) and the pSTG ( N = 19), among which 12 electrodes from HG and 7 electrodes from the STG were included in the "wide FFR " group, and 4 electrodes from HG and 12 electrodes from the STG were included in the "narrow FFR " group. The F0 tracking response was restricted to a specific band rather than the full frequency range of the pitch contours, which also indicated that the F0-modulated sweeps used in this study were sufficiently wide to investigate the frequency limits of the cortical FFR.
We averaged the spectral profile of the FFR amplitudes and phase coherence values across electrodes within the wide group separately ( Fig. 6 B, mean ± SEM), and the results clearly show the bandpass property of the cortical FFR. The passband of the FFR amplitudes was relatively wider than that of phase coherence. However, both measures provided direct evidence supporting the notion that the cortical FFR is limited to the low F0 range. On average, the cortical FFR from the wide group covered the F0 from ~50 to ~160 Hz ( Fig. 6 C, mean ± SEM, top bar), with the widest FFR band ranging from ~40 to ~240 Hz ( Fig. 6 C, bottom bar). This FFR band covered the F0 used in previous studies that have reported cortical contributions to the FFR ( Bidelman 2018 ;Brugge et al., 2009 ;Coffey et al., 2016 ;Coffey et al., 2017b ;Griffiths et al., 2010 ;Nourski et al., 2013 ). Fig. 6 D shows the F0 distribution of the male Chinese speaker (90%: 85-220 Hz) and female Chinese speaker (90%: 125-335 Hz), which also matches the previously reported F0 range of English speakers (male: 85-180 Hz; female: 165-255 Hz) ( Baken 2010 ;Keating and Kuo 2012 ). We can see that the frequency limits of the cortical FFR largely overlap with the human vocal pitch, especially those of male speakers.
To gain insight into the underlying mechanisms of the cortical FFR, we tested the relationship between the FFR and frequency tuning properties of the local neuronal population. We measured the frequency tuning curve for each site ( Fig. 6 E) and extracted the key tuning properties (i.e., the best frequency and the tuning bandwidth). Although the frequency tuning ranged from ~100 Hz to 5 kHz, the specific band of the F0 tracking response showed a small amount of variation ( Fig. 6 A). We first compared the best frequency (BF) for the wide group and the nar-row group and found that the electrodes with the wide FFR tend to have lower preferred frequency (t 33 = − 2.062, p = 0.047, d = − 0.7182, two-sample t -test, Fig. 6 F). The tuning bandwidth for the low-frequency channels tends to be wider in animal studies ( Fishman et al., 2013 ;Sayles and Winter 2008 ). We also compared the tuning bandwidth (in octaves relative to the BF) for the wide group and the narrow group and found that the difference in the tuning bandwidth was not significant (t 33 = 1.354, p = 0.185, d = 0.471, two-sample t -test, Fig. 6 G). By comparing the band limit of the FFR and that of frequency tuning, we found that the cortical FFR in the preferred frequency region of 100 Hz ~1 kHz has a wider F0-FFR band, which more closely matches the human vocal range.

Discussion
The F0 contour gives rise to the intonation and the lexical tone in human speech ( Binns and Culling 2007 ;Brown et al., 2011 ;Laures and Bunton 2003 ;Plack et al., 2005 ;Steinhauer et al., 1999 ;Tang et al., 2017 ). Using iEEG recordings of the human auditory cortex, we investigated the potential role of the cortical FFR in speech processing and tested our hypothesis that the cortical FFR is optimized to benefit the processing of F0 contour in speech. On one hand, the cortical FFR showed a significant harmonic preference, and pure tones in the same frequency range did not evoke the cortical FFR effectively. Speech syllables that contain rich harmonic structures induced higher phase coherence of F0 tracking compared to the synthesized harmonic complex stimuli. On the other hand, the prominent F0 tracking responses to the harmonic complex stimuli were found in the frequency range of 50 to 160 Hz, which largely overlapped with the F0 distribution of the human vocal pitch. Electrodes with a wider frequency following band tended to be located in the frequency tuning region of (100 Hz ~1 kHz) the auditory cortex. Anatomically, the cortical FFR for both speech and harmonic complex stimuli was the strongest in the primary auditory cortex and decreased significantly in the associative auditory cortex. Although the cortical FFR is not an exclusive encoding mechanism for the F0 contour in speech, the convergence of the encoding space of the cortical FFR into the speech statistics may suggest the beginning of speech specialized processing ( Tang et al., 2017 ;Zatorre and Baum 2012 ). Moreover, both the stronger encoding of harmonics and the frequency limits of the cortical FFR may inform the future non-invasive FFR studies using EEG and MEG.

F0 encoding and harmonic preference in cortical FFR
The present study provided direct evidence that the human auditory cortex has the capacity to encode the dynamic F0 contours of speech and harmonic complex stimuli in a phase-locking manner. Previously, the temporal coding of periodic fluctuations has been shown to gradually transform into rate coding as the signals travel from the midbrain to the auditory cortex ( Gao and Wehr 2015 ;Langner 1992 ;Wang et al., 2008 ). However, electrophysiological studies in humans and macaque monkeys have shown that the temporal coding of F0 in the cortex can reach a frequency of 200 Hz or higher ( Fishman et al., 2013 ;Griffiths et al., 2010 ;Johnson et al., 2012 ). These findings, together with our data, suggest that there are still specialized neural populations in the human auditory cortex with fast temporal response characteristics.
Based on this study, the cortical FFR shows the stronger encoding of harmonics. The cortical FFR was not effectively driven by the pure tone sweeps, while the FFRs to the missing fundamental stimuli were similar to the F0 present harmonic complex stimuli ( Figs. 3 B, 4 B). The harmonic preference of cortical responses has previously been reported both in humans and animals ( Hall et al., 2002 ;Hullett et al., 2016 ;Liang et al., 2002 ). Our results support the harmonic preference and extend it to the phase-locking response. In this study, we also measured the FFR induced by natural speech signals through a group of Chinese syllables. Different from harmonic stimulation, natural speech signals not only have linguistic meanings (i.e. determining lexical tone), but also have more abundant acoustic features, such as the syllable rate, complex formant structure, more variable speed and direction by which the F0 contour changed. It should be noted that the differences between the speech and non-speech have different effects on the phase coherence and spectral amplitude of FFR. The phase coherence has a better tracking of the speech F0 than the spectral energy, which suggests that the timing fidelity of cortical responses may have ensured the precise encoding of dynamic F0 in speech sounds ( Fig. 5 D, 5 F).
In EEG studies, a similar FFR may be evoked by pure tone stimuli and missing fundamental stimuli ( Galbraith 1994 ;Galbraith and Doan 1995 ;Greenberg et al., 1987 ), and pure tone sweeps are proved effective to study aging process and hearing impairment ( Clinard and Cotter 2015 ;Fu et al., 2019 ). Although pure tone stimuli failed to evoke FFR in the auditory cortex, the finding of harmonic preference in the auditory cortex cannot be taken as a biomarker so far to differentiate the cortical FFR from other components. Since the auditory system is highly nonlinear, FFRs at all levels may be influenced by factors such as stimulus polarity and stimulus sets ( Lerud et al., 2014 ;Skoe and Kraus, 2010 ). It's still under debate about the origins of the human FFR, particularly when elicited to speech sounds.

Band limit of cortical FFR
The subcortical origins of the FFR have been widely explored by the scale-EEG ( Bidelman 2018 ;Tichko and Skoe 2017 ), however, it's still under debate about the frequency limits of the cortical FFR. In this study, using the synthetic harmonic complex stimuli with F0s covering the human vocal range, the F0 tracking response was found to be concentrated in the frequency range of 50-160 Hz, with the maximum tracking range being 40-240 Hz ( Fig. 6 A, 6 D). Although the upper limit of the cortical FFR is much lower than that of the subcortical FFR ( Bidelman 2018 ;Tichko and Skoe 2017 ), the F0 of human speech mostly ranges between 85 and 255 Hz ( Baken 2010 ;Keating and Kuo 2012 ); thus, the phaselocking response within this range in the cortex may be important in supplying precise temporal information on speech intonation and lexical tones.
The F0 tracking response observed in this study had a lower boundary of approximately 50 Hz, which mainly reflected the tracking of periodicity ( Rosen et al., 1992 ). Additionally, the temporal tracking of lower temporal fluctuations Griffiths et al., 2010 ;Nourski et al., 2013 ;Nourski et al., 2009 ) and other temporal structures of linguistic units (i.e., syllable rates, phrases) also occurs in largescale cortical networks ( Ding et al., 2016 ;Giraud and Poeppel 2012 ). Together with our data, these findings suggest that cortical frequencyfollowing responses at different time scales concurrently track the time course of speech structures and constitute an important mechanism for speech encoding at the cortical level. There is also a large number of EEG studies on the auditory steady-state response (ASSR), which peaks at 40 Hz and also originated from the auditory cortex ( Krishnan et al., 2009b ;Ross et al., 2002 ). It is believed that the ASSR is more than a superposition of individual evoked potential responses to individual clicks, but involving the perceptual binding process of a large neural network ( Krishnan et al., 2009b ;Ross et al., 2005 ). In contrast, the cortical FFR we recorded with intracranial EEG is a response of local neural populations to detail spectro-temporal features of sound, which may not be able to reflect the 40 Hz signature of network integration ( Waldert et al., 2009 ).
Frequency tuning is another basic property of neurons in the auditory cortex and is informative for pitch-related processing ( Bendor and Wang 2005 ;Feng and Wang 2017 ;Fishman et al., 2013 ;Kikuchi et al., 2019 ). Electrodes with a wider FFR band were found to have tuning frequencies within 100 Hz ~1 kHz ( Fig. 6 E, 6 F), which was highly consistent with the findings in macaque studies ( Fishman et al., 2013 ). A specialized pitch processing region was also identified in the low-frequency tuning region in the primary auditory cortex of marmosets ( Bendor and Wang 2005 ), where pitch extraction relies on temporal cues for low F0s and spectral cues for high F0s ( Bendor et al., 2012 ). The F0 tracking response located in the low tuning frequency region may be a candidate temporal mechanism for pitch extraction. Nevertheless, it should be noted that the findings in the present study are observed at the population level, which reflect both the neural firing pattern and the neural synchrony of the local population ( Denker et al., 2011 ;Ray et al., 2008 ). Our findings also support the notion that neural populations may be better suited for the temporal coding of pitch contours than single neurons ( Bendor et al., 2012 ;Fishman et al., 2013 ;Johnson et al., 2012 ;Yin et al., 2011 ).

Anatomical distribution of cortical FFR
Imaging studies have suggested that the processing of pitch involves the HG, the PT, and the STG region ( Allen et al., 2017 ;De Angelis et al. 2018 ;Hall and Plack 2009 ;Warren et al., 2003 ). Some other studies have further ranked these brain areas into a pitch processing hierarchy, in which central processing moves anterolaterally from the primary auditory cortex as the patterns in pitch variations are processed ( Griffiths 2003 ;Patterson et al., 2002 ). Consistent with previous studies, the cortical FFR was observed bilaterally in HG and its surrounding areas in the STG. Moreover, the FFR strength decreased from the primary auditory cortex to the associative auditory cortex ( Figs. 3 E, 4 E) Nourski et al., 2013 ).
Previous studies on auditory pitch processing have also suggested different functional roles of anterior STG (aSTG) and posterior STG (pSTG) ( Liégeois-Chauvel et al., 1998 ). Different connectivity has also been shown for aSTG and pSTG ( Sammler et al., 2015 ). Generally, it is found the pSTG is tuned for temporally fast varying speech sounds and has a short temporal integration window, while the aSTG is tuned for temporally slow varying speech sounds and has a longer temporal integration window ( Hullett et al., 2016 ). Specifically, imaging studies and intracranial EEG studies have found that pSTG and planum temporale (PT), which was posterior to the HG, are highly related to linguistic pitch processing ( Liang and Du, 2018 ;Warren et al., 2003 ;Xu et al., 2006 ), including lexical tone ( Si et al., 2017 ) and intonation ( Tang et al., 2017 ). In contrast, the function of aSTG is relatively less known. Some studies have suggested that activation of aSTG may be affected by melodic context and expectations ( Seger et al., 2013 ;Warrier and Zatorre, 2004 ). In this study, we observed stronger cortical FFR in pSTG than in aSTG, which was in line with the notion that aSTG and pSTG have different functional focus and neural connections.
The cross-linguistic differences of FFR have long been the focus of auditory research, and many studies have shown that native speakers of tonal languages may have better FFR ( Krishnan et al., 2009 ;Song et al., 2008 ;Zhao and Kuhl 2018 ). Although the current data did not provide a direct comparison, we propose that non-tonal language speakers also have cortical FFR, but they may differ in response properties. In the present study, we observed a higher phase coherency of the left HG. The tendency of higher timing fidelity in left HG may indicate an influence of language experience on primary auditory processing ( Krishnan et al., 2005 ;Warrier and Zatorre 2002 ;Xu et al., 2006 ). It is meaningful for future works to further explore the difference between the cortical FFR of the tonal-language speaker and non-tonal language speaker.

Limitations and future research questions
The present work studied the F0 tracking response in the human auditory cortex directly taking advantage of the intracranial EEG recording. Based on our findings, we proposed that the cortical FFR would optimize the processing of F0 contour in speech signals in two main aspects, that is the preference of cortical FFR to harmonic structures and selective coding of F0 in the range of human vocal pitch. However, it should be noted that the F0-tracking response observed in our study may not be a direct representation of pitch perception ( Gockel et al., 2011 ), but may facilitate subsequent intonation and lexical tone processing in higher stages by the encoding of the pitch-bearing information, especially in the vocal pitch range.
Although the intracranial EEG benefits the investigation of cortical FFR, it should be noted that the number of electrodes and the coverage areas were defined according to the patient's clinical needs, so it's difficult to make them consistent between subjects. The limited accessibility and clinical setting further make it difficult to well control the recruitment of subjects, and do not permit complex experiment setting ( Parvizi and Kastner, 2018 ). Although subjects included in this study were beyond the maturation period of FFR ( Skoe et al., 2013 ), future studies with patients in a similar age and consistent electrode coverages across hemispheres and subjects can help further investigation on FFR lateralization and its developmental effects.

Data and code availability statement
The data and codes that support the findings of this study are available upon request.

Declaration of Competing Interest
None.