Perception of frequency modulation is mediated by cochlear place coding

Natural sounds convey information via frequency and amplitude modulations (FM and AM). Humans are acutely sensitive to the slow rates of FM that are crucial for speech and music. Two coding mechanisms are believed to underlie FM sensitivity, one based on precise stimulus-driven spike timing (time code) for slow FM rates, and another coarser code based on cochlear place of stimulation (place code) for fast FM rates. We tested this long-standing explanation by studying individual differences in listeners with varying degrees of hearing loss that resulted in widely varying fidelity of place-based or tonotopic coding. Our findings reveal that FM detection at both slow and fast rates is closely related to the fidelity of place coding in the cochlea, suggesting a unitary neural code for all FM rates. These insights into the initial coding of important sound features provide a new impetus for improving place coding in auditory prostheses.


Modulations in frequency (FM) and amplitude (AM) carry critical information in biologically 37
relevant sounds, such as speech, music, and animal vocalizations (Attias and Schreiner, 1997; 38 Nelken et al., 1999). In humans, AM is crucial for understanding speech in quiet (Shannon et al., 39 1995;Smith et al., 2002), while FM is particularly important for perceiving melodies, 40 recognizing talkers, determining speech prosody and emotion, and segregating speech from other 41 competing background sounds (Zeng et  cues to the auditory system (Chen and Zeng, 2004;Ives et al., 2013). This lack of success is 52 partly related to a gap in our scientific understanding regarding how FM is extracted by the brain 53 from the information available in the auditory periphery. 54 The coding of AM begins in the auditory nerve with periodic increases and decreases in 55 the instantaneous firing rate of auditory nerve fibers that mirror the fluctuations in the temporal 56 envelope of the stimulus (Schreiner and Langner, 1988;Joris et al., 2004). As early as the 57 inferior colliculus and extending to the auditory cortex, rapid AM rates are transformed to a code 58 involving firing rate that is no longer time-locked to the stimulus envelope and instead relies on 59 overall firing rate, with different neurons displaying bandpass, lowpass, or highpass responses to 60 different AM rates (Schreiner and Langner, 1988; Wang et al., 2008). The coding of FM is less 61 straightforward. For a pure tone with FM, the temporal envelope of the stimulus is flat; however, 62 the changes in frequency lead to dynamic shifts in the tone's tonotopic representation along the 63 basilar membrane, resulting in a transformation of FM into AM at the level of the auditory nerve 64 (Zwicker, 1956; Moore and Sek, 1995;Saberi and Hafter, 1995;Sek and Moore, 1995). 65 Although this FM-to-AM conversion provides a unified and neurally efficient code for 66 both AM and FM based on periodic fluctuations in the instantaneous auditory-nerve firing rate in 67 both cases (Saberi and Hafter, 1995), it falls short of explaining human behavioral trends in FM 68 sensitivity, specifically at low carrier frequencies (fc < ~4-5 kHz) and slow modulation rates (fm 69 <~ 10 Hz), where sensitivity tends to be considerably better than at higher carrier frequencies or 70 fast modulation rates (Demany and Semal, 1989; Moore and Sek, 1995; Sek and Moore, 1995; 71 Moore and Sek, 1996;He et al., 2007;Whiteford and Oxenham, 2015;. 72 This discrepancy is important, because low frequencies and slow modulation rates are the most 73 important for human communication, including speech and music, as well as animal 74 vocalizations. The enhanced sensitivity to slow FM at low carrier frequencies has been explained 75 in terms of an additional neural code based on stimulus-driven spike timing in the auditory nerve 76 that is phase-locked to the temporal fine structure of the stimulus . 77 Although such a time-based code can potentially provide greater accuracy (Siebert, 1970;Heinz 78 et al., 2001), and is used for spatial localization (Moiseff and Konishi, 1981;Grothe et al., 2010), 79 it is not known whether or how this timing information is extracted by higher stages of the 80 auditory system to code periodicity and FM. 81 If the detection of FM at fast modulation rates depends on an FM-to-AM conversion, 82 whereas the detection of FM at slow rates does not, then fast-rate FM detection thresholds should 83 depend on the sharpness of cochlear tuning (Fig. 1) (Robertson and Manley, 1974;Liberman et al., 1986;Moore, 2007). In contrast, 90 damage to the cochlea is not thought to lead to a degradation of auditory-nerve phase locking to 91 temporal fine structure for sounds presented in quiet (Henry and Heinz, 2012), so we would not 92 expect to find a strong relationship between slow-rate FM detection thresholds and hearing-loss-93 induced changes in cochlear tuning. 94 Panels C and E demonstrate that a place code for FM would result in a greater change in output 100 level on the low-frequency side of the excitation pattern (purple lines) relative to the high-101 frequency side (green lines). 102 Here we measured FM and AM detection at slow (fm = 1 Hz) and fast (fm = 20 Hz) 104 modulation rates in a large sample of listeners with hearing thresholds at the carrier frequency (fc 105 = 1 kHz) ranging from normal (~0 dB sound pressure level, SPL) to severely impaired (~70 dB 106 SPL), consistent with sensorineural hearing loss (SNHL). The fidelity of cochlear frequency 107 tuning was assessed using a psychophysical method to estimate the steepness of the forward 108 masking function around 1 kHz. The results revealed a relationship between the estimated 109 sharpness of cochlear tuning and sensitivity to FM at both fast and slow modulation rates. This 110 relationship remained significant even after controlling for degree of hearing loss, sensitivity to 111 AM, and age. Our results suggest that the fidelity of coding of slow FM depends on the fidelity 112 of cochlear filtering, as predicted by a unified theory of AM and FM coding, and that an 113 additional neural timing code may not be necessary to explain human perception of FM. 114 115

116
Effects of hearing loss on masking functions. 117 The fidelity of place coding at the test frequency (1 kHz) was measured using pure-tone 118 forward-masking patterns. Participants heard two tones, one at a time, and were instructed to 119 select the tone that had a short 20-ms tone pip directly following it. The masker tones were fixed 120 in frequency (1 kHz) and level, while the tone pip level was adaptively varied to measure the 121 lowest sound level that the participant could detect. Without the presence of a masker, the level 122 of the tone pip reflects the absolute threshold ( Supplementary Fig. 1, unfilled circles). In the 123 presence of a pure-tone forward masker, the level of the tone pip depends on the tone pip's 124 frequency proximity to the masker and the shape of the individual's cochlear filters, where 125 detection for tone pips close in frequency to the masker are much poorer (i.e., the level must be 126 higher) than for tone pips farther away in frequency. For each participant, the steepness of the 127 low-and high-frequency slopes of the masking function were estimated by calculating linear 128 regressions between the thresholds for the four lowest (800, 860, 920, and 980 Hz) and highest  thresholds are plotted in percent peak-to-peak frequency change (2∆f(%)) and 20log(m), where 195 ∆f is the frequency excursion from the carrier and m is the modulation depth (ranging from 0-1)). 196 For all tasks, lower on the x-or y-axis represents better thresholds. Correlations marked with an 197 * are significant after Holm's correction (****p < .0001, ***p < .001, **p < .01, and *p < .05). 198

199
The role of frequency selectivity in FM detection. 200 The unitary neural coding theory of FM and AM predicts that steeper masking functions 201 (implying sharper cochlear tuning) should be related to better FM detection thresholds (Zwicker, 202 1956). The current consensus is that theory applies to fast but not slow FM detection (Moore and 203 Sek, , 1996 steeper than the high slope, it provides more stimulus information relative to the high side ( Fig.  221 1, leftmost column), and is therefore predicted to dominate FM performance (Zwicker, 1956). Hz; black) and fast (fm = 20 Hz; white) FM detection (n=55). Correlations marked with an * are 234 significant after Holm's correction (****p < .0001, ***p < .001, **p < .01, and *p < .05). 235 detection (y-axes) after variance due to audibility, sensitivity to AM, and age has been partialled 240 out for n=55 participants. Units of the x-and y-axes are arbitrary because they correspond to the 241 residual variance for slow (fm = 1 Hz; black) and fast FM detection (fm = 20 Hz; white). 242 Correlations marked with an * are significant after Holm's correction (****p < .0001, ***p < 243 .001, **p < .01, and *p < .05). 244

A unitary code for FM 247
Our finding that cochlear place coding is equally important for both slow-and fast-rate FM 248 detection was unexpected. Humans' acute sensitivity to slow changes in frequency at carriers 249 important for speech and music has been thought to result from precise neural synchronization to 250 the temporal fine structure of the waveform (Demany and Semal, 1989 The present study extends these previous findings by suggesting that the resolved 281 harmonics, which are most important for human pitch perception, may be represented by their 282 place of stimulation in a way that depends of the lower (and steeper) slope of the excitation 283 pattern, rather than just via the temporal fine structure information encoded via the stimulus-284 driven spike timing (phase locking) in response to resolved harmonics. This conclusion is 285 consistent with other studies showing that pitch perception is possible even with spectrally 286 resolved harmonics that are too high in frequency to elicit phase locking (Oxenham et al., 2011;287 Lau et al., 2017). In addition, the fact that timing fidelity in the human auditory nerve is no 288 greater than that found in smaller mammals (Verschooten et al., 2018), supports our conclusion 289 that differences in pitch perception between humans and other mammals cannot be ascribed to 290 differences in timing fidelity and phase locking, but instead may be due to differences in the 291 sharpness of cochlear tuning. 292

Alternative interpretations 294
One alternative interpretation of our results is that hearing loss leads to a degradation in both 295 spectral resolution and neural phase locking to temporal fine structure, and that it is the 296 degradation in the phase locking, not cochlear filtering, that drives the relationship between Finally, the relationship between FM and the slopes of the masking function was specific 315 to the low-frequency side of the masking function. Zwicker (1956) predicted over half a century 316 ago that the steeper, low-frequency slope should play a larger role in FM-to-AM conversion. If 317 the current findings were a spurious effect of time coding degrading with hearing loss, then the 318 correlation should not be specific to the low-frequency slope, as the high-frequency slope is also 319 strongly affected by hearing loss (r = .717, p < .0001). For the sake of parsimony, it seems more 320 reasonable to interpret the similar correlations between the lower masking slope and both slow 321 and fast FM as reflecting the same coding mechanism than to interpret them as coming from 322 different sources with a coincidentally similar correlation. 323 324

Explaining superior FM perception at low rates within a unitary framework 325
A pure cochlear place-based model for FM proposes that FM is transduced to AM through 326 cochlear filtering (Zwicker, 1956). As the frequency sweeps across the tonotopic axis, the 327 auditory system monitors changes in the output of the cochlear filters. For a place-only model to 328 explain FM, it would need to account for the rate-dependent trends in FM and AM sensitivity 329 observed here (Fig. 3) and in many previous studies (Viemeister, 1979;Sheft and Yost, 1990; slow modulation rates) to play a functional role. Thus, such a code would function more 338 efficiently at slow than at fast rates, producing the observed differential effect. 339 Alternatively, a combined place-time code may predict better sensitivity for slow, low-340 carrier FM relative to the same carrier at faster rates (Fig. 3)  Task 1 were not included in the study. Participants with symmetric hearing (n = 37; asymmetries 363 ≤ 10 dB at 1 kHz from Task 1) completed all monaural experimental tasks in their worse ear. Six 364 participants had SNHL at 1 kHz in both ears, but loss in the poorer ear exceeded the study 365 criterion; for these subjects, tasks were completed in the better ear only. One additional 366 participant was only assessed in their better ear because loss in the poorer ear was near the study 367 criterion (68.6 dB SPL at 1 kHz), and the subject indicated the level was uncomfortable. An 368 additional three participants had one normal-hearing ear and one ear with SNHL at 1 kHz, and 369 only measurements from the impaired ear were used in analyses. The final nine participants had 370 asymmetric SNHL in both ears, defined as an asymmetry > 10 dB on Task 1. For eight of these 371 subjects, the experimental tasks were completed for both ears separately. One participant with 372 asymmetric hearing only completed tasks in their poorer ear due to time constraints (Table 1)  Task 1 asymmetry > 10 dB; n=3 had normal hearing in their better ear, and n=8 had SNHL in both ears.

Stimuli. 381
Stimuli were generated in Matlab (MathWorks) with a sampling rate of 48 kHz using a 382 24-bit Lynx Studio L22 sound card and presented over Sennheiser HD650 headphones in a 383 sound-attenuating chamber. Tasks were measured monaurally with threshold equalizing noise 384 (TEN) (Moore et al., 2000) presented in the contralateral ear in order to prevent audible cross-385 talk between the two ears. The TEN was presented continuously in each trial, with the bandwidth 386 spanning 1 octave, geometrically centered around the test frequency. Except for tasks that 387 involved detection of a short (20 ms) tone pip, the TEN level (defined as the level with the 388 auditory filter's equivalent rectangular bandwidth at 1 kHz) was always 25 dB below the target 389 level, beginning 300 ms before the onset of the first interval and ending 200 ms after the offset of 390 the second interval. Because less noise is needed to mask very short targets, the TEN was 391 presented 35 dB below the target level for tasks that involved detection of a short, 20-ms tone pip 392 (Tasks 4 and 7). This noise began 200 ms before the onset of the first interval and ended 100 ms 393 after the offset of the second interval. 394 To obtain a more precise estimate of sensitivity for the test frequency, pure-tone absolute 395 thresholds were measured for each ear at 1 kHz. The target was 500 ms in duration with 10-ms 396 raised-cosine onset and offset ramps. The reference was 500 ms of silence, and the target and the 397 reference were separated by a 400-ms interstimulus interval (ISI). Tasks involving modulation 398 detection were assessed for the same frequency (fc = 1 kHz) at slow (fm = 1 Hz) and fast (fm = 20 Hz) rates. The target was an FM (Tasks 2 and 3) or AM (Task 4 and 5) pure tone while the 400 reference was an unmodulated pure tone at 1 kHz. Both the target and the reference tones were 2 401 s in duration with 50-ms raised-cosine onset and offset ramps. In the FM tasks, the starting phase 402 of the modulator frequency was set so that the target always began with either an increase or 403 decrease in frequency excursion from the carrier frequency, with 50% probability determined a 404 priori. A similar manipulation was used for the AM tasks, so that the target always began at 405 either the beginning or middle of a sinusoidal modulator cycle and so was either increasing or 406 decreasing in amplitude at the onset. Stimuli for the modulation tasks were presented at 65 dB 407 SPL or 20 dB sensation level (SL), whichever was greater, based on individualized absolute 408 thresholds at 1 kHz from Task 1. 409 Detection for a short (20 ms), pure-tone pip was measured with and without the presence 410 of a 1-kHz, 500-ms pure-tone forward masker. Tone-pip frequencies were 800, 860, 920, 980, 411 1020, 1080, 1140, and 1200 Hz, and both the tone pip and the masker had 10-ms raised cosine 412 onset and offset ramps. The tone pip was presented to one ear, directly following the offset of the 413 masker, and the masker was presented to both ears to avoid potential confusion effects between 414 the offset of the masker and the onset of the tone pip (Neff, 1986). The masker was fixed in level 415 at either 65 dB SPL or 20 dB SL, whichever was greater, based on absolute thresholds for the 416 500-ms test frequency in the target ear (Task 1). The starting level of the tone pip was always 10 417 dB below the masker level in the masked condition. For unmasked thresholds, the starting level 418 of the tone pip was either 40 dB SPL or 20 dB SL, whichever was greater, and the tone pip was 419 preceded by 500 ms of silence. 420

Procedure. 422
Procedures were adapted from  and are described in full below. 423 The experiment took place across 3-6 separate sessions, with each session lasting no longer than 424 2 hours. All tasks were carried out using a two-interval, two-alternative forced-choice procedure 425 with a 3-down 1-up adaptive method that tracks the 79.4% correct point of the psychometric 426 function (Levitt, 1971). The target was presented in either the first or second interval with 50% a 427 priori probability, and the participant's task was to click the virtual button on the computer 428 screen (labeled "1" or "2") corresponding to the interval that contained the target. Each 429 corresponding response button illuminated red during the presentation of the stimulus (either 430 reference or target). Visual feedback ("Correct" or "Incorrect') was presented on the screen after 431 each trial. All participants completed the tasks in the same order, and the tasks are described 432 below in the order in which they were completed by the participants. 433 434 Task 1: Absolute Thresholds at 1 kHz. Participants were instructed to select the button on the 435 computer screen that was illuminated while they heard a tone. The target was a 500-ms, 1-kHz 436 pure tone presented to one ear, and the reference was 500 ms of silence. Three runs were 437 measured for each ear, and the order of the presentation ear (left vs. right) was randomized 438 across runs. Three participants were only assessed in their better ear, due to an extensive amount 439 of hearing loss in the poorer ear according to their 1 kHz audiometric thresholds (all ≥ 80 dB 440 HL). The remaining participants completed monaural absolute thresholds for both ears. 441 On the first trial, the target was presented at 40 dB SPL. The target changed by 8 dB for 442 the first reversal, 4 dB for the next 2 reversals, and 2 dB for all following reversals. Absolute 443 thresholds were determined by calculating the mean level at the final 6 reversal points. If the 444 standard deviation (SD) across the three runs was ≥ 4, then 3 additional runs were conducted for 445 the corresponding ear, and the first three runs were regarded as practice. 446 447 Tasks 2 and 3: FM Detection. Participants were instructed to pick the tone that was "modulated" 448 or "changing". At the beginning of each run, the target had a peak-to-peak frequency excursion 449 (2∆f) of 5.02%. The 2∆f varied by a factor of 2 for the first two reversal points, a factor of 1.4 for 450 the third and fourth reversal points, and a factor of 1.19 for all following reversal points. The FM 451 difference limen (FMDL) was defined as the geometric mean of 2∆f at the final 6 reversal points. 452 Three runs were conducted for each modulation rate, and all three runs for slow-rate FM 453 (fm = 1 Hz) were completed before fast-rate FM (fm = 20 Hz). Asymmetric participants with two 454 qualifying ears completed six runs (three runs per ear) for each modulation rate, and the order of 455 the presentation ear was randomized across runs. If the SD across the three runs for a given ear 456 was ≥ 4, the participant completed an additional three runs, and only the last three runs were 457 used in analyses. 458 459 Task 4: Detection for 20-ms Tones. Participants were instructed to select the button (labeled "1" 460 or "2") on the computer screen that was illuminated while they heard a short, 20-ms target tone 461 pip. The target was presented at 40 dB SPL or 20 dB SL, whichever was greater, for the first trial 462 of each run. The level of the target changed by 8 dB for the first two reversals, 4 dB for the 463 following two reversals, and 2 dB for all following reversals. The absolute threshold was defined 464 as the mean target level at the final six reversal points. 465 Participants completed one run for each of the eight tone-pip frequencies: 800, 860, 920, 466 980, 1020, 1080, 1140, and 1200 Hz. The order of the tone-pip frequency conditions was 467 randomized across runs. Asymmetric participants with two qualifying ears had the order of the 468 runs further blocked by presentation ear, so that 8 runs for the same ear had to be completed 469 before any conditions in the opposite ear were measured. Whether the right or left ear was 470 assessed first was randomized. One additional run was conducted for any conditions with an SD 471 ≥ 4 dB, and only the final run for each condition was used in analyses. for the next two reversals, and 1 dB for all following reversals. The AM difference limen 477 (AMDL) was defined as the mean modulation depth (in 20log(m)) at the last 6 reversal points. 478 In the same manner as the FM tasks, all three runs for slow-rate AM (fm = 1 Hz) were 479 completed before fast-rate AM (fm = 20 Hz). Asymmetric participants with two qualifying ears 480 completed six runs (three runs per ear) for each modulation rate, and the order of the presentation 481 ear was randomized across runs. If the SD across the first three runs for a given condition were ≥ 482 4 dB, then three additional runs were conducted, and only the final three runs were analyzed. 483 484 Task 7: Forward Masking Patterns. The task was to determine which of two tones was followed 485 by a short, 20-ms tone pip. Two runs were measured for each of the eight tone-pip frequencies 486 (800, 860, 920, 980, 1020, 1080, 1140, and 1200 Hz), for a total of 16 runs, and the order of the 487 tone-pip condition was randomized across runs. Asymmetric participants with two qualifying 488 ears had the order of the runs further blocked by presentation ear, so that 8 runs for the same ear 489 had to be completed before any conditions in the opposite ear were presented. The 1-kHz, 500-490 ms masker tones were fixed in frequency and level, presented binaurally at 65 dB SPL or 20 dB 491 SL based on absolute thresholds from Task 1, whichever was greater. Within a trial, each masker 492 was either directly followed by a 20-ms tone pip, presented monaurally to the target ear, or 20-493 ms of silence. The starting level of the tone pip was 10 dB below the masker level in the 494 corresponding ear. The level of the tone pip changed by 8 dB for the first two reversals, 4 dB for 495 the third and fourth reversals, and 2 dB for the following reversals. The masked threshold for 496 each tone-pip frequency condition was calculated as the mean tone-pip level at the final 6 497 reversal points. For a given subject, if the SD of the masked threshold across the two runs was ≥ 498 4 dB, then the subject completed two additional runs for the corresponding tone-pip frequency. 499 For these conditions, only the final two runs were used in analyses, and the first two runs were 500 regarded as practice. The average across the final two runs for each tone-pip frequency was used 501 in analyses. 502 503 Sample size. Because the strength of the relationship between FM sensitivity and forward 504 masking slopes was unknown in listeners varying in degree of SNHL, and the number of people 505 with SNHL at 1 kHz was expected to be limited, we set a minimum sample size requirement for 506 SNHL subjects based on the smallest effect we would like to be able to detect. To detect a 507 moderate correlation between masking function slopes and FM sensitivity (r = .4, alpha = .05, 508 one-tailed test) with a power of .9, we needed a sample of n=47. We also aimed to recruit an 509 additional 10 participants with NH at 1 kHz of similar age to the SNHL subjects. The NH sample 510 was limited to 10 people to ensure a relatively even distribution of absolute thresholds at 1 kHz. 511 One of these anticipated NH subjects had mild SNHL at 1 kHz in their worse ear, leading to a 512 sample size of n=57, with n=9 NH listeners and n=48 SNHL. One SNHL subject reported a 513 history of neurological issues and was excluded from the study. Another SNHL subject had 514 unusually poor FM sensitivity at both rates, with thresholds greater than 3 SD from the group 515 mean. This outlier was excluded from all analyses, leading to a final sample size of n=55. 516 Including the outlier in all analyses generally did not affect the results (Supplementary Text 2,  517   Supplementary Table 1, and Supplementary Figs. 2-3). 518 519 Statistical analyses. The mean log-transformed thresholds (10log10(2∆f (%)) and 20log10(m)) 520 were used in all analyses to better approximate normality, where 2∆f (%) is the peak-to-peak 521 frequency excursion (for FM) as a percentage of the carrier frequency, and m is the modulation 522 index (for AM). All reported means (x ̅ ) and standard deviations (s) correspond to the log-523 transformed data. Confidence intervals (CIs) are 95% CIs. Pearson correlations were used to 524 assess continuous trends; the corresponding p values were adjusted using Holm's method to 525 correct for family-wise error rate (Holm, 1979). The p values corresponding to the correlations in 526 t-tests were used to assess rate-dependent differences, and effect sizes were calculated using 530 Cohen's dz (Lakens, 2013). The cocor package was used to calculate significant differences 531 between correlations using Steiger's modification (Steiger, 1980;Diedenhofen and Musch, 532 2015). All tests were one-tailed unless otherwise stated in the results. 533 Bootstrap analyses were conducted to estimate the highest possible correlation detectable 534 for each modulation task and the forward masking task, in order to ensure that correlations with 535 these measures were not limited by test-retest reliability. For each subject and for each 536 modulation condition, performance was simulated by randomly sampling 6 runs (3 test and 3 537 retest) from a normal distribution based on the individual means and standard deviations from 538 the corresponding task. An analogous procedure was conducted for each individual's masked 539 thresholds for every tone-pip condition, with 4 runs (2 test and 2 retest) sampled from each 540 individualized normal distribution. The average simulated runs were used to estimate the low and 541 high frequency slopes of the masking function by calculating a linear regression between the 4 542 lowest and 4 highest tone-pip frequency conditions for the average test and the average retest 543 runs (4 regressions per iteration). Simulated test-retest correlations were calculated using the 544 simulated slopes for n=55 subjects (for forward masking) or the simulated average test and retest 545 thresholds for each subject (for the modulated tasks). This process was repeated for 100,000 546 iterations. The correlations were transformed using Fisher's r to z transformation, averaged, and 547 then transformed back to r, yielding an average test-retest correlation whose maximum is limited 548 by within-subject error. KLW and AJO conceived of and designed the experiment; HAK and KLW collected the data; 559 KLW analyzed the data; KLW and AJO wrote the paper. 560