Speech and Music Have Different Requirements for Spectral Resolution

https://doi.org/10.1016/S0074-7742(05)70004-0Get rights and content

Publisher Summary

This chapter reviews the evidence that speech and music are quite different in terms of their requirements for spectral resolution, suggesting that it is important to understand the demands of the listening task in order to understand the importance of spectral resolution. The number of spectral channels needed depends on the difficulty of the listening task and the situation. Speech recognition, because it is a highly trained pattern recognition process, requires only four spectral channels of envelope information in the appropriate tonotopic place. Six to eight spectral channels are required for speech recognition in noisy listening conditions or for difficult speech materials or for people who are not native listeners in the language. In contrast, music requires at least 16 spectral channels even for identification of simple familiar melodies played with a single stream of notes. Recognition and enjoyment of music that is more complex and music with multiple instruments require at least 64 channels of spectral resolution and possibly many more. This large difference between music and speech highlights the difference in how the brain utilizes information from the auditory periphery. To understand the processing in the auditory system it is important to understand the relative roles of fine detail from the periphery and topdown pattern processing by the brain for different tasks.

Introduction

Auditory research for the last 50 years has been dominated by the study of the cochlea. The general feeling has been that the cochlea is the “bottleneck” through which all auditory information must pass, so we could not understand further processing until we fully understood the end organ. And what a beautiful end organ it is—complex hydrodynamics and nonlinearities; the cochlea is a marvel of evolution. But in this chapter it will be argued that this cochlear fixation has cost us dearly in terms of understanding hearing in all its complexity: some hearing tasks are dominated and limited by the cochlea while others are not. The obsessive focus on the cochlea has resulted in a relatively poor understanding of central and cognitive aspects of hearing. The difference in processing required for speech and music presents an excellent contrast between hearing limited by central and by cochlear processing.

Of course, one of the most elegant features of the cochlea is its ability to process sound energy of different frequencies in different tonotopic locations; that feature is the basic theme of this entire book. Sound energy is separated by the hydromechanical resonance system so that low‐frequency energy is primarily represented in the apex and high‐frequency energy is primarily represented in the base of the cochlea. The tonotopic organization of the cochlea is reproduced at all levels up the auditory system from the cochlear nucleus complex to the auditory cortex. Considerable detail is known about the mechanics and physiology of this tonotopic representation. Models and theories of hearing have based higher‐order hearing on fine features of the tonotopic and temporal representation of sound in the cochlea and auditory nerve. The design of electrodes and signal processing for cochlear implants (CIs) is based on reproducing the tonotopic pattern of information in the cochlea. However, research on CIs has demonstrated that auditory perception is complex and multidimensional. Auditory perception is not completely determined by the properties of the cochlea and auditory nerve, it only starts there.

The basic premise of this chapter is that the study of auditory perception needs to consider the demands of the perceptual task and the requirements of the brain for information from the auditory periphery. Some complex perceptual tasks can be accomplished with coarse and distorted information from the periphery while other tasks require precise detailed information. This distinction is most clearly demonstrated in the difference between speech and music.

Section snippets

What Is a Spectral Channel?

The idealized concept of spectral processing is that there is a narrowly tuned representation in the nervous system that responds only to acoustic stimulation in a narrow frequency range. In an extreme form, each neuron would respond to a single acoustic frequency. This extreme case cannot be realized in a biological system, but it is common to think of sound being analyzed into narrow independent bands in the cochlea. However, most representations of spectral selectivity in the cochlea are

Comparison of Spectral Resolution in Normal Hearing and Cochlear Implants

In a CI the residual AN in deaf listeners is activated by presenting electrical pulses on a series of electrodes positioned inside the scala tympani. The selectivity of activation can be controlled to some extent by the configuration of the active and return electrodes, but it appears that the main factor determining selectivity is the distance between the electrode and the stimulable neurons. When the distance is small, local, and selective activation can be achieved, while a large distance

Effects of Spectral Resolution and Distortion on Speech

When CIs were gaining popularity in the 1970s it was widely thought that they would never be able to restore good speech understanding because they could not reproduce the fine structure of the normal cochlea either in the temporal or in the spectral domain. With such a crude signal produced by the implant most people thought it could not produce useful perceptual results. When multichannel implants were introduced some patients were able to recognize speech well enough to converse on the

Effects of Spectral Resolution on Music

Speech and music are the two most important cultural acoustic signals. Are they similar in their demands on auditory processing? We have just reviewed studies that show that speech can be recognized when represented by only four bands of modulated noise. As the speech and listening conditions become more difficult more spectral resolution is required. How does music depend on spectral resolution? Is it similar to speech? The answer is clearly no. Speech and music have completely different

Conclusions

The number of spectral channels needed depends on the difficulty of the listening task and situation.

Speech recognition, because it is a highly trained pattern recognition process, requires only four spectral channels of envelope information in the appropriate tonotopic place. Six to eight spectral channels are required for speech recognition in noisy listening conditions or for difficult speech materials or for people who are not native listeners in the language. Even more spectral channels

References (43)

  • ShannonR.V.

    Multichannel electrical stimulation of the auditory nerve in man: I. Basic psychophysics

    Hear. Res.

    (1983)
  • AraiT. et al.

    Speech intelligibility in the presence of cross‐channel spectral asynchrony

  • BaskentD. et al.

    Speech recognition under conditions of frequency‐place compression and expansion

    J. Acoust. Soc. Amer.

    (2003)
  • BaskentD.E. et al.

    Interactions between cochlear implant electrode insertion depth and frequency‐place mapping

    J. Acoust. Soc. Amer.

    (2005)
  • BoothroydA. et al.

    Effects of spectral smearing on phoneme and word recognition

    J. Acoust. Soc. Am.

    (1996)
  • BraidaL.D. et al.

    Hearing aids—a review of past research on linear amplification, amplitude compression, and frequency lowering

    ASHA Monogr.

    (1979)
  • BurnsE.M. et al.

    Nonspectral pitch

    J. Acoust. Soc. Amer.

    (1976)
  • BurnsE.M. et al.

    Played‐again SAM: Further observations on the pitch of amplitude‐modulated noise

    J. Acoust. Soc. Amer.

    (1981)
  • BurnsE.M. et al.

    Perception of familiar melodies by cochlear implant users

  • CarlyonR. et al.

    Temporal pitch mechanisms in acoustic and electric hearing

    J. Acoust. Soc. Amer.

    (2002)
  • ChatterjeeM. et al.

    Forward masking excitation patterns in multi‐electrode cochlear implants

    J. Acoust. Soc. Amer.

    (1998)
  • DormanM.F. et al.

    Simulating the effect of cochlear‐implant electrode insertion depth on speech understanding

    J. Acoust. Soc. Amer.

    (1997)
  • DormanM.F. et al.

    Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine‐wave and noise‐band outputs,

    J. Acoust. Soc. Amer.

    (1997)
  • EisenbergL. et al.

    Speech recognition with reduced spectral cues as a function of age

    J. Acoust. Soc. Amer.

    (2000)
  • FishmanK. et al.

    Speech recognition as a function of the number of electrodes used in the SPEAK cochlear implant speech processor

    J. Speech Hearing Res.

    (1997)
  • FriesenL. et al.

    Speech recognition in noise as a function of the number of spectral channels: Comparison of acoustic hearing and cochlear implants

    J. Acoust. Soc. Amer.

    (2001)
  • FuQ.‐J. et al.

    Recognition of spectrally asynchronous speech by normal‐hearing listeners and Nucleus‐22 cochlear implant users

    J. Acoust. Soc. Amer.

    (2001)
  • FuQ.‐J. et al.

    Recognition of spectrally degraded and frequency‐shifted vowels in acoustic and electric hearing

    J. Acoust. Soc. Amer.

    (1999)
  • FuQ‐J. et al.

    Effects of noise and number of channels on vowel and consonant recognition: Acoustic and electric hearing

    J. Acoust. Soc. Amer.

    (1998)
  • GreenwoodD.D.

    A cochlear frequency‐position function for several species ‐ 29 years later

    J. Acoust. Soc. Amer.

    (1990)
  • HartB. et al.

    Meaningful Differences in the Everyday Experience of Young American Children.

    (1995)
  • Cited by (31)

    • Language Learning Impairment

      2015, International Encyclopedia of the Social & Behavioral Sciences: Second Edition
    • The organization and physiology of the auditory thalamus and its role in processing acoustic features important for speech perception

      2013, Brain and Language
      Citation Excerpt :

      For simple temporally modulated sounds such as sinusoidally amplitude modulated (SAM) sounds or repetitive click stimuli, inferior colliculus neurons represent modulation periodicity up to about 300–500 Hz with stimulus-synchronized discharges while modulation rate and envelope shape are represented mainly by firing rate (Krishna & Semple, 2000; Zheng & Escabí, 2008). These representations in IC then form the inputs to MGB neurons, with all IC subdivisions capable of phase-locking to modulations >100 Hz (Shaddock Palombi, Backoff, & Caspary, 2001). MGB neurons in awake animals are able to respond robustly to amplitude or frequency modulated tones up to about 20–30 Hz, and many are able to generate synchronized spiking for modulations ⩾100 Hz (Bartlett & Wang, 2007, 2011; Creutzfeldt, Hellweg, & Schreiner, 1980; Preuss & Muller-Preuss, 1990; Vernier & Galambos, 1957).

    View all citing articles on Scopus
    View full text