Elsevier

Speech Communication

Volume 27, Issue 1, February 1999, Pages 1-18
Speech Communication

Acoustic and perceptual study of phonetic integration in Spanish voiceless stops

https://doi.org/10.1016/S0167-6393(98)00064-8Get rights and content

Abstract

The relationship between the acoustic content and the perceptual identification of Spanish voiceless stops, /p/, /t/ and /k/, in word initial position, has been studied in two different conditions: (a) Isolated plosive noise (C condition); (b) Plosive noise plus 51.2 ms of the following vowel (CV condition). The purpose of the study was to assess whether there was a clear correspondence between the perceptual identification made by listeners and the acoustic classification performed using a spectral representation combined with the duration, energy and zero-crossings of the plosive noise. The acoustic classification was represented by a distance profile, formed by the acoustic distances between a given token and the three classes corresponding to the three stops. The acoustic distances were defined as the a posteriori probabilities of membership in each class (APP scores). The perceptual identification was represented by a response profile, formed by the number of listeners' responses assigned to each of the classes for a given token. The correlation between the acoustic and perceptual distances increased from the C condition (overall correlation 0.81) to the CV condition (0.95), indicating that as the signal is more precisely defined from the perceptual point of view, the acoustic content is also less ambiguous, since both the perceptual and acoustic classifications improve in the CV condition. The best correlations were achieved when the variables obtained in the temporal domain (duration, energy and zero-crossings of the plosive noise) were included in the analysis.

Zusammenfassung

Die Beziehung zwischen dem akustischen Inhalt und der perzeptionalen Identifizierung der spanischen stimmlosen Verschlußlaute /p/, /t/ und /k/ am Wortanfang, wurde unter zwei verschiedenen Bedingungen untersucht: (a) Isolierter Plosivlaut (Bedingung C); (b) Plosivlaut plus 51.2 ms des folgenden Vokals (Bedingung CV). Der Zweck der Untersuchung bestand darin, festzustellen, ob es eine eindeutige Übereinstimmung zwischen der akustischen Identifizierung durch die Zuhörer und der akustischen Klassifizierung gab, die durchgeführt wurde, indem wir eine Spektraldarstellung mit der Dauer, der Energie und den Nullkreuzungen des Plosivlautes kombinierten. Die akustische Klassifizierung wurde durch ein Distanzprofil veranschaulicht, das aus den akustischen Distanzen zwischen einem gegebenen Zeichen und den drei Klassen besteht, die den drei Verschlußlauten entsprechen. Die akustischen Distanzen wurden definiert als die Wahrscheinlichkeiten a posteriori einer Zugehörigkeit zu einerjeden Klasse (APP Wertung). Die perzeptionale Identifizierung wurde durch ein Reaktionsprofil dargestellt, das aus der Anzahl der Zuhörerantworten besteht, die jeder der Klassen eines gegebenen Zeichens zugeteilt wurden. Die Wechselbeziehung zwischen den akustischen und den perzeptionalen Entfernungen steigert sich von Bedingung C (Gesamt-Wechselbeziehung 0.81) auf Bedingung CV (=0.95), da das Signal vom Perzeptionsstandpunkt aus präziser definiert ist, es zeigt sich der akustische Inhalt ebenfalls weniger zweideutig, weil beide Klassifizierungen – sowohl die perzeptionale als auch die akustische – sich unter CV Bedingungen verbessern. Die besten Wechselporalem Gebiet (Dauer, Energie und Nullkreuzungen des plosivlauts) in die Analyse miteinbezogen wurden.

Introduction

One of the main problems in the study of speech is how to characterize and represent the process employed by listeners to perceive and identify the phonetic content of the acoustic waveform. Closely related to that problem is the fact that speech sounds are not organized sequentially and independently from the point of view of perception. The listener is able to decode the complex speech signals and sequentially organize them in a series of meaningful linguistic units, phonemes, syllables, etc. Those units, however, do not seem to be present in the acoustic signal in the same way as in the perceptual representation. That means that each phoneme does not necessarily correspond to a given portion of sound in the utterance. It is a well-known fact that adjacent segments often influence each other in the perception of a given phoneme (Nygaard and Pisoni, 1995).

Stop consonants are one of the classes of phonemes whose acoustic and perceptual study is interesting due to their highly dynamic characteristics, and the coarticulatory processes involved. Voiceless stop consonants are formed by a sudden transient which is usually called the release burst, plus a friction segment and an aspiration noise, which may or may not be present in the signal. The unvoiced segment formed by the release burst, the friction segment, and the aspiration noise, whenever it exists, will be henceforth denoted as plosive noise. After that unvoiced segment comes the onset of voicing and a vocalic transition following the movement of the articulators. Voiceless Spanish stops have three places of articulation: labial (/p/), dental (/t/) and velar (/k/). Their acoustic characteristics have been described in several works (Quilis, 1989; Torres and Iparaguirre, 1996), and are similar to those of French or Dutch stops, which are also unaspirated (Smits et al., 1996; Bonneau et al., 1996).

The burst of labial stops shows a diffuse spectrum, although sometimes concentrations of energy can be seen in some vocalic contexts. The dental burst shows, on an average, a concentration of energy roughly between 2500 and 4000 Hz in the context of /i/, while in the /u/ context that prominence is located between 2000 and 3500 Hz. In other contexts, the spectrum is relatively flat. The velar burst shows a spectral prominence close to 1000 Hz in the /o/ and /u/ contexts, and around 3500 Hz in the /i/ context (Fig. 1).

Numerous papers have been devoted to the study of the perception of stop consonants in a vowel environment, from different points of view. Basically, the main issues about stop perception are related to the amplitude, durational and spectral characteristics of the burst section, and whether the associated vocalic transition is a necessary or sufficient cue for stop perception. Associated to those issues is the question of whether invariant cues can be found in the speech signal for the determination of place of articulation, or whether those signals exhibit a lack of invariance, context-dependent cues being necessary for identification.

The relative amplitude of the release burst in synthetic CV syllables was studied by Ohde and Stevens (1983), finding that the relative amplitude of the burst significantly affected the perception of the place of articulation of voiceless stop consonants. Hedrick and Jesteadt (1996)studied the relative amplitude of the burst, together with the presentation level and vowel duration on perception of voiceless stop consonants by normal and hearing-impaired listeners. Their results suggest that normal listeners weight the relative amplitude and vocalic transitions differently from hearing-impaired subjects. The time-intensity envelope of speech was studied by Van Tasell et al. (1987), finding that envelope features can be efficiently used for stop perception even in the absence of spectral information.

Some authors have also studied the temporal characteristics of voiceless stops, particularly the durations required for the identification of place. It is generally accepted that 20–30 ms of initial stops are enough for stop identification (Tekieli and Cullinan, 1979; Krull, 1990; Bonneau et al., 1996). That does not necessarily mean that shorter portions should exhibit lower scores. Tekieli and Cullinan (1979)determined the minimum initial portion durations required by listeners for the identification of consonant–vowel syllables. They found that the first 10 ms of the CV syllable contained enough information for better than chance correct identification of place of articulation in voiceless stop consonants. They came to the conclusion that the burst is sufficient for the correct identification of stop consonants. This opinion is shared by other authors like LaRiviere et al. (1975), Winitz et al. (1972), and Cole and Scott (1974a): The vocalic transition is neither a sufficient nor a necessary cue for voiceless stop recognition. More recent works have opposite opinions. For instance, the results of Ohde (1988)support a model of stop consonant perception that includes spectral and time-varying spectral properties as integral components of analysis. A similar view is shared by Dorman et al. (1977), who found trading relationships between the release bursts and vocalic transitions: where the perceptual weight of one increased, the weight of the other declined. Bonneau et al. (1996)found that, although 20–30 ms of the initial CV syllables contained enough cues for a correct identification of the voiceless stops without the a priori knowledge of the subsequent vowel, performance was context-dependent. Moreover, they also found that a near-perfect identification of stops can only be achieved when all the main cues (burst spectrum, burst duration and onset of vocalic formants) were present simultaneously.

Acoustic analyses have also been extensively made on voiceless stop consonants (see, for instance Crystal and House, 1988; Deng and Braam, 1994; Jongman et al., 1985; Kobatake and Ohtani, 1987; Stevens and Blumstein, 1978; Halle et al., 1957; Blumstein and Stevens, 1979; Kewley-Port, 1982). Perhaps the most interesting works have been devoted to the acoustic classification of the stop consonants using two basic approaches. The first one adopts the invariant theory of Stevens and Blumstein (1978), based on static cues obtained at consonant release (Jongman and Miller, 1991; Torres and Iparaguirre, 1996). The second is based on the dynamic approach of Kewley-Port (1982), that emphasizes the need of using the dynamic information contained in the vocalic transition (Nossair and Zahorian, 1991; Tanaka, 1981; Forrest et al., 1988). A similar approach had already been considered by Searle et al. (1979)in their study on stop consonant discrimination. The correct recognition scores for the voiceless stop consonants in the first approach are generally below 80%, while those of the second approach vary roughly between 80 and 95%. Moreover, the results of Nossair and Zahorian (1991)emphasize the need for using dynamic spectral transition cues, which they found to be invariant for place of articulation in initial voiceless stops.

The purpose of this paper is to study the phonetic integration between the static and dynamic cues to place of articulation in word initial Spanish stop consonants from both an acoustic and perceptual point of view, and to assess the degree of correlation between the acoustic representation and the perceptual identification. There does not exist a clear definition of phonetic integration. From a perceptual point of view, two segments corresponding to acoustically different components of the signal are integrated when they both contribute to perception of a given phonetic category (Repp, 1988). From an acoustical point of view, phonetic integration can be interpreted as a procedure that combines the acoustic characteristics of both segmens and evaluates them jointly.

Two different perceptual experiments were carried out: first the plosive noise was presented to a set of listeners (henceforth denoted as the C condition); and second 51.2 ms of the following vowel were added to the noise (henceforth denoted as the CV condition). Those two segments were then submitted to an acoustical analysis, and the signals were acoustically classified in terms of place of articulation. Then, the correlation between the acoustic and perceptual representation was calculated. Our hypothesis is that, if the stop consonant is better defined in the CV condition from a perceptual point of view, then the acoustic representation must reflect that fact, and the correlation between perceptual and acoustic representation should be higher than in the C condition. If there is no improvement in the perceptual identification for the CV condition, the correlations should also reflect that fact.

Section snippets

Perceptual analysis

Eighteen subjects (9 men and 9 women) served as speakers in the experiments. They were all native Spanish speakers with no known history of speech or hearing disorders, with ages between 20 and 40 years old. They were asked to utter a series of two syllable words (CVCV) in citation form, whose first syllable was formed by the combination of a voiceless stop (/p/, /t/ and /k/), with one of the five vowels (/a/, /e/, /i/, /o/ and /u/). The total number of stimuli was 270=3 stops × 5 vowels × 18

Perceptual tests

Prior to any analysis of the perceptual data, a statistical procedure was carried out on the listeners' responses in order to select a group of listeners for which the correct identification scores are within the interval average value ± one standard deviation of the sample. This procedure was done to avoid the inclusion of listeners with extremely low/high correct scores in the final results. For that purpose, the average value and standard deviation of the correct identification scores of all

Discussion

In this paper we have studied the perception and acoustic characterization of the Spanish voiceless stops, and the relation between them. For that purpose, two different conditions were chosen: isolated plosive noise (C condition), and plosive noise plus 51.2 ms of the following vowel (CV condition), in five vocalic contexts.

The results obtained in the perceptual experiments do not completely agree with those of other authors. In our study, in the C condition, /k/ is better identified than /p/

Acknowledgements

The authors would like to thank Francisco Cruz for his assistance in the perceptual experiments.

References (43)

  • S.B Davis et al.

    Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences

    IEEE Trans. Acoust. Speech Signal Process.

    (1980)
  • L Deng et al.

    Context-dependent Markov model structured by locus equations: Applications to phonetic classification

    J. Acoust. Soc. Amer.

    (1994)
  • M.F Dorman et al.

    Stop-constant recognition: Release bursts and formant transitions as functionally equivalent, context-dependent cues

    Perception and Psychophysics

    (1997)
  • Feijoo, S., Dominguez, J.A., Viso, M., Balsa, R., 1995. A pattern recognition approach for the identification of...
  • K Forrest et al.

    Statistical analysis of word-initial voiceless obstruents: Preliminary data

    J. Acoust. Soc. Amer.

    (1988)
  • Fukunaga, K., 1972. Statistical Pattern Recognition. Academic Press, New...
  • S Furui

    Speaker-independent isolated word recognition using dynamic features of speech spectrum

    IEEE Trans. Acoust. Speech Signal Process.

    (1986)
  • S Furui

    On the role of spectral transition for speech perception

    J. Acoust. Soc. Amer.

    (1986)
  • M Halle et al.

    Acoustic properties of stop consonants

    J. Acoust. Soc. Amer.

    (1957)
  • M.S Hedrick et al.

    Effect of relative amplitude, presentation level, and vowel duration on perception of voiceless stop consonants by normal and hearing-impaired listeners

    J. Acoust. Soc. Amer.

    (1996)
  • A Jongman et al.

    Method for the location of burst-onset spectra in the auditory-perceptual space: A study of place of articulation in voiceless stop consonants

    J. Acoust. Soc. Amer.

    (1991)
  • Cited by (5)

    • Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling

      2012, Speech Communication
      Citation Excerpt :

      The feature vectors are input to a Gaussian mixture model (GMM) classifier previously trained on labeled training data. Perception experiments for PoA identification of Marathi unvoiced stops conducted by playing different sub-segments of CV (unvoiced stop followed by vowel) and VC syllables to 5 listeners (not reported here) and those reported in the literature for French, Spanish, English and Hindi (Neagu and Bailly, 1998; Feijoo et al., 1999; Bonneau et al., 1996; Nossair and Zahorian, 1991; Smits et al., 1996; Ahmed and Agarwal, 1969; Ohala and Ohala, 1998) indicate that both the burst and vocalic transition regions individually contain useful acoustic cues to place of articulation. In the case of unreleased stops as in VC context, the vocalic transition provides strong acoustic cues to stop place of articulation.

    • Intelligibility of clear speech: Effect of instruction

      2013, Journal of Speech, Language, and Hearing Research
    • An artificial neural network-based isolated word speech recognition system for the Romanian language

      2012, 2012 16th International Conference on System Theory, Control and Computing, ICSTCC 2012 - Joint Conference Proceedings
    • Landmark based recognition of stops: Acoustic attributes versus smoothed spectra

      2008, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    View full text