Acoustic and perceptual study of phonetic integration in Spanish voiceless stops
Introduction
One of the main problems in the study of speech is how to characterize and represent the process employed by listeners to perceive and identify the phonetic content of the acoustic waveform. Closely related to that problem is the fact that speech sounds are not organized sequentially and independently from the point of view of perception. The listener is able to decode the complex speech signals and sequentially organize them in a series of meaningful linguistic units, phonemes, syllables, etc. Those units, however, do not seem to be present in the acoustic signal in the same way as in the perceptual representation. That means that each phoneme does not necessarily correspond to a given portion of sound in the utterance. It is a well-known fact that adjacent segments often influence each other in the perception of a given phoneme (Nygaard and Pisoni, 1995).
Stop consonants are one of the classes of phonemes whose acoustic and perceptual study is interesting due to their highly dynamic characteristics, and the coarticulatory processes involved. Voiceless stop consonants are formed by a sudden transient which is usually called the release burst, plus a friction segment and an aspiration noise, which may or may not be present in the signal. The unvoiced segment formed by the release burst, the friction segment, and the aspiration noise, whenever it exists, will be henceforth denoted as plosive noise. After that unvoiced segment comes the onset of voicing and a vocalic transition following the movement of the articulators. Voiceless Spanish stops have three places of articulation: labial (/p/), dental (/t/) and velar (/k/). Their acoustic characteristics have been described in several works (Quilis, 1989; Torres and Iparaguirre, 1996), and are similar to those of French or Dutch stops, which are also unaspirated (Smits et al., 1996; Bonneau et al., 1996).
The burst of labial stops shows a diffuse spectrum, although sometimes concentrations of energy can be seen in some vocalic contexts. The dental burst shows, on an average, a concentration of energy roughly between 2500 and 4000 Hz in the context of /i/, while in the /u/ context that prominence is located between 2000 and 3500 Hz. In other contexts, the spectrum is relatively flat. The velar burst shows a spectral prominence close to 1000 Hz in the /o/ and /u/ contexts, and around 3500 Hz in the /i/ context (Fig. 1).
Numerous papers have been devoted to the study of the perception of stop consonants in a vowel environment, from different points of view. Basically, the main issues about stop perception are related to the amplitude, durational and spectral characteristics of the burst section, and whether the associated vocalic transition is a necessary or sufficient cue for stop perception. Associated to those issues is the question of whether invariant cues can be found in the speech signal for the determination of place of articulation, or whether those signals exhibit a lack of invariance, context-dependent cues being necessary for identification.
The relative amplitude of the release burst in synthetic CV syllables was studied by Ohde and Stevens (1983), finding that the relative amplitude of the burst significantly affected the perception of the place of articulation of voiceless stop consonants. Hedrick and Jesteadt (1996)studied the relative amplitude of the burst, together with the presentation level and vowel duration on perception of voiceless stop consonants by normal and hearing-impaired listeners. Their results suggest that normal listeners weight the relative amplitude and vocalic transitions differently from hearing-impaired subjects. The time-intensity envelope of speech was studied by Van Tasell et al. (1987), finding that envelope features can be efficiently used for stop perception even in the absence of spectral information.
Some authors have also studied the temporal characteristics of voiceless stops, particularly the durations required for the identification of place. It is generally accepted that 20–30 ms of initial stops are enough for stop identification (Tekieli and Cullinan, 1979; Krull, 1990; Bonneau et al., 1996). That does not necessarily mean that shorter portions should exhibit lower scores. Tekieli and Cullinan (1979)determined the minimum initial portion durations required by listeners for the identification of consonant–vowel syllables. They found that the first 10 ms of the CV syllable contained enough information for better than chance correct identification of place of articulation in voiceless stop consonants. They came to the conclusion that the burst is sufficient for the correct identification of stop consonants. This opinion is shared by other authors like LaRiviere et al. (1975), Winitz et al. (1972), and Cole and Scott (1974a): The vocalic transition is neither a sufficient nor a necessary cue for voiceless stop recognition. More recent works have opposite opinions. For instance, the results of Ohde (1988)support a model of stop consonant perception that includes spectral and time-varying spectral properties as integral components of analysis. A similar view is shared by Dorman et al. (1977), who found trading relationships between the release bursts and vocalic transitions: where the perceptual weight of one increased, the weight of the other declined. Bonneau et al. (1996)found that, although 20–30 ms of the initial CV syllables contained enough cues for a correct identification of the voiceless stops without the a priori knowledge of the subsequent vowel, performance was context-dependent. Moreover, they also found that a near-perfect identification of stops can only be achieved when all the main cues (burst spectrum, burst duration and onset of vocalic formants) were present simultaneously.
Acoustic analyses have also been extensively made on voiceless stop consonants (see, for instance Crystal and House, 1988; Deng and Braam, 1994; Jongman et al., 1985; Kobatake and Ohtani, 1987; Stevens and Blumstein, 1978; Halle et al., 1957; Blumstein and Stevens, 1979; Kewley-Port, 1982). Perhaps the most interesting works have been devoted to the acoustic classification of the stop consonants using two basic approaches. The first one adopts the invariant theory of Stevens and Blumstein (1978), based on static cues obtained at consonant release (Jongman and Miller, 1991; Torres and Iparaguirre, 1996). The second is based on the dynamic approach of Kewley-Port (1982), that emphasizes the need of using the dynamic information contained in the vocalic transition (Nossair and Zahorian, 1991; Tanaka, 1981; Forrest et al., 1988). A similar approach had already been considered by Searle et al. (1979)in their study on stop consonant discrimination. The correct recognition scores for the voiceless stop consonants in the first approach are generally below 80%, while those of the second approach vary roughly between 80 and 95%. Moreover, the results of Nossair and Zahorian (1991)emphasize the need for using dynamic spectral transition cues, which they found to be invariant for place of articulation in initial voiceless stops.
The purpose of this paper is to study the phonetic integration between the static and dynamic cues to place of articulation in word initial Spanish stop consonants from both an acoustic and perceptual point of view, and to assess the degree of correlation between the acoustic representation and the perceptual identification. There does not exist a clear definition of phonetic integration. From a perceptual point of view, two segments corresponding to acoustically different components of the signal are integrated when they both contribute to perception of a given phonetic category (Repp, 1988). From an acoustical point of view, phonetic integration can be interpreted as a procedure that combines the acoustic characteristics of both segmens and evaluates them jointly.
Two different perceptual experiments were carried out: first the plosive noise was presented to a set of listeners (henceforth denoted as the C condition); and second 51.2 ms of the following vowel were added to the noise (henceforth denoted as the CV condition). Those two segments were then submitted to an acoustical analysis, and the signals were acoustically classified in terms of place of articulation. Then, the correlation between the acoustic and perceptual representation was calculated. Our hypothesis is that, if the stop consonant is better defined in the CV condition from a perceptual point of view, then the acoustic representation must reflect that fact, and the correlation between perceptual and acoustic representation should be higher than in the C condition. If there is no improvement in the perceptual identification for the CV condition, the correlations should also reflect that fact.
Section snippets
Perceptual analysis
Eighteen subjects (9 men and 9 women) served as speakers in the experiments. They were all native Spanish speakers with no known history of speech or hearing disorders, with ages between 20 and 40 years old. They were asked to utter a series of two syllable words (CVCV) in citation form, whose first syllable was formed by the combination of a voiceless stop (/p/, /t/ and /k/), with one of the five vowels (/a/, /e/, /i/, /o/ and /u/). The total number of stimuli was 270=3 stops × 5 vowels × 18
Perceptual tests
Prior to any analysis of the perceptual data, a statistical procedure was carried out on the listeners' responses in order to select a group of listeners for which the correct identification scores are within the interval average value ± one standard deviation of the sample. This procedure was done to avoid the inclusion of listeners with extremely low/high correct scores in the final results. For that purpose, the average value and standard deviation of the correct identification scores of all
Discussion
In this paper we have studied the perception and acoustic characterization of the Spanish voiceless stops, and the relation between them. For that purpose, two different conditions were chosen: isolated plosive noise (C condition), and plosive noise plus 51.2 ms of the following vowel (CV condition), in five vocalic contexts.
The results obtained in the perceptual experiments do not completely agree with those of other authors. In our study, in the C condition, /k/ is better identified than /p/
Acknowledgements
The authors would like to thank Francisco Cruz for his assistance in the perceptual experiments.
References (43)
- et al.
The duration of American-English stop consonants: An overview
J. Phonetics
(1988) - et al.
Acoustic properties for dental and aveolar stop consonants: A cross-language study
J. Phonetics
(1985) - et al.
Comparative study of several distortion measures for speech recognition
Speech Communication
(1985) On spectral coarticulation in stop-vowel-stop syllables: Implications for automatic speech recognition
J. Phonetics
(1987)- et al.
Acoustic parameters for place of articulation identification and classification of Spanish unvoiced stops
Speech Communication
(1996) - et al.
Vowel identification: orthographic, perceptual and acoustic aspects
J. Acoust. Soc. Amer.
(1982) - et al.
Acoustic invariance in speech production: Evidence from measurements of the spectral characteristics of stop consonants
J. Acoust. Soc. Amer.
(1979) - et al.
Perception of the place of articulation of French stop bursts
J. Acoust. Soc. Amer.
(1996) - et al.
The phantom in the phoneme: Invariant cues for stop consonants
Perception and Psychophysics
(1974) - et al.
Toward a theory of speech perception
Psychological Review
(1974)
Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
IEEE Trans. Acoust. Speech Signal Process.
Context-dependent Markov model structured by locus equations: Applications to phonetic classification
J. Acoust. Soc. Amer.
Stop-constant recognition: Release bursts and formant transitions as functionally equivalent, context-dependent cues
Perception and Psychophysics
Statistical analysis of word-initial voiceless obstruents: Preliminary data
J. Acoust. Soc. Amer.
Speaker-independent isolated word recognition using dynamic features of speech spectrum
IEEE Trans. Acoust. Speech Signal Process.
On the role of spectral transition for speech perception
J. Acoust. Soc. Amer.
Acoustic properties of stop consonants
J. Acoust. Soc. Amer.
Effect of relative amplitude, presentation level, and vowel duration on perception of voiceless stop consonants by normal and hearing-impaired listeners
J. Acoust. Soc. Amer.
Method for the location of burst-onset spectra in the auditory-perceptual space: A study of place of articulation in voiceless stop consonants
J. Acoust. Soc. Amer.
Cited by (5)
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling
2012, Speech CommunicationCitation Excerpt :The feature vectors are input to a Gaussian mixture model (GMM) classifier previously trained on labeled training data. Perception experiments for PoA identification of Marathi unvoiced stops conducted by playing different sub-segments of CV (unvoiced stop followed by vowel) and VC syllables to 5 listeners (not reported here) and those reported in the literature for French, Spanish, English and Hindi (Neagu and Bailly, 1998; Feijoo et al., 1999; Bonneau et al., 1996; Nossair and Zahorian, 1991; Smits et al., 1996; Ahmed and Agarwal, 1969; Ohala and Ohala, 1998) indicate that both the burst and vocalic transition regions individually contain useful acoustic cues to place of articulation. In the case of unreleased stops as in VC context, the vocalic transition provides strong acoustic cues to stop place of articulation.
Acoustic-perceptual relationships in variants of clear speech
2014, Folia Phoniatrica et LogopaedicaIntelligibility of clear speech: Effect of instruction
2013, Journal of Speech, Language, and Hearing ResearchAn artificial neural network-based isolated word speech recognition system for the Romanian language
2012, 2012 16th International Conference on System Theory, Control and Computing, ICSTCC 2012 - Joint Conference ProceedingsLandmark based recognition of stops: Acoustic attributes versus smoothed spectra
2008, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH