Elsevier

Speech Communication

Volume 54, Issue 1, January 2012, Pages 40-54
Speech Communication

Automatic analysis of Mandarin accented English using phonological features

https://doi.org/10.1016/j.specom.2011.06.003Get rights and content

Abstract

The problem of accent analysis and modeling has been considered from a variety of domains, including linguistic structure, statistical analysis of speech production features, and HMM/GMM (hidden Markov model/Gaussian mixture model) model classification. These studies however fail to connect speech production from a temporal perspective through a final classification strategy. Here, a novel accent analysis system and methodology which exploits the power of phonological features (PFs) is presented. The proposed system exploits the knowledge of articulation embedded in phonology by building Markov models (MMs) of PFs extracted from accented speech. The Markov models capture information in the PF space along two dimensions of articulation: PF state-transitions and state-durations. Furthermore, by utilizing MMs of native and non-native accents, a new statistical measure of “accentedness” is developed which rates the articulation of a word by a speaker on a scale of native-like (+1) to non-native like (−1). The proposed methodology is then used to perform an automatic cross-sectional study of accented English spoken by native speakers of Mandarin Chinese (N-MC). The experimental results demonstrate the capability of the proposed system to perform quantitative as well as qualitative analysis of foreign accents. The work developed in this study can be easily expanded into language learning systems, and has potential impact in the areas of speaker recognition and ASR (automatic speech recognition).

Highlights

► This study presents a new model for accent based on phonological features (PFs). ► Pronunciation variations are viewed as different paths in the PFs space-time. ► Markov models learn the pronunciations variations between native and non-native speakers. ► Native speakers of Mandarin and American English are used for evaluation. ► Proposed system shows high correlation with human judgment of accent (0.89 with p < 0.01).

Introduction

Automatic accent analysis and classification is useful in speech science, with impact in many areas of speech technology such as automatic speech recognition (ASR) (Salvi, 2003, Zheng et al., 2005), speaker recognition (Mangayyagari et al., 2008), pronunciation modeling, pronunciation scoring, and language learning (Mak et al., 2003, Neri et al., 2006, Wei et al., 2006). Accent analysis is the process of identifying speech characteristics that contribute to a speaker’s accent. Accent structure can be based on one of three perspectives: (i) physical speech production analysis including phonemic, prosody, and linguistic structure, (ii) acoustic waveform analysis based on signal processing feature extraction, and (iii) human perception, which is based on the salient traits extracted by the listener which characterize an accent. These represent the science and technology domains for accent research. Alternatively, accent classification identifies a speaker’s accent based on the most discriminating speech characteristics. Here, cepstrum based features are most widely used for accent classification (Angkititrakul and Hansen, 2006, Choueiter et al., 2008). Additionally, low level speech features such as VOT (voice-onset time), word/phone durations, intonation patterns, formant-behavior, etc. have also been used for modeling and classifying accents (Arslan and Hansen, 1996a, Arslan and Hansen, 1996b, Das and Hansen, 2004, Hansen et al., 2010). Finally, a number of modeling techniques including GMMs (Gaussian mixture models), HMMs (hidden Markov models), and SVMs (support vector machines) have been employed to learn accent characteristics and have shown good performance in classification (Arslan and Hansen, 1996b, Pedersen and Diederich, 2007). While the above-mentioned features and modeling techniques provide good classification accuracy, they do not offer a comprehensive insight into the major differences among the accents under consideration. Since the origins of accent are embedded in production differences, it would be beneficial to automatically capture, compare and contrast the major articulatory characteristics of accents. Therefore, it is asserted that a phonological features (PFs) based framework would offer the necessary breadth and depth to comprehensively analyze distinct accents. Herein, the ability of PFs to capture fine articulatory variations in speech has been demonstrated in the research literature (Scharenborg et al., 2007, Sangwan and Hansen, 2008, King et al., 2007). This motivates the design and development of a PF-based accent analysis system.

In this study, the objective is to develop an accent analysis system that automatically models the major differences in articulation characteristics of two accent groups (native and non-native speakers). Using these accent models, the system is able to identify the speech characteristics of an individual speaker as native or non-native. In fact, the models are not only used to identify speech characteristics of different accents but to score them as well. These new scores represent the measure of “accentedness”. As shown in Fig. 1, it is proposed to form a continuation of “accentedness”, which is bounded over [−1, +1] where a value of −1 and +1 imply extremely non-native-like and native-like accent characteristics. The bounds identify the extremes in proficiency, where an individual speakers proficiency can be rated on this continuum. It is expected that the distribution of accentedness scores of non-native and native speakers would be similar to that shown in Fig. 1(a).

Our approach towards building the automatic accent analysis system relies on the use of phonological features (PFs). The use of PFs is especially beneficial, owing to the close relationship between PFs and articulatory/acoustic phonetics. The various PF dimensions enable a comprehensive accent analysis in the space of speech articulators. In this manner, the proposed accent analysis system is able to assign accentedness scores to various articulatory/acoustic traits of a speaker (e.g., aspiration, nasalization, rounding etc.). Therefore, PFs allow the proposed system to independently look at fine articulatory/acoustic details resulting in a more refined assessment of accent characteristics. Alternatively, PFs also enable differential diagnosis of a speaker’s accent as shown in Fig. 1(c). As seen in the figure, the different PF dimensions can be independently assessed for accentedness per speaker resulting in unique accent profiles. These accent profiles can provide very useful information for language learners as they clearly identify the students strengths and weaknesses. Furthermore, as shown in Fig. 1(b), a longitudinal study of the accent profile would identify areas of speech production improvement and stagnation as accent relaying properties. Finally, the information provided by accent profiles can also be potentially used as (i) input features for dialect, accent, speaker or language identification systems, and (ii) conditioning knowledge for ASR systems.

The proposed analysis system models accent by exploiting two important properties of articulation embedded within PF sequences, namely, state-occupancy and state-transitions. State-occupancy captures the durational aspect of articulation. State-transitions capture the characteristics of articulatory motion. For example, the tongue movement from a velar to dental place-of-articulation is a state-transition. Alternatively, the duration spent in the dental place-of-articulation is state-duration. Here, it is hypothesized that given the same articulation task (e.g., pronouncing a word) the statistical nature of state-transitions and state-durations of native and non-native speakers would be dramatically different. Hence, we propose to learn the statistical nature of state-transitions and state-durations for native and non-native articulation using Markov models (MMs). Subsequently, the MMs are used to compute the likelihood that an utterance was articulated by a native or non-native speaker. In this manner, the likelihoods can then be used to generate accentedness scores.

The proposed accent analysis system is evaluated on the CU-Accent corpus (Angkititrakul and Hansen, 2006). In our experiments, native speakers of American English (N-AE) and Mandarin Chinese (N-MC) are drawn from CU-Accent as the native and non-native speaker groups. The N-MC speakers are further divided into two groups based on their AE-exposure (which is assumed to be equal to their stay in U.S.A). The N-MC 1 and N-MC 2 groups correspond to the low and high exposure N-MC speakers, respectively. Using the target speakers in the above-mentioned speaker-groups, a number of experiments are conducted to demonstrate the accuracy and utility of the proposed system. In the first experiment, human-assigned accentedness scores are collected for the N-AE and N-MC speakers in a listener evaluation study. Subsequently, as shown in Fig. 1(d) the correlation between human-assigned scores and automatic scores (generated by the proposed system) is studied. Our results show a good correlation (0.8, p < 0.0001) between the human-assigned and machine-assigned accentedness scores. Additionally, the correlation between the machine-assigned accentedness scores and L2-exposure of N-MC speakers is also reported. The results reported in this study corroborate previous findings where non-native proficiency is seen to increase with L2-exposure (Flege et al., 1997, Flege, 1988, Jia et al., 2006). Encouraged by these findings, an in-depth differential analysis which compares and contrasts the articulatory dissimilarities of the native and non-native speaker groups is performed. As shown in Fig. 1(c), the differential analysis assigns accentedness scores to every PF-dimension. Hence, the proficiency of speakers is easily compared along individual articulators. The differential analysis performed in this study suggests an imbalanced increase in proficiency among N-MC speakers with increased L2-exposure. Particularly, it is observed that N-MC speakers gain greater proficiency in (i) vowel articulation as opposed to consonant articulation, and (ii) duration aspects of articulation as opposed to transitional aspects. In this manner, the proposed accent analysis system is able to offer a comprehensive comparative analysis of non-native and native articulation. Additionally, the proposed accent-analysis algorithm is easy to implement, and versatile in its usage. Therefore, the proposed system is beneficial to language learners, and speech scientists as an assistive analysis tool. Finally, the proposed scheme can be further developed to integrate into speech technology such as ASR, speaker recognition, and accent/dialect/language identification.

The remainder of this paper is organized as follows: in Section 2, the CU-Accent speech corpus is described. In this study, data from CU-Accent corpus has been used for development and analysis. In Section 3, a brief review of PFs with respect to speech technology is presented. The hybrid features (HFs) system is also introduced, and our HMM-based PF extraction system is described. In Section 4, the proposed accent analysis model based on Markov models is developed. Finally, in Section 5, the in-depth comparative analysis of native vs. non-native accent is presented and discussed.

Section snippets

CU-accent speech corpus

The CU-Accent corpus consists of speech utterances spoken by native speakers of American English (AE), Mandarin Chinese (MC), Turkish, Thai, Spanish, German, Japanese, Hindi, and French speakers (Angkititrakul and Hansen, 2006). The corpus consists of several male as well as female speakers per native-language (as mentioned above) where the utterances for each speaker were recorded over multiple sessions. In each data-collection session, the speakers were required to speak 5 tokens of 23

Phonological features (PFs)

Phonological features (PFs) are a generic concept in linguistics with several manifestations such as the binary features in Sound-Patterns in English (SPE), government phonology (GP), multi-valued (MV) features, and hybrid features (HFs) (Scharenborg et al., 2007, Frankel et al., 2007a). The different PFs definitions are inspired by articulatory, acoustical, phonological, or a combination of different aspects of speech. In this paper, we use the HFs definition owing to their close relationship

Proposed accent model

In this section, we develop the proposed accent model. The HF system described in Section 3.2 captures two important aspects of articulation, namely, the HF state-transitions as well as HF state-occupancy. For example, in the articulation of the diphthong /aw/, the tongue shifts from a low-to-high position while occupying each state for a finite duration of time. As shown in Fig. 3, the vertical and horizontal movements of the tongue from a low to high position, and mid-front to mid-back

Results and discussion

The experimental evaluations presented in this section use the data from N-AE (native American English) and N-MC (native Mandarin Chinese) speakers in the CU-Accent corpus. To facilitate a cross-sectional study, the N-MC speakers are divided into two groups: N-MC 1 and N-MC 2 based on their L2-exposure. Particularly, the N-MC 1 and N-MC 2 groups have L2-exposures of less than and greater than 2 yr. Furthermore, the 3 speaker groups are also divided into test and train groups. The train and test

Conclusion

In this study, an accent analysis system based on PFs (phonological features) was presented. The use of PFs as a framework for analysis was strongly motivated by the capability of PFs to capture the fine articulatory variations in speech. It was argued that since the origins of accent are strongly embedded in speech production, PFs would form an ideal platform for analysis. In the study presented, two aspects of articulation were exploited to model accent, namely, articulatory transitions and

Acknowledgement

This work was supported by the USAF under a subcontract to RADC, Inc., Contract FA8750-09-C-0067. (Approved for public release. Distribution unlimited.)

References (32)

  • L.M. Arslan et al.

    A study of temporal features and frequency characteristics in American English foreign accent

    J. Acoust. Soc. Amer. (JASA)

    (1996)
  • Choueiter, G., Zweig, G., Nguyen, P., 2008. An empirical study of automatic accent classification. In: ICASSP, pp....
  • F. Chreist

    Foreign Accent

    (1964)
  • Das, S., Hansen, J.H., 2004. Detection of voice onset time (VOT) for unvoiced stops (/p/,/t/,/k/) using the Teager...
  • J. Flege

    Factors affecting degree of perceived foreign accent in English sentences

    J. Acoust. Soc. Amer. (JASA)

    (1988)
  • Frankel, J., Magimai-Doss, M., King, S., Livescu, K., Cetin, O., 2007a. Articulatory feature classifiers trained on...
  • Cited by (8)

    • Unsupervised accent classification for deep data fusion of accent and language information

      2016, Speech Communication
      Citation Excerpt :

      Automatic Dialect Identification (DID)/Classification has recently gained significant interest in the speech processing community (Hansen et al., 2004; Torres-Carrasquillo, 2004; Ma et al., 2006; Li et al., 2007; Biadsy et al., 2009; Hansen et al., 2010; Liu et al., 2011; Liu et al., 2012; Sangwan and Hansen, 2012; William et al., 2013; Zhang et al., 2014).

    • HMM-based non-native accent assessment using posterior features

      2016, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    • Automatic accentedness evaluation of non-native speech using phonetic and sub-phonetic posterior probabilities

      2015, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    • Exploiting articulatory features for pitch accent detection

      2013, Journal of Zhejiang University: Science C
    • Improve low-resource non-native mispronunciation detection with native speech by articulatory-based tandem feature

      2013, 2013 IEEE China Summit and International Conference on Signal and Information Processing, ChinaSIP 2013 - Proceedings
    • Automatic accent assessment using phonetic mismatch and human perception

      2013, IEEE Transactions on Audio, Speech and Language Processing
    View all citing articles on Scopus
    View full text