The Role of Categorical Perception and Acoustic Details in the Processing of Mandarin Tonal Alternations in Contexts: An Eye-Tracking Study

Tu, Jung-Yueh; Chien, Yu-Fu

doi:10.3389/fpsyg.2021.756921

ORIGINAL RESEARCH article

Front. Psychol., 07 February 2022

Sec. Psychology of Language

Volume 12 - 2021 | https://doi.org/10.3389/fpsyg.2021.756921

The Role of Categorical Perception and Acoustic Details in the Processing of Mandarin Tonal Alternations in Contexts: An Eye-Tracking Study

$\r\nJung-Yueh Tu$ Jung-Yueh Tu¹ $Yu-Fu Chien*\r\n$ Yu-Fu Chien^2*

¹PhD/MA Program in Teaching Chinese as a Second Language, National Chengchi University, Taipei, Taiwan
²Department of Chinese Language and Literature, Fudan University, Shanghai, China

This study investigated the perception of Mandarin tonal alternations in disyllabic words. In Mandarin, a low-dipping Tone3 is converted to a high-rising Tone2 when followed by another Tone3, known as third tone sandhi. Although previous studies showed statistically significant differences in F0 between a high-rising Sandhi-Tone3 (T3) and a Tone2, native Mandarin listeners failed to correctly categorize these two tones in perception tasks. The current study utilized the visual-world paradigm in eye-tracking to further examine whether acoustic details in lexical tone aid lexical access in Mandarin. Results showed that Mandarin listeners tend to process Tone2 as Tone2 whereas they tend to first process Sandhi-T3 as both Tone3 and Tone2, then later detect the acoustic differences between the two tones revealed by the sandhi context, and finally activate the target word during lexical access. The eye-tracking results suggest that subtle acoustic details of F0 may facilitate lexical access in automatic fashion in a tone language.

Introduction

Mandarin Chinese is a tonal language, which uses pitch to distinguish lexical meaning. It has four lexical tones, a high level Tone1, a mid-rising Tone2, a low-dipping Tone3, a high-falling Tone4 as well as a neutral tone (Chao, 1930). With this tonal inventory, Mandarin is well-known for its third tone sandhi, where a low dipping Tone3 (T3) immediately followed by another T3 is altered into a rising tone, similar to the mid-rising Tone2 (T2) (Chao, 1930; Lin, 2007). This third tone sandhi leads to the situation that the sandhi-rising (SR) T3 and canonical-rising (CR) T2 are both realized as rising tones and they seem to be neutralized in the given context. Neutralization is a phenomenon in which two different phonemes are realized as the same sound in certain phonetic environments. The third tone sandhi rule is traditionally/pedagogically described as a T3 becoming a T2 when followed by another T3. The extent of neutralization between sandhi-rising T3 (SR-T3) and canonical-rising T2 (CR-T2), however, still remains a controversial issue. Previous studies comparing SR-T3 and CR-T2 have suggested incomplete neutralization in acoustic details (Peng, 2000; Zhang and Lai, 2010; Yuan and Chen, 2014) but complete neutralization in perception in identification tasks (Wang and Li, 1967; Peng, 2000). In other words, although previous studies showed statistically significant differences in F0 between a SR-T3 and a CR-T2, native Mandarin listeners failed to correctly categorize these two tones in perception tasks (Peng, 2000).

With the development of research methodology, the perception and processing of Mandarin tones have been explored through eye-tracking and Electroencephalography (EEG) technology. In the field of phonetics, eye-tracking experiments were initially conducted to investigate the perception of segmental sounds. For example, the eye-tracking study of consonants by McMurray et al. (2002, 2009) found that participants can perceive within-category voice onset time (VOT) differences of 5 ms. Their findings demonstrated effects of word initial VOT on lexical access, and also support models of spoken word recognition in which sub-phonemic detail is preserved in patterns of lexical activation for competing lexical candidates throughout the processing system. Then, eye-tracking techniques were used to explore the perception of Chinese Mandarin tones. Malins and Joanisse (2010) used this method to examine how segmental and tonal information affect Chinese word recognition. Their results showed that in the process of Chinese word recognition, participants integrate segmental and tonal information in a parallel way. Such findings cannot be observed from the results obtained in previous off-line end-state experiments, while the employment of eye-tracking technology can provide more evidence of on-line real-time data to explore language processing. Eye-tracking technique can provide evidence of use of fine-grained acoustic information that is not found in off-line measurements or tasks. It can shed light on the spoken word recognition process, saying how this information modulates target and competitor word activation as the speech signal unfolds. Later, Shen et al. (2013) conducted an eye-tracking experiment on the perception of Mandarin monosyllabic words with T2 and T3, which investigated how lexical tone perception of Mandarin T2 and T3 was influenced by the pitch height of the tone at onset, turning point, and offset. It has found that native Mandarin listeners perceived the tone with high-offset pitch as T2 while they perceived the tone with low-offset pitch as T3. Shen et al. (2013) further explained that a low turning point pitch served as a pivotal cue for T3, and prompted more eye fixations on T3 items, until the offset pitch directed significantly more fixations to the final tone choice. The findings indicated that in the perception of tones, the pitch height at critical points serves as an important perceptual cue. The results support the perspective that perception of tones is an incremental process.

In addition, Qin et al. (2019) compared the processing of Mandarin T1 and T2 by native Mandarin listeners and English listeners learning Chinese as a second language. They conducted an eye-tracking experiment using the visual world paradigm. Based on the phonetic distance between the target tone and the competitor, stimuli were manipulated such that the target tones were categorized into three conditions, including Standard condition (i.e., the target tone was canonical), Close condition (i.e., the target was phonetically closer to the competitor), and Distant condition (i.e., the target was phonetically more distant from the competitor). They found that within-category tonal information influenced both native and non-native participants’ word recognition, but did so in a different way for the two groups. In comparison with the Standard condition, Mandarin participants’ target-over-competitor word activation was enhanced in the Distant condition and inhibited in the Close condition, while English participants’ target-over-competitor word activation was inhibited in both the Distant and Close conditions.

Meanwhile, the processing and representation of Mandarin disyllabic words are relatively understudied and need more research. Since Mandarin T3 sandhi involves in contexts, it may provide more information to examine T3 sandhi not only in isolation but also in contexts. Chien et al. (2016) conducted an auditory–auditory priming lexical decision experiment to investigate the processing of Mandarin third tone sandhi words during spoken word recognition and their mental representations. In their priming experiment, each disyllabic tone sandhi target word (e.g., /tʂ^hu3 li3/) was preceded by one of three monosyllabic primes: a T2 prime (Surface-Tone overlap, /tʂ^hu2/), a T3 prime (Underlying-Tone overlap, /tʂ^hu3/), or a control prime (Baseline condition, /tʂ^hu1/). Their results showed that T3 primes (Underlying-Tone) elicited significantly stronger facilitation effects for the sandhi targets than Tone 2 primes (Surface-Tone), with little effect of target frequency on the pattern of the priming effects. Thus, they proposed that Mandarin third tone sandhi words are represented as /T3 T3/ in the mental lexicon.

The EEG technique has also been applied to research on the perception and processing of Mandarin disyllabic words. For instance, Chien et al. (2020) used the oddball paradigm to elicit mismatch negativity (MMN) in order to investigate the processing and representation of third tone sandhi words. This study used disyllabic /T2+T3/ (T2 condition), /T3+T4/ (T3 condition), and /T3+T3/ (sandhi condition) words as standards and an identical monosyllable [tʂu2] as the deviant in three separate conditions. The results in the first syllable time window showed that the T2 condition in which 竹叶 /tʂu2 yε4/ “bamboo” (T2) served as the standard and [tʂu2] as the deviant produced an MMN effect. They argued that this MMN effect was due to the surface acoustic differences between the first syllable of standards and the deviant. The results in the first syllable position for the T3 condition in which 主页 /tʂu3 yε4/ “main page” (T3) served as the standard and [tʂu2] as the deviant also elicited an MMN effect. This MMN effect could be due to the surface differences between the first syllable of standards and the deviant. It could also be due to differences in the underlying representation. Interestingly, no MMN effect was yielded in the first syllable position for the sandhi condition in which 主演 /tʂu3 jεn3/ “starring” served as the standard and [tʂu2] as the deviant. They argued that the results were probably because the participants perceived the deviant [tʂu2] as the first syllable of 主演 /tʂu3 jʂn3/ and converted the surface T2 into its underlying representation, or the representation of the first syllable of T3 sandhi words is phonologically underspecified, so there was no mismatch between the deviant and the first syllable of sandhi standards. According to their results, it seems that the surface acoustic information of T3 sandhi words is not that important when the experimental condition can help participants predict the following word. Retrieval of the underlying phonological representations is the key point.

In addition to on-line processing in the perception of Mandarin T3 sandhi words, there was one study working on on-line processing in the production of Mandarin T3 sandhi words. Zhang et al. (2015) investigated Event-Related Potentials (ERPs) in the covert production of Mandarin third tone sandhi in disyllabic words. Their stimuli included real words and pseudowords with T2-T3 and T3-T3 tonal combinations. The results showed that in comparison to the disyllabic words with T2-T3, the second syllable of the sandhi words with T3-T3 induced greater P2 (which is sensitive to phonological processing) amplitude. Zhang et al. (2015) claimed that the results suggest that the phonological encoding of tonal combinations with T3 sandhi may be more effortful. They further claimed that the phonological processing may not differ qualitatively between real words and pseudowords in the P2 time-window. In addition, the findings indicated that the phonetic/phonological encoding of T3 sandhi occurs before initiation of articulation. This research revealed on-line processing in the production of T3 sandhi words in Mandarin.

Previous studies on Mandarin T3 sandhi focused either on the acoustic and perceptual neutralization between SR-T3 and CR-T2 (Peng, 2000), or on how T3 sandhi words are processed and represented in the mental lexicon (e.g., Nixon et al., 2015; Chien et al., 2016, Chien et al., 2020). Few studies examined the dynamic processing between SR-T3 and CR-T2 as the acoustic signal of SR-T3 and CR-T2 unfolds. In this study, we not only revisited the extent of neutralization between SR-T3 and CR-T2 in both production and perception, but most importantly, also investigated the role of acoustic details within category in dynamic and automatic processing of tonal alternations in contexts. In order to approach this issue, we adopted an eye-movement tracking technique to provide detailed on-line processing information. It has shown in the previous eye-tracking studies that participants are aware of within-category differences in VOT (McMurray et al., 2002, 2009). In line with this, the current study employed the visual-word paradigm in eye-tracking, which taps into automatic processes, to investigate whether native listeners can perceive the differences between the two tones, at the suprasegmental level, and whether the acoustic details in lexical tone can facilitate lexical access in Mandarin. It is expected that the findings can shed light on the role of categorical perception and acoustic details in the processing of Mandarin tonal alternations in contexts. In addition, the current results can detect the subtle dynamic processing of disyllabic words with SR-T3 or CR-T2, as well as provide a hint for later stage of spoken word recognition. Specifically, the main research questions are as follows: (1) Are SR-T3 and CR-T2 acoustically incompletely neutralized? (2) Are SR-T3 and CR-T2 perceptually completely neutralized? (3) Are native Mandarin listeners sensitive to the acoustic details between SR-T3 and CR-T2 and able to use the information automatically in lexical access? The current study included three experiments. The first one was a speech production task which compared SR-T3 and CR-T2 to see whether we could replicate previous studies showing incomplete neutralization in F0 between them. The second one was an identification task and the last one was an eye-tracking experiment. The identification task tapped into phonological level since it induced more categorical processing, while the eye-tracking experiment tapped into automatic processing, on which level Mandarin listeners may show stronger sensitivity to subtle acoustic details.

Experiment 1: Production

The speech production experiment aimed to replicate previous studies which observed incomplete neutralization in F0 between sandhi-rising tone 3 (SR-T3) and canonical-rising tone 2 (CR-T2) (Peng, 2000; Zhang and Lai, 2010; Yuan and Chen, 2014). This experiment also served as the ground for the critical stimuli used in Experiment 2 (identification) and Experiment 3 (eye-tracking). We predict that systematic differences in F0 between SR-T3 and CR-T2 would be obtained. Specifically, SR-T3 would show lower average F0, a larger F0 difference between the onset and turning point, and a later turning point than CR-T2, indicating the influence of their respective underlying representations.

Participants

Twenty native Mandarin Chinese speakers from Northern China, aged between 20 and 24, were recruited (10 males and 10 females). None of them spoke any other Chinese dialects at the time of testing. They were also not simultaneous bilingual or early bilingual speakers of another non-Chinese language. All participants were university students with no reported language disability or hearing impairment. This research was reviewed and approved by the Human Subjects Committee of the Department of Chinese Language and Literature at Fudan University. All participants were asked to provide informed consent before the production experiment and were paid for their participation.

Stimuli

Ten minimal pairs of disyllabic T3 (SR) + T3 and T2 (CR) + T3 words with identical segments were used as critical stimuli (e.g., 百马 /paj3 ma3/ “hundreds of horses” vs. 白马 /paj2 ma3/ “white horse”). These two sets of words were adopted from Zhang et al. (2015) and matched in log word frequency [SR-T3 words: M = 0.480, SD = 1.046; CR-T2 words: M = 0.899, SD = 1.088; t(18) = 0.876, p = 0.393] (based on the corpus of Cai and Brysbaert, 2010) and stroke number [SR-T3 words: M = 8.2, SD = 4.54; CR-T2 words: M = 8.5, SD = 3.60; t(18) = 0.164, p = 0.872]. In addition, their first morphemes can be combined with several other morphemes to form disyllabic words. Another twenty disyllabic words with different tonal combinations were also included as fillers. For the fillers, 8 of them started with T1; 4 of them started with T2; 8 of them began with T4. Eight of them ended with T1; 4 of them ended with T2; 8 of them ended with T4. Information of the 20 critical stimuli is shown in Appendix Table A1.

Procedure

First, participants completed a language background questionnaire and a consent form in a quiet room. Then they did the production experiment run by Paradigm (Tagliaferri, 2019) and were recorded in the Phonetics and Psycholinguistics Lab at Fudan University, with a cardioid microphone (Shure, model SM57) and a digital solid-state recorder (Zoom H4N), using a sampling rate of 44,100 Hz.

In each trial of the production experiment, the participants first saw a fixation cross in the middle of the screen for 500 ms, and then the stimuli for 2,000 ms, during which they were instructed to produce the stimuli as naturally as possible. Five practice trials were first provided to the participants to ensure that the participants fully understood the procedure of the task. Then the main experiment began with a total of 120 tokens (20 critical stimuli and 20 fillers with three repetitions) randomly presented to the participants. The whole experiment took approximately 20 min. Participants’ productions of the critical stimuli were subjected to further analysis.

Data Analysis

The F0 tracks of the first vowel of the SR-T3 and CR-T2 words were measured using Praat software (Boersma and Weenink, 2019) and defined from the onset of periodicity in the waveform to the peak of the pitch track analysis in Praat. F0 tracks were extracted using ProsodyPro Praat script (Xu, 2005/2010) and measured at every 11.11% of the F0 duration, generating 10 measurement points for each target vowel. Then the extracted F0 tracks were checked for octave jumps. Whenever there were octave jumps, the target vowel was equally divided into ten points and the value of each point was manually calculated using F0 = 1/T(s) in which T represents the duration of one period of the waveform. A total of 27 tokens (27/1,200 = 2.25%) were discarded due to creakiness (17 of them) or mispronunciation (10 of them). All tokens were judged by the two authors, who are native speakers of Mandarin Chinese.

The extracted F0 tracks using ProsodyPro were then converted into semi-tone using the formula in (1) below in order to better reflect pitch perception (Rietveld and Chen, 2006). Moreover, the semi-tone values were transformed into z-scores using the formula in (2) below to minimize variation due to gender and speaker identity (Rose, 1987; Zhu, 2004). The z-scores were subjected to statistical analysis.

S T = 39.87 \times \log (H z / 50) (1)

z_{S T x} = \frac{S T x - \frac{1}{n} \sum_{i = 1}^{n} S T i}{\sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} (S T i - \frac{1}{n} \sum_{i = 1}^{n} S T i)^{2}}} (2)

Results and Discussion

Growth curve analysis (Mirman, 2014) were conducted to model the semi-tone z-scores represented by ten data points of SR-T3 and CR-T2 using the lme4 package in R (Bates et al., 2015), with p-values calculated by the lmerTest package (Kuznetsova et al., 2017). The linear, quadratic, and cubic time polynomials were entered as fixed factors. Three models were created by adding the three time polynomials one at a time as fixed factors and as random slopes for the participant random effect. A series of likelihood ratio tests were conducted to compare between the three models. The model that could explain the most variance of the data was determined the best model. Results showed that the model consisting of all the three terms was the optimal [linear vs. linear + quadratic: χ²(4) = 7,307.1, p < 0.001; linear + quadratic vs. linear + quadratic + cubic: χ²(5) = 1,389.1, p < 0.001], within which all the three terms were significant, indicating that the F0 tracks of SR-T3 and CR-T2 had an incomplete S-shape on an angle, as shown in Figure 1 (linear: β = 1.547, SE = 0.090, t = 17.708, p < .001; quadratic: β = 1.840, SE = 0.069, t = 26.566, p < .001; cubic: β = –0.656, SE = 0.031, t = –20.769, p < .001).

FIGURE 1

Figure 1. Mean semi-tone z-scores of the Sandhi-Rising T3 and Canonical-Rising T2 syllables.

In order to further evaluate whether or not canonical-rising T2 (CR-T2) and sandhi-rising T3 (SR-T3) were acoustically completely neutralized in tone, two additional models were built based on the best model above in which three time terms were included. Model A included all the three time polynomials and Tone (CR-T2, SR-T3, with CR-T2 serving as the baseline) as fixed factors. Model B included the three time polynomials, Tone, and their interactions as fixed factors. For both models, a set of random effects were also included to capture participant-level variability in all three time polynomials and in Tone. Results of likelihood ratio tests showed that Model B was significantly better than Model A [χ²(3) = 28.967, p < 0.001]. Within Model B, all three time polynomials were significant, indicating that the F0 tracks of SR-T3 and CR-T2 (the baseline) had an incomplete S-shape on an angle (linear: β = 1.622, SE = 0.089, t = 18.239, p < 0.001; quadratic: β = 1.795, SE = 0.071, t = 25.186, p < 0.001; cubic: β = –0.683, SE = 0.036, t = –19.021, p < 0.001). Moreover, SR-T3 showed significantly lower semi-tone z-scores than CR-T2, as reflected by the negative estimate for Tone (β = –0.141, SE = 0.039, t = –3.590, p = 0.002). Significant interaction effects between Tone and the linear time term (β = –0.151, SE = 0.034, t = –4.429, p < 0.001) as well as between Tone and the quadratic time term (β = 0.090, SE = 0.034, t = 2.637, p = 0.008) showed that the shapes of SR-T3 and CR-T2 were different. More specifically, the negative estimate for the interaction between the linear time term and Tone indicates that SR-T3 had a more negative slope compared to CR-T2, while the positive estimate for the interaction between the quadratic time term and Tone suggests that SR-T3 had a more convex shape (i.e., U shape) relative to CR-T2. The production results replicated those of previous studies (Peng, 2000; Zhang and Lai, 2010; Yuan and Chen, 2014), showing lower average F0 for SR-T3 and differences in F0 contour between the two tones. SR-T3 and CR-T2 were acoustically incompletely neutralized.

Experiment 2: Identification

The identification experiment aimed to investigate whether native Mandarin listeners were perceptually sensitive to the acoustic differences between SR-T3, derived from third tone sandhi, and CR-T2. It examined native speakers’ categorical perception of Mandarin tonal alternations. In order for the identification stimuli to reflect the overall production pattern in F0, the SR-T3 tokens we used for the identification task had lower F0 than the CR-T2 tokens. Given the stimulus selection, better-than-chance signal detectability would suggest that native Mandarin listeners are sensitive to the subtle acoustic differences between the two tones, and able to use them for lexical access. Chance-level performance would suggest listeners’ inability to detect the tiny acoustic differences between SR-T3 and CR-T3 words during spoken word recognition.