Articulation, Acoustics and Perception of Mandarin Chinese Emotional Speech

Abstract This paper studies articulatory, acoustic and perceptual characteristics of Mandarin Chinese emotional utterances as produced by two speakers, expressing Neutral, Angry, Sad and Happy emotions. Articulatory patterns were recorded using ElectroMagnetic Articulography (EMA), together with acoustic recordings. The acoustic and articulatory analysis revealed that Happy and Angry were generally higherpitched, louder, and produced with a more open mouth than Neutral or Sad. Sad is produced with low back tongue dorsum position and Happy, with a forward position, and for one speaker, duration was longer for Angry and Sad. Moreover, F1 and F2 are more dispersed (i.e., hyperarticulated) in emotional speech than Neutral speech. Perception tests conducted with 18 native listeners suggest that listeners were able to perceive the expressed emotions far above chance level. The louder and higher pitched the utterance, the more emotional the speech tends to be perceived. We also explore specific articulatory and acoustic correlates of each type of emotional speech, and how they impact perception.


General aims
Emotion is part of our human nature, and it is a subject that is interesting to study from a linguistic perspective. Our everyday experience tells us that speakers can change their speech when they are emotional, which inevitably affects its acoustic realizations. Also, listeners can often-perhaps not alwaysaccurately perceive the emotional state of the speaker. Studying this observation in a scientific manner is one task of modern phonetics.
There is an increasing body of literature on the acoustic characteristics of emotional speech, including acted emotion by professionals and non-professionals and also spontaneous, non-scripted emotional speech (see tables comparing different studies in detail in Erickson 2005). A multitude of different languages have been examined in this research tradition, including Mandarin Chinese (e.g., ChangLiao 2004; Gu & Lee 2007;Lin & Fon 2012;Liu & Pell 2012;Yang et al. 2007;Yuan et al. 2002;Wang et al. 2005;Wen et al. 2011;Zhang et al. 2006). However, how speakers articulate emotional speech is less well-studied, with a few exceptions (Erickson, Huang et al. 2008;Li et al. 2010;Nguyen et al. 2008). The perception of emotional speech is also less well studied than its acoustics. To fill these gaps, this paper examines the articulation, acoustics, and perception of emotional speech in Mandarin Chinese, in an attempt to further our understanding of emotional speech utterances.

Previous studies on emotional speech
We start this paper by reviewing some previous phonetic studies on emotional speech, which served as the basis of the current study (see Erickson 2005 a more extended summary of other studies). Erickson et al. (2000) used ElectroMagnetic Articulography (EMA) to examine properties of acted emotional utterances of two American English speakers. They showed that jaw and tongue dorsum position change as a function of the particular emotion, and specifically, the emotion Anger may involve increased jaw lowering (throughout this paper, for the sake of exposition, we use Small capital to express emotion types). They also found that emotional speech showed particular acoustic characteristics, realized in terms of F0 and formant structures.
An acoustic and articulatory study of spontaneous Mandarin Chinese by Erickson, Huang et al. (2008) examined a female native speaker, as she was speaking to her friend over a telephone-type connection set up in the lab, recalling past emotional events in her life, including the very Sad story of how her husband was murdered. The acoustic analysis showed that Happy syllables were significantly louder, higher in pitch, and shorter in duration than Sad syllables. Also for Sad, a breathy voice quality was found, as well as lowered lip and jaw, and more tongue tip protrusion compared to Happy. Li et al. (2010) reported on acoustic and EMA recordings of Mandarin Chinese emotional speech (Angry, Sad, Happy and Neutral, 9 vowels with 111 sentences) for a single male speaker. Among their findings was that Happy has the highest F0 maximum, followed by Angry, and then Neutral/Sad. Articulation also changed as a function of the emotion, such that Angry has the highest tongue body position. They also reported increased lip protrusion for Sad as well as for Angry. In addition, the study found intonation differences in final boundary tones, with Happy having a high rising tone, Angry, a high falling tone, and Sad and Neutral, low tones. Wang et al. (2005) reported that in contrast to neutral speech, the pitch register of Happy speech is higher, and the slope of F0 contour of the final syllable of each prosodic word is steeper, especially for the syllable at the end of the sentence.

The current study
Compared to acoustic examinations of emotional speech, the number of articulatory studies on emotional speech is limited; more case studies are thus warranted to advance our understanding of emotional speech in terms of articulation. The paucity of articulatory data of emotional speech is most likely due to the challenges of making articulatory recordings such as EMA, which is not available in every phonetics laboratory, along with the relative newness of such techniques, combined with the difficulty of recording emotional utterances in a lab setting.
The motivation for our study is to add more data to the literature on emotional speech in Mandarin Chinese-and we hope, more generally in natural languages-by reporting on two additional Mandarin Chinese (male and female) speakers' articulation of emotion; moreover, in contrast to some of the student participants in the pilot tests, the two speakers in our study were middle-aged with considerable ability to comfortably express emotions, as will be described later.
Previous studies on Mandarin Chinese tended to treat all those utterances intended by the speaker as emotional to also convey that intended emotion to listeners. However, even with spontaneous emotional expressions (Erickson et al. 2006;Spring et al. 1992) or professionally-acted emotions (e.g., Dang et al. 2010), not all utterances are perceived as the emotion intended by the speaker. A related theoretical question along these lines is whether emotions are expressed to be communicated to others (Ohala 1994), which predicts that particular emotions should have particular acoustic targets, or are they part of a human's cathartic system, produced to bring about relief to one's intense emotional experiencing of pain, pleasure, fear, sorrow, etc. and as such "targetless" in terms of acoustics (see e.g., Mazo et al. 1994). The answer may actually depend on the speaker and the specific situation (see e.g., Erickson et al. 2012) and will be discussed later in the interpretation of our acoustic and articulatory results. In this study, we tested what acoustic characteristics allowed listeners to perceive the "right" emotional state. Since only a small number of speakers' articulatory data has been reported on in the literature, the additional two speakers analyzed in this study will both substantiate previous findings, as well as uncover new ones, as part of the ongoing process of understanding the articulation, acoustics and perception of emotional speech in Mandarin Chinese, and again, natural languages in general.

Method
Articulatory and acoustic recordings were made of two Mandarin Chinese speakers producing Neutral, Angry, Sad and Happy utterances. Perception tests were conducted to see how well the emotions were perceived by listeners.

Data recording
Articulatory recordings were done using the 3D EMA (Carstens AG500) at the Japan Advanced Institute of Science and Technology (JAIST, Kanazawa Japan). Acoustic signals were simultaneously recorded. Based on the findings reviewed in section 1.2., especially Erickson et al. (2000), this paper reports on the recordings of two EMA sensors: one glued to the lower medial incisors to track jaw motion and another glued to the tongue dorsum (See Figure 1). Head movement was corrected using four sensors attached to the upper incisors, bridge of the nose, left and right mastoid processes behind the ears. The articulatory and acoustic data were digitized at sampling rates of 200 Hz and 16 kHz, respectively. The occlusal plane was estimated using a bite plane with three additional sensors. In post processing, the articulatory data were rotated to the occlusal plane and corrected for head movement using the reference sensors after low-pass filtering at 20 Hz. The lowest vertical position (maximum displacement) of the jaw with respect to the bite plane was located for each target syllable of the utterance using the MATLAB-based software mview (Haskins Laboratories); this measure indicates the amount of the articulator's displacements from the occlusal plane.
Tongue dorsum measurements were made at the time of the lowest vertical jaw position. The horizontal position of the tongue dorsum relative to the upper incisor is indicated as TDx, and high values indicate forward position of tongue; the vertical position of the tongue dorsum relative to the occlusal plane is indicated as TDz, and large values indicate low tongue dorsum position.
Brought to you by | Kobe Daigaku Authenticated Download Date | 12/28/17 8:55 AM Acoustic measurements of the vowels were obtained using Praat (Boersma & Weenink 2015). Duration, maximum F0, maximum intensity, F1 and F2 were measured, by creating a 10 ms analysis window centered in the middle of the vowel. Approximately 6 repetitions of each utterance-type were successfully recorded and analyzed.

Stimuli and analyses
For this paper, we report on 3 Mandarin Chinese phrases: 巴拿马 [ba1 na2 ma3] ('Panama'), 妈妈骂马 [ma1 ma ma4 ma3] ('Mother curses the horse'), and大把大把 [da4 ba3 da4 ba3] ('a lot of'). The vowels in the syllables are all the same, /a/, since jaw displacement varies according to vowel quality. For example, in English, high vowels and mid vowels can differ by approximately 2 mm, and so do mid vowels and low vowels (Menezes & Erickson 2013;Williams et al. 2013); similar findings have been reported for Japanese (Erickson & Kawahara 2016;Kawahara et al. 2014, submitted). Although no studies that we know of have addressed this issue in Mandarin Chinese, we assume that there is a similar relationship between vowel height and jaw displacement. This study thus controlled for the vowel quality in the stimuli. We examined the utterances with the low /a/ vowel, because jaw displacement patterns tend to be most clearly articulated with low vowels, but ongoing work is also looking at utterances with high and mid vowels.
For the current acoustic and articulatory analyses, we focus specifically on the word/phrase final syllable for the following reason. Recent research suggests that utterance-final syllables in Mandarin Chinese are prominent, which show increased jaw displacement and increased vowel duration Iwata et al. 2015). For the speakers in this study, increased jaw displacement for the final syllable is also seen, regardless of the emotion expressed (see Figures 2 and 3 below). Additional motivation for focusing on the last syllable comes from reports by Wang et al. (2005) that the greatest excursion of F0 occurs on the final syllable of each prosodic word and especially for the sentence final syllables.

Speakers and elicitation
Two middle-aged native speakers of Mandarin Chinese served as the speakers for the current experiment, one male (C02) and one female (C03), both born and raised in Beijing.
The utterances were spoken with different emotional expressions: Neutral, Angry, Sad and Happy. The speakers were asked to first speak the Neutral sentences, six randomized repetitions as part of a larger data set; then they were instructed to change their emotional attitude by remembering a situation where they felt very Angry, and to speak 6 repetitions of a set of words presented in a randomized order. Then, they were asked to put themselves in a "sad emotional situation" and to speak the sentences, remembering a situation where they were very Sad, and then the same for Happy. Between each set of emotions, the speakers were given time to set their moods.
C02 had no experience of acting, but had participated in amateur comedy talk shows in his college days. C02 describes himself as a "cry-baby" i.e., he cries easily when seeing a movie, reading a book, listening to music. So he was able to cry with tears even during the experiment. During the experiment, he imagined himself in the pain of losing his loved ones for Sad, or being angry with someone for something they had said to him for Anger; for Happy, he thought about his son and daughter.
C03 had no experience of acting, although her aunt was a professional actress, and she also was very convincing in her expressions of emotion. When she was doing the experiment, she imagined that she was very mad at someone who did something wrong (Anger), or felt very relaxed and excited about trip she was going to take (HAPPY), or imagined that she would fail to do anything, so what an unfortunate woman she was (Sad).
Each speaker afterwards reported that they experienced physiological changes for each of the emotion sets-especially C02, who was speaking with tears in his eyes for Sad and with his hands shaking and heart beat rising for Angry. For this speaker, it took about ten minutes to recover from each emotion, before he could do the next one, and music was played to calm him down. C03 also reported experiencing physiological changes, but not to the same degree.

Perceptual evaluations
The recorded utterances were randomized to make two perception tests, one for C02 and one for C03, for a total of 72 utterances and 73 utterances, respectively. The stimuli were presented to 18 university Mandarin Chinese students at a Japanese university in the Kansai area.1 Each listener participated in the two perception tests. The listeners were asked to (a) rate how emotional the utterances were (1 to 5, with "5" as "extremely emotional", "3" as "emotional", and "1" as "not emotional") and (b) identify what emotion they heard-Angry, Happy, Sad, Neutral, or Other emotion. Each test was preceded by a practice test of 5 utterances. The tests were presented on a computer interface through headphones in a quiet room.

Perception results
The results of the perception tests are the basis for the subsequent acoustic and articulatory analyses. We analyzed those utterances that were (a) rated by listeners as being "emotional" given a rating of 3, 4 and 5, in answer to the first question in the perception test and (b) judged to be the intended emotion (answer to second question in the perception test). In this way, the paper focuses on perceived emotion, not intended or produced emotion, i.e., we examined the acoustic and articulatory characteristics of emotional speech that was perceived correctly.
The overall perception test results for speakers C02 and C03 are shown in Tables 1, 2 and 3. Table 1 shows, on average, how emotional each utterance was judged by the listeners. Table 2 shows the total number of Neutral, Angry, Happy and Sad utterances, and how many of these were perceived accurately. Table 3 is a confusion matrix. Table 1 shows that listeners generally judged emotional utterance as more emotional than the Neutral utterances. All the emotional utterances, except for Happy of C02, were rated by listeners with "3" or above ("3" was "emotional" on the given scale). The highest rated was C02's expressions of Sad, which were rated as "very emotional". Neutral was rated as not emotional for both speakers; speaker C02's Happy was rated as not very emotional.  Table 2 shows that Sad utterances were best perceived as the speaker intended and Happy, the least well perceived. 1 An anonymous reviewer suggested it would be interesting to ask the speakers themselves to do a self-evaluation; while we agree that this is an interesting possibility, we asked other listeners to make evaluations as a way to validate the emotional stimuli for the perception tests. Our analysis is thus based on perceived emotion, rather than intended emotion.
Brought to you by | Kobe Daigaku Authenticated Download Date | 12/28/17 8:55 AM In Table 3, the accurately-perceived answers are shown in bold. The results show that listeners perceived all expressed emotions far better than chance level (1/5 = 20% is chance level). Listeners' correct identification rate was highest for Sad for both C02 and C03 (99% and 95%, respectively) and lowest for Happy (47% and 72%, respectively). Happy for C02 was confused with Neutral (23%) or Angry (17%) and for C03, with "other" (16%). Recall also that C02's Happy speech was not judged to be very emotional (Table 1).

Correlation analyses between phonetic characteristics and listener ratings of emotional degree ("emotional-ness")
In order to understand what phonetic characteristics contribute to an utterance being perceived as emotional, a Pearson-correlation analysis was run between the phonetic characteristics of each utterance and the emotional degree, i.e., how emotional the speech was rated by the listeners. The results are shown in Table 4. The results show that for both speakers, the higher pitched (F0 Max) and louder (Intensity Max), the more emotional the speech was judged to be. For C03, in addition, the longer the syllable, the more emotional, whereas for C02, the more open the jaw/the higher the F1, the more emotional. For C03, the amount of tongue dorsum (TDx) fronting may function as a cue for emotion: the more fronted the tongue dorsum, the more emotional it was perceived to be.

Analyses of specific measures
Next, let us move on to the articulatory characteristics of different types of emotional speech. We analyze here those utterances that were perceived above chance by listeners as the emotion intended by the speaker. The overall pattern of mean jaw displacement for each of the emotional expressions is shown in Figures 2 and 3; in which the larger size bar indicates a greater jaw displacement/mouth opening, and the x-axis indicates syllables. One general pattern we observe is that jaw opens the most in the sentence-final syllables, regardless of the emotional type. Another observation is that on the utterance-final syllables, C02 has large jaw opening for Angry voice, whereas C03 shows largest jaw opening for Happy speech. C02 also has larger jaw opening for Happy compared to Neutral and Sad.2 Fig. 2. Amount of jaw displacement by syllable for all utterances that were well-perceived as the emotion intended by speaker C02. (All analyses in this paper concern the emotions that were well-perceived, i.e., perceived above chance, as the emotion intended by speaker.). The y-axis indicates the amount of jaw displacement (mm), so that the larger value indicates a larger mouth opening. The x-axis indicates the syllable. Error bars show standard error. The colors indicate the different well-perceived emotion. In order to better document and understand the salient articulatory and acoustic characteristics of each well-perceived (i.e., perceived above chance) emotional expression (Neutral, Angry, Happy, Sad), Tables  5 and 6 show the mean values (the standard deviations) of the phonetic characteristics of the final syllable: maximum jaw displacement, tongue dorsum; also maximum F0, maximum intensity, duration and F1 and F2. The jaw and tongue dorsum (TDz) measurements are in terms of mm from the occlusal plane; the tongue dorsum-x (TDx) measurements are such that larger negative numbers indicate tongue position is further back in the mouth.
2 An anonymous reviewer asked whether it is a relative difference between the neutral "baseline" and each particular emotion or it is absolute characteristics of each emotion type that matters for the perception of each type of emotion. It is not surprising if listeners use the Neutral speech as their perceptual baseline for some kind of normalization, but we note that it is not impossible to tell a speaker's emotional state, even when we do not know that person's speech beforehand. This question, therefore, requires further thinking and experimentation.
Brought to you by | Kobe Daigaku Authenticated Download Date | 12/28/17 8:55 AM Table 5. Mean values and standard deviations of phonetic measurements for the final syllables for each emotional category perceived above chance level (20%) for C02. Number of utterances (N): for Sad=15~17; Angry=17~18; Happy = 16~17; Neutral=19. The jaw and tongue dorsum (TDz) measurements are in terms of mm from the occlusal plane; the tongue dorsumx (TDx) measurements are such that larger negative numbers indicate tongue position is further back in the mouth. In order to examine whether the emotions differ significantly in terms of their phonetic characteristics, ANOVAs were run with item as the random factor, EMOTION as the independent factor and each of the 8 measured phonetic values as the dependents variables. The results, shown in Tables 7 and 8, demonstrate that for both speakers, each of the phonetic values changes significantly, depending on the emotional expression; the one exception is duration for C02 for whom it does not change as a function of the emotion.  As an additional way of investigating the characteristics of well-perceived Happy, Sad, Angry and Neutral speech, we present a set of 4 scatter plots of the measured data. Figure 4 shows the scatter plots for F1 and F2 for both speakers.   Figure 4 shows that vowel quality changes as a function of the emotion. For both speakers, the /a/ vowel for Happy and Angry have higher F1 (i.e., the mouth is more open); in terms of F2, Happy /a/ for C02 has higher F2 (i.e., the tongue is more fronted) than Angry, but about the same degree of frontness for C03. For both speakers, the vowels in Neutral speech tend to be more central, which indicates that emotional speech may involve some sort of hyperarticulation. For C03, Sad is more back and higher. As for F2, for C02 it is highest for Happy and significantly higher than Neutral or Angry. For C03, in contrast, F2 is significantly higher for Angry/Happy (no significant difference between Angry and Happy) vs. Neutral vs. Sad.
F0 and intensity also change as a function of the emotion. Figure 5 shows scatter plots for C02 and C03, respectively. Brought to you by | Kobe Daigaku Authenticated Download Date | 12/28/17 8:55 AM For F0 Max of C02, Angry is higher than Happy, and Sad is higher than Happy/Neutral (no significant difference between Happy vs. Neutral). For C03, F0 Max is significantly different for all the emotions, except Angry vs. Sad; specifically, Happy is higher than Angry, then Sad, and then Neutral. As for maximum intensity, for C02, it is significantly different for all the emotions, except Happy vs. Sad; specifically, Angry is louder than Happy/Sad, and then Neutral. For C03, maximum intensity is significantly different for all the emotions; specifically, Angry is louder than Happy, then Sad, then Neutral. Overall, in terms of maximum F0 and intensity, the emotions tend to be better separated for C03 than for C02. Figure 6 plots jaw opening against duration. For C02, the amount of jaw displacement is significantly different for all the emotions, with significantly larger jaw displacement for Angry, then Happy, then Sad, and then Neutral; however, duration does not change significantly. For C03, the jaw is significantly different for all the emotions, except for Angry vs. Neutral. Happy syllables (not Angry), are associated with the most open mouth, then Neutral, and then Angry and Sad. With regard to duration, for C03, it is significantly longer for Angry/Sad than Happy/ Neutral. In addition, for C03, there is an interesting relationship between jaw opening and duration as a function of the emotion: for Angry, it is mostly vowel duration that tends to increase, while for Happy and Sad, it is mostly jaw opening that increases. For Neutral, we see that as the jaw opening increases, duration becomes longer (a similar finding for Neutral was also reported by Iwata et al. 2015 and). Finally, looking at Figure 7 for C02, we see that Sad has the lowest, most back TD position; for both C02 and C03, the TDx position is significantly different for all the emotions, except for Angry vs. Neutral for C02. As for the TDz for both speakers, it is significantly lower for Sad than the other emotions. A summary of the acoustic and articulatory findings of emotional speech compared with neutral speech is shown in Table 11. Table 11. Summary of acoustic and articulatory findings of well-perceived emotional speech compared with well-perceived Neutral speech. Significance levels are shown with the number of asterisks: ***= p<.001; **= p<.01; * = p<.05. Blank cells indicate no significant differences.

Discussion
First let us summarize the phonetic characteristics that cue an utterance being heard by listeners as emotional, as reported in Table 4. The results suggest that louder and higher pitched utterances are heard by listeners as "emotional", a finding also reported by other studies on emotional speech (see e.g., Erickson 2005). Additionally, increased F1 served as a cue to emotion, similarly reported for happy laughing speech by Szameitat et al. (2011). For one speaker (C03), increased duration is a cue to an utterance being heard as emotional, also as reported by e.g., Erickson (2005). In terms of articulation, for one speaker (C02), the more open the jaw, the more listeners heard the utterance as emotional (see also Erickson et al. 2000); also for this speaker, the lower the tongue dorsum, the greater the perceived emotion, especially for Sad. The second topic concerns whether listeners could perceive the specific emotions intended by the speakers. This study shows that listeners perceived all emotions better than chance. Sad was best perceived, especially for C02, and Happy least well. That C02's Sad was well-perceived as emotional is not surprising, considering that this speaker was actually crying with tears while speaking. Our current analysis focused on a small subset of possible acoustic/articulatory characteristics of emotional expressions; oral, nasal, pharyngeal resonances, which may have arisen due to actual crying, among many other things, are not examined here, but probably contribute to the listeners' perception of SAD. That Happy in general is not as well perceived as the other emotional expressions is not surprising either, since similar findings have also been reported in the literature (e.g., Erickson 2005;Scherer et al. 2001). More work needs to be done to address why-however, our conjecture based on work with social affective expressions (e.g., Shochi et al. 2009) is that Happy is a socially positive expression and as such, is less well marked. Angry and Sad are more marked in that they convey information related to survival and self-protection. Another reason that C02's productions were especially difficult for listeners to recognize as Happy may be because the maximum F0 for his Happy was rather low, whereas generally Happy has a high F0 (e.g., Li et al. 2010 It is also surprising that C02's Angry expressions were not better perceived as Angry, especially since C02 was actually physically shaking with anger during the recordings. This may have to do with the fact that the nine Angry utterances that ended without rising intonation were those that were perceived as Angry, while those nine Angry utterances that were perceived as Happy ended with rising intonation. A similar finding about boundary tones and emotions was reported by Li et al. (2010), i.e., Angry utterances tend not to have boundary tones, while Happy ones had rising ones.
A third topic concerns our findings about the acoustic and articulatory characteristics on the word/ phrase final syllables of well-perceived emotions (Happy, Angry, Neutral, Sad). In terms of F0, Happy had the highest F0 Max for one of the speakers (C03), similar to that reported by Li et al. (2010) and Wang (2005); however, for the other speaker, C02, Angry had the highest F0 Max, and perhaps this is one of the reasons that his Angry was sometimes confused with Happy, as discussed above. However, for both speakers, Angry was the loudest.
In terms of jaw displacement, the largest jaw opening is for Angry for C02 (see Erickson et al. 2000 who reported a similar finding), while it is for Happy for C03. As for vowel duration, we see for C03 that the negative emotions (Sad and Anger) have longer durations, as was also reported by Lin and Fon (2012); also by Zhang et al. (2006) for Sad. For C02, however, we do not see this. Concomitant increases in jaw opening and duration have been reported for final (Neutral) syllables by Iwata et al. (2015). It is interesting that for emotional speech, the jaw and duration may, at least for some speakers, work independently. This interplay between emotion, jaw displacement and duration needs to be investigated further.
In terms of tongue dorsum positions, for both speakers Sad has a low back tongue dorsum position and Happy the most forward tongue dorsum for Speaker C02. It is interesting that tongue positions (Figure 7) do not seem to match the F1-F2 patterns (Figure 4). Prior work with English comparing formant frequencies with TDx-z positions of emphasized vs. non-emphasized syllables shows a strong match with tongue articulation and formants: for emphasized /a/-vowels, the jaw opens more with the tongue more back and low, resulting in higher F1 and lower F2 (Erickson 2002). However, for emotional speech, at least for these two speakers, we do not clearly see this pattern. We do not have an explanation for this discrepancy at present, other than the conjecture that different types of emotional expressions are produced with different types of tongue (as well as lip, and jaw) articulations. It is possible that these articulatory changes lead to changes in voice qualities. This is a topic of future research.
In actuality, emotions are complex and often there is no single emotion expressed in an utterance (see e.g., Dang et al. 2010). Moreover, the emotional labels used can lead to confusing results. For instance, in this paper, we have analyzed what is often referred to as "Hot Anger", in contrast to "Cold Anger". The characteristics of these two types of angers are different; for instance, the former tends to be highpitched, loud (e.g., Scherer 1989), and also increased jaw displacement (Erickson et al. 2000), whereas the latter is low-pitched, soft (e.g., Scherer 1989) and decreased jaw displacement (Kim et al. 2014). Also, the term, "Sad"-frequently it refers to a soft, low pitched expression (e.g., Scherer 1989) which is what C03 performed; this contrasts with active grieving sadness (e.g., Erickson et al. 2006;Scherer 1989), which is what C02 performed.
One interesting remaining question is, when experiencing emotion, does the speaker have an acoustic target-i.e., wants to produce a sound such that listeners perceive the emotion? Or, does the emotional experience cause changes in the articulation that result in the acoustic changes? Differently put, is emotional speech listener-oriented or speaker-oriented? Along these lines, according to Nguyen et al. (2008), spontaneous emotional speech has different glottal characteristics than acted emotional speech. Erickson et al. (2006) found that acted emotions are better perceived than real spontaneous emotion. Erickson et al. (2009) reported that smiling speech was sometimes judged as Sad, and they discussed possible reasons for discontinuities between how a speaker produces the sound, and how it is perceived by listeners. Erickson et al. (2006) reported that there may be a difference in the phonetic characteristics of emotional speech as produced by a speaker in a highly intense emotional situation vs. those characteristics evaluated by listeners as being emotional. These findings may suggest that a highly emotional person may be "inside his/ her own world" where the expression of emotion is a personal, cathartic activity. This contrasts with a more Brought to you by | Kobe Daigaku Authenticated Download Date | 12/28/17 8:55 AM "acted" style of emotion, in which the speaker is aware of both him/herself and the other outside person, with the emotional expression more a communicatory activity. With regard to the current two speakers, we might say that the expressions of C03 were well-expressed, well-perceived acted emotions, whereas those of C02, especially for Sad and Angry were spontaneous, heart-felt, experienced emotions. The type of spontaneous expressions of emotion by C02 may have no acoustic target per se, whereas those of C03 may indeed have acoustic targets, along the lines advocated by Ohala (1994). Much more research is needed on the challenging and complex topics of acted vs. experienced emotions, and the larger topic of the presence of articulatory/acoustic targets.

Summary
This study examined the acoustic and articulatory changes associated with emotions as produced by two speakers of Mandarin Chinese. The results confirm that a speaker's voice changes in terms of acoustics and articulation when he or she is expressing different emotions. Moreover, listeners are able to make judgments about how emotional the speech is, and what specific emotion the speakers is expressing, although not all intended-emotions were perceived as that particular emotion by listeners. Another new aspect of this study is that we report on articulatory and acoustic characteristics of "accurately" perceived emotional expressions. In general, emotional speech, and especially Angry or Happy, tend to be louder and higher pitched; moreover, F1 and F2 (for the vowel /a/) are more dispersed (i.e., hyperarticulated) in emotional speech than non-emotional speech, a finding also reported by e.g., Li et al. (2010).
Other new findings are that Happy tends to be produced with a more fronted vowel, and Sad, with a more backed vowel; jaw and tongue dorsum position also tend to be lower for emotional speech, with the tongue more forward for Happy speech and more back (and low) for Sad speech. Duration, at least for one speaker, however, does not change as a function of the emotion, even though jaw displacement does. This suggests that jaw displacement and duration can be independent,4 especially during emotional speech. Numerous other interspeaker differences also are reported, and more work is needed, especially data analysis of more speakers.
This study is but a tip of the iceberg report on various observations about some of the acoustic and articulatory characteristics of Mandarin Chinese emotional speech, specifically for phrase/word final syllables on Tone 3. Future work needs to examine the effect of different tones on emotional expressions, ideally with different sets of vowels. Although this paper is based on a small set of data, it is offered as a stepping stone toward a better understanding of the acoustic and articulatory characteristics of emotional speech expressions in Mandarin Chinese.