Emotional responses of Korean and Chinese women to Hangul phonemes to the gender of an artificial intelligence voice

Introduction This study aimed to explore the arousal and valence that people experience in response to Hangul phonemes based on the gender of an AI speaker through comparison with Korean and Chinese cultures. Methods To achieve this, 42 Hangul phonemes were used, in a combination of three Korean vowels and 14 Korean consonants, to explore cultural differences in arousal, valence, and the six foundational emotions based on the gender of an AI speaker. A total 136 Korean and Chinese women were recruited and randomly assigned to one of two conditions based on voice gender (man or woman). Results and discussion This study revealed significant differences in arousal levels between Korean and Chinese women when exposed to male voices. Specifically, Chinese women exhibited clear differences in emotional perceptions of male and female voices in response to voiced consonants. These results confirm that arousal and valence may differ with articulation types and vowels due to cultural differences and that voice gender can affect perceived emotions. This principle can be used as evidence for sound symbolism and has practical implications for voice gender and branding in AI applications.


Introduction
Arti cial intelligence (AI), which can be considered the core technology of the Fourth Industrial Revolution, has various applications.In particular, speakers, an important function of AI, facilitate communication between humans and AI.However, a challenge remains whether AI speakers can evolve from providing basic convenience functions, such as weather noti cations and alarm settings, to providing better human-machine interaction.Previous studies on emotion recognition and emotional expression of AI have been actively conducted for better comprehensive communication.Most studies aimed to determine whether AI could recognize and express human-like emotions based on a user's emotional state.However, communication is two-way in nature.us, it is important to consider not only the emotions conveyed by AI but also how humans perceive the messages delivered by AI speakers.In the context of marketing channels, when users of AI speakers experience positive rather than negative emotions with AI speakers, their preference for these devices increases (Jang and Ju, 2019).
Humans communicate and express and spread emotions through language.Notably, the rapid identi cation of emotional elements in sound stimuli of communication plays an important role in survival and adaptation (Ramachandran and Hubbard, 2001).According to sound symbolism, speci c phonemes, which are the fundamental elements of sound in a language, convey meaning independently (Lowrey et al., 2003).e Bouba-Kiki effect, a good example of sound symbolism, refers to the phenomenon of how people associate round shapes when they hear "Bouba" and pointed shapes when they hear "Kiki" (Ramachandran and Hubbard, 2001;Maurer et al., 2006;Pejovic and Molnar, 2017).Similarly, the gleam-glum effect demonstrates that words containing /i;/, such as "gleam, " are perceived as more positive emotions than words containing / /, such as "glum" (Yu, 2021).However, little agreement exists on whether these effects are universally applicable, regardless of the native language or age.
e question of whether phonemes have sound symbolism remains unanswered (Slunecko and Hengl, 2006).Although some studies have indicated a common theme of sound symbolism, the results vary, which is likely because the phonemes can be classi ed into consonants and vowels.Recent studies have revealed that the Bouba-Kiki effect varies between Eastern and Western cultures (Chen et al., 2016) and can change depending on differences in native language (Styles and Gawne, 2017).ese previous ndings suggest that sound-shape mapping related to consonants may be in uenced by individual perceptual style and linguistic experience (Rogers and Ross, 1975;Chen et al., 2016;Shang and Styles, 2017;Chang et al., 2021).
is study adopted the classi cation of consonants as plain, aspirated, and voiced consonants, which is a common method and is recognized to evoke similar emotional impressions in various languages, including English and Korean.However, sound-size mapping associated with vowels is a common phenomenon across cultures and languages because of its lack of sensitivity to cultural backgrounds or native languages (Shinohara and Kawahara, 2010;Hoshi et al., 2019;Chang et al., 2021).erefore, vowels were selected based on the symbolism of vowel sounds.For instance, in the early research on sound symbolism by Sapir (1929), experiments were conducted on the size symbolism of vowels /a/ and /i/ using meaningless words "mal" and "mil." Participants were asked to identify those words that referred to a large table and a small table.Approximately 80% of participants indicated that "mal" denoted a large table and "mil" denoted a small table.is suggests that /a/, when added to an existing word, conveys a so feeling because it is a central and low vowel, indicating augmentation for distant or large objects or long durations.Conversely, /i/ is considered to represent close and small objects or short durations.ese research ndings highlight the in uence of mouth shape during pronunciation.In terms of the dimension of the aperture, high vowels, such as /i/ or /u/, involve a smaller aperture, whereas low vowels, such as /a/, involve a larger aperture, potentially conveying different symbolic meanings (Shinohara and Kawahara, 2010).Based on this evidence, this study adopted three representative vowel types that can induce different states and constructed 42 combinations of consonants and vowels.
To date, research on sound symbolism has mainly focused on vowels rather than consonants because consonants cannot be pronounced without vowels; thus, the sound symbols of vowels have been considered greater than those of consonants (Aveyard, 2012).However, actual language cannot ignore the in uence of consonants, and comparing only the differences in vowels can limit recognition of the emotional meaning.erefore, considering the practicality of language, this study attempts to measure the emotional values of both vowels and consonants through classi cation according to the articulation method (Kim, 2019).
When evaluating the emotional values of stimuli, arousal and valence are the two most basic dimensions (Russell, 1980).Arousal is evaluated based on how exciting or arousing a stimulus is, that is, how calm it is, and valence is evaluated based on how pleasant or unpleasant a stimulus is.According to empirical studies, the arousal and valence dimensions are not independent of each other and exhibit a U-shaped relationship.us, unpleasant stimuli are considered more arousing than pleasant stimuli, and both unpleasant and pleasant stimuli are more arousing than neutral stimuli (Libkuman et al., 2007;Grühn and Scheibe, 2008).In general, negative emotional stimuli are considered to have a higher arousal value than positive or neutral stimuli (Ekman et al., 1983).
However, preferences for words that express emotions show differences, indicating that cultural differences can also occur between language and emotional meanings (Park et al., 2018).us, even in the same situation that evokes emotions, the terms cognitively interpreted and referred to differ across cultures, and depending on how emotional words are translated, they can have distinct meanings (Hahn and Kang, 2000).Furthermore, many studies have measured the properties of sounds, such as rough, so, strong, or weak.However, because of the ambiguous nature of these adjectives and their lack of integration into each study, accurately classifying how people feel about sound is not possible.Considering these points, this study adopted universal emotions to distinguish between subjective emotional states.e six basic emotions, namely anger, disgust, fear, sadness, surprise, and happiness (Ekman and Oster, 1979), were used to measure subjective emotional states instead of relying on somewhat ambiguous emotional expressions (e.g., soness, strength, weakness, and sharpness) based on the degree of arousal and valence (Russell, 1980(Russell, , 2003;;Barrett, 2006a,b).
e effect of AI speakers on human emotion recognition may include variables such as the voice gender of AI speakers, as well as sound-shape and sound-size.Studies have demonstrated differences in preference for "themes" by voice gender (Kim and Yun, 2021) and in AI usage behavior based on human gender and experience (Ji et al., 2019;Obinali, 2019;Ernst and Herm-Stapelberg, 2020;Kim and Yun, 2021;Wang et al., 2021).For example, in "warm news" delivery, female voices are highly appreciated in terms of understanding, reliability, and favorability, whereas, for news with serious content, male voices are preferred (Kim and Yun, 2021).us, gender preferences for an AI speaker may differ according to the gender of the human listener.Although arousal and valence can be perceived in phoneme units, studies on the voice gender of AI speakers have not yet observed an effect of voice gender on phoneme units.
Recently, active research has been conducted on how emotions are expressed and recognized in literary works using AI-based natural language processing and machine learning techniques.In addition, studies and re ections on AI-generated speech and Frontiers in Psychology 02 frontiersin.orgLee et al. 10.3389/fpsyg.2024.1357975listeners' emotional responses have rapidly evolved in recent years (Val-Calvo et al., 2020).Particularly, with advancements in speech recognition technology, there is growing interest in exploring how the tone and expression used by AI when speaking can evoke emotional responses in listeners (Poon-Feng et al., 2014;Zheng et al., 2015).Such studies provide crucial insights into understanding the impact of AI speech technology on people's emotional responses, aiming to offer important insights for the effective development and application of this technology.is study used a cultural comparison to explore the arousal and valence that people experience in response to Hangul phonemes according to the gender of an AI speaker.For this purpose, the most basic unit, the Hangul phoneme, was used as an experimental stimulus to evaluate arousal and valence in Korean and Chinese women who can speak Korean.is study aimed to examine the cultural differences in arousal and valence using the articulation method, vowels, and the gender of the AI speaker.
English Notation for Articulation in Korean.

Participants and procedure
Using G * Power 3.1.9.7 (the University of Düsseldorf, Düsseldorf, Germany), a power analysis was conducted with an effect size of 0.25, an alpha error probability of 0.05, a power of 0.80, and the number of groups set to 4. e analysis showed that the minimum sample size required was 128 participants (42 participants per condition).In total, 136 participants were recruited (136 women; M age = 27.19 years, SD = 4.30) from a university bulletin board in South Korea.e sample consisted of 68 Korean and Chinese participants who were randomly assigned to one of two conditions in a between-subjects design with voice gender (man or woman).All participants were informed that they had been recruited for a psychological experiment measuring emotions for phonemes and that all experimental processes would be conducted online.e study was limited to women to present the differences in variables, considering that women generally exhibit greater emotional responsiveness than men.e inclusion criteria for selecting participants were as follows: women who were (1) over the age of 20 years, (2) of Korean or Chinese nationality, and (3) able to speak Korean.Before taking part in the experiment, all participants provided informed consent and were informed that they could stop the experiment at any time.Each participant received $20 for their participation.

Measures
To control for emotional variables, the participants were asked to complete a questionnaire, described below.No signi cant differences were found in psychological characteristics between groups (Table 1).

Positive and Negative Affect Schedule Scale
e Korean version (K-PANAS; Lee et al., 2003) and the Chinese version (C-PANAS; Huang et al., 2003) of the PANAS were used to evaluate the positive and negative affects.e PANAS comprises 20 items, with 10 evaluating expectations for positive affect (PANAS-P) and 10 evaluating expectations for negative affect (PANAS-N).Participants were asked to rate their responses on a 5-point Likert scale, where 1 indicates "very slightly or not at all" and 5 indicates "extremely." e higher the score, the higher the levels of positive and negative affect.Cronbach's alpha values were 0.60 and 0.83 for the K-PANAS-P and C-PANAS-P, respectively, and 0.61 and 0.85 for the K-PANAS-N and C-PANAS-N, respectively.

State-Trait Anxiety Inventory
e Korean version (K-STAI; Kim and Shin, 1978) and the Chinese version (C-STAl; Tsoi et al., 1986) of the STAI were used to measure state and trait anxiety (Spielberger et al., 1970).is scale consists of 40 items, with 20 measuring "trait anxiety (STAI-T), " and 20 measuring "state anxiety (STAI-S)." Participants were asked to rate on a 4-point Likert scale, where 1 indicates "not at all" and 4 indicates "very much so." e higher the score, the more intense or more oen an individual felt anxious.Cronbach's alpha values of the K-STAI-T and K-STAI-S were 0.91 and 0.92, respectively, and those of the C-STAI-T and C-STAI-S were 0.75 and 0.74, respectively.

Center for Epidemiologic Studies Depression Scale
e Korean version (K-CES-D; Chon et al., 2001) and the Chinese version (C-CES-D; Chi and Boey, 1993) of the CES-D were used to measure the baseline for depressive mood in participants (Radloff, 1977).is scale consists of 20 items, and participants answer how oen each item had occurred over the past week, with four response options ranging from 0, indicating Frontiers in Psychology 04 frontiersin.org

Stimuli
For the experiment, 42 Hangul phonemic stimuli with arti cial human sounds were created and used with a text-to-speech (TTS) program.All stimuli had an equal duration of 500 ms.Considering participants' fatigue, 42 voice stimuli were used combining three Korean vowels and 14 Korean consonants (Table 2).e three vowels used were those with the largest differences in pronunciation structure (Lee and Lee, 2000), and the 14 consonants used excluded double consonants.ese consonants were classi ed into three types according to the articulation system of classi cation (Kim, 2019).Speci cally, they were classi ed as lenis consonants if they did not require heavy breathing or straining of the throat, aspirated consonants if they involved the release of a burst of strong air during plosive sounds, and voiced consonants if they resonated in the mouth or nose when pronounced.

Data analysis
A dataset with 136 samples was included in the nal analysis.As the rst step in the analysis, a 2 (nationality: Korean, Chinese) × 3 (articulation: lenis, aspirated, voiced), 2 (nationality: Korean, Chinese) × 3 (vowel: /a/, /u/, /i/) two-way analysis of variance (ANOVA) was conducted to identify differences in sound symbolism between participants of different nationalities.In addition, a 2 (nationality: Korean, Chinese) × 2 (voice gender: female, male) two-way ANOVA was conducted to explore the arousal and valence patterns according to nationality and voice gender.A 2 (voice gender: female, male) × 3 (articulation: lenis, aspirated, voiced) and 2 (voice gender: female, male) × 3 (vowel: /a/, /u/, /i/) mixed ANOVA was conducted with Korean and Chinese participants, respectively, to explore the patterns of arousal and valence for each nationality.
An independent sample t-test was performed for continuous variables to compare psychological characteristics and perform planned comparisons.

FIGURE 3
Interaction effects between nationality and voice gender in arousal.

FIGURE 4
Simple main effect analysis of nationality and voice gender.*P < 0.05.

Differences between nationalities for articulation and vowels
Arousal, valence, and basic emotions were used as dependent variables, three types of articulation (lenis, aspirated, and voiced) and three vowels (/a/, /u/, and /i/) were used as within-subjects factors and nationality (Korean or Chinese) was used as a betweensubjects factor to perform the repeated-measures ANOVA.
e results presented in Figure 1 indicate that the interaction between the three types of articulation and nationality was signi cant for arousal [F (2,268) = 11.93,P = 0.00, η 2 = 0.08] and valence [F (2,268) = 7.33, P = 0.001, η 2 = 0.05].en, planned comparisons were performed using independent sample ttests, which revealed no differences between Korean and Chinese participants for lenis or voiced articulation.However, for aspirated articulation, arousal was higher in Chinese participants than in Korean participants [t ( 134

Differences between nationalities for articulation and vowels
e results of the two-way ANOVA, presented in Figure 3, provide statistical support for the interaction between nationality (Korean or Chinese) and voice gender (female or male) for arousal [F (1,132) = 7.92, P = 0.006, η 2 = 0.06].However, no signi cant interaction effects were found for valence or basic emotions.
No interaction effects were found on the valence or basic emotional score.However, the one-way ANOVA for simple main-effect analysis revealed differences in voice gender by nationality (Figure 4).Differences in basic emotions for voice gender by nationality.*P < 0.05, **P < 0.01.

Differences in the value of articulation for voice gender by nationality
As shown in Table 3, the articulation types showed differences in arousal based on voice gender by nationality.Aspirated articulation elicited more arousal than voiced articulation, regardless of voice gender.However, unlike Korean participants, Chinese participants showed signi cant differences in arousal based on voice gender.In particular, for aspirated and voiced articulation, articulation ratings were signi cantly higher for a male voice than for a female voice.However, valence exhibited a different pattern for the male voice compared to the female voice.Although arousal was rated lower for the female voice than for the male voice, valence was rated higher for the female voice than for the male voice.Regarding basic emotions, clear differences were observed in the patterns between nationalities (Figure 5).Korean participants showed no signi cant differences in scores regardless of voice gender, whereas Chinese participants exhibited a difference in scores between voice and gender, especially for voiced articulation.

Differences in the value of vowels for voice gender by nationality
As presented in Table 4, differences were found in arousal for the three vowels based on voice gender by nationality.Unlike the articulation results, for vowels, the patterns of arousal for voice gender differed depending on nationality.In Chinese participants, the difference in arousal value based on voice gender was remarkable for all types; however, in Korean participants, a difference in arousal value based on voice Frontiers in Psychology 10 frontiersin.orggender was found only for /i/.Regarding basic emotions, clear differences were observed in the patterns between nationalities.For Korean participants, the scores were not signi cantly different, regardless of voice gender.However, for Chinese participants, scores differed between voice and gender (Figure 6).

Discussion and conclusion
is study aimed to explore cultural differences by comparing the degree of arousal and valence experienced by Korean and Chinese women in Hangul phonemes, based on the gender of an AI voice.e results of this study revealed signi cant differences in arousal levels between Korean and Chinese women in response to male AI voices.In particular, Chinese women exhibited distinct differences in emotional perceptions of male and female voices in voiced consonants.In addition, this study classi ed participants by nationality and identi ed cultural differences in arousal and valence patterns according to articulation and vowels.
is study revealed that arousal and valence levels differed between Korean and Chinese women, even for phonemic units without conceptual meaning.is is consistent with Russell's claim that emotional stimuli may have cultural differences.For vowels, the results contradict those of previous studies that suggest universal emotional responses, depending on the culture.is disparity is likely because China uses a tonal language, unlike Korea.While Korean has a dialect with a different pitch from that of the standard language, it is not a tone language that conveys variations in meaning through pitch differences.By contrast, China's tonal language facilitates meaning changes through changes in tone.erefore, Korean and Chinese listeners may experience differences in arousal and valence patterns when hearing the same sound.is nding is supported by the results of a study comparing the vowels /a/, /u/, and /i/ in Chinese with tone and English without tone, highlighting differences in sound symbolism based on the presence or absence of lexical tones (Chang et al., 2021).
We aimed to observe the differences between wakefulness and emotional outcomes.Clear and distinct differences were apparent in wakefulness, whereas in emotion, only the interaction between nationality and consonants proved signi cant.To elucidate, the most substantial difference between Korean and Chinese syllables, apart from phonemic constraints, lies in the pronunciation elements through the presence of sound.Korean allows for seven syllable-nal consonants: ᄇ /p/, ᄃ /t/, ᄀ /k/, ᄆ /m/, ᄂ /n/, ᄋ/η, and ᄅ /l.By contrast, in Chinese, syllable-nal consonants are restricted to two: /-n/ and /η/.e disparity in syllable-consonant combination constraints is the primary cause of phonological variation between Korean and Chinese.e importance lies not only in the difference in the phoneme itself but also in the interaction effect between nationality and voice gender.Differences in the arousal and valence experienced in response to Hangul phonemes varied by nationality depending on whether the voice was female or male.In particular, Chinese women were found to experience negative emotions even when the voiced sound was presented with a male voice, although voiced consonant is an articulation method that results in less arousal than lenis and aspirated consonants.Although the study was conducted among Chinese participants living in Korea rather than in China, cultural values do not change easily (Hofstede, 1984(Hofstede, , 1998)), which may be attributed to cultural differences based on gender roles.According to a cultural-level study, China has a wider gender power gap than Korea; China highly values the image of masculinity (Moon and Woo, 2019), and Chinese women feel that they are not free to express themselves and are restricted in their opportunities to demonstrate their abilities because of men (Sun, 2022).Consequently, Chinese women are less dependent on men and experience a sense of competition with them.erefore, compared with Korean participants, Chinese participants experienced more negative emotions toward a male voice than a female voice.is trend can be further clari ed through research on cultural differences in gender-related issues.
In recent years, the generation and interpretation of literary works and various forms of literature through AI have highlighted the increasing importance of studying people's emotional responses.Speci cally, the examination of the ability of AI to convey emotions or impart speci c emotions is important.Currently, by utilizing natural language processing and machine learning techniques, research studies investigate how emotions are expressed and recognized in literary works, as well as how the tone and expression AI employs when narrating stories evoke emotional responses in listeners (Spezialetti et al., 2020;Lettieri et al., 2023).is study aimed to determine the emotional responses elicited in listeners by the tones and expression styles used by AI when delivering verbal expressions.Furthermore, in an era marked by an active international approach to media use, this study is signi cant in exploring the speci c differences in emotional responses to voices in Chinese and Korean, languages characterized by distinct intonations, and cultural and political expressions despite their geographical proximity.

Limitations
is study has several limitations.First, although the study focused on AI, various AI voices were not used for the investigation.Various AI speakers have been released in both Korea and China, and people exhibit different preferences.erefore, the sound stimuli used in this study are not perfectly consistent with the degree of arousal and valence experienced with the currently available voices of the AI speakers.For marketing applications, further research using the voices of various AI speakers is needed.In addition, some AI speakers now allow customers to purchase celebrity voices that they like directly, in which case, the results of this study would be challenging to apply, even for male voices.
Second, the study did not include comparisons with other cultures.Both Korea and China are cultural regions in Northeast Asia, but subtle differences may occur across various languages and cultures.Although Korea and China are in the same region, the fact that differences were found according to voice gender indicates that differences based on voice gender may occur in other cultures and may be relatively larger.In the future, multicultural studies should be conducted to compare a wide range of languages and cultures.
Frontiers in Psychology 11 frontiersin.orgLee et al. 10.3389/fpsyg.2024.1357975ird, this study was conducted with female participants.Women tend to respond more emotionally than men, and only women were recruited based on the existing claim that sound symbolism does not differ signi cantly by gender.However, cultural differences may interact with gender.
Finally, this study used only three vowels (/a/, /u/, and /i/) and 14 consonants; however, fortis consonants were not used.In Chinese, intonation plays a signi cant role alongside pronunciation in conveying the meanings of individual words.Hence, future research should meticulously examine variations in intonation within words of identical pronunciation to ascertain their emotional impact conveyed through speech.In addition to articulation, intonation is also used in China.erefore, a more detailed classi cation is required to re ect the actual situation.
Despite these limitations, this study is signi cant in demonstrating that arousal and valence may differ in articulation types and vowels depending on cultural differences, and that voice gender can also affect perceived emotions.is principle supports sound symbolism and has practical implications for voice gender and branding in AI applications.

FIGURE 1
FIGURE 1Interaction effects between the three types of articulation and nationality in arousal, valence, and basic emotions: disgust and happiness.*P < 0.05, **P < 0.01.

FIGURE 2
FIGURE 2Interaction effects between three vowels and nationality in arousal, valence, and basic emotions: disgust and happiness.*P < 0.05.

TABLE 1
Psychometric characteristics of the participants.
N = 136; all correlations are considered not statistically signi cant at P < 0.01.Mean ± standard deviation; PANAS-P, Positive and Negative Affect Schedule Scale-Positive; PANAS-N, Positive and Negative Affect Schedule Scale-Negative; STAI-T, State-Trait Anxiety Inventory-Trait Anxiety; STAI-S, State-Trait Anxiety Inventory-State Anxiety; CES-D, Center for Epidemiologic Studies Depression Scale.

TABLE 3
Differences in the types of articulation for voice gender by nationality.

TABLE 4
Differences in the value of vowels for voice gender by nationality.