1 Introduction

For several years now, households talk to Voice Assistants (VAs) in their homes and welcomed them as everyday companions [1,2,3,4]. Usually, most users use them predominantly to control and access home appliances and internet-based services [1, 5, 6], e.g., playing music, setting alarms, requesting weather forecasts, or asking for specific information [5]. By now, VAs have a significant contribution to the consumption of and interaction with information [1].

The progress in speech synthesis [7,8,9,10] and voice design [11] allows to make voices more human-like [11], less annoying [12], more appealing [13], more charismatic [14], or provide contextual cues implicitly [11]. In addition, the new opportunities offer designers to play with gender stereotypes [15], enable voice branding [16], or enrich the voice experience in general [17].

However, most users expect efficient and convenient interaction in a utilitarian sense as past experiences have disappointed them due to a lack of personal bonding and emotions [18]. Apart from considering the voice interaction as boring and monotone [19], users hope for a lively assistant, resembling a friend, that can express opinions and emotions itself as well as engage in a conversation [18].

In addition, the auditive channel bears potential in making use of sound design. Several researchers propose to explore interaction and experience beyond the dichotomie of human and machine and establish new design approaches for voice interaction [4]. Meanwhile, further researchers emphasize to integrate more sound design as well [20, 21]. The principles of sound design as there are sonification of data and interactions [22], musical expressions [23], the design of earcons and auditory icons [24, 25] represent great potential to enrich and enhance the current state of VAs. As stressed by Fagerlöhn and Liljedahl: “Sound design can be described as an inherently complex task, demanding the designer to understand, master and balance technology, human perception, aesthetics and semiotics.” [26].

While sound and the sonification of data could supplement the repertoire of speech synthesis and voice design by communicating information and expressing [22, 23], e.g., moods, atmospheres, emotions, interaction designers have not systematically adopted these extra options, so far. In this light, we draw from concepts and theories of sound design to explore our following two research questions:

RQ1:

How might sound add to the user experience of Voice Assistants?

RQ2:

How can we use sonification of data in information design for Voice Assistants?

In our work, however, we consider sound in its serving function to illustrate and enrich what is spoken by a Voice Assistant. In other words, we focus on the overlay quality of sound as a supplement to the speech output. As weather forecasts are a frequently used service of VAs, we decided to investigate this use case and its sonification. Therefore, we first conducted a user survey with 33 participants to empirically gather associative concepts and sounds for seven perceptible weather conditions. In the next step, we analyzed the design material and developed a sound library of seven distinct audio clips that illustrate our concept of sonic overlays. Finally, we evaluated and discussed our library with 15 participants in a qualitative interview study.

Our work shows that complementing voice interaction with illustrative soundscapes enriches the communication of VAs and is appreciated by potential users. As our empirical findings reveal, layering sound and speech needs special consideration of the relation of both and in light of the intended message. Therefore, we propose a user-centered design approach grounded in sound design that employs conceptual associations and the combination of iconic, abstract, and symbolic sounds. Sound Overlays, as outlined in this paper, could be used as an alternative to the advancements in speech science that focus on the modulation of emotions through the use of voice and speech as a design material. Furthermore, implementing a sound design in voice interaction might complement the emotional tone of voice of VAs in future designs. Soundscapes in voice interaction design add to the atmosphere of speakers to tell thrilling stories, as we know from sound design practices of modern media. Finally, we propose four design implications: Investigating soundscapes for voice interaction design (1), supplementing vocal messages by sound (2), aiming for authentic soundscapes (3), and finding a balance between expressiveness and informativeness as well as coping with trade-offs between clarity and sonification of information (4).

2 Related work

Our work is grounded in the following research fields in particular: VAs and Voice Interaction Design (see Section 2.1), earcons and sonic information design (see Section 2.2), and the design of sound effects and for sonic experiences in general (see Section 2.3). The first field focuses especially on the use of speech to enable natural conversational interaction with the user and addresses advancements in speech sciences to reflect on vocal speech as a key design material in voice interaction design. The second field deals with the auditory sense as an additional channel to encode and convey information. In terms of this work, we understand encoding of information as the process of using auditory channels to express information that humans can process with their auditory senses and understand in a meaningful way. Contrasting to the previous perspectives, the latter focuses on the effect and use of soundscapes in related fields of HCI and investigates the use of sound effects to enrich the experience of interactive media. To the best of our knowledge, only a few studies adopt concepts from sound design in the context of voice assistance and voice interaction design. In particular, current voice interaction research focuses on speech exclusively to make the voice output more natural and informative.

2.1 Voice interaction design

Voice interaction design represents a new type of interaction [19] that is primarily concerned with encoding and conveying information in spoken language. Particularly, the text-to-speech capabilities of current Natural Language Processing (NLP) machines [27, 28] enable and drive this emergence and growth of voice-first applications. The ephemeral character of speech-embodied information in comparison to text reveals different challenges of information communication by VAs, such as cognitive load or dead end conversations [2, 4, 20]. Due to a lacking persistent manifestation, cognitive load is increased and listeners are required to deeply focus in order to process and react to information [20]. Grice [29] argues that communication practices should always consider the quantity (right amount) and quality (speaking the truth) of information, as well as sharing only relevant information with a maximum of clarity.

However, user expectations regarding the capabilities of VAs remain frequently unfulfilled and cause disappointment and frustration as they expect an effortless and engaging exchange of information [18, 30]. Often, well-known usability issues like limited NLP and speech recognition, system errors, misunderstandings, and failed feedback cause this phenomenon [6, 31, 32]. As a result, this leads to an interaction style that is based on “guessing and exploration [rather] than knowledge recall or visual aids” [31]. Additionally, this type of conversational interaction does not feel natural, and lacks sufficient positive experiences to motivate users to engage frequently [18]. Consequently, VAs need reliable usability to prevent users from negative experience [4, 31, 32], and furthermore research to investigate the positive aspects of user experience, which might contribute to an enchanting, playful, meaningful, and engaging interaction [33].

Accordingly, current research studies anthropomorphic effects and how to mimic human-human conversation successfully [32, 34], even though some research points to negative effects of too much human likeness [32]. Further experience dimensions for conversational agents might build on a more flexible attitude regarding the categories of “human” and “machine” [4, 35] and [3, 6, 19] should “fit into and around conversations” [19], and respectively routines of the users. We should understand speech as an act of performance, a kind of storytelling [34], and affective communication strategies [36] to enrich the interaction and stimulate experiences. For instance, new modes of articulation like “whispering” already extend the dimension of sonic experiences and prevent the VA from being perceived as boring and monotone [19].

Moreover, human information processing is not linear but complex. The Elaboration Likelihood Model [37, 38], for instance, stresses that humans process information via two routes: via the central route, people decode the content of the message by listening carefully to the semantics, the strength of arguments, and the credibility of included facts. In contrast, via the peripheral route, people respond emotionally to the message, where they are more likely to rely on general impressions, peripheral cues, and subliminal tones.

Affective and emotional speech research [14, 39], especially speech emotion recognition [7,8,9], emotional speech synthesis [10] and emotional speech production [40] represent an emerging research area addressing these subtle but vital aspects of communication. A body of work studies, for instance, how our voice and our way of speaking express a range of emotions like sadness, joy, anger, dearness, surprise, and boredom [10, 11, 41]. Furthermore, various studies have shown that speech and voice impact credibility, trust, charisma, attractiveness, likeability, and personality perception in general [11, 14, 42].

Research, machine learning in particular, also underlines the features responsible for communicating emotions. For example, research on emotional speech uncovered that acoustic levels such as frequency, bandwidth, pitching, intensity, loudness, speech rate, pausing, duration, and intonation of phonemes, words, and utterances influence the perception of emotions [7, 9, 39, 43]. Further, several linguistic and paralinguistic, among other more abstract features like gender, age, or accent, influence users’ perception of speech and voices [7, 11].

Regarding speech emotion design, researchers have specified various notation systems, such as the emotional markup language [44, 45], which allows designers to annotate parts of sentences to be spoken with a particular emotion. To support designers, Shi et al. [46] outline the concept of state-emotion mapping that may serve to drive human-VA conversational interaction. However, to save designers this additional annotation work, the researchers proposed a text-based emotion detection algorithm to contextually determine the emotional phrasing and pronunciation of sentences [39].

Our approach aims to supplement advances in speech science that focus on modulating emotions through speech to create engaging experiences between users and VAs by investigating alternative interaction design approaches.

2.2 Sound design and data sonification

Even though sound design is an active research field in the HCI community, there is a call for more scientific approaches to enable reproducible results [26]. So far, this field moves between craftsmanship and art and depends on skillful sound designers, as “Sound design can be described as an inherently complex task, demanding the designer to understand, master and balance technology, human perception, aesthetics and semiotics.” [26]. Sound is an integral part of media and system design to convey a captivating narrative, and an integral component for audiovisual storytelling [47].

Therefore data sonification represents an integral process to encode data and interactions so that the intended meaning is not misunderstood. According to Enge [48], sonification can be seen as “the use of nonspeech audio to convey information” [49], whereas visualization is understood as “the use of computer-supported, interactive, visual representations of abstract data to amplify cognition” [50]. Visualizations support a clear understanding of information, while sonification frequently allows for more interpretation despite its means to convey information [22]. Therefore, the most common approaches to auditorily encode information in interaction design are auditory icons and earcons [24]. A fundamental difference between auditory icons and earcons is that earcons can be considered to be arbitrary symbolic representations, while auditory icons can be regarded as analogical representations. Blattner et al. [24] defined earcons as “non-verbal audio messages used in the user-computer-interface to provide information to the user about some computer object, operation, or interaction”. Brewster further specifies that earcons are “abstract, synthetic tones that can be used in structured combinations to create auditory messages” [51].

The sonification of data is not only able to encode information but is also capable of expressing and inducing emotions. Depending on the design goal, inaccuracies may exist, as humans evaluate emotions very subjectively [22, 23]. Thereby, experiences are based on the affective and functional perception of the design. This poses a challenge to research since it aims to investigate sonic elements and their impact objectively but competes with the narrative qualities of music and its affective and emotional impact [52]. While an interesting and positive experiential design may stimulate emotions, there will be a trade-off between the sonic experience and the clarity of the information [22]. The expression of emotions is defined by its psychophysical relationships between musical elements and perceptual impressions of the user. Further, capturing emotional expression in music is possible by focusing on a listener’s agreement as no one can effectively deny their experience [23, 53, 54]. In contrast to expression, communication further depends on accurately recognizing the intended information and emotion [23, 55]. Therefore, our work aims to explore the relation between a clear understanding of information and the enrichment of emotions by combining sound and speech.

2.3 The role of sound design in modern immersive media

Following Simpson [20] and Sanchez-Chavez et al. [21], scholars argue that advanced methodologies and design principles for Conversational User Interfaces (CUI), e.g., interfaces for VAs, chatbots, are needed. So far, current designs follow engrained and trusted GUI principles to present and represent information without considering the dimensions of auditive information processing, for example, the ephemeral state of speech, memory, imagination, user interpretation [4, 20, 21]. Sanchez et al. [21] propose to even go beyond current conversational design “to include more nonverbal and paralinguistic elements” that could expand the design space further when considering sound interaction as a primary form of interaction.

In the light of the above, in most cases, sound is regarded as a complementary approach to enrich the experience of visual media like in games, and movies: “Auditory cues play a crucial role in everyday life as well as VR, including adding awareness of surroundings, adding emotional impact, cuing visual attention, conveying a variety of complex information without taxing the visual system, and providing unique cues that cannot be perceived through other sensory systems. [...] VR without sound is equivalent to making someone deaf without the benefit of having years of experience in learning to cope without hearing” [56]. Further design studies revealed that soundscapes effect tasting experiences by adding a significant hedonic value [57, 58]. Soundscapes are defined as an “acoustic environment as perceived or experienced and/or understood by a person or people, in context” [59], which means that they represent a sign to their perceivers. We can also observe that, for example, conscious choosing of sounds plays out differently in behavior stimulation of children regarding play experience and the play itself [60]. Overall, sound design creates imaginative spaces in research and practice and is particularly important for narrative designs [61]. Adopting sound design principles for voice interaction design, we aim to enrich the narrative strength of VAs and explore how this will affect potential users.

3 Conceptualization and empirical investigation of a sound library

Weather reports are a frequently used service of VAs by users. In light of our research questions, we aim to build and evaluate a library of sonificated weather reports as a case study. Thereby, we decided to adopt the approach proposed by Mynatt et al. [62], who discussed potential pitfalls during design and subsequent recognition failures by users during the use of a sound-based interface in their work. In particular, the authors emphasized considering four categories for designing auditory icons: identifiability, conceptual mapping, physical parameters, and user preference. As follows, we discuss relevant theoretical concepts from related fields of sound design. Second, we continue with a user survey to collect conceptual mappings and physical parameters as design materials to empirically ground the design space for sonic overlays.

3.1 Theoretical implications from sound design

Current design practices of VAs focus on advances in speech modulation and interaction while not having established to complement speech-based output with soundscapes, yet. In this context, a sonic overlay can technically be characterized as a second track played in parallel with the voice as the primary track (see Table 1).

Regarding the goal of sonic overlays, two fundamental requirements can be identified that the design should take in mind:

  • Discrimination quality: As the primary information is given by speech, the sonic overlay must not impede or interfere with the information transmission of the first (talkative) channel.

  • Conceptual mapping: The second track is not arbitrary but should supplement the first to render the output more expressive and informative.

Table 1 Enhancing the voice track with a sonic overlay
Table 2 Sonic variables and their discrimination quality related to voice output

3.1.1 Increasing the discrimination quality of sonic overlays

In contrast to earcons, the aim of sonic overlays is not to substitute and summarize one specific piece of information but to enhance the experiential quality of information articulated via speech. Therefore, voice and sonic overlays have to be designed in synchronized co-existence to communicate and express information auditorily and in parallel. Hence, we take a special focus on what we call the discrimination quality — a category and feature that allows the user to isolate, separate, and process speech- and sound-based information directly.

Krygier [63] has adopted the basic concept of visual signifiers to the auditive channel. He outlines the concept of sonic variables by focusing on abstract sounds that can be modulated by frequency, volume, or timbre to encode information. Studying the variation systematically, he concludes that sound location and volume, pitch, register, timbre, duration, rate of change, order (sequential), and attack/decay are viable sonic variables to enhance geographic visualization. In contrast to Krygier [63], we move the design space beyond abstract sound and consider speech-based output as embedded and discriminable quality of a holistic audio clip. In this sense, Table 2 presents a not conclusive set of sonic variables that aims at the most notable discrimination possible between sonic overlays and speech-based output.

For our design, we took the discrimination variables Loudness, Timbre/Motives, and Temporal position into account which we regard as most impactful in our design. We discarded the variable Frequency band because we aimed for simple and non-modified soundscapes. As smart speakers vary in their technical loudspeaker quality, we neglected to build on Location as discriminative quality. However, this dimension might be worth considered in future design studies, as certain listeners using high-end speakers and headphones for VAs on their smartphone, have the technical equipment to experience localization in 3D sound spaces. It might support immersion by, for example, indicating the incoming direction of wind and rain in acoustic weather forecasts. In the following paragraphs, we provide further detail to understand how the chosen variables add to and are reflected in our design.

3.1.1.1 Loudness

Humans can distinguish between different volumes from about 3 dB up to 100 dB. Loudness owns an ordering function by its nature. Keeping a sound experience linear without any variance, loudness might become unconscious over time. Hence, different magnitudes of loudness might highlight and contrast parts of the sonic experience [63]. In particular, different volume levels might increase the discrimination between speech- and sound-based information by lowering the illustrative sounds and turning up the voice volume.

3.1.1.2 Timbre/motives

Krygier [63] defines timbre of sound as the encoding of information by the character of a sound. In analogy, instruments own a characteristic sound, such as the brassy sound of a trumpet, the warm sound of a cello, or the bright sound of a flute. Similar to the human voice, Alexa, Siri, and other VAs have a distinct sound that is distinguishable by the human ear. By choosing and incorporating distinct timbres for sonic overlays, their discrimination quality might be increased. Consequently, using tones or pieces of music, like a bird’s flutter or a synthetically produced ambient sound, contribute to recognizing both auditive tracks. This way, information on both tracks can be encoded independently. Additionally, music and sounds transport atmospheres and expressions of emotions, often recognizable as a distinct motive and in movies even underlining principal characters. Such superimposition of motives supports the construction of compound earcons [24] but can also be applied to sonic overlays.

3.1.1.3 Temporal position

By its very nature, audio tracks have a temporal structure and order. Thus, discrimination can also be supported by separating the sonic overlay and the voice track in time. The intro and the outro take a particular temporal position here. For instance, either speech may start or the sound of falling raindrops before the assistant begins talking. Further, incorporated background sounds may support the discrimination of auditive information when speakers pause.

3.1.2 Conceptual mapping: the semiotic of sonic overlays

We aim to create sonic overlays that are not arbitrary but related to speech-based information. The main goal of sonic overlays is to serve as an illustration of what has been said, leading to double encoded information by speech and a sonic overlay. For instance, if the VA reports rain for the next day, the sound of heavy rain supports this information. To characterize the relation between the vocal output and the sonic overlay, we apply Peirce’s semiotics [64] similar to David Oswald [65] in his work about the semiotic structure of earcons. The core of Peirce’s semiotic is the symbol as a triadic relation between the object, the interpretant, and the sign:

  • Sign: the sign-carrier which has a perceptual representation

  • Object: a thing, a concept, an experience, or an emotion the sign refers to.

  • Interpretant: the perception and interpretation in form of perceived object mood, or emotion in the mind of the perceiver

The sign mediates between the object and the interpretant. For instance, the ringtone of the mobile phone mediates its owner that someone is calling her. In this case, the knowledge of the calling is the interpretant, and the referred call presents the object, while the ring tone is the sign that caused that interpretation. In Peirce’s semiotic [64], we can say that the linking of the mobile phone’s ringing and its vibration refers to the same object (the call) as well as the interpretant (the knowledge of the call). In the same way, we can now characterize the relationship between the speech and its sonic overlay.

Fig. 1
figure 1

Gradual transition of icon to symbol, from high iconicity to high conventionality (adopted from [65])

Looking at the encoded meaning in this process of creating sonic overlays [65], Gaver [25], for instance, distinguishes between an iconic, a metaphorical, and a symbolic perceptual mapping. In contrast, Oswald [65] uses the Peircean tradition [64] distinguishing between iconic, indexical, and symbolic signs. Our view is influenced by both authors. Focusing on the experience, we follow Oswald’s comment that the constitutive element for iconic signs is similarity, not physical causality. For the same reason, we focus on associations, metaphors, and signal correlations that establish a link between a sign and its object. Consequently, we distinguish between three sign categories referring to the three kinds of relationships:

  • Iconic: the representation based on the similarity of the signs and the signals produced by the object

  • Associative: the representation based on associations, metaphors, or correlations between sign and object and the signals produced by the object

  • Symbolic: the representation based on convention only, no natural link between sign and object

Moreover, we consider this distinction as heuristic classification, where the icon and the symbol represent extreme values (see Fig. 1), when normally a sign has both qualities to some degree: the iconic quality to have semantically and/or signally proximity to the referenced object, as well as the symbolic quality, to draw to the object just by convention and repeated experience. However, we consider that such smooth transitions among the categories will be unproblematic in practice as the primary goal is not to uncover the essence of a sign but sensitize designers about the various opportunities to encode information by a sonic overlay. As follows, we want to discuss the three categories in more detail.

3.1.2.1 Iconic mapping

An icon is a visible, audible, or otherwise perceptible representation of the thing for which it stands. In the auditory world, iconic auditory signs will be sounds that sound similar to the object [65]. Thus, the iconic character results from an imitation of sounds typically produced by the referenced object. For instance, the dog iconically barking refers to the barking dog, or the engine noise serves as iconic auditory of a moving car. Iconic sound design is typically used in a radio play, movies, and computer games to enrich the user experience. In some cases, weather owns strong iconic representation, like, for example, thunder. We aim further to uncover which iconic sounds and combinations of those are useful to incorporate in sonic overlays.

3.1.2.2 Associative mapping

Going one step further beyond iconic representations, we can uncover associations that are reduced and linked to a distinct characteristic or feature. In the case of Starwars, Ben Burtt looked for familiar animal or machine sounds to establish credibility to ensure recognizable semantics for the sound effects: “The basic thing I do in all of these films [Star Wars and its sequels] is to create something that sounds believable to everyone, because it’s composed of familiar things that you can’t quite recognize immediately.” — Ben Burtt quoted by Whittington [66].

Arbitrariness is based on some similarities between the sound and the referent but not as strong as in auditory icons at the iconic level. As Gaver [25] argues that in general, iconic/nomic mappings are more powerful than symbolic and metaphoric/associative mappings, because iconic/nomic mappings show a direct relationship between the auditory icon and the physical source of the referent.

3.1.2.3 Symbolic mapping

“Auditory icons require there to be an existing relationship between the sound and its meaning, something that may not always exist” [67]. For example, this is the case if weather conditions do not come with literal sounds. A speaking example is the difference between thunderstorms and cloudy weather conditions. Whereas thunder offers an iconic mapping through its distinctive sound of rolling thunder, cloudy weather does not have such an explicit feature. In the absence of an iconic mapping, we ought to apply symbolic mapping, which “is essentially arbitrary, depending on the convention for its meaning” [25]. For example, when the VA announces cloudy weather, the consistent use of a particular sound establishes a symbolic relationship, similar to a ringtone that a user associates — over time — with a particular application.

Table 3 Semantic and sonic associations regarding weather conditions

3.2 User survey design and procedure

The first step in our design of sonic overlays is to define a conceptual mapping that is understandable by the users. Sonic overlays are more recognizable if they are based on iconic and associative mapping, with an active and purposeful linking between what is said and heard. This has the advantage, that no social conventions have to be previously established. Therefore, we conducted an online survey to collect associations with basic weather report events, such as rain (1), fog (2), frost (3), cloud (4), snow (5), thunder (6), and sun (7). In total, we received 33 complete answers but decided to incorporate also the described associations of 15 incomplete answers. We therefore collected a data set of 48 participants aged between 23 and 66 (male: 12, female: 19, non-binary: 1; mean age: 36.9 years).

Our survey did not aim at being statistically valid since it intended to sensitize our design phase. The survey was distributed in the area of Germany and Great Britain using social media services. All questions were not mandatory and open. We asked two questions for each of the seven weather conditions. First, the participants should name three concepts or terms they spontaneously associate with the mentioned weather conditions. Second, they should name or describe three associations of sound, noises, and/or music. Even if these associations are not explicitly set to music, they give the sound designers an impression of the semantic field that is evoked by each weather condition. Finally, we collected demographic information such as age, gender, and education. Afterwards, we decriptively summarized the results (see Table 3). Therefore, we clustered identical and very similar meanings. The table below shows the 10 most named concepts.

3.3 Results of semantic and sonic associations

3.3.1 Iconic mapping

The survey showed that for the specific weather events, participants had varying difficulty associating tones, sounds, or music, and these associations could be a lot diverse. The association turns out to be most coherent where there is a natural iconic mapping, i.e., where a weather event naturally causes sounds. Rain or flashes represent a fitting example, therefore. In the case of rain, for instance, the associated semantic field revolves around the theme of wetness, water, and raindrops. Those are also associated with certain moods such as chilling, and uncomfort but also calmness, and certain colors such as dark and gray.

The theme of rain and raindrops can also be found in associations such as pouring, as well as in associated objects for personal protection, such as an umbrella. The sonic field translates the theme of the semantic field of water in terms of nature sounds caused by rain, e.g., splashing, water rushing, dripping, pattering, and drumming. Besides, mainly naturalistic or nature-simulating associations were named, e.g., thunder and lightning, running faucet, or rice grains weighing back and forth. In the case of lightning, an iconic mapping is found in most cases that the electrical discharge produces not only lightning but also thunder. This fact has led to a quite homogenous sonic field, where most participants directly associate thunder or specific forms of thunder such as dumpling, crashing, or banging. Furthermore, it turns out that lightning is semantically and sonically associated with rain, expressed, for example, by sonic associations such as drumming, waterfall sound, or pattering rain.

3.3.2 Symbolic mapping

The opposite case presents weather events where a natural iconic mapping does not exist. The most prominent example of such a case is the cloudy weather. In contrast to rain or flashes, the participants do not associate specific natural events or activities but a vague, general impression of gray, dark shadows, coldness, and a quite unspecific, melancholic mood of discomfort, sluggishness, and bad temper. The theme of coldness was also expressed by mentioned protection means like bringing a jacket or sweater weather. Occasionally there are associations with seasons, e.g., autumn or places like Germany, as well as activities such as doing sports outside or city trips. This broad, unspecific, and, as it were, the soundless semantic field is echoing in the sonic field. Here we observe the heterogeneous answers that aim to differently translate the vague ideas of gray, gloomy, and melancholy sonorously. In addition, two participants answered the question about associated concepts but omitted the question about sonic equivalents. The other answers show a wide range of sonic associations. What is striking here, is the frequent tonal characterization of cloudy by general musical characteristics (melancholic music, ponderous beat, polyphonic male choir), certain musical trends (lo-fi beats, jazz music), or individual instruments such as strings, and styles of sounds such as muffled sounds or dull hum without associating one specific sound or piece of music. Participants mentioned natural sounds such as wind or water less frequently. We also find it inspiring that some participants associated human noises like sighing, breathing, and the sound of yawning to give the melancholic mood a sound.

3.3.3 Inbetween iconic and symbolic

The answers further show that most associations cannot be unambiguously classified as iconic or symbolic mapping, but mostly represent something in between. Therefore, in our view, it makes sense to understand the schema outlined in Section 3.1.2 as a heuristic rather than a strict category system. Sunny weather is one of the examples where iconic, associative, and symbolic mapping is balanced. In the semantic field, we see strongly iconical responses, e.g., warmth/heat, brightness, and blue sky, but various answers more indirectly related to sunny weather such as summer, expressions of summer feeling like being motivated, happiness — for instance, expressed by laughing — as well as diverse summer activities such as cycling or eating ice cream. In addition, some answers refer to measures for sun protection, e.g., sunshades or sunscreen. The corresponding sonic field also reflects this semantic field. Unlike flashes or rain, the sun does not directly cause sounds. The associated natural sounds are not iconic but rather associative. Various participants mention sonic expressions typical for a sunny sea holiday, such as the sound of waves, splashing in the water, or voices at the beach. These associations present indexical signs in the sense of Peirce [64] because of the causal chain of the sun (causes hot causes refreshing beach holiday causes sea sounds).

By the same token, they present a metaphorical mapping in the sense of Gaver [25] because in western societies beach sounds become a metaphor for a hot summer, good feeling, and sunny weather. In addition, some participants associate sunny weather with crickets chirping or birds chirping. Again, there is an element of indexicality and metaphoreness in these associations (as not sunny, rainy weather physically impedes both, chirping and singing, and so both natural events have become metaphors for a sunny summer). Less indexically but more metaphorically are answers such as laughing or cheering. While both are not directly metaphors for sunny weather, they are metaphors for happiness and good feeling — which was one of the associations in the semantic field. This feeling of lightness, sunny weather lifestyle, and good mood are represented by many musical associations, both regarding styles (light pop music, light electronic music, major sounds, reggae, Latin American music, as well as regarding particular songs such as “Sunshine Reggae” from Laid back, “O.P.P.” from Naughty by nature, and two ice cream commercials, “So schmeckt der Sommer” (Engl. “This is how summer tastes”) and “Like ice in the sunshine”.

Overall, the answers indicate that iconic mapping is prominent when the weather event causes typical, easy-to-remember sounds. In contrast, when those sounds were not avaiable, participants suggested symbolic mapping more often.

4 Developing a library for sonic overlays

Our library for sonic overlays is based on the empirical and descriptive results of the survey described in Section 3.2. Further, we use the categories of iconic, associative, and abstract sound to cluster the results and produce sound clips that show a high discrimination quality for all seven weather types. We will explain our design rationale and according steps as follows.

4.1 Design approach to enrich Alexa’s weather report

In contrast to Mynatt et al. [62], we decided to gather conceptual mapping and physical parameters by a free-form survey before the design phase. Further, our goal is not to design auditory icons but to illustrate speech by using iconic, associative, and abstract soundscapes that are not synthesized into an identifiable sound-only design but serve the purpose to illustrate spoken information.

The seven most distinctable weather types were chosen to be the core of this design: sunny, rainy, cloudy, foggy, snow, frost, and thunderstorms. The authors sorted the responses into categories depending on each sound’s connection with the weather in question:

  • Iconic sounds, which are caused directly by the weather

  • Associated sounds, which are expected to occur in conjunction with the weather but are not directly caused by it

  • Abstract sounds, which have a connection to the stated weather type in the respondent’s mind but are not necessarily linked to it

This categorization is based on previous conceptual considerations, as introduced and explained in Section 3.2, and enables easier identification of any positive or negative reactions to certain types of sound by users. Further, we want to highlight that, in contrast to rain, certain weather conditions like foggy and cloudy have no iconic sounds. This must be taken into account when creating the respective soundscapes. It will also provide an opportunity to evaluate how a lack of iconic sounds affects the user’s overall perception of the soundscape.

Therefore, as a first step, we categorized the survey results described in Section 3.2, sorted from most common to least common, for all seven weather types (see Table 3). Table 4 exemplifies a rainy weather condition (see below).

Table 4 Sorting and categorization of survey results using the example for rain

4.2 Structure and elements of sonic overlays

Sonic overlays and earcons/auditory icons share multiple features, such as conceptual mapping and encoding information by sounds. Yet, the survey of Cabral and Remijn [68] shows that in contrast to sonic overlays, earcons are quite short (mostly between 0.5 and 3 s). As our sonic overlays attempt to illustrate speech-based information of VAs, we need to take into account that talking often lasts from a few seconds to minutes. For instance, the weather report of the German Google Assist takes about 10 s, allowing sound designers further options regarding rhythm, using pauses, proving ambient sonic overlays, and other temporal parameters. Another main difference is that in sound overlays, the voice conveys the primary information, which liberates sound designers to more subtly encode the information and, for example, emphasize or ironically comment on the spoken information by sound. However, it also creates new constraints, such as that the sound overlay should not interfere with the voice making it difficult for the user to understand what the assistant has said.

The examples created were each around 25 s in length and incorporated sounds based on the most frequent answers given in the survey, in combination with a synthesized voice similar to that which would be heard from a VA. Further, a proper difference in loudness between the soundscape and speech ensures the discrimination quality within the sonic overlays. The structure of each sound overlay clip was consistent across all the weather types: each starts with around 5 s of sound effects to build up a soundscape representing the weather, then a voice would explain the weather condition and temperature, followed by additional 10–15 s of audio. If the clip includes any musical elements, these are incorporated into the soundscape after the voice has spoken.

Musical elements and soundscapes are essential to creating an expressiveness of information that speech could not. Two examples also incorporated musical elements, besides sounds and spoken words. The example for the sunny weather condition incorporated a guitar melody inspired by “Here Comes The Sun” by The Beatles, as this song was mentioned by multiple survey respondents in association with sunny weather. Further, the example of the frosty weather condition incorporated an original melody using tones and timbres identified in the surveys as conveying a feeling of cold, icy weather. When creating the soundscapes, sounds with a rather direct connection to the weather type in question were prioritized, e.g., the sound of wind or falling rain. However, in imposibble cases, more abstract sounds were preferred instead, e.g., the cloudy soundscape that featured heavy traffic noise. In either case, all sounds featured in the soundscapes were selected from the survey responses.

Table 5 Study participants (n\(=\)15) representing international differences in culture and residence

5 Evaluation of the sonic library

5.1 Interview study design and procedure

Frequently, associations and imagination are linked to prior experiences and their cultural background [23]. Therefore, we did not aim at a statistical representation of the populations in Germany and Great Britain. We were looking for participants with heterogenous cultural backgrounds able to speak and understand the English language. For recruitment, we used snowball sampling in our extended networks [69], thus, we posted requests in social networks like Facebook, international telegram groups and private messenger services. To further diversify our sample, we asked the first participants for references from their extended networks. Most of the 15 participants (4 male, 11 female), currently lived either in Germany or the United Kingdom, in addition to one participant living in France and one in Palestine. However, their geographical backgrounds were significantly more diverse, including south-east Asia, Sri Lanka, Canada, and Russia, among other countries. This diversity in backgrounds helps identify how a person’s current or past environment might affect the evaluation of sonic experiences and weather types. Table 5 provides an overview of the corresponding data regarding age, gender, and current and previous residence. Most interview participants had at least some previous experience with VAs. Participants who were inexperienced in interacting with VAs had a basic understanding of how they work. Therefore, we only explained the sound overlay concept. Participation in our study was voluntary and did not involve any compensation.

We chose a qualitative interview study approach to explore the subjective perception and usefulness of the sound overlay library. Each participant listened to both conditions: VAs with speech only and VAs featuring speech with sound overlay for three randomly chosen weather types. We created a randomized experimental design without repetition, so that each participant was played two of the three sounds, e.g., weather report with/without sound overlay for rain (1), fog (2), frost (3), cloud (4), snow (5), thunder (6) and sun (7). First, randomization without repetition ensured that at least six subjects listened to each of the seven weather reports. Second, the randomization was intended to minimize a sequence or order effect. The experimental design randomized the order and also the combination of the other samples (e.g., with snow and storm or with frost and sun) to account for possible changes in opinion brought about by hearing particular examples in combination. Additionally, the order of the clips for each weather was also randomly selected, taking into account that listening to the first clip might influence the next. We uploaded the sound library to youtube to share only the chosen links to the clips during the interview. After listening to each clip, the interviewee was asked specific questions about what they had just listened to, followed by more general questions about the concept and their impressions of it, e.g., did you recognize the sound as the correct type of weather? How long did it take? Or did the information come across, and how does it make you feel? Each interview lasted around 35 min on average and was conducted over Zoom.

Finally, the interviews were transcribed verbatim and coded inductively and independently in MaxQDA by two researchers using thematic analysis [70]. We focused on the effective sonic experience of the weather types and the perceived differences in design and usefulness. Also, we explored the impact of combining speech and sound and its implications for structuring and contextualizing information.

5.2 Findings

Some participants regularly used VAs to check weather forecasts but the majority relied on websites or smartphone apps instead, usually citing the level of detail offered as the reason why. Several stated that the short spoken summaries by VAs did not give enough specific detail to plan a whole day.

5.2.1 Supporting imagination and experience

Sonification aimed to support people to produce images in their minds that use emotions and prior experiences associated with distinct and ambient noises. By using the examples of weather, we could observe clear challenges in design for two specific groups of weather types: almost silent events like fog, sun, frost, and cloud, and loud events like rain, snow, and thunder. Although the prestudy foreshadowed possible challenges to design recognizable and unambiguous soundscapes, the cloudy weather seemed to cause the majority of problems in correctly understanding the presented information.

Most of the participants responded to the idea positively when listening to the samples and expressed vivid accounts of their imagination. Some welcomed their emotional responses and explained that this makes the interaction less boring and monotonous but more dynamic (P1). This evokes a space “like being on a boat in the ocean” (P3), when listening to the audio clip of “fog”. According to P1, weather reports supported by soundscapes felt less “artificial” than speech-only and created a kind of “haptic feedback” of the information:

“I think it’s more emotional because you do have like, an image, sort of, in your mind. Yeah, I like the fact that it’s not only rain. It feels like car and rain or some background noise. You know, it feels like you are really in the middle of the city. And you don’t have an umbrella, and you are suffering from a pool. (...) In this context, I think you want to use a temporary, really precise message of the weather, and I think this achieved their goals.” — P1

In particular, the soundscapes emphasized typical feelings associated with specific weather conditions, as participants explained that the thunder sounds made them anxious (P3), a sunny city equaled good feelings (P2, P7), or freezing temperatures indicated not to go outside:

“So we were like heavy winds, which were full of crystallized snow. And you could hear yourself like walking through the huts. Cold, like the freezing or the snow, which feels like the ground. And, yeah, the wind was so strong that you did not want to go outside at all.” — P14

The soundscapes of pleasant and unpleasant imagined situations alike enhanced the intended message and supported possible adaptations of the participants’ behavior, like being motivated to go out (P8). Some saw the concept particularly useful for special occasions and ambient background information needs (P13). Moreover, P1 and P14 reported that the sonic overlays contributed to a calm and relaxed feeling.

“Natural sounds in general. Also the crows and animals and things like that. Because sometimes people are stressed about everyday life or life pretty often. So they have, they want, like something to relax. And maybe one selling point of this app or a voice assistant would be like that one can relax, that are in our everyday life.” — P14

Sound is not considered overall necessary for a system solely designed to give factual information (P12). While regular forecasts are unbiased, sound adds a character to it that can have positive or negative connotations. This can help to form decisions based on the weather because it is easier to imagine yourself in the context. P12 indicated that the specific information might not be as memorable, but the overall impression was much stronger and helped with understanding the consequences of the weather conditions. Another piece of feedback from several participants was that the soundscapes made it easier for them to visualize the weather and think of how to prepare for or react to it. P3, P11, and P10 considered this useful for morning routines or directly after waking up in a dark room. Moreover, P10 calls the design concept more reassuring by giving a feeling of naturalness and coziness (P10). P3 also was surprised that it was not already commonplace for VAs since visual apps use graphics to add more context and to communicate information in a more appealing fashion (P6).

Further, this concept bears a chance to give friends coming to visit a more precise idea of the weather conditions and makes it more interesting to share (P13). Additionally, it might help to feel a deeper connection and experience with the represented location if you live far away, as long as the information represents the reality:

“Let’s say I want to go to London and I’m checking the weather in London. Or maybe I want to see the weather in a different country right now. For a particular reason, it is important to me. (...) but instead of saying rain and the strength of the rain, it might add more because if it is on real-time as opposed to a forecast, if it is music, then I feel it. This level of, you know, the burden of interpretation. But if they are actual, it’s almost as if they are giving real-time Information. Then if they are making me hear it, how it is, how snow is flowing. They know how it is raining in London or wherever away from the I can see from my window. I can see data that has been an interesting dimension that I would be interested to see.” — P9

Meanwhile, missing experiences of weather conditions or landscapes might contribute to misleading interpretations or less precise perceived information. For example, P15 could not recognize and relate well to the foghorn sound that represented foggy weather in comparison to P4, who imagined their current residence:

“I could picture the coast where I live, which is a harbor, small harbor and the sea and foggy sea and the fog coming into onto the land, which it does where I live (...) quite often. So yeah, a totally foggy, virtually visible. With the emphasis on sounds that you hear rather than what you see.” — P4

As P1 grew up in a large city, hearing footsteps in the snow made it difficult to differentiate between snowy and frosty weather and carried over all the impression of a hiking vacation in nature rather than an intuitive sense of the weather conditions. She was missing the noises of traffic, for example, cars. In contrast, P6 noted not to include traffic noises because those do not symbolize sunny weather to her. In a similar vein, P8 and P13 did not consider children playing outside as an appropriate illustration of sunny weather, and hearing splashes reminded P13 rather of rain.

5.2.2 Sonic information design

The sonification of information relies on abstract and iconic sounds, as well as relevant music pieces and speech. Particularly abstract sounds contributed to an active imagination and conveyed the meaning of the weather conditions. Therefore, all participants pointed out that the incorporation of related sounds gave a better impression of the scenario:

“I think all of these have given me very if you’ll pardon my illusion, Animal Crossing kind of vibes. I don’t know if that was a deliberate image or just circumstantial. But it’s not the weather. The tones fit the weather, the sounds of the light. With this one, you could hear like it was like birds singing. Nice day, kids having fun. Like, I think that was a roller coaster. And then the marimba at the end or like a guitar.” — P12

Overall, the concept does not represent a simple sequence of symbolic sounds. Hence, the soundscape has to be layered with consideration. An urban environment might sound different than pure nature but it has an equivalent impact when sounds like background noises are combined that indicate events happening during this kind of weather or the place of experience.

“I like that. Not just the sound of it. It really sounds like you try to mix it with different elements like the surroundings. Sometimes the sound is not really directly about the weather, distinctive. But I think that’s really awesome. Some feedback is that, for example, there’s the second one I have the most problem understanding. The foggy one.” — P1

The participants appreciated incorporating musical elements that acted in a similar vein to convey information and emotions that noises could not. For example, P11 stated that music represented “icy” conditions much better than footsteps. Likewise, this type of sonification supports the differentiation of similar states like frost, ice, and snow. P2 explained that music was thematic and indicated light and pleasant snow by that:

“I think it was very thematic in the sense that it gave you an idea of what to expect. It kind of indicated it’s going to be like, you know, sort of like, oh, it’s nice. You can walk in it. It’s going to be like pleasant thunder. It didn’t seem to be indicating snowstorm: Stay in your house!” — P12

Likewise, the use of a guitar, for example, may produce a “calming effect” (P8). In contrast, P11 described VAs as a convenience and aimed for efficient interaction, where music might be in the way. Further, P12 was concerned that not everybody would appreciate such a design decision as well:

“I liked it. I mean, again, it’s I think the sort of people that would be put off by the extra fluff at the end. People that would just look at a website and wouldn’t use the service anyway. So I think it’s adding an additional level of sort of engagement to people that are going to be using the product.” — P12

However, the music proved to be an effective element for supporting imagination and speech-based information:

“All the right information came across straight away. And what was interesting was that because I’d heard the music first, I had this same image of this road going into the distance and everything, a little bit orange. Don’t ask me why, but maybe going into the sunrise, sunset, you know, a pleasant travel image, basically.” — P4

An overall trend in the results was that soundscapes that more heavily used iconic sounds were more well-received than those which relied solely on abstract sounds. This presents an issue for weather types that do not have any associated iconic sounds, such as cloudy or foggy. Especially iconic sounds are well suited to represent precise information, entail clear messages, and evoke past experiences as associations at the same time. Further, natural sounds are closely tied to the expectations of weather conditions:

“And because of the sound of the birds, you kind of feel it’s sunny and the kind of feel that people outside and that things are happening outside. So you assumed your kind of mental image was this sort of like sunnier, drier weather.” — P9

In comparison, particularly rain and thunder were tangible noises with high and quick recognizability. Participants (P13, P3, P2) discussed afterward, for example in the case of snow and frost, how the granularity of weather conditions and their differentiation could be supported by a variety of iconic noises.

“And as I mentioned before, you could play a different thing. So the severity of it. So you’ve now winden and instead of sort of a lighter sound, but more heavy, I assume they were sort of sleigh bells or reindeer to indicate a more hazardous conditions maybe. Yeah, but yeah, I know it was all very easy to hear that it gave across everything you were trying to say.” — P2

P13 added that it could be confusing if there are snow sounds but only 50% chance of snow, for example, and that it may be better to build up from a wider bank of sounds for variations (P3), for example, a concise representation of temperature and that “Rain sounded maybe not as ‘heavy’ as the voice said” (P3).

Besides, difficulties arise with sounds that cannot be represented iconically because of the absence of noises, for example, with sun, clouds, or fog. However, this might lead to confusion by trying to substitute by using crows or horns that occur or are used in cases like fog. P11, P10, P4 and P1 had trouble understanding the meaning of crow noises and considered them as confusing.Besides iconic sounds, a deliberate choice for the design of sound overlays was to incorporate speech providing precise weather information. Many participants claimed that without speech, they could not identify the correct weather conditions, especially concerning fog, frost, and clouds:

“Well, what I noticed is that the abstract sound only came after she talked. The voice (...), there was no ambiguity. And I really knew that it was the frost that made the sound.” — P11

In contrast, some participants indicated that in the case of rain, the speech felt even unnecessary, and, in the case of thunder, it was even more clear than vocal information:

“I felt like it basically brought things across. The voice said heavy thunderstorms. And I feel like maybe the rain wasn’t heavy, heavy, heavy. But at the same time, that would raise the question of, well, how many different words does a voice assistant use when describing weather? And then can you map all of those words on to a sound of rain, like the thunder sounded heavy?” — P3

Overall, the intended and sonificated meaning of rain, sun, snow, and thunder was recognized most frequently and almost immediately. P11 added that by the sound he imagined, it is even easier to remember to bring an umbrella. Further, P12 explained by listening to thunder that he had clear thoughts on the preparation for the upcoming stormy weather.

“I think it was like supporting the voice. Sometimes I also think that the voice was completely unnecessary. In extreme beavers and extreme weather conditions, for example, when it was like snowing or raining. But a service (...) it will be like necessary to at least say the temperature. And I mean the information about that it’s snowing.” — P14

However, most participants considered speech for quick and precise information, like temperature indications (P14), valuable, especially those participants who might be impatient because they are in a hurry (P14, P1, P9). Furthermore, participants feared that voice and soundscapes could compete for attention sometimes, e.g., because of false expectations regarding the timed structure:

“Since, I think, it’s one minute. Whenever, (...) it’s not necessary, but it can be of it can be a bit frustrating if you missed the moment that it starts saying.” — P10

Further, P10, P13, and P12 expressed concerns that voice and background noises were overlapping too much, e.g., children screaming while playing outside (P13). Hence, despite a better image of a complete scenery, speech-based information was drowning down:

“In the same instance, you get like in films, sometimes there’s a dialog scene. And then the orchestral score or the things in the background is so loud, you actually can’t hear what’s going on, which then detracts from the product, which I think is something you guys have managed to avoid.” — P12

Additionally, P11 mentioned that sound shouldn’t seem to contradict speech to not add to ambiguity and confusion:

“It doesn’t add more information to this, to the stuff that she’s saying. Because in the first part of the snow, it added snow. She didn’t say anything about snow. And the second one added wind, even though the voice just said it’s foggy, not windy. And it must be very difficult to achieve. But I think that’s really important that the sound is very much in line with the words and not adding or taking away information.” — P11

P11, P8, and P9 stated that the use of sound elongates the application and requires patience. Consequently, in their need for quick information, they would prefer speech-based, either through voice or by glancing at their phones.

“In a car. Probably like when you need to just have the information (...). But when my mind is like, I just want to know this and then I want to do something else. I don’t know in which situation that’s the case. Usually, most of the time, but when I ask: ‘Okay, what’s the weather going to be like?’ And then they tell me and then I cannot ask another question for like 5 seconds because I have to wait until the rain stops. That would annoy me so much.” — P11

In total, we could observe balanced opinions on the preference of voice or sound–first regarding the structure of the sonic overlays. Therefore, some of the participants (P6, P14, P9, P11, P5) argued, for example, P14 and suggested starting with speech first when designing sonic overlays:

“I think it will be better to start with a voice or maybe a millisecond off or a nanosecond. I’m not sure of like of forever, of a silence and then the voice. Because I think sometimes people don’t have patience. Some people don’t have the patience for waiting until the voice pops up.” — P14

P6 demanded to have speech instantly — “facts not thrills” — but could imagine maybe a short sonic fade-in before and fade-out quickly afterward. A further advantage of speech–first might be reduced ambiguity and sound as additional layers that can be better interpreted (P9). P11 suggested making the clip shorter overall to make it more efficient, although this might lead to impressions that interfere with the voice.

“Waiting in suspense for the voice - then it happens suddenly. Voice and sound should start at the same time then let the sound carry on for just a few seconds afterward to leave an extra impression.” — P11

On the other hand, participants had found reasons to start with sound as well:

“No, I think the fact, that the lead-in was an audio clip of the weather type or something alluding to the weather type followed by the information, then followed by another weather clip with a bit more music. I think it gave you an idea of what was coming. It was then clarified and then you got this sort of little ribbon on the top of whatever you’re referring to us.” — P2

Many participants appreciated the current design structure of the sound overlays. They pointed out that sound introduces impressions and scenes as afterward speech fades in to confirm and clarify weather conditions. Besides, P10 describes this design as feeling less aggressive than the assistant speaking at you immediately. Nonetheless, participants like P14 and P4 emphasized that this concept needs time to get used to it first.

5.2.3 Sonic contextualization of experiences

The sonification of information might be extended to other applications and design spaces, as the statements of our participants show in the following. However, they expected some limits regarding the usefulness and experiential value. In particular, situations that allow for ambient sound and personal moods that welcome entertainment, e.g., driving in the car or waiting in general. For instance, P8 considers background ambiance, like the sound of a fireplace or ASMR (Autonomous Sensory Meridian Response) for cooking or studying as relevant. P13 would consider hearing the sounds of frying/chopping, etc., to be more amusing. Additionally, P4 describes a possible situation at work:

“When I’m working in home office, I’m able to choose. When I go out for a walk, I could look out the window. But in Scotland, that won’t tell you. You really need to know that temperature, preferably what it feels like. I mean, that’s peculiar to Scotland. It doesn’t set up. The temperature is what you really want. And yes, I could come out to whatever I’m writing or reading. And I could click or met Office, and I could get it. But if I could just get it instantly, you know, like that just: ‘Oh, I wonder if I need a hat and a scarf as well as a coat today. Do I need two pairs of gloves or one?’ Then I would quite like that. A fun way of doing it, especially as I want to then forget about work, although I actually associate my laptop with work. So for me, just to have some quick little sound, and off I go for my walk” — P4

Besides asking for the weather or specific information, the news is a frequently used service of VAs and radios alike. However, our participants had contradictive thoughts on the sonification of this offer. P4 could imagine a benefit of applying sounds to the presentation of traffic updates, travel reports, or election/sports results, especially at times you want to know the info in a flash. In contrast, P1 expressed that sound might distract or manipulate information. Further, for P3 bad or scary effects might be reinforced.

“Honestly, I, I don’t, I cannot think of anything that would benefit from that. Because it always conveys some sort of interpretation or maybe opinion or emotion. So if you add it to a news article, it’s not neutral anymore. And I read the news to make my own opinion. So I wouldn’t like to be presented with somebody else’s emotions.” — P11

Whereas more participants can see potential design spaces at home by enhancing other media and smart home applications. P3, for example, would wish for audible feedback on loading times and completion of tasks. P1 explains in further detail how a sound or earcon library of a current VA might enhance the notification experience of deliveries:

“Alexa might have some sound ding ding on this topic. Another possibility is when I’m anticipating a package, I know the different stages of the package, like, is it a ship that is delivering (...). It will be quite helpful because right now, they treat it as a notification. Like maybe you have, you can extend these to some parts of: ‘Are going to arrive today’. If they can have a different sound to describe where exactly my package is.” — P1

6 Discussion and implications

In light of our research questions, we want to discuss our results and provide implications for the design of future voice interaction. So far, Alexa is seen as Voice Assistant, very neutral in their answers with little capabilities to express emotions [18]. A significant amount of research in the fields of speech science aims to address this shortcoming, respectively emotional speech and voice design [7,8,9,10, 14, 39, 40, 44, 45]. In this paper, we complement this area of research by outlining a supplementary approach, using sound as a modality that could add a new dimension to voice interaction and enrich the user experience. In particular, we focus on the relation between speech and sound and the balance between communicating information and inducing emotions through sonification.

6.1 Sonic encoding for voice interaction design

6.1.1 Building soundscapes

The prevalent design paradigm regarding sound is to precisely encode information to substitute functions and representations [24], leading to different kinds of auditory icons and earcons that are highly recognizable. However, that also requires either a clear sonic representation, or users to learn its meaning first. As with current VAs and computer systems in general, we can observe the use and purpose of earcons to signal warnings or direct attention to events on short notice [24, 25]. However, iconic sonification might come at the expense of rich soundscapes capable to transport emotions, atmospheres, and further experiential qualities, as known from the design of classical media and extended realities [22, 23, 56, 57].

Extending the purpose of sound by substituting single functions and representations, our results indicate that sonic overlays may support voice interaction to encode, illustrate and communicate messages. The combination of iconic, abstract, and symbolic sounds shows a positive impact on the perception of weather reports by speech-based interaction. Participants described their experience as stimulating and entertaining, quite the opposite of previous experiences with VAs. Thereby, iconic elements support the recognizability of intended messages. Some weather types gained noticeably less positive feedback than others, particularly weather types that relied more heavily on abstract sounds such as cloud and fog. As these require the listener to draw connections between the sounds and the weather in a less direct way, they are more open to interpretation and have more potential to cause confusion. These potential issues first appeared as early as the pre-survey; these weather types had fewer associated sounds suggested overall, and the most common response for a sound associated with fog was “silence”. Musical elements as well as abstract soundscapes serve as an illustrative layer to build a holistic impression of the specific weather conditions and are a carrier for moods and emotions. However, a missing combination of iconic sounds might obscure some information.

With our work, we present a structured design approach to sonificate and illustrate voice interaction and, thus, enrich the experience of weather reports. So far, only a little work on methods and research regarding design approaches of voice interaction, especially in combination with sound design, exist [20, 21]. Current approaches to voice interaction design are based on collecting example dialogues, spoken terms, expressions, and paths as design materials. Similarly, we collected associative mappings for each message of a weather event and categorized those into abstract, iconic, and symbol design elements to develop a not exclusive sound library. Although the design was well appreciated, we need to balance abstract soundscapes that affect the experience with iconic sounds, meanwhile ensuring recognizability of the intended message to communicate information successfully.

6.1.2 Layering sound and speech

As our results indicate, the sonification of interaction opens the design space for more ways of expression [20,21,22]. However, voice remains a precise channel to communicate information and is perceived as an efficient and convenient way of interaction. Therefore, participants expect sounds to illustrate exactly the information of the voice channel and avoid contradictions from both channels. Further, by using abstract concepts like “children playing outside in the sun”, designers have to be careful not to mix channels in parallel that entail soundscapes based on human voices. Otherwise, the discrimination quality is not guaranteed. Besides, more research into differences in similar weather types like frost and snow could prevent misunderstandings. However, participants were skeptical whether, e.g., 50% probability of rain, could be communicated via sound. Yet, they still desired a high granularity to express the characteristics of weather conditions.

The structure of the audio clips regarding the temporal position of sound and voice received mixed feedback from the participants. Some liked the structure of starting with the sound, then introducing the voice, and ending with more sounds as it gave them time to form an impression of the weather from the sound that later was confirmed and clarified by the voice. However, other participants felt that the clips in their current form were too long and that they wasted too much time compared to a voice simply speaking the weather forecast in just a few seconds. Although almost all believed that the sounds produced a better connection to the weather than the voice alone, several interviewees indicated wanting to hear the voice-first to get the most information as quickly as possible. However, a more matching combination of both might reinforce the impression that the sound illustrates what the voice was saying in real-time. Currently, the voice simply speaks over the soundscape after a few seconds.

Overall, sonic overlays illustrated and strengthened the voice message. Speech added the preciseness of information, especially for events or impressions that naturally are silent and hard to sonificate. Besides, a certain granularity and discrimination quality in sound design might positively impact the preciseness of information. However, the temporal position of sound and speech has to be purposefully integrated into the overall design and needs more research to give clear implications.

6.2 Balancing emotion and information

6.2.1 Authentic soundscapes

Data sonification may serve both purposes, conveying information and emotion [22]. Sound design in Science Fiction gives the future a voice, linking the effects to the imagery to enhance the credibility of the cinematic reality [66]. The same holds for the role of sound design in games and XR [56]. Oftentimes, the goal is to create new worlds and experiences that are not nonexistent or less prevalent in real life.This was quite the opposite for our study because participants expected to understand the sonic overlays effortlessly. The main goal shifted to imitate the surroundings of known places and build on past experiences to encode information. As our results indicate, social context and personal residence environment greatly impact the upcoming associations and respective interpretations. For example, people who live in big cities might practice hiking as a seldom leisure activity, whereas people from the countryside might have a distinguishable understanding. The same applies to cultural experiences, e.g., festivities like Christmas associated with specific music and instruments. However, besides supporting the imagination of the known, places in different parts of the world can be illustrated in the same way. Yet also, in this case, it might be perceived as more worthwhile to experience representations quite close to the original experiences of people living in those areas.

Finally, experiences could be even further personalized by using location data, information on the surroundings in this area or during the daytime, and other chronic data to match the experience of the area. This approach would allow for enhanced recognition of sonificated information and for users to empathize with new places and experiences.

6.2.2 Encoding emotion

So far, VAs lack an engaging experience that motivates users to interact on a regular basis [18] and are regarded mostly in utilitarian ways by users. Following the call of researchers to explore potential experiential qualities of VAs [20, 21, 71], speech science research [7,8,9,10, 14, 39, 40, 44, 45] aims at encoding emotional information and expressiveness into the sound of voice and the way of speaking. With our alternative design approach, we investigated the design space to develop and promote an expressive context for dubbing, voice-overs, and future voice acting [72, 73].

Furthermore, our study focuses on exploring the various options to design surrounding and ambient sound contributing to the affective experience of VAs. Our results indicate that sound overlays could enhance imagination in comparison to voice-only interface design. Moreover, our participants reported both calming and anxious effects that either feel relaxing, or symbolize and promote action. This is also due to sound building up a closer complete scene, making it easier to visualize and respond than simply hearing words.

In the tension field of expressive and informative interaction, designers act responsibly and consciously regarding the sonification of positive and negative experiences. As our data shows, some participants were concerned about manipulative misuse of sounds, for example, when discussing news as further context for sonification. Clearly, some prefer “facts not thrills” (P6) and want their information not emotionalized.Further, some users deliberately do not want that triggering of negative feelings. Therefore, designers might also aim to balance hazardous weather conditions like thunder with sounds that indicate a positive feeling of a safe place or home. Nonetheless, future studies could deeply focus on the relation between voice and (weather) sounds to experiment with fitting voice modulations that mirror the context. In general, sound bears an opportunity to reinforce calming situations, as raindrops against the window were positively associated.

6.3 Limitations

Our study investigated just one potential use case of sound overlays and VAs. However, a further holistic investigation is needed that requires testing several use cases to thoroughly understand how to use sound in voice interaction. Nevertheless, we could observe positive reactions to our design.

We mainly focused on developing a design approach and examining the general feasibility of a basic concept. At this point, we did not include advanced methods to examine the discriminative quality of the voice within our sonic overlays. Hence, we expect room for improvement in this area. In future work, additional quantitative studies, e.g., asking participants to transcribe the speech of the VA afterward, and using established Quality of Experience measurements as applied in telecommunications engineering [74], might significantly optimize the discriminative quality.

The same holds for our insights into semantic mapping and sonic associations. In the tradition of explorative qualitative research [75], our study uncovers relations and suggests hypotheses without statistical validation. For instance, our study suggests that the mapping and sonic associations are more coherent, when the illustrating situation (e.g., “it is raining”) refers to natural sounds. Future studies should evaluate our insights and implications quantitatively to gain validated results that either confirm our hypotheses or show further areas of improvement [75].

Furthermore, the examples we tested were not representing real-time weather conditions at the location of our participants, nor were they presented in a realistic situation, e.g., during time pressure or participants knowing they need to leave the house in the next 10 min. To provide more robust results, tests need to be investigated that resemble both more realistic situations and feature the actual outdoor weather situation. Finally, our test was based on a rudimentary prototype that was not implemented and run on an actual smart speaker. We think that rerunning our study in a realistic and practice-based context might reveal further design principles and limits of usability but also opportunities for more sonic design.

7 Conclusion

We presented a study that aims to investigate what designers can learn from sound design if they like to enrich the experience with Voice Assistants. Focusing on one of the most favorite use cases, we present a user-centered approach to designing sonic overlays that complement the vocal messages of Voice Assistants and contribute to its user experience. Specifically, we were interested in how sonification of data might enhance voice interaction by using iconic, associated, and abstract sounds, in the example of weather forecasts. Based on a prestudy with 48 participants, we constructed a sound library for creating soundscapes for seven weather conditions: sunny, cloudy, foggy, thunder, rainy, freezing, snowing. We further evaluated the resulting soundscapes in an interview study with 15 participants to learn more about the effects of underlying spoken information with complementing soundscapes. Our study revealed both positive and negative feedback from our interviewees, based on which we were able to elicit respective design implications. Our design approach aims to open the design space for further sonic investigations and designs enriching voice interaction.