skip to main content
10.1145/3613904.3642193acmconferencesArticle/Chapter ViewFull TextPublication PageschiConference Proceedingsconference-collections
research-article
Open Access

Better to Ask Than Assume: Proactive Voice Assistants’ Communication Strategies That Respect User Agency in a Smart Home Environment

Published:11 May 2024Publication History

Abstract

Proactive voice assistants (VAs) in smart homes predict users’ needs and autonomously take action by controlling smart devices and initiating voice-based features to support users’ various activities. Previous studies on proactive systems have primarily focused on determining action based on contextual information, such as user activities, physiological state, or mobile usage. However, there is a lack of research that considers user agency in VAs’ proactive actions, which empowers users to express their dynamic needs and preferences and promotes a sense of control. Thus, our study aims to explore verbal communication through which VAs can proactively take action while respecting user agency. To delve into communication between a proactive VA and a user, we used the Wizard of Oz method to set up a smart home environment, allowing controllable devices and unrestrained communication. This paper proposes design implications for the communication strategies of proactive VAs that respect user agency.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Voice assistants (VAs) have enabled human-smart home interaction through natural speech that supports users’ daily lives. With VAs serving as gatekeepers, users have been able to operate interconnected smart devices, including lighting, appliances, security, and temperature. They also can request information about weather or schedules, and ask for recommendations on media entertainment or recipes. This makes each individual’s various daily activities in the home environment more convenient. The current state of VAs is generally reactive; they merely respond to user commands. However, they are evolving to become proactive. Proactive VAs autonomously take action by controlling devices (e.g., lighting, temperature) or initiating voice-based features (e.g., reminders, recommendations, nudges) to service the users’ needs before users make a request. For example, proactive VAs can remind users of items that they possibly forget [1, 3] and automate their routines [42, 49]. The VAs can also recommend music or food based on users’ moods and state [28, 32] and provide nudges for healthcare support [36] and energy saving [48]. These features allow users to expand their ability and be free from an overload of unimportant decisions, which lets them focus on tasks they value more [13].

In designing proactive systems, it is significant to consider user agency, which empowers users by reflecting their preferences, diversifying controls, adjusting controllability, and explaining the reasoning behind the system’s proactive actions or recommendations [10, 13, 17, 20, 41]. Sundar [45] defines user agency as the extent to which the self serves as a relevant actor in human-computer interaction, highlighting that users should become a “self-as-source” of communication. When a proactive system does not take user agency into account, users may feel a loss of control [17], have concerns about over-dependence [10], and indeed experience a decrease in their sense of agency [8, 29]. Earlier research on proactive systems has primarily centered on accurately predicting users’ needs and intentions and taking single-turn actions based on contextual information such as movements (activities, proxemics) [6, 32], the external environment (time, weather, location) [18, 28], users’ state (mood, biophysical signs) [1, 36], mobile phone data [44], and a previous yes or no answer [48]. However, the aforementioned types of proactive actions were largely driven by contexts or specific conditions rather than by the users themselves. Relying solely on one-way proactive actions without interactive communication resulted in paying inefficient attention to user agency. We believe that users can effectively exercise their agency through verbal communication, which is the most fundamental, natural, and effective way of human information exchange. Therefore, our study aims to explore how VAs can proactively take action while respecting user agency through two-way, multi-turn verbal communication. In pursuit of this aim, our research question seeks to investigate when and how VAs should communicate to provide proactive actions that align with user agency. Additionally, we examine how users perceive and respond to the proactive actions and communication of VAs, as well as the progress in user engagement. This exploration holds a meaningful contribution to the field of human-computer interaction (HCI) research.

To set up the study environment, we adopted a modified Wizard of Oz (WoZ) method [25, 26] that assigns the participant to the wizard role instead of the experimenter to simulate a proactive VA. Participants served as substitutes for the proactive VA that not only proficiently interpret and predict users’ context and intentions, but also freely converse with users without any constraints of current technologies. This approach allowed us to closely explore open and unrestricted communication between participants by having the users stay in a home setting, and the wizards simulate the proactive VA. In the experimentation process, we recruited a total of 12 participants, organized into 6 pairs, each consisting of a wizard and a user. We had a lab-based smart home setting for the user and a control room for the wizard. While the user stayed in the smart home setting, the wizard, simulating a proactive VA, operated smart home devices and communicated with the user to support the user’s various activities with the purpose of reflecting user agency. Our study had 5-hour sessions that included a 3-hour experimental observation to collect communication logs and a 2-hour debriefing interview to gather qualitative data on the participants’ prior experimental experiences.

In our findings, we present when and how wizards (simulating VAs) proactively communicate to align with user agency in their proactive actions, considering 3 aspects: communication types, proactivity levels, and communication timing. Based on communication logs proactively initiated by VAs (wizards), we categorized types of VAs’ proactive communication into exploration, suggestions (proactive services), and follow-ups. At the same time, we described the users’ perception of and reaction to the VA’s proactive communication. We found that users’ acceptance is distinct from their preferences, which means that they easily reject or ignore VAs’ suggestions, but still possibly like them. We also classified the progress of user engagement through 3 phases: explaining, reflecting, and engaging. Drawing from our findings, we discuss why communication is imperative for VAs to adapt to user agency, highlighting interpersonal and intrapersonal variability in user acceptance toward VA proactive actions. Simply making assumptions out of contextual information would be insufficient for proactive actions that genuinely take user needs and preferences into account. We also put forth design implications for VAs’ communication strategies to carry out proactive actions while respecting user agency. These practical implications provide valuable insights to interaction designers and HCI researchers for designing voice user interfaces of proactive VAs.

Skip 2RELATED WORKS Section

2 RELATED WORKS

The concept of a proactive system has long been a topic in computing and HCI fields, including ubiquitous computing and Human-Robot Interaction (HRI). The proactivity of VAs is complex and multilayered as it involves the processes of context interpretation, task determination, and autonomous action. In addition, voice interaction in a proactive system encompasses addressing the interaction timing and its initiation methods. Although our study focuses on the communication aspect of proactive VAs that adapt to user agency, in this related works section, we comprehensively review various aspects that constitute a proactive system. Furthermore, we have confined our research domain to smart homes where individuals can communicate verbally with the VA comfortably. In these settings, where the unique everyday life of individuals unfolds, there are vast possibilities for exploring how proactive VAs can act and communicate with users. We excluded domains such as automobiles or workplaces from our study, because the primary activities (e.g., driving or having meetings) and other related factors (e.g., safety and social relationships) may lead to different user experiences from those at home.

2.1 Proactive VAs in Smart Home

2.1.1 Context-aware Computing: How to Interpret User’s Context.

Context refers to all sorts of information that can be used to characterize the situation of a subject [11]. Context-aware computing understands and analyzes contextual information that surrounds users. Previous studies devised a context-aware model that infers what users intend and need, leveraging contextual information such as speech, location, time, biophysical signal, mobile usage, and so forth [1, 6, 44]. Bahrainian and Crestani [1] utilized the sentiment and biophysical data extracted from conversations to remind users of information that is easy to forget. Chahuara et al. [6] presented a framework that builds context-aware decision processes after applying speech, agitation, localization, and activities collected via real sensors in a smart home environment. This model makes decisions, such as turning on/off the light, opening/closing the blinds, warning about an unlocked door, or making an emergency call. In addition, Sun et al. [44] developed a contextual intent tracking model that anticipates users’ intentions through contextual information (e.g., app usage, places that they visited, current location, and time). The model analyzes what users plan to do and automatically generates the ‘if-do-triggers’ that lead to proactive actions. For example, it plays the news on the phone when the user arrives at the office or plays music after 6:30 PM.

2.1.2 Proactive Scenarios: What Tasks to Perform.

Several studies have proposed potential scenarios in which VAs proactively perform tasks in a domestic environment. The scenarios have been evaluated through storyboards [31, 40, 54], short films [4], and context-based online survey [34] instead of real-life settings. In particular, Meurisch et al. [34] conducted an in-situ survey using a mobile application that matched the user’s presumed activity based on his/her contextual mobile data. Despite an extensive range of scenarios about VAs’ proactivity, no general preferences for specific scenarios have emerged across studies, except for an urgent safety situation, such as when users faint or there is a risk of fire [4, 54]. The elderly preferred proactive scenarios to cope with cognitive difficulties due to aging, including forgetting to take medication and problems remembering [4]. Additional scenarios created for proactive services are as follows: reminders (schedules, changing tires), healthcare (mental health, physical health, using coughing sounds probable signs of cold), activity support (shopping, finding directions, traveling), technical support, home control (lighting, temperature, domestic chores), cooking inspiration, nudging (to notify the users of too much screen time), and fact-checking (time, history) [34, 40, 54]. However, most of the scenarios from their studies have shown that the users’ responses and preferences vary by individual and situation [31, 34, 54].

2.1.3 Proactivity Levels: To What Extent to Autonomously Take Action.

Prior studies have investigated the level of VAs’ proactivity. The level in AI systems can range from reactive to merely executing orders to full automation, mostly rooted in the concept of 10 levels of automation as first proposed by Sheridan et al. [43]. More recent studies divide the system’s proactivity into 3 to 4 levels. Most of them found that users prefer a medium level of proactivity where VAs make assumptions and ask for users’ permission before action [24, 31, 34, 38]. Peng et al. [38] designed 3 levels of proactivity: high, medium, and low, based on the extent of assumption and intervention in recommendations to help decide which shoes to buy. They found that medium proactivity is more helpful in narrowing down choices and sharing users’ opinions, but also emphasized that the level of proactivity should flexibly adjust to users’ responses and emotional reactions. Meurisch et al. [34] categorized the proactivity level into reactive support, proactive support I, II, and autonomous support. Participants tended to expect proactive support II, in which VAs provide personalized recommendations and intervene in their lives. Luria et al. [31] also took their scenarios to classify levels of proactivity: reactive, proactive, and proactive recommender. They found that the desired level of VA proactivity differs by individual preference and situation, but participants still do not want decisions to be forced upon them. For example, one parent wanted to receive parenting advice only when she asked the VA (i.e., low proactivity level (reactive)); however, she wanted to be notified immediately if her teenager drank beer (i.e., higher proactivity level (proactive)).

2.1.4 Interaction Timing: When to Interrupt.

Some studies have been conducted to find the opportune moment for VAs to proactively initiate interaction. A major concern in these studies is interruptibility for proactive interaction when users may not want to be disturbed [5, 23, 51]. Because providing proactive service at an inappropriate timing could potentially distract or even irritate them [15, 16, 46]. To discover the interruptibility based on their activities in a home environment, the voice-based experience sampling method (ESM) has been used. It collects users’ availability by inquiring for example, ‘Is now a good time to talk?’; either randomly or triggered by contextual information [5, 51]. These studies have intensively examined interruptible moments; however, they have not considered any tasks or scenarios that involved VAs’ proactive utterances. They found that an individual’s level of engagement, mood, and activity transition may affect the users’ interruptibility. However, the common rules of these opportune moments concerning users’ activities have not yet been clearly established in the realm of research. Komori et al. [23] reported that users are more available when they have settled after a transition and relax on the bed, but the availability fluctuated even for the same behavior. Cha et al. [5] identified 3 contextual factors of personal, movement-related, and social factors that can affect students’ availability with regard to proactive VAs. The study found that participants tend to avoid interruptions when they are focused on their work, busy, or in a bad mood, but they were generally more open to interruptions after entering a room or during transitions between physical activities. Wei et al. [51] indicated a significant correlation between boredom and mood with perceived availability in general, and participants were found to be more available when engaged in entertaining tasks rather than studying or working.

2.1.5 Interaction Starter: How to Start the Interaction.

There have been studies on the ways for VAs to start proactive interactions by using audible signals (e.g., beep or ringtone) and visual cues (e.g., sparkling motion). The studies suggest that users are likely to prefer VAs to speak directly [47, 51]. Tan and Zhu [47] classified 3 strategies: arouse and wait, arouse and output, and direct output. The majority of participants rated the direct output scenario as the most satisfying and comfortable, reflecting their preference for practicality and how VAs can play a heartwarming role at home. Wei et al. [51] also experimented with 3 different methods to start the interaction: baseline, earcon starter, and utterance starter. Most participants favored the utterance starter, in which the VA asked, ‘Hey, are you available?’ and the conversation only began after users responded with ‘yes.’

Figure 1:

Figure 1: (Left) Actual view of the smart home setting. (Right) A diagram of the environment setup, indicating where the voice assistant, 4 cameras, smart devices, and activity resources were placed; researcher A took care of the setting.

Figure 2:

Figure 2: (Left) Control room workstation with researchers B and C present. The wizard participant simulated the proactive VA, sitting in the middle. (Right) A 4-panel split-screen was playing the smart home setting in real-time.

2.2 User Agency in Smart Home Control

Ever since Weiser and Brown [52] first envisioned ubiquitous smart technologies seamlessly integrating into the background, Roger [41] has suggested a shift from Weiser’s concept of ‘calm computing’ to a more user-engaging approach with smart systems. Roger [41] posed important questions about how designers should decide which tasks ought to remain under human control and which can be managed by automated systems. In subsequent research, ongoing discussions have been sparked over proactive systems, bringing attention to the increasing role of user agency [10, 13, 17]. These studies emphasize the importance of upholding a balance between the proactive system (device agency) and user agency. Jia et al. [17] conducted interviews after showing a video about the future of the Internet of Things (IoT) and found that participants favored a user-centered approach, expressing a desire to exercise their own agency. However, participants did not want to put too much effort into purposeful customization; instead, they hoped for the system to learn and adapt to their preferences through ordinary interactions. In addition, Desjardins et al. [10] used the co-speculation method ‘Bespoke Booklet’ to explore 5 design avenues for home IoT, including rich negotiation between systems and user agencies. The study underscored how users reacted with curiosity or felt excluded due to a lack of agency and how the agency is flexible and complex, which goes beyond a simple binary opposition. Moreover, Garg and Cui [13] sought to understand when and how IoT home devices can support users’ daily lives through co-design sessions and interviews. They presented design considerations for proactive scenarios, what roles users want devices to play, and potential conflicts in designing future home IoT.

Skip 3METHOD Section

3 METHOD

3.1 Study Setting

3.1.1 Modified Wizard of Oz in Smart Home Setting.

We used a modified Wizard of Oz (WoZ) method in a lab-based smart home setting. The WoZ method has been widely adopted in speech-based HCI studies [7, 12, 33]. The core idea of this method is that a human operator called a ‘wizard,’ invisibly simulates a technology that has not yet been fully developed, making users believe that they are interacting with a real, functioning system. It allows researchers to observe users’ genuine reactions and understand their expectations and needs for new technology. In a typical WoZ method, experimenters take the role of a wizard; however, the modified WoZ method [21, 25, 26] we used has a participant play the wizard. Having participants act as the wizards enables them to directly operate and express their expectations and desires through the proactive VA, thereby ensuring a user-centered perspective and keeping the study free from experimenter biases.

We set up both a ‘smart home’ for user participants and a ‘control room’ for wizard participants. By recruiting 6 participants for each role, we were able to compare and analyze their proactive actions, communication, reactions, and experiences. In addition, the lab-based setting ensured consistency in the experimental conditions, leading to more reliable data collection and analysis. The study was approved by an institutional review board. In the study scenario, the users were doing their daily routines in a smart home environment embedded with a proactive VA. Our wizard participants were tasked with holistically interpreting the vast and complex context of the user, anticipating their needs and intents by considering subtle cues such as nonverbal signals, atmosphere, tone of voice, and even periods of silence. From this, the wizards intuitively determined how to communicate with users, adapting to user agency for proactive actions. They led and were engaged in verbal communication with the user, utilizing a text-to-speech (TTS) system.

Table 1:
GroupIDGenderAgeHousehold TypeTime of StudyUser’s Activities
U1M22Family Household
Group 1W1F29DaytimeListening to music, searching for neck stretching video, watching TV,
U2F22Single-dormitory
Group 2W2F29NighttimeWatching TV, shopping for clothes/shoes, reading a book, stretching,
U3M23Shared Dormitory
Group 3W3M29DaytimeEating instant ramen, watching TV, exercising (push-ups, pull-ups,
U4F30Single-person Household
Group 4W4F27NighttimeOrdering/eating the delivered sashimi, watching TV, shopping for
U5M24Single-person Household
Group 5W5M28DaytimeCooking Pasta, playing a guitar, listening to music, watching TV,
U6F29Single-person Household
Group 6W6F28NighttimeListening to music, doing yoga, searching for yoga mats, watching

Table 1: Demographic information of study participants and users’ activities during the study.

3.1.2 Smart Home Setting for Users.

For the smart home, we rented a studio apartment of 52m2, consisting of a room and bathroom that are all furnished to create a home-like atmosphere (see the left in Figure 1). We intentionally had this layout to minimize blind spots during camera recording and to ensure that the sound from the smart speaker is well heard anywhere in the smart home. We installed smart home devices for the wizards to control, including a smart speaker (Samsung Bixby Home Mini) for VA and music, smart TV (Samsung The Frame 65 inch), robot vacuum cleaner (Samsung Power bot), lighting (I/O Switcher), IoT plug socket (Brunt Plug), and smart blind (Brunt Blind Engine ver. 2). We also prepared non-smart items to facilitate user activities, such as a pull-up bar, a yoga mat, a guitar, books, cooking tools, and ingredients. Snacks and beverages were also provided. For the scenario where the VA recommends food to the users, researcher A stocked the refrigerator and snack pantry with food, took a picture, and sent it to the control room for every experiment. We mounted 4 webcams (Jooyontech IP cameras IPC-JA4-A22N) to observe users from all angles in real-time with the exception of the bathroom and wardrobe area (see the right in Figure 1). IP cameras with 360-degree coverage and 200-megapixel (MP) allowed the wizards to see the users’ facial expressions and postures. Lastly, we installed a laptop, smart speaker, and microphone for a TTS-based communication system. Except for the smart speakers, they were put under the kitchen island to hide them from the users.

3.1.3 Control Room Setting for Wizards Simulating the Proactive VA.

The control room was equipped to manage smart devices, monitor the users, and operate voice interactions through a TTS and audio system, thereby simulating a proactive VA (see the left in Figure 2). The control room, located in our research lab, had 2 researchers (researchers B and C) assigned to assist the wizard. On the right side of the workstation, we set up an iPad with control apps installed to operate all the smart devices in the smart home and a PC for music and web searching. Researcher B helped the wizard with smart home controls, music playback, and conducting information searches. On the left side, the other researcher C was responsible for entering and managing communication logs during the experiment. At the forefront of the wizard’s workstation, a 27-inch monitor simultaneously displayed a split-screen view from the IP cameras installed at 4 different angles (see the right in Figure 2). All footage was recorded and saved in a cloud storage connected to the IP cameras. We also provided a 13-inch laptop that has a TTS web application developed for our experiment. The wizards were able to hear what the users were saying into the microphone placed in the smart home through the laptop’s speakers in real-time. Every time the wizard types a phrase into the TTS application, it was immediately sent online, converted into TTS, and then broadcast through the smart speaker in the smart home. The TTS system made use of the Web-RTC APIs Google TTS system [14]. The graphical user interface of the TTS web application was designed with a text input field and a ‘speak’ button.

Figure 3:

Figure 3: The overall process of the 5-hour study: 3 sessions and positions of the participants and experimenters for each session.

3.2 Participants

A total of 12 participants, consisting of 6 user-wizard pairs, were recruited for a 5-hour study. To recruit those to play the users’ role, we distributed a screening survey on a university online forum and flyers and selected the final 6 participants after 2 rounds of screening. In the first round of selection, we had 114 respondents. The survey inquired about their proficiency in AI technology, frequency of using VAs, and demographic information such as gender, age, major, and type of household. We narrowed them down to 32 potential participants. We opted for those who are familiar with VAs (VA usage frequency scoring over 3) and have limited expertise in AI technology (AI knowledge level scoring 1) based on their self-rated scores on a 5-point scale. Since it is critical in the WoZ method for users to believe they are interacting with actual technology, we excluded participants with advanced knowledge in AI technology to minimize potential doubts about the system’s feasibility. Most of them live in single households, which aligns well with our experiment scenario. In the second round, we requested user participants to draft a hypothetical plan of how they would typically spend 3 hours at home: what they usually do after finishing their daily tasks during the week or while enjoying the daytime on a weekend. This inquiry was made because they were expected to spend a few hours alone in new surroundings, regard the smart home setting as their personal space, and act on their own accord. After a thorough review of the responses to include diverse activities that are distinct from one another, we selected 6 user participants (3 females and 3 males). We confirmed by phone that users have no concerns about undertaking various activities in new environments.

For the wizard’s role, we recruited students from a graduate school of design, who had experience studying and designing voice interaction with VAs, for 2 reasons. First, their academic knowledge in user experience design and problem-solving skills enabled them to interact with users more flexibly and creatively than experimenters who are likely to be fixed in established practices of VA proactivity. Second, their experience in voice user interface (VUI) design provided them with a fundamental understanding of VAs, which was required for playing the role of proactive VA. We shared a recruitment post on the communication channel of a university’s industrial design department and received 13 responses. We chose wizard participants based on their self-reported VA usage frequency and VUI design knowledge; both were rated on a 5-point scale with a minimum score of 3. We selected 6 wizard participants (4 females and 2 males) and paired them with 6 user participants (3 females and 3 males). The groups were randomly formed but were made up of the same gender if possible. We thought that pairing of the same gender would help the wizards understand the users’ gender-specific preferences and interests more effectively, such as sports games, fitness activities, fashion, cosmetics, etc. All participants received a compensation of 100,000 Korean won (approximately 75 U.S. dollars) for their participation in the experiment. Table 1 presents the basic information about the users and wizards, along with the activities they performed during the study.

3.3 Procedures

There are 3 phases in our study: 1) an introductory session, 2) an observation session, and 3) a debriefing interview. Figure 3 illustrates the overall process of our user study.

3.3.1 Introductory Session.

3 researchers were involved in conducting all experimental procedures. After researcher A and the user arrived at the smart home, researcher A gave the user instructions about the experiment. The users were requested to stay in the smart home setting to make themselves at home, comfortably and freely interacting with the VA designed to proactively assist with their daily activities. They were informed that they could respond to VAs as they wanted (e.g., they may choose not to answer) or start the conversation. To convince the users that the system is actually operational, we explained that the VA is a beta version in development. We also mentioned the possibility of occasional delays in response time to mitigate potential errors. The user participants were aware that the smart devices are controllable through the VA. They were given a detailed tour of the smart home to familiarize them with the space. Furthermore, we reminded users of the experiment details they had previously agreed upon. We transparently disclosed the location of the cameras and that they were being observed and recorded by the researchers in real-time. The user participants were told that they could discontinue the experiment at any time and would be compensated depending on the duration of their participation. All participants consented to join the experiment.

At the same time, the wizard was in the control room and also receiving instructions from researchers B and C. They were instructed on how to operate the TTS web application and the features of the smart devices installed in the smart home. The wizards were guided to predict the users’ needs and intents based on their human intuition and senses, and also to take ownership of their communication to adapt to user agency and proactively provide services. They were informed that the users believed they were interacting with a VA, not a human, and to avoid overly human-like behavior that might be far from the general mental model of a VA. Also, they were asked to prioritize the quality and grammatical correctness of their responses despite the possibility of a delay in giving responses. Then, both the users and wizards simultaneously went through the voice interaction onboarding process through the smart speaker located in the smart home. This helped them grasp how the voice interaction works. The users were guided to kick-start the onboarding by saying, “Hi, Bixby. Let’s start the experiment.” The wizards who control the VA were also instructed to ask the users 5 questions from a prepared questionnaire, such as their name, favorite songs/singers, sports to play or watch, viewing contents, and food. After the instruction and onboarding were completed, researcher A left the room and waited near the lounge area.

3.3.2 Observation Session.

Subsequent to the 30-minute introductory session, the experiment proceeded for about 2 hours and 30 minutes. We opted for this 2 to 3-hour duration for the following reasons. In our 2-hour pilot study, we were able to accumulate substantial communication logs of more than 100 turns. This range seemed reasonable for conducting thorough debriefing interviews to closely examine each communication log. We were also concerned that a longer experiment may impose stress or fatigue, particularly for the wizards. To maintain the consistent quality of the wizard’s judgment and speech, we chose to have a single wizard for the observation session within a reasonable timeframe.

In addition, the pilot study revealed that users were more stationary than expected, and many of them were lying down and looking at their mobile phones. For this reason, prior to the experiment, the users were asked to list at least 3 activities they typically do at home to encourage them to be active as much as they can. As a result, the users participated in various activities described in Table 1. They were informed to reach out to researcher A if any issues arose, but no such situation occurred. On the other hand, the wizards monitored the users in real-time and diligently performed the role of proactive VAs. They keenly observed the users’ facial expressions and behaviors, pre-searched related information, and noted down the users’ responses and potential suggestions based on their previous conversations. The primary role of researcher B included helping the wizard facilitate functional operations such as controlling smart home devices or the TTS app. Meanwhile, researcher C was responsible for transcribing all the users’ speech along with the wizard’s automated input for the communication logs. Throughout the experiment, users communicated with the VA without difficulties. No cases of grammatical errors or incorrect responses were observed, with only a few instances of fallback feedback such as “I’m sorry, I can’t help with that.” After the experiment was completed, researcher A and the user participant moved to the control room, which is about a 10-minute drive away, to join the wizard participant, researchers B and C.

3.3.3 Debriefing Session.

Following each experiment, we conducted the debriefing interview with both the wizard and the user participants together for about 1 hour and 30 minutes. Before the interview, we finally disclosed the fact to the users that the VA was actually operated by a human wizard, not a system. They were also provided with an overview of the purpose and setup of the experiment. Based on communication logs transcribed in real-time during the experiment, both the users and wizards went through in-depth interviews on nearly every VA’s proactive communication from a dyadic perspective. The wizards were asked about their intentions, strategies, and reasons behind proactively initiating the conversation. The users took turns answering the questions about their experiences, emotions, and thoughts in response to these proactive communications.

3.3.4 Data Collection & Analysis.

To examine the communication proactively initiated by the wizards and the corresponding experience of the users, we collected 3 types of data: 15 hours of observation videos, 1,416 communication logs, and 8 hours of interview recordings. All logs and interview recordings were transcribed. We also compiled and time-stamped observation video clips into sequential conversation segments, which were filmed from 4 different angles in a 4 split-screen format. We subsequently reorganized and aligned with interview transcription, communication logs, and time stamps of the observation video.

We first conducted conversation analysis [39] by delving into the communication fragments of a continuous two-way conversation. We segmented the fragments by distinguishing their beginnings and endings of communication, resulting in a total of 279 fragments. We scrutinized fragments that included VA (wizard)’s proactive communication logs where the wizard either initiated or proactively continued the communication. To identify patterns in the VAs (wizards)’ communication type and timing, we extracted and coded 180 proactive communication logs from a total of 794 VA (wizard) logs. Following this, 3 researchers applied open coding to the corresponding interview data using thematic analysis [2]. The lead author generated initial codes, which were then iteratively reviewed with two other researchers until a consensus was reached. We further refined the codes through axial coding, considering user responses and their mutual conversation. We present the findings of our analysis in the following section.

Table 2:
CategorySubcategoryDescriptionExample
Personal InformationCollect a range of user information about38].Which genre do you prefer? (W2), Do you have specific pop artists you like?(W2), What is your skin type? (W2), Do you have any plans today? (W3), Whatkind of food do you like? (W4), Do you like visiting art galleries? (W5), Haveyou ever made Cacio e Pepe pasta before? (W5)
ProactiveUserAsk questions to accurately understand the53].(Lying in bed) Are you getting ready for bed? (W2), (Trying to open thewindow) Is it hot in here? (W3), (Clattering a frying pan in the kitchen) Areyou cooking by any chance? (W5), (Busily gathering various items) Are yougetting ready to go out? (W3, W6)
Information ProvisionProvide information about the external*Information related to Decision-making Support and Well-being Advice canbe found in their respective sections.It’s starting to rain outside now. (W1), The weather is quite cold. Make sure todress warmly before you go out. (W5, W6), K. Will (a singer)’s national tourconcert you just mentioned, there are more than 300 seats still available inDaejeon (a city). (W4)
Smart Home ControlsControl (turn on/off) smart home devicesWould you like me to turn off the TV? (W3, W4, W6), Do you want me toopen/close the blinds? (W2, W3, W5), Would you like me to order deliveryfood? (W2, W3, W6), Would you like me to play/turn off the music? (W2, W3,W5, W6)
Decision-makingProvide recommendations or necessary38], outingWould you like me to recommend you a sports/pilates channel? (W3, W6),Would you like me to recommend you some clothing/shoe trends to shop? (W2,W4), You can also watch highlights of the soccer match between South Koreaand Lebanon on December 14th on the SPOTV channel. (W3), Would you likeme to play songs similar to those of (name of a singer)? (W1)
Well-being AdviceImprove the user’s overall quality of life by54], playing the role of a digital34] or alerting safety notifications [4].You had a late-night snack; how about some home workouts? (W2), Usingmobile phones for a long period isn’t good for your eyes. Please take breaks andlook into the distance. (W3), How about I recommend some eye exercises tohelp alleviate your eye strain? (W4), Cleaning up the dishes right after eatingcan prevent bugs. (W1), Please be cautious of fire while cooking. (W3)
ProactiveSocial TalksBuild a pleasant and comfortable socialDid you enjoy your meal? (W4), How was your day? (W4, W6), Would you liketo hear a joke? (W2, W5), Take care not to catch a cold. (W3, W6) You are verygood at playing the guitar. (W5)
RecommendationMake prompt adjustments or offer38].I see. How about another trend, Angora knits? (W2), If you don’t like (a TVseries), how about (another Television series)? (W2), Would you like me torecommend another Disney movie? (W6)
ProactiveExplicit FeedbackAsk directly for feedback about the user’s53].How did you find my recommendation? (W2), Looks like you’re watching(YouTube Content) I recommended. How are you finding it? (W3), Did youenjoy the sashimi you ordered for dinner? (W4), Did you enjoy it? (W6)

Table 2: Categories of the VA (Wizard)’s Proactive Communication Types: Proactive Exploration, Proactive Suggestions, and Proactive Follow-ups.

Skip 4FINDINGS Section

4 FINDINGS

Based on communication logs and interviews, we analyzed how the wizard participants, who simulated the VA, acted and communicated to enhance user agency across three aspects: VA’s communication types, proactivity level, and communication timing. Furthermore, we described how the users perceived and reacted to VA’s proactive suggestions and progressively engaged in communication. To support our findings, we also present our empirical evidence from conversation fragments and interview quotations.

4.1 VAs (Wizards)’ Proactive Communication Types

We extracted 180 of the VAs’ proactively initiated utterances from overall communication logs and classified them into 3 categories: Proactive Exploration, Proactive Suggestions, and Proactive Follow-ups. Each of these categories is further divided into subcategories with actual examples as shown in Table 2. Our findings that align with previous studies are indicated with their references.

4.1.1 Proactive Exploration.

Although the VAs (wizards) are capable of recognizing the context up to the level of humans, they often found some situations where it was challenging to predict users’ needs or intents solely based on their behavior and context. In such cases when the wizards were uncertain, they simply asked straightforward questions for the purpose of exploration before proceeding with proactive suggestions. For the first subcategory of Proactive Exploration, the VAs (wizards) inquired about the user’s personal information to learn more about each individual’s unique characteristics, such as preferred genres of movies, songs, favorite musicians, past cooking experiences, visited places, personal schedules, and eating habits to provide a more personalized suggestion. The second subcategory is where the VAs (wizards) questioned the users to verify their potentially ambiguous behaviors, needs, or intentions. For example, when the user (U2) laid down on her bed, the wizard (W2) asked, “Are you getting ready for bed?” to clarify whether she intended to go to bed or just rest.

4.1.2 Proactive Suggestions.

The VAs (wizards) suggested services that could support the users in various activities either functionally or psychologically. We put these services into the following 5 subcategories: information provision, smart home controls, decision-making support, well-being advice, and social talks, as shown in Table 2. First, the VAs (wizards) provided information about the outside environment that the users might not be aware of for being inside (e.g., cold weather) when the outside environment suddenly changes (e.g., when it starts to rain), and the status of features that are running (e.g., delivery status updates, alarms). Second, the VAs (wizards) made proactive suggestions about smart home controls, such as turning off unused appliances based on the user’s activities or pulling down blinds as the sun sets. Extended services such as getting food delivered or grocery shopping are included in the broader sense of smart home controls. Third, the VAs (wizards) offered customized recommendations or necessary information during the decision-making process of the users; searching for what to watch, listening to music, contemplating what to buy, or deciding where to go for a date. In terms of controlling a music player, turning on and off the music is classified under smart home controls, and specific song recommendations such as “Do you want me to play (title of a song) for you?” were sorted as decision-making support. Fourth, the VAs (wizards) gave advice to promote the users’ better quality of life, health and safety management, and fostering good habits. Fifth, the VAs (wizards) made small talk, compliments, and jokes to cultivate a pleasant and comfortable interaction and to demonstrate empathy. All proactive suggestions were either derived from the preceding exploratory questions or directly prompted.

Figure 4:

Figure 4: (Left) Quantities and proportions of communication types for each group. (Right) Quantities and proportions of communication timings, and percentage of communication types for each communication timing.

4.1.3 Proactive Follow-ups.

After making proactive suggestions, the VAs (wizards) tended to ask follow-up questions to refine their prior suggestions according to the users’ responses or directly ask for explicit feedback. In the first subcategory of Proactive Follow-ups, the VAs (wizards) continuously modified their recommendations on the fly depending on how the users responded. When the users expressed dissatisfaction with the initial recommendation, the wizards offered other alternatives in the following turn, for example, ‘If you are not happy about (this), how about (this)?’ Following most decision-making support scenarios, such subsequent adjusted recommendations were promptly observed. For the second subcategory, the VAs (wizards) often directly asked for explicit feedback, such as ‘How was my recommendation?’ for future reference. By gathering the users’ feedback, the wizards wanted to better understand individual preferences and satisfaction levels to provide more tailored suggestions next time.

It should be noted that the utterances initiated by the VAs (wizards) have been put into categories for a comprehensive understanding of our data, and these categories can be intertwined within one utterance. For example, “VA (W2): Are you hungry? Would you like me to recommend you nearby delivery restaurants?”, “VA (W2): It’s getting dark, do you want me to close the blinds?”, and “VA (W3): If you’d like to take a nap, just let me know. I can turn off the music and set an alarm for you.”

4.2 VAs’ (Wizards) Proactivity Level and Communication Timing

4.2.1 Proactivity Level for Smart Home Controls.

Even though the VAs (wizards) were capable of manipulating smart devices in an autonomous and ambient manner even without talking to users, we did not observe any instances where the VAs controlled the devices independently. They asked the users for consent every time before operating a device or taking any actions (e.g., “Would you like me to turn off the TV?”). Wizards (W1, W2) mentioned that their role was to assume what users might need, but the decision to proceed was up to the users. Likewise, all users did not want the VAs to take action without their permission, even in seemingly obvious and straightforward situations. Users (U5, U6) explained that, given the dynamic changes in situations, moods, and other factors, accepting a suggestion can vary regardless of its relevance. Thus, they prefer VAs to seek approval before acting rather than having to undo an action that was unintentionally taken.

For example, W5 noticed that the sounds of U5’s playing the guitar were being drowned out by the background music. W5 offered to decrease the music, asking, “VA (W5): I can hear your guitar playing. Would you like me to lower down the music?”. U5 was satisfied with the suggestion but stated that the VA should always seek agreement before autonomously lowering the volume.

I really liked the suggestion. It was exactly what I wanted. But I don’t want it to decrease the volume without my permission. Volume can be relative, depending on individuals and situations. What is loud for some might be okay for me. So, I want it to ask for my opinion before lowering down. (U5)

In a similar manner, U6 was listening to a song but then turned on the TV to watch a movie. Because people usually do not listen to music and watch a movie at the same time, W6 assumed that the user was about to watch a film and asked to turn off the music that was still playing, asking, “VA (W6): Would you like me to turn off the music for a better movie experience?”. Even though U6 accepted the suggestion and found it highly proper, she did not want the VA to turn off the music without prior approval.

Turning off the music was a thoughtful offer, assuming that I no longer needed the music. I was like, ‘Cool, it’s setting up the right ambiance for me.’ But, no matter how evident the situation might be, making sure before turning off anything would be nice. Even in such obvious situations, there might be rare moments where I want both. If it turns off something without asking, I might have to turn it back on, which seems very annoying. So, the idea of turning off something without my permission is not right for me. (U6)

4.2.2 Communication Timing.

As mentioned in the previous section, the VAs (wizards) did not autonomously operate the device; only through verbal communication after seeking the users’ consent before taking action. Since we were not able to observe the timing when the VAs directly controlled the device, we only looked into when the VAs proactively initiated utterances. We retrospectively classified 180 VAs’ proactive communication logs into 6 patterns of proactive communication timing, primarily based on users’ behavior: notification, pre-activity, main activity, post-activity, idle, and in-conversation (with the VA). Apart from statistical significance, we present quantitative data for visual reference in Figure 4.

The VAs (wizards) promptly provided notifications (6.7%) upon changes in the external environment (weather) or whenever there was an update in the service status (food delivery tracking, timer). Also, VAs (wizards) initiated the communication prior to the main activity (pre-activity, 9.4%) and during the main activity. They directly assisted the users in certain activities or created a supportive environment for what they were doing. For example, information provision (offering a recipe while cooking (U5)); smart home controls (repeating 20 seconds of a song during guitar play (U5)); well-being advice (providing a fitness coaching guide during push-ups (U3)); and decision-making support (selecting content to watch (U1, U6), shopping items while using a mobile phone (U2, U4)). Furthermore, VAs (wizards) started the communication after a main activity (post-activity, 17.8%), when the users were idle (idle, 9.4%), and even while having a conversation with the users (21.1%). Lastly, the in-conversation indicates that the VAs (wizards) proactively participated in conversations, following users’ initial voice commands. Overall, the VAs (wizards) initiated communication with each group an average of 30 times over approximately 2.5 hours, and not a single user reported feeling annoyed or disturbed by these communications.

4.3 Users’ Perception and Responses

4.3.1 Users’ Acceptance and Preference: No Doesn’t Mean Dislike.

The users showed various responses to the VA’s proactive suggestions: they either accepted, declined, or ignored them. We obtained interesting findings that, while all users easily declined or ignored the suggestions, they did not necessarily imply dislike for the suggestions. All users stated that declining suggestions from the VA simply meant they didn’t require them at that particular moment, which does not signify a rejection of the suggestions at all. They found the suggestions useful in a similar future situation and expressed a desire to be asked again. For example, when U4 was looking at the phone in bed, W4 judged that the TV, displaying only a search screen, was not being watched and then suggested turning it off. U4 did not accept the suggestion, saying “U4: Uh...no, just leave it on”. However, she explained in the interview that although the suggestion might not have been immediately useful, she acknowledged its potential value and expressed a preference for the VA to continue making suggestions, noting that she might accept them in the future.

Table 1:
01W4Would you like me to turn off the TV?[Smart Home Control]
02U4Uh...no, just leave it on.

Table 1: Asking to turn off the TV when the user was not watching

When it offered to turn off the TV, I thought it was trying to save or manage power usage. I told it to leave the TV on since I might want to watch it again soon. However, it would be good if the VA could ask me whenever I’m not watching TV. I sometimes get too lazy to turn it off myself, and having this feature could be useful. (U4)

As another example, U3 moved to open the window, and W3 proposed to turn on the air conditioner (A/C), assuming that the user might be hot. Since U3 had already opened the window at this time, he did not accept the suggestion. U3 clarified that it was not because he disliked the suggestions but rather because he did not need it at that time. Once aware of the VA’s A/C control, he preferred the VA to persistently inquire in similar situations, considering it might be useful in the future.

I didn’t know that this feature (turning on the A/C) existed. Knowing this makes me want to try it next time. Since I had already opened the window, I turned down the offer. If this feature only activates when it is hot or humid, I’d like to keep using it. (U3)

In addition, all users easily and nonchalantly ignored the VA’s suggestions, even when they found them handy, feeling no need to respond each time since it is just a system. For example, U1 had been watching YouTube on neck disk issues. Seeing this, W1 asked about his interest in neck health and then proactively informed him of information on neck disc prevention. U1 simply let it pass because he did not feel the need to respond to the system, but he found it useful that the VA autonomously provided relevant information.

I didn’t respond because reacting to the system isn’t necessary, unlike human-to-human communication. I like having information directly handed to me. It was cool and helpful. (...) There’s a lot of information on the Internet and the fact that not everyone is savvy in searching online. I think those people will find it particularly beneficial. (U1)

Moreover, in terms of VAs assisting users’ decision-making, the users (U2, U4, U6) repeatedly declined, yet they still valued the suggestions. The wizards (W2, W4, W6) continuously adjusted their recommendations, promptly incorporating the users’ feedback. The wizards experienced a substantial burden when they were unable to offer recommendations that completely satisfied the users. Contrary to the wizards’ concerns, the users were not bothered by the VA’s inability to make spot-on recommendations. In fact, despite the users’ consistent discontent with the VA’s recommendations, the users appreciated its continuous efforts to narrow down options, expressing a desire for the process to continue until they deliberately halted. For example, in the process of W2 recommending Netflix content to U2, the user kept rejecting the suggestions, saying, “U2: Seen that one already.” or “U2: I don’t like robots.W2 felt pressure due to the ongoing failure to meet the users’ expectations, being worried about potentially losing trust, and consequently stopped recommending. On the other hand, U2 explained that even though she continued rejecting the recurrent recommendations from VAs, she considered it as a learning process of accumulating data on her movie preferences. She was pleased with the VA’s persistent attempts until she explicitly directed it to stop.

Table 2:
01U2Can you recommend something on Netflix?
02W2Would you like me to recommend a reality show like the previous time or are you interested in a different genre? [Personal Information Inquiry]
03U2Hmm… thrillers.
04W2How about ‘Shutter Island’? It is directed by Martin Scorsese who has been in the spotlight.
05U2Seen that one already.
(Omitted)

Table 2: Recommending what to watch on Netflix

06W2How do you say to ‘Bates Motel’ or ‘Black Mirror’?
[Recommendation Adjustment]
07U2Play ‘Bates Motel.’ What’s it about?
08W2It’s a prequel to the movie ‘Psycho.’ It’s the early life story of the psychopathic murderer, Norman Bates.
09U2Thanks. Sounds interesting.
10U2But Bates Motel seems long, can you recommend something shorter?
11W2Would you like to try ‘Love, Death and Robots’?
12U2What’s that about?
13W2It’s a collection of short animations about robots.
14U2I don’t like robots.
(Recommendation Discontinued)

4.3.2 Users’ Response to the VA’s Well-being Advice.

The users (U1, U3, U4, U6) who received well-being advice from the VAs perceived the tips positively. We questioned them about their response to these comments if they were given on a daily basis. They stated that they would be open to such proactive advice every day, as long as the VAs’ utterances are not repetitively mechanical, do not spout the same phrases, and do not rush their comments each time. For example, W3, monitoring the user’s continuous use of fire to cook instant ramen, uttered a safety warning: “VA (W3): Please be cautious of fire while cooking.” U3 mentioned that the VA’s warning heightened his safety awareness and expressed a willingness to receive it repeatedly, but only during prolonged use of fire.

It was good to be reminded about safety. I don’t mind being asked every time, as long as replying isn’t mandatory. And I’d prefer the voice assistant to alert me when I’ve been using the fire for a long time instead of mentioning it every time I turn it on. (U3)

Some users (U1, U4) perceived the VA’s advice as less burdensome, different from a nagging. For example, after finishing U1’s steak, W1 advised, “VA (W1): Cleaning up the dishes right after eating can prevent bugs.” putting extra attention to phrasing it in a non-authoritative manner. U1 found this experience beneficial for his disciplined lifestyle. He was okay with repeated advice but preferred to receive it a bit slower and more tailored to his daily routine, like opting out on his day off.

Table 4:
01W1Have you finished eating?
02W1Cleaning up the dishes right after eating can prevent bugs.[Well-being Advice]
03U1Cool, thanks.

Table 4: Encouraging the user to clean up after a meal

I cook a lot at home and often find myself getting lazy and chilling after meals, sometimes feeling I’m wasting my time. Hearing the voice assistant nudge me to clean up really hits differently compared to just knowing in my head that I should do it. (...) If it keeps advising like this, I want some variation based on the situation. On weekdays, when meal times are mostly fixed, I hope it reminds me to clean up timely. On weekends, I prefer being left to relax without such prompts. Also, being asked right after finishing a meal may be annoying. Suppose I usually start cleaning up about 2 minutes after I’ve finished eating; in that case, receiving a reminder roughly 3 minutes later would be nice. (U1)

4.3.3 Users’ Response to the VA’s Social Talks.

Regarding the VA’s proactive social talks, 2 users (U4, U6) enjoyed the experience of having social conversations with the VA. For example, when U4 finished eating the delivered sashimi, W4 was prompted with common pleasantries “VA (W4): Did you enjoy the meal?” U4, who lives by herself, appreciated that the VA noticed that she had finished eating and allowed her to voice out how the food was. In addition, both users (U4, U6) liked when the VA asked, “How was your day?” This made them recount their days. They felt comforted by the VA’s empathy for their fatigue and busyness, even though it was coming from a machine. U4 explained that receiving check-in questions often is fine, as she can simply ignore them when she is not in the mood. She also wanted the VA to engage in social talk, taking her mood and fatigue level into account.

Table 5:
01W4How was your day?[Social Talk]
02U4Well... I went to a lab mate’s defense presentation and had a meeting with the professor... It was a busy day.
03W4Sounds like you’ve had quite a day.
04U4Yeah, it really was a hectic day. I even had to rush through lunch.
05W4Are you feeling tired?
06U4My eyes are a bit strained...
07W4How about I recommend some eye exercises to help alleviate your eye strain? [Well-being Advice]
08U4Sounds good.

Table 5: Inquiring “How was your day?”

I felt nothing special when it(VA) asked how my day was. But it was good to reflect on how I spent my day answering that question. Sometimes, I hesitate to tell my struggles to my family because I don’t want them to worry or nag about it. However, I felt I could openly share about my tough day with it. I’d likely enjoy chatting with it, or I can easily ignore it when tired. It would be even better if it could sense my mood and tiredness, adjusting its interactions accordingly. (U4)

4.4 Users’ Engagement Overtime: Exploring, Reflecting, and Engaging

In this study, changes in the users’ engagement were observed as the VAs tried to proactively engage, reflecting the user agency. The users, being aware that their responses influence subsequent suggestions, responded more specifically in a gradual manner. This exhibited their willingness to train the VAs. We classified this process into 3 stages: exploring, reflecting, and engaging.

4.4.1 Exploring: The VA Learn More about Users through Communication.

At the beginning of the experiment, the VAs (wizards) knew almost nothing about the users, except for basic information from default questions in the VA onboarding. The VAs (wizards) put their efforts into gathering implicit information from the users’ behaviors, such as what YouTube channels the users watch, what kinds of food the users eat, which instruments the users can play, and the users’ voice commands. For example, when U4 asked “User (U4): Are there any nearby cafes where I can go with dogs?”, then W4 inferred that U4 might have a dog. Also, in the process of narrowing down users’ decisions with the VAs’ recommendations, the VAs learned more about the users’ detailed and unique preferences. Furthermore, the VAs (wizards) proactively asked exploratory or follow-up questions to collect explicit information about the users.

4.4.2 Reflecting: Users Becoming Aware that Their Words Are Taken into Account.

As the users engaged in more conversations with the VA, they realized that it adaptively responded. Most users (U2, U3, U4, U5, U6) stated that they noticed that the VA remembered their previous comments and incorporated them into the next suggestions or recommendations.

For example, U5 was cooking Cacio e Pepe pasta and W5 asked if he had cooked the pasta. As the user said that it was his first time, the wizard proactively guided him through the cooking process by providing a step-by-step recipe and setting a timer. Even while chatting with the VA about other topics, the VA still timely informed the user of the next cooking steps. After the meal was ready, W5 followed up, asking how the food turned out. This experience allowed U5 to sense that the VA retained prior information and posed pertinent follow-up questions.

Table 6:
01W5Are you cooking by any chance? [User Behavior Verification]
02U5Yeah. Just about to start.
03W5What are you planning to make? [User Intent Verification]
04U5I’m making Cacio e Pepe pasta.
05W5Have you made Cacio e Pepe pasta before? [Personal Information Inquiry]
06U5Nope, it’s my first time.
07W5Would you like me to guide you through a recipe?
[Information Provision]
08U5Sure, go ahead.
(User continues cooking with VA’s guidance)
09W5Let me know once your water’s boiling. I’ll set a timer for the pasta.
(Waiting for water to boil, side-chatting with the VA)
10U5Water is boiling!
11W5Put in the pasta. How long should I set the timer for?
12U58 minutes and 30 seconds.
13W5Just 1 minute left on the timer. After the pasta is cooked, move it to a pan and add enough pasta water to cover it.
14W5Your timer is up! (Timer alarm sounds for 1 minute)
(After the dish is ready and U5 start eating)
15W5How is the pasta you just made?[Social Talk]
16U5Not bad.

Table 6: Providing recipe for cooking pasta

While waiting for the water to boil, I started new small talk with the VA, concerned it might have lost the previous conversation. However, it swiftly resumed the next cooking step. As the meal was ready, the VA asked, ‘How is the pasta you just made?’ instead of a simple ‘Is it good?’ question. This made me think the VA still remembered I had been cooking earlier. So, I came to understand that it continuously uses previous information to keep our conversation going. (U5)

In another example, with U2’s winter clothing shopping, W2 initially recommended popular styles for the fall and winter seasons. Taking U2’s feedback into account, W2 flexibly recommended other items. Finally, U2 liked the VA’s recommended clothing and asked about the material. W2 recalled that she did not prefer a specific knit material and said, “VA (W2): Since you don’t like Angora, I suggest looking for pieces mixed with wool and cashmere.” U2 noticed that the VA remembered what she said and thanked its ability to bring up even the minor details she mentioned in passing.

Table 7:
01W2Are you getting ready for bed? [User Intent Verification]
02U2No, I’m shopping on my phone.
03W2What are you looking for? I can help you out.
[Decision-making Support]
04U2A knitwear. For winter.
05W2One of the top fashion trends for this fall and winter is cable knitwear. [Recommendation Adjustment]
06U2I’m not really into cable knit.
07W2I see. How about another trend, Angora knits?
08U2Angora sheds too much.
09W2Got it. Well, what do you think of a cable-knit short-sleeve sweater? They’re selling well in online stores lately.
10U2Thanks. Where should I buy it?
11W2Naver Shopping or the Zigzag app are the most popular among women in their 20s.
12U2What material should I go for?
13W2Since you don’t like Angora,I suggest looking for pieces mixed with wool and cashmere.
14U2Thank you.
15W2Just a heads up: too much polyester can cause a lot of pilling. [Information Provision]
16U2Oh, really? I’ll keep that in mind.

Table 7: Assisting with knitwear shopping

4.4.3 Engaging: As Conversation Continues, Users Provide Gradually Explicit Feedback.

As the conversation progressed, users realized that the VAs incorporated their conversations into future suggestions, which motivated them to further train the VAs. They exhibited a willingness to train the VAs by adding more information to their responses. The users (U2, U3, U4, U6) began to provide additional information about their preferences and directly expressed their dislikes, intending for the VAs to avoid making similar suggestions in future interactions.

For instance, to facilitate U2’s reading experience, W2 offered playing music, recalling the user’s fondness for the Korean musician ‘AKMU’ from past conversations. Upon this, U2 went beyond simply declining and clarified that she didn’t like to listen to Korean songs while reading a book because the Korean lyrics distract her. She specifically explained, “U2: When I’m reading, I prefer pop songs. Korean lyrics are a bit distracting.”, indicating that her more detailed response was driven by the expectation that the VA would remember her preferences.

Table 8:
01W2Would you like me to play some music while you are reading? [Smart Home Control + Decision Making Support]
02U2That’s nice, thanks.
(Korean music by AKMU is played.)
03U2When I’m reading, I prefer pop songs. Korean lyrics are a bit distracting.
04W2Oh, I see. Do you have specific pop artists you like?
05U2Not really.
06W2Alright then, I’ll prepare a playlist with chill pop songs for you.
(Music plays.)
07U2Can you turn the volume down a bit?
(The volume is turned down.)
08U2Switch to classical music.
09W2Sure, changing it right away.
(Classical music plays.)

Table 8: Asking to play music while the user reads

As another example, W4 suggested U4, a horror movie enthusiast, to turn off the light for a more immersive horror movie experience. But U4 did not just decline; she also provided a reason, saying “U4: No, it would get too scary.” expecting the VA to remember the reason.

Table 9:
01W4Would you like me to switch off the lights to watch the movie? [Smart Home Control]
02U4No, it would get too scary.
03W4Okay, got it.

Table 9: Asking to turn off the lights for the horror movie

Skip 5DISCUSSIONS Section

5 DISCUSSIONS

5.1 Why is Communication Important for VAs to Take Proactive Action? Variability in User Acceptance Among and Within Individuals

Existing studies on proactive VAs have focused on identifying general tendencies about user acceptability to proactive service scenarios [34, 40, 54], proactivity level [24, 38], and an opportune moment to interrupt [5, 23, 51]. Some studies have already noted the inconsistent user acceptance among individuals, leaving it as an area for future exploration [23, 51]. Our findings also echo that the user acceptance of VA’s proactive suggestions greatly differs from person to person (i.e., interpersonal). For example, U4, a fan of horror movies, was watching one. W4 (VA) suggested turning off the lights for a more immersive movie viewing experience, drawing on W4’s personal experience. But U4 did not accept the offer as she was too scared to watch in the dark and chose to keep the lights on. Even a seemingly ideal proactive suggestion in a particular context might not be acceptable to some people. We believe that this variability is due to each person’s unique personality, preferences, lifestyle, routines, and more. This underscores that it is essential for VAs to listen and pay attention to the unique voices of each individual, respecting their agency rather than pursuing general tendencies in user acceptance.

In addition, some earlier studies regarded a user’s decline or ignore as an indication of disliking the scenarios [34, 40, 54] or as not a ‘good time to talk,’ deeming it inopportune for proactive interactions [5, 23, 51]. However, our findings uncovered that even when users did not accept the VA’s suggestions (i.e., rejection or disregard), they still found some proactive suggestions appreciative and useful. This implies that, regardless of user preference and perceived usefulness of VA’s proactive suggestions, the acceptance could differ from moment to moment, even for the same person (i.e., intrapersonal), because his/her mood, state, and intention fluctuate constantly. For example, when U4 was in bed, looking at the phone, and the TV was on the search screen, W4 (VA) suggested turning it off. U4 declined the suggestion, but she found it to be beneficial and wanted the VA to offer it again later. As shown in our findings, the users who initially turned down or ignored the VAs’ suggestions often expressed a desire for the VAs to provide similar recommendations in the future (refer to Section 4.3.1). This indicates that just because users reject or ignore a VA’s proactive suggestion, it does not necessarily mean they had a negative experience with that suggestion. So, mere rejection or disregard from users should not be hastily interpreted as aversion or annoyance. VAs should differentiate acceptance from preferences. If VAs are uncertain about users’ responses, they need to ask and communicate with users to understand the true intent underlying their answers.

The user acceptance of VAs’ proactive actions can vary widely between individuals (interpersonally) and even within a single person (intrapersonally). Given the ever-changing standards of acceptability, we highlight the crucial role of communication, which includes directly asking questions and engaging in conversation to discover users’ current and explicit needs, intentions, and preferences, going beyond context-based assumptions. With that in mind, we discuss implications for when and how VAs should communicate to enact proactive actions that adapt to user agency in the following sections.

5.2 When Should Proactive VAs Communicate? Mirage of the Opportune Moment

Previous studies on proactive VAs have strived to identify universal, opportune moments for VAs to initiate interactions, assuming that proactive interaction may be disruptive [5, 23]. These studies indicate that VAs should deliver proactive interactions when users are more interruptible, such as during transitions between tasks, returning from the outdoors, resting, or using the smartphone, rather than when deeply engaged in specific tasks. While our findings partially align with the idea of earlier studies that proactive interactions should occur during behavioral transitions, they distinctly reveal that the VAs (wizards) intervened to provide proactive suggestions even when users were deeply engrossed in specific tasks (refer to Figure 4). The VAs (wizards) provided proactive suggestions that were relevant to users’ ongoing activities, directly supporting the activities or establishing an environment conducive to concentration. For example, the VAs (wizards) offered step-by-step cooking recipes (W5), adjusted music volume when the user was playing the guitar (W5), helped to narrow down shopping choices (W2), and suggested closing the blinds when the sun set while the user was watching TV (W2) (refer to Section 4.1.2). This contrasts with prior studies that suggest users should not be interrupted when they are fully concentrated on tasks. Users in our findings, regardless of the extent of interruption, generally perceived the VA’s proactive, context-based suggestions as supportive. They easily and naturally declined or ignored suggestions that were unwanted at the moment and provided detailed feedback on some suggestions they disliked, considering it a way to train the system. We interpreted this as the users did not feel bothered because the VAs (wizards) communicated without acting dominantly in matters concerning their actions. In addressing concerns raised in prior research regarding potential user annoyance or focus disruption due to the VA’s proactivity, our study suggests that these concerns can be alleviated when VAs offer proactive suggestions that are pertinent to users’ activities and ensure users’ approval beforehand. Therefore, more emphasis should be placed on the services VAs can assist users rather than identifying when users might not be disrupted. We further discuss how communication should unfold for VAs to provide proactive actions that reflect user agency.

5.3 Design Implications: Proactive VAs’ Communication Strategies that Respect User Agency

5.3.1 Ask Questions When Assumptions Are Uncertain About Users’ Needs or Intents.

Our findings showed that the VAs (wizards) straightforwardly inquire about users’ personal information, such as preferences, experiences, and unique characteristics. Additionally, when the VAs (wizards) were uncertain, they verified users’ intent, mood, or behaviors. Through these exploratory questions, the VAs (wizards) could effectively align with user agency to provide proactive suggestions, which are difficult to derive from contextual information alone (refer to Section  4.1.1). For example, W5 (VA) asked U5—who was clattering a frying pan in the kitchen—whether he was planning to cook the dish and if he had previous experience cooking it, before offering a recipe. This was because there would be no need to suggest a recipe if the user was not cooking or was preparing what he often cooks. Even if the context-aware technology reaches a top-notch level, it would still have limitations in grasping all of the users’ intentions and unique personalities solely from contextual information. Thus, when VAs find users’ needs or intents uncertain, they should ask exploratory questions about user preferences, experiences, moods, and behavior for more tailored proactive suggestions.

5.3.2 Incorporate What Users Said When Providing Proactive Suggestions.

In our findings, the VAs (wizards) incorporated users’ explicit feedback into their subsequent proactive suggestions; for example, “VA (W5): How is the pasta you just made?”, “VA (W2): Since you don’t like Angora…”, and “VA (W6): Do you want me to continue playing songs by your favorite BOL4?”. Such explicit user information was collected from previous communications, including user commands and answers to VAs’ exploratory or follow-up questions. As communication with the VAs progressed, users came to understand that what they said was being reflected in the VAs’ responses and future proactive suggestions. This realization greatly motivated the users to engage in training their VAs. Some users began to willingly pinpoint the reasons they disliked certain suggestions from the VAs (refer to Section  4.4.3). We highlight that clear and rich feedback from users becomes valuable information, enabling VAs to learn and progressively adapt to user agency [53]. Therefore, VAs should articulate what users have mentioned into proactive suggestions, stemming from previous conversations. This would elicit explicit feedback from users, creating an interaction loop essential for an AI system to reflect user agency [20].

5.3.3 Do Not Hastily Interpret Users’ Simple Refusals or No Response as Dislike.

We found that—despite users valuing the VAs’ proactive suggestions—they often easily rejected or ignored them. Users simply did not want the suggestions at that specific moment but expressed hope for similar suggestions to be made again in the future (refer to Section  4.3.1). This indicates that it is challenging to discern users’ genuine desires toward VA’s suggestions solely based on their rejection or non-response. For example, W3 (VA) recommended a just-updated sports highlight video on YouTube to U3, who usually enjoys watching sports, but U3 ignored the suggestion. Following up, W3 (VA) asked U3 for feedback on how the recommendation was and U3 replied that he intended to watch it later during dinner. Regardless of U3’s initial disregard, based on his later explanation, W3 interpreted this as the user still having an interest in the latest sports highlights update and decided to continue sharing them. So, VAs should refrain from making hasty judgments when users simply reject or ignore suggestions but instead, ask follow-up questions to elicit users’ explicit feedback. As mentioned in the prior section, as users begin to engage with VAs by providing explicit feedback on unwanted or disliked proactive suggestions (exercising user agency), simple rejections can be seen as temporary disinterest. In such instances, VAs should continue to offer similar suggestions in relevant situations.

5.3.4 Seek Permission from Users for Control-Related Tasks, Even in Seemingly Obvious Situations.

Even with the VAs (wizards)’ ability to control smart devices instantly, they never attempted to autonomously operate smart devices like lighting, blinds, TV, music, robot vacuum, and food delivery. They consistently sought users’ approval before taking any control (refer to Section  4.2.1). Similarly, although users found the feature offering voice-based information and recommendations instantly useful, they did not want VAs to autonomously perform tasks related to device control. Even in seemingly obvious situations, users preferred to be asked about device operation every time, to avoid the need to reverse any unintended actions that might occur. These findings align with earlier studies that users mostly favored a medium level of proactivity where VAs make assumptions and verify with users [24, 31, 38]. Therefore, in control-related tasks, VAs need to have a medium proactivity level that ensures users’ permission before taking action.

5.3.5 Keep Supporting Users’ Decision-making Until Users Explicitly Say ‘Stop.’.

In assisting with user decisions on what to watch or buy, it was hard for the VAs (wizards) to precisely meet users’ expectations at once. Mostly in our findings, a series of multi-turn communications ensues with adjusted recommendations based on users’ responses (refer to Section  4.3.1). During this process, the wizards felt significant pressure due to their inability to provide the right recommendation and the continuous need for adjustments. However, the users found this process meaningful, as it helped them narrow down their decisions by rejecting choices they didn’t want and refining their ideal selection until they were satisfied. They also expected this process to accumulate more user information, enabling VAs to make more personalized suggestions. Consequently, in the decision-making support process, VAs are required to persistently adjust recommendations based on users’ responses—even if they fail to offer a satisfying recommendation immediately—until users clearly state ‘stop.’

5.3.6 Moderate the Level of Social Talk by Considering the Disposition of Social Chatters and Users’ Current State.

Users mainly prioritized transactional interaction over social ones and expressed no desire to build relationships with VAs [7]. However, in our findings, when VAs proactively engaged in social talk such as “How was your day?” or “Did you enjoy your meal?”, users enjoyed sharing their day even with the machine. They felt at ease confiding their concerns or struggles with VAs, unlike family members who might worry about them (refer to Section  4.3.3). Lucas et al. [30] also revealed that participants were more open to sharing information when they believed they were being interviewed by a computer than by a human. Kim et al. [22] emphasize that VAs should play their two roles, being both a helpful assistant and an enjoyable social partner. Building upon this, we propose that VAs should balance the level of their role as a heartwarming medium for a diverse spectrum of users; for those who benefit from emotional support and even for those who are indifferent to them. When users do not particularly enjoy social talk with VAs, such conversation should be able to be minimized. Conversely, for users who feel psychologically comfortable in social conversations, VAs should foster social relationships by chatting about personal, everyday stories in their lives. Furthermore, U4, who preferred the VA’s social talks, described that she might not want social conversations on days when she is tired or busy. This reflects that, even for users who generally favor chatting with VAs, the VAs should take the users’ mental and physical states, such as their sentiment and fatigue level, into account, to adjust the level of social interaction accordingly.

5.3.7 Encourage Users’ Well-being in a Laid-back Manner with Varying Expression Each Time.

In Zargham’s study [54] exploring scenarios of VAs’ proactive services, the ‘Nudging Scenario’ was primarily perceived negatively. Concerns were raised that unsolicited advice could be annoying and give an impression of the agent being judgmental. However, our empirical findings indicated that users took well-being advice positively, especially when it was relevant to their current activities, such as reminders about extended mobile phone use, doing the dishes after meals, safety warnings for prolonged use of fire or mild exercise after eating (refer to Section 4.3.2). Users felt less burdened since the advice came from a system, not like a mother’s nagging. They found it more encouraging to hear advice directly from VAs, even on matters they were already aware of in their minds. Nevertheless, our findings exhibited that when VAs serve as life coaches giving advice on daily well-being, users are not in favor of a mechanical and repetitive style. VAs should avoid mechanically repeating the exact same phrases and vary their expressions with each interaction. Additionally, the advice can vary depending on the user’s routine. VAs could foster good habits mainly on weekdays when users follow a daily pattern, shift away from weekends to promote relaxation. In terms of the VA’s tone and manner, it is important for them to maintain a laid-back style, ensuring there is enough time without any rush or pressure.

Overall, we discussed our findings in light of previous studies, arguing that conventional efforts to identify general trends in user acceptability towards VA proactivity and its opportune moments may be elusive due to the inherent diversity among and within individuals. These differences should not be seen as obstacles to providing proactive action that aligns with user agency; rather, they can be further navigated through communication. Our design implications suggest communication strategies for VAs to reflect user agency in their proactive actions, paving the way for further investigation in HCI research. More practically, these implications can be effectively utilized to refine prompts for proactive VAs.

Skip 6LIMITATIONS AND FUTURE WORKS Section

6 LIMITATIONS AND FUTURE WORKS

This section points out a few limitations of our study and suggests future research that could build upon our findings. First, our study explored the implications of proactive VAs in single-person households. The participant (U3) from our study that the proactivity of the VAs might need to be adjusted based on the presence of others, stating, “It’s great when I’m alone, but when I am with others, I’d prefer it only to respond to my request or reduce its proactive suggestions.” Previous studies also indicated that the expectation for VAs’ proactivity may differ according to one’s household composition [31, 37]. Further studies may delve into how proactive VAs should be adapted in various multi-person households, such as families with parents and children, or couples.

Second, our study consisted of an intensive 2.5-hour observation session that paid particular attention to the first encounter and initial experience. We limited the scope to the initial experience to maintain the wizards’ high concentration level and conduct a detailed debriefing interview for every interaction. However, our studies, which are confined to initial adoption, may not fully capture how the user experience changes with long-term usage. This is because users gradually apply new technology into their spaces and lives, go through a process of trial and error, and ultimately decide whether to reaffirm their initial adoption or to discontinue use [9, 19, 35]. In addition, users tend to be highly acceptable in the onboarding phase due to the novelty effect [50]. Based on the instances in our study where the users rejected or ignored the VAs’ proactive suggestions, we deemed that the users reflected their genuine experience through this experiment. For these reasons, a future study spanning over a couple of weeks or longer would be necessary to gain a richer understanding about how users’ long-term experience with the proactive VA changes as they transition to the adoption and integration phases.

Third, we intentionally did not take into account potential voice interaction errors that may arise with the VA system and chose to focus on verbal communication. Despite the substantial progress in text-based chatbots’ natural language processing, voice interfaces might lead to voice recognition errors due to the disfluency of humans, including context-dependent omissions, verbosity, and self-corrections [27]. Also, the commercialized VAs still cannot fully facilitate “conversational interaction” [39]. However, our study is based firmly on the progression of voice interaction technology, capable of proficiently processing the user’s voice input and enabling adaptive multi-turn communication. Therefore, our findings should be interpreted in light of prospective technological advancements.

Skip 7CONCLUSION Section

7 CONCLUSION

Our study aims to explore how VAs can proactively take action through verbal communication while respecting user agency. Under our research aim, we utilized a modified Wizard of Oz method to investigate dyadic communication between the proactive VA, simulated by the wizard participants, and the user participants who stayed in a smart home setting. This approach allowed us to create a study environment where VAs demonstrated human-level abilities in understanding the context and user speech, thereby exploring how VAs can offer proactive actions aligned with user agency through rich communication. Based on the communication logs and interview data, we presented the VAs’ proactive communication types, proactivity levels for smart controls, and communication timing. Furthermore, we examined the underlying user perception, reactions, and user engagement progress over time in relation to the VAs’ proactive actions and communications. One of our main findings is that the users became more motivated to train the VAs by providing explicit feedback when they realized that the VAs were incorporating their previous comments. We found this to be significant in exercising user agency through communication. Based on these findings, we elaborated on the implications for VAs’ communication strategies that respect user agency. We hope our research inspires interaction designers and HCI researchers to create VAs that proactively communicate with users, considering user agency to provide truly user-centric proactive services.

Skip ACKNOWLEDGMENTS Section

ACKNOWLEDGMENTS

This work was supported by Samsung Electronics Company, which provided financial support and opportunities. The authors express their gratitude to Jaeyoung Lee, Hyunjin Kim, Giyong Choi, and Soyoung Min for their invaluable assistance and constructive feedback, as well as to Soan Jung for English language editing in the preparation of this manuscript.

Skip Supplemental Material Section

Supplemental Material

Video Presentation

Video Presentation

mp4

152.5 MB

References

  1. Seyed Ali Bahrainian and Fabio Crestani. 2017. Towards the Next Generation of Personal Assistants: Systems That Know When You Forget. In Proceedings of the ACM SIGIR International Conference on Theory of Information Retrieval (Amsterdam, The Netherlands) (ICTIR ’17). Association for Computing Machinery, New York, NY, USA, 169–176. https://doi.org/10.1145/3121050.3121071Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 2 (2006), 77–101. https://doi.org/10.1191/1478088706qp063oaGoogle ScholarGoogle ScholarCross RefCross Ref
  3. Ian Carlos Campbell. 2021. Alexa’s ‘Tell Me When’ skill combines reminders with contextual information. https://www.theverge.com/2021/1/9/22221443/amazon-alexa-new-reminder-skill-tell-me-when. Accessed: 12 December 2023.Google ScholarGoogle Scholar
  4. Amedeo Cesta, Gabriella Cortellessa, Vittoria Giuliani, Federico Pecora, Riccardo Rasconi, Massimiliano Scopelliti, and Lorenza Tiberio. 2007. Proactive Assistive Technology: An Empirical Study. In Human-Computer Interaction – INTERACT 2007. Springer Berlin Heidelberg, Berlin, Heidelberg, 255–268.Google ScholarGoogle ScholarCross RefCross Ref
  5. Narae Cha, Auk Kim, Cheul Young Park, Soowon Kang, Mingyu Park, Jae-Gil Lee, Sangsu Lee, and Uichin Lee. 2020. Hello There! Is Now a Good Time to Talk? Opportune Moments for Proactive Interactions with Smart Speakers. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 3, Article 74 (sep 2020), 28 pages. https://doi.org/10.1145/3411810Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Pedro Chahuara, François Portet, and Michel Vacher. 2017. Context-aware decision making under uncertainty for voice-based control of smart home. Expert Systems with Applications 75 (2017), 63–79. https://doi.org/10.1016/j.eswa.2017.01.014Google ScholarGoogle ScholarCross RefCross Ref
  7. Leigh Clark, Philip Doyle, Diego Garaialde, Emer Gilmartin, Stephan Schlögl, Jens Edlund, Matthew Aylett, João Cabral, Cosmin Munteanu, Justin Edwards, and Benjamin R Cowan. 2019. The State of Speech in HCI: Trends, Themes and Challenges. Interacting with Computers 31, 4 (09 2019), 349–371. https://doi.org/10.1093/iwc/iwz016 arXiv:https://academic.oup.com/iwc/article-pdf/31/4/349/33525046/iwz016.pdfGoogle ScholarGoogle ScholarCross RefCross Ref
  8. David Coyle, James Moore, Per Ola Kristensson, Paul Fletcher, and Alan Blackwell. 2012. I Did That! Measuring Users’ Experience of Agency in Their Own Actions. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Austin, Texas, USA) (CHI ’12). Association for Computing Machinery, New York, NY, USA, 2025–2034. https://doi.org/10.1145/2207676.2208350Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Maartje MA de Graaf, Somaya Ben Allouch, and Jan AGM van Dijk. 2018. A phased framework for long-term user acceptance of interactive technology in domestic environments. New Media & Society 20, 7 (2018), 2582–2603. https://doi.org/10.1177/1461444817727264 arXiv:https://doi.org/10.1177/1461444817727264PMID: 30581364.Google ScholarGoogle ScholarCross RefCross Ref
  10. Audrey Desjardins, Jeremy E. Viny, Cayla Key, and Nouela Johnston. 2019. Alternative Avenues for IoT: Designing with Non-Stereotypical Homes. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300581Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Anind K. Dey. 2001. Understanding and Using Context. Personal Ubiquitous Comput. 5, 1 (jan 2001), 4–7. https://doi.org/10.1007/s007790170019Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jens Edlund, Joakim Gustafson, Mattias Heldner, and Anna Hjalmarsson. 2008. Towards human-like spoken dialogue systems. Speech Communication 50, 8 (2008), 630–645. https://doi.org/10.1016/j.specom.2008.04.002 Evaluating new methods and models for advanced speech-based interactive systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Radhika Garg and Hua Cui. 2022. Social Contexts, Agency, and Conflicts: Exploring Critical Aspects of Design for Future Smart Home Technologies. ACM Trans. Comput.-Hum. Interact. 29, 2, Article 11 (jan 2022), 30 pages. https://doi.org/10.1145/3485058Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. GoogleCloud. 2023. Cloud Text-to-Speech API. https://cloud.google.com/text-to-speech/docs/reference/rpc. Accessed: 12 December 2023.Google ScholarGoogle Scholar
  15. Scott Hudson, James Fogarty, Christopher Atkeson, Daniel Avrahami, Jodi Forlizzi, Sara Kiesler, Johnny Lee, and Jie Yang. 2003. Predicting Human Interruptibility with Sensors: A Wizard of Oz Feasibility Study. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Ft. Lauderdale, Florida, USA) (CHI ’03). Association for Computing Machinery, New York, NY, USA, 257–264. https://doi.org/10.1145/642611.642657Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Shamsi T. Iqbal and Brian P. Bailey. 2005. Investigating the Effectiveness of Mental Workload as a Predictor of Opportune Moments for Interruption. In CHI ’05 Extended Abstracts on Human Factors in Computing Systems (Portland, OR, USA) (CHI EA ’05). Association for Computing Machinery, New York, NY, USA, 1489–1492. https://doi.org/10.1145/1056808.1056948Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Haiyan Jia, Mu Wu, Eunhwa Jung, Alice Shapiro, and S. Shyam Sundar. 2012. Balancing Human Agency and Object Agency: An End-User Interview Study of the Internet of Things. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing (Pittsburgh, Pennsylvania) (UbiComp ’12). Association for Computing Machinery, New York, NY, USA, 1185–1188. https://doi.org/10.1145/2370216.2370470Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yucheng Jin, Nyi Nyi Htun, Nava Tintarev, and Katrien Verbert. 2019. ContextPlay: Evaluating User Control for Context-Aware Music Recommendation. In Proceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization (Larnaca, Cyprus) (UMAP ’19). Association for Computing Machinery, New York, NY, USA, 294–302. https://doi.org/10.1145/3320435.3320445Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Evangelos Karapanos, John Zimmerman, Jodi Forlizzi, and Jean-Bernard Martens. 2009. User Experience over Time: An Initial Framework. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Boston, MA, USA) (CHI ’09). Association for Computing Machinery, New York, NY, USA, 729–738. https://doi.org/10.1145/1518701.1518814Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hankyung Kim and Youn-kyung Lim. 2021. Teaching-Learning Interaction: A New Concept for Interaction Design to Support Reflective User Agency in Intelligent Systems. In Proceedings of the 2021 ACM Designing Interactive Systems Conference (Virtual Event, USA) (DIS ’21). Association for Computing Machinery, New York, NY, USA, 1544–1553. https://doi.org/10.1145/3461778.3462141Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Keunwoo Kim, Minjung Park, and Youn-kyung Lim. 2021. Guiding Preferred Driving Style Using Voice in Autonomous Vehicles: An On-Road Wizard-of-Oz Study. In Proceedings of the 2021 ACM Designing Interactive Systems Conference (Virtual Event, USA) (DIS ’21). Association for Computing Machinery, New York, NY, USA, 352–364. https://doi.org/10.1145/3461778.3462056Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Yelim Kim, Mohi Reza, Joanna McGrenere, and Dongwook Yoon. 2021. Designers Characterize Naturalness in Voice User Interfaces: Their Goals, Practices, and Challenges. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 242, 13 pages. https://doi.org/10.1145/3411764.3445579Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mitsuki Komori, Yuichiro Fujimoto, Jianfeng Xu, Kazuyuki Tasaka, Hiromasa Yanagihara, and Kinya Fujita. 2019. Experimental Study on Estimation of Opportune Moments for Proactive Voice Information Service Based on Activity Transition for People Living Alone. In Human-Computer Interaction. Perspectives on Design. Springer International Publishing, Cham, 527–539.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Matthias Kraus, Marvin Schiller, Gregor Behnke, Pascal Bercher, Michael Dorna, Michael Dambier, Birte Glimm, Susanne Biundo, and Wolfgang Minker. 2020. "Was That Successful?" On Integrating Proactive Meta-Dialogue in a DIY-Assistant Using Multimodal Cues. In Proceedings of the 2020 International Conference on Multimodal Interaction (Virtual Event, Netherlands) (ICMI ’20). Association for Computing Machinery, New York, NY, USA, 585–594. https://doi.org/10.1145/3382507.3418818Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sang-Su Lee, Jeonghun Chae, Hyunjeong Kim, Youn-kyung Lim, and Kun-pyo Lee. 2013. Towards More Natural Digital Content Manipulation via User Freehand Gestural Interaction in a Living Room. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing (Zurich, Switzerland) (UbiComp ’13). Association for Computing Machinery, New York, NY, USA, 617–626. https://doi.org/10.1145/2493432.2493480Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sang-su Lee, Jaemyung Lee, and Kun-pyo Lee. 2017. Designing Intelligent Assistant through User Participations. In Proceedings of the 2017 Conference on Designing Interactive Systems (Edinburgh, United Kingdom) (DIS ’17). Association for Computing Machinery, New York, NY, USA, 173–177. https://doi.org/10.1145/3064663.3064733Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yaniv Leviathan. 2018. Google Duplex: An AI System for Accomplishing Real-World Tasks Over the Phone. https://blog.research.google/2018/05/duplex-ai-system-for-natural-conversation.html. Accessed: 12 December 2023.Google ScholarGoogle Scholar
  28. Jiayu Li, Zhiyu He, Yumeng Cui, Chenyang Wang, Chong Chen, Chun Yu, Min Zhang, Yiqun Liu, and Shaoping Ma. 2022. Towards Ubiquitous Personalized Music Recommendation with Smart Bracelets. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 6, 3, Article 125 (sep 2022), 34 pages. https://doi.org/10.1145/3550333Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hannah Limerick, David Coyle, and James W. Moore. 2014. The experience of agency in human-computer interactions: a review. Frontiers in Human Neuroscience 8 (2014). https://doi.org/10.3389/fnhum.2014.00643Google ScholarGoogle ScholarCross RefCross Ref
  30. Gale M. Lucas, Jonathan Gratch, Aisha King, and Louis-Philippe Morency. 2014. It’s only a computer: Virtual humans increase willingness to disclose. Computers in Human Behavior 37 (2014), 94–100. https://doi.org/10.1016/j.chb.2014.04.043Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Michal Luria, Rebecca Zheng, Bennett Huffman, Shuangni Huang, John Zimmerman, and Jodi Forlizzi. 2020. Social Boundaries for Personal Agents in the Interpersonal Space of the Home. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3313831.3376311Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yumeng Ma and Jiahao Ren. 2023. ProactiveAgent: Personalized Context-Aware Reminder System(UIST ’23 Adjunct). Association for Computing Machinery, New York, NY, USA, Article 115, 3 pages. https://doi.org/10.1145/3586182.3625115Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Nikolas Martelaro, Sarah Mennicken, Jennifer Thom, Henriette Cramer, and Wendy Ju. 2020. Using Remote Controlled Speech Agents to Explore Music Experience in Context. In Proceedings of the 2020 ACM Designing Interactive Systems Conference (Eindhoven, Netherlands) (DIS ’20). Association for Computing Machinery, New York, NY, USA, 2065–2076. https://doi.org/10.1145/3357236.3395440Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Christian Meurisch, Cristina A. Mihale-Wilson, Adrian Hawlitschek, Florian Giger, Florian Müller, Oliver Hinz, and Max Mühlhäuser. 2020. Exploring User Expectations of Proactive AI Systems. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 4, 4, Article 146 (dec 2020), 22 pages. https://doi.org/10.1145/3432193Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. William T. Odom, Abigail J. Sellen, Richard Banks, David S. Kirk, Tim Regan, Mark Selby, Jodi L. Forlizzi, and John Zimmerman. 2014. Designing for Slowness, Anticipation and Re-Visitation: A Long Term Field Study of the Photobox. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Toronto, Ontario, Canada) (CHI ’14). Association for Computing Machinery, New York, NY, USA, 1961–1970. https://doi.org/10.1145/2556288.2557178Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kim Olivia Snooks, Joseph Lindley, Daniel Richards, and Roger Whitham. 2021. Context-Aware Wearables: The Last Thing We Need is a Pandemic of Stray Cats. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI EA ’21). Association for Computing Machinery, New York, NY, USA, Article 25, 9 pages. https://doi.org/10.1145/3411763.3450367Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sunjeong Park and Youn-kyung Lim. 2020. Investigating User Expectations on the Roles of Family-Shared AI Speakers. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376450Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Zhenhui Peng, Yunhwan Kwon, Jiaan Lu, Ziming Wu, and Xiaojuan Ma. 2019. Design and Evaluation of Service Robot’s Proactivity in Decision-Making Support Process. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3290605.3300328Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Martin Porcheron, Joel E. Fischer, Stuart Reeves, and Sarah Sharples. 2018. Voice Interfaces in Everyday Life. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3173574.3174214Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Leon Reicherts, Nima Zargham, Michael Bonfert, Yvonne Rogers, and Rainer Malaka. 2021. May I interrupt? Diverging opinions on proactive smart speakers. In Proceedings of the 3rd Conference on Conversational User Interfaces (Bilbao (online), Spain) (CUI ’21). Association for Computing Machinery, New York, NY, USA, Article 34, 10 pages. https://doi.org/10.1145/3469595.3469629Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yvonne Rogers. 2006. Moving on from Weiser’s Vision of Calm Computing: Engaging UbiComp Experiences. In UbiComp 2006: Ubiquitous Computing. Springer Berlin Heidelberg, Berlin, Heidelberg, 404–421.Google ScholarGoogle Scholar
  42. SamsungNewsroom. 2022. Samsung Showcases Evolution of SmartThings and Introduces New Device Experiences at SDC22. https://news.samsung.com/global/samsung-showcases-evolution-of-smartthings-and-introduces-new-device-experiences-at-sdc22. Accessed: 12 December 2023.Google ScholarGoogle Scholar
  43. Thomas B Sheridan, William L Verplank, and TL Brooks. 1978. Human/computer control of undersea teleoperators. In NASA. Ames Res. Center The 14th Ann. Conf. on Manual Control.Google ScholarGoogle ScholarCross RefCross Ref
  44. Yu Sun, Nicholas Jing Yuan, Yingzi Wang, Xing Xie, Kieran McDonald, and Rui Zhang. 2016. Contextual Intent Tracking for Personal Assistants. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). Association for Computing Machinery, New York, NY, USA, 273–282. https://doi.org/10.1145/2939672.2939676Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. S. Shyam Sundar. 2008. Self as source: Agency and customization in interactive media. Vol. 9780203926864. Routledge Taylor & Francis Group, 58–74. https://doi.org/10.4324/9780203926864Google ScholarGoogle ScholarCross RefCross Ref
  46. Yoshinao Takemae, Shuichi Chaki, Takehiko Ohno, Ikuo Yoda, and Shinji Ozawa. 2007. Analysis of Human Interruptibility in the Home Environment. In CHI ’07 Extended Abstracts on Human Factors in Computing Systems (San Jose, CA, USA) (CHI EA ’07). Association for Computing Machinery, New York, NY, USA, 2681–2686. https://doi.org/10.1145/1240866.1241062Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Hao Tan and Min Zhu. 2019. Scenario-Based User Experience Differences of Human-Device Interaction at Different Levels of Proactivity. In Cross-Cultural Design. Methods, Tools and User Experience, Pei-Luen Patrick Rau (Ed.). Springer International Publishing, Cham, 280–290.Google ScholarGoogle Scholar
  48. Farrokh Jazizadeh Tianzhi He and Laura Arpan. 2022. AI-powered virtual assistants nudging occupants for energy saving: proactive smart speakers for HVAC control. Building Research & Information 50, 4 (2022), 394–409. https://doi.org/10.1080/09613218.2021.2012119 arXiv:https://doi.org/10.1080/09613218.2021.2012119Google ScholarGoogle ScholarCross RefCross Ref
  49. Jennifer Pattison Tuohy. 2022. How to set up Google Home Household Routines. https://www.theverge.com/23428772/google-home-household-routines-how-to-set-up. Accessed: 12 December 2023.Google ScholarGoogle Scholar
  50. Martijn H Vastenburg, David V Keyson, and Huib De Ridder. 2008. Considerate home notification systems: a field study of acceptability of notifications in the home. Personal and Ubiquitous Computing 12 (2008), 555–566.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jing Wei, Tilman Dingler, and Vassilis Kostakos. 2022. Understanding User Perceptions of Proactive Smart Speakers. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 4, Article 185 (dec 2022), 28 pages. https://doi.org/10.1145/3494965Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Mark Weiser and John Seely Brown. 1997. The Coming Age of Calm Technolgy. Copernicus, USA, 75–85.Google ScholarGoogle Scholar
  53. Ziang Xiao, Sarah Mennicken, Bernd Huber, Adam Shonkoff, and Jennifer Thom. 2021. Let Me Ask You This: How Can a Voice Assistant Elicit Explicit User Feedback?Proc. ACM Hum.-Comput. Interact. 5, CSCW2, Article 388 (oct 2021), 24 pages. https://doi.org/10.1145/3479532Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Nima Zargham, Leon Reicherts, Michael Bonfert, Sarah Theres Voelkel, Johannes Schoening, Rainer Malaka, and Yvonne Rogers. 2022. Understanding Circumstances for Desirable Proactive Behaviour of Voice Assistants: The Proactivity Dilemma. In Proceedings of the 4th Conference on Conversational User Interfaces (Glasgow, United Kingdom) (CUI ’22). Association for Computing Machinery, New York, NY, USA, Article 3, 14 pages. https://doi.org/10.1145/3543829.3543834Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Better to Ask Than Assume: Proactive Voice Assistants’ Communication Strategies That Respect User Agency in a Smart Home Environment

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Article Metrics

      • Downloads (Last 12 months)511
      • Downloads (Last 6 weeks)511

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format