research-article

Free Access

SimUser: Generating Usability Feedback by Simulating Various Users Interacting with Mobile Applications

Authors:
Wei Xiang

International Design Institute, Zhejiang University, China

International Design Institute, Zhejiang University, China

0000-0003-2058-5379
View Profile

,
Hanfei Zhu

International Design Institute, Zhejiang University, China

International Design Institute, Zhejiang University, China

0000-0001-6953-5212
View Profile

,
Suqi Lou

International Design Institute, Zhejiang University, China

International Design Institute, Zhejiang University, China

0009-0008-4014-0872
View Profile

,
Xinli Chen

International Design Institute, Zhejiang University, China

International Design Institute, Zhejiang University, China

0009-0007-6607-0203
View Profile

,
Zhenghua Pan

Zhejiang University, China

Zhejiang University, China

0009-0007-0391-7673
View Profile

,
Yuping Jin

Zhejiang University, China

Zhejiang University, China

0009-0003-8397-7465
View Profile

,
Shi Chen

International Design Institute, Zhejiang University, China

International Design Institute, Zhejiang University, China

0000-0002-3577-5725
View Profile

,
Lingyun Sun

International Design Institute, Zhejiang University, China

International Design Institute, Zhejiang University, China

0000-0002-5561-0493
View Profile

CHI '24: Proceedings of the CHI Conference on Human Factors in Computing SystemsMay 2024Article No.: 9Pages 1–17https://doi.org/10.1145/3613904.3642481

Published:11 May 2024Publication History

CHI '24: Proceedings of the CHI Conference on Human Factors in Computing Systems

Pages 1–17

Abstract

The conflict between the rapid iteration demand of prototyping and the time-consuming nature of user tests has led researchers to adopt AI methods to identify usability issues. However, these AI-driven methods concentrate on evaluating the feasibility of a system, while often overlooking the influence of specified user characteristics and usage contexts. Our work proposes a tool named SimUser based on large language models (LLMs) with the Chain-of-Thought structure and user modeling method. It generates usability feedback by simulating the interaction between users and applications, which is influenced by user characteristics and contextual factors. The empirical study (48 human users and 21 designers) validated that in the context of a simple smartwatch interface, SimUser could generate heuristic usability feedback with the similarity varying from 35.7% to 100% according to the user groups and usability category. Our work provides insights into simulating users by LLM to improve future design activities.

1 INTRODUCTION

AI-driven user simulation has been widely applied to generate usability feedback to support mobile application prototyping[42, 63]. Designers in the prototyping stage propose multiple alternative solutions in a short period, calling for rapid iteration, which cannot be supported by user tests that typically last several days, even months. Therefore, there has been increasing research trying to approximate how users will perceive and interact with the interface. These methods measure visual saliency [28], tappability [76], possible inputs [47], etc. However, as indicated by ISO 9241-11 (Ergonomics of human-system interaction — Part 11: Usability: Definitions and concepts) [5, 6, 27], usability not only refers to the extent to which a system, product or service can be used to achieve specified goals with effectiveness, efficiency and satisfaction, but also highlights the influence of specified users and in a specified context of use. In other words, usability tests should aim at target users with similar characteristics in terms of demographics, capabilities, or experience, coupled with definitive usage scenarios, including environment, tasks, etc. This could increase the utility and credibility of the usability tests.

The limited performance of the AI-driven simulation of specified users and contexts is derived from the limitations of data sets and the complex and dynamic nature of usability. Existing methods predict interface usability issues by learning from a vast amount of general user data [16, 29, 79], but do not deduce the underlying reasons from the perspective of specified user groups. Moreover, usability feedback is constructed during the interaction process between users and applications under specified contexts [55]. As a result, it is quite challenging for machine learning models to model the usability feedback of specified users.

Large language models (LLMs) such as GPT-4 [69] and LLaMA [85] have shown potential in general user simulation and experience imagination. LLMs can display different linguistic styles and habits of various human groups in conversations, and they also possess the capability to infer users’ emotional states and thoughts through in-context learning [19, 82]. However, designers still doubt the efficiency of LLMs. In a pilot study, LLMs only identified 24% of the static usability issues and have little chance of discovering problems related to the interaction between interfaces [64]. This stems from an inadequate design and knowledge base for simulating usage experiences, leading LLMs to generate common but superficial suggestions. Therefore, the application of usability feedback during prototyping is still far from mature.

This article aims to create an LLM-based tool called SimUser to rapidly generate heuristic usability feedback across diverse user groups. It infers reasonable usability insights by simulating the interaction between users and mobile applications under a specified context, hoping to support designers in prototyping. Specifically, our study is guided by the following research questions:

•	RQ1: What are the expectations and concerns of designers regarding LLM-generated usability feedback?
•	RQ2: How to generate usability feedback from the perspectives of specified user groups in context?
•	RQ3: How effective and heuristic is the LLM-generated usability feedback?

We initially conducted a formative study with 10 UX practitioners to collect their expectations and concerns about the LLM-based tool. In response to the needs of designers, we proposed SimUser to generate usability feedback by simulating diverse groups of users interacting with mobile applications. An empirical study was conducted on a set of simple smartwatch interfaces. We verified SimUser’s performance across five usability categories and its reflection on different user characteristics and contextual factors by comparing it with human users. Design practitioners considered the tool easy to use and the results were inspiring for them to find usability opportunities.

Our work does not replace the role of human usability tests but rather serves as a heuristic tool during the prototyping stage. In summary, our work contributes in the following ways:

•	We summarize the challenges design practitioners face in the design practice, and conclude their expectations and concerns about using LLMs to generate usability feedback;
•	We create a tool to generate usability feedback toward mobile application prototypes, which is influenced by characteristics of user groups and contextual factors;
•	We examine the feasibility of simulating users by LLM and offer an opportunity to extend the application of user simulation in the design field.

2 RELATED WORK

2.1 Usability Feedback in Prototyping

In the prototyping stage, designers lean on the internal team or experts to acquire usability feedback, using informal techniques such as cognitive walkthroughs [44] and heuristic evaluations [65]. These methods examine the interface and judge its compliance with recognized usability principles [62]. These approaches necessitate the involvement of experienced experts, otherwise, the results may be inadequate and biased [65].

Recent research usually obtains effective usability feedback during user flow, which is a set of interactions that describe the typical set of steps needed to accomplish tasks [2, 32]. However, they focus on collecting users’ feedback toward current interfaces, yet frequently overlook the potential impact of users’ past experiences and expectations on usability feedback [96]. Researchers suggest expectation disconfirmation, which points out that people’s expectations are satisfied or not, will influence users’ feedback on usability and user experience [58, 68, 89]. Besides, it is important to consider specified user characteristics by creating user personas, which need to be described in detail and supplemented with factors relevant to the test application.

Using LLMs to simulate the above process involves two main challenges. One is to empathize with specified users and portray their characteristics. Another is to simulate how human users perceive and interact with the mobile application in the user flow. Next, we will discuss existing research that provides us with insights to address these two challenges.

2.2 Act Like a Certain User Using LLM

LLMs offer a new approach for inferring and mimicking human characteristics from textual descriptions. [37] Some studies have taught LLMs to act as a specified role with the prompts of the form "You are an expert of...", "Act as a...", etc. [74, 92]. However, LLMs can only represent a general human group. Recent studies also believe that there is a potential danger of systemic discrimination produced in the training process of LLMs. To help LLMs reduce toxicity in persona generation [17], more user data from online analytics and web technologies which liberate static personas via interactive user interfaces are utilized to improve LLMs’ ability of in-context learning [9]. Notably, opinions from stakeholders and using the most relevant opinions from individual users enable LLMs to simulate more accurately [13, 25, 56, 95]. However, for usability assessment, being merely "human-like" is insufficient. LLMs can mimic the language style of specified individuals, but simulating human cognition and behaviors remains an area for further exploration. It needs to be extended to how humans connect with their task and surrounding environment [94].

Current user modeling research provides methods to assist LLM in representing user characteristics during simulation. Nolte et al. proposed a modification of personas for hearing-impaired users by defining them in terms of ability ranges [66]. Casas et al. detailed user personas into user levels toward the system usage, interaction methods, and acceptance of audio and visual displays [10] to parameterize users and allow the system to better respond to user characteristics. Inspired by such methods, we think supplementing personas with detailed and parameterized descriptions of ability and task-related characteristics may help LLMs exhibit the characteristics of target users.

2.3 Methods to Simulate User Flow

The interaction between users and the application is another critical component. Simulated users must be capable of understanding and interacting with the interface for assessment of mobile application usability through user flow simulation.

LLMs can process textual data but cannot directly interact with GUI. Although the latest multimodal LLMs begin to process image information [49, 88], they still fall short of accurately understanding interfaces. To bridge this gap, Wang et al. contributed an algorithm that used depth-first search traversal to convert mobile UI’s view hierarchy [86] into HTML syntax. This study emphasized that LLMs require external models like SalGAN and UEyes to understand multimodal mobile applications containing text, image, and structural information [28, 70]. Transforming the combination of code files and visual models into natural language descriptions has the potential to enhance LLMs’ ability to interact with interfaces.

Automated UI testing methods measure the interaction results of interface feasibility, yet they do not focus on the interaction process and user experience. Previous methods have already been able to manipulate GUI elements like human users [22, 81, 86], but they usually require a lot of human data, step-by-step instructions, or designed task-specified reward functions for task completion [24, 46]. LLMs bring new chances with their language comprehension and inferring abilities. Liu et al. extended the research on LLM into generating semantic input text according to the GUI context based on LLM’s outstanding progress in text generation [50]. In their subsequent work, they asked the LLM to play the role of a human tester and chat with mobile apps to identify issues of usability [51]. Their LLM tool held the capability to execute more complex operations by extracting the static context of the GUI page and the dynamic context of the iterative testing process. Beyond interfaces, the mobile applications embody the logic associated with user tasks [45]. Obtaining usability feedback from aspects such as the application’s framework and interaction logic, rather than just individual interfaces, also requires exploration.

LLM-based agent framework offers chances to conduct usability tests within the context of the interactive user flow. It assigns intelligent agents with different capabilities, which can humanize interactive and emotional experiences, enhancing realism in simulating social scenarios [3, 77, 90]. Meanwhile, chain-of-thought (CoT) improves LLMs’ accuracy of inferring by zero-shot prompts [36, 91]. By achieving such a framework and structure, we think it is feasible to simulate interactions between users and mobile applications while reflecting the characteristics of users and the impact of scenarios.

3 FORMATIVE STUDY

To gain insights into the challenges encountered in prototyping practices and the opinions of designers regarding LLM-based methods on usability tests, we conducted 60-minute semi-structured interviews with 10 frontline practitioners consisting of UX designers and evaluators (6 male and 4 female, average age = 27.9), with experience ranging from 2 to 10 years. All of them are familiar with LLM and user research methods, and two of them are developing LLM-based design tools. The practitioners were recruited from social media and they provided their professional credentials.

We created a case study from our proposal and presented it to practitioners. Initially, we manually translated high-fidelity interfaces into natural language for LLM’s comprehension. Subsequently, we tasked GPT-4 as ’behaving as an energetic 12-year-old child’ to interact with the application and provide usability feedback. This process also revealed the agent’s perception and operational approach to designers. The interview consisted of two main aspects:

•	What are the obstacles and challenges designers encounter when obtaining usability feedback during the prototype stage?
•	What are the expectations and concerns regarding the use of LLMs for generating usability feedback?

The first aspect was synthesized insights from literature and current design practices. The second aspect delved deeper, encompassing practitioners’ experiences using LLMs, the information they hoped to gain from the LLM-generated feedback, their concerns about the LLM-based system, etc. The detailed questions are listed in the supplementary materials.

Data from interviews was collected in the form of field notes and audio recordings. These field notes were later compiled, while audio files were anonymized, transcribed, and translated. Our study employed an ethnomethodological perspective to clarify how participants skillfully organize their practice and integrate technologies [72]. We focused on identifying challenges and expectations in design practice and system optimization. The data analysis, undertaken by one author in conjunction with two others involved in the interviews, concentrated on specific topics, reflecting the ethnomethodological approach [87].

3.1 Usability Evaluation Practices in Prototyping Stage

Personas require supplements. Personas are vital for practitioners to define target user groups. However, it is necessary to supplement user personas with more information. Practitioners customize user persona traits specified to the function of the prototype under test. Occasionally, they also incorporate task-related information and stakeholders’ insights to form a more comprehensive description.

User feedback is expected but absent in design practice. Practitioners recognize the importance of user feedback, yet involving users during prototyping proves challenging. Designers often substitute direct user involvement with internal estimations based on their own experience and empathy, which may lead to inaccurate judgments.

3.2 Expectations and Concerns toward LLM Agent

LLM should generate a wider range of contextual usability feedback in a quicker and lower-cost way. Promisingly, practitioners point out that one advantage of our LLM agent is its ability to quickly and cost-effectively simulate the feelings of target users. Practitioners believe they can have the opportunity to explore feedback from a wider variety of user groups. Compared to AI methods that only predict usability issues, they prefer the LLM approach that can infer user thoughts and underlying reasons. Besides, they think our demo broadened their perspectives on users by offering a new way to expand scenarios by imagining user flow in main and extreme scenarios.

LLM should infer characteristic and reasonable requirements of specified users. Although they believe our approach is inspiring, this credibility depends on reasonably reflecting user characteristics. They mention that the existing digital personas provide a solid foundation for the LLM agent to understand users’ characteristics. Based on that, they hope the LLM agent could identify users’ insights with characteristics that are distinctly related to the mobile applications, thereby enriching the diversity and completeness of the user group.

Their concerns mainly concentrate on the ability of LLM. First, they think it is very difficult for LLM to "see" or understand interfaces and interactions. They agree that using codes like HTML/CSS may enable LLM to know the functions or layouts superficially. Still, its process of understanding and interacting with the application should mimic human user perception patterns. Besides, the tactile and auditory feedback of hardware should also be felt by the LLM agent. To summarize, the key to making LLM truly understand interfaces and interactions lies in bridging the gap between language models and multimodal information such as visual, tactile, and auditory cues.

Second, practitioners are uncertain whether the LLM agent can provide reasonable usability feedback through simulating interactions in the user flow. It requires further empirical study for validation. They advise that feedback output should be described from the users’ perspective and summarized into heuristic insights. For instance, responses like "I hope I can contact parents through this interface" are more human-like than "I wish there is a ’contact parents’ button". In addition, the LLM agent should present the specific factors that lead to this usability feedback to the designer.

Usability Category	Explanation	Example of User Feedback
Information and Functions	The content of interfaces, including visual information and available functions.	"I need to know my heart rate data on this interface."
Interface Layout	The design of interfaces, including readability and comprehensibility.	"The text is too small for me to read clearly."
Interaction Operations	The method of user input, like clicking on the screen or pushing the side button on the watch.	"Sliding feels more natural to me than clicking"
Interaction Feedback	The output of user operations, including visual variation, auditory cues, and vibration.	"I need sound to confirm that I have successfully operated."
Interaction Logic	The navigation and framework of the app.	"I hope to directly access the sports interface from the homepage."

View Table

Figure 1: Overview of SimUser procedure. SimUser first creates simulated users and applications, then generates simulated usage scenarios based on the Task. Within these scenarios, it simulates interactions, ultimately producing usability feedback.

4 DESIGN OF SIMUSER TO GENERATE USABILITY FEEDBACK

To achieve our design goal, we create SimUser, an LLM tool that can understand mobile applications and simulate different target users to give usability feedback while interacting with it. Based on formative research and literature review [33, 40], we add interactive factors and categorize usability issues into five types, as listed in Table 1. SimUser is designed to offer these five types of usability feedback but does not involve intelligent elements like data accuracy and complex algorithms like recommendations. Meanwhile, it focuses on user ability characteristics (eyesight and comprehension ability) and task-related characteristics (skills, experience, and usage proficiency) [18, 38], while demographic characteristics such as cultural background have not yet been considered. Besides, it has to consider the contextual factors including the task, the physical environment (location, light, and temperature. e.g. playground at a cold night), and the hardware [31, 73].

To simulate the interaction between users and applications, we segment SimUser into two distinct LLM sub-agents: the Mobile Application Agent (MA) and the User Agent (UA), which is inspired by ReAct [93]. In the early stages of development, we found that if only one LLM agent were established, it became easily influenced by the context, confusing its role as the user or the application, which caused numerous hallucinations. To clarify, MA serves as a representation of the mobile application prototype that is under testing, displaying both information about the application and the mobile device. Meanwhile, UA stands for the target user and is modeled based on the user persons.

To be more human-like, SimUser needs to learn how human users perceive and interact with interfaces [78, 84]. We realize that LLM agents often adopt a machine-like logic to understand interfaces, such as skipping the browsing process but directly locating the target widget. To achieve human-like thinking, we utilize the CoT approach to mimic the real-life perception process and facilitate a step-by-step interaction between MA and UA.

Figure 1 illustrates the process of SimUser. In the subsequent sections, we will detail the functional modules of both sub-agents and explain how they work together to achieve the simulation.

4.1 Interactive Process Between Mobile Application Agent and User Agent

During the simulation, MA and UA collaborate to simulate an authentic user flow, as visualized in Figure 2. The simulation starts when designers upload the prototype, user personas, and tasks. The prompt structure is shown in Figure 3 and detailed in the supplemental material.

Figure 3: The prompt structure for generating usability feedback in SimUser has four parts: Expectation Generation, Interface Comprehension, Operation, and Usability Feedback. There are two sets of external knowledge: one comprising Principles that must be adhered to, and the other consisting of Usability References.

Step 1. MA will generate interface descriptions in accordance with the high-fidelity application uploaded by the designer in advance. Meanwhile, UA will refine user characteristics based on the user personas to specified factors and expand their usage scenarios accordingly.

Step 2. Next, UA will simulate user flow by building expectations of the upcoming interface. Then, MA describes the interface information to UA. UA will then determine whether this page is available to continue the task. If the answer is negative, UA returns to the previous page and do operations again; if the answer is positive, UA continues finishing the following steps. When UA thinks that the task has been accomplished, it quits the user flow simulation after steps on the current interface.

Step 3. UA will try to continue the task. After that, UA’s operations are transmitted to MA, and MA will determine whether the operation is successful and provide potential visual feedback on the interface or haptic and auditory feedback from the hardware for UA.

Step 4. UA needs to compare its expectations with the usability categories mentioned above, providing its thoughts according to the expectation disconfirmation.

Step 5. Since UA has gone through the whole process of the task, UA will evaluate the application’s interaction logic and framework as a supplement to the usability feedback.

Throughout the process, UA should consistently play the role of the target user specified by the designer, immersing themselves in the extended usage scenarios. In each interface, UA must make three evaluations, expressing its thoughts by the "think-aloud" method. An overall assessment of the application will also be provided at the end of the simulation. When providing user’s thoughts, reasons should be given based on user-related contexts. The details of each section are described in the following.

4.2 Input from Designers

Information of the application and mobile device. Designers upload the codes along with images of each interface to SimUser for MA. Additionally, designers should define the interaction method and feedback method of the mobile device.

Information of the user personas and tasks. The processed user personas from interviews and questionnaires should be input as reference materials for SimUser to learn the characteristics of certain user groups. User personas must include ability and task-related characteristics. Additionally, designers need to define tasks that users are supposed to complete within the mobile application.

4.3 Mobile Application Agent

4.3.1 Interface description generator.

MA will generate natural language descriptions of each interface, including positions, contents, sizes, color, contrast, and visual saliency of widgets and texts based on the code files. To organize interface information, MA integrates the code files of the interface and outputs a new structured text file, encoding essential layout elements and structure while adding global style. Besides, MA will extract the absolute positions of widgets, and combine them with the image to segment the interface. The segmented interface will be processed by a visual model to calculate the visual saliency of each widget. The visual saliency will also be included in the text file as a supplement to the definitions. The most important thing is MA has to describe the interface objectively. For instance, the description should not be “icon of heart rate” but “icon looks like a heart” and evaluative adjectives are forbidden. The example is shown in Figure 4.

Figure 4: Example of interface description generation. The description is divided into three parts: an overall visual description, a description of widgets segmented by area, and a summary.

Figure 5: Overview of MA workflow. (1) represents the Interaction logic constructor, (2) represents the Interface description generator, and (3) represents the User action reactor. The User action reactor interacts with UA to simulate the user flow.

4.3.2 Interaction logic constructor.

This module aims to create the foundational framework for the application. MA extracts data on methods and logic of interactions from code files, organizing them into categorized groups. Each group comprises a single operation and its associated feedback, which can be either visual on the interface or multimodal on the device. The interaction logic between different interfaces constitutes the overall framework of the mobile application. The framework is also exported as another structured text file.

4.3.3 User action reactor.

The mobile application agent reacts to user actions through the user flow. If the operation is available, MA will inform the user what kind of feedback he will receive. At the same time, MA will recall the interaction logic of the mobile application, and determine the next interface the user will be directed to. When the simulated user accesses an interface, MA provides its description sourced from pre-generated files. The overall workflow of MA is shown in Figure 5.

4.4 User Agent

User Agent (UA) serves to simulate users in the interactive user flow. The simulation method is shown in Figure 6 and described in the following:

4.4.1 Target user and scenario generator.

Analyze user persona. With the target user definition, UA will create several target user descriptions that meet designers’ requirements of tasks according to the uploaded user persona reference. Along with user persona analysis, UA will organize a discussion to gather insights from three stakeholders, and they only need to propose the characteristics of the target user rather than design suggestions. They have to explain their thoughts from their roles or expertise areas. The description of each user should cover characteristics provided by the two sources which not only be derived from past training data, displaying differences across these characteristics.

Factors		Explanation
Visual demand	Contrast	How visual factors influence users’ perception
	Magnification
	Color
	Brightness
Input methods	Voice	How feasible and favorable these interaction methods are for users
	Touch
	Sensor
Output methods	Text
	Graphics
	Sound
	Vibration
Usage proficiency		How experienced users are with mobile application
Attitudes		How receptive users are to mobile applications

View Table

Describe detailed characteristics. To enhance UA’s capability in aligning user personas with expectations and user flow, we have delved deeper into detailing the user characteristics. Focusing on mobile applications, four additional factors are introduced: visual demands, interaction demands, usage proficiency, and attitudes toward intelligent applications. Each factor has several secondary factors, shown in Table 2, which are assigned values on a five-point scale. SimUser also summarizes these factors as a structured user description and re-input it into UA.

Generate possible scenarios. Relevant usage scenarios are created based on tasks and user characteristics. The simulated scenarios need to conform to the habits of the user, and it is inferred from the perspective of the user’s physical ability, cognitive level, etc. In response to practitioners’ suggestions, we specifically ask LLM to generate edge scenarios, such as extreme weather or accidents. Besides usage scenarios, the contexts also include previous thoughts, behaviors, and emotions of the target user. We hypothesize that these factors would influence LLM generation of user expectations and simulation of user flow.

4.4.2 User flow simulator.

Generate user expectations. We employ zero-shot prompts to assist UA in inferring user expectations for the designer’s mobile application. Compared to human users, LLM agents lack prior experience, leading to their lack of a basis for assessing usability. Therefore, enabling them to develop expectations for upcoming pages and interactions in advance may help them become more critical. The expectations generated by UA should reflect the influences of the target user and scenario. To be closer to reality, UA predicts the next interface based on its operations on the current one. UA envisions the upcoming interface from the perspective of usability categories in Table 1 before MA describes the interfaces. It is worth noting that UA should provide reasons for each user expectation it generates. These reasons must be traced back to the fundamental aspects, such as user characteristics, scenarios, device information, and so on.

Perceive and comprehend the interfaces. In our earlier demos, we found that LLMs would identify task-related elements directly from the interface, bypassing the perception process typical of human users. Therefore, we prompt SimUser to perceive the interfaces in a human-like way. First, UA gains a broad understanding of the interface layout and the relative positions of widgets and text. In this case, UA then searches for the target widget which may help it continue completing the task. Also, UA possesses decision-making capabilities. If the current interface is inappropriate, UA will return to the previous page and operate again. Otherwise, it interacts with the identified widget to proceed to the next interface or concludes the user flow simulation if it is the end.

Interact with the mobile application. When UA believes there is a widget that may be possible to enter the right page, it will choose one operation from click, slide, long press, and so on to interact with it and trigger its function. After operating, UA will receive corresponding feedback from the mobile application. UA may misoperate on the interface and will receive no feedback or be navigated to the unwanted interface.

Evaluate the single interface and corresponding interaction. After completing the perception and operation steps on a single interface, UA should compare the actual interface with user’s expectations, and determine the feelings these differences bring to him. UA objectively identifies specific differences and then explains his feelings. The five factors of usability categories listed in Table 1 should be answered one by one. To obtain interactive usability feedback, we set up various principles and standards based on different stages of tasks. Principles include how to imitate a characteristic user and how to perceive and operate on interfaces, while the standards consist of SUS [8], NASA-TLX [21], and UEQ [39] which instruct SimUser to assess usability. Detailed information on the transformation methods, principles, and standards are listed in the supplement material.

Figure 7: The inputs and outputs of the web tool. Designers upload user documents, test tasks, and prototypes. They can supervise interface descriptions and user flow through, and then receive usability feedback after a waiting period.

4.4.3 Interaction logic evaluator.

According to the user flow, UA reports subjective feelings about the interaction logic of the mobile application. It needs to consider whether the interaction logic is simple or complex, as well as whether it aligns with user habits.

4.5 Implementation

SimUser is implemented using the following tools. We export interface images and use Locofy [53] to convert Figma files to HTML/CSS files. HTML files include information on interfaces and their interaction logic, while CSS files describe the features and layout. Other forms of code, such as React, are also feasible, as long as they include this information. After that, We logically divide functional areas based on the interface layout and obtain the absolute positions of key widgets. In conjunction with this, SalGAN [70] is employed for measuring the visual saliency proportion of widgets.

The web tool for SimUser is developed in Python and uses the Gradio library for rapid GUI construction. Designers can input test tasks and user reference documents. SimUser displays prototype previews along with three sets of resettable Users and their corresponding Scenarios. When SimUser successfully processes the structured files, three types of images for the current page (original, segmented, and saliency grayscale), as well as the natural language description will be presented. To save tokens of GPT-4 API in the user flow simulation, we deploy prompt engineering methods to remind contextual information as described above. After SimUser completes the simulation, designers can receive SimUser’s usability feedback for the current interface with the user’s perceptions and behaviors during the process. The usage method of the web tool is illustrated in Figure 7.

5 EMPIRICAL STUDY

We conducted an empirical study to verify the reasonableness of SimUser by comparing it with user testing of human users and then further ascertaining if SimUser could offer designers assistance and inspiration. We chose the smartwatch and sports application as our evaluation case. Smartwatches are frequently used in HCI research [26, 34, 61]. Furthermore, Smartwatch applications possess fundamental interface information, such as functions, layout, and diverse interaction methods integrated with hardware, all of which are extendable to other smart interfaces like smartphones. The choice of a sports application is motivated not only by its widespread usage [11, 15] but also by the greater influence of contextual factors on its usability feedback.

Figure 8: The test prototypes A and B. The two sets of prototypes perform differently in usability categories, with Interface A possessing a more direct interaction logic.

5.1 Apparatus: test prototypes and the mobile device

We invited design graduate students to create two simple prototypes of the same sports mobile application for the smartwatch in Figma which only included basic functions. The two prototypes (A and B) only differ in usability concerning interaction logic, interface layout, and interaction feedback, as shown in Figure 8.

In our studies, we chose Apple Watch Series 8 as the test device. Not only is it a prevalent device, but it also encompasses several typical screen interaction methods. Additionally, users can interact with physical buttons on the side of the Apple Watch. We incorporated these hardware interactions as complements to the screen-based interactions that both human users and SimUser could utilize during our testing.

The experiment was conducted in a university’s design studio. One researcher controlled the prototype interface on the watch via a laptop, while another researcher used a laptop to document participants’ feedback, with the entire session being audio recorded. The whole process of user testing was conducted in Mandarin, and transcribed by the transcription service of Tencent after removing identifiable information.

5.2 Participants

To compare and verify the effectiveness of SimUser, user tests were conducted with two groups of participants with different user types. We invited 24 university students and 24 elderly individuals with an average age of 22.58 (SD = 2.53) and 60.79 (SD = 8.87). All participants were recruited from social media and the average user test lasted 45 minutes, and participants received 50 CNY as payment for their devotion.

As the SimUser web tool was developed, we invited 21 design practitioners with average ages of 23.5 (SD = 2.25) and average experience of 5.43 years (SD = 2.06). The recruitment method was the same as the formative study.

5.3 Procedure for collecting usability feedback from human users

In human user groups, all participants were asked to fill out questionnaires about their background and demographics and perform user testing on prototypes A and B separately. In the beginning, we gave participants time to familiarize themselves with the application. Then, we introduced the test task that they should record a running exercise on the test applications and imagine themselves in their most common exercise scenarios. Afterward, the experimenter reviewed each interface with participants, retracing their experience and guiding them to identify usability issues of the interfaces. Besides, participants would state how they perceive the interface and find the target widget. In the end, we interviewed them to conclude their experience and also asked them to contemplate potential extended scenarios. The order of the two test prototypes has been counterbalanced.

Figure 9: Study procedure for collecting usability feedback from human users and SimUser.

5.4 Procedure for collecting usability feedback from SimUser

With the consent of the participants, we collected their information according to the user characteristics SimUser requires and processed it into the SimUser’s reference document. SimUser analyzes these data and generates deeper user characteristics, as mentioned in section 4.4.

To determine how many rounds of SimUser feedback should be collected in the study, a pre-test was conducted with different rounds of SimUser. We had SimUser simulate 10 rounds of results and compiled the generated usability feedback (the analytical method is detailed in section 6). We observed that randomly sampling five rounds of results could cover over 90% of the outcomes from ten rounds, which meant 10 rounds were sufficient. In this case, we employed SimUser to generate 10 rounds of usability feedback as elderly individuals and university students for prototypes A and B respectively (40 rounds in total). The output over 10 rounds was collected and then compared with the results from real users.

5.5 Procedure for gathering designers’ suggestions

We recorded a demonstration video, enabling these designers to remotely experience the tool after we introduced how to use SimUser, explained the input materials, and presented the generated usability feedback to them. In the following, they were asked to fill out a System Usability Scale (SUS) questionnaire [8] and conduct brief semi-structured interviews to gain their opinions. The results employed the same coding and analysis methods as the formative research.

6 RESULTS

Figure 10: The coding process. This shows how usability feedback was analyzed and counted to calculate coverage.

A coding approach combining deductive and inductive methods was employed [54, 57]. To analyze the transcribed data, we created a 7-category deductive coding framework. It included the 5 usability categories and 2 types of usage processes about how users perceive and behave on the interfaces. After that, data of human users and SimUser was inductively coded under a unified framework [7]. Two annotators independently identified usability feedback labels and corresponding reasons on each page. They generated themes from these labels and classified the themes into seven categories. The final code structure contained three layers called category, theme, and label, as shown in Figure 10.

For better identification, we identified structured annotations "interface number - object - attribute - attitude" for each theme. When annotators realized that existing themes did not well describe the data, new codes were either merged, split, altered, or generated to better explain the data [43, 54, 60, 80]. A consistency was conducted to examine the validity of annotations (Kappa = 0.72).

Within each category, we counted the number of occurrences of every theme across all participants and extracted themes shared by SimUser and human users. The coverage was calculated by the following formula: (1) \(\begin{equation} Coverage = \frac{Number \ of \ the \ shared \ themes}{Number \ of \ total \ themes \ in \ human \ data} \times 100\% \end{equation}\)

Further, we analyzed characteristic labels of the elderly and university students between the results of human users and SimUser. We also verified whether SimUser can infer usage scenarios according to user personas and the task.

	User Types	SimUser Coverage	Examples
General scenarios	Elderly individuals	80.0%	E.g. Park/lakeside/outdoor, community/near home, sidewalk/road

	University students	88.2%	E.g. Campus/playground, indoor/gym, park/lakeside

specified scenarios	Elderly individuals	76.9%	E.g. Rainy days, falls/injuries, temporary phone calls

	University students	80.9%	E.g. Encountering obstacles, rainy days, mountains, marathons

Scenario-based user needs	Elderly individuals	73.5%	E.g. Easy and fast operation, safety requirements, voice interaction

	University students	73.4%	E.g. Listen to music, synchronize data with other devices, social functions

View Table

6.1 Usage scenarios

SimUser covered over 70% of usage scenarios of both the elderly and university students, as shown in Table 3, and performed better in empathizing with university students. For example, in general scenarios, outdoor locations like "park" and "lakefront" were most frequently mentioned by simulated elderly individuals, while simulated university students often mentioned "campus playground" and "gym". Specifically, SimUser provided many special scenarios not mentioned by human users, such as running in an unfamiliar city while traveling or running with a dog.

6.2 Usability Feedback from SimUser and Human User

2,875 labels are collected from human users, and the overall coverage of SimUser is 80.0% (n = 2,299). In usability data comprising 7 categories and 2 user groups, totaling 14 data groups, 6 groups of SimUser cover over 80% of human user feedback and the other six groups have coverage of around 70%. However, 2 groups of SimUser’s coverage, the students’ views on operation method and interaction feedback, fell below 50%. The coverage rate is shown in Table 4, and we will demonstrate some interesting results behind the data.

6.2.1 Perception processes and interactive behavior.

The perception process and interactive behaviors were compared between SimUser and human participants. SimUser’s simulation of human users’ perception and behavioral processes achieved over 85% coverage rate. For instance, SimUser mentioned, "The green color and the arrow make me think of ’start’" and "I am going to long press the red button marked with a cross to end the run", similar to what human users thought and acted.

Percentage of Human Data Covered by SimUser
	User	Perception Process	Interactive Behavior	Information Functions	Interface Layout	Interaction Feedback	Interaction Operations	Interaction Logic
Prototype A	E	77.8% (112/144)	94.4% (136/144)	76.5% (13/17)	91.3% (136/149)	81.9% (95/116)	81.0% (17/21)	100.0% (11/11)
	U	98.1% (156/159)	100.0% (144/144)	39.2% (22/56)	78.6% (99/126)	45.8% (60/131)	42.4% (14/33)	100.0% (10/10)
Prototype B	E	90.0% (180/200)	97.9% (188/192)	67.9% (19/28)	78.9% (127/161)	67.6% (121/179)	78.6% (22/28)	92.3% (12/13)
	U	86.0% (167/194)	97.9% (188/192)	90.3% (56/62)	56.6% (90/159)	52.7% (77/146)	35.7% (15/42)	66.7% (12/18)

View Table

6.2.2 Performance in usability feedback categories.

Information and functions. SimUser’s coverage within this category varied from 39.2% to 90.3%. Human users seemed to have difficulty providing usability feedback from the perspective of information and functions. In contrast, SimUser generated a more diverse range of such needs. 45.9% of the data generated by SimUser, which was not previously mentioned by human users, was also recognized as valuable.

Interface layout. SimUser covered 84.8% usability feedback of the elderly users and 66.3% of the students. In particular, it could distinguish the stylistic differences in layout between prototypes A and B. Prototype A was designed to have higher contrast colors, larger text and widgets, and clearer iconography, as the results of SimUser indicated prototype A received more positive evaluations.

Interaction feedback. SimUser’s coverage in the interaction feedback varied from 45.8% to 81.9%. SimUser favored comprehensive and multimodal interaction feedback. Nevertheless, nearly half of both human users mentioned that “lacking auditory or tactile feedback is acceptable”. Contrary to SimUser, 8 university students believed that auditory feedback in public places is highly undesirable.

Interaction operations. SimUser showed differences in feedback on operation methods between user groups: a lower coverage for university students (38.7%) while a higher coverage for the elderly (79.6%). SimUser struggled to conceptualize trendy and varied operation methods such as custom gestures, frequently cited by university students.

Interaction logic. SimUser covered 86.5% usability issues in interaction logic. By comparing the characteristic labels for prototypes A and B, we found that over 90% of the SimUser data offered feedback about the excessive depth of access in applications. Human users without design experience could perceive that prototype A was better through comparison, but it was difficult to detect special issues in the interaction logic. However, SimUser did not identify the confusing interaction logic in certain interfaces (B3, B4, and B5), but students pointed out this kind of issue.

6.2.3 User group characteristics reflected by SimUser.

The characteristic labels of the perception process generated by SimUser showed a 93.4% similarity to human data. For example, elderly individuals exhibited more confusion compared to university students, including misunderstandings of icon meanings and difficulties with some specialized vocabulary. The weaker eyesight of the elderly was also a common factor affecting usability feedback. The characteristic labels of the two user types in SimUser data mirrored these human users’ preferences in interaction modalities (coverage ranging from 67.6% to 87.1%). Elderly users often expressed challenges with text information and they preferred auditory feedback, especially voice-based notifications, while university students preferred vibration and visual feedback.

However, the effect of task-related characteristics was not obvious. Due to a lack of experience, elderly human users only provided 5 characteristic labels, all of which were completely covered by SimUser, who additionally generated more safety and health requirements. Nevertheless, SimUser only covered 48.6% task-related characteristics of the university students. With extensive proficiency and experience using smart devices, students proposed more characteristic labels which included expectations for innovative interaction experiences, social media sharing, and personalized settings. Additionally, SimUser lacked experience in using other related applications. 7 human users mentioned sports applications they had used, such as "keep" and "the joyrun" and compared them to prototypes A and B.

6.2.4 Contextual influence reflected by SimUser.

In SimUser data, more than 50 labels addressed how the physical environment (e.g., lighting, usage scenario) and the mobile device (hardware) influenced interface usage. For example, SimUser mentioned that for prototype A, the feedback sound (a crisp "click" sound upon tapping) "might not be sufficiently noticeable in a noisy gym or outdoor running environments". In contrast, human users rarely accounted for context, even though we instructed participants to imagine themselves in running scenarios during the experiments.

6.3 Differences in SimUser and Human Data

SimUser provided extra information not mentioned by human users, accounting for 45.5% of all the feedback it offered, which was not counted in the coverage. Part of the feedback was considered inspiring. For instance, SimUser suggested "I have high blood pressure and I need emergency assistance", which were reasonable considerations but absent in the human data of 24 elderly individuals. However, the excessive amount of information increases the difficulty of collating and filtering, potentially obscuring valuable feedback.

6.4 Feedback of SimUser Web Tool from Design Practitioners

The web tool’s average score of the processed SUS data was 66.22 (SD = 7.55), which was close to the average performance of 68. Unlike other AI-driven design assistance tools with a high level of complexity, our web tool was easy to use as it aligns more closely with a ’designerly understanding’ [48] attributed to its intuitively explained functionalities. Designers thought functions in the tool were well-integrated and easy to use. Nonetheless, they experienced difficulty in assimilating extensive textual outputs, indicating that enhancements through visualization or integrated report generation could markedly enhance the tool’s presentation. Also, a modular approach to segregate the functionalities of the web tool was recommended for improving usability. Besides, encapsulating the reasoning process of CoT in the backend resulted in designers losing control over the process, affecting explainability and their trust in the outcomes. Some designers indicated that they would like to control intermediate steps to adjust the output.

These designers all acknowledged the value of using SimUser to generate diverse usability feedback, which was innovative and worth trying. Designers also proposed many intriguing ideas. In addition to traditional user research data, they believe that commercial data, astrological signs, and MBTI personality types could also serve as user persona inputs. Two of them even suggested that reverse-clustering users from usability feedback might iterate a more accurate categorization of user groups.

7 DISCUSSION

7.1 Enable LLM to Generate Usability Feedback of Mobile Application Prototypes

Designers often find it challenging to accurately predict what information users expect to receive, or how specified usage scenarios may influence the interaction. To support user-centered design, contextual usability feedback with user characteristics is becoming a trend in design practices and research [67].

After 10 rounds of feedback generated by SimUser, it could cover most of the usability issues with a similarity of 80% raised by 24 human users. SimUser reached high coverage rates in the perception processes and interactive behavior of human users, and it showed the potential to generate usability feedback in categories of information and functions, interface layout, and interaction logic. Unfortunately, SimUser faced challenges in fully capturing the nuances of real human data, primarily due to its limitations in replicating human aesthetic sensibilities, experiential knowledge, and grasp of emerging trends. Such constraints resulted in certain aspects of usability feedback that SimUser was currently unable to generate effectively. Meanwhile, the coverage indicated the similarity to human users, which does not completely reflect the output of SimUser since a lot of additional usability feedback that human participants overlooked was produced. These extra problems augment the amount of usability feedback from small sample user tests but might obscure the real usability issues as clutter also increases. Some of them may be reasonable and enlightening, but designers can only use them as heuristics inspiration, not as replacements for human results.

We find it somewhat difficult to directly identify issues within individual prototype designs without clear standards and references. Compared to human users, GPT-4 exhibits greater ’tolerance’, being willing to experience a deeper interface depth. Similarly, if only describe interface information to GPT-4 and ask it to express its feelings, it is difficult to pinpoint specific usability issues. Instead, it may unreasonably consider interface design as satisfying. We introduce the concept of expectation disconfirmation to assist SimUser in identifying issues through comparison while making it more ’critical’ through prompts. Additionally, we educate SimUser on how design factors of mobile applications impact their usability through prompts. As a result, SimUser becomes more stringent and better equipped to generate reasonable usability feedback.

In our work, we adopt two LLM models to simulate the interaction between users and mobile applications. We divide the two agents (MA & UA) into separate modules and conduct in-module inferring using the CoT approach. As tasks are decomposed into steps and we ask SimUser to record the results of every step, SimUser can create users with detailed characteristics and generate accurate interface descriptions. The rate of hallucination in SimUser’s inferring decreases, echoing the results of previous studies related to CoT [20]. The two agents complete their tasks respectively and show the potential to collaborate interactively when we establish the sequence and rules of tasks. Moreover, SimUser performs well in the interface layout feedback which relies on visual observation. We suppose that the visual saliency model predicting attention allocation on interfaces helps SimUser to understand visual information.

7.2 Reflect User Characteristics and Contextual Influence in LLM-Generated Usability Feedback

Besides simulating usage scenarios and conversations between an imaginary prototype and fictional users [35], SimUser can understand user characteristics and generate similar usage scenarios with human users. For example, the most frequent exercise scenario for the elderly is the park while for students is the playground. More promisingly, SimUser is more capable of inferring extreme scenarios. Human users find it difficult to anticipate unexpected situations that might occur during running, but SimUser can generate such scenarios, like a potential fall for the elderly or sudden social needs for students.

SimUser effectively reflects the ability characteristics, which is particularly obvious for elderly people who have noticeable limitations in their capabilities. Our results indicate that by leveraging the inferential capabilities of LLMs and the user research materials we provide, it is possible to manifest the characteristics of user groups through in-context learning without extra training [74, 83]. Besides, we draw upon theories of user modeling to add factors, that are more relevant to the applications and tasks to be tested. User modeling helps SimUser to capture the broad characteristics of users more accurately, thereby better understanding the capabilities, preferences, and attitudes of this user group that influence usability feedback. However, SimUser cannot effectively represent task-related characteristics, especially those related to experience and proficiency. Optimizing these factors may further improve the alignment between SimUser and human users.

The generated usability feedback presents the contextual influence of the physical environment and hardware. Recent AI-driven prototype evaluation tools hardly consider the context where users employ the mobile applications, but SimUser can imagine how the usage scenarios and hardware will impact usability feedback. For instance, some simulated users interacted with the mobile application by pushing the side buttons on the smartwatch and a simulated student reported that he felt auditory feedback was inefficient because the gym was very noisy.

Last but not least, it is challenging for some human users to imagine possible edge scenarios due to their lack of experience and express their expectations towards interfaces, which are both easy for SimUser to infer. Existing user research techniques still have a gap between researchers and users. For example, the elderly may not be familiar with these technologies, and there may be a generational divide in communication with younger researchers [23]. SimUser creates a channel for barrier-free communication between researchers and simulated users.

7.3 Implications for LLM-based Usability Feedback in Design Practice

Our formative study indicates that although designers want usability feedback on prototypes, it is hard for them to acquire user insights in design practice. As discussed in the first part, we achieve practitioners’ expectations that LLM tools like SimUser can help quickly diverge into various target user groups and reflect the influence of user characteristics and scenarios in usability feedback. We address the concern about how LLM understands mobile applications by generating natural language descriptions from interface code files and combining them with a visual model. For another concern about ensuring LLM infers reasonable usability feedback, we initially assist LLM in simulating user flow steps in the form of CoTs, thereby enhancing accuracy by outputting these steps. Additionally, we establish a set of principles through prompt engineering to guide LLM in reporting experiences and tracing their reasons in a user-like manner. After their concerns are addressed, they believe that we have not only essentially met the need for recognizing usability feedback issues during the prototype iteration, but also provided more insights like how usage scenarios influence usability.

Even though LLM agents are capable of generating a variety of outcomes, their hallucinations remain a concern, and designers should treat the results as heuristic insights. The LLM agent should aim to broaden designers’ insights of users, serving as an auxiliary tool for discovering user requirements as it can not replace human user research. The key is to help designers rapidly expand target users and scenarios to obtain as many usability insights as possible.

We think it is beneficial to extend LLM-based user simulation tools to other fields of design. The generation and inferring ability of LLMs may be better applied in design since the design is an ill-structured field [30]. Like other AI-driven methods, LLM-based methods have the advantages of rapid generation and low cost, especially providing designers with an ’ever-lasting’ user, which can be regarded as a supplement for human users. In design practice, experienced experts and target users are sometimes hard to recruit, but AI is always available. Some research has already used LLMs to generate user interviews [71]. After mobile applications are launched, it is also potential to use LLMs to infer users’ thoughts and feelings behind online data. It is even possibly meaningful to employ simulated users in design education for students to test their prototypes and enhance their user research skills quickly.

7.4 Limitations and Future Work

Our empirical study was conducted with a limited sample of two groups of participants and several rounds of LLM tests. The interfaces used in the study were not derived from real design practices, but rather artificially designed to simulate common issues found in mobile applications. Although we helped SimUser remember its experience, there was still hallucination in SimUser like other LLM tools [4]. Besides, we have explored the possibility of LLM generating usability feedback, but the web tool has not been fully tested [41], which requires feedback in long-term practice.

How to more effectively establish a connection between interface visual attributes and usability for SimUser requires further exploration. We think more models like activity exploration [52] and visual complexity [14] will further enhance SimUser’s ability to extend usability feedback such as aesthetic evaluations of interfaces [59].

Currently, we have only completed testing on the simple sports application of smartwatches, but the descriptive method we propose covers the basic components of mobile applications, including visual information, functions, layout, and interaction methods. The compatibility is reflected in the fact that the basic components of more complex interfaces, such as smartphones and in-car interfaces, are similar or even the same [12], only that the amount of information increases. This becomes more promising with the enhancement of LLMs’ capabilities.

There remains a more in-depth exploration of the impact of context. We have not yet instructed SimUser on the rules governing how contextual factors influence usability feedback, similar to how interface factors do. Furthermore, we have only considered how the current interface influences users’ expectations for the next one. How the dynamically accumulated factors like emotions affect usability feedback throughout the user flow is also interesting for future study [1, 75].

8 CONCLUSION

We conducted a formative study to find out what designers expect and concerns about usability feedback generated by LLMs. According to their opinions, we proposed SimUser, an LLM tool to produce heuristic usability feedback, which is composed of two LLM agents to simulate interactions between users and mobile applications. SimUser aims to reflect the characteristics of target user groups and the influence of usage scenarios in the usability feedback. In the context of a simple smartwatch interface, an empirical study indicated that the SimUser tool can identify between 35.7% and 100% of the usability issues identified by human users, depending on the user group and the usability category. Designers think it is beneficial to iterate prototypes with the generated usability feedback, although the web tool still needs improvements. In sum, our work has taken the first step to applying LLMs to generate usability feedback in optimizing prototype iteration and offers insights for developing more LLM tools in the whole process of design.

Supplemental Material

Video Preview

mp4

60.1 MB

Download

Video Presentation

mp4

154.6 MB

Download

Available for Download

vtt

3613904.3642481-preview-video.vtt (773 B)

vtt

3613904.3642481-talk-video.vtt (14 KB)

References

Anshu Agarwal and Andrew Meyer. 2009. Beyond usability: evaluating emotional response as an integral part of the user experience. In CHI’09 Extended Abstracts on Human Factors in Computing Systems. 2919–2930.Google Scholar
Reference
Majed Alshamari and Pam Mayhew. 2008. Task design: Its impact on usability testing. In 2008 Third International Conference on Internet and Web Applications and Services. IEEE, 583–589.Google ScholarDigital Library
Reference
Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, and David Wingate. 2023. Out of one, many: Using language models to simulate human samples. Political Analysis 31, 3 (2023), 337–351.Google ScholarCross Ref
Reference
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 (2023).Google Scholar
Reference
Nigel Bevan, Jim Carter, Jonathan Earthy, Thomas Geis, and Susan Harker. 2016. New ISO standards for usability, usability reports and usability measures. In Human-Computer Interaction. Theory, Design, Development and Practice: 18th International Conference, HCI International 2016, Toronto, ON, Canada, July 17-22, 2016. Proceedings, Part I 18. Springer, 268–278.Google Scholar
Reference
Nigel Bevan, James Carter, and Susan Harker. 2015. ISO 9241-11 revised: What have we learnt about usability since 1998?. In Human-Computer Interaction: Design and Evaluation: 17th International Conference, HCI International 2015, Los Angeles, CA, USA, August 2-7, 2015, Proceedings, Part I 17. Springer, 143–151.Google ScholarCross Ref
Reference
Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative research in psychology 3, 2 (2006), 77–101.Google Scholar
Reference
John Brooke. 1996. Sus: a “quick and dirty’usability. Usability evaluation in industry 189, 3 (1996), 189–194.Google Scholar
Reference 1Reference 2
Giulio Carducci, Giuseppe Rizzo, Diego Monti, Enrico Palumbo, and Maurizio Morisio. 2018. Twitpersonality: Computing personality traits from tweets using word embeddings and supervised learning. Information 9, 5 (2018), 127.Google ScholarCross Ref
Reference
Roberto Casas, Rubén Blasco Marín, Alexia Robinet, Armando Roy Delgado, Armando Roy Yarza, John Mcginn, Richard Picking, and Vic Grout. 2008. User modelling in ambient intelligence for elderly and disabled people. In Computers Helping People with Special Needs: 11th International Conference, ICCHP 2008, Linz, Austria, July 9-11, 2008. Proceedings 11. Springer, 114–122.Google ScholarDigital Library
Reference
Xiao Chen, Wanli Chen, Kui Liu, Chunyang Chen, and Li Li. 2021. A Comparative Study of Smartphone and Smartwatch Apps. In Proceedings of the 36th Annual ACM Symposium on Applied Computing (Virtual Event, Republic of Korea) (SAC ’21). Association for Computing Machinery, New York, NY, USA, 1484–1493. https://doi.org/10.1145/3412841.3442023Google ScholarDigital Library
Reference
Xiao Chen, Wanli Chen, Kui Liu, Chunyang Chen, and Li Li. 2021. A Comparative Study of Smartphone and Smartwatch Apps. In Proceedings of the 36th Annual ACM Symposium on Applied Computing (Virtual Event, Republic of Korea) (SAC ’21). Association for Computing Machinery, New York, NY, USA, 1484–1493. https://doi.org/10.1145/3412841.3442023Google ScholarDigital Library
Reference
Jiale Cheng, Sahand Sabour, Hao Sun, Zhuang Chen, and Minlie Huang. 2022. PAL: Persona-Augmented Emotional Support Conversation Generation. arXiv preprint arXiv:2212.09235 (2022).Google Scholar
Reference
Francesco Chiossi, Changkun Ou, and Sven Mayer. 2023. Exploring Physiological Correlates of Visual Complexity Adaptation: Insights from EDA, ECG, and EEG Data for Adaptation Evaluation in VR Adaptive Systems. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI EA ’23). Association for Computing Machinery, New York, NY, USA, Article 118, 7 pages. https://doi.org/10.1145/3544549.3585624Google ScholarDigital Library
Reference
J Clement. 2020. App stores: number of apps in leading app stores 2020. Statista (2020).Google Scholar
Reference
Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A Mobile App Dataset for Building Data-Driven Design Applications(UIST ’17). Association for Computing Machinery, New York, NY, USA, 845–854. https://doi.org/10.1145/3126594.3126651Google ScholarDigital Library
Reference
Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in ChatGPT: Analyzing Persona-assigned Language Models. arxiv:2304.05335 [cs.CL]Google Scholar
Reference
Andrew Dillon and Charles Watson. 1996. User analysis in HCI — the historical lessons from individual differences research. International Journal of Human-Computer Studies 45, 6 (1996), 619–637. https://doi.org/10.1006/ijhc.1996.0071Google ScholarDigital Library
Reference
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. 2022. A survey for in-context learning. arXiv preprint arXiv:2301.00234 (2022).Google Scholar
Reference
Hao Fei, Bobo Li, Qian Liu, Lidong Bing, Fei Li, and Tat-Seng Chua. 2023. Reasoning Implicit Sentiment with Chain-of-Thought Prompting. arXiv preprint arXiv:2305.11255 (2023).Google Scholar
Reference
Sandra G Hart and Lowell E Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In Advances in psychology. Vol. 52. Elsevier, 139–183.Google Scholar
Reference
Zecheng He, Srinivas Sunkara, Xiaoxue Zang, Ying Xu, Lijuan Liu, Nevan Wichers, Gabriel Schubiner, Ruby Lee, and Jindong Chen. 2021. Actionbert: Leveraging user actions for semantic understanding of user interfaces. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 5931–5938.Google ScholarCross Ref
Reference
Syariffanor Hisham. 2009. Experimenting with the use of persona in a focus group discussion with older adults in Malaysia. In Proceedings of the 21st Annual Conference of the Australian Computer-Human Interaction Special Interest Group: Design: Open 24/7. 333–336.Google Scholar
Reference
Peter C Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Adam Santoro, and Timothy Lillicrap. 2022. A data-driven approach for learning to control computers. In International Conference on Machine Learning. PMLR, 9466–9482.Google Scholar
Reference
EunJeong Hwang, Bodhisattwa Prasad Majumder, and Niket Tandon. 2023. Aligning Language Models to User Opinions. arxiv:2305.14929 [cs.CL]Google Scholar
Reference
Alaul Islam, Ranjini Aravind, Tanja Blascheck, Anastasia Bezerianos, and Petra Isenberg. 2022. Preferences and Effectiveness of Sleep Data Visualizations for Smartwatches and Fitness Bands. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 27, 17 pages. https://doi.org/10.1145/3491102.3501921Google ScholarDigital Library
Reference
IO ISO. 2018. Ergonomics of human-system interaction—Part 11: Usability: Definitions and concepts (ISO 9241-11: 2018).Google Scholar
Reference
Yue Jiang, Luis A Leiva, Hamed Rezazadegan Tavakoli, Paul RB Houssel, Julia Kylmälä, and Antti Oulasvirta. 2023. UEyes: Understanding Visual Saliency across User Interface Types. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21.Google ScholarDigital Library
Reference 1Reference 2
Yue Jiang, Yuwen Lu, Christof Lutteroth, Toby Jia-Jun Li, Jeffrey Nichols, and Wolfgang Stuerzlinger. 2023. The Future of Computational Approaches for Understanding and Adapting User Interfaces. In Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. 1–5.Google Scholar
Reference
Marina Johnson, Abdullah Albizri, Antoine Harfouche, and Samuel Fosso-Wamba. 2022. Integrating human knowledge into artificial intelligence for complex and ill-structured problems: Informed artificial intelligence. International Journal of Information Management 64 (2022), 102479.Google ScholarDigital Library
Reference
Satu Jumisko-Pyykkö and Teija Vainio. 2010. Framing the context of use for mobile HCI. International journal of mobile human computer interaction (IJMHCI) 2, 4 (2010), 1–28.Google ScholarDigital Library
Reference
Kate Kaplan. 2023. User Journeys vs. User Flows. https://www.nngroup.com/articles/user-journeys-vs-user-flowsGoogle Scholar
Reference
Jayden Khakurel, Antti Knutas, Helinä Melkas, Birgit Penzenstadler, Bo Fu, and Jari Porras. 2018. Categorization framework for usability issues of smartwatches and pedometers for the older adults. In Universal Access in Human-Computer Interaction. Methods, Technologies, and Users: 12th International Conference, UAHCI 2018, Held as Part of HCI International 2018, Las Vegas, NV, USA, July 15-20, 2018, Proceedings, Part I 12. Springer, 91–106.Google ScholarDigital Library
Reference
Konstantin Klamka, Tom Horak, and Raimund Dachselt. 2020. Watch+Strap: Extending Smartwatches with Interactive StrapDisplays. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–15. https://doi.org/10.1145/3313831.3376199Google ScholarDigital Library
Reference
A Baki Kocaballi. 2023. Conversational ai-powered design: Chatgpt as designer, user, and product. arXiv preprint arXiv:2302.07406 (2023).Google Scholar
Reference
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2023. Large Language Models are Zero-Shot Reasoners. arxiv:2205.11916 [cs.CL]Google Scholar
Reference
Michal Kosinski. 2023. Theory of mind may have spontaneously emerged in large language models. arXiv preprint arXiv:2302.02083 (2023).Google Scholar
Reference
Sari Kujala and Marjo Kauppinen. 2004. Identifying and Selecting Users for User-Centered Design. In Proceedings of the Third Nordic Conference on Human-Computer Interaction (Tampere, Finland) (NordiCHI ’04). Association for Computing Machinery, New York, NY, USA, 297–303. https://doi.org/10.1145/1028014.1028060Google ScholarDigital Library
Reference
Bettina Laugwitz, Theo Held, and Martin Schrepp. 2008. Construction and evaluation of a user experience questionnaire. In HCI and Usability for Education and Work: 4th Symposium of the Workgroup Human-Computer Interaction and Usability Engineering of the Austrian Computer Society, USAB 2008, Graz, Austria, November 20-21, 2008. Proceedings 4. Springer, 63–76.Google ScholarDigital Library
Reference
Dave Lawrence and Soheyla Tavakol. 2007. Website Usability. Balanced Website Design: Optimising Aesthetics, Usability and Purpose (2007), 37–58.Google ScholarCross Ref
Reference
David Ledo, Steven Houben, Jo Vermeulen, Nicolai Marquardt, Lora Oehlberg, and Saul Greenberg. 2018. Evaluation Strategies for HCI Toolkit Research. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (, Montreal QC, Canada, ) (CHI ’18). Association for Computing Machinery, New York, NY, USA, 1–17. https://doi.org/10.1145/3173574.3173610Google ScholarDigital Library
Reference
Chunggi Lee, Sanghoon Kim, Dongyun Han, Hongjun Yang, Young-Woo Park, Bum Chul Kwon, and Sungahn Ko. 2020. GUIComp: A GUI design assistant with real-time, multi-faceted feedback. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1–13.Google ScholarDigital Library
Reference
Yoonjoo Lee, John Joon Young Chung, Jean Y. Song, Minsuk Chang, and Juho Kim. 2021. Personalizing Ambience and Illusionary Presence: How People Use “Study with Me” Videos to Create Effective Studying Environments. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (, Yokohama, Japan, ) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 355, 13 pages. https://doi.org/10.1145/3411764.3445222Google ScholarDigital Library
Reference
Clayton Lewis and Cathleen Wharton. 1997. Cognitive walkthroughs. In Handbook of human-computer interaction. Elsevier, 717–732.Google Scholar
Reference
Toby Jia-Jun Li and Oriana Riva. 2018. KITE: Building conversational bots from mobile apps. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services. 96–109.Google Scholar
Reference
Yuanchun Li and Oriana Riva. 2021. Glider: A reinforcement learning approach to extract UI scripts from websites. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1420–1430.Google ScholarDigital Library
Reference
Yuanchun Li, Ziyue Yang, Yao Guo, and Xiangqun Chen. 2019. Humanoid: A deep learning-based approach to automated black-box android app testing. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1070–1073.Google ScholarDigital Library
Reference
Q Vera Liao, Hariharan Subramonyam, Jennifer Wang, and Jennifer Wortman Vaughan. 2023. Designerly understanding: Information needs for model transparency to support design ideation for AI-powered user experience. In Proceedings of the 2023 CHI conference on human factors in computing systems. 1–21.Google ScholarDigital Library
Reference
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023).Google Scholar
Reference
Zhe Liu, Chunyang Chen, Junjie Wang, Xing Che, Yuekai Huang, Jun Hu, and Qing Wang. 2023. Fill in the blank: Context-aware automated text input generation for mobile gui testing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1355–1367.Google ScholarDigital Library
Reference
Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2023. Chatting with GPT-3 for Zero-Shot Human-Like Mobile Automated GUI Testing. arxiv:2305.09434 [cs.SE]Google Scholar
Reference
Zhe Liu, Chunyang Chen, Junjie Wang, Yuekai Huang, Jun Hu, and Qing Wang. 2022. Guided Bug Crush: Assist Manual GUI Testing of Android Apps via Hint Moves. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 557, 14 pages. https://doi.org/10.1145/3491102.3501903Google ScholarDigital Library
Reference
Locofy.ai. 2023. Figma to React, React Native, HTML/CSS, Next.js, Gatsby, Vue. https://www.figma.com/community/plugin/1056467900248561542/Locofy-FREE-BETA—Figma-to-React%2C-React-Native%2C-HTML%2FCSS%2C-Next.js%2C-Gatsby%2C-VueGoogle Scholar
Reference
Maria Lungu. 2022. The coding manual for qualitative researchers. American Journal of Qualitative Research 6, 1 (2022), 232–237.Google ScholarCross Ref
Reference 1Reference 2
Martin Maguire. 2001. Context of use within usability activities. International journal of human-computer studies 55, 4 (2001), 453–483.Google ScholarDigital Library
Reference
Tara Matthews, Tejinder Judge, and Steve Whittaker. 2012. How Do Designers and User Experience Professionals Actually Perceive and Use Personas?. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Austin, Texas, USA) (CHI ’12). Association for Computing Machinery, New York, NY, USA, 1219–1228. https://doi.org/10.1145/2207676.2208573Google ScholarDigital Library
Reference
Nora McDonald, Sarita Schoenebeck, and Andrea Forte. 2019. Reliability and Inter-Rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice. Proc. ACM Hum.-Comput. Interact. 3, CSCW, Article 72 (nov 2019), 23 pages. https://doi.org/10.1145/3359174Google ScholarDigital Library
Reference
Jaroslav Michalco, Jakob Grue Simonsen, and Kasper Hornbæk. 2015. An exploration of the relation between expectations and user experience. International Journal of Human-Computer Interaction 31, 9 (2015), 603–617.Google ScholarCross Ref
Reference
Aliaksei Miniukovich and Antonella De Angeli. 2015. Computation of interface aesthetics. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 1163–1172.Google ScholarDigital Library
Reference
Bilge Mutlu and Jodi Forlizzi. 2008. Robots in organizations: the role of workflow, social, and environmental factors in human-robot interaction. In Proceedings of the 3rd ACM/IEEE international conference on Human robot interaction. 287–294.Google ScholarDigital Library
Reference
Ali Neshati, Bradley Rey, Ahmed Shariff Mohommed Faleel, Sandra Bardot, Celine Latulipe, and Pourang Irani. 2021. BezelGlide: Interacting with Graphs on Smartwatches with Minimal Screen Occlusion. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 501, 13 pages. https://doi.org/10.1145/3411764.3445201Google ScholarDigital Library
Reference
Jakob Nielsen. 1992. Finding Usability Problems through Heuristic Evaluation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Monterey, California, USA) (CHI ’92). Association for Computing Machinery, New York, NY, USA, 373–380. https://doi.org/10.1145/142750.142834Google ScholarDigital Library
Reference
Jakob Nielsen. 2012, January 3. Usability 101: Introduction to Usability. https://www.nngroup.com/articles/usability-101-introduction-to-usability/Google Scholar
Reference
Jakob Nielsen. 2023, Octorber 20. Unreliability of AI in Evaluating UX Screenshots. https://jakobnielsenphd.substack.com/p/ai-ux-evaluationGoogle Scholar
Reference
Jakob Nielsen and Rolf Molich. 1990. Heuristic Evaluation of User Interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Seattle, Washington, USA) (CHI ’90). Association for Computing Machinery, New York, NY, USA, 249–256. https://doi.org/10.1145/97243.97281Google ScholarDigital Library
Reference 1Reference 2
Amelie Nolte, Karolin Lueneburg, Dieter P. Wallach, and Nicole Jochems. 2022. Creating Personas for Signing User Populations: An Ability-Based Approach to User Modelling in HCI(ASSETS ’22). Association for Computing Machinery, New York, NY, USA, Article 50, 6 pages. https://doi.org/10.1145/3517428.3550364Google ScholarDigital Library
Reference
Adi Nugroho, Paulus Insap Santosa, and Rudy Hartanto. 2022. Usability Evaluation Methods of Mobile Applications: A Systematic Literature Review. In 2022 International Symposium on Information Technology and Digital Innovation (ISITDI). IEEE, 92–95.Google Scholar
Reference
Richard L Oliver. 1980. A cognitive model of the antecedents and consequences of satisfaction decisions. Journal of marketing research 17, 4 (1980), 460–469.Google ScholarCross Ref
Reference
OpenAI. 2023. GPT-4 Technical Report. arxiv:2303.08774 [cs.CL]Google Scholar
Reference
Junting Pan, Cristian Canton Ferrer, Kevin McGuinness, Noel E O’Connor, Jordi Torres, Elisa Sayrol, and Xavier Giro-i Nieto. 2017. Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081 (2017).Google Scholar
Reference 1Reference 2
Stefano De Paoli. 2023. Writing user personas with Large Language Models: Testing phase 6 of a Thematic Analysis of semi-structured interviews. arxiv:2305.18099 [cs.CL]Google Scholar
Reference
David Randall, Richard Harper, and Mark Rouncefield. 2007. Fieldwork for design: theory and practice. Springer Science & Business Media.Google ScholarDigital Library
Reference
Tom Rodden, Keith Cheverst, K Davies, and Alan Dix. 1998. Exploiting context in HCI design for mobile systems. In Workshop on human computer interaction with mobile devices, Vol. 12. Glasgow.Google Scholar
Reference
Leonard Salewski, Stephan Alaniz, Isabel Rio-Torto, Eric Schulz, and Zeynep Akata. 2023. In-Context Impersonation Reveals Large Language Models’ Strengths and Biases. arxiv:2305.14930 [cs.AI]Google Scholar
Reference 1Reference 2
Juergen Sauer and Andreas Sonderegger. 2009. The influence of prototype fidelity and aesthetics of design in usability tests: Effects on user behaviour, subjective evaluation and emotion. Applied ergonomics 40, 4 (2009), 670–677.Google Scholar
Reference
Eldon Schoop, Xin Zhou, Gang Li, Zhourong Chen, Bjoern Hartmann, and Yang Li. 2022. Predicting and explaining mobile ui tappability with vision modeling and saliency analysis. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–21.Google ScholarDigital Library
Reference
Sivan Schwartz, Avi Yaeli, and Segev Shlomov. 2023. Enhancing Trust in LLM-Based AI Automation Agents: New Considerations and Future Challenges. arXiv preprint arXiv:2308.05391 (2023).Google Scholar
Reference
Ben Shneiderman, Catherine Plaisant, Maxine Cohen, Steven Jacobs, Niklas Elmqvist, and Nicholas Diakopoulos. 2016. Designing the user interface: strategies for effective human-computer interaction. Pearson.Google Scholar
Reference
Makram Soui and Zainab Haddad. 2023. Deep learning-based model using DensNet201 for mobile user interface evaluation. International Journal of Human–Computer Interaction 39, 9 (2023), 1981–1994.Google ScholarCross Ref
Reference
Anselm Strauss and Juliet Corbin. 1998. Basics of qualitative research techniques. (1998).Google Scholar
Reference
Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. 2022. META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI. arXiv preprint arXiv:2205.11029 (2022).Google Scholar
Reference
Xiaofei Sun, Xiaoya Li, Shengyu Zhang, Shuhe Wang, Fei Wu, Jiwei Li, Tianwei Zhang, and Guoyin Wang. 2023. Sentiment Analysis through LLM Negotiations. arXiv preprint arXiv:2311.01876 (2023).Google Scholar
Reference
Silvia Terragni, Modestas Filipavicius, Nghia Khau, Bruna Guedes, André Manso, and Roland Mathis. 2023. In-Context Learning User Simulators for Task-Oriented Dialog Systems. arXiv preprint arXiv:2306.00774 (2023).Google Scholar
Reference
Dejan Todorovic. 2008. Gestalt principles. Scholarpedia 3, 12 (2008), 5345.Google ScholarCross Ref
Reference
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. arxiv:2302.13971 [cs.CL]Google Scholar
Reference
Bryan Wang, Gang Li, and Yang Li. 2023. Enabling conversational interaction with mobile ui using large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17.Google ScholarDigital Library
Reference 1Reference 2
Ding Wang, Santosh D. Kale, and Jacki O’Neill. 2020. Please Call the Specialism: Using WeChat to Support Patient Care in China. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376274Google ScholarDigital Library
Reference
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, 2023. CogVLM: Visual Expert for Pretrained Language Models. arXiv preprint arXiv:2311.03079 (2023).Google Scholar
Reference
Xiaorui Wang, Ronggang Zhou, and Renqian Zhang. 2020. The impact of expectation and disconfirmation on user experience and behavior intention. In Design, User Experience, and Usability. Interaction Design: 9th International Conference, DUXU 2020, Held as Part of the 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, July 19–24, 2020, Proceedings, Part I 22. Springer, 464–475.Google Scholar
Reference
Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. 2023. Unleashing Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. arXiv preprint arXiv:2307.05300 (2023).Google Scholar
Reference
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. 2023. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arxiv:2201.11903 [cs.CL]Google Scholar
Reference
Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Yongdong Zhang, and Zhendong Mao. 2023. ExpertPrompting: Instructing Large Language Models to be Distinguished Experts. arxiv:2305.14688 [cs.CL]Google Scholar
Reference
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629 (2022).Google Scholar
Reference
Yao Yao, Zuchao Li, and Hai Zhao. 2023. Beyond Chain-of-Thought, Effective Graph-of-Thought Reasoning in Large Language Models. arxiv:2305.16582 [cs.CL]Google Scholar
Reference
Yong Zheng. 2019. Multi-Stakeholder Recommendations: Case Studies, Methods and Challenges. In Proceedings of the 13th ACM Conference on Recommender Systems (Copenhagen, Denmark) (RecSys ’19). Association for Computing Machinery, New York, NY, USA, 578–579. https://doi.org/10.1145/3298689.3346951Google ScholarDigital Library
Reference
Qihao Zhu and Jianxi Luo. 2023. Toward Artificial Empathy for Human-Centered Design: A Framework. arXiv preprint arXiv:2303.10583 (2023).Google Scholar
Reference

Index Terms

SimUser: Generating Usability Feedback by Simulating Various Users Interacting with Mobile Applications
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Empirical studies in HCI
    2. Interactive systems and tools

Recommendations

Generating Automatic Feedback on UI Mockups with Large Language Models
CHI '24: Proceedings of the CHI Conference on Human Factors in Computing Systems
Feedback on user interface (UI) mockups is crucial in design. However, human feedback is not always readily available. We explore the potential of using large language models for automatic feedback. Specifically, we focus on applying GPT-4 to automate ...
Read More
Users' design feedback in usability evaluation: a literature review

As part of usability evaluation, users may be invited to offer their reflections on the system being evaluated. Such reflections may concern the system's suitability for its context of use, usability problem predictions, and design suggestions. We term ...
Read More
On-the-Fly Usability Evaluation of Mobile Adaptive UIs Through Instant User Feedback
Human-Computer Interaction – INTERACT 2019
Abstract
Adaptive User Interfaces (AUIs) have been promoted as a solution for context variability due to their ability to automatically adapt to the context-of-use at runtime. For the acceptance of AUIs, usability plays a crucial role. Classical usability ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CHI '24: Proceedings of the CHI Conference on Human Factors in Computing Systems
May 2024
18961 pages
ISBN:9798400703300
DOI:10.1145/3613904
Editors:
Florian Floyd Mueller
Monash University
,
Penny Kyburz
The Australian National University
,
Julie R. Williamson
University of Glasgow
,
Corina Sas
Lancaster University
,
Max L. Wilson
University of Nottingham
,
Phoebe Toups Dugas
Monash University/New Mexico State University
,
Irina Shklovski
University of Copenhagen
Copyright © 2024 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 May 2024
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Large language models
Usability feedback
User Simulation
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate6,199of26,314submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 536
  Total Downloads
- Downloads (Last 12 months)536
- Downloads (Last 6 weeks)536
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

SimUser: Generating Usability Feedback by Simulating Various Users Interacting with Mobile Applications

CHI '24: Proceedings of the CHI Conference on Human Factors in Computing Systems

Abstract

1 INTRODUCTION

2 RELATED WORK

2.1 Usability Feedback in Prototyping

2.2 Act Like a Certain User Using LLM

2.3 Methods to Simulate User Flow

3 FORMATIVE STUDY

3.1 Usability Evaluation Practices in Prototyping Stage

3.2 Expectations and Concerns toward LLM Agent

4 DESIGN OF SIMUSER TO GENERATE USABILITY FEEDBACK

4.1 Interactive Process Between Mobile Application Agent and User Agent

4.2 Input from Designers

4.3 Mobile Application Agent

4.3.1 Interface description generator.

4.3.2 Interaction logic constructor.

4.3.3 User action reactor.

4.4 User Agent

4.4.1 Target user and scenario generator.

4.4.2 User flow simulator.

4.4.3 Interaction logic evaluator.

4.5 Implementation

5 EMPIRICAL STUDY

5.1 Apparatus: test prototypes and the mobile device

5.2 Participants

5.3 Procedure for collecting usability feedback from human users

5.4 Procedure for collecting usability feedback from SimUser

5.5 Procedure for gathering designers’ suggestions

6 RESULTS

6.1 Usage scenarios

6.2 Usability Feedback from SimUser and Human User

6.2.1 Perception processes and interactive behavior.

6.2.2 Performance in usability feedback categories.

6.2.3 User group characteristics reflected by SimUser.

6.2.4 Contextual influence reflected by SimUser.

6.3 Differences in SimUser and Human Data

6.4 Feedback of SimUser Web Tool from Design Practitioners

7 DISCUSSION

7.1 Enable LLM to Generate Usability Feedback of Mobile Application Prototypes

7.2 Reflect User Characteristics and Contextual Influence in LLM-Generated Usability Feedback

7.3 Implications for LLM-based Usability Feedback in Design Practice

7.4 Limitations and Future Work

8 CONCLUSION

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Generating Automatic Feedback on UI Mockups with Large Language Models

Users' design feedback in usability evaluation: a literature review

On-the-Fly Usability Evaluation of Mobile Adaptive UIs Through Instant User Feedback

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media