User intent perception by gesture and eye tracking

: User intent is a highly cohesive human activity which is difficult to encode and measure. It is critical to seize the target user’s psychological expectations, feelings, needs, and other elements of user’s true intentions in the area of product design. This paper proposes a task-based method to perceive user intent by capturing their “Hand movement” and “Eye movement” when target users are performing user tasks. In terms of “Hand movement”, user intents are obtained through gesture recognition based on computer vision, in particular, task-based gesture recognition solves the classic difficult problem “Midas Touch” in gesture interface of computer vision; in terms of “Eye movement”, user interests and concerns on product are captured by eye-tracking equipment. As the “Hand movement” and “Eye movement” represent the vast majority of user’s nonverbal expression of intents, this method is proved to have great practicality and pertinence by both theory and experiment.


Introduction
When a witness describes a fugitive's looks to the police, it is difficult to imagine the police are able to accurately sketch out the fugitive's portrait finally. In such a case, the witness is unable to portray the fugitive's looks out, and the police with professional painting skill can sketch out the fugitive's portrait by repeatedly communicating with the witness even though he knows nothing about the fugitive.

PUBLIC INTEREST STATEMENT
Users desire good usability products, they have expectations based on their experiences prior to cognizing products, they have a mental model about products according to their experiences or even prejudices of some similar products, such as "how is the function?", "what would it look like?", "how to use them?" and so on. These user intent are highly cohesive human activities which are difficult to encode and measure. It is critical to seize the target user's psychological expectations, feelings, needs, and other elements of user's true intentions in the area of product design. This paper proposes a task-based method to perceive user intent by capturing their "Hand movement" and "Eye movement" when target users are performing user tasks. In terms of "Hand movement", user intents are obtained through gesture recognition based on computer vision, in terms of "Eye movement", user interests and concerns on product are captured by eye-tracking equipment.

Related work
User intent perception is a kind of intangible knowledge that could not be found in the books, documents, manuals, brochures and other carriers, it is a human natural skill, judgment and intuition which can't be easily described because this kind of knowledge is subjective, random and vague. User intent perception was first proposed by Michael Polanyi who was a British physical chemist and philosopher (Polanyi, 1958). He believes imagery perception exists in the minds under particular circumstance, it is difficult to formulate and communicate (Lowry, Roberts, & Romano, 2013;Nielsen, 1995;Tang, Lee, & Gero, 2011;Wu & Wang, 2012), it is an important knowledge from experience relate to innovation. Since then, imagery perception has been studied and expressed in the way of visualization mode by scholars from philosophy, linguistics, psychology, education, library science, management science, computational science and other fields (Antunes, Herskovic, Ochoa, & Pino, 2014;Jacob & Wachs, 2013;Kela et al., 2006;Morency, Sidner, Lee, & Darrell, 2005;Niezen & Hancke, 2008;Torralba, Murphy, Freeman, & Rubin, 2003).
User intent perception is a frontier research field of the interdiscipline involving psychology, industrial design, computer technology, and so on. The research of user intents is not enough to form a complete theoretical system yet, but has begun to take shape (de Vries & Masclet, 2013;Liu, Wang, Zhong, Wickramasuriya, & Vasudevan, 2009).
In the field of industrial design, the most commonly used method to perceive user intents on the appearance of product is protocol analysis. Semantic Differential Method is one of the basic implementation methods proposed by (Osgood, Suci, & Tannenbaum, 1957). This method relates user intent in Likert scale through semantics of the learning object such as product appearance, color, and so on (Tang et al., 2011). Since this method is quite easy to understand and use, scholars try to apply this method to furniture design, watch design, telephone design, car design, and so on (Lowry et al., 2013). Although protocol analysis reflects the human cognitive activity directly, it is a qualitative, descriptive analysis, there are some problems in the application, such as whether it reports all of the hidden thought contents and affects the thinking process of testees (Fiorentino, Uva, Gattullo, Debernardis, & Monno, 2014), whether the procedures of oral report translating, coding and analysis have sufficient reliability, and so on.
As computer technology grows, in the field of software product, Nielsen proposed usability testing method to get user intents through operating products (Huang, Yang, & Chu, 2012;Li & Selker, 2001). Usability testing refers to evaluating a product or service by testing it with representative users. Typically, during a test, users will be asked to complete typical UT while usability experts watch, listen and take notes. The goal is to collect qualitative and quantitative operating data and determine the user intent and usability problems with the product. Usability testing method has an effect on getting user intent under the special circumstance that is must not with usability expert absent, besides, the method depends on the usability expert's personal judgment, it is not very objective. This paper proposes a method based on UT to perceive user intent, by this method, the subjective, random and vague user intent is divided into three states, the meaningful user intent can be separated, the elusiveness of user intent can be solved; In the meantime, as a result of computer visionbased gesture recognition program, the gesture system gives feedback command to user operation in real time without expert real-time observation, it is very objective; One more thing, the method based on user task can predict the user behavior to some extent.

User task
Definition of UT: in a product or system, all operations that a user performs to finish a certain function from the initial state to the target state make up a user task. User task is not necessarily only one operation, it is always a combination of various operations from the initial state to the target state, take phone-call for example: UT 1: Open contact list → Choose contact → Make a call UT 2: Open the keypad → Enter number → Make a call Note that users always try to understand the product in their own way, it is possible that user's understanding of the product is completely incompatible with the actual work principle, and the user's operation may not meet the system's expectation, but this is totally acceptable, because all of these operations are user's true intent which can be captured during performing the product. Once these users' true intent is vividly perceived and accurately described, the research would be carried on to find out why users could not follow the system's expectations, and then take necessary steps to redesign the product. This is so-called user-oriented design. In this paper, user intent perception is made up of research of user's "Hand movement" and "Eye movement" when performing user task.
The user imagery of product is always vague, which makes it difficult to capture user true intent, especially to perceive user imagery in a non-language way such as expression and posture. The work converts blur imagery of user to the appropriate product design elements and make sure the product design is as much as possible consistent with the user imagery model, even beyond their expectations, there are too many unknown variables and computation to deal with for both designers and the perception systems. The method of user task is a good solution to the work mentioned above, because this method has a very clear user goal which is the key to the user intent perception. Task-based user intent perceptual system divides user intent into three types: the first type of user intent (U1) is the one expected by the perceptual system, it is effective operations which could cause responses from perception system, the interaction between user and perception system is exactly what we want; the second type of user intent (U2) could also cause responses but it is not expected, users always do an operation in accordance with their own understanding and experience, sometimes they do it in a wrong way, but it is still a user true intent that can't be ignore; the third type of user intent (U3) accounts for the largest proportion, yet, it is meaningless for perceptual system. U3 includes unconscious operations, and also operations that are conscious but nonexistent or meaningless for perceptual system, such behaviors and operations would be ignored to reduce the amount of unknown variables and computation.
Take Attia pot as an example to explain how to perceive user intent by the method of user task, the interactions tasks between Attia Pot and user are summarized as shown in Table 1. There are five UT including User Interface, Move Object, Cook Food, Get Food, Clean Pot. Each user task is decomposed into one or more subtask which is assigned a description set G i (i = {1, 2 … n}).

"Hand movement"
In order to make the research of "Hand movement" more universal and objective, this paper presents a gesture recognition program based on computer vision, unlike usability testing, user behaviors and operations are judged by perceptual system instead of experts, it is much more objective, in particular, task-based gesture recognition solves the classic difficult problem "Midas Touch" in gesture interface of computer vision. Since all the meaningful gesture operations are in connection with the functions of product system, as with an operation map, every operation is corresponding to the related function. The functions of product system are limited, which makes the meanness operations limited, so it is possible that all the UT of product are pre-set as shown in form 3.1, the rotation operations (G 7 , G 13 ) of gesture sets belong to "Midas Touch" problem. For example, the user goal is to seal the Attia Pot when he rotates the lid clockwise (G 7 ), then, the gesture will recover normal state without any purpose yet trigger a command because of anticlockwise rotation (G 13 ), this is the second type of user intent (U2), the gesture triggers a command without user goal, and the perceptual system would not give any response at this situation.
To solve "Midas Touch" problem, researchers have been studying difference ways, such as Context-Based Gesture (Kumar & Herbert, 2003), Guessability Study (Ruiz, Li, & Lank, 2011), User-Centered Gesture Interface (Norman, 2013) and so on, but none of them is as flawless, convenient, and effective as the task-based method. The logical mind of perceptual system (Figure 1) is just the same way as humans', because there is not such a user task like G 7 → G 13 or G 13 → G 7 , so, when the gesture recovers the normal state after the clockwise rotation, the perceptual system gives no response.
Attia Pot is a perfect object of studying user intent, it involves the interaction between gestures and perceptual system, the fingers' interaction with the Physical buttons and touchable screen, and the eyes' interaction with the shape and color of products. Consider all of these features of interactions between gesture operations and Attia Pot, 5 feature points are extracted, they are thumb tip, the index finger tip, palm, the fit between thumb and index finger, and the hollow area of hand ( Figure 2). According to user task, define gesture set G = {G 1 , G 2 , G 3 … G n } of each gesture operation (Table 2), for example, G 13 stands for a gesture set of Open Lid. Every gesture set shares the same feature vector (F) which is a five order matrix mentioned above.  In order to prove the versatility and replicability of the method, this paper adopts a conventional gesture recognition experiment, in this experiment 16 testees were recruited to perform the UT, and the gesture operations were captured in a single camera, the framework of experiment is shown as below: Take the task Cook Food for example, the user task is decomposed into operations, as shown in Figure 2. The operation flow drawn by green arrows is the expected task which belongs to the U1. The black arrows stand for U2, and the blue stands for U3. In this case, the perceptual system will predict the user's operation intent would be either sealing the Attia Pot by rotating the lid clockwise or opening the cover with the gesture flow: Grab → Move → Put. It has two advantages to read gestures by using task-based method: firstly, the amount of computation is greatly reduced because of ignoring meaningless gestures (U3); secondly, user intent could be predicted to some extent as the given phrases from Input Method, there are also two given choices Seal (G 7 ) and Open Lid (G 13 ) when user is doing Grab Lid. According to the user task, Cook Food made up of three parts, part 1: grab lid (Figure 3), part 2: rotate the lid clockwise to seal (Figure 4), and part 3: click the button from the touchable screen to cook (Figure 7). Part 3 will be explained in the other experiment in the next section, because it has not solved well in the field of computer vision. Hollow area intersect with the object, execute the command.
Hollow area does not intersect with the object, Command is not active.

R1
The object is rotated together with the gesture Part 2: Rotation Only a single camera is used in the experiment, so the gesture operations are effectively detected only in the XY plane but not in the XZ plane or YZ plane. The rotation operations around X or Y axis need to be transformed in the XY plane.
Here is how the operation of rotation works in this experiment, firstly, record the beginning position of gesture noted as R 0 (xR 0 , yR 0 ), when the gesture rotates to a certain angle, note it down as R 1 (xR 1 , yR 1 ).
Consider the position of object centroid as R C (xR c , yR c ), and the included angle ∠R 0 R c R 1 of these three points is the Angle that the gesture rotated. The Angle can be calculated by the following formula: If yR1 > yR0, it means clockwise rotation, otherwise, it would be anticlockwise rotation.
During the experiment, the perceptual system was successful to identify all the gesture operations and judge the user intent under the normal condition that the refresh rate of camera is lower than 30 f/s and the movement speed of gesture is within 5-10 mm/s, the average time of gesture recognition is about 17.6 mm ( Figure 5), basically satisfy the request of real-time. However, when the movement speed of gesture went up over 22 mm/s, it became unstable, because the analyzed video frames of gesture are relatively static and discrete, which requires high definition video frames of gestures, however, this flawless would never be a problem, because the movement speed of user operation is within 5-10 mm/s in the vast majority of application system, still, it can be easily fixed by the camera of higher refresh rate.
It is unexpected that when we looked back the record, we found some interesting operations belonging to U2 yet mean a lot to product design. 3/16 testees tried to open the Attia Pot by rotating the Rotary Knob which is designed to exhaust Figure 6. Obviously, the design of Rotary Knob is misleading user who behavior based on their experiences and cognitive model. This experiment is not only able to perceive user intent, but also improve product design by testing and evaluating products.

"Eye movement"
In the field of computer vision, researchers have been studying the gesture interaction, but rarely concerned with the eye interaction. However, the eye interaction also plays an important role of user intent, such as the user expectations of product color, product shape, and product function. All these are beyond the reach of computer vision theory, and the gesture of clicking on the touchable screen (G 10 ) too.
The part 3 of user task of Cook Food is to click the button Cook on the screen, but the perceptual system can't tell the difference between clicking the different buttons, because the gestures are the same, all the feature vectors are the same, only the eyes' attention changes when clicking different buttons.
This paper presents using the eye tracking equipment to solve these two problems. Eyes are the windows of the mind, eyes convey lots of user expectations especially on color, shape of product and attention of function, help to perceive user intent more comprehensively. The eye tracking equipment is used for recording eye tracking data, constructing visualizations, filtering and exporting data while user is performing tasks, it is widely used in areas which involve visual perception and eye movement such as Human Factors, Collaborative Systems, Virtual Reality, Marketing, Advertising, and so on.
In the part 3 of experiment, testees try to click the button Cook on the touchable screen, as the testees are looking at the Attia Pot, the eye tracking equipment focuses on the testees' pupils and concentration of their gaze and then generates data about these actions on heat map, as shown in Figure 7. The heat maps represent where the testees concentrate their gaze and how long they gaze at the area. Generally, a color scale from blue to red indicates the duration of focus. Thus, a red spot indicates that the testees focused on this area for a longer period of time. According to the heat map, we know exactly where the users' attention is and how the gesture of Click works by both of perceptual system and eye tracking equipment.
So far, the user task experiment of Cook Food of well accepted products on the market (Figure 8), ask user to select the favorite one in order to find out the user expectation on color and shape of products by three measurements which are Attraction, Attention, and Popularity. Attraction is measured on the time that testees spend to get the first sight of the product from the beginning of experiment, the shorter the better. Attention is about how many time the testees look at the product during the experiment, the more the better. Popularity means the time of selected product, the more the better. At first, we utilize gray image to lead testees focusing on the shape of product (Experiment 2-B), then we use a normal one (Experiment 2-C).
The data of eye tracking are shown in Table 2 which indicates: firstly, the data of object 3, 4, 8 changes significantly, it is well explained from the human psychology point of view, since all of them are red that can quickly be cause for concern. Secondly, object 5 and 7 in gray image draw more testee's attention, it could be identified their shapes interest the testees, but the interest in shape  drops quickly when the image became colorful, obviously, the color scheme is a failure work. Thirdly, the object 6 is chosen once in gray image but three times in color image, which indicates the yellow scheme is an excellent work.

Summary and prospect
This paper proposes a task-based method to perceive user intent which is divided into three types, one of them is separated and ignored because of its aimless, meaningless and vagueness, thus reducing an amount of unknown variables and computation fundamentally. A versatile and replicable experiment based on computer vision is conducted to perceive user intent from gestures while utilizing eye tracking equipment as an adjunct way to get user intent. The result shows that the combination of these two techniques can effectively perceive user intent and guide the product design.