Attention Behavior Evaluation during Daily Living Based on Egocentric Vision

In this paper, we propose a system for understanding attention behavior in daily living using an egocentric vision. As the visual attention models, several studies have been made on bottom-up attention and topdown attention. However, in the human brain, there are functions for specifically responding to faces and body parts. Therefore, we define a “category-specific attention model” for the specific response function and integrate the three model types of top-down attention, bottom-up attention, and category-specific attention to generate an attention map. And we extracted area of the higher saliency on the attention map. In the extracted area, we were verified whether divided attention is working or not from two approaches. The first approach detected the object by object recognition on what is seen. The second approach focused on the movement of the hand holding the food. The movement of the hand is generated using the movement orientation histogram calculated from the optical flow. In concrete terms, we evaluated the divided attention by the following 4 tasks; (i) a serial task of cutting only. (ii) as a parallel task, a task of cutting and task of checking whether the deep pan is boiling over or not. (iii) a serial task and cognitive load task. (iv) a parallel task and cognitive load task. On this occasion, there was a significant difference in both the movement orientation histogram of the hand and number of the deep pan confirmations based on the existence of a cognitive load issue, and it was confirmed that this could be used to distinguish between states when cognitive function is decreasing and the normal state. 


I. INTRODUCTION
Attention function disorder is a disorder in which it is difficult to place attention on the necessary areas or concentrate one's attention，and is one type of acquired brain injury.In recent years, there has been an increasing trend in patients with acquired brain injury, and according Manuscript received September 12, 2016; revised May 15, 2017.
to a survey by the Tokyo Metropolitan area, it is estimated that they number more than 500,000 people [1].Rehabilitation is vital for patients with acquired brain injury, and, in particular, systems in which the rehabilitation is closely connected to lifestyle have been proposed [2][3].We are also proposing a form of rehabilitation closely linked to lifestyle, and carry out a rehabilitation program that promotes independence through the actions of cooking and cleaning [3].Until this point, by conducting reflective rehabilitation in which multiple cameras are used to capture the behavior of the person with the disability, award points and write comments for attention function and executive attention, and present these as ratings and comments for the actual video, improvements have been reported in terms of enthusiasm and attention of the person with the disability towards the rehabilitation [4].However, this had issues such as the effort taken to install multiple cameras, as well as problems such as occlusion.For this reason, we aim to establish a system in which behavior attention can be captured using an egocentric vision.
The images obtained from the egocentric vision are ideal for observing and analyzing everyday behavior, and there are studies of visual attention as research on attention using an egocentric perspective camera.For example, Yamada et al. performed a performance evaluation of visual attention estimation by experimentally analyzing the relationship between a saliency map obtained from the egocentric perspective images and the line-of-sight position [5].Furthermore, Kase et al. conducted a study in which they estimated the egocentric perspective from multiple fixed point cameras and estimated the attention area using a distribution of visual acuity and a saliency map [6].In these studies, the movement of the head and visual distribution was used as top-down attention on a bottom-up attention model.However, it is considered with the actual task, that what is important is not only the movement of the head, but also to what the attention is being turned.In the cutting task, attention is being consciously turned towards the cleaver and the nearby area.
In this paper, we create an attention function model based on the human visual model and propose a system to evaluate attention behavior during multiple tasks.

II. ATTENTION FUNCTION
Attention, when classified neurophysiologically, can be arranged into bottom-up attention and top-down attention.With bottom-up awareness, it is thought that awareness is invoked or redirected by external stimuli.Through sensory input as clue stimuli, sensory information processing of the target is selectively promoted or restrained.In contrast to this, top-down attention is mainly a type of awareness that works by intentionally directing one's attention to a specific spot of one's choice [7][8][9].
Furthermore, in terms of the characteristics of awareness, Kashima et al. have arranged the classifications and terms used by a large number of researchers, categorizing them conveniently into the following four items; -strength / persistence / range, -electivity / concentration / stability, -convertibility / mobility, -controllability.

A. Human Visual Attention Model
Until now, systems proposed for a human visual attention model have included bottom-up attention expressed as a saliency map and top-down attention in which attention is intentionally directed towards the movement of the head and the application of a spotlight [4][5]．However, in actual human processing, before the top-down attention and bottom-up attention functions begin to work, the processing functions known as the FFA(Fusiform Face Area), EBA(Extrastirate Body Area), and PPA(Parahippocampal Place Area) instinctively classify the human face, body parts, and background areas in advance [10].After classification, object detection and spontaneous stimulus detection take place.In other words, this takes place separately from bottomup and top-down processing in humans for the human face and body parts.In this paper, we follow the human processing model and define awareness of human areas such as the hand area and face area as category-specific attention, and propose a model integrating the three types of processing as category-specific attention, top-down processing, and bottom-up attention.

B. Category-Specific Attention Model
With category-specific attention, it is necessary to detect the face and body parts in advance.To detect the face and body parts, we extract the skin area.In this paper, as we conduct cooking operations using an egocentric vision, we mainly extract the hand area.The method of extracting the hand area is shown below.
(1) Extraction of the hand area The method of extracting the hand area involves first performing smoothing of the image as a whole to remove noise, and then extracting the human areas and hand area through detection of skin color.In concrete terms, from the input RGB color space, conversion is performed on the HSV color space, which is comparatively resilient to changes in light.In this paper, we define the value for skin color as (Hue: Saturation: Value) = (9-15: 110-200: 10-200).After extracting skin color, the image is binarized, and with contraction and expansion processing, noise is further removed.Next, labeling processing is performed on the extracted area.At this time, as shown in Fig. 1, there is a high possibility, from the first-person perspective camera characteristics during cooking, that the hand area will appear on the bottom of the screen, so the area mistakenly extracted from the top of the screen is determined not to be a hand area.

C. Bottom-Up Model
For bottom-up processing, Itti et al. established a saliency map generated from colors, intensity and orientation with low level features as a calculation model [11].In this paper, attention is directed towards the movements in cooking, in addition to these three basic elements, and an active saliency map with movement elements added is constructed.The processing flow is shown in Fig. 4. With the saliency map, the various feature maps are integrated and generated as a linear combination.Here, we focus on the weighting parameters for each feature map.Normally the weighting parameters are uniform.In this case, as no intentional parameters are specified, this is considered close to bottom-up attention.However, increasing the size of the weighting parameters for a specific feature map means that the intentional judgments of humans are involved.In other words, adjustment of the weighting parameters is considered to include top-down attention.For example, when attention is placed on a moving section, the weighting parameters for the feature map of the movement are increased.In this paper, attention is placed on the moving areas (hand areas) during the task, and (color: orientation: intensity: motion) = (0.1: 0.1: 0.1: 0.7) are set for each feature parameter.This parameter was decided from experience.The results of the active saliency map are shown in Fig. 5.

D. Top-Down Model
With top-down attention processing, normally an object, etc., related to the specified task is detected on the screen, recognized as a specific object, and, if it is a target object, it is set so the saliency is increased.For example, if the issue is a request to focus on the red area, the red areas are extracted from the screen and the saliency is set to be high for those areas.In this paper, it is necessary to judge whether the respective areas were intentionally looked at during serial tasks (cutting task) and parallel tasks (cutting task and the deep pan checking task).For cutting work, it was judged that attention was placed on the hand area section with the previously described category-specific attention model and movement feature map weighting parameters, and for the deep pan checking work, the deep pan was extracted and recognized on the screen and the saliency level near the deep pan was set to be higher.In other cases, it was supposed that attention was placed in the center of the screen due to the characteristics of the egocentric vision.

E. Attention Model
We hypothesized that, when linearly integrating the three types of attention models discussed until now, the areas with high saliency are the human attention points.The results of this are shown in Fig. 6.On linear integration, the weighting parameters (category-specific attention: bottom-up attention: top-down attention) = (W SA : W BU : W TP ) were carried out differently from the human area recognition for the normal human model and W TP > W BU .In this paper, however, the following weighting parameters are set.This parameter was decided from experience.

F. Attention Function Evaluation
The Clinical Assessment for Attention (Japan Society for Higher Brain Dysfunction, 2008) and Disaster Care Assistance Term (Hatta et al., 2001) are methods of evaluating the attention function.Both of these examinations use simple equipment on a test sheet to evaluate the attention function.However, these examination methods are only carried out periodically and it can be said that it is difficult for them to grasp realtime recognition states and action events, and it covers the non-daily action of tests.In this paper, to evaluate the attention function in real life, we use, as a base [9], the attention characteristics described in Chapter 2, as shown in Table I, and perform an evaluation using the attention function evaluation index.In this paper, we focus on (4) divided attention, and evaluate whether there is conscious attention during parallel tasks.In concrete terms, if we suppose that with the main cutting work, the attention is focused nearby (the left hand is like a cat's paw and consciously presses down in a precise manner), when it is necessary to direct attention elsewhere, attention is focused that way, following which when returning to the cutting work, we evaluated whether attention is once again placed on the area nearby (short attention switchover time) or attention is directed at multiple designated areas (divided attention).

III. ATTENTION EVALUATION METHOD
We focus on the shape of the hands and movement of the left hand during the cooking task, and evaluate whether attention was achieved.

A. The Movement of Hand
Normally, when carrying out the cutting task, the food material is fixed to prevent injury, and the hand fixing the food material does not move very much.This is because we consciously believe this to be a dangerous activity and are consciously fixing the food material.If the hand fixing the food material moves significantly, this will mean that the food material is not fixed, and using the cleaver in this state is dangerous.Here, to detect the attention state, we focus on the movement of the fixing hand.In the extraction method for the hand area shown in Chapter 2 B, focusing only on the fixing hand, we calculate a direction histogram using an optical flow for the hand area, and classify this into movement in the normal state and others.The orientation histogram is set to 16 orientations or 32 orientations and these were compared.Fig. 7 shows a histogram of the left hand movement during the respective cutting tasks.

B. Recognizing Shape of Hand
As an attention state, we can generally understand the level of attention to the task by the movement of the hand, but in order to judge the details of the danger state, we focus on the shape of the suppressing hand.A method of recognizing the hand shape is hand shape recognition for sign language using a fixed camera [12].In this study, the hand area is reflected in enlarged form with the hand stretched out towards the camera, the hand is extracted using color extraction, and following that, there is the method using HOG (Histogram of Oriented Gradients) feature quantity or a method in which the shape is identified using 3D Active Appearance Models for recognition of gestures [13].In this study, it is necessary to identify the detailed movement of the finger, and on this occasion, an egocentric vision is used to identify hand shapes during the cooking task, and the way of viewing these was limited, with not enough hand shapes for sign language recognition.In this paper, we classified by using several methods and features.several methods are "k-NN", "Support Vector Machine", "Logistic Regression" and "Neural Network of 3 layers".Next, several features are "HOG feature", "Bag of Features based on SIFT feature" and "DCNN of deep learning".A study sample of the hand area is shown in Fig. 8. Basically, we consider it is in a dangerous state when the fingers are opened.

IV. EXPERIMENTS
In this paper, we focused on (4) divided attention from among the functional characteristics of attention.As a method of experiment, the four patterns in Table II were applied to each of six healthy persons.On this occasion, the cooking operation involved a cutting task.However, as there is a danger that the learning effect will reveal itself in the same method of cutting, multiple methods of cutting were proposed and one random form of cutting was presented before starting cutting.When checking the deep pan, liquid (in this case, milk) was poured in the deep pan, and the task of confirming that the deep pan was not boiling over was added (parallel task).Furthermore, as a calculation issue, to reduce the recognition function, a calculation issue was added to place stress on the brain.The content of the problem was to keep deducting one of the values of 7, 11, 13, or 17 from 1000.They were instructed to voice the calculation results and to keep answering if they made a mistake with the results.The timing of their answer placed stress by making them calculate at an interval of 3-5 seconds.However, as there was no evaluation of the memory function, their previous answer was told to them if they forgot it.

V. EVALUTATION AND DISCUSSION
First, the results of the movement of the hand holding down the food materials is shown in Fig. 9. Fig. 9 is an average orientation histogram for each issue.
For both 16 orientations and 32 orientations, we can see that the movement of the hand increases when cognitive load was placed for issues (i)<(iii), (ii)<(iv)． Here, when focusing on (ii) and (iv), to which the divided attention issue was applied, the significant difference shown in Fig. 10 was obtained with a t-test.The result was that, for both 16 orientations and 32 orientations, p(T<=t) = 0.045, and a significant difference was obtained.Next, we shall discuss the identification performance of the hand shape.On this occasion, there was the twoclass issue of whether the shape was a safe shape or dangerous shape, and the image size was normalized to 100 x 100.The learning image was divided into manual tasks from the actual cooking operation, and the number of learning images was approximately 200.The test image used in DCNN was 150 images.The identification performance is shown in Table III.The result was that the DCNN format with Bag of Features was best at 75%.With the HOG feature quantity focusing on the edge gradient, k-NN was best at 59.5%.As a cause of the accuracy decreasing, there was an image of fingers opening in the image in a safe state that were bent, and this kind of image was mis-detected.
Next ， the operation of checking the deep pan carefully was detected as divided attention, and the number of confirmations per minute are shown in Table IV.As a method of checking the deep pan, it was judged whether the deep pan was reflected within the area of attention.For each of the experiment participants, it was clear that the number of checks decreased when cognitive load was applied to parallel tasks in the normal state.In other words, in many cases they forgot to direct their attention towards the deep pan.Furthermore, there were cases when too much attention was paid to the task so checking of the deep pan was neglected and in two cases the deep pan boiled over.For the two people for whom the deep pan boiled over, the time until checking was long and, as there were more opportunities to check after the deep pan had boiled over once, it is thought that the ratio of checks did not change compared to the other experiment participants.In actual fact, when asked, after the experiment, why they did not look until the deep pan had boiled over, the opinion was given that full attention was placed on the cooking task or calculation problem and they forgot to look at the deep pan.Next, for the parallel task issue, a significant difference in the two types of cognitive load or not for the parallel task issue was verified with a t-test.The result is shown in Fig. 11.The results were p(T<=t) = 0.045, demonstrating a significant difference.

VI. FUTURE WORK
Moving forward, we plan to carry out recognition including finger patterns as hand pattern recognition, and increase the accuracy of identification systems.With an identification system of 90% or more, a more detailed attention evaluation can be performed by applying hand shape information in addition to hand area movements.Furthermore, on this occasion, we focused only on attention allocation, but in addition to other evaluations of attention function, we plan to perform an evaluation of functions related to attention function, such as executive attention.Currently, we are conducting experiments focusing on the lifestyle behavior of cleaning in addition to cooking.Furthermore, on this occasion, the identification rate for the hand shape was approximately 50-60%.This is because there is similar section between the hand shapes in the normal state and dangerous state.As a method of improvement, it is believed that identification can be made more accurate by adding information such as bending and length of fingers in addition to the shape of the hand as a whole.

VII. CONCLUS ION
On this occasion, by using an attention level map integrating category-specific attention, top-down attention, bottom-up attention with an egocentric vision, we detected the visual area.We proposed a method of evaluating divided attention through recognition of movements nearby and objects in relation to the visual field.In actual fact, we evaluated whether divided attention could be achieved for the two elements of hand area movement and the deep pan checking when cognitive load was added and not added to six healthy people, and a significant difference was achieved.With this format, we confirmed that identification was possible in a state where cognitive functions were reduced and in a normal state.

A
This work was supported by JSPS KAKENHI Grant Number 15K0036.

Figure 1 .
Figure 1.Extraction of the hand area (2) Segmentation of the left and right handsOn this occasion, as a method of dividing the left and right hands in relation to the area extracted in (1), and based on the presupposition that for the two types of extracted area shown in Fig.2(a) that the left and right hands will not cross, the area on the left side is considered the left hand and the area on the right side is considered to be the right hand.Furthermore, as shown in Fig.2(b), for areas other than the two areas, the area on the left side of the image is the left hand, and the area on the right side of the image is the right hand.

Figure 2 .Figure 3 .
Figure 2. Segmentation of the hand area

Figure 4 .
Figure 4. Process flow of active saliency map

Figure 5 .
Figure 5. Active saliency map for cutting

Case 1 :Figure 6 .
Figure 6.Process flow of a gaze area (a) safety state (b) dangerous state Figure 8.The state of the left hand during cutting

Figure 10 .
Figure 10.The significant difference of the hand movement

Figure 10 .
Figure 10.The significant difference of checked the deep pan

TABLE I .
AN ATTENTION FUNCTION INDEXES

TABLE II .
THE EXPERIMENTAL TASK (iv) Parallel task＋Cognitive load task cutting task checkinga deep pan calculatation taskExperimentsTask Contents

TABLE III .
RESULTS OF T HE SHAPE OF HANDS

TABLE IV .
THE NUMBER OF TIMES THAT CHECKED THE DEEP PAN.DID THE DEEP PAN BOIL OVER.