Coarse-to-fine online learning for hand segmentation in egocentric video

Hand segmentation is one of the most fundamental and crucial steps for egocentric human-computer interaction. The special egocentric view brings new challenges to hand segmentation tasks, such as the unpredictable environmental conditions. The performance of traditional hand segmentation methods depend on abundant manually labeled training data. However, these approaches do not appropriately capture the whole properties of egocentric human-computer interaction for neglecting the user-specific context. It is only necessary to build a personalized hand model of the active user. Based on this observation, we propose an online-learning hand segmentation approach without using manually labeled data for training. Our approach consists of top-down classifications and bottom-up optimizations. More specifically, we divide the segmentation task into three parts, a frame-level hand detection which detects the presence of the interactive hand using motion saliency and initializes hand masks for online learning, a superpixel-level hand classification which coarsely segments hand regions from which stable samples are selected for next level, and a pixel-level hand classification which produces a fine-grained hand segmentation. Based on the pixel-level classification result, we update the hand appearance model and optimize the upper layer classifier and detector. This online-learning strategy makes our approach robust to varying illumination conditions and hand appearances. Experimental results demonstrate the robustness of our approach.


Introduction
Recently, the first-person camera embedded wearable computer, such as augmented reality headset and smart glasses, is growing vigorously and urgently requires suitable interaction patterns for egocentric vision. One feasible option is taking user's hand as the medium for human-computer interaction. The wearable computer interprets hand position, posture, and gesture into commands and produces appropriate responses to the user. These properties of hand are preceded by reliable hand detection and segmentation from the egocentric video. The egocentric view brings opportunities for hand detection and segmentation. Since the video is recorded from a first-person perspective, the occlusions are less likely to happen at the attention hand and the user prefers to concentrate on region in the center of view field. Meanwhile, the egocentric video also presents new challenges including rapid changes in illuminations, significant camera motion, and background clutter.
Great efforts have been made in detecting user's hand from the egocentric video especially in pixel-level detection [1][2][3][4][5][6][7]. Most of these methods are under an implicit assumption that the hand presents in the video all the time. But, the assumption fails in many situations in which the hand is not used, such as before or after the human-computer interaction. Subsequently, some cascade detection methods are put forwarded to get rid of the assumption by checking out hand presence before performing pixel-by-pixel classification [8][9][10]. However, these approaches rely on the existence of a large training set containing a broad variety of data which are collected from multiple users under diverse illumination conditions. Hand appearance varies greatly in diverse users and environmental conditions. Not only does the training set cost a lot of manual effort in data collection and labeling but also it does not guarantee to make the approach adapt to any hand appearance and environmental condition.
To address this issue, we propose a method for unsupervised hand detection and segmentation in egocentric video. In our approach, the frame-level hand presence or absence is observed based on motion saliency which is particular in the egocentric view. By combining motion and appearance property, we get unsupervised labeling results for the superpixel-level hand classification. Then, the pixel samples of hand are extracted according to confidences of the superpixels and used to train a pixel-level classifier which produces fine-grained hand segmentation. In order to be robust with varying environmental condition, we constantly update the classifier and detector by using a bottom-up optimization method. We test our method on challenging datasets, and the experimental results show that our method robustly produces precise segmentation, as illustrated in Fig. 1.
In summary, this paper makes three main contributions: We propose a frame-level hand presence detection method that utilizes hand motion saliency in the egocentric human-computer interaction, which reduces the false positive rate for the final target of pixel-level hand segmentation. We present a top-down cascaded classification method which segments hand hierarchically in levels of frame, superpixel, and pixel so as to reduce computational cost, in which the classifiers are trained on-the-fly so as to be robust to diverse users. We analyze and optimize the online trained classifiers by a bottom-up method which makes the hand segmentation robust to varying environmental conditions.

Related work
Egocentric vision is an emerging area in computer vision. According to survey of [11], the most commonly explored objective of egocentric vision is object recognition and tracking. Furthermore, hands are among the most common objects in the user's field of view, and a proper detection, localization, and tracking could be a main input for other objectives, such as gesture recognition, understanding hand-object interactions, and activity recognition [5,[12][13][14][15][16][17][18][19][20]. Recently, egocentric pixel-level hand detection has attracted more and more attention.
Most of the proposed methods are based on pretraining classifiers using abundant manually labeled data. Li and Kitani [1,4] propose a pixel-level hand detection method using color-and texture-based features. Zhu et al. [2] propose a method which use local hand shape information in the training data and enforces shape constraints in the estimation. Serra et al. [3] integrate temporal and spatial consistency to complement the appearance features. Betancourt et al. [21] identify the left and right hands and models hand occlusions to improve the accuracy of hand segmentation. These methods improve the precision of pixel-level hand detection but still under the implicit assumption of hand presence in all frames. This assumption is not always true since the hand may be absence before or after the egocentric human-computer interaction.
Some of the proposed methods conquer the hand segmentation task sequentially. Betancourt [8,22] proposes a sequential classifier consists of a hand-detector and a hand-segmentator. Betancourt et al. [9] extend SVMbased hand detector with a dynamic Bayesian network. These methods reduce false-positive rate of hand segmentation but also needs the offline training which requires manual labeled data. Kumar et al. [10] illustrate an on-the-fly hand detection training method which is initialized by a calibration gesture performed by the user. This simple preprocessing step saves a great deal of Fig. 1 Results of proposed method in challenge cases. From a-g are cases of hands are motion blur, background having skin-color, frames are overexposed, hands in contrast shadow, frames are underexposed, hands interacting with objects, and hands in varying poses manual labeling but may not be friendly to the user. Zhu et al. [23] propose a two-stage detector which firstly generates bounding box proposals and secondly evaluates the proposals by a convolutional neural network. Moreover, all of these methods are still challenged with varying environment conditions since they do not have any model updating strategy.
In this paper, we are going to illustrate our finegrained hand segmentation method which leverages unsupervised online learning pattern to robustly segment the hand in pixel-level from egocentric video.

Method
In this section, we discuss an unsupervised online learning method for fine-grained hand segmentation based on top-down classification and bottom-up optimization. By learning hand appearance and motion features onthe-fly, we segment out the hand with precise boundary from the egocentric video which is captured in varying illumination condition. From the point view of topdown strategy, we divide the classification task into three parts: frame-level detection and superpixel-level and pixel-level classifications. Before scanning pixel by pixel, we firstly estimate whether a frame contains a hand and whether a region of the frame contains hand pixels. By doing this, we reduce the false positive and initialize samples for further online training. After that, we learn feature from the labeled region and train two-level classifiers. To make sure the classifiers adapt to varying hand appearance, we update the hand appearance model and optimize the upper layer classifier and detector. Figure 2 shows the framework of our method.

Ego-saliency-based hand detection
Before scanning the frame pixel-by-pixel, the first task is detecting presence of hand from a frame-level perspective and then automatically initialize hand masks for subsequent classifications. Motion-based methods [24][25][26] are proposed for background subtraction for freely moving camera. In general, it is difficult to determine whether the hand is present or not without prior information about the environment or appearance of the hand. Fortunately, the egocentric interaction scenario provides many constraints that are suggestive of the hand's presence.
From the point view of an interaction cycle, the motion of hand in egocentric view has periodical specialty. In the interaction preparatory phase, the whole hand and part of the arm together gradually enter into the view field. During the interaction, the whole hand moves around the center of view field and the fingers are likely to make more vigorous motion than the palm and arm, such as making a gesture. When the interaction is finished, the whole hand and part of the arm together gradually move out of the view field. We observe that the preparatory phase is a natural bootstrap since the hand motion is more salient than other regions and the hand is hardly to enter into the view field from the top side.
Based on this observation, we define an ego-saliency metric E f consists of spatial and temporal terms to estimate how likely the hand is present in the frame f. The higher the ego-saliency value, the more likely the hand is present.
where the first term is the spatial cue that restricts the hand motion should be salient and happened in the right position. The second term is the temporal cue that restricts the hand motion should be consequently increased. W and H denote width and height of the frame respectively. M f (i, j) is the motion saliency of a pixel at position (i, j) and calculated based on optical flow map using method [27]. As shown in Fig. 3d, we set a noninteractive border with width W and height h from the top of the frame. We set h as one tenth of the frame height in experiments. And we use a distance-based exponential weight to restrict that hand motion should happen away from the non-interactive border. λ is the weight response control factor. The farther a pixel is away from the non-interactive border, the greater its weight is assigned. N t is the number of non-zero values in the motion saliency map M t . The consequent motion increment is observed by a sign function sgn(▪) based on the number of pixels having salient motion in adjacent n frames.
After detecting the presence of hands, we initially segment moving hand regions based on motion and appearance clustering. By using dual TV-L1 optical flow [28], we extract dense motion flow fields and get a motion map. We cluster the motion map into k groups of regions using K-means and we set k as 10 in the experiments. The motion clustering naturally divides foreground and skin-colored background into different regions since they usually move differently. Figure 3d shows the regions got from motion clustering and the non-interactive border. With the help of non-interactive border, we easily select out a set of background regions {R BG } which intersect with the border. The rest unknown regions are further determined based on appearance clustering. According to Eq. (2), we calculate the likelihood H f (R i ) of an unknown region R i belonging to hand region based on the similarity between the unknown region R i and background regions {R BG }. S(▪,▪) is a function calculating color histograms similarity of two regions. Then, we find out the hand regions which have low color similarity with all the background regions. Figure 3 shows the initialized hand mask which is generated by using motion and color clustering.

Online training two-level hand classifiers
With the ending of the interaction preparatory phase, hand motion is attenuating and may eventually become much less salient, such as only the fingers move to make a gesture while the palm holds still. Moreover, motionbased segmentation usually produces the result with blurry and noise boundaries around objects. Therefore, the appearance feature is more discriminative than motion cue for fine-grained hand segmentation during the interaction phase (Fig. 4).
Here, we address a coarse-to-fine strategy-based hand segmentation method that learns appearance feature of hand and background on-the-fly. Based on the initial set of hand masks {B t } got from the frame-level detection, we firstly train a superpixel-level hand classifier so as to segment frames into superpixel regions from which the stable pixel samples are selected out. Then, we utilize the selected pixel samples to train a pixel-level classifier which produces fine-grained hand segmentations.
In the frame-level detection step, we obtain a coarse segmentation of the hands using motion and ego-saliency cues. It initially provides the ground truth labels of hand regions for superpixel-level training. We overly segment the recent n consecutive frames {F t } into superpixels by using a modification of a state-of-the-art algorithm termed simple linear iterative clustering (SLIC) [29]. The K-means clustering of motion map derives a binary segmentation separating the foreground from the background. However, the K-means segmentation has coarse boundaries which are sometimes inconsistent with the superpixels'. To select good samples for superpixel-level training, we initialize a label map based on the portion of positive pixels in each superpixel and refine it by energy optimization. Figure 5 illustrates the process of superpixel sample selection. Given a binary mask of the K-means segmentation, we assign the superpixels having 80% positive pixels as foreground candidates and their dilated superpixels as background candidates. The candidates are further selected based on confidence score calculation and energy optimization.
We define a confidence score to describe how much the superpixel is more similar to its homogeneous neighbors than the heterogeneous neighbors. For a candidate superpixel, we calculate its confidence score as Eq. (3).  After normalization, we get a score map as shown in Fig. 5d.
where Ω − i and Ω þ i are sets containing samples collected from the neighborhood of superpixel i, the superscript "-" indicates that the samples have different class label with superpixel i while "+" stands for the contrary situation, and Z is a normalization factor. And, h SIFT and h RGB denote the SIFT and RGB histograms respectively, D(h i , h j ) is the Chi-square distance between the histograms h i and h j , and c k is a constant to normalize the k-th descriptor.
We take the score as a label and optimize it for each superpixel by using Ising model [30]. The foreground and background candidates constitute a foreground system and a background system respectively. The energy of each system consists of the affinities and consistencies of superpixels to their neighborhood within the system. Color and texture are useful cues since foreground tends to have a difference appearance than the background behind it. Therefore, the affinity between a superpixel and its neighbor is computed as the Chi-square distance between their color and texture histograms. Higher affinity indicates stronger consistency for belonging to the same class. Therefore, we optimize the label based on an energy which encourages coherence in superpixels of similar appearance. For a superpixel, we inverse its label and calculate the energy change caused by the inversion. This label inversion is directly accepted if the system energy is increased. On the contrary, the process is further judged by an acceptance function. This routine is repeatedly executed until the system reaches equilibrium. Then, the superpixel labels are optimized.
Given a labeled region, we calculate the energy of each superpixel within it and accumulate them together to describe its system energy. For a superpixel, we first compute an affinity score and a label consistency score for each pair of adjacent superpixels. After normalizing the scores, we calculate their correspondence which is proportional to the superpixel energy. Based on the exponential correspondence, we obtain the superpixel energy. After that, we compute the system energy as where Ω o i is the neighborhood of superpixel i within the system, S(i, j) is the affinity, and L(i, j) is the label consistency between two adjacent superpixels.
To describe the appearance of a superpixel, we compute the histograms of SIFT features and RGB values from the image area of it occupies. Considering the appearance feature is prone to be coherence in a local region, we use the distance between two adjacent superpixels to restrict the contribution of the neighboring superpixel. Larger distance indicates smaller contribution. Moreover, the superpixels nearing to the system boundary are tend to be unstable. Hence, the distance from superpixel to boundary is also a term of the affinity score. Based on these four descriptors, the affinity score S(i, j) is defined for superpixel pair (i, j) as Eq. (6).
where A(i, j) is the Euclidean distance between the adjacent superpixel centers, and B(j) is the Euclidean distance from superpixel j to the system boundary.
We inverse the label of superpixel i and get its updated label consistency score L'(i, j) with the adjacent superpixel j. The energy of the system is renovated correspondingly. Then, we compute the increment △E of the system energy. Based on the increment, we decide if it should accept the label inversion.
where label −1 (i) is the inversion value of label(i), β is a weight factor and R is a pseudo random number from uniform distribution. In a word, the label inversion will be accepted if it increases the correspondence between the superpixel's appearance similarity and classification type. Given the notations of all superpixels, we initially train the superpixel-level classifier based on appearance features which consist of color and gradient statistic in each superpixel. The classifier is able to select out the superpixels belonging to hands with confidence values. Note that the motion cue may eventually become much less discriminative. Therefore, we apply the SLIC on color frame to get the superpixel segmentation in the subsequent online training. Because of benefiting from the relatively accurate boundaries produced by the SLIC, the segmentation in superpixel level is improved than the result of frame-level detection. However, the superpixel having low confidence value may partially contain hand region. It will cause misclassification if we take that kinds of superpixels as background and select negative pixel samples from them. Therefore, we proposed a sample selection strategy for pixel-level classifier training.
For pixel-level training, we select samples from the superpixels based on their classification confidence values. The negative samples are selected from the superpixel having confidence smaller than a threshold value T U . The positive samples are selected from the candidate superpixels which have confidences greater than a threshold value T L . The unstable superpixels having confidences between T U and T L are abandoned as unknown. Moreover, the higher the confidence of a superpixel belongs to the hand region, the more positive samples are extracted from it. Based on the property of superpixel generated by SLIC method, we suppose that pixels nearing to the center of the superpixel are more likely to be in the same class with the superpixel. Therefore, we divide the pixels of candidate superpixels into training and unknown groups based on the distance between the pixel and the superpixel's center. By combining the area A sp and confidence W sp of a superpixel, we define the distance threshold T sp as Eq. (8). Then, the candidate superpixels are eroded based on the threshold T sp . The pixels in the shrunk region are put into unknown group while the others are selected as positive training samples.
Following the previous pixel-level segmentation approach [1], we extract color features from RGB, HSV, and LAB color spaces and texture feature using HOG [31]. By using a pool of combination of features and random forest classifiers [32], we classify the unknown pixels and obtain fine-grained hand segmentations. After that, we also get a more precision description of the confidence of a superpixel belonging to the hand region. The confidence values of superpixels are updated with their portion of positive labeled pixels. Then, we re-train the superpixel-level classifier by using the superpixel having high confidence values. By doing this, we update the hand and background models on-the-fly which makes the method more robust to varying environment. Note that the two-level classifiers select out the pixels that are most likely to be in the hands. The motion cue becomes salient and discriminative again when the interactive hand gradually moves out of the view field. Therefore, we still have to monitor the hand absence by aid of the egocentric saliency metric which is added a confidence term, as described in Eq. (9).
Where, the first term denotes the motion saliency, the second term observes the consequent motion decrement, the third term is the average superpixel confidence of the frame f, and m is the number of superpixels having confidence greater than 0.5.

Evaluation to update classifier
In evaluation stage, we use a bottom-up strategy. We evaluate bottom classifiers and feedback loss to the upper levels. The superpixel-level classifier is directly affected by precision of pixel-level classification since the confidence of superpixel is calculated based on pixel classification results. In the initialization step, we consider frames of a sequence equally to contribute to pixel-level classifier. Since background changes constantly, the appearance of hand varies a lot and becomes different from previous situation, such as hand enters into a shadow place. Therefore, the history frames contribute differently and we calculate weights W t for n history frames of the training set {F t } to make their contributions more rational based on error of pixel-level classifier. The weight W t consists of a local metric W t L and a global metric W t G .
Given a labeled training set {F t }, we train a collection of classifiers {C t }. By using the classifier C t , we get the confidence value W t sp k of a superpixel SP k belonging to the hand regions in current test frame f. The local metric W t L restricts that the result of classifier C t has low variance with other classifiers of the set. Therefore, we calculate the loss of using training data from frame t based on the difference between classification results of test frame f produced by C t and the average classifier C f F t g .
where m is the number of superpixels in current test frame f and n is the number of frames in the training set {F t }.
From a global point of view, we estimate the loss of using training data from frame t based on the difference between the classification result of frame f produced by C t and the classification result of frame f-1 produced by the previous classifier C {Fp}f − 1 which is trained using data from {F p } under the constraint of weight W p .
Generally speaking, precise classification can segment hand region from background with clear boundary while smooth and flat inside the region. We calculate gradient map of the classification probability map and define three gradient-based constraints to evaluate the global loss. Firstly, the magnitude of the biggest contour in the gradient map should be large. Then, the gradient in the conjunction of two superpixels should be small. That is, the number of contours in the gradient map should be small. And last, the shapes of the biggest contours in current and previous gradient maps should be similar. Based on these three constraints, we calculate a global loss function having terms based on the average magnitude G f of the biggest contour, the number N f of contours, and the shape S f of the biggest contour in the classification result of test frame f.
where the right hand superscript denotes the classifier has been used, C t or C f −1 f F p g . D(▪,▪) is a function estimating the difference between two shapes.
By combining W t L and W t G , we evaluate the effectiveness of training samples from frame t not only in local superpixels but also in the global hand region. Based on the weight W t , we optimize the pixel-level classification result which is used to update the superpixel-level classifiers. Note that the terms of the weight function will be normalized before combination.

Results and discussion
We evaluate our cascaded hand segmentation method on two types of egocentric data which correspond to different levels of human-computer interaction. The first type contains the both hands are exposed with little varying gesture and interacting with objects, such as holding a cup. The second type contains the hands performing gestures, such as virtual keyboard typing, without directly interacting with any object. We firstly compare our cascaded hand segmentation with the state-of-the-art methods and analyze the validity of our framework. Then, we illustrate that the egocentric human-computer interaction can benefit from our hand segmentation approach.

Evaluation on benchmark dataset
To compare with baseline methods, we first test our approach on the benchmark dataset CMU EDSH [1] which consists of egocentric videos containing diverse indoor and outdoor illumination and hand poses. The videos were collected by a subject wearing the head-mounted standard color camera and passing through scenes with varying illumination including the extreme cases of underexposed and overexposed at a resolution of 720p and a speed of 30 FPS. Besides the change of skin color, the hand pose also changes during the subject doing daily activities. The dataset contains 19,788 frames and 743 ground truth labels from three video clips, including EDSH1, EDSH2, and EDSHK. EDSH1 and EDSH2 involve data of bare hands with a few intentional gestures while EDSHK records hand interacting with objects in a kitchen. In order to match the scale of the ground truth, we downsample the resolution of the frame from 1280 × 720 to 640 × 480 pixels. We conduct quantitative and qualitative evaluation on the benchmark dataset to compare our detection performance with the prior arts.
In Table 1, we compare our method with the three state of the arts on F-score. Li and Kitani [1] predict hand pixel using color and gradient features based on Random Forest classifiers. Zhu et al. [2] extend the pixel-level method by introducing shape information of pixels based on structured forests. Baraldi et al. [5] utilize temporal and spatial coherence strategy to improve the hand segmentation of the pixel-level method. The state of the arts use video clip EDSH1 as the training data and test their approaches on the rest clips of EDSH2 and EDSHK. The corresponding F-scores are provided by their papers. Since our approach using online training strategy, we give out our F-scores on all the clips. As the F-scores shown in Table 1, our approach improves the detection precision in most experiments. We have implemented our algorithm and tested the non-optimized code on an Intel-based PC, with a i7-4500 U CPU that runs at 1.80 GHz. Most of time is spent on superpixel sample selection and online training. The time cost can be reduced by decreasing the number of samples used in all stages. In Table 2, we compare our method with of the three state of the arts on time. Figures 6 and 7 show the visually comparison of test images overlapped by detection results provided by their papers and our method in the challenge cases of extreme lighting conditions and background color.
In Fig. 6, the test frames of EDSH2 were taken under extreme lighting conditions of overexposed, underexposed and high contrast shadows. Parts of the hand are blended into background by the strong or insufficient light while the color and texture of the other parts are faded inordinately. Li and Kitani [1] fail to give good prediction in these cases. In contrast, our approach has much higher detection precision. Our continuously online training strategy makes the classifiers robust to varying illumination even in the extreme conditions. Figure 7 shows the case of background sharing similar color and texture with hand. Both methods of Li and  [2] fail to distinguish the hand from the textureless and skin-colored background. In contrast, our approach gives more correctly prediction in this case. By using online learning, our method gradually updates the hand and background models so that the classifiers are more robust to varying scene.

Evaluation on egocentric application
Fingertip position is one of the most practical information for egocentric vision-based human-computer interaction, such as the user inputs command via a virtual keyboard. Fingertip detection can directly benefit or suffer from the precision of the hand segmentation. Therefore, we use a simple fingertip detection method to further evaluate our hand segmentation method from the practical point of view. As shown in Fig. 8, we evaluate the applicability of hand segmentation by an application of virtual keyboard interaction. The ready gesture of index finger up triggers the virtual keyboard to show up. Then, the egocentric view field is divided into girds each of which corresponds to a key. In the experiment, we divide the view field into 5 × 7 grids which provide relative comfortable interaction scale for the user. The duration of fingertip activates the key input and the corresponding position will light up. We extract tip position of the index finger from the hand segmentation result by convex hull analysis. The video was recorded by a subject wearing the head-mounted Logitech camera in the indoor scene at a resolution of 640 × 480 and a speed of 30 FPS. The test video totally contains 1439 frames consist of the whole interaction procedure including hand moving into the view field, ready gesture showing up, fingertip hovering and moving through keys, and hand moving out of the view field. Figure 8a-c illustrates the robustness of our hand segmentation-based fingertip detection. Figure 8d shows the failure case caused by the noise of the segmentation which could be removed by extra postprocess. Figure 9 shows the performance of our hand segmentation method in the virtual keyboard interaction application. The red and blue dots are the detected fingertip interaction frames. We can see that the detected fingertip position and the ground truth respectively in the keyboard position is stable and with little  jitter. And the fingertip detection accuracy rate is 0.9867 over the test video. Figure 9b shows the total 17 failure cases over the 1277 interactive frames. It proves that our hand segmentation approach is reliable and prone to be used in egocentric vision based human-computer interaction.

Conclusions
In this paper, we presented an unsupervised on-the-fly hand segmentation method which consists of top-down classification and bottom-up optimization. From the point of view of egocentric interaction loop, an unsupervised frame-level hand detector is proposed for the purpose of reducing the false positive caused by hand absence. We implement the frame-level detection by setting a non-interactive border based on an assumption that the hand is hardly to enter into the view field from the top side for egocentric interaction. Based on the frame-level detection result, the superpixel-level and pixel-level classifiers are trained on-the-fly sequentially aimed at improving reliability of hand segmentation. To get stable samples for superpixel-level training, we select the candidates based on steps of confidence score calculation and energy optimization. In order to be robust to vary environmental conditions, the classifiers are updated from the bottom up based on the proposed performance evaluation method. Experiments carried on public datasets validate the generality of the proposed approach. This paper shows the potential of unsupervised method for pixel-level hand segmentation in egocentric interaction. We believe that it can be transferred to the pixel-level object segmentation by combining with gaze analysis and contributing to activity recognition.