Sensor-Assisted Face Tracking

Generally, face detection and tracking focus only on visual data analysis. In this paper, we propose a novel method for face tracking in camera video. By making use of the context metadata captured by wearable sensors on human bodies at the time of video recording, we could improve the performance and efficiency of traditional face tracking algorithms. Specifically, when subjects wearing motion sensors move around in the field of view (FOV) of a camera, motion features collected by those sensors help to locate frames most probably containing faces from the recorded video and thus save large amount of time spent on filtering out faceless frames and cut down the proportion of false alarms. We conduct extensive experiments to evaluate the proposed method and achieve promising results.


Introduction
Locating and tracking faces in video streams have long been the most fundamental techniques in computer vision. They are step stones of almost all facial analysis algorithms including face alignment, face modeling, face recognition, and gender/age recognition and have enabled numerous applications, such as human-computer interaction (HCI), video surveillance, and many other multimedia applications. Particularly in the context of HCI, only when computers could understand human faces well could they begin to truly figure out people's intentions and thoughts and react in a proper manner.
In general, the goal of face detection is to determine whether or not there are any faces present in an arbitrary image and, if present, return the location and extent of each face. While this appears to be a simple task for human beings, it is very difficult for computers and has been a hot topic in machine vision that attracts top researchers all over the world in the past few decades. The difficulties associated with face detection can be attributed to many variations of lighting conditions, scale, location, orientation, pose, facial expression and occlusions, and so forth. In addition, intraclass interferences that arise from make-up, beard, mustache, and glasses of the same person make face detection problem even more hard.
In recent years, face detection has made significant progress and been increasingly utilized in real world applications and products, like Google's Picasa. Nowadays most digital cameras are equipped with built-in face detector to help autofocus. However, face detection in unconstrained settings remains a challenging task. Modern face detection algorithms are mostly based on low level feature extraction and statistic model training and focus their attention wholly on visual data analysis. Herein more and more complex features and rigorous learning algorithms are developed to extract as much information as possible from the visual content.
Automatic face tracking requires face detection to initialize the tracking process. It is an application of object tracking. In its simplest form, tracking can be defined as the problem of estimating the trajectory of an object in the image plane as it moves around a scene [1]. In terms of face tracking, a tracker assigns consistent labels to detected faces in consecutive video frames. The main challenge in tracking is clutter. Clutter is the phenomenon when features expected from the target are difficult to discriminate against features extracted from other objects in the scene. Another challenge 2 International Journal of Distributed Sensor Networks is introduced by appearance variations of the target itself. Intrinsic appearance variability includes pose variation and shape deformation, whereas extrinsic appearance variability includes illumination change, camera motion, and different camera viewpoints [2].
In this paper, we focus our attention to handle the task of face tracking with a new perspective. Typical face detection and tracking are conducted frame by frame and window by window. In terms of face detection, the time spent on filtering out a faceless frame is comparable to that on identifying a frame containing faces as every search window needs to be checked to ensure all possible faces are detected. Faces are tracked in following frames using relatively less computationally expensive methods. In case of track failure, face detection runs again to reinitialize the tracker. Large amount of time is wasted on searching faces in faceless frames. For example, when a subject in video turns his back and walks away from the camera, his face totally disappears and from that moment on there is no need to apply face detection and tracking. To improve the performance and cut down time cost, we take advantage of context metadata collected at the time of video capture in a sensor-assisted environment to rule out potential faceless frames.
The rapid advances in consumer electronics have led to a wide proliferation of cheap powerful wearable sensors, such as accelerometer, digital compass, gyroscope, and GPS. The availability of these sensors initially included in smart phones to improve user experience is now changing the landscape of potential applications and providing reliable sources of contextual information which helps to model human behavior. In this study, we employ smart phones as sensing platforms to collect measurements of orientation sensor and help interpreting human moving direction which is explored and utilized to improve face detection and tracking. To summarize, the main contributions of this paper are twofold. First, we present a sensor-assisted fast face detection and tracking approach. As far as we know, it is the first attempt to integrate personal sensing technologies into face detection and tracking in video. This integration of a new sensing model broadens the domain of semantic analysis of visual content and will be catalyzed by the growing popularity of wearable devices and concurrent advance in ubiquitous computing. Second, we implement a set of state-of-the-art multiobject tracking algorithms and conduct extensive experiments to evaluate our method.
The remainder of this paper is organized as follows. Section 2 presents the related work. Section 3 introduces the problem we tend to deal with. Section 4 details the proposed method. Section 5 describes our experiments together with result analysis. Concluding remarks are placed in Section 6.

Face Detection.
There have been hundreds of reported approaches to handle the problem of face detection. Based on early works of Yang et al. [3], existing face detection approaches can be grouped into four categories: knowledge-based methods, feature invariant approaches, template matching methods, and appearance-based methods. Knowledge-based methods employ predefined rules to determine face presence based on human knowledge; feature invariant approaches aim to find face structure features that are robust to pose and lighting variations; template matching methods make use of prestored face templates to judge if a face exists in an image; appearance-based methods rely on techniques from statistical analysis and machine learning to find the relevant characteristics of face and nonface images. The learned characteristics are in the form of distribution models or discriminant functions that are consequently used for face detection. Meanwhile, dimensionality reduction is usually carried out for the sake of computation efficiency and detection efficacy. Among these approaches, appearancebased methods have distinguished themselves as the most promising ones and had been showing superior performance to the others.
There are mainly two important factors that determine the success of a face detector: the features used for representing face images and the learning algorithm that implements the detection. Histogram based features have become very popular in recent years due to their excellent performance and efficiency, including local binary patterns [4], local ternary patterns [5], and histograms of oriented gradients [6]. Most state-of-the-art face detection methods usually use a combination of these features by concatenating them or by optimizing combination coefficients at the learning stage.
In terms of learning, most approaches treat face detection problem as a binary classification problem and determine whether current search window contains a face. Various machine learning methods ranging from the nearest neighbor classifier to more complex approaches such as neural networks, convolution neural networks, and classification trees have been employed for face detection. Among them, boosting based cascades have attracted a lot of research interest. Viola and Jones [7] introduced a very efficient face detector by using AdaBoost to train a cascade of patternrejection classifiers over rectangular wavelet features. Each stage of the cascade is designed to reject a considerable fraction of the negative cases that survive to that stage, so most of the windows that do not contain faces are rejected early in the cascade with comparatively little computation. As the cascade progresses, rejection typically gets harder so the single-stage classifiers grow in complexity. The structure of the cascaded detection process is essentially that of a degenerate decision tree. In our work, we employ this detector to initialize face tracking.

Object
Tracking. Two tracking paradigms have been presented in [8]. Recursive tracking methods estimate current state of an object by applying a transformation on the previous state −1 based on measurements 1 ⋅ ⋅ ⋅ taken in the respective images. The recursive estimation of a state depends on state of the object in previous frame and is susceptible to error accumulation. For instance, Lucas and Kanade [9] propose a method for estimating sparse optical flow within a window around a pixel. The optic flow is fit into a transformation model that is used to predict new position of the object. Comaniciu et al. [10] propose a tracker based on mean shift. The transformation of the object state is obtained by finding the maximum of a similarity function based on color histograms, while tracking-by-detection methods [11,12] treat tracking as a classification problem and train a classifier to distinguish the object from the background. The object detector (classifier) could be static or updated online. In [13], a SVM (support vector machine) classifier is integrated into an optic-flow-based object tracker. However, the SVM classifier is trained beforehand and unable to adapt. Adaptive discriminative trackers [14,15] train and update a classifier using new training examples acquired during tracking.
The research efforts mentioned above focus their attention wholly on the analysis of visual data. While in this study, we provide a novel method to exploit contextual information collected at the point of video recording to help face detection and tracking in video.

Problem Formulation
Subjects carrying smart phones move around casually in the FOV of a fixed digital camera. Video data are continuously recorded and motion measurements are collected by embedded sensors on the phone. As depicted in Figure 1, direction measurements captured by orientation sensor contained in the red brace indicate that the subject is moving toward camera, and during this period the camera could most likely capture clear faces. Based on this judgement, we could apply face detection and tracking directly to frames recorded in this period and skip faceless frames before and after. Our objective is to figure out these advantageous situations and improve face detection and tracking in various situations with the help of on-body sensors.

Proposed Method
In this section, we elucidate the proposed method in detail. As illustrated in Figure 2, → is direction of camera lens and can be adjusted as needed. As camera FOV is symmetric about → , we just focus our analysis on half of the FOV. → is moving direction of a subject carrying smart phones. We assume that moving direction of a subject keeps consistent with his facing direction. The less ∠ is, the more likely a face will appear. When ∠ equals 0 ∘ , the subject moves right towards the camera and frontal faces could be then captured. When ∠ is less than 90 ∘ , a portion of facial features are lost and performance of face detection and tracking varies with different algorithms. When ∠ is equal to or greater than 90 ∘ , detailed features of frontal faces totally disappear and most face detection algorithms fail to locate any faces. Thus we only need to consider situations satisfying 0 ∘ ≤ ∠ ≤ 90 ∘ . Based on the above analysis, we propose a two-stage automatic face tracking framework. In first stage, context metadata collected at the time of video capture are scanned to identify advantageous situations for visual analysis. Video frames are automatically labeled to indicate whether they Video data Orientation Figure 1: An application scenario of the proposed method. We believe that time period defined by the red brace is probably most suitable for face detection and tracking, during which the subject moves towards the camera.

Sensor Description.
Two types of sensor are involved in the proposed method, camera sensor and orientation sensor. Video streams recorded from the camera sensor are saved on the disk as discrete files. In this subsection, we put emphasis on orientation measurements collection from orientation sensor on smart phones.
Currently most smart phones are equipped with various types of specialized sensors originally aimed at improving user experience, including orientation sensor. An orientation sensor usually consists of an accelerometer and a magnetometer and can sense the orientation of a smart phone relative to the earth with three different values, pitch, roll, and azimuth, as shown in Figure 3. Pitch indicates rotation about the -axis and ranges from −180 ∘ to 180 ∘ inclusively, with positive values when the -axis tilts toward the -axis; roll indicates rotation about -axis and ranges from −90 ∘ to 90 ∘ inclusively, with positive values when the -axis tilts toward the -axis. Azimuth indicates angle between theaxis and magnetic north direction and ranges from 0 ∘ to 359 ∘ inclusively. Experiments have demonstrated that with the specified phone attachment shown in Figure 3, azimuth angle of the smart phone could be utilized to estimate moving direction of human body.

Frame Labeling.
We conduct a preliminary experiment to quantitatively measure impacts of moving direction on face detection. First, we divide ∠ equally into six sectors labeled from sector 1 to sector 6 counterclockwise. Each sector covers an angle of 15 ∘ , as shown in Figure 4. We recruit three participants to move towards the camera in each sector from a far point where his face could be barely distinguished with the naked eye. To rule out influence of interference factors, we require that the participant moves in his most relaxed way in order to maintain the same motion status in all sectors and keep facing his moving direction as far as possible. We use a Logitech C615 HD webcam to record video clips, as shown in Figure 3. Video resolution and frame rate are set to 640 × 480 and 20 frames per second, respectively. Video frames are encoded with libx264 and stored in the format of mp4 files on a Ubuntu 12.10 PC.
We manually count the number of faces in every video clip and then perform face detection using a Haar feature based face detector [7] in OpenCV [16]. Statistics about collected data and detection results are listed in Table 1 FrNo( , ) is the number of frames of video clip in sector .
As listed in Table 1, AvgFr( ) decreases from 102 in sector 1 to 52 in sector 6. This is mainly due to the fact that when each participant starts from the same position and moves towards the camera with the same speed, the smaller the angle ∠ in Figure 2 is, the shorter the time he stays in camera FOV and the less the frames will be recorded in the corresponding sector. AvgFa( ) and AvgFaD( ) are, respectively, the averaged International Journal of Distributed Sensor Networks end if (8) end for (9) if then (10) for all 1 FaNo( , ) and FaDNo( , ) are, respectively, the number of manually labeled and detector predicted faces in video clip in sector . From the result we could conclude that faces, either manually labeled (AvgFa) or detector predicted (AvgFaD), start to decrease dramatically from sector 3, and the detector could barely detect any faces when participants move in the last two sectors (sector 5 and sector 6), even though a dozen of profile faces could be distinguished by human eyes. This is due to the absence of enough facial features. The detected faces in sectors 4 and 5 are mostly false positive results (fake faces). Thus we could define advantageous situations as |∠ | ≤ 45 ∘ and skip frames on other occasions. Although it is a coarse-grained threshold, we could demonstrate the great improvement it brings to face detection and tracking in the following sections.
With the obtained threshold, we design a frame labeling algorithm, as listed in Algorithm 1. All collected orientation measurements are first smoothed to reduce noise and then scanned to label video frames recorded at that time period to prepare for face detection and tracking. Once a qualified azimuth sample is detected within a time window, all frames within that window will be labeled positive. By this strategy we could alleviate the adverse impacts of sudden body turning. With respect to situations when multiple subjects exist, to simplify the problem, we assume that all subjects stay within camera FOV in all experiments. Final result of frame labeling is calculated by applying logical OR to the set of obtained from analyzing orientation measurements of each subject.

Face Tracking.
In this section, we provide three sensorassisted tracking algorithms to track faces in the labeled frame sequences, including track by detection, track by mean shift, and track by TLD (tracking-learning-detection). In these algorithms, we employ the Viola-Jones face detector [7] to initialize the tracking process. Moreover, to reduce detection time, we filter out nonskin areas from each frame using a skin model presented in [17]. To deal with multiface tracking, we design a face classification algorithm to group different faces, as shown in Algorithm 2. The logic of the algorithm is straightforward. For each face, we search the most similar descriptor by comparing the normalized face patches. In this paper we resize each face to a 15 × 15 patch in pixel before comparison. Patch similarity is conducted on 6 International Journal of Distributed Sensor Networks Input: : an image patch containing a human face. : a vector of face descriptor , = {ℎ , , ℎ , }. ℎ is HSV color histogram of a face.
is the bounding box of a face and calculated by ( , ). ℎ is a vector of normalized face patch ℎ calculated by ( , ℎ). is a vector of indexes of frames that containing the face described by . : index of current frame. : threshold value for similarity of two normalized face patches. pixel-to-pixel level using the metric defined in (3), where 1 , 2 , 1 , and 2 are means and standard deviations of image patches 1 and 2 . The value of ( 1 , 2 ) ranges from −1 to 1. The larger ( 1 , 2 ), the more similar 1 to 2 . When most similar descriptor is found and the similarity is qualified, the descriptor is updated with features of current face. Otherwise a new descriptor is created and saved. The descriptor will be used in the following sensor-assisted face tracking algorithms: (1) Track by Detection. As shown in Algorithm 3, we apply face detection over each positively labeled frame containing faces. The detected faces are then classified using Algorithm 2. The performance of this algorithm totally relies on the generalizability and representability of training samples of the detector. We use this algorithm as a benchmark for parameter optimization in Section 5.1.
(2) Track by Mean Shift. We provide a tracking algorithm based on mean shift [10] in Algorithm 4. Mean shift is a procedure for locating the maxima of a density function given discrete data sampled from that function. In ℎ ( , .ℎ , . , ), a confidence : vector of face descriptors recording which frame a face appears.

Input:
: video frame , 1 ≤ ≤ . is video frame count. : boolean sequence indicating whether contains faces. int: the interval of track reinitialization.
: threshold value for rectangle overlap. Output: : vector of face descriptors recording which frame a face appears.
end if (6) if % int = 0 then (7) Detect faces from and save them to (8) for all do (9) ( , , ) (10) end for (11) else (12) for all do (13) Get last index from . (14) if ) > then (19) . ← (20) . map of the face in current frame is first created using its color histogram .ℎ in previous frame; then by searching for a local peak using the confidence map, is estimated as most probable position of the face in current frame based on its previous position .
. To filter out long lost faces, we only consider face descriptors that keep active till the previous frame. We adopt spatial overlap of face bounding boxes as a metric to distinguish successful detections, as defined in (4). 1  : video frame , 1 ≤ ≤ . is video frame count. : boolean sequence indicating whether contains faces. 1 : count of positively labeled frames for TLD initialization. Output: : vector of face descriptors recording which frame a face appears.
end if (6) if then (7) (8) end if (9) Detect faces in and save them to (10) for all do (11) ( , , ) (12) end for (13) ← + 1 (14) end for (15)  head may cause a zero overlap. When V ( 1 , 2 ) is below a threshold value, a track failure occurs. To handle these situations, we conduct face detection at a fixed interval to reinitialize the tracking process, as illustrated in Algorithm 4: Mean shift algorithm has difficulty in tracking fast moving targets and suffers from local optimal problem caused by the mountain climbing optimum algorithm it used for searching .
(3) Track by TLD. TLD was proposed by Kalal et al. [18,19]. It is a framework designed for long-term tracking of an unknown object in unconstrained environments. The object is tracked and simultaneously learned in order to build a detector that supports the tracker once it fails. The detector is built upon the information from the first frame as well as the information provided by the tracker. The original TLD tracker tracks only one object. We create a multiobject tracker based on OpenTLD provided in [20]. In Algorithm 5, we first initialize TLD tracker in ( ), where contains descriptive information of faces detected from the first 1 positively labeled frames. By initialization, an internal classifier is trained for each face in using training data extracted from the first 1 positive frames. Positive training samples are obtained by applying affine warping transformation to the detected face region. Negative samples are obtained by collecting rectangles with similar dimension from nonface areas in the frame. In addition, positive samples of one face are added to negative training sets of other faces for performance improvement. While tracking, ( , ) tracks and detects the faces in parallel in each new frame and fuses the result of internal tracker and detector into . Then, positive and negative training samples are again extracted based on the fusion result to update the detector. The main limitation of the algorithm is that it cannot track faces that appear after the initialization. Thus, this method can only apply to situations where count of subjects will not increase. In this paper, we set 1 to values that will include all faces that will appear in a video.

Experiments
In this section, we conduct extensive experiments to evaluate the proposed method. In addition to video devices and    capture settings used in Section 4.2, we also utilize Android smart phones equipped with orientation sensors. Two subjects are recruited to take part in our experiments and phones are attached to waist belt where moving direction of subjects could be best approximated, as shown in Figure 3. A simple GUI application is created to start and stop data collection on phones. Orientation measurements are recorded and saved in text files on phone SD card and later accessed via USB.
We implement labeling and tracking algorithms proposed in Section 4 based on OpenCV library [16].  video clips and thirty-two text files of orientation measurements.

Tracking Optimization.
To label frames using Algorithm 1, we set to 45 ∘ which has been experimentally proved in Section 4.2. In terms of time window , a larger window leads to less missed faces at the cost of potentially more faceless frames. We define a metric in 6 to measure the performance of Algorithm 1 on video . ( ) is a ratio between number of positively labeled frames containing faces and number of positively labeled frames. It ranges from 0 to 1. The larger ( ), the more efficient Algorithm 1. To rule out interference factors in multiple-face situations like mutual occlusion of human bodies, we run Algorithms 3 and 1 over the single-face data with different . The averaged obtained in indoor and outdoor situations is illustrated in Figure 5. An optimal is achieved in the vicinity of = 0.5: #positively labeled frames having faces #positively labeled frames .
The threshold in Algorithm 2 affects face classification accuracy and varies with different tracking algorithms and application scenarios. We use a value of = 0.65 for all our experiments. In Algorithm 4, two parameters International Journal of Distributed Sensor Networks will affect tracking performance. When overlap between bounding boxes of a face tracked across two consecutive frames is below , a track failure occurs. The larger is, the more likely a positive true tracking result will be obtained. An optimal varies with different human motion patterns. In our experiment, we set empirically to 0.4. The parameter specifies the interval at which we reinitialize trackers to deal with tracking failures. When = 1, Algorithm 4 degrades into Algorithm 3 and it applies face detection on each positively labeled frame. When int is set to the length of a video clip, Algorithm 4 stops as soon as mean shift fails. A larger int costs less detection time and suffers the risk of high probability of false faces. We employ metrics defined in ((6a), (6b), and (6c)) and run Algorithm 4 on the single-face data. F-score can be interpreted as a weighted average of precision and recall; precision is the fraction of face detections that are true faces; recall is the fraction of true faces that are detected; tp is the number of detections that are true faces; fp is the number of detections that are false faces; fn is the number of missed faces. Precision, recall, and F-score reach their best at 1 and worst at 0.

Tracking Comparison.
In this subsection, we compare the sensor-assisted face tracking algorithms depicted in Algorithms 3, 4, and 5 with their sensorless counterparts in terms of performance and processing speed. To conduct sensorless face tracking, we just set elements of , the output of Algorithm 1, all to and keep subsequent tracking algorithms unchanged. We run the optimized algorithms on the collected data and calculate the averaged results under each situation. As shown in Figure 7, the sensor-assisted algorithms achieve comparable performance to their sensorless counterparts in terms of recall, while precision of the sensorassisted version obviously overweighs the sensorless ones. It is attributed to the removal of false alarms that might be caused over negatively labeled frames, as illustrated in Figure 8. In addition, due to exclusion of faceless frames, sensor-assisted tracking algorithms achieve higher processing speeds, as illustrated in Figure 9. The superiority becomes more evident especially under single-face situations where a comparatively larger percentage of frames are labeled negatively. Extracted frames from the results are illustrated in Figure 10.

Conclusion
In this paper, we propose a novel method for fast face tracking. The method innovatively leverages sensor captured contextual information and could be utilized as a preliminary step to assist various algorithms for face detection and tracking in video. Experiment results demonstrate the performance improvement brought by the proposed method. However, the method is limited in the following aspects. First, users have to register and carry their smart phones in order to facilitate the tracking process. This necessary attachment of sensors damages the unobtrusiveness of visual sensing and causes inconvenience to users and limits application of the method to specific groups of people at some restricted places, where their healthcare and security are concerned, such as inpatients in hospitals and elders in nursing homes. Second, frames might be mislabeled on some occasions which may damage performance of the method. For example, when a subject turns head to attractions around the camera while his body is back to it, video frames recorded during this period are labeled negative by the proposed method and his face may be missed. In another case, when subjects move out of camera FOV while still facing the opposite of camera direction, frames recorded at this moment are labeled positive while in fact the subject is not in them and the tracking analysis over these frames is wasted. Third, the proposed method does not apply to video archives created in the past due to the absence of contextual metadata. A lot of work needs to be done to make the method better. In the future, we plan to explore the possibility of applying more other wearable sensors to the field of content analysis of visual data.