Stereo Vision-Based Human Tracking for Robotic Follower

Abstract This paper addresses the problem of real-time vision-based human tracking to enable mobile robots to follow a human co-worker. A novel approach to combine stereo vision-based human detection with human tracking using a modified Kalman filter is presented. Stereo vision-based detection combines features extracted from 2D stereo images with reconstructed 3D object features to detect humans in a robot's environment. For human tracking a modified Kalman filter recursively predicts and updates estimates of the 3D coordinates of a human in the robot's camera coordinate system. This prediction enables human detection to be performed on the image region of interest contributing to cost effective human tracking. The performance of the presented method was tested within a working scenario of a mobile robot intended to follow a human co-worker in indoor applications as well as in outdoor applications.


Introduction
Human detection and tracking has been extremely active research area in the computer vision community over the past decade.The importance of this area arises from its numerous applications such as video surveillance, smart vehicle and virtual reality [1].One of the broadest application areas is robot vision [2].Visionbased human tracking is crucial for human-robot interaction in numerous human-centred robot applications.A person-following robot finds a specified person using visual tracking functions and follows him/her in order to provide different services for a human or to accomplish different tasks depending on the application [3].Person following robots developed until now use various types of cameras for detecting a target person and some of them use other sensors in combination with vision.For example, in [3] a stereo vision system and additionally a Laser Range Finder (LRF) are mounted on the robot body to enhance the performance of the person following robot's behaviour.Furthermore, some authors developed methods including human-body wearable sensors [4], which are however, inadequate in some applications.In order to avoid the complex data fusion algorithms necessary in the case where different sensor types are used and in order to avoid human contact sensors, a number of authors have developed methods based on vision as the only sensor.
Although people detection and tracking with a single camera is a well-explored topic, it has been shown that detecting humans from a single image while maintaining a low false detection rate is a very difficult problem [5].The use of stereo vision provides a higher grade of information that brings several advantages when developing human-robot applications.The information regarding disparities becomes more invariable to illumination changes than the images provided by a single camera, which is a very advantageous factor for environment estimation.Furthermore, the possibility of knowing the distance to the person could be of great assistance in tracking.The robot calculates the distance between the target person and itself using stereo vision and follows him/her with the appropriate speed to maintain the distance.In [6] a robust human detecting and tracking system, which can be used in indoor environments, is presented.Besides this there are numerous other human tracking methods applicable to indoor applications, which are based on background subtraction and on detecting the skin colour of the face and hands on a captured image.In contrast to these methods, in this paper a stereo vision-based human tracking method that can be used in both, indoor and outdoor applications is used.It is achieved by using only 3D information for human detection and tracking, disparity (depth) information for segmentation of object regions in the scene and 3D human characteristics extracted from a disparity image for human detection.This paper also addresses the need for a fast vision module and distributed system architecture to enable real-time robot motion control.Such an open robot controller architecture connecting a number of software modules easily, including target detection module and robot motion control module, even if these modules are located on different computers, has been presented in [3].In this paper, a realization of distributed processing using the ROS (Robot Operating System) [7] is considered.Support of time critical robot control is also achieved by a cost effective vision module for human tracking.This is achieved by implementing the Kalman filter as usual in vision-based human tracking [2], yet here the goal is performing human tracking on the image region of interest (ROI), rather than on the whole disparity image.
The remainder of the paper is organised as follows.Section 2 presents the layout of the vision-based robotic human follower system.Section 3 presents the first part of the vision module of robotic follower: stereo vision-based human detection.The second part of the vision module of the robotic follower; human tracking based on a modified Kalman filter, is presented in Section 4. The performance evaluation of the presented stereo vision-based human tracking within the working scenario of a mobile robot intended to follow a human co-worker in indoor applications, as well as in outdoor applications, is presented in Section 5.

Vision-based robotic human follower -system layout
The presented vision-based human tracking system is supposed to provide sensor input for vision-based control of a mobile robot, which works in a team helping a human co-worker with transportation of different objects in indoor applications, as well as in outdoor applications, such as investigation of a hazardous environment.For example, in the robot working scenario of investigation of contaminated/hazardous environments, the robot works as a transportation robot, helping the human to carry containers with collected samples from the environment.The vision system for human tracking within such robot working scenarios has to be able to detect the human, calculate the distance to the human and track the human, keeping a constant distance between them.After sensing the reduction in distance, indicating the human's intention to approach the robot, the robot has to stop and allow the human to place the containers with the collected samples onto the robot's mobile platform.The principal layout of the visually controlled robotic system for human tracking, whose vision module is described in this paper, is shown in Figure 1.

Wheels velocity
The presented robotic system has a stereo camera system on board, as well as a low-power industrial PC (Intel i7-620M @ 2.66 GHz, 4GB of RAM, 64 GB SSD, ~65 Watt).As the platform is intended to run on batteries so as to move independently of external power supplies, low power consumption of the on-board components is critical.Therefore, the on-board PC functionality is limited to capturing images from the stereo camera and sending them compressed over the wireless link to the off-board vision module, as well as sending direct commands obtained from the robot control to the wheel controllers.
In order to assure the control commands are sent at regular time intervals, the vision module and the robot control module run on different PCs, as illustrated in Figure 1, with differently coloured blocks.The computationally expensive vision algorithms run on a dedicated high-end PC (Intel Xeon E5520 @ 2.26 GHz, 6GB of RAM) with the goal of tracking the human coworker in front of the robot and determining its 3D position with respect to the coordinate system of the left stereo camera, as depicted in Figure 1.The obtained 3D position is then sent to the robot control module located on a different desktop, PC (Intel E4700 @ 2.6 GHz, 2GB of RAM), which computes the required velocities for each wheel such that the robot keeps following the human.These velocities are then sent at regular time intervals over the wireless link to the on-board PC, which sends direct commands to the wheel controllers.If the robot control module ran on the same PC as the vision module, it might not be able to send the new velocities out in time due to the vision module blocking the CPU (Central Processing Unit).
The communication between the three computers is carried out via the ROS (Robot Operating System), which is a widely used communication framework that, among other features, allows easy configuring of multiple computers for cooperation in order to improve the time effectiveness of the system.This time effectiveness is achieved by splitting computationally expensive tasks into modules, which run on different computers.
The focus in this paper is on the robot's vision module.It consists of two sub-modules: stereo vision-based human detection and human tracking based on a modified Kalman filter.

Stereo vision-based human detection
A block-diagram of the presented human detection system is given in Figure 2 and the individual processing steps are explained in the following sections.

Object segmentation
Stereo image information is used to aid the definition of regions of different objects, including humans, in camera images.The acquired image pair is used to compute a disparity map using a block-matching algorithm [8].In principle, the disparity map is found by computing the stereo correspondences between the image points from the left and right stereo images.For a given 3D point P(X, Y, Z), the corresponding points in the left and right image respectively are pL(u L ,v L )and p R (u R ,v R ), where u and v are coordinates of an image point in the image coordinate system with the origin in the camera's optical centre.The difference in u coordinates of corresponding points in stereo images is known as disparity d.
The disparity of an image point is inversely proportional to the distance of the original 3D point to the camera coordinate system known as depth.The resulting disparity map represents a 2D image in which values of pixels are equal to the disparity (1).The pixel coordinates in a disparity map correspond to the pixel coordinates in the left stereo image as the left stereo image is usually considered as the reference image when building the disparity map.The resulting disparity map in the proposed system is segmented using a connected pixel labelling-based method.The main idea behind this segmentation method is to group the pixels with the same or very close pixel values as it is assumed that they belong to the same object.Namely, neighbouring pixels in the disparity map belonging to the surface of an object have close disparity values, while on the edges of the object the difference in disparity values between the pixels of the object and of the background is large.These transitions in disparity values are used for the segmentation.The details of the disparity map segmentation method used are given in [9].The segmentation result in the case of the human tracking robot scenario considered here is shown in Figure 3(c).Differently coloured regions in the image in Figure 3(c) represent different objects, which are at different distances to the robot's camera.As can be seen, the ground has been removed from the segmented image (represented by black colour) in order to avoid merging with other objects, including humans, placed on the ground.The ground plane removal was done by detecting the regions in the lower part of the disparity map whose disparity values gradually change, i.e., whose image gradient in the vertical direction gradually changes.In contrast to ground plane pixels, in the disparity map the regions of object surfaces have almost constant disparity values.

Feature-based object classification
Once the objects have been segmented in the disparity map, different features describing segmented object regions are calculated.The chosen features have been defined so as to enable distinguishing of humans from other objects in the robot's perceived environment.The used features can be separated into two groups: 2D features and 3D features.The former are calculated from the 2D segmented image, while the later result from 3D object reconstruction.

 2D features.
In order to describe the shape of every segmented region, the three so-called Hu invariant moments [10] are used, as they are proven to be effective shape descriptors.They are calculated over the pixels of each segmented object region according to following formulas: where pq  is the central moment defined as: In ( 5) C u and C v are image coordinates of the centre of mass of the segmented object region (average over all image coordinates of segmented pixels in a region).I(u,v) is the intensity level of an image point with coordinates u and v.In the case of a segmented image, I(u,v) is 1 for each pixel segmented as belonging to an object and 0 for each background (not segmented) pixel. 3D features.In order to effectively distinguish humans from other objects in the robot's environment, Hu moments, as descriptors of shapes of segmented object regions in 2D segmented disparity maps, are supported by two 3D objects features: object height and object width.To calculate these two features, first the bounding box of each segmented object region is defined as the smallest rectangle containing the segmented region in the 2D segmented disparity map.Then, the upper left corner and the bottom right corner of each bounding box are identified and their coordinates in the left stereo image are calculated as (u LU ,v LU ) and (u RB ,v RB ) respectively.In order to calculate real 3D object width and height, the 3D reconstruction (so-called 2D to 3D mapping [11]) of the bounding box corner points is performed according to: where f is focal length of the camera (in pixels) and b is the stereo camera base line (in meters) representing a line connecting the camera centres of stereo cameras.In ( 6) and (7), d is the disparity of the centre of mass of the segmented object region ( C u , C v ) in the segmented disparity map.With the 3D coordinates ( 6) and ( 7) of the corner points, the object height h and object width w are calculated as: The defined features are used in the proposed system for the classification of an object as belonging or not belonging to the class "human being".The used classifier is Backpropagation Neural Network with one hidden layer [12].The training of the classifier, i.e., the neural network parameters adjustment, was done using a training set of 577 feature vectors (H1, H2, H3, h,w) extracted from segmented human regions in disparity maps of stereo image pairs acquired indoors as well as outdoors.The testing of the developed classifier was done using another 423 feature vectors.These test features were obtained by extraction from segmented regions of different objects, including humans, in disparity maps of stereo image pairs acquired indoors as well as outdoors.The obtained classification result from the training set was very good, as indicated by the fact that the classification performance rate was 96%.Misclassification, i.e., the inability to classify an object from a robot's environment as human, happened in cases of significant human occlusion or segmentation of humans as connected to objects from the environment.This indicates a need for integrating a module to predict and estimate the 3D position of the human to provide an input to robot control, even in the case of human classification failure.In the proposed system such a module is the modified Kalman filter described in Section 4.

Human detection
The last processing step in the proposed vision system is human detection.Once the human has been successfully classified, his/her 3D position with respect to the coordinate system of the left stereo camera is calculated through the 3D coordinates of the centre of mass of the segmented human region ( HC u , HC v ) in the segmented disparity map: Besides the 3D coordinates ( 9), the outputs of the human detection module are also the 3D coordinates of the corner points of the human bounding box that are calculated according to ( 6) and ( 7), so that the final output of human detection is the nine-dimensional vector: The 3D coordinates of the corner points of the human bounding box are further used by a tracking module.The tracking module is based on a modified Kalman filter and recursively predicts the estimates of the 3D coordinates of corner points, allowing the prediction of a region of interest (ROI) in the robot's camera images.This prediction enables human detection to be performed on the image region of interest rather than on the whole image, contributing to the cost effectiveness of human tracking as very important for robot control.

Tracking system overview
Human tracking starts after N video frames have been used for the initialization of tracking based on a modified Kalman filter.N is the number of the first N consecutive frames in which human detection was successful.The presented application assumes one human co-worker in the robot's environment.However, the presented method is applicable also in cases where there is more than one human in the robot's environment.In that case, the first detected human (or the human closest to the robot) in the initialization phase is selected as the person to be followed and he/she is tracked in the tracking phase.In this way, even if other humans enter the scene, the robot is able to continue tracking the person selected for tracking in the initialization phase.In order to fulfil this task, the tracking system works under the assumption that the 3D position of the tracked person cannot suddenly change between two consecutive frames and therefore it is able to correctly locate the selected person in successive frames.This holds as long as the humans in the scene can be clearly distinguished from each other.However if one of the humans temporarily occludes the human selected to be tracked, he/her is not in the field of the camera's view anymore, but the proposed modified Kalman filter continues to predict his/her position, as explained in the following.As illustrated in Figure 4, the stereo vision-based human detection is firstly applied to the whole image of the first N stereo image pairs to obtain human detection vector SH.
The processing of the whole images, which means extracting the features of all the segmented objects in robot's environment and classifying them as belonging to the "human" class or not belonging to the "human" class is very time consuming.In order to achieve time effective human tracking, as is very important for real-time robot control, the described stereo vision-based human detection is from the frame N+1 applied to the image region of interest (ROI), rather than to the whole image.The setting of the image ROI is done using the predicted value of SH., i.e., the predicted values of the 3D position of the human with respect to the coordinate system of the left stereo camera as well as the predicted values of the 3D coordinates of the corner points of the human's bounding box.Using the predicted values of the 3D point coordinates, the 2D coordinates of these points in the image frame pair N+1, which define the image ROI, are calculated by so-called 3D to 2D mapping from ( 6)-( 9).As illustrated in Figure 4, the predicted value of SH results from the modified Kalman filter according to the procedure described in the following.

Modified Kalman filter
A number of approaches to prediction and tracking in robot vision are based on the traditional Kalman filter [2].
In the Kalman Filter approach, it is presumed that the behaviour of a moving object can be characterized by the following predefined models of motion and measurement respectively: and that the models can be represented in terms of a state vector xk that corresponds to image frame k.In (11) A represents the state transition matrix, which determines the relationship between the present state xk and the previous one xk-1 and matrix H describes the relationship between the measurement vector zk and the state vector xk.Vectors wk and vk are noise terms, which are assumed to be independent of each other and Gaussians with zero mean and covariance matrices In the presented system we want to predict and estimate the state of vector SH., defined in (10), based on its last known state.We assume that the velocity of the tracked human is constant between the subsequent video frames, so the state vector is simplified and does not include the acceleration term.The state vector in the presented tracking system is: where the vector H k

S
represents measurements of the 3D position of the detected human in frame k and vector H k V represent 3D velocities of the centre of mass of the segmented human region and the corner points of the human's bounding box respectively.By considering a constant velocity, the state transition matrix A can be determined from the basic kinematic equations as follows: Where t is the sampling interval.
In the presented system the measurements k H k of the 3D position of the detected human in frame k are defined by (10).In order to avoid large discrepancies that could arise between two subsequent frames due to errors in stereo vision-based human detection, the modification of the Kalman filter is suggested, which involves replacement of measurement vector zk in (12) with smoothed measurement vector s k z .The smoothing of zk is performed taking into account previous measurements according to: In ( 16) S k z is a smoothed measurement vector and a0,…b2, are smoothing coefficients, which are calculated according to equations given in [13].These coefficients are constants and depend on the cut-off frequency, which was chosen by Residual Analysis [13].The initialization of the smoothing ( 16) is done in tracking initialization as illustrated in Figure 4.
The presented Kalman filter-based tracking has two stages: the prediction and the correction stage.In the prediction stage, the state vector xk in the current frame is predicted using the corresponding state vector in the previous frame via matrix A. In the correction stage, the smoothed measurement vector in the current frame is used to update the state estimate and the error covariance matrix.
The prediction stage can be expressed as follows: where k ˆ x is the a priori estimate state and k  P is the a priori estimate error covariance matrix in frame k.The correction stage can be expressed through the following three equations: where k x is the a posteriori estimate state at frame k, given the measurements up to that time, k P is the a posteriori estimate error covariance matrix at frame k, indicating the accuracy of the estimated state vector and Kk is the Kalman gain.The tracking process is recursive.
The estimates computed from (20) are used as the state vectors on the right hand side of (17) for the next video frame k+1.This is predicted vector SH that is used for setting the image ROI as illustrated in Figure 4.

Performance evaluation
The performance of the presented stereo vision system for human tracking was tested within the working scenario of a mobile robot intended to follow a human coworker in indoor applications, as well as in outdoor applications.
Experiments were conducted where a human walking in front of the mobile robot was imaged by a Point Grey Bumblebee XB3 [14] stereo camera mounted on the robot.The image pairs were grabbed at full resolution at 12fps (frames per second).Each pair of stereo frames was processed in order to extract information for stereo vision-based reconstruction of human walking with respect to a camera coordinate system.Some examples of processed images with superimposed extracted bounding boxes of humans are shown in Figure 5.The result of human detection in five frames from the video captured indoors is shown in Figure 5 (a)-(e).Figure 5 (f)-(j) shows the result of human detection in five frames from the video captured outdoors.
In order to evaluate the performance of the system with respect to the accuracy of the reconstruction of the 3D coordinates of the person and therefore of the reconstruction of robot's distance to the person, the experimental results were compared with the ground truth obtained in two ways.In the first experiment the ground truth distance to the person was obtained by a Bosch PLR 50 digital laser rangefinder [15].In the second experiment a reference path to be followed by the person was drawn on the floor.As the focus in this paper is the presentation of the vision module of the robotic follower, in the performed experiments the robot only observes the person without following him/her so that the errors possibly occurring in vision can be decoupled from possible errors introduced by the robot control.
In the first experiment, the distance computed from the output of the proposed stereo vision-based tracker according to: was compared with the distance obtained by a Bosch PLR 50 digital laser rangefinder.According to the datasheet, the accuracy of this device is ±2mm regardless of distance.The system was tested on 1055 frames and the average error in distance was 2.38% at a standard deviation of 2.12 %.It can be observed that there is a region in the reconstructed path in the upper right corner where the proposed human detection gives no results.As is evident, the Kalman filter estimates and predicts the 3D position of the human in these frames, so that at any moment the proposed tracker outputs valid data.This is very important for proper robot control to avoid random movement of the robot.Also, the proposed tracker makes the reconstructed path smoother, which helps robot control to avoid erratic movements of the robot platform.
Besides providing reliable information for robot control, the Kalman prediction enables human detection to be performed on the image region of interest contributing to the cost effectiveness of human tracking.The processing time saved is between 30% and 70% of the time required to process the whole frame, depending on the size of the predicted region.
Besides the cost effective vision method, real-time robot control is supported by the proposed distributed computing.As mentioned before, the proposed system acquires image pairs from the on-board stereo camera at a rate of 12Hz.If image acquisition, the vision module and the robot control all ran on the on-board PC, due to the limited computational capabilities of the on-board, low-power PC, the vision module could only run at a rate of about 4Hz and could interfere with robot control by constantly blocking the CPU with image processing tasks.
In order to avoid CPU blocking, the operations would need to run sequentially, as shown in Figure 7(a).
It can be observed that the total processing time of the system from image acquisition to sending speed values to the wheels is the sum of the three operations: where T1, T2 and T3 represent the time periods required for each of the three operations to finish and they are 10ms, 250ms and 10ms respectively.Hence, if the operations would run sequentially, the output rate of the robot control would be limited to less than 4Hz, which could cause abrupt movements of the robot.
In order to overcome these problems, distributed computing was introduced in the presented system.This allowed the use of a separate high-end PC for running the vision module and a desktop PC for running the control module.In this way an individual module fully used all the available cores of the CPU without the necessity of leaving free resources to concurrent modules, as they run on separate machines.The introduction of such distributed computing results in a reduction in processing time for both the vision module (from 250ms to 75ms) and the robot control module (from 10ms to 4ms), as can be seen in Figure 7(b).Processing time for image acquisition remains the same (10ms) as according to system specifications given in Section 2 a low power on-board PC has to be used for acquiring the images In the presented distributed computing system the three modules use the ROS library for communication, which implements socket communication over TCP (Transmission Control Protocol) in a way that is transparent to the user and therefore easy to use, debug and deploy.
Besides the introduction of distributed computing, realtime system characteristics are supported by pipelining as the use of distributed computing allows the operations to be performed in a pipeline fashion.That is, a module, which is upstream can already process new data while the downstream module processes its output.However, by using multiple computers additional transmission delays are introduced to the system, which are on average about 1ms for cabled connections (T23) and 5ms for wireless connections (T12 and T31). Figure 7(b) shows an overview of the timing when distributed computing is used.Black bars illustrate the times needed for processing image acquisition, vision and robot control, while the grey bars illustrate transmission delays.As the operations now run in parallel, the total processing time of a frame is given by the slowest operation, to which the communication times from and to that module are added.where TTo and TFrom represent the transmission time required to send data from the previous module to the slowest module and from the slowest module to the next module in line.In Figure 7(b) the slowest module is the vision module, to which the transport times from image acquisition (T12) and to robot control (T23) are added.In pipelined systems there is another measurement, the system latency, which describes the time required for the final output to be delivered for a specific input.This is the sum of all operations, including all communication times.
In other words, after TL has passed and the pipeline is full, every other sample comes after a period T. Therefore, the pipelines are especially useful when all operations take approximately the same amount of time to be completed, since no module needs to waste PC resources by waiting for the next input.
It can be seen that while in the case of sequential processing, according to (23), the total processing time and implicitly the latency would be 270ms, reaching 3.7 Hz, after using distributed computing the processing time is 81ms leading to a potential rate of 12.34Hz and a latency of 100ms.Therefore, the 12Hz rate of the stereo camera can be maintained by all modules in the proposed system.Even though 12Hz is a good rate for obtaining reference values for robot control, in order to ensure smooth movement of the robot, in the proposed architecture the robot control additionally interpolates between two samples obtained from the vision module.

Conclusions
In this paper the vision module of a person-following robot is presented.It consists of two sub-modules: stereo vision-based human detection and human tracking based on a modified Kalman filter.The robustness of human detection is provided by combining disparity (depth) image-based features and reconstructed 3D human features.The Kalman filter-based prediction enables performing the human detection over the image region of interest rather than over the whole image, which makes the proposed human tracking method cost effective.The cost effective vision method is supported by distributed computing to support real-time performance of robot control.Experimental results on detection of humans in videos captured in both environments, indoor and outdoor, are given.The performance evaluation was carried out with respect to the accuracy of human tracking, as well as with respect to cost effectiveness of the presented vision module.

Figure 1 .
Figure 1.Principal layout of the robotic system for following a human co-worker.

Figure 2 .
Figure 2. Block diagram of the proposed stereo vision-based human detection.

Figure 3 .
Figure 3. Left stereo image of human walking in front of the robot (a).Disparity map (b).Segmented disparity map (c).

Figure 4 .
Figure 4. Block-diagram of proposed human tracking system

Figure 6 .
Figure 6.Comparison of the reference human's path and the human's paths reconstructed with the proposed stereo visionbased tracker with and without tracking filterFigure 6 shows the result of the second experiment where the reference path in the form of a 2.3x2.3msquare is

Figure 7 .
Figure 7. Timing diagram in case all operations run sequentially on one, low power computer (a), on different computers (b)