Object segmentation in cluttered environment based on gaze tracing and gaze blinking

People with disabilities, such as patients with motor paralysis conditions, lack independence and cannot move most parts of their bodies except for their eyes. Supportive robot technology is highly beneficial in supporting these types of patients. We propose a gaze-informed location-based (or gaze-based) object segmentation, which is a core module of successful patient-robot interaction in an object-search task (i.e., a situation when a robot has to search for and deliver a target object to the patient). We have introduced the concepts of gaze tracing (GT) and gaze blinking (GB), which are integrated into our proposed object segmentation technique, to yield the benefit of an accurate visual segmentation of unknown objects in a complex scene. Gaze tracing information can be used as a clue as to where the target object is located in a scene. Then, gaze blinking can be used to confirm the position of the target object. The effectiveness of our proposed method has been demonstrated using a humanoid robot in experiments with different types of highly cluttered scenes. Based on the limited gaze guidance from the user, we achieved an 85% F-score of unknown object segmentation in an unknown environment.


Introduction
There are many patients who suffer from devastating conditions, such as amyotrophic lateral sclerosis (ALS) [1], brain stroke, and muscular dystrophy [2]. These patients usually retain full consciousness but can only blink or move their eyebrows. As it is assumed that humanoid robots will coexist in human environments in the near future, the ability of a humanoid robot to assist such patients is in high demand. The most common situation in which patients need assistance from a robot is an object search application, where the robot is expected to deliver a specific object in the environment according to the user's needs. Even the most common application of object segmentation is a very challenging problem.
To autonomously segment objects in a cluttered environment, many techniques, such as active contour model or snake [3], level sets [4], and the graph cuts [5] method have been proposed in the computer vision field. In extending the graph cut method, many researchers have tried to use predefined information, such as shape [6], or using kernel methods [7]. Recently, many learning-based approaches [8], in which the robot has previously learned object categories, have been introduced. However, it is still hard to achieve high accuracy from passive observation. Learning-based methods are hard to apply in practical situations, in which new objects are introduced every day. It is not practical to register all new objects in a database.
Interaction with a user makes object segmentation feasible for practical situations. Based on the interactive Open Access *Correspondence: photcyber@gmail.com 1 Graduate School of Information Science and Technology, Osaka University, Osaka, Japan Full list of author information is available at the end of the article capability of a patient, we can use facial engagement, head pose commands, and Brain Machine Interface (BMI) to interact with the robots [9]. Recently, BMI played an important role in helping patients control electronic appliances and the movements of robotic or prosthetic limbs [10]. However, to give a command about an object's position and set boundaries in a cluttered environment is difficult, as transmitting a precise command via these interaction methods is complicated.
To interact with a robot in an object segmentation task, we introduced a Gaze Tracking Device (GTD) to disabled patients. This gaze-interaction task was primarily divided into three steps. First, the patient gazes at the target object using their own vision or visual feedback from the robot's vision and then give a command by blinking to select the target object. Next, the robot navigates to the target object, sending visual feedback to the user's monitor, which allows the user to select and confirm the target object with another gaze. Finally, the robot grasps the object and brings it back to the user (Fig. 1).
In this paper, we proposed a gaze-based object segmentation method, based on gaze interaction from users, to optimize image labels and segment mixed, multicolored, and occluded objects in a cluttered environment. As the objects in an image are multicolored with noise and low resolution, we investigated a transformation of the original image to a higher dimensional kernel space using an iterative image segmentation as well as proposed a method to filter target object label from the image based on gaze information. Specifically, we proposed two types of gazes interaction for object segmentation: • Gaze Tracing (GT): in which the user passively gazes into the area surrounding the object in an image. • Gaze Blinking (GB): in which the user blinks at the center of the object for confirmation.
These two types of gazes can be integrated with visualbased object segmentation to achieve accurate object segmentation in cluttered environment.

System architecture
This section briefly explains the object-search system architecture ( Fig. 2 ) [11], which allows a robot to navigate to the target location according to the user's gaze command. The calibration of the GTD is initially conducted with the method described in [12]. Then, a Kalman filter [13] is applied for gaze position ( g p x , g p y ) smoothing. Finally, our proposed gaze-based object segmentation is applied for target object segmentation.

Blink detection
We can also detect a user's eye blinks using [14]. Therefore, voluntary long blinks and involuntary short blinks can be classified. Different types of blinks instruct different robot commands as follows: • Location command: This command specifies a target's location. This can be done by performing two Fig. 1 The overall concept of an object search using a Gaze Tracking Device (GTD). The patient gazes at the target object and then the robot autonomously navigates the environment to search for the object while providing visual feedback to the patient consecutive blinks at the target's location in the real environment. • Object confirmation command: After gaze tracing, we can confirm an object's location by gaze blinking, in which the user performs three consecutive blinks.
Each blink interval must be around 300 ms. This blink pattern is clearly not natural human eye motion and clearly separated from involuntary short blinks, so it does not increase cognitive load and does not disturb the user's natural gaze pattern.

Robot teleoperation using gaze
The user can teleoperate the robot by performing a location command in which the user gazes directly at the goal location in the environment. Our robot platform can independently navigate to that target location based on prior map and indirect search algorithms [15]. Once the robot arrives at the target location, it streams visual data of the target object to the user for gaze interaction.
Since the robot has a limited field of view, it can adapt its observation point to provide different perspectives of the object which is based on the approach of the author's previous work [11].

Methods
This section presents our proposed gaze-based object segmentation. Our strategy is to use gaze-interaction to enhance vision-based target-object segmentation. The gaze interaction was designed so that object segmentation can be performed as few gaze interactions as possible. Firstly, the user applies only a few examples of gaze tracing (GT) and gaze blinking (GB) to the target object. Afterward, we apply image segmentation to segment an image into different labels with parameter optimization from obtained gaze information. Finally, we filter target object label based on gaze tracing (GT) and gaze blinking (GB) approach.

Image segmentation Label creation
The goal is to segment Image I into different K regions (r l=1 , r l=2 . . . , r l=K ) that are smooth and consistent for the user to perform gaze interaction. This problem is a labeling problem, in which we utilize the graph cut method [7] to find a label f that minimizes the energy.
where α is a resolution constant and D p f and S p,q f are data term and smoothness term, respectively. The data term is defined as: where c l is the piecewise constant model parameter of region R l . The data term is derived from the observed data that measures the cost of assigning label f p to pixel p. The smoothness term is defined as: where s f (p), f (q) measures the cost of assigning the labels f p , f q to neighbor pixels p, q. We define s f (p), f (q) as the truncated squared absolute distance as: where ct is a constant of the truncated squared absolute distance.

Kernel mapping
A data term D p f of an image data is converted via a kernel function so that the system is suitable to segment nonlinearly separable data. Therefore, Eq. 1 is transformed to: where φ(.) is a nonlinear mapping from image space to a higher dimensional feature space depending on the number of segmentation regions N R . Based on a kernel trick [7], we can derive the kernel function as: where we use the radial basis function (RBF) kernel, which is suited for pattern data clustering. The RBF kernel is defined as:

Optimization
To achieve an optimized image label, Eq. 5 was optimized with an iterative two-step optimization method. The first step is performed by updating the centroid data of each label based on the following condition: The second step consists of finding the optimal label of the image from label centroid data provided by the first step. Each step updates the centroid data and creates a new optimal label. The algorithm iterates these two steps until the energy converges to a local minima.

Target object selection by gaze interaction
Even though the image was segmented into different labels, the target-object label needs to be defined by the user and we aim to use gaze interaction to assist in object selection. There are several existing gazed based user interfaces. Rivu et al. [16] propose to use gaze from the user to gradually reveal information on demand. Also, in Augmented Reality and Virtual Reality, instead of using mouse or gestures, we can confirm targets using gaze selection [17]. Gaze is also used to select objects in 3D environments based on hybrid gaze and controller techniques [18]. Recently, a combination of gaze and gestures (9) is an active field with several applications such as object manipulation [19] or gaze-enhanced menu interfaces [20].
In this work, we focus on gaze-only as a potentially implicit and effortless method for selecting object from cluttered environment. The user can provide the robot with a clue about the target object's boundaries by gaze tracing and blinking at the target object on a computer screen. The gaze interaction process is divided into:

Gaze tracing (GT)
The user is asked to gaze at the target object on the label image (Fig. 3a). This step is defined as a passive gaze. Based on the position of the gaze tracing, we then build the heat map [21] which is an ellipsoid distribution centered at gaze position ( g p x , g p y ): where w hm is the weight of the heat map and x and y is the position of each pixel in an image. p indicates gaze points index which is arranged in chronological order. A single ellipsoid heat map from each gaze point will be For the gaze tracing, it is hard to conclude which labels the user gazed at that should be assembled as object labels, since the user can easily be distracted and look at other spots. As presented in Fig. 3b, we asked the user to look at the scissors, which was the target object. However, the user also unintentionally gazed at other points. Including all gaze-tracing points yields an error in the object-label selection.
We apply a Gaussian Mixture Prediction (GMP) [22] approach to fit Gaussian distribution to GT data as shown in Fig. 3c. The number of clusters has been set to 2 since we want to classify GT points at target location and GT points at non-target location. The co-variance of Gaussian distribution determines the contour of the cluster. GMP uses Expectation Maximization (EM) algorithm [23] to optimize the separation of soft clustering. In our case, we set a non-negative regularization of 10 −6 added to the diagonal of co-variance and The convergence threshold to stop EM iterations is set to 10 −3 .

Gaze blinking (GB)
Label selection based only on gaze tracing yields an object label with noise, as presented in Fig. 3d. To confirm the target object, the user is required to give three consecutive blinks at the center of the target object. GB will be used to confirm the cluster that belongs to the target object and we select the label within 2 standard deviations (the size of the ellipsoid) where a mask is centered on the confirmation location (presented as a red ellipsoid in Fig. 3b), which will be created to filter out only the local maximum points of the heat map and used to integrate the corresponding label to the object label as presented in Fig. 3e. The instances that are out of the ellipsoid will be considered outliers.

Smoothness and number of region optimization
We use gaze pattern information to choose the smoothing parameter ( α ) and a number of segmented region (K). From our observation, we found that the user tends to have more gazing points (GT) when they select complex or multicolored objects while having low number of gazing points on a simple object. As a result, we proposed to choose a number of segmented region parameter K as follow where P is number of the local maximum points of the heat map from gaze tracing (GT) lies within ellipsoid defined by GB. W o and H o are lengths of major and minor (11) K = w K P W o H o axes of ellipsoid derived from GT and GB. w K is the weight of the number of region adaptation. Furthermore, we update α which is the smoothing term of region by where w a is the weight of the alpha adaptation. From observation, we found w K to be 1.2 and w a is set to 1 for best performance on our dataset and TOSD dataset. With K and α optimization, we can achieve high accuracy and low recall rate of object segmentation while reducing the segmentation time of a simple object.

Experiment setup
We developed and tested the system on a humanoid robot called ENON [24]. The robot was equipped with two RGB-D sensors (Kinect V1 sensor), one for navigation and another on its head at a height of 1.8 m to stream visual data to the user. Users were asked to wear GTDs and accelerometers (to measure the user's head orientation), as presented in Fig. 4.
Our experiments were conducted in an office environment, as presented in Fig. 4. We asked five users which are master students from Osaka University, Japan. Their age ranged from 23 to 31 years (M = 27.20, SD = 3.35), 3 participants were male and 2 were female. All users have no prior experience using GTDs. Users use our system to guide the robot to a target location and perform gaze interaction with objects on a table.
Each user individually performed a calibration exercise to confirm gaze precision. The subject was required to look at different reference points from different ranges. Overall, an acceptable average error was found to be 1.51° with variance of 0.77°. The robot sent a visual stream with an image size of 640 * 480 pixels, which was then processed on MATLAB running on Windows 10 on a personal computer (PC) (E5-1620 3.50 GHz Xeon CPU, 16384 MB RAM, NVIDIA Quadro K2200 graphics card). The processing time varied depending on the number of iterations, segmentation regions, and resolution sets for kernel graph cuts segmentation.
Ten objects of different shape, size, and appearance were prepared. The experiment was conducted 10 times for each object, modifying the object's appearance from being strongly occluded with textured sides, to sparsely or partially textured, non-textured, multicolored, or with unicolored sides.

Evaluation
We evaluated our proposed segmentation algorithm on two datasets: our dataset and TOSD datasets [25]. We selected TOSD as a comparison dataset since it consists (12) α = 1 − w a P W o H o of scenes with varied object-configuration complexities. It is composed of images with complex and cluttered scenes, as well as scenes where only several boxes or other simple objects are presented. The TOSD dataset consists of 111 scenes for training and 131 scenes for testing.
For comparison, we also compare our object-segmentation method with the active [26] and the saliency segmentation [25] methods for GT, GB, as well as the combination of GT and GB for segmentation. The quality of segmentation was measured based on the recall and precision of segmentation, i.e., how many points in the final object label corresponded to the ground truth (the object being manually selected by the user). We compared by using the F-measure defined by where calculated precision P is the fraction of the segmentation mask overlapping with the ground truth and recall R is the fraction of the ground truth overlapping with segmentation mask.

Object label segmentation analysis
As a piecewise constant model, the system starts by segmenting an image into different regions based on K-mean clustering. The result is presented in Fig. 5, in which the initial label image is not smooth and consistent. The target object selected using the gaze interaction from this initial label will still result in an object label with noise from the areas both inside and outside the label. For example, consider an initial object label calculated from K-mean clustering in which the number of labels is set to four. The initial label is converted via RBF kernel function to a higher dimension of four different images. As a result, each label characteristic is represented in each kernel image.
Next, each kernel image is applied to an iterative graph cut algorithm. The algorithm interactively merges the small noise inside the label while preserving the minimal Fig. 4 The experiment setup Fig. 5 Left: the example of the initial object label is based on K-mean clustering, in which the number of labels is set to 4. The initial label is not smooth and consistent, so it is converted via RBF kernel function to a higher dimension of four different images in which the distribution of each kernel is changed based on each label characteristic energy of each label. The advantage of each kernel image is that it is resilient to noise from Gaussian assumption of the RBF kernel function. From our observation, the algorithm converges within five iterations, as presented in Fig. 6. With the label image, only one gaze point at each label is sufficient to include that label as part of the object label.

Gaze-based object segmentation result
Based on an optimized label obtained from the iterative graph cut, the user can select the region by GT point and confirm it with GB. Example results are presented in Fig. 7, in which the second and third columns show the optimized boundaries and labels, respectively. The user viewed the object label image and performed GT and GB interaction. The result of gaze tracing is presented in the fourth column of Fig. 7, in which the target label is only gazed at, but noise remains. The final object, in which the noise is removed by GB, is presented in the fifth column of Fig. 7. The result confirms that this system is applicable for the object segmentation of multicolored objects (as shown in the first row) and occluded objects (as shown in the second and third rows). Figure 8 shows a comparison of the proposed segmentation performance with an active segmentation approach. For GB only, active segmentation is performed at an average of 38.7%, since the algorithm only works when the object has a linear color distribution. However, objects 1, 2, and 3 were multicolored and had noise, so the active segmentation algorithm performance dropped to an average of 19%. Furthermore, the kernel-based method handled noise and nonlinear data more robustly, with an average precision of 45%.
By integrating GT and GB to the object-segmentation algorithm, as presented in Fig. 8, the robot achieved better segmentation, and the performances of the kernelbased and active segmentation approaches improved to 86.9% and 80.3% precision, respectively. This was due to the GT and GB clues from the user, helping the system integrate multicolor labels into the same object.

Interaction analysis
Since each interaction possibly leads to a different segmentation for an object, we also analyzed the results of four different F-scores. First refers to the segmentation Fig. 6 The result of energy optimization of iterative kernel segmentation. At each iteration, the algorithm updates the kernel image and output label. Therefore, the final label at the last iteration has a smooth and consistent label with a minimum energy guarantee. This optimizes the label (where all the holes are filled), allowing the user to easily interact with the label from the first interaction, Best refers to the best segmentation WRT ground truth, All refers to the average score over all the segmentations for an object, and Worst refers to the worst score among all the segmentations for an object.
Ideally, we would like each interaction to yield the best segmentation. However, the segmentation algorithm depends a lot on the position of the GT and GB points. Therefore, an algorithm that is resilient to different interactions (i.e., different sets of GT and GB points) is expected. To show how GT and GB interaction can improve segmentation to other traditional approaches (Active and Saliency), we also implemented 3 types of interaction which are GT-only, GB-only, and GT + GB to conventional approaches. For GB-only interaction, without GT points, GMP will use the centroid of label that has a similar color (± 5 of hue value) to be points for clustering. For example, if the GB point is located at the green label, all the centroids of the green labels will be used in the GMP process. If the target object has a single color or similar color, the GB-only can assist segmentation well. However, GB-only interaction is not robust to multi-color object segmentation. For GT-only interaction, since there is no object confirmation from the GB point, the system will select the cluster that has more GT points as a target object. This approach can fail when the user wants to select a tiny object since the GT points at a small object usually less than the distraction point.
We present all F-scores in Table 1 for all cases (All, First, Worst, and Best), for all objects, and all scenes. GT + GB interaction generally improve overall performance in all approaches and on average, our proposed gaze-based object segmentation outperformed Active segmentation, Saliency segmentation and Kernel segmentation by 24%, 15% and 10% respectively.

Comparison with state of the art methods
We also evaluated our segmentation algorithm with other state-of-the-art object segmentation approaches using publicly available datasets: the Table Object Scene Dataset (TOSD) [25]. Table 2 compares our results to state-of-the-art object segmentation.
With GT and GB interaction, our method achieves an F-score of 0.75, which is an over 10 % relative improvement from the previous best entry (SGN [27]) and is also better than the concurrent work from MASK R-CNN [28] with 7%. Compared to the best entry, using fine data only, we achieve 15% improvement. We also performed evaluation within each individual category. Our method shows massive improvement of each category over other approaches (relatively 15% improvement over Glass and Bottle and 20% improvement on Plate, Book and Mug).

Discussion
As only color information from a monocular camera was used, we analyzed and discussed the characteristics of object segmentation and how smoothness and a number of regions affect the accuracy of segmentation. Typically, the image segmentation algorithm [7] was robust enough to segment image into fine piece of image label and it can filter out noise in higher kernel space. However, as it is typical unsupervised learning, some parameter such as the number of regions and the smoothness must be predefined. A single-color object and multi-color required different sets of parameters to achieve high accuracy. In our study, we not only use GT and GB to directly segment target object label, but we also proposed to adapt smoothness and a number of regions based on gaze clues from the user which is another factor to achieve high accuracy (Fig. 9).
For a single-color object, the user usually performs only a few gaze tracing. If the label segmentation is too fine, the same color label can be separated into several labels. There is a chance some target object labels might not be selected by gaze tracing. As a result, the low number of the region and a high degree of smoothness should be set so that the object label is not too fine. On the other hand, the user usually performs more gaze tracing points on different regions of a multi-color object. To achieve high precision of multicolor objects, a high number of regions and a low degree of smoothness parameter should be set. As a result, object labels are more distinct from other object labels and can be matched with gaze tracing points.

Conclusion
There are many patients who suffer from locked-in syndrome and an inability to live independently. We proposed object segmentation based on gaze interaction for patients to interact with a robot for the application of an object search in a cluttered environment.  To interact with the label in an image, we introduced the concepts of GT and GB to help the robot determine its target object. The patient can perform GT by freely gazing at the area around a target object. Later, the user provides the robot with the object location from GB, which involves three consecutive blinks. Afterwards, the kernel-based segmentation algorithm with parameter selection from gaze information was performed with the purpose of image labeling. The result of this interaction will be integrated with the image labeling to confirm the final object label. Our experiment results show that the proposed gazebased method overcomes the conventional method (with an F-score of 85% for a combination of GB and GT) for noisy multicolor and occluded object segmentation with an average precision of 54.8% for GT and 86.9% for a combination of GB and GT, respectively.
Our future work will focus on integrating this system with autonomous navigation for autonomous wheelchairs.