Egocentric upper limb segmentation in unconstrained real-life scenarios

The segmentation of bare and clothed upper limbs in unconstrained real-life environments has been less explored. It is a challenging task that we tackled by training a deep neural network based on the DeepLabv3+ architecture. We collected about 46 thousand real-life and carefully labeled RGB egocentric images with a great variety of skin tones, clothes, occlusions, and lighting conditions. We then widely evaluated the proposed approach and compared it with state-of-the-art methods for hand and arm segmentation, e.g., Ego2Hands, EgoArm, and HGRNet. We used our test set and a subset of the EgoGesture dataset (EgoGestureSeg) to assess the model generalization level on challenging scenarios. Moreover, we tested our network on hand-only segmentation since it is a closely related task. We made a quantitative analysis through standard metrics for image segmentation and a qualitative evaluation by visually comparing the obtained predictions. Our approach outperforms all comparing models in both tasks and proving the robustness of the proposed approach to hand-to-hand and hand-to-object occlusions, dynamic user/camera movements, different lighting conditions, skin colors, clothes, and limb/hand poses.


Introduction
Hands are one of the main channels of human communication that allows people to relate to each other and interact with objects.Our hands are often in our field of vision, for example, during daily activities (Pirsiavash and Ramanan 2012).In many cultures, hands support verbal communication and increase comprehensibility by adding meaning and emphasis to words (Maricchiolo et al. 2005).Many applications involving the use of the hands are based on localization methods and, in particular, on hand segmentation.It is usually used as a pre-processing step in various contexts, such as hand gesture recognition (Poularakis and Katsavounidis 2015), human-robot interaction (HRI) (Ju et al. 2017), human-computer interaction (HCI) (Maurya et al. 2018), and mixed reality (MR) (Herumurti et al. 2017).
With the spread of wearable devices, systems for analyzing and detecting hands in a first-person perspective, called egocentric or first-person vision (FPV), have increasingly developed (Betancourt et al. 2015).Although several approaches were designed to tackle this task using the thirdperson point of view (TPV) (Matilainen et al. 2016;Bojja et al. 2019), a significant difference in the visual aspect compared to the egocentric view can be found.The FPV combines the challenges of segmentation (Kok and Chan 2016;Ren et al. 2020;Gruosso et al. 2021a;Minaee et al. 2021), mainly due to a large variety of backgrounds, presence of shadows and occlusions, with the inherent difficulties in ego vision (Alletto et al. 2015), which involves rapid changes in the lighting conditions of the captured scene, dynamic movement of the camera and the wearer that can cause motion blur (Li and Kitani 2013).
In this study, we extended hand segmentation task focusing on upper limb in egocentric vision and unconstrained real-world scenarios, where not only the hand but Monica Gruosso, Nicola Capece and Ugo Erra have contributed equally to this work.
the rest of the upper limb is framed, for example, using RGB cameras with a wide field of view.Although several approaches exist, most of them focused only on the hand up to the wrist (Urooj and Borji 2018) or bare arm (Li and Kitani 2013;Wang et al. 2019;Lin and Martinez 2020).We were, instead, interested in the whole limb, also taking into account the clothes and occlusions, further increasing the task difficulty.
To achieve our goal and overcome the limitations of the existing methods, we trained a deep convolutional neural network for upper limb segmentation in egocentric vision.It was based on the DeepLabv3+ architecture (Chen et al. 2018), which is the SOTA deep convolutional networks for semantic segmentation (Gruosso et al. 2021b, c).The main problem we faced was the lack of well-annotated real-life RGB images that included all case studies.Larger datasets consist of synthetic data collected in virtual environments or semi-synthetic composited images (e.g., obtained using a green screen setup) that often lack realism (Mueller et al. 2018), limiting the model generalization ability and making them unsuitable for real-world unconstrained applications.
Therefore, we collected a real-life upper limb segmentation dataset in FPV that consists of about 46 thousand varied RGB images with accurate labels, which may enable a deep neural network to learn a wide range of realistic activities and achieve good results without any fine-tuning or domain adaptation.In particular, the data includes carefully selected images and labels from the well-known datasets (EDSH (Li and Kitani 2013) and TEgO (Lee and Kacorri 2019)) and our manually labeled EgoCam dataset, which was obtained using two different cameras in egocentric point of view.Ego-Cam allows increasing the generalization of the upper limb segmentation dataset and including different limb sections.
In this paper, we mainly focused on testing and comparing our approach with SOTA networks for hand and arm segmentation, extending our previous works (Gruosso et al. 2021b, c).In particular, we performed a quantitative analysis using well-known metrics for image segmentation (Chen et al. 2018;Gruosso et al. 2021a;Minaee et al. 2021) and a qualitative assessment comparing the obtained predictions.In particular, quantitative metrics used are Mean pixels Accuracy (mAcc), Intersection over Union (IoU), and F1 score (see Sect. 4).
We focused on two tasks: whole upper limb segmentation and hand-only segmentation.Both were considered in an egocentric view and unconstrained real-life scenarios.Our trained network achieved impressive results for both cases and was robust to occlusions, motion blur, and various lighting conditions, skin tones, clothes, limb position, hand pose and size.To the best of our knowledge, the proposed work is the first to evaluate and prove the effectiveness of a deep learning model for upper limb segmentation in such cases, averagely outperforming SOTA by ∼ 21% on mAcc, ∼ 26% on IoU, and ∼ 30% on F1 score, and collecting vast and varied real-life well-annotated images.
The remainder of this paper is structured as follows: Sect. 2 provides an overview of the related works; Sect. 3 describes the neural network, training phase, and upper limb segmentation dataset created; Sect. 4 illustrates the evaluation methods; Sect. 5 presents the obtained results and discusses the comparisons with the other approaches; Sect.6 summarizes our contributions and future works; Finally, in Sect.7 we provide online resources about data, code available for research purposes and the demo video of our work.

Localization methods
The localization area refers to all approaches useful for detecting hands or parts of them within images, such as detection (Narasimhaswamy et al. 2019), identification (Betancourt et al. 2017), segmentation (Lin and Martinez 2020;Gonzalez-Sosa et al. 2020), pose estimation and tracking (Zimmermann and Brox 2017;Gruosso et al. 2020;Capece et al. 2020).Hand segmentation is the most demanding hand localization task and is usually involved as an input pre-processing technique in many contexts since it allows identifying hand regions with pixel-level detail and distinguishing the hands from the background and objects (Dadashzadeh et al. 2019;Bandini and Zariffa 2020).One of the first segmentation approaches based on deep learning was proposed by Betancourt et al. (2017), who extended traditional methods by introducing an intermediate hand identification step to detect right and left hands using a Maxwell distribution of angle and position.Cai et al. (2020) proposed an approach for hand segmentation, which consisted of a model adaptation framework based on Bayesian CNN to deal with the typical generalization problem that affects this type of task and the scarcity of large well-annotated datasets.
Another approach for hand segmentation was designed by Li et al. (2019) and consisted of a semi-supervised framework based on the optimized noisy masks and a small number of labeled data.Video-based segmentation approaches often use temporal sequences to retrieve information on occluded hand segments to improve accuracy.Furthermore, many of them in the best cases, with ideal light conditions do not exceed 94% in F1 score accuracy.Our approach, on the other hand, does not require further information in addition to the current frame and has proved to work well even with variable lighting conditions, occlusions, and self-occlusion, reaching levels of accuracy near 97% as F1 score.Several recent supervised methods include neural networks trained with semi-synthetic data, for example, obtained using a green screen setting and composited with new backgrounds, as proposed by Lin and Martinez (2020) and Gonzalez-Sosa et al. (2020).Semi-synthetic images can be easily collected and annotated, although they usually has an artifact aspect.In particular, foreground images do not exhibit various lighting conditions since they are captured in controlled environments, are often not well blended with backgrounds, and there are significant chromatic discrepancies between the foreground and the background that can leed to poor performances in real-life conditions.
To the best of our knowledge, the proposed approach is the first to design a deep learning approach for upper limb segmentation and collect a large number of real well-annotated RGB images.

Interpretation methods
The interpretation area collects all those approaches that can deduce high-level information starting from those obtained by the localization methods.Indeed, hand segmentation is often part of the pipeline for dynamic and static gestures recognition problem (Urooj and Borji 2018; Paul et al. 2020), hand activities (Bambach et al. 2015;Nguyen et al. 2018), etc.In the field of interpretation methods, an interesting approach was proposed by Cai et al. (2017), who dealt with the hand grasp analysis using an egocentric vision-based system.In particular, the visual grasp structures were supervised-learned using data provided by a wearable camera.Similarly, (Bambach et al. 2015) investigated to what extent hand segmentation can help distinguish between different activities in a more accurate way.The authors consider topics such as hand detection, segmentation, and disambiguation of interacting people in first-person videos dealing realistic contexts, by detecting hands through strong appearance models via CNN.The main disadvantage of this type of approaches is the decrease in performance on uncontrolled environment performed tasks where real-life hand movements are used.A recent method for gesture recognition was proposed by Chalasani et al. (2018).They introduced an end-to-end deep learning method by combining an Egohand mask encoder, which is part of a hand segmentation network, and an RNN for temporal discrimination and hand gesture recognition in egocentric viewpoint.Compared to us, this approach uses semi-synthetic composite images obtained via green screen.In general, we considered hand images in real life scenarios without specific constraints making our approach more generalized to be used in different environmental contexts.

Application methods
Interpretation and localization methods are useful to design real-world applications in egocentric vision, such as detecting gestures and translating them into actions for HCI applications (Rautaray and Agrawal 2015;Brancati et al. 2015;Haria et al. 2017).HCI is closely related to VR, Augmented Reality (AR) and MR applications.In this context, there is extensive use of human segmentation techniques due to the increasing success of headsets (Yueming et al. 2007;Caggianese et al. 2015;Thalmann et al. 2015), as well as for robot interaction (Ju et al. 2017).However, depth sensors are preferred to RGB cameras, both to reduce lighting variation influence, and achieve more efficient hands localization (Bandini and Zariffa 2020).A human-robot interaction method based on hand gesture segmentation was proposed from Ju et al. (2017).They employed a Microsoft Kinect to captured RGB-D images in order to segment and recognize gestures useful to enable human-robot interaction, to improve NAO robot's understanding and interpretation performance.In particular, the RGB images and depth map were aligned through genetic algorithm to detect the key points.Furthermore, authors provided an edge refinement method of the tracked hand gestures from RGB images based on Bayesian networks.However, in this approach hand segmentation is a secondary task, indeed some results are affected of mismatched pixel between the background and the hand.Differently, our approach is able to distinguish and clearly separate the edges of the hand from the background showing high quality of results.Free-hand interactive AR applications could benefit from hand segmentation.An interesting approach was designed from Dave et al. (2019), in which AR was used for experiment of analytical chemistry to remove possible risks during experiments and reduce waste of chemicals.A complete pipeline was provided in this approach, starting from egocentric videos that follow three stages: frame processing which include hand segmentation, pose estimation, and interaction between hand and virtual objects.VR and MR also take advantages from hand segmentation for more accurate hand tracking and gestures recognition.Indeed, Herumurti et al. ( 2017) explored the MR technology in virtual room arrangement application by implementing a fingertip interaction method using hand segmentation.As reported from the authors, approaches based on adaptive hand segmentation only work well in appropriate light conditions.Our dataset instead, was created considering a wide range of lighting situations such as (sunlight, flashlight, shadow, ambient light, etc.).For this reason, the quality of our results not affect of particular lighting conditions.In a similar application context, Maurya et al. (2018) proposed an approach to extend the interaction capability of low-cost head-mounted displays implementing an RGB-based hand segmentation system to enable HCI.In the best cases, this approach achieves an IoU accuracy of around 94% compared to 98% for our approach.Indeed, real-time applications often have to tread-off between quality and computational performance.

Upper limb segmentation network
We propose a deep neural network for egocentric upper limb segmentation in unconstrained real-world scenarios, considering a great variety of skin colors, occlusions (inter-hand and caused by objects), lighting conditions, both bare and clothed arms in different frame positions, and dynamic user/ camera movements.It was based on the encoder-decoder DeepLabv3+ architecture (Chen et al. 2018), which is the SOTA for semantic segmentation that achieved impressive results in various research fields (Harkat et al. 2020;Wang and Liu 2021;Wu et al. 2021;Kong et al. 2021) and many benchmark datasets.The encoder extracts low-level features and semantic information from the input image, gradually reducing the feature maps size.As shown in Fig. 1, it consists of a backbone network, followed by an atrous spatial pyramid pooling module (ASPP) (Chen et al. 2017b) and a 1 × 1 convolutional layer.The ASPP module allows cap- turing multi-scale context information through the use of three atrous convolutions (Papandreou et al. 2015), a 1 × 1 convolution, and an image pooling layer in parallel with each other.We set the atrous rate of the ASPP atrous convolutions to 6, 12, and 18, respectively.Instead, the decoder is built using convolutional and bilinear upsampling operations in order to retrieve spatial information from the encoder features and refine the segmentation result obtaining detailed object boundaries.
Although different neural networks can be used as the network backbone, we chose the Xception model (Chollet 2017) motivated by the promising qualitative and quantitative results obtained in fast computation time for the image classification task, outperforming the previous networks such as VGG-16, ResNet-152, and Inception V3.Moreover, we tested several network configurations and experimentally verified that it was the best model for our case study (Gruosso et al. 2021b).In particular, we chose the Xception-65 model adapted by Chen et al. (2018) to the task of semantic segmentation.It is characterized by 65 layers, in which the original max-pooling layers are replaced by atrous depthwise separable convolutions (also called atrous separable convolutions), which are a factorization of a standard convolution in a depthwise convolution (spatial convolution performed independently for each channel) with atrous convolution followed by a pointwise ( 1 × 1 ) convolution.In addition, batch normalization and ReLU are added after each 3 × 3 depth- wise convolution.

Training details
The network was trained using our upper limb segmentation train set (see Sect. 3.3 for details) and a training protocol similar to Chen et al. (2017a).In particular, we adopted the cross-entropy loss function and the stochastic gradient descent optimization algorithm with momentum equals to 0.9, setting the batch size to 8, the base learning rate 0 to 0.0001, and using a polynomial learning rate policy (also known as "poly" policy), which proved to be more effective than other policies leading to faster convergence (Liu et al.where t is the learning rate at the current iteration step t, T is the total number of iterations equals to 90K for our training phase, and p is the power value set to 0.9. The network training was performed using weights pretrained on ImageNet (Russakovsky et al. 2015) and MS-COCO (Lin et al. 2014) datasets1 and GPU acceleration through one Nvidia Titan Xp GPU with 12GB memory.We basically adopted python 3.6 and tensorflow (Abadi et al. 2015) machine learning library tested on Microsoft Windows 10 Pro.Other requirements are described in the online resources reported in Sect.7. Finally, data augmentation was applied by randomly left/right flipping images and labels during training to avoid model overfitting.

Datasets description
One of the main challenge of this research field is the lack of well-annotated RGB images captured in several real-life unconstrained environments.Most datasets are often limited to synthetic/semi-synthetic images (Shilkrot et al. 2019;Lin and Martinez 2020;Gonzalez-Sosa et al. 2020;Lin et al. 2021), which are easy to label but often look artificial and unrealistic.This could lead to poor results in real-world domain and scene adaptation techniques are often used to improve the accuracy for specific scenarios (Lin and Martinez 2020).On the other hand, available real datasets often contain few data or low-quality images with coarse segmentation masks (Tang et al. 2018), making them not suitable for deep learning approaches (Bandini and Zariffa 2020).Furthermore, some real-life dataset shows the whole upper limb (both bare and clothed arms) in their images although only the hands or bare arms were labeled (Bambach et al. 2015;Urooj and Borji 2018;Wang et al. 2019), resulting in a misclassification for our case study.Therefore, we collected a large comprehensive upper limb segmentation dataset (see Sect. 7) to overcome the limitations of existing datasets (Gruosso et al. 2021c).It consists of 46.021 well-annotated RGB images captured in unconstrained real-world scenarios and showing a wide range of situations, e.g., different indoor and outdoor environments, lighting conditions (sunlight, ambient light, flashlight, shadows), bare and clothed arms, skin tones, hand-to-hand and hand-to-object occlusions, and a variable amount of motion blur.In the case of occlusions, (1) t = 0 × 1 − t T p we considered the limb as foreground and the objects as part of the background.All collected data are in an egocentric perspective and comes from three different dataset: (1) EDSH (Li and Kitani 2013), which includes indoor and outdoor video frames showing different lighting conditions and a user's bare limb (hands and forearms) during real-life actions, such as preparing tea, climbing stairs, and opening doors; (2) TEgO (Lee and Kacorri 2019), which is a large dataset including high-resolution indoor images showing two subjects' hands and forearms with different skin tones, lighting, and object occlusions; (3) our manually labeled EgoCam dataset showing four male and female people in simple and cluttered environments, indoor and outdoor reallife scenes, inter-hand occlusions, different lighting conditions, and skin tones.Compared to the other two datasets, EgoCam includes more limb sections (e.g., elbow and part of the upper arm) that can be easily framed by cameras with a wide-angle field of view, also increasing the number of well-annotated images in the upper limb segmentation dataset.We only considered a subset of the first two datasets since we found and discarded data whose labels contained errors.In the case of TEgO, we also deleted images where the clothed arm was labeled as the background as we were not only interested in the bare limbs.Since data had different aspect ratio and orientations, we performed a square crop and spatially resized them to 360 × 360 to accelerate the training.Then, the upper limb segmentation dataset was divided into training (43.837 images) and test (2.184 images) subsets.
Furthermore, we considered another test set to assess the network generalization level on difficult cases.We named it EgoGestureSeg since is a subset of a benchmark dataset for egocentric hand gesture recognition called EgoGesture (Zhang et al. 2018).EgoGestureSeg consists of 235 images2 captured in challenging indoor/outdoor scenarios and manually labeled by Gonzalez-Sosa et al. (2020).They show both clothed and bare limbs captured up to the forearm, using natural or artificial light (e.g., light bulbs or flashes).There are also shadows, darken limbs captured against the light, and hand-to-hand occlusions.Furthermore, some frames have also a noticeable amount of motion blur.

Evaluation methods
To evaluate the performance and effectiveness of the proposed approach, we performed both quantitative and qualitative evaluations comparing our approach with SOTA 1 3 methods.The quantitative analysis was conducted using standard and well-known metrics for the segmentation task, while the qualitative evaluation were made visually comparing the segmentation masks predicted by all models.Both assessments are beneficial and discriminating: metrics are useful indices for comparing different methods on a specific task or benchmark datasets and are used to obtain overall information on models performances, while visual inspection of the output can reveal strengths or unfavorable scenarios in which the network fails, then influencing the final choice of a model (Minaee et al. 2021).
We considered Accuracy (also known as Pixel Accuracy) as the first quantitative index (Minaee et al. 2021).It is the ratio of the correctly predicted pixels p ii divided by the total number of pixels p ij for K + 1 classes.It can also be expressed in terms of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) values.Formally, where p ii is the number of pixels of class i predicted as belonging to class i and p ij is the number of pixels of class i predicted as belonging to class j.This is a useful metric when considering a class-based assessment, but it may not be reliable in the case of the overall evaluation of test set if the classes are not well balanced (Lateef and Ruichek 2019).In this case, mAcc is used, computing the per-class accuracy and averaging over the total number of classes.The second metric chosen is the IoU, also known as the Jaccard index, which is the most commonly used metric for evaluating segmentation tasks and quantify the overlap between the ground-truth labels (GT) and the predicted masks (Pred).In more specific terms, it computes the ratio between the intersection and the union of these two sets.IoU can also be formulated considering the number of true positive, false positive, and false negative values.
Similar to Accuracy, mean IoU (mIoU) is employed in the case of the whole test set.It is defined as the per-class IoU averaged over all classes.The last metric is F1 score, which is widely used to measure the accuracy of image segmentation approaches (Minaee et al. 2021) and the contour matching (i.e., how well the predicted boundary of a segmented object matches its GT boundary).It combines the prediction p and recall r of the test, where In particular, the F1 score is computed as the harmonic mean of precision and recall: In the case of such an index, mean F1 (mF1) is usually used.For each class, it is the average F1 score of that class over all images.Instead, for the aggregated set, mF1 represents the average F1 score of all images.

Experimental results
We extensively tested the proposed approach against several real-life scenarios and conditions and compared the obtained results with several deep learning methods.We firstly compared our results with SOTA methods for both hand and arm segmentation in egocentric vision, i.e., Ego2Hands3 (Lin and Martinez 2020) and EgoArm4 (Gonzalez-Sosa et al. 2020).In addition, we considered HGR-Net (Dadashzadeh et al. 2019), which showed interesting results in challenging scenarios, although not specifically designed for egocentric vision segmentation.Quantitative and qualitative results can be found in Sects.5.1 and 5.2.Second, we performed a further comparison on hand segmentation, which is a task closely related to upper limb segmentation.Relevant findings and discussions are provided in Sect.5.3.All tests were conducted on the same computer equipped with an Intel Core i7-3rd generation CPU, 16GB RAM, and one Nvidia Titan Xp GPU with 12GB memory.

Comparisons on upper limb segmentation dataset
The first test was done considering our upper limb segmentation test set.Table 1 illustrates the obtained metric values expressed in percentage.As can be noted, our approach achieved the best results (shown in bold) in all cases and the difference from the other networks is significant for each metric.The second-best result is displayed in blue color and was obtained by Ego2Hands and EgoArm with a difference of about 9% for mAcc, 30% for mIoU, and 26% for mF1 on the overall test set.In the case of the limb category, a small difference can be found for accuracy ( 6% ) and a greater difference for IoU ( 50% ) and mF1 ( 45% ) compared to the (5) F1 = 2 × p × r p + r second-best values.On the other hand, less variability in the values of each metric calculated on the background class can be noted.This is most evident observing the obtained predictions, in which few background classification errors were made by all networks.Some images and the corresponding output masks are shown in Fig. 2. The worst results were obtained by HGR-Net (fourth row of Fig. 2), which was unable to identify and correctly classified the limbs, as can also be seen from the per-class metrics (Table 1).In our opinion, this is because HGR-Net was not specifically trained to segment the limbs captured from an egocentric perspective.Hence, a general-purpose solution for human limb segmentation may not be sufficient to achieve optimal results.Ego-Arm, instead, generally identified a large part of the limb, although it made several mistakes on the background pixels, incorrectly classifying objects as limbs.On the other hand, Ego2Hands was not always able to segment them correctly (no limb pixels were identified in the last image of Fig. 2).Contrary to the other models, our network obtained excellent segmentation masks in various scenarios, with different lighting conditions (e.g., flashlight and shadows), skin tones, and hand-to-object occlusions, as can be seen in the fifth row of Fig. 2.

Comparisons on EgoGestureSeg dataset
To evaluate our approach in more challenging scenarios and compare it with the chosen SOTA networks, we considered the EgoGestureSeg dataset (see 3.3 for details).quantitative analysis show the superiority of  our approach, which obtained the best values for all metrics, as can be seen in Table 2.The second best value was achieved by EgoArm in most cases, differing by about 2% for mAcc and about 7% for mIoU on the overall image set evaluation.For the limb class, we can see a similar trend, in which the accuracy obtained by EgoArm is slightly lower than the best value ( 1.7% ), while mIoU differs more ( 11.8% ).The second-best mF1 score was obtained by Ego- 2Hands in the case of the overall evaluation and by Ego-Arm for the limb class.In the worst case, very low values were recorded, e.g., HGR-Net reached values lower than 32% for the limb class metrics and less than 66% for the metrics computed on the overall set.This seems to support our hypothesis on the need for an ad hoc model for egocentric segmentation.Instead, fewer errors on the background class were made by all approaches and values greater than 78% were obtained.
Figure 3 shows some qualitative results in several scenarios.The third image represents a challenging situation, where the hand and arm are very dark and similar to the background color and, hence, difficult to recognize.In addition, both bare and clothed limbs are shown.Furthermore, some photos were captured using outdoor and indoor lights, i.e., the first two and the last three images, respectively.The first, fourth, and fifth images present motion blur, especially around hands and fingers, which were segmented with difficulty by most comparison models.On the other hand, our network achieved better performance in all those cases and was robust to various clothes and lighting conditions, as shown in the fifth row of Fig. 3. Fig. 3 The output obtained by testing all models on EgoGes-tureSeg images, which are shown in the first row.The ground-truth (GT) segmentation masks can be found in the last row

Comparisons on hand segmentation
We tested our model on the hand-only segmentation task since it is related to upper limb segmentation and compared the proposed network with the approach designed by Urooj and Borji (2018), who provided four networks for hand segmentation in the wild.Each one is based on the RefineNet model (Lin et al. 2017) and was trained using a specific hand segmentation dataset (i.e., EgoHands (Bambach et al. 2015), EYTH (Urooj and Borji 2018), GTEA (Fathi et al. 2011;Li et al. 2015), and HOF (Urooj and Borji 2018), respectively), and obtained impressive results in the case of hands captured in egocentric viewpoint and unconstrained environments.Despite the correlation between the two tasks, there is a slight difference: the hands and arms are marked as the foreground for upper limb segmentation, while the arms are classified as background and only the hands are recognized as foreground in the case of hand segmentation.
Hence, an impartial and equal comparison through the whole test sets employed in the previous evaluations could not be performed, especially in the case of quantitative analysis.Therefore, we selected a subset of our test set showing only hands until the wrist (1514 images) and tested both our network and the four RefineNet-based models using this subset.This can further demonstrate the superiority and robustness of our approach even if only the hand is framed by the camera, such as when the limb is moving in or out of the field of view.Table 3 illustrates the quantitative results obtained on the hand segmentation subset.RefineNet models were indicated using the acronym RN for brevity.Metrics were computed on the overall subset and for each class.As can be noted, the best and second-best values were mostly achieved by our network and the RefineNet model trained using the GTEA dataset, respectively.However, the difference between them is significant, especially with regards to the metrics calculated for the limb class.In particular, there is a difference of about 23% in the case of Accuracy and IoU, while about 28% for the mF1 score.The worst val- ues computed on the overall subset and for the limb class were obtained by the RefineNet models trained using the HOF and EYTH datasets.It is noteworthy that HOF contains images of hands on the face and hence not captured from an egocentric viewpoint, and EYTH is a small dataset consisting of both egocentric and third-person point of view images ( 30% of hands captured in TPV and 70% in FPV hands (Urooj and Borji 2018).Conversely, EgoHands contains video frames in which two subjects are playing board games.Then, each frame contains one subject in FPV or both subjects, one in FPV and the other in TPV.Instead, GTEA contains only FPV scenes.This further emphasizes the need for a dedicated training dataset where all images show the limbs in an egocentric view.Moreover, a qualitative comparison was conducted.Some predictions can be seen in Fig. 4. The tested input images, which show only the hands up to the wrist, and the groundtruth masks can be found in the first and the last row, respectively.The sixth row contains our outputs and the other rows present the masks predicted by the models of Urooj and Borji (2018).All RefineNet networks were unable to properly segment the black hand in the first image.In addition, only models trained using EgoHands and GTEA correctly classified most of the hand pixels in the last photo, where the back of the hand was exposed to a flashlight.The worst results were obtained by RefineNet-HOF, as expected.As can be seen, our network correctly identified all hands and also achieved a good segmentation level on contours.

Conclusion
We investigated upper limb segmentation in egocentric vision and unconstrained real-life RGB images.In particular, we trained a deep neural network based on the wellknown DeepLabv3+ architecture (Chen et al. 2018;Gruosso et al. 2021c).Although several hand and arm segmentation datasets exist, they are limited to a small amount of wellannotated real photos or a large number of synthetic/semisynthetic images with a lack of realism and captured in constrained or artificial environments.Therefore, the available data is often unsuitable for real-world applications.Hence, to train our network, we then collected 46.021 thousand images, which shows a wide range of real-life scenarios, skin tone, clothes, occlusions, and lighting conditions (Gruosso et al. 2021b).They include carefully selected images and ground-truth masks from the EDSH (Li and Kitani 2013) and TEgO (Lee and Kacorri 2019) datasets, which met the main requirements we looked for, and our manually labeled EgoCam dataset.
To prove the robustness of the proposed approach, we tested our approach extensively and compared it to the SOTA for hand and arm segmentation, such as Ego2Hands (Lin and Martinez 2020), EgoArm (Gonzalez-Sosa et al. 2020), and HGR-Net (Dadashzadeh et al. 2019).In addition, we focused on hand-only segmentation since it is a task closely related to upper limb segmentation.In this context, we made a comparison with the most recent approach for egocentric hand segmentation in the wild, which consisted of four deep neural networks based on the RefineNet model (Lin et al. 2017) trained using different hand segmentation datasets (Urooj and Borji 2018).We employed our upper limb segmentation test set and the EgoGestureSeg dataset labeled by Gonzalez-Sosa et al. (2020) in the case of upper limb comparisons, while a manually selected hand-only subset of images for hand segmentation assessments.In particular, we performed a quantitative analysis computing standard metrics for the image segmentation task, achieving the best values for Accuracy, IoU, and F1 score calculated on the overall test sets and for the limb class with a considerable margin compared to competitors, as illustrated in Tables 1, 2 and 3. Furthermore, we visually compared the predictions of all models.Our network obtained impressive segmentation mask accuracy in diverse real-world scenarios with no model/scene adaptation, e.g., hand-to-hand and object-to-hand occlusions, indoor and outdoor areas, different lighting conditions, shadows, skin tone, and user/camera movements (Figs. 2, 3, and 4).The worst results were achieved by the networks trained using a small number of images captured in the third-person point of view or a combined dataset of TPV and FPV data, such as HGR-Net and RefineNet-HOF.Therefore, the evaluations conducted demonstrated the need for an approach specifically designed for FPV and a sufficiently varied and large training dataset that can allow a good generalization level.
To the best of our knowledge, the presented approach is the first to introduce an effective and robust deep learning solution that outperforms existing approaches and collect a large and comprehensive upper limb segmentation dataset in egocentric vision with accurate labels.This study can be the basis for building an immersive scenario with an upper limb segmentation network running locally on VR devices, such as the latest generation headset.For example, it could improve the user's sense of presence and body ownership in virtual environments allowing users to see their real upper limbs in the virtual environment instead of virtual avatars (Gruosso et al. 2021b).Then, a user study could be conducted in the future to assess the usefulness of our approach for VR/MR applications.
Finally, we will release our dataset and code to encourage future research on this topic (see Sect. 7).

Online resources
We have shared our code for research purposes through a GitHub repository: https:// github.com/ Uniba s3D/ Upper-Limb-Segme ntati on.The trained models can be found at this

Fig. 2
Fig. 2 Some qualitative results obtained by testing all models on our Upper Limb Segmentation test set.First two images come from EgoCam and the last tree images are from TEgO.The input images and the groundtruth (GT) segmentation masks are shown in the first and the last row, respectively

Fig. 4
Fig. 4 Comparisons using images showing only the hand (first row).The predictions obtained by the four RefineNetbased networks can be found from the second to the fifth row.Our output and the ground-truth (GT) segmentation masks are shown in the sixth and the last row, respectively