Deep Learning for 3D Ear Detection: A Complete Pipeline from Data Generation to Segmentation

The human ear has distinguishing features that can be used for identification. Automated ear detection from 3D profile face images plays a vital role in ear-based human recognition. This work proposes a complete pipeline including synthetic data generation and ground-truth data labeling for ear detection in 3D point clouds. The ear detection problem is formulated as a semantic part segmentation problem that detects the ear directly in 3D point clouds of profile face data. We introduce EarNet, a modified version of the PointNet++ architecture, and apply rotation augmentation to handle different pose variations in the real data. We demonstrate that PointNet and PointNet++ cannot manage the rotation of a given object without such augmentation. The synthetic 3D profile face data is generated using statistical shape models. In addition, an automatic tool has been developed and made publicly available to create the ground-truth labels of any 3D public data set that includes co-registered 2D images. The experimental results on the real data demonstrate higher localization compared to existing state-of-the-art approaches.


I. INTRODUCTION
The external shape of the human ear has distinguishing features that differ significantly from person to person. Research shows that even the ears of identical twins are different [1]- [3]. Importantly, the ear shape of a person remains steady between the ages of 8 to 70 [4]- [6]. These two factors have attracted the research community to investigate using images of the ear for numerous applications, including biometric identification; 3D ear reconstruction from partially occluded ear images [7] or from a single 2D ear image [8]; gender recognition; genetic study; and asymmetry analysis for clinical purposes [9]- [13].
In ear-based biometrics, one of the significant steps is to localize ears in profile face images. Most ear detection approaches use 2D images for ear region localization as they require fewer computations [11], [12], [14]. Due to the importance of being able to handle unconstrained images for object detection and segmentation, recently numerous deep learning-based methods have been proposed including simple convolutional neural network (CNN) based [15]- [18], landmark-based [19], Faster R-CNN based [20], pixel-wise [21], and geometric-based [22]. However, 2D image-based approaches are limited to constrained scenarios due to their sensitivity to lighting conditions and pose variations. Therefore, 3D images can be used to overcome the limitations of 2D images [23].
Recent developments in 3D imaging techniques open the door for 3D image-based applications, including biometrics, robotics, medical diagnosis, and autonomous driving [23], [24]. Generally, 3D data can be represented in various forms, such as point clouds, volumetric grids, depth images, and meshes. The point cloud representation is becoming more popular as it reserves the original geometric information in 3D domains without discretization. However, conventional convolutional neural networks cannot be applied directly to point clouds due to the irregular order of the points. Therefore, most work using 3D images generally converts point cloud data to Euclidean structured format before sending it to the CNN architectures. This representation conversion introduces unnecessarily voluminous data and wraps natural invariances of the data due to the generation of quantization artefacts. Recently, Qi et al. introduced two novel deep learning architectures named as PointNet [25] and PointNet++ [26] that can identify features directly on 3D point clouds. These two networks provide the solution to the problem mentioned before and open the door to solve many other re-search questions in classification and semantic segmentation, including Engelmann [27], PointSIFT [28], 3DContextNet [29], ShellNet [30], LSANet [31], PointCNN [32], PCCN [33], ConvPoint [34], KPConv [35], InterpCNN [36]), RSNet [37], G+RCU [38], 3P-RNN [39], DGCNN [40], SPG [41], GACNet [42], DPAM [43]. For more details the reader should see the comprehensive survey on point cloud data presented by Guo et al. [44].
Here, the ear detection problem is expressed as a semantic part segmentation problem where the profile face data (3D point clouds) is divided into two parts: ear and non-ear. As the problem is formulated in a single class with 2 parts, we are motivated to use a simpler network. In this work, we propose a deep learning-based approach named EarNet to detect ears directly on 3D point clouds of profile face data by modifying the PointNet++ [26] architecture. To handle the pose variation in the test data sets, we include a rotation augmentation block during the transfer learning of the EarNet.
Conventionally, a large set of training data is required to train a deep neural network efficiently. To the best of our knowledge, however, labeled 3D point cloud data for ear detection is not available. Therefore, we propose a novel approach for generating a large 3D synthetic profile face data set using two publicly available statistical 3D face models to train the proposed EarNet. Three public data sets are utilized to evaluate the performance of the trained model. Moreover, to examine the robustness of this approach, we also use a challenging 3D profile face data set from the University of Western Australia (UWA) that contains occlusions due to earphones. The contribution of this work can be summarized as follows: 1) A novel deep learning-based ear detection model named EarNet is proposed. EarNet is a modified version of the PointNet++ [26] with a rotation augmentation block addressing pose variation problems in the real data. 2) A novel approach is proposed to synthetically generate a large number of 3D profile face data, which is used to train the proposed EarNet. 3) A novel approach is proposed to create the groundtruth labels on real 3D data where 2D co-registered images are available. The ground-truth data is used for quantitative evaluation of the EarNet. 4) Comprehensive experiments are conducted demonstrating the state-of-the-art performance on the largest publicly available 3D profile face data set. The rest of the paper is organized as follows. Related work for 3D ear detection is described in Section II. The proposed ear detection pipeline is elaborated in Section III. The performance evaluation is explained in Section IV, followed by a conclusion in Section V.

II. RELATED WORK
The main focus of this work is ear detection in 3D data. Therefore, we only include the existing approaches that used 3D data for ear detection and categorize them into two groups: conventional machine learning-based approaches and deep learning-based approaches. These are summarized below.

A. MACHINE LEARNING-BASED APPROACHES
Existing machine learning-based approaches for ear detection in 3D data are either shape model-based, landmarkbased, or graph-based. Chen et al. [45] proposed a shape model-based approach for localizing ears in 3D profile face images. The helix and anti-helix parts of the ear were represented by a shape model consisting of a discrete set of 3D vertices. The authors extracted step edges from the profile images because of the strong visibility in the ear helix. The segments of the edges were thinned, dilated, and classified into several clusters. A modified iterative closest point (ICP) algorithm was applied to align the edges and the shape model. Ear detection was obtained by the minimum registration error between the cluster and the shape model. The reported detection accuracy was 92.6% on 312 test images from 52 subjects. The limitation of the approach was the sensitivity of scale and pose variation. Zhou et al. [46] presented a 3D shape model to extract a set of shape-based features to train the support vector machine (SVM) classifier. They reported 100% accuracy, however, this result was obtained on 142 test images only.
A landmark-based ear detection technique that achieved 100% detection accuracy on the UND J2 data set was proposed by Lei et al. [23]. They presented a tree-based graph (ETG) to represent the ear and a curvedness map for localizing ear landmarks. However, their approach required manual intervention for landmark annotation.
An edge connectivity graph was proposed by Prakash et al. [47] for ear detection on 3D images from the UND J2 data set achieving 99.38% detection accuracy. The authors used a connectivity graph technique to extract the initial ear edge image. Their approach handled the influence of the scale and in-plane rotation. However, the authors did not solve the off-plane rotation for ear detection. As a result, they had to discard some images from the UND J2 data set because of poor detection quality. Pflug et al. [48] proposed a binary mean curvature map for edge detection on 3D profile images and reported 95.65% accuracy on the UND J2 data set. The detected edges were used for semantic analysis to reconstruct the helix contour of the ear. The successful detection was defined by 50% overlapping pixels between ground-truth and the predicted ear region. As a result, their approach included additional pixels as an ear region, including clothes (e.g., scarves and collars).

B. DEEP LEARNING-BASED APPROACHES
The earliest attempt for ear detection on 3D point clouds was EpNet [49] where the network layers of PointNet [25] were customized to detect ear points on 3D profile faces. In the EpNet, the input points are mapped into a feature vector by multilayer perceptron networks (MLP). A max-pooling operator is then employed on these feature vectors. This  [46] Histogram-based 142 100 Prakash et al. [47] Connectivity graph 1604 99.38 Pflug et al. [48] Edge-patterns 2414 95.65 Lei et al. [23] Landmark-based 1800 100 Mursalin et al. [49] Point cloud 1800 93.09 Zhu et al. [50] PointNet++ 1385 93 pooling operation results in a permutation invariant global feature vector. Finally, using MLP, the point feature vector and the global feature vector are combined and transformed into an output vector. Although this solves the permutation and transformation invariance in the point clouds, it cannot capture the local structure in the Euclidean space. As a result, detection accuracy is affected by pose variations.
Recently, Zhu et al. [50] proposed an ear segmentation approach using PointNet++. The authors trained their network using transfer learning on the pre-trained weights from ShapeNet data [51]. They used one 3D data per subject (total of 415 subjects) from the UND J2 data set for fine-tuning their segmentation network. Their approach was tested on the remaining data from the UND J2 data set. However, the authors did not examine the use of data augmentation while training their deep neural network. They also did not examine the effect of pose variation effects on the detection performance. The existing 3D ear detection methods are summarized in Table 1. Notice that the authors usually reported various performance metrics and depended on their own evaluation protocols. As a result, direct comparisons between the detection accuracy reported in Table 1 should be avoided.

III. PROPOSED EAR DETECTION PIPELINE
In this work, a large synthetic data set is produced by using two publicly available statistical models to train the proposed EarNet. The ground-truth of real data sets is generated by utilizing the Mask R-CNN [52]. The complete processing pipeline of ear detection is described below.

A. TRAINING DATA GENERATION
To create an extensive 3D data set for training, two publicly available statistical models, Basel Face Model (BFM) [53] and Liverpool-York Head Model (LYHM) [54] are used. The aim of using two models is to increase the variations in the training data. Both of the models were created using a dimensionality reduction technique named Principal Component Analysis (PCA). By varying the shape parameters, different face instances are generated. It is straightforward to label the ear points of these generated data because a known one-toone correspondence exists among the data.
All 3D profile face images of the UND J2 and UND F data set are left-side profile face images. Our literature review indicates that ear detection is conducted mostly on profile face 3D images. However, both of the above-mentioned statistical models contain a full-face image. Therefore, we transform the full-face image of the statistical models to left-side profile face data. The following steps are conducted to create the left-side profile face data. First, the nose tip is detected using a coarse to fine approach proposed by Mian et al. [55], where each 3D face data is sliced horizontally at multiple steps. The location on the slice with the largest altitude triangle is regarded as a possible nose tip and is given a confidence value equivalent to the altitude. This process is iterated for all the slices to get one candidate point per slice corresponding to the nose ridge. Some points that do not correspond to the nose ridge are considered outliers. The outliers are eliminated by using Random Sample Consensus (RANSAC) [56]. The point with the maximum confidence value is selected as a nose tip. Second, the detected nose tip is chosen for the current viewpoint. The full-face image is rotated to a different azimuth (−45 • ,−60 • ,−90 • ) and elevation (±30 • ) angles. This rotation facilitates pose variations in the training data. Third, the hidden point for each rotation angle is deleted by using a hidden point removal algorithm [57]. Finally, the preprocessed data is downsampled. The purpose of this downsampling is to reduce the computation for the EarNet. Three downsampling techniques (random sampling without replacement, uniform box grid, and non-uniform box grid) are applied. The non-uniform box grid method is selected as this shows better sampling quality to retain the overall geometric shape of the 3D face data. The number of points is selected empirically to preserve the overall shape of the face and ear region. We tested 1024, 2048, and 4096 points and chose the 4096 points for better visual quality.
After nose tip detection and downsampling, we use a threshold-based technique proposed by Gautam Kumar [58] to label the ear points. The threshold value is selected empirically. We observe that the distance between ear and nose is around 26 mm. The ear points are present within 20 mm width and 35 mm height. The training data preparation is summarized in Figure 1.

B. GROUND-TRUTH LABELING
Data labeling is a crucial task in deep learning. This work proposes a novel approach that involves labeling the 3D public data sets where corresponding co-registered 2D images are available. The purpose of the data labeling process is to evaluate the quantitative performance of the proposed ear detection model. Our data labeling process is divided into two stages. First, the ear region is detected on the 2D profile face images using the Mask R-CNN [52]. The Mask R-CNN is an extended version of the Faster R-CNN [59], which add a segment to predict an object mask within the detection bounding box detected by Faster R-CNN. The purpose of using Mask R-CNN is to localize the pixels belonging to the ear region instead of just bounding boxes. The output of the Mask R-CNN is a 2D binary mask of a given 2D color profile face image. Second, the detected mask is projected to the coregistered 3D data for labeling. We label '1' for points that belong to the ear and '0' for points that belong to the nonear. The block diagram of the ground-truth labeling on real data is illustrated in Figure 2.
The Mask R-CNN implementation by Waleed et al. [60] is used in this study. To train the Mask R-CNN, we randomly take a few sample images from each data set mentioned in Section IV-A. The total number of images for training and testing is 200 and 40, respectively. The VGG Image Annotator (VIA) [61] is used for labeling the 2D color images. We train the 2D ear detection Mask R-CNN starting from pretrained COCO weights [60]. The results show an intersection over union score of 90.32% on 40 test images. Therefore, we visually checked all the predicted ear regions and corrected manually if needed. FIGURE 2: Block diagram of ground-truth labeling procedure on real data. Here, the input 2D color image passes through the Mask R-CNN. The output of the network is the detected and masked 2D color image. This masked image is transferred to the co-registered 3D point cloud.

C. EAR DETECTION NETWORK (EARNET)
The proposed EarNet is a deep neural network that is customized to the PointNet++ [26] layers for ear detection. The PointNet++ part segmentation network is designed for 16 different classes with 50 parts, while we train our proposed EarNet for 1 class with 2 parts. Therefore, a smaller network with a lower number of parameters can learn the variations. For this reason, we empirically drop some of the MLP layers in [26]. As a result, the execution time is significantly reduced without decreasing the accuracy. In addition, a data augmentation block is added to rotate the full 3D point cloud with respect to the x and y axes. The purpose of this augmentation is to provide more understanding of a given object. This addition of augmentation improves the performance of ear detection in 3D point clouds. We also demonstrate that in the presence of pose variations on the 3D profile face data, the EarNet performs better than the PointNet++. The architecture of our ear detection model is shown in Table 2. The EarNet consists of several layers, including set abstraction, feature propagation, and segmentation layers. In the set abstraction layers, sampling and grouping are conducted using point convolutions and the furthest point sampling method. The skip link concatenation is used for feature propagation to the next layers of the network. In the segmentation block, a fully connected layer is utilized to estimate the per-point class scores for every point in the input data. We use the same notation as [26] for describing the architecture of EarNet, set abstraction is SA(K, r, [l 1 , ..., l d ]) where K number of local regions in radius r and [l 1 , ..., l d ] is the fully connected layers with l i (i = 1, ..., d) output channels. Here, n is the number of parts, which is two in this case. In all layers, the ReLU activation is executed. In the last two layers, dropout is applied.

D. EVALUATION METRICS
The performance of the proposed ear detection model is evaluated using two commonly used metrics, namely Accuracy, and intersection over union (IoU ). These metrics are calculated using five different variables: true positive (T P ), false positive (F P ), true negative (T N ), false negative (F N ), and T otal. Here, T P represents ear points correctly classified as part of an ear, while T N represents non-ear points classified as non-ear points. F P represents non-ear points classified as ear points, and F N represents ear points classified as non-ear points. T otal is the number of points that exist in a given point cloud data. The Accuracy is estimated using the following equation, The intersection over union (IoU ) is calculated as follows:

IV. EXPERIMENTS A. DATA
Three publicly available 3D profile face data sets, namely UND F [62], UND G [63], and UND J2 [64] are used to evaluate the performance of the proposed ear detection model. These data sets were developed by the University of Notre Dame and have been used as benchmark data sets by the ear biometrics community. The 3D scans were captured at different times, and the number of subjects (UND J2 is the largest and UND G is the smallest) varies among these data sets. The UND G data set is comprised of images with significant pose variations compared to the other two. A brief description of these data sets is explained below. The UND F data set consists of 942 3D profile face scans with co-registered 2D color images. The total number of subjects is 302 (176 males and 126 females). The distribution of scans for each subject is not uniform. There are 562 scans of male subjects, and 380 scans of female subjects.
The UND J2 data set comprises a total of 1800 3D scans from 415 different subjects (178 females and 237 males). Each subject has a different number of images with scale and pose variations. Some images include occlusion by hair and earrings. In this study, a set of randomly selected 415 scans is kept separated for transfer learning of the proposed model and other purposes (see Sections IV-B and IV-C4), and the remaining 1385 scans are used for model evaluation.
The UND G data set includes 738 3D profile face scans with yaws of 45, 60, 75, and 90 degrees. There are 437 leftside and 301 right-side profile face scans. In this work, we only use the left side profile face scans.
Apart from the UND data set, we also acquired 3D profile face data from the University of Western Australia (UWA) [65]. The authors collected data from 50 subjects using a Minolta Vivid 910 range scanner. All these images contain earphones.

B. TRAINING
First, we train the proposed EarNet from scratch using 20,000 synthetic data (80% training and 20% testing). The hyperparameters are selected empirically. The optimal batch size is 16. We observe that the model fails to run if the batch size is greater than the chosen size. The number of data points of each scan is selected as 4096. The optimizer is selected as Adam [66] with a momentum of 0.9. The initial learning rate is set to 10 −3 . This work uses the default values for optimizer and learning rate from [26].
Second, we apply transfer learning to the network using 150 real 3D scans. These scans are randomly selected and separated from the total 415 scans in the UND J2 data set. In addition, we separate another 50 scans randomly from the remaining scans to evaluate the transfer learning technique. During transfer learning, we apply rotation augmentation. The total number of data becomes 3000 (each image contains 20 rotations) after the augmentation. In this work, all experiments are performed in the Lamda Balde machine with GPU 8x 1080 Ti GeForce GTX 1080 Ti. The cod 1 is implemented in PyTorch version 1.3.1.

C. RESULTS AND DISCUSSIONS 1) Detection Accuracy
The average accuracy of our proposed EarNet on different public data sets is reported in Table 3. We obtain a consistent accuracy throughout the data sets. Sample ear detection results on each data set are illustrated in Figure 3. The ear points are shown in blue, and the non-ear points are shown in red. We also examine the case where there are no ear points present in the profile face point clouds. The results demonstrate the correctness of our network, which does not detect any ear points on those test data. A sample outcome is shown in Figure 4.
The robustness to the occlusions due to earrings is illustrated in Figure 5. The results show that ear points are detected correctly even in the presence of earrings. We also demonstrate the robustness of our model in the presence of earphones. Our ear detection model achieves 98.89% accuracy on the UWA ear data set. A sample result is shown in Figure 6.
We also compare the performance of our approach on the UND J2 data set with recently published work by Zhu et al. [50]. They used 415 scans (one scan per all 415 subjects)  for transfer learning of the basic PointNet++ network, and reported an accuracy of 93% on the remaining 1385 scans. On the other hand, our approach achieves an accuracy of 98.62% by using only 150 scans in the transfer learning. The better performance of our approach may be explained as follows. The learned weights of the basic PointNet++ were established by training 16 different objects, where each object contains 50 parts. However, the UND J2 data set is entirely different from the data used to train the PointNet++. Our proposed EarNet also outperforms EpNet [49] which is based on PointNet. The performance comparison is summarized in Table 4.   To validate our detection accuracy, we compare our approach with the ear detection approach proposed by Islam et al. [65] using bounding boxes. Figure 7 shows that even with the lowest (<60%) IoU value, our detection is inside the bounding box, and the ear shape is also significantly visible.

2) Mean IoU
The mean IoU results on different data sets achieved by our model and those by PointNet and PointNet++ models are reported in Table 5. Our approach shows higher mIoUs for all the data sets. We observe five failure cases on the UND G data set using both PointNet and PointNet++ as illustrated in Figure 8. These images contain significant pose variations. The first column is the ground-truth labels, and the remaining columns are the prediction of different models. Notice that all these images have significant pose variations. We see that PointNet captures the global shape but misses the local understanding of the shape. Although PointNet++ captures the local shape, it still lacks an understanding of the global structure. Our model capture both global and local shapes and therefore, does not have any complete failure cases.
To demonstrate the robustness to data point resolution of   Figure 9, where the point cloud resolution is shown in descending order (top to bottom). Although our model demonstrates considerable robustness against occlusion due to hair, we notice that in the four cases where our model obtains mIoU less than 50%, a portion of ear is covered by hair. These cases are illustrated in Figure 10. The corresponding 2D images (bottom row) show that these cases have hair over the ear (last two images) along with significant pose variations (the first two images).

3) Detection Speed
The proposed EarNet achieves faster detection speed compared to PointNet++. The average inference time per 3D real scan (non-synthetic) is 0.11 s on a GPU GeForce GTX 1080 Ti machine. The detection speed between different models is reported in Table 6. Although PointNet shows a faster detection speed, it has less accuracy than the other two models.

4) Other Experiments
We perform experiments to evaluate the effects of training data size (synthetic) on the network performance. First, we train our network with 35,000 synthetic data. Then we retrain the network multiple times, dropping 5,000 data each time.   Our experiments demonstrate that 20,000 data is optimal for training ( Figure 11). The quantitative results of our ear detection model trained on synthetic data only are presented in Table 7. The overall accuracy on three public data sets (UND J2, UND G, and UND F) is around 90%. The real data has more variability, which our trained model is not able to capture. As a result, we see a lower mIoU value. This result indicates that there is a possibility to improve the model performance. Therefore, we also conduct experiments to see the effects of adding real data from the UND J2 data set using transfer learning. A total of 415 scans from each subject are separated from the UND J2 data set. We perform three experiments selected from the 415 scans: first 50 subjects, randomly selected 50 subjects, and 50 subjects that seem hard to detect visually. The hard 50 data are selected manually and are visually challenging data in terms of occlusion. We perform the experiment three times for the randomly selected data and report the average result. As illustrated in Figure 12, the performance of the network does not depend on how the set of 50 real data is selected for transfer learning. On the other hand, although a small number of real data contributes to a significant improvement in the accuracy, no significant changes in performance are observed by conducting transfer learning using more than 50 real data (see Figure 13). Therefore, for transfer learning we use 150 real data randomly selected from the pool of 415 scans (kept separated from the testing set). We also demonstrate the effects of training from scratch (including synthetic and real data) compared to training with synthetic data first and then transfer learning with real data. Our experiments do not show any significant differences in terms of accuracy. However, the transfer learning from the trained network with synthetic data requires 2.25 hours less than the training from scratch on the same machine.

V. CONCLUSION
This work aims to detect ears directly on 3D point clouds of profile face data by applying a deep neural network named EarNet. A large set of synthetic profile face data is generated for training the proposed EarNet. Additionally, a novel approach is proposed to create ground-truth labels on real 3D data with corresponding co-registered 2D images. The experimental results demonstrate that our model performs significantly better than the existing deep learning models for ear detection directly from 3D point clouds. A possible direction for future research is to incorporate the proposed ear detection model into an ear recognition pipeline. In addition, we will also investigate different deep learning-based 2D segmentation networks for the ground-truth labeling pipeline.