A robust deep learning approach for glasses detection in non ‐ standard facial images

Automated glasses detection is a cardinal component in facial/ocular analysis that powers forensic, surveillance and biometric authentication systems. Throughout literature, glasses detection was always experimented by either utilizing hand ‐ crafted or deep learning features. Nevertheless, in both cases, highly standard face/ocular images were needed to derive the suggested technique. Both working methods performed reasonably well, but the results were bonded to the quality of the facial image and extracted features, where a slight shift and/or rotation in the input face image negatively affects the results. In addition, the obtained performance is even worse on real ‐ world (non ‐ standard) images, especially when compared to recent achievements in other computer vision research areas. In this paper, we present a robust deep learning approach for glasses detection from selfie photos full/ partial frontal body non ‐ standard images captured in real ‐ life uncontrolled environments that do not utilize any facial landmarks. To the best of our knowledge this paper is the first to experiment detecting glasses from selfie photos, using a robust deep learning approach. Experimental results on various benchmark facial analysis datasets demonstrated the superior performance of the proposed technique with 96% accuracy.


| INTRODUCTION
Identification of human soft biometrics is a hot computer vision (CV) research filed. This is mostly attributed to the increased reliance on surveillance systems that provides vast amounts of visual data to be analysed [1,2] in an off-line manner. Smartphones are a key player as well, where these soft biometrics can be acquired using a regular smartphone camera. The majority of soft biometrics identification systems are driven by human facial analysis [3], where the human face has the benefit of being a recognizable demographic core attribute [4]. However, the process is not straightforward, where some technical factors, such as image resolution, presence of hands andillumination could severely affect the system's performance.
The presence of eyeglasses is a technical obstruction in facial analysis systems. This is attributed to the generated shadow, reflection and/or occlusion caused by the glasses' frame that covers the eyes (most important part of the face). This leads to inaccurate output of facial/ocular analysis systems, were the appearance of facial images [5] is changed. The situation is even worse with sunglasses, where the entire ocular area is covered causing failure in eye detection and subsequent systems.
However, in recent years many studies have been proposed to reduce the impact of eyeglasses existence on ocular and facial analysis systems [6]. These studies proposed numerous solutions ranging from detecting eyeglasses' presence up to their removal using sophisticated techniques [7]. The situation is more challenging, especially with real-world unconstrained images [8] that depict full/partial frontal facial images with even extra occlusion under variant viewing angles. These images are typically represented by the modern selfie images, which are very difficult to analyse using standard methods. This is attributed to their high variability in resolution, deformation and occlusion, due to the real-life environment they were captured in. Moreover, it is more difficult when it comes to detect glasses in these selfie photos, as they are not taken in ideal/near-ideal capturing conditions, that is, non-controlled lighting conditions and non-textured background. The standard ways that utilize facial landmarks to detect glasses will not succeed, due to the image non-standard nature, where it is difficult to locate any facial landmarks, for example, nose-piece. Figure 1 depicts an illustration of the different ways to detect glasses from facial images. The situation is still unsolvable even with the recent advance in machine learning (ML) approaches, mainly because of ML high sensitivity to the unlimited variations [9] that could be depicted in selfie images.
The contribution of this paper is presenting a robust convolution neural network (CNN) model to effectively detect glasses for the first time from selfie images. The proposed deep learning (DL) model utilizes a transfer learned knowledge from the one-million ImageNet dataset [12], in addition to new features and kernels learned from the selfie images dataset [13]. The utilization of transfer learning in combination with new learned feature maps, compensates the tough uncontrolled nature of selfie images. This enables the network to achieve a higher glasses detection rate. Finally, the reached research results provide a solid baseline for glasses identification from selfie images using DL based approaches, while leaving a room for possible future improvements.
The rest of the paper is organized as follows: Section 2 presents and discuss the related literature work that covers: (1) selfie images and their importance as a new trend in the CV field, and (2)glasses detection recent trends. The proposed CNN model is presented in Section 3. Section 4 presents the experimental part and results. Finally, the paper is concluded in Section 5.

| Glasses detection
Glasses detection is key problem in CV research, due to its direct relation with facial analysis systems as described earlier.
However, there exists a limited research that targets this specific problem. In the past era the problem was commonly approached by localization of the eyes and defining surrounding regions, where glasses are expected and their presence is evaluated using grey-level discontinuities of the frame compared to the face [14]. A combination of edge information (strength and orientation), and geometric features (convexity, symmetry, smoothness, and continuity) were used for eyeglasses detection in [15]. In another research, the task was carried out using edge information within a small area defined between the eyes, that is, nose-piece [15]. Bayes rules were used as well to detect and remove glasses [15]. In the following era, image descriptors were utilized to assist in glasses detection. For example, Local Binary Pattern was used in [10,16], wavelets in [17], recently HOG in [10] and haar-like features in [18]. Three-dimensional features were also used to detect glasses [19], following their wide effectiveness in describing image landmarks [20]. Recently, using DL based techniques is spreading rapidly through many areas of CV [21]. This is because DL based techniques do not suffer fatigue or moods, thus, they can process huge amounts of data at inconceivable speed outperforming humans in terms of accuracy. For example in [5] Caffe framework [22] was used for glasses detection from iris image dataset. However, this dataset did not depict full faces, but it depicts frontal cropped ocular regions. Later, a two stage CNN was proposed in [6], but, the full experiments were carried out using 23k images for testing and only 1k images for validation. Recently, a different CNN architecture based on adversarial learning was proposed in [23]. This work utilized a separate discriminator and recognizer, where the later can successfully capture the connections among multiple face analysis tasks by sharing feature representations towards a better detection. However, the work utilized standard facial images that does not reflect the real-life current facial images.
Conclusively, key glasses detection literature work is summarized in Table 1. The table data reveals that majority of related work utilized hand-crafted features that were used directly [15,17] or fused to learning-based methods [10,16], that is, SVM and AdaBoost or deep learning [5,18] as well. Although, the reached detection accuracies ranged from 80% to 100%, there are some serious limitations that could be summarized in the following points: � Some approaches were tested on inhouse datasets that are small in size and typically recorded in ideal conditions [15,19], moreover they even require the exact location of the eye [18] prior to processing. This led to artificially high lab-accuracy, where such techniques might not achieve the same performance on real-life images. � Even the utilized larger size datasets does not reflect true variability, as it contains repeated indoor/outdoor sessions for the same person taken for the purpose of experiment, that is, VISOB dataset [24]. � The majority of related work was mostly approached by localizing eyes as a first step and finding the nose-piece as a second step to detect the glasses position [15,18]. This way requires the facial images to be perfectly standard and frontal, which is not applicable in today's realistic datasets, that is, selfie dataset [13].

F I G U R E 1
Illustration of the different ways to detect glasses. The classic method employs edge detection based on the nose-piece as a famous facial landmark. The deep learning method learns to extract features from input images. Some of the figure parts are adapted from [10,11] BEKHET AND ALAHMER � Most of the training data for ML-based approaches were artificially stamped with eyeglasses [6], as the frame images were aligned for superposition using facial landmarks, that is, eyebrow, eye, ear, and nose [6]. Furthermore, the diversification of real glasses shape, makes the artificial stamping neither accurate nor representative of real training and testing images. Figure 2 depicts sample image with stamped glasses versus real glasses. The images with real glasses are more challenging as they contain natural reflection and shadows. � For deep learning-based approaches, the majority were either trained on synthetically stamped images [6] or cropped ocular images [6,25] or iris images [5]. This led to the lab-based 100% accuracy.
In general, the majority of glasses detection related work is still immature in handling fully non-standard images, that is, selfies. This is attributed to the usage of either cropped facial images or ocular images that do not adequately reflect realworld appearance variations (The 120k VISOB [24] dataset is entirely an ocular dataset). Figure 3 depicts sample images from the common glasses detection datasets, that shows their ideal nature that facilitated the reported high detection accuracy. Furthermore, DL based techniques are powerful and suits the job, as it combines feature extraction and classification together in a comprehensive end-to-end model that receives the raw input data and produces the final classification results. Conclusively, there is still much room for work in glasses detection to cover the aforementioned limitations, in terms of using a tailored network architecture and a fully realistic dataset such as the selfie images described in the following section.

| Selfie images
The emergence of selfies in such enormous volumes granted it a big-data aspect and forced its existence as a new CV research field. Moreover, traditional CV techniques could not effectively handle selfies. This is attributed to two main reasons: (1) their big-data nature, where hand-crafted features are expensive to extract and might not generalize well in such volumes [35], (2) their non-standard capturing way, makes them always prone to extreme occlusion of facial landmarks. Moreover, these images might depict a side view of the face in addition to added emojis or artificial effects, for example, cartoon moustache. Such, problems contribute to the difficulty of glasses detection in selfie images. A group of selfie images that illustrate some of the aforementioned problems are depicted in Figure 4.
From a research perspective, selfie images have been of almost rare usage throughout CV literature. This is because they were more linked to psychology research [36]. From a CV perspective, selfies were studied according to various attributes, that is, senior, youth, Asian, etc., where SIFT [37] and HOG [38] features were employed through SVM to inspect these attributes [13]. However, the reported performance was very poor, that is, <40% accuracy for detecting glasses using SIFT and HOG respectively [13].
Conclusively, selfie images are a new global phenomenon that represents a very sophisticated unique case in CV research, which is worthy of studying and analysing. This will help to unleash the true benefit of such enormous selfie amounts (≫24-billion images [39]). Furthermore, their realistic nature (not being recorded for research purpose) gives them the ultimate diversification and realism for glasses detection work. This could help to improve forensic science and biometric authentication. For biometric authentication, the work is useful in situations where retinal scan is required and the subject is wearing glasses. The system in this case detects the glasses and asks the subject to remove them prior to the scan. This is because glasses act as an obstacle for retinal scan authentication [5]. Regarding the forensic science, it helps in situations where surveillance videos are examined to find a specific suspect, that is, wearing glasses. Figure 4 depicts sample selfie images with/without glasses that reflects the diversification compared to previous datasets depicted in Figure 3.

| PROPOSED METHOD
The existence of gigantic labelled data repositories, that is, ImageNet, allowed CNNs to infer and learn rich features representations, which made a huge boost in multiple visual recognition problems [40]. Moreover, for the first time these learned features and representations could be harnessed and transferred to a new problem with some network reshaping. This is often employed when the required consummate amount of sample data needed to train a deep neural network from scratch is simply not sufficient [41]. However, even with the transfer learning approach there are still some major changes and training need to be performed to the original network structure to fit for the new problem. The new training and parametrization could still extend to weeks and months, as the network needs to go through all the training data to learn/ remap features and adjust the final layers [41]. Hence, the main objective of this paper is to unleash the power of CNN using transfer learning to better handle selfie images. For this reason we propose the DL architecture illustrated in Figure 5. The core of DL work is convolutional operations for input images and stacking layers to generate corresponding feature maps. The convolution operation is described as: where, X is the input image and K is a 2D convolution matrix and ⊛ represents the discrete convolution operation. The K matrix slides over the input matrix with stride parameter. The proposed glasses detection CNN architecture is designed based on the famous AlexNet [42] structure. AlexNet is a CNN that was trained over more than million images from the ImageNet database. The original network has eight main layers (5 convolutions þ 3 fully connected) and can classify images into F I G U R E 3 Sample facial/ocular images from six common glasses detection datasets, from top to bottom; CAS-PEAL-R1 [31], FERET [11], CASIA-IRIS4 [32], LFW [33], NIVE [34] and VISOB [24]. That images shows full face/ocular images in controlled environment with/without eyeglasses. The photos were taken in textured background, ideal lighting conditions and excellent viewing angles F I G U R E 2 Sample images with synthesized [26] and real glasses [13]. The Synthesized glasses are not realistic with no reflection or shadow in the ocular area, and mostly depict similar frames, which eases the task of glasses detection BEKHET AND ALAHMER -77 1000 different object categories, such as keyboard, mouse, pencil and many animals. As a result, the network has learned rich feature representations for a wide range of images which makes it a good starting point for the proposed glasses detection CNN.
The transfer learning problem could be mathematically approached by considering a source domain data defines as follows: where x Si ∈ X S is the data instance and y Si ∈ Y S is the corresponding class label, and a target domain data as D T ¼ fðx T 1 ; y T 1 Þ; …; ðx T n T ; y T n T Þg, where x T i ∈ X T is the data instance and y T i ∈ Y T is the corresponding class label. In most cases, where n S represents the target data which is not available in the same amount as n T , that is, n S represents the selfie images data and n T is the ImageNet data. Transfer learning aims to help F I G U R E 4 Sample selfie images with/without glasses that reflects the dataset problems, for example, occlusion, partial-face/side-face view and artificial effects. The images belong to the selfie dataset [13] 78improve the learning of a predictive function f T ð•Þ of the target domain problem D T using the knowledge from the source domain problem D S and a learning task T S . However, neither D S ≠ D T , nor T S ≠ T T . The function f ð•Þ is the objective predictive function that could be learned from the training data pairs fx i ; y i g ≡ ffeature; labelg, where x i ∈ X and y i ∈ Y . The feature space is represented by X, while the label space is represented by Y. The original AlexNet CNN has an image input of size 224,�224, which is changed to 227,�227 to fit with the new data image size, and save the resizing step to speed-up training. This is already tackled in the data augmentation step. Moreover, it has an output layer of 1000 softmax-normalized neurons, one for each of the ImageNet [12,42] object classes. Thus, the last three layers had to be replaced for the proposed glasses detection problem. This was achieved by adding a fully connected layer, a softmax layer and a classification output layer with only two classes, that is, glasses/no-glasses. Such fine-tuning approach allows the network model to pick up the specifics and bias of the selfie images dataset based on the generic features learnt from the first layers. Furthermore, two dropout layers (50% random dropout) were added to counter overfitting because of selfie image data non-big size. Figure 6 depicts the architecture of the proposed glasses detection CNN after adding all of the required layers.
The next section discusses the selfie image dataset specifications and presents the experimental results based on the proposed glasses detection CNN model.

| EXPERIMENTS AND RESULTS
This section investigates the performance of the proposed CNN model for glasses detection. The selfie image dataset is introduced in Section 4.1. Section 4.2 highlights the network training phase details, followed by the experimental results and their related analysis in Section 4.3.

| Datasets
In this paper we had used the first and only selfie images dataset [13,43] (to the best of our knowledge). The selfie dataset is composed of 46,836 images annotated with 36 different attributes divided into several categories as follows: Accessories: glasses, sunglasses, lipstick, hat, earphone. Gender: is female. Age: baby, child, teenager, youth, middle age, senior. Race: white, black, Asian. Face shape: oval, round, heart. Facial gestures: smiling, frowning, mouth open, tongue out, duck face. Hair colour: black, blond, brown, red. Hair shape: curly, straight, braid. Misc.: showing cellphone, using mirror, having braces, partial face. Lighting condition: harsh, dim. The important attribute in this dataset for the proposed work is the glasses/sunglasses attribute, that is used for the network supervised learning phase. A diverse group of selfie images from this dataset are depicted in Figure 4.

| Network training phase
The proposed CNN model takes advantage of data augmentation to reduce the effects of overfitting. Before presenting an example image to the network, all dataset images are preprocessed by randomly translating in (À 30, 30) range and randomly reflecting the images. The random translation step is necessary to avoid the positional bias in the data. This is because most of the selfie images are tend to be centred. This requires the model to be tested on perfectly centred images as well, which is tackled using this step. The choice of �30 translation range restricts the effect to ∼10% of the image size, which is common in CNN data training [44,45]. These preprocessing steps are applied consistently to all images to artificially increase the dataset size using label-preserving transformations [46]. Moreover, to reduce the computational load during the training phase, all of the transformed augmented images are produced from the original ones at runtime without storing them on disk.
The model was trained using a stochastic gradient descent with a batch size of 10 examples, momentum of 0.9 and weight decay of 0.0004. The small value of weight decay is important for the model to facilitate learning, as it helps to reduce the model's training error. Furthermore, the weights of the early layers are preserved from changing, as they had learned the abstract features from the ImageNet dataset. The training and results were obtained using an Intel Core i7, 3.3 GHZ with 16 GB of RAM. The training time extended for a whole 15 days to iterate through all the training data (original þ augmented ¼ ∼100k) and fine-tune the network parameters based on the new data. The next section presents and discusses the network performance for the targeted problem.

| Results and discussion
Following the common experimental setup, the dataset was randomly splitted by assigning 70% of the images to training and 30% to validation (unseen by the network). The split in this case does not affect the results, since there is no established test-set split for the selfie dataset. Regarding the quantitative evaluation, the accuracy measure [47,48] (Equation 4) is used as it reports the percent of correctly identified images with/without glasses with respect to the whole dataset. Additionally, false rejection ratio (FRR), false acceptance ratio (FAR), and equal error rate (EER) metrics are also utilized, as they are commonly used for biometric application system evaluation [49].
where K is the number of test set categories, N is the number of testing samples. I k ðyÞ is an indicator function evaluates to one when y ¼ k; I k ðyÞ ¼ 1, otherwise evaluates to 0. y i andŷ i are the true label and predicted label of the i th sample respectively. The validation loss is also used to provide an extra measure about the model performance, as it indicates how well the model is generalizing to unseen data Equation 5 depicts the loss function. Whereŷ i is the network prediction with the ground truth values y i and λ is the individual loss function, that is, log-loss in the proposed model.
The network achieved 96% accuracy on the 46k selfie dataset with 0.15 log-loss. Furthermore, Table 2 depicts the standard biometric metric values of the proposed network model. In general, the results are very good considering the challenging realistic nature of the selfie dataset. This result proves that the features learned from the ImageNet dataset are generic enough to generalize to the selfie dataset. However, a high percentage of this accuracy is attributed to the extra feature-maps learned during the network training phase, that enriched the final classification phase. The proposed CNN model performance is also compared against four baselines to emphasis its effectiveness. The

-
BEKHET AND ALAHMER selected baselines represent the current research themes in glasses detection. Namely hand-crafted features and CNN-based approaches. The hand-crafted approaches are Dense SIFT [13] and Dense HOG3D [13]. For, CNN-based approaches they are represented by the state of art Google Teachable Machine [50] that is based on TensorFlow implementation and the two stage CNN approach [6]. The benchmarking results depicted in Table 3 and visualized in Figure 7 emphasizes the robust performance of the proposed glasses detection CNN model, as it outperformed the hand crafted based approaches with 60 � 1.4% and 10.8 � 0.2% for the CNN-based approach.
Regarding the two stage CNN baseline [6], the reported 100% glasses detection accuracy, was obtained based on the VISOB dataset [24]. This dataset only consists of ocular images, that is, not full facial images. The testing data were generated by digitally stamping eyeglasses on the dataset images. The testing and validation were experimented using 23,826 and 1191 images respectively (5% of the dataset for validation). However, in the proposed work, the entire selfie dataset (32,785 images for training and 14,050 images for validation) was used. This is considered 87.22% (without data augmentation) increase in training and validation data compared to [6], which highly contributes towards more robust result. In addition, the proposed work did not utilize any images with digitally stamped frames, as the selfie dataset are entirely realistic.
Furthermore, the performance of the proposed network model was validated on six common facial analysis benchmark datasets. The first dataset, that is, CAS-PEAL-R1 [31] is a Chinese face dataset, composed of 99,594 images of 1040 individuals (595 males and 445 females). The dataset was constructed using nine cameras that were mounted horizontally on an arc arm to simultaneously capture images across different poses. The second, that is, FERET [11] is a standard face recognition dataset composed of 14,126 images that includes 1199 individuals and 365 duplicate sets of images. A duplicate set is a second set of images of a person already in the database and was usually taken on a different day. The third, that is, NIVE [34], is a face expression analysis dataset, collected from 215 test subjects that were captured under different facial expressions. The fourth, that is, LFW [33] is a public face verification benchmark composed of 13,233 images for 5749 people. The fifth, that is, Caltech [51] contains a total of 10,524 faces in 7092 images collected from the web. All of these five datasets depict subjects with glasses, that include dark frame glasses, glasses without frame, and sunglasses, which is the main concern of the proposed work. The final dataset is a fully synthesized version [52] of the famous MS-Celeb [53] dataset (47,917 image), that was virtually stamped with thick black-framed eyeglasses. This dataset is very challenging as it optimizes the intra-variations caused by eyeglasses [52].
The proposed model glasses detection accuracy(%) performance on the six aforementioned public datasets is shown in Figure 8. The figure reflects an average performance of 91.7 � 4.7% over FERET, LFW, NIVE, CAS-PEAL-RI, synthesized MS-Celeb and Caltech WebFaces datasets. In addition, to further confirm the robustness of the proposed CNN model a large diverse group of selfie images were collected from the Internet and were tested on the proposed CNN. The system achieved an accuracy of 97.05%, which consolidates the previous result.
Moreover, to emphasize the benefit of the transfer learning, the full CNN network depicted in Figure 6 was reset and fully trained from scratch on the selfie dataset only. The layers' implementation details are given in Table 4. After a full cycle of training epochs and following the same data setup proposed earlier; the network achieved a detection accuracy of 86.5% on the selfie dataset. Figure 9 shows a 9.5% accuracy less compared to the version that relies on transfer learning. This quantifies the benefit of the transferred knowledge from the ImageNet. However, this result is not bad considering the size of the selfie dataset which is only 4.6% of the ImageNet dataset.
Towards a deeper look into the network learning phase, Figure 10 illustrates the evolving of convolutional kernels throughout the convolution layers till the last image classes learnt by the fully connected layer. The figure shows that the TA B L E 4 The layers and layer parameters of the proposed glasses detection network network has learned the different components of images like edges and colour blobs, in addition to a group of frequency and orientation filters, where the glasses shape is clearly depicted in the final layer image class. Figure 11 shows activations from the last convolutional layer during the forwardpass based on the input image depicted in the same figure, where the bright areas in the activation channels corresponds to the activation based on the presence of glasses. Finally, for the qualitative performance of the proposed DL model, Figure 12 depicts a group of challenging selfie images that were classified based on their glasses attribute using the proposed CNN model. The results reflect the classification power of the network, as it has learnt a variety of useful features to identify glasses in such selfie images. A clear example that depicts the developed CNN power is the example image indexed at 4 � 4 (row � column), where the subject is NOT wearing glasses and places it over her head and the network correctly identifies this image as not wearing glasses. However, there are some cases were the network failed to detect the glasses. For example the image in Figure 12, indexed at location 6�2 was mistakenly classified as wearing glasses. This was attributed to the magnifier that covers the eyes, that was mistakenly identified as glasses. Furthermore, the image indexed at location 6�1, was also mistakenly classified because of the small subject to scene ratio, that is the subject is very small in the image. F I G U R E 1 0 Evolution of the convolutional kernels through the glasses detection CNN until the final fully connected layer. conv1 ¼ 96 kernels, conv2 ¼ 256, conv3 ¼ 384, conv4 ¼ 384 and conv5 ¼ 256. The final fully connected layer depicts images that resemble the most closely image class, that is with/without glasses. The glasses shape is clearly depicted in the final layer image class (left image) BEKHET AND ALAHMER -83

| CONCLUSION
This paper presented an effective CNN model for glasses detection from selfie images. Distinct from previous work, this paper utilizes realistic dataset, that is selfie dataset, with nonsynthesized glasses and challenging frontal faces with full/partial body. Selfie images is a highly challenging dataset (46k) that was almost of rare usage, due to its unprecedented high variability uncontrolled nature (photos taken by normal users for themselves any time/where). The proposed CNN model achieved 96% accuracy. In order to reach such result, the proposed model was built with transferred knowledge from the ImageNet 1.2 million dataset, this allows raw-data abstraction that expands the analysis to unseen data yet. However, even with such transfer learning the network had to have extra layers and go through full epochs (full training cycle on the whole training data) that extended for almost two weeks. After such extensive training the network had learnt a variety of different image components, that is, edges and colour blobs which are important in detecting the existence of glasses in the input image.
The results presented within this paper are sufficient to the targeted problem. However, there is still room to improve the achieved accuracy through implementation other CNN models or even combining the proposed CNN with a long-short term memory towards a better result.

Saddam Bekhet
https://orcid.org/0000-0002-3028-6500 F I G U R E 1 2 Sample images from the selfie dataset that were classified using the proposed deep learning model. Right predications are tagged with a green label and wrong predictions are tagged with a red label BEKHET AND ALAHMER -85