On the Effect of Selfie Beautification Filters on Face Detection and Recognition

Beautification and augmented reality filters are very popular in applications that use selfie images captured with smartphones or personal devices. However, they can distort or modify biometric features, severely affecting the capability of recognizing individuals' identity or even detecting the face. Accordingly, we address the effect of such filters on the accuracy of automated face detection and recognition. The social media image filters studied either modify the image contrast or illumination or occlude parts of the face with for example artificial glasses or animal noses. We observe that the effect of some of these filters is harmful both to face detection and identity recognition, specially if they obfuscate the eye or (to a lesser extent) the nose. To counteract such effect, we develop a method to reconstruct the applied manipulation with a modified version of the U-NET segmentation network. This is observed to contribute to a better face detection and recognition accuracy. From a recognition perspective, we employ distance measures and trained machine learning algorithms applied to features extracted using a ResNet-34 network trained to recognize faces. We also evaluate if incorporating filtered images to the training set of machine learning approaches are beneficial for identity recognition. Our results show good recognition when filters do not occlude important landmarks, specially the eyes (identification accuracy>99%, EER<2%). The combined effect of the proposed approaches also allow to mitigate the effect produced by filters that occlude parts of the face, achieving an identification accuracy of>92% with the majority of perturbations evaluated, and an EER<8%. Although there is room for improvement, when neither U-NET reconstruction nor training with filtered images is applied, the accuracy with filters that severely occlude the eye is<72% (identification) and>12% (EER)


Introduction
Selfie images captured with smartphones enjoy huge popularity and acceptability, and social media platforms centered around sharing such images offer several filters to "beautify" them before uploading. Filtered images are more likely to be viewed and commented, achieving a higher engagement (Bakhshi et al., 2021), and selfies are also increasingly used in security applications since mobiles have been converted in data hubs used for all type of transactions (Rattani et al., 2019).
A challenge posed by such filters is that facial features may be distorted or concealed. Given their low cost and immediate availability, they are a commodity for many that is used daily, not necessarily with the aim of compromising face recognition systems. However, the capability of recognizing individuals may be affected, and even the possibility of detecting the face itself before any recognition can take place. This may be crucial for example in crime investigation on social media (Powell and Haynes, 2020), where automatic pre-analysis is necessary given the magnitude of information posted or stored in confiscated devices (Hassan, 2019). There are multiple examples of crimes captured on mobiles (Berman and Hawkins, 2017;Pagones, 2021), with the most striking lately being the use of posted videos of the US Capitol to identify rioters (Morrison, 2021). There is, therefore, interest to study the consequences of different levels of image manipulation and concealment of facial parts due to these "beautification" filters. It would be also of interest to evaluate methods that remove the filter's effect to avoid a decrease in face detection and recognition performance. The purpose and contributions of this work are therefore multi-fold: i) we summarise related works in image digital manipulation, in particular with the purpose of facial beautification; ii) we study the impact of image enhancement and Augmented Reality (AR) filters both on the detection of filtered faces and on the recognition of individuals; iii) we develop a method to reverse some of the applied manipulations (in particular, those who entail obfuscating parts of the face) and evaluate its benefit both on improving detection and identity recognition of filtered faces; and iv) we study if training the identity recognition system with manipulated images helps to increase recognition accuracy when such manipulations are present in test images.
As manipulation methods, we use the most popular selfie filters in Instagram (Canva, 2020), which mainly modify contrast/lightning, as well as AR filters that obfuscates the eyes or nose by adding items like glasses or animal noses. These are applied to 4,324 images of the LFW database (Huang et al., 2007). Face detection is carried out with the CNN of Geitgey (2018), while feature extraction is done with a ResNet-34 pre-trained for face recognition (King, 2017). To reverse the manipulations, we use a modified U-NET (Ronneberger et al., 2015) that we train on 203K images of the CelebA database (Liu et al., 2015). We focus on reversing modifications to the eye region, since these are observed to have the biggest impact on face detection or recognition. For example, obfuscating the eyes reduces face detection accuracy to <90%, while modifying contrast/lightning or obfuscating the nose do not have a significant impact (>98%). A similar trend is observed in face identification and verification experiments. After applying the generative U-NET reconstruction, detection is recovered to the levels of original unmodified images. This highlights both the impact of eye occlusion on the face detector, as well as the benefit of the reconstruction method employed. t-SNE scatter plots of the vectors given by the feature extraction network are also provided, showing a progressively higher intra-class (identity) and smaller inter-class variability as the nose or the eyes be-come occluded. This mirrors the results of face detection, and suggests that identity separation will be impacted as well.
The rest of the paper is organized as follows. Related works on facial manipulation are outlined in Section 2. The employed data and its preparation is described in Section 5.1, including the beautification filters and the U-NET reconstruction method. The experimental protocol is then detailed in Section 5, with the obtained results discussed in Section 6. Finally, conclusions are drawn in Section 7.

Related Works
Facial manipulation can be done in the physical or digital domain. Physical manipulation can be achieved for example via make-up or surgery, while digital manipulation or retouching is via software . Physical manipulation can be permanent (surgery) or non-permanent (make-up). Make-up can be quickly induced, so the same person may appear different even after a short period. Also, given the wide acceptance of cosmetics, it may also appear in enrolment data. Digital retouching allows similar modifications than surgery or cosmetics, but in the digital domain, as well as other changes such as re-positioning or resizing facial landmarks. A common aim of these modifications is to improve attractiveness (beautification). Of course, it is also possible that someone pretends to look like somebody else to gain illegitimate access, or to hide the own identity to avoid recognition (Ramachandra and Busch, 2017;Scherhag et al., 2019). Another manipulation is the use of facial masks, either surgical due to the current pandemic (Damer et al., 2020), or artificial as used in Presentation Attacks (Ramachandra and Busch, 2017). However, this is out of the scope of this paper, since they are not oriented towards beautification.
Some works focus on detecting retouched images. The methods proposed include Supervised Restricted Boltzmann Machine (SRBM) (Bharati et al., 2016), semi-supervised autoencoders (Bharati et al., 2017), or Convolutional Neural Networks (CNN) (Jain et al., 2018). They also present new databases such as the ND-IIITD Retouched Faces Database (Bharati et al., 2016) or the MDRF Multi-Demographic Retouched Faces (Bharati et al., 2017), generated with paid and free applications that provide for example skin smoothing, face slimming, eye/lip color change, eye/teeth brightening, etc. Authors of the MDRF database also analyze the impact of gender or ethnicity, showing that detection accuracy can vary greatly with the demographics. The work (Jain et al., 2018) also analyzes the detection of GAN-altered images. All these approaches consider the use of a single image (the retouched one). In contrast, Rathgeb et al. (2020) proposes a differential approach where the unaltered image is also available, something which, according to the authors, is plausible in some scenarios (e.g. border control). They use texture and deep features with images of the FERET and FRGCv2 datasets. Retouching is done with free applications from the Google PlayStore, arguing that free applications are more likely to be used by consumers. Another set of works analyze the impact of manipulated images on the recognition performance. In (Dantcheva et al., 2012), they gather two databases of Caucasian females with makeup. One is from before/after YouTube tutorials mostly affecting the ocular area, and the other is by modification of FRGC images with lipstick, eye makeup or full makeup. The study employs Gabor features, LBP and a commercial system, showing an increase in error when testing against makeup pictures. They also found that applying LBP to Gabor filtered images (as opposed to the original image) partly compensates the effect. In (Ferrara et al., 2013), alterations such as barrel distortion or aspect ratio change are studied. They also simulate surgery digitally, such as injectables, wrinkle removal, lip augmentation, etc. They employ the AR database, with two commercial and a SIFT algorithm, concluding that the systems can overcome limited alterations, but they stumble on heavy manipulations. Digital retouching is studied in (Bharati et al., 2016) with the ND-IIITD database. They use a commercial system and OpenBR, an open source face engine, finding that the performance is considerably degraded when testing against retouched images. Image retouching is also examined by Rathgeb et al. (2020) with a commercial system and the open-source Ar-cFace, showing its negative impact as well.

Beautification Filters
We focus on two manipulations: image enhancement and Augmented Reality (AR). These modifications, in particular AR filters, have not been addressed in the related literature. For enhancement, we use the 9 most popular selfie Instagram filters (Canva, 2020), which mostly change contrast and lightning ( Figure 1). The ranking is based on the number of images with a particular filter and the hashtag "#selfie". Since the Instagram API does not allow to apply filters to a large amount of data, they are recreated with a four-layer neural network that learns the changes of each filter (Hoppe, 2021). Regarding AR filters, they obfuscate parts of the face that can be critical for recognition (eye and nose, Figure 2). Such filters are very popular in social media (e.g. Snapchat) and even in conference platforms such as Zoom. In particular, we apply: "Dog nose", "Transparent glasses", "Sunglasses with slight transparency", and "Sunglasses with no transparency". These are merged with the face by using the landmarks (Figure 3b) given by (Geitgey, 2018).

Image Reconstruction with U-NET
We use a modified version of the U-NET network. Originally presented for image segmentation (Ronneberger et al., 2015), it outperformed more complex networks in accuracy and speed while requiring less training data. It has a compression or encoding path with convolutions and max-pooling, followed by decompression or decoding path with up-convolutions. This gives the network a U-shape (Figure 4), hence its name. Residual links connect maps of the encoding and decoding paths, with channels concatenated, allowing the model to focus on the parts of the image that change. The original network has been modified, since the task is different. Inspired by Springenberg et al. (2015), max-pooling and up-convolutions are changed to strided convolutions/transposed convolutions. Also, map concatenation in residual links is changed by addition to halve the number of channels. With this, we expect to still retain changes of image patches while counteracting over-fitting.

Databases
We use the version aligned by funneling (Huang et al., 2012) of Labeled Faces in the Wild (LFW) (Huang et al., 2007). It has 13,233 images of 5,749 celebrities from the web with large variations in pose, light, expression, etc. To ensure a sufficient amount of images per person, we remove people with less than 10 images, resulting in 158 individuals and 4,324 images. Five datasets are then created by applying the Instagram filters and , while the right part shows the reconstructed image. Shades at 95% preserve some information, so a good reconstruction is still possible. Shades at 100%, on the other hand, destroys the pixels behind the shades.
the four AR filters of Section 3. The Instagram dataset is created by applying one filter of Figure 1 randomly to each unfiltered image. Additionally, images with sunglasses are processed with U-NET, giving two more datasets of reconstructed images. This results in the 8 different datasets of Table 1. U-NET is trained to reconstruct the filters shades leak and shades no leak (Figure 2c, d) with the CelebA dataset (202, 599 pictures of 10, 177 people) (Liu et al., 2015). We use a batch size of 64, with Adam as optimizer and the MSE between the output and the target (unfiltered) images as loss. CelebA is not used for biometric recognition experiments, allowing to test the generalization ability of the U-NET model on unseen data.

Face Detection and Feature Extraction
The 8 datasets of Table 1 are further encoded into a feature vector that will be used for biometric authentication. First, faces are detected with the "face location" function of (Geitgey, 2018). The detector used is the more accurate CNN, rather than the default HOG model. If more than one face or no face is found, the image is discarded (e.g. Figure 3c). Note in Table 1 that the accuracy of the detector varies across the datasets, suggesting that the applied manipulations have different impact (Section 6). Feature extraction is done with a ResNet-34 network of 29 convolutional layers and the number of filters per layer reduced by half (King, 2017). This CNN has been trained from scratch for face recognition using around 3 million faces of 7485 identities from the VGG dataset, the face scrub dataset, and other pictures scraped from the internet. The accuracy of this network on the LFW benchmark is 99.38% (King, 2017). It uses as input images of 64×64, and produces a 128-dimensional vector (taken from the last layer before the classification layer).

Face Identification and Verification
For identification, we carry out closed-set experiments. To find the closest subject of the database, we use both distance measures (Euclidean, Manhattan, and Cosine) and trained approaches (Support Vector Machines, SVMs (Cortes and Vapnik, 1995) and Extreme Gradient Boosting, XGBoost (Chen and Guestrin, 2016)). Since SVM is a binary classifier, we adopt a one-vs-all approach with multiple SVMs, taking the decision of the model that is the most confident. XGBoost is multi-class, using softmax with cross-entropy loss as objective. SVM is a widely employed classifier with good results in biometric authentication (Fierrez et al., 2018), and XGBoost has wide adoption in the industry, having obtained top rankings in recent machine learning challenges (DMLC, 2021). Before the experiments, feature vectors are scaled with min-max normalization, so each element is in the [0, 1] range.
To measure identification accuracy, we first obtain the confusion matrix, which reports the number of instances in an actual class vs. the instances in the predicted class. The number of true positives (TP i ), false positives (FP i ), false negatives (FN i ), and true negatives (TN i ) of each class i are then obtained (i=1,...,N, with N being the number of classes). In a multi-class problem, the results are treated as a collection of one-vs-all binary problems, one for each class. Then, the average TP, FP, FN and TN of all classes is computed. (for example, TP is obtained by averaging TP 1 , ..., TP i , ..., TP N ). Finally, the Accuracy, Precision, Recall and F1-Measure metrics are obtained (Lever et al., 2016). In principle, these four metrics should be different, but since we are averaging the TP i , FP i , FN i and TN i of all classes, the four measures turn out to be the equal (noa, 2020). For this reason, we will report a single identification metric that we will call "Accuracy" in general. This method, called micro averaging, is the recommended choice in multiclass classification with unbalanced classes (noa, 2020), as it is the case of the LFW database. Another alternative would be to first obtain the Accuracy, Precision, Recall and F1-Measures of each individual class, and then average them. This is called macro averaging, since it combines the four overall metrics of each class, and it would result in them being different.
For verification experiments, we use the same distance measures than previously (Euclidean, Manhattan and Cosine). Accuracy is measured by obtaining the False Rejection Rate (FRR) and False Acceptance Rates (FAR) at different distance thresholds (Wayman, 2009). Then, we report the Equal Error Rate (EER), which is the error at the threshold where FRR=FAR. Table 1 provides the number of images of each dataset for which a face is detected. The benchmark (unfiltered) dataset has a detection rate of ∼99%. The face is also detected successfully in case of Instagram filters, which can be expected since they mainly enhance contrast or lightning. Faces with a dog nose are also well detected, suggesting that the detector is not sensitive to nose occlusions. On the other hand, occlusions in the eye region has a high impact, even with transparent glasses. When the reconstruction network of Section 4 is applied to images with shades, detection accuracy is recovered to a higher extent, highlighting the benefits of the employed reconstruction. Figure 5 depicts the reconstruction of the images of Figure 2c, d, showing a clear reconstruction of the majority of the eye area in the case of shades with 95% opacity. In the case of 100% opacity, the reconstruction is less successful, although sufficient to obtain a good detection accuracy. Figure 6 details the class separation of the various datasets after feature extraction by t-SNE (van der Maaten and Hinton, 2008) (perplexity 30). Only the five most frequent classes are colored due to limitation of colors. For the benchmark (unmodified) and the Instagram datasets, the clusters appear well separated, which suggests that class (identity) separation is possible  with a high accuracy. Clusters of the dog nose dataset are still separated, although closer and with higher intra-class variability. In the datasets with glasses, and specially with shades, the clusters appear much closer. It can also be seen a parallelism between the t-SNE plots and the face detection results reported in Table 1, in the sense that faces where the nose or eye appear obfuscated are more difficult to detect and to recognize. We now report face identification experiments in Table 2. With distance measures, the first original (unfiltered) image is used for enrolment, and identification attempts are done with filtered images of the different datasets. Comparatively, the Euclidean distance performs best, although just 1-2% better than the other distance metrics. The performance on the non-filtered benchmark and instagram datasets are the highest, with minuscule differences between them (≈92-93% for all distance measures). The dog dataset follows at ≈56%, and the transparent glasses dataset at ≈50%. The performance of shades leak and shades no leak datasets is poor, specially the latter one. After reconstruction, the shades recon leak dataset shows some performance recovery (66.3%). On the other hand, reconstruction with the non-leaking shades (100% opacity) does not contribute to any performance improvement with distance measures.

Results and Discussion
To carry out identification experiments with trained methods, the datasets are split into 80% (training) and 20% (test). Training is done either with benchmark unfiltered images (denoted as "Train=Benchmark") or with filtered images of the corresponding datasets ("Train=Filter"). Identification tests are always done with filtered images. A first observation is that both trained methods behave similarly on the different datasets, at least in those that do not involve shades (eye obfuscation). For the datasets with shades, SVM is significantly better than XG-Boost. Also, both trained methods are benefited in the majority of cases by training with filtered images. This is specially true for SVM with the datasets involving shades, where accu-racy improvements between 9.1% and 25.5% are observed. It is also relevant the improvement observed between shades leak (86.6%) and shades recon leak (94.6%), highlighting again the benefits of the image reconstruction method. On the other hand, as observed in the previous paragraph too, reconstruction with the non-leaking shades (100% opacity) does not contribute to any performance improvement. A significant difference, however, is that the performance of SVM in the datasets involving shades is much better than with distance measures (compare the best cases marked in bold). Another benefit of trained methods is that the performance with datasets entailing any type of obfuscation (either nose or eyes) is significantly better. For example, with SVM, the performance with any dataset is above 85%. Notably, the dog dataset has 96.6% (56.2% with distance measures), the transparent glasses 92.2% (49.8%), the shades recon leak 94.6% (66.3%), and the shades recon no leak 84.9% (36.8%).
Finally, we report verification experiments (Table 3). As before, the first original (unfiltered) image is used for enrolment, and the remaining filtered images of the different datasets for verification attempts, both genuine (mated) and impostor (non-mated). As can be observed, the Euclidean distance performs best, although the other distances are less than 1% behind. The best EER is for benchmark and instagram sets at ≈ 2%, with the rest (in descending order) at 7% (dog), 8.2% (glasses), 12.5% (shades leak), and 14% (shades no leak). The EER for reconstructed shades leak surpasses the dog results at 6%. Also, as before, reconstruction with the non-leaking shades (shades recon no leak) does not show any improvement.

Conclusions
Social media platforms offer many different filters to "beautify" selfie images before they are uploaded. Augmented Re- ality (AR) filters that modify the face image by adding items like "noses" or "glasses" are also popular in social media and video-conference applications. Unfortunately, such filters may distort or occlude facial features, affecting the capability of recognizing individuals or even detecting the face itself. This may hinder for example the investigation of crimes on images from social media platforms or confiscated mobile devices, where the huge amount of data to be analyzed demands some automatic pre-screening (Powell and Haynes, 2020;Hassan, 2019). Accordingly, we are interested in studying the effect of popular selfie "beautification" filters on the accuracy of both face detection and recognition. We use the most popular selfie filters in Instagram (Figure 1) as well as AR filters that incorporate artificial animal noses or glasses to the face (Figure 2). We consider three types of glasess: transparent, sunglasses with 95% opacity, and with 100% opacity. The first class of filters mostly alter the contrast or illumination of the image, where the second class obfuscate part of the face with the artificial item that is introduced. These are applied to 4,324 images from 158 individuals of the Labeled Faces in the Wild (LFW) database (Huang et al., 2007), after which we apply a CNN-based face detector (Geitgey, 2018). Recognition is done using features from a ResNet-34 network trained to recognize faces (King, 2017).
The effect of some of the employed filters have been observed to be detrimental both to face detection and identity recognition, specially those who obfuscate the eye region. Thus, we also explore methods to reverse the applied manipulations. To do so, we train a modified U-NET network (Ronneberger et al., 2015;Springenberg et al., 2015) on a separate set of 203K images from the CelebA database (Liu et al., 2015) to recover the obfuscated eye region ( Figure 5). In the face detection arena, changes to image contrast/illumination or obfuscation of the nose region do not have a great impact (detection accuracy >98%). Obfuscation of the eye region, on the other hand, reduces the accuracy by 10%. This means that the detector relies on the availability of the eye region, even if the rest of the face is visible. This is confirmed after applying the reconstruction U-NET network, when detection accuracy is recovered to the levels of unmodified images. Face occlusion is a difficulty known to make detection systems to struggle (Zeng et al., 2021), although our results indicate that eye occlusion is more critical than for example nose occlusion.
A similar trend is observed in face identification and verification experiments. Alterations to the eye region produce a much stronger decline in accuracy than altering the nose area or changing contrast/illumination. This indicates that the em-ployed ResNet model also relies strongly on the eye region to distinguish identities. Recognition experiments are done both with distance measures between feature vectors and trained classifiers (Cortes and Vapnik, 1995;Chen and Guestrin, 2016). As enrolment images, we evaluate both the use of original unfiltered images, as well as filtered images. In general, trained classifiers overcome distance measures, and the use of filtered images for enrolment also provides better accuracy. In overall terms, by combining the use of filtered images to train machine learning methods and the reconstruction of eye regions by U-NET, we manage to achieve an identification accuracy above 92% with the majority of image perturbations. In the verification experiments, the EER is kept generally below 8%. The application of sunglasses affect the system most heavily, and in the case of sunglasses with 100% opacity, the recognition performance does not change after applying U-NET. Even if face detection was improved also in this case, the achieved reconstruction of the eye region ( Figure 5) is not sufficient to improve recognition. This can be somehow expected given that the original pixels behind the sunglasses are completely destroyed. Thus, a faithful reconstruction is not achieved as in the glasses with 95% opacity, where the original information is preserved to a certain extent behind the glasses.
As future work, we are exploring to improve the reconstruction performance further, for example incorporating image translation methods based on adversarial training Zhu et al., 2017). We expect to achieve a more realistic and accurate result, specially with non-leaking (100% opacity) AR filters such as the sunglasses, which have been shown in our experiments to have a negative impact on both face detection and identity recognition. We will also explore reconstruction of other face areas, such as the nose or the mouth.
Another direction that we have not addressed in this work is the detection of applied manipulations. Our results have shown that accuracy can be improved by using filtered images to train the recognition algorithm and by reconstructing the altered face regions. To allow this, knowing which specific alteration has been applied or which face regions have been modified is useful. We predict that detecting alterations at the patch level will be a fruitful avenue (Jain et al., 2018). The latter will be combined with the use of detection and recognition methods based on local analysis, so if one particular region is occluded or altered, it is set to not contributing to the task. This is similar to for example using detectors of the periocular region that do not rely on the full-face being available (Alonso-Fernandez and Bigun, 2016), but applied to different regions of the face.