Masked Face Detection and Calibration with Deep Learning Models

Under the COVID-19 pandemic, the demand that face detection devices should be enhanced to detect masked faces is imperative. In this study, we utilize several state-of-the-art face detection models and compare them on various unmasked and masked human face datasets. Moreover, by analyzing the results we obtain, we evaluate these disparate models and discover some problems. Attempting to overcome the problems discovered, we propose and implement several improvements, and acquire more results for analysis. At length, we propose some ideas for future research directions.


Introduction
Since the outbreak of corona virus in January 2020, there are more than 100 million people infected the COVID-19 virus 1 , epidemiologists claim that wearing a mask can preclude the virus from spreading efficiently2 on account of its capacity of airborne transmission. Therefore, the authorities worldwide begin to request the public to wear a mask when they stay in public areas. However, such policies cause serious challenges to existing face recognition systems, since facial features of nose and mouth are covered, which results in fewer features for recognition instruments to detect.
Traditionally, adaptive boosting and cascade classifier are used for face detection, which is also used for masked cases [1]. However, more recently, deep learning models are dominant in masked face detection, as we would discuss in Section 2, e.g., fine-tuned InceptionV3 in a transfer based approach and pre-trained VGG or ArcFace based models. Our aim belongs to compare various models on a number of different datasets, e.g., YOLO and SSDFace detect model, on Real-World-Masked-Face-Dataset, released by Wuhan University and relevant laboratory and simulated masked dataset like masked LFW (Labeled Faces in the Wild). Many studies contribute to establish new datasets or to masked face recognition. Datasets of masked faces are respectively rare, which impedes the training of proposed models, and these studies make considerable contribution. There are plenty of machine learning and deep learning methods used to detect the mask on face, some of which are based on finetuned existing and pre-trained models, all these models achieve good performances. As for face recognition, there are two research direction, namely, using remnant faces in images to finish recognition or recovering the whole face to provide as many as features for models to recognize. In general, methods above use different amounts of features in the face. Therefore, the performance vary a lot as well.
Nonetheless, there are still few researches focusing on the condition that people wear masks. Some studies propose several new datasets, there are still many studies intended to detect masked face directly or utilize some different approaches to detect the masked faces. However, this approach is time-consuming and costly, and researchers have to invest a large quantity of resources to collect data to train the novel models. In addition, plenty of existing models that are already deployed in lots of systems are required to be displaced because their low performance in masked face detection. The problem we need to tackle is whether we can detect the masked face based on the combination of existing models.
(2) The performances of these models on masked datasets decrease in various degrees. For example, the accuracy of yoloface model decreases by 30-40% (from more than 90% to 50-60% or so) when they detect masked faces from datasets listed in Section 3 and the accuracy of ssdface model decreases by 5-10% (from nearly 100% to 90-95%). Moreover, the accuracy of retinaface model decreases by 20% or so (from nearly 100% to 80%), the accuracy of centerface model decreases with remarkable variation, and the accuracy of lffdface decreases by 2-5% (from nearly 100% to 95%).
The rest of our study is as follows: ꞏ We analyze and compare the performances of different deep learning models that are trained on several datasets. ꞏ We then make some judgments in Section 4 aiming to make further exploration of the results.
Some histograms contribute to illustrate the situation. Furthermore, we expose some problems, in general and detailed, of yoloface and ssdface detection model. ꞏ We achieve some improvement on certain models we utilize, namely, YOLOv2 detection model and try to use the ameliorated YOLOv3 and YOLOv4 model to generate new results for general judgment and contrast. ꞏ In the end, we propose several problems and give some possible solutions, looking forward to inspire further research.

Related Work
In this section, we discuss the related work from the three aspects, namely, masked face datasets, masked face detection and masked face recognition.

Masked Face Datasets
There are two approaches of creating masked face datasets. The first approach is to collect real-world face images on the Internet with masks. However, this approach is time-consuming and costly. It is also difficult to build a large image dataset in this approach. The alternative approach is to detect the anchor points in the face images and append fake masks. In this approach, it is easy to create a large masked image face dataset because we have already many open face image datasets, e.g., WebFace and LFW, etc. Some competitions are also related to masked face recognition, which releases the corresponding dataset for research [2]. In this part, we introduce some of these masked face datasets.
In [3], the authors present a novel dataset focusing on Indian, considering that those existing datasets lack the cultural diversity and are collected in unrestricted settings. The proposed dataset contains 1,574 images of several kinds of unconstrained circumstances. The authors implement two different experiments to evaluate the proposed IMFW dataset. The first one evaluates the performance 3 of four pre-trained models, namely, VGGFace, ResNet-50 (Trained on the VGGFace2 dataset), LightCNN29, and ArcFace. The second utilizes existing loss functions, i.e., contrastive loss and triplet loss for masked face recognition. The result shows that the model performs poorly in both subsets and the complete set, while the lowest accuracy corresponds to 20% in VGGFace model and the best accuracy corresponds to 60% in ArcFace model.
Many face recognition instruments are required to improve their performance on masked face recognition. While most existing face recognition approaches base on deep learning, we need larger publicly available datasets. In [4], the authors propose to construct masked face datasets by different means, they propose three types of masked face datasets. The first one is Masked Face Detection Dataset (MFDD), which contains 24,771 masked face images. The second one is Real-world Masked Face Recognition Dataset (RMFRD), which contains 5,000 masked people images and 90,000 unmasked images of same subjects. The last one is Simulated Masked Face Recognition Dataset (SMFRD) with 500,000 face images of 10,000 subjects.
In [5], the authors propose three types of masked face detection dataset, namely, Correctly Masked Face Dataset (CMFD), Incorrectly Masked Face Dataset, and their combination MaskedFace-Net. There are 137,016 images divided into correctly worn and incorrectly worn masked face categories.

Masked Face Detection
With masks, it would become more difficult to detect either faces or masks, both of which have been considered in the literature. During the COVID-19 period, wearing a mask has become a common approach to prevent the virus from spreading. And to comply with the social distance policy, many restaurants and markets have to ask the customers to wear masks. With the existing cameras, some tech companies have proposed the use of automatic mask detection solutions. Furthermore, we want to detect the existence of faces to make sure the previous relevant systems still work even with masks. In this part, we introduce some relevant studies in this direction.
There are many deep learning models [6] that aid in identifying a masked face. In [7], the aim is to develop a deep learning model for detecting unmasked people. The work implements image augmentation firstly, then utilizes the fine-tuned InceptionV3 as the pre-trained model in a transfer learning based approach. The model is further trained on Simulated Masked Face Dataset, which consists of 785 masked faces and 785 unmasked faces. The performance shows that the InceptionV3 model attains an accuracy of 99.92% in training phase and 100% in testing phase, a precision of 99.9% in training phase and 100% in testing phase. This model has a remarkable performance and exceeds many other models.
There are increasing need to detect subjects with masks to preclude the virus from spreading during the COVID-19 pandemic. A full training pipeline based on the ArcFace work is proposed in [8] to overcome this problem. The network bases on a modified ResNet-50 model. This work selects the MS1MV2 dataset that contains 5.8 million images and 85,000 identities for training. In the testing phase, the model achieves nearly 100% accuracy for mask-usage verification. The results surpass all others with a 12% accuracy increase than the original.
For helping existing recognition system of monitoring public areas to identify whether a person wears mask, the authors in [9]  While people are required to wear masks in public areas under COVID-19 pandemic, existing monitoring instruments are facing challenges of detecting people with masks efficiently. The authors in [11] propose a deep learning model based on multi-graph convolutional networks (MGCN). The model trains on the real world masked face dataset which contains more than 7600 images and achieves an accuracy of 97.9%.

Dataset Exhibition
In this study, we utilize all the public datasets as we can find from the literature, to compare models for unmasked face detection and masked face recognition. Table 1 describes the datasets we select with the data size. This comprehensive collection would also be inspiring for follow-up studies. There are six datasets in total, four of which have both unmasked part and masked part, namely, MFRD dataset, AgeDB-30 dataset, LFW dataset and WHN dataset. While CASIA-WebFace and RMFD (Real-World-Masked-Face-Detection) are merely masked datasets.
AFDB, whn-masked and RMFD are all created by Wuhan University during corona virus pandemic, aiming to help deal with hygienic emergency and contribute to face detection research [12]. AFDB consists of 90,000 unmasked images and 2,000 masked faces or so, each masked face correlated to several unmasked face images. In addition, we use RMFD as supplementary masked dataset that contains about 4,000 images. The designer describes WHN as a verification dataset, containing more than 4,000 images of about 300 actors, similarly, each unmasked person correlated to various masked images. Particularly, images in RMFD dataset differ with each other significantly, some of which are stretched with relatively abnormal proportion whereas others may have various background and dim light conditions. AgeDB-30 [13], CASIS-WebFace [14] and LFW [15] datasets are all simulated masked datasets.
AgeDB-30 consists of iamges of 16,488 celebrities, like scientists, politicians, writers etc. We divide them into unmasked part of 6,000 images and masked part of equal 6,000 images. CASIS-WebFace dataset contains nearly 500,000 images of more than 10,000 celebrities, all of which are collect from IMDB website. LFW is a dataset often used for face recognition, in which the images derive from natural scenes with variety of factors like different postures, light conditions, expression, occlusion etc. Containing more than 13,000 pictures, whose sizes are all for 250x250. We select 6,000 for unmasked dataset and 6,000 for masked dataset.
During the research process, we have also tried the CFP (Celebrities in Frontal-Profile in the Wild) dataset. Nonetheless, we abandon this dataset since images in unmasked part are front sides whereas in masked part are profiles.

Models utilized
In this work, we utilize various networks to test on several datasets, all of which are highly advanced models that have been used and perfected by amounts of researchers. Here the synthetic description and detailed facet of contribution in these works are listed.

YOLO 9000 [16]
Advanced from YOLOv1, the researchers make some improvements to augment accuracy of the model while keep the advantage in speed. Ameliorations like Batch Normalization (BN) and fine-tune of dealing resolution, the former enable YOLOv2 to gain an immense augment of convergence meanwhile decreases the dependence on other regularization methods (Abandoning dropout regularization without overfitting), at length achieves more than 2% improvement in mAP. In the latter progress, the authors fine tune the input resolution in classification network for 10 epochs, considering the problem that the previous YOLO model [17] has to switch directly to new resolution after pretrained on ImageNet, from 224 x 224 to the resolution of 448. By which the network has time to adjust to higher input resolution and attains an improvement on mAP for 4%.
Using Faster R-CNN for reference [18], the authors remove the fully connected layers from previous YOLO whereas predict the bounding boxes by utilizing anchor boxes which results in tiny declination of mAP whereas significant augment of recall ratio. To get better priors, the authors utilize k-means method in which the authors also supersede the original Euclidean distance by using their certain distance metric related to IOU. To ensure the stability of the model, the authors follow location prediction method in YOLO, which means to predict location coordinates related to the grid cell location.
Additionally, there are tiny approaches as by adding a passthrough layer to provide a way to utilize better feature maps to localize some smaller items and training YOLOv2 model on images of various sizes, which improves the accuracy and is able to achieve relatively fast detection in small resolution at the same time.

SSDFace [19]
SSD is a light-weight algorithm with brilliant accuracy, faster than YOLO, nearly with same accuracy as Faster RCNN. Comparing with Faster RCNN [20], SSD has no process of generating the proposal region, which improve detect speed significantly. Different with traditional methods for detection of objects with a variety of sizes that detect separately at length synthesize the results by Non-maximum suppression, the SSD algorithm achieves equal result by utilizing feature maps in disparate layers. The backbone network is VGG16, the authors make some fine-tunes by eliminating and converting and adding some fully connected layers. SSD utilizes convolutional layers with different depths to predict objects with disperse sizes. Highresolution correspondent to small object, i.e. feature maps in lower layers settled with smaller anchor, higher layers vice versa. There are two different convolution kernels that deal the output of feature maps with disparate sizes, one outputs confidence for classification while the other outputs location for regression. The authors introduce default box involved in prediction and prior box similar to anchor in Faster RCNN, by matching prior box and ground truth through IOU the process generate positive and negative examples, which are controlled on certain ratio. To some extent, training process becomes easier and The data augmentation the authors implement has significant effects on model performance. For each sample, it will be sampled to generate several samples among which to be selected stochastically for training.

Retinaface [21]
There are two backbone networks can be used for feature extraction during training process, Resnet and MobilenetV1-0.25, the former contribute to higher accuracy and the latter enable the model to realize real-time detection on CPU.
As a key method in this work, depthwise separable convolution greatly reduce the amount of parameters. Similar to Retinanet, Retinaface adopts Feature-Pyramid Network which focus on these rear feature layers. Using 1x1 convolution to adjust its channels then utilize up sample to fuse the features. For further feature extract, the authors utilize SSH [22] with parallel structure to enhance receptive field by piling up 3x3 convolution to supersede 5x5 and 7x7 convolution.
The results of prediction can be categorized into face classification prediction, face regression prediction and facial landmark regression prediction which are separately utilized to judge whether there is object included in anchor, adjust the anchor to generate predicted box, adjust the anchor to acquire facial landmarks. Non-maximum suppression is also implemented after steps above to 6 eradicate repeat anchors. By annotating five facial landmarks and adding extra supervised and selfsupervised mesh decoder branch the authors minimize the multi-task loss function to get a brilliant one.
Comparing with MTCNN [23], the accuracy on face verification has a significant improvement. Additionally, on single CPU, it can achieve a real-time detection on VGA-resolution images. [24] Light and Fast Face Detector (LFFD), is an anchor-free and one-stage algorithm suitable to edge devices. Enlightened by correlation of RF, ERF (receptive field, effective receptive field) play the core new design of authors` work.

LFFD
Considering the drawbacks of anchors, the authors put forward that RF in feature maps are natural anchors to deal the shortages of anchor based methods and various RF strategies that various image sizes correspond to disperse ERFs and EFs. By controlling the stride of RF, it can attain 100% coverage of face in theory. Additionally, RF is able to handle consecutive face sizes, by which means the imbalance of anchor can be solved.
The authors design the network architecture with four parts with different sizes, each of which has several loss branches since it has to ensure the features of faces to be robust and 100% coverage. They also utilize data augmentation including sampling stochastically, horizontal flip and colour distortion.
The key part loss function in each branch includes two sub-branches for face classification and bounding box regression by which it can deal with RFs of various sizes.

CenterFace [25]
CenterFace is a light-weight anchor-free algorithm with ability to predict face box and face landmark in real-time and with brilliant accuracy, which takes advantage of contextual maps and facial landmarks in learning. By utilizing the centre point of face bounding box, the facial landmark can be regressed straight to features.
The authors select Mobilenetv2 as the backbone network, by which the structure can be small but powerful. Similar to Retinaface we have discussed above, CenterFace also adopt Feature Pyramid Network to adjust channel and make further feature extraction, from where features can be regressed directly.

YOLOv3[26]
While still utilizing dimension clusters as anchor boxes, the author replaces the IOU of bounding box and a ground truth with confidence 1 using logistic regression, by which recall rate can be relative high for small object detection. YOLOv3 extracts features from 3 various scales by utilizing feature pyramid networks [27]. By using residual network and upsampling, this YOLOv3 work is able to deal semantic information with further meaning, containing more information from previous layers.

Evaluation and Improvement
In this section, we focus on analysis of these results. Table2 below shows the results we get from models tested on several datasets we used. The wide line divides the results into datasets with unmasked-masked pairs and masked datasets. Furthermore, we indicate some advantages and drawbacks of models concerning some certain problems.

General Analysis
For unmasked datasets, the results indicate that all of models we used manage to achieve brilliant results, with nearly 100% accuracies. Since these unmasked images are with a good quality, clear and has few occlusion, these algorithms face only a few challenges. We focus our analysis on results of masked datasets. All of the models we use achieve a worse performance at various extents. Moreover, their performances differ significantly with each other. For example, ranging from 50% to 95%, on AgeDB-30-masked dataset.

Analysis of significant abnormalities
We focus our analysis on results of masked datasets, particularly those marked in red shown in Table 2. There are two abnormal parts we need to address. The first one is the performance of yoloface on masked datasets, on AFDB, AgeDB, LFW and CASIA particularly. The second is that ssdface attains an aberrant high accuracy on AFDB and AgeDB contrary to its low performances on other masked datasets.
For the first problem, YOLOv2 itself is a relatively less accurate algorithm whereas has high speed. We propose that such performances result from the weak adaptability of YOLOv2 on some datasets or there are mistakes in the processes of algorithm. To verify our conjecture, we output the aspect ratios of YOLOFace tested on AFDB, AgeDB, LFW and CASIA-WebFace. Figure 1 shows these relevant aspect ratios. The intervals that aspect ratios fall down are relatively normal, indicating it is actually the relatively low precision of YOLOv2. Concerning the second problem, we propose that SSDFace may detect some irrelevant region as facial region or detect facial landmark falsely, which result in inflation of the precision. To verify our conjecture, we draw the ground box of results of two models on AFDB-masked and AgeDB-30 masked datasets, finding that there is truly problem with detection process of SSDFace. In detection, there are some ground boxes cover merely one or no facial landmark but SSDFace may make false detection. It is hard to judge whether this image is properly detected. After attempting to take advantage of aspect ratio histogram to judge the correctness of detection like before, we rectify the result at some extent. Figure 2 shows the histogram of ratios of AFDB and AgeDB datasets results. We qualify the referential ratio of 0.6-1.0 on AFDB and 0.7-1.1 on AgeDB to rectify the results and generate new ones of 78.73% on AFDB (90.92% originally) and 87.64% n AgeDB (94.48% originally).
Furthermore, drawing support from brilliant LFFDFace algorithm, we utilize LFFDFace to reexamine the outputted ground boxes of SSDFace to check their validity. The results show a 30.84% accuracy on ground boxes of results of SSDFace on AFDB dataset and 21.06% on AgeDB-30 dataset. Given such low precisions, SSDFace truly has a virtually high accuracy comparing with other models, despite some lost detection of LFFDFace may pull down the accuracy. The results indicate that YOLOv3 does achieve a significant improvement over YOLOv2 on AFDB and AgeDB-30. However, on datasets LFW and CASIA, the performance decreases greatly. A reasonable explanation is that the YOLOv3 model have less competence on simulated masked images, otherwise the accuracy on AgeDB-30, LFW and CASIA dataset will be higher. Meanwhile, all of these datasets have large-scale face region, which for YOLOv3 thwart its remarkable advantage on tiny scale object detection.
However, there are confusing results that some of detected images are falsely classified as other objects like cat, dog, etc. Moreover, facial landmarks are important for detection since images with sunglass and those profile images are rarely detected at all.
Generally, YOLOv3 should generate much more accurate results, being limited by simulated datasets and large-scale image size. In futurity, if there are more real-world masked dataset, this problem can be well verified.

Conclusion
In general, we test YOLOv2, SSDFace, LFFDFace, RetinaFace and CenterFace models on several datasets, namely, AFDB, AgeDB-30, LFW, CASIA-WebFace, RMFD and whn. Then we acquire a series of results, by evaluating what we find that algorithm like LFFDFace, RetinaFace and CenterFace achieve remarkable performance whereas there are also problems with YOLOFace and SSDFace. By analyzing aspect ratios of results from YOLOFace and SSDFace, we discover YOLOv2 actually has limited adaptability dealing with masked datasets. Moreover, SSDFace is a far less suitable model to be competent in masked face detection. Beyond YOLOv2, we further utilize YOLOv3 and YOLOv4 to test previous datasets additionally. We get unexpected results, which may result from simulated masked dataset but not real masked dataset and large-scale facial images, as conclusion of analysis afterward.
There are still several drawbacks in this work. We have limited time to formulate our own datasets, and lack of sufficient resources to train gigantic models to make the work more complete. So we have to choose those well-designed, more official datasets and utilize well trained networks directly to make assessment. In future work, we expect that we can design our own network and enable which to be well trained to propose a more comprehensive work.