FACE RECOGNITION FOR SMART ATTENDANCE SYSTEM USING DEEP LEARNING

,


INTRODUCTION
The attendance system was previously managed using traditional methods nowadays, so its implementation can be time-consuming and impractical. A smart attendance system with a combination of face detection and face recognition methods can overcome the ineffectiveness of attendance. Therefore, the introduction of image processing techniques to perform face recognition in an attendance system where the appearance of a student's face can be labeled and compared with images in the classroom with an accumulated database Naufal et al. [1]. Using face recognition smart attendance system automatically marks students attendance in a class by recognizing their faces. The system is divided into several steps, but face detection and face recognition are the main steps. First, we need a database on each face to mark attendance. The camera device takes all facial images in the classroom, then face detection is carried out, and then face recognition is carried out on the faces detected earlier by comparing the faces in the database containing images. By using a smart attendance system, it can reduce attendance errors assisted by using an automatic attendance list using a smart attendance system, making it easy for the parties concerned to use smart attendance automatically and can save time in registering attendance.
Huang et al., over the last ten years in the use of deep learning, have obtained interesting results from various fields, one of which is face recognition. The use of deep learning to perform face detection and face recognition in recent years has been widely applied in everyday life based on algorithms that perform learning on a lot of data by studying many factors such as faces, expressions, angles, and light Qu et al. [2]. There are several methods of face detection have been proposed, proposed by Deng et al. [3] RetinaFace method to perform face detection, which achieves stable face detection with evaluation results of average precision of 96.713%, 96.082%, and 91.44% on the WiderFace dataset. Ren et al. [4] used the multi-task cascade (MTCNN) method to perform face detection on pedestrians' feet. The evaluation results showed a detection rate of 74.1% for near-distance pedestrians, 66.8% for medium distance, and 34.2% for long distance.
Proposed method YOLO V3 for face detection to perform with a yield of 92.79% Chen et al. [5].
The proposed method by Naufal et al. [1] uses the Haar feature detection algorithm, which 3 FACE RECOGNITION FOR SMART ATTENDANCE SYSTEM functions as face detection and feature extraction on images to obtain input with an accuracy rate of 95.23% in the synthesis dataset. There are methods for face recognition, William et al. [6] proposed the FaceNet method for performing face recognition, using MTCNN to take image face areas and then training them on the pre-trained model and evaluating face recognition on the Yale database, Jaffe and AT&T datasets with 100% results, 97.5%, and 100%. Proposed a face recognition method for students using Additive angular margin loss (ArcFace) on students, by using MTCNN for face extraction and entering into the ArcFace model trained with ResNet50 Jie et al. [7]. The results were evaluated on the LFW dataset with a face recognition accuracy rate of 99.80%. Proposed a method with the GoogleNet model (inception) using caffee and Nvidia digits framework to perform face recognition with an accuracy rate of 91.43% on the LFW dataset Anand et al. [8]. Proposed the SeNet50 model with an accuracy rate of 72% for facial recognition and 75.3% for image sizes of 8 x 8 pixels to 24 x 24 pixels Massoli, Amato, and Falchi [9].
Based on recent studies above, there are proposed solutions for face detection and face recognition. Therefore, this study presents the approach to using face detection models RetinaFace and MTCNN and face recognition with FaceNet and ArcFace models for smart attendance system.
The objective that will be carried out in this paper is to find the best combination of face detection and face recognition methods that produce the best accuracy and speed of the face detection and face recognition models. The contributions of this paper are evaluating several face detection and face recognition models for smart attendance systems and finding the best combinations model for other organizations which the best to use in smart attendance system.

RELATED WORKS
In this section, related works are divided into 3 sections: Face Detection, Face Recognition, and summary related works. For face detection and Face Recognition, we analyze which model has the best performance and is suitable for the implementation described in related works.

Face Detection
Deng et al. [3] conducted a study using RetinaFace to perform face localization. By using 4 GALUH PUTRA WARMAN, GEDE PUTRA KUSUMA RetinaFace, which combines the prediction of face squares and localization of 2D facial landmarks and 3D vertex regression in the image plane, then performs face localization to obtain localization details on the face so that it can integrate with 3D regression, then uses a pyramid network to obtain image input with an output of 5 feature map, after that get the features on a map of a specific scale and calculate the multi-task loss. The results of the metric evaluation are based on the AP on RetinaFace using the WIDER FACE dataset with the final results of 96.713%, 96.082%, and 91.44% for the average AP result of 55.02%.
Zhang et al. [10] proposed a Multitask Cascaded Neural Networks (MTCNN). The proposed MTCNN's purpose is to construct an avalanched structure and utilize it as material for multi-task knowledge to anticipate the position of the face in a coarse-to-fine way. MTCNN also seeks to connect two tasks. In its application, MTCNN can identify real-time events with high accuracy.
The MTCNN model is composed of three networks. The first is the Proposal Network (P-Net), which serves to obtain the face and give the face some boundary boundaries. The Refine Network

Face Recognition
Proposed face recognition using the FaceNet method. In this study, FaceNet implementation used 2 pre-trained models, namely CASIA-WebFace and VGGFace2, in the FaceNet training process by MTCNN (Multi-task Cascaded Convolutional Neural Networks) when face detection has been carried out face image size with an area of 182 pixels X 182 pixels, then model training is carried out using 2 pre-trained models, namely CASIA-WebFace and VGGFace2 to improve the accuracy of face recognition, image datasets that have been edited.  [6].
Proposes Additive Angular Margin Loss (ArcFace) using the ArcFace loss function further to improve the discriminative face recognition model and training process, using the arc-cosine function to calculate the angle between the current feature and the target weight. Add an additive angular margin to the target angle and log the target back again by the cosine function. And doing scale logits is the same as in softmax loss for classification tasks. The evaluation results were carried out on several datasets, and LFW got the best accuracy value of 99.82% Deng et al. [14].
Proposed Face Recognition using a multiple distance facial Convolutional Neural Network (CNN) architecture using the CNN method as feature extraction and facial images with optimal distances to train to show the best performance Moon, Seo, and Pan [15]. Doing preprocessing data using interpolation and histogram equalization to extract the image, then using CNN to extract all the features on the face, then using the Euclidian distance, which is used to ensure the arrangement of the faces in the database; CNN consists of 5 layers where the original images are placed with different distances of different sizes. Included, not included in the CNN structure. The first layer is the input layer, the 2nd layer is the convolution layer, the 3rd layer is the sub-sampling layer, the 4th layer is the convolution layer, and the 5th layer is the sub-sampling as the data used in training per person using the total 9 faces. The results of research on performing face recognition using the Euclidean distance method by comparing it with facial images in the database with the results of using the face recognition method average 88.9% within 1-9 meters.
Propose a research method using a deep neural network combined with a new alignment algorithm, PCA, and Bayesian network to perform multi-view face recognition Zhao et al. [16].
Using 3 CNN differences to obtain feature vectors and carrying out 2 classification methods during the L2 regularization training process is used to reduce overfitting, the activation functions used are ReLU, PreLu, and ELU for comparison. The PCA algorithm is used for dimensionality reduction in features, the joint Bayesian method is used for vector comfort assessment in face recognition. The evaluation results carried out in face image recognition obtained in the CAS-PEAL dataset obtained an accuracy rate of 98.52%.

Summary Related Works
Based on the reviews that have been carried out on several face detection and face recognition methods above, RetinaFace by Deng et al. [3] and MTCNN by Zhang et al. [10] have better performance in the face detection and alignment stage with quite high accuracy results where the results are better than other face detection methods. For face recognition, the best models were obtained by FaceNet by William et al. [6] and ArcFace Deng et al. [14]. FaceNet acquires high verification accuracy results in the Essex faces dataset in face recognition. ArcFace Model also consistently outperforms the state-of-the-art, which has obtained comprehensive experiments with other methods.
Conclusions that have been obtained based on the reading. In this study, a combination of 2 face detection and face recognition sides will be evaluated, and two models RetinaFace and MTCNN models, will be evaluated for the face detection side. On the face recognition side, FaceNet and ArcFace models will be evaluated.

THEORY AND METHODS
In this section theory and method are divided into 3 sections for deep learning and 2 subsections of face detection and face recognition models that will be used in this study.

Deep Learning
Deep learning is a subset of learning based on algorithms stimulated by brain function and how it works in a structured manner. Deep learning in computers can produce results by training on previously available examples. This deep learning capability makes it possible to achieve high results in related learning [17] Convolutional Neural Network (CNN) is one of the deep learning algorithms that work by taking an input image and then convoluting it with a filter or kernel to extract features with a feature extractor and then assigning weights and biases from the resulting data. Study various aspects or objects in the image to distinguish one from another [18].
The CNN architecture consists of several CNN layers consisting of an input layer, convolution layer, pooling layer, activation function, fully connected layer, and output layer [19]. Division of 8 GALUH PUTRA WARMAN, GEDE PUTRA KUSUMA work modules on CNN, namely feature extractor & classifier. The feature extractor module gets input data into an image, which will then undergo a convolution process for each remaining pixel in an image. This convolution process is carried out repeatedly to get results in each convolution process. The final process in the convolution layer is that the features (feature vector) will be used in the classification module to read the patterns in the feature vector. The read pattern will then be classified in the classification module & put in the CNN sample prediction output. The components contained in this CNN can not only be able to perform classification but can also be implemented to perform face detection and face recognition.

Face Detection Model
Face Detection is used to detect faces in each image, in this study the face detection that will be examined is RetinaFace and MTCNN.

RetinaFace
The retina face consists of 3 main components: a pyramid network (FPN) feature, context modeling capacity, and cascade multi-task loss as shown in Figure 1 below. The first part is the FPN with a feature pyramid level from P2 to P6, where P2 to P5 is calculated from the output in the residual ResNet stage adjusted from C2 to C5. Convolution 3x3 calculates P6 at stride=2 at C5. C1 to C5 are pre-trained classifications that have been trained on the ImageNet-11k dataset. While p6 is randomly initialized. The second part is the context modeling capacity to strengthen the deformation convolutional network (DCN). This module above features mapping apart from 3x3 convolution. The third part is Cascade Multi-task Loss, which is a loss function that is used to improve the performance of face localization. A cascade regression approach with multi-task loss is applied with a loss head of 1x1, the convolution is spread to different feature maps of the dimension. The first context head is used as a bounding box on the regular anchor, and the second context head is used to predict the bounding box more accurately than regressed, where the Loss function used is called mesh regression loss with a combination of vertex loss and edge loss.  MTCNN model, in the first step by giving an image, initially resize it to different scales to build an image pyramid, which is the input of the following three-stage cascaded framework name: PNet, RNet, and ONet. The original picture input is used to build the multi-scale image pyramid, which is then used to construct the candidate region via the complete convolution form of PNet.
In RNet Non-maximum suppression is used to remove the candidate box with the highest degree of correspondence, then in ONet to identify face regions with more supervision. In particular, the network will output five facial landmarks positions as the structure diagram of the MTCNN model in Figure 2 below.

Face Recognition Model
Face Recognition is a method for identifying or verifying an individual's identity using their face. In this study, the face recognition models used are FaceNet and ArcFace.

FaceNet
FaceNet uses a deep convolutional network to optimize the embedding, which means comparing the bottleneck layers with the deep learning approach. FaceNet consists of part batch layers as input and deep architecture As CNN followed by l2 for normalization with embedding results, using triplet loss when applied in the training process as described in Figure 3.

ArcFace
ArcFace is a face recognition model that takes two face images as input and outputs the distance between them to see how likely they are to be the same person. It can be used for face recognition.
ArcFace uses a similarity learning mechanism that allows distance metric learning to be solved in the classification task by introducing Angular Margin Loss to replace Softmax Loss. The distance between faces is calculated using cosine distance, a method used by search engines, and can be calculated by the inner product of two normalized vectors. If the two vectors are the same, θ will be 0 and cosθ=1. If they are orthogonal, θ will be π/2 and cosθ=0. Therefore, it can be used as a similarity measure.

FACE RECOGNITION FOR SMART ATTENDANCE SYSTEM
The Arcface loss function in Eq. (1) essentially takes the dot product of the weight 'w' and the 'x' feature where θ is the angle between 'w' and 'x' and then adds a penalty 'm' to it. 'w' is normalized using l2 norm and 'x' has been normalized with l2 norm and scaled by a factor 's' This makes the predictions rely only on the angle θ or the cosine distance between the weights and the feature.
In a typical classification task, after calculating features, the Fully Connected (FC) layer takes the inner product of features and weights and applies Softmax to the output. In ArcFace, cosθ is calculated by normalizing features and FC layer weights and taking the inner product. The loss is calculated by applying Softmax to cosθ. At this point, apply arccos to the cosθ values after taking the inner product and add an angular margin of +m only for the correct labels. In this way, we prevent the weight of the FC layer from being overly dependent on the input data set. During the ArcFace inference process, the features of the two faces are normalized, and the inner product is computed to determine if both pictures are the same person.

PROPOSED METHODOLOGY
This study focuses on creating a smart attendance system that applies a combination of face detection models and face recognition combinations of face detection models, namely MTCNN by Zhang [10] and RetinaFace by Deng et al. [3]. Face recognition model to use in this study William et al. [6] and ArcFace Deng et al. [14]. The proposed methodology is illustrated in Figure   4 below.

Dataset
The dataset used in this study consist of three dataset WiderFace for face detection models and combinations of Essex Faces94+95 to determine the best performing combinations method at face recognition models.

WiderFace
The WiderFace dataset will also be used

Essex Faces94
In this study, the Essex Faces94 dataset was used to evaluate the face recognition model with a total of 153 identities with an image size of 180 x 200 with a total of 3,080 images with subjects 13 FACE RECOGNITION FOR SMART ATTENDANCE SYSTEM sitting at approximately the same distance from the camera and was asked to speak while a sequence of twenty images was taken. The speech was used to introduce moderate and natural facial expression variations with an example sample dataset in Figure 6 below. Figure 6. Sample Essex Faces94

Essex Faces95
The Essex Faces95 dataset in this study is used to evaluate the performance of the face recognition model with a total of 72 individual identities with an image size of 180 x 200 consist a total of 1.440 images. A sequence of 20 images per individual was taken using a fixed camera.
During the sequence the subjects take one step forward towards the camera. This movement is used to introduce significant head (scale) variations between images of the same individual, as shown sample dataset in Figure 7 below.

EXPERIMENTAL DESIGN
This section will explain the experiments that will be carried out on face detection evaluation and evaluation of the combination of the two models for measuring performance from recognition rate and speed. In the Face Detection experiment, the dataset for WiderFace was divided into the following sizes: 40% for training, 10% for validation, and 50% for testing. With size training data running on MTCNN and RetinaFace face detection models. Furthermore, validation data is used for testing models because WiderFace does not provide ground truth evaluation for data testing. The first step of Experiments is to take image annotations from the training, then do the face detection model training, after that testing on the validation set by getting average value precision (AP) of 3 subset validation easy, medium, and hard in the result respectively.

Evaluation of Combination Models
In the combination models experiment, the dataset from EssexFaces94+EssexFaces95 is combined with a total of 4519 images with a total of 224 identities, split into 60% training, 20% validation, and 20% testing. The first experiment was performed by extracting facial images with the face detection models and looping to get images and identities from the dataset subfolders.
Next, embedding is done to get facial features that will be input into the FaceNet or ArcFace face recognition models and stored in an array. After that, the embedding results are vector normalized with the l2 norm. Last step is to do a fit model by getting train, val, and test results using a multilayer perceptron (MLP) classifier to get a rank-1 recognition rate.

Performance Measures.
This section will explain the evaluation of the metrics used to measure the performance of the face detection, face recognition, and speed models of the combination of the two models.

Face Detection Performance
The performance measurement used to evaluate the model's performance on face detection models is called average precision (AP), a method of condensing the precision-recall curve into a single value representing the mean of all precisions. The following Eq (2). below is used to calculate the AP. The difference between the existing and the next recalls is measured and then multiplied by the current precision and use a loop that goes through all precisions/recalls. In other words, the AP is the weighted sum of precisions at each n threshold, with the weight corresponding to the increase in recall. (2)

Face Recognition Performance
The performance measurement used to evaluate the model's performance on face recognition is the rank-1 recognition rate. Which depends on a list of images from one gallery and a list of images to test with the same identity; for each probe image, all similarities to all images in the gallery folder will be calculated and determined if the gallery image has the highest or lowest similarity from the identity which is the same as the probe image, for the identification of the query image compared to a set of target images in the gallery then sorted by similarity to the most similar or the smallest most similar target image. With the formula Eq (3). below.

Processing Time Performance
The performance of the processing time is measured by an image that can pass processing time through detection, embedding, and recognition, which is performed on a combination of face detection and face recognition models to determine the best combined model based on processing time elapsed as shown formula in Eq (4) below.

EXPERIMENTAL RESULT
In this study, the experimental results are divided into 3 parts evaluation results of face detection, evaluation results of combination models, and evaluation results of processing speed.

Evaluation Results of Face Detection
The experimental testing results on the WiderFace dataset on face detection models show that RetinaFace outperforms MTCNN with an AP value in Table 2.

Evaluation Results of Combination Models
The experimental results were carried out after combining RetinaFace and MTCNN for face detection models, with FaceNet and ArcFace predicted using the Multi-Layer Perceptron (MLP) classifier. The results in Table 4. show that the testing results obtained for all combination models are quite high, RetinaFace+FaceNet holds the best recognition rate with a value of 99.114%.

Evaluation Results of Processing Speed
In this section, an experiment was conducted to measure the performance of a combination model based on speed with a total of 100 images measured 10 times, and the average was found 17 FACE RECOGNITION FOR SMART ATTENDANCE SYSTEM to obtain the processing speed per image. Based on the results of the experiments that have been carried out, the RetinaFace+FaceNet combination model has a higher Rank-1 Recognition Rate than the other three models, in terms of speed RetinaFace+ArcFace has a faster time processing images in face recognition where slightly 13.92 ms faster than RetinaFace+FaceNet. Therefore, the recommendation for the best combination model based on the speed and higher accuracy that can be used is RetinaFace+FaceNet.

CONCLUSIONS
In this study, based on experiments conducted on face detection models with RetinaFace and MTCNN as measured by AP, this model will be combined with every FaceNet model and ArcFace It can be concluded that by using the recommended combination of face detection and face recognition models, RetinaFace+FaceNet has advantages in terms of Recognition Rate and speed to be implemented in a smart attendance system.
In this study the evaluation results were run on sufficient hardware powerful, and the specification on this hardware are better than the system on chip Arduino and Raspberry Pi. Future Works it is necessary to evaluate how performance of this combination model with hardware at the same computing capacity lower like Arduino or Raspberry Pi. From this research, it has only been evaluated in terms of processing speed, Furthermore, it also needs to be evaluated from the needs of its computing resources.