Face mask recognition system using MobileNetV2 with optimization function

ABSTRACT The world has experienced a health crisis with the outbreak of the COVID-19 virus. The mask has been identified as the most effective way to prevent the spread of the virus. This has led to the need for a face mask recognition device that not only detects the presence of the mask but also provides the accuracy with which a person is wearing the face mask. In addition, the face mask should also be recognized from all angles. The project aims to create a new and improved real-time face mask recognition tool using image processing and computer vision approaches. A dataset consisting of images with and without a mask was used. For the purposes of this project, a pre-trained MobileNetV2 convolutional neural network was used. The performance of the given model was evaluated. The model presented in this project can detect the face mask with an accuracy of 99.21%. The face mask recognition tool can effectively detect the face mask in the side direction, which makes it more useful. The optimization function which contains the learning loops and the optimization function are also used.


Introduction
Coronavirus  is the latest evolving virus that has taken over the world in just a few months. It is a type of pneumonia that was initiated in early December 2019 near the city of Wuhan, Hubei Province, China, when it was declared a global pandemic by the World Health Organization (WHO) on March 11, 2020 (Fong, Dey, and Chaki 2021). According to WHO statistics, as of February 24, 2021, more than 111 million people were infected with the virus and about 2.46 million deaths were registered (Canete 2021). The most common symptoms of coronavirus are fever, dry cough and fatigue among many others. It is mainly spread by close direct contact between people and the respiratory droplets of an infected person, produced by coughing, sneezing or exhaling. Because these droplets are too dense to travel long distances through the air and fall quickly to floors or surfaces, they are also spread when people touch virus-damaged surfaces and touch their face (eg, eyes, nose, and mouth) (Vaishya et al. 2020). WHO declared a state of emergency worldwide and developed some emergency precautions to limit the spread of the virus, i.e. wash hands regularly with soap and water for 20 years, use disinfectants, keep distance, disinfect regularly. surfaces, using disposable tissues when coughing or sneezing, and especially wearing masks in public places (Bhagat et al. 2020;Xiao and Estee Torok 2020). Similar to the effective control of the spread of the SARS virus in the community during the 2003 SARS outbreak (Peng et al. 2003), community-wide masking has also been very effective in controlling the spread of the coronavirus (Cheng et al. 2020;Matuschek et al. 2020). Effective control of respiratory droplets has made wearing masks an essential element of the response to COVID-19 . For example, the effectiveness of N95 and surgical masks in blocking virus transmission (by blocking respiratory droplets) is 91% and 68%, respectively (Tom et al. 2020). Wearing face masks can effectively prevent the entry of viruses and airborne particles so that some pollutants cannot enter the respiratory system of another person (Qin and Li 2020). Global scientific cooperation has greatly improved due to the outbreak of the Corona virus and is looking for new tools and methods to fight this virus. One such technology that can be used is artificial intelligence (AI) (Bhagat et al. 2020). It can quickly track the spread of the virus, identify high-risk patients, and control the epidemic in real time (Gayathri, Kumar, and Kumar Gunjan 2022). It is also useful for early prediction of infection by analyzing previous patient data, which in turn can reduce the risk of dying from the virus (Vaishya et al. 2020).
As discussed, wearing masks is the most effective measure to protect against the transmission of coronavirus, but ensuring that masks are worn in public places is a challenge for the government and relevant authorities. Fortunately, AI as an application (using machine learning (ML) or deep learning algorithms (DL)) can help to force the wearing of masks in public places by finding masks in real time with a built-in camera add network (network of surveillance cameras or any other). It is an easy way to control people locally, maintain social silence and make sure everyone wears masks (Alrammahi and Radif 2019).
MobileNet architecture is special because it uses very less computation power to run. This makes it a perfect fit for mobile devices, embedded systems, and computers to run without GPUs. MobilenetV1 is the first version of the Mobilenet models . It has more layers and convolution parameters than MobilenetV2. MobilenetV2 is the second version of the Mobilenet model (Mohd et al. 2022). It has significantly less parameters in deep neural networks. This makes deep neural networks easier. Because of its light weight, it is ideal for embedded systems and mobile devices. MobilenetV2 is an updated version of MobilenetV1 (Ahmed et al. 2022). This makes it more efficient and effective. Because of the importance of face mask detection in public places, here in this study, we have demonstrated the use of one popular DL-based architectures, i.e., MobileNetV2 for effective face mask detection , Figure 1.

Related Work
Over the past few years, object recognition algorithms employing deep learning models have become theoretically more competent when compared to shallow models in tackling complicated jobs (Yadav 2020). One example is building a real-time system/model that is capable of detecting whether people have worn a mask or not in public areas.
Militants and Dionisio (Militante and Dionisio 2020) developed an automatic system to detect whether or not a person is wearing a mask and if the person is not wearing a system mask to generate an alarm. To develop the systems, the authors used CNN's VGG-16 architecture. The systems achieved an overall detection accuracy of 96%. In the future, the authors decided to create a system that will not only detect whether a person is wearing a mask or not, but will also detect a physical distance between each person and trigger an alert if the physical distance is not correctly observed.
Sanjaya and Rakhmawan (Sanjaya and Adi Rakhmawan 2020) introduced a model using the DL algorithm to determine whether a person wears a mask or not in public places. To do this, they used the MobileNetV2 image classification method, which is a pre-trained method. In this experiment, the authors used two data sets, namely (1) RMFD obtained from Kaggle and (2) data set collected from 25 cities in Indonesia using CCTV cameras, traffic lights and shop cameras. Both datasets were used to train their model. The trained model achieved detection accuracy of 96% and 85% in the test sets of these two data sets, respectively. P. Gupta, N. Saxena, M. Sharma, J. In this proposed approach, only extracted facial features are provided instead of raw pixel values as input. The facial features were extracted using Haar Cascade, and these typical facial features instead of pixel values are retained. As the number of redundant input functions decreases, so do the more sophisticated neural network recognition tools. It also makes the process smoother and faster by using DNN instead of network folding. The proposed method does not affect the physical accuracy, the average accuracy obtained was 97.05% ). Nizam et al. (Din et al. 2020) developed a GAN-based system to remove observed face masks and synthesize missing facial components with more detailed details and area reconstruction. The proposed GAN used two separators: the first took over the structure of the face mask, while the second was able to remove the area covered by the face mask. They used two synthetic data sets in the model training process. Loey et al. (Loey et al. 2021) introduced a face mask detection model that works on deep transfer learning and classical ML classifiers (classical ML classifiers refer to ML algorithms that work on functions manually extracted and designed from input data). The Residual Neural Network (ResNet 50) algorithm was used to extract the features. The extracted services were then used to train three classic ML algorithms, namely the Support Vector Machine (SVM), the Decision Tree (DT), and the Ensemble Learning (EL). Three different face mask datasets were used in the study, namely (i) the Real World Masked Faces (RMFD) dataset, (ii) the simulated masked faces (SMFD), and (iii) the wild-labeled faces. (LFW). Finally, trained classifiers were tested for face mask detection. During the test, the SVM classifier achieved the highest detection accuracy compared to the DT and EL classifiers. It achieved 99.64% and 99.49% detection accuracy in RMFD and SMFD, respectively, and 100% detection accuracy in LFW.
Bhuiyan, Akter Khushbu, and Sanzidul Islam (2020) published an article in which the purpose of the proposed system is to identify masked people, and the faces are represented using the advanced architecture of YOLOv3. YOLO (You Only Look Once) uses the Convolution Neural Network (CNN) learning algorithm. YOLO connects to CNN through hidden layers, through search, easy algorithm retrieval, and can detect and locate any type of image. The implementation begins by taking 30 unique images of the dataset into the model after combining the results to derive action-level predictions. It provides excellent imaging results as well as good detection results. This model is used for live video to verify that the frame rate of the model is in the video and its detection capability with two layers masked/unmasked. Inside video, our model has impressive outputs averaging 17 fps. This system is more efficient and faster than other methods that use their own data set . The authors (Venkateswarlu, Kakarla, and Prakash 2020) recommended using a pretrained MobileNet with a global pooling block that can be used for facial recognition and detection (Fadhil and Abbas Marhoon 2021). Preconfigured MobileNet creates a multidimensional component map from a shaded image. In the proposed model, there is no problem with overfitting, as it uses a common connection block.

Dataset
The dataset used in this project is the face dataset (with/without mask dataset) (Figure 2) from Kaggle.com. The dataset consists of 3,832 images divided into two classes of masks: • Images taken with a mask: 1914 • Images taken Without Mask: 1918 there were different varieties of images with variations in size and resolution.

Methodology of the Proposed Study
In order to predict whether a person has put on a mask, the model requires learning from a well-curated dataset, as discussed later in this section. The model uses Convolution Neural Network layers (CNN) as its backbone architecture to create different layers. Along with this, libraries such as OpenCV, Keras, Tensorflow and Sklearn are also used. The proposed model is designed in three phases: Data pre-processing, CNN model training and Applying face mask detector as described in Figure 3.

Data Pre-Processing
The accuracy of the model depends on the quality of the data set. Clearing the original data is performed to remove defective images found in the data set. The images are resized to a fixed size of 224 × 224, gives optimal results. Pictures are then marked with or without a mask. The image array is then converted to a NumPy array for quick calculations. The preprocess input function provided by MobileNetV2 is also used. Then, the data expansion method is used to increase the size of the training database and improve its quality. The ImageDataGenerator function is used to create multiple versions of the same image with the appropriate values for scrolling, scaling, horizontal or vertical scrolling. Training templates have been added to avoid overloading. It increases the generalization and strength of the studied model. The entire data set is then divided into read data and test data in an 8:2 ratio by randomly selecting images from the data set. The Stratify parameter is used to maintain the same proportion of data in the original data set in the read and test data sets.

MobileNetv2
We used the Mobilenetv2 package from tensorflow to preprocess our image to work with the mobilenetv2 architecture, MobilenetV2 is a pre-trained model for image classification. Pre-trained models are deep neural networks that are trained using a large images dataset. Using the pre-trained models, the developers need not build or train the neural network from scratch, thereby saving time for development, Figure 4.

CNN Model Training
On dataset of images, extensive experiments were conducted to evaluate the performance and effectiveness of the suggested models. On dataset, Figure 6 shows the MobileNetV2 model's training and validation curves. Figure 5 shows that over 10 epochs, the training and validation accuracy achieved by MobileNetV2 are 99%. Hence, the MobileNetV2 model achieved equal training accuracy on dataset. We made use of already existing  MobileNetV2 architecture from keras. We remove the ADD layer and replace it with our own softmax layer brief explanation of the layers. In our model gradient descent is used which is an optimization algorithm used when training a machine learning model. It is based on a convex function and frequently adjusts its parameters to reduce a given function to a local minimum.   We start by defining the initial parameter values and from there use gradient calculus to adjust the values iteratively so that the specific cost function is minimized.
Repeat until convergence fθ j θ j À α @ @θ j JðθÞg To improve the training of the model data augmentation has been used in data analysis are techniques used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regulator and helps reduce overfitting when training a machine learning model. (Shorten and Khoshgoftaar 2019) It is closely related to data over-analysis. also used for optimization (Average pooling) involves calculating the average for each patch of the feature map. This means that each 7 × 7 square of the feature map is down sampled to the average value in the square. Our model is trained on ImageDataGenerator() (data augmentation) of the image dataset which we preprocess previously.
The new optimization algorithm used in this model includes easily adjusting gradients and how to calculate the loss, training acceleration with all kinds of tricks (e.g., teacher forcing) and better tuning of hyperparameters (e.g., using periodic learning rate).for every epoch it enumerates over the entire dataset in batches. For every batch, a forward pass and record every operation in a tape, calculate the loss with respect to the actual labels, use recorded operations to perform a backpropagation and calculate gradients, use the optimizer to adjust the layers weights by applying the gradients. Once, the pass on the entire training set finishes, the training loop performs a forward pass on the entire validation set in batches. For every batch, it does a forward pass and make sure the model is in an inference mode and calculate the validation loss of this epoch. It cumulates the losses to determine the validation loss of the current epoch.
Once, the pass on the entire training set finishes, the training loop performs a forward pass on the entire validation set in batches. For every batch, it does a forward pass and make sure the model is in an inference mode and calculate the validation loss of this epoch. It cumulates the losses to determine the validation loss of the current epoch. In the last step the CNN model is integrated into a web-based application that is hosted in order to be shared easily with other users and they can upload their image or live video feed to the model to recognize facial masks and then get the predicted result. Streamlit, an open-source python library, is used to design and construct a simple web app that allows users to submit a picture with a single click of a button and receive the outcome in a matter of seconds.

Experimental Results and Analysis
The final results, as shown in Table 1, were achieved after multiple experiments using various hyper-parameter values such as learning-rate , epoch size, and batch size.

Model Testing
The model was tested on various diverse images, and some of them are exhibited below in the Fig. 10. The green rectangular box demonstrates a person correctly wearing the mask along with the accuracy score at the top, whereas the rectangular red box displays that the individual is without any mask. In summary, the model learns from the training dataset in order to label and then predict(Unal et al. 2022), Figure 8.

Comparison with Other Models
A part from the custom CNN architecture implemented in this research, there exists some other architectures, such as MobilenetV2 (Kaur et al. 2021), etc. This model was compared to several models by training them on the same dataset. MobileNetV2 is a CNN model of 53 layers and 19 blocks (Kaur et al. 2021). On comparing several models with the proposed model based on accuracy, size, and training speed, we can find that MobileNet-V2 performs marginally more inferior and is substantially slower to train but has a smaller memory footprint. On the contrary our proposed model, exhibits a slightly strong performance than main paper, training speed and is considerably smaller than (Kaur et al. 2021)

Conclusion
In conclusion, this project presents a model of machine learning in mask detection. After a process of training, testing and testing, the model is able to accurately measure the proportion of people wearing masks in certain cities. COVID-19 is one of the fastest spreading viruses and poses a threat to human health, global trade and the economy. The change and its rapid spread made it difficult to control the situation. Taking preventive measures will reduce the spread of the virus, one of the most important measures is wearing masks in public places. Therefore, in this project, a deep learning-based approach was applied to automatically detect the face mask. The learning models i.e. Convolutional Neural Network (CNN) and MobileNetV2 model were evaluated on the dataset. The datasets consist of dataset containing 3832 images of individuals with and without masks. The comparative results show that MobileNetV2 achieved 99.21% classification accuracy. The program can also be linked to the entrance gates, allowing only those who are wearing masks to enter. It can also be used in shopping malls and universities.

Disclosure statement
No potential conflict of interest was reported by the author(s).