A novel DCNN-ELM hybrid framework for face mask detection

The Coronavirus disease (2019) has caused massive destruction of human lives and capital around the world. The latest variant Omicron is proved to be the most infectious of all its previous counterparts – Alpha, Beta and Delta. Various measures are identified, tested and implemented to minimize the attack on humans. Face masks are one of those measures that are shown to be very effective in containing the infection. However, it requires continuous monitoring for law enforcement. In the present manuscript, a detailed research investigation using different ablation studies is carried out to develop the framework for face mask recognition using pre-trained deep convolution neural networks (DCNN) models used in conjunction with a fast single layer feed-forward neural network (SLFNN) commonly known as Extreme Learning Machine (ELM) as classification technique. The ELM is well known for its real time data processing capabilities and has been successfully applied both for regression and classification problems of image processing and biomedical domain. It is for the first time that in this paper we have proposed the use of ELM as classifier for face mask detection. As a precursor to this, for feature selection, six pre-trained DCNNs such as Xception, Vgg16, Vgg19, ResNet50, ResNet 101 and ResNet152 are tested for this purpose. The best testing accuracy is obtained in case of ResNet152 transfer learning model used with ELM as the classifier. The performance evaluation through different ablation studies on testing accuracy explicitly proves that ResNet152 - ELM hybrid architecture is not only the best among the selected transfer learning models but also proves so when it is compared with several other classifiers used for the face mask detection operation. Through this investigation, novelty of the use of ResNet152 + ELM for face mask detection framework in real time domain is established.


Introduction
For the last two years, whole world is suffering from Coronavirus disease also known as COVID-19. As per World Health Organization (WHO) (WHO, 2022), as of 2 February 2022, a total of 380,321,615 confirmed cases of COVID-19, including 5680,741 deaths have occurred worldwide. Fig. 1 depicts the plots of COVID-19 cases and deaths reported around the world as of 30 January 2022. It is a virus which starts either with common cold, fever, nausea in human beings and leads to serious medical conditions such as severe acute respiratory syndrome (SARS) and COPD. This virus was first identified in Wuhan, China in December 2019 and slowly it has spread all over the world affecting the entire globe.
Recently, a few months back, a new variant of COVID-19 virus known as Omicron is found to be the most infectious among all its counterparts -Alpha, Beta and Delta. The European and Northern American countries are presently in a vulnerable state due to this infection. As this creates serious health problems, it is imperative to protect everyone from this pandemic.

Importance of face mask to prevent COVID-19
According to WHO guidelines (WHO, 2020), all persons must follow COVID appropriate behavior to protect not only themselves but also others, as it is a highly contagious and communicable disease. For this purpose, the standard protocol requires the use of face masks, maintaining physical distance, washing hands regularly, etc. Of these measures, the use of masks is one of the essential steps, as it is very difficult to maintain physical distance in crowded places. Many investigators have considered this important aspect in their research works (Feng et al., 2020;Howard et al., 2020;Vnv O'Dowd et al., 2020;Schünemann et al., 2020;Spitzer, 2020). Feng et al. (2020) reported that since the spread of COVID-19, the use of masks has spread in China and other Asian countries such as South Korea and Japan. They enforced mandatory face mask guidelines in public places.
According to Howard et al. (2020), the use of masks in public places can reduce the spread of infections in areas with high population densities. As a result, reducing contagion could significantly reduce the loss of life and the monetary effect. Therefore, they recommend that the use of face masks is an effective way to control the spread of the virus.
According to Vnv O'Dowd et al. (2020), few important strategies to deal with this severe pandemic of the 21st century are: social distancing, washing hands and covering the mouth and nose with a cloth. According to them around the world, the mask has become an unstoppable part of public techniques to combat the current pandemic. Schünemann et al., 2020 describes in detail about the advantages and disadvantages of wearing face coverings. In the end, they conclude that masks can be useful both for the general population and for healthcare professionals. They reflected on the initial idea of wearing face masks, as well as delving into the political discussions associated with them. Spitzer (2020) shows that masks can control the spread of the coronavirus, especially in asymptomatic people. In addition, he reflects on the study of the benefits, importance and uses of masks for school children.
In order not to be affected by COVID-19 in the above scenario, it is important to wear a mask at all times. In such situations, with conventional measures, it is very difficult to know whether or not a person is wearing a mask in a public gathering or at the workplace. As a result, face mask identification and detection has become a critical computer vision problem to aid global civilization. It is important to note that the research related to development of an efficient mask detector with high precision is still scant. Therefore, this has become an area of great interest to the research community. To achieve a high degree of accuracy, various machine learning or deep learning architectures are used, tested and compared.
Besides the necessity of using face masks for COVID-19 protection, it is also important to use it as a habit due to increased vehicular and industrial pollution especially in the metropolitan cities. Overall, it can be said that use of facemasks is going to acquire an important position in our daily lives and in future we will be definitely geared towards developing certain good strategies to implement it in an effective manner.

Review of literature on face mask detection
A brief survey of the technical models for face mask detection is presented here: Ge et al. (2017) have utilized local linear embedding algorithm with CNNs for facial mask identification and obtain 76.1% accuracy. Bu et al. (2017) proposed a cascade CNN model to identify the masked faces from the MASKED FACE dataset. Inamdar & Mehendale (2020) developed a transfer learning method known as "Facemasknet" framework to identify the persons who are not wearing the mask in public places. They claimed that they obtained an accuracy of 98.6%. Loey et al. (2021a) proposed a deep learning model, in which they have utilized YOLO v2 with ResNet50 for face mask detection. They claimed that their proposed model achieved 81% precision accuracy. In another work, Loey et al. (2021b) proposed a hybrid deep learning model. In this work, they combined the ResNet50 model with three different classifiers such as SVM, decision tree, and ensemble classifier and achieved a maximum precision of 99.64% in case of use of ensemble classifier. Jiang et al. (2020) proposed a retinal facial mask model, also known as a "single-stage detector." This model is based on the transfer learning concept, using ResNet or MobileNet as the backbone of the model, Feature Pyramid Network (FPN) as the neck, and context attention modules as the head. They stated that their proposed model achieved a precision of 94.5%.
As already mentioned, in addition to protection against COVID-19, the use of masks is also becoming standard due to increasing levels of contamination and waste in the environment. Therefore, it is necessary to distinguish people who wear a mask from those who do not. In this context, there is a certain need for intelligent techniques to identify and recognize people with and without a mask. Since we know that feature extraction using local feature descriptors like LBP, BSIF, HOG, SIFT, etc., encounters various problems like facial expression, pose variation, scene background, and illuminance, etc., the methods based on deep learning concept can solve such problems and are able to extract the complicated feature map from the images of the dataset (Filippidou & Papakostas, 2020;Xu et al., 2015). Therefore, pre-trained DCNN models for feature extraction are found to outperform the traditional feature mapping task mentioned above. This has led to obtain better results in terms of sorting accuracy and time calculation margins.

Contribution of the proposed work
In this research paper, we propose a novel deep learning framework architecture based on the concept of transfer learning for intelligent face mask detection and recognition. To implement this framework and its underlying model, we use two different face mask image data sets, namely the Real World Masked Facial Recognition Dataset (RMFD) (Wang et al., 2020) and the Face Mask Detection Dataset (FMDD) (Larxel's Face Mask Detection Dataset, 2021). The RMFD contains 90, 000 images of unmasked faces and 5000 images of masked faces and FMDD contains 853 actual images of various subjects. The proposed study will be carried out in six phases. In the first phase, six different pre-trained deep convolution neural networks (DCNN) are used, namely Xception, Vgg16, Vgg19, ResNet50, ResNet 101 and ResNet152, for the extraction of feature maps from the face mask images. The extracted feature map is then used to complete the classification task using the ELM classifier with four different activation functions, such as: Lea-ky_relu, Sigmoid, Relu and Tanh. Ismael & Ş engür (2021) have used various Deep Transfer Learning approaches for COVID-19 detection using X-ray image classification. They use AlexNet, VGG and ResNet architectures for feature extraction along with various other classifiers including the SVM. We have also followed a similar analogy to have used Xception, VGG and ResNet architectures for feature extraction along with ELM as a classifier tool. Note that the ELM as a potential classifier which is capable to work under real time constraints is a well-known fact (Huang et al., 2006). To the best of our knowledge and information, the ELM classifier is never used in combination with transfer learning models to implement the face mask recognition task. The proposed framework is described in detail in Section 3.1. In the second phase, the six pre-trained DCNNs mentioned above are fined-tuned using face mask image data sets and their respective classification precision is calculated and compiled. This is carried out to compare the results of the second phase with those of the first phase to examine the effectiveness of the ELM as a classifier tool used in connection with the DCNNs for the afore mentioned image data sets. These results are described, discussed and compared in detail in Section 3.2. In the third phase, deep analysis of ResNet152 model is carried out by feature map visualization of seven different layers of the model. These results are presented and analyzed in Section 3.3. In the fourth phase, ablation study of the proposed hybrid ResNet152+ELM model is performed. The results of the five different ablation studies are presented and analyzed in Section 3.4. In the fifth phase, five different classifiers such as Support Vector machine (SVM), Decision Tree (DT), K-Nearest Neighbor (K-NN), Random Forest (RF) and Ensemble Classifier (EC) are used for the classification task to classify the extracted feature map obtained from various previously trained DCNNs. This is carried out to examine the comparative superiority of a particular classifier over others. The ELM is again shown to outperform all other classifiers. These results are presented and analyzed in Section 3.5. In the sixth and final phase, we compare our results with other first-line research works for the same problem in Section 3.6. The novel idea of merging pre-trained DCNN and ELM as an otherwise real-time classifier for image processing applications in general and face mask detection task in particular is the major contribution of the proposed paper. This is particularly true as face mask detection operations tend to use big datasets and are therefore required to finish their tasks under real time constraints. The ELM classifier is never used earlier for real time classification of face masks detection operation. This classifier, as our results show, is proved to be suitable to achieve the objective of best classification accuracy under real time constraints. This brings in the desired novelty in the proposed work. The outcomes are discussed in detail in Section 3.

Datasets
In the present work, two notable datasets are used for training and testing of the proposed model. These datasets can be accessed over the internet freely. The first dataset is known as real-world masked face recognition dataset (RMFD) which was prepared by Wang et al. (2020). To prepare this set, the authors used a Python crawler tool to track the front face images of public figures and their corresponding masked face images from internal resources. They have carried out many manual tasks to filter out unwanted images. They also used semi-automation tools for this purpose. This dataset contains both masked and unmasked images. We have used 4653 images from this data set in the present work. Few images from this dataset are shown in Fig. 2.
The second dataset is face mask detection dataset (FMDD) (Larxel's Face Mask Detection Dataset, 2021) which contains 853 real life images of different subjects. The annotation document given by the creators of this dataset is additionally used to crop these images into single individual images for the purpose of classification task. This dataset also contains both masked and unmasked images. In our proposed work, we utilize 4072 such single individual images. Fig. 3 depicts few selected images from this dataset.

Transfer learning approach of image classification
The design of the face mask detector framework architecture in the proposed research work is based on the transfer learning approach (Brownlee, 2017;Marcelino, 2018). In this design, six different pre-trained deep convolution neural network such as Xception, Vgg16, Vgg19, ResNet50, ResNet 101 and ResNet152 are used for feature extraction purpose. These pre-trained models were trained on more than 1000,000 images belonging to 1000 classes. These pre-trained models are already in use to successfully solve many other classification problems such as self-driven car, healthcare image classification problems etc. (Filippidou & Papakostas, 2020;Xu et al., 2015).
The CNN is a profound learning procedure that comprises of numerous layers stacked together and utilizes nearby associations between the values. Mainly, when the dataset is in the form of images, the CNN is used as a deep learning model and it can perform both the feature extraction and classification tasks. There is no need to manually supply the features to the classifiers. It has the capability to extract the features from the images using convolution layer.
A typical CNN model comprises of the following four layers: 1 The first layer in the CNN is known as convolutional layer. As the name suggests, this layer performs convolution operation between the pixel window of the image and kernel. Convolution operation (denoted as *) is used to extract the features from the image. This can be represented as a mathematical formulation with input image J and the kernel K (also known as feature detector). The output generated after performing convolution operation is known as feature map. Convolution operation is given by Eq. (1): 2 The second layer in the architecture is known as ReLu layer (Rectified Linear Unit) which applies ReLu activation function on the generated feature map. This activation function maps all values in the feature map between (0, max). It maps all the negative values to 0 and all other values remain the same. 3 The third layer is the pooling layer. This layer is used to perform dimensionality reduction. It reduces the dimension of the output generated by the last layer. Max pooling is the most common pooling strategy which retains only the maximum value in the input window. 4 The fourth layer is fully connected layer which finally performs the classification operation on the pooled feature map. These layers can be stacked over one another to achieve a high degree of accuracy.
In the present work, unlike the conventional methods of feature extraction, we use pre-trained deep convolution neural network (DCNN) models in which it is allowed to extract the features during the training phase. In the present work, we use six pre-trained DCNNs (Ayyar, 2022;Kurama, 2020) such as Xception, Vgg16, Vgg19, ResNet50, ResNet 101 and ResNet152 for feature extraction. Thereafter, the computed feature map is used as a precursor to the classification task. In this work, for classification task, in place of fully connected layer of pre-trained DCNNs, we use different machine learning classifiers for the classification task of facial images loaded with or without the masks. We have used five machine learning classifier techniques. These are Extreme Learning Machine (ELM), Support Vector Machine (SVM), K-Nearest Neighbor (K-NN), Random Forest (RF) and Ensemble classifier (EC). We use external classifiers due to two specific reasons. (1) There are a number of good research publications available which have used external classifiers such as SVM, RF, K-NN and EC. We further use another classifier ELM which is a proven classifier to successfully classify under real time constraints. (2) The accuracy of ELM as an external classifier is extremely good in other image processing application (Karpagachelvi et al., 2012;Kim et al., 2009;Yang et al., 2013;Agarwal et al., 2014;Mishra et al., 2018). We attempted to reproduce a similar efficiency for face mask detection as well. We compile our results from these classifiers and in the first phase of this simulation, these are compared among themselves to assess the best one out of the selected models. Note that the feature extraction step is same for all classifiers; hence their comparison is based on their efficacy to classify among the masked and unmasked facial images from the two datasets. A brief introduction about these classifiers is presented in Section 2.3.

Machine learning classifiers for image classification operation
This section presents a brief description of all the machine learning classifiers used in present work.

Extreme learning machine (ELM)
The ELM was proposed by Huang et al. (2004) as a proficient option to the backpropagation calculation for single-layer feed-forward networks (SNFN). It is a fast feed-forward network which is good in achieving generalization. It is a quick learning calculation with great speculation execution when contrasted with other conventional feed-forward networks. In this architecture, the input weights are randomly initialized and output weights are calculated by using Moore-Penrose Matrix Inverse (Fill & Fishkind, 2000).
Given a set of training examples (x i , y i ) for i = 1. R and the number of hidden neurons being S and the number of output neurons being U, mathematically the formulation of ELM is given by Eq. (2) In Eq. (2), the connection weights of input layer neurons to the lth hidden layer neurons is given by z l , the bias with b l and the connection weights from lth hidden neuron to the output neuron being β l .
The above equation can also be represented in the vector form as given by Eq.
The set of linear equations of Eq. (3) can be solved by using Moore-Penrose generalized inverse operation given by Eq. (4).
In Eq. (4), theG † is the Moore-Penrose generalized inverse (Fill & Fishkind, 2000) of matrix G and can be calculated as A prototype of ELM algorithm is described below: ELM (Training Dataset, g, S) Training Dataset: (x i , y i ) for i = 1..R g: activation function S: number of hidden neurons • Initialize: hidden node parameters (z i , b i ) for i =1, 2…S • Calculate: G (hidden-layer output matrix) It is a well-established fact that the Extreme Learning Machine (ELM) outperforms the conventional learning methods due to its speed and generalization capabilities (Huang et al., 2004(Huang et al., , 2006. This is specifically true in the context of its comparison with gradient based learning algorithms such as Back propagation neural networks wherein the weights are adjusted in terms of the non-linear relationship between input and output. The ELM on the contrary computes the desired weights randomly in a single step and is therefore extremely fast. It is so fast that it has shown un-comparable results for several image processing applications such as face recognition, image and video watermarking etc. (Karpagachelvi et al., 2012;Kim et al., 2009;Yang et al., 2013;Agarwal et al., 2014;Mishra et al., 2018). In the present facemask detection framework, the leaky_relu, sigmoid, relu and tanh activation functions are used in the ELM classifier for our investigations.

The support vector machine (SVM) classifier
The SVM was proposed by Vapnik (Vapnik, 1998;Cortes & Vapnik, 1995). It can be used both for regression and classification problems. As a classifier, it gives a very good accuracy and shows comparative superior time complexity than other traditional classifiers. It works well with high dimensional data. The main objective of the SVM is to find the optimum hyperplane which classifies the data into two or more classes for two-way or multi-way classification. The mathematical formulation of linear hyperplane which classifies the data into two classes in given in where b is the intercept and bias term and w indicates the weight vector of the hyperplane equation.

The K-Nearest neighbor (K-NN)
The K-NN is a versatile algorithm which can be used for handling the missing values and resampling of datasets. The main objective of using the K-NN is that it considers K data points or samples from the training dataset which are nearest to the test data point using Euclidian distance measure function and then it will classify the test data point in the class which has the highest number of data points out of all classes of the K nearest neighbors.

The random forest classifier
The Random forest classifier works by building many decision trees on the sub-samples of the dataset. It uses bagging and feature randomness when building the trees and average voting for making predictions. For binary decision tree, node importance can be computed using Gini formula given by Eq. (6) Where in j is the importance of node j, w j is weighted number of samples reaching node j, P j is the impurity value of node j, left(j) is the child node from left split on node j and right(j) is the child node from right split on node j.
Then the feature importance (fim) for each feature on a decision tree can be computed using Eq. (7) fim l = ∑ j:node j splits on feature l in j ∑ k∈all nodes in k Where fim l is the importance of feature l and in j is the importance of node j.
The computed feature importance value needs to be normalized between 0 and 1 using Eq. (8) Now at the random forest level, the importance of each feature for all the trees T can be computed by Eq. (9) RFfim l = ∑ j∈all trees normfim l,j T In the present work, the Grid search method is used to find the values of hyperparameters.

The ensemble classifier
The Ensemble classifier makes the prediction on voting basis. It makes the prediction not only on the basis of just one classifier, but on the basis of set of classifiers. It combines the prediction done by the different classifiers to improve the overall performance of the model. In the present work the classifiers used in ensemble methods are: K-NN, Linear regression and Decision Tree.

Experimental work and results
We use the following system configuration for executing the proposed research work: -Intel i5 processor (16 GB RAM) using Python 3.7 software and Google Colab (n1-highmem-2 instance, 2vCPU @ 2.2 GHz, 13GB RAM, 100GB Free Space). In this work, the data pertaining to 80% of the images from the datasets are used for training the model and 20% data is used for testing the trained model. Note that the classification accuracy is used for comparative performance evaluation of the proposed and the similar facemask detector frameworks. The mathematical formulation for classification accuracy is given by Eq. (10).

Accuracy = Total correct predictions done by classifier Total predictions done by classifier
The research work carried out in this paper is organized as a detailed study which is divided into six phases as follows:

DCNN-ELM hybrid framework
As it is mentioned in Section 2 that Convolution Neural Network consists of mainly four types of layers such as (i) Convolution layer which performs convolution operation using a specified kernel and extracts the features from the images, (ii) ReLu (Rectified Linear Unit) layer which applies ReLu activation function on the output of  convolution layer and maps all negative values to zero while retaining the positive values, (iii) Pooling layer which applies max pooling function on the data and retains only the maximum value from the given window and (iv) finally Fully Connected layer which performs the classification task. In the proposed approach, for classification purpose, instead of fully connected layer of the deep CNN model, the ELM is used as classifier. Therefore, in this model, fully connected layer is eliminated, and the ELM architecture is used for classification of the dataset. The feature map computed by the pre-trained DCNN model is divided into training and testing data. The training feature map is given as input to the ELM for training and the testing feature map is used for testing of the ELM model. The proposed hybrid DCNN-ELM model is depicted in Fig. 4.
In the present simulation, six different pre-trained DCNN models are used namely, Xception, Vgg16, Vgg19, ResNet50, ResNet101 and ResNet152. These were trained on ImageNet dataset which consists of around 14 million images categorized in 1000 categories (Brownlee, 2017;Marcelino, 2018, Ayyar, 2022. The images in the facemask dataset are resized according to the size of image needed for each DCNN model. One example of image resizing is shown in Fig. 5. In the present work, these pre-trained models are used to compute the feature map for the given dataset and classification of images in masked and non-masked categories is carried out by using ELM classifier. Also mentioned in Section 1.3, Ismael & Ş engür (2021) have carried out a detailed study on COVID-19 using classification of X-ray images using the same methodology. They have extracted features of these images using AlexNet, VGG and ResNet deep transfer learning models and have further used SVM with different kernel functions for image classification. They have proved that ResNet50 feature extractor with SVM having Linear kernel gives the best classification accuracy of 94.7%. We have also carried out a similar work, albeit with a few different transfer learning models and an altogether different classifier. We use ELM instead of using SVM as a classifier tool. Further that our results are found to be the best for ResNet152 feature extractor. To the best of our knowledge and information, the ELM classifier for face mask detection task is never used in conjunction with transfer learning models. These results are presented and analyzed in great detail in this section. The ELM is a fast single feed-forward neural network and therefore its use in classification of images makes this task to be accomplished on real-time scale with significantly less processing time consumed. The ELM classifier is used with four activation functions in the present work namely, leaky_relu, sigmoid, relu and tanh with varying number of hidden neurons (S). For performance evaluation, the classification accuracy score is used. Variation of accuracy score as a function of number of hidden units (S) is shown in Fig. 6(a-d) and Fig. 7 (a-d) for all four activation functions namely leaky_relu, sigmoid, relu and tanh of ELM for the RMFD and FMDD datasets respectively.
The results are the best for S = 1024 when the accuracy score is the maximum. These plots clearly indicate that the accuracy score is the best for leaky_relu activation for RMFD dataset. Sigmoid activation function comes close to leaku_relu activation while the relu and tanh activation are third and fourth in that order of performance evaluation. In case of  FMDD dataset, a similar outcome is observed. Leaky_relu is found to be slightly superior to all other three activation functions in terms of accuracy score plotted as a function of number of hidden units (S). In this case also, the results are the best for S = 1024. The results emerging out of Fig. 6 and Fig. 7 are compiled in Table 1  and Table 2 respectively for RMFD and FMDD image datasets. This is only done for S = 1024. In this table, results of all pre-trained models are compiled. On careful observation of these two tables, we find the accuracy score (99.78%) is the best in case of ResNet152 transfer learning model for leaky_relu activation function for S = 1024 in the ELM architecture for the RMFD dataset. Table 3 compiles the computation time (in seconds) consumed by the hybrid DCNN+ELM model for different activation functions for S = 1024. Note that for RMFD dataset having 4653 images, the average consumed time span for classification using ELM for any of the pretrained DCNN model is found to be very less. The average computed time span of 1.12 s is the best (smallest) time and therefore it indicates that the processes are accomplished on real time scale. The ELM has advantage of only one parameter to be tuned and hence exhibit the real time outcomes. This, in addition of using input weights and hidden biases to be chosen at random, makes it a very fast neural network. But, this feature of the ELM may also require more hidden nodes than the traditional tuning-based learning algorithms. The ELM is well known to show its real time capabilities in different other image processing applications (Karpagachelvi et al., 2012;Kim et al., 2009;Yang et al., 2013;Agarwal et al., 2014;Mishra et al., 2018). A similar outcome is observed in the present classification task mandated for masked facial images datasets as well which is reported elsewhere.

Fine tuning of DCNN models
In the second phase of this experimental simulation, fine tuning of transfer learning models using RMFD dataset is carried out and these results are tabulated in Table 4. In our case, the DCNN models are trained on RMFD dataset.
The ResNet152 is found to be the one with best classification accuracy of 98.93%. Hence, for further investigations, only ResNet152 is considered. Fig. 8(a) shows the fine-tuning plot of training and testing accuracies as function of number of epochs for the ResNet152 model trained with RMFD image dataset. Similarly, Fig. 8(b) shows the finetuning plot of loss as a function of number of epochs for the same ResNet152 model trained with RMFD dataset.
It is quite clear that training is very stable while testing also gets stabilized when we increase the epochs. On similar lines, the loss function is minimized as we increase the number of epochs. Hence, with all these investigations carried out in phase-2 of this simulation, it becomes evident that ResNet152 model trained with RMFD dataset is the best among all considered transfer learning models mentioned in Phase-   1 of this paper. As a matter of fact, after comparing the results of the two phases (phase 1 and phase 2) of this experimental simulation, we can definitely conclude that results obtained in phase-1 are better than the results obtained in phase-2. This glaring outcome is a result of using the ELM architecture for classification task which is not only found to produce better performance but also yields these results in the smallest time

Analysis of feature visualization of ResNet152 model
It is a well-known fact that CNN/DCNN models are used as powerful tools for image classification and recognition task. They do this by learning the features of the image by use of various filters applied at each convolution layer. The features learnt by each layer vary significantly. In order to understand how the DCNN models works and learns the features of images, we need to visualize the representations of feature maps computed by the DCNN model across different layers of its architecture. A feature map tells us that how DCNN layers learn to identify and extract different features present in an image. We can understand the reasons behind the predictions made by the DCNN model in relation to the class. Feature visualization turns the internal features contained in an image into recognizable or visible image patterns. Visualizing the features will help us clearly decipher the learnt features by a given DCNN model.
As each of the mentioned DCNN architecture consists of huge number of layers, here we list the feature map of only few layers of ResNet152 architecture. Fig. 9 depicts input image supplied to ResNet152 model from RMFD dataset. Fig. 10(a-g) depicts different feature maps captured at different output layers of every convolution layer. In this work using ResNet152 model, we observe outputs varying from convolution layer 1 to convolution layer 5. Note that we have carried out this extraction for all the models mentioned in this work. However, due to limitation of space, we only give the feature maps pertaining to ResNet152 model. The second reason of presenting these results is that in Sections 3.1 and 3.2, we already conclude that ResNet152 model produces better results as compared to other DCNN models used in this work. It is clear by observing these feature maps of Fig. 9(a-g), that as we increase in the number of convolution layers, this results in the extraction of features more and more at the microscopic level of the input image. We also observe that the initial layers are easier to visualize and interpret. They capture the predominant low-level features such as edge, orientation etc. of the input image. As the number of layers increase, the features become less interpretable and more abstract. They capture high-level features which help differentiate between various classes of images.

Ablation studies of the proposed hybrid ResNet152+ELM model
In  The objective of this detailed study is to identify and select only those parameters using which the used model produces the best or optimum results. We now present our compiled results for these variations below:

Replacing the avg_pool (GlobalAveragePooling2D) layer by a different layer
In our proposed model, the avg_pool layer of ResNet152 gives us a 2-D feature map which is used by the ELM classifier for classification task. In this study, we replace avg_pool layer with Flatten layer and Global-MaxPooling2D layer to observe the impact of these layers on the performance of the model. Table 5 shows that avg_pool layer gives the highest accuracy whereas in case of both the Flatten layer and the GlobalMaxPooling2D layer, the accuracy slightly drops.

Changing the classifier used in the ResNet152+ELM model
The Table 6 compiles and compares the classification accuracy after using different classifiers along with the proven ResNet152 model as feature extractor. It is observed that the highest accuracy is achieved in case of ELM used as a classifier technique.

Varying the number of hidden neurons in the ELM classifier
As the ELM is finally selected as the classifier for the proposed hybrid architecture, it is pertinent to vary the number of hidden neurons (S) in this single layer feed forward neural network. For this purpose, this number is varied as shown in Table 7. It is observed that 99.78% accuracy is achieved with S = 1024 which becomes stagnant even though S is increased.

Changing the activation function in the ELM classifier of the ResNet152+ELM model
We have also carried out this study by selecting the best activation function from the pool of available functions. Table 8 compiles these values. The leaky_relu function is found to be the best function for which accuracy = 99.78% is obtained.

Varying the number of layers in the residual network (ResNet)
In this ablation study, the number of layers are varied in the ResNet architecture. Table 9 compiles the classification accuracy for ResNet models with 50, 101 and 152 layers. It is found that ResNet model with 152 layers gives highest accuracy of 99.78%.
From the results compiled in the Tables 5-9 respectively, it is clearly evident that the best accuracy using ResNet152 + ELM hybrid architecture is obtained when the specific parameters are selected and varied. Table 10 compiles the best selected parameters from each of the above five ablation studies.
Under these circumstances, it is concluded that the proposed ResNet152+ELM hybrid architecture gives the best accuracy of 99.78% under the given pre-processing considerations defined in Section 3.1 for the images of the RMFD dataset.

Hybrid DNN with different classifiers
The outstanding performance of using ELM as a classifier in this model is yet to be tested, compared, and evaluated viz-a-viz other standard classifiers. In this section, this comparison is investigated and analyzed. For this purpose, several standard classifiers are considered. These are: -Support Vector Machine (SVM), Decision Tree (DT), Random Forest combined with grid search, K-Nearest Neighbor (K-NN)        classifier and Ensemble Voting classifier. All these classifiers are used for classification of feature map computed by pre-trained DCNN models for RMFD dataset. The pre-trained DCNN models other than the ResNet152 are considered to establish a comprehensive comparison between these models used in conjunction with ELM (with leaky_relu and S = 1024) and other selected classifiers mentioned above (please see Table 6 in Section 3.4.2). Table 11 compiles the complete data for all the pretrained DCNN models and all classifiers considered in this work over RMFD dataset. It is very clear that the highest accuracy score is found to be 99.78% for ResNet152 DCNN model in which the ELM is used as the classifier. Hence, two conclusions are drawn out of these results and also using the results compiled and analyzed in Section 3.4. This is (1) The ResNet152 model is the best among all considered DCNN models, (2) The ResNet152 DCNN model amalgamated with ELM architecture for classification task is found to be the best under rigorous performance evaluation based on the ablation studies. Note that Ismael & Ş engür (2021), in their work on X-ray imagery for COVID-19 detection have only reported the best recognition accuracy of 94.7% using ResNet50. Whereas, in Table 11, the accuracy using the ResNet50 for face masked images is calculated to be 99.677% for RMFD dataset which is very satisfactory for the present domain. The underlying reason lies in the effective processing of the face mask image datasets and better tuning of the SVM in the present work. Further, Table 12 compiles the time (in seconds) taken by the different DCNN models with different classifiers on RMFD dataset.
Hence, the major contribution of the proposed framework is to establish the supremacy of the pre-trained DCNN model ResNet152 in conjunction with ELM as classifier to classify the real-life images for face mask detection operation, that too in real time frame. The proposed framework is capable to train and test the entire dataset in fraction of a second in this combination. This is remarkable performance knowing well that the acquisition of the CCTV footage videos in runtime certainly require fastest possible processing of its frames. Additionally, the training accuracy is close to 100% which proves that this combination is the best among the selected ones and is also very suitable for real time mask detection application development. Fig. 10 shows the complete configuration flowchart for the proposed ResNet-152+ELM model which is found to be the best among all other architectures considered in this paper. This is achieved according to the results presented in the section from 3.1 to 3.5. Hence, the comparison of our best results with those of other similar works is also important. For this purpose, many other published architectures are considered. These are compared and analyzed based on our proposed model shown in Fig. 11. This comparison is carried out in Table 13.

Comparison with other similar works
A careful comparison of data in Table 13 proves that the only entry pertaining to the work of Loey et al. (2021b) wherein they obtained a 99.6% testing accuracy for ResNet50 transfer learning model used in conjunction with the Ensemble classifier for RMFD dataset is worth consideration. We, however, obtain better testing accuracy of 99.78% for the same dataset using ResNet152 in conjunction with ELM as classifier. This proves beyond doubt that the proposed model of using a pre-trained DCNN model in conjunction with ELM classifier is capable to produce the best performance under real time constraints. Other researchers have not tested the real time outcomes of their proposed architectures. It is a well-known fact that image processing applications involving classification step is a time-consuming task. That too when the DCNN is being used as its precursor. The tuning of the DCNN architecture is in itself a time-consuming task. Overall, both the processes must be completed in the smallest possible time spans. Another precondition to this objective is to obtain the best recognition rates for performance evaluation. The proposed DCNN + ELM framework is proved to be suitable to achieve the objective. The DCNN produces a feature map which is further classified by the ELM with 99.78% recognition rate. Further, several attempts are made to minimize the computational time span. In the present case, the classification of masked faces of RMFD dataset is best handled in smallest possible computation time spans. This proves to be the major contribution of the present work.

Conclusion
The proposed paper thoroughly investigates an important image processing application which is also very relevant under current COVID-19 situation across the world. In the present work, several transfer learning models are tested in conjunction with a series of classifiers such as ELM, SVM, DT, K-NN and Ensemble Classifier. The detailed performance evaluation is carried out through ablation studies by calculating the testing accuracy along with the computation time spans used for feature extraction and the classification task. The entire work is executed in three phases. The unique feature of using ResNet152 with ELM classifier is proved to be the best in terms of testing accuracy (99.78%) with only average computation time of 1.12 s for classification of face masked images of the RMFD dataset. This proves that the proposed combination of ResNet152 + ELM architecture is not only the best among the selected ones but is also fit to exhibit real time capabilities. These real time constraints are in fact very useful for an image and video processing application such as this one. Another contribution of this paper is to establish the supremacy of the proposed ResNet152 + ELM model over other similar models used for face mask detection operation.