YOLO vs. CNN Algorithms: A Comparative Study in Masked Face Recognition

. Purpose: This research investigates the effectiveness of YOLO (You Only Look Once) and Convolutional Neural Network (CNN) in real-time face mask recognition, addressing the challenges posed by mask-wearing in infectious disease prevention. Method: Utilizing a diverse dataset and employing YOLO's object detection and a combined Haar Cascade Algorithm with CNN, the study evaluated key performance indicators, including accuracy, framerate, and F1 Score. Results: Results indicated that CNN outperformed YOLO in accuracy (99.3% vs. 79.3%) but operated at a slightly lower framerate. YOLO excelled in recall and precision, presenting a compelling choice for specific application needs. The research underscores the importance of considering factors beyond accuracy for informed decision-making in the realm of face mask recognition. Novelty: This research evaluates the real-time performance of YOLO and CNN algorithms in masked face recognition, highlighting the crucial balance between framerate efficiency and detection accuracy.


INTRODUCTION
The field of Artificial Intelligence (AI) is rapidly growing within computer science, with computer vision emerging as an extensively researched subfield [1]- [3].This subfield has a specific case study, face recognition, which has attracted significant attention from researchers due to its potential applications [4], [5].The ability of face recognition to identify individuals based on their facial features provides numerous advantages across various domains [6].For instance, organizations such as schools or companies can enhance efficiency in attendance systems by incorporating face recognition technology [7].In addition, face verification, a specialized application within face recognition, plays a vital role in ensuring secure access to functions like smartphone locks and payment systems [8].
Several studies have demonstrated the impressive performance of face recognition systems.A study proposed a method for constructing a face recognition system, attaining an accuracy of 95.97% on the AR Face dataset with 120 individuals and 97.20% on the VTU-BEC-DB multimodal database [9].Similarly, another study developed a face recognition system for school attendance, achieving an accuracy of 97.29% in their proposed attendance system [10].Notably, surveys employing various methodologies conducted by several researchers in recent years have consistently indicated satisfactory performance of observed face recognition systems [11]- [13].Although face recognition seems to be a well-addressed topic in AI development, this does not diminish the need for ongoing research by scholars exploring different cases and conditions in face recognition systems.
One of the very new cases is regarding the use of face masks that might prevent face recognition from functioning.This is particularly true during the outbreak, in which a highly concerning infectious disease that spreads through touches and droplets requires people to wear masks, both indoors and outdoors.This precautionary measure obstructs facial visibility, making it difficult for individuals to recognize each other.For this reason, a study developed masked face recognition systems, achieving higher accuracy (97% and improved true positive rates, respectively) by including masked face images in their training data [14].Another study simulated face dataset augmentation with nonphysical masks [15].Other studies have found that YOLO-V3 and YOLO V3-tiny achieved higher accuracy in detecting face masks compared to CNN.A study reported an average accuracy of 91.28 for YOLO-V3, an average precision value of 86.65 for CNN, and an accuracy of 95% for YOLO V3-tiny and 84% for CNN [16].Meanwhile, another study proposed a novel dataset and methods for real-time detection of masked and unmasked faces, achieving an accuracy of 99.5% using YOLO and a CNN architecture [17].
However, until now, there has been no comprehensive comparison between the YOLO algorithm and the Haar-Cascade CNN algorithm.Thus, this research focuses on face mask recognition using deep learning models, namely YOLO and CNN.Both methods are employed for face mask recognition due to Yolo's known speed, as its frame detection architecture utilizes a regression model and does not require a complex pipeline.On the other hand, the use of dimensions greater than 1 in CNN will affect the overall scale of an object.In this research, deep learning experimentation will be conducted for face recognition, in which two deep learning models will be used for comparison to determine which of these algorithms can, at the very least, be identified as superior when applied to a camera for mask-wearing identification.

METHODS
The dataset was carried out by three individuals, each assigned the task of providing 15 images with masks and 15 images without masks, resulting in a total of 90 images.Each photo had a resolution of 286x286 pixels and was captured using a Logitech C920 camera.Various perspectives were captured in each photo, creating a diverse dataset for testing with the intended implementation of detection methods.Figure 1 serves as an exemplar of the dataset employed for the implementation database.In this study, the decision was made to compare two different detection methods: utilizing YOLO (You Only Look Once) [18], [19] and a combined method of the Haar Cascade Algorithm with Convolutional Neural Network (CNN) [20].YOLO is a real-time object detection algorithm that divides the image into a grid and predicts bounding boxes and object classes in a single processing step [21].Meanwhile, the Haar Cascade Algorithm is a classical method that detects objects based on visual features, and CNN is a type of artificial neural network architecture effective in understanding spatial data structures [22].The CNN layers used in data processing include normalization to adjust the pixel value range [23], filtering layer to extract crucial features, ReLU (Rectified Linear Unit) to introduce non-linearity, Max pooling to reduce spatial dimensions, flattening to transform data into a vector, fully connected layer for feature combination, and softmax as the activation function for final classification [24].
In this research, the Keras library was used to create a face detector, and detect_face can be declared as a cropping function.Following this, the augmentation stage was reached to enhance data diversity.Image augmentation techniques were used to create new variations of existing faces [25], including rotation, flipping, or cropping to generate similar but different images [26].This contributes to an increased diversity of available data for training the face recognition model, allowing the model to better recognize various facial characteristics.As a result, a total of 1890 augmentations and originals with three labels each were generated.Expanding the dataset of facial images with more variations and labels is aimed at improving the training of a more effective face recognition model.In this research, 20% of the data were used as test data, while 80% served as training data.Each variable checked the number of samples and dimensions of the training and testing data to understand the data structure used in model training and testing.
In terms of research indicators, the study carefully selected three key variables: accuracy, framerate, and F1 Score.Accuracy is a holistic measure of the model's correctness, providing an overall assessment of its performance.Framerate, measured in frames per second, offers insights into the real-time applicability of the detection methods.The inclusion of the F1 Score, a metric combining precision and recall, ensures a balanced evaluation that considers both false positives and false negatives [27].[28], [29].Different layers were employed based on the system's operational principles [30]- [32].

RESULTS AND DISCUSSIONS
This research began the analysis by collecting data images from three individuals during the research experiment.Subsequently, the system was tested using two algorithmic comparisons, namely YOLO and CNN.Three instances of person were employed to test both YOLO and CNN in each respective trial.Table 1 provides a comprehensive comparison of statistical analyses for mean rainfall observations using two different methods: CNN and YOLO.In terms of frame rate, CNN operated at 3.57 FPS with a slight advantage over YOLO at 3.67 FPS.Figures 3 and 4 illustrate the success of both systems in facial detection and their capability to distinguish individuals effectively.However, when considering accuracy, CNN outperformed YOLO with a rate of 99.3% compared to 79.3%.The precision of both methods was high, with CNN at 97.3% and YOLO at 81.3%.Interestingly, both methods achieved perfect recall, indicating their ability to capture all relevant instances.The F1-Score, which balances precision and recall, favored CNN with a score of 0.97, while YOLO scored a perfect 1.00.These findings suggest that CNN excels in accuracy and F1-Score, while YOLO demonstrates competitive performance, especially in terms of recall and precision.The choice between the two methods may depend on specific priorities, such as real-time processing (where CNN has a slight edge in frame rate) or a balance between precision and recall (where YOLO excels).The comparison in Figure 5 indicates that the YOLO algorithm can achieve higher detection accuracy than CNN images that exhibit detection errors.When considering these results, it becomes evident that both CNN and YOLO possess their own strengths and excel in specific areas.CNN's superior accuracy and high F1-Score, for example, make it a robust option for applications where precision and overall model performance are paramount.On the contrary, YOLO's impeccable precision and recall imply an impressive capability to accurately detect and classify instances.The choice between these methods may be subjective and contingent upon the specific requirements of the application.If prioritizing real-time processing is crucial, the slight advantage in frame rate for CNN might be considered a decisive factor.However, if the emphasis lies on a well-balanced model with high precision and recall, YOLO could be presented as a compelling choice.Figure 6 shows the test accuracy of YOLO and CNN, as well as their frame rate rests.
The classification results can be seen in Figure 7.The table, overall, provides a valuable foundation for decision-making, but further considerations, such as computational efficiency and specific application requirements, should be taken into account to make an informed choice between CNN and YOLO for mean rainfall observations.The generated data can be compared to the previous study, which researched CNN and YOLO solely on full-face images without masks.Additionally, this study involved a real-time application where detection was performed directly, providing accuracy comparisons, wherein the CNN algorithm demonstrated a detection accuracy of 99.3%, while YOLO showed 79.3% accuracy in the case of mask usage.Furthermore, CNN achieved a higher Frames Per Second (FPS) at 3.57 FPS compared to YOLO, which recorded 3.67 FPS.

CONCLUSION
In conclusion, the CNN method demonstrated high accuracy in recognizing both masked and unmasked faces, with an accuracy rate of 99.3% and 97.3%, respectively.However, it operated at a slightly slower processing speed, achieving an FPS of 3.27, and exhibited some prediction errors as indicated by imperfections in precision, recall, and F1-score.Despite these limitations, the CNN method remains viable for effective face mask recognition applications.On the other hand, the YOLO algorithm offered a comparable average processing speed of 3.8 FPS, but its accuracy was slightly lower, at 79.3% for masked faces and 81.3% for unmasked faces.Therefore, the choice between these methods depends on the specific priorities regarding accuracy and processing speed in the context of face mask recognition applications.Both methods have their strengths and limitations, and selecting the most suitable one should be based on the specific requirements of the application at hand.

Figure 2 .
Figure 2. The workflow of the system execution Figure 2 depicts the flow diagram of the system utilized in this research.The figure illustrates that an individual's webcam feed was analyzed through the YOLO and CNN algorithmic systems[28],[29].Different layers were employed based on the system's operational principles[30]-[32].

Figure 5
Figure 5 (a) Classification of CNN matrix confusion, (b) Classification of YOLO matrix confusion

Table 1 .
Comparison of statistical analysis of the mean rainfall observations