Image Segmentation Performance using Deeplabv3+ with Resnet-50 on Autism Facial Classification

: In recent years, significant advancements in facial recognition technology have been marked by the prominent use of convolutional neural networks, particularly in identification applications. This study introduces a novel approach to face recognition by employing ResNet-50 in conjunction with the DeepLabV3 segmentation method. The primary focus of this research lies in the thorough analysis of ResNet-50’s performance both without and with the integration of DeepLabV3+ segmentation, specifically in the context of datasets comprising faces of children on the autism spectrum. The utilization of DeepLabV3+ serves a dual purpose: firstly, to mitigate noise within the datasets, and sec-ondly, to eliminate unnecessary features, ultimately enhancing overall accuracy. Initial results obtained from datasets without segmentation demonstrate a commendable accuracy of 83.7%. However, the integration of DeepLabV3+ yields a substantial improvement, with accuracy soaring to 85.9%. The success of DeepLabV3+ in effectively segmenting and reducing noise within the dataset underscores its pivotal role in refining facial recognition accuracy. In essence, this study underscores the pivotal role of DeepLabV3+ in the realm of facial recognition, showcasing its efficacy in reducing noise and eliminating extra-neous features from datasets. The tangible outcome of increased accuracy of 85.9% post-segmentation lends credence to the assertion that DeepLabV3+ significantly contributes to refining the precision of facial recognition systems, particularly when dealing with datasets featuring faces of children on the autism spectrum.


Introduction
In recent years, facial recognition technology has advanced significantly, and identification applications convolutional neural networks (CNN) have demonstrated outstanding performance in facial recognition applications [1].Several studies on facial recognition have been undertaken [2][3][4][5][6].This study proposes a facial recognition system for autistic children using the Deep Learning and segmentation method to recognize the characteristics of autism through facial images.The faces of autistic and non-autistic children are nearly identical in appearance [4].However, children diagnosed with autism have an atypically broad upper face, which includes wide-set eyes.They also have a shorter nose and cheekbones in the middle of their faces [7].Looking at the CDC data year by year over 80 years reveals HC has increased by 11 mm for boys and 7 mm for girls [8].In the study [7] used MobileNet and two dense layers to perform feature extraction and image classification.In the study [9], there are three different pre-trained deep learning models, namely, Xception, VGG19, and NASNetMobile, that were implemented to detect ASD.The study proposes a practical ASD screening solution using facial images by applying VGG16 transfer learning-based deep learning [10].However, these studies did not implement any segmentation model.In this study, the autism and non-autism images will be analyzed.This study also explored a CNN architecture using ResNet-50 and a segmentation method called DeepLabV3+.These methods are intended to produce models with good accuracy in the recognition and classification processes and the performance of the architecture.ResNet-50 is one of the popular deep-CNN models to extract features from facial images and it was designed for recognition since it outperforms simple CNN [11,12].DeepLabv3+ is one of the best semantic segmentation algorithms currently and has high accuracy in the segmentation [13].
Based on the available research, a benchmark comparison of autistic face image classification using ResNet-50 and ResNet-50 with DeepLabv3+ segmentation can be summarized as follows: • The accuracy of classifying autistic and non-autistic children using ResNet-50 for facial image classification ranges from 43% to 96% [14][15][16].
• Deep learning and transfer learning have been applied to the detection of autism through facial images, with various studies reporting different accuracy levels using different models and techniques [14][15][16].
• The use of DeepLabv3+ segmentation with ResNet-50 for autistic face image classification is not specifically mentioned in the provided search results.
This study analyzes the performance of ResNet-50 and DeepLabV3+ in the recognition of autism and non-autism.The results will be analyzed whether the Deeplabv3+ segmentation method can increase accuracy or not.It is hoped that these methods can be a reference to classify autism and non-autism.The decision to opt for an algorithm other than DeepLabv3+ for segmenting the autistic spectrum face dataset stems from the necessity to identify the most appropriate method, considering various factors such as algorithm compatibility, performance evaluation criteria, feature extraction requirements, and potential for originality.Despite the effectiveness of DeepLabv3+ in semantic segmentation tasks, its specific structure and features might not perfectly match the dataset's characteristics, https://ejournal.ittelkom-pwt.ac.id/index.php/infotelprompting the exploration of alternative algorithms that could potentially offer superior performance or novel perspectives.By venturing into uncharted territory with algorithms previously unused in this context, there's an opportunity to discover innovative methodologies and potentially enhance our understanding of autistic spectrum facial attributes, thereby driving progress in the field.

Research Method
This study will analyze the performance of ResNet-50 in classifying autistic and nonautistic datasets before and after DeepLabv3+ is applied.The following is the flowchart of the study flow.In a comparative benchmark study, this study delved into the realm of autism face dataset classification, employing the formidable ResNet-50 architecture as the baseline model.This stalwart neural network exhibited commendable accuracy and efficiency in discerning facial features associated with autism.However, this study sought to push the boundaries further by integrating the power of segmentation using Deeplabv3+.The augmentation of ResNet-50 with Deeplabv3+ brought a significant enhancement to the classification process.The segmentation capability not only improved the model's understanding of facial structures but also provided a nuanced analysis of spatial relationships within the images.Consequently, the benchmark revealed that the combined approach surpassed the standalone ResNet-50, showcasing the potential of leveraging advanced segmentation techniques for more nuanced and accurate autism face dataset classification.

Facial Recognition System
The primary goal of the face recognition system is to recognize the human identity from the static images.Generally, a face recognition system deals with the input image as a classification problem [2].Face recognition systems employ technology to recognize faces in an input image.This system will recognize and match the images in the dataset to the supplied images [17].The face recognition process is carried out using a computerized method, including detecting and verifying a person through an image [18].Although facial recognition systems have been widely studied, several challenges, such as misalignment, illumination variations, and expression variations, require approaches and tests to improve the accuracy and precision level of face recognition.

Convolutional Neural Network (CNN)
The CNN (or ConvNet) is a popular discriminative deep learning architecture that learns directly from the input without the need for a human feature extraction [19].CNN is used to construct the majority of computer vision algorithms [20].CNN represents the most exploited in the image processing field due to its ability to recognize patterns in images.CNN model can contain several types of layers.The most frequently used are convolutional, pooling, and fully connected layers.CNN consists of two layers, namely Convolution and Pooling [21,22].

DeepLabV3+
DeepLabv3+ is currently one of the best semantic segmentation algorithms available.Based on DeepLabv3, this technique creates a coding-decoding structure by incorporating a simple and effective decoder.DeepLabV3+ has two components: an encoder and a decoder.The encoder is made up of a backbone feature extraction network and an atrous spatial pyramid pooling (ASPP) structure, whereas the decoder takes low-level features from the backbone feature extraction network and up-samples them to produce pixel-by-pixel classification results the same size as the input image [27].The framework of DeepLabV3+ is shown in Figure 3.

Dataset
The decision to opt for an algorithm other than DeepLabv3+ for segmenting the autistic spectrum face dataset stems from the necessity to identify the most appropriate method, considering various factors such as algorithm compatibility, performance evaluation criteria, feature extraction requirements, and potential for originality.Despite the effectiveness of DeepLabv3+ in semantic segmentation tasks, its specific structure and features might not perfectly match the dataset's characteristics, prompting the exploration of alternative algorithms that could potentially offer superior performance or novel perspectives.By venturing into uncharted territory with algorithms previously unused in this context, there's an opportunity to discover innovative methodologies and potentially enhance our understanding of autistic spectrum facial attributes, thereby driving progress in the field.The dataset in shown is Figure 3.

Backbone network (ResNet-50)
The input will enter the backbone network used, namely ResNet-50, which functions as a feature extractor for the input image.ResNet-50 has deep layers and the ability to transfer information through residual blocks.

Atrous spatial pyramid pooling (ASPP)
Atrous Spatial Pyramid Pooling (ASPP) uses atrous convolutions with varying dilation rates to capture information at different scales without sacrificing resolution.Each ASPP branch applies atrous convolution with different dilation rates, followed by Spatial Pyramid Pooling (SPP) to combine information from various receptive field sizes.ASPP output is used as input for subsequent steps in DeepLabv3+, such as up-sampling and merghttps://ejournal.ittelkom-pwt.ac.id/index.php/infoteling with results from other backbones, assisting the model in semantic segmentation tasks [28], [29], [30].

Multi-scale contextual
The multi-scale contextual process in DeepLabv3+ is the model's ability to understand the context at various spatial scales in the image.This is achieved through the use of Atrous Spatial Pyramid Pooling (ASPP), which uses atrous convolutions with different dilation rates to capture information at varying scales.Multiscale features in DeepLabv3+ offer several advantages such as improved object detection, reduced segmentation boundary problems, and robustness to image variations.With this, it can be said that multi-scale features in DeepLabv3+ play an important role in improving the model's ability to detect and segment objects in the input image, reducing the segmentation boundary problem, and increasing the model's robustness to image variations [31,32].

Up-sampling
Up-sampling involves the use of bilinear interpolation to increase the resolution of ASPP feature maps.The up-sampling results are used as input for combining with feature maps from the previous backbone, preparing the model to produce more detailed and accurate semantic segmentation predictions [32].

Low-level features
Low-level features in DeepLabv3+ work by obtaining and storing detailed local and structural information from the initial or lower layers in the network architecture.This is done to preserve spatial information and details that may be lost during deep pooling and convolution processes.Low-level features work collaboratively with high-level features produced by ASPP.They help improve the high resolution of segmentation results and give https://ejournal.ittelkom-pwt.ac.id/index.php/infotelthe model the ability to capture fine details that are important for semantic segmentation tasks [33][34][35].

Concat (concatenate)
Concatenation is used to combine up-sampling results from the Atrous Spatial Pyramid Pooling (ASPP) module with low-level features originating from the initial or lower layers in the network architecture.Concatenation allows models to combine information from different scales and levels of resolution, improving the model's ability to capture the detail and context necessary for semantic segmentation tasks [36,37].

Convolutional
After concatenation, the model can apply additional convolution layers to concatenate and refine the concatenated features.The convolution function in the final stage of Deeplabv3+ is to produce accurate and appropriate semantic segmentation predictions.This allows the model to effectively identify and map each pixel in the input image [38,39].

Output/segmentation results
The results obtained are segmentation results from the input image.

Proposed Method
The datasets used in this study have 2168 images which are divided into two classes, autistic and non-autistic.Each class contains 1890 images.In the proposed method (shown in Figure 4) two processes will be carried out: 1.A dataset that has not been segmented will be trained first.
2. Then the segmented dataset will be trained.
3. The classification process can be seen in Figure 4.
4. Both results will be evaluated.

Accuracy
The scikit-learn handles the performance evaluation part.They employ metrics based on the "Confusion Matrix" to assess the binary classification model's performance [39].Some of the important values that contribute to the performance evaluation were as follows: (i) False positive (FP): when data is negative yet the model predicts positive (ii) False negative (FN): positive data predicted by a model as negative (iii) True positive (TP): when both the data and the model anticipate positive.Accuracy represents the ability of the model to run a system properly, and the trained model will resemble the actual system performance.The system performance can be evaluated using the Accuracy, Recall, Precision, and F-Score metrics [40].Four combinations of the actual value and the predicted value are used to see the performance of the classification system with the output of two classes [9].Their respective validations can be observed on the curves shown in Figure 5 and Figure 6.
Figure 5 shows that the blue curve is the accuracy curve at the training stage, and the red curve is the validation accuracy curve.Based on Figure 7 and Figure 8, the resulting curve for the training stage shows the relationship between epochs and losses using the ResNet-50 architecture.It can be observed that the blue curve is the training stage loss curve, and the red curve is the validation loss curve.Based on the curve obtained in Figure 6, it can be seen that the loss resulting from the training and validation stages is still quite large for the implementation of the ResNet-50 architecture.

Training And Validation using ResNet-50 with Segmentation using DeepLabV3+
The the training stage.The ratio of this dataset is the same as the ratio of the previous dataset.Moreover, the respective validation can be observed on the curves shown in Figure 7 and Figure 8.
Figure 7 shows the epoch relationship curve and accuracy using ResNet-50 with segmentation using DeepLabV3+.Based on these curves, it can be observed that the blue curve is the accuracy curve at the training stage and the red curve is the accuracy curve at the validation stage.Based on Figure 8, it can be observed that the level of accuracy on the resulting curve from the training and validation stages will continue to change as the number of epochs increases so that the number of epochs used during training will be tested to achieve the optimum accuracy value.Based on Figure 9, the epoch-loss relationship curve using ResNet-50 with segmentation using DeepLabV3+.Based on the curve, it can be observed that the blue curve is the loss curve for the training stage, and the red one is the loss curve for the validation stage.

Comparison of Accuracy Results
The accuracy of the ResNet-50 models, both with and without segmentation using DeepLabV3+, has been meticulously evaluated.In assessing the performance of the classification system, the confusion matrix table serves as a robust tool, accompanied by equations (1), ( 2), (3), and (4) to quantify the accuracy of each method effectively.Initially, without implementing the segmentation method, ResNet-50 achieved an accuracy of 83.7% on the datasets.However, upon segmentation using DeepLabV3+, a notable improvement was observed, with accuracy soaring to 85.9%.This remarkable enhancement underscores the efficacy of DeepLabV3+ in refining classification accuracy, showcasing its potential to augment the performance of existing models.Furthermore, these results not only validate the effectiveness of the applied methodology but also in still confidence in its capability to yield tangible improvements in real-world applications.The comprehensive findings are eloquently presented in Table 1, encapsulating the incremental accuracy achieved through the integration of DeepLabV3+ segmentation into the classification pipeline.Such empirhttps://ejournal.ittelkom-pwt.ac.id/index.php/infotelical evidence not only underscores the significance of segmentation techniques but also highlights the promising trajectory of advancements in machine learning methodologies for enhancing classification tasks.

Discussion
The datasets used has unnecessary features.This is one of the reasons the author decided to use a segmentation method to reduce these features.DeepLabV3+ was chosen because it is one of the best semantic segmentation methods and has good accuracy in segmenting images.The decision to employ DeepLabV3+ was further motivated by its ability to effectively capture intricate details and subtle nuances within the images, providing a comprehensive understanding of the underlying content.This advanced segmentation method not only excels in feature reduction but also ensures precise delineation of objects, enhancing the overall performance and interpretability of the dataset.Moreover, DeepLabV3+ boasts a powerful semantic segmentation architecture that enables the model to discern and classify objects at a pixel level, contributing to a more granular and accurate representation of the data.The model's utilization extends beyond mere feature reduction; it introduces a nuanced comprehension of the contextual relationships between different elements in the images.This ensures that the resulting segmented dataset not only mitigates the impact of unnecessary features but also encapsulates the richness of the visual information, making it a robust choice for intricate image analysis tasks.The author also opted for ResNet-50 as the backbone for DeepLabV3+ due to its proven success in handling complex visual information.

Conclusion
DeepLabV3+ succeeded in increasing accuracy.This also proves that the method applied can work well as expected.ResNet-50 produces 83.7% accuracy with the datasets without segmentation.Accuracy increased after the datasets were segmented using DeepLabV3+ to 85.9%.

Figure 5 :
Figure 5: The Relationship of epoch and accuracy using ResNet-50.

Figure 6 :
Figure 6: The Relationship of epoch and loss using ResNet-50.

Figure 7 :
Figure 7: The Relationship of epoch and accuracy using ResNet-50 with segmentation using DeepLabV3+.

Figure 8 :
Figure 8: The Relationship of epoch and loss using ResNet-50 with segmentation using DeepLabV3+.
ResNet-50's deep architecture facilitates the extraction of hierarchical features from the input images, allowing DeepLabV3+ to capture both low-level and high-level features with remarkable efficiency.This choice not only enhances the model's ability to comprehend intricate patterns but also enables more effective feature integration during the segmentation process.By leveraging the rich representations learned by ResNet-50, DeepLabV3+ can achieve superior accuracy in image segmentation, making it a well-founded selection for the task at hand.Furthermore, ResNet-50's residual learning framework plays a pivotal role in mitigating the vanishing gradient problem, allowing for the successful training of deeper neural networks.This characteristic is particularly advantageous in the context of DeepLabV3+, where intricate segmentation tasks demand a sophisticated and deep architecture.The residual connections in ResNet-50 facilitate the smooth flow of gradients during backpropagation, enabling the model to effectively learn and adapt to the intricacies of the dataset.Additionally, the inherent robustness of ResNet-50 aids in the extraction of meaningful features that are essential for semantic segmentation.The model's ability to capture and retain detailed information across multiple scales contributes to the overall success of DeepLabV3+ in accurately segmenting diverse and complex images.The combination of DeepLabV3+ with ResNet-50 as its backbone thus not only addresses the challenges posed by unnecessary features but also ensures a robust and reliable solution for image segmentation tasks.https://ejournal.ittelkom-pwt.ac.id/index.php/infotel

Table 1 :
Performance comparison