Transfer-Learning Approach for Enhanced Brain Tumor Classification in MRI Imaging

: Background: Intracranial neoplasm, often referred to as a brain tumor, is an abnormal growth or mass of tissues in the brain. The complexity of the brain and the associated diagnostic delays cause significant stress for patients. This study aims to enhance the efficiency of MRI analysis for brain tumors using deep transfer learning. Methods: We developed and evaluated the performance of five pre-trained deep learning models—ResNet50, Xception, EfficientNetV2-S, ResNet152V2


Introduction
It can be anxiety-inducing when patients must wait for a medical diagnosis, especially regarding intracranial neoplasms (or brain tumors).Brain tumors are uncontrolled and abnormal growths of cells in the brain, classified into primary tumors, which originate in brain tissue, and secondary tumors, which spread from other parts of the body to the brain tissue via the bloodstream [1].Given the intricate nature of the brain-an enormous and complex organ that controls the nervous system and contains around 100 billion nerve cells-the uncertainty surrounding a potential brain tumor diagnosis intensifies the anxiety experienced by patients awaiting medical assessments [2].Patients are concerned about the impact of brain tumors on their cognitive functions, treatment options, and overall quality of life, amplifying the emotional strain they face in this situation.
Among brain tumors, glioma and meningioma stand out as lethal primary tumor types, with glioma ranking as the most prevalent brain tumor in humans [3].The World Health Organization (WHO) classifies brain tumors into four grades: grades 1 and 2 represent less severe tumors like meningioma, and grades 3 and 4 indicate more serious types such as glioma.In clinical practice, meningioma, pituitary, and glioma tumors account for approximately 15%, 15%, and 45% of cases, respectively [4].Understanding the differences between these tumor types and their grades is important for accurate diagnosis and effective treatment.
The median medical wait time for specialists across hospital providers in the US state of Vermont is 41 days, but it varies significantly by location.The wait time range for radiologists in Vermont is between 7 and 112 days.Additionally, the average wait time for a primary physician in the United States overall is 20.6 days [5].However, the challenges do not end there; after completing the MRI scan, patients often must wait several weeks or months for their next appointment to receive their MRI results [6].This prolonged wait time for diagnoses causes stress and anxiety for many patients.Integrating artificial intelligence into the medical diagnosis process can reduce the wait time for a brain tumor diagnosis [7].
Convolutional neural networks (CNNs) are popular deep learning models designed for image classification tasks [8], making them particularly adept at determining intricate patterns within medical images, such as those obtained from MRI scans [9].Many CNN models are already trained and available for image classification purposes; such models are known as pre-trained models.Training models from scratch can be time-consuming and computationally intensive, often requiring significant resources like GPUs.This can be mitigated using pre-trained models, which significantly reduces the time and resources needed for training.Transfer learning, a technique that involves utilizing knowledge from pre-trained models on large datasets, enhances the efficiency and effectiveness of the classification process by fine-tuning these models to adapt to new tasks or datasets [10].This approach allows the model to build upon previously learned features and patterns, accelerating the learning process and improving performance on new tasks.
This research presents a comparative analysis of five widely used pre-trained deep learning models-ResNet50, Xception, EfficientNetV2-S, ResNet152V2, and VGG16-on the task of brain tumor classification using MRI images.While extensive research exists on individual models for MRI-based brain tumor classification, our study stands out by providing a direct comparison of these five specific models on a single dataset.The key contributions of this study include: (1) a systematic evaluation of these models' performance on a publicly available MRI dataset using metrics such as accuracy, F1 score, and precision; and (2) the identification of the most effective pre-trained model for brain tumor classification.We hope that this study provides a robust framework for future research in the medical diagnostics field.
The rest of the paper is structured as follows: Section 2 describes the methodology, including the dataset and image augmentation, pre-trained model descriptions, model architecture, and model fine-tuning.Section 3 presents the results of the performance analysis of individual models and a comparison of model performance.The paper concludes with Section 4, which discusses the conclusions and future work.

Dataset and Image Augmentation
The Brain Tumor MRI dataset used in this research is a publicly available dataset containing a total of 7023 MRI images: 5712 training images and 1311 testing images [11].The data are grouped into four distinct categories: pituitary, meningioma, glioma, and no tumor.Specifically, the testing subset comprises 300 pituitary images, 306 meningioma images, 300 glioma images, and 405 no-tumor images, while the training subset includes 1457 pituitary images, 1339 meningioma images, 1321 glioma images, and 1595 no-tumor images.
The MRI images were split into training and validation sets with a split ratio of 50%, ensuring stratification based on class labels to maintain class balance.Before being used for training, the images were preprocessed and augmented to ensure the model could handle a diverse range of images, such as those with low brightness and various orientations [12].Table 1 shows the set values for the parameters.The images were rescaled to a value of 1/255, and brightness was set to a range from 0.8 to 1. Rotation, zoom, shift, and flip shear were adjusted as depicted in the table.These parameter settings increase the diversity of the training data and improve the model's ability to generalize to a wide range of MRI images [13].Figure 1 provides a visualization of the brain tumor image dataset.Residual Network 50 (ResNet50) is a pre-trained convolutional neural network architecture consisting of 50 layers, developed by Microsoft Research in 2015.In deep neural networks, as networks grow deeper, the gradients in the earlier layers diminish significantly, a problem known as the vanishing gradient problem.ResNet50 tackles this challenge by introducing residual connections, enabling the network to learn residual functions [14].This architecture was chosen for its deep structure and ability to handle complex image recognition tasks [15].

Xception
Extreme Inception (Xception) is an extension of the inception architecture, using depthwise separable convolutions to reduce the number of parameters while maintaining high performance.This model performs well in complex image recognition tasks due to its unique architecture, which effectively captures intricate patterns [16,17].

EfficientNetV2-S
Efficient Network Version 2 Small (EfficientNetV2-S) is optimized for resource-constrained environments without compromising performance.This variant, denoted as "S", strikes a balance between model size and computational efficiency [18].Its selection was based on the need for a model that could deliver accuracy while being computationally efficient.

ResNet152V2
Residual Network 152 Version 2 (ResNet152V2) is an improved version of ResNet with 152 layers.It retains the skip connections from the original ResNet architecture, making it adept at training deep networks.This model's robustness and accuracy in handling complex image classification tasks make it a reliable choice for applications requiring high-level feature extraction [19].

VGG16
Visual Geometry Group 16 (VGG16) is a 16-weight-layer deep learning model developed by the Visual Geometry Group.Known for its simplicity and effectiveness, VGG16 consists of many convolutional layers followed by fully connected layers.Its ability to precisely capture complex characteristics in images makes it valuable for a variety of image classification applications [20].

Model Architecture and Fine-Tuning
The sequential model configuration serves as a foundational structure that remains consistent across all variations of the model architecture.This standardized setup includes essential components such as Flatten layers for reshaping data, Dropout layers for regularization to prevent overfitting [21], and Dense layers with activation functions for feature transformation and classification.
The Dense layer with 256 units is regularized using L2 regularization with a coefficient of 0.015, L1 activity regularization with a coefficient of 0.005, and L1 bias regularization with a coefficient of 0.005 [22].The activation function used for this Dense layer is ReLU [23].Additionally, BatchNormalization layers are applied to stabilize training by normalizing the input to each layer.The batch normalization layer is configured with a momentum of 0.99 and an epsilon value of 0.001 [24][25][26].The final Dense layer, or output layer, consists of 4 units with a Softmax activation function for multi-class classification, providing the probability distribution of the input belonging to each of the four classes: pituitary, meningioma, glioma, and no tumor [27].
In the model compilation step, the Adam optimizer (Adaptive Moment Estimation) with a learning rate of 0.0001 is used to minimize the loss function during neural network training [28].Categorical cross-entropy, a suitable loss function for multi-class classification tasks, measures the error rate between the actual and predicted labels [29].Metrics such as recall, precision, accuracy, and F1 score are set to comprehensively assess the model, providing deep insights into its classification performance.
The model is fine-tuned using the fit method for 10 epochs with the training dataset.To prevent overfitting, the EarlyStopping callback with a patience of 2 is utilized to monitor validation loss and save the best weights throughout the training process [30].The evaluation of the model's performance is conducted based on training, validation, and testing datasets, where metrics such as loss, accuracy, and F1 scores are computed.The confusion matrix is used to visualize the model's classification performance, providing a detailed breakdown of the predicted versus actual labels, including true positives, true negatives, false positives, and false negatives.Table 2 illustrates the model architecture layers.

Performance Analysis of Individual Models
Table 3 summarizes the various training, validation, and testing metrics for all five models that we evaluated.Another crucial part of the evaluation process is the analysis of the model's classification outcomes over different tumor classes.The data shows the F1 scores between 0.82 and 0.92, precision results from 0.81 to 0.94, and recall rates from 0.74 to 0.99.Table 3 shows the model performance metrics, Table 4 shows the classification metrics, and Figure 2 shows the confusion matrix for ResNet50.Overall, the minimum F1 score found across all the classes is 0.97, and the maximum is 0.99.Additionally, Recall reached the maximum possible value of 1 for class 2, an overall average of 0.98, and an average precision of 0.98.Upon validation, the accuracy reached 99.39% and an F1 score of 0.9933 with a validation loss of 0.5079; these metrics are very close to the training metrics, which shows that the model learned exceptionally well.Upon testing, the model had a test accuracy of 98.17%, which is the highest test accuracy across the 5 pretrained models.It also had a higher test F1 score of 0.9817, with the test loss being 0.5265.The model's classification outcomes over different tumor types had an average of 0.98 across precision, recall, and F1-score.Table 3 shows the model performance metrics, Table 5 shows the classification metrics, and Figure 3 shows the confusion matrix for Xception.

EfficientNetV2-S
The evaluation of EfficientNetV2-S has presented the precision values from 0.94 up to 1.00 and the recall rates between 0.91 and 0.99.The F1 scores ranged from 0.94 to 1.00 in different tumor cases as a result.During training, the model achieved an accuracy of 96.69 and an F1 score of 0.9673, with a training loss as low as 3.0644.Upon testing, the test accuracy went down to 96.19% and an F1 score of 0.9629 with a test loss of 3.0673.In the validation phase, accuracy was 95.27, and the F1 score was 0.9586, with a validation loss of 3.0861.Overall, the loss was approximately the same across the training, validating, and testing phases.Overall, the model's performance was average.Table 3 shows the model performance metrics, Table 6 shows the classification metrics, and Figure 4 shows the confusion matrix for EfficientNetV2-S.

ResNet152V2
The detailed analysis of the ResNet152V2 model shows that the model demonstrated precision values ranging from 0.54 to 1.00, recall rates between 0.20 and 0.97, and F1 scores spanning from 0.33 to 0.96.The training accuracy of this model was 88.45% and a training F1 score of 0.8747 with a training loss of 2.9106.The model had a validation accuracy of 88.09%, a validation F1 score of 0.8707, and a validation loss of 2.9141, which is very similar to the validation and test loss.The F1 score for testing was 0.7998, a test loss of 2.9849, and a test accuracy of 78.51%.The overall performance of this model was not good, given the test metrics.Table 3 shows the model performance metrics, Table 7 shows the classification metrics, and Figure 5 shows the confusion matrix for ResNet152V2.

VGG16
The VGG16 model exhibited varying performance metrics across different classes, with precision ranging from 0.51 to 0.98, recall from 0.50 to 0.97, and F1-scores from 0.65 to 0.92.The overall accuracy was 0.77, indicating moderate predictive capabilities.During training, the model had a loss of 0.9147, an accuracy of 87.96%, and an F1-score of 0.8850.On the validation set, the model maintained consistent performance with a loss of 0.9948, accuracy of 86.11%, and F1-score of 0.8654.In the testing phase, the model showed a loss of 1.1323, accuracy of 76.83%, and F1-score of 0.7756, providing insights into its classification abilities on new datasets.However, based on the metrics, the performance of this model was not good.During testing, it did not perform well.Table 3 shows the model performance metrics, Table 8 shows the classification metrics, and Figure 6 shows the confusion matrix for VGG16.

Comparison of Model Performances
The evaluation and comparison of the pre-trained models, ResNet50, Xception, Effi-cientNetV2-S, ResNet152V2, and VGG16, in classifying brain tumors into glioma, meningioma, pituitary, and no tumor reveal distinct performance characteristics that are crucial for enhancing the precision and efficiency of Magnetic Resonance Imaging (MRI) analysis in medical diagnostics.These models play a pivotal role in addressing the critical need for accurate and expedited medical diagnostics in the realm of brain tumor classification.Table 3 shows the performance of all five models, and the green highlighted (Xception) values represent the model that performed best.

Conclusions and Future Work
This paper presented a comparative analysis of various pre-trained models for brain tumor classification using MRI images.Xception stood out with a test F1 score of 0.9817.EfficientNetV2-S showcased the second-highest test F1 score of 0.9629.ResNet152V2 achieved a test F1 score of 0.7998, followed by ResNet50 with a test F1 score of 0.7963.VGG16 demonstrated a test F1 score of 0.7756.These results highlight Xception's superior F1 score in brain tumor classification, making it a highly effective model for MRI analysis.
We envision a software application using the Xception model that clinicians might use to classify images of a patient's brain after an MRI scan is taken to give an immediate (but tentative) diagnosis.While a specialist should still evaluate such images for a conclusive diagnosis, integrating with a well-developed model could potentially speed up the process and reduce the wait time by weeks.This could provide a patient with some initial indication about whether or not the scans suggest evidence of a tumor rather than having to wait an indeterminate amount of time to learn anything at all.
For future work, we will explore the integration of additional pre-trained models and the application of ensemble learning techniques to further improve classification accuracy.

Figure 2 .
Figure 2. ResNet50 Model: Confusion Matrix.3.1.2.Xception During the training phase of the model with Xception, the model achieved an excellent Training Accuracy of 99.49%, with an F1 score of 0.9949 and a training loss as low as 0.5046.Overall, the minimum F1 score found across all the classes is 0.97, and the maximum is 0.99.Additionally, Recall reached the maximum possible value of 1 for class 2, an overall average of 0.98, and an average precision of 0.98.Upon validation, the accuracy reached 99.39% and an F1 score of 0.9933 with a validation loss of 0.5079; these metrics are very close to the training metrics, which shows that the model learned exceptionally well.Upon testing, the model had a test accuracy of 98.17%, which is the highest test accuracy across the 5 pretrained models.It also had a higher test F1 score of 0.9817, with the test loss being 0.5265.The model's classification outcomes over different tumor types had an average of 0.98 across precision, recall, and F1-score.Table3shows the model performance metrics, Table5shows the classification metrics, and Figure3shows the confusion matrix for Xception.

Table 3 .
Training, Validation, and Testing Metrics for all models (best model in green).Training F1 Score of 0.9347, corresponding to a Train Loss of 3.4093.Nevertheless, the Validation Accuracy was 91.60%, accompanied by a Validation Loss of 3.4513 with an F1 score of 0.9074.In the testing phase, the ResNet-50 Testing Accuracy reached 87.96%, and the Testing F1 Score reached 0.7963, with a Test Loss of 3.6171.