Enhancing Weather Scene Identification Using Vision Transformer

) Abstract: The accuracy of weather scene recognition is critical in a world where weather affects every aspect of our everyday lives, particularly in areas like intelligent transportation networks, autonomous vehicles, and outdoor vision systems. The importance of weather in many aspects of our life highlights the vital necessity for accurate information. Precise weather detection is especially crucial for industries like intelligent transportation, outside vision systems, and driverless cars. The outdated, unreliable, and time-consuming manual identification techniques are no longer adequate. Unmatched accuracy is required for local weather scene forecasting in real time. This work utilizes the capabilities of computer vision to address these important issues. Specifically, we employ the advanced Vision Transformer model to distinguish between 11 different weather scenarios. The development of this model results in a remarkable performance, achieving an accuracy rate of 93.54%, surpassing industry standards such as MobileNetV2 and VGG19. These findings advance computer vision techniques into new domains and pave the way for reliable weather scene recognition systems, promising extensive real-world applications across various industries.


Introduction
Weather identification is identifying and forecasting weather patterns using advanced technologies such as computer vision and machine learning.This procedure is important to our daily lives since it enables us to make informed decisions about outdoor activities, clothing choices, and travel.Accurate weather recognition also helps improve our capacity to anticipate and address emergencies by providing early warning of extreme weather events like hurricanes, tornadoes, and floods.Furthermore, it is essential in directing decisions crucial for industries like energy, transportation, and agriculture-all of which are greatly impacted by atmospheric conditions.
Individuals rely on the weather information provided throughout a given time period because it affects their everyday activities and habits.People often make judgments and schedule their activities based on the prevailing weather.This can include things like opting to go for a bike ride, booking a trip, or organizing a vacation.Furthermore, weather is a consideration to consider when planning company operations, transit systems, sporting events, and sightseeing trips.It is critical to consider the weather in the location where activities are held.
Weather is unique to a certain location and is often measured via human observations or sensors.However, the high cost of camera sensors may hurt the local economy.As technology advances, artificial intelligence (AI) is predicted to play an increasingly important role in embedded systems, enabling more accurate weather analysis while lowering hardware costs.AI is growing more prominent in people's life, making numerous chores easier.Many prominent organizations are currently incorporating AI into their technology and continuing to invest in its development.Deep learning, a branch of AI, uses architectures with hidden layers to automatically extract information from photos, making it an effective tool for weather forecasting.
Over the last couple of decades, the research community and industry players have actively invested in autonomous vehicle technology aiming to revolutionize the transportation sector by adopting techniques that could make them affordable, safe, efficient and convenient.Amongst many challenges faced by autonomous vehicle, one of them is to accurately perceive the environment in terms of weather conditions (such as rain, fog, snow etc.), which could affect associated sensors compromising their safety.It is paramount to have an effective predictive ability of weather conditions by a visually captured data through camera so to make autonomous vehicles safer.
Considerable research effort has been dedicated to the realm of weather image categorization employing deep learning architectures.Within these investigations, a plethora of approaches and methodologies have been utilized to attain elevated levels of classification precision.As an illustration, within a specific investigation, Elhoseiny et al. [1] leveraged attributes extracted from the fully connected layers within the AlexNet framework to categorize weather images into two distinct classes.Through the utilization of the Soft-Max function to classify the derived attributes from the ultimate layer, they achieved a commendable accuracy of 91.1%.
Guerra et al. [2] conducted a thorough investigation into the classification of weather images, covering three different types of meteorological data.A hybrid approach combining augmentation techniques and superpixel technology was utilized to consistently improve pixel distribution throughout all images.The Convolutional Neural Network (CNN) model was trained over several iterations before the dataset was classified using the Support Vector Machine (SVM) method.Notably, the ResNet-50 model achieved a total accuracy of 80.7%, emerging as the top performer.
The significance of weather recognition is pronounced in various practical scenarios, particularly within systems aiding self-driving technology.This significance is evidenced by its ability to enhance road safety through measures like adjusting vehicle speed and modulating different lighting conditions based on real-time weather information.As a result, a subset of research endeavors has been dedicated to weather recognition via in-car cameras.
An approach denoted as template matching was proposed in studies [3,4] for the identification of raindrops on windshields, as they serve as robust indicators of rainy conditions.In [3], a framework was devised to establish three distinct global features that discriminate between overcast, sunny, and wet weather.
Roser and Moosmann [5] introduced a technique wherein the entire image was divided into thirteen equal sections with diverse histogram data extracted from each region individually to facilitate rain detection.Notably, beyond just rain, investigations have also extended to examining fog and haze.The application of Koschmieders Law [6] was demonstrated in [7] to compute visibility in foggy scenarios.
In studies [8,9], a process involving power spectra computation followed by the utilization of Gabor filters was employed to extract features for fog recognition.Bronte et al. [10] proposed an edge-oriented technique employing Sobel filters to identify edges and consequently assess the presence of fog.Moreover, Gallen et al. [11] introduced a method that employed the backscattered radiance pattern of light to specifically detect fog during nighttime conditions.Furthermore, a cluster of investigations in the domain of weather recognition focuses on common outdoor photographs [12] for estimating prevailing weather conditions.This is achieved by employing illumination calculations on multiple images captured at a specific location.In the pursuit of identifying weather conditions, a myriad of global features was explored in [13], encompassing power spectral slope, edge gradient energy, infection point particulars, contrast, and saturation.
To enhance weather category classification, Li et al. [14] devised an approach that amalgamated Support Vector Machines (SVMs) and decision mechanisms with an array of global characteristics.In a departure from previous methodologies, Lu et al. [15] proposed a solution to the two-class weather classification predicament, employing a diverse range of local factors such as the sky, shadows, and reflections.
Efforts to tackle the challenge of multiclass weather classification were undertaken by Zhang et al. in studies [16,17], where a blend of global and local features was harnessed.Addressing the identical two-class weather recognition problem, these researchers combined hand-crafted features with features extracted from Convolutional Neural Networks (CNNs), yielding notably improved outcomes.
Presently, computers possess the capability to analyze satellite imagery, enabling them to ascertain prevailing weather conditions and formulate forecasts.While this information is readily accessible through the internet, it is imperative to acknowledge the substantial variability in weather across diverse geographical locations.Within industries like transportation, the real-time classification of weather conditions holds particular significance.This is exemplified in applications like self-driving vehicles, where weather images aid in decisions such as activating windshield wipers during rain.Nonetheless, the task of classifying weather images presents challenges due to inherent similarities between distinct weather phenomena, such as mist and snow, or cloudiness and rainfall.
Image classification, a technology enabling computers to discern weather patterns from real-time images, holds immense potential.It serves as a foundational tool for the development of Advanced Driver Assistance Systems (ADASs) and autonomous machines.To categorize weather into four classes (cloudy, wet, snowy, or clear), the study [18] employed the AlexNet and GoogleNet architectures.GoogleNet demonstrated an accuracy of 92.0%, while AlexNet achieved 91.1%.However, pertinent details about the distribution of training and test data as well as dataset acquisition methods were omitted.
Meanwhile, Xia et al. [19] embarked on the task of classifying weather images into four categories: foggy, rainy, snowy, and sunny, and they achieved an accuracy of 86.47% with AlexNet.Notably, the computation duration for each design was not provided.
In a different study [20], CNN and transfer learning were harnessed for the classification of weather images, focusing on binary label classification of "With Rain" (WR) and "No Rain" (NR) using the VGG16 architecture; they achieved an accuracy of 85.28%.
Transfer learning, a pivotal technique in machine learning, emerges as a solution to the inherent challenge of limited training data [21].Built on the assumption of unbiased and evenly distributed training and testing data, this approach facilitates the transfer of knowledge from a source domain to a target domain.Within the realm of computer vision, transfer learning expedites learning processes and enhances performance.Typically, pre-trained models are trained on diverse image datasets and subsequently retrained on specific source datasets.Notably, past research has not ventured into the realm of multiclass weather image identification employing diverse CNN architectures via transfer learning.
The study proposed by Chu et.al. [22] undertakes the classification of weather images across six distinct categories: cloudy, foggy, rainy, sunny, snowy, and sunrise.To achieve this and expedite model development with enhanced performance, transfer learning is adeptly employed.The ImageNet dataset serves as the bedrock for transfer learning endeavors.For rigorous comparison with forthcoming research, the dataset is meticulously curated from publicly available sources such as Kaggle and the Camera as Weather Sensor (CWS) dataset [22].In assessing the efficacy of this study, performance metrics encompass accuracy, precision, recall, and the F1 score.
Highlighting the relevance of weather recognition across diverse sectors, including autonomous driving and agriculture, Młodzianowski [23] emphasizes an image-centric weather detection system as a viable solution.This system harnesses transfer learning to accurately classify weather conditions, even when faced with a limited dataset.In the pursuit of this objective, the study introduces three weather recognition models grounded in the architectures of ResNet50, MobileNetV2, and InceptionV3, subsequently delving into a comparative analysis of their efficiencies.
From the literature point of view, we have concluded that most of the studies are based on traditional CNN-based architecture and of limited classes; therefore, there is a need to introduce an effective model that can identify the number of weather scenes.Thus, the contributions of this study are listed below: 1.
Evaluation of Enhanced Vision-Transformer (ViT) Model: The study introduces and assesses a fine-tuned Vision-Transformer (ViT) model for weather scene recognition.
The evaluation showcases the model's superiority when compared against three conventional pre-trained CNN-based models (VGG-16, MobileNetV2).This comparison sheds light on the enhanced capabilities of the ViT model in accurately discerning weather patterns.

2.
Global Feature Extraction: In this study, we introduce a patch-wise self-attention module and a global feature extraction technique, both of which constitute significant contributions to our research.

3.
CNN-Based Pre-Trained Models: In this study, we have additionally conducted finetuning on pre-trained CNN-based models, such as VGG16 and MobileNetV2, for comparison with the proposed ViT model.

4.
Advancement to Multiclass Weather Scene Classification: Going beyond binary classification, this study advances the field by concentrating on multiclass weather scene classification.This approach acknowledges the intricate and diverse nature of realworld weather scenarios.

5.
In-Depth Exploration of Transfer Learning: The study dedicates attention to the efficacy of transfer learning-based pre-trained models with a particular focus on the fine-tuned ViT model.Through thorough investigation and comparison, the research provides valuable insights into the potential advantages of employing transfer learning for weather recognition tasks.

Materials and Methods
In this section, a dataset description, preprocessing and methodology are presented in detail.Preprocessing ensures that the dataset is optimally prepared for the following stages by skillfully handling these processes.

Dataset Description
The selected dataset comprises 11 distinct classes, each representing various weather conditions [24].Each class contains a specific number of images.For a visual depiction of the dataset, see Figure 1.
Sensor (CWS) dataset [22].In assessing the efficacy of this study, performance metrics encompass accuracy, precision, recall, and the F1 score.
Highlighting the relevance of weather recognition across diverse sectors, including autonomous driving and agriculture, Młodzianowski [23] emphasizes an image-centric weather detection system as a viable solution.This system harnesses transfer learning to accurately classify weather conditions, even when faced with a limited dataset.In the pursuit of this objective, the study introduces three weather recognition models grounded in the architectures of ResNet50, MobileNetV2, and InceptionV3, subsequently delving into a comparative analysis of their efficiencies.
From the literature point of view, we have concluded that most of the studies are based on traditional CNN-based architecture and of limited classes; therefore, there is a need to introduce an effective model that can identify the number of weather scenes.Thus, the contributions of this study are listed below:

Materials and Methods
In this section, a dataset description, preprocessing and methodology are presented in detail.Preprocessing ensures that the dataset is optimally prepared for the following stages by skillfully handling these processes.

Dataset Description
The selected dataset comprises 11 distinct classes, each representing various weather conditions [24].Each class contains a specific number of images.For a visual depiction of the dataset, see Figure 1

Vision Tranformer (ViT) Architecture
ViT was developed as a DNN (deep neural network) architecture for image recognition in 2020 [25].The transformer's architecture was initially designed with natural language processing in mind, introducing the innovative idea that images should be viewed as patches of images or token sequences.ViT uses the transformative capabilities built into the transformer design to skillfully manage these token sequences.It is worth noting that the transformer architecture, which serves as the foundation for ViT, has showcased its adaptability and effectiveness across a diverse array of tasks, including image restoration and object detection [26], underscoring its broad applicability and performance capabilities [27].ViT can extract a full perspective from the input image that includes both local and global features thanks to the combined effects of tokenization and embedding.
ViT introduces predefined positional embeddings.These additional vectors serve as repositories for encoding the precise positions of tokens within the sequence before they are processed by the transformer layers.This strategic integration allows the model to determine the relative placements of tokens and extract important spatial information from the input image.

Vision Tranformer (ViT) Architecture
ViT was developed as a DNN (deep neural network) architecture for image recognition in 2020 [25].The transformer's architecture was initially designed with natural language processing in mind, introducing the innovative idea that images should be viewed as patches of images or token sequences.ViT uses the transformative capabilities built into the transformer design to skillfully manage these token sequences.It is worth noting that the transformer architecture, which serves as the foundation for ViT, has showcased its adaptability and effectiveness across a diverse array of tasks, including image restoration and object detection [26], underscoring its broad applicability and performance capabilities [27].ViT can extract a full perspective from the input image that includes both local and global features thanks to the combined effects of tokenization and embedding.
ViT introduces predefined positional embeddings.These additional vectors serve as repositories for encoding the precise positions of tokens within the sequence before they are processed by the transformer layers.This strategic integration allows the model to determine the relative placements of tokens and extract important spatial information from the input image.
The ViT architecture based on the Multi-head Self-Attention (MSA) mechanism.MSA empowers the model with the ability to simultaneously focus on various regions within the image.This mechanism comprises distinct "heads" each with the capacity to independently compute attention.These individual attention heads have the flexibility to concentrate on diverse image segments.This produces representations that eventually join together to create the complete representation of the image.This simultaneous attention to multiple parts grants ViT the capability to capture intricate relationships among input elements.However, it is important to note that this enhancement escalates the model's complexity and computational requirements due to the increased number of attention heads and additional processing steps required for aggregating their outputs.The mathematical expression for MSA can be formulated as follows: The self-attention mechanism plays a fundamental role in transformer architectures, facilitating the interactions modeling and associations across sequences in all predictive tasks.In contrast to Convolutional Neural Networks, the layer of self-attention consolidates insights along with characteristics from the complete sequence of input.This encompasses global information with local; it nurtures a more accurate portrayal of information.The mechanism of attention operates by computation of the scalar product.This product is computed between key vectors and the query.This is followed by the normalization of the resulting attention scores using the SoftMax function.Subsequently, it adjusts the vectors to produce an improved output.A comprehensive investigation by Cordonnier et al. [28] delved into the interplay in-between operations of the convolution layer and the self-attention mechanism.Their research revealed that self-attention, especially with features, emerges as an exceptionally versatile mechanism that is capable of capturing both local and global characteristics.This highlights the adaptability and flexibility inherent in self-attention, setting it apart from conventional convolutional techniques.
For a visual representation of the ViT network at an abstract level, please refer to Figure 2, which illustrates the major components of an effective ViT model.
The ViT architecture based on the Multi-head Self-Attention (MSA) mechanism.MSA empowers the model with the ability to simultaneously focus on various regions within the image.This mechanism comprises distinct "heads" each with the capacity to independently compute attention.These individual attention heads have the flexibility to concentrate on diverse image segments.This produces representations that eventually join together to create the complete representation of the image.This simultaneous attention to multiple parts grants ViT the capability to capture intricate relationships among input elements.However, it is important to note that this enhancement escalates the model's complexity and computational requirements due to the increased number of attention heads and additional processing steps required for aggregating their outputs.The mathematical expression for MSA can be formulated as follows: The self-attention mechanism plays a fundamental role in transformer architectures, facilitating the interactions modeling and associations across sequences in all predictive tasks.In contrast to Convolutional Neural Networks, the layer of self-attention consolidates insights along with characteristics from the complete sequence of input.This encompasses global information with local; it nurtures a more accurate portrayal of information.The mechanism of attention operates by computation of the scalar product.This product is computed between key vectors and the query.This is followed by the normalization of the resulting attention scores using the SoftMax function.Subsequently, it adjusts the vectors to produce an improved output.A comprehensive investigation by Cordonnier et al. [28] delved into the interplay in-between operations of the convolution layer and the self-attention mechanism.Their research revealed that self-attention, especially with features, emerges as an exceptionally versatile mechanism that is capable of capturing both local and global characteristics.This highlights the adaptability and flexibility inherent in self-attention, setting it apart from conventional convolutional techniques.
For a visual representation of the ViT network at an abstract level, please refer to Figure 2, which illustrates the major components of an effective ViT model.Patch Embedding: In the ViT framework, the input image is partitioned into fixedsize, non-overlapping patches.Each of these patches undergoes a linear transformation, facilitated by a learned linear transformation matrix, converting the 2D spatial information within the image into a sequential arrangement of embeddings [29].
In Equation ( 2), E patch , X, and W patch represent the patch embeddings, image patches, and learned linear transformation matrix, respectively.
Positional Embedding: Given that the inherent structure of the transformer architecture lacks an inherent understanding of the spatial arrangement of these patches, the infusion of positional information becomes necessary.This is achieved through the inclusion of positional embeddings, which are added to the patch embeddings.These positional embeddings furnish crucial details about the spatial positioning of each patch within the original image [30].
In Equation ( 3), E pos represents positional embeddings and i, j represent the spatial coordinates of the patch within the image.
Transformer Encoder: The embeddings, i.e., positional, denoted as E pos , traverse through a sequence of TE layers.Each layer comprises a fusion of two things: first, the feedforward neural networks and second, the mechanism of self-attention [31].This selfattention mechanism empowers each patch with the ability to focus on other patches, effectively capturing intricate relationships within an image globally [32].Following this, the feedforward neural networks conduct additional processing to enhance these attended representations.As a result of this encoding process, a collection of contextualized embeddings is generated for each patch, adeptly encapsulating a rich blend of both localized and global information that is inherently embedded within the image.
In Equation ( 4), A, Q, K, V and d k represent the attention scores, query matrix, key matrix, value matrix, and dimension of the key vectors, respectively.
Classification Head: The ultimate contextualized embeddings stemming from the transformer encoder serve as the foundation for downstream tasks, including image classification.In the context of classification tasks, various strategies can be employed to process these contextualized embeddings.A commonly used approach involves taking one out of two things: either a classification token or the average embeddings.Subsequently, fully connected layers can be applied to generate class predictions based on this processed information.
In Equation ( 5), P Class , W Class, and AveragePooling (E Contextualized ) represent the class predictions, weights of the classification layer, and average pooling of contextualized embeddings, respectively.

Hyperparameters for ViT Pre-Trained Model
In this research, the initial images underwent preprocessing and resizing to dimensions of 224 × 224 pixels.These resized images were subsequently partitioned into patches, each measuring 16 × 16 pixels.The process of dividing the input image into smaller, fixed-size patches involves breaking down the image into pieces that are 16 pixels wide and 16 pixels tall.
For this research, the model employed was trained on a substantial dataset known as ImageNet-21k [33,34].This extensive dataset encompasses approximately 14 million images categorized into a staggering 21,841 distinct classes, which were tailored explicitly for large-scale image classification tasks.The architecture of the model comprises 12 transformer layers with each layer housing 768 hidden components.The model's substantial capacity is reflected in its impressive 85.8 million trainable parameters, which greatly contribute to its learning capabilities.You can find the specific parameter values and configurations utilized in the ViT model in Table 1.

Results and Discussion
In this section, we provide an in-depth exploration of the evaluation metrics employed, delve into the specifics of our experimental procedures, and present the outcomes derived from the methodology we have proposed.

Performance Evaluation Metrics
Evaluating the efficacy of machine learning and deep learning models hinges on the utilization of key performance indicators.These metrics play a pivotal role within the domains of machine learning, deep learning, and statistical investigation [35].This research endeavor has focused on four indispensable evaluation criteria to gauge the efficiency of our innovative model.
• Accuracy: The accuracy metric gauges the overall correctness of the model's predictions by determining the ratio of correctly classified instances to the total number of samples.Nevertheless, in situations characterized by imbalanced datasets or cases where different types of errors carry varying degrees of importance, depending solely on accuracy may prove inadequate for a comprehensive assessment.
• Precision: Precision measures a model's ability to correctly identify positive samples among the actual positive instances.This metric quantifies the proportion of true positives concerning the total of true positives and false positives.
• Recall: We evaluate the model's ability to accurately detect positive samples within the actual positive pool using the recall metric, which is also referred to as sensitivity or the true positive rate.This measurement is obtained by determining the ratio of true positives to the sum of true positives and false negatives.In essence, recall offers insights into the comprehensiveness of positive predictions.
• F1-Score: The harmonic mean of precision and recall is known as the F1-score.The F1-score falls within the range of 0 to 1 with its optimal performance achieved at a score of 1.

Experimental Results
The fine-tuned ViT model was considered to identify meteorological scenes.ViT's key benefit over conventional CNNs is its ability to be immediately supervised and trained on large datasets, eliminating the need for pre-training on auxiliary tasks.Furthermore, ViT illustrates its capabilities by attaining cutting-edge performance across a wide range of image recognition applications all while maintaining a far lower parameter count than traditional CNN designs.In light of these compelling advantages, we rigorously finetuned the ViT model for the specific purpose of identifying weather scenes, producing outstanding results.
Table 3 presents a detailed overview of the hyperparameters employed during the fine-tuning of the ViT model.Further, we have used the Google Colab, Tensor Processing Unit (TPU) for study experiments [36].We have partitioned the dataset into training and testing subsets using a split ratio of 0.2.The resulting dataset following this partition is displayed in Table 4.We evaluated the proposed model's performance using class-specific precision, recall, and F1-scores, as shown in Table 5.The "support" column in the table indicates the sample count for each class.For example, the "snow" class contains 41 samples for testing, but the "rain" class contains 50 examples for testing.Notably, the cumulative sum of the "support" column equals 511, indicating that our model was thoroughly evaluated on 511 samples.In situations with uneven class distributions or significant discrepancies in the cost of misclassifying distinct classes, a confusion matrix is invaluable for assessing the performance of a classification model [37].This matrix includes essential measures like accuracy, precision, recall, and F1-score.To assess the resilience of our proposed model, we conducted a comparative analysis between the ViT model and two leading pre-trained CNN-based models: VGG16 [38] and MobileNetV2 [39].The effectiveness of the ViT model can be attributed to its advanced global feature extraction technique, as illustrated in Figure 4. We have achieved an accuracy of 0.9061 with MobileNetV2 and 0.8991 with the VGG16 pre-trained model.The pro- To assess the resilience of our proposed model, we conducted a comparative analysis between the ViT model and two leading pre-trained CNN-based models: VGG16 [38] and MobileNetV2 [39].The effectiveness of the ViT model can be attributed to its advanced global feature extraction technique, as illustrated in Figure 4. We have achieved an accuracy of 0.9061 with MobileNetV2 and 0.8991 with the VGG16 pre-trained model.The proposed ViT model outperformed VGG16 and MobileNetV2 in terms of all evaluation metrics, i.e., accuracy, precision, recall, and F1-score.To assess the resilience of our proposed model, we conducted a comparative analysis between the ViT model and two leading pre-trained CNN-based models: VGG16 [38] and MobileNetV2 [39].The effectiveness of the ViT model can be attributed to its advanced global feature extraction technique, as illustrated in Figure 4. We have achieved an accuracy of 0.9061 with MobileNetV2 and 0.8991 with the VGG16 pre-trained model.The proposed ViT model outperformed VGG16 and MobileNetV2 in terms of all evaluation metrics, i.e., accuracy, precision, recall, and F1-score.The method we have introduced demonstrates superior performance compared to state-of-the-art approaches, as evidenced by the evaluation of our model's results presented in Table 6.In Table 6, Xia et al. achieved 96.03% accuracy [19], but their considered dataset was based on four classes.All the studies in Table 6 are mainly based on four classes; however, in the study of Li et al. [10], the number of classes is 10, but all the classes based on clouds types achieved an accuracy of 80%.In our study, we have 11 classes and we have achieved an effective accuracy of 93.54%.

Robustness and Limitation of the Proposed ViT
The proposed ViT for weather scene identification brings forth notable benefits and some inherent limitations.The models excel in capturing global contextual information, which is a critical aspect of understanding weather patterns.Their scalability allows researchers to adapt model sizes to the complexity of the task, from basic weather classifications to fine-grained analyses.Using pre-trained weights helps knowledge transfer from huge datasets, which improves performance.ViT's self-attention processes allow for selective concentration on appropriate visual regions, which aids in the recognition of minor weather trends.However, problems persist, such as the need for significant labeled data, computational resources for training, reduced interpretability when compared to simpler models, and sensitivity to image resolution.When evaluating the use of ViT models for weather scene recognition, researchers must carefully assess these considerations, aiming to maximize their benefits while resolving their limits.

Theoretical and Practical Implications
This study makes theoretical contributions by broadening the scope of ViT applications in computer vision and the complex realm of weather scene detection.This study advances our understanding of complicated scene recognition and ViT's excellent generalization capabilities across a wide range of environmental conditions.On a practical level, the findings have far-reaching impacts.They empower weather forecasting agencies with ViT-based tools for more accurate predictions, enhancing decision making in agriculture, transportation, and disaster management.ViT-equipped autonomous vehicles and drones ensure enhanced safety, promoting advancements in self-driving technology and surveillance systems.Additionally, ViT models support environmental monitoring, climate research, renewable energy optimization, and smart city development while also enhancing consumer-oriented weather applications.In summary, this research bridges theory and practice, fostering innovation and informed decision making across a spectrum of industries in response to dynamic weather scenarios.

Conclusions
This paper presents the utilization of patch-based technology, specifically the ViT model, for single-image scene identification through deep learning and computer vision.The proposed ViT model can recognize 11 distinct weather-related classes.The main aim of this research is to showcase the practical use of deep learning and computer vision in enhancing scene awareness, particularly in urban environments.These insights have broad applications, including enabling autonomy in urban settings and beyond.
The rigorous evaluation of our quartet of models-MobileNetV2, VGG-19, and the fine-tuned ViT-in the context of weather scene recognition unequivocally crowned the ViT model as the undisputed leader.While VGG-19 achieved an admirable accuracy of 89.91% and MobileNetV2 reached 90.6%, the fine-tuned ViT model demonstrated exceptional prowess with an impressive accuracy rate of 93.54%.These conclusive results firmly establish the ViT model's superiority over its CNN-based counterparts, positioning it as the foremost choice for weather recognition tasks requiring heightened accuracy.
Taking our research further into real-time weather scene recognition scenarios, particularly in dynamic weather conditions, holds immense potential for practical applications.Such endeavors could unearth invaluable insights, further enhancing the ViT model's real-world effectiveness and ensuring its relevance in the ever-evolving field of weather scene identification.

1 .
Evaluation of Enhanced Vision-Transformer (ViT) Model: The study introduces and assesses a fine-tuned Vision-Transformer (ViT) model for weather scene recognition.The evaluation showcases the model's superiority when compared against three conventional pre-trained CNN-based models (VGG-16, MobileNetV2).This comparison sheds light on the enhanced capabilities of the ViT model in accurately discerning weather patterns.2. Global Feature Extraction: In this study, we introduce a patch-wise self-attention module and a global feature extraction technique, both of which constitute significant contributions to our research.3. CNN-Based Pre-Trained Models: In this study, we have additionally conducted finetuning on pre-trained CNN-based models, such as VGG16 and MobileNetV2, for comparison with the proposed ViT model.4. Advancement to Multiclass Weather Scene Classification: Going beyond binary classification, this study advances the field by concentrating on multiclass weather scene classification.This approach acknowledges the intricate and diverse nature of realworld weather scenarios.5. In-Depth Exploration of Transfer Learning: The study dedicates attention to the efficacy of transfer learning-based pre-trained models with a particular focus on the finetuned ViT model.Through thorough investigation and comparison, the research provides valuable insights into the potential advantages of employing transfer learning for weather recognition tasks. .

Figure 2 .
Figure 2. ViT abstract level architecture diagram [25].Patch Embedding: In the ViT framework, the input image is partitioned into fixedsize, non-overlapping patches.Each of these patches undergoes a linear transformation,

Table 1 .
Hyperparameters configurations of ViT model.In this study, the Keras framework of deep learning and Python programming language were used for the experiment's purposes.The experiments were carried out with the free version of Google Colab (https://colab.research.google.com/(accessed on 1 April 2024)).The experiments configuration details are available in Table 2.

Table 3 .
Hyperparameters for Vision Transformer model.

Table 4 .
Class-wise dataset samples for training and testing.

Table 5 .
Performance of ViT model class-wise.

Table 6 .
Comparison of our study with state-of-the-art research.