YOLOv5s-Fog: An Improved Model Based on YOLOv5s for Object Detection in Foggy Weather Scenarios

In foggy weather scenarios, the scattering and absorption of light by water droplets and particulate matter cause object features in images to become blurred or lost, presenting a significant challenge for target detection in autonomous driving vehicles. To address this issue, this study proposes a foggy weather detection method based on the YOLOv5s framework, named YOLOv5s-Fog. The model enhances the feature extraction and expression capabilities of YOLOv5s by introducing a novel target detection layer called SwinFocus. Additionally, the decoupled head is incorporated into the model, and the conventional non-maximum suppression method is replaced with Soft-NMS. The experimental results demonstrate that these improvements effectively enhance the detection performance for blurry objects and small targets in foggy weather conditions. Compared to the baseline model, YOLOv5s, YOLOv5s-Fog achieves a 5.4% increase in mAP on the RTTS dataset, reaching 73.4%. This method provides technical support for rapid and accurate target detection in adverse weather conditions, such as foggy weather, for autonomous driving vehicles.


Introduction
In the field of autonomous driving, object detection is a crucial technology [1], and its accuracy and robustness are of paramount importance for practical applications [2]. However, in foggy weather scenarios, challenges arise due to weakened light and issues such as blurred object edges, which lead to a decline in algorithm performance, consequently affecting the safety and reliability of autonomous vehicles [3]. Therefore, conducting research on target detection in foggy weather scenes holds great significance.
In recent years, researchers have made certain progress in addressing the problem of object detection in foggy weather conditions [4,5]. Traditional methods primarily rely on conventional computer vision techniques such as edge detection, filtering, and background modeling. While these methods can partially handle foggy images, their effectiveness in complex scenes and under challenging foggy conditions is limited. To address the issue of object detection in complex foggy scenes, scholars have started exploring the utilization of physical models to represent foggy images. He et al. [6] proposed a single-image dehazing method based on the dark channel prior, while Zhu et al. [7] presented a fast single-image dehazing approach based on color attenuation prior. These dehazing methods improve the visibility of foggy images and subsequently enhance the accuracy of object detection. However, physical model-based methods require the estimation of fog density, making it difficult to handle multiple fog densities in complex scenes.
With the continuous development of deep learning techniques, deep learning has gradually become a research hotspot in the field of object detection [8,9]. Compared to traditional methods, deep learning models can directly learn tasks from raw data and exhibit improved generalization through training on large-scale datasets [10]. Deep learning-based relationships within an image, thereby enhancing model performance. In this study, the Swin Transformer [40] component is incorporated into the YOLOv5s model to improve detection accuracy in adverse weather conditions.
The main contributions of this study are as follows: 1.
On the basis of the YOLOv5s model, we introduce a multi-scale attention feature detection layer called SwinFocus, based on the Swin Transformer, to better capture the correlations among different regions in foggy images; 2.
The traditional YOLO Head is replaced with a decoupled head, which decomposes the object detection task into different subtasks, reducing the model's reliance on specific regions in the input image; 3.
In the stage of non-maximum suppression (NMS), Soft-NMS is employed to better preserve the target information, thereby effectively reducing issues such as false positives and false negatives.
The remaining sections of this paper are organized as follows. In Section 2, we provide a brief overview of the original YOLOv5s model and elaborate on the innovations proposed in this study. Section 3 presents the dataset, experimental details, and results obtained in our experiments. Finally, in Section 4, we summarize our work and propose some future research directions.

Overview of YOLOv5
YOLOv5 [17] is an efficient and highly accurate real-time object detection algorithm that extends the YOLO series [15,16]. This algorithm employs a single neural network to perform bounding box and category predictions. In comparison to its previous versions, YOLOv5 incorporates several improvements, including a new backbone network based on the CSP architecture [41], dynamic anchor allocation methods, and data augmentation techniques such as Mixup [42]. These enhancements have enabled the algorithm to achieve outstanding performance on multiple benchmark datasets while maintaining real-time inference speeds on both CPU and GPU platforms. The YOLOv5 model consists of four different configurations: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. In general, YOLOv5s is well-suited for real-time object detection in scenarios with limited computational resources, while YOLOv5x is more suitable for applications that require highprecision detection. Considering the real-time detection requirements in foggy weather conditions, this study employs YOLOv5s as the experimental model. The operational flow of YOLOv5s-Fog proposed in this paper is illustrated in Figure 1. Operational procedure of YOLOv5s-Fog. This framework incorporates an augmented predictive feature layer to bolster the network's regional comprehension. Additionally, we employ a decoupled head to effectively address scenarios characterized by diminished contrast and indistinct boundaries. Lastly, the Soft-NMS technique is employed for the integration of bounding boxes.

The Architecture of YOLOv5s-Fog Network
In foggy weather conditions, object features in images often become blurry or even lost due to the presence of fog [43]. In this paper, we propose a novel approach to address the aforementioned issue by improving the YOLOv5s network architecture. Our proposed network, YOLOv5s-Fog, is illustrated in Figure 2. Firstly, we introduce a new feature detection layer called SwinFocus, which enhances the object detection capability by better capturing subtle features of objects in the image. Compared to traditional convolutional neural networks, SwinFocus achieves global interaction and aggregation of feature map information by decomposing the spatial and channel dimensions of the feature maps, enabling the network to better detect objects concealed in the fog. Secondly, to enhance the flexibility of the model during the detection stage, we employ a decoupled head, where the classification and regression heads are separately processed, making better use of the network's expressive power. Finally, we utilize Soft-NMS in the post-processing stage to effectively handle the issue of overlapping objects in foggy images. In challenging weather conditions such as foggy environments, traditional Convolutional Neural Networks (CNNs) face a range of limitations and challenges in object detection tasks [44]. Firstly, the presence of fog causes image blurring, reduced contrast, and color distortion, making it difficult for traditional convolutional operations to effectively extract clear object edges and fine details. Secondly, lighting variations and occlusions in foggy scenes make it challenging for traditional CNNs to accurately localize and detect objects. Swin Transformer [40] is a neural network based on the Transformer [38] architecture that has demonstrated outstanding performance in computer vision tasks such as image classification, object detection, and semantic segmentation. The architecture of Swin Transformer is illustrated in Figure 3. Unlike traditional CNNs, Swin Transformer introduces the Patch Partition module, which divides the input image into blocks and flattens them in the channel dimension to better capture the subtle features of objects in the image. In Swin Transformer, the image is first input into the Patch Partition module for block-wise processing, where each 4 × 4 adjacent pixels form a patch that is then flattened in the channel dimension. Subsequently, four stages are constructed to generate feature maps of different sizes. Stage 1 utilizes a linear embedding layer, while the remaining three stages employ Patch Merging for downsampling. Finally, these stages are stacked in a repeated manner.
To enhance the feature representation in adverse weather conditions, we introduce an additional Swin Transformer-based feature detection layer, SwinFocus, to the YOLOv5 framework. The basic structure of SwinFocus is illustrated in Figure 4. SwinFocus plays a critical role in object detection under challenging weather scenarios, with its hierarchical feature representation mechanism being the core component. Through multiple stages of downsampling, it can extract features at different scales, capturing information from objects of various sizes. This ability enables SwinFocus to adapt better to size variations and diversity in targets. Furthermore, SwinFocus inherits the window attention mechanism, which transforms global attention into local attention, allowing it to focus more on subtle details and edge information in the image. In foggy conditions where images may be affected by blurring and reduced visual quality, the window attention mechanism can precisely localize objects and extract crucial features. The computational formulas for two consecutive SwinFocus layers are as follows:

Decoupled Head
Deep learning-based object detection methods typically adopt a shared feature detector to simultaneously predict the class and location information of objects [45]. This coupling approach is beneficial as it improves model efficiency and accuracy through shared feature representations. However, in foggy weather conditions, this tightly coupled approach may face limitations and challenges. Firstly, the image quality in foggy environments is severely affected, resulting in visual impairments that make it difficult for traditional shared feature detectors to accurately extract clear object edges and fine details. Consequently, this impacts the accuracy of object localization and detection. Secondly, due to light absorption and scattering effects in foggy weather, the visibility of objects is reduced, causing indistinct object edges and easy blending with the background.
The Decoupled Head separates feature extraction from spatial position information by employing two independent network heads [46]. This design effectively reduces the coupling between feature extraction and spatial position information [45], enabling the model to better handle complex lighting variations caused by light propagation attenuation and scattering in foggy weather conditions. Moreover, the separation of feature extraction and spatial localization tasks allows the feature extraction head to focus on extracting discriminative features, while the spatial information head can concentrate on processing positional information. The structure of the decoupled head, as illustrated in Figure 5, involves a 1 × 1 convolutional layer to reduce the channel dimension, followed by two parallel branches, each containing two 3 × 3 convolutional layers [46]. This approach not only reduces the complexity of the network architecture but also enhances the accuracy of the model.

Soft-NMS
In comparison to normal weather conditions, foggy weather exhibits differences in light propagation, contrast, color, and visibility [1]. These issues result in more prominent overlapping of objects. Traditional non-maximum suppression (NMS) methods may excessively suppress overlapping bounding boxes when selecting the one with the highest confidence, leading to the erroneous exclusion of important objects [47]. The specific details are shown in Figure 6. Soft-NMS addresses this by introducing a confidence decay factor, which helps preserve the confidence information of overlapping objects to a certain extent and reduces the likelihood of suppressing important objects. Additionally, traditional NMS methods solely rely on the intersection over union (IoU) between bounding boxes for suppression, disregarding the confidence information of the objects [48]. This can cause lowconfidence bounding boxes to have a high IoU, while high-confidence bounding boxes may be erroneously suppressed. Soft-NMS adjusts the suppression based on both the confidence and overlap of the bounding boxes, thereby better preserving high-confidence bounding boxes and improving the localization accuracy of objects in foggy weather conditions. Compared to traditional non-maximum suppression (NMS), Soft-NMS introduces a softening function that gradually reduces the scores of other bounding boxes overlapping with the one having the highest confidence, instead of directly setting their scores to zero [47]. This principle can be represented by the following formula: for a set of input bounding boxes B = {b 1 , b 2 , ..., b n },where each bounding box b i consists of four coordinates and a confidence score s i , Soft-NMS measures their similarity by computing the Intersection over Union (IoU) values between the boxes.
Then, the scores of each detection box are adjusted based on their similarity. Specifically, for the currently processed detection box b i , its final weight is given by: In this equation, θ represents a threshold. When s i is greater than θ, the original score is retained. Otherwise, a Gaussian function is used to suppress other similar detection boxes, with σ controlling the rate of weight reduction. The final weight ω i * is adjusted based on a linear interpolation with the confidence score s i of the current detection box.
Among them, α is a parameter that controls the ratio between the adjusted score and the original score. Finally, for each detection box b i , the Soft-NMS function adjusts it to ω i , where detection boxes with higher similarity will appear with lower weights in the output results, thus avoiding issues such as excessive suppression and exclusion of correct detections. Soft-NMS gradually reduces the scores of overlapping bounding boxes while preserving a certain degree of overlap. This approach allows for better handling of occlusions, blurriness, and overlapping instances in complex environments during object detection tasks. As a result, the selection of object detection boxes becomes more reasonable and stable, enhancing the overall performance of the detection system.

Dataset
Insufficient datasets are available for training and testing object detection algorithms under adverse weather conditions, which can adversely affect their performance, particularly those based on a CNN. Additionally, the traditional atmospheric scattering model [15] fails to accurately simulate real-world foggy scenes [36]. To ensure fairness, we selected a total of 8201 images as the training set (V_C_t), sourced from VOC [49] and COCO [50] datasets. For the test set, we utilized V_n_ts [36] and RTTS [51]. RTTS was employed to evaluate the method's object detection capability in foggy weather conditions, while V_n_ts was used to assess its performance on standard datasets. The dataset encompasses five categories: people, cars, buses, bicycles, and motorcycles. Further details regarding dataset usage are presented in Table 1.

Experimental Details
The experimental setup of YOLOv5s-Fog is shown in Table 2. During the training process of this study, we employed various effective data augmentation techniques, including MixUP [42] and Mosaic [16]. Additionally, we utilized a cosine learning rate scheduling strategy, setting the initial learning rate to 3 × 10 −4 , batch size to 16, and conducting 30 iterations.

Evaluation Metrics
This study evaluates the detection performance of the model using mean Average Precision (mAP). mAP is a metric commonly used to assess the performance of object detection algorithms. It represents the average area under the Precision-Recall curve, which provides a comprehensive evaluation of both the localization accuracy and recognition accuracy of the classifier. A higher mAP value indicates better detection performance of the model. The specific calculation method is as follows: In this context, TP represents True Positive, FP represents False Positive, FN represents False Negative, R n denotes Recall, P n represents the maximum Precision at that Recall, and N indicates the number of classes.

Experimental Results
To validate the effectiveness of YOLOv5s-Fog, we compared it with various existing methods for foggy scene object detection, including deep learning-based object detection networks [15][16][17], dehazing methods [33,52], domain adaptation [32,35], and image adaptive enhancement [36]. The specific results are shown in Table 3. Table 4 presents a comprehensive comparison between our proposed approach and state-of-the-art object detection models in terms of experimental results. Figure 7 illustrates the variations of key metrics, including bounding box loss, object loss, and class loss, as well as Precision, Recall, mAP, and mAP50-95 after each epoch during the training and validation process of YOLOv5s-Fog.

The Analysis of Experimental Results
From Table 3, it can be observed that YOLOv5s-Fog outperforms other methods both on conventional weather datasets and foggy weather datasets. Specifically, the combination of deep learning architecture with image-adaptive methods outperforms traditional image dehazing approaches. One notable example is IA-YOLO [36], which achieves performances of 72.65% and 36.73% on V_n_ts [49] and RTTS [51], respectively. This superiority can be attributed to the ability of image-adaptive algorithms to consider different regions within the image and make adaptive adjustments based on regional characteristics and requirements. In contrast, conventional dehazing [35] algorithms typically apply the same processing method to the entire image, without fully considering local variations within the image. IA-YOLO utilizes the YOLOv3 [15] network architecture. To further investigate the impact of network architecture on object detection results in foggy conditions, we conducted experiments using the original YOLOv5s [17]. YOLOv5s achieves significant improvements with performances of 87.56% on V_n_ts and 68% on RTTS compared to IA-YOLO. YOLOv5s incorporates a range of network architecture and techniques, including multi-scale fusion, anchor box design, and classifier optimization. Building upon YOLOv5s, YOLOv5s-Fog introduces additional feature detection layers [40] and a Decoupled Head [45,46] to enhance the network's ability to explore challenging details in foggy scenes. Additionally, Soft-NMS [47] is employed in the post-processing stage to address occlusion issues in foggy conditions. Ultimately, our proposed method achieves mAP scores of 92.23% and 73.40% on V_n_ts and RTTS, respectively. Furthermore, YOLOv5s-Fog does not heavily focus on image dehazing, maintaining its original end-to-end detection approach and avoiding interference from artificially added noise during the detection phase. Figure 8 showcases partial detection results of the three models that performed well in RTTS. The first row presents IA-YOLO [36], which employs image adaptive techniques to remove specific weather information and restore the underlying content. Although this approach improves detection performance, it introduces undesired noise to the object detector. The second and third rows display the detection results of YOLOv5s and YOLOv5s-Fog, respectively, without image dehazing or image enhancement. It is evident from Figure 8 that YOLOv5s-Fog exhibits excellent detection capabilities in foggy weather conditions and low-light environments. Additionally, YOLOv5s-Fog can identify smaller objects in dense fog more effectively. Table 3. Comparison of the performance of each method on the conventional dataset (V_n_ts) and the foggy weather dataset (RTTS). The rightmost two columns present the mAP(%) on the two test datasets, including V_n_ts and RTTS.

Ablation Studies
In order to validate the effectiveness of each module, we conducted an ablation study on the RTTS dataset. To ensure the scientific rigor of this paper and comprehensively evaluate the proposed model, we employed three specific metrics: mAP, mAP50-95, and GFLOPs. The impact of each module on the detection results is listed in Table 5. Table 6 documents the detection performance of YOLOv5s-Fog on each object category in the RTTS dataset after incorporating different modules. Through experimental validation, we have observed that SwinFocus significantly enhances the model's mAP. This can be attributed to the adoption of the cross-domain selfattention mechanism during training, enabling the model to capture global features more effectively. Despite the introduction of additional object detection layers, which increase the model's parameters and computational burden, it is justified considering the application scenarios in adverse weather conditions such as foggy weather. Table 5 demonstrates its notable performance improvements.

The Impact of the Decoupled Head
By incorporating the Decoupled Head, the total number of layers in the model increased by 12, and the GFLOPs rose by 1.2. The adoption of the Decoupled Head not only enhances mAP but also enables adaptability to diverse object detection tasks and datasets, showcasing excellent scalability.

The Impact of Soft-NMS
For object detection in foggy conditions, Soft-NMS primarily functions to address densely overlapping instances in large quantities. In Figure 9, we present the detection results of YOLOv5s-Fog on the RTTS dataset. Compared to traditional NMS, Soft-NMS exhibits superior handling of similar objects in complex environments, highlighting its significant advantage.

Conclusions
In this paper, we propose YOLOv5s-Fog, a novel approach to address the challenges of object detection under foggy conditions. Unlike previous research, we do not rely on dehazing or adaptive enhancement techniques applied to the original images. Instead, we enhance the YOLOv5s model by introducing additional detection layers and integrating advanced modules. Our improved model demonstrates higher accuracy in foggy conditions. Experimental results show the potential of our proposed method in object detection tasks under adverse weather conditions. In the future, we plan to invest more efforts in constructing datasets for object detection in extreme weather conditions and develop more efficient network architectures to enhance the model's accuracy in extreme weather detection.

Data Availability Statement:
The data presented in this study are available upon request from the corresponding author.