Comparison of CNN-Based Models for Pothole Detection in Real-World Adverse Conditions: Overview and Evaluation

: Potholes pose a signiﬁcant problem for road safety and infrastructure. They can cause damage to vehicles and present a risk to pedestrians and cyclists. The ability to detect potholes in real time and with a high level of accuracy, especially under different lighting conditions, is crucial for the safety of road transport participants and the timely repair of these hazards. With the increasing availability of cameras on vehicles and smartphones, there is a growing interest in using computer vision techniques for this task. Convolutional neural networks (CNNs) have shown great potential for object detection tasks, including pothole detection. This study provides an overview of computer vision algorithms used for pothole detection. Experimental results are then used to evaluate the performance of the latest CNN-based models for pothole detection in different real-world road conditions, including rain, sunset, evening, and night, as well as clean conditions. The models evaluated in this study include both conventional and the newest architectures from the region-based CNN (R-CNN) and You Only Look Once (YOLO) families. The YOLO models demonstrated a faster detection response and higher accuracy in detecting potholes under clear, rain, sunset, and evening conditions. R-CNN models, on the other hand, performed better in the worse-visibility conditions at night. This study provides valuable insights into the performance of different CNN models for pothole detection in real road conditions and may assist in the selection of the most appropriate model for a speciﬁc application.


Problem Description and Motivation
Road transport plays a vital role in the economic development of a country, connecting communities and businesses while providing access to educational and employment opportunities, as well as healthcare and social services [1]. However, with the fast economic expansion and technological advancements in recent years, the quality of the transportation system has been affected. Ultimately, as road conditions deteriorate, traffic resilience weakens [2]. One of the most significant issues faced by drivers is the presence of potholes on roads. These not only cause damage to vehicles but also pose a risk to driver safety. According to a study by the AAA in 2016, potholes cost U.S. drivers USD 15 billion over the past five years in vehicle repairs, equivalent to nearly USD 3 billion per year [3].
Potholes can be caused by a variety of environmental factors, such as severe weather, high traffic loads [4], and heavy vehicles, as well as poor construction methods and lack of proper maintenance [5]. These issues are particularly difficult to address as they are often unexpected and require constant monitoring to ensure the road infrastructure [6] is in good condition. To mitigate the risk posed by potholes, a system that can detect and communicate their presence to maintenance services or other drivers could be implemented. This would not only improve driver safety but also assist relevant authorities in determining which potholes pose the greatest risk and need to be repaired as soon as possible [7]. The determining which potholes pose the greatest risk and need to be repaired as so sible [7]. The development of a variety of technologies that enable the colle cessing, and transfer of data in real time between fixed traffic devices, movin and data centres has opened up a wide range of new possibilities for Intelligent Systems (ITSs) [8].
The development of intelligent pothole detection technology is crucial to sues with road traffic. However, in natural settings, the detection of potholes c dered by adverse weather conditions, such as rain, snow, and fog, as presented On rainy days, potholes may be hidden under puddles or resemble puddles. W windshield of a car can obstruct the driver's field of view and make it difficu any damage to the road. The reduced visibility induced by the fog increases the that the car will sustain damage from potholes. Additionally, even in favoura tions, there is still the possibility of erroneous detection. The results of the pot tion in good-visibility conditions are displayed in Figure 1. On a sunny day, it to observe instances of false detection, such as the reflection of a pothole on a c a little road patch, denoted in green. Pothole detection in real conditions is a complex task that involves seve including the type of road surface, the condition of the road, and the traffic vo One of the main challenges in pothole detection is the ability to accurately id locate potholes in real time. Regardless of the approach used, pothole detect conditions requires a combination of hardware and software to accurately id locate potholes. This includes the use of advanced sensors, sophisticated algor machine learning techniques to process and analyse the data collected. Additio important to consider the environmental conditions, such as weather and light can affect the accuracy of pothole detection [13]. It is expected that with the development of sensor technology and machine learning techniques, pothole will become increasingly accurate and reliable in the future.

Pothole Detection
The location of road potholes may be determined in a number of different w cost methods based on accelerometers measuring vibration can have a high de curacy. Although their performance is independent on visibility conditions, the time may be slow, and the vehicle must drive over a pothole to make detection Three-dimensional reconstruction techniques using laser [17,18] or stereo 21] can be utilized to model point cloud representations of road inconsistencies based on the principle of transmitting light pulses to the target object/surface Pothole detection in real conditions is a complex task that involves several factors, including the type of road surface, the condition of the road, and the traffic volume [12]. One of the main challenges in pothole detection is the ability to accurately identify and locate potholes in real time. Regardless of the approach used, pothole detection in real conditions requires a combination of hardware and software to accurately identify and locate potholes. This includes the use of advanced sensors, sophisticated algorithms, and machine learning techniques to process and analyse the data collected. Additionally, it is important to consider the environmental conditions, such as weather and lighting, which can affect the accuracy of pothole detection [13]. It is expected that with the continued development of sensor technology and machine learning techniques, pothole detection will become increasingly accurate and reliable in the future.

Pothole Detection
The location of road potholes may be determined in a number of different ways. Low-cost methods based on accelerometers measuring vibration can have a high degree of accuracy. Although their performance is independent on visibility conditions, their reaction time may be slow, and the vehicle must drive over a pothole to make detection [14][15][16].
Three-dimensional reconstruction techniques using laser [17,18] or stereo vision [19][20][21] can be utilized to model point cloud representations of road inconsistencies. 3D lasers based on the principle of transmitting light pulses to the target object/surface can detect objects in low visibility but come at a higher cost. Stereo vision systems can be susceptible to vibrations and need several camera alignments as well as significant surface reconstruction calculations.
Deep-learning-based object detection has been utilized in research to perform accurate and quick pothole detection. Object detection is a technique used to identify and classify objects within an image while determining their location. It is a critical challenge in the field of computer vision, and it has been the subject of intensive research and development over the course of the last several years. With the rapid development of deep learning, significant advancements have been made in improving object recognition performance using deep neural models. Among these models, Convolutional Neural Networks (CNNs) are considered one of the most innovative breakthroughs in image processing. One of the key advantages of using CNNs for object detection is that they can be trained end to end on large datasets, allowing us to learn complex features and representations of objects.
Region-based CNNs (R-CNNs) were among the first deep-learning-based object detection methods, proposing a CNN to extract features from input images and Support Vector Machines (SVMs) to classify objects [22]. However, their slow processing speed limits their practical use. Fast R-CNN [23] was later introduced to improve R-CNN by adding a Region-of-Interest (RoI) pooling layer that reduced processing time and allowed for end-to-end training of the network. You Only Look Once (YOLO) [24], a single-stage architecture, significantly improved detection accuracy, making it comparable to and, in some cases, even superior to R-CNN models. Providing significant accuracy and speed of inference, the YOLO models and their modifications appeared most often in research works regarding pothole detection. In addition, Faster R-CNN was also deployed for improved visual representation and detection accuracy. Although inference speed is a limitation of Faster R-CNN compared to newer versions of YOLO, there has been significant development of models in the R-CNN family recently.
In addition to the aforementioned architectures, EfficientDet [25], Single-Shot Detector (SSD) [26], and RetinaNet [10] are other models used for pothole detection. EfficientDet uses a compound scaling method to optimize the detection of objects at various sizes and scales. It achieves high accuracy with fewer parameters, making it a more efficient architecture for real-time applications. Similarly, RetinaNet uses a novel loss function to address the issue of class imbalance and improve the detection of smaller objects. SSD, on the other hand, simplifies the object detection pipeline by eliminating the region proposal step, resulting in faster detection times. A detailed overview of works related to pothole detection is given in Section 2.
Despite the progress that has been made in pothole detection using CNNs, there are still several challenges that need to be addressed to make this technology more practical and effective [27][28][29]. These include the need for larger and more diversified training datasets and algorithms capable of detecting potholes in real time and operating in a range of real-world settings, such as changing lighting and weather conditions. The motivation of this paper is to provide a comprehensive overview of the different models available and to compare their performance in pothole detection in a complex environment through experiments. To the best of our knowledge, a comparison of potholedetection-based systems concerning heterogeneous weather conditions is currently lacking. The main contributions of this paper are summarized as follows: • An overview of contemporary neural networks employed in the pothole detection task and evaluation of the effectiveness of Fast-, Faster-, Mask-, Cascade-, Sparse R-CNN, and YOLO versions 3, -v4, -v5, -v6, and -v7 computer vision models for the pothole detection. The results of the study can provide valuable information for future research and can aid in the development of more accurate and reliable pothole detection systems.

•
Comparison of models' performance in terms of detection accuracy under different weather conditions and exploring the capability of models dealing with adverse weather and light conditions. Although YOLO architectures perform detection with significant accuracy and speed, R-CNN models may handle the very low-visibility detection more successfully.
• Determining the limitations of existing deep neural networks for pothole detection  and identifying techniques by which future research could improve pothole detection  performance under adverse visual conditions. Detecting potholes in adverse conditions and achieving a high level of accuracy at the same time is quite a difficult task, as indicated above. The main challenge in solving this problem is to work with the data collected under adverse conditions and to select an appropriate algorithm for detection. This work was carried out in six steps: firstly, a survey of existing methods for detecting road damage is reviewed and discussed in Section 2. Section 3 reviews the state-of-the-art algorithms for object detection. Section 4 describes the methodology used in this study, the data acquisition method, and the evaluation metrics. Section 5 discusses the experimental results and performs a comparative analysis between the state-of-the-art systems. In Section 6, we discuss the experiments, the benefits and limitations of the research, and recommendations for future research.

Related Work on Pothole Detection
Although the use of deep CNNs for pothole detection has been studied for several years [7,11,25,27,29,30], efficient algorithms for this task are still limited. In many scenarios, the pothole object is relatively small compared to the size of the road image, making it challenging to train a CNN on a high-resolution image due to the large amount of memory required. In a recent study by Pena-Caballero et al. [9], the performance of object identification and semantic segmentation algorithms was evaluated based on detection time and overall system accuracy. The results indicated that while segmentation algorithms had high accuracy, they often came with an increase in computational complexity. YOLO version 3 (v3) was found to be faster and achieve a better mean Average Precision (mAP) than YOLOv2.
Park et al. [30] presented a method for automated pothole detection based on YOLO models. Three YOLO models (YOLOv4, YOLOv4-tiny, and YOLOv5) were used in the training and testing phases, with a dataset of 665 pothole images divided into training, validation, and testing subsets. The results showed that the mAP@0.5 values of YOLOv4, YOLOv4-tiny, and YOLOv5 were 77.7%, 78.7%, and 74.8%, respectively, with YOLOv4-tiny having the best performance in detecting potholes. This study has limitations, such as low accuracy in detecting small, distant potholes and a lack of testing under poor weather and lighting conditions. W. Ye et al. [27] presented a pothole detection method based on CNN pre-pooling. The dataset comprised 400 raw pothole images collected from different pavements under varying lighting conditions, cropped into 96,000 small images for training or testing. The proposed method utilized a pre-pooling layer in the CNN architecture, improving accuracy compared to conventional CNN. The authors analysed the proposed method's advantages in terms of accuracy, robustness to different lighting and pavement conditions, and superiority over conventional methods (e.g., Sobel edge detection and K-means cluster analysis).
Ahmed [11] performed an extensive evaluation of various object detection architectures, including YOLOv5 models (for three different model sizes) with ResNet101 backbone, YOLOR, Faster R-CNN with ResNet50, VGG16, MobileNetV2, and InceptionV3 backbones. The author also proposed a modified VGG16 (MVGG16) architecture, which effectively reduced computational costs while retaining detection accuracy. Upon comparison, Faster R-CNN with ResNet50 achieved the highest precision rate of 91.9% and an inference time of 0.098 s for larger images. YOLOv5, on the other hand, offered the fastest inference speed with an inference time of 0.009 s but at the cost of reduced accuracy, making it more suitable for real-time applications.
H. Chen et al. [29] proposed a pothole detection method based on localization and partial-classification CNNs. The proposed method comprised two subnetworks: Localization Network (LCNN) and the Part-based Classification Network (PCNN). The LCNN utilized a high recovery network to identify candidate regions, and the PCNN performed classification on the candidates. The authors used two different image sizes as inputs to the network to balance accuracy and efficiency. The results showed that the accuracy, precision, recall, and F1-score were 95.0%, 95.2%, 92.0%, and 93.6%, respectively, with the proposed method outperforming existing methods. The proposed method has the potential for scalability to detect not only potholes but also other road defects, but its performance may degrade with low-resolution input data. The authors suggest using scale selection and cascading techniques to further improve performance.
Heo et al. [7] proposed improvements in YOLOv4 Tiny through the integration of two multi-scale feature modules, such as the Spatial Pyramid Pooling (SPP) and Feature Pyramid Network (FPN) with CSPDarknet53-tiny backbone. Additional modules make up for the loss of spatial information by using a series of max-pool filters and extracting features of varying sizes (in SPP) and combining high-level feature maps from the deep convolutional layers with low-level features from the shallow layers (in FPN). The proposed SPFPN-YOLOv4 tiny model outperformed YOLOv2, YOLOv3, and YOLOv4 tiny, not only in mAP@.5 but also in frames per second rate (FPS). Table 1 provides a summary of recent pothole detection systems. It is important to note that individual pothole detection systems may differ in the road damage dataset and hardware used, as well as in the training settings.

Current State of Object Detection Algorithms
The purpose of detection networks is to locate objects in an image, often by defining an axially aligned bounding box that is centred on the object and to categorize that object at the same time. This is accomplished by specifying an axially aligned bounding box that is centred on the object. When there is sufficient overlap between an object's estimated location and the bounding box that a human has drawn around the object, the object's estimated location is accurate (ground truth). In this part of the article, several of the most well-known models for object recognition in images are going to be contrasted with one another in respect of their inner structure. The following two categories can be used to roughly classify these models: • Two-stage-The operation of these models is divided into two phases. In the first phase, the so-called area design takes place. In these regions, predictions are then computed using the image classification model, and object classification is performed according to the results. This method is relatively slow because this procedure is applied to each proposed region. Representatives of these models include models from the R-CNN family.

•
Single-stage-These models perform object localization and classification in a single pass through the neural network. This makes them significantly faster-they can perform real-time detections. Typical representatives include YOLO models and their variants.

Region-Based CNN
The original R-CNN model [22] was introduced in 2014, and it has been a crucial development in object detection systems, providing higher accuracy than other approaches. The R-CNN model proposes several RoIs in an image and performs convolution on these regions, which is a departure from the CNN approach of handling multiple regions. R-CNN uses a heuristic search technique to identify congested locations and the convolutional network receives these scaled regions of interest and generates a feature matrix. The presence of an object is then assessed via SVM using this feature matrix. In the final stage, the offset region is estimated to fine-tune and correct the position of the bounding box.
The main drawback of the R-CNN model is its time-consuming nature. Learning a neural network can be very time consuming, as each image may require the classification of up to 2000 distinct regions of interest, making real-time recognition virtually impossible. Additionally, the algorithm for selective search does not undergo any form of learning and is, hence, immutable, not learning to provide more appropriate options over time. This section will further provide an overview of the successors of R-CNN.

Fast R-CNN
The Fast R-CNN [23] was developed to address the limitations of the R-CNN. In this model, the complete input image is given to the CNN, which generates a feature matrix. This matrix is then used to identify the RoIs, which are scaled to the same aspect ratio and passed through an RoI pooling layer. A fully connected network is then used to classify the object and evaluate the bounding box. The Fast R-CNN model is faster than the R-CNN model, as only two convolution rounds are performed on the entire image. The learning process of the network is 10-times faster while consuming the same amount of hardware resources.

Faster R-CNN
The Faster R-CNN model [35] is another addition to the R-CNN family and is designed for real-time object classification. Unlike the Fast R-CNN and R-CNN models, the Faster R-CNN model replaces the selective search technique with a Region Proposal Network (RPN) to locate the specific location of an object. The input image is sent through the convolutional network, and the RPN is used to identify the RoIs. The RoI Pooling layer is then used to scale the RoIs, and the final step of object categorization and the bounding box refinement process is identical to that of the Fast R-CNN model. The Faster R-CNN model is up to ten-times faster than the Fast R-CNN model, making it suitable for real-time applications.

Mask R-CNN
Mask R-CNN is a deep learning model that extends the Faster R-CNN architecture by adding another output, the object mask. This model was introduced by He et al. [36] in 2017 and has since become a popular model for object detection and semantic segmentation tasks.
One of the key innovations of Mask R-CNN is the use of the RoI Align layer, which replaces the RoI Pool layer used in Faster R-CNN. The RoI Align layer helps to prevent misalignment at the pixel level, which leads to a more precise spatial distribution of the object. In Mask R-CNN, the two-stage procedure is maintained, with the same RPN being used in the first stage. In the second stage, Mask R-CNN outputs not only the class and bounding box predictions but also a binary mask for each ROI. This mask represents the spatial distribution of the object, and its pixel-to-pixel correspondence provides a natural way to extract the spatial structure of the mask. However, one of the main limitations of Mask R-CNN is its high computational cost, as it requires additional computations to generate the object masks.

Cascade R-CNN
Cascade R-CNN is an object detection model that improves the performance of the Faster R-CNN model. The model was introduced by Cai and Vasconcelos [37] in 2018 and addresses the issue of missed detections and false positives in Faster R-CNN by adding a series of detection heads with increasing Intersection Over Union (IoU) thresholds to the Faster R-CNN model. In Cascade R-CNN, each detection head is trained with a different IoU threshold, and the output of one detection head is used as the input for the next head. This cascading approach helps to reduce the number of false positives and improve the accuracy of object detection.
However, one of the main limitations of Cascade R-CNN is its increased complexity, as it requires multiple detection heads to be trained and deployed, which increases the computational cost and reduces the inference speed. In addition, the model may not perform well on smaller objects, as it relies on the output of previous heads, which may not detect small objects accurately.

Sparse R-CNN
Sparse R-CNN is an object detection model that is based on a predefined number of proposal boxes, which allows for more efficient processing of the input image. The model was introduced by Sun et al. [38] in 2021 and is designed to address the limitations of existing object detection models that rely on dense candidate boxes or region proposal networks. Instead of using dense candidate boxes, as used in single-stage architectures, or dense-to-sparse candidate boxes obtained using a RPN in two-stage architectures, the Sparse R-CNN utilizes learnable proposal boxes that are updated iteratively during training. This approach reduces the computational complexity of the model, allowing for faster processing times. Sparse R-CNN uses an FPN based on the ResNet backbone, along with a dynamic instance interactive head. The authors showed that the Sparse R-CNN model performs comparably to well-established detector baselines, achieving 45.0 AP on the COCO dataset with a 3× training schedule and ResNet-50 FPN, and an inference time of 22 FPS.
As seen in Table 2, all five models (Fast, Faster, Mask, Cascade, and Sparse R-CNNs) offer unique improvements and advancements in object detection. However, each model also faces its own set of challenges and limitations. The Fast R-CNN improved accuracy by using a selective search for object proposals, but it suffered from high computational complexity. The Faster R-CNN improved speed and efficiency by incorporating the RPN, but it had limited scalability and precision. The Mask R-CNN increased accuracy with its binary mask output, but it also had a more complex architecture and limited scalability. The Cascade R-CNN improved accuracy with its series of detection heads, but it had limited scalability and could miss some objects. The Sparse R-CNN improved efficiency with its learnable proposal boxes, but it had limitations in accuracy for smaller objects [38]. Table 2. Comparison of R-CNN models.

Model Key Features Improvements Issues
Fast R-CNN Utilizes selective search for object proposals.
Improved detection accuracy and reduced computation time.
Reliant on the quality of the object proposals.
Faster R-CNN Integrates object proposal generation into the network.
Increased speed and accuracy compared to Fast R-CNN.
Limited to fixed-sized object proposals.

Mask R-CNN
Adds object mask output to the bounding box and class predictions.
Improved object detection with the more precise spatial distribution of the object.
Requires additional computation for the mask output.
Cascade R-CNN Employs a series of detection heads with increasing IoU thresholds.
Improved accuracy by addressing the issue of missed detections.
Increased computation time compared to Faster R-CNN.
Sparse R-CNN Uses predefined, learnable proposal boxes.
Improved detection accuracy and reduced computation time compared to other R-CNN models.
Limited to the number of proposal boxes defined.

YOLO
The YOLO model, introduced by Joseph Redmon [24], is a state-of-the-art object detection algorithm that offers real-time object detection capabilities. Unlike traditional object detection methods that require multiple scans of an image, YOLO detects objects in a single forward pass through a neural network. In YOLO, the input image is divided into a grid, and a network of bounding boxes is generated for each grid cell to predict the presence and location of objects within the grid. Each bounding box is associated with a probability score indicating the likelihood of detecting a specific object class and the location information encoded in the bounding box. The final predictions are made by selecting bounding boxes with a probability score higher than a predefined threshold. The YOLO framework comprises three main components:

•
Backbone extracts high-level features from the input image; • Neck generates feature pyramids to enable multi-scale object detection; • Head performs the final detection by applying anchor frames and generating the output vectors with class probabilities, scores, and bounding boxes.
YOLO's single-pass object detection strategy results in faster inference times compared to traditional object detection algorithms such as R-CNN. However, one of the limitations of YOLO is its difficulty in detecting smaller objects in an image. Over the years, several variants of YOLO have been proposed, including YOLOv3 [39], YOLOv4 [40], YOLOv5 [41], YOLOv6 [42], and YOLOv7 [43], which differ in the backbone, neck, head, and loss function used, as shown in Table 3. These variants offer improved performance and accuracy, making YOLO a powerful tool for object detection in various applications. YOLOv3 [39] has a total of 106-layer architecture, including 53 convolutional layers. This model has nearly tripled the number of CNN layers compared to YOLOv2. This allows for the detection of objects on three different scales, which is a major improvement over the previous version. One of the key features of YOLOv3 is its ability to resample or down sample, the incoming image, which determines the different scales at which objects can be detected. This resampling leads to the detection of 10-times more frames than was possible with YOLOv2, sacrificing some detection speed for improved accuracy. Additionally, YOLOv3 includes skipping connections and up sampling, which further improves its ability to detect small objects. Another notable difference between YOLOv3 and its predecessor is the method used for class prediction. While earlier versions used a Softmax function, YOLOv3 employs independent logistic classifiers. This allows a single object to belong to multiple classes, making the system more flexible and accurate.

YOLOv4
YOLOv4 [40] brings several significant improvements over its predecessor, YOLOv3. The most notable change is the integration of the Cross-Stage Partial Network (CSP-Net) into the Darknet architecture, creating a new backbone feature extractor known as CSP-Darknet53. The convolutional architecture used in YOLOv4 is based on a modified version of the DenseNet model, which has several advantages over traditional models. These advantages include improved gradient stability, facilitated back-propagation, reduced computational requirements, and enhanced learning. The Neck layer in YOLOv4 is a combination of SPP [44] and Path Aggregation Network (PANet) [45], which enhances the detection process by increasing the receptive field and filtering out essential features. This results in improved accuracy and reduces computational time. YOLOv4 also utilizes "Bag of Freebies" and "Bag of Specialties," which have been proven to significantly improve the performance of the algorithm. The "Bag of Specialties" includes features, such as the Mish activation function and modified PANet, which are not available in previous versions.

YOLOv5
YOLOv5 [41] offers significant improvements over its predecessor, YOLOv4. At the core of YOLOv5 is the combination of the CSP-Darknet53 network and the Focus layer, which form an efficient and highly accurate backbone. This backbone solves the problem of repeating gradient information in large networks, resulting in faster inference times, improved accuracy, and reduced model size. One of the key changes in YOLOv5 is the replacement of the BottleneckCSP block with the C3 block, which features three convolutional layers and enhances the information flow. PANet acts as a bottleneck, improving the precision of object localization, especially for tiny and large objects. The SPP layer helps overcome the limitation of fixed network size, and the use of the SiLU activation function instead of the Mish activation function adds to the efficiency of the system. Another innovative feature of YOLOv5 is its use of mosaic augmentation, where the network is trained on images created from four training photos. This allows for smaller training batch sizes and enables the use of GPUs with lower memory. YOLOv5 also offers a range of five different architecture options, including YOLOv5 Nano, Small, Medium, Large, and XLarge. Each architecture is designed to meet different size, speed, and precision requirements.

YOLOv6
YOLOv6 [42] has been designed to ensure optimal performance and to accommodate the hardware's capabilities, specifically the CPU power and memory bandwidth. To accomplish this, the backbone and neck architecture of YOLOv6 were re-designed utilizing the Rep-PANet and EfficientRep technology, making it faster and more efficient. One of the major improvements in YOLOv6 is its decoupled head architecture, which allows it to detect objects at double the speed of the previous version, YOLOv5, without sacrificing accuracy. Eliminating shared features of box categorization and regression allows YOLOv6 to be more precise in object detection, resulting in fewer false-positive and false-negative results. Additionally, YOLOv6 adopts an anchorless paradigm, which resulted in improved detection accuracy. The SimOTA label assignment procedure and SIoU Bounding Box regression loss further enhance the accuracy of the object detection system. At present, YOLOv6 is available in three different sizes, Nano, Tiny, and Small, to cater to a wide range of applications, from small edge devices to high-performance systems.

YOLOv7
YOLOv7 [43] is the latest advancement in object detection algorithms. YOLOv7 brings several important improvements that make it a more powerful tool for object detection. One of the major differences between YOLOv7 and YOLOv6 is the backbone network. YOLOv7 uses an Extended Efficient Layer Aggregation Network (E-ELAN) instead of the EfficientRep used in YOLOv6. The E-ELAN makes use of the cardinality of expand, shuffle, and merge operations to continually enhance the network's learning potential while preserving the gradient path. This leads to improved accuracy and speed in object detection tasks. Another improvement in YOLOv7 is the use of the FPN and PANet in the neck. This results in an FPN-PANet structure that combines the best of both networks. The primary head classifies objects, while the secondary heads train in the middle layers, leading to improved accuracy and faster detection times.
In addition to the improved network architecture, YOLOv7 also utilizes Neural Architecture Search (NAS) instruments to optimize the scaling of models for deployment. This NAS-based approach increases scalability and robustness, as the NAS algorithm iteratively identifies the optimal scaling factors based on resolution, width, depth, and stage (number of feature pyramids). Finally, YOLOv7 also employs module-level re-parameterization, where different segments of the model are optimized using exclusive strategies. The gradi-ent flow propagation channels determine which segments of the model require recalibration, leading to improved robustness and accuracy.

Methodology
In this study, an overview of current visual data-based pothole detection is performed. According to the findings obtained, the R-CNN and YOLO models were widely used architectures within the related literature.
As for the evaluation part, selected models from R-CNN and YOLO families were tested for their detection performance on pothole dataset. It was investigated to what extent negative light and weather impacts the accuracy of object detectors. The research objective was formulated as follows: does the current state of the art have adequate tools to effectively handle deviations from clear conditions?

Experimental Setting
The proposed models in this study were tested under laboratory conditions, and the results were analysed both quantitatively and qualitatively. The operating system utilized was Windows 10, and the tools and environment used for deep learning model training included Anaconda3, CUDA v11.3 with a model acceleration training function, cuDNN8, and PyTorch 1.11.0 as the training framework. The hardware used was a high-performance CPU: Intel Core i9 12900HX Alder Lake and a powerful GPU: NVIDIA GeForce RTX 3090 Ti 24GB.
The transfer learning technique, which plays a crucial role in deep learning, was employed in this study. The technique focuses on leveraging information from a previously trained network (pre-trained weights) and applying it to related problems. The ImageNet and COCO dataset, which are widely used for object detection, were employed in computing pre-trained weights that were applied for model training in this study. The R-CNN models used in this study are available in [46]. Each R-CNN model utilizes ResNet-50 backbone for feature extraction. Furthermore, publicly available implementations of the YOLO architecture were utilized, and larger models were selected to match the size of the R-CNN network. A visual representation of the workflow of the detection process is given in Figure 2. It should be noted that conversion of annotation to COCO format was performed for the training of R-CNN models.

Methodology
In this study, an overview of current visual data-based pothole detection formed. According to the findings obtained, the R-CNN and YOLO models were used architectures within the related literature.
As for the evaluation part, selected models from R-CNN and YOLO familie tested for their detection performance on pothole dataset. It was investigated to w tent negative light and weather impacts the accuracy of object detectors. The resea jective was formulated as follows: does the current state of the art have adequate effectively handle deviations from clear conditions?

Experimental Setting
The proposed models in this study were tested under laboratory conditions, a results were analysed both quantitatively and qualitatively. The operating system u was Windows 10, and the tools and environment used for deep learning model t included Anaconda3, CUDA v11.3 with a model acceleration training function, cu and PyTorch 1.11.0 as the training framework. The hardware used was a highmance CPU: Intel Core i9 12900HX Alder Lake and a powerful GPU: NVIDIA G RTX 3090 Ti 24GB.
The transfer learning technique, which plays a crucial role in deep learning, w ployed in this study. The technique focuses on leveraging information from a pre trained network (pre-trained weights) and applying it to related problems. The Im and COCO dataset, which are widely used for object detection, were employed i puting pre-trained weights that were applied for model training in this study. The R models used in this study are available in [46]. Each R-CNN model utilizes Res backbone for feature extraction. Furthermore, publicly available implementations YOLO architecture were utilized, and larger models were selected to match the siz R-CNN network. A visual representation of the workflow of the detection process i in Figure 2. It should be noted that conversion of annotation to COCO format w formed for the training of R-CNN models.  Before training, each detector required the configuration of a set of hyperparameters.
The default values were used as a starting point and several models were trained by changing one or more of the main hyperparameters. The models with converged loss values and optimal performance on the validation set were selected for this research. The training process parameters of the RCNN and YOLO models are presented in Table 4. The SGD momentum optimizer was utilized for all models. The Early Stopping function automatically halted the training process when the validation loss did not improve for five consecutive epochs.

Dataset
In the field of pothole detection, it is of the utmost importance to use datasets that closely simulate real-world scenarios. This typically involves a camera mounted on a moving vehicle that captures images while on the road in adverse weather conditions. Unfortunately, the existing datasets for pothole recognition fall short of replicating these conditions. To tackle this issue, we created a database that is accessible in [47].
Our image data collection showcases a wide range of potholes, including those that pose significant road hazards. The images with a resolution of 1920 × 1080 were captured from a single stretch of road located in an industrial area of a city, known for its inadequate road conditions, over a period of three months (May, June, and July). This timeline allows for the capture of the evolving surroundings, such as the presence of pedestrians, passing or stationary vehicles, and changes in road conditions. Each image is stamped with an exact date and time of capture.
The dataset also includes images of manhole covers, which can pose a challenge to the model as they can be easily mistaken for potholes due to their circular shape. This step helps to improve the generalization capability of the computer vision model and reduce the likelihood of false positives. The annotated collection of 1052 photos was captured during clear weather conditions, but the dataset also covers four unfavourable weather conditions: rain, sunset, evening, and night. These conditions pose a challenge to the detection models; thus, the availability of a realistic dataset is crucial for the development of robust and reliable pothole detection systems. Figure 3 provides a visual representation of the realworld scenarios, while Table 5 summarizes the statistics of the dataset. In addition to the number of images in each subset, statistics include the number of occurrences of potholes and manhole covers in each of the two categories. For training purposes, the clear dataset was split into training, test, and validation subsets in a 70:15:15 ratio. The data from the remaining adverse conditions were then used for testing only.
The default values were used as a starting point and several models were trained by changing one or more of the main hyperparameters. The models with converged loss values and optimal performance on the validation set were selected for this research. The training process parameters of the RCNN and YOLO models are presented in Table 4. The SGD momentum optimizer was utilized for all models. The Early Stopping function automatically halted the training process when the validation loss did not improve for five consecutive epochs.

Dataset
In the field of pothole detection, it is of the utmost importance to use datasets that closely simulate real-world scenarios. This typically involves a camera mounted on a moving vehicle that captures images while on the road in adverse weather conditions. Unfortunately, the existing datasets for pothole recognition fall short of replicating these conditions. To tackle this issue, we created a database that is accessible in [47].
Our image data collection showcases a wide range of potholes, including those that pose significant road hazards. The images with a resolution of 1920 × 1080 were captured from a single stretch of road located in an industrial area of a city, known for its inadequate road conditions, over a period of three months (May, June, and July). This timeline allows for the capture of the evolving surroundings, such as the presence of pedestrians, passing or stationary vehicles, and changes in road conditions. Each image is stamped with an exact date and time of capture.
The dataset also includes images of manhole covers, which can pose a challenge to the model as they can be easily mistaken for potholes due to their circular shape. This step helps to improve the generalization capability of the computer vision model and reduce the likelihood of false positives. The annotated collection of 1052 photos was captured during clear weather conditions, but the dataset also covers four unfavourable weather conditions: rain, sunset, evening, and night. These conditions pose a challenge to the detection models; thus, the availability of a realistic dataset is crucial for the development of robust and reliable pothole detection systems. Figure 3 provides a visual representation of the real-world scenarios, while Table 5 summarizes the statistics of the dataset. In addition to the number of images in each subset, statistics include the number of occurrences of potholes and manhole covers in each of the two categories. For training purposes, the clear dataset was split into training, test, and validation subsets in a 70:15:15 ratio. The data from the remaining adverse conditions were then used for testing only.

Dataset Augmentation
Data augmentation is a crucial technique in deep learning to mitigate the impact of limited dataset size and prevent overfitting. By artificially increasing the size of the dataset, data augmentation helps models to learn and generalize better, leading to improved accuracy. In the case of YOLO and R-CNN, various data augmentation techniques were used to improve the performance of the models. In YOLO, the unique technique of mosaic augmentation was employed. This involves combining multiple randomly cropped images into a grid, creating a more diverse set of data for the model to learn from. The following parameters were used for data augmentation in YOLO training: scale factor of 0.5, shear of 0.5, up and down flip of 0.2, left-right flip of 0.5, mosaic of 1, and translation of 0.1. Similarly, R-CNN performance was enhanced through the use of augmentation techniques, including flipping and translation. The parameters used were a flip factor of 0.5, a vertical flip factor of 0.008, and a horizontal flip factor of 0.5. These augmentations helped R-CNN and YOLO to avoid overfitting and generalize better to new data, leading to improved accuracy.

Evaluation Metrics
The accuracy of an object detector can be evaluated according to a variety of parameters. The most typical assessment metrics include precision, recall, and Average Precision (AP) or mAP. The AP metric is intended to offer a reliable and consistent assessment of the categorization and detection processes. In addition, the frame rate is a key indicator of the speed of the object detector; therefore, it is important to pay attention to both metrics. To describe the mAP metric, it is important to define the following terms first:

•
Precision is the ratio of successfully recognized occurrences, denoted as True Positives (TPs), to all positively detected instances (TP + False Positivess (FP)), as shown in Figure 4. Recall is defined as the ratio of correctly recognized instances to all tested instances (TPs + False negatives (FNs)).

•
The IoU algorithm is used to calculate the amount of overlap that exists between the anticipated and ground truth bounding boxes. Then, the detection of TP is referred to be a match between bounding boxes that is greater than a particular threshold. The occurrence of FP takes place when the detection level falls below a predetermined threshold. The fact that the proper detection was not made is indicated by the FN instance.
augmentation was employed. This involves combining multiple randomly ages into a grid, creating a more diverse set of data for the model to learn lowing parameters were used for data augmentation in YOLO training: scal shear of 0.5, up and down flip of 0.2, left-right flip of 0.5, mosaic of 1, and 0.1. Similarly, R-CNN performance was enhanced through the use of augm niques, including flipping and translation. The parameters used were a flip vertical flip factor of 0.008, and a horizontal flip factor of 0.5. These augmen R-CNN and YOLO to avoid overfitting and generalize better to new data, proved accuracy.

Evaluation Metrics
The accuracy of an object detector can be evaluated according to a var eters. The most typical assessment metrics include precision, recall, and Ave (AP) or mAP. The AP metric is intended to offer a reliable and consistent the categorization and detection processes. In addition, the frame rate is a of the speed of the object detector; therefore, it is important to pay attentio rics. To describe the mAP metric, it is important to define the following term

•
Precision is the ratio of successfully recognized occurrences, denoted tives (TPs), to all positively detected instances (TP + False Positivess (F in Figure 4. Recall is defined as the ratio of correctly recognized instanc instances (TPs + False negatives (FNs)).

•
The IoU algorithm is used to calculate the amount of overlap that exist anticipated and ground truth bounding boxes. Then, the detection of to be a match between bounding boxes that is greater than a particular t occurrence of FP takes place when the detection level falls below a p threshold. The fact that the proper detection was not made is indicat instance.  The accuracy of the prediction can be determined by using the precision/recall (PR) curve and the Area Under Curve (AUC). The AP measure that was decided upon for The PASCAL Visual Object Classes Challenge 2010 [48] is calculated from the PR curve through the process of interpolating the precision at eleven different recall levels [0, 0.1, . . . , 1] (Equation (1)). ρ interp (r) specifies the precision at each recall level, and it is interpolated by taking the greatest precision measured for which the corresponding recall is greater than r.
The notation of mAP is an abbreviation for "mean absolute precision," which is determined by taking the average of all n categories (Equation (2)). The mAP is typically evaluated with different values of IoU threshold, e.g., mAP@.5 means the mean of AP calculated for each class with an IoU threshold > 0.5; if not given, mAP is calculated across all IoU threshold values. The mAP@ [0.5:0.95] is a popular measure that provides the most information about the quality of object identification and it was used as part of the COCO Detection Challenge [49]. It refers to the mAP calculated across an IoU threshold between 0.5 and 0.95. A high score for this metric indicates that the model can identify and correctly classify visual objects.

Results and Discussion
The comparison between R-CNN and YOLO models has been a topic of interest in the field of object detection for a long time. Both families of models have been widely used for object detection tasks, but each has its strengths and weaknesses. In this research, we evaluate the performance of R-CNN and YOLO models on a database that contains different weather conditions (clear, rainy, sunset, evening, and night).

Performance of R-CNN
According to the evaluation results shown in Table 6, R-CNN-based models' performance in different weather conditions shows a clear correlation between the complexity of the weather conditions and the decrease in precision, recall, and mAP scores. In clear weather, all models performed relatively well, with precision and recall scores ranging from 0.475 to 0.711, mAP@.5 scores ranging from 0.672 to 0.746, and mAP@ [0.5:0.95] scores ranging from 0.269 to 0.338.  The Faster R-CNN, Cascade R-CNN, and Sparse R-CNN models are leading in terms of mAP success rates, with the Faster R-CNN and Sparse R-CNN having the highest precision score of 0.692 and the Cascade R-CNN having the highest recall score of 0.552. They also showed the largest performance improvement compared to the standard R-CNN model. Faster R-CNN improved by 5% in precision and 4.8% in recall, Cascade R-CNN improved by 6.9% in precision and 9.7% in recall, and Sparse R-CNN improved by 5% in precision and 8.7% in recall.
Under more challenging conditions, such as rain and changing light, due to the late hours of the day, R-CNN models saw a significant drop in performance. With the improvement in models and their detection accuracy under clear conditions, the gap in the performance between clear and worsened conditions may also increase. However, there are some exceptions, especially with the newer architectures, such as Cascade and Sparse R-CNN. Each model responds differently to adverse road conditions, but one of the common phenomena was a significant decrease in the accuracy of detection under night conditions. When compared to models' performance under clear conditions, there was a decrease recorded, on average, by 26%, 23.5%, 20.3%, and 41.2% in mAP@.5 for the rain, sunset, evening, and night subsets, respectively.
The continuous change in the mAP accuracy throughout the different versions of R-CNN models is marked in Table 6, while the accuracy of the basic R-CNN model is considered as the baseline starting point. The difference in accuracy is calculated considering the current highest value achieved for individual data subsets.
A comparison of the overall performance of the models, when the mean value of accuracy across all conditions is calculated, is shown in Table 7. We can see that the various object detection models function differently. To better understand the strengths and weaknesses of each model, we will compare them in several ways, including model size, number of parameters, inference time, and performance.  Mask and Sparse R-CNN, whose size is more than double that of R-CNN, can achieve both short inference time and higher detection performance. While a multi-stage object detector, Cascade R-CNN, can provide more accurate results, the processing time of the inference is still high. • Parameters: The number of parameters is another critical factor in determining the model's performance. A model with more parameters has the potential to learn more complex relationships in the data, which can lead to better performance. However, having more parameters also means that the model is more prone to overfitting and requires more computational resources during both training and inference. In this comparison, Cascade R-CNN has the largest number of parameters, with 107 million, followed by Sparse R-CNN with 77.8 million and Mask R-CNN with 64.1 million parameters. On the other hand, Fast and Faster R-CNN are some of the smallest models, with 42.5 million and 53.2 million parameters, respectively. • Performance: To evaluate the ability of models to cope with all adverse conditions, the mean detection performance over all conditions is considered in this comparison. Sparse R-CNN has the highest overall accuracy, with a mean precision of 55.4%, a mean recall of 46.6%, and the highest mean mAP@.5 value of 51.4%. The differences in performance can be attributed to the architectures of the models and the methods they use to extract features from the input image. Overall, the results in Table 7 suggest that the different models have their strengths and weaknesses, and the choice of model will depend on the specific requirements of the task at hand. For instance, if accuracy is a priority, Cascade R-CNN and Sparse R-CNN are suitable models, while if inference speed is a concern, Mask R-CNN with the lowest inference time is more suitable. Ultimately, Sparse R-CNN provides both the highest accuracy and a low inference time. These results could be because of the differences in the architectures and optimization techniques used in each of these models. The Cascade R-CNN model has a multi-stage architecture, which enables it to refine its predictions at each stage, resulting in improved performance. On the other hand, the Sparse R-CNN model uses a sparse proposal generation technique, which helps it to be more efficient and faster, leading to better performance. Table 8 shows a comparison of the performance of different YOLO object detection models. As we can see, the latest models (YOLOv6 and YOLOv7) perform better than the older ones (YOLOv3 and YOLOv4) in terms of precision, recall, and mAP across different weather conditions, with improvements observed in the worsened lighting conditions.  Adverse weather and reduced light have a natural impact on camera sensing and subsequent object detection performance. As in the previous case of R-CNN models, the performance of YOLO models degraded when subjected to more difficult visual conditions. Despite the detection accuracy being continuously improved in clear settings, the performance difference between clear and deteriorated conditions may also widen. The mAP@.5 for rain, sunset, evening, and night subsets decreased, on average, by 28.8%, 26.2%, 32.11%, and 64.5% in comparison to clear conditions.

Performance of YOLO
The continuous change in the mAP accuracy throughout the different versions of YOLO models is marked in Table 8. The accuracy of the YOLOv3 model is considered the baseline starting point. The difference in accuracy is calculated considering the current highest value achieved for individual data subsets.
As can be seen from Table 8, each new version of YOLO comes with additional improvements in the mAP@.5 measure for every data subset. However, YOLO's detection performance rapidly declines at night. Unlike RCNN models reaching up to 30% of mAP@.5, the detection accuracy of YOLO models does not hit above the 20% level of mAP@.5.
A comparison of the overall performance of the models, when the mean value of accuracy across all conditions is calculated, is shown in Table 9. The accuracy of the pothole detection task increases with the newer versions of YOLO architecture. In addition, we compare individual YOLO models from several perspectives, including model size, number of parameters, inference time, and performance. The YOLOv7 model has the best performance, the smallest size, and the fastest inference time. YOLOv7 is suitable for applications with lower performance requirements or computational constraints. This model has shown significant improvements in all aspects compared to other models, making it a strong contender for various object detection applications. It is important to note that the YOLOv5m and YOLOv6m models also showed impressive results, particularly in the areas of accuracy, model size, and inference time.

Comparison of R-CNN and YOLO Results
Based on the performance metrics presented in Sections 5.1 and 5.2, we can draw several conclusions about the relative strengths and weaknesses of R-CNN and YOLO for the task of pothole detection. The results demonstrate that YOLO models outperform R-CNN models in terms of accuracy and speed of inference. Figure 5 shows the detection accuracy of the best-performing YOLO and R-CNN architectures. As can be seen, the models from the R-CNN group achieved higher results (up to 30% mAP@.5) on the night data subset, which presents worse possible conditions. In conclusion, the overall accuracy represented by Mean_mAP@.5 shows that R-CNN models are competitive regarding the detection task under adverse visual conditions. A visual comparison of models' accuracy versus the number of parameters and model size is shown in Figure 6. YOLOv7 is the most efficient model, with a Mean_mAP@.5 of 55.6%. The best-performing R-CNN model, Sparse R-CNN, achieved a Mean_mAP@.5 of 51.4%. YOLOv7 has a smaller model size of 35.5 MB and requires only 36.9 million parameters, which is approximately a 91.5% and 52.6% reduction compared to the Sparse R-CNN model, which has a size of 415.4 MB and 77.8 million parameters. This makes YOLOv7 highly effective for real-time object detection in applications such as pothole detection.
When comparing YOLOv7 to R-CNN models in terms of detection speed, the percentage improvement is substantial. CNN models in terms of accuracy and speed of inference. Figure 5 shows the dete accuracy of the best-performing YOLO and R-CNN architectures. As can be seen models from the R-CNN group achieved higher results (up to 30% mAP@.5) on the n data subset, which presents worse possible conditions. In conclusion, the overall accu represented by Mean_mAP@.5 shows that R-CNN models are competitive regardin detection task under adverse visual conditions.  A visual comparison of models' accuracy versus the number of parameters and model size is shown in Figure 6. YOLOv7 is the most efficient model, with a Mean_mAP@.5 of 55.6%. The best-performing R-CNN model, Sparse R-CNN, achieved a Mean_mAP@.5 of 51.4%. YOLOv7 has a smaller model size of 35.5 MB and requires only 36.9 million parameters, which is approximately a 91.5% and 52.6% reduction compared to the Sparse R-CNN model, which has a size of 415.4 MB and 77.8 million parameters. This makes YOLOv7 highly effective for real-time object detection in applications such as pothole detection. When comparing YOLOv7 to R-CNN models in terms of detection speed, the percentage improvement is substantial. The YOLOv7 model achieved an inference time of 78.3 ms, which is approximately a 46.1% reduction compared to the Sparse R-CNN model, with an inference time of 146.4 ms. Additionally, with each new version of the YOLO model, performance continues to improve.
Interestingly, the YOLOv5m model has a higher Mean_mAP@.5 (53.0%) than all the R-CNN models, while also having a smaller number of parameters and model size than most R-CNN as well as some YOLO models. These results demonstrate the efficiency and effectiveness of the YOLO architecture in the pothole detection task, particularly with its latest versions of YOLOv5, YOLOv6, and YOLOv7.
Despite R-CNN-based models having significantly more parameters, model size, and requiring longer inference time, they can extract useful features that contribute to the improvement in pothole detection, especially under the worst visibility at night.
The findings obtained throughout this work have practical implications for pothole detection applications, as YOLOv7 can accurately detect potholes in real time with a low computational cost, making it suitable for deployment on resource-constrained devices. Interestingly, the YOLOv5m model has a higher Mean_mAP@.5 (53.0%) than all the R-CNN models, while also having a smaller number of parameters and model size than most R-CNN as well as some YOLO models. These results demonstrate the efficiency and effectiveness of the YOLO architecture in the pothole detection task, particularly with its latest versions of YOLOv5, YOLOv6, and YOLOv7.
Despite R-CNN-based models having significantly more parameters, model size, and requiring longer inference time, they can extract useful features that contribute to the improvement in pothole detection, especially under the worst visibility at night.
The findings obtained throughout this work have practical implications for pothole detection applications, as YOLOv7 can accurately detect potholes in real time with a low computational cost, making it suitable for deployment on resource-constrained devices.
This can potentially lead to more efficient road maintenance and improved road safety. It is important to note that the performance of any object detection algorithm is highly dependent on the quality and size of the dataset used for training, as well as the specific implementation details.

Mitigating Adverse Visual Conditions
Computer vision systems are challenged by low-contrast images, shadows, or adverse weather. Several research works have been conducted to mitigate the effects of adverse visual conditions on camera sensing and subsequent object detection.
Deep learning has shown impressive performance in low-light image enhancement. Techniques, such as multiscale [50,51] and attention feature maps [52], have been successfully utilized. An end-to-end attention-based multi-branch CNN was developed in [53]. It performs denoising and low-light enhancement simultaneously to deal effectively with colour distortion and noise that are also occurring within dark images. Consisting of four subnetworks, the system creates an under-exposed attention map to handle the under exposed image parts and then the noise map is derived. Both image representations are further used in a multi-branched subnetwork for the image enhancement task. A final separate CNN is utilized for contrast, exposure, and colour improvement. The proposed model outperformed other state-of-the-art methods in terms of enhancement rate, with testing time reported as 0.05 s for the lightweight model version and 0.48 s for the model tested on the SID dataset. Apart from low-light image enhancement, rainy images can also be effectively derained [54,55] or defogged [56,57]. The high-quality image enhancement techniques can be used as a pre-processing step to improve object/pothole detection. However, the computational constraints and overall inference speed should be considered.
Another approach is to perform enhancement and detection jointly in an end-toend manner. According to [58], a learnable pre-processing module for low-light image enhancement may decrease the accuracy of detection in some cases. The authors of [58] proposed learnable low-light image enhancement implemented jointly with the detection task utilizing the twin architecture. By using the information at both the original and enhanced features levels, an improvement in face detection was achieved. In [59], an image-adaptive YOLO deals with weather-specific information using a fully differentiable image processing module (DIP) to pre-process high-resolution images for input to YOLOv3. Hyperparameters of the DIP module are learned by separate the CNN predictor networks. The proposed system performed effectively under foggy and low-light scenarios with an inference time increase of 13 ms over the YOLOv3 baseline.
Improving internal modules of existing object detection architecture may increase model robustness and allow for better regulation of the model speed. A Trans-Decoupled YOLO [60], which is designed for small object detection in complex environments, integrates transformer modules with a self-attention mechanism into the YOLOv5 backbone for global contextual feature extraction. Additionally, a decoupled lightweight head for both simplified and more accurate detection was proposed. Improvements of 6.4% (mAP@.5) and 6.8% (mAP@ [0.5:0.95]) over YOLOv5 on the TT100K dataset were reported. Convolutional Block Attention Module (CBAM) is another visual attention-based module used to improve not only the small object detection but also to mitigate adverse weather conditions [61,62]. Inspired by CBAM, the authors of [63] proposed a global attention mechanism with 3D permutation in its channel part and increased number of conv layers in the spatial part.
A common approach to improve a model's generalization to adverse conditions is to incorporate images captured under different weather and lighting conditions in the training dataset [9,10]. The available data can also be augmented with synthetically created images or with original images translated into different conditions. For instance, synthetically produced rain data added to training images improved object detection by 21% in [64].
The influence of adverse visual conditions can also be mitigated by the fusion of different sensors. Although accelerometer-based pothole detection is sensitive to vehicle speed, it is particularly useful for detection in low visibility. For more accurate results, video data were combined with an acceleration sensor for vehicle vibration measurements in public crowdsourcing applications in [65]. As in previous cases, the multi-sensor system can be enhanced with an attention mechanism and enabled to adapt effectively to varying adverse weather conditions [66].

Conclusions
Computer vision techniques have shown promising results in automating pothole detection, but selecting the best model for deployment can be a challenging task, especially if we consider the detection of potholes under different weather conditions. In this article, we focused on introducing the current state-of-the-art CNNs that are used for pothole detection. The main objective is to compare their performance through experiments and provide information that can be useful for future research in this field. This study evaluated the effectiveness of different computer vision models, including Fast R-CNN, Faster R-CNN, Mask R-CNN, Cascade R-CNN, Sparse R-CNN, and YOLO versions 3 to 7, for the task of pothole detection under adverse visual conditions, such as rain, sunset, evening, and night. The models' performances were compared in terms of detection accuracy under different weather and lighting conditions.
Our experimental results revealed that YOLOv7, followed by YOLOv6l and YOLOv6m, demonstrated the best performance across all weather conditions. YOLOv5l and YOLOv5m also showed good performance, with slight variations in different weather conditions. These results indicate that YOLO architectures may be the most suitable for pothole detection under adverse visual conditions, such as rain, sunset, and evening. However, it is worth noting that R-CNN models, despite their significant computational costs, proved to be the most suitable for night-time detection. Although YOLO architectures perform detection with significant accuracy and speed, R-CNN models may handle the very low-visibility detection more successfully.
The results showed that the performance of the models was negatively affected by lighting conditions, with night data showing the lowest performance. When compared to the clear subset, the mAP@.5 for the night subset decreased, on average, by 41.2% and 64.5% for R-CNN and YOLO models, respectively. These findings highlight the importance of considering different weather and low-light conditions when selecting object detection models. Our study's contributions may provide valuable information for researchers interested in improving pothole detection performance under adverse visual conditions. The proposed study may also contribute to the development of ITS, which aims to improve road safety and reduce the number of accidents caused by potholes.
Future research could focus on more diverse and weather-specific data augmentation techniques using generative networks. These methods could enable the generation of synthetic data that accurately captures the complexities of different weather conditions, thus improving the generalization capability of the models. Moreover, model modification such as self-attention modules (e.g., Transformers, CBAM, GAM) for salient feature extraction could be incorporated to improve the detection of relatively small objects in challenging conditions. Novel multi-scale features could also be implemented to capture objects at different scales and enhance model performance. Furthermore, incorporating additional information into the object detection pipeline, such as semantic segmentation or depth estimation, could help to further improve the accuracy of object detection in challenging weather conditions. Additionally, it would be useful to extend the dataset to other weather conditions (snowfall, hail, fog) to assess the robustness of the models. Finally, an interesting direction for future research would be to implement the models on hardware platforms for real-world testing. This could involve deploying the models on drones or vehicles to evaluate their effectiveness in detecting objects in real time, which would have significant implications for applications, such as autonomous driving and aerial surveillance.