STC-YOLO: Small Object Detection Network for Traffic Signs in Complex Environments

The detection of traffic signs is easily affected by changes in the weather, partial occlusion, and light intensity, which increases the number of potential safety hazards in practical applications of autonomous driving. To address this issue, a new traffic sign dataset, namely the enhanced Tsinghua-Tencent 100K (TT100K) dataset, was constructed, which includes the number of difficult samples generated using various data augmentation strategies such as fog, snow, noise, occlusion, and blur. Meanwhile, a small traffic sign detection network for complex environments based on the framework of YOLOv5 (STC-YOLO) was constructed to be suitable for complex scenes. In this network, the down-sampling multiple was adjusted, and a small object detection layer was adopted to obtain and transmit richer and more discriminative small object features. Then, a feature extraction module combining a convolutional neural network (CNN) and multi-head attention was designed to break the limitations of ordinary convolution extraction to obtain a larger receptive field. Finally, the normalized Gaussian Wasserstein distance (NWD) metric was introduced to make up for the sensitivity of the intersection over union (IoU) loss to the location deviation of tiny objects in the regression loss function. A more accurate size of the anchor boxes for small objects was achieved using the K-means++ clustering algorithm. Experiments on 45 types of sign detection results on the enhanced TT100K dataset showed that the STC-YOLO algorithm outperformed YOLOv5 by 9.3% in the mean average precision (mAP), and the performance of STC-YOLO was comparable with that of the state-of-the-art methods on the public TT100K dataset and CSUST Chinese Traffic Sign Detection Benchmark (CCTSDB2021) dataset.


Introduction
The traffic sign detection system is an important part of an intelligent transportation system. It can effectively provide the driver with current road traffic information, and it can also ensure the operational safety of the intelligent vehicle control system. In recent years, due to the far-reaching impact of this technology on traffic safety, this field has been deeply studied by many researchers.
Traditional traffic sign detection algorithms are mainly concentrated on color segmentation, combining features such as the shape and contour for feature extraction, and then realizing the recognition of traffic sign by completing feature classification through classifiers [1][2][3][4][5][6]. The handmade features in traditional techniques are human exhaustion and a lack of sufficient robustness to deal with complex and changeable traffic environments. In recent years, traffic sign detection algorithms based on deep convolutional neural networks have been widely developed. They are mainly divided into two categories: the two-stage object detection algorithm represented by the region-based convolutional network (R-CNN) series [7][8][9], and the one-stage object detection algorithm represented by the you only look (1) The down-sampling multiple was adjusted, and a small object detection layer was added to reduce the loss of small object information transmission during the downsampling operation. (2) The Swin Transformer structure was combined with a convolutional neural network (CNN) for local relevance as well as global modelling capabilities. (3) Complete-IoU (CIoU) and the normalized Gaussian Wasserstein distance (NWD) metric were combined as loss functions, and the robustness of the model in small object detection was improved by adjusting the proportional relationship. (4) The K-means++ algorithm was used to obtain a new initialized anchor box size by clustering and analyzing the instance label information, which can improve the matching degree between the anchor boxes and the real samples.
The rest of this paper is structured as follows: Section 2 introduces the relevant research on small object detection and traffic sign detection. Section 3 describes the proposed Sensors 2023, 23, 5307 3 of 20 methods in detail. Section 4 describes the experimental results and analysis, including comparative studies and ablation studies. Finally, Section 5 presents the discussion of the experimental results, and Section 6 includes the conclusion and future prospects of this paper.

Small Object Detection
There are usually two ways to define small objects. One definition states that the object size must be smaller than 0.12% of the original size to be regarded as a small object. This paper takes this as a reference. The other is an absolute size definition, that is, the object size must be smaller than 32 × 32 pixels. Therefore, small object detection has always been a difficult topic to address in the field of object detection. At present, multi-scale fusion, the receptive field angle, high-resolution detection, and context-aware detection are the main approaches to small object detection. In high-resolution detection [26,27], highresolution feature maps are established and predicted to obtain fine details, but context information is lost. In addition, to obtain the context information of the object, there are several methods [28,29] that use the top-down and bottom-up paths to fuse the features of different layers, which can greatly increase their receptive field. In this paper, the feature pyramid network (FPN) [30] + PAN was used as the feature fusion module of the network, and a multiple attention mechanism was introduced in the model backbone to enhance the learning of context and expand the receptive field, so as to effectively improve the accuracy of small object detection.

Traffic Signs Detection
The key to traffic sign detection is to extract distinguishable features. Due to limitations in the computer power and available dataset size, the performance of traditional methods depends on the effectiveness of the manual extraction of features, such as color-based [31,32] and shape-based methods [33,34]. These methods are also easily affected by factors such as extreme weather, illumination changes, variable shooting angles, and obstacles, and can only be applied to limited scenes.
In order to promote traffic sign detection in real scenes, many authors have published excellent traffic sign datasets, such as the Laboratory for Intelligent and Safe Automobiles (LISA) dataset [35], GTSDB, CCTSDB, and TT100K. Since the TT100K dataset covers partial occlusion, illumination changes, and viewing angle changes, it is closer to the real scene than other datasets. With the development of deep learning technology, and the publication of several excellent public datasets, the performance of traffic sign detection algorithms based on deep learning has been significantly improved compared with the traditional traffic sign detection algorithms. Zhang et al. [36] used the Cascade R-CNN [8] combined with the sample balance method to detect traffic signs, achieving ideal detection results on both CCTSDB and GTSDB. Sun et al. [37] proposed a feature expression enhanced SSD detection algorithm, which achieved an 81.26% and 90.52% mAP on TT100K and CCTSDB, respectively. However, the detection speed of this algorithm was only 22.86 FPS and 25.08 FPS, which could not achieve real-time performance. Liu et al. [38] proposed a symmetric traffic sign detection algorithm, which optimizes the delay problem by reducing the computing overhead of the network and, at the same time, improves the traffic sign detection performance in complex environments, such as scale and illumination changes, achieving a 97.8% mAP and 84 FPS on the CCTSDB dataset. However, the integration of multiple modules leads to insufficient global information acquisition.

Materials and Methods
The YOLOv5 model in the YOLO series has many advantages such as high detection accuracy, fast operation speed, and easy deployment, and has been widely used in many industrial fields. However, due to its poor detection performance on small objects, the network needs to be improved to improve its performance in small object detection. Consid-Sensors 2023, 23, 5307 4 of 20 ering the high requirements of traffic sign detection for model accuracy and speed, YOLOv5 was selected as the baseline network for subsequent improvement in this paper, and a small traffic sign detection network, STC-YOLO, was constructed for complex scenes. The overall structure is shown in Figure 1. In the feature extraction part, a 16 times down-sampling operation was applied instead of 32 times down-sampling, and shallow branching was added to reduce the loss of small object information during feature propagation. A feature extraction module with a stronger characterization ability was designed to replace the C3 module to make up for the decrease in the receptive field caused by the adjustment of the subsampling operation. On the basis of CIoU loss, the NWD was introduced to calculate the localization loss to balance the sensitivity of the IoU to the location deviation of tiny objects. The K-means++ algorithm was used to replace the K-means algorithm in the original network to improve the matching degree of small objects.
The YOLOv5 model in the YOLO series has many advantages such as high detection accuracy, fast operation speed, and easy deployment, and has been widely used in many industrial fields. However, due to its poor detection performance on small objects, the network needs to be improved to improve its performance in small object detection. Considering the high requirements of traffic sign detection for model accuracy and speed, YOLOv5 was selected as the baseline network for subsequent improvement in this paper, and a small traffic sign detection network, STC-YOLO, was constructed for complex scenes. The overall structure is shown in Figure 1. In the feature extraction part, a 16 times down-sampling operation was applied instead of 32 times down-sampling, and shallow branching was added to reduce the loss of small object information during feature propagation. A feature extraction module with a stronger characterization ability was designed to replace the C3 module to make up for the decrease in the receptive field caused by the adjustment of the subsampling operation. On the basis of CIoU loss, the NWD was introduced to calculate the localization loss to balance the sensitivity of the IoU to the location deviation of tiny objects. The K-means++ algorithm was used to replace the K-means algorithm in the original network to improve the matching degree of small objects.

Feature Pyramid
The neck part of the YOLOv5 model uses the information obtained from the backbone part to strengthen the representation capability of features through the FPN and PAN structure. The structure is shown in Figure 2a, where CUCC represents the Conv, Upsample, Concat and C3 modules, respectively, while CCC represents the Conv, Concat, and C3 modules, respectively. The backbone network is used to extract features from the object image and output three feature maps of different scales. {P3, P4, P5} are the feature maps after the input image has been down-sampled {8, 16, 32} times, respectively. In the feature extraction pyramid, the receptive field of the 32 times down-sampling is the largest, and the area of the mapped full-size image is the largest, which is more suitable for predicting large objects. However, most objects in traffic sign images are small objects composed of dozens of or even a few pixels, and effective feature information (i.e., color, shape, size, and texture) is scarce, which leads to poor detection effects for small objects.

Feature Pyramid
The neck part of the YOLOv5 model uses the information obtained from the backbone part to strengthen the representation capability of features through the FPN and PAN structure. The structure is shown in Figure 2a, where CUCC represents the Conv, Upsample, Concat and C3 modules, respectively, while CCC represents the Conv, Concat, and C3 modules, respectively. The backbone network is used to extract features from the object image and output three feature maps of different scales. {P3, P4, P5} are the feature maps after the input image has been down-sampled {8, 16, 32} times, respectively. In the feature extraction pyramid, the receptive field of the 32 times down-sampling is the largest, and the area of the mapped full-size image is the largest, which is more suitable for predicting large objects. However, most objects in traffic sign images are small objects composed of dozens of or even a few pixels, and effective feature information (i.e., color, shape, size, and texture) is scarce, which leads to poor detection effects for small objects. In order to ensure the transmission of more small object details and output feature maps with a stronger characterization capability for small objects, this paper improved the multi-scale structure on the basis of the original feature pyramid structure. The multiscale path aggregation network (MPANet) structure is shown in Figure 2b. semantic information among the deep networks, which in turn will affect the accuracy of the fine-grained classification. {C2, C3, C4, C5} are the feature maps after the input image has been down-sampled {4, 8, 16, 16} times, respectively. In this work, the 32 times downsampling layer was replaced with a 16 times convolutional layer in the backbone network, and at the 16th layer, the feature map output after 4 times down-sampling by the backbone part and the feature map of the neck part after 2 times up-sampling processing were bonded in parallel to obtain a 160 × 160-size feature map for predicting small objects. This feature map has a smaller receptive field and rich object information. After multi-scale fusion, it can better learn object features, enhance the capture ability of the network for smaller objects, and improve the object detection effect.

C4STB Module
Aiming at the problem that traffic signs are not easy to detect due to their small sizes, leading to a poor detection performance, the above method uses 1 × 1 convolution instead of a down-sampling 3 × 3 convolution operation to ensure the transmission of more detailed information. Although CNNs have achieved great success in image processing, their limited perceptual range limits their ability to capture global contextual information. In contrast, the Swin Transformer [39] adopts a more flexible self-attention mechanism, which can better communicate global semantic information, and is outstanding in extracting global semantic information and achieving the best performance. On this basis, a feature extraction module with a stronger characterization ability was constructed by combining four convolutional modules and the Swin Transformer Block (C4STB). Its structure is shown in Figure 3a. The bottleneck in the YOLOv5 feature extraction unit was replaced with the Swin Transformer Block (STB) in the Swin Transformer, the receptive field was expanded with the help of a window self-attention module, and 3 × 3 convolution was added to enhance the local information of the object. The small-scale prediction output of the YOLOv5 model is a feature map with a size of 20 × 20. An image with a size of 2048 × 2048 in the TT100K dataset is taken as an example. When the image is down-sampled to 20 × 20, objects with a size smaller than 103 × 103 are compressed to less than one pixel; most of the traffic sign objects in the image are smaller than 103×103, meaning the small-scale prediction in YOLOv5 is of little significance for small object detection. Directly cutting the small-scale detection layer will cause a lack of semantic information among the deep networks, which in turn will affect the accuracy of the fine-grained classification. {C2, C3, C4, C5} are the feature maps after the input image has been down-sampled {4, 8, 16, 16} times, respectively. In this work, the 32 times down-sampling layer was replaced with a 16 times convolutional layer in the backbone network, and at the 16th layer, the feature map output after 4 times down-sampling by the backbone part and the feature map of the neck part after 2 times up-sampling processing were bonded in parallel to obtain a 160 × 160-size feature map for predicting small objects. This feature map has a smaller receptive field and rich object information. After multi-scale fusion, it can better learn object features, enhance the capture ability of the network for smaller objects, and improve the object detection effect.

C4STB Module
Aiming at the problem that traffic signs are not easy to detect due to their small sizes, leading to a poor detection performance, the above method uses 1 × 1 convolution instead of a down-sampling 3 × 3 convolution operation to ensure the transmission of more detailed information. Although CNNs have achieved great success in image processing, their limited perceptual range limits their ability to capture global contextual information. In contrast, the Swin Transformer [39] adopts a more flexible self-attention mechanism, which can better communicate global semantic information, and is outstanding in extracting global semantic information and achieving the best performance. On this basis, a feature extraction module with a stronger characterization ability was constructed by combining four convolutional modules and the Swin Transformer Block (C4STB). Its structure is shown in Figure 3a. The bottleneck in the YOLOv5 feature extraction unit was replaced with the Swin Transformer Block (STB) in the Swin Transformer, the receptive field was expanded with the help of a window self-attention module, and 3 × 3 convolution was added to enhance the local information of the object. In the traditional ViT, multi-head self-attention (MSA) needs to proc age information at the same time, so the computational complexity is re contrast, W-MSA in the STB uses a window as a unit (the window size i fault) to control the calculation area with less computation. This method work complexity such that the computational complexity scales linearly size. However, this method will also block the information transmission b windows, making it necessary to use the SW-MSA module to solve this tively extract the distance information, and achieve a more accurate seman ing. As shown in Figure 4, compared with W-MSA, SW-MSA adds a shi establishes the information interaction between different windows withou computational overhead. The W-MSA adopts a regular window-partition for the input images and calculates the self-attention within each windo window segmentation is shown in Figure 4b. The SW-MSA module will forming window division, thus generating new windows. With half of t as the step size, each period of the image is moved to the upper left dir the blue and red areas in Figure  As shown in Figure 3b, the STB structure consists of windows multi-head self-attention (W-MSA), shifted windows multi-head self-attention (SW-MSA), and multi-layer perceptron (MLP). A residual connection is applied after each MSA and MLP, and a layer norm (LN) layer is inserted between the modules. This part can be expressed as follows: whereX l and X l denote the output features of the W-MSA (SW-MSA) module and the MLP module for block l, respectively. In the traditional ViT, multi-head self-attention (MSA) needs to process all of the image information at the same time, so the computational complexity is relatively high. In contrast, W-MSA in the STB uses a window as a unit (the window size is set to 7 by default) to control the calculation area with less computation. This method reduces the network complexity such that the computational complexity scales linearly with the image size. However, this method will also block the information transmission between different windows, making it necessary to use the SW-MSA module to solve this problem, effectively extract the distance information, and achieve a more accurate semantic understanding. As shown in Figure 4, compared with W-MSA, SW-MSA adds a shift operation and establishes the information interaction between different windows without increasing the computational overhead. The W-MSA adopts a regular window-partitioning mechanism for the input images and calculates the self-attention within each window. The result of window segmentation is shown in Figure 4b. The SW-MSA module will shift when performing window division, thus generating new windows. With half of the window size as the step size, each period of the image is moved to the upper left direction, and then the blue and red areas in Figure 4c move to the lower and right sides of the image, respectively, finally achieving the effect after the window division offset, as shown in Figure 4d.

Loss Function
The original YOLOv5 network uses CIoU [40] as the regression loss function. The CIoU function considers three important geometric metrics, which are the distance of the center point, overlap area, and aspect ratio. Its performance is better than that of other methods and can provide the movement direction when the bounding boxes do not overlap. The calculation formula of CIoU is as follows: where ρ (b , b ) indicates the Euclidean distance between the center points of the predicted and real boxes, and c represents the diagonal distance of the smallest circumscribed rectangle of the two boxes; the weight factor is denoted by α, and the aspect ratio consistency is denoted by v; w and h represent the width and height of the predicted box; w and h represent the width and height of the real box.
Considering that intersection over union (IoU)-based metrics (such as the IoU itself and its extensions) [41,42] are excessively sensitive to the location deviation of the tiny objects, applying anchor-based detectors results in a drastic deterioration of the detection performance. To alleviate this, this paper combines CIoU and NWD [43] to calculate the localization loss. The bounding boxes are first modeled as a two-dimensional Gaussian distribution, and the similarity between them is calculated using the Gaussian distribution corresponding to the predicted object and the real object. Next, the normalized Wasserstein distance between them is calculated according to Equation (9). Finally, the localization loss is calculated according to the proportional relationship between CIoU and NWD in Equation (10), which is defined as follows: where N and N are Gaussian distributions modeled by A = (cx , cy , w , h ) and B = (cx , cy , w , h ), W (N , N ) is a distance measure, c is a constant closely related to the dataset, and β is the weight proportional coefficient.

Loss Function
The original YOLOv5 network uses CIoU [40] as the regression loss function. The CIoU function considers three important geometric metrics, which are the distance of the center point, overlap area, and aspect ratio. Its performance is better than that of other methods and can provide the movement direction when the bounding boxes do not overlap. The calculation formula of CIoU is as follows: where ρ 2 (b A , b B ) indicates the Euclidean distance between the center points of the predicted and real boxes, and c represents the diagonal distance of the smallest circumscribed rectangle of the two boxes; the weight factor is denoted by α, and the aspect ratio consistency is denoted by v; w A and h A represent the width and height of the predicted box; w B and h B represent the width and height of the real box.
Considering that intersection over union (IoU)-based metrics (such as the IoU itself and its extensions) [41,42] are excessively sensitive to the location deviation of the tiny objects, applying anchor-based detectors results in a drastic deterioration of the detection performance. To alleviate this, this paper combines CIoU and NWD [43] to calculate the localization loss. The bounding boxes are first modeled as a two-dimensional Gaussian distribution, and the similarity between them is calculated using the Gaussian distribution corresponding to the predicted object and the real object. Next, the normalized Wasserstein distance between them is calculated according to Equation (9). Finally, the localization loss is calculated according to the proportional relationship between CIoU and NWD in Equation (10), which is defined as follows:  (N A , N B ) is a distance measure, c is a constant closely related to the dataset, and β is the weight proportional coefficient.
For the detected objects, regardless of whether the objects overlap or not, localization loss can be measured using the distribution similarity. In addition, NWD is not sensitive to the scale of the objects, meaning it is more suitable for measuring the similarity between small objects [44]. In the regression loss function, NWD loss is added to make up for the disadvantage of CIoU loss in small object detection, and CIoU loss is retained, which makes the algorithm converge faster when predicting the bounding box localization and improves the model performance.

Anchor Box
The original YOLOv5s model uses the K-means algorithm to cluster the COCO dataset [45], so that the feature maps with different sizes have three anchor boxes with different fixed widths and heights. However, the K-means algorithm is affected by the random selection of the initial cluster center, which may cause the initial cluster center to be far away from the optimal cluster center location; this not only affects the convergence speed of the model, but also leads to poor detection results. At the same time, considering that the number of large and medium objects in the COCO dataset accounts for the majority, the size of the generated anchors is too large to meet the actual needs of traffic sign detection. In order to solve the above problems, this paper uses the K-means++ algorithm to re-cluster all labeled object frames in the training dataset.
The K-means++ clustering algorithm is an optimization algorithm based on the Kmeans algorithm. Its main purpose is to improve the selection of the initial points, make the anchor box size of the training dataset more appropriate, and improve the detection accuracy of the model for small objects. The algorithm selects a random sample point from the dataset as the first initialized cluster center. Then, for each sample point, the shortest distance between it and the current clustering center is calculated, and the probability of each point becoming the next clustering center is calculated using the distance information. Following this, the point with the largest probability value is selected from these probability values as the next clustering center, and the above calculation steps are repeated until k clustering centers are selected. Finally, for each sample in the dataset, its distance to the k clustering centers is calculated and assigned to the corresponding class of the clustering center with the smallest distance. The cluster center is updated iteratively until the position of the cluster center no longer changes. In this way, the K-means++ clustering algorithm can better adjust the position of the cluster center, so as to obtain more accurate classification results.

Experimental Results and Analysis
The TT100K traffic sign dataset provides 100,000 high-resolution images (with a resolution of 2048 × 2048), containing 30,000 traffic sign instances, and the size distribution of the instances in the images ranges from 16 × 20 to 160 × 160 pixels. A total of 45 categories with more than 50 instances were chosen in this experiment. The dataset has a total of 7962 images containing complete annotation information, of which 6262 were selected as the training set and 1700 were selected as the test set. The corresponding flag images and category names are shown in Figure 5, where pl * includes pl100, pl120, pl20, pl30, pl40, pl5, pl50, pl60, pl70, and pl80; pm * includes pm20, pm30, and pm55; ph * includes ph4, ph4.5, and ph5; and il * includes il100, il60, and il80. The CCTSDB2021 dataset contains 17,856 images, with 15,886 images for traini 1970 images for testing. It contains scenes such as urban roads and highways, wit lutions between 600 × 900 and 1024 × 768. The size distribution of the traffic signs images ranges from 20 × 20 to 573 × 557 pixels. There are three main categories o signs, namely "warning", "prohibited", and "mandatory".

Data Augmentation
In the face of the complex environments in traffic sign detection, this paper fo the approach of enhancing corruption proposed in the literature [36] and chose to e the TT100K training dataset. Considering the environmental conditions and the siz traffic signs dataset, we added fog, snow, Gaussian noise, random occlusion, and blur and adjusted the contrast, brightness, and saturation of the randomly selected to expand the minority samples. After data augmentation, the TT100K dataset w tended to 22,776 images as the enhanced TT100K dataset, of which 21,076 images in harsh environments and 6262 in natural environments) were selected as the train and 1700 images were selected as the test set. The enhancement results of one pic the TT100K dataset are shown in Figure 6.

Origin
Motion Blur Fog The CCTSDB2021 dataset contains 17,856 images, with 15,886 images for training and 1970 images for testing. It contains scenes such as urban roads and highways, with resolutions between 600 × 900 and 1024 × 768. The size distribution of the traffic signs in the images ranges from 20 × 20 to 573 × 557 pixels. There are three main categories of traffic signs, namely "warning", "prohibited", and "mandatory".

Data Augmentation
In the face of the complex environments in traffic sign detection, this paper followed the approach of enhancing corruption proposed in the literature [36] and chose to enhance the TT100K training dataset. Considering the environmental conditions and the size of the traffic signs dataset, we added fog, snow, Gaussian noise, random occlusion, and motion blur and adjusted the contrast, brightness, and saturation of the randomly selected image to expand the minority samples. After data augmentation, the TT100K dataset was extended to 22,776 images as the enhanced TT100K dataset, of which 21,076 images (14,814 in harsh environments and 6262 in natural environments) were selected as the training set and 1700 images were selected as the test set. The enhancement results of one picture in the TT100K dataset are shown in Figure 6.
x FOR PEER REVIEW 9 of 20 The CCTSDB2021 dataset contains 17,856 images, with 15,886 images for training and 1970 images for testing. It contains scenes such as urban roads and highways, with resolutions between 600 × 900 and 1024 × 768. The size distribution of the traffic signs in the images ranges from 20 × 20 to 573 × 557 pixels. There are three main categories of traffic signs, namely "warning", "prohibited", and "mandatory".

Data Augmentation
In the face of the complex environments in traffic sign detection, this paper followed the approach of enhancing corruption proposed in the literature [36] and chose to enhance the TT100K training dataset. Considering the environmental conditions and the size of the traffic signs dataset, we added fog, snow, Gaussian noise, random occlusion, and motion blur and adjusted the contrast, brightness, and saturation of the randomly selected image to expand the minority samples. After data augmentation, the TT100K dataset was extended to 22,776 images as the enhanced TT100K dataset, of which 21,076 images (14,814 in harsh environments and 6262 in natural environments) were selected as the training set and 1700 images were selected as the test set. The enhancement results of one picture in the TT100K dataset are shown in Figure 6.

Origin
Motion Blur Fog

Experimental Environment and Parameter Settings
In this work, all experiments were conducted using the Windows11 operating system, 64 GB RAM, and a GTX-3090 graphics card with 24 GB of video memory; the deep learning framework used was Pytorch1.11.0; and the programming language was Python 3.8.
The optimization algorithm used for model training was stochastic gradient descent (SGD). The initial learning rate was 0.01, the momentum was 0.937, and the weight decay coefficient was 0.0005. In addition, the model was trained for 200 epochs, the batch size of the TT100K dataset was set to 32, and the batch size of the CCTSDB2021 dataset was set to 16.

Experimental Evaluation Index
The evaluation indexes are mainly divided into two aspects: detection accuracy and detection speed. Precision (P) mainly measures the degree of model error detection; recall (R) mainly measures the degree of model missed detection; average precision (AP) is the area under the P-R curve; mAP is the AP average of all categories. They are calculated as follows:

Experimental Environment and Parameter Settings
In this work, all experiments were conducted using the Windows11 operating system, 64 GB RAM, and a GTX-3090 graphics card with 24 GB of video memory; the deep learning framework used was Pytorch1.11.0; and the programming language was Python 3.8.
The optimization algorithm used for model training was stochastic gradient descent (SGD). The initial learning rate was 0.01, the momentum was 0.937, and the weight decay coefficient was 0.0005. In addition, the model was trained for 200 epochs, the batch size of the TT100K dataset was set to 32, and the batch size of the CCTSDB2021 dataset was set to 16.

Experimental Evaluation Index
The evaluation indexes are mainly divided into two aspects: detection accuracy and detection speed. Precision (P) mainly measures the degree of model error detection; recall (R) mainly measures the degree of model missed detection; average precision (AP) is the area under the P-R curve; mAP is the AP average of all categories. They are calculated as follows: where TP means true positive, TN means true negative, FP means false positive, and FN means false negative. n is the number of categories; AP(j) represents the AP of the jth category. The detection speed adopts the FPS, which represents the number of images that can be processed per second.
The model complexity uses parameters, and the specific calculation formula is as follows: where C o represents the number of output channels, C i represents the number of input channels, and k w and k h represent the width and height of the convolution kernel, respectively.

Experimental Analysis 4.4.1. Feature Fusion Layer Improvement Experiment
To determine the final number of detection heads of the model, different detection head branches and their performance were compared. The experimental results are shown in Table 1. As can be seen, the network model with the p2 detection head outperformed the mAP of YOLOv5s by about 0.4% to 6.8%. With a reduction in the down-sampling multiple and the addition of the p2 detection head module, the mAP showed the largest increase and the lowest parameter number compared with YOLOv5s, which proves that this structure transmits more small object information. Considering the number of parameters and the detection accuracy comprehensively, the prediction branch corresponding to the improved {p2, p3, p4} was selected as the output detection head.

Ablation Study
To verify the contribution of each module to the model performance, the added modules and YOLOv5s were combined to conduct ablation experiments on the enhanced TT100K dataset. The experimental results are shown in Table 2. Table 2 lists the four evaluation indicators: AP, AR, mAP, and FPS. Compared with YOLOv5s, the model proposed in this paper increased the AP value by 7.9%, the AR value by 9.7%, and the mAP value by 9.3%, and the speed was only slightly decreased (from 101.01 FPS to 87.71 FPS). This shows that the network model proposed in this paper can substantially improve the detection accuracy of small objects on the basis of ensuring real-time performance. Each module adopted in this work improved the detection accuracy of the network to some extent. Compared with the YOLOv5s, the MPANet module brought significant improvements (mAP from 79.6% to 86.4%), with the AP, AR, and mAP being improved by 5.4%, 6.0%, and 6.8%, respectively. This shows that the prediction of larger sizes can make better use of the detailed information of traffic signs in the image, so as to detect small traffic signs more accurately. When the C4STB module proposed in this paper was applied to the YOLOv5s model, the AP, AR, and mAP were increased by 1.8%, 1.1%, and 1.1%, respectively. This shows that the C4STB module can effectively extract features with a better discriminating ability for small object detection and, at the same time, expand the receptive field to ensure the accuracy of medium and large-size objects. When the combination of NWD and CIoU was used as the loss function, the AP, AR, and mAP were increased by 5.7%, 2.8%, and 2.5%, respectively. This shows that the introduction of NWD into the regression loss function is helpful in improving the sensitivity of the IoU-based metrics to small object position deviations, thereby improving the detection accuracy of small objects. The K-means++ clustering algorithm was used to obtain a set of anchor boxes that correspond to the traffic sign dataset, which increased the AP, AR, and mAP by 5.4%, 3.7%, and 2.1%, respectively. This shows that generating candidate boxes that are more suitable for small object detection can better cover small objects in the dataset and effectively solve the problem of the low detection rate of candidate boxes in small object datasets.
The P-R curve with each improved module added to the YOLOv5s network was drawn under the same axis, as shown in Figure 7. It can be clearly seen that the curve with the MPANet, C4STB, NWD, and K-means ++ modules added at the same time covers the curve with a single module added. Intuitively, it is concluded that each improved module in this paper provides a certain performance improvement to the network. Each module adopted in this work improved the detection accuracy of the network to some extent. Compared with the YOLOv5s, the MPANet module brought significant improvements (mAP from 79.6% to 86.4%), with the AP, AR, and mAP being improved by 5.4%, 6.0%, and 6.8%, respectively. This shows that the prediction of larger sizes can make better use of the detailed information of traffic signs in the image, so as to detect small traffic signs more accurately. When the C4STB module proposed in this paper was applied to the YOLOv5s model, the AP, AR, and mAP were increased by 1.8%, 1.1%, and 1.1%, respectively. This shows that the C4STB module can effectively extract features with a better discriminating ability for small object detection and, at the same time, expand the receptive field to ensure the accuracy of medium and large-size objects. When the combination of NWD and CIoU was used as the loss function, the AP, AR, and mAP were increased by 5.7%, 2.8%, and 2.5%, respectively. This shows that the introduction of NWD into the regression loss function is helpful in improving the sensitivity of the IoU-based metrics to small object position deviations, thereby improving the detection accuracy of small objects. The K-means++ clustering algorithm was used to obtain a set of anchor boxes that correspond to the traffic sign dataset, which increased the AP, AR, and mAP by 5.4%, 3.7%, and 2.1%, respectively. This shows that generating candidate boxes that are more suitable for small object detection can better cover small objects in the dataset and effectively solve the problem of the low detection rate of candidate boxes in small object datasets.
The P-R curve with each improved module added to the YOLOv5s network was drawn under the same axis, as shown in Figure 7. It can be clearly seen that the curve with the MPANet, C4STB, NWD, and K-means ++ modules added at the same time covers the curve with a single module added. Intuitively, it is concluded that each improved module in this paper provides a certain performance improvement to the network.

Performance on the Enhanced TT100K Dataset
To confirm the validity of the network model proposed in this work, the detection results of three mainstream object detection networks, namely YOLOv3, YOLOv6, and YOLOv7, were reproduced on the enhanced TT100K dataset, and compared with those of  To confirm the validity of the network model proposed in this work, the detection results of three mainstream object detection networks, namely YOLOv3, YOLOv6, and YOLOv7, were reproduced on the enhanced TT100K dataset, and compared with those of STC-YOLO and YOLOv5s. The comparison results are shown in Table 3. It can be seen that the model proposed in this paper achieved relatively excellent results in the AP, AR, mAP, and FPS. The mAP of STC-YOLO was 88.9%, which was 5.8%, 7.6%, and 31.8% higher than that of YOLOv3, YOLOv6, and YOLOv7, respectively. The detailed mAP values for each algorithm in each category are shown in Table 4. It can be seen that the STC-YOLO model achieved the best performance in most of the 45 categories. All results were obtained using the same hardware. In this table, the best results are in bold.
The visualization of traffic sign detection using STC-YOLO and the YOLOv5s network on the enhanced TT100K dataset is presented in Figure 8. The tested images were processed by simulating noise, blur, weather and lighting, and all images used for testing were unused images during training. In Figure 8(1b,1c), it can be seen that under the normal condition, the proposed method accurately detected each traffic sign, while YOLOv5s showed missed detection in the detection of the small objects "pl40" and "p11". This type of missed detection was also reflected in the foggy conditions, as presented Figure 8(2b). The decrease in image clarity leads to increasing difficulty in detecting traffic signs; YOLOv5s missed a small object from the "i5" category, while STC-YOLO realized the correct detection of all signs. After adding motion blur, which caused interference in the image, the YOLOv5 network did not detect any objects, while STC-YOLO accurately detected all the objects in the image, as presented in Figure 8(3b,3c). Under the interference of rain and snow, YOLOv5 showed false detection in the detection of the "p11" sign, while STC-YOLO correctly detected all the objects in the image, as shows in Figure 8(4b,4c). Comparing Figure 8(5b,5c), it can be found that under the interference of noise, YOLOv5 missed the small objects "pl30" and "pn" in the distance, while STC-YOLO correctly detected all the signs in the image. When the image was disturbed by illumination changes, both YOLOv5 and STC-YOLO correctly detected the traffic signs, but STC-YOLO had a higher detection accuracy for the "pne" category of small objects, as shown in Figure 8(6b,6c).

Performance on the TT100K Dataset
To further prove the superiority of the network model proposed in this paper, the STC-YOLO network was compared with five other networks, namely SSD + Aligned Matching [46], TSR-SA [47], PSG-Yolov5 [48], AIE-YOLO [21], and YOLOv5s, on the TT100K dataset. These networks are all common frameworks and represent the latest improvements in the field of small object detection in recent years. The experimental results are shown in Table 5. It can be seen that, compared with YOLOv5s, the mAP of the network proposed in this paper increased by 12.9% in real-time detection. Compared with the other state-of-the-art methods, the proposed network also showed certain advantages in detection accuracy and real detection, with 88.49 FPS on GPU 3090. Overall, the network proposed in this paper takes into account both the accuracy and real-time performance and has a better performance in detecting small traffic signs.  In Figure 8(1b,1c), it can be seen that under the normal condition, the proposed method accurately detected each traffic sign, while YOLOv5s showed missed detection in the detection of the small objects "pl40" and "p11". This type of missed detection was also reflected in the foggy conditions, as presented Figure 8(2b). The decrease in image clarity leads to increasing difficulty in detecting traffic signs; YOLOv5s missed a small object from the "i5" category, while STC-YOLO realized the correct detection of all signs. After adding motion blur, which caused interference in the image, the YOLOv5 network did not detect any objects, while STC-YOLO accurately detected all the objects in the image, as presented in Figure 8(3b,3c). Under the interference of rain and snow, YOLOv5 showed false detection in the detection of the "p11" sign, while STC-YOLO correctly detected all the objects in the image, as shows in Figure 8(4b,4c). Comparing Figure 8(5b,5c), it can be found that under the interference of noise, YOLOv5 missed the small objects "pl30" and "pn" in the distance, while STC-YOLO correctly detected all the signs in the image. When the image was disturbed by illumination changes, both YOLOv5 and STC-YOLO correctly detected the traffic signs, but STC-YOLO had a higher detection accuracy for the "pne" category of small objects, as shown in Figure 8(6b,6c).

Performance on the TT100K Dataset
To further prove the superiority of the network model proposed in this paper, the STC-YOLO network was compared with five other networks, namely SSD + Aligned Matching [46], TSR-SA [47], PSG-Yolov5 [48], AIE-YOLO [21], and YOLOv5s, on the TT100K dataset. These networks are all common frameworks and represent the latest improvements in the field of small object detection in recent years. The experimental results are shown in Table 5. It can be seen that, compared with YOLOv5s, the mAP of the network proposed in this paper increased by 12.9% in real-time detection. Compared with the other state-of-the-art methods, the proposed network also showed certain advantages in detection accuracy and real detection, with 88.49 FPS on GPU 3090. Overall, the network proposed in this paper takes into account both the accuracy and real-time performance and has a better performance in detecting small traffic signs. To ensure the authenticity and effectiveness of the proposed network, experiments on the CCTSDB2021 dataset were conducted. The comparison of STC-YOLO with state-of-theart methods on the CCTSDB2021 dataset is shown in Table 6. It can be seen that STC-YOLO outperformed the one-stage methods ESSD [37], YOLOv3 + MAF + SIA [49], M-YOLO [38], and YOLOv5s in the mAP. Compared with the two-stage method Faster R-CNN + ACFPN + Auto Augment [50], STC-YOLO achieved a comparable mAP and outstanding speed for real detection. The effectiveness of the proposed STC-YOLO model in small object detection was also verified on the Visdrone2019 public dataset [51], in which there are 288 video clips with a total of 10,209 still images captured by various drone cameras and 10 categories of objects. The detailed experimental results can be found in the Supplementary Materials.

Discussion
The novelty of this study lies in the amplification of a large amount of new data related to traffic signs. These data cover partial occlusion, illumination changes, view changes, and extreme weather conditions. In addition, the robustness of the model proposed in this paper to complex environments and small objects was tested on multiple publicly available datasets such as TT100K and CCTSDB2021.
It can be seen from the results in Section 4.4 that by reducing the subsampling multiple and designing a larger prediction head, the model proposed in this paper performed better in small object detection and obtained the highest mAP. Even if the detection speed decreases slightly, it can still meet the real-time requirements. The contribution of the module additions to ablation was also discussed. As shown in Table 2, the addition of each module improves the network detection accuracy to a certain extent. The MPANet module aims to feed back more shallow features after multi-scale fusion, so as to enhance the ability of the network to capture smaller objects. The C4STB module adopts the design of the Swin Transformer structure combined with a convolutional neural network to make up for the problem of the reduced receptive field caused by the reduced subsampling operation, and also to solve the limitation of the global context information acquisition ability caused by the reduced receptive field. The introduction of the NWD index combined with the CIoU loss function as an index to measure the similarity between the real boxes and the predicted boxes, which can capture the spatial information of the object better, reduces the impact of the change in the size or shape of the object on the detection results, thus significantly improving the accuracy of object detection. The K-means++ algorithm is used to cluster the label boxes and optimize the initial centroid, avoiding the situation where the k-means algorithm may fall into the local optimal solution.
Based on the comparative experiments in Section 4.5, the STC-YOLO network showed higher accuracy and computational efficiency compared to other mainstream networks. In addition, the model's performance was also improved under interference such as snow, fog, noise, motion blur, and partial occlusion. However, there are still some limitations. First, this study did not take into account all natural conditions, such as nighttime conditions and when traffic signs are faded or damaged. Secondly, because the fusion of multiple modules may increase the amount of computation, the detection speed is slightly decreased. In future studies, it is planned to optimize the STC-YOLO model for more complex environments.

Conclusions
To deal with traffic sign detection in complex real environments, the TT100K public dataset was expanded to 22,776 images, and a traffic sign detection network, STC-YOLO, based on the framework of YOLOv5 was constructed. The STC-YOLO network achieved 88.9% in the mAP, which was 9.3% higher than that of the original YOLOv5s. In the STC-YOLO network, aiming at the small size and difficult localization of traffic signs, the ability to capture smaller objects was enhanced by adjusting the down-sampling multiple. The NWD metric was introduced to make up for the sensitivity of IoU loss to the positional deviation of tiny objects. The K-means++ clustering algorithm was used to obtain the anchor box scales that are more suitable for the traffic sign dataset. Additionally in the feature fusion stage, the feature enhancement module C4STB was designed to take the local and global information obtained from the feature map as the low-level feature of the fusion. In addition, the adaptability experiment on the TT100K and CCTSDB2021 datasets further proved the STC-YOLO network's superiority. The STC-YOLO network is more suitable for small object detection tasks in complex environments. This work provides ideas for environment perception in autonomous driving and can be extended to the field of small object detection. However, there are many difficulties in detecting traffic signs at night, such as street lighting and reflected light interference. The robustness of the STC-YOLO algorithm to nighttime conditions has not been verified. In the future, datasets could be collected to focus on traffic sign detection in night scenes. Finally, further optimization of the performance and stability of the model should be carried out, and the design and development of mobile terminal systems should be focused on.

Data Availability Statement:
The data that support the findings of this study are available upon request from the authors.

Conflicts of Interest:
The authors declare no conflict of interest.