A New Multi-Branch Convolutional Neural Network and Feature Map Extraction Method for Traffic Congestion Detection

With the continuous advancement of the economy and technology, the number of cars continues to increase, and the traffic congestion problem on some key roads is becoming increasingly serious. This paper proposes a new vehicle information feature map (VIFM) method and a multi-branch convolutional neural network (MBCNN) model and applies it to the problem of traffic congestion detection based on camera image data. The aim of this study is to build a deep learning model with traffic images as input and congestion detection results as output. It aims to provide a new method for automatic detection of traffic congestion. The deep learning-based method in this article can effectively utilize the existing massive camera network in the transportation system without requiring too much investment in hardware. This study first uses an object detection model to identify vehicles in images. Then, a method for extracting a VIFM is proposed. Finally, a traffic congestion detection model based on MBCNN is constructed. This paper verifies the application effect of this method in the Chinese City Traffic Image Database (CCTRIB). Compared to other convolutional neural networks, other deep learning models, and baseline models, the method proposed in this paper yields superior results. The method in this article obtained an F1 score of 98.61% and an accuracy of 98.62%. Experimental results show that this method effectively solves the problem of traffic congestion detection and provides a powerful tool for traffic management.


Introduction
As the transportation system continues to grow, the number of motor vehicles continues to rise, resulting in severe traffic congestion in some areas of large cities [1,2].Efficient and reasonable traffic congestion monitoring helps transportation departments manage congestion problems more effectively [3,4].Currently, there are three main methods that can be applied to traffic congestion monitoring.The first method is to detect congestion through induction coils, which can effectively measure the traffic flow and speed on the road [5,6].An induction coil is a device used to measure vehicle speed, commonly used in traffic control and traffic monitoring systems.Its working principle is based on the changes in induced electromagnetic fields.The induction coil consists of a coil, usually buried in the road, with a length perpendicular to the direction of vehicle travel.When a vehicle passes over the speed measurement coil, the metal parts of the vehicle (such as the wheels) will disturb the electromagnetic field generated by the induction coil.This disturbance will cause changes in the induced current within the coil.The induction coil calculates the speed of the vehicle by measuring the changes in induced current [7].
Sensors 2024, 24, 4272 2 of 17 The second method is to use vehicle GPS data for congestion monitoring [8,9].GPS equipment can provide real-time feedback on vehicle position and speed.The principle of GPS speed measurement is based on the Doppler effect.The Doppler effect refers to the change in the frequency of the received signal when the transmitting source and the receiving source are in relative motion.In GPS speed measurement, after receiving the signal transmitted by the satellite, the receiver measures the frequency of the signal and calculates its own speed relative to the satellite based on the Doppler effect.By measuring and calculating the frequencies of multiple satellite signals, GPS receivers can accurately measure their own speed [10].By calculating the average speed of vehicles on each road section, it can be determined whether there is congestion on that road section.Although these two methods are relatively simple in detecting congestion, they require extensive hardware support.Installing induction coils on various road sections and installing GPS equipment in vehicles will consume a lot of manpower and financial resources.This article will focus on proposing a traffic congestion detection method with less hardware consumption.
The third method is to use camera data and computer vision algorithms to detect whether there is congestion on the road [11,12].At present, a large number of cameras have been widely deployed in various sections of transportation system.Using these existing cameras for traffic congestion monitoring can reduce hardware resource requirements, but requires the application of complex algorithms for calculations.Therefore, the focus of this research is to use camera data to determine whether there is congestion in each road section.We will study image recognition models for traffic congestion detection.The input of the model is image data captured by cameras, and the output is whether there is congestion on the road section.Currently, there are three main types of research in this field.(1) The first method uses a target detection algorithm to detect the number and location of cars, trucks, and buses in the picture, and based on this, determine whether there is congestion on the road [13,14].(2) The second method extracts traffic status features from images, including the number of vehicles and their speed, and then uses machine learning models to determine whether there is congestion in the image based on these features [15,16].(3).The third method uses the image classification model to determine whether there is congestion in the area where the camera is located [17,18].
Many scholars have conducted traffic-related image detection and image classification based on visual features.In terms of features, they mainly consist of bag features [26], Haar features [27], edge features [28], shape features [29], gradient histograms features [30] and CNN [31].
In the problem of traffic congestion detection based on camera data, current research mainly relies on machine learning methods based on feature extraction or image classification models based on deep learning [9,40].These methods currently have problems such as insufficient detection accuracy and excessive model size.The research in this article is mainly based on object detection models.This article will propose a new vehicle information feature map (VIFM) for extracting traffic congestion features.Meanwhile, this article will propose a new multi-branch convolutional neural network (MBCNN) for traffic congestion detection.
In building an image-based traffic congestion recognition model, the main challenges include accurate detection of vehicles in images, reasonable feature extraction, and efficient classification model construction.The reason why this article adopts a deep learning-based approach is that deep learning has achieved great success in the image field.We apply deep learning methods to traffic congestion recognition in order to make some progress in the problem of traffic congestion recognition.
This study uses a target detection algorithm to process the images captured by the camera to automatically identify traffic elements such as vehicles in the images.Divide the picture into several small squares, and then count the number of vehicles in each small square.This article extracts the maximum value, sum value and number of small squares greater than the threshold to extract higher-level features.By extracting image features through this method, the model can learn the difference between congestion and non-congestion states, laying the foundation for subsequent congestion detection.
Next, this article proposes a new multi-branch convolutional neural network model.The three branches of the model each use a different size of convolution kernels.Finally, the outputs of the three branches are passed through a fully connected layer to determine whether there is congestion.This classification model is used in this article for traffic congestion prediction detection.
Finally, this study combines actual traffic data to train and test the proposed congestion detection method.In the testing phase, the actual traffic conditions are compared with the detection results to evaluate the F1 score and accuracy of the model.The main contribution of this paper is to propose a new method for extracting vehicle information feature maps (VIFMs) and a multi-branch convolutional neural network model (MBCNN).The model proposed in this article will achieve better results compared to existing convolutional neural networks, deep neural networks, and baseline models.The method proposed in this paper can achieve traffic congestion detection based on camera data.Due to the limitations of our experimental conditions, we can only use cameras to capture 2D images.At the same time, our method can effectively provide intelligent algorithms for massive ordinary traffic surveillance cameras without adding additional hardware costs.

Method
The traffic congestion detection in this article is mainly divided into three aspects: vehicle target detection, feature map extraction, and classifier detection of whether there is traffic congestion.First, this article establishes a vehicle information extraction model based on the You Only Look Once v8 (YOLOv8) model.Then, this paper proposes a feature map extraction method based on vehicle information.Finally, this article establishes a multi-branch convolutional neural network model for traffic congestion detection.

Target Detection
This article uses the most advanced YOLOv8 target detection model to detect the location and number of vehicles in each picture.The structural diagram of the YOLOv8 target detection model is shown in Figure 1.For a detailed introduction to the YOLOv8 target detection model, readers can refer to [41,42].Due to space limitations, this article will not go into details.

Vehicle Information Feature Map (VIFM)
This paper proposes a vehicle information feature map (VIFM) method.After identifying the location of each car in the picture, the information needs to be processed to extract features as model input for the subsequent classifier.This article divides a two-di-

Vehicle Information Feature Map (VIFM)
This paper proposes a vehicle information feature map (VIFM) method.After identifying the location of each car in the picture, the information needs to be processed to extract features as model input for the subsequent classifier.This article divides a two-dimensional image into m*m small squares with equal distance and then counts the number of vehicles in each small square, so that our feature map can be obtained.Its specific definition is as follows: where f is the calculation result of the feature map, N is the number of detected vehicles, n is the vehicle number, x n is the x coordinate of the n-th vehicle, and y n is the y coordinate of the n-th vehicle.w and h are the size of a small square.I is an indicator function defined as follows: where E is a logical expression.The schematic diagram is as Figure 2. The significance of this formula is that it first divides the image into several rectangular squares at equal distances and then counts how many vehicles are in each small square.

Vehicle Information Feature Map (VIFM)
This paper proposes a vehicle information feature map (VIFM) method.After identifying the location of each car in the picture, the information needs to be processed to extract features as model input for the subsequent classifier.This article divides a two-dimensional image into m*m small squares with equal distance and then counts the number of vehicles in each small square, so that our feature map can be obtained.Its specific definition is as follows: ) and ) 1 ( and and ) 1 ( ( where f is the calculation result of the feature map, N is the number of detected vehicles, n is the vehicle number, xn is the x coordinate of the n-th vehicle, and yn is the y coordinate of the n-th vehicle.w and h are the size of a small square.I is an indicator function defined as follows: where E is a logical expression.The schematic diagram is as Figure 2. The significance of this formula is that it first divides the image into several rectangular squares at equal distances and then counts how many vehicles are in each small square.The detailed calculation process of the VIFM algorithm is as follows (Algorithm 1): After obtaining the feature map, we further extract three features from the feature map, namely the total number of feature map, the maximum number of feature map, and the number of elements greater than the threshold.The calculation is as follows: where I is an indicator function.If the condition is true, its value is 1; otherwise it is 0. Thresh is the vehicle number threshold.In subsequent chapters, after obtaining the feature map and 3 features, this article inputs the feature map and 3 features to the classifier to identify traffic congestion.

Multi-Branch Convolutional Neural Network (MBCNN)
This paper proposes a multi-branch convolutional neural network (MBCNN) for automatic detection of traffic congestion.Since the features proposed in this article are two-dimensional matrix structures, their characteristics are similar to image data.Therefore, this paper uses convolutional neural network as the classifier.If the value is close to 1, it means it is congested, and if the value is close to 0, it means it is non-congested.
In order to capture the features of different sizes of the image, the image first passes through 3 convolution branches respectively.The convolution kernel sizes in the three branches are 1, 2, and 3 respectively.The padding is set to the same.That is to say, the size of the feature map is always m*m.First, in each of the three branches, the image passes through a convolutional layer, a ReLU layer, a convolutional layer, and a ReLU layer.Then, the features extracted from the three branches are concatenated together.The concatenated feature map passes through a dropout layer, a flattening layer, and finally a fully connected layer with an output length of 3 to obtain the vector representation of our image.The inference formula for the convolutional neural network part of the classifier is as follows: where image is the feature map input.⊗ is the convolution symbol.⊕ represents vector concatenation.W 1 and W 2 are the convolutional kernels of the first two convolutional layers.b 1 and b 2 are the bias terms of the first two convolutional layers.ReLU is a nonlinear activation function.W 3 and b 3 are parameters of the fully connected layer.The output dimension of the last fully connected layer is 3. Dropout is used to prevent overfitting, and its dropout probability is 0.5.
Sensors 2024, 24, 4272 6 of 17 In order to consider the three features of the total number of vehicles, the maximum value, and the count value greater than the threshold, the features extracted by the convolutional neural network are first spliced with this feature.Finally, after two fully connected layers, the value is the congestion confidence.The final fully connected neural network reasoning process is as follows: where ⊕ represents vector concatenation.W 4 and W 5 are the parameters of two fully connected layers.b 4 and b 5 are the bias terms of the two fully connected layers.Sigmoid is the final output layer.Y > 0.5 means the picture is congested, and Y < 0.5 means the picture is non-congested.Total is the total number of vehicles in the feature map, max is the maximum value of each element in the feature map, and count is the number of elements in the feature map that are greater than the threshold.The structure of the classifier in this article is shown in the Figure 3.Among them, input1 is the feature map input, and input2 is the total number of vehicles, the maximum value and the number of elements greater than the threshold.The output is the confidence probability that there is traffic congestion in this image.
Sensors 2024, 24, x FOR PEER REVIEW 7 of 18 value and the number of elements greater than the threshold.The output is the confidence probability that there is traffic congestion in this image.The loss function for this model training is the cross-entropy loss function, which is defined as follows: The loss function for this model training is the cross-entropy loss function, which is defined as follows: The classification model in this article is trained using the Adam optimization algorithm.Adam is a common optimization algorithm; readers can refer to [43].

Evaluation Index
This article uses F1 score as the main evaluation index for traffic congestion detection, which is as follows: The second indicator in this article is accuracy, which is as follows: where TP is the true positive example, that is, the number of pictures in that actual category that are congested and are identified as congested.TN is the true negative example, that is, the number of pictures in that actual category that are non-congested and are recognized as non-congested.FP is the false positive example, that is, the number of pictures whose actual category is congestion and are recognized as non-congestion.FN is the false negative example, that is, the number of images in that actual category that are non-congested and are identified as congested.

Numerical Experiments
This section will conduct numerical experiments for traffic congestion detection based on actual data.Section 3.1 introduces the data used for the numerical experiments.Section 3.2 discusses the results of object detection.Section 3.3 presents the results of feature map extraction.Section 3.4 presents the classification results.Section 3.5 compares the detection accuracy of different classification models and different object detection models.Section 3.6 shows the experimental results on other datasets.Section 3.7 analyzes time and space complexity.Section 3.8 shows the choice of hyperparameter m.

Dataset
The Chinese City Traffic Image Database (CCTRIB) is an image dataset used for road congestion status detection [44].CCTRIB images come from traffic videos captured by multiple city key road surveillance cameras, including surveillance videos on highways, urban roads, and expressways.The frequency of video collection is once every 500 frames.The videos collected on each key road include various situations such as lighting changes, weather changes, and different imaging scales.
The dataset has a total of 9200 images, including 4600 traffic congestion images and 4600 non-congestion traffic images.The image resolution is between 480 × 320 and 1920 × 1080 pixels, which can be used for training and testing of road congestion detection algorithms.There are 8471 pictures in the training set and 729 pictures in the test set.The proportion of congestion and non-congestion on both the training set and the test set is 50%.Figure 4 is an example of the CCTRIB dataset.The upper part is an example of a congested image.The lower part is an example of a non-congested image.
The congested images include 1160 daytime images and 870 nighttime images under clear skies, 970 daytime images and 800 night images under cloudy conditions, and 390 daytime images and 410 night images in rain and mist.The non-congested images include 940 daytime images and 1160 night images under clear conditions, 890 daytime images and 890 night images under cloudy conditions, 390 daytime images and 330 night images under rainy and foggy weather conditions.Figure 5 is the distribution diagram of the CCTRIB dataset images.
The dataset has a total of 9200 images, including 4600 traffic congestion images and 4600 non-congestion traffic images.The image resolution is between 480 × 320 and 1920 × 1080 pixels, which can be used for training and testing of road congestion detection algorithms.There are 8471 pictures in the training set and 729 pictures in the test set.The proportion of congestion and non-congestion on both the training set and the test set is 50%.Figure 4 is an example of the CCTRIB dataset.The upper part is an example of a congested image.The lower part is an example of a non-congested image.

Target Detection Results
This article first performs target detection on camera images based on the YOLOv8 model and extracts the location and quantity information of vehicles.This article detects vehicle targets in all images in the training set and test set, and plots the target locations and confidence levels.The effect of target detection on the congestion category dataset is as follows (Figure 6): The detection effect of the target detection algorithm on non-congested datasets is as follows (Figure 7):

Target Detection Results
This article first performs target detection on camera images based on the YOLOv8 model and extracts the location and quantity information of vehicles.This article detects vehicle targets in all images in the training set and test set, and plots the target locations and confidence levels.The effect of target detection on the congestion category dataset is as follows (Figure 6):

Target Detection Results
This article first performs target detection on camera images based on the YOLOv8 model and extracts the location and quantity information of vehicles.This article detects vehicle targets in all images in the training set and test set, and plots the target locations and confidence levels.The effect of target detection on the congestion category dataset is as follows (Figure 6): The detection effect of the target detection algorithm on non-congested datasets is as follows (Figure 7):  The detection effect of the target detection algorithm on non-congested datasets is as follows (Figure 7):

Feature
This article extracts a feature map based on the vehicle position information obtained by the target detection model and uses it for the final classification model.We divide the entire image into m*m squares.In this experiment, we set the parameter m = 6.The reason why we determined m = 6 is that we conducted 5-fold cross-validation on the training set, and m = 6 can produce a relatively optimal result.In Section 3.8, we will discuss the value of m in detail.Then count the number of vehicles in each square.Finally, the feature map is used as the input of the classification model.Figure 8 shows the feature map extraction results of four images.The categories of the upper two images are traffic congestion, and the categories of the lower two images are non-congestion categories.The above two pictures are feature maps of congested pictures.The following two pictures are feature maps of non-congested pictures.

Classification Results
We use the training set data of CCTRIB to train our MBCNN model and evaluate the accuracy of the prediction results on the test set.There are 8471 pictures in the training set and 729 pictures in the test set.The overall prediction process is to first use the pre-trained YOLOv8 model to detect vehicles, then use a VIFM to extract feature maps and advanced

Feature Map
This article extracts a feature map based on the vehicle position information obtained by the target detection model and uses it for the final classification model.We divide the entire image into m*m squares.In this experiment, we set the parameter m = 6.The reason why we determined m = 6 is that we conducted 5-fold cross-validation on the training set, and m = 6 can produce a relatively optimal result.In Section 3.8, we will discuss the value of m in detail.Then count the number of vehicles in each square.Finally, the feature map is used as the input of the classification model.Figure 8 shows the feature map extraction results of four images.The categories of the upper two images are traffic congestion, and the categories of the lower two images are non-congestion categories.

Feature Map
This article extracts a feature map based on the vehicle position information obtained by the target detection model and uses it for the final classification model.We divide the entire image into m*m squares.In this experiment, we set the parameter m = 6.The reason why we determined m = 6 is that we conducted 5-fold cross-validation on the training set, and m = 6 can produce a relatively optimal result.In Section 3.8, we will discuss the value of m in detail.Then count the number of vehicles in each square.Finally, the feature map is used as the input of the classification model.Figure 8 shows the feature map extraction results of four images.The categories of the upper two images are traffic congestion, and the categories of the lower two images are non-congestion categories.The above two pictures are feature maps of congested pictures.The following two pictures are feature maps of non-congested pictures.The above two pictures are feature maps of congested pictures.The following two pictures are feature maps of non-congested pictures.

Classification Results
We use the training set data of CCTRIB to train our MBCNN model and evaluate the accuracy of the prediction results on the test set.There are 8471 pictures in the training set and 729 pictures in the test set.The overall prediction process is to first use the pre-trained YOLOv8 model to vehicles, then use a VIFM to extract feature maps and advanced features, and finally use the MBCNN model to predict whether there is traffic congestion in the image.
We set the size of the feature map to 6*6 and set the element size threshold thresh = 3.For the first branch, the convolution kernel size is set to 1*1, and the output channel numbers of the two convolution layers are 64 and 32 respectively.The nonlinear activation function is ReLU, and the padding is set to the same.For the second branch, the convolution kernel size is set to 2*2, and the output channel numbers of the two convolution layers are 64 and 32, respectively.The nonlinear activation function is ReLU, and the padding is set to the same.For the third branch, the convolution kernel size is set to 3*3, and the channel numbers of the two convolution layers are 64 and 32, respectively.The nonlinear activation function is ReLU, and the padding is set to the same.The dropout layer has an inactivation probability of 0.5.The output size of the first fully connected layer is 3, and the activation function is ReLU.The output size of the second fully connected layer is 12, and the activation function is ReLU.The output size of the third fully connected layer is 1, and the activation function is the sigmoid function.
This article first shows the prediction results of some data in the test set.Figure 9 shows 16 pictures predicted to be congested.Figure 10 shows 16 pictures predicted to be non-congested.As can be seen from the figure, the method in this paper can effectively predict traffic congestion.The threshold for this experiment is 0.5.If the confidence level is greater than 0.5, it indicates a higher likelihood that the image is a congested image.Otherwise, it indicates a higher likelihood that the image is a non-congested image.In the entire test set, the F1 score of this algorithm is 98.61%, and the accuracy is 98.62%.features, and finally use the MBCNN model to predict whether there is traffic congestion in the image.
We set the size of the feature map to 6*6 and set the element size threshold thresh = 3.For the first branch, the convolution kernel size is set to 1*1, and the output channel numbers of the two convolution layers are 64 and 32 respectively.The nonlinear activation function is ReLU, and the padding is set to the same.For the second branch, the convolution kernel size is set to 2*2, and the output channel numbers of the two convolution layers are 64 and 32, respectively.The nonlinear activation function is ReLU, and the padding is set to the same.For the third branch, the convolution kernel size is set to 3*3, and the channel numbers of the two convolution layers are 64 and 32, respectively.The nonlinear activation function is ReLU, and the padding is set to the same.The dropout layer has an inactivation probability of 0.5.The output size of the first fully connected layer is 3, and the activation function is ReLU.The output size of the second fully connected layer is 12, and the activation function is ReLU.The output size of the third fully connected layer is 1, and the activation function is the sigmoid function.
This article first shows the prediction results of some data in the test set.Figure 9 shows 16 pictures predicted to be congested.Figure 10 shows 16 pictures predicted to be non-congested.As can be seen from the figure, the method in this paper can effectively predict traffic congestion.The threshold for this experiment is 0.5.If the confidence level is greater than 0.5, it indicates a higher likelihood that the image is a congested image.Otherwise, it indicates a higher likelihood that the image is a non-congested image.In the entire test set, the F1 score of this algorithm is 98.61%, and the accuracy is 98.62%.

Comparative Experiment
This article uses the CNN model as the final classification model.First, our baseline model is a model that only considers the number of vehicles, that is, using YOLOv8 to detect vehicles and then counting the number in each image.Images that are larger than the threshold are identified as showing traffic congestion, while images that are smaller than the threshold are identified as showing non-congestion.The threshold is determined using a logistic regression model [45].
This article compares this model with feed-forward neural networks [46], support vector machines [47], etc.During comparison, the feature map is first converted into a 1dimensional vector and then input into the classification model for classification.
The number of hidden layer units of the feed-forward neural network is set to 256, the hidden layer activation function is set to the ReLU function, the number of output layer units is 1, the activation function is the sigmoid function, the binary cross-entropy loss function is used, and the optimization algorithm uses the Adam algorithm.The parameter C of the SVM classifier is set to 1, the kernel function is the rbf function, the order is set to 3, the minimum tolerated accuracy is set to 0.001, the block size is set to 200, and the decision function is ovr.
At the same time, this article also compares the classification accuracy of several large-scale deep neural networks, as shown in Table 1.We utilize a large pre-trained network to extract features directly on the original images.For the VGG16 model [48], the output is a 512-dimensional feature vector.For the Resnet50 [49] model, the output is a 2048-dimensional feature vector.For the EfficientNet b7 [50] model, the output is a 2560dimensional feature vector.For large pre-trained networks, the final classifier settings are thus: the number of hidden layer units is 128, the activation function is the ReLU function, the number of output layer units is 1, and the activation function is the sigmoid function.It is worth noting that it is still relatively rare to use convolutional neural networks for traffic congestion detection.The convolutional neural network references cited in this article have applications in other fields.These models are mainly used for comparative studies.

Comparative Experiment
This article uses the CNN model as the final classification model.First, our baseline model is a model that only considers the number of vehicles, that is, using YOLOv8 to detect vehicles and then counting the number in each image.Images that are larger than the threshold are identified as showing traffic congestion, while images that are smaller than the threshold are identified as showing non-congestion.The threshold is determined using a logistic regression model [45].
This article compares this model with feed-forward neural networks [46], support vector machines [47], etc.During comparison, the feature map is first converted into a 1-dimensional vector and then input into the classification model for classification.
The number of hidden layer units of the feed-forward neural network is set to 256, the hidden layer activation function is set to the ReLU function, the number of output layer units is 1, the activation function is the sigmoid function, the binary cross-entropy loss function is used, and the optimization algorithm uses the Adam algorithm.The parameter C of the SVM classifier is set to 1, the kernel function is the rbf function, the order is set to 3, the minimum tolerated accuracy is set to 0.001, the block size is set to 200, and the decision function is ovr.
At the same time, this article also compares the classification accuracy of several large-scale deep neural networks, as shown in Table 1.We utilize a large pre-trained network to extract features directly on the original images.For the VGG16 model [48], the output is a 512-dimensional feature vector.For the Resnet50 [49] model, the output is a 2048-dimensional feature vector.For the EfficientNet b7 [50] model, the output is a 2560-dimensional feature vector.For large pre-trained networks, the final classifier settings are thus: the number of hidden layer units is 128, the activation function is the ReLU function, the number of output layer units is 1, and the activation function is the sigmoid function.It is worth noting that it is still relatively rare to use convolutional neural networks for traffic congestion detection.The convolutional neural network references cited in this article have applications in other fields.These models are mainly used for comparative studies.[46] 97.04 97.11 SVM [47] 98.17 97.80 CNN(VGG16) [48] 98.12 98.08 CNN(Resnet50) [49] 97.17 97.11 CNN(EfficientNe_b7) [50] 95.25 95.19 In order to choose the optimal target detection model, this article also compares the traffic congestion detection results obtained when using YOLOv8 and YOLOv5 [20,21], SSD [22,23] and the Haar Cascade models [19] (Table 2).

F1 (%) Accuracy (%)
Haar Cascade [19] 71.54 75.44 YOLOv8 [20] 98.61 98.62 YOLOv5 [21] 95.38 95.47 SSD [22] 72.4976.26 In order to more intuitively display the detection effects of the four target detection models, we draw the detection results of a picture in the dataset, and the results are in Figure 11.It can be seen that the YOLOv8 and YOLOv5 models achieve relatively better results.

F1 (%)
Accuracy (%) Haar Cascade [19] 71.54 75.44 YOLOv8 [20] 98.61 98.62 YOLOv5 [21] 95.38 95.47 SSD [22] 72.4976.26 In order to more intuitively display the detection effects of the four target detection models, we draw the detection results of a picture in the dataset, and the results are in Figure 11.It can be seen that the YOLOv8 and YOLOv5 models achieve relatively better results.

Experiments on Other Datasets
To further verify the generalization performance of the proposed method, we conducted inference experiments on a subset of data from the traffic net dataset [51].The traffic net dataset has a total of four categories of data, including accidents, dense traffic, fires, and sparse traffic.Each category has 1100 images.

Experiments on Other Datasets
To further verify the generalization performance of the proposed method, we conducted inference experiments on a subset of data from the traffic net dataset [51].The traffic net dataset has a total of four categories of data, including accidents, dense traffic, fires, and sparse traffic.Each category has 1100 images.
We trained our model based on the data of dense traffic and sparse traffic categories in the traffic net dataset and validate the effectiveness of various methods on the test set.The hyperparameter calibration of these models is the same as that in Section 3.5.The results are shown in Table 3.It can be seen that the model in this paper achieves better results than the baseline model, machine learning model, and pre-trained convolutional neural network model.It is worth noting that due to the different installation locations of some cameras in the traffic net dataset compared to CCTRIB, and because the training set of the traffic-net dataset is only 1800 images, the generalization performance is slightly worse, which is normal.We extracted 10 images each from the dense and sparse traffic categories, for a total of 20 images, and used the MBCNN model for inference.The results are as follows (Figure 12): Sensors 2024, 24, x FOR PEER REVIEW 14 of 18 We trained our model based on the data of dense traffic and sparse traffic categories in the traffic net dataset and validate the effectiveness of various methods on the test set.The hyperparameter calibration of these models is the same as that in Section 3.5.The results are shown in Table 3.It can be seen that the model in this paper achieves better results than the baseline model, machine learning model, and pre-trained convolutional neural network model.It is worth noting that due to the different installation locations of some cameras in the traffic net dataset compared to CCTRIB, and because the training set of the traffic-net dataset is only 1800 images, the generalization performance is slightly worse, which is normal.We extracted 10 images each from the dense and sparse traffic categories, for a total of 20 images, and used the MBCNN model for inference.The results are as follows (Figure 12): As can be seen from the figure, the model in this article can effectively identify congested and non-congested images in the traffic net dataset.

Time and Space Complexity Analysis
In order to more comprehensively analyze the performance of this algorithm, this article compares the time complexity and space complexity of the algorithm.The main frequency of our computer is 1.9 HZ.The memory size is 16 GB, and the operating system is 64-bit.The analysis results are shown in Table 4.It can be seen that the time and space complexity of this method can well meet the needs of the current computing power.As can be seen from the figure, the model in this article can effectively identify congested and non-congested images in the traffic net dataset.

Time and Space Complexity Analysis
In order to more comprehensively analyze the performance of this algorithm, this article compares the time complexity and space complexity of the algorithm.The main frequency of our computer is 1.9 HZ.The memory size is 16 GB, and the operating system is 64-bit.The analysis results are shown in Table 4.It can be seen that the time and space complexity of this method can well meet the needs of the current computing power.To determine the impact of the feature map size m on the experimental classification accuracy, we used a 5-fold cross-validation method with a dataset.That is, 20% of the data from the training set were used as the validation set.This article tested the classification performance on the validation set when m was 4, 5, 6, and 7.The results are shown in the following table.From Table 5, it can be seen that the optimal value of m is 6.

Conclusions
This article studies an automatic detection method of traffic congestion based on image data and proposes a new traffic congestion feature extraction method VIFM and a new traffic congestion classifier MBCNN, which can provide effective data support for traffic management, reduce system operating costs, and provide new methods for automatic detection.
This method is based on the YOLOv8 model to detect vehicle information in images.Based on the VIFM method, it extracts feature maps and high-level features of vehicle information.This method uses MBCNN to identify whether the image is a traffic congestion image.The method proposed in this paper can effectively utilize the existing massive network of cameras in the traffic system to automatically detect traffic congestion without increasing hardware costs.
In Section 3, the following are discussed.1-3).Numerical experiments show that the proposed method achieves good results.
In future research, we will study end-to-end multi-stage traffic congestion recognition methods which integrate target detection, feature map extraction, and congestion classification into an end-to-end model to further improve the accuracy and computational efficiency of traffic congestion recognition.

Figure 2 .
Figure 2. Schematic diagram of feature map extraction.

Figure 2 .
Figure 2. Schematic diagram of feature map extraction.

Figure 3 .
Figure 3. Classifier structure diagram based on convolutional neural network.

Figure 3 .
Figure 3. Classifier structure diagram based on convolutional neural network.

Figure 4 .
Figure 4. Examples of CCTRIB dataset images.The congested images include 1160 daytime images and 870 nighttime images under clear skies, 970 daytime images and 800 night images under cloudy conditions, and 390 daytime images and 410 night images in rain and mist.The non-congested images include 940 daytime images and 1160 night images under clear conditions, 890 daytime images and 890 night images under cloudy conditions, 390 daytime images and 330 night images under rainy and foggy weather conditions.Figure 5 is the distribution diagram of the CCTRIB dataset images.

Figure 5 .
Figure 5. Image distribution of the CCTRIB dataset.

Figure 6 .
Figure 6.Object detection results of images from the category "congested".Each picture in the figure is a picture from the congestion category randomly selected from the dataset.

Figure 5 .
Figure 5. Image distribution of the CCTRIB dataset.

Figure 4 .
Figure 4. Examples of CCTRIB dataset images.The congested images include 1160 daytime images and 870 nighttime images under clear skies, 970 daytime images and 800 night images under cloudy conditions, and 390 daytime images and 410 night images in rain and mist.The non-congested images include 940 daytime images and 1160 night images under clear conditions, 890 daytime images and 890 night images under cloudy conditions, 390 daytime images and 330 night images under rainy and foggy weather conditions.Figure 5 is the distribution diagram of the CCTRIB dataset images.

Figure 5 .
Figure 5. Image distribution of the CCTRIB dataset.

Figure 6 .
Figure 6.Object detection results of images from the category "congested".Each picture in the figure is a picture from the congestion category randomly selected from the dataset.

Figure 6 .
Figure 6.Object detection results of images from the category "congested".Each picture in the figure is a picture from the congestion category randomly selected from the dataset.

Sensors 2024 , 18 Figure 7 .
Figure 7. Object detection results from images with the category "non-congested".Each picture in the figure is a non-congested picture randomly selected from the dataset.

Figure 8 .
Figure 8. Camera image feature map extraction results.

Figure 7 .
Figure 7. Object detection results from images with the category "non-congested".Each picture in the figure is a non-congested picture randomly selected from the dataset.

Sensors 2024 , 18 Figure 7 .
Figure 7. Object detection results from images with the category "non-congested".Each picture in the figure is a non-congested picture randomly selected from the dataset.

Figure 8 .
Figure 8. Camera image feature map extraction results.
use the training set data of CCTRIB to train our MBCNN model and evaluate the accuracy of the prediction results on the test set.There are 8471 pictures in the training set and 729 pictures in the test set.The overall prediction process is to first use the pre-trained YOLOv8 model to detect vehicles, then use a VIFM to extract feature maps and advanced

Figure 8 .
Figure 8. Camera image feature map extraction results.

Figure 9 .
Figure 9. Traffic congestion prediction results.The confidence level is close to 1, which represents traffic congestion.

Figure 9 .
Figure 9. Traffic congestion prediction results.The confidence level is close to 1, which represents traffic congestion.

Figure 10 .
Figure 10.Traffic congestion prediction results.The confidence level is close to 0, which represents non-traffic congestion.

Figure 10 .
Figure 10.Traffic congestion prediction results.The confidence level is close to 0, which represents non-traffic congestion.

Figure 11 .
Figure 11.Detection results of four target detection models.

Figure 11 .
Figure 11.Detection results of four target detection models.

Figure 12 .
Figure 12.The inference results of the method on the traffic net dataset.The text in each image is the prediction category, and the number is the prediction confidence.

Figure 12 .
Figure 12.The inference results of the method on the traffic net dataset.The text in each image is the prediction category, and the number is the prediction confidence.

( 1 )
This article analyzes the detection results of YOLOv8.(2) This article analyzes the results of VIFM extracting feature maps.(3) This article analyzes the results of MBCNN for congestion recognition.(4) This article compares the congestion recognition results of different classifiers and different object detection models.(5) This article analyzes the spatial and temporal complexity of our method.This article verifies the effectiveness of the proposed method on the CCTRIB dataset.(1) This article verifies the effectiveness of the VIFM method and the MBCNN classifier.The classification model in this article achieves an F1 score of 98.61% and an accuracy of 98.62% on the test set.(2) This article compares the detection accuracy of this model with other object detection models, other convolutional neural network models, other deep learning models, and baseline models (see Tables

Table 1 .
Comparison between the classification model of this article and other classification models.

Table 2 .
Comparison of target detection models.

Table 1 .
Comparison between the classification model of this article and other classification models.

Table 2 .
Comparison of target detection models.

Table 3 .
Comparison of prediction results on the traffic net dataset.

Table 3 .
Comparison of prediction results on the traffic net dataset.

Table 4 .
Time and space complexity analysis results.

Table 5 .
Results of different hyperparameters m.