A Vehicle Detection Method Based on an Improved U-YOLO Network for High-Resolution Remote-Sensing Images

Guo, Dudu; Wang, Yang; Zhu, Shunying; Li, Xin

doi:10.3390/su151310397

Open AccessEssay

A Vehicle Detection Method Based on an Improved U-YOLO Network for High-Resolution Remote-Sensing Images

by

Dudu Guo

^1,2,

Yang Wang

²,

Shunying Zhu

¹ and

Xin Li

^2,*

¹

School of Transportation and Logistics Engineering, Wuhan University of Technology, Wuhan 430070, China

²

College of Transportation Engineering, Xinjiang University, Urumqi 830046, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(13), 10397; https://doi.org/10.3390/su151310397

Submission received: 27 May 2023 / Revised: 27 June 2023 / Accepted: 28 June 2023 / Published: 30 June 2023

(This article belongs to the Special Issue Transportation and Vehicle Automation)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The lack of vehicle feature information and the limited number of pixels in high-definition remote-sensing images causes difficulties in vehicle detection. This paper proposes U-YOLO, a vehicle detection method that integrates multi-scale features, attention mechanisms, and sub-pixel convolution. The adaptive fusion module (AF) is added to the backbone of the YOLO detection model to increase the underlying structural information of the feature map. Cross-scale channel attention (CSCA) is introduced to the feature fusion part to obtain the vehicle’s explicit semantic information and further refine the feature map. The sub-pixel convolution module (SC) is used to replace the linear interpolation up-sampling of the original model, and the vehicle target feature map is enlarged to further improve the vehicle detection accuracy. The detection accuracies on the open-source datasets NWPU VHR-10 and DOTA were 91.35% and 71.38%. Compared with the original network model, the detection accuracy on these two datasets was increased by 6.89% and 4.94%, respectively. Compared with the classic target detection networks commonly used in RFBnet, M2det, and SSD300, the average accuracy rate values increased by 6.84%, 6.38%, and 12.41%, respectively. The proposed method effectively solves the problem of low vehicle detection accuracy. It provides an effective basis for promoting the application of high-definition remote-sensing images in traffic target detection and traffic flow parameter detection.

Keywords:

U-YOLO; cross-scale channel attention; remote-sensing images; vehicle inspection

1. Introduction

The rapid development of intelligent transportation in recent years, coupled with the increasing ability of remote-sensing technology to acquire Earth observation data, has led to more attention being paid to vehicle detection using high-resolution remote-sensing images. As the foundation and core of smart travel, vehicle detection has important practical significance for target tracking and event detection [1,2,3,4,5]. However, traditional vehicle detection methods have a high installation cost and a large installation workload. The use of remote-sensing-image-based vehicle detection can cover a larger area of ground, which is more suitable for large-scale vehicle target detection. The detection results can be used for intelligent regulation of the number of vehicles, traffic flow detection, and construction of intelligent transportation systems [6,7].

Traditional remote-sensing-image-based vehicle detection methods usually use sliding-window-based search or feature extraction algorithms, which have a high computational cost and low detection accuracy. The rapid development of deep learning has produced methods that are clearly superior to traditional methods for the detection of high-resolution remote-sensing images. However, there is still a need to consider the lack of sufficient vehicle feature information and pixel values in remote-sensing images, which results in insufficient distinguishability of information. Current effective strategies to improve detection accuracy mainly include enhancing feature context information and extracting target saliency features.

To improve detection accuracy by enhancing feature context information, Liu et al. [8] proposed a YOLOv3-FDL model for multi-scale feature fusion, which reassigns contextual information to the four scales via the K-Means++ clustering algorithm. Zou et al. [9] improved the detection accuracy of traffic signage via a bidirectional feature pyramid to extract the context information in the feature layer. Hua et al. [10] improved the ability of the model to recognize fine details by constructing a multi-scale hybrid attention module to aggregate contextual information of the input image. Yadav et al. [11] designed a multi-scale feature fusion module by analyzing three characteristics of grayscale images, which makes full use of the contextual information of the features and improves detection accuracy. Ye et al. [12] used a feature adaptation module with a residual structure [13] to obtain contextual information about the target and reinforce the target features. Liu et al. [14] proposed a feature pyramid composite neural network structure combining contextual enhancement and feature refinement to address the problem of feature scattering and semantic differences between layers for tiny targets. Jiao et al. [15] added an enhancement module of contextual information combined with convolution to the feature extraction module to improve the global characterization capability of the model.

To improve detection accuracy by extracting target salient features, Wu et al. [16] enhanced the semantic information of small penguins in remote-sensing images by designing a multi-frequency feature fusion module (MAConv) and a bottleneck-efficient aggregation layer (BELAN) and further extracted low-frequency information using a lightweight Swin Transformer (LSViT) and an attention mechanism. Chen et al. [17] added an extended Dilated Attention Module (DAM) to the YOLOV3 [18] detection framework to expand the perceptual field of the convolution kernel, highlight the difference between background and target, and improve the accuracy of remote-sensing detection. Guo et al. [19] added polymorphic and group attention modules to YOLOV3 to capture multi-scale and multi-shape features of targets and to enhance feature structure information. Qu [20] proposed a combination of adaptive spatial fusion and residual attention to fuse aircraft feature information. The attention-based approach can selectively suppress regionally irrelevant information, enhance feature-related information, and improve the effectiveness of feature representation. The above-mentioned attention-based mechanisms to obtain salient information only guide the learning phase of single-scale features without considering the multi-scale features of the target.

This study aims to synthesize the above research and improve the problem of low vehicle detection accuracy caused by low vehicle feature information and a limited number of pixels in high-definition remote-sensing images. A one-stage remote-sensing image target detection network with improved U-YOLO is proposed by combining the two approaches of enhancing feature context information and extracting target saliency features. This method retains more valid vehicle information features via the cross-scale channel attention module (CSCA) and enhances the feature information of vehicles in remote-sensing images via the sub-pixel convolution module (SC). We then verify the generalization and robustness of this model using the NWPU VHR-10 and DOTA datasets and use migration learning for model pre-training. The experimental results show that the U-YOLO model effectively solves the problem of low accuracy of vehicle detection caused by the small amount of vehicle feature information and a limited number of pixels in the high-definition remote-sensing images and can provide an effective basis for advancing the application of high-definition remote-sensing images in the detection of traffic flow parameters.

2. Improved U-YOLO Network Construction

2.1. YOLOV3

YOLOV3 is a fast one-stage target detection network. In this paper, we cluster the targets according to the boxes in the remote-sensing image dataset and redesign the prior frame size for detecting the targets based on the YOLOV3 detection framework. In order to improve the accuracy of small target detection, the detection layer is increased to four output layers, where the output size of each layer is given by Equation (1) as follows:

\begin{array}{l} p 4 = (n_{c l a s s} + n_{p r e d i c t}) \times n_{b o x} \times 13 \times 13 \\ p 3 = (n_{c l a s s} + n_{p r e d i c t}) \times n_{b o x} \times 26 \times 26 \\ p 2 = (n_{c l a s s} + n_{p r e d i c t}) \times n_{b o x} \times 52 \times 52 \\ p 1 = (n_{c l a s s} + n_{p r e d i c t}) \times n_{b o x} \times 104 \times 104 \end{array}

(1)

where n_class is the number of categories, n_predict contains five prediction values (x, y, w, h, confidence), and n_box is the prior box designed for each pixel, which takes the value of 4. A total of 16 prior boxes of different scales are used.

2.2. Improved U-YOLO Network

In order to enhance the connection between feature information and improve the adaptability of the network to multi-scale targets, an improved U-YOLO remote-sensing image target detection network based on the practice of output feature layer detection in the YOLOV3 network is constructed, as shown in Figure 1. Firstly, the AF module is used to process the main output feature map to obtain the multi-scale features of the target. Secondly, the CSCA module is used to process the output features of the AF module and refine the feature map by constructing multi-scale fusion weight vectors so as to extract the target saliency features. Thirdly, the SC module is used to restructure the multi-channel feature map of the target and increase the resolution of small target features by enlarging the feature map.

2.2.1. Introduction of AF Module

Due to the higher number of layers required for shallow information transmission in the backbone network structure, excessive convolutional layers will destroy the original structural information of the target and lead to a decrease in localization accuracy. Therefore, the introduction of the AF module reduces the loss of contextual information in the feature map. In the AF module, the multi-scale information of the target is fused via the two-layer convolutional structure. The output of each layer of the feature after processing by the AF module is given by Equation (2) as follows:

\begin{array}{l} s_{s 2} = b_{a 2} \\ s_{s 3} = b_{a 3} + D_{W 1} \cdot (s_{s 2}) \\ s_{s 4} = b_{a 4} + D_{W 2} \cdot (s_{s 3}) \end{array}

(2)

where

[b_{a 4}, b_{a 3}, b_{a 2}]

is the three-layer output of the backbone network, which is processed via the AF module as

[s_{s 4}, s_{s 3}, s_{s 2}]

, and D_W is a 3 × 3 convolution kernel with a step size of 2. The AF module uses a convolutional kernel-based down-sampling approach to connect and pass the underlying structural information to the higher-level information and reduce the loss of structural information of the target.

2.2.2. Improvement of CSCA Module

Since vehicles have less information in remote-sensing images, resulting in insufficient information regarding target distinguishability, attention-based (SE, CBAM) methods are commonly used to improve the representation of target features. However, the consideration of multi-scale information of the target frequently leads to a lack of multi-scale detection. Thus, the CSCA module is proposed to obtain the saliency information of the target by generating fusion weights. The target area features can be effectively enhanced, and the background features can be weakened in remote-sensing images with a large span of target scale information, and the structure of the CSCA module is shown in Figure 2.

The input features Ψ_p (i,j) ∊ R ^{(C × H × W)} have dimensions of C × H × W. The channel weights are obtained by compressing the feature layer along the dimensionality of the channel direction using the adaptive mean pooling method, and the output is given by Equation (3) as follows:

w_{p} = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} ψ_{p} (i, j)

(3)

where p (p ∊ (2, 3, 4)) and the dimension of w_p is 1 × 1 × c_p. The channel weights contain the saliency information of each channel and are downscaled in a fully connected manner and adjusted by Equation (4) as follows:

s_{p} = σ (w_{1} \cdot w_{p})

(4)

where w₁ is fully connected, σ is the ReLu activation function, and the dimension of s_p is 1 × 1 × 128. In this paper, the adjusted channel weight vectors are stacked according to the H direction, and the output size is a 3 × 1 × 128 fusion weight. Then, the normalization method is used to learn the feature distribution, and its output is shown in Equation (5).

s_{k} = σ (w_{2} \cdot C o n c a t (s_{4}, s_{3}, s_{2}))

(5)

where w₂ is a 3 × 1 convolution kernel. The obtained s_k is the fusion weight vector, which is then used as the basis for matching the multi-scale features and learning the feature distribution of the s_k fusion weight vector separately. The output is a three-channel weight vector, which is used to fine tune the w_p weight vector as follows:

f_{p} = δ (φ (σ (w_{1} \cdot s_{k})) + w_{p})

(6)

where φ is the normalization function, and σ is the Sigmoid function. Finally, the significant information on the feature layer is obtained using the fusion weights, as shown in Equation (7):

c_{s p} = f_{p} \cdot s_{s p}

(7)

where c_sp is the output feature, and s_sp is the input feature. The CSCA module extracts the salient features of the target by obtaining the fused feature weights, which have lower computational parameters compared to the up-and-down-sampling-based methods. The CSCA module can fuse feature information at different scales.

2.2.3. Replacement of SC Module

After multi-scale information fusion in the AF module and target feature enhancement in the CSCA module, an up-sampling method based on subpixel convolution [20] is applied, which amplifies feature layer resolution and enhances the feature integrity of small targets. The input image is defined as Ψp (i,j) ∊ R ^{(C × H × W)} with channel dimension C and resolution H × W. After sub-pixel convolution to reorganize the channel information in the feature layer, the output result is Ψp (i,j) ∊ R ^{(C/r2 × H.r × W.r)} with channel dimension C/r² and resolution H.r × W.r. The up-sampling method via sub-pixel convolution reorganizes the pixel points at the same position in the feature map according to the arrangement shown in Figure 3. In this method, the channel information of the feature layer is utilized and can be used to obtain more and finer feature information.

High accuracy can often be achieved by amplifying the feature layer to detect small targets. Linear interpolation-based up-sampling methods obtain interpolation values by calculating neighboring pixels and thus zoom in on the image, but such methods do not take into account channel feature information. The deep-learning-based methods use inverse convolution kernels to generate estimates to complement images, while the methods do not consider multi-channel information and are relatively computationally intensive. Therefore, this paper proposes an improved SC up-sampling module based on subpixel convolution construction, and its output is shown in Equation (8).

\begin{array}{l} s_{c 4} = w^{'} \cdot (c_{s 4}) \\ s_{c 3} = w^{'} \cdot (c_{s 3} + \sup i x (s_{c 4})) \\ s_{c 2} = w^{'} \cdot (c_{s 2} + \sup i x (s_{c 3})) \\ s_{c 1} = w^{'} \cdot (b_{a 1} + \sup i x (s_{c 2})) \end{array}

(8)

3. Experiments and Results Analysis

3.1. Experimental Data

In this paper, we used the NWPU VHR-10 and DOTA datasets for training and validating the model. The NWPU VHR-10 aerial remote-sensing image target detection dataset was released in 2015 and consists of 800 images of 10 types of targets: 650 images with target information and 150 images without target information. The DOTA dataset is a remote-sensing dataset jointly released by Wuhan University and Huazhong University of Science and Technology. There are a total of large 2806 images in the dataset. To improve the training efficiency of the model and avoid the compression process carried out via the image input, we cropped them to a 500×500 resolution via a random window. The relative area statistics of each target in the above dataset are shown in Figure 4 and Figure 5.

3.2. Evaluation Metrics and Experimental Analysis

3.2.1. Evaluation Metrics

The average precision (AP) and the mean average precision (mAP) were used to evaluate the detection performance. The P–R curve equation is defined in Equation (9) as follows:

\begin{array}{l} P r e c i s i o n = \frac{N_{T P}}{N_{T P} + N_{F P}} \\ R e c a l l = \frac{N_{T P}}{N_{T P} + N_{F N}} \end{array}

(9)

where Precision is the percentage of positive samples among the target detection samples, Recall is the percentage of correctly identified targets among all targets, AP is the area of the P–R curve, and mAP is used to represent the average of the mean accuracy of multiple categories in the dataset. It is calculated as shown in Equation (10).

m A P = \frac{\sum_{i = 1}^{W} \int_{0}^{i} P_{i} (R) d t}{N}

(10)

3.2.2. Experimental Analysis

For the experiments in this paper, we used Windows 10 Pro 64-bit operating system with an R7 3700 processor, an NVIDIA GeForce RTX3070 GPU, with the Cudav4.11.01 software environment, Pytorch1.7 deep-learning framework, and the Python3.8 programming language. Due to the large variety of remote-sensing satellites, non-uniform resolution, and large differences in sensor accuracy, the generalization of the network trained from a single data set is poor. Therefore, pre-training weights were generated by loading the NWPU VHR-10 and DOTA datasets for model training and ablation experiments.

(1): NWPU VHR-10 experimental analysis of the dataset

The backbone feature network of U-YOLO was replaced with Resnet50, Resnet101, and Efficient-B2 [21] in turn for the experiments. The experimental dataset was NWPU VHR-10, and the experimental results were evaluated with two metrics, mAP and Frame Per Second (FPS). The results are shown in Figure 6.

It can be seen from Figure 6 that the detection speed is the fastest when replacing the backbone network of U-YOLO with Resnet50, but its mAP is the lowest. Resnet101 and Efficient-B2 achieved better results in mAP, but their deeper convolutional layers resulted in slower detection. The results show that U-YOLO combined with Darknet53 has the best detection performance in the NWPU VHR-10 dataset. The accuracy of each module’s contribution to the detection performance of the entire network is explored in conjunction with the Darknet53 backbone network and validated in the NWPU VHR-10 dataset. The results are shown in Table 1, where method B is the original YOLOV3 detection method, while methods C, D, and E are the detection results after gradually adding the AF, CSCA, and SC modules.

As seen from Table 1, in the ablation experiment with the NWPU VHR-10 dataset, the mAP of U-YOLO improved by 6.89% compared with YOLOV3. In YOLOV3, the AF module fuses the structural information contained in the underlying feature map with the high-level semantic information to help target localization. Its accuracy contribution was 1.96%. The CSCA module obtains the target saliency information by constructing channel fusion weights. It enhances the representation of multi-scale feature information of the target with an accuracy contribution of 3.48%. The SC module uses a multi-channel feature reorganization method to amplify the feature map with an accuracy contribution of 1.45%. The experiments show that the CSCA module contributes the most to the accuracy of the whole network.

In order to reflect more intuitively the differences in network learning ability due to different module combinations, the images were processed using the Class Activation Map (CAM) method. Representative images of various types were selected from the NWPU VHR-10 dataset for visualization and analysis of the targets using the class activation map (CAM) method. Figure 7 shows the graph of CAM visualization results for some of the targets.

The yellow arrow indicates that the network has a low response value to the target feature, while the red arrow indicates the incorrect response of the network to the target. The red region in the heat map indicates that the neural network has a high response value in that region. The higher the response value of the network to the target, the easier it is to be detected.

It can be seen from Figure 7 that the network has a low response to the bridge (B) class and an incorrect response to the harbor (H) class due to the setting of the original prior frame in method B. In method C, the AF module is combined with the relevant prior frame design approach. The boundary and central structure information is better for some small target classes. Method D combines the AF module and CSCA module. There are more red regions in the harbor (H) class than when method C was used. The method of combining attention enhances its feature expression. In method E, the structural information of the target was enhanced while amplifying the feature map by restructuring the information on the salient features of the target. A better prior frame was then used to regress it, yielding better results.

(2): DOTA experimental analysis of the dataset

The DOTA dataset was used for experimental analysis due to its large number of small targets. Based on the use of Darknet53 as the backbone network, the impact of each module on the whole network is explored. The experimental results are shown in Table 2, where method B is the original YOLOV3 detection method, and methods C, D, and E are the detection results after incrementally adding the AF, CSCA, and SC modules.

It can be seen from Table 2 that the U-YOLO improves mAP by 11.38% compared to YOLOV3. It can be seen from the comparison among C, D, and E that the AF module improves the mAP of the network by 3.15%. The mAP of the network is improved by 4.94% via the CSCA module, which contributes the most to the mAP of the network and shows a powerful ability to obtain significant features in small target datasets. By reorganizing the salient features, the SC module contributes 3.24% to the accuracy of the whole network, which also achieves good results. The corresponding modules are connected to the CAM in order to visualize the impact of multiple variables in the model. In the DOTA dataset, representative targets were selected for visualization and analysis. The visualization results are shown in Figure 8.

It can be seen from Figure 8 that method E can guarantee better target structure information, and the network’s response value to small targets is higher than for other methods. Method B uses the original prior frame design in the YOLOV3 detection framework, which makes the network have lower response values for small targets such as large vehicles (LV) and small vehicles (SV). The AF module is added in method C, which improves the response value of the network for small targets; however, the network has more false responses to the targets. The CSCA module is added via method D, which obtains the salient features of the target by fusing the attention weights while reducing the proportion of non-sample features and weakening the background features, thereby reducing the target error response. Method E extends the feature map using feature reorganization to make the network better adaptable to the edges and some features of the target.

4. Case Analysis of Road Network Vehicle Extraction

4.1. V547 Dataset

In this paper, we used the V547 dataset for model evaluation and validation. This dataset is derived from Google Earth, Worldview-3, and the remote-sensing images from Gaofen 1. The valid number of sheets is 547, containing 5000 vehicle samples. We obtained 6350 valid samples using data augmentation methods such as level flipping and random noise. They were divided into a training set and a validation set according to the ratio of 8:2. The open-source annotation software labeling 3.16.7 was used to annotate images and store them in VOC format. There are different road sections, different road conditions, and remote-sensing images of occlusion conditions to be considered when annotating the data. Part of the data set is shown in Figure 9.

4.2. Experimental Analysis of V547 Data Set

Migration learning was used to load the WDOTA weights into the U-YOLO network and freeze the backbone. The V547 dataset was used to train the feature fusion part of U-YOLO. The ratio of the training set, test set, and validation were 3:1:1. The WDOTA backbone weight was released once the feature fusion training reached fitting to release. The backbone network and the feature fusion were trained in a unified manner. The network convergence effect is shown in Figure 10.

Afterward, the feature visualization and comparison experiment of multiple networks was established, and the matrix information of a single feature map in the network was extracted and represented using a visualization tool. The RFBNet network [22], M2det [23] network, and SSD (300) [24] network, which are more effective at detecting small targets, were chosen for the comparison networks. The feature layer visualization detection results were uniformly scaled down to the same scale, and the feature visualization results of multiple groups of networks are shown in Figure 11.

It can be seen from Figure 11 that the U-YOLO has better feature layer visualization results, more obvious feature information, and higher target distinguishability compared to SSD (300) and M2det. Each target detection network was trained in the vehicle detection data set, and 20% of the data set was used for testing. The P–R (Precision–Recall) curve is used for comparison via multiple network testing. The results are shown in Figure 12.

It can be seen from Figure 12 that the detection accuracy of U-YOLO is higher than other comparison networks when the recall rate remains the same. The AP evaluation index was used to quantitatively analyze the results, as shown in Table 3.

In terms of detection accuracy, the U-YOLO accuracy is significantly higher than the other networks, and the AP values were improved by 6.84%, 6.38%, and 12.41% compared to RFBNet, M2det, and SSD300, respectively. In order to visually verify the accuracy of the algorithm proposed in this paper, the improved U-YOLO network was used to conduct experiments on vehicle extraction on real roads. The results are shown in Figure 13.

As seen in Figure 13, U-YOLO detection outperformed the comparison network on remote-sensing images with 0.3 m resolution, obtaining 94.5% accuracy and only a 6.8% false detection rate. Compared to RFBNet, M2det, and SSD (300), the accuracy rate was improved by 8.2%, 11%, and 15%, and the false detection rate was reduced by 2.7%, 4.2%, and 6.8%, respectively. It is clear from the experimental results that the U-YOLO method was better than the comparison networks at improving vehicle detection accuracy by amplifying the scale of the feature layer.

5. Conclusions

In this paper, an improved U-YOLO vehicle target detection network was proposed to address the problem of low vehicle detection accuracy due to low vehicle feature information and a limited number of pixels in high-definition remote-sensing images. By introducing the AF module, the underlying structural information is connected and passed on to the higher-level information, reducing the loss of structural information of the target. By designing the CSCA module, the fusion weights are generated to obtain the multi-scale saliency information of the target. By introducing the SC module, the pixel points at the same position in the feature map are reorganized in a certain arrangement to amplify the feature layer. First, we compare the Resnet50, Resnet101, Efficient-B2, and Darknet53 backbone networks. Simultaneous experiments on two public datasets, NWPU VHR-10 and DOTA, verified the generality of the model. For the NWPU VHR-10 dataset, Resnet50 obtained 27.43 FPS and 85.62 mAP, while Resnet101 obtained 20.45 FPS and 89.24 mAP, efficient-B2 obtained 21.26 FPS and 88.43 mAP, and darknet53 obtained 26.16 FPS and 91.35 mAP. Through these experiments, it can be seen that the improved strategy proposed in this paper can effectively improve the accuracy of small target detection in remote-sensing images and maintain a good detection speed, which can provide effective help for vehicle detection in remote-sensing images.

In addition to this, the effectiveness of vehicle detection was validated via the production of the V547 dataset. Compared with the RFBNet network, the M2det network and SSD (300) network are more effective at detecting small and medium-sized targets in remote-sensing images. This was also validated for remote-sensing images with 0.3 m resolution. The experimental results showed that the algorithm in this paper improves the AP value by 6.84%, 6.38%, and 12.41% compared with RFBNet, M2det, and SSD300, respectively. It can better detect vehicle targets in remote-sensing images, and its detection results can provide data support for the intelligent regulation of vehicle numbers, traffic flow detection, and the construction of intelligent traffic systems.

Although the improved U-YOLO remote-sensing vehicle target detection network designed in this paper has achieved good results, the problem of the distinction between dark vehicles and the background is amplified due to shadow occlusion. Additionally, it results in a high rate of missed detections of dark vehicles in shadowy areas. In our next study, we will explore the approach to enhance the sparse features of dark vehicles in the shadows and extract global information on dark vehicles to alleviate the problem of degradation of dark vehicle detection accuracy due to shadow occlusion.

Author Contributions

Conceptualization, D.G. and X.L.; methodology, D.G. and Y.W.; validation, X.L. and Y.W.; investigation, D.G. and S.Z.; writing—original draft preparation, D.G.; writing—review and editing, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Xinjiang Autonomous Region key research and development project (2022B01015).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All results and data obtained can be found in open-access publications.

Conflicts of Interest

The authors declare no conflict of interest.

References

Silva, L.F.O.; Oliveira, M.L.S. Remote Sensing Studies Applied to the Use of Satellite Images in Global Scale. Sustainability 2023, 15, 3459. [Google Scholar] [CrossRef]
Liu, H.; Ding, Q.; Hu, Z.; Chen, X. Remote Sensing Image Vehicle Detection Based on Pre-Training and Random-Initialized Fusion Network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Fang, X.; Hu, F.; Yang, M.; Zhu, T.; Bi, R.; Zhang, Z.; Gao, Z. Small object detection in remote sensing images based on super-resolution. Pattern Recognit. Lett. 2022, 153, 107–112. [Google Scholar]
Khan, M.A.; Nasralla, M.M.; Umar, M.M.; Ghani-Ur-Rehman; Khan, S.; Choudhury, N. An Efficient Multilevel Probabilistic Model for Abnormal Traffic Detection in Wireless Sensor Networks. Sensors 2022, 22, 410. [Google Scholar] [CrossRef] [PubMed]
Rehman, G.U.; Zubair, M.; Qasim, I.; Badshah, A.; Mahmood, Z.; Aslam, M.; Jilani, S.F. EMS: Efficient Monitoring System to Detect Non-Cooperative Nodes in IoT-Based Vehicular Delay Tolerant Networks (VDTNs). Sensors 2023, 23, 99. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Wu, Z.; Li, L.; Yang, D.; Pang, H. Improved YOLOv3 model for vehicle detection in high-resolution remote sensing images. J. Appl. Remote Sens. 2021, 15, 026505. [Google Scholar] [CrossRef]
Li, X.; Guo, K.; Subei, M.; Guo, D. High-resolution remote sensing vehicle automatic detection based on feature fusion convolutional neural network. In Proceedings of the International Conference on Computer Vision, Application, and Design (CVAD 2021), Sanya, China, 19–21 November 2021; SPIE: Bellingham, WA, USA, 2021; Volume 12155, pp. 141–146. [Google Scholar]
Liu, Z.; Gu, X.; Chen, J.; Wang, D.; Chen, Y.; Wang, L. Automatic recognition of pavement cracks from combined GPR B-scan and C-scan images using multiscale feature fusion deep neural networks. Autom. Constr. 2023, 146, 104698. [Google Scholar] [CrossRef]
Zou, H.; Zhan, H.; Zhang, L. Neural Network Based on Multi-Scale Saliency Fusion for Traffic Signs Detection. Sustainability 2022, 14, 16491. [Google Scholar] [CrossRef]
Hua, Z.; Yu, H.; Jing, P.; Song, C.; Xie, S. A Light-Weight Neural Network Using Multiscale Hybrid Attention for Building Change Detection. Sustainability 2023, 15, 3343. [Google Scholar] [CrossRef]
Yadav, D.P.; Kishore, K.; Gaur, A.; Kumar, A.; Singh, K.U.; Singh, T.; Swarup, C. A Novel Multi-Scale Feature Fusion-Based 3SCNet for Building Crack Detection. Sustainability 2022, 14, 16179. [Google Scholar] [CrossRef]
Ye, X.; Xiong, F.; Lu, J.; Zhou, J.; Qian, Y. F³-Net: Feature Fusion and Filtration Network for Object Detection in Optical Remote Sensing Images. Remote Sens. 2020, 12, 4027. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Jiao, L. Remote Sensing Image Change Detection Based on Deep Multi-Scale Multi-Attention Siamese Transformer Network. Remote Sens. 2023, 15, 842. [Google Scholar] [CrossRef]
Wu, J.; Xu, W.; He, J.; Lan, M. YOLO for Penguin Detection and Counting Based on Remote Sensing Images. Remote Sens. 2023, 15, 2598. [Google Scholar] [CrossRef]
Chen, L.; Shi, W.; Deng, D. Improved YOLOv3 based on attention mechanism for fast and accurate ship detection in optical remote sensing images. Remote Sens. 2021, 13, 660. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Guo, W.; Li, W.; Li, Z.; Gong, W.; Cui, J.; Wang, X. A slimmer network with polymorphic and group attention modules for more efficient object detection in aerial images. Remote Sens. 2020, 12, 3750. [Google Scholar] [CrossRef]
Qu, Z.; Zhu, F.; Qi, C. Remote Sensing Image Target Detection: Improvement of the YOLOv3 Model with Auxiliary Networks. Remote Sens. 2021, 13, 3908. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Receptive Field Block Net for Accurate and Fast Object Detection. In Computer Vision—ECCV 2018; Lecture Notes in Computer Science; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; Volume 11215. [Google Scholar]
Zhao, Q.; Sheng, T.; Wang, Y.; Tang, Z.; Chen, Y.; Cai, L.; Ling, H. M2Det: A Single-Shot Object Detector Based on Multi-Level Feature Pyramid Network. Proc. AAAI Conf. Artif. Intell. 2019, 33, 9259–9266. [Google Scholar] [CrossRef] [Green Version]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Lecture Notes in Computer Science; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: New York, NY, USA, 2016; Volume 9905. [Google Scholar]

Figure 1. Structure diagram of the improved U-YOLO network.

Figure 2. CSCA module structure diagram.

Figure 3. Schematic diagram of sub-pixel convolutional channel reorganization.

Figure 4. Statistical analysis of various targets for the NWPU VHR10 dataset. Categories: baseball field (BD), storage tank (ST), ship (S), bridge (B), port (H), basketball court (BC), aircraft (A), athletic field (GTF), tennis court (TC), and vehicle (V).

Figure 5. Statistical analysis of various targets in the DOTA dataset. Categories: Basketball court (BC), tennis court (TC), small vehicle (SV), water tank (ST), swimming pool (SP), ship (SH), soccer field (SBF), roundabout (RA), aircraft (PI), large vehicle (LV), helicopter (HC), harbor (HA), athletic field (GTF), bridge (BR), and baseball field (BD).

Figure 6. Accuracy and speed comparison with different backbone networks.

Figure 7. Results of target CAM visualization in NWPU VHR-10 dataset.

Figure 8. Results of target CAM visualization in the DOTA dataset.

Figure 9. V-547 remote-sensing image vehicle detection dataset.

Figure 10. Network convergence effect of transfer learning.

Figure 11. Comparison chart of feature layer visualization results.

Figure 12. P–R curve comparison.

Figure 13. Example of vehicle detection results in remote-sensing images. (Note: Accuracy rate = accurate number/box number; False detection rate = (box number -accurate number)/vehicles number).

Table 1. Accuracy of the combined modules in the NWPU VHR-10 dataset.

Methods	AF	CSCA	SC	mAP (%)
B	−	−	−	84.63
C	+	−	−	86.42
D	+	+	−	89.90
F	+	+	+	91.35

Table 2. Accuracy of the combined modules in the DOTA dataset.

Methods	AF	CSCA	SC	mAP (%)
B	−	−	−	60.00
C	+	−	−	63.15
D	+	+	−	68.09
E	+	+	+	71.38

Table 3. Comparison of AP Evaluation Metrics.

Method	Backbone	Pre-Train	AP
U-YOLO	Vgg16	Yes	95.85%
RFBNet	Vgg16	Yes	89.01%
M2det	Vgg16	Yes	89.47%
SSD300	Vgg16	Yes	83.44%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, D.; Wang, Y.; Zhu, S.; Li, X. A Vehicle Detection Method Based on an Improved U-YOLO Network for High-Resolution Remote-Sensing Images. Sustainability 2023, 15, 10397. https://doi.org/10.3390/su151310397

AMA Style

Guo D, Wang Y, Zhu S, Li X. A Vehicle Detection Method Based on an Improved U-YOLO Network for High-Resolution Remote-Sensing Images. Sustainability. 2023; 15(13):10397. https://doi.org/10.3390/su151310397

Chicago/Turabian Style

Guo, Dudu, Yang Wang, Shunying Zhu, and Xin Li. 2023. "A Vehicle Detection Method Based on an Improved U-YOLO Network for High-Resolution Remote-Sensing Images" Sustainability 15, no. 13: 10397. https://doi.org/10.3390/su151310397

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Vehicle Detection Method Based on an Improved U-YOLO Network for High-Resolution Remote-Sensing Images

Abstract

1. Introduction

2. Improved U-YOLO Network Construction

2.1. YOLOV3

2.2. Improved U-YOLO Network

2.2.1. Introduction of AF Module

2.2.2. Improvement of CSCA Module

2.2.3. Replacement of SC Module

3. Experiments and Results Analysis

3.1. Experimental Data

3.2. Evaluation Metrics and Experimental Analysis

3.2.1. Evaluation Metrics

3.2.2. Experimental Analysis

4. Case Analysis of Road Network Vehicle Extraction

4.1. V547 Dataset

4.2. Experimental Analysis of V547 Data Set

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI