Bridge detection method for HSRRSIs based on YOLOv5 with a decoupled head

ABSTRACT The different imaging conditions of high spatial resolution remote sensing images (HSRRSIs) tend to cause large differences in the background information of bridges from the images, including problems of difficult detection of multiscale bridges, leakage of small bridges and insufficient detection accuracy for their detection. To address these problems, a YOLOv5 network with a decoupled head for the automatic detection of bridges in HSRRIs is proposed in this paper. First, the problem of inconsistent scale of information fusion of each feature in the feature pyramid network is solved using a weighted bi-directional feature pyramid network (BiFPN). Then, the convolutional block attention module (CBAM) is fused into the three effective feature layers after feature pyramid network processing. The bridge feature information is effectively extracted from the channel and spatial dimensions. Next, the decoupled head is fused in the YOLO Head to separate the classifier and regressor to speed up the network convergence and improve the network detection accuracy simultaneously. Finally, the practical effect is evaluated by calculating the average precision (AP). According to the experimental results, the AP of the proposed method is 98.1%, which is improved by 4.1%∼23.5% compared with other models.


Introduction
As an important transportation facility, bridges have a high status in both military and civilian areas.As a key hub between water and land routes, bridges will have many military forces deployed around them as key military strike objects during wartime.Therefore, it is of strategic and tactical importance to develop research on the automatic detection of bridges using HRRIs.In the civil areas, bridges have the function of pulling the economy.First, at the time of construction, bridges have the first round pulling effect on the development of GDP.Second, bridges have the second round pulling effect on the development of transportation as a water and land transportation hub after completion.Finally, bridges will play a long-term third round pulling effect on the formation of city belt, industry belt, market belt, tourism belt and the formation of regional economic zone after completion.Therefore, there is important socioeconomic value in developing automatic bridge detection algorithms.Traditional bridge detection methods are typically based on spectral features, geometric attributes, or medium to high-level semantic features.For example, Chaudhuri et al. used domain operators and geometric features of bridges to achieve the detection of bridges over water from multispectral images (Chaudhuri and Samal 2008).Hao et al. used the Hough transform method to achieve bridge detection in complex backgrounds (Hao et al. 2007).Wang et al. proposed an SAR image object detection method based on fuzzy support vector machines, which can effectively identify bridges over water (Wang, Yin, and Sun 2014).Most traditional methods extract color and texture shape using spectral features of waters and bridges.They used classifiers such as logistic regression, decision tree and support vector machine for classification.However, problems like high time complexity, window redundancy, and slow classification persist, making it difficult to achieve the expected goals of bridge detection for small bridges and complex backgrounds.
With the development of deep learning, object detection methods based on deep learning have been widely used in various fields, such as pathology detection (Krithiga and Geetha 2021;Salvi et al. 2021;Zhou et al. 2020), face detection (Ming et al. 2022;Singh, Rathore, and Kumar 2022;Wang et al. 2022), crop pests and diseases detection (Butera et al. 2022;Du et al. 2022), road traffic (Huang et al. 2022;Qiu, Huang, and Tang 2022), and text detection (Akallouch et al. 2022;Raj et al. 2022;Tian et al. 2018).Object detection methods based on deep learning have wide adaptability, a wide range of applications and good detection effects.Many scholars have also researched bridge detection.For example, Sun et al. proposed a YOLOv4 network combined with random erasure for automatic bridge detection with HSRRIs (Sun et al. 2022b).Wang et al. used traditional unsupervised methods to extract a priori information and designed auxiliary modules to receive and join the a priori information to implement a bridge detection algorithm that combined traditional methods and deep learning (Wang et al. 2021).Chen et al. used the fusion of various feature information extracted by the balanced and attention mechanism to achieve automatic detection of SAR image bridges (Chen et al. 2020).
The detection methods for bridges proposed by these scholars provide good references for this study.However, some important problems in bridge detection remain.The traditional bridge detection method, cannot achieve accurate detection of HSRRIs in complex backgrounds under the influence of different illumination and cloud cover conditions.Additionally, there are problems such as high time complexity, window redundancy, and slow detection speed.For bridge detection methods based on deep learning, automatic detection of bridges also has the problems of difficult detection of multiscale bridges, and small bridges are easy to miss and have low detection accuracy.Therefore, to solve these problems, this study uses a YOLOv5 network (Glenn, Alex, and Jirka 2020) with a decoupled head to achieve the automatic recognition and detection of bridges in HSRRIs.The network improves the overall bridge detection accuracy while reducing the leakage problem of small bridges with HSRRIs and obtains a better detection effect in multiscale bridge detection.
The primary contributions of this paper are as follows: (1) A novel feature pyramid structure is constructed.A weighted BiFPN is used in feature pyramid upsampling and downsampling to reconcile the inconsistent information of bridge features at different scales.This algorithm fuses of multiscale bridge features and enhances feature representation capability.The CBAM is fused into the effective feature layer after the feature pyramid network to extract bridge feature information from the channel and spatial dimensions to achieve a more effective focus on multiscale bridge features and further enhance the extraction of bridge feature information.(2) A novel classifier and regressor detection model is developed.In the network where the original classifier and regressor are combined, the classifier and regressor are performed separately by the fused decoupled head to achieve faster network convergence while improving the model detection accuracy.Comparison and ablation experiments show that the YOLOv5 network with the proposed decoupled head algorithm performs better in multiscale bridge detection in HSRRIs than existing methods and can achieve high precision and accurate detection.
The remainder of this paper is organized as follows: Section 2 describes related work.Section 3 describes the research methods used in this study in detail.Then, experimental results are reported in Section 4. In Section 5, we provide a discussion, and then, we draw conclusions in Section 6.

Object detection in HSRRIs
As remote sensing satellites, UAVs, aerial cameras and other equipment develop, the acquired remote sensing images are used in a variety of fields.Examples include agriculture (Blickensdorfer et al. 2022;Gururaj, Umesh, and Shetty 2022;Kumar et al. 2022), forestry (Chadwick et al. 2022;Wang et al. 2022b), environmental detection (Sharma and Arya 2022;Tebaldini et al. 2022), and disaster assessment (Alsamhi et al. 2022;Li et al. 2022;Qin et al. 2022).HSRRIs have become an important data source in object detection due to their abundant data resources, high resolution, and diverse data types.As deep learning is developed, the interpretation of HSRRIs is achieved by deep learning framework.For example, Dong et al. designed the HSRRIs object detection method based on the CNN to solve the problems existing in the scale range of the region of interest and the object feature representation of object detection (Dong et al. 2020).Zhu et al. proposed a shadow detection method based on a contextual detail-aware network to address the issues of shadow detection in HSRRIs (Zhu et al. 2022a).
Other scholars have achieved object detection of HSRRIs by the distillation of deep learning frameworks while enhancing the detection capability of lightweight models.For example, Yang proposed an adaptive reinforcement supervision distillation (ARSD) model to improve the detection capability of lightweight models for object detection of HSRRIs in response to the problem of a large amount of noise generated by knowledge distillation methods, which affects the training capability of the models (Yang et al. 2022).Zheng et al. addressed the issue of high false and missed the detection rates of urban plantation trees by implementing a single tree detection and localization function based on the YOLOv4-Lite object detection framework through HSRRIs (Zheng and Wu 2022).
When applying object detection to HSRRIs, some scholars have proposed the method of finegrained object detection for the problem of the difficulty of fine-grained object detection.For example, Sun et al. performed fine-grained object detection in HSRRIs with an improved cascaded hierarchical object detection network (Sun et al. 2022a).Zhou et al. addressed the problem of small differences between classes and high similarity in fine-grained object detection in HSRRIs.An attention-based group feature enhancement and a learning strategy emphasizing subsignificant features were proposed to improve the two-stage classifier in object detection (Zhou et al. 2022).

Feature extraction in remote sensing images
Feature extraction is a critical step in object detection.The popular feature extraction methods used in remote sensing images are primarily divided into two types.The first type is feature extraction according to traditional methods, which uses primarily the spectral features, spatial features, texture features, and other information of remote sensing images to achieve feature extraction.For example, Wang et al. exploited the scattering between bridges and backgrounds through SAR images.Automatic detection of bridges is achieved through three processes: river segmentation, potential bridge area detection, and bridge discrimination (Wang et al. 2009).The other is feature extraction based on the CNN, which performs feature extraction using network depth; the number and size of convolution kernels; and network structure.For example, Zhu et al. proposed a graph focusing aggregation network to represent the structural features of remote sensing objects for the problem that traditional CNNs have difficulty extracting rotation and scale factors (Zhu et al. 2022b).Jiang et al. improved the detection of multiscale objects by combining a two-stage neural network with a staggering localization strategy (Jiang et al. 2020).
The above paragraph describes the object detection in HSRRIs and the methods for feature extraction in remote sensing images.The advantages of HSRRIs have made them possible to be used in various fields.Due to poor feature extraction ability, slow extraction speed, and low accuracy of traditional methods, these methods cannot meet the demand for high precision and more accurate detection of remote sensing images.Based on the excellent detection performance of the CNN, it can achieve high precision and more accurate detection of remote sensing images.Therefore, this paper proposes using a YOLOv5 network with a decoupled head using bridge data from HSRRIs to solve the problems of difficult detection of multiscale bridges, missing detection problems of small bridges, and low detection accuracy.The method of this paper can achieve automatic detection of bridges, which is important for the study of high precision and accurate detection of bridge objects.

Research method
The proposed method primarily includes the steps of feature extraction, feature fusion, classification prediction, and regression prediction.First, to address the problem of multiscale feature fusion with inconsistent feature information at each scale, the representation of features is enhanced using a weighted BiFPN (Tan, Pang, and Le 2020) with residual connections.Concurrently, BiFPN also adds weight for each scale feature of the fusion, which is used to adjust the contribution of each scale and improve the feature pyramid network for bridge feature information extraction.Then, the attention mechanism of CBAM (Woo et al. 2018) is fused to the effective features after feature pyramid processing, and more attention is given to the feature information of the channel and spatial dimensions in the effective feature layer to enhance the extraction of multiscale bridge feature information.Next, to improve bridge detection accuracy, the extracted feature layers are then passed through the decoupled head network (Ge et al. 2021) to achieve classification and regression operations.Finally, bridge accuracy results are evaluated by the decoupled head.Figure 1 shows the flow chart of the proposed method.

BiFPN
The pyramid structure in the original YOLOv5 network learns from the FPN + PAN structure (Jin and Zheng 2020;Lin et al. 2017).The FPN + PAN primarily learns from PANet (Jin and Zheng 2020), and this structure achieves multiscale feature fusion through three effective feature layers obtained in the backbone feature extraction network with enhanced feature extraction by upsampling and downsampling.However, the feature fusion of this method cannot use the feature information between different scales, which will reduce the detection accuracy of the network, and the importance of feature information at different scales is different.Therefore, this paper uses a weighted BiFPN in the feature pyramid to achieve the integration of effective feature layers from the backbone network with efficient bi-directional cross-scale connectivity and weighted feature fusion, which enhanced the feature information expression capability.The bridge feature information between different scales can thus be fully used.The BiFPN structure is shown in Figure 2. BiFPN is implemented as follows: (1) Delete nodes with only one input edge.When the feature pyramid network is fused with different features, these nodes have only one input edge and no feature fusion.Their contribution to the network is negligible; thus, removing these nodes simplifies the bi-directional network.
(2) If the input and output nodes are in the same network layer, an additional edge is added to achieve the fusion of more features.
(3) Unlike the PANet structural approach, BiFPN can process feature maps for each bi-directional path, enabling a higher level of feature fusion.

CBAM
The attention mechanism is a common method and technique used in CNNs.The attention mechanism is centered on getting the network to focus on the required focus in the local object, and hence to obtain more contextual semantic information.Typically, attention mechanisms are channel attention mechanisms, spatial attention mechanisms, or a combination of the two.The channel attention mechanism primarily considers the importance of each feature channel in the CNN and the mechanism established to enhance or inhibit the relationship between channels for different object tasks.The spatial attention mechanism pays more attention to relevant task areas and ignores irrelevant task areas in the image to address the critical areas in the network.The combination of the two is a combination of the above two different attention mechanisms.
The CBAM used in this study can effectively extract bridge feature information from channel and spatial dimensions.CBAM can be divided into a channel attention module and a spatial attention module.The channel attention module focuses on bridge features, while the spatial attention module focuses on the detection of the location information of bridge features.The combination of these two allows the network to focus more on the feature information of the bridge, to improve the accuracy of the model detection.CBAM includes the following steps: (1) The input feature layer is first processed by the channel attention mechanism.Global average pooling and global max pooling are performed in a single feature layer, and the shared fully connected layer is used to process the pooled results.Its formula is shown in (1).
where M c (F) is the result processed by the channel attention mechanism, s( • ) is the sigmoid value of the pooling result of the fully connected layer, MLP represents a shared multilayer perceptron with hidden layers; W 1 and W 0 are the output layer weights and hidden layer weights in MLP; and F c avg and F c max represent the global average pooling features and max pooling features processed by the channel attention mechanism.
(2) Some image location information was lost after the channel attention mechanism processing was completed.Therefore, it is necessary to process the input feature layer with a spatial attention mechanism.The core idea of spatial attention mechanism processing is to take the maximum and average on the channel of each feature point in the input feature layer and use the convolution operation to obtain the spatial attention map.Its formula is shown in (2): where M s (F) is the result processed by the spatial attention mechanism, f 7 × 7 represents the 7 × 7 convolution; and F s avg and F s max represent the global average pooling feature and maximum pooling feature processed by the spatial attention mechanism, respectively.
In this study, CBAM is fused in the three output effective feature layers after PANet feature fusion.The YOLOv5 network integrated with the CBAM attention mechanism enhances the bridge saliency under complex backgrounds and can extract higher-level features.This process makes the extracted high-level feature information more abundant, and improves the bridge detection accuracy under HSRRIs.

YOLOv5 network fused with a decoupled head
YOLO Head is the classifier and regressor of YOLOv5.After backbone feature network extraction, BiFPN enhanced feature network extraction, and CBAM attention mechanism processing, three effective feature layers are input into the YOLO Head to obtain classification and regression prediction results.Figure 3 shows a schematic diagram of the YOLOv5 network structure fused with the decoupled head algorithm.The classification and regression tasks of the YOLOv5 original YOLO head network are implemented together in convolution.However, there is a conflict between these two tasks, affecting the network detection accuracy.Therefore, the proposed method incorporates the decoupled head in the YOLOv5 head.The classification prediction branch and regression prediction branch were used for classification and regression tasks, to accelerate the convergence speed of the network and improve the accuracy of model detection.
Inspired by the algorithm (Ge et al. 2021), the decoupled head of the proposed algorithm proposed works by considering the feature map as a collection of multiple feature points.The decoupled head network performs classification and regression by judging the feature points separately and finally integrates them for prediction.The decoupled head network is shown in Figure 4.The decoupled head network is implemented by operations such as convolution, normalization, and activation functions.The decoupled head network first adjusts the number of channels by 1 × 1 convolution.Then, two parallel 3 × 3 convolution layers are used to separate the classification and regression tasks so that the classification and regression tasks are performed separately.Next, the classification, localization, and confidence detection tasks are again implemented by 1 × 1 convolution in classification and regression.Then, the confidence level of the type of object contained in each sign point is judged in the decoupled head network separately, and the regression coefficient of each feature point is judged.The regression coefficients are adjusted to obtain the coordinates of the prediction frame as well as to determine the presence or absence of the corresponding object for each feature point.Finally, these three prediction results are stacked and integrated.The proposed  algorithm incorporates the decoupled head network to accelerate the network convergence speed while improving the network detection accuracy.

EIoU_Loss
The original loss function of YOLOv5 uses the GIoU loss function (Rezatofighi et al. 2019).The GIoU loss function fuses the minimum bounding rectangle of the prediction box and the ground truth box, which can improve the problem where the distance between the two cannot be predicted when they do not intersect.However, when the prediction box is inside the ground truth box, the GIoU loss does not change, and it is not possible to determine the relationship between the position of the prediction box and the ground truth box.Therefore, the EIoU loss function is used to solve this problem in this paper.Compared with the GIoU loss function, the EIoU loss function (Zhang et al. 2021) considers the distance between the center point of the prediction box and the ground truth box, and solves the problem where the GIoU loss function cannot determine the position relationship between the prediction box and the ground truth box when they are in the inclusion relationship.The EIoU loss function improves the convergence speed and regression accuracy of the model by considering the overlap between the prediction box and the ground truth box, and by directly computing the penalty terms for the width and height.
The EIoU loss function is calculated as the overlap loss of the prediction box and the ground truth box, the center distance loss, and the height and width loss.Therefore, the EIoU loss function is defined as follows.The EIoU loss function calculates the overlap loss, center distance loss, and height and width loss of the prediction box and the ground truth box.Thus, the EIoU loss function is defined as: IoU (Jiang et al. 2018) is the area intersection and union ratio of the prediction box and the ground truth box; r 2 (b, b gt ) is the Euclidean distance between the center point of the prediction box and the ground truth box; r 2 (w, w gt ) is the Euclidean distance between the width of the prediction box and ground truth box; r 2 (h, h gt ) is the Euclidean distance between the height of the prediction box and the ground truth box; c is the diagonal distance between the prediction box and the minimum bounding rectangle of the ground truth box.C w represents the closure width between the prediction box and the ground truth box; and C h represents the closure height between the prediction box and the ground truth box.

Experimental data
The experimental data in this paper are the bridge detection dataset provided by the 2020 High Resolution Remote Sensing Image Interpretation Competition (Sun et al. 2022b).Figure 5 shows an example of the data sample.The dataset consists of 2000 bridge images in 164 different scenes, which were taken by Gaofen-2 and constructed based on historical image sequences.The spatial resolution of Gaofen-2 is 1 m for panchromatic images and 4 m for multispectral images.The image sizes are 668 pixels × 668 pixels and 1000 pixels × 1000 pixels.According to the statistics there are 3286 bridges in this dataset; bridges with lengths less than 50 pixels are defined as small bridges and bridges with lengths greater than 100 pixels are defined as large bridges.A single image contains at least one bridge object.These bridges have different environmental backgrounds, lighting conditions, and noise information, which can ensure the diversity of data and adequately train the network for automatic detection of HSRRIs.In dataset partitioning, 9:1 is used to divide into the training sets and test sets.To create a training set, 90% of the data is used, and 10% is used to create a testing set.

Experimental environment
The experimental environment in this paper is performed in the Windows 10 operating system running on an Intel (R) i9-11900K CPU, NVIDIA GeForce RTX3090 graphics card, 24 GB video RAM, and 64 GB of system RAM.The experiment uses GPU for training and testing, and its training platform is PyCharm software version 2020.1 (downloadable from https://www.jetbrains.com/pycharm/download/, Prague, Czech Republic).During training, 300 epochs are trained, the batch size is set to 6, the learning rate is 1 × 10 −3 , and the IoU threshold is 0.5.

Quantitative evaluation
The accuracy of the proposed method is verified using precision, recall and AP as the quantitative evaluation of the model, whose definitions are shown in Table 1.

Comparison experiments
Experiments used object detection tasks to compare the proposed network to existing networks, and including CenterNet (Law and Deng 2018), RetinaNet (Lin et al. 2020), SSD (Liu et al. 2016), YOLOv3 (Redmon and Farhadi 2018), YOLOv4 (Bochkovskiy, Wang, and Liao 2020), YOLOv5 (Glen, Alex, and Jirka 2020), and YOLOx (Ge et al. 2021).Experimentally, the quantitatively evaluated precision, recall, and AP were calculated and compared when training on the HSRRIs bridge dataset.As shown in Table 2, the bridge detection accuracies under different network models are counted separately.The higher points in Table 2 are the values of AP calculated by the comparison of each network with the proposed method.Table 2 shows that the detection accuracy of bridges with HSRRIs ranges from 48.7% to 98.3%, the recall rate ranges from 59.3% to 96.6%, and the AP ranges from 74.6% to 98.1%.The precision, recall, and AP of the proposed  In this study, we compare the experimental effects after fusing the weighted BiFPN, CBAM, and decoupled head network using ablation experiments on the HSRRIs bridge dataset.Their quantitatively evaluated precision, recall, and AP are calculated and compared.The results of the ablation experiment detection are shown in Table 3.The detection accuracy of the original YOLOv5 network with different modules is improved, and the proposed algorithm achieves a suboptimal recall of 96.6%, which is only 0.4% different from the optimal result but achieves optimal precision and AP of 98.3% and 98.1%, respectively.
To demonstrate the effectiveness of the proposed method when detecting of multiscale bridges and small bridges, prediction experiments were selected for bridges under different scenarios.Table 4 shows the detection results of bridges by different networks in different scenarios.Figures 6-8 show the bridge detection results of different networks in different scenarios.
(1) Scene 1 Analysis of bridge detection results Figure 6 shows the multiscale bridge detection in the large-format remote sensing images.The bridges in Scene 1 include three small bridges and one large bridge, which exhibit a marked contrast in scales.In this context, the AP of bridge detection by different networks ranged from 0.59 to 1.00.Considering Figure 6 and Table 4, large bridges can be detected by different networks in large-format remote sensing images.However, when detecting small bridges, RetinaNet, SSD, and YOLOv3 networks all produced missed detections, with the number of missed detections ranging from 1 to 3.Although the other networks can detect the bridges correctly, their detection accuracies are low, and the detection abilities for small bridges are all reduced, for example, the CenterNet network achieved only 0.59 AP in this scenario.The proposed method compares the other network structures when detecting large bridges correctly, no missed detection occurred, and the detection accuracy is the highest when detecting small bridges.(2) Scene 2 Analysis of bridge detection results Figure 7 shows the detection of multiscale bridges in small-format remote sensing images.Scene 2 includes bridges with different scale information, and the image information is unclear.In this context, the SSD network appears to miss the detection of small bridges, and all other networks can detect the bridge information accurately.Considering Figure 7 and Table 4, the AP values of different networks for bridge detection vary from 0.65 to 1.00.From the experimental results, different networks can achieve good detection accuracies and effects when detecting clear large bridges.However, when the HSRRIs information is unclear, such as the bridge in the upper left corner of Figure 7(a), the networks have decreased the detection accuracy for all of them.In this scenario, both for small and unclear bridges, the proposed method can make accurate detections and achieves the highest detection accuracy of all tested methods.
(3) Scene 3 Analysis of bridge detection results Figure 8 shows the detection of juxtaposed bridges.A total of four bridges are juxtaposed in Scene 3. The image imaging conditions in this scene are poor, and there is a phenomenon in which some bridges are obscured by clouds, the background of the bridges is complex, and the features are unclear.Therefore, the detection accuracy of bridges by different networks is reduced.However, the proposed method still shows good detection capability when bridges are obscured by clouds and can still detect bridge objects accurately and with high precision.

Discussion
The AP of the proposed method for detecting bridges with HSRRIs is 98.1%.The comparison experiments highlight that the rise points range from 4.1% to 23.5%, which verifies that the proposed algorithm can detect bridges effectively and correctly with high precision.By comparing the ablation experiments, the increases in performance in detecting bridge targets from HSRRIs range from 0.4% to 8.0%.The results of the comparison and ablation experiments can be combined to show that the accuracy of the proposed method in bridge detection is markedly improved compared to existing algorithms.
To demonstrate the accuracy of the proposed method, the detection effects of different networks on bridges under different scenarios were selected.The detection results show that the proposed algorithm can achieve accurate detections compared to existing methods, achieving the highest detection accuracy and no leakage.In these different scenarios, particularly when detecting small bridges in large images and bridges obscured by clouds, the proposed method can achieve high detection accuracies and good generalizability.

Conclusions
To solve the problem of difficult bridge detection with HSRRIs, a YOLOv5 network with the decoupled head algorithm is proposed.The algorithm primarily combines the BiFPN to enhance the representation capability of bridge features and solve the problem of conflicting information on feature fusion scales.CBAM was used to enhance the feature extraction capability of multiscale bridges by focusing on the feature information in the feature layer from both channel and space dimensions.Finally, in the classification and regression tasks, the decoupled head network is used for classification, localization, and confidence detection tasks to improve the detection accuracy of the model.The proposed algorithm achieves 98.1% detection accuracy in HSRRIs, an increase of 4.1%∼23.5% compared with existing models.The proposed method also provides a high generalizability that performs better in various scenarios, such as when bridges are obscured by clouds, bridges with multiscale are present in one image, and small bridges.
However, the proposed method complicates the model while improving detection accuracy.Subsequent work should focus more on the application of lightweight networks so that the network model can be optimized to achieve high precision and light detection requirements while improving detection accuracy.Concurrently, we also found that when the bridge tilted at a large angle, the area surrounded by the horizontal box was large, and irrelevant background information was present in the horizontal box, which led to poor detection results.Therefore, to solve this problem, we plan to apply rotating box-based detection in future work.

Figure 1 .
Figure 1.Flow chart of the proposed method.

Figure 3 .
Figure 3.The architecture of proposed method.

Figure 4 .
Figure4.The architecture of decoupled head network.Cls. is used to determine the kind confidence of feature points; Reg. is used to obtain the prediction frame coordinates; Obj. is used to determine whether feature points have the presence of corresponding objects.

Figure 6 .
Figure 6.Scene 1 Bridge detection results.Figure (a) shows the original image, and the rectangular box is circled by the location of the bridge.From figure (b) to (i) show the detected results under different models, respectively.

Figure 7 .
Figure 7. Scene 2 Bridge detection results.Figure (a) shows the original image, and the rectangular box is circled by the location of the bridge.From figure (b) to (i) show the detected results under different models, respectively.

Figure 8 .
Figure 8. Scene 3 Bridge detection results.Figure (a) shows the original image, and the rectangular box is circled by the location of the bridge.From figure (b) to (i) show the detected results under different models, respectively.

Table 1 .
Quantitatively evaluate.FP TP is the number of bridges correctly detected by the model.FP is the number of bridges detected by the model as correct but as other categories.FN is the number of models detected as other categories but as bridges.Precision, which indicates the percentage of bridges detected by the model that are correct.r stands for Recall, which is the percentage of all correct bridges selected by the model.The value of AP is the size of the area enclosed by p as a function of r in the range [0,1].method in bridge detection are the best in each index in the comparison experiments.The above comparison experiments show that the proposed algorithm can achieve high precision detection of bridges with HSRRIs.

Table 2 .
Detection results of different models.

Table 3 .
Results of ablation tests.The best results are in bold font.

Table 4 .
Bridge detection results in different scenarios.The best results are in bold font.Number is the number of missed detections.