MRFF-YOLO: A Multi-Receptive Fields Fusion Network for Remote Sensing Target Detection

High-altitude remote sensing target detection has problems related to its low precision and low detection rate. In order to enhance the performance of detecting remote sensing targets, a new YOLO (You Only Look Once)-V3-based algorithm was proposed. In our improved YOLO-V3, we introduced the concept of multi-receptive fields to enhance the performance of feature extraction. Therefore, the proposed model was termed Multi-Receptive Fields Fusion YOLO (MRFF-YOLO). In addition, to address the flaws of YOLO-V3 in detecting small targets, we increased the detection layers from three to four. Moreover, in order to avoid gradient fading, the structure of improved DenseNet was chosen in the detection layers. We compared our approach (MRFF-YOLO) with YOLO-V3 and other state-of-the-art target detection algorithms on an Remote Sensing Object Detection (RSOD) dataset and a dataset of Object Detection in Aerial Images (UCS-AOD). With a series of improvements, the mAP (mean average precision) of MRFF-YOLO increased from 77.10% to 88.33% in the RSOD dataset and increased from 75.67% to 90.76% in the UCS-AOD dataset. The leaking detection rates are also greatly reduced, especially for small targets. The experimental results showed that our approach achieved better performance than traditional YOLO-V3 and other state-of-the-art models for remote sensing target detection.


Introduction
The high-altitude remote sensing images [1][2][3][4] obtained by satellites and aircrafts are widely used in military, navigation, disaster relief, etc. So, remote sensing target detection [5][6][7] is becoming an important research hotspot. The interferences of the light changes, environment, and other complex backgrounds in remote sensing images make remote sensing targets hard to be detected. At present, there are still some problems such as low detection accuracy, error detections, and missed detections.
In order to realize remote sensing target detection, researchers have made unremitting efforts. The algorithm of elliptical Laplace operator filtering based on Gaussian scale space was proposed in 2010 [8]. It treated the vehicle targets as elliptical class objects and employed elliptic operators in different directions to perform convolution filtering with the targets. Then, the k-nearest-neighbor classifier was used to separate the false targets. In 2015, a new method for remote sensing target detection was proposed by Naoto Yokoya et al. [9]. It combined feature detection based on sparse representation with generalized Huff transform. Then, by adopting the method of learning the dictionary of targets and backgrounds, the sparse image representation of specific classes was constantly supplemented. Finally, the remote sensing target detection was realized. According to the detection of high-resolution optical satellite ship targets, Buck et al. [10] firstly considered the use of frequency domain characteristics to extract the candidate areas of ship targets. Then, the length-width ratio of the superstructure and the

Introduction to YOLO
YOLO is the most popular regression-based target detection algorithm due to its conciseness and high speed. Compared with the region-based algorithms such as Fast R-CNN and Faster R-CNN, YOLO is suitable for engineering applications due to the simple and efficient network. Since the advent of YOLO, YOLO has evolved from YOLO-V1 to YOLO-V2 and YOLO-V3.

The Fundamental of YOLO
When detecting the targets, YOLO will firstly divide the input image into S × S grid cells. The grid cell that the center of the target falls in will be responsible for detecting it. For each grid cell, YOLO predicts B bounding boxes. For each bounding box, YOLO predicts five values: four values for the location of the bounding box, and one value for the confidence of the bounding box. The confidence can be defined as P(Ob ject) × IOU truth prid . Confidence measures two aspects: one is whether the target lies in the bounding box, and the other is the bounding box's accuracy in predicting the position of the target. If no target lies in the bounding box, then the confidence of the bounding box is 0. If the bounding box contains the target, then P(Ob ject) = 1, and the confidence will be the IOU (Intersection-Over-Union) between the bounding box and ground truth. In addition, YOLO predicts C categories for each grid cell and a set of conditional probabilities for each grid cell: P(Class i Ob ject) .
From the above, the output of the network contains a total of S × S grid cells. Each grid cell predicts B bounding boxes. Each bounding box predicts five values. In addition, each grid cell predicts C categories. So, the size of the output tensor of the network is S × S × (B × 5 + C).

The Principle of YOLO-V3
To extract deeper information of the network, YOLO upgrades the feature extraction network from Darknet19 to Darknet53. YOLO makes heavy use of residual units. Instead of pooling layers, YOLO adopts convolutional layers with stride = 2 to implement down-sampling. The structure of YOLO-V3 is shown in Figure 1.
Remote Sens. 2020, 12, x FOR PEER REVIEW 3 of 24 layers are replaced by densely connected network (DenseNet). (3) To enhance the performance of detecting the remote sensing targets with small size, the 4th scale was added to the framework of YOLO-V3. The rest part of this paper is organized as follows. (1) In Section 2, we introduced the framework of YOLO-V3. (2) In Section 3, we detailed the improvements of our approach. (3) In Section 4, experimental verification was given to verify the effectiveness of our approach. (4) Finally, we gave the conclusion of this paper in Section 5.

Introduction to YOLO
YOLO is the most popular regression-based target detection algorithm due to its conciseness and high speed. Compared with the region-based algorithms such as Fast R-CNN and Faster R-CNN, YOLO is suitable for engineering applications due to the simple and efficient network. Since the advent of YOLO, YOLO has evolved from YOLO-V1 to YOLO-V2 and YOLO-V3. whether the target lies in the bounding box, and the other is the bounding box's accuracy in predicting the position of the target. If no target lies in the bounding box, then the confidence of the bounding box is 0. If the bounding box contains the target, then P( ) 1 Object = , and the confidence will be the IOU (Intersection-Over-Union) between the bounding box and ground truth. In addition, YOLO predicts C categories for each grid cell and a set of conditional probabilities for each grid cell:

The Fundamental of YOLO
From the above, the output of the network contains a total of S S × grid cells. Each grid cell predicts B bounding boxes. Each bounding box predicts five values. In addition, each grid cell predicts C categories. So, the size of the output tensor of the network is

The Principle of YOLO-V3
To extract deeper information of the network, YOLO upgrades the feature extraction network from Darknet19 to Darknet53. YOLO makes heavy use of residual units. Instead of pooling layers, YOLO adopts convolutional layers with stride=2 to implement down-sampling. The structure of YOLO-V3 is shown in Figure 1.  The feature extraction network down-samples the image to 32×, which signifies that the size of the output feature map is 1/32 the size of the input image. To enhance the performance of detecting small targets, the detection was carried out at the feature map down-sampled by 32×, the feature map down-sampled by 16×, and the feature map down-sampled by 8×, respectively. Up-sampling is adopted due to the reason that the deeper the network, the better the effect of feature expression. For example, in the case of detecting targets with the feature map down-sampled by 16×, if the 4th down-sampling layer is directly used for detection, the effect is generally not good. So, the network doubles the size of the feature map down-sampled by 32× by up-sampling a with step size of 2. In this way, the dimensions of the two feature maps remain the same. Then, the network concatenates the two feature maps to achieve feature fusion. Similarly, we do the same for the other detection layers.
The final outputs of YOLO-V3 are three scales: 13 × 13, 26 × 26, and 52 × 52, which are responsible for the detection of big targets, medium-sized targets, and small targets, respectively.
In YOLO-V3, the loss function can be divided into three parts: coordinate prediction error, IOU error, and classification error [43].
The coordinate prediction error is defined as: (1) In Equation (1), S 2 represents the number of the grid cells of each scale. B denotes the number of bounding boxes for each grid. I obj ij represents whether there is a target that falls in the j-th bounding box of the i-th grid cell. (x i , y i , w i , h i ) and (x i , y i , w i , h i ) represent the center coordinate, height, and width of the predicted box and the ground truth, respectively.
The IOU error is defined as: ( In Equation (2), C i and C i denote the true and predicted confidence, respectively. The third part is the classification error: In Equation (3), p i (c) refers to the true probability of the target, whilep i (c) refers to the predicted value.
From the above, the final loss function is shown in Equation (4): (4)

Methodology
Even the state-of-the-art YOLO-V3 model still has poor performance in detecting remote sensing targets due to the complex background and small size of the targets. For real-time remote sensing Remote Sens. 2020, 12, 3118 5 of 23 target detection, it is necessary to increase the receptive fields and extract features more effectively without deepening the network.
Therefore, based on the original YOLO-V3 model, several improvements are proposed for the feature extractor and detection layers.

The Feature Extractor Based on Res2Net
In order to alleviate the problem of gradient fading on the premise of deepening the network, the feature extractor of YOLO-V3 employs the structure of ResNet. The Darknet53 of YOLO-V3 contains five residual blocks. Each residual block consists of one or a set of multiple residual units. The structure of the residual unit is exhibited in Figure 2.

Methodology
Even the state-of-the-art YOLO-V3 model still has poor performance in detecting remote sensing targets due to the complex background and small size of the targets. For real-time remote sensing target detection, it is necessary to increase the receptive fields and extract features more effectively without deepening the network.
Therefore, based on the original YOLO-V3 model, several improvements are proposed for the feature extractor and detection layers.

The Feature Extractor Based on Res2Net
In order to alleviate the problem of gradient fading on the premise of deepening the network, the feature extractor of YOLO-V3 employs the structure of ResNet. The Darknet53 of YOLO-V3 contains five residual blocks. Each residual block consists of one or a set of multiple residual units. The structure of the residual unit is exhibited in Figure 2. The residual blocks in YOLO-V3 overcome the problem of gradient fading when deepening the feature extraction network and enhancing the performance of feature expression. Representing features on multiple scales is important for many visual tasks. However, ResNet still represents multi-scale features in a hierarchical manner, which makes the features within each layer underutilized. To solve this problem, Gao et al. [44] proposed a new connection method for the residual units to extract features. In this method, the author constructed hierarchical residual class connections in a single residual block and proposed a new building block, which is named Res2Net. Res2Net represents multi-scale features with finer granularity and increases the range of receptive fields at each layer. Borrowing the core idea of Res2Net, we added several tiny residual terms to the original residual units to increase the receptive fields of each layer. Compared with the residual unit, the structure of our proposed 'Res2 unit' is shown in Figure 3. The residual blocks in YOLO-V3 overcome the problem of gradient fading when deepening the feature extraction network and enhancing the performance of feature expression. Representing features on multiple scales is important for many visual tasks. However, ResNet still represents multi-scale features in a hierarchical manner, which makes the features within each layer underutilized. To solve this problem, Gao et al. [44] proposed a new connection method for the residual units to extract features. In this method, the author constructed hierarchical residual class connections in a single residual block and proposed a new building block, which is named Res2Net. Res2Net represents multi-scale features with finer granularity and increases the range of receptive fields at each layer. Borrowing the core idea of Res2Net, we added several tiny residual terms to the original residual units to increase the receptive fields of each layer. Compared with the residual unit, the structure of our proposed 'Res2 unit' is shown in Figure 3.
In the 'Res2 unit', we divide the input feature map into N sub-features (N = 4 in this paper) on average after the 1 × 1 convolutional layer. Each sub-feature is represented as x i (i = 1, 2, . . . N). Each x i is in the same size, but it only contains 1/S number of channels compared with the input feature map [45]. K i () represents the 3 × 3 convolutional layer. We represent y i as the output of K i (). So, y i is represented as: In particular, y 1 , y 2 , y 3 , y 4 can be expressed as ( * represents convolution): Remote Sens. 2020, 12, 3118 6 of 23 In this paper, we set N as the controlling parameter, which means that the number of input channels can be divided into multiple feature channels on average. The larger N is, the stronger the multi-scale capability will have for the network. In this way, we will get an output of different sizes of receptive fields.
Compared with the residual unit, the improved 'Res2 unit' makes better use of contextual information and can help the classifier detect small targets and the targets subject to environmental interference more easily. In addition, the extraction of features at multiple scales enhance the semantic representation of the network. In the 'Res2 unit', we divide the input feature map into N sub-features (N = 4 in this paper) on average after the 1 × 1 convolutional layer. Each sub-feature is represented as Each i x is in the same size, but it only contains 1/ S number of channels compared with the input feature map [45].
() i K represents the 3 × 3 convolutional layer. We represent i y as the output of In particular, 1 2 3 4 , , , y y y y can be expressed as ( * represents convolution): In this paper, we set N as the controlling parameter, which means that the number of input channels can be divided into multiple feature channels on average. The larger N is, the stronger

Densely Connected Network for Detecting Layers
The structure of YOLO-V3 in Figure 1 shows that there are six convolutional layers in each detecting layer. In order to avoid gradient fading, we introduce the concept of densely connected networks (DenseNet).
DenseNet [46][47][48][49] was firstly proposed by Huang et al in 2017. It connects each layer with others in the way of feedforward. The structure of DenseNet is shown in Figure 4.
In Figure 4, x i is the feature map of the output, while H i represents the transport layer. There are l(l + 1)/2 connections in the network with l layers. Each layer is connected to all the other layers; thus, each layer can receive all the feature maps of the preceding layers. The feature map of each layer can be expressed in Equation (7): The structure of DenseNet makes it easy to alleviate gradient fading. In addition, DenseNet can also enhance feature transmitting and reduce the number of parameters to a certain extent. The structure of our proposed densely connected network is described in detail in Section 3.4.

Densely Connected Network for Detecting Layers
The structure of YOLO-V3 in Figure 1 shows that there are six convolutional layers in each detecting layer. In order to avoid gradient fading, we introduce the concept of densely connected networks (DenseNet).
DenseNet [46][47][48][49] was firstly proposed by Huang et al in 2017. It connects each layer with others in the way of feedforward. The structure of DenseNet is shown in Figure 4. In Figure 4, i x is the feature map of the output, while i H represents the transport layer. There connections in the network with l layers. Each layer is connected to all the other layers; thus, each layer can receive all the feature maps of the preceding layers. The feature map of each layer can be expressed in Equation (7): The structure of DenseNet makes it easy to alleviate gradient fading. In addition, DenseNet can also enhance feature transmitting and reduce the number of parameters to a certain extent. The structure of our proposed densely connected network is described in detail in Section 3.4.

Multi-Scale Detecting Layers
Three scales of detecting layers are used in YOLO-V3 to detect multi-scale targets. Among them, the scale with a feature map down-sampled by 32× is responsible for detecting the big targets. The scale with a feature map down-sampled by 16× is responsible for detecting the medium-sized targets, and the scale with a feature map down-sampled by 8× is responsible for detecting the small targets. The remote sensing images contain a large number of small targets. In order to get more fine-grained features and more detailed location information, the 4th scale with a feature map down-sampled by 4× is added to the network as a new detecting layer.

Our Model
From what has been discussed above, the proposed MRFF-YOLO adopted ResNet, Res2Net, DenseNet, and multi-scales detecting layers. The structure of MRFF-YOLO is shown in Figure 5.

Multi-Scale Detecting Layers
Three scales of detecting layers are used in YOLO-V3 to detect multi-scale targets. Among them, the scale with a feature map down-sampled by 32× is responsible for detecting the big targets. The scale with a feature map down-sampled by 16× is responsible for detecting the medium-sized targets, and the scale with a feature map down-sampled by 8× is responsible for detecting the small targets. The remote sensing images contain a large number of small targets. In order to get more fine-grained features and more detailed location information, the 4th scale with a feature map down-sampled by 4× is added to the network as a new detecting layer.

Our Model
From what has been discussed above, the proposed MRFF-YOLO adopted ResNet, Res2Net, DenseNet, and multi-scales detecting layers. The structure of MRFF-YOLO is shown in Figure 5.
Since each x i in 'Res2 block' is in the same size but contains only 1/N number of channels compared with 'RES Block', the number of parameters of the network has not been increased.
The structure of the Dense blocks in Figure 5 is shown in Figure 6.      As shown in Figure 6, H 0 represents the convolutional layer. Since each i x in 'Res2 block' is in the same size but contains only 1 / N number of channels compared with 'RES Block', the number of parameters of the network has not been increased. The structure of the Dense blocks in Figure 5 is shown in Figure 6.  As shown in Figure 6,

K-means for Anchor Boxes
Anchor box is used to detect multiple targets in one grid cell, which was a concept firstly proposed in Faster-RCNN. Inspired by Faster-RCNN, YOLO-V3 adopts anchor boxes to match the length-to-width ratios of targets better. Different from Faster-RCNN, which sets the sizes of anchor boxes manually, we executes K-means on the dataset to acquire anchor boxes in advance for YOLO-V3. The K-means function conducts latitude clustering to make the anchor boxes and adjacent ground truth as approximate as possible, which means they can have larger IOU values. For each ground truth, j j x y represent the center of the

K-Means for Anchor Boxes
Anchor box is used to detect multiple targets in one grid cell, which was a concept firstly proposed in Faster-RCNN. Inspired by Faster-RCNN, YOLO-V3 adopts anchor boxes to match the length-to-width ratios of targets better. Different from Faster-RCNN, which sets the sizes of anchor boxes manually, we executes K-means on the dataset to acquire anchor boxes in advance for YOLO-V3. The K-means function conducts latitude clustering to make the anchor boxes and adjacent ground truth as approximate as possible, which means they can have larger IOU values. For each ground truth, gt j (x j , y j , w j , h j ), j ∈ {1, . . . N}, and (x j , y j ) represent the center of the ground truth, while (w j , h j ) refers to the height and the width of the ground truth. The distance between the ground truth and bounding box is defined as follows [50]: IOU represents the intersection over union, which is defined in Equation (9): The larger the value of IOU between the ground truth and bounding box, the smaller the distance will be. The steps of the algorithm are shown in Table 3. Table 3. K-means for anchor boxes.

Decoding Process
In order to get the final bounding boxes, we need to decode the predicted value. The relationship between the bounding box and its corresponding prediction box is shown in Figure 7. In Figure 7, t x , t y , t w , t h refer to the predicted values, while c x , c y represent the offset of the grid relative to the upper left. The location and size information of the bounding box are shown in Equation (10).

Decoding Process
In order to get the final bounding boxes, we need to decode the predicted value. The relationship between the bounding box and its corresponding prediction box is shown in Figure 7.

Remove Redundant Bounding Boxes
After decoding, the network will generate the bounding boxes of the targets. In order to eliminate redundant bounding boxes that correspond to the same targets, we run Non-Maximum Suppression (NMS) on bounding boxes. The NMS algorithm contains three steps: 2 Step 2: If the value of IOU is larger than the threshold, then we shall consider that they correspond to the same target, and the bounding box with higher confidence will be retained. 3 Step 3: Repeat step 1 and step 2 until all the boxes are retained. Algorithm 1 exhibits the detailed steps of NMS for our approach: k ← argmaxc 4: end 10: end

Results
In this section, we conduct experiments on RSOD and UCS-AOD datasets and compared our approach with other state-of-the-art target detection models such as YOLO-V2, YOLO-V3, SSD, Faster-RCNN, etc. The experimental conditions are shown as follows: Framework: Python3.6.5, tensorflow-GPU1.13.1. Operating system: Windows 10. CPU: i7-7700k. GPU: NVIDIA GeForce RTX 2070. 50,000 training steps were set. The learning rate of our approach decreased from 0.001 to 0.0001 after 30,000 steps and to 0.00001 after 40,000 steps. The initialization parameters are displayed in Table 4.
The curve of the RSOD dataset The curve of the UCS-AOD dataset

The Evaluation Indicators
To evaluate a binary classification model, we can divide all the results in four categories: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). We exhibit the confusion matrix in Table 6: As shown in Table 6, TP denotes the sample that is positive in actuality and positive in prediction; FP denotes the sample that is negative in actuality but positive in prediction; FN denotes the sample that is positive in actuality but negative in prediction; TN denotes the sample that is negative in actuality and negative in prediction. With the confusion matrix, precision and recall are defined in Equations (11) and (12): Accuracy and recall are two indicators that check and balance each other. The tradeoff between them is hard. With aiming at measuring the precision of detecting targets with different categories, average precision (AP) and mean average precision (mAP) are introduced, which are the most important evaluation indicators of target detection. Average precision (AP) is defined as: where P i refers to the precision of the i−th category, R i refers to the recall of the i−th category. P i (R i ) is the function with R i as its independent variable and P i as its dependent variable. It measures the performance of target detection for a certain category. The mean average precision (mAP) is defined as: It measures the performance of target detection for all the c categories.
In addition, FPS is also an important indicator for target detection for measuring the real-time performance. It refers to the number of frames processed by the target detection algorithm in one second.

Experimental Process and Analysis
To evaluate the validity of our approach for remote sensing target detection, we selected RSOD and UCS-AOD as our experimental datasets. Generally speaking, if the ground truth of the target takes up less than 0.12% pixels of the whole image, we divided it into the category of small targets. If the ground truth of the target takes up 0.12-0.5% pixels of the whole image, we divided it into the category of medium targets. Otherwise, if the ground truth of the target takes up more than 0.5% pixels of the whole image, we divided it into the category of large targets. The RSOD dataset contains a mass of aerial images. The targets marked in the samples are divided into four categories, including aircraft, playground, oil tank, and overpass. Among them, most of the targets of the aircraft and oil tank are small or medium in size, and the size of the playgrounds and overpasses are large. In addition to scale diversity, the samples are also obtained under different light conditions and backgrounds of varying degrees of complexity. UCS-AOD is the dataset of target detection in aerial images. Tables 7 and 8 contain statistics tables of the datasets.

Experimental Results and Comparative Analysis
Three evaluation indexes are adopted to verify the performance of our approach. They are the mAP, Frames Per Second (FPS), and leak detection rate, respectively. We compared the performance of our approach with the state-of-the-art target detection algorithms in the RSOD dataset. The contrastive results are shown in Table 9. In addition, if we differentiate targets by size, the contrastive results are shown in Table 10.

Experimental Results and Comparative Analysis
Three evaluation indexes are adopted to verify the performance of our approach. They are the mAP, Frames Per Second (FPS), and leak detection rate, respectively. We compared the performance of our approach with the state-of-the-art target detection algorithms in the RSOD dataset. The contrastive results are shown in Table 9. In addition, if we differentiate targets by size, the contrastive results are shown in Table 10. Table 9 demonstrates that the proposed MRFF-YOLO is superior to the other state-of-the-art target detections in mAP. FPS did not reduce much compared with YOLO-V3. The mAP of MRFF-YOLO for remote sensing target detection is 88.33%, which increases by 11.23%, 10.54%, and 11.75% compared with YOLO-V3, UAV-YOLO, and DC-SPP-YOLO, respectively. In addition, the accuracy of detecting small and medium targets such as aircrafts and oil tanks has been significantly improved. In terms of detection speed, MRFF-YOLO satisfies the real-time performance of remote sensing targets. Experimental results indicated that the improved MRFF-YOLO can obviously improve the accuracy of detecting remote sensing targets under complex background. Not only that, MRFF-YOLO can meet the demand of real-time detection. In particular, the detection effect of small targets is more advantageous. Table 10 shows the contrastive results of different sizes. We can see that MRFF-YOLO is superior to YOLO-V3 in detecting small targets.  For the universality of the performance of MRFF-YOLO for remote sensing target detection, we chose another dataset, UCS-AOD, as an additional verification. The contrastive results lie in Table 11. We can see from Tables 10 and 11 that the leak detection rate of MRFF-YOLO is prominently lower than the original YOLO-V3 and other classical target detection algorithms.  Figure 10 exhibits some of the detection results of our proposed MRFF-YOLO. A set of 19 samples contains the remote sensing targets with different sizes and categories. They are under backgrounds of varying degrees of complexity and in different light conditions. The angles of view from which images are acquired are also quite different. Each target is shown in Figure 10, which certified the excellent performance of our approach for remote sensing target detection.
Remote Sens. 2020, 12, x FOR PEER REVIEW 17 of 24 backgrounds of varying degrees of complexity and in different light conditions. The angles of view from which images are acquired are also quite different. Each target is shown in Figure 10, which certified the excellent performance of our approach for remote sensing target detection.

Ablation Experiments
Section 4.3.1 has proved the advantage of MRFF-YOLO. In order to analyze the impact of 'Dense block', 'Res2 block', and the 4th detection layer on mAP and FPS, different module combination modes were set up in the experiment, and the RSOD dataset was chosen.
With the aiming at verifying the validity of 'Res2 block' in a feature extraction network and the 4th detection, different module combination modes are set in the experiment. The experimental results are shown in Tables 12 and 13. Among them, Table 12 exhibits the ablation experimental  result with three detection layers, while Table 13 exhibits the ablation experimental result with four detection layers. The detection layers are the same as those of the original YOLO-V3.

Ablation Experiments
Section 4.3.1 has proved the advantage of MRFF-YOLO. In order to analyze the impact of 'Dense block', 'Res2 block', and the 4th detection layer on mAP and FPS, different module combination modes were set up in the experiment, and the RSOD dataset was chosen.
With the aiming at verifying the validity of 'Res2 block' in a feature extraction network and the 4th detection, different module combination modes are set in the experiment. The experimental results are shown in Tables 12 and 13. Among them, Table 12 exhibits the ablation experimental result with  three detection layers, while Table 13 exhibits the ablation experimental result with four detection layers. The detection layers are the same as those of the original YOLO-V3.  The results of the contrast in Tables 13 and 14 show that with 'Res2 block' in the feature extraction network, the mAP improved from 77.10% to 78.12% and from 86.25% to 87.49%, respectively. In addition, the detection speed improved from 29.7 to 31.5 FPS and from 22.8 to 23.5 FPS, respectively. The experimental contrast certified the effectiveness of the improvement in the feature extraction network. In order to verify the impact of an additional detection layer on the detection accuracy, we compared the 1st experiment to the 4th experiment in Table 12 with those in Table 13, respectively. The mAP improved by 9.15%, 8.95%, 9.01%, and 9.37%, respectively. For smaller targets such as aircraft, the accuracy improved more obviously, which proves that the additional detection layer is suitable for smaller remote sensing target detection. The ablation experiments demonstrated in Tables 12-14 indicated that each module we proposed is efficient for improving the accuracy of remote sensing target detection. Among them, the proposed 'Res2 blocks' in the feature extraction network and 'Dense block 1' to 'Dense block 4' in the detection layers not only improved the accuracy but also sped up the detection. In addition, the 4th detection layer improved the performance of detecting small targets in a large degree at the expense of some detection speed. Generally speaking, MRFF-YOLO is an excellent model for real-time remote sensing target detection. Table 14 compares the experimental effects of each 'Dense block' in Figure 6. With 'Dense block 1' to 'Dense block 4' added in the detection layers, the mAP improved from 87.49% to 88.33% and the FPS improved from 23.5 to 25.1, which indicated that the Dense blocks we proposed in the detection layers can modestly improve the accuracy of detecting remote sensing targets and accelerate the velocity of detection simultaneously. Figure 10 shows the perfect performance of MRFF-YOLO for remote sensing target detection. Besides the detection results in Figure 10, the comparison of detection effect between MRFF-YOLO and YOLO-V3 is also provided. The RSOD dataset contains a mass of small targets, so we chose its detection results for comparison. In Figure 11 Figure 11 provided 20 images of 10 sets to compare the detection effect of YOLO-V3 and MRFF-YOLO intuitively. The images contain a mass of densely distributed targets that are small or medium in size. Among them, the 1st list and the 2nd list are the images detected by YOLO-V3, while the 3rd list and the 4th list are the images detected by MRFF-YOLO. Figure 11 clearly showed that several targets were not detected or erratically detected by YOLO-V3. Especially if the targets are small and densely distributed, there will be situations in which YOLO-V3 may predict several targets as one (a (3), a (7), a (8), a (9)) or judge shadows as targets (a (10)). On the other hand, all the marked targets were detected faultlessly. The contrast experiment in this section showed that our  Figure 11 provided 20 images of 10 sets to compare the detection effect of YOLO-V3 and MRFF-YOLO intuitively. The images contain a mass of densely distributed targets that are small or medium in size. Among them, the 1st list and the 2nd list are the images detected by YOLO-V3, while the 3rd list and the 4th list are the images detected by MRFF-YOLO. Figure 11 clearly showed that several targets were not detected or erratically detected by YOLO-V3. Especially if the targets are small and densely distributed, there will be situations in which YOLO-V3 may predict several targets as one (a (3), a (7), a (8), a (9)) or judge shadows as targets (a (10)). On the other hand, all the marked targets were detected faultlessly. The contrast experiment in this section showed that our improved YOLO-V3, MRFF-YOLO, can detect densely distributed small and medium remote sensing targets better than original YOLO-V3.

Conclusions
Aimed at the characteristics of remote sensing targets for which a large number of small targets exist in remote sensing images and their distribution is relatively dense, a series of improvements were proposed based on YOLO-V3. In order to realize the multi-scale feature extraction of the target, Res2Net was adopted to improve the capability of the feature extraction network. Posteriorly, we contrapose the difficulty of feature extraction of small targets in high altitude remote sensing, increasing the detection scales from three to four. In addition, in order to avoid gradient fading, the 'Dense blocks' we proposed were used to replace the five convolutional layers in each detection layer. We can see from Tables 9-11 and Figure 10 that the MRFF-YOLO we proposed is superior to other state-of-the-art algorithms in remote sensing target detection. Since MRFF-YOLO was provided based on YOLO-V3, Tables 12-14 in ablation experiments showed that each module we proposed was valid for improving the accuracy of remote sensing target detection. A slight loss in detection speed is acceptable. The comparison of detection effect revealed that MRFF-YOLO performed better than YOLO-V3 in detecting densely distributed targets with a small size in remote sensing images. In general, our approach is more suitable for remote sensing target detection than YOLO-V3 and other classical target detection models, and it basically meets the requirement of real-time detection. In further work, other networks based on receptive field amplification will be researched.
Author Contributions: D.X. Methodology, software, provided the original ideal, finished the experiment and this paper, collected the dataset. Y.W. contributed the modifications and suggestions to the paper, writing-review and editing. All authors have read and agreed to the published version of the manuscript.