YOLO-FIRI: Improved YOLOv5 for Infrared Image Object Detection

To solve object detection issues in infrared images, such as a low recognition rate and a high false alarm rate caused by long-distance, weak energy, and low resolution, we propose a region-free object detector named YOLO-FIR for the infrared (IR) image with the core of YOLOv5 by compressing channels, optimizing parameters, etc. And an improved infrared image object detection network, YOLO-FIRI, is further developed. Specifically, while designing the feature extraction network, the cross-stage-partial- connections (CSP) module in the shallow layer is expanded and iterated to maximize the use of shallow features. In addition, an improved attention module is introduced in residual blocks to focus on objects and suppress background. Moreover, multiscale detection is added to improve the detection accuracy of small objects. Experimental results on the KAIST and FLIR datasets show that YOLO-FIRI demonstrates a qualitative improvement compared with the state-of-the-art detectors. Compared with YOLOv4, the mean average precision (mAP50) of YOLO-FIRI is increased by 21% on the KAIST dataset, the speed is reduced by 62%, the parameters are decreased by 89%, the weight size is reduced by more than 94%, and the computational costs are reduced by 84%. Compared with YOLO-FIR, YOLO-FIRI has an approximately 5% to 20% improvement in AP, AR (average recall), mAP50, F1, mAP50:75, etc. Furthermore, due to the shortcomings of high noise and weak features, image fusion can be applied to the image preprocessing as a data enhancement method by fusing visible and infrared images based on a convolutional neural network.


I. INTRODUCTION
Object detection in infrared images has received extensive attention within the field of computer vision due to its important research value for applications. It also occupies an irreplaceable position in many fields, such as in diagnosing diseased cells [1], video surveillance [2], drone cruise [3], infrared warning [4], infrared night vision [5], infrared guidance [6], and other civilian and military fields. Although object detection models have achieved promising results in various tasks, it is still a challenging task in infrared images, as most models only focus on visible images.
Obtained through thermal radiation, infrared images have outstanding characteristics, such as object detection from long distances, high concealment, and availability both in the daytime and nighttime. With the expansion of distance imaging, the ever-growing demands for intelligent object detection in infrared images have become more urgent. However, constructing models for infrared images to achieve the desired results has been restricted by the longer imaging wavelength, larger noise, poorer spatial resolution, and more sensitivity to temperature changes in the environment compared with those characteristics of visible images. In recent years, several studies have investigated the possibility of object detection in infrared images using methods, such as spatial filtering [7], frequency domain filtering [8], and sparse representation [9]. However, these traditional object detection methods in infrared images are restricted by a single application scenario. They also have a slow recognition speed and a weak generalizability, making them difficult to use for fully extracting important features when applied to multi-scene and real-time detection applications.
Convolutional neural networks (CNN) models have the ability to learn the deep features of the input. The high-level data features from the original pixels of the training data can be learned to obtain a better feature expression capability for complex context information. Some networks have greatly improved their accuracy and generalize ability [10], and have solved the problems of image classification [11], image segmentation [12], superresolution [13], etc. Examples include region-based two-stage object detection algorithms, such as R-CNN [14], Fast R-CNN [15], and Faster R-CNN [16], as well as region-free one-stage object detection algorithms in the SSD (Single Shot Multi-box Detector) series [17], [18] and the YOLO series [19]- [22]. In comparison to visible images, the lower signal-to-noise ratio of infrared images makes the objects easier to submerge and interfere; the complex background of infrared images makes the object areas dark and uneven; and the distance between objects and infrared sensors is relatively long, making the objects occupy small areas of the whole image. This shows that the CNN models are robust when applied to visible images. Despite the effectiveness of these studies, using a convolutional neural network to detect weak and small objects in infrared images has become a difficult and hot research topic.
Based on the characteristics of infrared objects, we propose an object detection algorithm named YOLO-FIR for infrared images based on the region-free detector YOLOv5 [23]. As a state-of-the-art detector, YOLOv5 has the advantages of fast convergence, high precision, and strong customization. It also has strong real-time processing capabilities and low hardware computing requirements, meaning that it can be easily transplanted to mobile devices. These advantages are very helpful for ensuring the detection accuracy of the object in infrared images. Then, we make further improvements and propose a novel detection model, YOLO-FIRI, for small and weak objects in infrared images. The model proves to be reliable and efficient for object detection in infrared images. In the design of the YOLO-FIRI network, the CSP [24] module in the backbone network is extended to focus on shallow information and extract features to the maximum, while the feature extraction module is iterated to extract the detailed information and the deep features more thoroughly. Meanwhile, the SK (Select Kernel) [25] attention module is introduced and improved in residual blocks, and the features are re-weighted and fused from the channel dimension. In the detection stage, to better detect small and weak objects, multiscale feature detection is improved. Fourscale feature maps are used to detect multiscale objects, especially to enhance the detection of small objects. Additionally, to evaluate the contribution of each part of the network model designed in this paper, we conduct an ablation experiment on the KAIST [26] infrared pedestrian dataset. Finally, in the experimental analysis, the image fusion method is adopted to realize data enhancement by using the Densefuse [27] network to fuse visible and infrared images. We also use another dataset, FLIR [28], to further evaluate the performance. Experimental results demonstrate that the presented model improves the detection accuracy and ensures real-time detection speed in infrared images.
The main contributions of the research can be summarized as follows: 1. Based on studying the unique features of infrared images, this paper proposes the YOLO-FIR method for infrared objects with the core of YOLOv5 by analyzing the network structure, compressing channels, optimizing parameters, etc. Further improvements are made to design a novel network, YOLO-FIRI, where the feature extraction network is designed for complete use of the shallow features, and the detection head network has four layers to focus on small and weak objects. 2. Our designed feature extraction network extends and iterates the shallow CSP module, which uses an improved attention module, forcing the network to pay more attention to the shallow and detailed features in infrared images, as well as make the model more robust by learning more distinguishable features. 3. In image preprocessing, we use convolutional neural network, Densefuse network, to fuse visible and infrared images, which can realize data enhancement. Thus, image fusion can be used as a data enhancement method to enhance the features of infrared images.

II. RELATED WORK
Infrared image object detection has the advantage of not being disturbed by the environment, and it is a research hotspot in the field of object detection [29]. Currently, infrared image object detection approaches have been divided into two types: traditional algorithms and CNN models.

A. TRADITIONAL INFRARED OBJECT DETECTION
Traditional infrared object detection methods consider infrared images as three parts: object, background, and noise in the images. The idea is to suppress background and noise, thus, strengthening the object to achieve object detection by using various methods. K. Zhao et al. [7] first used the detection method based on spatial filtering for infrared object detection. In terms of different gray values of object and background, the background is selected and suppressed, and thus the object is detected. However, this method allowed all isolated noise points of small objects to pass, leading to a low detection rate. To address this problem, T.S. Anju [8] used the frequency difference between the object and background to separate the high-frequency part and the low-frequency part to achieve the detection task. Compared with that of spatial filtering methods, the detection effect of frequency domain filtering is a substantial promotion, but incurs high computational complexity. P. Jiao et al. [9] adopted sparse representation to cast the principle of infrared object detection in the form of a low-rank matrix and sparse matrix recovery, thus, achieving object segmentation and detection. The performance on the signal-clutter ratio (BCR) and background suppression factor (BSF) is much better than that of the filtering methods, but the nonlocal autocorrelation of the infrared image background cannot be used well, which leads to the lack of a background suppression effect. As the BCR decreases, the image background becomes increasingly complex. Enormous computational costs do not justify applicating the above models in real-time detection.
Traditional infrared pedestrian detection methods use artificially designed feature extractors, such as Haar [30], histogram of oriented gradients (HOG) [31], or aggregate channel features (ACF) [32] to extract the features of objects. Next, it takes advantage of the sliding window to extract local features, and then use support vector machine (SVM) [33] or AdaBoost [34] to determine whether there is an object in the region. Unfortunately, these infrared object detection algorithms have strong pertinence, a high time complexity, and window redundancy. They are also not robust to the changes in object diversity.

B. NEURAL NETWORK OBJECT DETECTION ALGORITHMS
Due to the improvement in computability and the widespread use of infrared imaging system equipment, many datasets are released to the public, such as KAIST [26], FLIR [28], and OTCBVS [35], which prompts deep learning to be gradually applied in the field of infrared image object detection. Thanks to the strong capablity of feature expression, the CNN models open new horizons and create a large amount of excitement in object detection. They preserve the neighborhood relations and spatial locality of the input in their latent higher-level feature representations. Additionally, the number of free parameters describing their shared weights does not depend on the input dimensionality, meaning that the CNN can scale well to realistic-sized high-dimensional images in terms of computational complexity. The object detection networks mainly include two-stage detectors and one-stage detectors [36].
In 2014, R. Girshick et al. presented a pioneering twostage object detector, R-CNN [14], and the main idea was to divide object detection into two steps: to generate proposals and predict objects. However, the computation was not shared and was extremely time-consuming. To accelerate inference speed and achieve better detection accuracy, Fast R-CNN [15] and Faster R-CNN [16] were developed by using ROI (Region of Interest) pooling and a novel proposal generator, RPN (Region Proposal Network), respectively. Currently, there are many network model variants based on Faster R-CNN for solving different problems, such as R-FCN [37], Mask R-CNN [12], and Sparse R-CNN [38]. In the field of infrared image object detection, D. Ghose et al. [39] proposed a method with a few modifications based on classical two-stage detector Faster R-CNN, which used the corresponding saliency maps to enhance the infrared images. However, since the process of training the saliency network was not added to the Faster R-CNN, the non end-to-end multitask training was very time-consuming. C. Devaguptapu et al. [40] developed a multimodel Faster R-CNN to obtain high-level infrared features through RGB channels, but the multimodel undoubtedly increased the inference speed of training. J. Park et al. [4] developed a CNN-based human detection method for infrared images. The performance was improved by performing pixelwise segmentation and making fine-grained predictions, whereas the proposed method lacked generality for the different datasets. Practically speaking, two-stage detectors have difficulty achieving real-time inference.
To address the issue of two-stage detectors, J. Redmon et al. [19] proposed a region-free one-stage object detection algorithm, YOLO, in 2016 that divided the image into grid cells and considered each cell as a proposal to detect the object. Compared with Faster R-CNN, YOLO omitted proposal generation and achieved end-to-end detection, which could realize real-time detection. Subsequently, the greatly improved detection speed gives rise to extensive research, such as SSD [17], YOLOv3 [21], and YOLOv4 [22]. To detect infrared objects, M. Kristo et al. [41] used the one-stage detector YOLOv3 to detect persons at night in different weather conditions. YOLOv3 is faster than the two-stage detector Faster R-CNN. However, YOLOv3 easily misses small objects, so its detection accuracy is very low. M. Li et al. proposed SE-YOLO [42], a real-time pedestrian object detection algorithm for small objects in infrared images, which improved the feature expression ability of the network combined with the SE block [43].To further improve the speed and accuracy of object detection, especially when objects are small and occluded, Y. Li et al. [44] developed a detector, YOLO-ACN, by introducing an attention module, CIoU (Complete Intersection over Union) [45], [46] loss, improved Soft-NMS (Non-Maximum Suppression) [47], and depthwise separable convolution. The detector, YOLO-ACN, can focus on small objects and avoid the deletion of occluded objects. However, there were still a large number of parameters to save that the weight file was too large, which made it difficult to apply on mobile devices. In addition to the above methods based on the classic YOLOv3, there were some other onestage network models for infrared image object detection. Y. Cao et al. [48] presented a DNN-based one-stage detector, ThermalDet, which included a dual-pass fusion block (DFB) and a channelwise enhancement module (CEM). The mAP of ThermalDet is 74.6% in the FLIR dataset, and thus, cannot achieve the desired results. X. Dai et al. [49] presented an SSD-like object detection method TIRNet. In this method, VGG was adopted to extract features, and the residual branch was introduced to robust features. Although TIRNet only cost a little additional time, its detection performance still could not meet the actual application requirements. X. Song et al. [50] harnessed the features of infrared images and visible images to achieve fused features. Then, a multispectral feature fusion network (MSFFN) was proposed based on YOLOv3 to detect pedestrian objects, but the excellence of the MSFFN was only obvious when the input images were of a small size.
These methods achieved a better performance for nighttime object detection in different fields, such as pedestrian detection [4], [8], [41], [44] and autonomous driving [48], [49]. Despite the recent progress, it was difficult to transplant these models to mobile devices after training, especially for drone equipment, satellite equipment, infrared cameras, etc.  To solve the problems in the existing models, this paper studies the state-of-the-art detector YOLOv5, which was first released on June 25, 2020. Based on studying the unique features of infrared images, this paper proposes the YOLO-FIR method for infrared objects with the core of YOLOv5 by analyzing the network structure, compressing channels, optimizing parameters, etc. Furthermore, by extending the thickness of the shallow CSP module that contains rich feature information in the backbone network, incorporating an improved SK attention module in the residual blocks to boost the feature extract ability and adding the detection layer to detect smaller objects, YOLO-FIRI, an improved infrared image object detection framework, is further proposed. In addition, the Densefuse network is used to fuse infrared images and visible images to generate more informative fused infrared images. Experimental results show that, compared with the latest infrared image object models, whether on the KAIST dataset or the FLIR dataset, the proposed object detection model YOLOv5-FIRI for infrared images brings about notable improvements in detection accuracy, detection speed, and model size. In addition, the detection accuracy of the proposed models on the fused dataset is improved to a certain degree.

III. PROPOSED METHOD A. NETWORK ARCHITECTURE
One-stage deep convolutional neural networks YOLOv3 and YOLOv4 have achieved a good performance in object detec-tion. YOLOv5 uses a variety of network structures and two types of CSP modules to improve YOLOv4, so that YOLOv5 is very conducive to object detection and recognition in terms of detection accuracy and computational complexity. Therefore, this paper proposes a method, YOLO-FIR, for infrared objects with the core of YOLOv5 by analyzing network structure, dividing data, compressing channels, optimizing parameters, training, and testing the model, etc. As a result, a novel network model, YOLO-FIRI, is designed and implemented to detect small and weak objects quickly in infrared images. The structure of YOLO-FIRI is shown in Figure 1, which mainly includes three parts: a backbone network with a lightweight feature extraction network, a neck network to realize cross-stage feature fusion, and a multiscale detection head.
In Figure 1, the input infrared image changes from 512×512×1 (infrared images have a single channel) to 256×256×4 after the focus operation. Then, in the backbone network, the extended CSP module is used to extract rich information from shallow and deep feature maps after the focus operation, and the attention mechanism introduced in the CSP module guides the assignment of different weights to realize and notice the extraction of weak and small features. In addition, the SPP (Spatial Pyramid Pooling) [51] layer can concatenate the results obtained in the channel dimension through four pooling windows to solve the alignment problem of anchors and feature maps. We use the SK attention module to enhance the extracted features. Second, in the neck network, PANet (Path Aggregation Network) [52] is used to generate feature pyramids, and the top-down and bottom-up fusion structures are both used to effectively fuse the multiscale features extracted from the backbone network and enhance the detection of objects with different scales. Finally, in the detection head, the four sets of output feature maps are detected, and the anchor boxes are applied on the output feature maps to generate the final output vectors with a class probability score, a confidence score, and a bounding box. Then, according to the NMS [53] postprocessing, the results detected by the four detection layers are screened to obtain the final detection results. The additional set of feature maps can solve the problem of missed and false detections caused by long-distance shooting. The proposed network models demostrate a qualitative improvement compared with the latest infrared image object detection in terms of detection accuracy, reasoning speed, and network parameters.

B. EXTENDED CSP MODULE
With network layers deepening gradually, the convolutional neural network can better extract the semantic information of high-level features, but the resolution of the high-level feature maps is low. In contrast, the resolution of the feature map in the shallow layer is high, whereas the feature semantic information extracted by the shallow network is weak. For small and weak objects with a few features in infrared images, deep convolution may cause object features to be difficult to extract or even lost. To maximize the extraction of features that are conducive to the detection of weak and small objects in infrared images, it is necessary to make full use of the high-level resolution features of the convolutional neural network in the shallow layer. Thus, in the feature extraction stage, we extend the thickness of the CSP module in the shallow feature extraction process. Through feedback and iteration step-by-step, the object features in the feature maps can be fully extracted to achieve multifeature extraction from shallow layers to deep layers. Moreover, when deepening the CSP modules in the overall feature extraction network by controlling the width and depth factors, we only extend the thickness of the CSP module to extract shallow features. The backbone structure of the entire feature extraction network is shown in Figure 2. In this way, without substantially increasing the size of the network model and the complexity of the algorithm, the ability to extract shallow feature information is enhanced, which is conducive to the detection of weak and small objects in infrared images. In addition, the CSP structure divides the feature maps into two branches to extract features and then merge them, which can achieve a richer gradient combination while reducing the amount of calculations.
In Figure 2, after the focus slicing operation, the Conv and CSP modules are stacked three times, the shallow layer is extended to the same number of CSP module feedback iterations as the deep layer, and feature maps of different sizes are obtained step-by-step. Then, the full extraction of fine-grained features of shallow information and deep highlevel semantic information are obtained, and the specific CSP module structure is shown in Figure 3. Conv represents the three operations of standard convolution, normalization, and the activation function, while Conv2d represents standard convolution. Through the concatenate operation, the feature maps containing the two branches of the Conv and the attention module (SK Layer) are merged, and the 128×128 features are fully extracted in the shallow layer. Compared with the YOLOv5m model that adds 108 layers, we only add 18 layers to the network by extending the shallow CSP module. We ensure that the detection speed of the model is not reduced when improving the detection accuracy of the model.

C. IMPROVED SK ATTENTION MODULE
The visual system tends to pay attention to a part of the information that assists with judging the image and ignores the unimportant information [54]. In object detection, an attention mechanism can be introduced in the residual blocks of the shallow feature extraction stage to effectively select object information, and more weights can be assigned to small and weak objects to improve the feature expression ability of small objects for accurate detection. The SKNet [25] network can adaptively adjust the size of the receptive field according to multiple scales of the input information, and better extract objects with different sizes and distances. Therefore, we introduce an improved SK attention module in each CSP module and use two convolutional operations with different convolution kernel sizes to learn the channel weights. The output vector continues to perform 1×1 convolution operations. The corresponding introduction position in the CSP module is the SK layer in Figure 3, and the specific improved SK attention mechanism structure is shown in Figure 4. As shown in Figure 4, after two Conv modules, which include standard convolution, normalization, and activation functions, the improved SK attention module is directly embedded into the residual blocks, and it is mainly divided into three parts: split, fuse, and scale. The split operation separates the input vector by performing the Conv operation with two different sizes of kernels, 3×3 and 5×5, to obtain the output vectors U1 and U2, and to obtain the vector U after the addition operation. The fuse stage uses global average pooling (F gp ) to compress the matrix to 1×1×C and uses a channel descriptor to represent the information of each channel. Therefore, the dependency between the channels is established, which can be expressed as (1), and the fully connected layer (F f c ) makes the relationship between the channels flexible and nonlinear. Here, two fully connected layers are also used to add more nonlinearity, fit the complex correlation between the channels, reduce the number of parameters and calculations as much as possible, and obtain the weight value, which is given by the Eq.: In (1), W and H are the width and height, respectively, and i and j are the i-th row and j-th column of the image, respectively. In (2), ω is the weight, σ is the ReLU activation function [55], and B represents the batch normalization operation.
Scaling is a simple weighting operation. The weight values calculated in the fuse stage are multiplied back to the original matrix to obtain the final output of the SK blocks, which can strengthen the useful information of weak and small objects for different channelwise scenarios. The matrix vectors are added again and merged to make full use of the shallow and deep layer information. By using a simple and effective fully connected layer, the output obtained after the sigmoid activation weight value is directly multiplied by the vector U to obtain the vector V instead of generating vectors a and b (two weight matrices to multiply). Thus, the computational complexity is reduced, and the reduction of the inference speed caused by the increased network layer is avoided.
In (3), F scale (U, F f c ) is channelwise multiplication, multiplying the feature maps U with the weight value obtained in the F f c stage, and outputting the weighted feature maps.
SK is a lightweight module that can be directly embedded in the network. It has a strong generalize ability by acquiring different receptive field information and an adaptive adjustment structure which is beneficial to the detection of pedestrians in infrared images. Moreover, systemic improvements can be achieved with a minimal computational burden.

D. MULTISCALE FEATURE DETECTION
The YOLOv5 network uses three types of output feature maps to detect objects with different sizes and uses 8 downsampling output feature maps to detect small objects. The objects in the KAIST [26] dataset are small and weak; therefore, we add a feature scale to focus on smaller objects. When feature maps are upsampled to the size of 64×64, we continue to upsample the feature maps to obtain 4 downsampling feature maps. At the same time, the expanded 128×128 feature maps are fused with the same size feature map of the second layer in the backbone network to make full use of the shallow and deep features. After multiscale fusion, the four feature scales are 128×128, 64×64, 32×32, and 18×18, as shown in Figure 5. The 32×32 marked in the grid division represents the size of each grid. Nine anchors with three detection scales are increased to twelve anchors with four detection scales. YOLOv5 can adaptively calculate suitable anchors according to different datasets, making it easier for the model to learn the converge and predict objects with different scales. In Figure 5, the left part is the prediction of the model, and the four detection layers (P2-P5) predict the values, i.e., the central point tx and ty, the width tw and height th, and the confidence score. The right part is the ground truth of the objects, and the network obtains the label information of the input images. Then, the loss between the prediction value and the ground truth is established to calculate the loss of each detection layer. Through the feedback of the loss, the model gradually optimizes the performance and completes the training. The loss calculation method for each detection layer is the same, which is obtained by calculating the sum of the bounding box regression loss, class loss, and confidence loss, e.g., the calculation of the P2 detection layer is shown as Eq.: loss p2 = loss ciou + loss cls + loss obj (4) Here, the bounding box loss (loss ciou ) uses CIoU, the class loss is calculated through BCE (Binary Cross Entropy) loss, and the confidence loss is realized by BCE with logits loss to get numerical stability.

IV. EXPERIMENT ANALYSIS
To test the performance of the infrared image object detection models YOLO-FIR and YOLO-FIRI which are proposed in this paper, we use the public KAIST and FLIR infrared pedestrian dataset. First, the latest detection algorithms and the detection models proposed in this paper are compared in terms of detection accuracy, speed, computational complexity, parameters, etc. Second, an ablation experiment is carried out on the improved YOLO-FIRI model to test the performance brought by the different improved methods. Third, the KAIST dataset provides well-aligned visible and infrared image pairs, so the Densefuse deep neural network is used to fuse the visible and infrared images, enhance the characteristics of the objects, and generate a fusion infrared image dataset. Then, YOLO-FIR and YOLO-FIRI experi-VOLUME XX, 2021  ment on the fused dataset. Finally, the FLIR dataset is also used to further test the performance of the proposed models.

A. COMPARISON OF THE DETECTION PERFORMANCE
The KAIST infrared pedestrian dataset is a classic public object detection dataset evaluated by most infrared image object detection algorithms. It contains large-scale, accurate manual annotations, well-aligned visible and infrared image pairs, and has a total of 95,328 pairs of images (640×512 resolution), including various conventional traffic scenes on campus, street, and countryside. The dataset labels include two pedestrian object categories: person and people. Those who are better to distinguish are labeled as person, and multiple individuals who are not easy to distinguish are labeled as people. C. Li  show that these data is sufficient to achieve a successful model training and performance evaluation, while also achieving a high detection accuracy. Table 1 is the comparison of the state-of-the-art infrared image object detection algorithms and the proposed YOLO-FIR and YOLO-FIRI in this paper in terms of the various evaluation indicators. Here, the size of the input image is 512×512, the training epochs are set to 300, the batch size is 16, the initial learning rate is 0.001 and learning rate decay of 0.01 every 5 epochs, the IoU threshold is set to 0.20, and the momentum and weight decay are 0.937 and 0.0005, respectively. We keep mosaic as 1 to use the mosaic data enhancement algorithm to expand the diversity of object samples. The training of all experiments is based on the PyTorch and carried out on the GeForce GTX 1660 GPU.
In Table 1, YOLO-FIR and YOLO-FIRI are the proposed infrared image object detection models based on YOLOv5. Compared with YOLOv3, YOLOv4, and YOLO-ACN, YOLO-FIR and YOLO-FIRI have greatly improved in various indicators. In particular, YOLO-FIRI is 24.8%, 26.9%, and 26.0% higher than the latest classic one-stage object detection algorithm YOLOv4 in detection accuracy indicators AP, AR, and F1, and the mAP50 has also been improved by approximately 21.4%. Compared with YOLO-FIR, the AP, AR, and mAP50 improved by 3.9%, 8.1%, and 13.2%, respectively. The object features in the infrared images are weak and small. If the IoU threshold is set too large, it will be more unfavorable for detecting objects in infrared images. In contrast, setting the highest IoU value to 0.75 is more common and suitable for actual applications. We use mAP50:75 to calculate the average detection accuracy under six different IoU thresholds. Table 1 shows that the mAP50:75 of YOLO-FIRI is approximately 62.3% higher than that of YOLOv4, and the detection time and   calculation amount decrease by approximately 62.2% and 84.1%, respectively, which greatly improves the real-time processing capability and reduces the hardware calculation requirements. In terms of the number of parameters, YOLO-FIRI directly reduces from tens of millions to millions. The corresponding weight file size also reduces from more than 200 MB of YOLOv3 and YOLOv4 to 15 MB of YOLO-FIRI, which is more conducive to model transplantation for mobile devices. This is mainly because the YOLO-FIR and YOLO-FIRI models use CSPDarknet as the backbone, and perform channel compression, as well as parameter optimization. As a result, the models are able to solve problems, such as repeating gradient information in network optimization in the backbone of other large-scale convolutional neural network frameworks. The gradient changes are integrated into the feature map from beginning to end, thus reducing the number of parameters and flop value of the model, which not only ensures the inference speed and accuracy, but also reduces the model size.
YOLO-FIRI is an improved infrared image object detection framework for weak and small objects in infrared images. Compared with YOLO-FIR, under the condition that the detection time, computational complexity, parameters, etc. are basically unchanged, the detection accuracy indicators AP, AR, mAP and F1 have been improved by approximately 4.2%, 9.1%, 5.6%, and 6.8%, respectively, especially mAP50:75 by approximately 13.2%. This indicates that the detection accuracy has significantly progressed after improving the detection of weak and small objects in infrared images. To further compare the accuracy differences between the two classes, we test each accuracy evaluation index on the KAIST dataset with YOLO-FIR and YOLO-FIRI. The results are shown in Table 2. In the detection of the two classes, AP and AR increased by 4.6%-8.1%. Moreover, the mAP50 for person and people reached 98.8% and 99.0%, respectively, and the mAP50:75 for the class of people increased by 13.6%.
To study the changes within the different detection models in the training process, Figure 6 shows the training accuracy of YOLOv3, YOLOv4, YOLO-FIR, and the different improved methods based on YOLO-FIR. The left image is the mean average precision of the two classes when the IoU is 0.5 (mAP50), and the right image is the mean average precision of the six different thresholds when the IoU is 0.5 to 0.75 (mAP50: 75). Because YOLOv4 improves the feature extraction network and data enhancement technology of YOLOv3, its detection accuracy is slightly better than that of YOLOv3. However, the YOLO-FIR and YOLO-FIRI detection methods proposed in this paper have a much higher performance than YOLOv4 whether in the mAP50 on the left or the mAP50:75 on the right, and the number of trainings required to enter the stable state is also relatively low. Compared with YOLOv3 and YOLOv4, the accuracy of mAP50 approaches 80%, whereas the improved YOLO-FIRI approaches 98%; the mAP50:75 approaches 60%, whereas the improved YOLO-FIRI approaches 95%, which greatly VOLUME XX, 2021 improves the average detection accuracy. In addition, as the epochs increase, the detection performance of YOLOv3 and YOLOv4 does not always exhibit an upward trend. If the training is continued, overfitting will occur [58], which will cause a slight decrease in the detection performance. However, YOLO-FIR and YOLO-FIRI adapt a variety of CSP structure extraction and fusion features. As the number of trainings increases, their detection performance maintains a steady trend. Loss plays an important role in the training process, which reflects the relationship between the true value and the predicted value. The smaller the loss is, the closer the prediction value is to the true value, and the better the performance of the model. Figure 7 shows the loss convergence rates of YOLOv3, YOLOv4, YOLO-FIR, and different improved methods based on YOLO-FIR. From Figure 7, we can see that the bounding box loss, class loss, and object loss in the training set and the validation set present a falling trend and eventually stabilize. Whether in the training or the validation, the improved YOLO-FIRI has been considerably reduced compared with YOLOv4. Compared with the proposed YOLO-FIR, although the curve is relatively close, there is still a slight decrease for the bounding box regression loss in the validation set. YOLOv4 is 0.037 at 300 epochs, whereas the bounding box regression loss of YOLO-FIRI is 0.012, which means that the proposed method can significantly accelerate the network's training process and converge to a lower loss when optimizing the neural network.
Examples of the test results on the YOLOv4 and the YOLO-FIRI models are shown in Figure 8. In terms of the number of detected objects, YOLO-FIRI detects 1 or 2 more pedestrian objects than the YOLOv4 in each image. As shown in (a2) and (b2), YOLO-FIRI can solve the missed detection problem of YOLOv4 by effectively extracting features for the possible occlusion of parallel pedestrians; for the long-distance pedestrians detected in (c2) and (d2), YOLO-FIRI realizes the detection of small objects through multiscale detection. Although the object in the infrared image is difficult to distinguish from the background, the improved model can still achieve the detection of the pedestrian in the infrared images with different distances by enhancing shallow features, fusing multiple features, and improving multiscale detection.

B. ABLATION STUDY
To see the effect of the different improved technologies more intuitively on the performance of the model, we conduct an ablation experiment. Specifically, keeping the structure of YOLOv5s unchanged and only improving the extended CSP module, we can observe the impact of the performance. Then, we improve the SK attention module, and multiscale feature detection to observe the experimental results and analyze the influence.
Our ablation experiment also keeps training for 300 epochs. When the training result is stabilized, the training is completed and then tested on the KAIST test set. The indicators are shown in Table 3. By introducing the extended CSP, the improved SK attention module, and the added detection head, the accuracy indicators of object detection have been improved accordingly. When the integration of these four improvements is tested as the final network model, the tested indicators show better detection accuracy than the three methods introduced separately. The corresponding mAP50 increased by 2.8%, and the mAP50:75 achieved a maximum increase of 10.2%.

C. INFRARED OBJECT DETECTION ON THE FUSED KAIST
In the field of image analysis, the quality of the image directly affects the design of the algorithm and the accuracy of the detection. Compared with visible images, infrared images have lower resolution and blurred visual effects. At present, the performance of the object detection model based on deep learning proves to be high quality and the image performance is better as well. However, when it is applied to infrared images to detect weak and small objects, the performance of the model is greatly reduced. Therefore, to improve the detection performance of weak and small objects, data enhancement as an effective method has been proven to solve the challenges brought by object detection tasks, such as Cutout [59], CutMix [60], Keep Augment [61], and other data enhancement methods by improving the image resolution [17], [62]. Image fusion can be a data enhancement method that performs fusion processing for a variety of different types of source images. Compared with a single image, image fusion has a substantial enhancement in image quality and clarity. The KAIST dataset provides infrared and visible image pairs of the same scene. Given the lack of a publicly available infrared dataset, the generated fusion dataset also achieves data diversity. Visible images can reflect the spectral information properties of objects, containing more detailed information, and are more in line with visual characteristics of the human eyes. And the thermal radiation characteristics of infrared images are more sensitive to objects and areas, which can avoid interference caused by scene changes. Therefore, infrared image and visible image are complementary, and the image fusion method can be used to fuse infrared and visible images into a more informative infrared image by using the fusion method to achieve the purpose of data enhancement. Figure 9 shows the Densefuse [27] framework for fusing infrared and visible images, which mainly includes three parts: encoder, fusion layer, and decoder. The encoder mainly includes two kinds of layers, C1 and a dense layer. C1 contains a 3×3 convolution kernel to extract rough features, and the dense layer contains three convolutional layers to extract deep high-level features. The encoder convolution kernel size and convolution stride are 3×3 and 1, respectively, which can receive images of any size, while the dense layer can retain depth features as much as possible in the encoding network. The fusion layer uses an additive strategy to fuse the infrared and visible image features extracted by the encoder, which can be written as Eq.: In (5), V is m (x, y) represents the visible image of the mth channel, Ir m (x, y) represents the infrared image of the m-th channel, F m (x, y) indicates the fusion result of the mth channel, and λ is the weighting coefficient. The output of the fusion layer enters the decoder, which contains four 3×3 convolutional layers to reconstruct the fused image. The loss function of the network H (x, y) is weighted by the structural similarity loss function H SSIM (x, y) and the pixel loss function H P (x, y).
H (x, y) = γH SSIM (x, y) + H P (x, y) = γ (1 − SSIM (Out (x, y) , In (x, y))) + Out (x, y) − In (x, y) 2 (6) VOLUME XX, 2021  In (6), Out (x, y) and In (x, y) represent the output image and the input image, respectively. H P (x, y) is the Euclidean distance between Out (x, y) and In (x, y). SSIM (·) is the structural similarity. Considering that there are three orders of magnitude differences between the pixel loss and SSIM loss, we take 100 here.
In this experiment, we use the Densefuse network, which is shown in Figure 9, to obtain the fused KAIST dataset. The image comparison of the infrared images, visible images, and fused images of the KAIST are shown in Figure 10. Among them, the detailed information of some objects is marked with red boxes for highlighting. Compared with the infrared image (a1), the fused image (c1) has a clearer body posture and contour, which expands the data features, so that the object features are more obvious and easier to extract. Compared with the visible image (b2), the fused image (c2) avoids the influence of the red bus occlusion in the background, making the object easier to recognize instead of being filtered out as the background. Compared with the . Some detection results on the infrared and fused KAIST dataset. We use the proposed YOLO-FIRI to test the two kinds of datasets. The first row is the images in the infrared KAIST dataset and the second row is the images in the fused KAIST dataset. Although the YOLO-FIRI achieves better detection results on the infrared KAIST dataset, the fused dataset obtained through data enhancement can further improve the performance with the proposed models. infrared image (a3) and the visible image (b3), the number of objects and human characteristics in the fused image (c3) are clearer.
For the visible, infrared, and fused infrared datasets, the proposed network models are trained separately. Table 4 compares the test results of visible, infrared and fused infrared images on YOLO-FIR and the improved network model YOLO-FIRI. On the detection model of YOLO-FIR, when fused dataset is used as input, both AP and mAP50 are improved, and the detection accuracy of mAP50 is improved by 0.7% compared to a single infrared image. On the YOLO-FIRI, the fused infrared dataset achieved the best results, and the mAP increased to 98.5%. The KAIST dataset has serious occlusion problems in visible images, but it can effectively improve this problem on the infrared dataset. Therefore, on the YOLO-FIRI models, the performance of the fused infrared dataset is better than that of the visible dataset and the infrared dataset.
After training, the models are further tested on two dif-ferent types of datasets. Figure 11 shows some randomly selected images of infrared and fused datasets. In Column (a), the fused image can extract pedestrian objects that are occluded due to the shooting angle; in Column (b), pedestrian objects at the edge can also be detected more accurately after fusion. In the densely crowded images of the two Columns (c) and (d), the number of pedestrian objects detected has increased from two and six to three and seven, respectively. We can see that the fused test set can detect more pedestrian objects than the infrared test set, and the overall accuracy rate has been improved.

D. INFRARED OBJECT DETECTION ON THE FLIR DATASET
On the KAIST dataset, we compare different models of the YOLO series. To better test the detection performance of the proposed models on infrared images, we test YOLO-FIRI and compare the different detection approaches on the FLIR dataset [28]. In the FLIR dataset, it provides both visible and infrared images, whereas the visible and infrared image pairs are not well aligned. Some infrared images do not have corresponding visible images, and only the infrared images are labeled. Therefore, we simply train on the infrared images and do not need to adapt a pretrained detector with RGB images. The training set and the test set contain 8862 and 1366 images with 640×512 resolution, and a total of three classes are included. In the experiments, we use the train set and test set as provided in the dataset benchmark. The experimental setting is still the same as that of the KAIST dataset. As shown in Table 5, we compare the value of AP for each class and the mAP of the different detectors, which are presented with percentage. In [40], when only Faster R-CNN was used as the detector, the mAP was 53.9%. However, the MMTOD-CG used the Faster R-CNN as a baseline and combined a pretrained detector, which increased the mAP by 7%. TermalDet [48] was proposed based on RefineDet [63], which took into account the features of each layer as the final detection, and the accuracy increased to 74.6%. The proposed methods YOLO-FIR and YOLO-FIRI are based on the stateof-the-art detector YOLOv5 in this paper. Under the premise of ensuring speed, the accuracy is further improved with the full use of features in the shallow layers, the attention of the important feature, and the improved detection head. We can observe that the values of AP and mAP all outperformed those in previous work, and our YOLO-FIRI further reaches 83.5%. In other words, the region-free framework YOLO-FIRI can learn more features from infrared images and the situation of false detection and long-distance missed detection has improved.

V. CONCLUSION
To overcome the drawbacks of infrared images object detection, in this paper, we propose a one-stage region-free object detector YOLO-FIR for infrared images, which is based on the YOLOv5 and the application for infrared images. Combining the features of infrared images, we further propose an improved YOLO-FIRI based on YOLO-FIR. Precisely, we extend and iterate the shallow CSP module of the feature extraction network and combine an improved attention module to the residual blocks to maximize the use of shallow features, forcing the network to learn the robust and distinguishable features. Additionally, the network detection head is improved, multiscale object detection layers are added, and the detection accuracy of infrared small objects is improved. Compared with YOLOv4, YOLO-FIRI has made a qualitative leap in various indicators. The mAP of YOLO-FIRI is increased by approximately 37% on the infrared images of KAIST, the detection time is reduced by approximately 62%, the network parameters are reduced by more than 89%, and the weight size is reduced by more than 93%. Compared with YOLO-FIR, the mAP of YOLO-FIRI reaches 98.3% and increases approximately 13% on KAIST. The AP for the bicycle class of YOLO-FIRI on FLIR also reaches 85%, which is an increase of 15%. Our proposed model's stateof-the-art performance can be attributed to the combination of learned shallower features and attention features, which allows our model to detect infrared objects based on their low resolution and unclear features. Because the KAIST dataset provides well-aligned visible and infrared images, we prove that data enhancement can be realized to further improve the detection accuracy of infrared images through the use of the convolutional neural network in image preprocessing to fuse visible and infrared images.
In this paper, we mainly focus on the single infrared images which are still and chaotic. It would be interesting to use the infrared video to realize the object detection because the video sequences have a strongly correlation between the front and rear frames. Therefore, the detection performance in infrared video object detection will be better.