Object fusion tracking based on visible and infrared images: A comprehensive review

doi:10.1016/j.inffus.2020.05.002

Information Fusion

Volume 63, November 2020, Pages 166-187

https://doi.org/10.1016/j.inffus.2020.05.002 Get rights and content

Highlights

•
A review of fusion tracking methods via visible and infrared images is presented.
•
Main RGB-infrared trackers are summarized and categorized into several groups.
•
Public RGB-infrared datasets are summarized and compared.
•
Main results on public datasets are summarized and analyzed in detail.
•
Future prospects of RGB-infrared fusion tracking are discussed and suggested.

Abstract

Visual object tracking has attracted widespread interests recently. Due to the complementary features provided by visible and infrared images, fusion tracking based on visible and infrared images can boost the tracking performance under adverse challenging conditions. RGB-infrared fusion tracking has become an active research topic and various algorithms have been proposed in recent years. In this paper, we present a review on RGB-infrared fusion tracking. We summarize all major RGB-infrared trackers in the literature and categorize them into several major groups for better understanding. We also discuss the development of RGB-infrared datasets, and analyze the main results on public datasets. We observe that deep learning-based methodsachieve the state-of-the-art performances. Besides, the graph-based and correlation filter-based methods give a bit worse but still competitive performances. In conclusion, we give some suggestions on future research directions of fusion tracking based on our observations. This review can serve as a reference for researchers in RGB-infrared fusion tracking, image fusion, and related fields.

Introduction

Visual object tracking has received significant attention in recent years due to its wide applications in many areas, such as robotics [1], autonomous vehicles [2], human-computer interface [3] and video surveillance [4]. According to the type of images included, it can be roughly classified into tracking based on visible images, tracking based on infrared images, and RGB-infrared fusion tracking. Among these types, the most popular is tracking based on visible images. Note that in this paper we do not distinguish between visible and RGB (Red-Green-Blue) images, although the visible images also contain gray-scale images.

Currently, two main kinds of methods in visual object tracking are deep learning (DL)-based methods [5] and correlation filter (CF)-based approaches [6]. Tracking methods based on deep learning mainly utilize its strong feature representation ability to extract better features than handcrafted ones, thus these approaches can achieve good tracking results in many cases. Here handcrafted ones means the features that are designed manually, such as histogram of oriented gradients (HOG) [7] and scale invariant feature transform (SIFT) [8]. Previously, deep learning-based methods suffer from slow speed severely [9]. However, with the application of fully convolutional Siamese networks in tracking [5], recent deep learning-based trackers achieve high performance tracking results while maintaining real-time speeds [10], [11], [12]. In CF-based tracking algorithms, the model can be updated in real-time as the correlation operation can be efficiently implemented via the Fast Fourier Transform (FFT). Therefore, during tracking process, CF-based methods utilizing shallow features can run in real-time. However, recently some CF-based trackers use raw deep convolutional features which are of high dimensionality [13], [14], [15]. These trackers become slower and slower because the computational time for the correlation filters increases with the feature dimensionality [16].

However, due to the limitation of the imaging mechanism of visible images, tracking algorithms based on visible images may fail as they may be unreliable in certain circumstances. For example, when the illumination conditions are poor or change significantly. The infrared images detect thermal information of objects and are insensitive to these factors. They can provide complementary information to visible images, as shown in Fig. 1a. In recent years, researchers also explore performing object tracking with infrared images [18], [19], [20], [21]. However, the infrared images typically have low resolutions and poor textures, and are also unreliable in certain conditions as shown in Fig. 1b. Therefore, more researchers begin to investigate object tracking method based on the fusion of visible and infrared images to overcome the inherent shortcomings of the methods based on single-modal images. By fusing complementary information from visible and infrared images, the robustness of tracking algorithms can be greatly enhanced. As a result, in recent years, object tracking based on RGB and infrared images have become a hot research topic. An increasing number of researches have been published in high quality journals or well-known conferences [22], [23], [24], [25], [26], [27], [28], [29], [30]. As a consequence, the well-known visual object tracking challenge (VOT) started a new RGB-infrared subchallenge in 2019¹, aiming to attract researchers to evaluate the performances on provided video sequences. Note that since the appearance of tracking based on visible and infrared images, it did not have a consistent name. A large part of researchers used fusion tracking [31], [32] or tracking by fusion [33], [34], [35]. It was until 2017 that some researchers started using RGBT tracking [27]. In this review, we denote the object tracking based on the fusion of visible and infrared images as RGB-infrared fusion tracking, because we think it can cover this kind of methods better and is thus more suitable for a comprehensive review. Besides, by using this name we aim to emphasize the importance of fusion in this kind of methods.

The research on RGB-infrared fusion tracking has begun in 2000s, as indicated by the development timeline of this field given in Fig. 2. RGB-infrared fusion tracking can be categorized into different categories. According to the primary modality utilized during fusion tracking, there are infrared-assisted RGB tracking and RGB-assisted infrared tracking. In infrared-assisted RGB tracking, visible image is the primary modality. Infrared images are employed for assisting RGB tracking, especially when the visible images are not reliable [36], [37]. In these works, the evaluation metrics are evaluated based on the ground truth of visible images. In contrast, in RGB-assisted infrared tracking, infrared image is the primary modality and all evaluation metrics need to be computed based on the infrared ground truth [38]. In this paper, we broadly divide the RGB-infrared fusion tracking methods into five categories according to their adopted theories, namely traditional methods, sparse representation (SR)-based, graph-based, correlation filter-based and deep learning-based approaches. It is well known that effective and robust feature representation is crucial for tracking algorithms. Before sparse representation-, graph-, correlation filter- and deep learning-based methods, researchers performed fusion tracking using traditional techniques such as mean shift, Camshift, Kalman filter, and particle filter. Traditional methods utilize handcrafted features to represent the target. Sparse representation-based methods work on the basis of possible representation of the target with linear combinations of bases in overcomplete dictionaries. Graph-based approaches firstly divide the bounding box around the target to non-overlapping patches, and then build the relationship among these patches to work out a feature representation of the target. CF-based trackers learn correlation filters online efficiently to adapt to variation of the target. Deep learning-based methods leverage the strong feature representation ability of deep neural networks to learn robust feature representation of the target from a large amount of images. In all these methods, a key point of achieving good fusion tracking performance is the effective combination of visible and infrared features.

As can be seen from Fig. 2, RGB-infrared fusion tracking is developing very fast. However, to the best of our knowledge, there is a lack of review on RGB-infrared fusion tracking in the literature that gives a comparison and evaluates the performance of these different techniques. This paper tries to fill this gap. The main contributions of this review are in several aspects. First, to the best of our knowledge, this is the first review on RGB-infrared fusion tracking. This manuscript systematically investigates the RGB-infrared fusion tracking methods, benchmark datasets, and evaluation metrics. Main RGB-infrared tracking algorithms are grouped into several types according to their corresponding theories and each kind is introduced in detail, including the main principles, representative methods as well as pros and cons. Second, main results on public datasets are presented and analyzed in this review to provide an objective comparison of the existing approaches. Third, based on the systematically review of main RGB-infrared fusion tracking methods and the performance comparison of different trackers, we give detailed discussions on future prospects and provide suggestions on promising research directions of this field.

The structure of this review is schematically illustrated in Fig. 3. Section 2 gives some background information. In Section 3, RGB-infrared fusion tracking methods are discussed in detail, including key points in implementation and different fusion levels. In Section 4, we summarize the development of RGB-infrared datasets. Section 5 introduces the evaluation metrics. Section 6 presents experimental results and gives an analysis on the performances. Sections 7 discusses the future prospects. Finally, Section 8 concludes the paper.

Section snippets

Related work

This section discusses some related works which are helpful for understanding and performing RGB-infrared fusion tracking.

RGB-infrared fusion tracking

In recent years, a lot of RGB-infrared fusion tracking algorithms have been proposed and some examples are listed in Table 1. In this section, we firstly discuss the key points of achieving good fusion tracking performance. Then, we introduce the fusion levels in fusion tracking. According to when the images are fused, they can be divided into pixel-level, feature-level and decision-level fusion tracking. We then give a comprehensive survey on RGB-infrared fusion tracking methods. These methods

Available RGB-infrared dataset

Large-scale datasets are of vital importance in RGB-infrared fusion tracking, since they are not only beneficial for training algorithms, but are also crucial for testing algorithms and comparing performance among trackers. Before large-scale datasets are available, in most RGB-infrared fusion tracking publications, the experimental part only employs several visible and infrared video pairs or even one single video pair to verify the algorithm. For example, the OTCBVS dataset [112] which

Evaluation metrics

In recent years, several well-recognized evaluation metrics have been proposed to evaluate tracking performance based on visible images. These include precision rate (PR), success rate (SR), accuracy, robustness and Expected Average Overlap (EAO). These evaluation metrics can also be applied to RGB-infrared fusion tracking.

Benchmark results and analysis

In this section, we present results on available public fusion tracking datasets. The results are either collected from the published literature or produced by the authors. The aim is to facilitate the research of this direction and make it easier for researchers to compare tracking results with the state-of-the-arts. It should be mentioned that many RGB-infrared trackers are not open-source and their results on public dataset have not been reported [22], [23], [24], [38], [79], [80]. As a

Future prospects

Despite the remarkable progress that has been achieved in RGB-infrared fusion tracking, several issues remain for future work. In this section, we give detailed discussions on specific trends of RGB-infrared fusion tracking based on the review of existing approaches.

Conclusion

Fusion tracking based on visible and infrared images (RGB-infrared fusion tracking) has attracted considerable attention and made significant progress in the past few years. In this paper, we comprehensively review existing RGB-infrared fusion tracking methods in the literature. These approaches can be generally divided into five categories: traditional methods, sparse representation-based, graph-based, correlation filter-based, and deep learning-based methods. Each category is introduced and

Declaration of Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Xingchen Zhang: Conceptualization, Investigation, Writing - original draft, Writing - review & editing. Ping Ye: Visualization, Investigation, Data curation. Henry Leung: Writing - review & editing. Ke Gong: Visualization, Investigation. Gang Xiao: Supervision, Writing - review & editing, Funding acquisition, Project administration.

Acknowledgment

This work was sponsored in part by the National Program on Key Basic Research Project of China under Grant 2014CB744903, in part by the National Natural Science Foundation of China under Grant 61973212 and Grant 61673270, in part by the Shanghai Science and Technology Committee Research Project under Grant 17DZ1204304, in part by the Shanghai Industrial Strengthening Project under Grant GYQJ-2017-5-08.

References (142)

Q. Liu et al.
Deep convolutional neural networks for thermal infrared object tracking
Knowl.-Base.Syst.
(2017)
L. Zhang et al.
Synthetic data generation for end-to-end thermal infrared tracking
IEEE Trans. Image Process.
(2018)
C. Li et al.
Learning collaborative sparse representation for grayscale-thermal tracking
IEEE Trans. Image Process.
(2016)
C. Li et al.
Weighted sparse representation regularized graph learning for RGB-T object tracking
Proceedings of the 25th ACM international conference on Multimedia
(2017)
H. Liu et al.
Fusion tracking in color and infrared images using sequential belief propagation
IEEE International Conference on Robotics and Automation
(2008)
Z. Wang et al.
Multi-focus image fusion using PCNN
Pattern Recognit.
(2010)
H. Ghassemian
A review of remote sensing image fusion methods
Inf. Fusion
(2016)
K. Ma et al.
Perceptual quality assessment for multi-exposure image fusion
IEEE Trans. Image Process.
(2015)
H. Yin
Tensor sparse representation for 3-D medical image fusion using weighted average rule
IEEE Trans. Biomed. Eng.
(2018)
Y. Liu et al.
A general framework for image fusion based on multi-scale transform and sparse representation
Inf. Fusion
(2015)

T. Wan et al.

An application of compressive sensing for image fusion

Int. J. Comput. Mathemat.

(2011)

S. Li et al.

Pixel-level image fusion: a survey of the state of the art

Inf. Fusion

(2017)

Y. Liu et al.

Deep learning for pixel-level image fusion: Recent advances and future prospects

Inf. Fusion

(2018)

J. Ma et al.

Infrared and visible image fusion methods and applications: A survey

Inf. Fusion

(2019)

H. Hermessi et al.

Convolutional neural network-based multimodal image fusion via similarity learning in the shearlet domain

Neural Comput. Appl.

(2018)

X. Yan, S.Z. Gilani, H. Qin, A. Mian, S. Member, S.Z. Gilani, H. Qin, A. Mian, Unsupervised deep multi-focus image...

Y. Liu et al.

Infrared and visible image fusion with convolutional neural networks

Int. J. Wavelet., Multiresolution Inf. Process.

V.A. Laurense et al.

Path-tracking for autonomous vehicles at the limit of friction

Fusion tracking in color and infrared images using joint sparse representation

Sci. China Inf. Sci.

(2012)

Cited by (130)

The developments and trends of electrospinning active food packaging: A review and bibliometrics analysis
2024, Food Control
Electrospinning is a straightforward and versatile technique to design nano/micro fibers with high porosity, large surface-to-volume ratio, outstanding tunability and flexibility. Substantial advances in developing electrospun active food packaging have been made in recent years, and many publications are emerging. However, there is a lack of comprehensive and intuitive analysis in respect to the mostly critical research topics and development trends of electrospinning active food packaging. In this study, 350 articles about “electrospinning active food packaging” were collected from the Web of Science Core Collection database during 2007–2021. The key words, highly co-cited references and top research aspects in this subject were categorized and evaluated using Citespace and the Scientometrics method. The institution and co-cited references analysis indicated that IATA-CSIC has contributed the most publications in this field. The cluster analysis demonstrates that the current electrospun active packaging focuses on using biodegradable polymers, such as starch, PHBV and PLA, and encapsulating bioactive compounds, especially curcumin and nisin. The barrier properties of packaging materials are also the research hotspot. The future trends are inclined to seeking other more effective bioactive compounds, and improving their encapsulation efficiency in the biodegradable polymer matrix. The challenges and future research focus of electrospinning active packaging were illustrated in the end. This work could assist the researchers tracking the latest hotspots, current challenges, as well as the future trends of electrospinning active food packaging.
Deep learning and multi-modal fusion for real-time multi-object tracking: Algorithms, challenges, datasets, and comparative study
2024, Information Fusion
Real-time multi-object tracking (MOT) is a complex task involving detecting and tracking multiple objects. After the objects are detected, they are assigned markers, and their trajectories are tracked in real-time. The scientific community is intrigued by the possibilities of utilizing MOT technology in the context of smart cities. Their primary focus lies in the domains of intelligent transportation, detection of vehicles and pedestrians, crowd surveillance, and public safety. Deep learning techniques have been developed in recent years to effectively tackle the challenges of real-time MOT tasks and enhance tracking performance. Environmental perception within smart traffic applications relies heavily on sensor data fusion. In traffic scenarios, a thoughtful approach involves utilizing a combination of sensors and cameras to detect and track targets while gathering valuable data effectively. However, it faces challenges when it comes to detecting and tracking objects that are in motion, have complex changes in appearance, or are in crowded scenes. This paper explores the foundational standard for real-time Multiple Object Tracking tasks. We prioritize the examination of quantitative measures by conducting a comprehensive analysis of widely utilized benchmark datasets and metrics. This study also investigates established embedding techniques and multi-modal fusion methods within real-time multi-target tracking algorithms. Each strategy will be classified and assessed according to a predefined set of principles. The paper presents a comprehensive analysis and visual representation of various MOT strategies. Finally, this paper aims to present an overview of the current challenges faced by the MOT mission, as well as the potential objectives that lie ahead.
DUGAN: Infrared and visible image fusion based on dual fusion paths and a U-type discriminator
2024, Neurocomputing
Existing infrared and visible image fusion techniques based on generative adversarial networks (GAN) generally disregard local and texture detail features, which tend to limit the fusion performance. Therefore, we propose a GAN model based on dual fusion paths and a U-type discriminator, denoted as DUGAN. Specifically, the image and gradient paths are integrated into the generator to fully extract the content and texture detail features from the source images and their corresponding gradient images. This incorporation aids the generator in generating fusion results with rich information by integrating output features of dual fusion paths. In addition, we construct a U-type discriminator to focus on input images’ global and local information, which drives the network to generate fusion results visually consistent with the source images. Furthermore, we integrate attention blocks in the discriminator to improve the representation of salient information. Experimental results demonstrate that DUGAN has better performance in qualitative and quantitative evaluation compared with other state-of-the-art methods. The source code has been released at https://github.com/chang-le-11/DUGAN.
Multi-scale convolutional neural networks and saliency weight maps for infrared and visible image fusion
2024, Journal of Visual Communication and Image Representation
Image fusion is the fusion of multiple images from the same scene to produce a more informative image, and infrared and visible image fusion is an important branch of image fusion. To tackle the issues of diminished luminosity in the infrared target, inconspicuous target features, and blurred texture of the fused image after the fusion of infrared and visible images. This paper introduces a novel effective fusion framework that merges multi-scale Convolutional Neural Networks (CNN) with saliency weight maps. First, the method measures the source image features to estimate the initial saliency weight map. Then, the initial weight map is segmented and optimized using a guided filter before being further processed by CNN. Next, a trained Siamese convolutional network is used to solve the two key problems of activity measure and weight assignment. Meanwhile, a multi-layer fusion strategy is designed to effectively retain the luminance of the infrared target and the texture information in the visible background. Finally, adaptive adjustment of the fusion coefficients is achieved by employing saliency. The experimental results show that the method outperforms the state-of-the-art algorithms in terms of both subjective visual quality and objective evaluation effects.
Infrared and visible image fusion via mixed-frequency hierarchical guided learning
2023, Infrared Physics and Technology
Existing deep learning-based fusion methods usually model local information through convolution operation or global contexts via self-attention mechanism. This serial scheme discards the local features during globality modeling, or vice versa, which may generate limited fusion performance. To tackle this issue, we introduce a mixed-frequency hierarchical guided learning network, or FreqFuse for short. More specifically, we first design a parallel frequency mixer through a channel splitting mechanism, including max-pooling and self-attention paths, to learn both high and low-frequency information. The mixer can provide more comprehensive features within a wide frequency range compared with a single dependency. Second, we develop a dual-Transformer integration module to guide fusion progress. The assigned weights are calculated by cross-token and cross-channel Transformer, which are used to measure the activity levels of source images and preserve their modality characteristics in the intermediate fused features. On this basis, we build a hierarchical guidance decoder to reconstruct a final fusion image. The cross-scale mixed-frequency features are reused to gradually optimize the activity levels of different modality images, and promote the fused result to be highly informative and strongly characterized. We benchmark the proposed FreqFuse on different datasets, and experimental results demonstrate that it achieves impressive performance compared with other methods.
EADS: Edge-assisted and dual similarity loss for unpaired infrared-to-visible video translation
2023, Infrared Physics and Technology
In extreme environments where visible imaging is limited, infrared imaging is often used to assist imaging. However, infrared images lack detailed semantic information and have low contrast, and may not be suitable for direct observation by humans or for practical tasks. Therefore, overcoming the significant differences between the two modalities and realizing the transfer of infrared to visible videos will help to better utilize infrared images. Based on this, we proposed a one side end-to-end infrared-to-visible video translation framework, EADS, that uses our edge-assisted generation and dual similarity loss to preserve the scene structure information to the maximum extent and realize the translation of infrared videos into realistic, detailed, and temporally and spatially coherent visible light videos. Experiments show that our translated videos can be used in tasks such as object detection and image fusion.

View all citing articles on Scopus

View full text

Object fusion tracking based on visible and infrared images: A comprehensive review

Highlights

Abstract

Introduction

Section snippets

Related work

RGB-infrared fusion tracking

Available RGB-infrared dataset

Evaluation metrics

Benchmark results and analysis

Future prospects

Conclusion

Declaration of Competing Interests

CRediT authorship contribution statement

Acknowledgment

Knowl.-Base.Syst.

IEEE Trans. Image Process.

IEEE Trans. Image Process.

Pattern Recognit.

Inf. Fusion

IEEE Trans. Image Process.

IEEE Trans. Biomed. Eng.

Inf. Fusion

Int. J. Comput. Mathemat.

Inf. Fusion

Inf. Fusion

Inf. Fusion

Neural Comput. Appl.

Int. J. Wavelet., Multiresolution Inf. Process.

arXiv Preprint arXiv:1805.08982

Signal Processing：Image Communication

Infrar. Phys. Technol.

Math. Probl. Eng.

Hand posture recognition using finger geometric feature

Proceedings of the 21st International Conference on Pattern Recognition

Path-tracking for autonomous vehicles at the limit of friction

2017 American Control Conference

Visual object tracking classical and contemporary approaches

Front. Comput. Sci.

Fully-convolutional siamese networks for object tracking

European Conference on Computer Vision

High-speed tracking with kernelized correlation filters

IEEE Trans. Pattern Anal. Mach. Intell.

Histograms of oriented gradients for human detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Scale invariant feature transform

Scholarpedia

Learning multi-domain convolutional neural networks for visual tracking

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

High performance visual tracking with siamese region proposal network

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

SiamRPN++: Evolution of siamese visual tracking with very deep networks

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Convolutional features for correlation filter based visual tracking

Proceedings of the IEEE International Conference on Computer Vision Workshops

ECO: efficient convolution operators for tracking

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Beyond correlation filters: learning continuous convolution operators for visual tracking

European Conference on Computer Vision

Context-aware deep feature compression for high-speed visual tracking

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Fusion-based background-subtraction using contour saliency

Computer Society Conference on Computer Vision and Pattern Recognition Workshops

A thermal object tracking benchmark

Proceedings of 12th IEEE International Conference on Advanced Video and Signal Based Surveillance

Hierarchical spatial-aware siamese network for thermal infrared object tracking

Know.-Based Syst.

Learning modality-consistency feature templates: a robust RGB-infrared tracking system

IEEE Transa. Ind. Electron.

Robust collaborative discriminative learning for rgb-infrared tracking

Thirty-Second AAAI Conference on Artificial Intelligence

Modality-correlation-aware sparse representation for RGB-infrared object tracking

Pattern Recognit. Lett.

Fusing two-stream convolutional neural networks for RGB-T object tracking

Neurocomputing

Cross-modal ranking with soft consistency and noisy labels for robust RGB-T tracking

Proceedings of European Conference on Computer Vision

Learning local-global multi-graph descriptors for RGB-T object tracking

IEEE Trans. Circuit. Syst. Video Technol.

Fast RGB-T tracking via cross-modal correlation filters