applied

: The development of single-modality target tracking based on visible light has been limited in recent years because visual light images are highly susceptible to environmental and lighting inﬂuences. Thermal infrared images can well compensate for this defect, so RGBT tracking has attracted increasing attention. However, existing studies are limited to the aggregation of multimodal information using feature fusion, ignoring the role of decision-level fusion in tracking, and the original re-detection algorithm in the used baseline model is prone to the accumulation of failures. To deal with these problems, we propose the Redetection Multimodal Fusion Network (RMFNet). The network is divided into three branches, the visible light branch, the thermal infrared branch, and the fusion branch. The three-branch structure can plainly utilize the complementary advantages of multimodal information and the commonalities and speciﬁc characteristics of the two modalities. We propose a multimodal feature fusion module (EFM), which can adaptively calculate the reliability of the modality and perform a weighted fusion of the two-modality features features. The existing redetection algorithm is improved, and the re-detection mechanism of global search in the current frame is added to reduce the accumulation of failures. We have conducted extensive comparative validation on two widely used benchmark datasets, GTOT and RGBT234. The outcomes of the experiments suggest that RMFNet outperforms other tracking methods.


Introduction
Object tracking is a computer vision task that involves continuously estimating and predicting the position of an object in each subsequent frame of a video sequence, based on the annotation of the object in the first frame, to track and predict the object's motion trajectory and position. Video tracking technology involves theories from many fields and has great practical value in areas such as security monitoring, human-computer interaction, and intelligent life. While single-modality target tracking technology has made significant progress, it often performs poorly in complex scenes, such as those involving background clutter, occluded targets, low light, rain, and snow. In contrast, thermal infrared cameras, which are based on thermal radiation imaging, can effectively compensate for the shortcomings of visible light imaging and are better suited to dealing with such challenges. Furthermore, the visual light image can provide the detailed texture and color information that may be lost in the thermal infrared image. Therefore, the two modalities have their respective strengths and weaknesses, and exhibit strong complementarity. Combining the information from both modalities in target tracking can provide more comprehensive target information and improve the accuracy and robustness of tracking. Hence, the research on target tracking using both visible light and thermal infrared modalities has substantial significance.
The current RGBT tracking algorithms based on deep networks can be broadly classified into two types of research. The first type investigates how to use deep networks to adaptively fuse the features of visible and thermal infrared modality. For example, Zhu et al. [1] proposed a recursive strategy to densely aggregate hierarchical features of the network and reduced redundant features and noise through channel pruning. Xu et al. [2] performed bilinear pooling on any two layers using cross-product (a second-order computation effectively aggregating deep semantic and shallow texture information of the target) and aggregated bilinear pooling features of different modalities using a quality-aware network. Li et al. [3] proposed multi-scale adapters and multi-modality fusion modules to adaptively aggregate features of different scales and modalities. The second direction aims to explore the commonalities and characteristics of the two modalities and fully leverage their potential value. For example, Xiao et al. [4] proposed an enhanced transformer structure for feature fusion, which can further strengthen the fused feature and modality-specific features. Lu et al. [5] proposed a hierarchical structure with a parallel combination of general adapters and modality adapters and integrated hierarchical divergence losses, which enabled single-stage joint learning of modality-sharing and modality-specific features. Additionally, a dynamic fusion module was designed in the instance adapter, which can perform quality-aware fusion of different source data.
Although these tracking networks demonstrate reliable performance and great potential in various environments and challenges, two issues still exist. Firstly, these networks only use feature-level fusion and, subsequently, only use the fused feature for classification tasks. However, in certain scenarios (e.g., extreme lighting or thermal crossover), fused features may be easily influenced by low-quality interference when one modality image fails or has poor quality, making them not always superior to single-modality features. Decision-level fusion can address this issue and to some extent alleviate the impact of image mismatch. Unfortunately, the importance of decision-level fusion has been overlooked by the aforementioned networks. Secondly, these networks belong to the MDNet network architecture-based algorithms, and the MDNet network has a cumulative failure issue in targeting re-detection algorithms.
To overcome these two problems, we propose a network for RGBT tracking: a redetection multimodal fusion network that improves the redetection algorithm and combines feature-level and decision-level fusion. We extracted features using the same network configuration as the first three convolutional layers of VGG-M [6] with the modification of expanding the receptive field. To fully exploit the complementarity of multimodal information and explore the differences between two modalities, we divide the network into three branches, the visible light branch, the thermal infrared branch, and the fusion branch. The two specific modality branches each use three convolutional layers to extract features, with no weight sharing between the convolutional layers of the two modalities. A fusion branch is introduced in parallel between the two modalities, where the two features output by the 7 × 7 convolution are, respectively, input into the second 5 × 5 convolution and third 3 × 3 convolution of the fusion branch to extract shared information between the two modalities. The ECA [7] attention-based feature fusion module designed in this paper can adaptively compute the weights of two modalities and perform a weighted fusion of multimodal features to obtain more robust fused features. Three fully-connected classification layers are added after each of the three obtained features to obtain the sample scores output using each of the three features, and the three scores are summed to obtain the final sample score. We further improved the re-detection mechanism of existing algorithms. Specifically, when the tracker's pre-location is unreliable, we directly expand the search area and perform a global target search to obtain target re-location. We then determine whether the re-location is reliable, and, if so, use the re-location as the target location for the frame, or, if not, use the one with the higher score of pre-location and re-location as the target position for this frame. Finally, the Alpha-Refine [8] bounding box refinement module is added in this paper to adjust the coarse positioning of the tracker precisely.
This paper offers several key contributions, summarised as follows:

1.
We propose an RGBT tracking network: the Redetection Multimodal Fusion Network (RMFNet), which contains two types of fusion, medium-term feature-level and late-term decision-level fusion, and can be divided into three branches, visible, thermal infrared, and fusion, and can fully utilize the complementarity and correlation of multimodal information to achieve robust RGBT tracking.

2.
We have designed an ECA attention-based multimodal feature fusion module that adaptively computes two modalities' reliability and performs a weighted fusion of multimodal features to obtain a more favourable feature representation.

3.
We improve the re-detection algorithm of the base tracker by adding a re-detection step that performs a global target search at the current frame, which mitigates the accumulation of failures and can increase the robustness and precision of the tracking algorithm while decreasing the amount of computation and making the tracking algorithm more efficient.

4.
Extensive experimental results on the GTOT [9] and RGBT234 [10] datasets show that RMFNet outperforms other advanced RGBT tracking methods and obtains good performance.

RGBT Tracking
Siamese network and MDNet network architectures are two mainstream frameworks for RGBT tracking algorithms, especially the former. Li et al. [11] suggested a fusion tracking method based on the MDNet [12] framework in 2018, and, subsequently, many RGBT trackers based on the MDNet framework appeared. Zhu et al. [1] proposed the dense feature aggregation and pruning network (DAPNet), proposing a recursive strategy to densely aggregate the hierarchical features of the network densely and by channel pruning to reduce redundant features and remove noise. Li et al. [13] proposed a multi-adapter convolutional network (MANet) to fully exploit the potential value of mode-sharing, modespecific, and instance-aware information. Xu et al. [2] proposed the cross-layer bilinear pooling network (CBPNet), which aggregates features at different levels by performing bilinear pooling operations on any two layers through cross-products, and uses a qualityaware network to fuse the bilinear pooled features of different modalities. Zhang et al. [14] proposed the Adaptive Learning Attribute Representation Network (ADRNet), where the core of the algorithm is adaptive attribute representation and adaptive modality fusion, by learning how to distinguish the target from the background based on the target attribute representation.
The first Siamese framework to be applied to RGBT tracking is SiamFT [15], which considers SiamFC [16] as the baseline and uses a doubled Siamese network, visible and infrared network, which can process visual and infrared images separately to meet the real-time requirements. DSiamMFT [17] further employs dynamic online learning transformation strategies and multi-level semantic features based on SiamFT. SiamIVFN [18] utilizes complementary feature fusion networks and contribution aggregation networks to handle not only the misalignment of image pairs but also the fusion of features based on the contributions of the two modalities, giving the network high accuracy and a speed of 147.6 FPS, making it the fastest deep learning-based RGBT tracker available. It is worth mentioning that algorithms based on the Siamese framework rely on a large amount of data to train the network. However, since the existing RGBT dataset does not have such a large amount of data, the current algorithms of the Siamese framework usually use a large-scale RGB dataset and train the network with the grayscale maps generated from RGB images. However, the imaging principle of thermal infrared images differs from that of visible images, resulting in the generated grayscale maps lacking reliability to some extent. In this paper, we conducted a study using the MDNet framework.

Multimodal Fusion Network
A suitable fusion strategy can take full advantage of multimodal information and reinforce the robustness and accuracy of the tracker. Existing multimodal fusion methods can be divided into three fusion methods, pixel level, feature level, and decision level. In early fusion approaches, pixel-level fusion was used by directly connecting the multiple modality data channels and feeding the fused data into a convolutional neural network (CNN) for feature extraction, as shown in Figure 1a. However, with the outstanding performance of feature aggregation in multiple domains, Li et al. [11] designed a feature fusion network that directly integrates RGB and thermal infrared depth features, as shown in Figure 1b, and fused the two modality features using element-wise addition. Yang et al. [19] designed a fusion network that adds the output classification results, as shown in Figure 1c. Li et al. [13], as shown in Figure 1d, distinguish the multimodal information into shared and exclusive information extracted by different adapters, proposing a multi-adapter fusion network. Gao et al. [20], as shown in Figure 1e, adaptively fuses modality features with hierarchical features in the form of a recursive fusion chain to achieve a double height of speed and accuracy. However, the above algorithms only use the fused features in the subsequent processing stage, without considering the unreliable fused features caused by noise or failure of one modality. As shown in Figure 1f, it is the RMFNet network proposed in this paper, the parallel fusion branches, can fully exploit the shared information of two modalities, and the three-branch fusion network gives full play to the advantages of commonality and characteristics of multimodal features.  [11]. (c) Late decision-level fusion, the framework proposed by [19]. (d) Framework proposed by [13]. (e) Framework proposed by [20]. (f) Our framework. "C" and "+" denotes splicing and element summation. "AFM" and "EFM" indicate multimodal feature fusion modules.

Attention Mechanism
The attention mechanism is a powerful tool in computer vision that helps neural networks focus on the most critical information in the input. By enhancing the learning and expression of this information, it has been extensively used for a broad range of tasks such as object tracking, image segmentation, image restoration, etc., and has proven to improve network performance significantly. SENet [21] employs an adaptive channel feature recalibration module to retain valuable features and improve network performance. ECANet [7] models the dependency between channels by using a more lightweight and faster one-dimensional convolution instead of the fully connected layers in SENet. SKNet [22] proposes a convolutional kernel attention mechanism that allows the network to select appropriate convolutional kernels and fuse multiple branches with different kernel sizes. CANet [23] embeds the position information of features into a channel attention mechanism, and enables the network to aggregate features along the horizontal and vertical directions through pooling operations, thereby improving the expressiveness and accuracy of the target. Non-local [24] and PSANet [25] each model the spatial relationships between all pixels in the feature map and embed attention modules after each block in CNN. The attention mechanism is utilized in this paper to adaptively adjust the weight of features to enhance the performance of the tracker under different modalities.

Our Method
We first present the proposed redetection multimodal fusion network and then describe our fusion methodology, the feature fusion module, the improved re-detection algorithm and the final used bounding box refinement Alpha-Refine module.

RMFNet Overall Architecture
RMFNet architecture is shown in Figure 2. In order to fully leverage the shared and specific information from both modality images and fully utilize the complementary strengths of the two modality information for more accurate and dependable target tracking, our RMFNet consists of two modality-specific branch-based and one fusion branch network. A pair of visible and thermal images of arbitrary size are taken as input to the network. In the two modality-specific branch networks, the backbone comprises the first three convolutional layers of VGG-M [6], ensuring accuracy while avoiding the problem of low efficiency caused by too many layers. The backbone network was modified to improve the representation quality of the region of interest (ROI) and obtain a larger receptive field. Specifically, the first layer has a convolutional kernel size of 7 × 7 × 96, the second layer has a convolutional kernel size of 5 × 5 × 256, and both layers have a ReLU activation function and a local response normalisation (LRN), removing the max pooling layer after the second layer in the original network. The third layer has a convolutional kernel size of 3 × 3 × 512 and is a dilated convolution with r = 3.
In addition, to obtain the shared information between the two modalities, we adopted a similar two-and three-layer convolutional setting as that of the specific modality branches. The input two-modality images were first extracted by a 7 × 7 convolution to extract features and then were separately inputted into the second and third layers of the fusion branch to mine their shared information, where the shared weights can explore more shared information between the two modalities. Furthermore, utilizing the designed feature fusion module, the two features outputted from the fusion branch were adaptively weighted and fused to enhance the target's feature expression capability, ultimately achieving more accurate and precise object tracking.
Finally, all three features we obtained use three fully connected classification layers to learn the instance perception of the target and a softmax layer to classify the optimized features. The FC6 layer is a binary fully-connected layer with K domains (one domain represents a video sequence). The three classification scores obtained using the three features are summed to obtain the final sample score, and the mean of the top 5 scoring sample frames is the location of the predicted target.

The Fusion Strategy
Visible and thermal infrared images have different ways of perceiving and expressing target information and have different advantages under other conditions. Modality fusion can comprehensively utilize the benefits of both modalities, improve the precision and robustness of object tracking, enhance the tracker's resistance to environmental changes and noise, and improve the tracking speed by reducing unnecessary calculations. This paper employs two modality fusion methods, medium-term feature-level fusion and late-term decision-level fusion.
(1) Medium-Term Feature-Level Fusion The feature-level fusion tracking algorithm first requires feature extraction using convolutional neural networks for visual and thermal infrared images, respectively, and then feature fusion using some designed fusion rules to obtain fused features of the two modalities, then tracking using this fused feature. As shown in Figure 2, our fusion branch aims to exploit the shared information between the two modalities fully. To better utilize the shared information, we perform feature-level fusion operations on the two features outputted by the fusion branch. In this paper, we propose a multimodal feature fusion module based on ECA attention (EFM), which can adaptively compute the contribution of the two modality features to target tracking and obtain the weighted fusion feature based on the relative reliability between modalities. The specific details of the EFM module are illustrated in Figure 3. The EFM module can reduce noise and redundant information in the fusion process, obtain more robust feature representation, and improve tracking accuracy. Figure 3. Detail diagram of the EFM module, X r , X t , X f , indicating the available optical, thermal infrared, and fusion features, respectively. B × C × W × H represents the dimensions of the tensor, where B is the batch size, C is the number of channels, and W × H represents the width and height of the feature map.
Inspired by the attentional fusion module proposed in AFF [26], we suggest an improved feature fusion module based on ECA attention [7]. We use fast one-dimensional convolution to capture inter-channel interactions effectively, without reducing the number of channels, and with fast computational speed and minimal memory occupation. Unlike the fusion strategy in AFF, we do not add the two features and then weigh the resulting fused feature. Instead, we first calculate the global feature representation of the two features, then the global features of the two global feature representations obtained separately, and then calculate the weights after stitching the two feature channels together.
The input of the EFM module comprises two features that need to be weighted and fused. Firstly, perform global average pooling on the two features separately to extract their global information, denoted as f t c ∈ R and f r c ∈ R, respectively. The formula is as follows: W and H denoted of dimensions of the input features. One-dimensional fast convolution is then applied to f t c and f r c with a kernel size of 3, which allows the network to capture interactions between the 3 adjacent channels. The two global feature representations are then concatenated along the channel dimension, and the channel weights are calculated through a softmax operation. Finally, the two features are weighted and fused using the obtained weights, resulting in a fused feature more conducive to subsequent tracking.
The channel-wise concatenation of two input features is denoted as [·, ·], where w t and w r are the weights of the two input features. X f , X r , and X t represent the fused feature and the two original input features, respectively.
(2) Late Decision-Level Fusion In target tracking, Decision-level multimodal fusion refers to the fusion of tracking results from different modalities according to some rules during the tracking process. This approach can effectively utilize the tracking results from different modalities, thereby improving tracking accuracy and robustness. However, most existing RGBT tracking algorithms [3,4,14,27] are based on the feature-level fusion framework, ignoring the importance of decision-level fusion in object tracking. In this paper, we adopt a late decision fusion strategy, using visible light modality features, thermal infrared modality features, and fused modality features for classification tracking, obtaining three tracking results based on different features. Then, we combine the three results by addition to obtain the final prediction score. The advantage lies in utilizing tracking results from three different features, fully leveraging the modality-specific and shared information for target tracking. This approach also mitigates the impact of two modality images needing to be more strictly aligned, thereby improving the tracker's accuracy, robustness, and stability. The formula for late decision fusion in this paper is presented below: The symbol F f c denotes a fully connected layer, while X f , X v , and X t represent the fused feature, visible light feature, and thermal infrared feature, respectively. The late fusion process is depicted in Equation (7), in which three fully connected layers are employed to learn feature representations for each feature type. Then, the scores obtained from the three modalities are combined using element-wise addition to produce a two-dimensional feature for background and target binary classification.

Improved Re-Detection Algorithm
As shown in Figure 4, it is the flow of the re-detection algorithm of the original MDNet [12] algorithm, in which the pre-positioning at this time is still used as the target position of the current frame when the pre-positioning is judged to be unreliable, and the candidate frame is generated by expanding the search area around the predetermined position of this frame when tracking the next frame. However, changes such as deformation, rapid motion, occlusion of the target, or changes in illumination or background interference during the tracking process can cause the tracker to fail and lose the target. When the mean of the top 5 candidate boxes' scores is less than 0, the pre-positioning is deemed unreliable, and it is judged that the tracker may have drifted, causing the target to be lost. The re-detection process of the original MDNet algorithm causes a continuous accumulation of errors, which is improved in this paper, as shown in Figure 5. When the pre-positioning is unreliable, a global search is performed around the previous frame's target location by expanding the search area, and the current frame's target is repositioned. Target repositioning can detect the target again and reinitialize the tracker, avoiding the continuous accumulation of errors and improving tracking reliability. Moreover, target repositioning can be recovered promptly, preventing the tracker from continuously tracking the wrong target or losing the target, reducing the computation and time consumption and improving tracking efficiency. The reliability of repositioning is evaluated, and when it is reliable, samples are collected and updated around this position, and the tracker is updated. When repositioning is unreliable, the score of the pre-positioning and repositioning is evaluated, and the one with a higher score is selected as the target position of the current frame.

Alpha-Refine Module
Object tracking aims to accurately estimate a given target's boundary box in consecutive video frames. This paper adopts a multi-stage strategy to improve the prediction accuracy of boundary boxes. First, a re-detection multimodal fusion network is used as the basic tracker to provide a coarse location boundary box of the target, and then the Alpha-Refine refinement module is used to adjust it for a final target boundary box. The process is described in Figure 6. Alpha-Refine is a generic and accurate refinement module that significantly improves the quality of boundary boxes generated by the basic tracker. In the data preparation stage, we use the target in the first frame to initialize the reference template of the Alpha-Refine refinement module. During the tracking stage, the basic tracker's coarse location boundary box output is expanded to twice its size as the search area of the Alpha-Refine module. Templates and search regions use pixel-by-pixel correlation to compute high-performance response maps and use keypoint prediction heads to predict bounding boxes, maximizing extraction and maintaining detailed spatial morphology for more accurate bounding box estimation.

Offline Training
This section outlines the offline training process of RMFNet. Firstly, the weights of the first three convolutional layers of the VGG-M [6] model trained on the ImageNet dataset are loaded and used to initialize the convolutional layers of our two specific modal branches and fusion branch. Then they trained the entire RMFNet using the Adam algorithm, with the learning rate of the convolutional layers set to 0.0001 and the learning rate of the fully connected classification layer formed to 0.001, which is 10 times higher than that of the convolutional layers. For each training iteration, 8 frames of images and their corresponding ground-truth targets are randomly extracted for each video sequence. Samples are then Gaussian-sampled around the ground truth targets on the image. In total, 256 positive samples and 768 negative samples are collected from the 8 frames of images, forming a small batch of samples for network training. The only criterion for dividing positive and negative samples is the IOU between the sample box and the ground truth box, with two thresholds set at 0.7 and 0.5. Boxes with IOU greater than 0.7 are positive samples, and those with IOU less than 0.5 are negative samples. RMFNet is trained for 120 and 220 rounds on the GTOT and RGBT234 datasets. These two datasets serve as each other's training and testing sets, with the RGBT234 dataset as the testing set when GTOT is used as the training set, and vice versa. It is worth noting that to save training time, we randomly select 78 video sequences for each iteration when training the network on the RGBT234 dataset.

Online Tracking
During online tracking with RMFNet, the three branches' last multi-domain binary classification fully connected layers need to be replaced with new classification layers to allow the tracker to learn instance representation features of new video sequences. The network needs to be fine-tuned to track a new video sequence using the first frame's marked target. The training process is similar to MDNet [12]. Specifically, the tracking process of the tracker is shown in Algorithm 1. During tracking, the convolution layers and the parameters of the first two classification layers, FC4 and FC5, trained on the previous dataset, are loaded, and the network is fine-tuned using the first frame's image. Around the ground-truth box of the target in the first frame, 500 and 5000 positive and negative sample boxes S + 1 and S − 1 are generated and used for 50 iterations of fine-tuning training, updating only the parameters of the classification layers. To obtain precise bounding boxes, we also trained a bounding box regression network on the first frame's target, which is used for regression adjustment of reliable pre-localization in subsequent frames. We collected samples near the ground truth in the first frame for long-term and short-term network updates. When tracking the target at frame t, 256 candidate boxes are generated around the ground truth of t − 1 in the t-th frame, and the network calculates their scores f (X i t ), and the one with the highest score is the location of the target predicted by the tracker. The score calculation formula is shown in Equation (8):

Performance Evaluation
We have chosen to measure our method against popular tracking algorithms of recent years on two datasets, RGBT234 [10] and GTOT [9]. The validity of our proposed algorithm and each module is fully proved through comparative analysis and ablation experiments on the benchmarks of the above two datasets.

Datasets and Metrics
GTOT Dataset: The GTOT dataset consists of 50 pairs of visible and thermal infrared video sequences, including common targets, such as people, vehicles, ships, and aeroplanes, as well as various indoor, outdoor, daytime, and nighttime environments. The location of the current target is annotated in each frame by a human, and based on factors, such as the shooting environment, time, and target motion state, they are labelled as seven different attributes. The performance of the RGBT tracker in dealing with different challenging attributes is evaluated based on these seven attribute labels. We use two commonly used tracking evaluation metrics for experimental analysis: precision rate (PR) and success rate (SR). The PR and SR indicators represent the proportion of the number of frames whose centre position error and intersection ratio between the bounding box output by the tracker and the ground truth are less than the preset threshold in the current frame in the total number of video frames. Since targets in the GTOT dataset are usually small, 5 pixels is the threshold we set.
RGBT234 Dataset: The RGBT234 Dataset is a massive RGBT tracking dataset that extends the RGBT210 dataset, containing 234 pairs of time and space-aligned visible and thermal infrared video sequences with approximately 210,000 frames, with the longest video sequence consisting of 8000 frames. Furthermore, the dataset encompasses a variety of challenging attributes, including, but not limited to, camera motion, deformations, motion blur, changes in illumination, and occlusions. These challenges are labelled separately in order to enable a more comprehensive evaluation of the algorithms. The algorithms are fairly evaluated using the maximum precision rate (MPR) and success rate (MSR) metrics. We set the threshold for MPR to 20 pixels. MPR and MSR use the maximum value, while PR and SR use the average value. MPR and MSR focus more on the accuracy and stability of the algorithm than PR and SR.

Evaluation on the RGBT234 Dataset
We also verified the performance of our RMFNet with other trackers on a largerscale dataset, RGBT234, to further validate its effectiveness. The evaluation results on the RGBT234 dataset are analyzed for three aspects, overall performance, performance under 12 challenging attributes, and visualization results in some highly challenging scenarios.
The RGBT234 dataset comprises 12 challenge attributes, low resolution (LR), partial occlusion (PO), scale variation (SV), background clutter (BC), thermal crossover (TC), deformation (DEF), low illumination (LI), camera motion (CM), motion blur (MB), no occlusion (NO), fast motion (FM), and heavy occlusion (HO). We selected 12 trackers from the aforementioned RGB and RGBT trackers for comparing the results of 12 challenging attributes. As depicted in the evaluation results shown in Figures 9 and 10, RMFNet outperforms other trackers in most of these challenges. In PR evaluation, RMFNet performs well in BC, CM, DEF, LI, and FM challenges, indicating a stable performance in these challenging scenarios. However, in the HO, SV, and NO challenges, RMFNet's performance is inferior to that of MACNet, as MACNet makes better use of appearance information than RMFNet. Moreover, RMFNet is outperformed by HDINet in LR and PO challenges, implying that our model needs improvement in utilizing complementary information. In future work, motion location prediction and camera motion compensation can be incorporated to enhance the model's appearance modelling capability, and better fusion rules can be designed to fully exploit complementary information. In the SR evaluation, we were only inferior to ECO in the TC challenge. This is because when the thermal crossover phenomenon occurs, the thermal infrared image can no longer distinguish the target, and the use of feature fusion operations increases the interference of low-quality information, resulting in a loss of tracking accuracy. At this point, deep features are not as helpful in locating the target as traditional hand-crafted features, such as color and texture. Notably, SR is a better evaluation indicator than PR.  We performed comparative experiments on the visual tracking results of four video sequences (elecbike10, man4, Children4, and car41) in the RGBT234 dataset, with the compared algorithms being MANet [13], ECO [30], SGT [38], and MDNet + RGBT [12]. The visual tracking diagrams of RMFNet and other algorithms shown in Figure 11 show that RMFNet can maintain good performance even in challenging environments such as cluttered backgrounds, camera motion, appearance changes, and occlusions. For example, in Figure 11a, the target undergoes obvious appearance changes due to a turn in a low-light environment, and all trackers lost the target in the middle of the video, but RMFNet was able to relocate the target later. In Figure 11b, severe occlusion occurs, and all trackers except RMFNet lost the target. RMFNet could accurately locate the target despite the repeated occlusions with the background. In Figure 11c, the target faces challenges such as low-light, camera lens rotation, and deformation, but RMFNet can still adapt to the target shape and track it well. In Figure 11d, multiple background individuals similar to the target move across it, and the target is partially occluded. RMFNet can still differentiate the target from other objects in the cluttered background and partial occlusion conditions.

Ablation Experiments
We conducted ablation experiments on both the GTOT and RGBT234 datasets, which demonstrated the effectiveness of all the proposed components and module improvements presented in this chapter.

Component Analysis
As shown in Table 2, Baseline is the basic model used in this article. It uses the RT-MDNet network and takes visible light and thermal infrared images as inputs to extract features separately. Then, it performs an element-wise addition fusion operation on the output of the feature by the third convolutional layer of both modalities to obtain the fused features. Finally, the fused feature is fed into three fully connected classification layers for object tracking. LF stands for late-term decision-level fusion, which combines visible light, thermal infrared, and fused features separately through three fully connected classification layers to obtain sample scores. Then, the scores of the three modalities are added and averaged to obtain the final sample score. FB represents the fusion branch, which adds two convolutional layers in parallel between the two specific modality branches to extract shared features of the two modalities. EFM represents the ECA-based multimodal feature fusion module, which adaptively weights the feature maps of two modalities output by the shared branch based on their reliability and fuses them into a single representation. RD stands for our improved re-detection algorithm that incorporates a re-detection process to mitigate the accumulation of failures. AR represents the Alpha-Refine [8] refinement module, which performs fine adjustment of the tracker's coarse localization. As shown in Table 2, all of the proposed components exceed the results of the baseline model, and each component can contribute significantly to the network's performance, proving our network model's superiority.

Multimodal Fusion Module Analysis
Comparing the performance of III and IV in Table 2, it can be verified that the fusion features obtained using the EFM module enhance the tracker's performance. To further validate the advanced nature of the fusion strategy used by the EFM module, the following experimental performance comparisons were performed: (1) baseline, which fused the two-modality features at the third convolutional layer output through element-wise addition; (2) baseline+EFM-AFF, which replaced the attention mechanism in AFF [26] with ECA attention and used the feature fusion strategy in AFF for feature fusion; and (3) base-line+EFM, which utilized the proposed EFM multimodal fusion module for feature fusion. Table 3 shows the results of the comparison. We found that baseline+EFM outperformed baseline, demonstrating that our multimodal fusion module effectively integrated different modality features and significantly improved the tracker's performance. Moreover, the performance of baseline+EFM-AFF was inferior to that of baseline, which indicated that our approach was more suitable for multimodal feature fusion than the fusion strategy employed in AFF.

Efficiency Analysis
The implementation of RMFNet was based on Python 3.6, PyTorch 1.2.0, Tesla V100 GPU. Finally, we measured the performance and runtime of RMFNet on the RGBT234 and GTOT datasets using MDNet + RGBT [12], MANet [13], DAPNet [1], and CBPNet [2]. As shown in Table 4, the evaluation results demonstrate that RMFNet is superior to other trackers for speed and performance. RMFNet-noAR, which removes the AR module, achieves the highest speed. Notably, under the same conditions we tested MANetz on the RGBT234 and GTOT datasets at 0.88 fps and 0.64 fps, respectively.

Conclusions
This article proposes a new robust RGBT tracking network called the Redetection Multimodal Fusion Network (RMFNet). RMFNet uses a three-branch structure to fully exploit the potential value of shared and specific information in multimodal data. By combining medium-term feature-level fusion with in a multimodal fusion approach, RMFNet fully utilizes the complementary advantages of multimodal information and compensates for the shortcomings of single-modality information. Additionally, RMFNet uses an improved re-detection algorithm to address the failure accumulation problem of the base algorithm, as well as a multi-stage bounding box estimation strategy further to improve the robustness and accuracy of the tracker. Through extensive experiments, RMFNet has demonstrated good performance in various challenging scenarios and has stable tracking capabilities. In the future, we will further investigate how to leverage the temporal and motion information of videos to improve RMFNet and enhance the performance of the tracker.

Conflicts of Interest:
The authors declare no conflict of interest.