SiamSMN: Siamese Cross-Modality Fusion Network for Object Tracking

: The existing Siamese trackers have achieved increasingly successful results in visual object tracking. However, the interactive fusion among multi-layer similarity maps after cross-correlation has not been fully studied in previous Siamese network-based methods. To address this issue, we propose a novel Siamese network for visual object tracking, named SiamSMN, which consists of a feature extraction network, a multi-scale fusion module


Introduction
Visual object tracking is one of the fundamental tasks in computer vision.It aims to track a given object in each frame over a video sequence.Object detection, which focuses on identifying and locating objects within individual frames, complements object tracking by providing initial object localization.Together, detection and tracking form a robust framework for many real-world applications.For instance, detection can identify and locate objects in the initial frame, and tracking can ensure continuous observation of these objects across subsequent frames [1].Object tracking is widely used in many fields, such as visual surveillance [2], human-computer interaction [3], augmented reality [4], etc.Despite recent advances, it is still widely acknowledged as being an extremely difficult assignment because of background clutter, scale variations, significant variations in illumination, etc.
The currently used object tracking methods can be divided into two categories: correlation filter-based [5][6][7][8][9][10][11][12] and deep learning-based trackers [13][14][15][16][17].In correlation filter-based tracking, a correlation filter is trained online on the region of interest by minimizing a least-squares loss.The object is detected in consecutive frames by convolving the trained filter via the Fast Fourier Transform (FFT) [18].In order to estimate the object location in the next frame, the learned filter is applied to the region of interest in which the location of the maximum response is the target location.Early correlation filter-based trackers such as MOSSE [5] and CSK [6] exploited intensity features for object tracking.To achieve a more discriminating image representation, ValMadre et al. [7] proposed a correlation filterbased network (CFNET) in an offline manner that follows an end-to-end approach.Despite significant advancements, correlation filter-based trackers are less resistant to objects in fast-moving or low-frame-rate films and less flexible with respect to scale changes.In addition, further research suggests several targeted improvement techniques from a variety of angles, including scale improvement (e.g., DSST [8], CSR-DCF [9], etc.), elimination of boundary effects (SRDCF [10], C-COT [11], ECO [12], etc.), etc.These trackers have a clear advantage in real time, but still need to be optimized in situations such as complex background interference and similarity occlusion.
Deep learning technologies have significantly advanced the task of visual tracking by providing a powerful feature representation capacity.A variety of tracking methods based on deep learning have been presented, such as FCNT [13], MDNet [14], STCT [15], AD-Net [16], SiamFC [17].Among them, Siamese-based trackers have the potential advantages of significantly improving the tracking performance.Bertinetto et al. [17] first introduced a Siamese network for visual tracking.Since then, object trackers built on Siamese networks and object detection frameworks have achieved state-of-the-art performance, such as SiamRPN [19], SiamRPN++ [20], and SiamMask [21].The Siamese-based trackers formulate the object tracking task as a similarity matching problem by computing cross-correlation similarities between a template image and a search image, which converts the tracking into finding the target object from an image region by computing the highest visual similarity.Therefore, it casts the tracking problem into a Region Proposal Network (RPN)-based detection framework by leveraging Siamese networks, which is the key to boosting the performance of deep trackers.
For most of the popular trackers (such as SiamFC [17], SiamRPN [19], and SiamBAN [22]), multi-level similarity maps can provide different representations.Similarity maps from shallow layers focus on low-level information, such as color and shape, which are essential for localization but lack semantic information; similarity maps from deeper layers have rich semantic information that is useful in some challenging scenarios, such as motion blur and huge deformation.Thus, the fusion of different similarity maps plays a critical role in accurate target tracking.Weighted sum and concatenation operation are common for aggregating multi-layer similarity maps.However, these methods can only combine different levels of similarity maps through a fixed linear approach, failing to fully utilize the complementary information from high-level and low-level similarity maps.This limitation restricts the tracker from achieving an interactive fusion of spatial information and semantic cues.Inspired by the transformer architecture [23], we design a novel multi-scale similarity-map fusion module that models the relationship between spatial information from high-resolution layers and semantic cues from low-resolution layers.The fusion module contains only one layer of feature encoder and feature decoder.The feature encoder aims to learn interdependencies between different similarity maps, while the decoder aggregates the low-level and high-level semantic information.The main problem in the tracking process is the inconsistency between classification and regression.Specifically, the classification probability is high but the positioning is inaccurate.In experiments, we observed that points near the object boundary were more likely to predict accurate locations.Motivated by this observation, we devised a learnable prediction module to refine the bounding boxes based on the predicted offset map.The proposed SiamSMN efficiently achieved robust and precise performance under complex scenarios, while maintaining good real-time processing capabilities.This is crucial for practical applications that require both speed and accuracy.The main contributions of this work are as follows: • We design a transformer-based similarity-map fusion module that fully explores the interdependencies among multiple similarity maps associated with different semantic meanings, which helps the tracker accurately locate objects in complex scenarios.

•
We propose a learnable prediction module to generate a boundary point for each side based on the rough bounding box, which can solve the problem of inconsistent classification and regression.

•
Our methods achieve competitive performance with the state-of-the-art trackers on four different benchmarks, while maintaining real-time processing capabilities.

Related Work 2.1. Siamese Network-Based Object Tracking
Recently, a Siamese network-based tracking framework has attracted great attention in the vision tracking community due to its end-to-end training capacity and high efficiency.The Siamese tracker consists of two branches: a template branch and a search branch.The template branch receives the target image patch from the previous frame as input, while the search branch receives the target image patch in the current frame as input.Both of these branches share CNN parameters so that the two image patches encode the same transformation, which is suitable for tracking.
As one of the pioneering works, SiamFC [17] adopted a fully convolutional Siamese network as a feature extractor and introduces correlation layers to combine feature maps.Inspired by the success of SiamFC, more and more researchers began to pay attention to the Siamese Network tracking method.Zhu et al. [24] proposed a distractor-aware Siamese network (DaSiamRPN) that utilized the local-to-global search strategy to deal with the challenges of full occlusion and out-of-view.Wang et al. [25] put forward a residual attentional Siamese network (RASNet), which embedded an attention mechanism into Siamese trackers to promote the discriminating ability of the tracking model.Other methods include SiamDW [26], SiamMASK [21], SiamFC++ [27], etc.Though the above methods utilize a multi-scale strategy to cope with scale variation, they cannot handle aspect ratio changes due to target appearance variations.In order to make more accurate predictions for target locations, SiamRPN [19] combines a Region Proposal Network (RPN) in the object detection with a Siamese network.By jointly training a classification branch and a regression branch for the region proposal, SiamRPN [19] avoids the time-consuming step of extracting multi-scale feature maps for the object scale invariance and achieves very efficient results.However, it has difficulty dealing with distractions with a similar appearance to the object.Based on SiamRPN [19], DaSiamRPN [24] increases the hard-negative training data during the training phase.Through data enhancement, it improves the discrimination of the tracker and obtains a much more robust result.SiamRPN++ [20] optimizes the network architecture by using ResNet [28] as a backbone.At the same time, it randomly shifts the training object location in the search region during model training to eliminate the center bias.Despite these advancements, existing methods still face challenges in effectively fusing multi-layer similarity maps to fully exploit spatial and semantic information.Our proposed method, SiamSMN, aims to address these shortcomings by introducing a novel multi-scale fusion module and a learnable prediction head, thereby enhancing tracking performance.

Transformer in Object Tracking
The Vision Transformer (ViT) [29] first presented a pure vision transformer architecture, obtaining an impressive performance on image classification.Briefly, a transformer is an architecture for transforming one sequence into another with the help of attention-based encoders and decoders.The attention mechanism observes an input sequence and decides at each step which other parts of the sequence are important, facilitating the capture of global information from the input sequence.In recent years, some studies attempted to introduce the transformer to object tracking and achieved promising performance.Yu et al. [30] proposed a deformable Siamese attention network, referred to as SiamAttn, by introducing a new Siamese attention mechanism that computed deformable self-attention to improve the discriminating ability of target features before applying depth-wise cross-correlation.CGACD [31] learns attention from the correlation result between the template and search region and then adopts the learned attention to enhance the search region features for further classification and regression.TransT [32] is a transformer-based fusion network for targetsearch information incorporation.Although these works have improved tracking accuracy with the attention mechanism, they still heavily rely on the correlation operation in fusing the template and search region feature.In this work, we exploit a transformer to directly fuse multi-layer similarity maps without using any weighted sum or concatenation operations.

Proposed Method
In this section, we present a detailed description of the proposed SiamSMN framework.As shown in Figure 1, our SiamSMN consists of three components: a feature extraction network, a multi-scale fusion module, and a prediction head.First, the feature extraction network separately extracts the features of the template image and the search image.Second, these features are calculated by a depth-wise cross-correlation to produce multiple similarity maps.Then, these different scales of similarity maps are aggregated by the proposed feature fusion network.Finally, the fused feature maps are input into the prediction head, which is responsible for classifying the enhanced features and regressing the bounding boxes to generate the final tracking results.

Feature Extraction Network
Like Siamese-based trackers, the proposed SiamSMN method takes a pair of image patches as the inputs of the backbone network.The Siamese backbone network consists of two identical branches.One is called the template branch, which receives the template patch as input (denoted as Z).The other is the search branch, which receives the search patch as input (denoted as X).The two branches share parameters to embed the inputs Z and X into a common feature space for cross-correlation.The cross-correlation between template and search regions is implemented in the common feature embedding space as follows: where ⋆ denotes the channel-by-channel correlation operation.The generated similarity map S has the same number of channels as ϕ(X), and it contains massive information for classification and regression.Object tracking requires rich representations that span levels from low to high, scales from small to large, and resolutions from fine to coarse.Many methods take advantage of fusing both low-level and high-level features to improve tracking accuracy.In our network, multi-layer features are extracted to collaboratively infer the target location.We utilize the ResNet-50 as our backbone network and use blocks 3, 4, and 5 of the ResNet-50 to extract features from the target template and the search region.Features from different blocks of the backbone focus on different hierarchical information about objects.We use a depth-wise cross-correlation to aggregate the features extracted from the last three residual blocks of the backbone, which helps the trackers produce multiple semantic similarity maps.The cross-correlation between the template feature and search feature is implemented as follows: where ⋆ represents the cross-correlation operator, and ϕ(•) is the embedding function for feature extraction.As shown in Figure 1, S ′ 3 , S ′ 4 , and S 5 are fed into the feature fusion module individually to aggregation multi-layer similarity maps.

Multi-Scale Fusion Module
Inspired by the transformer [23], we fuse different levels of similarity maps by designing a novel transformer fusion network.Unlike the original transformer [23], our transformer fusion model only contains one layer for both feature encoder and decoder.A feature encoder aims to learn the interdependencies among different similarity maps, while the feature decoder aggregates the low-level and high-level semantic information.
Feature encoder: As shown in Figure 2, first, a learnable position encoding is used to encode the similarity maps from the 3th and 4th layers, denoted as S (3) Eventually, the encoded information can be calculated through FFT and normalization.The output of the encoder can be used by the decoder as its input for the multi-head attention module.
Feature decoder: The feature decoder follows the same structure as the encoder.Differently, we built the effectively feature decoder without positional encoding and a global average pooling.In addition, the feature decoder has two heads of attention.
Specifically, the output of the first multi-head attention can be expressed as: In order to further increase the tracking accuracy, the second multi-head attention aggregates the semantic information from the low-layer similarity map.We can obtain a fusion result from the following equation: The final response map (R * ) can be calculated through FFT and normalization.

Prediction Head
As shown in Figure 1, the classification branch, regression branch, and centerness branch are applied to localize objects and estimate their shapes.When testing, the final score (used for ranking the detected bounding boxes) is computed by multiplying the predicted centerness with the corresponding classification score, which helps suppress the low-quality detected bounding boxes and improves the overall performance by a large margin.For a response map (R * ) obtained using a multi-scale similarity-map fusion network, the classification branch outputs a classification feature map A cls w×h×2 , the regression branch outputs a predicted offset map A reg w×h×4 , the centerness branch outputs a centerness feature map A cen w×h×1 , where each point value gives the "centerness score" of the corresponding location.Each point (i, j) in A cls w×h×2 contains a 2D vector, which represents the foreground and background scores of the corresponding location in the search region.Similarly, each point (i, j) in A reg w×h×4 contains a 4D vector t(i, j) = (l, t, r, b), which represents the distance from the corresponding location to the four sides of the bounding box in the input search region.
In the experiments, we observed that points near the object boundary were more likely to predict accurate locations.Inspired by this, we propose a box refinement module to refine the bounding boxes based on the regression branch; as shown in Figure 3, the feature map (F * ), regression (Reg), and response map (R * ) are the three inputs for this box refinement.First, the feature map obtains a set of offsets after a set of convolution operations, denoted as T t .Next, we perform a reshape function on regression, and the result is recorded as Z 0 .Z 0 represents the distance from the given point to the 4 boundaries.Then, we obtain T 0 by Equation ( 8): where θ c (•) maps the points on the response map R * back to the search patch and obtains the generated points on each layer.After that, we utilize Z 0 and T 0 as the input of ϕ c (•) to obtain a coarse bounding box.This B b is defined as follows: where ϕ c (•) decodes the distance prediction into a bounding box.Finally, the coarse bound- ing box generates a boundary point for each side based on a set of offsets generated by the feature map.A finer bounding box is generated by aggregating the prediction results of the four boundary points.

Loss function:
The training loss function in this paper is defined as follows: where L cls represents the focal loss for classification, L cen refers to the IoU loss, and L reg is the binary cross-entropy loss.λ 1 and λ 2 are the weight parameters of L cen and L reg , respectively.During model training, we empirically set λ 1 = 1 and λ 2 = 3.
For the video datasets, we directly sampled the image pairs from one video sequence to collect training samples.For the COCO detection datasets, we applied some transformations to the original images to generate pairs.Common data augmentation techniques were applied to enlarge the training set.For easy comparison, the input sizes of the search patch and template regions were 255 × 255 and 127 × 127, respectively.The backbone parameter was initialized on ImageNet and then we used the parameter as initialization to retrain our model.
Training In total, there were 20 epochs; for the first 10 ones, the parameters of the Siamese sub-network were frozen while training the classification and regression sub-networks.For the last 10 epochs, the last three blocks of ResNet-50 were unfrozen to be trained together.In addition, the stochastic gradient descent (SGD) was adopted, and batch size, momentum, and weight decay were set to 32, 0.9, and 0.0001, respectively.Our tracker was trained in Python using PyTorch on a PC with a RTX 2080 Ti.Our approach was trained with only the specified training set provided by the official website for fair comparison.
Testing details: During the testing process, we used an offline tracking strategy.Only the object in the initial frame of a sequence was adopted as the template.Consequently, the target branch of the Siamese sub-network could be pre-computed and fixed during the whole tracking period.The search region in the current frame was adopted as the input of the search branch.

Comparison with State-of-the-Art Trackers
We compared our approach with the state-of-the-art trackers on four tracking datasets.

OTB100
The OTB100 [35] dataset is a public tracking benchmark that contains 100 sequences from different scenes.All frames in the sequence are divided into seven categories: camera motion, illumination change, occlusion, size change, motion change, unassigned, and overall.The shortest sequence "Deer" in OTB100 has 71 frames, and the longest sequence "Doll" is 3872 frames.The average length of each sequence in this benchmark is about 590 frames.We followed the one pass evaluation (OPE) protocol and report the AUC scores of the success plot.

UAV123
The UAV123 [37] dataset contains a total of 123 video sequences, including more than 110K frames.All sequences are fully annotated with upright bounding boxes.The objects in the dataset mainly suffer from fast motion, large scale variation, large illumination variation, and occlusions, which make tracking challenging.

LaSOT
To further validate the proposed framework on a larger and more challenging dataset, we conducted experiments on LaSOT [34].The LaSOT dataset provides large-scale, highquality dense annotations with 1400 videos in total and 280 videos in the testing set.Such a large test dataset brings a great challenge to the tracking algorithms.The official website of LaSOT provides 35 algorithms as baselines.Normalized precision plots, precision plots, and success plots in one-pass evaluation (OPE) were considered as the indicators.
We compared our SiamSMN with the top-nine trackers including SiamBAN [22], ATOM [40], SiamRPN++ [21], SiamMask [3], and so on.The results of SiamBAN [22] are provided on the website of its authors, while other results are provided by the official website of LaSOT. Figure 6 reports the overall performances of our SiamSMN tracker on the LaSOT testing set.SiamSMN increased the AUC and the normalized distance precision relatively by 1.6% and 1.4% over SiamBAN [22], which is the best tracker reported in the original paper.

GOT-10K
GOT-10K [41] is a recent large-scale dataset that contains 10K sequences for training and 180 for testing.After uploading the tracking results, the analysis is performed automatically by the official website.
The provided evaluation indicators include success plots, average overlap (AO), and success rate (SR).All the results are provided by the official website of the GOT-10K.Table 1 shows that SiamSMN can outperform all the trackers on the GOT-10K.As shown in Table 1, our tracker ranked first in terms of all the indicators.Compared with Ocean [36], our SiamSMN improved the scores by 2.1%, 3.7% and 5.0%, relatively, for AO, SR 0.5 , and SR 0.75 .

Ablation Study
To analyze and verify the effectiveness of each proposed module, an ablation experiment was performed on the UAV123 [37] dataset.

Box Refinement
To verify the effectiveness of the box refinement (BR), an ablation experiment was performed, and the results are shown in Table 2. Without box refinement, our method reached 63.6% and 82.5%.When we added the box refinement, the success and precision improved by 2.4% and 2.2%, respectively.The outcome in Table 2 demonstrates that the box refinement can consistently improve tracking performance.

Multi-Scale Fusion Module
To analyze the effectiveness of the multi-scale fusion module (MFM), we designed three variants: weighted sum, concatenation operation, and MFM.As shown in Table 3, it was found that the use of the MFM yielded better results than the other two methods.When we used the multi-scale fusion module to fuse multi-layer features, it was obvious that our method showed a great improvement in tracking performance compared to the traditional fusion methods (weighted sum and tandem).

Speed Analysis
In Table 4, we show the evaluation of OTB100 with respect to frames per second (FPS).The reported speed was evaluated on a machine with one RTX 2080 Ti, and those of other methods are provided by the OTB100 official results.As shown in the table, although TransT [32] was faster than our method, the accuracy was 3.8% lower than that of our method.In addition, our network was much simpler than others, and no specially designed parameters were needed for training.

Conclusions
In our paper, we exploited the expressive power of the transformer and proposed a simple yet effective visual-tracking framework named SiamSMN that fully explores the interdependencies among multi-level similarity maps.SiamSMN directly classifies objects and regresses bounding boxes in a unified network and does not require pre-defined candidate boxes.Experimentation results demonstrated that the proposed SiamSMN method could achieve competitive performance and real-time speed on four popular tracking benchmark datasets, confirming its effectiveness and efficiency.

Figure 1 .
Figure 1.Framework of SiamSMN, which contains a feature extraction network, a multi-scale fusion module, and a prediction head.
the result is used as the K and Q inputs of the multi-head attention module.S ′ 3 serves as its V input.The multi-head attention output of this feature encoder can be obtained by:

Figure 2 .
Figure 2. Detailed workflow of the multi-scale fusion module.The left sub-window illustrates the feature encoder.The right one shows the structure of the decoder.

Figure 3 .
Figure 3. Illustration of the box refinement.

Figure 6 .
Figure 6.Comparisons among the top-10 trackers on LaSOT.Our SiamSMN significantly outperforms the state-of-the-art methods.

Table 1 .
Comparison results on the GOT-10K test set.The best two results are highlighted in red and blue fonts, respectively.

Table 2 .
The ablation study results of the box refinement (BR).The best results are highlighted in red.

Table 3 .
The ablation study results of the multi-scale feature fusion.The best results are highlighted in red.

Table 4 .
The results in terms of success and speed for different methods on OTB100.The best results are highlighted in red.