Decoupled Cross-Modal Transformer for Referring Video Object Segmentation

Referring video object segmentation (R-VOS) is a fundamental vision-language task which aims to segment the target referred by language expression in all video frames. Existing query-based R-VOS methods have conducted in-depth exploration of the interaction and alignment between visual and linguistic features but fail to transfer the information of the two modalities to the query vector with balanced intensities. Furthermore, most of the traditional approaches suffer from severe information loss in the process of multi-scale feature fusion, resulting in inaccurate segmentation. In this paper, we propose DCT, an end-to-end decoupled cross-modal transformer for referring video object segmentation, to better utilize multi-modal and multi-scale information. Specifically, we first design a Language-Guided Visual Enhancement Module (LGVE) to transmit discriminative linguistic information to visual features of all levels, performing an initial filtering of irrelevant background regions. Then, we propose a decoupled transformer decoder, using a set of object queries to gather entity-related information from both visual and linguistic features independently, mitigating the attention bias caused by feature size differences. Finally, the Cross-layer Feature Pyramid Network (CFPN) is introduced to preserve more visual details by establishing direct cross-layer communication. Extensive experiments have been carried out on A2D-Sentences, JHMDB-Sentences and Ref-Youtube-VOS. The results show that DCT achieves competitive segmentation accuracy compared with the state-of-the-art methods.


Introduction
Referring video object segmentation (R-VOS) is an emerging subtask in the field of video segmentation.Its objective is to segment the regions of interest within video frames based on provided natural language expressions.In contrast to traditional semi-supervised video object segmentation methods [1][2][3], R-VOS only requires a simple textual description of the target object instead of manually annotating it in the first video frame (or a few frames).Consequently, R-VOS avoids intricate and costly annotation procedures, enhancing its user-friendliness.The academic community has recently exhibited great interest in this task due to its promising potential in applications such as intelligent surveillance video processing, video editing, and human-computer interaction.
Referring video segmentation addresses two crucial issues: how to make the model understand which object is the referred one, and how to accurately segment the target from the background.
The key to addressing the former issue lies in accomplishing fine-grained cross-modal information interaction and aggregation.Most of the early methods involve direct fusion of the visual and linguistic features.Common strategies include concatenation-convolution [4,5], recurrent LSTM [6], dynamic filter [7,8] and attention mechanism [9][10][11][12].In contrast, some recent approaches [13][14][15][16] introduce additional variables as a bridge for aggregation.They employ a set of object queries to collect and store entity-related information from multimodal features.Subsequently, these queries are linked across frames to achieve the tracking effect.Methods of this type can better model the cross-modal dependencies, but often overlook the scale disparity between visual and linguistic features.Since visual features are often tens of times longer than linguistic features, the object queries will be overly biased to focus on visual content.
The latter issue, which is to precisely identify the boundary between the reference object and background, is also important for improving segmentation accuracy.Most of the existing R-VOS methods [11,13,17] use a conventional Feature Pyramid Network (FPN) [18] to fuse features of multiple levels.Despite showing good performance, there is still a lot of room for improvement because of the information loss in this progressive fusion process.
To address the above problems, we propose DCT, an end-to-end decoupled crossmodal transformer for referring video object segmentation, which adopts a DETR-like [19] framework.In particular, in order to precisely distinguish the referred target, we firstly propose a Language-Guided Visual Enhancement Module (LGVE).It uses cross-modal attention operations to transmit discriminative linguistic information to visual features of all levels, so as to strengthen the response of the referred area and preliminarily filter out irrelevant background.In addition, a decoupled transformer decoder is designed.In this module, object queries interact with visual and linguistic features parallelly to reduce the attention bias caused by feature size differences.For accurate boundary locating, we leverage the existing Cross-layer Feature Pyramid Network (CFPN) [20] structure to conduct direct communication across multiple layers, which reduces the information loss during the commonly used stage-wise fusion process.
In summary, the main contributions of this paper are as follows: • An end-to-end unified network termed DCT is proposed to tackle referring video object segmentation, which sufficiently utilizes multi-modal information and aggregates multi-scale visual features.• The Language-Guided Visual Enhancement Module (LGVE) and the decoupled transformer decoder are constructed to establish coordinated information interactions among object queries, visual features and linguistic features.

•
Cross-layer Feature Pyramid Network (CFPN) is brought in to reduce the information loss in the progressive fusion process.

•
Experiments on four benchmarks demonstrate that our proposed method achieves competitive segmentation accuracy compared with the state-of-the-art methods.

Referring Video Object Segmentation
Referring video object segmentation (R-VOS) is an important research area at the intersection of computer vision and natural language processing.Existing R-VOS methods can be broadly categorized into three types: propagation-based methods [5,12,21], matching-based methods [22,23], and query-based methods [13,14,16].
Propagation-based methods apply image-level referring object segmentation methods [9,24] on individual video frames and then acquire important temporal context through mask propagation.Seo et al. [12] designed a memory attention module based on selfattention architecture to propagate spatio-and temporal information from memory frames to the current frame, enhancing the temporal consistency of the segmentation results.Hui et al. [21] used textual information to guide the weighted combination of features from the current frame and reference frames, further optimizing the feature representation of the referred target.Propagation-based R-VOS methods are simple and fast, but prone to error accumulation, particularly when there are significant changes in the appearance of the target.
Matching-based methods divide the R-VOS task into two steps: trajectory generation and cross-modal matching.In the trajectory generation step, a model for instance segmentation or object detection is used to identify all the objects.Then, the objects are associated across the entire video to construct a collection of object trajectories.In the cross-modal matching step, the methods compute the relevance scores between each trajectory and the language description and select the pair which matches best.However, methods of this type [22,23] have a more complex training process as they require separate optimization for their multiple submodules.
Queried-based methods view R-VOS as a sequence prediction problem.They introduce a set of object queries to represent video entities and link them across frames to achieve natural tracking.Botach et al. [13] proposes the first query-based R-VOS method, which is named MTTR.The method includes no text-related inductive bias modules or post-processing operations, which greatly simplifies the segmentation process.However, unlike object detection and panoramic segmentation, object attributes in R-VOS are more random and difficult to accurately describe by fixed query vectors.In this regard, Wu et al. [14] take the given language expression as a constraint and generated query vectors online to make them more focused on the referred target.

Transformer
Transformer [25] is an attention-based encoder-decoder architecture which has a remarkable ability to capture long-term global dependencies.It was originally used for sequence modeling in machine translation and has been widely applied in natural language processing (NLP) [26,27] and computer vision (CV) [28,29] tasks.More recently, transformer has also been introduced to the highly regarded multi-modal domain, which provides valuable insights for R-VOS.For example, the large-scale pre-training model CLIP [30] uses transformers to extract visual and text features and accurately align them in the embedding space through contrastive learning.Ding et al. [31] exploits transformer as the cross-modal decoder for referring image segmentation and proposes a query generation module based on multi-head attention, which can comprehend the given language expression from different perspectives under the guidance of visual cues.The proposal of DETR [19], an end-to-end object detector, is a significant milestone in the development of transformer.It introduces the query-based paradigm and simplifies the conventional pipeline of object detection.MDETR [32] extends this idea to the field of referring expression comprehension, proposing an end-to-end modulated detector that detects objects in an image conditioned on a raw text query.VisTR [33] employs a non-auto-regressive transformer to parallelly supervise and segment the video instances at the sequence level.Considering the simplicity and efficiency of this DETR-like framework, our proposed method also adopts this architecture, but further addresses the undesirable attention bias in the interaction between object query and multi-modal features.

Overall Pipeline
Given a T-frame video clip V = {v t } T t=1 with the spatial resolution of H × W and a L-word text expression R = {r l } L l=1 , the aim of DCT is to generate a binary segmentation mask sequence M = {m t } T t=1 for the referred object.The overall pipeline of DCT is shown in Figure 1.It consists of four components: Feature Extraction and Enhancement, the decoupled transformer decoder, Instance Segmentation and Instance Sequence Matching process.
Feature Extraction and Enhancement.For the given video-text pair, we first use a visual encoder and a linguistic encoder for feature extraction, then use LGVE to achieve language-guided visual enhancement.Specifically, Video Swin transformer [34] is adopted to extract the multi-level visual features of the video frames, which are Meanwhile, a pretrained language model RoBERTa [35] is employed to extract the word-level linguistic feature F l ∈ R C l ×L .Consid-ering that the visual features do not include a particular focus on the referred object, we propose the LGVE to highlight language-related visual regions, generating the enhanced visual features F i e , i = 1, 2, 3.
Sensors 2024, 24, x FOR PEER REVIEW 4 of 15 Feature Extraction and Enhancement.For the given video-text pair, we first use a visual encoder and a linguistic encoder for feature extraction, then use LGVE to achieve language-guided visual enhancement.Specifically, Video Swin transformer [34] is adopted to extract the multi-level visual features of the video frames, which are . Meanwhile, a pretrained language model RoBERTa [35] is employed to extract the word-level linguistic feature to represent the instances for each frame.Firstly, the toplevel feature 3 e F output by LGVE is picked out and combined with fixed positional en- codings.Then, we feed  into stacked transformer decoder layers along with the fea- tures l F and 3 e F to collect and store entity-related information.Details on this can be found in Section 4.3.
Instance Segmentation.On top of the transformer decoder and LGVE, we build a CFPN spatial decoder and two prediction heads to obtain the mask sequences.In particular, the CFPN spatial decoder takes the multi-level enhanced visual features as inputs and outputs the segmentation feature map for each frame.The mask head comprises three stacked linear layers and produces dynamic kernels Ω from each object query, which is then convolved with seg F to obtain N instance sequences.The class head is a single-layer perceptron.It predicts the binary confidence score of each sequence, which indicates whether it matches the referred target.
Instance Sequence Matching and Loss.Having the instance sequences and their class scores, we proceed to find the optimal assignment between ground-truth and the predictions using Hungarian matching [36].The loss of DCT is same as the one in [13], which is composed of cls  and mask  .Specifically, cls  is a cross-entropy loss while mask  is a combination of the Dice [37] and binary Focal loss [38].The whole loss function is as follows: Decoupled Transformer Decoder.In this module, we introduce a set of N object queries Q = {q t } T t=1 , q t ∈ R N×C q to represent the instances for each frame.Firstly, the top-level feature F e3 output by LGVE is picked out and combined with fixed positional encodings.Then, we feed Q into stacked transformer decoder layers along with the features F l and F e3 to collect and store entity-related information.Details on this can be found in Section 4.3.
Instance Segmentation.On top of the transformer decoder and LGVE, we build a CFPN spatial decoder and two prediction heads to obtain the mask sequences.In particular, the CFPN spatial decoder takes the multi-level enhanced visual features as inputs and outputs the segmentation feature map for each frame.The mask head comprises three stacked linear layers and produces dynamic kernels Ω from each object query, which is then convolved with F seg to obtain N instance sequences.The class head is a single-layer perceptron.It predicts the binary confidence score of each sequence, which indicates whether it matches the referred target.
Instance Sequence Matching and Loss.Having the instance sequences and their class scores, we proceed to find the optimal assignment between ground-truth and the predictions using Hungarian matching [36].The loss of DCT is same as the one in [13], which is composed of L cls and L mask .Specifically, L cls is a cross-entropy loss while L mask is a combination of the Dice [37] and binary Focal loss [38].The whole loss function is as follows: where λ cls , λ d and λ f are three hyperparameters.More detailed settings can be seen in Section 4.2.

Language-Guided Visual Enhancement
Since the visual features initially extracted by the backbone contain no special concentration on the referred object, it is important to convey the target-related linguistic semantics to redistribute their attention.MTTR [13] use a standard self-attention operation to facilitate information exchange only between the linguistic feature and the top-level visual feature, which may widen the semantic gap within the visual features of different levels.ReferFormer [14] proposes a Cross-modal Feature Pyramid Network to perform multi-scale cross-modal fusion but places it behind the transformer.As a result, the transformer still encounters difficulty in clearly identifying the target.Therefore, we design the Language-Guided Visual Enhancement Module, which incorporates linguistic information into visual features of all levels and puts it ahead.This module acts as a coarse locater and performs an initial filtering of irrelevant background regions.
The structure of LGVE is shown in Figure 2. Here, we use f v to represent the multilevel visual features of a single frame, and is the one of level-i.Firstly, our LGVE generates the query (Q) and the intermediate visual feature f i mid from f i v with 1 × 1 point-wise convolutions followed by 3 × 3 depth-wise convolutions, aggregating pixel-wise cross-channel context and channel-wise spatial context.Simultaneously, the key (K) and value (V) are generated from the linguistic feature F l with two linear projections.Then, we use cross-attention operations to assemble word-level linguistic features at each spatial location and produce a vision-language correlation filter S, whose dimension is the same as f i v .The above process is shown in Equations ( 2) and ( 3). where where cls λ , d λ and f λ are three hyperparameters.More detailed settings can be seen in Section 4.2.

Language-Guided Visual Enhancement
Since the visual features initially extracted by the backbone contain no special concentration on the referred object, it is important to convey the target-related linguistic semantics to redistribute their attention.MTTR [13] use a standard self-attention operation to facilitate information exchange only between the linguistic feature and the top-level visual feature, which may widen the semantic gap within the visual features of different levels.ReferFormer [14] proposes a Cross-modal Feature Pyramid Network to perform multi-scale cross-modal fusion but places it behind the transformer.As a result, the transformer still encounters difficulty in clearly identifying the target.Therefore, we design the Language-Guided Visual Enhancement Module, which incorporates linguistic information into visual features of all levels and puts it ahead.This module acts as a coarse locater and performs an initial filtering of irrelevant background regions.
The structure of LGVE is shown in Figure 2. Here, we use v f to represent the multi- level visual features of a single frame, and

∈
is the one of level-i.Firstly, our LGVE generates the query (Q) and the intermediate visual feature × 1 point-wise convolutions followed by 3 × 3 depth-wise convolutions, aggregating pixelwise cross-channel context and channel-wise spatial context.Simultaneously, the key (K) and value (V) are generated from the linguistic feature l F with two linear projections.
Then, we use cross-attention operations to assemble word-level linguistic features at each spatial location and produce a vision-language correlation filter S , whose dimension is the same as i v f .The above process is shown in Equations ( 2) and (3).( , , , ) ( , , , )

Softmax(
) where  Finally, the intermediate feature f i mid is multiplied by the spatial filter S element-wisely to obtain the i-th level enhanced visual feature f i e : f i e = S ⊙ f i mid (4)

Decoupled Transformer Decoder
The transformer module is a key component in query-based R-VOS methods.Its objective is to gather entity-related information from the multi-modal features and store it in the object queries.However, the decoding process of existing works ignores the severe imbalance in the sizes of multi-modal features.As the length of the visual feature is much longer than that of the linguistic feature (often 20 times longer or more), the object queries may be greatly biased to the former, which is not conducive to a fine-grained understanding of the language expression.
To address this problem, we consider separately interacting the object queries with features of the two modalities and merging the results with adaptive weights.As is shown in Figure 3, the decoupled transformer decoder has N d layers.In each decoder layer, we first perform a self-attention operation upon the object queries q to model its inner relationship and output q s .Then, two cross-attention layers are constructed to collect information from the linguistic feature F l and enhanced visual feature f 3 e independently, generating the single-model subqueries q text and q video .Finally, the updated query q ′ is obtained by a weighted sum of the subqueries, where the weights are learned from the subquery embeddings with liner projections.The above process can be represented by Equations ( 5)- (7).
wisely to obtain the i-th level enhanced visual feature

Decoupled Transformer Decoder
The transformer module is a key component in query-based R-VOS methods.Its objective is to gather entity-related information from the multi-modal features and store it in the object queries.However, the decoding process of existing works ignores the severe imbalance in the sizes of multi-modal features.As the length of the visual feature is much longer than that of the linguistic feature (often 20 times longer or more), the object queries may be greatly biased to the former, which is not conducive to a fine-grained understanding of the language expression.
To address this problem, we consider separately interacting the object queries with features of the two modalities and merging the results with adaptive weights.As is shown in Figure 3, the decoupled transformer decoder has d N layers.In each decoder layer, we first perform a self-attention operation upon the object queries q to model its inner rela- tionship and output s q .Then, two cross-attention layers are constructed to collect infor- mation from the linguistic feature l F and enhanced visual feature generating the single-model subqueries text q and video q .Finally, the updated query q′ is obtained by a weighted sum of the subqueries, where the weights are learned from the subquery embeddings with liner projections.The above process can be represented by Equations ( 5)- (7).

Cross-Layer Feature Pyramid Network
As described earlier, FPN is a classical method for fusing multi-scale features in R-VOS methods and brings about obvious improvements in most cases.However, in this kind of progressive fusion process, the low-level visual information (e.g., object texture and boundary details) is accessed only once in the final fusion stage, resulting in a low-qualified segmentation of the edges.Additionally, as the fusion proceeds, high-level semantic cues are gradually diluted, which diminishes the model's ability to recognize the referred target.In view of the above two issues, we replace standard FPN with the Cross-layer Feature Pyramid Network (CFPN) [20].Compared with FPN, CFPN promotes the information exchange among visual features of all layers by aggregating them simultaneously, thus generating a segmentation feature map rich in both semantics and spatial details.
As shown in Figure 4, CFPN first performs global average pooling (GAP) and concatenation operations on the enhanced visual features, resulting in a 1d global representation, Z.Then, we transform Z into the layer-wise fusion weight ψ = {φ i } 3 i=1 with a two-layer perceptron, formulated as where θ 1 ∈ R D×Y and θ 2 ∈ R Y×3 are fully connected layers, D = C 1 + C 2 + C 3 is the channel number of Z, Y is set to 256 empirically, and ReLU refers to the ReLU activation function.Afterwards, the dynamical fusion weight is used to rescale the original visual features and form the aggregated visual representation f g where ⊕ denotes the concatenation operation and UP refers to upsampling.Considering that f g is a naïve concatenation of the multi-level features, CFPN further constructs a crosslayer feature distribution structure.To be more precise, f g is fed into a set of average pooling layers followed by 3 × 3 convolutions to generate the redistributed features f i d , i = 1, 2, 3, which are subsequently merged in a top-down manner to yield the final segmentation feature map f seg .In contrast to FPN, the feature maps in f d collect information from the full spectrum of multi-level representation.This enables the retention of more discriminative and complementary visual information during the fusion process.
A2D-Sentences is an extension of the A2D [39] dataset.It consists of 3782 YouTube videos and 6655 textual descriptions, covering eight types of actions performed by seven categories of objects.
JHMDB-Sentences is an extension of the J-HMDB [40] dataset.It comprises 928 video sequences showcasing 21 human actions, each accompanied by a corresponding textual description.Notably, each frame of the videos is labeled with a 2d puppet mask.
Refer-YouTube-VOS is built upon the large-scale video segmentation dataset YouTube-VOS [41].It consists of a total of 3978 videos, of which 3471 are used for training, 202 for validation, and 305 for testing.Each video in this dataset is annotated with highquality instance segmentation masks for every fifth frame.Since the test set is accessible only during the competition, our evaluation experiments are conducted on the validation set.
A2D-Sentences is an extension of the A2D [39] dataset.It consists of 3782 YouTube videos and 6655 textual descriptions, covering eight types of actions performed by seven categories of objects.
JHMDB-Sentences is an extension of the J-HMDB [40] dataset.It comprises 928 video sequences showcasing 21 human actions, each accompanied by a corresponding textual description.Notably, each frame of the videos is labeled with a 2d puppet mask.
Refer-YouTube-VOS is built upon the large-scale video segmentation dataset YouTube-VOS [41].It consists of a total of 3978 videos, of which 3471 are used for training, 202 for validation, and 305 for testing.Each video in this dataset is annotated with high-quality instance segmentation masks for every fifth frame.Since the test set is accessible only during the competition, our evaluation experiments are conducted on the validation set.
For Refer-YouTube-VOS, the method is evaluated with the criteria of region similarity (J ), contour accuracy (F ) and their average value (J &F ).

Inplementation Details
In accordance with previous works [13,14], we train DCT on A2D-Sentences and Refer-YouTube-VOS and use all three datasets for evaluation.For model settings, the decoupled transformer has 4 decoder layers (N d = 4) and each layer is configured with 8 attention heads.The number of object queries is set to 50.The hyperparameters of the loss function are set as During training, we use sliding windows to crop video clips and the default window size is set to eight.The resolution of the frames is adjusted to ensure that the shorter side is at least 360 pixels and the longer side is at most 640 pixels.Random horizontal flipping, random cropping, and photometric distortion are used for data augmentation.AdamW is used as the optimizer and the weight decay is set to 1 × 10 −4 .For A2D-Sentences, we train the model for 60 epochs with a batch size of 6 and a dynamic learning rate, which is as shown in Equation (10).For Refer-YouTube-VOS, the epoch and batch size are 30 and 4, respectively, and the learning rate is shown in Equation (11).
When inferencing, DCT predicts N instance sequences corresponding to the N queries.For each sequence, we sum the confidence scores output by the class head across frames.Finally, the sequence with the highest total score is identified as the referred object.

Ablation Study 4.3.1. Ablation Study on the Main Components
We conduct extensive experiments on A2D-Sentences to evaluate the effectiveness of the key components in our proposed method.The results are presented in Table 1.In the baseline model (as shown in the first line), the LGVE is removed while the decoupled transformer decoder and CFPN are replaced with a conventional transformer decoder and a common FPN similar to the ones used in MTTR [13].In this case, the OIoU, MIoU and mAP drop significantly by 3.8, 4.1 and 5.4.In experiment No.2 to No.4, we introduce the three proposed components on top of the baseline separately.Obviously, each component can bring about an improvement in the segmentation accuracy.In experiment No.5, we achieve a higher accuracy by simultaneously introducing the LGVE and CFPN.The reason for this may be that the LGVE facilitates the semantic consistency of the visual features, thus reducing the misalignment caused by semantic discrepancies during the global fusion in CFPN.

Analysis of the Temporal Context Size
Modeling the temporal context plays a crucial role in the R-VOS task for which actions or behaviors in videos cannot be fully understood or derived by analyzing a single frame.In DCT, we use sliding windows to crop video clips and adopt Video Swin transformer as the spatial-temporal encoder.To study the effect of the temporal context size, we change the window size during training and evaluating on A2D-Sentences dataset.The results are presented in Table 2.When the window size is 1, the model essentially transforms into an image-level approach, with mAP being only 42.6.As the window size increases, the model captures more time clues, and the segmentation performance gradually improves.However, the metrics reach their peak at the window size of 8.One possible reason for this phenomenon could be that over a longer time span, the target's behavior and spatial position undergo more pronounced changes, which consequently make the target more challenging for the model to comprehend.To investigate the impact of the number of object queries on model performance, we conduct ablation experiments on the A2D-Sentences dataset under the window size of 8, as shown in Table 3.During the training process, randomly initialized object queries gradually converge towards fixed regions or specific categories of targets.As a result, a low number of queries (e.g., 5) are insufficient to cover the complex distribution of objects in the dataset.On the other hand, when queries are too dense (e.g., 75), more similar mask sequences are generated, making it more challenging for the model to optimize during the one-to-one Hungarian matching.Therefore, the number of queries for the final model is set to 50.
It can be seen that DCT achieves OIoU, MIoU and mAP values of 70.7, 70.5 and 39.6, respectively, which are 0.7, 1.2 and 0.5 higher than the best method ReferFormer.As the results on JHMDB-Sentences are obtained by evaluating the model trained on A2D-Sentences without finetuning, DCT further proves its good generalization.Results on Refer-YouTube-VOS.Table 6 displays the experimental results on the largest and challenging R-VOS dataset Refer-YouTube-VOS.For fair comparison, we report the performance of ReferFormer trained from scratch and without post-process operations.It can be observed that our DCT surpasses all the cutting-edge methods with a gain of 0.6J and 0.5F and achieves the state of the art.

Visualization and Analysis
In Figure 5, we show some of the visualization results of DCT on Ref-Youtube-VOS.It can be observed that our DCT can accurately comprehend the given language expression and accomplish precise segmentation even in cases of severe object deformation (as shown in row 1), interference from similar objects (as shown in row 2) and partial disappearance of the target (as shown in row 3).

Discussion
In this paper, we optimize the modules of multi-modal feature interaction and decoding segmentation specially, which brings about a great improvement in the segmentation performance.However, the method still has some limitations.For example, we link the masks across frames according to the permutation of the queries, but do not design a memory module to store the historical features of the referred targets, resulting in an insufficient utilization of the temporal information.In addition, during the inference process, we select the mask sequence with the highest total confidence score as the final output.This approach may lead to false-positive results when the referred target does not appear in the video.In summary, how to collect and utilize temporal information from historical frames while maintaining the simplicity of the model framework, as well as improving the strategy for instance matching, remains worthy of further research.

Conclusions
This paper presents an end-to-end decoupled cross-modal transformer for referring video object segmentation.In the proposed model, a Language-Guided Visual Enhancement Module (LGVE) and a decoupled transformer decoder are designed to establish sufficient and balanced information interactions between the object queries and features of different modalities, so as to accurately identify the referred object.Then, we introduce the Cross-layer Feature Pyramid Network, which was originally used for salient object detection, as the spatial decoder, generating high-qualified object boundaries by making a better use of the visual semantics and details.Extensive experiments are carried out on three benchmark datasets.The experimental results fully demonstrate the effectiveness of the proposed method.

Discussion
In this paper, we optimize the modules of multi-modal feature interaction and decoding segmentation specially, which brings about a great improvement in the segmentation performance.However, the method still has some limitations.For example, we link the masks across frames according to the permutation of the queries, but do not design a memory module to store the historical features of the referred targets, resulting in an insufficient utilization of the temporal information.In addition, during the inference process, we select the mask sequence with the highest total confidence score as the final output.This approach may lead to false-positive results when the referred target does not appear in the video.In summary, how to collect and utilize temporal information from historical frames while maintaining the simplicity of the model framework, as well as improving the strategy for instance matching, remains worthy of further research.

Conclusions
This paper presents an end-to-end decoupled cross-modal transformer for referring video object segmentation.In the proposed model, a Language-Guided Visual Enhancement Module (LGVE) and a decoupled transformer decoder are designed to establish sufficient and balanced information interactions between the object queries and features of different modalities, so as to accurately identify the referred object.Then, we introduce the Cross-layer Feature Pyramid Network, which was originally used for salient object detection, as the spatial decoder, generating high-qualified object boundaries by making a better use of the visual semantics and details.Extensive experiments are carried out on three benchmark datasets.The experimental results fully demonstrate the effectiveness of the proposed method.

Figure 1 .
Figure 1.Overall architecture of the proposed method.

..
Considering that the visual features do not include a particular focus on the referred object, we propose the LGVE to highlight language-related visual regions, generating the enhanced visual features Decoupled Transformer Decoder.In this module, we introduce a set of N object queries 1 { } , q

Figure 1 .
Figure 1.Overall architecture of the proposed method.

Author Contributions:
Conceptualization: A.W.; methodology: A.W.; formal analysis and investigation: A.W. and R.W.; Writing-original draft preparation: A.W.; writing-review and editing: A.W. and Q.T.; funding acquisition: R.W.; supervision: R.W., A.W. and Z.S.All authors have read and agreed to the published version of the manuscript.Funding: This research was funded by the Double First-Class Innovation Research Project for the People's Public Security University of China, grant number 2023SYL08.Institutional Review Board Statement: Not applicable.Informed Consent Statement: Not applicable.

Table 1 .
Ablation experiments on the key components.

Table 2 .
Ablation experiments on the sliding-window size.

Table 3 .
Ablation experiments on the query number.