Attention-guided cross-modal multiple feature aggregation network for RGB-D salient object detection

: The goal of RGB-D salient object detection is to aggregate the information of the two modalities of RGB and depth to accurately detect and segment salient objects. Existing RGB-D SOD models can extract the multilevel features of single modality well and can also integrate cross-modal features, but it can rarely handle both at the same time. To tap into and make the most of the correlations of intra-and inter-modality information, in this paper, we proposed an attention-guided cross-modal multi-feature aggregation network for RGB-D SOD. Our motivation was that both cross-modal feature fusion and multilevel feature fusion are crucial for RGB-D SOD task. The main innovation of this work lies in two points: One is the cross-modal pyramid feature interaction (CPFI) module that integrates multilevel features from both RGB and depth modalities in a bottom-up manner, and the other is cross-modal feature decoder (CMFD) that aggregates the fused features to generate the final saliency map. Extensive experiments on six benchmark datasets showed that the proposed attention-guided cross-modal multiple feature aggregation network (ACFPA-Net) achieved competitive performance over 15 state of the art (SOTA) RGB-D SOD methods, both qualitatively and quantitatively.


Introduction
As a fundamental and essential vision task in computer vision, SOD aims to locate and identify the most visually eye-attracting objects in an image.With the continuous advance of SOD and deep neural network (DNN) technology, SOD has been widely applied in numerous computer vision-related applications, such as object classification and recognition [1,2], target detection [3], semantic segmentation [4,5], video object tracking [6], object discovery [7], image retrieval [8], simultaneous localization and mapping (SLAM) [9], style transfer [10], image translation [11] and image compression [12,13].
Early SOD approaches [14][15][16][17][18] are mainly designed for RGB image and take it as input, which often suffers from performance degradation on more challenging cases, including similar texture and appearance of the foreground and background regions, occlusion, low-contrast light, complex and cluttered background etc. Thanks to the powerful representation ability of convolutional neural networks (CNN), the CNN-based RGB SOD methods [19][20][21][22][23][24] have achieved remarkable success compared to the traditional handcrafted feature-based RGB SOD methods.However, due to the loss of three-dimensional visual information in the two-dimensional RGB image, even CNN-based methods cannot completely solve the abovementioned issues, which still result in unsatisfactory performance when encountering complex and challenging scene images.Several surveys on RGB-based SOD [25][26][27] summarize its research progress in detail.
As is known to all, depth map contains spatial geometric and structure information and also provides indispensable complementary cues for RGB-based SOD and other vision tasks, including depth super-resolution [28], depth estimation [29,30] etc. Due to the widespread popularity of smartphones and other advanced RGB-D sensors (e.g., Kinect), especially with the rise of portable depth cameras, depth maps are easy to obtain at a minimal cost.To this end, by combing the RGB iamges with auxiliary depth maps, recent research works [31][32][33] take both RGB and depth information as input and have verified its effectiveness in improving the detection and segmentation process.Compared with the RGB-based SOD method, the RGB-D SOD methods usually achieved promising performance in various challenging scenes.As shown in Figure 1, although R2Net [34] is the latest SOTA RGB SOD method, it still has problems, such as incomplete detection and false detection when encountering some challenging scenarios.In addition, even if depth map is used, because the use of depth information is too simple and rough, the correlation between depth information and RGB information is not well mined and the traditional RGB-D SOD method BEDC [35] still has unsatisfactory detection performance.In contrast, the deep learning-based RGB-D SOD approaches RD3D [31] perform better.Therefore, although a number of previous RGB-D SOD methods have attempted to explore the effect and
contributions of depth map in SOD, the usage of auxiliary depth information brings several problems as follows.First, some low-quality depth maps can actually degrade and impair the detection performance of the RGB-D SOD model.Second, taking into account the feature complementarity of different modalities input and how to efficiently compute and represent depth-aware features for SOD task.RGB image provides rich semantic and appearance details, such as color and texture, but depth maps contain other useful supplementary information including shape, surface normals etc.How to use depth information correctly is an important issue that needs to be addressed.Third, how to effectively aggregate multilevel deep features from single modality data, and how to further integrate these cross-modal multilevel features from different modalities data, to suppress the background regions and completely highlight salient objects.
To address the aforementioned issues, we propose an attention-guided cross-modal feature pyramid aggregation network for RGB-D SOD, named ACFPA-Net.Specifically, to achieve cross-modal multilevel feature aggregation, we designed a novel cross-modal feature pyramid interaction module, denoted as CFPI.To address the performance degradation caused by low-quality depth maps, we developed a depth feature filtration module, denoted as DFF.Moreover, we also proposed a feature enhance and amplification module (FEA) to obtain more discriminative depth-aware features.Equipped with these useful modules, our ACFPA-Net can achieve better performance compared with multiple existing RGB-D SOD methods.
In summary, the major contributions of this paper are summarized as follows: 1) We propose an novel RGB-D SOD network via attention-guided cross-modal feature pyramid aggregation, named ACFPA-Net, which automatically extracts and aggregates the cross-modal multilevel visual information to highlight salient objects from various challenging scenes effectively.2) By applying a residual convolutional block attention module (RCBAM), the CPFI module is designed to aggregate the cross-modal features for generating more discriminative depth-aware features.3) We conduct extensive experiments on six RGB-D SOD benchmark datasets under four widely adopted evaluation metrics, which demonstrate the superiority and effectiveness of the proposed ACFPA-Net model against 15 recent SOTA models.
The rest of this paper is organized as follows.Section two first reviews related works of SOD.Then, section three describes the proposed attention-guided cross-modal feature pyramid aggregation network for RGB-D SOD.Next, the experimental results, corresponding disscussion and analysis are reported in section four.Finally, the concise conclusion of this paper is drawn in section five.

Related works
In the past two decades, SOD has been widely concerned by researchers due to its extensive applications, and many SOD methods have been presented.In this section, We briefly describe the most relevant works with SOD in terms of their input data.We first briefly review the related methods about the RGB SOD, then we introduce the development of RGB SOD methods and analyze the differences between these previously mentioned methods and our model.Among them, we focus on CNN-based RGB-D SOD approaches, as our proposed approach also falls into this category.In addition to this, we also discuss the recent SOTA RGB-D SOD methods implemented with other deep neural networks instead of CNN.

RGB SOD
In the early days, the SOD methods were carried out on RGB images.These existing methods can be simply divided into two categories: Traditional RGB SOD methods and CNN-based RGB SOD models.

Traditional RGB SOD methods
The pioneering work [36] of SOD is proposed by Itti et al.This model extracts multiple feature maps including color, brightness, direction and gradient information to predict salient objects in RGB image.Inspired by this work, early traditional methods are mainly based on a wide variety of handcrafted features or intrinsic prior knowledge, such as local or global color contrast [17], center prior [37], spatial priors [16], objectness prior [38,39], texture [40], background prior [14,15], and etc., and have been proposed to identify and segment the salient objects.For example, Cheng et al. [17] proposed a regional contrast based SOD model, which simultaneously evaluates global contrast differences and spatial coherence.Contrary to [17], Yang et al. [14] ranked the similarity of the image pixels or regions with foreground cues or background cues via graph-based manifold ranking for saliency detection.There are too many such classic SOD methods to repeat here, so please refer to the review literatures [26,41,42].In a nutshell, these methods are heavily depending on heuristic low-level features, thus lacking the guidance of high-level semantic cues.Although most of them are more computationally efficient, their performance can be severely degraded or even invalidated when challenging scenarios are encountered.

DNN-based RGB SOD models
Since 2012, deep learning has played an important role in various computer vision tasks.Naturally, deep learning based SOD models [25] have been explored.Benefited from the significant progress of CNN [43,44], CNNs-based RGB SOD methods have quickly become the mainstream and have achieved impressive improvements compared with traditional SOD methods above.In 2015, several pioneering works [19][20][21] first introduced CNN into SOD.Among them, Zhao et al. [19] took global context and local context and proposed a multi-context deep learning framework for SOD.Wang et al. [21] proposed two deep neural networks DNN-L and DNN-G to learn local patch features for detecting local saliency and predicting the saliency score of object region based on the global features, respectively.Li et al. [20] integrated handcrafted low-level features with multiscale high-level semantic contrast feature extracted using CNN for saliency detection.These early models only take advantage top features of the backbone, which can hardly capture the detailed features of the salient objects because of operations such as downsampling.
Afterward, researchers focus on developing deep aggregation models, which aim to fuse multilevel features provided by the backbone.For instance, Liu et al. [45] leveraged the contextual attention module to extract local and global context cues.Chen et al. [46] proposed a reverse attention network to exploit the missing regions by erasing the predicted salient regions in side-output features.Liu et al. [47] designed a feature aggregation module to make the coarse-level semantic information well fused with the fine-level features.Pang et al. [48] designed aggregation interaction modules to fuse multilevel features and propose self-interaction modules to get more efficient feature representations.In general, such approaches [49][50][51] are usually designed as an end-to-end encoder-decoder architecture with various effective feature manipulation strategies to extract, refine and integrate multi-scale multilevel features from CNN backbone network [44].In other words, CNN is usually treated as the feature encoder (e.g., VGG [52], ResNet [44]) to generate multilevel visual features, and these extracted features are fed into a well-designed decoder for multi-scale feature fusion to produce the final saliency prediction results.It is worth mentioning that because of its excellent feature selection ability, the attention mechanism is very beneficial for SOD tasks.Therefore, some attention-based RGB SOD methods have also been explored and achieved superior performance [45,[53][54][55].Liu et al. [45] utilized pixel-wise contextual attention to select global and local contextual information.Zhao et al. [54] proposed a pyramid feature attention network, which adopts channel-wise attention and spatial attention to focus more on valuable features.
Recently, many deep SOD models [50,53,56,57] have been proposed to predict salient object contours and use it to enhance the object boundaries for SOD.For example, a novel boundary-aware model was presented in BASNet [50] to enhance the boundaries of salient objects by incorporating a boundary localization stream.Zhao et al. proposed an edge guidance network (EGNet) [57] to explicitly model complementary salient object information and salient edge information within the network to preserve the salient object boundaries.Detailed introductions of SOD works can refer to recent popular survey [25].Although the RGB SOD methods perform well, when faced with challenging scenarios such as low illumination and contrast, transparent objects and complex and cluttered scenes, these methods are still a bit overwhelming.This is mainly because RGB images contain only visual appearance information and no rich spatial geometry and structural information.In contrast, depth maps can provide rich spatial geometric cues that RGB images cannot provide and are helpful for detecting salient objects.This type of RGB-D SOD approach is discussed in the next section.

RGB-D SOD
Over the past decade, by leveraging depth information, a large number of RGB-D SOD methods have been proposed.In this section, similar to RGB SOD, we roughly classify previous RGB-D SOD approaches into two categories: Traditional RGB-D SOD methods, and CNN-based RGB-D SOD models.Especially in recent years, the latter have become mainstream and have achieved encouraging progress compared with the former.

Traditional RGB-D SOD methods
Similar to traditional RGB SOD methods, early RGB-D SOD methods extract handcrafted features (e.g., local and global contrast [58][59][60], spatial prior [60,61], background prior [62] etc.) from image pairs (RGB images and depth maps) and fuse them for detecting salient objects.As a matter of fact, depth cue is often treated as a complementary prior together with other specific prior information from RGB images to assist saliency detection on RGB-D images.The pioneering RGB-D SOD study [58] computed the global disparity contrast and domain knowledge of stereoscopic images to measure stereo saliency, and built a stereo saliency analysis benchmark dataset STEREO.Subsequently, Cheng et al. [60] measured salient value using spatial bias and contrast cues (color contrast and depth contrast), and built the DES dataset for RGB-D SOD.In the same year, Peng et al. [59] proposed a multicontextual contrasted-based saliency detection method and built the NLPR dataset.The latter two datasets, DES and NLPR, directly provide depth maps collected by the Kinect device.The emergence of these datasets largely stimulates the study in RGB-D SOD.For example, Ren et al. [62] integrated region contrast with the background prior, depth and surface orientation prior to generate a coarse saliency map, and reconstructed the final saliency map by a saliency restoration stage.Wang et al. [63] developed a multi-stage SOD framework via minimum barrier distance transform and multilayer cellular automata-based fusion with 3-D spatial prior and depth bias.Cong et al. [64] proposed a depth-guided transformation and optimization model to incorporate depth map into the existing RGB SOD methods for boosting the performance of RGB-D SOD.Although these traditional methods have achieved promising performance, the low-quality depth maps and the handcrafted features limit their generalization and performance improvement in complex scenarios.

Deep learning-based RGB-D SOD models
Recently benefitting from the depth cue, which contains rich spatial structural information, deep learning-based RGB-D SOD methods have achieved significant progress.Among the first such studies, Qu et al. [65] designed a simple CNN-based model to learn the interaction mechanism of RGB and depth-induced saliency features for RGB-D SOD.Piao et al. [66] designed a depth refinement block to extract and fuse multilevel paired features and combined depth cues with multi-scale context features for locating salient objects.Although its model architecture is relatively simple and straightforward, its detection performance has improved significantly over traditional methods.CNN-based RGB-D SOD methods [67,68] can adaptively extract and fuse discriminative complementary features from RGB-D image pairs.Among them, fusion-based approaches have devoted significantly to RGB-D SOD and have achieved excellent performance.In terms of the fusion strategy of RGB and depth modal information in most previous papers [69][70][71], RGB-D SOD methods can be roughly divided into the following three popular categories: Data-level fusion, feature-level fusion, and result-level fusion, especially the intermediate one.
Data-level fusion.This type of fusion strategy directly concatenates three-channel RGB image and single-channel depth map together as four-channel input image.Models that employ this fusion strategy usually train a single-stream SOD model with the composited four-channel input.Typically, Wang et al. [72] proposed a data-level recombination strategy to fuse RGB with depth data, and applied a lightweight designed triple-stream network to predict salient objects.To explore the effect of the depth map, DANet [73] directly uses the depth map to guide the early fusion between RGB and depth modality.Subsequently, a joint learning and cooperative fusion model is proposed in [74] to exploit the cross-modality complementary information.With shared parameters for feature extraction, these methods largely reduce the number of parameters while tending to learn compromising features between modalities.
Result-level fusion.This type of approach commonly adopts two-stream CNNs on the RGB image and the depth map, respectively, to obtain two initial RGB-related and depth-related saliency prediction maps, and then fuses them in a variety of ways to generate the final saliency map, such as concatenation, addition, multiplication etc.For typical examples, Han et al. [75] adopted a transfer learning strategy to integrate the feature representations of RGB and depth to generate the final saliency map.In AFNet [76], a two-stream CNN was designed to extract features and predict saliency map from RGB and depth modality respectively, and used a saliency fusion module to learn a switch map to fuse the predicted saliency maps.Li et al. [77] proposed an information conversion network (ICNet) with encoder-decoder architecture for RGB-D SOD, in which an information conversion module was used to fuse high-level RGB and depth features in an interactive and adaptive way.Since only feature interactions are performed on the prediction maps, such methods cannot fully characterize the correlations between the two modalities.
Feature-level fusion.Such fusion methods generally adopt a two-stream network structure to extract multi-scale RGB and depth features separately, and then aggregate cross-modality features at multiple levels by a specially designed cross-modal fusion unit to generate the final saliency prediction map.To better explore the complementary values from each other, both [78] and [79] design a complementary-aware fusion module to select and combine multi-modal features.However, the low-quality depth maps may lead to poor fusion results.DQSD [80] integrates a novel depth quality-aware subnet to assess the depth quality before conducting the selective RGB-D fusion.Fan et al. [81] proposed a bifurcated backbone strategy to split the multilevel features into teacher and student features.Ji et al. [66] proposed a depth-induced multi-scale recurrent attention network to learn the internal semantic relation of the fused features and optimize local details with memory-oriented scene understanding.Jin et al. [82] supplemented the depth features with a depth map estimated from RGB images and fused the bimodal features in two stages according to the hierarchy.Wu et al. [70] designed an implicit depth restoration strategy to enhance the learning of features by the backbone network during the training phase.Different attention-based mechanisms are introduced in [68,83,84] to explore complementary cross-modality information for improving the performance.
In general, data-level and result-level fusion strategies are more efficient, and feature-level fusion is more accurate.The proposed model in this paper belongs to feature-level fusion based ones.In fact, there are some inspiring related models [85][86][87] based on other neural networks that perform well.Discussing these models in detail is beyond the scope of this paper, so please refer to the recent survey [27] for more details.Although great performance improvements have been made, existing models do not deal with the intra and inter-feature interaction issues well, as most of them only regard depth map as the supplement of RGB image and ignore the correlations between the two, and still encounter problems such as incomplete feature aggregation and partial boundary loss.For feature fusion itself, in addition to cross-modal feature fusion, cross-level feature fusion and cross-scale feature fusion are also crucial for RGB-D SOD methods.In this context, low-level detail features and highlevel semantic cues are fused progressively, and multi-scale features are aggregated to complement contextual information at different levels of detail.Although some methods have explored cross-modal and cross-scale feature fusion, they have not considered the importance of multi-scale features [88][89][90].Different from these aforementioned methods, in order to balance high accuracy and efficiency, in this paper, we design a cross-modal pyramid feature interaction module to integrate features from RGB and depth modalities under the premise of fully exploiting multi-scale information, and introduce a cross-modal feature decoder to aggregate these fused features to generate the final prediction results.

Proposed methods
We first briefly describe the overall backbone network architecture of the proposed ACFPA-Net in in Section 3.1.Second, we give a detailed explanation for the CPFI module in Section 3.2, then the CMFD and pyramid dilated convolution module are illustrated in Sections 3.3 and 3.4, respectively.Finally, the stage-wise intermediate supervision is introduced in Section 3.5.

Backbone network
The overall framework of our proposed ACFPA-Net is shown in Figure 2, which follows a standard encoder-decoder architecture.The feature encoder contains an RGB and depth stream backbone networks for separate feature extraction, which is constructed based on a ResNet-like pretrained model that removes the last fully-connected layer and the global average pooling layer.First, a pair of RGB and depth images are fed into two dual-stream feature encoders for multilevel feature extraction, then the extracted features are progressively integrated and refined by multiple cascading CPFI modules.Subsequently, the fused features from the CPFI module, along with high-level single-modal semantic features from both RGB and depth stream, are further fed into the feature decoder for cross-modal fusion.In order to expand the receptive field to obtain richer high-level semantic features, it should be noted that single-modal features need to be enhanced by a pyramid dilated convolution (PDC) module [91] before being fed into the feature decoder.These modules are elaborated in detail in subsequent sections.

CPFI module
The CPFI module is designed to integrate features from RGB and depth modalities, as shown in Figure 3, and it is divided into two parts: Single-modal feature enhancement and multi-modal multilevel feature interaction.In the single-modal feature enhancement, feature maps f i r and f i d , i = 1, 3, 5, 7 with the same scale, along with shallow features, are taken as inputs.First, the single-modal features are expanded using dilated convolutions to obtain rich features with different levels of receptive fields, then the RCBAM is a modification from CBAM [92], which adds an additional residual connection to better refine the obtained features.This process can be described in detail as follows: where SA and CA represent spatial attention, and channel attention, respectively, f denotes the input feature, and ⊗ is the multiplication operation.Specifically, linear operation is employed to interact with the enhanced single-modal features.We first perform feature aggregation on single-modal features with the same dilation rate, obtaining rich fused features with different receptive fields.Then, we concatenate these features and utilize the previous layer's feature containing abundant detailed information as a supplement to obtain the final fused feature, as follows: Electronic Research Archive Volume 32, Issue 1, 643-669.
where cat stands for the concatenate operation and f i(λ) represents the feature from the i-th layer with a dilation rate of λ(λ = 1, 3, 5, 7).

CMFD
Existing research has shown that features at different levels in a network can reflect different characteristics of objects.Specifically, shallow features can provide richer local information, while deep features contain semantic-level global information.In the encoding stage, we obtain high-level features with abundant information.Through experiments, we found that effectively utilizing shallow features in the decoding stage can significantly improve the model's performance.To achieve this, we introduce the mutually guided cross-level decoder [93] and modify it as CMFD module to aggregate multi-scale and comprehensive features from both top-down and bottom-up pathways.Specifically, as shown in Figure 4, the output features of the CPFI module are first aggregated with global semantic guidance features f pdcd and f pdcr ; then the high-level semantic aggregated features f j f usion 5 j=2 are fed into the feature fusion (FF) module in the top-down pathway to interact with shallow information and complement the detailed features.
To fully explore the complementary nature of local and global information, the multi-scale feature fusion module (MFM) integrates outputs from the FF side f j .The bidirectional pathway aggregates multi-scale crossmodal fusion features from low and high layers, enabling the prediction of complete structures and clear boundaries.Figure 4 illustrates the implementation details of this module.In each FF module, the input consists of high-level semantic aggregated features f j f usion 5 j=2 and FF features f j f f 4 j=2 from the upper layer.Therefore, the feature fusion in FF can be described as: Electronic Research Archive Volume 32, Issue 1, 643-669.
where ⊕ and ⊗ represent element-wise addition and multiplication and U is the bilinear interpolation upsampling operation.AP denotes the global average pooling operation.{ w i } 4 j=2 and w j+1 to generate multi-scale prediction results: where ⊕ denotes element-wise addition.denotes downsampling by convolution operations.The MF module is used to filter out unnecessary information and obtain multilevel prediction outputs {S k } 5 k=2 .These multilevel predictions are concatenated after upsampling and then passed through a convolutional layer to obtain the final saliency map S map : where U is the upsampling operation, and P means the 3 × 3 prediction convolution layer.

PDC module
The PDC takes high-level features f 5 rgb and f 5 d as input to capture rich semantic and positional information, providing global semantic guidance in the decoding stage.In real-world scenarios, salient objects may vary in size.To address objects of different scales, the PDC module includes two parts: Dense atrous convolution (DAC) unit and residual multikernel pooling (RMP) unit.The DAC unit takes inspiration from the structure of Inception-ResNetV2 and consists of four parallel branches, by replacing convolutions at different scales with dilated convolutions.Specifically, in DAC, different numbers and dilation rates of atrous convolutions are stacked in the four branches to expand the receptive fields.We use atrous convolutions with dilation rates of one, three, and five, resulting in receptive fields of three, seven, and nine for each branch.Additionally, each branch undergoes a 1×1 convolution for rectified linear activation.Finally, similar to ResNet's skip connections, the aggregated features from different receptive fields are added to the original features to obtain the richer fused features.Inspired by spatial pyramid pooling (SPP) [94], the RMP unit further encodes multi-scale contextual features of objects extracted from the DAC unit without requiring additional learning weights.In this work, we set four receptive field scales: 2 × 2, 3 × 3, 5 × 5 and 6 × 6, resulting in four feature maps of different sizes.After each pooling level, a 1 × 1 convolution is used to reduce the channels of the feature maps by 256 for reducing computational cost, then we upsample the low-dimensional features to the same size as the original features by using bilinear interpolation.Finally, we concatenate the original features with the upsampled features as the output feature.

Stage-wise intermediate supervision
To facilitate the training of the network, we apply feature supervision at multiple stages.Specifically, in addition to the predicted maps from the network's output, we perform 3 × 3 convolutions on the feature maps obtained from the CPIF and PDC modules to obtain corresponding predicted maps.All the prediction results are then adjusted to the same resolution as the input image through bilinear interpolation.
We use weighted cross-entropy and weighted intersection over union as the loss functions.Therefore, the final loss of the network is defined as follows: ) where G represents the ground truth, {λ i } 3 i=1 is a hyperparameter, f pdcd and f pdcr are the output features from the PDC module, and S m ap is the final predicted map of the network.

Experimental results and discussion
We first described the implementation details of the proposed ACFPA-Net model in Section 4.1, then introduced the six RGB-D SOD benchmark datasets and five commonly used evaluation metrics in Section 4.2 and in Section 4.3.After that, the comparisons with 15 SOTA CNN-based methods are conducted.Finally, we conduct a series of ablation studies to validate the effectiveness of our proposed modules.We conducted a quantitative and qualitative comparative analysis with 16 SOTA RGB-D SOD models in Section 4.4 and all-round ablation studies in Section 4.5.

Implementation details
The proposed ACFPA-Net is implemented based on pytorch and trained with a single NVIDIA GeForce RTX 3090 GPU, and the total training time in total takes five hours corresponding to 70 epochs.Our model architecture is independent of the backbone network.During the training phase, for the sake of fairness, we adopt the ResNet-50 [44] as the backbone network for both RGB and depth streams, which is initialized by the pretrained parameters on ImageNet.For the depth stream, we adopt gray color mapping to transform the single-channel depth map into a three-channel image as input.Several common transformations including random flipping, rotating and cropping are adopted for data augmentation to prevent model overfitting.Multi-scale training is also applied; that is, all the training input samples are resized into [320,352,384].We set the maximum epoch and batch size as 70 and 10, respectively.The AdamW optimizer [95] is employed for optimizing the proposed network model.The corresponding learning rate is initially set to 1e-5 and dynamically adjusted every 20 epochs with weight decay 0.1.During inference stage, all the test images are uniformly fixed to 352 × 352 resolution, then fed into the network to generate the final saliency prediction map without any other post-processing steps.The inference time of our method is about 0.06 second for an image.

Benchmark datasets
Training dataset.To verify the effectiveness and generalization ability of the proposed model ACFPA-Net and to make a fair comparison with existing RGB-D SOD approaches, following these recent SOTA methods [70,74,96], ACFPA-Net is trained on the conventional training benchmark dataset, which consists of two parts: The training set of NJUD2K [61] dataset NJUD2K-train with 1,485 image pairs and the training set of NLPR [59] dataset NLPR-train with 700 image pairs.
Testing datasets.Six widely-used RGB-D SOD benchmark datasets are used as experimental testing datasets, which includes STEREO [58], NLPR-test [59], NJUD2K-test [61], DUTLF-D-test [66], LFSD [97] and SIP [98].Except for DUTLF-D, other datasets are directly used for testing the performance of our proposed model and competing methods.DES [60] includes 135 images of indoor scenes captured by a Kinect camera.As the test subset of NLPR dataset, NLPR-test [59] is captured by a Kinect with a resolution of 640 × 480, which contains 300 natural images with multiple salient objects from 11 types of indoor and outdoor scenes under different illumination conditions.NJUD2K-test [61] is what remains of the NLPR dataset after NLPR-train has been stripped out, which contains 500 stereo image pairs with diverse objects and complex scenarios from different sources such as the internet and stereo movies, where several depth maps are estimated through an optical flow method.STEREO [58] includes 1000 stereoscopic images downloaded from the internet where the depth maps are generated from the stereo images using the SIFT flow method.SIP [98] contains 929 high-resolution RGB-D images with a high-resolution of 744 × 992, which covers diverse real-world scenes from various viewpoints, poses, occlusions, illuminations, and backgrounds.DUTLF-D [66] contains 1,200 paired RGB-D images captured by a Lytro camera with a resolution of 600×400.LFSD [97] is a small-scale dataset including 100 small-resolution RGB-D images and manually labeled ground truths, where the depth maps are captured via a Lytro light field camera.

Evaluation metrics
Following previous works [70,84,85,96], we adopt five generally-recognized metrics for quantitative performance evaluation of the proposed method and other SOTA competitors, namely precision recall curves (PRC), mean absolute error (MAE), mean F-measure (Fm), mean E-measure (E ξ ) [105] and S-measure (S α ) [106].Given a saliency map S and the ground truth map G, the definitions for these metrics are as follows: The first is F-measure as a widely-used region-based similarity evaluation metric, which takes into account both precision (P) and recall (R) to assess the overall performance of the predicted results.In this work, we assess the maximum F-measure (F Max β ) score across the binary predicted maps of different thresholds.The second is the MAE, which measures the average pixel-wise absolute difference between the predicted saliency maps and the corresponding ground truth.It is denoted as: where S is the saliency map, G is the ground truth, and W and H indicate the width and height of the saliency map respectively.The third is S-measure (S α ), which measures the spatial structure similarities of predicted saliency maps compared to the corresponding ground truth from the perspective of region-aware and objectaware.Mathematically, where S r denotes regional perception, S o denotes object perception and α is set to be 0.5 to balance the region similarity S r and object similarity S o , as suggested in [106].
We also calculate scores of mean E-measure E Mean ξ under the same protocol as the official paper [105], which jointly captures image-level statistics and local pixel-wise correlation information to evaluate the similarity between the saliency prediction and the ground truth.Mathematically, where ξ F M stands for the enhanced alignment matrix at pixel location.W and H are the width and height of the saliency map.In summary, for the four metrics above, higher F β , E ξ , S α and the lower MAE score indicate better performance.Besides, we report the number of parameters and running time of each method for efficiency analysis.
Table 1.Quantitative Comparison with SOTA methods on the DES [60], NLPR [59] and LFSD [97] benchmark datasets.↑ (↓) denotes that the higher (lower) is better.Evaluation metrics include F β , E ξ , S α , and MAE score, The top 3 results are respectively marked in red, green and blue.

Method
BackBone Pub'Year DES NLPR LFSD
Qualitative evaluation and discussion For visual comparisons between our approach and the baseline methods, as shown in Figure 5, for several visualization results in various challenging scenes including multiple objects (column six), small objects (columns three, four and five), big objects (columns one, two, five and twelve), complex background (columns four, eight, ten and fourteen), low contrast (columns four, five, seven and thirteen), Occlusive object (column nine), complex objects (columns ten and eleven), it can be clearly seen that the proposed ACFPA-Net can consistently produce more accurate and complete saliency maps with sharp boundaries and coherent details, which achieves better detection performance of the salient objects.These samples given in Figure 5 belong to the conventional benchmark datasets, including DES [60], NLPR [59], NJUD2K [61], STEREO [58] and SIP [98].
Quantitative comparison and analysis The detailed quantitative performances of the proposed ACFPA-Net and previous competitive methods on DES, NLPR and LFSD datasets can be found in Table 1, and Table 2 shows the quantitative comparison in terms of four evaluation metrics on NJUD2K, SIP, and STEREO datasets.These evaluation values in both Table 1 and Table 2 show that our ACFPA-Net can more accurately and completely detect salient objects in complex scenes by using global semantic relations and multi-scale detail information.Compared with the other 15 methods, our carefully designed ACFPA-Net ranks in the top three in terms of S-measure, E-measure and F-measure.For example, on the SIP dataset, compared to the second best method (MITF-Net), the percentage gain of E-measure reaches 0.7%, the percentage gain of MAE score reaches 0.2% and the percentage gain of S-measure reaches 1.5%.On the STEREO dataset, the minimum percentage gain measured by S-measure reaches 0.8%, while the minimum percentage gain measured by F-measure reaches 1.5%.
Model complexity As shown in Table 3, we compare our model complexity with some recent competitive ones, and it can be seen that our model has a relative disadvantage in the aspect of expenses; however, this is not the focus of this article, and Tables 1 and 2 have proved our value in performance.Of course, model lightweighting is also important research with wide application value, and we will work on more efficient methods in the future.

Ablation study
We have provided comprehensive ablation studies to validate the effectiveness of each key component (PDC and CPFI) employed in our ACFPA-Net.The quantitative results for each module combination are shown in Table 4.For a fair comparison, the benchmark of this paper is to directly add RGB features and depth features as fusion features and input them into the CMFD module.We gradually add different components to the baseline.Table 4. Ablation study of the proposed modules.CPFI and PDC denote the cross-modal pyramid feature interaction module and the pyramid dilated convolution module,respectively.The Significance of the PDC module.To prove the effectiveness of the PDC, quantitative comparisons are illustrated in Table 4.It can be observed that after adding the PDC module to the first line, there is a significant improvement in various evaluation indicators.This also confirms our idea that by incorporating the PDC module after high-level features, we can capture broader and deeper semantic features, providing global semantic support for the decoding stage.
This clarity demonstrates the effectiveness of the CPFI module.
The effectiveness of CPFI module.First, to study the importance of the CPFI module, we listed the experiments on the NJUD, SIP and STEREO datasets, as shown in Table 3.Compared to the benchmark (first row), after adding the CPFI module to replace the direct addition operation (fifth row), the indicators have improved on multiple datasets such as NJUD and STEREO.This is reasonable because CPFI can enhance the refinement effect of single modal features (RGB features, depth features) through the multiscale convolution and RCBAM and multilevel interaction between RGB features and depth features.The last row of Table 4 adds both CPFI and PDC modules on top of the benchmark.It can be observed that each method improves performance and, when combined, we achieve the best results.Second, we explore the effectiveness of the CPFI structure.The key structures of CPFI are RCBAM and multi-scale aggregation.We replace CPFI with single-scale CBAM (third row) and single-scale RCBAM (fourth row) with the decoder PDC.The comparison between the third and fourth rows can demonstrate that adding residual connections to CBAM is positive for the model.The comparison between the fifth and fourth rows can demonstrate the effectiveness of multi-scale aggregation.In addition, we also compare CPFI with other aggregation modules, and CPFI gets promising results, as shown in the fifth to seventh rows.CIM and CMIB are two SOTA methods SPNet [88] and DIGRNet [103] aggregation modules, respectively.

Failure cases
The proposed model has good detection performance in most cases.However, as shown in Figure 6, there are some failure cases.The first row can't effectively avoid the interference of occlusion.The second row fails to highlight the object due to the fact that the object is at the edge of the frame and is incomplete.The third column has incomplete detection results due to the complexity of the scene and the difficulty in distinguishing the object from other pedestrians.Similar failed detections are also similarly seen in other SOTA RGB-D SOD methods, as shown in the fifth and sixth columns.In addition to the above reasons, we believe that low-quality depth maps (third row) are also important in affecting the performance of the model.These extremely challenging issues will also be the focus of our future research.

Conclusions
In this paper, we propose an ACFPA-Net for RGB-D SOD.To effectively balance high accuracy and efficiency, we designed a CPFI module to integrate multilevel features from both RGB and depth modalities.Moreover, for improving the intra-and inter-modality aggregation compatibility when fusing the information of the two RGB and depth modalities, we introduced a CMFD to aggregate these fused features in an encoder network to generate the final prediction results.Experiments on six benchmark datasets demonstrated that the proposed ACFPA-Net achieves competitive performance over 15 SOTA RGB-D SOD models, under four widely used evaluation metrics.

Figure 1 .
Figure 1.Two RGB-D input samples and the corresponding saliency maps generated by different types of SOD methods.(a)original RGB image, (b) depth map, (f) ground truth, (c),(d) and (e) are the saliency maps generated by DNN-based RGB SOD models, traditional RGB-D SOD methods and deep learning-based RGB-D SOD models.

Figure 2 .
Figure 2. The overall architecture of the proposed ACFPA-Net.It mainly consists of a backbone network, CPFI module and cross-modal multilevel feature decoder (CMFD) module.

4 j=2represent 3 ×j f usion 5 j=2 into 256. {w i } 5 i=2 and w j+1 4 j=2denote 3 ×
3 convolution operations with PReLU activation and 1 × 1 convolution operations with ReLU activation, respectively, both of which transform the channels of f 3 convolution operations without activation.F j is the output of the FF module and is simultaneously passed to the next layer's FF module as input.The MFM will bottom-up receive the outputs F j
*Tables may have a footer.

Table 3 .
Complexity comparison with SOTA methods.