Progressive Guided Fusion Network With Multi-Modal and Multi-Scale Attention for RGB-D Salient Object Detection

The depth map contains abundant spatial structure cues, which makes it extensively introduced into saliency detection tasks for improving the detection accuracy. Nevertheless, the acquired depth map is often with uneven quality, due to the interference of depth sensors and external environments, posing a challenge when trying to minimize the disturbances from low-quality depth maps during the fusion process. In this article, to mitigate such issues and highlight the salient objects, we propose a progressive guided fusion network (PGFNet) with multi-modal and multi-scale attention for RGB-D salient object detection. Particularly, we first present a multi-modal and multi-scale attention fusion model (MMAFM) to fully mine and utilize the complementarity of features at different scales and modalities for achieving optimal fusion. Then, to strengthen the semantic expressiveness of the shallow-layer features, we design a multi-modal feature refinement mechanism (MFRM), which exploits the high-level fusion feature to guide the enhancement of the shallow-layer original RGB and depth features before they are fused. Moreover, a residual prediction module (RPM) is applied to further suppress background elements. Our entire network adopts a top-down strategy to progressively excavate and integrate valuable information. Compared with the state-of-the-art methods, experimental results demonstrate the effectiveness of our proposed method both qualitatively and quantitatively on eight challenging benchmark datasets.


I. INTRODUCTION
Salient object detection (SOD) aims to locate and segment the interesting or attractive regions in a scene by imitating the human visual system. It has been applied to various vision tasks, such as image segmentation [1], matching [2], enhancement [3], and weakly supervised learning [4]. With the development of depth sensors, depth cues can be conveniently acquired as a supplement to color appearance information, which contributes to better perceive and understand the complex and challenge scenes, such as ones with similarlooking objects and backgrounds, or varying illuminations.
The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Olague . Therefore, RGB-D salient object detection using depth cues has attracted more and more attention from researchers.
For a given set of RGB-D (RGB + depth) images, the purpose of RGB-D SOD is to predict a saliency map and extract salient regions by exploring the complementary relationship between color image and depth data. Traditional RGB-D saliency detection models based on hand-crafted features mainly use depth information to excavate a few effective auxiliary feature attributes, such as longitudinal distance, boundaries, shape, surface normal, etc. These properties can improve the ability of models to detect salient objects from complex scenes. Over the past few years, numerous traditional RGB-D models have been developed [5]- [24]. Specifically, some methods focus on taking depth feature as an explicit supplementary of color feature [5]- [16]. For example, Cheng et al. [8] extend the 2D center bias to 3D spatial bias via using longitudinal depth distance, and combine it with color contrast and depth contrast for calculating a saliency map. Zhu and Li [13], [14] directly employ the depth map to generate the depth feature saliency and merge it with the color feature saliency, then background elimination model or the center dark channel prior is applied to optimize the fusion saliency map. Others devote themselves to design the depth measurement algorithm to obtain implicit attributes such as shape and contour in the depth map [17]- [21]. In [17], instead of using absolute depth distance, Ju et al. propose an anisotropic center-surround difference (ACSD) measure which pops out salient objects from the scenes with the help of global depth structure. In [18], Ren et al. present the normalized depth prior and the global-context surface orientation prior. These prior can highlight near objects, weaken distant objects, and decrease the saliency of severely inclined surfaces (such as the ground plane or ceilings). Given that the background contains high-contrast areas may be cause false positives for detection, Feng et al. [19] design a local background enclosure (LBE) feature to capture the spread of angular directions, which quantifies the proportion of the object boundary that is in front of the background from the depth map. In general, the depth feature-based method implements RGB-D saliency detection in an intuitive and simple way, while ignores the potential feature attributes in depth map. By contrast, the depth measurement-based method aims to refine the saliency result by utilizing implicit attributes. Moreover, to deal with the varying depth quality, Cong et al. [22] present a depth confidence measure to assess the reliability of the depth map and control the fusion ratio of depth and color features.
However, due to the limited expression of traditional handcrafted features, the complementarity between RGB and depth features cannot be fully explored, which greatly astricts the improvement of algorithm performance. To solve the problem, many researchers gradually introduce convolutional neural network (CNN) for better integrating RGB and depth data [25]- [40]. The CNN-based models can learn deeper feature representations, profoundly excavate the associations between RGB images and depth cues to improve the saliency detection performance. According to the fusion strategy, RGB-D saliency detection networks can be roughly divided into three types [41]: 1) early fusion; 2) late fusion; 3) multiscale fusion. The early fusion directly connects RGB image and depth map to form a four-channels as input [32], or first combines low-level features extracted from the independent networks and then feeds it to the subsequent network [25]. The late fusion mainly adopts two parallel networks to learn high-level features from RGB and depth maps, respectively, and then connect them to generate the final saliency prediction map [40]. The multi-scale fusion mainly integrates crossmodal interactive module into the feature learning networks and explores the complementarity between deep-layer and shallow-layer features, which is also the most mainstream fusion strategy at present. Typically, Hao et al. [26] propose a multi-scale multi-path fusion network, which diversifies the single fusion path into a global reasoning one and a local capturing one, and meanwhile introduces multi-level crossmodal interactions in multiple layers to achieve sufficient and efficient fusion. Li et al. [37] design an information conversion module to integrate high-level RGB and depth features in an interactive and adaptive way, and a cross-modal depth-weighted combination block to enhance RGB features with depth features at each level. Moreover, Wang et al. [42] propose a completely different fusion method from the abovementioned networks. To prevent the dual-stream architecture from preferring RGB sub-branch in the subsequent fusion process, they design a novel data-level recombination strategy which converts the original four-dimensional RGB-D data into DGB, RDB and RGD. Then, these reorganized data are sent to a lightweight three-stream network for complementary fusion.
It has been proved that the depth map with rich spatial information is meaningful to detect salient objects from a cluttered background. However, due to the limitations of depth sensor, the quality of depth maps will vary greatly in different environments. The poor depth map often tremendously endures serious noise or blurred edges, which severely affects the detection accuracy, and even leads to detection failure. In response to this problem, some interesting works [43], [44] have emerged. Wang et al. [43] propose a simple yet effective D (depth) quality measurement scheme. The core idea is to design a series of features based on the common attributes of high-quality D regions, then, combine them with RGB and D saliency cues to guide selective RGB-D fusion. Chen et al. [44] present a two-stage depth estimation method. First, the corresponding relationship between the input and its similar images is established through retrieval, and it combines with the designed depth transferring strategy to estimate the coarse depth. Then, they construct fine-grained, object-level correspondences coupled for improving the quality of estimated depth, and finally feed the estimated depth and original depth into the selective fusion network. However, these depth measurements and estimations will cause additional calculations and time.
On the whole, although traditional and deep learning-based approaches have achieved good results, how to further alleviate the impact of poor depth maps and effectively integrate RGB and depth modalities is still a challenge worth exploring. Therefore, based on the above observations, we further clarify the main purpose of this RGB-D SOD task, which is to design an effective and universal fusion network. The network is capable of excelling in extracting salient objects without additional depth measurement schemes, regardless of the depth quality in the scene. In other words, according to the provided RGB images and depth images with unknown quality, the network can adaptively complete the valuable and complementary fusion action rather than biasing towards a certain modality attribute or a fixed and stiff ratio fusion. To achieve this goal, in this work, we present a progressive guided fusion network (PGFNet) with multi-modal and VOLUME 9, 2021 multi-scale attention for RGB-D salient object detection. The PGFNet is composed of four key parts. First, we adopt two parallel ResNet-50 [45] or VGG-16 [46] to extract RGB and depth features, respectively. Next, a multi-modal and multi-scale attention fusion model (MMAFM) is designed for adaptively fusing multiple modal features in each layer. Then, we propose a multi-modal feature refinement mechanism (MFRM) to optimize the original RGB and depth features with the assistance of high-level fusion feature. At last, the residual prediction module (RPM) is used to predict the saliency map of each layer. We alternately cascade these modules in top-down manner, which continuously enhances and optimizes the fusion of multi-modal features to obtain the final saliency prediction map.
The main contributions of our paper can be summarized as follows: 1) We structure a novel network, i.e., PGFNet, which aims to adequately and efficiently learn the complementarity of multi-modal features and multi-scale features in diverse layers, as well as detect salient regions more accurate.
2) We design a multi-modal and multi-scale attention fusion model (MMAFM), which utilizes the semantic associations between modalities to adaptively fuse features at different modalities and different scales for selecting and enhancing valuable information.
3) To better express the semantic information of multiple modalities, we propose a multi-modal feature refinement mechanism (MFRM), which combines the high-level fusion feature to further optimize the shallow-layer features, so that they can retain more details while having richer global context information.
4) Comprehensive experiments on eight popular benchmark datasets under five widely used evaluation metrics demonstrate that the proposed PGFNet is pretty competitive to the state-of-the-art RGB-D salient object detection models.
The rest of this article is organized as follows: In Section II, we elaborate the proposed PGFNet. In Section III, we conduct extensive experiments to confirm the superiority and effectiveness of our PGFNet. In Section IV, we provide the conclusion.

II. METHODOLOGY
In this section, we elaborate the proposed progressive guided fusion network (PGFNet) with multi-modal and multi-scale attention. As illustrated in Fig. 1, the overall framework is based on a symmetrical two-stream encoder-decoder architecture, which is mainly consists of four subsections: feature encoding module, multi-modal and multi-scale attention feature fusion model, high-level fusion feature guided multimodal feature refinement mechanism and residual prediction module.

A. FEATURE ENCODING
Considering the computational complexity, we employ ResNet-50 [45] network for feature encoding. As shown in Fig. 1, RGB image and depth map are encoded separately through the two-stream encoders. To be concise, we denote the encoding block in the RGB branch as E i R (i ∈ {1, 2, 3, 4, 5}, i is the block index), the corresponding output feature as F i R , define the encoding block in the depth branch as E i D (i ∈ {1, 2, 3, 4, 5}), the corresponding out feature as F i D , Given an RGB image I R and an aligned depth map I D , through the encoding blocks, we obtain two feature groups each of which contains five features with different levels and diverse scales. The values in each encoding block represent the length, width, and channel size of the output feature, respectively. When we adopt ResNet-50 as the backbone, the input resolution of RGB image and depth map are set to 352 × 352 × 3 and 352 × 352 × 1, respectively.

B. MULTI-MODAL AND MULTI-SCALE ATTENTIVE FUSION MODEL
For an RGB-D SOD task, how to effectively utilize depth cues is a crucial point. An accurate depth map can provide precise spatial structure clues and promote the detection accuracy. In contrast, a poor depth map contains massive disturbance and error information, which is detrimental to the detection performance. Therefore, how to adequately aggregate RGB and depth cues at different layers is extremely critical. Inspired by stereoscopically attentive multi-scale (SAM) module [47], we propose a multi-modal and multi-scale attentive fusion model (MMAFM), as shown in Fig. 2. Different from the SAM module, our model not only adaptively learns the weight factor of each scale according to the characteristic of own modality, but also globally guides the selection and optimization at the corresponding modal scale by combining the multi-modal information. In detail, MMAFM is composed of three processes: multi-scale feature extraction, multi-modal and multi-scale attention, feature fusion, which are explained in detail below.

1) MULTI-SCALE FEATURE EXTRACTION
In order to acquire more plentiful global semantic information and reduce the information dilution in the decoder, a multi-scale operation is applied to deep-layer features. Specifically, as illustrated on the left part of Fig. 2, the portion contains five parallel branches at each modality. For all branches, a 1 × 1 convolution is adopted to compress the channel size to 32, which greatly reduce the calculation and complexity of the network model. Then, four parallel 3 × 3 dilated convolutions with different dilation rate are applied to obtain abundant global context. Without loss of generality, for the i th layer refined featuresF i R andF i D (i ∈ {1, 2, 3, 4, 5}), after the multi-scale operation, we get the multi-scale feature groups of RGB and depth, i.e., . . , f k D ], which is described as, The overall architecture of our proposed PGFNet, which consists of four key stages: feature encoding, feature fusion, feature refinement and saliency prediction. First, the feature encoding networks (ResNet-50 [45]) extract features from RGB and depth images. Next, the multi-modal features of the highest layer are fed to multi-modal and multi-scale attention fusion model (MMAFM) for adaptive integration and enhancing the response to beneficial features. Then, the features of the remaining layers will be transmitted to a high-level guided multi-modal feature refinement mechanism (MFRM) before fusion, pursuing more details and global context information. Finally, the residual prediction modules optimize and decode the fusion features in each layer for highlighting salient objects. Here, MAFM refers a multi-modal attention fusion model without multi-scale information, which is applied at the shallow layers. Notably, at the training phase, the pixel-level ground truth (GT) is used to supervise all saliency maps generated by the network. where,F i R andF i D are the refined RGB and depth features which are described in detail in Section II-C. k represents the branch index, N is the total number of branches. The more branches, the greater computation is required. So, N is FIGURE 3. Illustration of attention mechanism (AM). The parameter r represents the reduction factor, which is set to 4. empirically set to 5 in this article. Conv (·) represents a 1 × 1 convolution operation, and DConv (·) represents a 3 × 3 dilated convolution, for the k th branch (k ∈ {1, 2, 3, N − 1}), its dilation rate is 2k − 1. Considering the computational complexity and the sallow-layer features (i.e., ) may be contain more noise, the multi-scale operation is not applied to these layers, that is, k here is only equal to 0.

2) MULTI-MODAL AND MULTI-SCALE ATTENTION
If all the scale features of above RGB and depth modalities are directly aggregated by simple element-wise summation or connection, it may cause the beneficial information branches to be weakened or even submerged by the useless ones. In addition, the information provided by each branch may have a different focus, which is often overlooked if they are treated equally. To alleviate this issue, the modal attention and scale attention are combined into the fusion model. We not only consider the importance of each scale feature in own modality, but also integrate and utilize the information of multiple modalities to adaptively guide and refine the scale feature at each modality. By this way, adequately explore the complementarity and difference of features from the aspect of modalities and scales. In detail, as shown on the right part of Fig. 2, we separately connect all branches of RGB and depth modalities, and feed them into the attention mechanisms (AM) after a 1 × 1 convolution operation for obtaining the scale attention weights at the corresponding modality. Subsequently, connect the outputs of all 0 th branch and input it into the attention mechanisms after a 1 × 1 convolution for getting the modal attention weights at the corresponding scale. As shown in Fig. 3, the attention mechanism includes channel attention and spatial attention. It aims to learn the attention of each branch by using all branches of a single modality or the original features of all modalities, suppress non-informative features and focus on the specific spatial location in a global manner. The operation of the attention mechanism F AM (·) is generally defined as, where W ∈ R N ×C×H ×W includes the attention weights of N branches, W = w 0 , w 1 , . . . , w k (k ∈ {0, 1, . . . , N − 1}), w k is the attention weight of the k th branch. F ∈ R C×H ×W is the input feature, H , W , C represent its length, width and channel size. F CA (·) and F SA (·) represent the channel attention and spatial attention, respectively. ⊗ denotes the element-wise multiplication. More specifically, where F GAP (·) represents the global average pooling (GAP) operation, F MLP (·) is a multi-layer (two-layer) perceptron. And the spatial attention is defined as, where Convs (·) represents the execution of four convolution operations in sequence: a 1 × 1 convolution, two 3 × 3 dilated convolutions and a 1 × 1 convolution. Next, we can formally calculate the scale attention weights U S R , U S D and the modal attention weights V M R , V M D according to (2), includes the RGB scale weights extracted from all RGB branches, u k R is the scale attention weight of the k th branch at RGB modality. Similarly, contains the depth scale weights extracted from all depth branches, u k D is the scale attention weight of the k th branch at depth modality. Cat (·) means connection operation, Conv (·) represents a 1 × 1 convolution operation for reducing parameters and computational complexity.
includes the RGB modal weights under different RGB scales, which is learned from all original modal features (i.e., RGB and depth modalities).
contains the depth modal weights under different depth scales. Then, we calculate the final weight of each branch combined with (5) and (6). where are the compositive attention weights of RGB and depth branches, respectively. h k R/D = u k R/D ⊗ v k R/D denotes the attention weight of the k th RGB or depth branch. Then, the weighted multi-scale featuresf R andf D can be described as,

3) FEATURE FUSION
For the multi-scale feature fusion at multiple modalities, we adopt a two-step way: inter-modal fusion and inter-scale fusion. First, we simply aggregate the modal scales from the same branches by element-wise summation. where f = [f 0 , f 1 , . . . , f k ] is the multi-scale feature group after executing inter-modal fusion, f k =f k R ⊕f k D is the feature merged by all the k th modality branch, and ⊕ denotes the element-wise summation. It is worth noting that dilated convolutions can enlarge the receptive field of features, whereas large dilation rate may lead to serious gridding effect and lose more spatial details. To overcome it, we conduct a topdown progressive fusion on the scale levels, as shown in Fig. 4. Normally, the smaller dilation rate, the less information continuity is destroyed, and the more original details can be retained. Therefore, we try to continuously propagate the high-level scale feature with large dilation rate to the low level, so that the final fusion feature can be supplemented with contextual information while reserving local details. Concretely, the fusion operation between two adjacent scale levels can be expressed as, where k ∈ {0, 1, . . . , N − 2}, f k,k+1 represents the fusion feature between the k th and (k + 1) th scale levels. When N = 5, the initial fusion feature f N −1,N = f N −1 = f 4 . Then, we can calculate the final fusion feature at the i th layer by (10), i.e., F i fusion = f 0,1 .

C. MULTI-MODAL FEATURE REFINEMENT MECHANISM
The output features from deep-layer encoding blocks contain rich high-level semantic information, which can indicate the approximate position and shape of the object in the scene. To this end, we design a multi-modal feature refinement mechanism (MFRM) guided by the high-level feature. Its main goal is to gradually supplement the deeplayer multi-modal fusion feature with strong semantics to the shallow-layer original features by the top-down cascade. The proposed refinement strategy can effectively improve the global semantic representation ability of each modal feature, reduce the interference of redundant information, automatically select and strengthen important feature cues for saliency detection, as well as further promote the quality of the subsequent fusion operation. As shown in Fig. 5a, we adopt a symmetrical attention sub-module to capture the relationship between each individual modality and the high-level fusion feature. Specifically, for the RGB feature F i R of the i th (i ∈ {1, 2, 3, 4}) layer, we first feed it into a 1 × 1 convolution to reduce the channel size to 64. Two 3 × 3 convolution blocks are followed to enlarge the receptive field and extract more available unimodal information. In addition, to prevent the information loss during the convolution process, the compressed feature is added by a residual connection for retaining more original attribute. The same operation is also applied to F i D for strengthening the depth feature. Then, we obtain the enhanced RGB and depth features (i.e., F i ER and F i ED ), and input them together with the upper-layer optimized fusion featureF i+1 fusion into the high-level guided attention mechanism (HGAM) for separately learning supplementary enhancement features. It is worth noting thatF i+1 fusion is the fusion feature optimized by the residual prediction module (RPM), and the specific calculation process, see Section II-D. The main structure of HGAM is shown in Fig. 5b. Furthermore, inspired by the successful application of self-attention [48]- [50], we refer to it in the refined model to replace the model in Fig. 5b, as shown in Fig. 5c. Considering the computational complexity of self-attention, it is only applied to the 3 th and 4 th layers. The operations of the two attention mechanisms are marked as F HGAM 1 and F HGAM 2 , respectively, then we have, where,F i R andF i D represent the refined RGB and depth features, respectively, which are employed as the input features of the multi-modal and multi-scale attention fusion model (MMAFM). Notably,F 5 R/D = F 5 R/D , which means that the top layer is fused directly without a refinement mechanism. VOLUME 9, 2021 Mathematically, (12) where U (·) denotes the up-sampling operation via bilinear interpolation if these features are not in the same scale. The process of the F HGAM 2 can be described as follows, where F in is the feature after up-sampling, and w A is a attention weight which considers the pairwise relationship at any point in the high-level feature map. softmax (·) is an elementwise softmax function, R 1 (·) reshapes the input feature to R C 1 ×(HW ) , R 1 (·) reshapes the input feature to R C 2 ×(HW ) , and R 2 (·) reshapes the input feature to R C 2 ×H ×W . Notably, C 1 is set to 1/2 of C 2 = C.

D. RESIDUAL PREDICTION MODULE
As we all know, the shallow layers of deep learning networks capture low-level structural cues while the deep layers capture high-level semantic information. To take maximum advantage of the complementarity and difference between diverse feature layers, our network construction has been adopting a progressive guided manner to integrate and refine features. It is committed to gradually transfer high-level semantic information from the deep layer to the lower layer and learn a more accurate salient object with clearer edges. Furthermore, we can obtain a rough saliency map through the toplayer multi-modal fusion feature, the map can indicate the approximate location and shape of salient object, meanwhile effectively suppressing and eliminating most of background elements. Based on the above observations, we design a residual prediction module (RPM) based on the saliency map to further optimize the shallow-layer fusion features by combining with the deep-layer saliency cues. This operation can tellingly alleviate the gradual sparseness of high-level information during the fusion process as well as suppress the background noise from the low-level features. As illustrated in Fig. 6, input an initial fusion feature F i fusion of the i th layer, an optimized fusion featureF i+1 fusion and a prediction map S i+1 of the (i + 1) th layer, it outputs the optimized fusion featurẽ F i fusion and prediction map S i ,which are defined as: where S i+1 up is the saliency prediction map after up-sampling, δ (·) is a sigmoid function. The outputs of this layer will guide the multi-modal feature refinement module and residual prediction module of the next layer, and so on, until the first layer. For the top layer, we first utilize the initial fusion feature F 5 fusion to directly generate an initial prediction map, i.e., S 6 , and then feed them to the residual prediction module.

E. LOSS FUNCTION
As shown in Fig. 1, we supervise the prediction map at each layer, which clarifies the optimization goal for each step of the network and accelerates the convergence of training. Moreover, to better guide the network learning and produce more details, we introduce a pixel position aware (PPA) loss [51], which synthesizes local structure information to generate different weights for all pixels and introduces pixel constraints (i.e., weight binary cross entropy (wBCE) loss) and global constraints (i.e., weighted intersection over union (wIoU) loss). Mathematically, (19) where L wBCE is defined as, (20) where, where H and W are the height and the width of the saliency map. 1(·) is the indicator, and γ is a hyperparameter. l ∈ {0, 1} denotes two classes of the labels. S ij and GT ij are prediction saliency map and ground truth of the pixel at location (i, j) in an image. represents all the parameters of the model, and Pr(S ij = l| ) is the predicted probability. α ij is the indicator of pixel importance, which is defined by the difference between the center pixel and its surroundings. L wBCE is beneficial for the model to pay more attention to hard edge areas. Homoplastically, the wIoU loss is defined as, In summary, the total loss function of our network is expressed as, where α i is the weight coefficient and simply set to 1 in our experiments.

III. EXPERIMENTS A. DATASETS
To demonstrate the effectiveness of our proposed method, we evaluate it on eight public benchmark datasets. NJU2K [17] contains 1985 RGB-D images which are collected from the Internet, 3-D movies and photographs taken by stereo camera, and depth maps are estimated by the optical-flow method.
NLPR [9] includes 1000 RGB-D images, where the depth maps are captured by Microsoft Kinect.
STERE [5] contains 1000 pairs of binocular images with the corresponding pixel-level ground truth. This is the first collection of stereoscopic images in this field.
DES [8] is a small dataset comprises 135 indoor RGB-D images, taken by Kinect at a resolution of 640 × 640.
SSD [13] is built on three stereo movies and includes indoor and outdoor scenes. It has 80 images with the resolution of 960 × 1080.
SIP [30] consists of 929 high-resolution images, which designed for salient person detection in the complex scenes. The depth maps are captured by a smart phone (Huawei Mate10).
LSDF [52] includes 100 light fields captured by a Lytro light field camera.
DUT [29] is a recently released dataset containing 800 indoor and 400 outdoor scenes, some of which are quite challenging.

B. EVALUATION METRICS
Following [30], we use the following five popular evaluation metrics to comprehensively evaluate the performance of the saliency detection methods.
MAE estimates a mean absolute error between a prediction saliency map S and a ground-truth map GT, it is defined as PR curve is formed by a series of pairs of precision and recall scores calculated at fixed thresholds ranging from 0 to 255, which describes the model performance at different situations.
F-measure is a harmonic mean of average precision and recall, which is defined as, we empirically set β 2 = 0.3. S-measure [53] is used to measure the spatial structure information, which is defined as, (27) where α is a balance parameter between the object-aware structural similarity S 0 and region-aware structural similarity S r , and it is set to 0.5. S r is the sum of the structural similarity of multiple image blocks with different weights. The greater the proportion of blocks covering GT foreground region, the greater the weight allocated. S 0 is the comprehensive consideration of foreground structure similarity and background structure similarity. E-measure [54] is to evaluate the foreground map (FM) and noise, which combines local pixel values with image-level mean values to jointly capture image-level statistics and local pixel matching information.
where φ is an enhanced alignment matrix for the two properties of a binary map.

C. EXPERIMENTAL SETTINGS 1) TRAINGING/TESTING
Following the same training settings as most models, such as in [25], [30], [55], we employ 1485 images from the NJU2K dataset and 700 images from the NLPR dataset as our training set. The remaining images in the NJU2K and NLPR datasets and the whole datasets of STERE, DES, SSD, SIP, and LSDF are used for testing.

2) IMPLEMENTATION DETAILS
We adopt the PyTorch [56] framework to build our network model and conduct extensive experiments on an NVIDIA TITAN Xp GPU. The feature encoder is composed of two parallel ResNet-50 networks. The networks discard the last pooling and fully connected layers, the parameters are initialized according to the pre-training model of ImageNet [57]. The other parameters in our network are initialized as the PyTorch default settings. Refer to [58], we utilize the Adam algorithm [59] to optimize our model. The initial learning rate is set to 0.0001, and it drops to 0.1 times every 60 epochs with a total of 200 epochs. The images are resized to 352 × 352 for both the training and testing stages. For augmenting the training samples, we also take some measures (i.e., random flipping, rotating, and clipping). It takes about 12 hours to train our model with batch size of 8. Quantitative results of ablation studies on three popular datasets. Red indicates the best performance, ↑ denotes larger is better, and ↓ denotes smaller is better. BCE: binary cross entropy loss. PPA: pixel position aware loss. MMAFM: multi-modal and multi-scale attention fusion model, where −1/−0 represents using or not using top-down multi-scale feature fusion strategy, respectively. MFRM: multi-modal feature refinement mechanism, where −1/−0 indicates with or without the self-attention, respectively. RPM: residual prediction module.

D. ABLATION STUDY
Our network combines multi-modal and multi-scale attention fusion model (MMAFM), multi-modal feature refinement mechanism (MFRM) and residual prediction module (RPM).
In this subsection, we provide comprehensive ablution experiments on NJU2K [17], NLPR [9], and SIP [30] to demonstrate the effectiveness of these components. Table 1 intensively shows all the results of the above experiments. Specifically, we analysis 1) the importance of multi-modal and multi-scale attention fusion model (MMAFM); 2) the necessity of multi-modal feature refinement mechanism (MFRM) and residual prediction module (RPM); 3) the usefulness of PPA loss. We change only one component at a time, leaving the other parameters unchanged. In this paper, we directly connect RGB and depth features of the highest layer extracted from ResNet-50 to predict a saliency map, which is set as the baseline model.

1) THE IMPORTANCE OF MMAFM
The MMAFM plays a very important role in the proposed PGFNet. To study its importance, we explore two variables relative to the baseline model: replacing the top-down fusion strategy in MMAFM with direct element-wise summation (i.e., the 3 rd row), MMAFM uses the designed fusion strategy (i.e., the 4 th row). As shown in Table 1, compared with the baseline model (i.e., the 2 nd row), all evaluation metrics (in the 3 rd and 4 th rows) obviously show a gradual increase trend. Overall, the proposed MMAFM improves (2.45∼2.69%, 0.64∼1.4%, 3.53∼4.28%, 0.87∼1.76%) for the metrics (S α , E m , F β , M ) on three datasets. Conclusively, the results of the 3 rd and 4 th rows confirm that the top-down fusion strategy is better than the direct summation operation, the reason may be that this way can better alleviate the gridding effect and reserve more spatial details. With the help of MMAFM, our PGFNet captures a more efficient semantic representation of salient objects by taking full advantage of the complementarity between RGB and depth features in terms of different scales and modalities.

2) THE NECESSITY OF MFRM AND RPM
To verify the necessity of MFRM and RPM in our PGFNet, we provide three variables based on the baseline model combine with MMAFM (i.e., the 4 th row): introducing MFRM without self-attention (i.e., the 5 th row), joining MFRM and RPM without self-attention (i.e., the 6 th row), adding MFRM and RPM with self-attention (i.e., the 7 th row). In Table 1, the 5 th row is overwhelmingly better than the 4 th row, confirming that our proposed refinement mechanism is helpful to the optimization of salient objects. It is able to reinforce the semantic expression of shallow-layer features. The performance of the 6 th row is improved compared with the 5 th row, which indicates that our prediction model further promotes the detection accuracy. Finally, we introduce self-attention into the deep-layer MFRM (i.e., the 3 rd and 4 th layers) to obtain the final result (i.e., the 7 th row), which demonstrates that the high-level guided self-attention mechanism is quite valuable and achieves further the optimization of network performance.

3) THE USEFULNESS OF PPA LOSS
To illustrate the usefulness of loss function, we provide two variables: BCE loss (i.e., the 1 st row) and PPA loss (i.e., the 2 nd row). As shown in Table 1, compared with BCE loss, PPA loss has certain advantages in all evaluation metrics (S α , E m , F β , M ) with the increase of about (0.16%, 0.14%, 0.35%, 0.55%), especially in MAE, the result is more prominent. This may be that PPA loss pays more attention to the perception of the edge and structural information of salient objects.
and Fig. 7 show the quantitative comparison results of the proposed method on seven datasets. We also report saliency maps with various scenes, as shown in Fig. 8. For a fair comparison, the saliency maps of all compared methods are directly provided by their authors or generated by running their released codes. Moreover, we additionally provide a comparison with the latest 8 CNN-based methods: DMRA [29], SSF [61], S 2 MA [62], HDFNet [63], FRDT [35], DANet [36], CoNet [64], A2dele [65]. These methods have the same training set, and the set introduces 800 images from the DUT dataset besides the subsets of NJU2K and NLPR datasets mentioned above. In turn, we retrain our model on this training set and list all the results in Table 3.

2) QUANTITATIVE COMPARISON
We report the PR curves and F-measure curves on seven datasets in Fig. 7 and list S-measure (S α ), maximum E-measure (E m ), maximum F-measure (F β ), MAE (M ) in Table 2 and Table 3. As shown in Fig. 7, our method achieves better PR curves and F-measure curves on all datasets. This indicates that the proposed PGFNet can obtain the higher precision and recall compared with other methods, as well as means that the saliency maps we generate have better consistency.
As listed in Table 2, it is obvious that the performance of CNN-based models is far superior to traditional ones, which yet proves the status and application value of convolutional neural network in the image processing field. Undoubtedly, compared with traditional or deep learning-based models, the proposed algorithm shows powerful competitiveness in terms of all evaluation metrics. Performance gains over the best compared models (D 3 Net, PGAR and DQAS) are (0.3%∼5.4%, 0.4%∼5.5%, 1%∼6.5%, 0.5%∼2.5%) for the metrics (S α , E m , F β , M ) on all datasets except LFSD dataset. We only do not achieve the best on the three values of the LFSD dataset, but the values are still suboptimal. Moreover, from Table 3, we can clearly find that our evaluation data is also excellent on the new training set.
In conclusion, the effectiveness of our model is relatively ideal from a quantitative point of view.

3) VISUAL COMPARISON
We provide visual comparisons with classical four non-deep learning and six CNN-based models in Fig. 8. We observe that the proposed method is able to handle several challenging and complicated scenes. To more convincing, we compare these methods in following aspects: (1) the ability to handle boundary contacts; (2) the ability to resolve similar appearances; (3) the detection ability for a poor depth map. (4) the ability to process a depth with distractors.
Here combine with examples in Fig. 8 to vividly explain the above five aspects. First, in the 4 th , 7 th and 8 th rows, only PGAR method responses well to boundary contact issues. But it may misdetect when the object has a low contrast in the scene, especially in the 8 th row. It fails to make full use of the depth map with clear contours and mistakes the object shadow as a salient area. In contrast, our saliency maps perform better in this situation. Second, as shown in the 6 th and 8 th rows, the object appearance is relatively close  to the background, especially in the 6 th row, the color of the chair is quite similar to the door behind it. All CNN-based comparison methods cannot effectively extract the complete object from the scene, while our model achieve the goal by roundly exploring the depth cues, and improve the detection accuracy. Thirdly, the quality of the depth maps is very poor in FIGURE 8. Visual comparisons with four latest non-deep learning methods (i.e., DCMC [22], SE [12], CDCP [14] and CDB [16]) and six latest CNN-based methods (i.e., MMCI [26], TANet [31], CPFP [28], D 3 Net [30], PGAR [38] and DQAS [43]). the 3 rd and 7 th rows, which makes it difficult to distinguish the salient objects by relying on the depth map solely. Moreover, only ours and PGAR algorithms show better performance. Meanwhile, we further find that PGAR responds to objects similar with the background very weakly (i.e., the 6 th row), which indicates that the algorithm focuses more on RGB modality and not enough on depth information. Finally, when the object in the depth map is interfered by the background object near ground or close distance (e.g., the 1 st and 2 nd rows), the results of all comparison methods are not ideal and will be false and missed detections. In sharp contrast, our saliency maps obviously possess more perfect and more detailed salient object.
We further analyze the experimental results in Fig. 8 from the data fusion level. According to the quality of the input data, there are the following four situations. (1) The depth map is very poor, it does not reflect the structural information of the salient object at all, and only be distinguished by color image (e.g., the 3 rd and 7 th rows). (2) Although the depth map is accompanied by distractors, the partial contour structure of the object is very clear compared to the RGB image, and the color information is relatively obvious on the whole (e.g., the 1 st and 2 nd rows). (3) The quality of the depth map is relatively good, but the object is difficult to distinguish in terms of color appearance (e.g., the 6 th and 8 th rows). (4) The object in the RGB and depth maps are all obvious (e.g., the 5 th , 9 th , 10 th rows). For the above four cases, our prediction results are pretty ideal. These show that our model is not disturbed by the extremely poor depth map, and it can also extract the structure of the object from the depth map with certain interference which is difficult to extract from the RGB image. Moreover, our model can make full use of the structure information from the high-quality depth map, no matter whether the color information of the object is prominent or not. All these phenomena indicate that our model is very flexible and effective.
In conclusion, our algorithm is more robust and adaptable in various complex scenarios. It is more inclined to integrate multi-modal cues adaptively and selectively, rather than simply biasing to a certain data branch.

4) OTHER COMPARISONS
In this part, we further analyze the performance of the proposed model in terms of compatibility and model size.

a: COMPATIBILITY
Most of the deep learning-based RGB-D SOD models generally adopt ResNet-50 [45] or VGG-16 [46] as backbone architecture. Therefore, to verify the compatibility of our model, we provide performance comparisons employing different backbones in Table 4. Meanwhile, we utilize diversiform color marks to better reflect the advantages of using ResNet-50 or VGG-16 as backbone compared with other TABLE 4. Performance comparison using different backbones. Red indicates the best. In PGFNet (VGG-16), bold and green indicate the best and second performances compared with the comparison methods in Table 2. ↑ denotes larger is better, and ↓ denotes smaller is better.

FIGURE 9
. Comprehensive comparisons between model size and F-measure performance on the NLPR dataset. Our-R and Our-V denote our models based on ResNet-50 and VGG-16, respectively. 20 state-of-the-art models in Table 1. Obviously, we can discovery that our model is the best overall regardless of the backbone type used, which shows that our framework has strong compatibility.

b: MODEL SIZE
In Fig. 9, we compare the model size of different methods and corresponding maximum F-measure on the NLPR dataset. Compared with other models, our model based on VGG-16 achieves better accuracy with a smaller model size (30.80M). Our model based on ResNet-50 with an acceptable size (48.98M) significantly improves the detection accuracy. The results illustrate that the designed model can realize satisfactory saliency detection performance with a relatively small number of parameters, achieve a certain balance between lightweight and accuracy.

IV. CONCLUSION
In this paper, a novel progressive guided fusion network (PGFNet) is proposed for RGB-D salient object objection. PGFNet is based on a symmetrical two-stream encoderdecoder architecture, which is equipped with three highly efficient submodules with a clear division of labor. Specifically, the multi-modal and multi-scale attention fusion model (MMAFM) to obtain optimal fusion features via learning the internal relationships at different scales and modalities. The multi-modal feature refinement mechanism (MFRM) is applied to enhance the unfused shallow-layer features, and it is achieved through the guidance of the highlevel fusion feature. Moreover, the residual prediction module (RPM) to further restrain the background noise according to saliency value. Extensive experimental results on eight datasets prove that our method outperforms most of the stateof-the-art algorithms.
In addition to RGB-D saliency detection, we will consider extending the proposed model to other multi-source data detection or special scenario applications, such as RGB-T salient object detection and underwater computer vision tasks. The RGB-T salient detection [66], [67] integrates RGB and thermal infrared data. Due to the insensitivity of thermal infrared image to light, it can provide supplementary information when the salient object suffers from varying light, glare, or shadows. Furthermore, in relatively recent years, saliency detection has been widely used in the field of automated underwater exploration [68]. In view of this, we consider using dark channel prior [69] instead of depth information as a powerful auxiliary to improve the performance of underwater detection. Theoretically, these schemes are feasible and worthy of further research in the future.