AFI-Net: Attention-Guided Feature Integration Network for RGBD Saliency Detection

This article proposes an innovative RGBD saliency model, that is, attention-guided feature integration network, which can extract and fuse features and perform saliency inference. Specifically, the model first extracts multimodal and level deep features. Then, a series of attention modules are deployed to the multilevel RGB and depth features, yielding enhanced deep features. Next, the enhanced multimodal deep features are hierarchically fused. Lastly, the RGB and depth boundary features, that is, low-level spatial details, are added to the integrated feature to perform saliency inference. The key points of the AFI-Net are the attention-guided feature enhancement and the boundary-aware saliency inference, where the attention module indicates salient objects coarsely, and the boundary information is used to equip the deep feature with more spatial details. Therefore, salient objects are well characterized, that is, well highlighted. The comprehensive experiments on five challenging public RGBD datasets clearly exhibit the superiority and effectiveness of the proposed AFI-Net.

However, RGBD saliency models mainly rely on handcrafted features, such as contrast computation [5,13], minimum barrier distance computation [22], and the cellular automata model [20]. e performance of these RGBD saliency models degrade largely when handling some complex RGBD scenes with attributions, including small salient objects, heterogeneous objects, cluttered background, and low contrast. is phenomenon can be attributed to the weak representation ability of hand-crafted features in RGBD saliency models. Fortunately, significant progress has been achieved in deep learning theories in the past few years. In particular, convolutional neural networks (CNNs), which provided high level semantic cues, have been applied in RGBD saliency detection successfully [24][25][26][27][28][29][30][31][32][33][34][35], such as the three stream structure in [34], the fluid pyramid integration in [35], and the complementary fusion in [28].
ough the performance of existing deep learning-based RGBD saliency models is encouraging, they still lose their efficiency when dealing with complex RGBD scenes. us, the performance in the area of RGBD saliency detection can still be improved. In addition, some fusion-based RGBD saliency models [5,14,15,19,27,28,32,33] aim to integrate two modalities, namely, RGB and depth information, through early fusion, middle fusion, and result fusion. ese models often result in cross-modal distribution gap or information drop, leading to performance degradation. Meanwhile, the attention mechanism [36] has been widely adopted in many saliency models [37][38][39], enhancing the saliency detection performance in RGB image scenes. Furthermore, boundary information has been applied in salient object detection [40,41], providing more spatial details for salient objects.
us, this work proposes an innovative end-to-end RGBD saliency model, that is, attention-guided feature integration network (AFI-Net). AFI-Net can extract and fuse features and perform saliency inference. Specifically, our model first extracts multimodal and level deep features, with the pair of RGB and depth images as the input. en, the attention module, where the attention mechanism [36] is adopted to generate the attention map, enhances the multilevel RGB and depth features, yielding enhanced deep features. Next, these enhanced features (originated from different modalities and levels) are fused hierarchically. Lastly, the RGB and depth boundary features, that is, low-level spatial details, and the integrated feature are combined to perform saliency inference, yielding a high-quality saliency map. Our model focuses on RGBD saliency detection, whereas the existing boundaryaware saliency models [40,41] focus on performing saliency detection in RGB images.
More importantly, the key advantages of the AFI-Net are the attention model, which indicates the salient objects coarsely, and the boundary information, which provides more spatial details for features. us, we can characterize salient objects perfectly in RGBD scenes. e general contributions of AFI-Net are described as follows: (1) We propose AFI-Net to highlight the salient objects in RGBD images. e AFI-Net has three components, the extraction and fusion of features, and saliency inference.
(2) To sufficiently utilize deep features from different modalities and levels, the attention module is employed to enhance deep features and guide the hierarchical feature fusion. Furthermore, the spatial details are further embedded in the saliency inference step to obtain accurate boundary details. (3) We perform exhaustive experiments on five public RGBD datasets, and our model achieves the state-of-the-art performance. e experiments also validate the effectiveness of the proposed AFI-Net.
In [14], color contrast, depth contrast, and spatial bias are combined to generate saliency maps. In [5], luminance-, color-, depth-, and texture-based features are used to produce contrast maps, which are combined to compute for the final saliency map using weighted summation. In [15], the features maps computed by using region grouping, contrast, and location and scale are combined to conduct RGBD salient object detection. In [19], compactness saliency maps computed using color and depth information are aggregated into a saliency map via the weighted summation approach. In [20], color-and depth-based saliency maps are integrated and improved via the linear summation and cellular automata. In [24], various feature vectors, such as local contrast, global contrast, and background prior, are generated and fused to infer the saliency value of each superpixel.
With the wide deployment of CNNs, the performance of RGBD saliency models is significantly advanced. In [25], depth features are combined with appearance features using the fully connected layers, generating high-performance saliency maps. In [27,28], a two-stream architecture is proposed with a fusion network to detect the complementarities of RGB and depth cues. In [31], two networks, namely, a master network and a subnetwork, are used to obtain deep RGB and depth features, respectively. In [29], RGBD salient object detection is performed using a recurrent CNN. In [35], multilevel features are fused and used to detect salient objects using a fluid pyramid network. In [33], two-stream networks interact to further explore the complementarity of multimodal deep features. In [32], a fusion module is employed to fuse the RGB and depth-based saliency results.

Methodology
First, the proposed AFI-Net is introduced in Section 3.1.
en, the feature extraction is presented in Section 3.2. Subsequently, feature fusion and saliency inference are described in Section 3.3. Finally, in Section 3.4, some implementation details are introduced. Figure 1 summarizes our RGBD saliency model, AFI-Net, which includes a two-stream-based encoder (i.e., feature extraction), a single branch-based decoder (i.e., feature fusion), and saliency inference. Specifically, the entire network is constructed based on VGG-16 [46] with an end-to-end structure. RGB image I and depth map D are used as the input to AFI-Net. Here, the initial depth map is encoded into an HHA map D using [47]. en, RGB image I and depth map D are sent to the two-stream network. us, we can obtain multilevel initial deep RGB features AF i Computational Intelligence and Neuroscience

Overall Architecture.
Next, the fusion branch is used to integrate the enhanced RGB and depth features hierarchically, and we can obtain integrated deep features IF i 5 i�1 . Finally, the saliency inference module is employed to obtain a saliency map S by aggregating the boundary information, that is, the low-level spatial details. In Section 3.2, we elaborate the proposed RGBD saliency model, AFI-Net.

Feature Extraction.
e feature extraction branch, namely, an encoder, is a two-stream network containing RGB and depth branches constructed based on VGG-16 [46]. Specifically, the RGB and depth branches have five convolutional blocks with 13 convolutional layers (kernel size � 3 × 3 and stride size � 1) and 4 max pooling layers (pooling size � 2 × 2 and stride size � 2). Here, considering the inherent difference of I and D, the RGB and depth branches have the same structure with different weights. Following this pipeline, we can obtain the initial multiple modalities and the multilevel features including the deep RGB features AF i 5 i�1 and the deep depth features DF i 5 i�1 , as shown in Figure 1.
On the basis of multimodal features AF i 5 i�1 and DF i 5 i�1 , we first deploy the attention module, as shown in Figure 2, to further enhance the initial deep features. Formally, we denote each initial deep RGB feature AF i or deep depth feature DF i as F i for convenience. According to Figure 2(b), attention feature AF i is formulated as follows: where Conv denotes a convolutional layer. en, we can compute the attention weight (af i (w, h)) at each spatial location using softmax as shown in Figure 2(b): where (w, h) denotes the spatial coordinates of attention feature AF i and the width and height of AF i are denoted as After obtaining attention map af i , initial deep feature F i should be selected, which is formulated as follows:  Computational Intelligence and Neuroscience deployed to fuse the multimodal and level deep features hierarchically, as shown in Figure 1. Specifically, the hierarchical integration operation is performed as follows: where H denotes the fusion and contains one convolutional layer and one upsampling layer, [.] denotes channel-wise concatenation, and IF i is the i th integrated deep feature. According to the descriptions, we can obtain the first integrated deep feature (IF 1 ). On basis of IF i , we aggregate it with the low-level spatial detail features, that is, the boundary information, to obtain a saliency map. Specifically, as shown in Figure 1, boundary information BA and B D can be obtained from the bottom layer conv1-2 in the RGB and depth branches, respectively, by using a convolutional layer (1 × 1), that is, a boundary module (BI box marked in yellow). Subsequently, the saliency prediction is performed by using two convolutional layers (3 × 3) and one softmax layer. us, the saliency inference is written as follows: where the RGBD saliency map is represented by S, [.] denotes the channel-wise concatenation operation, and Fun refers to the convolutional layers and the one softmax layer.

Implementation
Details. AFI-Net includes feature extraction, feature fusion, and saliency inference. Concretely, D train � (I n , D n , GT n ) N n�1 is the training dataset, where I n � I j n , j � 1, . . . , N p , D n � D j n , j � 1, . . . , N p , and GT n � GT j n , j � 1, . . . , N p refer to the RGB image, the depth map, and the ground truth with N p pixels, respectively. Here, subscript n is dropped, and I, D { } corresponds to each RGB image and depth map pair. us, the total loss can be written as follows: where the kernel weights and bias of the convolutional layers are denoted as W and b, respectively; GT + and GT − denote the salient objects and backgrounds, respectively; and β is the ratio of salient objects' pixels in GT, that is, β � |GT + |/|GT − |. Furthermore, P(GT j � 1|I, D; W, b) is the saliency value of each pixel.
AFI-Net is implemented using the Caffe toolbox [48]. During the training phase, the parameters of the SGD algorithm, such as momentum, base learning rate, minibatch size, and weight decay, are set to 0.9, 10 − 8 , 32, and 0.0001, respectively. Our total iterations are set to 25, 000. Furthermore, the learning rate is divided by 10 at the beginning of each 12, 500 iterations. e VGG-16 model is used to initialize the weights of the RGB and depth branches. e fusion branch is initialized by using the "msra" method [49]. In addition, the training data used by CPFP [35] is also employed to train our model. e training data contain 1400 pairs of RGB and depth images from NJU2K [16] and 650 pairs of RGB and depth images from NLPR [15]. Obviously, augmentation operations are also adopted, including rotation with angles 90 ∘ , 180 ∘ , and 270 ∘ and mirroring. Finally, the number of training samples is 10250. After the training phase, we can obtain the final model with 131.2 MB. During the test phase, the average processing time per 288 × 288 image is 0.2512 s.

Experiments
e public RGBD datasets and the comprehensive evaluation metrics are described in Section 4.1. In Section 4.2, exhaustive quantitative and qualitative comparisons are performed successively. Lastly, the ablation analysis is presented in Section 4.3.

Experimental Setup.
To validate the proposed AFI-Net, we perform comprehensive experiments on five challenging RGBD datasets, namely, NJU2K [16], NLPR [15], STEREO [13], LFSD [50], and DES [14]. NJU2K includes 2003 samples, which are captured from the Internet, daily routines, and so on. From the dataset, 1400 samples are employed for training and 485 samples for testing, that is, "NJU2K-TE." NLPR was constructed by Microsoft Kinect, consisting of 1000 samples, and the salient objects in some samples are more than one. For training the AFI-Net, 650 samples are selected from NLPR to construct the training set, and 300 samples are selected from NLPR to build the testing set, that is, "NLPR-TE." STEREO has 1000 samples, which are used as the testing set. LFSD and DES consist of 100 and 135 samples, which are all used as the testing set. All the datasets are equipped with pixelwise annotation. To compare the RGBD saliency models quantitatively, max F-measure (max F), S-measure (S) [51], mean absolute error (MAE), and max E-measure (max E) [52] are utilized in this paper.
S-measure considers the region aware value (S r ) and the object aware value (S o ) simultaneously, measuring the structural similarity between the ground truth and the saliency map. Referring to [51], the formulation is defined as follows:

Computational Intelligence and Neuroscience
where α is a balance parameter (here, we set it to 0.5). F-measure is the weighted harmonic mean of precision and recall and is formulated as follows: where β 2 is set to 0.3. Max F-measure could be obtained using different thresholds [0, 255]. E-measure denotes the enhanced-alignment measure considering the local details and the global information. Referring to [52], E-measure can be written as follows: where f(·) denotes the convex function,°denotes the Hadamard product, and ξ is the alignment matrix. MAE measures the difference between ground truth GT and saliency map S: where the obtained saliency maps are scaled to [0, 1], W is the width of the saliency map, and H denotes the height.

Comparison with the State-of-the-Art Models.
A comparison is first made on NJU2K-TE, NLPR-TE, STEREO, LFSD, and DES between AFI-Net and nine state-of-the-art RGBD saliency models, namely, CDCP [23], ACSD [16], LBE [18], DCMC [19], SE [20], MDSF [21], DF [24], AFNet [32], and CTMF [27]. e traditional heuristic RGBD saliency models represented by the first six RGBD saliency models and the last three RGBD saliency models are CNNsbased RGBD saliency models. Here, the saliency maps of the other models are provided by the authors or obtained through the source codes. Next, the quantitative and qualitative comparisons are presented. Specifically, Table 1 shows the quantitative comparison results on five RGBD datasets. AFI-Net outperforms the nine state-of-art RGBD saliency models in terms of all the evaluation metrics. Figure 3 presents the qualitative comparisons on some complex scenes. AFI-Net achieves superior performance over the nine state-of-the-art models. Specifically, the first example presents a box on the ground, where the box in the depth map is indistinctive. e other models shown in Figures 3(e)-3(m) falsely highlight some backgrounds and cannot pop-out the box completely. In the second example, the vase is a heterogeneous object, and its bottom is unclear in the depth map. Our model (Figure 3(d)) performs better than the other models though the top part is not popped-out completely. Like the first example, the third and the fourth examples not only show an unclear depth map but also present a cluttered background. Fortunately, our model still highlights the bird and the cow completely and accurately. e fifth and sixth examples show multiple salient objects. AFI-Net exhibits the best performance, as shown in Figure 3(d). In the seventh example, the man is in the image  boundary, and its corresponding depth is also unclear. Under this condition, our model still performs better than the others though some backgrounds are also highlighted mistakenly. For the 8 th and 11 th rows, the salient objects occupy most regions of the images. Our model and the AFNet achieve comparable performance, as shown in Figures 3(d) and 3(m). e 9 th and the 10 th rows also show a cluttered background. Obviously, AFI-Net still exhibits the best performance as shown in Figure 3(d).
Generally, through the extensive comparison of AFI-Net and nine state-of-the-art models, we can demonstrate the proposed AFI-Net's effectiveness.

Ablation Studies.
Here, the intensive study on some key components in AFI-Net is presented quantitatively and qualitatively. Specifically, the crucial points in AFI-Net include the attention module (AM) and the boundary module (BI), as shown in Figure 1. erefore, we design two variants of our model, namely, AFI-Net without the attention module (denoted as "w/oA") and the AFI-Net without the boundary module (denoted as "w/oB"). Correspondingly, we perform comprehensive comparisons between our model and the two variants.
First, the quantitative comparison results are presented in Table 2. Clearly, AFI-Net consistently outperforms the two variants, "w/oA" and "w/oB," on two RGBD datasets. Secondly, the qualitative comparison results are presented in Figure 4. AFI-Net (shown in Figure 4(d)) performs better than the two variants (shown in Figures 4(e)) and 4(f )). e results of AFI-Net have welldefined boundaries and highlight the salient objects completely. In contrast, the two variants falsely highlight some backgrounds and cannot detect the salient objects completely.
Overall, the attention and boundary modules play an important role in AFI-Net, enhancing the deep features and equipping them with more spatial details. Meanwhile, the

Failure Case Analysis.
In the experiments, we demonstrate the effectiveness and rationality of the proposed AFI-Net. Particularly, Figure 3 shows the qualitative comparison between the proposed AFI-Net and the state-of-the-art saliency models, highlighting the effectiveness of the proposed AFI-Net. However, in some challenging images, our model cannot detect salient objects well, as shown in Figure 5(d). Specifically, in Figure 5, the first example shows a traffic sign, and all models fail to pop-out the salient object. e second example is a pendant, which is highlighted incompletely by most of the models. In the third example, which presents a pavilion, all models falsely pop-out the background regions. In the last two examples, the car and the pot culture cannot be detected accurately and completely. Although our model fails to highlight the salient objects of these examples, it can still pop-out the main part of the salient objects shown in Figure 5(d) better than the other models (shown in Figure 5(e))-5(m)) because our model contains an effective attention module, which covers the main parts of the salient objects. Generally, the research on RGBD saliency detection still faces many difficulties, and the research on the complex scene images is worthy of attention.

Conclusion
is work proposes an innovative RGBD saliency model AFI-Net, which can perform feature extraction, feature fusion, and saliency inference. Specifically, the generated initial multimodal and multilevel features are first promoted by a series of attention modules, which select the initial deep features and coarsely indicate the location of salient objects. en, the hierarchical fusion branch is adopted to fuse the enhanced deep features, which are further combined with low-level spatial detail features (i.e., the boundary information) to perform saliency inference. us, the generated saliency maps can highlight salient objects and preserve sharp boundaries. e experiments results on five public RGBD datasets indicate that the proposed AFI-Net obtains superior performance over nine state-of-the-art models.

Data Availability
Previously reported data were used to support this study and are available at https://doi.org/ 10

Conflicts of Interest
e authors declare no conflicts of interest regarding the publication of this article.