Making BBS-Net More Efficiency for RGB-D Salient Object Detection

BBS-Net is a novel bifurcated backbone strategy network based on cascaded refinement, which achieves the most advanced performance in RGB-D salient object detection models. However, the performance of the model in multi-scale feature extraction and depth feature enhancement is not strong enough, and the generated saliency maps contain a lot of background distraction and blurred edges. In this paper, based on BBS-Net framework, we adopt more effective feature extraction network and depth feature enhancement module. Specifically, 1) Use the Res2Net-50 structure that constructs hierarchical residual-like connections within a single residual block as a feature extraction network; 2) Use ECA module as channel attention to extract depth information and enhance cross-modal compatibility. The comprehensive experiments of our method on 6 datasets show that it outperforms the original BBS-Net and 7 current state-of-the-art methods in 4 evaluation metrics.


Introduction
In recent years, salient object detection (SOD) based on RGB-D has received widespread attention and various models have been proposed. BBS-Net [1] is the state-of-the-art performance model in RGB-D SOD models. It is a bifurcated backbone strategy network using a cascaded refinement strategy. The features extracted by ResNet-50 [2] are divided into two parts: high-level features and low-level features. First, a cascaded decoder is used to aggregate the high-level features to generate the initial saliency map. Low-level features are refined by an element-wise multiplication with the initial saliency map. Finally, the refined low-level features integrated by another cascaded decoder to predict the final saliency map.
Although BBS-Net has achieved encouraging results, the performance of the model is not strong enough. It can be seen from the saliency maps in the fourth column of Fig. 1 that there are still many background distraction and edge blur problems in the maps, and complete salient objects cannot be detected. We believe that there are two major challenges:  Fig.1 Saliency maps of BBS-Net (1) Suboptimal feature extraction network. Objects may appear with different size in depth maps and RGB images, and perceiving information of different scales is important for understanding objects. The network can detect objects more accurately through context information. As a classic feature extraction network, ResNet-50 uses repeated 1×1, 3×3, and 1×1 convolutions to extract features. Its performance in multi-scale feature extraction is not strong enough, which makes the feature expression insufficient.
(2) The original depth-enhanced module has poor effect. Depth maps are much noisier and textureless than RGB images, which may introduce feature noise and redundancy into the network. Through deep analysis of the saliency maps generated by the original model and the performance of each module, we found that the performance of the depth enhancement module is not effective enough, which will affect the compatibility of multi-modal features and the extraction of depth informative cues.
In computer vision tasks, obtaining multi-scale representations require the convolutional filters to use a large range of receptive fields to describe objects of different scales. Convolutional neural networks (CNNs) gradually learn multi-scale features from coarse to fine through a stack of convolutional filters. Here, we specifically use Res2Net-50 [3] as the backbone network to extract features. It expresses multi-scale features at a granular level by constructing hierarchical residual-like connections within the residual block.
Compared with the original network, we specifically use ECA-Net (i.e., ECA module) [4] to realize the connection between depth channel features. ECA-Net is a channel attention that uses a local cross-channel interaction strategy. The model uses 1D convolution to be more efficient, and the coverage of interaction can be adaptively selected according to the number of channels.
We have verified the effectiveness of our model via extensive studies and comparisons. In summary, our main contributions are as follows: (1)In order to more fully extract the depth and RGB features, we use the Res2Net-50 as the backbone network for multi-scale feature extraction. This feature extraction network can extract features at a granular level and increase the receptive field, enhancing the representation of features.
(2)To more effectively enhance the depth features and improve the compatibility of RGB and depth features, we use ECA-Net as a more effective channel attention module. This channel attention uses 1D convolution to achieve local cross-channel interaction and enhance the connection between depth channels.
(3)Without bells and whistles, our model outperforms the original BBS-Net network and 7 state-of-the-art alternatives, over 6 widely used benchmark datasets.

Methods
For the specific improvement methods of BBS-Net, we first introduce the purpose of adopting the new feature extraction network and the basic principle of Res2Net module in 2.1, and then introduce the reason and structure of ECA module as our channel attention in 2.2.

2.1.Feature extraction network
By replacing the feature extraction network in the original BBS-Net network with the Res2Net-50, the purpose is to enhance the model's multi-scale feature extraction capabilities, expand the receptive field of the feature map, and capture richer context information to further improve the feature representation ability of the model.
Res2Net module is a further improvement made on the basis of ResNet block, which constructs hierarchical residual-like connections within a single residual block. The structure diagram of Res2Net module is shown as in Fig. 2. Fig.2 Res2Net module Within a single residual block, the feature map after 1×1 convolution is evenly divided into groups s according to the number of channels, and each group uses 3×3 convolution for feature extraction. The features after 3×3 convolution are added with the original features of the next group. In order to reduce the parameters of the module, the first group omits the 3×3 convolution, which can also be regarded as the reuse of features. Finally, the feature maps obtained in each group are spliced, and information fusion of different scales is carried out through a 1×1 convolution. The expression of the Res2Net module can be written as: 1 Where x indicates a subset of the feature map, Conv is a 3×3 convolution operation, y and y indicate the final output of the corresponding feature subsets. It can be seen from the expression that every 3×3 convolution will increase the receptive field of the feature map, and finally a multi-scale feature map is spliced.

2.2.More effective channel attention
Depth maps captured from state-of-the-art sensors are usually poor, it is very important to use a high-performance attention module to extract useful information from the depth features and improve multi-modal compatibility. Using ECA module as our new channel attention can enable the model to greatly reduce the amount of parameters while strengthening the connection between deep feature channels, and enhance the detection performance of the model.
ECA module adopts a local cross-channel strategy on the basis of SE block [5]. The input feature maps first aggregate global features through global average pooling, then adjust the channel weight through convolution operations, and finally multiply the weight value with the original feature maps. The model structure is shown in Fig.3. The model only considers the information of the current channel and its k adjacent channels, so it has only k C parameters. In addition, channels sharing the same learning parameters can be used to further improve model efficiency. The expression of the ith value of feature y is: In which ω indicates the output of y after weight operation, α is the weight of y or one of its neighbors, and Ω represents the set of y and its k neighbors.
In addition, ECA module can adaptively select the coverage of interaction according to the number of channels. The calculation formula is: Where k indicates the size of the adaptive convolution kernel, C is the number of channels. and are parameters, and it is recommended to set to 2 and 1 respectively.

3.1.Experimental Settings
Implementation Details. Our model uses the PyTorch framework to experiment on a single NVIDIA 3080 GPU. Parameters of the Res2Net-50 are initialized from the model pretrained on ImageNet, and other parameters are initialized as the default settings of PyTorch. We train the model with a batch size of 8 for 200 epochs, using the Adam optimization algorithm to optimize our model, the initial learning rate is set to le-4, and it drops 10 times every 60 epochs. The methods used for data enhancement include random flip, border clip and random rotation.

3.2.Ablation Experiments and Analyses
In this section, we explore the impact of different modules in the improved method on the datasets.
New feature extraction network. To prove the effectiveness of Res2Net-50 as the backbone network, we compare the results of Res2Net-50 as the backbone with the results of the original model. As shown in Tab.2, we could see that the use of the new feature extraction network brings significant improvements to the results. More effective channel attention. We replaced the original channel attention with ECA module and retrained. The experimental results are shown in Tab.3. Numerically, ECA module is more effective than the original channel attention, indicating that the new channel attention enhances the extraction of depth features and helps cross-modal feature compatibility.

Quantitative Results
As shown in Tab.3, our model outperforms the state-of-the-art methods on most evaluation metrics contain S-measure, maxF, maxE and MAE. Tab.3 Quantitative comparison of models including S-measure, maxF, maxE, and MAE on six datasets Dataset Metric DF [12] AFNet [13] MMCI [14] PCF [15] TANet [16] CPFP [17] D3Net [11] BBS  Fig.4, we show some visualization results. These images are divided into five challenging situations, including simple scene, complex scenes, small object, multiple objects, and low contrast. As shown in Fig.4, we give a simple example in the first row. All the methods are basically good in performance. Two small object examples are displayed in rows 2-3. It is obvious that other methods have different degrees of background distraction when detecting people. The multiple objects situations are shown in rows 4-5. Our method can locate all salient objects in both examples and the segmentation effects are the best. Although the quality of the depth map in line 5 is relatively poor, our model can still detect all objects. In 6th-7th rows, we show the complex scene. Compared with other methods, our model does not mistakenly treat background distraction as salient objects. Besides, we also sample some images (8th-9th rows) whose scene is low contrast. Although the contrast of RGB images is very low, our method can effectively use the information in the depth map to detect reliable results.