Hierarchical Multimodal Adaptive Fusion (HMAF) Network for Prediction of RGB-D Saliency

Visual saliency prediction for RGB-D images is more challenging than that for their RGB counterparts. Additionally, very few investigations have been undertaken concerning RGB-D-saliency prediction. The proposed study presents a method based on a hierarchical multimodal adaptive fusion (HMAF) network to facilitate end-to-end prediction of RGB-D saliency. In the proposed method, hierarchical (multilevel) multimodal features are first extracted from an RGB image and depth map using a VGG-16-based two-stream network. Subsequently, the most significant hierarchical features of the said RGB image and depth map are predicted using three two-input attention modules. Furthermore, adaptive fusion of saliencies concerning the above-mentioned fused saliency features of different levels (hierarchical fusion saliency features) can be accomplished using a three-input attention module to facilitate high-accuracy RGB-D visual saliency prediction. Comparisons based on the application of the proposed HMAF-based approach against those of other state-of-the-art techniques on two challenging RGB-D datasets demonstrate that the proposed method outperforms other competing approaches consistently by a considerable margin.

is study focuses on the second task, namely, saliency prediction.
In the last two decades, numerous saliency prediction methods for RGB images have been significantly improved, and various models have been proposed [17][18][19]. However, several extant studies [1,8,11] reveal that features extracted from two modalities-depth maps and RGB images-complement each other. RGB images contain discriminative visual-appearance information, whereas depth maps include geometric features concerning objects. In depth maps, the interestingness of objects degrades with increasing distance from the camera; i.e., an object located closer to the camera attracts greater attention. RGB-D saliency prediction has attracted increased attention in recent years [24][25][26][27][28], and extant studies have concentrated on the design of depth-induced RGB-D saliency prediction methods [29][30][31]. In most such methods, fusion methods are inadequate to combine complementary features obtained from an RGB image and depth map. erefore, substantial room for performance improvement exists. is study was inspired by the above-mentioned observation. To facilitate the appropriate fusion of features obtained from the RGB image and depth map, a hierarchical multimodal adaptive fusion (HMAF) network for RGB-D saliency prediction is proposed. In the proposed method, three twoinput attention modules are adopted to exploit the importance weights of different modalities instead of simply concatenating feature vectors obtained from the two channels. Additionally, a three-input attention module is used to fuse the hierarchical fusion saliency features adaptively, thereby facilitating accurate RGB-D visual saliency prediction. e major contributions of this study to the literature can be summarized as follows: ( Element-wise product Sigmoid  Figure 1 is a block diagram of the proposed algorithm flow. In this section, we first introduce the pipeline of our HMAF for RGB-D-saliency prediction. We use a two-stream network to extract the hierarchical features of two modalities. en, three two-input attention modules can adaptively fuse the hierarchical features of the two modalities. Finally, the hierarchical fusion saliency features are adaptively fused by the three-input attention module. Figure 2 shows the proposed HMAFbased approach. Figure 4: ree-input attention module (attention module 4).   High-level features contain more semantic information-useful for distinguishing salient areas-but very little spatial and local-context information. In contrast, low-level features contain more spatial information (e.g., textures, edges, and contours). us, both high-and lowlevel features can be considered important and complementary to each other regarding the prediction of visual attention, thereby making multilevel feature extraction even more necessary to achieve high-accuracy saliency prediction. In this study, VGG-16 [32]-a widely used pretrained network-was considered as the backbone of the two-stream network for extracting hierarchical features from the RGB and depth modalities. However, modifications to the VGG-16 network were affected via the elimination of its fully connected layer. To preserve the relatively large spatial sizes within higher layers, the stride of the last max-pooling layer was decreased, thereby facilitating more accurate saliency prediction. erefore, the stride of the entire network was reduced to 16 for a fixed input RGB (depth) image size of M × N pixels. e spatial size of the final feature map was M/ 2 4 × N/2 4 . In the proposed study, the output hierarchical features from max-pooling layers Pool3, Pool4, and Pool5 were used to obtain hierarchical fusion saliency feature maps concerning the adaptive multimodal fusion. High-level features can be used to distinguish between different object classes while being less discriminative of objects belonging to the same class. In contrast, low-level features include more spatial information that can be distinguished within the same class of objects, but they are less robust to dramatic changes in appearance. Both high-and low-level features were used in this study to enhance the performance of saliency prediction. Accordingly, in the proposed HMAFbased approach, hierarchical multimodal features were extracted from layers Pool3, Pool4, and Pool5, as shown in Figure 2.

Fusion of Multimodal
Features. Attention modules assign different weights to the features extracted from the two modalities [33]. e two-input attention modules employed in this study comprise three operators-Transformation, Fuse, and Select, as illustrated in Figure 3, where a two-input case is illustrated.
first performed with a dilated convolution involving a 3 × 3 kernel and dilation size of 2.

Fuse.
Transformation results obtained from the two modal streams were subsequently fused through an elementwise summation expressed as (1) e global information obtained was subsequently embedded using global average pooling to obtain channelwise features-S ∈ R C . Furthermore, a compact feature Z ∈ R d×1 was created to facilitate guidance for precise and adaptive selections. is was accomplished with the use of a fully connected layer, with a decrease in dimensionality for better efficiency, where Z � δ(β(WS)); β is the batch normalization; δ denotes the rectified linear unit function, and W ∈ R d×1 .

Select.
According to soft computing techniques [33,34], a soft attention across streams was used to select different streams guided by the feature Z. In particular, a fully connected layer with a sigmoid operator was applied to stream-wise digits to generate a probability statistical distribution comprising two streams,  (2) Here, A c (B c ) denotes the c th row of A (B), and w 1,c (w 2,c ) denotes the c th element of a (b) . Additionally, A, B and a, b denote the soft attention vectors for U 1 and U 2 , respectively. e final output feature map Y was obtained by redistributing weights via w 1,c and w 2,c to features corresponding to modalities RGB and depth, respectively. e corresponding equation can be written as It must be noted that the above equation is only applicable to the two-input case. However, one can easily deduce the corresponding formulae for cases involving more inputs by simply extending equations (1)-(3), and the detailed structure of a three-input case is shown in Figure 4.

Saliency Prediction.
Upon completion of the multimodal feature fusion stage, hierarchical fusion saliency features elevated to the same size were obtained via bilinear interpolation. Subsequently, the final fusion result was obtained via the use of a three-input attention module (by extending equations (1)-(3)). As previously mentioned in Section 2.2, the attention module can adaptively assign importance weights to fused saliency features belonging to different levels.
After obtaining the final fusion result, the overall saliency-prediction result was obtained via a combined operation of two convolution layers, a final prediction with a 1 × 1 convolution layer, and a sigmoid activity function. To obtain the same resolution as the input, bilinear interpolation was performed using a factor of four.

Combined Loss Function.
In this study, the mean square error (MSE) was used as a baseline reference in combination with the linear correlation coefficient (CC) criteria to train the proposed HMAF network. However, the CC criteria were slightly modified to represent dissimilarities without the need for empirical coefficients. e modified loss was observed to mimic cross-entropy behavior, which is used in visual classification, approaching zero in the absence of any mistakes. e MSE function is generally used to detect deviations between the predicted value (P) and the true value (Q) of the model. e smaller the value is, the closer the predicted value is to the real value. Its calculation formula is as follows: where N is the number of pixels per image; i indicates the i th pixel; index n ranges from 1 to N; Q denotes the groundtruth, and P represents the predicted saliency map.
e CC criteria are deployed to calculate the linear CC between two distributions, and its value lies within the range [−1, 1], with CC � 1 indicating that the two distributions are correlated and vice versa. e CC function is defined as follows: where σ represent the standard deviation of the input. For the efficient application of root mean square propagation (RMSprop), the CC criteria can be simplified to the form where CC′ represents the modified CC criteria. It converts the similarity criteria into dissimilarity criteria, the value of which lies within the range [0, 2]. In this study, a simple summation of MSE and CC′ was considered to formulate the loss function, i.e., In Section 3.3, we quantitatively justify the rationale behind selecting the loss combination by comparing our results with those obtained using one evaluation criterion or two evaluation criteria as a loss function.

Data Processing.
We randomly sampled 70%, 10%, and 20% of all RGB-D images as the training, verification, and test sets, respectively.
Prior to data entry, the image size was adjusted to 224 × 224 pixels. To boost the generalization performance of the network, each RGB image was mean centered and normalized to unit variance using precomputed parameters before it was input to the network.

Training Methods.
e proposed HMAF method was trained by loading the weights of a pretrained VGG-16 network as initial weights. Eight RGB-D images were used during each iteration. e learning rate was initialized at 1 × 10 -4 . e values of the model parameters were learned via backpropagation of the loss function described in (6) using RMSprop. An early termination method was used to prevent model overfitting. Furthermore, we applied a transition of 0-2 pixels and random horizontal flips on both axes of the input RGB-D images to augment the dataset at training time. All experiments were conducted on a workstation equipped with an NVIDIA TITAN V GPU and 12 GB of RAM.

Datasets.
To verify the prediction performance of our proposed HMAF network, all saliency prediction methods were applied to various datasets. Presently, there are very few public RGB-D datasets for performing visual saliency prediction research. is study evaluates the performance of saliency-prediction methods applied to two representative datasets-NUS3D [35] and NCTU [36], as follows: (1) NUS3D contains 600 RGB-D images and involves several 2D and RGB-D view scenes; it provides depth images, RGB stimuli, and 2D and RGB-D fixation maps; (2) NCTU comprises 475 RGB-D images along with depth maps; this dataset includes various scenes, most of which have been adapted from existing stereo movies and videos.

Evaluation Criteria.
Four commonly used performance criteria-CC [20], Kullback-Leibler divergence [19] (KLDiv), area under the ROC curve [18] (AUC), and normalized scanpath saliency [22] (NSS)-were used to evaluate the prediction performance of the competing saliencyprediction methods [18,19]. Table 1 compares single (KLDiv, CC′, and MSE) and combined (MSE + CC′, MSE + KLDiv, and CC′ + KLDiv) loss-function values. As seen from the table, the proposed method, on average, achieves better results than extant methods. e values of the combined loss functions, however, attain better results in favor of the proposed method compared to those of single functions. Our combined loss (MSE + CC′) obtains competitive prediction results on all criteria, unlike other loss functions. Based on this, all results were obtained by training the proposed HMAF with our combination.

Effects of Hierarchical Features.
To demonstrate the effects of hierarchical features, the output features from convolution layers Pool3, Pool4, and Pool5 were used for visual saliency prediction (we denoted this as models A, B, and C, respectively).
us, hierarchical features can be considered important and complementary to each other, thereby achieving high-accuracy saliency prediction.

Effect of VGGNet as Model Backbone.
To further verify the effectiveness of the VGGNet, we kept the pipeline unchanged while only replacing the VGGNet with the ResNet block [37] (we denoted this as model D). e results are listed in Table 2. We can observe that applying the VGGNetbased approach generally provides better performance than the ResNet-based approach.

Comparison against State-of-the-Art.
To demonstrate the effectiveness of the proposed HMAF-based saliencyprediction method, it was compared against four state-ofthe-art approaches-Fang [24], DeepFix [19], DVA [20], SAM [21], EML [22], and SMI [23]. Table 3 lists the results of the quantitative comparison obtained by applying the competing methods to both the aforementioned datasets in terms of CC, KLDiv, AUC, and NSS. e results in Table 3 show that the proposed HMAF-based method clearly outperformed all others considerably, thereby demonstrating its superior effectiveness, robustness, and generalization capabilities.
A qualitative comparison of the proposed HMAF method against the other six methods is depicted in Figure 5. Clearly, the proposed HMAF yields more accurate prediction results than other techniques because the saliency maps are fused by three two-input attention modules and a threeinput attention module to facilitate high-accuracy prediction. It can be seen from Figure 5 that the proposed method is less distracted by high-contrast edges and complex backgrounds and that it can easily predict bottom-up saliency maps while also dealing well with global and local contrast levels. Another important fact is that the proposed method can highlight many top-down factors, such as human faces, people in complex backgrounds, and objects located at long and short distances from the camera. More specifically, although the images in lines 3, 4, 5, and 6 include complex backgrounds and various attention-grabbing objects, our proposed method can still highlight semantic regions preferentially.
us, the qualitative and quantitative experimental results both show that HMAF outperforms all other presently available techniques in terms of both robustness and accuracy.
3.6. Failure Case Analysis. Some typical failure cases are shown in Figure 6. e first row indicates that the proposed HMAF method does not perform well for RGB-D images with people. e second row indicates that when there are many objects, the proposed HMAF method prefers highcontrast objects and ignores small objects. To address this problem, we intend to consider deep networks [38,39] to improve the performance of HMAF in the future.

Computational Complexity.
e computational complexity of the proposed HMAF-based approach and other approaches was estimated from tests on the NCTU dataset. It takes approximately 1 h to train the proposed HMAF using an NVIDIA TITAN V GPU and an Intel i5-8500 3.0 GHz CPU. Inference takes approximately 0.01 s using the proposed HMAF on an image of size 640 × 480. In conclusion, our approach has low computational complexity and can satisfy the requirements of real-time image processing systems.

Conclusion
In this study, multimodal fusion of 3D data was studied, and a layered multimodal adaptive fusion network based on an attention module was proposed. e proposed network effectively extracts and combines features from different modes and levels. e dual input attention module uses the weights of importance associated with RGB and depth modes. Adaptive fusion of hierarchical features was extracted from the two patterns, rather than simply joining Computational Intelligence and Neuroscience them together. In addition, the three-input attention module assigns different weights to the fusion significance characteristics at different levels for the RGB-D significance prediction. Experimental results show that the RGB-D significance prediction method based on HMAF is superior to all other advanced methods. is model exhibits superior performance, largely because of its attention mechanism design. It has the potential to mimic human visual systems more closely, which we hope to investigate in the future by introducing this technology to develop a direct convolution kernel that adapts the convolution kernel to identify targets quickly, allowing significant compression of the model parameters, for a variety of tasks. In addition, in the bottomup process, if a feature refinement unit module can be designed for feature enhancement, the prediction errors in feature coding can be further repaired through the prior knowledge learned.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.