Robust and Refined Salient Object Detection Based on Diffusion Model

Ye, Hanchen; Zhang, Yuyue; Zhao, Xiaoli

doi:10.3390/electronics12244962

Open AccessArticle

Robust and Refined Salient Object Detection Based on Diffusion Model

by

Hanchen Ye

,

Yuyue Zhang

and

Xiaoli Zhao

^*

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(24), 4962; https://doi.org/10.3390/electronics12244962

Submission received: 8 November 2023 / Revised: 30 November 2023 / Accepted: 9 December 2023 / Published: 11 December 2023

(This article belongs to the Special Issue AI Security and Safety)

Download

Browse Figures

Versions Notes

Abstract

:

Salient object detection (SOD) networks are vulnerable to adversarial attacks. As adversarial training is computationally expensive for SOD, existing defense methods instead adopt a noise-against-noise strategy that disrupts adversarial perturbation and restores the image either in input or feature space. However, their limited learning capacity and the need for network modifications limit their applicability. In recent years, the popular diffusion model coincides with the existing defense idea and exhibits excellent purification performance, but there still remains an accuracy gap between the saliency results generated from the purified images and the benign images. In this paper, we propose a Robust and Refined (RoRe) SOD defense framework based on the diffusion model to simultaneously achieve adversarial robustness as well as improved accuracy for benign and purified images. Our proposed RoRe defense consists of three modules: purification, adversarial detection, and refinement. The purification module leverages the powerful generation capability of the diffusion model to purify perturbed input images to achieve robustness. The adversarial detection module utilizes the guidance classifier in the diffusion model for multi-step voting classification. By combining this classifier with a similarity condition, precise adversarial detection can be achieved, providing the possibility of regaining the original accuracy for benign images. The refinement module uses a simple and effective UNet to enhance the accuracy of purified images. The experiments demonstrate that RoRe achieves superior robustness over state-of-the-art methods while maintaining high accuracy for benign images. Moreover, RoRe shows good results against backward pass differentiable approximation (BPDA) attacks.

Keywords:

salient object detection; adversarial purification; adversarial detection; diffusion model

1. Introduction

Salient object detection (SOD) is a task focusing on segmenting the most prominent objects in an image as perceived by humans. Over recent years, SOD methods based on deep learning have demonstrated excellent performance [1,2,3]. However, just like label-level classification networks, deep SOD networks are also susceptible to adversarial attacks [4,5]. These attacks introduce imperceptible perturbations to the input, resulting in bad prediction results.

Among the available defense strategies, adversarial training [6] stands out as the most universal and effective method. Adversarial examples are used as the dataset during the training process. However, for SOD tasks that involve pixel-level classification, the training overhead becomes significant due to the complexity of the decoder component. In addition, the method’s robustness does not always generalize well to unforeseen attacks. Consequently, existing SOD defenses [4,5] adopt a ‘noise-against-noise’ strategy that disrupts adversarial perturbation through randomness. Robust saliency (ROSA) [5] disrupts the pixels within superpixels of the input image before feeding them into the target network, followed by the use of a conditional random field (CRF) [7] to reverse the disruption operation. Learnable noise (LeNo) [4], conversely, integrates the noise-adding operation at the network’s initial layer and attempts to learn and subtract the noise at the final layer for recovery. However, the former approach lacks the flexibility to handle different types of noise due to its limited learning capacity, while the latter requires network modification and fine-tuning, which could potentially compromise the network’s performance. Despite these limitations, both methods demonstrate the effectiveness of the ’noise-against-noise’ defense strategy. This leads us to a question: is there a defense method that does not require network modifications, yet possesses strong denoising capabilities? We believe that the diffusion model is highly suitable for this purpose.

In recent years, many works have found that diffusion models [8,9], as purifiers, can effectively defend against adversarial attacks, including those on images [10,11], audio [12], and 3D point clouds [13]. Purification based on diffusion models involves adding Gaussian noise to the input data and then iteratively denoising the data back to benign inputs. However, we have observed that although the purified images exhibit improved robustness, their accuracy tends to decrease. Thus, it is necessary to refine the results after purification.

In this paper, we focus on two points: purifying images to achieve robustness, and refining the purified results to regain accuracy. Thus, we propose the Robust and Refined (RoRe) method, an SOD defense framework based on diffusion model. We present the defense results of RoRe in Figure 1. RoRe consists of three modules: purification, adversarial detection, and refinement. (1) Purification. We conduct a forward diffusion process and a backward denoising process based on the diffusion model to purify the input images for achieving robustness. (2) Adversarial detection. This consists of a multi-step voting adversarial classifier and a similarity condition. The former trains a noisy classifier introduced from the guided diffusion model [14] and predicts labels for multiple steps, thus incorporating the noise-against-noise strategy into the adversarial detection. The latter involves comparing saliency results before and after purification to classify adversarial examples, which is beneficial in the case where the former makes a false judgment. The combination of these two parts enables adversarial detection, allowing benign images to directly enter the target network and maintain their original performance, while adversarial examples are purified and refined. (3) Refinement. We employ a simple yet effective encoder–decoder architecture network to refine the saliency results of purified images. The input and output of the network form a residual structure. This design allows the network to focus on repairing the necessary parts, thereby avoiding overwriting the target network’s performance.

Our contributions are threefold:

We propose the plug-and-play Robust and Refined (RoRe) SOD defense framework, which is able to achieve robustness and further improve accuracy by refinement. The experimental results show that RoRe achieves better robustness than the state-of-the-art (SOTA) methods under project gradient descent (PGD) and backward-pass differentiable approximation (BPDA) attacks.
We extend the noise-against-noise strategy to adversarial detection by using the guidance classifier in the diffusion model as a multi-step voting adversarial classifier. This improves the robustness of the classifier and provides the possibility for benign images to regain their original accuracy.
We provide a refinement network to solve the problem of decreasing accuracy caused by the purification operation. This network employs an encoder–decoder architecture with a residual structure, allowing it to repair the necessary parts.

2. Related Work

2.1. Salient Object Detection

Salient object detection networks aim to identify and localize the most prominent parts of an image. The most commonly used SOD models are based on encoder–decoder structures. These models typically consist of a deep convolutional neural network (CNN) for feature extraction and a series of up-sampling layers to produce a saliency map of the same size as the input image. Zhao et al. [15] propose a gated network (GateNet) which uses multilevel gate units to optimally transmit context information from the encoder to the decoder and a gated dual-branch structure to improve the discriminability of the whole network. Liu et al. [16] propose a pixel-wise contextual attention network (PiCANet) that generates an attention map over the context region of each pixel to selectively attend to informative context locations at each pixel. Ma et al. [1] propose a pyramidal feature-shrinking network (PFSNet), which aggregates adjacent feature nodes in pairs with layer-by-layer shrinkage to fuse effective details and semantics and discard interference information. Qin et al. [3] propose a boundary-aware segmentation network (BASNet) for highly accurate image segmentation with a predict–refine architecture and a hybrid loss. While these networks are highly effective, they are vulnerable to adversarial attacks. One such attack is project gradient descent (PGD) [6], which is an iterative gradient-based method that perturbs an input image by taking steps in the direction of the gradient of the loss function. When the defense method involves purification, there is a targeted attack algorithm known as backward pass differentiable approximation (BPDA) [17]. BPDA approximates non-differentiable or hard-to-differentiate transformations during the backward pass, allowing for effective attacks against such defenses. These attacks provide a powerful tool for evaluating the robustness of the defense methods.

2.2. Adversarial Purification

Adversarial purification aims to transform input images in order to mitigate adversarial noise. Existing defenses for SOD networks generally employ a noise-against-noise strategy, which utilizes randomness to counteract adversarial noise. ROSA [5] introduces a segment-wise shielding component and a context-aware restoration component before and after the target network, respectively. The former aims to disrupt adversarial noise by shuffling pixels within superpixel blocks, while the latter attempts to restore it using CRF [7]. LeNo [4] anticipates that the noise introduction and denoising operations in ROSA are non-learnable, and thus proposes learnable noise addition and denoising modules, which are added to the first layer of the encoder and the last layer of the decoder, respectively. Both methodologies underscore the effectiveness of the noise-against-noise strategy. On the other hand, generative models can also be used to purify the input. Initially, techniques such as Generative Adversarial Networks (GANs) [18] or autoregressive generative models [19] were employed to purify images. Recently, diffusion models have showcased their powerful generative capabilities [8,14,20]. The generation process of these models, which gradually denoises Gaussian noise, aligns well with the noise-against-noise defense strategy. Recent works have demonstrated their superior noise purification abilities in various tasks including image classification [10], audio classification [12], and 3D point cloud classification [13]. Defense methods based on diffusion models involve adding Gaussian noise at a specific step to the adversarial examples, followed by sampling this step with a pre-trained diffusion model. This paper will demonstrate the adaptability of this method to pixel-level classification tasks such as SOD, and highlight a tool within diffusion models, specifically the use of a classifier for guidance, which has not been emphasized in existing works.

2.3. Saliency Refinement

Saliency refinement aims to further enhance the accuracy of the generated coarse salient results, including tasks such as sharpening edges, removing background noise, and preserving the integrity of foreground images. We categorize these methods into two groups: non-neural network and neural network methods. (1) Non-neural network methods refer to approaches that do not rely on neural networks but instead design heuristic algorithms to refine salient results. Xu et al. [21] conducted a statistical multiscale local analysis. Wang et al. [22] conducted feature contrast manipulation between the foreground and background. The spatial structure is considered to preserve local structural integrity [23] or remove background noise [24,25]. (2) Neural network methods involve training a network model to achieve refinement. Xu et al. [26,27] modified the CRF to cascade for handling multiscale saliency results. In terms of feature learning, image features [28,29], color and texture features [30], complementary and discriminative features [31], as well as motion energy and appearance features [32] are learned to improve the coarse results. In terms of network architecture, the local superpixel-based CNN [33], fully convolutional network augmented with segmentation hypotheses [34], and the residual refinement block functioning as an internal module within the network [35] all represent viable approaches. The process from coarse to fine depends on how the coarse results are produced. In this paper, ‘coarse’ refers to the prediction results of the purified image. Due to the minor distortions in the purified image compared to the original, the prediction results are already close to those of the benign image. Therefore, this paper adopts a residual structure in the encoder–decoder-based network, training it for refinement. It is a straightforward and effective method.

3. Background of DDPM

In this paper, we only use the most basic diffusion model for purification, specifically, the denoising diffusion probabilistic model (DDPM) [8]. The principle of the diffusion model involves initially adding noise to the image in multiple steps until it transforms into Gaussian noise with distribution

N

. Then, a denoiser is trained to iteratively remove the noise and restore the original image. This denoiser learns the mean of the noise distribution that is removed at each step.

3.1. Forward Diffusion Process

The forward diffusion process involves adding Gaussian noise step by step. Here,

x_{t}

represents the image obtained at step t, while

x_{0}

denotes the natural, noise-free image. We specify that the forward process is a Markov process and define the distribution at step t as a Gaussian distribution relative to step

t - 1

, as shown in the following equation:

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I),

(1)

where

β_{t}

represents the variance at step t. As t increases,

β_{t}

gradually grows until the maximum step T is reached. As long as T is large enough,

x_{T}

will approximate a standard Gaussian distribution. In the case of [8], T is set to 1000. By using the above equation, we can only obtain

x_{t}

iteratively. However, ref. [8] utilizes the reparameterization trick to derive a closed-form solution for

x_{t}

with respect to

x_{0}

. By letting

α_{t} = 1 - β_{t}

and

{\bar{α}}_{t} = \prod_{i = 1}^{t} α_{i}

, we can obtain the following equation:

q (x_{t} | x_{0}) = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I),

(2)

and its closed-form solution is given by:

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{(1 - {\bar{α}}_{t})} z_{t},

(3)

where

z_{t} \sim N (0, I) .

3.2. Backward Denoising Process

The backward denoising process involves learning the noise to be removed at each step through a network

θ

. This is also defined as a Markov process, and it has been proven [20] that when

β

is small enough,

q (x_{t - 1} | x_{t})

approximates a Gaussian distribution. In fact, the network only needs to learn the mean

μ_{θ}

of the noise. The variance

Σ_{θ}

is fixed, and

σ_{t}

is instead used to represent the deterministic schedule. The distribution of

x_{t - 1}

relative to

x_{t}

can be expressed as:

p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t)) .

(4)

Making

p_{θ} (x_{t - 1} | x_{t})

approximate

q (x_{t - 1} | x_{t})

is desirable. However, this is infeasible due to the requirement for integrating over the entire dataset. Instead, we aim to learn the posterior probability

q (x_{t - 1} | x_{t}, x_{0})

, which satisfies:

q (x_{t - 1} | x_{t}, x_{0}) = N (x_{t - 1}; \tilde{μ} (x_{t}, x_{0}), {\tilde{β}}_{t} I),

(5)

{\tilde{β}}_{t} = \frac{1 - {\bar{α}}_{t - 1}}{1 - {\bar{α}}_{t}} β_{t},

(6)

\tilde{μ} (x_{t}, x_{0}) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} z_{t}) .

(7)

Hence,

μ_{θ} (x_{t}, t)

aims to predict

\tilde{μ} (x_{t}, x_{0})

, which essentially involves predicting

z_{t}

. Ultimately, the DDPM sampler outputs

ϵ_{θ}

to approximate

z_{t}

.

4. Method

Our proposed RoRe (Robust and Refined) method is a comprehensive defense framework tailored for SOD networks, as shown in Figure 2. It consists of three main modules: (1) an adversarial purifier with a diffusion model; (2) an adversarial detection module, composed of a multi-step voting adversarial classifier and a similarity condition component; and (3) a refinement network, structured as an encoder–decoder architecture. We use the following notations in our study:

x \in R^{3 \times H \times W}

represents a benign image;

\hat{x} \in R^{3 \times H \times W}

represents a potential adversarial example that may be perturbed by a

δ

distortion (or not when

δ = 0

); f denotes the target network;

P

denotes the purifier;

C_{m}

denotes the multi-step voting classifier, where

C

denotes the classifier within it; and

R

denotes the refinement network.

4.1. Adversarial Purification with Diffusion Model

Taking inspiration from the original process of image generation with a diffusion model, we devise a purification strategy using a pre-trained DDPM, thus circumventing the need for any further training. The goal of the proposed DDPM purifier

P : R^{3 \times H \times W} \to R^{3 \times H \times W}

is:

\tilde{x} \approx x s . t . \tilde{x} = P (\hat{x}) .

(8)

Unlike the standard sampling procedure in diffusion models, we do not need to generate initial noise. Instead, we subject the input image

\hat{x}

to a controlled addition of Gaussian noise during the forward process, effectively disrupting the adversarial noise. Then, we employ the backward process to iteratively denoise the image, thereby restoring it closer to its benign state.

We adopt the notation from Section 3 and present the forward diffusion process as follows:

\begin{matrix} x_{T_{a}} & = \sqrt{{\bar{α}}_{T_{a}}} \hat{x} + \sqrt{(1 - {\bar{α}}_{T_{a}})} ψ \\ = \sqrt{{\bar{α}}_{T_{a}}} x + \sqrt{{\bar{α}}_{T_{a}}} δ + \sqrt{(1 - {\bar{α}}_{T_{a}})} ψ, \end{matrix}

(9)

where

T_{a}

is a hyperparameter, representing the number of steps for adding noise, and

ψ

is a standard Gaussian noise. We need to consider how to choose the value of

T_{a}

. It should meet the following conditions: (a) the term

\sqrt{(1 - {\bar{α}}_{T_{a}})} ψ

should be large enough to cover the adversarial noise term

\sqrt{{\bar{α}}_{T_{a}}} δ

, and (b) it should not be so large as to disrupt the image information

\sqrt{{\bar{α}}_{T_{a}}} x

to the point where it is difficult to recover. We determine the value of

T_{a}

through experimentation. Specifically, we selected a portion of the data from a training dataset and listed the relationship between

T_{a}

and the accuracy, as shown in Figure 3. It can be observed that if

T_{a}

is too small, the noise cannot be effectively removed, and if it is too large, the image will be excessively distorted, also leading to a decrease in accuracy. Therefore, we choose the climax point of

T_{a}

.

After the forward noise addition is completed, we follow the DDPM method [8] and perform the backward denoising on

x_{T_{a}}

, as shown in Algorithm 1 and Figure 4. By iterating

T_{a}

steps using Formula (7), we can obtain the purified image

\tilde{x}

.

Algorithm 1 DDPM Purifier

P

Require: an input image

\hat{x}

, DDPM sampler

ϵ_{θ}

, noise adding step

T_{a}

Ensure: purified image

\tilde{x}

1:: $ψ \sim N (0, I)$
2:: $x_{T_{a}} \leftarrow \sqrt{{\bar{α}}_{T_{a}}} \hat{x} + \sqrt{(1 - {\bar{α}}_{T_{a}})} ψ$ ▹ Forward diffusion
3:: for $t = T_{a}, \dots, 1$ do ▹ Backward denoise
4:: $z \sim N (0, I)$ if $t > 1$ , else $z \leftarrow 0$
5:: $x_{t - 1} \leftarrow \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x_{t}, t)) + σ_{t} z$
6:: end for
7:: $\tilde{x} \leftarrow x_{0}$

4.2. Adversarial Detection

As shown in the middle part of Figure 2, adversarial detection aims to identify whether an input

\hat{x}

is benign or not, thus allowing for different actions to be taken accordingly. It consists of two parts: a multistep voting classifier and a similarity condition. The algorithm is detailed in Algorithm 2.

Multistep voting classifier. The multistep voting classifier

C_{m} : R^{4 \times H \times W} \to {b e n i g n, a d v e r s a r i a l}

is actually a process in which the classifier

C : R^{4 \times H \times W} \to {b e n i g n, a d v e r s a r i a l}

is executed multiple times, as shown in Figure 5. Clearly,

C_{m}

is used to classify whether

\hat{x}

is benign or not. The idea behind this is that

\hat{x}

may also harm

C

even if its primary aim is to attack f. So, we extend the noise-against-noise strategy into the adversarial detection to alleviate this problem. Inspired by [14], we train

C

to classify noisy images. However, we found that it is difficult to train such a classifier using only RGB image information

\hat{x}

. This is because the classifier

C

would introduce Gaussian noise at random steps during the training process, which disrupts the adversarial noise

δ

, making it difficult to distinguish between benign images x and adversarial examples

\hat{x}

, causing the training to struggle to converge. To address this issue, we concatenated the output results

f (\hat{x})

as additional information using

\hat{x}

as an input. The detailed algorithm is shown as the function

C_{m}

in Algorithm 2. The most crucial step is line 12, where, for different steps t, the input is added with different Gaussian noises

\sqrt{(1 - {\bar{α}}_{t})} ψ

, performing

T_{m}

operations to disrupt adversarial noise. Then, the classification result

y_{t}

is obtained by the classifier

C

at each step t, finally determining the outcome based on majority rule.

Algorithm 2 Adversarial Detection

Require: an input image

\hat{x}

, target network f, adversarial classifier

C

, number of step

T_{m}

,
DDPM purifier

P

, SSIM threshold s
Ensure: label(

b e n i g n

or

a d v e r s a r i a l

)

y_{a}

1:: $\tilde{x} \leftarrow P (\hat{x})$
2:: if $C_{m} (\hat{x}, f (\hat{x}); T_{m}, C)$ is $b e n i g n$ and $S i m C o n d (f (\hat{x}), f (\tilde{x}))$ is $b e n i g n$ then
3:: $y_{a} \leftarrow b e n i g n$
4:: else
5:: $y_{a} \leftarrow a d v e r s a r i a l$
6:: end if
7:: $C_{m}$ $(\hat{x}, f (\hat{x}); T_{m}, C) :$ ▹ Multistep voting classifier
8:: for $t = 1, \dots, T_{m}$ do
9:: $ψ \sim N (0, I)$
10:: $x_{i n} \leftarrow c o n c a t (\hat{x}, f (\hat{x}))$
11:: $x_{t} \leftarrow \sqrt{{\bar{α}}_{t}} x_{i n} + \sqrt{(1 - {\bar{α}}_{t})} ψ$
12:: $y_{t} \leftarrow C (x_{t}; t)$
13:: end for
14:: if num of labels $b e n i g n$ in ${y_{t}} >$ num of labels $a d v e r s a r i a l$ in ${y_{t}}$ then
15:: return $b e n i g n$
16:: else
17:: return $a d v e r s a r i a l$
18:: end if
19:: $SimCond$ $(a, b) :$ ▹ Similarity condition
20:: Calculate $S S I M$ between a and b
21:: return $b e n i g n$ if $S S I M > s$ , else return $a d v e r s a r i a l$

Similarity condition. The

C_{m}

alone is not sufficient because it may still produce false judgments. We introduce a similarity condition, which is represented by the function

S i m C o n d : R^{1 \times H \times W} \times R^{1 \times H \times W} \to

{b e n i g n, a d v e r s a r i a l}

in Algorithm 2.

S i m C o n d

also classifies whether

\hat{x}

is benign or not. It takes two inputs,

f (\hat{x})

and

f (\tilde{x})

, which are the saliency results before and after purification, respectively. If there is a significant difference between

f (\hat{x})

and

f (\tilde{x})

, we consider

\hat{x}

to be an adversarial example. We quantify the difference between

f (\hat{x})

and

f (\tilde{x})

using SSIM. The inclusion of

S i m C o n d

is intended to address the issue of potential misclassifications by

C_{m}

, where an adversarial example

\hat{x}

is incorrectly classified as a benign image, leading to a decrease in accuracy. If

S i m C o n d

classifies

\hat{x}

as adversarial, it rescues this situation by guiding

\hat{x}

towards the purification and refinement. However, if

S i m C o n d

also misclassifies

\hat{x}

as benign, the final result would be

f (\hat{x})

. Nevertheless, since

f (\hat{x})

and

f (\tilde{x})

are similar in this case, the difference in accuracy is not substantial.

4.3. Saliency Refinement

As mentioned above, the purified image suffers from a decrease in accuracy. Adversarial detection allows benign images to maintain their original performance, whereas adversarial images cannot. We design a refinement network

R : R^{7 \times H \times W} \to R^{1 \times H \times W}

to mitigate the decline in accuracy. As shown in Figure 6, this network takes three inputs: the original image

\hat{x}

, the purified image

\tilde{x}

, and the result of the purified image

f (\tilde{x})

. The input result

f (\tilde{x})

and the network output form a residual structure. The residual structure is used because, without it, the refinement network would be equivalent to regenerating the saliency prediction, which would overwrite the target network’s performance. We hope that the network will only learn to repair the necessary parts, thereby reducing its burden.

When considering whether additional information is needed, we believe that both the original image

\hat{x}

and the purified image

\tilde{x}

should be incorporated. As the original image

\hat{x}

may contain adversarial noise that misleads the refinement process and the purified image

\tilde{x}

might suffer from pixel distortion that compromises the image boundaries, these two sources can complement each other. Thus, we combine both inputs with

f (\tilde{x})

. The network structure could be any structure that maintains the same size for input and output; in this work, we have opted for a simple yet effective UNet [37] structure.

To empower the refinement network

R

to produce better-quality saliency results, we introduce a hybrid loss function

l_{h y b r i d}

, inspired by BASNet [3]. This loss consists of three components: BCE (binary cross-entropy), SSIM (structural similarity index measure), and IoU (intersection over union), which individually supervise pixel-level, patch-level, and map-level information. Specifically, let the hybrid loss

l_{h y b r i d} = l_{b c e} + l_{s s i m} + l_{i o u}

.

l_{b c e}

is a commonly used binary classification loss, defined as:

l_{b c e} = - Σ g (i, j) l o g (y (i, j)) + (1 - g (i, j)) l o g (1 - y (i, j)),

(10)

where

g (i, j)

and

y (i, j)

represent the ground truth and predicted results at pixel location

(i, j)

, respectively. SSIM is a measure used to evaluate the structural similarity between two images. Let

μ_{g}

,

σ_{g}

,

μ_{y}

,

σ_{y}

, and

σ_{g y}

represent the mean and variance of the ground truth, the mean and variance of the predicted results, and the covariance between them, respectively. The SSIM loss,

l_{s s i m}

, is defined as follows:

l_{s s i m} = 1 - \frac{(2 μ_{g} μ_{y} + C_{1}) (2 σ_{g y} + C_{2})}{(μ_{g}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})},

(11)

where

C_{1}

and

C_{2}

are small constants added to prevent the denominator from becoming zero. IoU is used to measure the degree of overlap between the areas of two images. To make it differentiable, we use the IoU loss proposed in [38]:

l_{i o u} = 1 - \frac{Σ Σ y (i, j) g (i, j)}{Σ Σ [y (i, j) + g (i, j) - y (i, j) g (i, j)]} .

(12)

where

g (i, j)

and

y (i, j)

represent the ground truth and predicted results at pixel location

(i, j)

, respectively.

5. Experiments

5.1. Experiment Settings

Datasets and network architectures. We utilized three widely used datasets for evaluating the performance of the SOD task: ECSSD [39], DUT-OMRON [40], and HKU-IS [41]. ECSSD contains complex scenes and pixel-level ground truth annotations across 1000 images. DUTS-OMRON offers diverse scenes with object masks and region maps overlaid on images, spanning 5168 images. HKU-IS provides diverse images with pixel-level ground truth annotations across 4447 images. The dataset used for training the multi-step voting classifier and refinement network is DUTS [36]. DUTS provides pixel-level ground truth annotations for benchmarking algorithms across 10,553 images. To demonstrate the generalization capability of the RoRe defense and its superiority relative to state-of-the-art methods, we selected several classic SOD networks proposed in recent years. For robustness under PGD attack, we chose GateNet [15], PFSNet [1], and PiCANet [16]. For robustness under BPDA attack, we chose GateNet, PFSNet, and U

^{2}

Net [2].

Adversarial attacks. We use PGD to evaluate the defensive performance of the RoRe and SOTA methods. The parameters of PGD strictly follow the settings in LeNo, i.e., the perturbation budget is 20/255 under the

l_{\infty}

norm, the number of steps is 10, and the step size is 0.04. It is worth noting that although ROSA attack was employed in LeNo, we believe that there is no sufficient theoretical justification for ROSA to exhibit a better attack effect than PGD. The experimental results in LeNo also confirmed this observation. Furthermore, given that RoRe employs gradient obfuscation as a defensive measure, we also utilize the BPDA in combination with the Expectation over transformation (EOT) [42] algorithm as adaptive attacks. This approach involves a 10-step iterative process with each attack gradient being computed over 15 runs using EOT, and the gradients are then summed to create a comprehensive adversarial example. The perturbation budget for these attacks is set to 10/255 under the

l_{\infty}

norm.

Implementation details. We use one RTX3090 graphic card for computation, and PyTorch 2.0 and Python 3.11 as deep learning frameworks. For the purification module, we select the purification step

T_{a}

from DUTS-TR (the training set of DUTS).

T_{a}

for GateNet and PFSNet is set to 15, while PiCANet’s

T_{a}

is set to 25. For the adversarial detection module, the

T_{m}

in the multi-step voting classifier is set to 25 for all the networks. The training data for this classifier consisted of DUTS-TR and the saliency results obtained from each network. We set the batch size to 8, the number of epochs to 10, the learning rate to

3 \times 10^{- 4}

, and the weight decay to 0.05. The SSIM threshold s is set to 0.8. For the refinement module, we employed the classical UNet architecture. To accommodate the three types of inputs, we expand the encoder threefold and apply separate convolutions. The decoder part remains unchanged. When providing skip connections to the decoder, we concatenate the respective feature maps from the three inputs together. We set the batch size to 20, the number of epochs to 5, the initial learning rate to 0.001 (with a 1/5 reduction every epoch), and the weight decay to 0.

Evaluation metrics. We adopt the mean absolute error (MAE) and

F_{β}

-measure as our evaluation metrics. MAE is used as a metric to evaluate the pixel-level differences between the generated saliency map S and the corresponding ground truth G. It can be formulated as follows:

MAE = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} |S (i, j) - G (i, j)|,

(13)

where W and H are the width and height of the saliency map, respectively. The

F_{β}

-measure, on the other hand, is employed to gauge the balance between the precision and recall of our saliency detection results. It is formulated as:

F_{β} = \frac{(1 + β^{2}) \cdot Precision \cdot Recall}{β^{2} \cdot Precision + Recall},

(14)

where

β^{2}

is set to 0.3 to weight the precision more than the recall, according to conventional practice in saliency detection literature.

5.2. Robustness under PGD Attack

We compare our method, RoRe, with two SOTA methods, ROSA and LeNo, as demonstrated in Table 1 and Table 2. Table 1 illustrates the robustness of various network models defended by our RoRe under PGD attacks. It should be noted that although we use the same attack parameters as LeNo, the accuracy results obtained from the attack slightly vary. This variation can be attributed to two factors: firstly, PGD itself inherently possesses a randomness characteristic; secondly, LeNo only performed PGD attacks on GateNet and used these attacked samples as adversarial examples for PiCANet and PFSNet, resulting in less significant attack effects on the latter two models. However, we launched PGD attacks on all three networks. The “No defense” and “No defense (ours)” in the table corresponding to the accuracy of the undefended network as claimed by LeNo and the accuracy of the undefended network after our attack execution, respectively. In the case of GateNet, RoRe and LeNo have similar effects, with LeNo improving in F-measure by 30% and MAE by 0.24, while RoRe improves in F-measure by 60% and MAE by 0.5. For PFSNet, LeNo enhances its F-measure by 18% and MAE by 0.1, while RoRe increases its F-measure by 40% and decreases MAE by 0.2. On PiCANet, LeNo improves its F-measure by 6% and MAE by 0.01, while RoRe boosts its F-measure by 48% and reduces MAE by 0.1. It can be seen that even when the effectiveness of LeNo’s attacks is generally less than ours, RoRe still manages to elevate the robustness of the models to a higher level.

Table 2 shows the impact of RoRe on the accuracy of benign images. On GateNet, LeNo exhibited an average decrease of 1.7% in F-measure and 0.007 in MAE, while RoRe had a decrease in F-measure of only 0.2% and a decrease in MAE of less than 0.002. On PFSNet, LeNo had a decrease of 2% in F-measure and 0.01 in MAE, while RoRe had a decrease in F-measure of only 0.02% and a decrease in MAE of less than 0.0006. On PiCANet, LeNo had a decrease of 0.2% in F-measure and 0.004 in MAE, while RoRe had a decrease in F-measure of less than 0.02% and a decrease in MAE of less than 0.002. The refinement module in RoRe enables some datasets to achieve better results than the original, which is most significant in PiCANet. This is because their network architecture is relatively old and the original results were not optimal. In contrast, this advantage of our refinement module is not as significant on the PFSNet.

Figure 7 shows the precision–recall curves for different datasets and target models. For benign images, the curve with RoRe is very close to the original benign curve, indicating that RoRe does not significantly compromise the accuracy of benign images. For adversarial images, the curve with RoRe is still somewhat distant from the benign curve, but compared to the adversarial curve, there is a substantial improvement.

As shown in Figure 8, we present the visualization results of RoRe. For benign images, RoRe preserves the original accuracy. In the first, third, and fifth rows, RoRe selects the original results, while in the second, fourth, and sixth rows, purification and refinement are performed. For adversarial examples, RoRe significantly improves the saliency results. Figure 9 depicts the results of images after applying the DDPM purifier. It is evident that both benign images and adversarial examples become slightly blurred after purification. However, the perturbation in the adversarial examples is effectively reduced.

5.3. Adaptive Attack and Ablation Study

Adaptive attack. Table 3 shows the robustness of RoRe under BPDA attack. The attacker is aware of both the target network and the purification module in RoRe. It can be observed that under the protection of RoRe, the BPDA attack only results in a maximum 14% decrease in F-measure and an increase in MAE of no more than 0.07.

Ablation study. Table 4 shows the impact of each module of RoRe on the accuracy of benign and PGD adversarial examples generated by GateNet. After purification, the accuracy of the PGD images (Robust) has significantly improved, while the accuracy of the benign images (Natural) has slightly decreased. The inclusion of adversarial detection does indeed have an impact on the robust accuracy, as any misclassification would lead to a decrease in accuracy. However, this decrease is not significant, and it also contributes to further improving the accuracy of benign images, approaching the original accuracy.

6. Conclusions

This paper presents RoRe, a defense framework based on the diffusion model, to protect SOD networks against adversarial attacks. RoRe consists of three modules: purification, adversarial detection, and refinement. Purification utilizes the diffusion model to purify input images. Adversarial detection employs a multi-step voting classifier and similarity conditions to jointly classify inputs. When detected as adversarial, refinement is performed; otherwise, the original prediction results are used. We evaluate the model’s accuracy under PGD attack using various typical SOD models and datasets and find that it outperforms ROSA and LeNo, simultaneously improving robust accuracy while preserving natural accuracy. Moreover, RoRe demonstrates promising results against BPDA attacks as well. The limitation of RoRe lies in the iterative nature of the diffusion model’s generation process, which introduces a multi-step iteration, resulting in a bottleneck for the speed of the purification process. In terms of future development, we believe that with the advancement of the diffusion model, this defense method will become popular and make further progress in terms of generation speed and quality.

Author Contributions

Conceptualization, H.Y. and X.Z.; methodology, H.Y.; software, H.Y.; validation, H.Y., Y.Z. and X.Z.; formal analysis, H.Y.; investigation, H.Y.; resources, X.Z.; data curation, H.Y.; writing—original draft preparation, H.Y.; writing—review and editing, H.Y., Y.Z. and X.Z.; visualization, H.Y.; supervision, X.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Scientific and Technological Innovation 2030—Major Project of “New Generation Artificial Intelligence” grant number 2020AAA0109300.

Data Availability Statement

All data included in this study are available upon request by contact with the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ma, M.; Xia, C.; Li, J. Pyramidal Feature Shrinking for Salient Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021. [Google Scholar]
Qin, X.; Zhang, Z.; Huang, C.; Dehghan, M.; Zaiane, O.R.; Jagersand, M. U-2-Net: Going deeper with nested U-structure for salient object detection. Pattern Recognit. 2020, 106, 107404. [Google Scholar] [CrossRef]
Qin, X.; Zhang, Z.; Huang, C.; Gao, C.; Dehghan, M.; Jagersand, M. BASNet: Boundary-Aware Salient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Wang, H.; Wan, L.; Tang, H. LeNo: Adversarial Robust Salient Object Detection Networks with Learnable Noise. In Proceedings of the Assoc Advancement Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 February 2023. [Google Scholar] [CrossRef]
Li, H.; Li, G.; Yu, Y. ROSA: Robust Salient Object Detection Against Adversarial Attacks. IEEE Trans. Cybern. 2020, 50, 4835–4847. [Google Scholar] [CrossRef] [PubMed]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar] [CrossRef]
Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H.S. Conditional Random Fields as Recurrent Neural Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. arXiv 2020, arXiv:2006.11239. [Google Scholar] [CrossRef]
Nichol, A.; Dhariwal, P. Improved Denoising Diffusion Probabilistic Models. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar] [CrossRef]
Nie, W.; Guo, B.; Huang, Y.; Xiao, C.; Vahdat, A.; Anandkumar, A. Diffusion Models for Adversarial Purification. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022. [Google Scholar] [CrossRef]
Wang, J.; Lyu, Z.; Lin, D.; Dai, B.; Fu, H. Guided Diffusion Model for Adversarial Purification. arXiv 2022, arXiv:2205.14969. [Google Scholar]
Wu, S.; Wang, J.; Ping, W.; Nie, W.; Xiao, C. Defending against Adversarial Audio via Diffusion Model. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Sun, J.; Nie, W.; Yu, Z.; Morley Mao, Z.; Xiao, C. PointDP: Diffusion-driven Purification against Adversarial Attacks on 3D Point Cloud Recognition. arXiv 2022, arXiv:2208.09801. [Google Scholar] [CrossRef]
Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021. [Google Scholar]
Zhao, X.; Pang, Y.; Zhang, L.; Lu, H.; Zhang, L. Suppress and Balance: A Simple Gated Network for Salient Object Detection. In Proceedings of the Computer Vision-ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2020; Volume 12347, pp. 35–51. [Google Scholar] [CrossRef]
Liu, N.; Han, J.; Yang, M.H. PiCANet: Pixel-Wise Contextual Attention Learning for Accurate Saliency Detection. IEEE Trans. Image Process. 2020, 29, 6438–6451. [Google Scholar] [CrossRef] [PubMed]
Athalye, A.; Carlini, N.; Wagner, D. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018. [Google Scholar] [CrossRef]
Samangouei, P.; Kabkab, M.; Chellappa, R. Defense-GAN: Protecting Classifiers Against Adversarial Attacks Using Generative Models. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Song, Y.; Kim, T.; Nowozin, S.; Ermon, S.; Kushman, N. PixelDefend: Leveraging Generative Models to Understand and Defend against Adversarial Examples. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.A.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In Proceedings of the International Conference on Machine Learning, Lille, France, 16 July–11 July 2015; Proceedings of Machine Learning Research. Volume 37, pp. 2256–2265. [Google Scholar]
Xu, J.; Liu, Y. A Coarse-to-Fine Method Based on Saliency Map for Solar Cell Interior Defect Measurement. IEEE Trans. Instrum. Meas. 2022, 71. [Google Scholar] [CrossRef]
Wang, H.; Yan, B.; Wang, X.; Zhang, Y.; Yang, Y. Accurate saliency detection based on depth feature of 3D images. Multimed. Tools Appl. 2018, 77, 14655–14672. [Google Scholar] [CrossRef]
Zhang, M.; Pang, Y.; Wu, Y.; Du, Y.; Sun, H.; Zhang, K. Saliency detection via local structure propagation. J. Vis. Commun. Image Represent. 2018, 52, 131–142. [Google Scholar] [CrossRef]
Wang, C.; Yang, B. Saliency-Guided Object Proposal for Refined Salient Region Detection. In Proceedings of the 2016 30th Anniversary of Visual Communication and Image Processing (VCIP), Chengdu, China, 27–30 November 2016. [Google Scholar]
Pang, Y.; Wu, Y.; Wu, C.; Zhang, M. Salient object detection via effective background prior and novel graph. Multimed. Tools Appl. 2020, 79, 25679–25695. [Google Scholar] [CrossRef]
Xu, Y.; Hong, X.; Zhao, G. Salient Object Detection with CNNs and Multi-scale CRFs. In Proceedings of the Image Analysis, 21st Scandinavian Conference on Image Analysis (SCIA), Norrkoping, Sweden, 11–13 June 2019; Felsberg, M., Forssen, P., Sintorn, I., Unger, J., Eds.; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2019; Volume 11482, pp. 233–245. [Google Scholar] [CrossRef]
Xu, Y.; Xu, D.; Hong, X.; Ouyang, W.; Ji, R.; Xu, M.; Zhao, G. Structured Modeling of Joint Deep Feature and Prediction Refinement for Salient Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference On Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3788–3797. [Google Scholar] [CrossRef]
Zheng, Q.; Yu, S.; You, X. Coarse-to-fine salient object detection with low-rank matrix recovery. Neurocomputing 2020, 376, 232–243. [Google Scholar] [CrossRef]
Nwe, T.L.; Min, O.Z.; Gopalakrishnan, S.; Lin, D.; Dong, S.P.S.; Li, Y.; Pahwa, R.S. Improving 3D Brain Tumor Segmentation with Predict-Refine Mechanism Using Saliency and Feature Maps. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Virtual, 25–28 September 2020; pp. 2671–2675. [Google Scholar]
Li, R.; Sun, S.; Yang, L.; Hu, W. Saliency Detection via CNN Coarse Learning and Compactness Based ELM Refinement. In Proceedings of the Computer Vision, Pt II 2nd CCF Chinese Conference on Computer Vision (CCCV), China Comp Federat, Tianjin, China, 11–14 October 2017; Yang, J., Hu, Q., Cheng, M., Wang, L., Liu, Q., Bai, X., Meng, D., Eds.; Communications in Computer and Information Science. Springer: Singapore, 2017; Volume 772, pp. 445–460. [Google Scholar] [CrossRef]
Tang, Y.; Wu, X. Salient Object Detection with Chained Multi-Scale Fully Convolutional Network. In Proceedings of the 2017 ACM Multimedia Conference (MM’17), 25th ACM International Conference on Multimedia (MM), Comp Hist Museum, Mountain View, CA, USA, 23–27 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Xu, M.; Liu, B.; Fu, P.; Li, J.; Hu, Y.H.; Feng, S. Video Salient Object Detection via Robust Seeds Extraction and Multi-Graphs Manifold Propagation. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2191–2206. [Google Scholar] [CrossRef]
Li, Y.; Cui, F.; Xue, X.; Chan, J.C.W. Coarse-to-fine salient object detection based on deep convolutional neural networks. Signal Process.-Image Commun. 2018, 64, 21–32. [Google Scholar] [CrossRef]
Fu, K.; Zhao, Q.; Gu, I.Y.H. Refinet: A Deep Segmentation Assisted Refinement Network for Salient Object Detection. IEEE Trans. Multimed. 2019, 21, 457–469. [Google Scholar] [CrossRef]
Deng, Z.; Hu, X.; Zhu, L.; Xu, X.; Qin, J.; Han, G.; Heng, P.A. R³Net: Recurrent Residual Refinement Network for Saliency Detection. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 13–19 July 2018; Lang, J., Ed.; pp. 684–690. [Google Scholar]
Wang, L.; Lu, H.; Wang, Y.; Feng, M.; Wang, D.; Yin, B.; Ruan, X. Learning to Detect Salient Objects with Image-level Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, PT III, Munich, Germany, 5–9 October 2015; Lecture Notes in Computer Science. Volume 9351, pp. 234–241. [Google Scholar] [CrossRef]
Mattyus, G.; Luo, W.; Urtasun, R. DeepRoadMapper: Extracting Road Topology from Aerial Images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Shi, J.; Yan, Q.; Xu, L.; Jia, J. Hierarchical Image Saliency Detection on Extended CSSD. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 717–729. [Google Scholar] [CrossRef] [PubMed]
Yang, C.; Zhang, L.; Lu, H.; Ruan, X.; Yang, M.H. Saliency Detection via Graph-Based Manifold Ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013. [Google Scholar] [CrossRef]
Li, G.; Yu, Y. Visual Saliency Based on MuItiscale Deep Features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar] [CrossRef]
Athalye, A.; Engstrom, L.; Ilyas, A.; Kwok, K. Synthesizing Robust Adversarial Examples. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018. [Google Scholar] [CrossRef]

Figure 1. The defense result of our proposed RoRe. From left to right, the columns correspond to the adversarial examples attacked by PGD, the saliency results generated from the adversarial examples, the saliency results after RoRe defense, and the ground truth. The proposed RoRe is able to restore the attacked images to back to their benign state.

Figure 2. Overview of the RoRe defense framework. RoRe initially applies purification to the original input images. Subsequently, it performs adversarial detection based on the obtained saliency results. If an adversarial example is detected, a refinement process is carried out. Otherwise, the original saliency results are used as the final outcome.

Figure 3. The influence of purification steps. (a) Purification results using different steps. Choosing an appropriate value for

T_{a}

, the number of steps, is crucial. If

T_{a}

is too small, noise removal is inadequate, resulting in incorrect saliency results. If

T_{a}

is too large, the image information deviates from the original, also leading to incorrect results. (b) Purification steps vs. F-measure. Dataset is DUTS-TR [36].

Figure 3. The influence of purification steps. (a) Purification results using different steps. Choosing an appropriate value for

T_{a}

, the number of steps, is crucial. If

T_{a}

is too small, noise removal is inadequate, resulting in incorrect saliency results. If

T_{a}

is too large, the image information deviates from the original, also leading to incorrect results. (b) Purification steps vs. F-measure. Dataset is DUTS-TR [36].

Figure 4. Adversarial purification with diffusion model. The original image undergoes a forward diffusion process, followed by a backward denoising process.

Figure 5. Multistep voting classifier. The original image and its saliency result are concatenated together. Then, multiple noisy images are created through a series of steps, and each image is classified separately. Finally, the results are determined by majority voting.

Figure 6. Refinement network. The original image, purified image, and its saliency result are jointly inputted into the refinement network to refine the saliency result. The input and output form a residual connection, ensuring that the refinement network does not overwrite the performance of the target network.

Figure 7. Precision–recall curves for our RoRe on benign and adversarial samples across different datasets and target models, where the horizontal axis represents the datasets, and the vertical axis represents the target models. The ‘adv’ and ‘benign’ indicate adversarial (PGD) and benign samples. ‘adv+RoRe’ and ‘benign+RoRe’ indicate the results after the defense of our RoRe.

Figure 8. Visualization of RoRe results, where the leftmost column represents the target network. From left to right, the columns correspond to the GTs, the benign images, the saliency results generated by the unprotected target network for the benign images, the saliency results generated by the target network after applying RoRe to the benign images, the images perturbed by PGD attack, the saliency results generated by the unprotected target network for the perturbed images, and the saliency results generated by the target network after applying RoRe to the perturbed images.

Figure 9. Visualization of the purification results. From left to right, the columns correspond to the benign images, the benign images after purification, the images perturbed by PGD attack, and the perturbed images after purification.

Table 1. Comparison with SOTA under PGD attack. The attacker is only aware of the target network. The results borrowed from the reference [4] are marked with *. The results are presented in terms of “absolute value(improvement relative to the corresponding No Defense)”. We compare the absolute value, with the best results marked in red and the second-best results marked in blue. The ‘↑’ indicates that a higher value is better, while the ‘↓’ indicates that a lower value is better.

Dataset	Network	Metrics	No Defense *	+ROSA *	+LeNo *	No Defense (Ours)	+RoRe
ECSSD	GateNet	$F_{β} ↑$ (%)	27.41	47.67 (+20.26)	59.27 (+31.86)	28.55	93.12 (+64.57)
ECSSD		MAE↓	0.4591	0.2823 (−0.1768)	0.2179 (−0.2412)	0.5899	0.0455 (−0.5444)
DUT-OMRON		$F_{β} ↑$ (%)	18.48	29.30 (+10.82)	41.70 (+23.22)	18.48	77.13 (+58.65)
DUT-OMRON		MAE↓	0.4882	0.3647 (−0.1235)	0.2725 (−0.2157)	0.5993	0.0667 (−0.5326)
HKU-IS		$F_{β} ↑$ (%)	31.27	55.44 (+24.17)	65.38 (+34.11)	23.52	92.31 (+68.79)
HKU-IS		MAE↓	0.4293	0.2532 (−0.1761)	0.1842 (−0.2451)	0.5586	0.0343 (−0.5243)
ECSSD	PFSNet	$F_{β} ↑$ (%)	59.38	73.31 (+13.93)	80.32 (+20.94)	42.94	90.24 (+47.3)
ECSSD		MAE↓	0.2075	0.1253 (−0.0822)	0.1079 (−0.0996)	0.2951	0.0589 (−0.2362)
DUT-OMRON		$F_{β} ↑$ (%)	45.71	58.79 (+13.08)	65.11 (+19.4)	35.44	78.01 (+42.57)
DUT-OMRON		MAE↓	0.2417	0.1318 (−0.1099)	0.1284 (−0.1133)	0.2547	0.0672 (−0.1875)
HKU-IS		$F_{β} ↑$ (%)	69.41	77.16 (+7.75)	82.91 (+13.5)	49.16	90.01 (+40.85)
HKU-IS		MAE↓	0.1625	0.0997 (−0.0628)	0.0809 (−0.0816)	0.2189	0.0440 (−0.1749)
ECSSD	PiCANet	$F_{β} ↑$ (%)	77.08	84.15 (+7.07)	85.11 (+8.03)	28.55	85.69 (+57.14)
ECSSD		MAE↓	0.1112	0.0926 (−0.0186)	0.0913 (−0.0199)	0.2709	0.0877 (−0.1832)
DUT-OMRON		$F_{β} ↑$ (%)	61.67	64.16 (+2.49)	66.58 (+4.91)	18.48	62.33 (+43.85)
DUT-OMRON		MAE↓	0.1287	0.1269 (−0.0018)	0.1195 (−0.0092)	0.2030	0.1041 (−0.0989)
HKU-IS		$F_{β} ↑$ (%)	80.81	83.24 (+2.43)	84.68 (+3.87)	23.52	72.41 (+48.89)
HKU-IS		MAE↓	0.0892	0.0819 (−0.0073)	0.0782 (−0.0110)	0.2352	0.1179 (−0.1173)

Table 2. Comparison with SOTA on benign images. The results borrowed from the reference [4] are marked with *. The results are presented in terms of “absolute value(improvement relative to the corresponding No Defense)”. We compare the absolute value, with the best results marked in red and the second-best results marked in blue. The ‘↑’ indicates that a higher value is better, while the ‘↓’ indicates that a lower value is better.

Dataset	Network	Metrics	No Defense *	+ROSA *	+LeNo *	No Defense (Ours)	+RoRe
ECSSD	GateNet	$F_{β} ↑$ (%)	94.38	93.60 (−0.78)	92.92 (−1.46)	94.81	94.72 (−0.09)
ECSSD		MAE↓	0.0332	0.0425 (+0.0093)	0.0399 (+0.0067)	0.0383	0.0397 (+0.0014)
DUT-OMRON		$F_{β} ↑$ (%)	81.90	79.75 (−2.15)	79.69 (−2.21)	81.92	81.60 (−0.32)
DUT-OMRON		MAE↓	0.0545	0.0614 (+0.0069)	0.0606 (+0.0061)	0.0547	0.0548 (+0.0001)
HKU-IS		$F_{β} ↑$ (%)	94.32	92.75 (−1.57)	92.79 (−1.53)	93.84	93.76 (−0.08)
HKU-IS		MAE↓	0.0301	0.0377 (+0.0076)	0.03772 (+0.00762)	0.0312	0.0309 (−0.0003)
ECSSD	PFSNet	$F_{β} ↑$ (%)	94.61	87.78 (−6.83)	92.63 (−1.98)	95.23	95.19 (−0.04)
ECSSD		MAE↓	0.0305	0.0698 (+0.0393)	0.0391 (+0.0086)	0.0314	0.0320 (+0.0006)
DUT-OMRON		$F_{β} ↑$ (%)	81.97	77.25 (−4.72)	79.29 (−2.68)	82.29	82.28 (−0.01)
DUT-OMRON		MAE↓	0.0553	0.0615 (+0.0062)	0.0656 (+0.0103)	0.0543	0.0536 (−0.0007)
HKU-IS		$F_{β} ↑$ (%)	94.34	89.43 (−4.91)	92.49 (−1.85)	94.32	94.29 (−0.03)
HKU-IS		MAE↓	0.0285	0.0491 (+0.0206)	0.0387 (+0.0102)	0.0265	0.0261 (−0.0004)
ECSSD	PiCANet	$F_{β} ↑$ (%)	92.78	90.22 (−2.56)	92.64 (−0.14)	91.66	91.64 (−0.02)
ECSSD		MAE↓	0.0455	0.0536 (+0.0081)	0.0482 (+0.0027)	0.0586	0.0588 (+0.002)
DUT-OMRON		$F_{β} ↑$ (%)	77.62	76.59 (−1.03)	77.68 (+0.06)	76.55	76.91 (+0.36)
DUT-OMRON		MAE↓	0.0661	0.0737 (+0.0076)	0.0727 (+0.0066)	0.0670	0.0671 (+0.0001)
HKU-IS		$F_{β} ↑$ (%)	92.14	90.73 (−1.41)	91.83 (−0.31)	90.16	90.27 (+0.11)
HKU-IS		MAE↓	0.0419	0.0514 (+0.0095)	0.0472 (+0.0053)	0.0512	0.0509 (−0.0003)

Table 3. Robustness against BPDA attacks, where the attacker is aware of both the target network and the purification module. “Benign” denotes the original accuracy of the target network on benign images. “BPDA+RoRe” denotes the accuracy of RoRe attacked by BPDA. The ‘↑’ indicates that a higher value is better, while the ‘↓’ indicates that a lower value is better.

Dataset	Network	Metrics	Benign	BPDA+RoRe
ECSSD	GateNet	$F_{β} ↑$ (%)	94.81	81.06
ECSSD	GateNet	MAE↓	0.0383	0.1099
DUT-OMRON	PFSNet	$F_{β} ↑$ (%)	82.29	73.75
DUT-OMRON	PFSNet	MAE↓	0.0543	0.0861
HKU-IS	U $^{2}$ Net	$F_{β} ↑$ (%)	93.53	85.61
HKU-IS	U $^{2}$ Net	MAE↓	0.0306	0.0610

Table 4. Ablation study on GateNet to analyze the performance of each module under benign (natural) and adversarial (robust) examples. “pur” denotes the purification module, “det” denotes the adversarial detection module, and “ref” denotes the refinement network module. The ‘↑’ indicates that a higher value is better, while the ‘↓’ indicates that a lower value is better.

Module			Metric	Natural			Robust
+pur	+det	+ref	Metric	ECSSD	DUT-OMRON	HKU-IS	ECSSD	DUT-OMRON	HKU-IS
			$F_{β} ↑$ (%)	94.81	81.92	93.84	28.55	18.48	23.52
			MAE↓	0.0383	0.0547	0.0312	0.5899	0.5993	0.5586
✓			$F_{β} ↑$ (%)	89.27	80.77	93.15	88.67	76.97	92.20
✓			MAE↓	0.0695	0.0561	0.0331	0.0719	0.0699	0.0384
✓		✓	$F_{β} ↑$ (%)	94.59	81.13	93.45	93.17	77.30	92.46
✓		✓	MAE↓	0.0346	0.0526	0.0280	0.0453	0.0654	0.0334
✓	✓	✓	$F_{β} ↑$ (%)	94.72	81.60	93.76	93.12	77.13	92.31
✓	✓	✓	MAE↓	0.0397	0.0548	0.0309	0.0455	0.0667	0.0343

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, H.; Zhang, Y.; Zhao, X. Robust and Refined Salient Object Detection Based on Diffusion Model. Electronics 2023, 12, 4962. https://doi.org/10.3390/electronics12244962

AMA Style

Ye H, Zhang Y, Zhao X. Robust and Refined Salient Object Detection Based on Diffusion Model. Electronics. 2023; 12(24):4962. https://doi.org/10.3390/electronics12244962

Chicago/Turabian Style

Ye, Hanchen, Yuyue Zhang, and Xiaoli Zhao. 2023. "Robust and Refined Salient Object Detection Based on Diffusion Model" Electronics 12, no. 24: 4962. https://doi.org/10.3390/electronics12244962

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust and Refined Salient Object Detection Based on Diffusion Model

Abstract

1. Introduction

2. Related Work

2.1. Salient Object Detection

2.2. Adversarial Purification

2.3. Saliency Refinement

3. Background of DDPM

3.1. Forward Diffusion Process

3.2. Backward Denoising Process

4. Method

4.1. Adversarial Purification with Diffusion Model

4.2. Adversarial Detection

4.3. Saliency Refinement

5. Experiments

5.1. Experiment Settings

5.2. Robustness under PGD Attack

5.3. Adaptive Attack and Ablation Study

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI