Purifying Adversarial Images Using Adversarial Autoencoder With Conditional Normalizing Flows

We present a target-agnostic adversarial autoencoder with conditional normalizing flows specifically designed to, given any unlabeled image dataset, purify adversarial samples into clean images, i.e., remove adversarial noise from the images while preserving their visual quality. In our model interpretation, samples are processed by manifold projection in which the encoder brings the sample back into a posterior data distribution in latent space so that the sample is less likely to be irregular to the learned representation of any target classifier. Normalizing flows conditioned on top of our hybrid network structure and walk-back training are used to deal with common drawbacks of generative model and autoencoder-based approaches: not only the trade-off between compression loss and over-fitting on training data but also the structural model dependency on dataset classes and labels. Experiments demonstrated that our proposed model is preferable to existing target-agnostic adversarial defense methods particularly for large and unlabeled image datasets.

Research societies have characterized deep learning in various image recognition tasks, including face recognition and autonomous driving [1]. Deep learning models are increasingly becoming a technological pillar in critical parts of the world economy. For instance, face recognition has become a leading technology for biometric identification and is widely used in the financial, military, and public security areas as well as in daily life. However, artificial neural networks are vulnerable to adversarial samples. Adversarial attacks aim to deceive deep neural networks via elaborately designed perturbations in the input samples. Various studies have shown how an input image can be perturbed so as to confuse neural classifiers without the pixel differences being perceived by humans. Such intrinsic flaws in neural optimization methods, if taken advantage of by malicious attackers, could lead to public misinformation, sabotage, and dangerous security breaches in the real world. In a hazardous scene of face identification, for example, criminals can cause a security breach by adversarial attack to make a face verification system wrongly recognize an intruder as an authorized person.
In the ceaseless fight against data forgery, the defense side would be unfortunately put in a difficult position if the defense mechanism must learn as much about the target as the attacking side does. Instead of analyzing and reversing the perturbation tailored by attackers, recent studies have focused on the posterior distributions of the authentic data samples and on learning how to bring tampered samples back to the posterior from wherever they were artificially planted on the projected manifold. The philosophy of adversarial purification argues for directly processing the input image dataset before any embedded adversarial component stands a chance of propagating into the learned features of the target neural network. It is commonly considered that existing adversarial purification methods are rarely based on any adversarial sample recipe with regard to the target model. However, as impractical as it may be, the structure, training process, or computation fashion of existing purification models is often coupled with the target dataset statistics, namely the number of classes, the hierarchy of data classification, and the distribution of data.
Recent studies have endeavored to unveil the essential principle of adversarial perturbation and have empirically interpreted adversarial samples as composition of "out-ofdistribution" data points in regard to the learned manifold [2], [3], thereby exploiting any irregularities in the target neural network. Similar to many adversarial defense methods, our proposed model is also built upon the assumption that decisive structural information for classifying images and/or recognizing objects can, to a certain extent, be distinguished from the adversarial noise components in the learned latent space. This process is often referred to as "purification". We aim for a target-agnostic method for defending against the exposure of image classification models to adversarial samples that can be flexibly applied to different unlabeled datasets, without acquiring any prior knowledge of the target model, such as the network structure, feature representation, or learned parameters. Fig. 1 illustrates an external view of our proposed method. The goal of an image dataset purification is to simultaneously (i) minimize pixel loss of cleaned images (visual quality preservation) and (ii) maximize the prediction accuracy of image classifier neural networks on the image dataset. Both are equally important.
Our main contributions in this work are as follows: r We propose a method for out-of-the-box purification of adversarial images in large, unlabeled, and arbitrary class distribution image datasets, without requiring access to target models or attacker information.
r We demonstrate that our approach is scalable, making it a cost-effective tool for preventing malicious poisoning when training reliable neural networks from public and unorganized resources.
r We explore a self-supervised approach using normalizing flows for an adaptive representation-based posterior estimation on various image sets, which achieves performance parity with related works while outperforming them on complex and unlabeled datasets. Additionally, we provide a configurable trade-off between restoration and classification, and highlight the potential uses and principles of self-supervised latent representation from a forensic and information security perspective.

II. RELATED WORKS
Adversarial defense approaches of assorted genres have been proposed, including adversarial training [4], [5], feature robustness [6], stochastic robustness [7], and model distillation [8]. The techniques used by such defense methods are often involved in the target model architecture or training process; therefore, white-box access to the target model is required, even before the model finishes training, which severely limits generalization for heterogeneous and unknown target models. Generative models and autoencoders, surprisingly, have been able to power a one-off processing mechanism onto the samples themselves, instead of the target models.

A. GENERATIVE MODELS
The idea of "purify once, use anywhere" makes adversarial purification desirable in practice for pre-processing datasets ahead of model learning. The core assumption behind such research, represented by DefenseGAN [3], PixelDefend [2], and DiffPure [9], is that adversarial samples can be interpreted as outliers in the learned representation of the posterior data distribution, which does not rely on the attacker model and often resembles across image classification convolutional neural nets. By training generative models independently from both threat models and classifiers so that they simply learn to sample or restore clean images from the posterior distribution, adversarial noise can be suppressed to non-visually recognizable, irrelevant information. However, the mode collapse in generative adversarial networks (GANs) [10] and the high computational cost in autoregressive network layers [11], Langevin dynamics [12] and diffusion process [9] have diminished the feasibility of applying such generative models at dataset scale.
Variational autoencoders (VAEs) such as PuVAE [13], Gaussian mixture VAEs [14] and analysis-by-synthesis models [15] have been proposed for purifying images by mapping adversarial examples to the latent space of legitimate images after learning class-specific data distributions, which typically perform with lower computation workload than pixel generative models at the cost of quality loss. AAEs [16], to the best of our knowledge, have yet to become the cornerstone of any state-of-the-art image purifier model. AAEs blend the autoencoder architecture with the adversarial loss concept introduced by the GAN, which is based on a concept similar to that of the VAE except that it uses adversarial loss to regularize the latent code instead of KL divergence, which enables a specific latent data representation to be imposed without a closed form.
Our proposed model extends generative architectures with conditional flows because both generative and autoencoderbased adversarial purifiers are not readily transferable when the target image dataset is changed. Existing methods would restructure and assemble a different model from scratch because the models are structurally dependent on the target dataset. For example, supervised classification information (class labels of all images) is often used as sideways input to guide the learning process [13], [17], which can drastically vary across different types of image datasets. In other approaches [14], [15], the training set-up directly depends on the data classes, even developing into per-class model instances. Whether the model's performance would be consistent against various kinds of dataset structures, including numbers of data classes and the label distribution, remains unanswered.

B. FLOW-BASED MODELS
Normalizing flows have empowered a distinct kind of competitive generative model, usually referred to as flow-based models [18]. Flow-based generative adversarial networks such as Flow-GANs [19] have been proposed to perform exact likelihood evaluation coupling adversarial and maximum likelihood training using normalizing flows, which can be used for generating conditional synthetic samples for domains without the requirement of obtaining its labels [20]. Flow-based generative models are considered to be powerful exact likelihood models with efficient sampling and inference; however, flowbased models generally have much worse density modeling performance compared with autoregressive models [21], despite their computational efficiency. Flow-based methods have been used to construct more complex posterior distribution in learned models [22] and even integrated into flow-based variational autoencoders (f-VAEs) [23]. Studies [24], [25] have shown that combining normalizing flows with underlying variational autoencoders counteracts the heavy restrictions imposed on the flow-based model architecture, i.e., being excessively deep and wide in order to ensure a calculable determinant of the model Jacobian.

A. ADVERSARIAL AUTOENCODERS
Let q φ (z|x) be both the probabilistic encoding model in an autoencoder framework and the generative model in an adversarial framework. A new discriminative model d χ (z) is introduced to distinguish between latent samples drawn from p(z) and q φ (z|x). The cost function used to train discriminator d χ (z) is and N is the dataset size. Adversarial training is used to match q φ (z|x) to an arbitrarily chosen prior p(z), where q φ (z|x) is often implemented as a neural network with input x and output z. Unlike VAEs, for which the structure of q φ (z|x) is usually limited to a multivariate Gaussian, more arbitrary complexity can be supported in AAEs. Because an adversary is used to match the prior without computing a closed-form KL divergence, the posterior is free of any analytic definition. It has been shown [16] that AAEs are able to match q φ (z|x) to multivariate priors p(z) including a mixture of ten two-dimensional Gaussian distributions.

B. NORMALIZING FLOWS
Flow-based models define q(x|z) = δ(x − G(z)) and map observed data to a standard Gaussian latent variable by using a specially designed G(z) to stack individual simple invertible transformations, of which the key component is the coupling layer: split x into two concatenated parts x 1 , x 2 and let for which the Jacobi determinant is s i (x i ), and the inverse is obviously tractable. Such transformation is typically referred to as affine coupling (or additive coupling if s(x 1 ) ≡ 1) and is denoted as g. Successive coupling layers can be concatenated to construct an invertible yet complex enough function, that is, G = g 1 • g 2 • · · · • g L , which is referred to as an (unconditioned) flow. Henceforth, flow-based models are trained by maximizing log likelihood function directly as it is easy to differentiate with respect to the parameters of the flows. Consequently, sampling from flows becomes much more efficient by simply calculating Practically speaking, flow-based models could grow heavyweight as the stacks are often chained with remarkable depth.

IV. PROPOSED METHOD
We have devised an autoencoder-based defense method for purifying adversarial samples in heterogeneous image datasets. Let X DATA be a dataset containing data instances x DATA ∈ R d , where d denotes the dimension of the data space. Dataset class labels are denoted by y DATA ∈ R c in a set of classes C, where c is the number of classes. Let target classifier A be the model exposed to an attacker. We assume a set X ADV consisting of adversarial examples X ADV ∈ R d is created from the target classifier model. We define a set X consisting of clean samples and adversarial samples. Instances X sampled from this set are used at inference time. In the following sections, we explain the processes of training and generating purified samples using our approach. Fig. 2 describes an overview of our model architecture and explains how the functionally different components are jointly optimized. Our purifier model is built upon a convolutional encoder and decoder structure with successive U-net layers, where the encoder receives a data-label pair and projects the data point to the latent distribution q φ (z|x) corresponding to the class-label dependent normalizing flows.

A. MODEL ARCHITECTURE
The normalizing flows are conditioned in juxtaposition with the convolutional encoder to jointly learn an adaptive posterior distribution in the latent space, so we do not have to assign a dataset-specific predetermined representation prior to model training. This greatly helps the model to generalize on heterogeneous unlabeled datasets and improves the classification and quality metrics when the model is applied across large datasets, as demonstrated in Sections V-B and V-C.
Latent vector z in the latent space is sampled using the learned parameters q(x|z) = δ(x − G(z)) obtained from the encoder: where the stochasticity in q(z) comes from both the data distribution and the randomness of the normalizing flow at the output of the encoder. The selective nature of convolutional neural networks using pooling and strides to widen the receptive field is a disadvantage for generative models because the feature selection causes information loss. We use a U-Net [26] structure with successive layers for connecting the encoder convolutional features to the decoder deconvolutional features [27], thereby preserving the mid-and high-frequency details of the original image content while imposing AAE noise reduction in latent vector z through the normalizing flow transformation. The sampled z enters the decoder and produces an output instance with the same dimension d as the input.
Makhzani et al. [16] demonstrated that AAEs can impose complicated distributions without having access to the explicit functional form of the distribution. During training, the AAE in our model is trained to minimize so that the aggregated posterior of the hidden representation vector matches the prior distribution transformed by the normalizing flow.
In a flow-based adversarial autoencoder, the likelihood is well-defined and computationally tractable for exact evaluation of expressive transformations. Therefore, our model can be jointly trained via (6) and (4) in which case the discriminator is redundant. On top of which, we are able to perform ancestral sampling similar to a normal generative adversarial network whereby we sample a random vector z ∼ p(z) and transform it to a model generated sample via G θ = g −1 θ . This makes it feasible to learn a flow-based latent representation using an adversarial learning objective.

B. MODEL TRAINING AND INFERENCE
A study of sampling in high-dimensional spaces (like the model inference process in the following section) using a local corruption process (in this case, adversarial noise) by Bengio et al. found that behavior far from the training examples by the autoencoder can create spurious modes in regions insufficiently visited during training [28]. We obtained a generalization boost on unseen samples by training the manifold projection process to "walk back" [28] towards the training samples. We exploit knowledge of the currently learned model P(x|x) to define adversarial corruption so as to pick values ofx that would be obtained by following the generative chain, i.e., by enabling the model to proceed as if we sampled starting at training example x. We consider x to be a negative example from which the autoencoder should move away and towards x. Such training helps the model to generalize on unseen samples in the dataset because it forces the autoencoder to learn to walk back from the random walk it generates, towards the x's in the training set.
An autoencoder network structure integrated with the normalizing flows is applied at model inference time. Because the purification model learns only the distribution of z from the training dataset, the input data is mapped to the learned latent space when the sample contains adversarial noise, which is cleaned through projection to the latent variable.
Our model is assembled in a way that the inference input samples come from the same data distribution on which the model was trained, which enables the model to adaptively learn the latent distribution specific to the target dataset. Moreover, our model does not require any structural change when it is applied to different datasets. It can be retrained for heterogeneous datasets and can purify image samples from various class-structure datasets on demand.
Our model framework supports the integration of both VAE and AAE for the data sampling and transformation in the latent space. The difference between using variational autoencoders such as PuVAE [13] and AAEs for image purification is that, in order to back-propagate through the KL divergence by Monte-Carlo sampling, VAEs require access to the exact functional form of the prior distribution. In contrast, by using an AAE, we only need to be able to sample from the normalizing flow results to induce q(z) to match p(z). We have tested the uses of both in our framework and selected AAE for our evaluation, because the latter showed more consistent performance under our testing scenario and evaluation settings.

V. EVALUATION
In this section, we present experiment results for our model compared with state-of-the-art adversarial purification methods: PixelDefend [2], Defense-GAN [3], PuVAE [13], and DiffPure [9]. The experiment settings and ablation studies on the normalizing flow learned posterior and U-Net structure trade-off are also presented in the following sections.

A. EXPERIMENT SETTINGS
We compared our proposed method with others on three datasets: CelebA-HQ [29], CIFAR-100 [30], and Ima-geNet [31], with widely used architectures as classifiers: ResNet [32], WideResNet [33], and VGG [34]. We used the AutoAttack l 2 and l ∞ threat models [35], which contain an ensemble of adversarial attacks with their recommended hyper-parameters in the framework, as listed in Table 1. We also evaluated the performance under iFGSM [36], Carlini-Wagner (C&W) [37] and DeepFool [38] attacks respectively. We masked the data class labels in all training datasets for fair comparison between our method and the others.
We used two kinds of classifier accuracy as the performance metrics: standard accuracy and robust accuracy. The standard accuracy measures the visual quality preservation of the defense method on clean images. The robust accuracy measures the defense performance on adversarial samples. We also measured the perceived quality of purified images by peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM). Due to the demanding computational cost, we evaluated on a fixed subset of 512 images randomly sampled from each test set, without significant result difference compared to the complete test set [9].

B. COMPARISON WITH THE STATE-OF-THE-ART
As shown in Table 1, our model slightly outperformed the generative adversarial model-based and autoregressive generative model-based purifiers on facial and cross-domain image datasets for the ensemble of adversarial attacks provided by the AutoAttack l 0 and l 2 threat model benchmarks.
Two groups of representative examples by our proposed method are displayed in Fig. 3. Our model achieved similar performance with the best of all other methods although the metric results reveal that their performances differed by data class structure. Under certain parameter combinations, the diffusion model attained higher accuracy in purifying homogeneous data, such as human face images. For such datasets with few superclasses, most of the latent transformation-based purifiers use a multivariate Gaussian as the prior, with settings sometimes additionally tailored in accordance with the data class labels. However, similar approaches would not work well practically for large datasets consisting of numerous, unstructured, and unlabeled data classes. Our self-supervised approach learns the optimal posterior to utilize the class distribution information to the maximum so that the samples are represented in the latent space in such a way that the classification knowledge is restored while the adversarial noise is removed. The evaluation results show that our model produced the highest purification accuracy on the dataset with the most data classes (ImageNet).
The evaluation of performance on different attack methods respectively is shown in Table 2. The result indicates that the preferable performance of our method has no significant bias on any particular attack method. In addition to classification accuracy, PSNR and SSIM are used to measure the perceived image quality of the purified images. Both metrics are calculated between original images and purified adversarial images. Our method obtained the most of the highest scores in classification and image quality metrics on different classifier architectures and attack methods.

C. ABLATION STUDY ON THE POSTERIOR
Normalizing flows are built into the autoencoder structure of our model to represent the inherent dataset class characteristics of the latent data distribution. We conducted an ablation study on the normalizing flow part of our model and observed the latent data distribution under different conditions.
We ran the ablation study on the MNIST [39] handwritten digits dataset, which contains ten classes of 0-9 digits. We examined the digit image samples in the latent space by running k-means clustering on the projected data samples, with k set to 10, and obtained a silhouette score of 0.56. We compared the results between our proposed method and the baseline using a default normal distribution (without normalizing flows. Fig. 4 shows that the learned latent distribution clearly reflected the data class characteristics when the normalizing flows were 272 VOLUME 4, 2023 used, despite the fact that no labels were used in training. However, the projected samples in the baseline latent space can barely be distinguished to form meaningful class representations (note that the distance scale differs between the radar graphs). To illustrate the difference in latent class representation, we computed the average Euclidean distance between all latent sample pairs in the top 30 (by image count) data classes in the CIFAR dataset. As shown in Fig. 5, the average distance for intra-class sample pairs in the normalizing flow latent space was noticeably lower than for the inter-class sample pairs. This was not the case in the baseline space: there is no statistical difference by sample class. This demonstrates that integrating normalizing flows into our AAE model improved the latent transformation by enabling the model to infer class information from datasets and use it to purify adversarial noise, leveraging the estimated posterior to balance the quality loss and noisy component removal.

D. ABLATION STUDY ON THE RESOLUTION FEATURES
We conducted another ablation study for the resolution features bypassing the latent posterior from encoder to decoder through the U-Net layers. The hypothesis is that the features learned by convolutional layers contribute to both reconstructing image details and hiding adversarial signals.
The parallel data flows of the autoencoder and U-Net structure in our model architecture can support a trade-off between output visual quality (human perception) and classification accuracy (machine perception). The trade-off can be configured by the dropout regularization parameter in the U-Net successive layers. As shown in Fig. 6, our experiment empirically verified that when regularization parameter is increased, more of both image details and adversarial components reach the decoder, therefore the reconstruction pixel loss decreases while classification accuracy also decreases. Such monotonic behavior appeared on different datasets persistently.

VI. CONCLUSION
Our proposed adversarial autoencoder-based adversarial purification method is uniquely based on normalizing flow-based data and features a model architecture with dedicated design elements. It can clean loosely-defined large image datasets without having to teach the model about the dataset class specifics and without the need to rebuild the network structure to match the target dataset. Our experiments demonstrated that structural information about the target dataset can be used to improve the trade-off between image reconstruction quality and adversarial defense accuracy.
Future work includes extending our methodology to other related information security problems such as defending backdoor attacks in image datasets, improving the integration of various adversarial purifier architectures such as VAEs with conditional flows, and designing a more generic model for different data modalities.