Analysis of Dominant Classes in Universal Adversarial Perturbations

The reasons why Deep Neural Networks are susceptible to being fooled by adversarial examples remains an open discussion. Indeed, many different strategies can be employed to efficiently generate adversarial attacks, some of them relying on different theoretical justifications. Among these strategies, universal (input-agnostic) perturbations are of particular interest, due to their capability to fool a network independently of the input in which the perturbation is applied. In this work, we investigate an intriguing phenomenon of universal perturbations, which has been reported previously in the literature, yet without a proven justification: universal perturbations change the predicted classes for most inputs into one particular (dominant) class, even if this behavior is not specified during the creation of the perturbation. In order to justify the cause of this phenomenon, we propose a number of hypotheses and experimentally test them using a speech command classification problem in the audio domain as a testbed. Our analyses reveal interesting properties of universal perturbations, suggest new methods to generate such attacks and provide an explanation of dominant classes, under both a geometric and a data-feature perspective.


Introduction
Universal adversarial perturbations [1] are input-agnostic perturbations capable of fooling a Deep Neural Network (DNN) while remaining imperceptible for humans. These perturbations are generally created as untargeted attacks, so that no preference over the (incorrect) output class is assumed [1,2,3,4]. However, previous work [1,5,6,7] has reported a phenomenon regarding the effect of universal perturbations in the attacked model: the preference of the perturbation to change the class of the inputs into a particular dominant class, without this being specified or imposed in the generation of the perturbation. Thus, some classes (or class regions in the decision space) act as attractors under the effect of universal perturbations.
In this paper, we analyze this phenomenon with the aim of sheding light on the (still misunderstood) vulnerability of DNNs to universal perturbations. The main contributions of our paper are the following: • First, we propose a number of hypotheses to explain and characterize the existence of dominant classes linked to universal adversarial perturbations, and revisit previous hypotheses and open questions in the related work.
• We experimentally test the proposed hypotheses using a speech command classification task in the audio domain as a testbed. To the best of our knowledge, this is the first work in which the analysis of dominant classes is studied for the audio domain. Apart from providing evidence of the validity of the proposed hypotheses, our results reveal interesting properties of the DNN sensitivity to different types of perturbations.
• Finally, we highlight a number of differences between the image domain and the audio domain regarding the analysis of adversarial examples, contributing to a more general understanding of adversarial machine learning.

Related work
Universal adversarial perturbations for DNNs were introduced in [1] for image classification tasks.
The goal of such perturbations is to fool a DNN for "most" natural inputs when they are applied to them, and, at the same time, to be imperceptible for humans. Formally, following the notation used in [8], a perturbation v is said to be (ξ, δ)-universal if the following conditions are satisfied: being µ the distribution of natural inputs in the d-dimensional input space R d , and f (x) the output class assigned to an input x by a classifier f : R d → {y 1 , ..., y k }.
The discovery of such attacks for state-of-the-art DNNs has led to a deeper study of their properties.
In [1], the vulnerability of DNNs to universal perturbations is empirically studied in the image domain, which is attributed in part to the geometry of the decision boundaries learned by the DNNs. In particular, it is shown that, in the vicinity of natural inputs, perturbations normal to the decision boundaries are correlated, in the sense that they approximately span a low dimensional subspace (in comparison to the dimensionality of the input space). Thus, being the minimal perturbation capable of changing the output of an input x (hence normal to the decision boundary at x + v x ), it is possible to find a subspace S ⊂ X, with dim(S) dim(X), so that v x ∈ S for x ∼ µ. The existence of such a subspace implies that even random perturbations (with small norms) sampled from S are likely to cause a misclassification for a large number of inputs [1]. This hypothesis is further developed in [8], also for the image domain, where the vulnerability of classifiers to universal perturbations is formalized, under the assumption of locally linear decision boundaries in the vicinity of natural inputs. An illustration of a linear approximation of the decision boundary is shown in Figure  However, the assumption of locally linear decision boundaries becomes insufficient to comprehensively formalize the vulnerability of DNNs to universal perturbations. Indeed, there is a crucial connection between that vulnerability and the curvature of the decision boundaries [8]: there exist common perturbation directions (i.e., span a low-dimensional subspace) in the input space for which, starting from natural inputs, the decision boundaries are positively curved along these directions. See Figure A.1 (right) for a comparison between a positively curved boundary and a negatively curved boundary. The positive curvature of the decision boundaries implies small upper bounds for the amount of perturbation required to surpass the decision boundaries, as depicted in Figure A.1 (right). Thus, those positive curvatures increase the vulnerability of DNNs, as smaller perturbations are required to fool the model. At the same time, the fact that those directions are also common for multiple inputs implies the existence of small input-agnostic adversarial perturbations.
In a further analysis developed in [9], it is shown that the directions in the input space for which the decision boundaries are highly curved are indeed associated by the DNN with class identities (the further we move in one of such directions, the higher -or lower-the confidence of the model in one particular class is). Moreover, it is shown that the class features 1 associated to such directions are, indeed, the most relevant ones as far as the classification performance of the model is concerned, what links the accuracy of DNNs with their vulnerability to adversarial attacks.
The aforementioned theoretical frameworks focus, in particular, on the vulnerability to universal perturbations. In this paper, we focus instead on one particular property of universal perturbations: the existence of dominant classes that are significantly more frequently predicted for the perturbed (and misclassified) inputs. This phenomenon was first reported in [1] for image classification tasks. Subsequent works have also reported the existence of dominant classes in image classification tasks [6,5], and in text classification tasks [7]. In this paper, we show that this happens also for other domains, such as speech command classification tasks in the audio domain. Although it is hypothesized in [1] that a possible explanation for the dominant classes is that they occupy a larger region in the decision space, it is left as an open research question. In this paper, we tackle this research question and test multiple hypotheses in the search for a deeper understanding of this phenomenon.
Outside the particular field of universal perturbations, multiple theoretical frameworks have been proposed for the explanation of adversarial examples. Whereas most of them focus on the properties of the DNNs [10,11,12,13], other alternative explanations have also been proposed. In this paper, special attention is paid to the one introduced in [14], in which adversarial examples are explained in terms of the robustness of the features in the data. In particular, it it shown that datasets contain non-robust features which, although being highly discriminative (i.e., that the data is well described by these features), are uncorrelated with the ground-truth classes when they are perturbed by small (adversarial) perturbations. Thus, when a classifier learns to rely on such non-robust features to accurately classify the data, it becomes vulnerable to adversarial perturbations. The small robustness of such features to small perturbations also implies their lack of meaning for humans, which explains the imperceptibility of the attacks. In our paper (Section 5.2), we hypothesize that the higher sensitivity of the model to certain features might explain the existence of dominant classes.

Proposed Framework
Let us consider a machine learning model f : X → Y , with X ⊆ R d and Y = {y 1 , . . . , y k }, trained to classify inputs x ∈ X coming from a data distribution x ∼ µ among one of the k possible classes in Y . To formally describe dominant classes, let us denote p v j the probability of misclassifying an input as the class y j when a universal perturbation v is added to the inputs: Similarly, let t v i,j represent the probability that, departing from an input of ground-truth y i , the model incorrectly predicts the class y j for the perturbed inputs: In practice, if the distribution µ is unknown, these probabilities can be estimated using a finite set of input samples X . Definition 1. y a is an attractor class for another class y i (i = a), under a perturbation v, which will be denoted as y i v − → y a , if at least the α > 1 k−1 proportion of the inputs corresponding to the class y i are predicted as y a when they are perturbed with v: Definition 2. y b is a dominant class for the universal perturbation v if at least the β > 1 proportion of the inputs are wrongly classified as y b when they are perturbed with v: Alternatively, y b can be defined also in terms of the number of classes that it attracts. Precisely, y b is dominant if it is an attractor class for at least the ζ > 1 k−1 proportion of the remaining classes: The choice of the parameters α, β and ζ can determine the existence of multiple attractor and dominant classes. In this paper, we assume α, β, ζ ≥ 1 3 since we are interested in those classes which are incorrectly predicted for a significant proportion of inputs, or which attract a significant proportion of other classes.
To explain the relationship between universal perturbations and dominant classes, we use a speech command classification problem in the audio domain as a testbed. We selected the Speech Command Dataset [15], in which the underlying task consists of classifying audio signals, of fixed length, into one of the following classes: silence, unknown, yes, no, up, down, left, right, on, off, stop and go.
We trained a convolutional neural network as a classifier, based on the architecture proposed in [16], which is composed of two convolutional layers with ReLU activations, a fully connected layer and a final softmax layer. This architecture has been used in a number of related works [15,17,18,19]. The audio waveforms (in the time-domain) from the input space R 16000 , which take values in the range [−1, 1], are first converted into spectrograms by dividing the audios into frames of 20ms, with a stride of 10ms, and applying the real-valued fast Fourier transform (retrieving 512 components) for each frame. As the frequency spectrum of a real signal is Hermitian symmetric, only the first 257 components are retained. The dimension of the resulting spectrogram is 99 × 257. Finally, the Mel-Frequency Cepstrum Coefficients (MFCCs) [20] are extracted from the spectrogram, in the space R 99×40 , before being sent to the network. It is worth pointing out that the adversarial perturbations that are generated for this model are optimized in an end-to-end fashion, directly in the audio waveform representation of the signal.
We selected the UAP-HC algorithm introduced in [4] to create the universal perturbations. This algorithm, which is a reformulation for the audio domain of the one proposed in [1], consists of iteratively accumulating individual untargeted adversarial perturbations, generated using the DeepFool algorithm [21]. The pseudocodes for both the UAP-HC and DeepFool algorithms can be found in Algorithm 1 and Algorithm 2, respectively. These algorithms have been generalized to (optionally) prevent them from reaching certain adversarial classes. This generalization will be further described and motivated in Section 4.
Finally, we highlight that the rationale of the DeepFool algorithm relies on a geometric approach. In particular, a first-order approximation of the decision boundaries is used to move the input towards the estimated closest boundary, being, therefore, an untargeted attack. Thus, the optimization process of the UAP-HC algorithm is not biased towards any particular class, although, in practice, different universal perturbations lead in most of the cases to the same dominant classes.

Algorithm 1 UAP-HC
Input: A classification model f , a set of input samples X , a projection operator P p,ξ , a fooling rate threshold δ, a maximum number of iterations I max , a set of restricted classes R ⊂ Y Output: A universal perturbation v 1: v ← initialize with zeros 2: F R ← 0 Fooling rate. 3: iter ← 0 Iteration number. 4: while F R < 1 − δ ∧ iter < I max do 5: X ← randomly shuffle X 6: Check that x i is not already fooled by v: in the p ball of radius ξ and centered at 0. 11: Update v only if adding v i increases the FR and if the current class is not in R: 13: if iter ← iter + 1 20: end while Algorithm 2 DeepFool Input: An input sample x of class y i , a classifier f , a set of restricted classes R ⊂ Y . Output: An individual perturbation r. 1: x ← x 2: r ← initialize with zeros for y j ∈ Y do 6: end for 9: l ← arg min j∈Y |f j | ||w j || 10: r ← r + |f l | ||w l || 2 2 w l 11: x ← x + r 12: end while

Dominant classes in speech command classification
In this section, we generate different universal adversarial perturbation for the speech command classification task described in Section 3, in order to investigate whether in this domain dominant classes are also produced.
We start by generating 10 different universal perturbations using the UAP-HC algorithm, without restricting any class (R = ∅). We set ξ = 0.1 as threshold for the perturbation 2 norm, and restricted the UAP-HC algorithm to a maximum of five iterations. To generate the perturbations, we used a training set of 100 inputs per class, which makes a total of 1200 inputs. Once the perturbations are generated, their effectiveness will be measured in a test set, containing samples that were not used during the generation of the perturbations. The initial accuracy of the model in this set is 85.52%. 2 According to the results, the algorithm led to universal perturbations with left and unknown as dominant classes for almost all the experiments. This can be seen in Figure 1 (left), which shows the frequency with which each class is wrongly predicted when the perturbation is applied to the audios in the test set. We only considered those inputs that were initially correctly classified by the model, but misclassified when the perturbation is applied. The frequencies are shown individually for the ten universal perturbations, with each row corresponding to one perturbation. As can be seen, both left and unknown arise as dominant classes in 9 of the 10 experiments, sometimes even at the same time.
It is important to highlight that dominant classes arise without being imposed in the universal perturbation crafting procedure. However, we tested whether dominant classes remain dominant even if we explicitly avoid them during the optimization process (see Algorithms 1 and 2). We start by preventing the algorithm from considering those directions that point to the decision boundaries of the class left. The results obtained for ten new perturbations generated with this restriction are shown in Figure 1 (center). As can be seen, the most frequent adversarial class is now unknown for 9 of the 10 perturbations created.
We went another step further and repeated the experiment, this time, however, restricting the boundaries corresponding to both left and unknown classes. The results are shown in Figure 1 (right). In this case, the two restricted classes were no longer dominant classes, but different dominant classes were obtained, precisely, up, right and go. It is also worth emphasizing that, although dominant classes were obtained in all the experiments, they were different depending on which other classes were restricted. For instance, whereas the class up rarely appeared as dominant without restrictions, it is the most frequent dominant class when both left and unknown classes are restricted.   Regarding the effectiveness of the attacks, the fooling rate of every perturbation (i.e., the percentage of inputs that are misclassified when the perturbation is applied) is shown in Figure 2, for each class independently. The fooling rates have been computed considering the inputs that were initially correctly classified. As can be seen, the effectiveness of each perturbation is higher in some classes than in others, achieving up to ≈69% in some cases. The fooling rates corresponding to the dominant classes, which have been highlighted in the figure, are practically zero for most of the perturbations, which reveals that the perturbation does not change the prediction of the model for those inputs.
For more informative results, the mean and maximum fooling rate of all the perturbations are shown in Table 1

Hypotheses about the existence dominant classes
In this section, we propose a number of hypotheses to explain and characterize the relationship between universal adversarial perturbations and dominant classes. The proposed hypotheses are also experimentally tested using the framework described in Section 3.

Dominant classes occupy a larger region in the input space
In [1], the existence of dominant classes is attributed to a larger region of such classes in the image space. Nevertheless, due to the high dimensionality of the input spaces in current machine learning problems, exploring the volume that each decision region occupies in the whole input space is intractable in practice.
Even so, to test this hypothesis, we randomly sampled and classified 1,000,000 inputs from the input space. The values of the inputs were sampled uniformly at random in the range [−1, 1]. We found that all the samples were classified as the class silence, which is not a dominant class in our experiments, as shown in Section 4 (see Figure 1). Therefore, our results suggest that there is not necessarily a connection between the volume occupied by the decision regions of different classes and the frequency with which inputs perturbed by universal perturbations reach the regions corresponding to the dominant classes.

Class properties of universal perturbations
Universal perturbations are capable of changing the output class of a large number of inputs, and the majority of the misclassified inputs are moved unintentionally towards a dominant class. In this section, we show that the perturbation itself is predicted by the model as the dominant class with high confidence.
In fact, we noticed that the following three factors are positively correlated during the generation process of a universal perturbation v: the fooling rate (F 1 ), the percentage of inputs misclassified as the dominant class y b (F 2 ), and the confidence with which the model considers that the perturbation belongs to the dominant class (F 3 ): 3 where X is a set of inputs and f j : X → R represents the output confidence of the classifier f corresponding to the class y j . An example of the evolution of these factors during the optimization process of a universal perturbation, using the UAP-HC algorithm, is shown in Figure 3. These results correspond to the first experiment of Section 4, for the case in which no class was restricted.
In particular, the left figure shows the evolution of the frequency with which each class is (wrongly) predicted for the misclassified inputs, and the right figure shows the output confidences of the model when the universal perturbation is classified. The fooling ratio of the perturbation has been included in both figures as a reference, represented by a dashed line. For the 10 different universal perturbations generated in Section 4 (without restricting any class), the average Pearson correlation coefficient between F 1 and F 3 during the first iteration of Algorithm 1 is 0.79. Similarly, the average correlation between F 1 and F 2 is 0.87, and the average correlation between F 2 and F 3 is 0.91.
Motivated by this finding, we studied whether any perturbation v that is classified by the model as one particular class with high confidence is capable of producing the same effect as a universal perturbation, that is, to force the misclassification of a large number of inputs by pushing them to the class f (v). For this purpose, we defined the following optimization problem, in which the objective is to find a perturbation v, with a constrained norm, that maximizes the confidence of the model in one particular class y t , f t (v), that is: We launched 100 trials for each possible target class, starting from random perturbations. 4 We used a gradient descent approach to optimize the perturbation, restricting the search to 100 gradient descent iterations, and setting a threshold of ξ = 0.1 for the perturbation norm.
The mean and maximum fooling rates obtained with the generated perturbations are shown in Table  2, computed independently for each target class. The fooling rate for each class individually is shown in Figure 4 (left). As can be seen in Table 2, the classes left and unknown, both the most frequent dominant classes associated to the universal perturbations generated using the UAP-HC algorithm (see Figure 1), achieve a significantly higher effectiveness than the rest of classes. Apart from that, with independence of the target class, the majority of the samples fooled by these perturbations were classified as the target class. This is shown in Figure 4 (right), in which the average frequency with which each class is predicted under the effect of the perturbations is computed, independently for each target class.
Based on these results, we can hypothesize that the model is more sensitive to some class features than to others, and that, ultimately, the sensitivity degree to each class feature is what determines the dominant classes. In other words, a class y j will have a greater dominance the more sensitive the model is to the patterns in the data distribution that are associated to y j (by the model itself    Table 2: Effectiveness of the perturbations generated using Algorithm 12, averaged for the 100 perturbations generated for each target class.

Singular Value Decomposition
In [1], the existence of universal perturbations for image classification DNNs is attributed, in part, to similar patterns in the geometry of decision boundaries around different points of the decision space. In particular, as described in Section 2, perturbations normal to the decision boundaries in the vicinity of natural inputs approximately span a very low-dimensional subspace, revealing that similar perturbations are capable of changing the output class of different input samples. This was assessed experimentally for state-of-the-art DNNs, by computing the Singular Value Decomposition (SVD) of a matrix collecting normalized individual untargeted perturbations generated using the DeepFool algorithm. Their results show that the decay of the singular values was considerably faster in comparison to the decay obtained from the decomposition of random perturbations (sampled from the unit sphere). This implies that the subspace spanned just by the first d d singular vectors (i.e., those corresponding to the highest singular values) contained vectors normal to the decision boundaries in the vicinity of natural samples. Indeed, random perturbations sampled from such a subspace were capable of achieving a fooling rate of nearly 38% on unseen inputs, whereas random perturbations (of the same norm) in the input space only achieved a fooling rate of approximately 10% [1].
In this section, we take this approach as a framework to study the existence of dominant classes. First, we will replicate the previous experiment to assess whether, in the audio domain, it is also possible to find a low-dimensional subspace of the input space collecting vectors normal to the decision boundaries of DNNs. Nevertheless, due to the input transformation process required to convert the raw audio signal into the MFCC representation (see Section 3), the results might differ depending on the data representation in which the analysis is done. Thus, we computed the SVD for a set of individual perturbations and different sets of random perturbations, under the three main representations for audio signals: raw audio waveform, spectrogram and MFCC coefficients.

Analysis of the SVD of audio perturbations
Let us consider a set of n natural input samples X = {x 1 , . . . , x n }. The individual perturbations were generated using the DeepFool algorithm, in the raw audio waveform representation: The perturbations that these raw waveforms produce in both the spectrogram and MFCC representations are computed as v i = g(x i + v i ) − g(x i ), being g the input transform function, which maps the raw audio waveforms into either a spectrogram or the MFCC features: The random perturbations were sampled uniformly at random from the raw input space: As in the case of adversarial perturbations, the corresponding perturbations in the frequency-domain representation are computed as: In this case, the random perturbations were scaled to have a fixed 2 norm of 0.1 before being applied to the inputs in Equations (17) and (18).
Finally, for a more representative analysis, we considered two additional sets of random perturbations, sampled uniformly at random from the space of spectrograms (R 99×257 ) and the space of MFCC coefficients (R 99×40 ):   (13)- (20). The results corresponding to the raw waveform, spectrogram and MFCC representations are shown in the first, second and third row of the figure, respectively. Whereas the left column shows the singular values obtained with the SVD for each data representation, in the right column the decays are characterized by fitting exponential curves (depicted as dashed lines) with the following form: A higher value of the decay factor λ represents a faster decay. Note that the singular values have been scaled in the range [0, 1] before fitting the exponential curves, for a more uniform comparison.
Regarding the results in the raw waveform representation, the decay of the singular values is mainly linear for both individual and random perturbations, showing indeed a very similar decay in both cases. This means that there is not a set of singular vectors that is significantly more informative than the rest, and, as a consequence, a large set of vectors would be needed to provide an approximate basis for the perturbations. Thus, the perturbations do not show significant correlations in this representation. Regarding the frequency-domain representations, the decays of the singular values corresponding to the perturbations sampled uniformly at random in the space of spectrograms (R SPEC ) and in the space of MFCC coefficients (R MFCC ) are also clearly linear.
However, considering the perturbations in the frequency domain produced by the raw waveform perturbations, either random or adversarial, the singular values decay exponentially. These results indicate, first, that even if the perturbations are generated in the raw audio waveform representation, it is necessary to go to the frequency-domain to observe informative patterns. This might be a fundamental difference between the image domain and the audio domain, as most of the analyses done in the former can be done directly in the raw image space. Secondly, the effect of audio perturbations in the frequency-domain can be characterized by just a small (in comparison to the dimensionality of the corresponding spaces) number of singular vectors. For instance, for the MFCC representation, the most relevant information is captured in less than the ∼150 first singular vectors (that is, those corresponding to the highest singular values). The fact that this happens for both random or adversarial perturbations could imply, however, that the captured correlations are uninformative about the geometry of the decision boundaries around natural inputs, or, alternatively, about the vulnerability of the network to adversarial attacks. Nevertheless, in the reminder of this section we show that the SVD of individual adversarial perturbations not only provides a representative basis for inputagnostic perturbations, but also that this basis is strongly connected with the dominant classes. For the previous reasons, the rest of the analysis will focus on the MFCC feature space.
We start evaluating the fooling rate of randomly sampled perturbations in the subspace spanned by the first N = {10, 50, 100, 200, 500} singular vectors, for the cases in which the SVD is computed for individual perturbations (V MFCC ) and random perturbations (R MFCC ). All the sampled perturbations were normalized, and the fooling rate was evaluated for different scaling factors under the 2 norm, in the range [−200, 200]. Note that, given an unit vector v, for any scalar c ∈ R, ||v ·c|| 2 = |c|. For reference, the median 2 norm of the perturbations (in the MFCC) produced by the 10 universal attacks generated in Section 4, measured in the test set, is approximately 100. Figure 6 shows the average fooling rates obtained for 100 trials, for each value of N . The results clearly show that, when the SVD is computed for individual perturbations, the fooling rates are significantly higher than for the case of random perturbations, even for norms close to zero. For instance, taking as reference the results corresponding to an 2 norm of 100, the average fooling rate is approximately 48% for the case of individual perturbations, when N ≤ 100. For the case of random perturbations, in the same conditions, the average fooling rate is only 17%.
However, the fooling rate corresponding to individual perturbations significantly decreases when a large number of singular vectors are considered. Indeed, for N ≥ 200, the fooling rates get closer to those obtained for random perturbations. For instance, when N = 500, the average fooling rate (with an 2 norm of 100) is approximately 18%. This reveals that, whereas the singular vectors corresponding to the highest singular values are capturing directions normal to the decision boundaries around natural inputs (being, therefore, effective in fooling the model for a large number of inputs), the remaining singular vectors do not provide additional or relevant information.

Connection with dominant classes
In the previous section, we have shown that, also for speech command classification models, it is possible to find a low dimensional subspace S containing (input-agnostic) vectors normal to the decision boundaries in the vicinity of natural inputs. Therefore, a reasonable hypothesis is that dominant classes can be explained in terms of the geometric similarity of the decision boundaries in regions surrounding natural inputs, information that is captured by the basis of S, that is, by the singular vectors obtained from the SVD of individual perturbations.
The first hypothesis is that the first singular vectors are also normal to decision boundaries corresponding to the dominant classes. To validate this hypothesis, we first computed the fooling rate that each singular vector can achieve individually. This is shown in Figure 7    To continue with the analysis, we computed the frequency with which each class is (wrongly) predicted, considering only the inputs that were misclassified when the singular vectors were used as perturbations. The aim of this analysis is to assess if there exists a direct connection with the dominant classes. The results are shown in Figure 8, considering the first 100 singular vectors, scaled to have an Euclidean norm of 100. As can be seen, considering the singular vectors with the highest fooling rate (those corresponding to the vectors approximately in the range [1,50]), the most frequent wrong classes are unknown and left. Indeed, for 84% of the singular vectors in [1,50], the sum of the frequency corresponding to those two classes exceeds 50%, that is, at least 50% of the misclassified inputs are classified as left or as unknown. Moreover, for 62% of the singular vectors, the total frequency corresponding to those two classes exceeds 80%. Therefore, we now know that the singular vectors (with a high fooling rate) not only point towards decision boundaries in the close vicinity of natural inputs, but also that those decision boundaries correspond mainly to the dominant classes.
We repeated the experiment using the singular vectors obtained when the SVD is computed for random perturbations. The results are shown in Figure 9. In this case, it is evident that the results are more uniform along all the singular vectors, particularly for those singular vectors with a higher fooling rate (precisely, those in the range [1,50], as shown in Figure 7). For reference, in this case, only for 32% of the singular vectors in the range [1,50] the total frequency corresponding to unknown or left exceeds 50%, and only for 2% of the singular vectors the total frequency exceeds 70%.
Overall, the SVD of individual perturbations has shown that the obtained singular vectors are inputagnostic perturbations directions for which the model is highly vulnerable: even when the inputs are slightly pushed in those directions, they surpass the decision boundary of the model. This reveals that the geometry of the decision boundary has patterns that are repeated in the vicinity of multiple natural inputs. Apart from that, we have shown that such adversarial directions mainly point towards the decision boundaries corresponding to the dominant classes. Therefore, it can be concluded that the universal perturbation optimization algorithms implicitly exploit the shared geometric patterns of decision boundaries to increase the effectiveness of the perturbations, leading to the same dominant classes in the majority of the cases.

Conclusion
In this paper, we have proposed and experimentally validated a number of hypotheses to justify the intriguing phenomenon of why universal adversarial perturbations for DNNs are capable of sending the majority of inputs towards the same wrong class (i.e., dominant classes), even if such behaviour is not specified during the optimization of the perturbations. These hypotheses were studied in the audio domain, using a speech command classification task as a testbed.   (15)). The (unit) singular vectors have been scaled using two different scale factors: 100 (left) and −100 (right). For the sake of clarity, the frequencies are shown individually for the classes unknown and left, while the total frequency corresponding to the rest of classes has been grouped (others). for which the model has a higher sensitivity. On the other hand, we demonstrated that the geometry of the decision boundaries of audio DNNs contains similar patterns in the vicinity of natural inputs, and that the most vulnerable directions in the decision space point to the regions corresponding to the dominant classes. Finally, our work highlights a number of differences between the image domain and the audio domain, which contribute to a better and more general understanding of the field of adversarial machine learning.

Future research lines
Whereas the frameworks proposed in this paper have shown to be effective in revealing the connections between dominant classes and universal perturbations, there are a number of open lines that could be further investigated in order to achieve a deeper understanding of the behavior of universal perturbations.
First, focusing on the framework proposed in Section 5.2, an interesting future line of research could be trying to identify the data-features that the model recognizes as each class with high confidence, for instance, following the methodologies proposed in recent related works [14]. Similarly, the analysis of the geometry of the decision space carried out in Section 5.3 could be further extended by considering the curvature of the decision boundaries, which has proven to be highly informative Left Others Figure 9: Frequency with which each class is assigned to the misclassified inputs under the effect of singular vectors (computed for random perturbations, see Equation (18)). The (unit) singular vectors have been scaled using two different scale factors: 100 (left) and −100 (right). For the sake of clarity, the frequencies are shown individually for the classes unknown and left, while the total frequency corresponding to the rest of classes has been grouped (others).
for the analysis of universal perturbations [8,9]. Moreover, it could be interesting trying to unify the data-feature perspective used in Section 5.2 and the one used in Section 5.3, relying on the geometry of the decision space of the DNN. Finally, a deeper understanding of the decision spaces of DNNs is necessary to comprehensively explain why decision boundaries contain large geometric correlations around natural inputs, as well as many other fundamental questions regarding the learning process of DNNs.
Advances in all these research lines could bring a deeper understanding of the vulnerability of DNNs to adversarial attacks, which can be used, for instance, to create more effective attacks. Indeed, as shown in Section 4, the existence of dominant classes reduces the effectiveness of universal perturbations, since the fooling rate in the inputs of those classes is practically zero. Therefore, preventing the appearance of dominant classes during the generation of the perturbation can lead to more effective attacks. At the same time, understanding the vulnerabilities of DNNs to adversarial attacks also contributes to the generation of more effective defensive strategies, and, ultimately, more robust models.

Acknowledgments
Appendix C Detailed analysis of the effectiveness of universal perturbations (UAP-HC)