A quantum active learning algorithm for sampling against adversarial attacks

Adversarial attacks represent a serious menace for learning algorithms and may compromise the security of future autonomous systems. A theorem by Khoury and Hadfield-Menell (KH), provides sufficient conditions to guarantee the robustness of active learning algorithms, but comes with a caveat: it is crucial to know the smallest distance among the classes of the corresponding classification problem. We propose a theoretical framework that allows us to think of active learning as sampling the most promising new points to be classified, so that the minimum distance between classes can be found and the theorem KH used. The complexity of the quantum active learning algorithm is polynomial in the variables used, like the dimension of the space $m$ and the size of the initial training data $n$. On the other hand, if one replicates this approach with a classical computer, we expect that it would take exponential time in $m$, an example of the so-called `curse of dimensionality'.


I. INTRODUCTION
Supervised learning is one of the subareas of machine learning that consists on techniques to learn to classify new data taking as example a training set. More specifically, the computer is given a training set X, consisting on n pairs of point and label, (x, y). With the information, the computer is supposed to extract or infer the conditional probability distributions p(y|x) and use it to classify new points x. This paradigm is in contrast with unsupervised learning, that like in the case of clustering attempts to find structure to a set of points without labels; and reinforcement learning, where an agent has to figure out the best policy or action for each situation it may face.
An important kind of supervised learning is what is usually called active learning. To introduce this concept suppose that we have a supervised learning algorithm, with its corresponding training set. However, instead of directly trying to predict the label of new points, we give the classifier the option to pose us interesting questions in order to reduce the uncertainty in p(y|x). In this setting, the algorithm will add to its training set new points in areas where it has a lot of uncertainty.
To explain the concept better, let us give an example. Suppose we have an image classifier used for a self-driving car, that has to be able to distinguish between cars, pedestrians... Internally, images are decomposed into pixels that can be characterised by * pabloamo@ucm.es † mardel@ucm.es their combination of red, blue and green. Thus, an image can be expressed as an array of 3 dimensional vectors, or as a gigantic vector if the resolution of the images is always the same. The set of images of different classes forms a single manifold M in a highly dimensional space. The key idea of an active learning algorithm is one that is able to understand in what kind of images it has the most uncertainty, and request additional examples to be labeled and added to its training set. Another important concept in the context of supervised learning is that of adversarial attacks or adversarial examples, the name given to a phenomenon where a trained and accurate (usually a neural network) classifier, can be mislead into wrong classification, by producing a carefully chosen and slightly modified version of one point that is classified well. They were discovered quite recently [1,2], and have received a lot of attention leading to models robust to particular attacks [3][4][5][6].
After the discover of such adversarial examples, considerable effort has been put in explaining why they happen and how they can be avoided. One of the given theoretical reasons links their existence to a high codimension, the difference in dimension in the structure of the classes we are trying to separate with respect to the highly dimensional space in which they are embedded [7]. This suggest a strategy, proposed in theorem 5 of [7], that makes use of several definitions: Definition 1. Given a manifold M, a δ-cover of such manifold is a set of balls of radius δ, and will be a key ingredient to avoid adversarial examples. The radius δ has the meaning of how fine is the coarse graining of the classes. With perhaps an abuse of notation, we will call a δ-cover to the set of x i points that are in the center of these balls. This means that a δ-cover is a coarse-grained sampling of each class, where δ is the parameter that controls how coarse or fine is the sampling. Following our previous example, an example of a δ-cover is a set of images of cars, for example, such that any possible image of a car is no further than δ-close to one of the training set. Notice that one can measure distance between images by the distance between the vectors containing the amount of green, blue and red of each pixel, in a p-norm.
Definition 2. Given a manifold M ⊂ R m , anneighbour of M in the norm p, M , is the set of points x ∈ R m such that the p-distance of x to M is smaller or equal to .
Here, has the meaning of the robustness against perturbations. For example, take an image of a car and its corresponding vector.
is the amount one can perturb the vector without fooling the classifier. Then, M is the space of such perturbations of size , for each class within M.
For the next definition we will need the notion of a classifier, a function f that, given a point x is able to predict a label y. Definition 3. Let M ⊂ R m be a manifold containing several disjoint parts called classes C i separated by a distance r p in the norm p, and a classifier f reasonably well trained to distinguish between those classes. An -adversarial example of such classifier in the norm p is a point x whose p-distance to a given class C 0 is smaller than , but it is classified to be in class C 1 . That is d(x, C 0 ) ≤ and f (x) = C 1 .
An example of an adversarial example is pictured in figure 1. The blue point is close to the red class but classified as blue.
Theorem 1. (Khoury and Hadfield-Menell) [7] Let M ⊂ R m be a k−dimensional manifold that contains each of the classes, and let < r p , for r p , the minimum separation distance between two classes in norm p . Let L be a learning algorithm, and f L the classifier it produces. Assume that for any point x in the training set X L with label y, and for every pointx inside a ball centered in x with radius r p ,x ∈ B(x, r p ), the learning algorithm classifiesx in the same way it does with x: f (x) = f (x) = y. We then have the following guarantee: If X L is a δ-cover for δ < r p − then f L correctly classifies M , that is, an -neighbour of M.
FIG. 1. Manifold M made of two classes separated by a distance rp. The space near each class is M . We also represent a δ-cover of the blue class. The red point is in the red class, but when -shifted, can be mistakenly classified as blue. It would be an adversarial example that we are trying to avoid.

Proof omitted.
Given these definitions and the theorem we can think of this as a procedure to establish antolerance to perturbations. The robustness provided by the δ-cover qualitatively means, that if we provide a cover of the class with balls of size δ, then whenever we get -near the class, in M , we are sure that the point will be correctly classified. Then, our classifier will be robust to perturbations. Therefore, we can see that if we knew the minimum distance of separation of two classes, r p , we would be able to produce a cover of the two classes that avoids the adversarial examples. However finding r p is not easy, because we do not have such distance, but an upper bound to the minimum distance of samples between two classes. The problem we aim to solve in this paper is finding this minimum distance between the two classes r p , because if we overestimated δ we would not be able to use Theorem 1, and if we underestimated it, the δ-cover would be more expensive to establish.
With this purpose we propose an active learning algorithm that allows for fast sampling of the most informative points that could be added to the training set X L to find out this minimum distance r p . We will later define what we mean exactly by 'informativeness'. Also, notice that this means that the level of robustness is something we choose, and r p is the unknown we are looking for that would allow us to calculate δ.

A. Related work
As we have seen, adversarial examples are a danger for any classifier that needs to be robust to perturbation. For instance, adversarial examples can be dangerous is when an autonomous car has to recognise traffic signals. Since adversarial examples were discovered [1], there has been lots of work to explain why they happen [2] and also to obtain provably robust models [3]. In particular some ideas to avoid them are related to adversarial training: training against those adversarial examples before the actual adversary has time to pose them to the classifier. This is for instance the model explained in [3], where they use projected gradient descent to minimize the maximum expected loss where D is the initial population of points x, with true label y, and perturbations can be taken from a small set S. L is the loss function: the function that measures the difference between the predicted and actual classification, with adjustable parameters θ, and E indicates expected value. The max represents the work of the adversary, whereas the minimization represents the work to make the classifier robust. This setup certainly works, as long as you can minimize (1). However, as pointed out in [7], in order to work perfectly it would require an exponential number of adversarial examples in the dimension of the problem added to the training set. Thus, additional strategies are worth exploring. It is also worth noticing that adversarial attacks are gaining attention in the quantum community. Recently, it has been indicated that this phenomenon is also present in the case of quantum classifiers [8], where the dimension plays a very important role: the higher the dimension the easier to carry out those adversarial attacks.
This work is additionally strongly related to several forms of active learning algorithms. Active learning algorithms are those where the classifier can ask for new points to be classified and added to its training data base. This field can be divided in two main branches [9]: query synthesis, where new examples are created, and sampling. The latter is subdivided in stream-based sampling and pool-based. In stream-based sampling one selects one item at a time and decides whether it is worth the cost of classifying it [10]. In pool-based sampling examples are sampled from a large pool of unlabelled data [9]. As we will see, our algorithm can be used both as a poolbased sampling or as query synthesis, depending on the points used. The most typical strategies to select the samples are uncertainty sampling, that selects points with maximum uncertainty about which class they belong to [11], or Query-by-committee, where the space of classifiers that agree with the data is halved sequentially [12]. Finally there is also the strategy of using expected error reduction [13] that selects those points that on average, weighted according to probability, make the loss function as low as possible. This last procedure is similar to what we are using, except that instead of minimizing the loss function, we select points that on expectation would achieve the margin of the SVM to be as low as possible.
Finally, we want to mention that our work relies heavily on reference [14]. At the time that article was written, there was no known method to solve low-rank linear systems of equations in time polylogarithmic. However, [15] has very recently proposed a method that achieves precisely that, using the techniques from [16], which found a classical algorithm taking inspiration from [17]. Therefore, one could also solve the linear system of equations that appear in this case in polylogarithmic time in the dimension n. We have two arguments why we have not explored the result here. Firstly, although [15] is efficient on n, it has a bad behaviour in other involved parameters such the error . Secondly, perhaps more important argument, is that here we are not solving a single linear system of equations, but rather many of them in quantum parallel. This, to the best of our knowledge makes the proposed algorithm quite different from what is usually done. If one wanted to solve the proposed problem with classical algorithms, one may try something as gradient ascent, where in each step the algorithm from [15] is used to find how interesting a given point is.

II. MODELLING THE ACTIVE LEARNING PROBLEM
The problem we want to solve is the following: .,n} , where x i ∈ R m are the n data points already classified, and c i are label vectors that indicate probability of membership to the possible classes where the point might be classified. Suppose also that these memberships sum up to, at most, 1 for each point, and for simplicity assume that there are just two classes. The problem is: what are the most informative points to be added to S in order to learn the Support Vector Machine, and in particular r p , with better precision?
The concept of 'informativeness' of a point will be rigorously defined latter. In our particular problem, Initially we have the SVM in black, but we want to obtain the green one, that is more accurate. The dotted lines are the ones that we cannot see initially and must find out. The equation of the SVM is the one depicted and the margin 1/| w| is chosen to be equal to rp. This is achieved when we make the two parallel hyperplanes that indicate the margin to fulfill equations w· x−b = ±1. Since making the SVM more precise implies making the margin smaller, we want to find an SVM that maximizes | w|.
we are interested in finding the minimum distance between two classes, in order to provide the required covering of the two classes that avoid having adversarial attacks. However, it might be the case that both classes have some points in common. In order to avoid this, we establish that a point x i is in class j if ( c i ) j > 0.8. This value is arbitrary but should be strictly greater than 0.5 to avoid overlapping of classes. This way we clearly separate the two sets, and the minimum distance between classes is greater than 0. With this condition one wants to make classes clearly separated, since otherwise it does not make sense the concept of adversarial example, which will be central to our discussion. Now let us look at figure 2. We can see that the initially calculated SVM has a greater margin than that of the SVM we would have if we were able to perfectly know both classes. If we used such margin in Theorem 1, we would not find a good cover that allows to avoid adversarial examples. Thus, we are interested in minimizing the maximum margin, which is given by the true SVM if we knew the actual classes and not just some samples.
Our problem is however in contrast with usual pool-based sampling. In their problem the algorithm usually seeks to classify new points near the SVM. This is not the case for our problem, since we want points that have a membership to a class greater than 0.8. Points that have no clear membership to a single class are not so interesting and we do not include them in our data set.
How then do we measure how interesting could be to classify an arbitrary point? A good heuristics for our problem is that we are trying to find points that with high probability would decrease the margin r p a lot. This is because there are two competing conditions on r p . On the one hand, in order to fulfill the condition of the problem, we cannot take r p larger than it really is, as that would make us choose a δ that does not fulfill the conditions of the theorem. On the other hand, the smaller δ is, the more expensive it is to establish the δ-cover of the classes.
The previous paragraph suggests measuring the 'informativeness' of a point by dividing the problem in two parts: we first find the probability that a given point x n+1 is in class c, and then multiply this probability by the inverse of the margin distance of the updated SVM, taking into account this would-be newly classified point.
Therefore we could solve this problem in two different ways. Firstly, one could attempt to search the point with the maximum 'informativeness', which is costly but can be sped-up using Grover-like techniques to a certain degree. On the other hand, one can aim to obtain the top 1/C of points with more 'informativeness', like for example the best 1% in which case C = 100. Informativeness will be measured to be directly proportional to P c (x n+1 ), the probability that point x n+1 is classified in class c; and ||w xn+1 ||, proportional to the inverse of the margin distance of the Support Vector Machine, with the idea that the smaller we make the margin, the more we have learned with the newly classified point, as can be seen in Fig 2. Notice here that C is really a hyperparameter of the algorithm and has no direct relation with δ, or r p .

A. Main results
In this article we propose a theoretical framework that allows us to think of active learning as sampling the most promising new points to be classified, so that the minimum distance between classes can be found and Theorem 1 used. We also propose a quantum algorithm that would allow us to perform the sampling efficiently, in polynomial cost in all variables, under certain conditions.
In the quantum algorithm we do not use neu-ral networks but rather Support Vector Machines (SVM) [18], although it might be possible to use the general strategy in other set ups like using quantum neural networks. However the main quantum advantage we propose does not come from the use of quantum SVM, but rather from Amplitude Amplification (AA) and Amplitude Estimation (AE) [19], the application of Quantum Phase Estimation [20] to Amplitude Amplification. In particular we claim we obtain an exponential advantage with respect to what we would get if we tried to solve this problem of sampling the most informative points classically. Let us explain this further: in the usual case of Amplitude Amplification and Amplitude Estimation we have an oracle that tells us whether an element is marked or not, and this yields a quadratic advantage with respect to the classical case. However, in our situation, being a good point or not depends on its relative 'informativeness' with respect to other points. This means that classically in order to assert that a given point is within the top 1/C = 1% you should first calculate the informativeness of 0.99N points, out of N points. This clearly prohibitive, since N = O(l m ), where m is the dimension of the space, and l its discretisation. Notice that for a n 0 × n 0 image, the dimension of the image would be m = 3n 2 0 , due to the three colours or channels needed to define each pixel.
In contrast, if we want to solve this problem using Amplitude Estimation we do not incur in such cost. What we do is find, using bisection, an informativeness threshold above which there are only 1% of the most informative points. Once that is the case, we can mark those items and use Amplitude Amplification to find them. The cost will then be O(C −1 ) as will be explained latter on, for a given precision in the threshold, and crucially C independent of N . In the next subsection we lay out the general strategy of our paper to solve this problem.

B. Algorithm strategy and background
As we have seen, our aim is to sample points to be added to the training set, x n+1 , with the highest possible informativeness, defined as P c (x n+1 )||w xn+1 ||, where the first term indicates the probability that a given point is in a class, and the second measures how much that would improve the classifier.
Thus, we will employ the following strategy to select the 1/C most relevant points, for example the 1% most relevant points, in which case C = 100: 1. Create a superposition of all points of the space R m , discretising it so that in the end there are l m points to test. This can be done with Hadamard gates in a register |0 ⊗m log l . Notice that the values of the coordinates of each point are encoded in the computational basis.
2. Use the current SVM and a reasonable function such as a sigmoid to assess how likely is that each of those points is in a given class, using Amplitude Estimation to write it in a register in the computational basis.
3. Set up a given threshold T for the informativeness. 7. Amplitude Estimate the percentage of points whose informativeness is above the threshold.
Since we want the precision to be of the order of /C, the cost will be O(C −1 ).
8. If such percentage is lower than 1/C, lower the threshold, if it is higher, rise the threshold. In both cases go to step 4.

9.
Once T is such that the informativeness of 1/C of the points is above it, mark such points.
10. Use Amplitude Amplification to find one of such points, at cost O( √ C).
Since we will rely on them somewhat heavily, in order to carry out those results it is useful to remember some results from [14]. They assume that each point x i is labeled with a single class y i , and there are only two classes y i = ±1. Given the pairs of data and label ( x i , y i ) i∈{1,...,n} , they state that calculating the SVM is equivalent, in the dual formulation, to maximizing over the multipliers α of the Lagrangian with constraints i α i = 0 and α i y i ≥ 0∀i. After this, the result of the classifier is given, for a new Notice that the matrix K l,k is a kernel matrix. Since the margins are of length at least 1, this means that for the training data We can transform those inequality results into equality adding slack variables e i . Since there are only two classes, y i = ±1, and so y 2 i = 1. We will later say, with a bit of abuse of notation, that a given point is in class c to mean any of the two possible values of y i .
Solving the problem is equivalent to solving the following system of equations [14], To do that without incurring in prohibitive costs, [14] performs a low rank approximation that imposes a small −1 K = κ eff = O(1), taking into account only the eigenvalues λ i that are K ≤ λ i ≤ 1.

III. CALCULATING Pc(xn+1).
The first thing we should care about is calculating P c (x n+1 ), the probability that point x n+1 is in class c. Notice that |x n+1 is encoded in the computational basis, so that we can use it more easily, 'copy' it (|x n+1 → |x n+1 |x n+1 )... This is not quantum cloning, since the information is encoded in the computational basis. One can perform this operation rather taking an ancilla and applying a C-Not from each original qubit to one ancilla qubit. Notice also that all states presented in this section and the next one are not normalized. The simplest way to calculate the probability would be to solve the SVM, and use it to define a sigmoid function along the margin. Solving the SVM can be done using [14] and reading each entry using Amplitude Estimation. It would return the vector (b, α), which can be used to create w = j α j x j , and therefore the SVM.
A second, more elegant solution, is the following. In [14], they propose a method to estimate, using the Swap Test and the output state |b, α , to which class does a given point | x belong. They calculate that the success probability of measuring a |− in the ancilla is for |ũ and |x defined as in (7) and (8). The real interesting thing to notice is that if P > 1/2 the classification is in one class and if P < 1/2 it is in the other. We modify this protocol slightly so that we perform Amplitude Estimation on this result instead of repeatedly measuring the expected value. This allows us to obtain the amplitude for each point A xn+1 (first arrow in the next equation), and then apply whatever mathematical function we consider appropriate to shape the function P c (x n+1 ), like a sigmoid for example, from A xn+1 (second arrow).
Notice that the last step is possible because the amplitude A xn+1 is encoded in the computational basis for each point x n+1 .

IV. THE CONDITION NUMBER.
Now lets turn to the condition number of the system of equations that appears in the SVM we are solving, the condition number of the matrix F in (6). Recall that the condition number is defined as where σ min and σ max are the minimum and maximum singular values respectively. We need to take care of it because we will later see that when we postselect on the ancilla of HHL the algorithm used in section V to solve the SVM, we will be able to recover the norm ||w xn+1 || as long as we can correct for the condition number that would correspond to each point x n+1 in the amplitudes, as can be seen in the appendix A of [21]. However, trying to first calculate the actual condition number might be too expensive: O(κ 3 ) = O(n 3 ), where n is the number of already classified points, if we try to use it in the Quantum SVM, as O(κ) = O(n) to calculate it using the Quantum Singular Value Estimation. [14] proposes that in the case when the kernel matrix has O(1) eigenvalues of size O(1), and O(n) eigenvalues with values O(1/n) as it is the case for low rank matrices, we can choose a condition number κ eff = O(1) such that in the end we will get an additional error of order O(1/ √ n) (apart from ) and the complexity of the algorithm O( −3 κ 3 eff log(mnN )), where m is the dimension of the space. In such case notice that we have imposed an effective condition number for all x n+1 : κ xn+1 = κ eff , so we no longer have to care about the condition number. Since we suppose n is large enough, the additional error we introduce will be small, of order O(n −1/2 ) according to the supplementary material of [14].

V.
QUANTUM SUPPORT VECTOR MACHINE AND THE NORM OF αx n+1 . The next step consists on calculating | α xn+1 , that is, the solution of the SVM if we accepted that point x n+1 is in the class c as assumed. To do that recall that the Quantum SVM algorithm [14] provides it to us encoded in the amplitudes, but we need to do postselection in exactly the same way it is done for the original HHL algorithm [22]. Now, the success probability during postselection will be dependent on the norm of the solution to the Linear System of Equations. In particular, if we are trying to solve Ax = b, then we can calculate the norm of the solution ||x|| using (see appendix A of [21]), where √ p 1 represents the amplitude of the ancilla of HHL algorithm being in the correct state, usually |1 . Notice that this is good news. In fact, in our case b is the vector (0, y), that will be the same regardless the point x n+1 , as we are just working with the assumption that x n+1 is in class c. Therefore, we do not have to correct each solution of the linear system of equations with the factor ||(0, y)||.
Also, we have made the low-rank approximation indicated in [14] that κ xn+1 is the same for all points x n+1 . Therefore, the only thing that matters is the amplitude of the ancilla indicating success. Therefore, it will be easy to perform Amplitude Estimation over the amplitude of the ancilla being in state |1 and recover |||(b, α) xn+1 || . We will also perform Amplitude Estimation to estimate the absolute value of each entry of the solution vector α xn+1 = (b, α 1 , ..., α n+1 ) xn+1 . Notice that there are n + 1 points in the training set, as the newly added one is the n + 1, and so there is the same number of entries of α.
To estimate the sign we can do the same we did in [23]: once we have the absolute value of the entries we can prepare a state C ij (α i |j + α j |i ), and perform Amplitude Estimation over the result of a Swap Test with the solution vector | α . If the relative sign of entries i and j is the same we will get a nonzero result proportional to 2C ij α i α j , but if the relative sign is opposite, the inner product will cancel out. The only difference with the procedure of [23] is that we are still working on a superposition of points |x n+1 .
The encoding of states (0, y) and C ij (α i |j +α j |i ) can be done easily, since the length of those vectors is n. Therefore the maximum cost of encoding will be O(n), but can be done much more efficiently with for example the methods [24] and the QRAM method of [25]; or the method described in theorem 1 of [26] for 'dense' and 'sparse' vectors respectively, in time polylogarithmic in n.
In the end we get this state xn+1 |P c (x n+1 ) ||| α xn+1 || |b xn+1 |sgn(b xn+1 , α 1,xn+1 ) ... |sgn(α n,xn+1 , α n+1,xn+1 ) |α n+1,xn+1 |x n+1 (14) Notice that there are n + 1 entries in the vector α since we have added a new point to the classifier. Thus the new w would be w = Using the previous equation, it is possible to cal- x ij,xn+1 α j,xn+1 ) 2 in the computational basis but in superposition and uncompute the rest. That means that the information is encoded in the computational basis for each point x n+1 , but there are a superposition of these. Therefore, we get By the previous sentence we mean that the information is encoded in the computational basis (for example, in binary in the registers), but there are multiple vectors in superposition. Nevertheless this does not represent a problem to perform operations that are the reversible equivalent of classical gates. We are therefore performing arithmetic operations in the computational basis, controlled by the register indicating the point. Multiplying again in the computational the first two registers basis: The rest of registers is uncomputed as they are no longer needed. This means that once we have |P c (x n+1 )||w xn+1 || in a register, we perform the inverse circuit on all the intermediate results that we had obtained, following the scheme depicted in 4.

VI. FINDING THE TARGET POINT xn+1.
We now have the state (16). We could try to find the point x n+1 with best informativeness P c (x n+1 )||w xn+1 ||. In this case, using Grover search ensures us that we can do it in time √ N [28], which is less than if we were to do it brute force classically in a landscape with many local maxima, but still exponential in the dimension of the system. Also, the problem with aiming to find the absolute maximum, is that we introduced an approximation when we made κ = κ eff , so we might not get the absolute best. This procedure is thus very expensive and FIG. 4. Circuit that uncomputes the second register. This procedure is used for example in the Quantum Singular Value Estimation subroutine described in [27]. without success guarantee due to the low rank approximation. Thus, we instead aim to find one point in the top 1/C portion of the points. This is much cheaper. One can then perform the following procedure: • Use bisection search and Amplitude Estimation to figure out from what value of P c (x n+1 )||w xn+1 || the points represent this fraction of the N points. Since we want Amplitude Estimation to have a reasonable error, we need the error to be = O(1/C), and therefore the cost of the Amplitude Estimation procedure is O( −1 ) = O(C).
• Once found the threshold for P c (x n+1 )||w xn+1 || from which we consider it to be an interesting point, perform Amplitude Amplification to all those points that are above it, and finally sample one at random from the remaining ones. The cost of Amplitude Amplification is O( √ C).
Therefore, even if N is a very large number (exponential in the dimension of the space m), the cost will be logarithmic on it and therefore just polynomial in m. This procedure will only be efficient if C is independent of N , since otherwise it is just better using [28] to find the top one, at cost O( √ N ). This means, for example, that we cannot make C so large that we are attempting to find the best point, as this would make C = N , and there would be a dependence. Rather, we want C to be constant independent of N , like the 1% mentioned previously. However, as said previously, this is in contrast to what happens in the case where we try to do this classically. To obtain one of the 1% most informative points, one would need to sample heavily from the N = O(l m ) points.

VII. COMPLEXITY
In this section we want to give a calculation of the complexity of the proposed quantum algorithm.
The first step, as we said earlier, is calculating the probability that a given item or point of the high dimensional space is in a certain class, P c (x n+1 ). This implies solving a quantum SVM [14], with complexity O(d −3 κ −3 eff log(mn)), where d is the order of the kernel, m the dimension of the space, n the number of already classified points, and κ eff is an effective condition number which can be taken O(1) for this problem. Reading out the solution using Amplitude Estimation in the Swap Test costs an additional O( −1 ). Since we are assuming that we have scaled the matrix to obtain the largest eigenvalue equal to 1, the procedure of [14] implies that the smallest one that we take into consideration is λ −1 = κ eff . The low-rank approximation induces an additional error of O(n −1/2 ), as it is explained in the supplementary material of that article (see Supplementary Material in [14]).
Next step is calculating the Quantum SVM for the system with the added point, again with complexity O(d −3 κ −3 eff log(mnN )). The term log N comes from performing operations in superposition, as log N qubits are needed to represent each point. Then, performing Amplitude Estimation to estimate the absolute value and sign of the solution costs O( −1 (n + 1)). The −1 is inherent to the Amplitude Estimation procedure, whereas the n + 1 cost comes from needing to read out n + 1 entries and compare also the relative sign.
Finally, if we are interested in finding the top 1/C portion of scanned points with error /C, Amplitude Estimation implies a cost of O(C −1 ) and is independent of m, the dimension of the space. Amplitude Amplification will only contribute with O( √ C additive to the complexity of Amplitude Estimation. Thus, in general, we can see that the complexity will be O(Cndκ −3 eff −5 poly log(nmN )), with error O( n −1/2 ), as long as we are interested in the top 1/C portion of most promising points of the set of N points. The n −1/2 comes from the low-rank approximation of [14], as we mentioned. This allows us to probe a large number of points at the same time, with polynomial dependence on the number of already classified points n and the dimension of the space m, as log N = m. This means that we have avoided exponential costs for highly dimensional spaces, what would happen if we try to sample from a highly dimensional space.
Finally, notice that the main speedup comes from the ability of the algorithm Amplitude Estimation to estimate the percentage of points that have values P c (x n+1 )||w xn+1 || above a certain threshold. The theorem is Theorem 2. (Amplitude Estimation) [19]: For any positive integer k, the algorithm Amplitude Estimation outputs an estimate 0 ≤ã ≤ 1 of the desired amplitude a such that with success probability at least 8 π 2 for k = 1, and with success probability greater than 1 − 1 2(k−1) for k ≥ 2. J is defined as the number of times we need the implementations of the oracle that tells whether an element is marked, for Amplitude Estimation. Also, if a = 0 thenã = 0, and if a = 1 and J even, thenã = 1.
Proof omitted. In our case, notice that a = O(C −1 ), and we want the error = |a −ã| = O(C −1 ) too. Therefore, we will need J = O(C) implementations of the procedure, which will be independent of the number of analyzed points N .
In contrast, if we wanted to estimate the number of points whose value P c (x n+1 )||w xn+1 || is above a certain threshold we would need to sample heavily from a highly dimensional space, a process that is exponential in the dimension of the space m.

VIII. CONCLUSION
In the previous sections we have seen that given a large and error-resistant quantum computer, it could be possible to establish a polynomial-complexity sampling procedure which lets us know what are the most promising points to be added to the training set in order to get a better approximation of the minimum distance between classes, r p . Recall that knowing r p is a requisite to applying a δ-cover and, using theorem 1, avoid adversarial examples. This protocol is heuristic, which means that in order to check its actual performance we have to run it in a realistic quantum computer. Nevertheless we believe that the same strategy that we have used to sample from the top 1/C most useful points could be employed in other setups. Particular techniques that seem interesting are the ability to perform arithmetic and other operations with the information encoded in the computational basis, but in a superposition, performing the same procedure in several vectors in superposition. For a more detailed explanation of how to do this, see Appendix A. Therefore, the problem we solved in this paper is an example of a more general one: Problem 2. General Informative Points Sampling Problem Given a high dimensional space R m , and a highly non-linear function f : R m → R, sample a point x ∈ R m that with a high probability f (x) is in the top 1/C fraction of all points in the space. That is, if there is another pointx ∈ R m sampled uniformly at random, then P (f (x) ≤ f (x)) = 1 − 1/C.
In this paper we have presented a procedure that allows to solve this problem efficiently, in time polynomial in the dimension of the space m and C, whenever C is a constant independent of the number of points N = l m in the space. In contrast classically one should sample O(N ) to be able to choose one of the desired solutions. Therefore, one may say that for this problem we have avoided what is usually called 'the curse of dimensionality'. step would be calculating in the basis 1 2 |00 | √ 0 + |01 | 1/2 + |10 | 2/3 + |11 | 3/4 |0 = 1 2 (|00 |00000000 + |01 |10110101 to a given finite precision. The second register has the binary representation of the decimal number. To perform the rotation for the third register, one can perform a series of amplitude amplification rotations (with target |1 ), controlled on qubits of the second register, where the angle θ of the subroutine starts in π/8 and is halved in each rotation. Notice that we chose θ = π/8 because in the first rotation the third register would then be cos(2θ) |0 + sin(2θ) |1 . This method allows us to quickly, in O(poly log −1 ), en-code in the amplitude whatever is already encoded in the basis. Therefore the encoding is efficient, because the calculations in the basis where also efficient. After the rotations are completed one can uncompute the second register, leaving the desired result.
From this is very simple to see that one can also perform Amplitude Estimation of |1 in superposition. To do it just call A all the previous encoding procedure, and insert it in the usual Amplitude Estimation procedure depicted in Figure 5. Notice that the first register is not depicted although it plays an important role as a controller for A.
Thus in the end one expects to get |i |y , that can be transformed in the basis to |i |sin(πy/J) ≈ |i | i/(i + 1) . This example proves that it is possible to perform the quantum and classical subroutines in quantum superposition, as needed for our algorithm.