Hardening Quantum Machine Learning Against Adversaries

Security for machine learning has begun to become a serious issue for present day applications. An important question remaining is whether emerging quantum technologies will help or hinder the security of machine learning. Here we discuss a number of ways that quantum information can be used to help make quantum classifiers more secure or private. In particular, we demonstrate a form of robust principal component analysis that, under some circumstances, can provide an exponential speedup relative to robust methods used at present. To demonstrate this approach we introduce a linear combinations of unitaries Hamiltonian simulation method that we show functions when given an imprecise Hamiltonian oracle, which may be of independent interest. We also introduce a new quantum approach for bagging and boosting that can use quantum superposition over the classifiers or splits of the training set to aggregate over many more models than would be possible classically. Finally, we provide a private form of $k$--means clustering that can be used to prevent an all powerful adversary from learning more than a small fraction of a bit from any user. These examples show the role that quantum technologies can play in the security of ML and vice versa. This illustrates that quantum computing can provide useful advantages to machine learning apart from speedups.


I. INTRODUCTION
There is a huge uptick in the use of machine learning for mission critical industrial applications from self driving cars [1] to detecting malignant tumors [2] to detecting fraudulent credit card transactions [3]: machine learning is crucial in decision making. However, classical Machine learning algorithms such as Principal component analysis (used heavily in anomaly detection scenarios), clustering (used in unsupervised learning), support vector machines (used in classification scenarios), as commonly implemented, are extremely vulnerable to changes to the input data, features and the final model parameters/hyper-parameters that have been learned. Essentially, an attacker can exploit any of the above vulnerabilities and subvert ML algorithms. As a result, an attacker has a variety of goals that can be achieved: increasing the false negative rate and thus become undetected (for instance, in the case of spam, junk emails are classified as normal) [4] or by increasing the false positive rate (for instance, in the case of intrusion detection systems, attacks get drowned in sea of noise which causes the system to shift the baseline activity) [5,6,6,7], steal the underlying model itself exploiting membership queries [8] and even recover the underlying training data breaching privacy contracts [8].
In machine learning literature, this study is referred to as adversarial machine learning and has largely been applied to security sensitive areas such as intrusion detection [9,10] and spam filtering [4]. To combat the problem of adversarial machine learning, solutions have been studied from different vantage points: from a statistical stand point, adversaries are treated as noise and thus the models are hardened using robust statistics to overcome the malicious outliers [11]. Adversarial training is a promising trend in this space, is wherein defenders train the system with adversarial examples from the start so that the model is acclimatized to such threats. From a security standpoint, there has been substantial work surrounding threat modeling machine learning systems [12,13] and frameworks for anticipating different kinds of attacks [14].
Quantum computing has experienced a similar surge of interest of late and this had led to a synergistic relationship wherein quantum computers have been found to have profound implications for machine learning [15][16][17][18][19][20][21][22][23][24][25][26][27] and machine learning has been shown to be invaluable for characterizing, controlling and error correcting such quantum computers [28][29][30]. However, as of late the question of what quantum computers can do to protect machine learning from adversaries is relatively underdeveloped. This is perhaps surprising given the impact that applications of quantum computing to security have a long history.
The typical nexus of quantum computing and security is studied from the light of quantum cryptography and its ramifications in key exchange and management. This paper takes a different tack. We explores this intersection by asking questions the use of quantum subroutines to analyze patterns in data, and explores the question and security assurance -specifically, asking the question Are quantum machine learning systems secure?
Our aim is to address this question by investigating the role that quantum computers may have in protecting machine learning models from attackers. In particular, we introduce a robust form of quantum principal component analysis that provides exponential speedups while making the learning protocol much less sensitive to noise introduced arXiv:1711.06652v1 [quant-ph] 17 Nov 2017 by an adversary into the training set. We then discuss bagging and boosting, which are popular methods for making models harder to extract by adversaries and also serve to make better classifiers by combining the predictions of weak classifiers. Finally, we discuss how to use ideas originally introduced to secure quantum money to boost the privacy of k-means by obfuscating the private data of participants from even an allpowerful adversary. From this we conclude that quantum methods can be used to have an impact on security and that while we have shown defences against some classes of attacks, more work is needed before we can claim to have a fully developed understanding of the breadth and width of adversarial quantum machine learning.

II. ADVERSARIAL QUANTUM MACHINE LEARNING
As machine learning becomes more ubiquitous in applications so too do attacks on the learning algorithms that they are based on. The key assumption usually made in machine learning is that the training data is independent of the model and the training process. For tasks such as classification of images from imagenet such assumptions are reasonable because the user has complete control of the data. For other applications, such as developing spam filters or intrusion detection this may not be reasonable because in such cases training data is provided in real time to the classifier and the agents that provide the information are likely to be aware of the fact that their actions are being used to inform a model.
Perhaps one of the most notable examples of this is the Tay chat bot incident. Tay was a chat bot designed to learn from users that it could freely interact with in a public chat room. Since the bot was programmed to learn from human interactions it could be subjected to what is known as a wolf pack attack, wherein a group of users in the chat room collaborated to purposefully change Tay's speech patterns to become increasingly offensive. After 16 hours the bot was pulled from the chat room. This incident underscores the need to build models that can learn while at the same time resist malfeasant interactions on the part of a small fraction of users of a system.
A major aim of adversarial machine learning is to characterize and address such problems by making classifiers more robust to such attacks or by giving better tools for identifying when such an attack is taking place. There are several broad classes of attacks that can be considered, but perhaps the two most significant in the taxonomy in attacks against classifiers are exploratory attacks and causative attacks. Exploratory attacks are not designed to explicitly impact the classifier but instead are intended to give an adversary information about the classifier. Such attacks work by the adversary feeding test examples to the classifier and then inferring information about it from the classes (or meta data) returned. The simplest such attack is known as an evasion attack, which aims to find test vectors that when fed into a classifier get misclassified in a way that benefits the adversary (such as spam being misclassified as ordinary email). More sophisticated exploratory attacks may even try identify to the model used to assign the class labels or in extreme cases may even try identify the training set for the classifier. Such attacks can be deployed as a precursor to causitive attacks or can simply be used to violate privacy assumptions that the users that supplied data to train the classifier may have had.
Causitive attacks are more akin to the Tay example. The goal of such attacks is to change the model by providing it with training examples. Again there is a broad taxonomy of causitive attacks but one of the most ubiquitous attacks is the poisoning attack. A poisoning attack seeks to control a classifier by introducing malicious training data into the training set so that the adversary can force an incorrect classification for a subset of test vectors. One particularly pernicious type of attack is the boiling frog attack, wherein the amount of malicious data introduced in the training set is slowly increased over time. This makes it much harder to identify whether the model has been compromised by users because the impact of the attack, although substantial, is hard to notice on the timescale of days.
While a laundry list of attacks are known against machine learning systems the defences that have been developed thus far are somewhat limited. A commonly used tactic is to replace classifiers with robust versions of the same classifier. For example, consider k-means classification. Such classifiers are not necessarily robust because the intracluster variance is used to decide the quality of a clustering. This means that an adversary can provide a small number of training vectors with large norm that can radically impact the cluster assignment. On the other hand, if robust statistics such as the median, are used then the impact that the poisoned data can have on the cluster centroids is minimal. Similarly, bagging can also be used to address these problems by replacing a singular classifier that is trained on the entire training set with an ensemble of classifiers that are trained on subsets the training set. By aggragating over the class labels returned by this process we can similarly limit the impact of an adversary that controls a small fraction of the data set.
Given such examples, two natural questions arise: 1) "How should we model threats in a quantum domain?" and 2) "Can quantum computing be used to make machine learning protocols more secure?". In order to address the former problem it is important to consider the access model used for the training data for the quantum ML algorithm. Perhaps the simplest model to consider is a QRAM wherein the training data is accessed using a binary access tree that allows, in depth O(n) but size O(2 n ), the training data to be accessed as bit strings. For example, up to isometries, U QRAM |j |0 = |j |[v j ] where [v j ] is a qubit string used to encode the j th training vector. Similarly such an oracle can be used to implement a second type of oracle which outputs the vector as a quantum state vector (we address the issue of non-unit norm training examples below): U |j |0 = |j |v j . Alternatively, one can consider a density matrix query for use in an LMR algorithm for density matrix exponentiation [31,32] wherein a query to the training data takes the form B|0 → ρ where ρ is a density operator that is equivalent the distribution over training vectors. For example, if we had a uniform distribution over training vectors in a training set of N training vectors then ρ = N j=1 |v j v j |/(Tr N j=1 |v j v j |). With such quantum access models defined we can now consider what it means to perform a quantum poisoning attack. A poisoning attack involves an adversary purposefully altering a portion of the training data in the classical case so in the quantum case it is natural to define a poisoning attack similarly. In this spirit, a quantum poisoning attack takes some fraction of the training vectors, η, and replaces them with training vectors of their choosing. That is to say, if without loss of generality an adversary replaces the first ηN training vectors in the data set then the new oracle that the algorithm is provided is of the form where |[φ j ] is a bit string of the adversary's choice. Such attacks are reasonable if a QRAM is stored on an untrusted machine that the adversary has partial access to, or alternatively if the data provided to the QRAM is partially under the adversary's control. The case where the queries provide training vectors, rather than qubit vectors, is exactly the same. The case of a poisoning attack on density matrix input is much more subtle, however in such a case we define the poisoning attack to take the form B|0 = ρ : |Tr(ρ − ρ )| ≤ η, however other definitions are possible. Such an example is a quantum causitive attack since it seeks to cause a change in the quantum classifier. An example of a quantum exploratory attack could be an attack that tries to identify the training data or the model used in the classification process. For example, consider a quantum nearest neighbor classifier. Such classifiers search through a large database of training examples in order to find the closest training example to a test vector that is input. By repeatedly querying the classifier using quantum training examples adversarially chosen it is possible to find the decision boundaries between the classes and from this even the raw training data can, to some extent, be extracted. Alternatively one could consider cases where an adversary has access to the network on which a quantum learning protocol is being carried out and seeks to learn compromising information about data that is being provided for the training process, which may be anonymized at a later stage in the algorithm.
Quantum can help, to some extent, both of these issues. Poisoning attacks can be addressed by building robust classifiers. That is, classifiers that are insensitive to changes in individual vectors. We show that quantum technologies can help with this by illustrating how quantum principal component analysis and quantum bootstrap aggragation can be used to make the decisions more robust to poisoning attacks by proposing new variants of these algorithms that are ammenable to fast quantum algorithms for medians to be inserted in the place of the expectation values typically used in such cases. We also illustrate how ideas from quantum communication can be used to thwart exploratory attacks by proposing a private version of k-means clustering that allows the protocol to be performed without (substantially) compromising the private data of any participants and also without requiring that the individual running the experiment be authenticated.

III. ROBUST QUANTUM PCA
The idea behind principal component analysis is simple. Imagine you have a training set composed of a large number of vectors that live in a high dimensional space. However, often even high dimensional data can have an effective low-dimensional approximation. Finding such representations in general is a fine art, but quantum principal component analysis provides a prescriptive way of doing this. The idea of principal component analysis is to examine the eigenvectors of the covariance matrix for the data set. These eigenvectors are the principal components, which give the directions of greatest and least variance, and their eigenvalues give the magnitude of the variation. By transforming the feature space to this eigenbasis and then projecting out the components with small eigenvalue one can reduce the dimensionality of the feature space.
One common use for this, outside of feature compression, is to detect anomalies in training sets for supervised machine learning algorithms. Imagine that you are the administrator of a network and you wish to determine whether your network has been compromised by an attacker. One way to detect an intrusion is to look at the packets moving through the network and use principal component analysis to determine whether the traffic patterns are anomalous based on previous logs. The detection of an anomalous result can be performed automatically by projecting the traffic patterns onto the principal components of the traffic data. If the data is consistent with this data it should have high overlap with the eigenvectors with large eigenvalue, and if it is not then it should have high overlap with the small eigenvectors.
While this technique can be used in practice it can have a fatal flaw. The flaw is that an adversary can inject spikes of usage in directions that align with particular principal components of the classifier. This allows them to increase the variance of the traffic along principal components that can be used to detect their subsequent nefarious actions. To see this, let us restrict our attention to poisoning attacks wherein an adversary controls some constant fraction of the training vectors.
Robust statistics can be used to help mitigate such attacks. The way that it can be used to help make PCA secure is by replacing the mean with a statistic like the median. Because the median is insensitive to rare but intense events, an adversary needs to control much more of the traffic flowing through the system in order to fool a classifier built to detect them. For this reason, switching to robust PCA is a widely used tactic to making principal component analysis more secure. To formalize this, let us define what the robust PCA matrix is first.
. This is very similar to the PCA matrix when we can express as mean . Because of the similarity that it has with the PCA matrix, switching from a standard PCA algorithm to a robust PCA algorithm in classical classifiers is a straight forward substitution.
In quantum algorithms, building such robustness into the classifier is anything but easy. The challenge we face in doing so is that the standard approach to this uses the fact that quantum mechanical state operators can be viewed as a covariance matrix for a data set [31,32]. By using this similarity in conjunction with the quantum phase estimation algorithm the method provides an expedient way to project onto the eigenvectors of the density operator and in turn the principal components. However, the robust PCA matrix given in 1 does not have such a natural quantum analogue. Moreover, the fact that quantum mechanics is inherently linear makes it more challenging to apply an operation like the median which is not a linear function of its inputs. This means that if we want to use quantum PCA in an environment where an adversary controls a part of the data set, we will need to rethink our approach to quantum principal component analysis.
The challenges faced by directly applying density matrix exponentiation also suggests that we should consider quantum PCA within a different cost model than that usually used in quantum principal component analysis. We wish to examine whether data obtained from a given source, which we model as a black box oracle, is typical of a well understood training set or not. We are not interested in the principal components of the vector, but rather are interested in its projection onto the low variance subspace of the training data. We also assume that the training vectors are accessed in an oracular fashion and that nothing apriori is known about them.
Definition 2. Let U be a self-adjoint unitary operator acting on a finite dimensional Hilbert space such that U |j |0 = |j |v j where v j is the j th training vector.
There exists a set of unit vectors in a Hilbert space of dimension N (2N v + 1) such that for any |x j , |x † k within this set x † j |x k = x j , x k /R 2 . Proof. The proof is constructive. Let us assume that R > 0. First let us define for any x j , We then encode If R = 0 then It is then easy to verify that x † j |x k = x j , x k /R 2 for all pairs of j and k, even if R = 0. The resultant vectors are defined on a tensor product of two vector spaces. The first is of dimension N and the second is or dimension at least 2N v + 1. Since the dimension of a tensor product of Hilbert spaces is the product of the subsystem dimensions, the dimension of the overall Hilbert space is N (2N v + 1) as claimed.
For the above reason, we can assume without loss of generality that all training vectors are unit vectors. Also for simplicity we will assume that v j ∈ R, but the complex case follows similarly and only differs in that both the real and imaginary components of the inner products need to be computed.
Threat Model 1. Assume that the user is provided an oracle, U , that allows the user to access real-valued vectors of dimension as quantum states yielded by an oracle of the form in Lemma 3. This oracle could represent an efficient quantum subroutine or it could represent a QRAM. The adversary is assumed to have control with probability α < 1/2 the vector |x j yielded by the algorithm for a given coherent query subject to the constraint that x j ≤ R. The aim of the adversary is to affect, through these poisoned samples, the principal components yielded from the data set yielded by U .
Within this threat model, the maximum impact that the adversary could have on any of the expectation values that form the principal component matrix O(αR). Thus any given component of the principal component matrix can be controlled by the adversary provided that the data expectation α µ/R where µ is an upper bound on the componentwise means of the data set. In other words, if α is sufficiently large and the maximum vector length allowed within the algorithm is also large then the PCA matrix can be compromised. This vulnerability comes from the fact that the mean can be dominated by inserting a small number of vectors with large norm.
Switching to the median can dramatically help with this problem as shown in the following lemma, whose proof is trivial but we present for completeness. Lemma 4. Let P (x) be a probability distribution on R with invertable cummulative distribution function CDF. Further, let CDF −1 be its inverse cummulative distribution function and let CDF −1 be Lipshitz continuous with Lipshitz constant L on [0, 1]. Assume that an adversary replaces P (x) with a distribution Q(x) that is within total variational distance α < 1/2 from P (x). Under these assumptions |median(P (x)) − median(Q(x))| ≤ αL.
Thus we have that 1 2 − α ≤ CDF(y 1 ). The inverse cummulative distribution function is a monotonically increasing function on [0, 1] which implies along with our assumption of Lipshitz continuity with constant L Using the exact same argument but applying the reverse triangle inequality in the place of the triangle inequality gives y 1 ≤ y 0 + αL which completes the proof.
Thus under reasonable continuity assumptions on the unperturbed probability distribution Lemma 4 shows that the maximum impact that an adversary can have on the median of a distribution is negligible. In contrast, the mean does not enjoy such stability and as such using robust statistics like the median can help limit the impact of adversarially placed training vectors within a QRAM or alternatively make the classes less sensitive to outliers that might ordinarily be present within the training data.
Theorem 5. Let the minimum eigenvalue gap of d-sparse Hermitian matrix M within the support of the input state be λ, let each training vector x j be a unit vector provided by the oracle defined in Lemma 3 and assume that the underlying data distribution for the components of the training data has a continuous inverse cummulative distribution with function with constant Lipshitz constant L > 0. We have that 1. The number of queries needed to the oracle given in Lemma 3 needed to sample from a distribution P over O(λ)-approximate eigenvalues of M such that for all training vectors 2. Assuming an adversary replaces the data distribution used for the PCA task by one within variational distance α from the original such that αL ≤ 1 and α < 1/2. If M be is the poisoned robust PCA matrix then M − M 2 ≤ 5αL(d + 2).
The proof is somewhat involved. It involves the use of linear-combinations of unitary simulation techniques and introduces a generalizations to these methods that shows that they can continue to function when probabilistic oracles are used for the matrix elements of the matrix M (see Lemma 16) without entanglement to the control register creating problems with the algorithm. Furthermore, we need to show that the errors in simulation do not have a major impact on the output probability distributions of the samples drawn from the eigenvalues/ vectors of the matrix. We do this using an argument based on perturbation theory under the assumption that all relevant eigenvalues have multiplicity 1. For these reasons, we encourage the interested reader to look at the appendix for all technical details regarding the proof.
From Theorem 5 we see that the query complexity required to sample from the eigenvectors of the robust PCA matrix is exponentially lower in some cases than what would be expected from classical methods. While this opens up the possibility of profound quantum advantages for robust principal component analysis, a number of caveats exist that limit the applicability of this method: 1. The cost of preparing the relevant input data |x will often be exponential unless |x takes a special form.
2. The proofs we provide here require that the gaps in the relevant portion of the eigenvalue spectrum are large in order to ensure that the perturbations in the eigenvalues of the matrix do not sufficiently large as to distort the support of the input test vector.
3. The matrix M must be polynomially sparse.
4. The desired precision for the eigenvalues of the robust PCA matrix is inverse polynomial in the problem size.
5. The eigenvalue gaps are polynomially large in the problem size.
These assumptions preclude it from providing exponential advantages in many cases, however, it does not preclude exponential advantages in quantum settings where the input is given by an efficient quantum procedure. Some of these caveats can be relaxed by using alternative simulation methods or exploiting stronger promises about the structure of M. We leave a detailed discussion of this for future work. The final two assumptions are in practice may be the strongest restrictions for our method, or at least the analysis we provide for the method, because the complexity diverges rapidly with both quantities. It is our hope that subsequent work will improve upon this, perhaps by using more advanced methods based on LCU techniques to eschew the use of amplitude estimation as an intermediate step.
Looking at the problem from a broader perspective, we have shown that quantum PCA can be applied to defend against Threat Model 1. Specifically, we have resilience to . in a manner that helps defend against attacks on the training corpus either directly which allows the resulting classifiers to be made robust to adversaries who control a small fraction of the training data. This shows that ideas from adversarial quantum machine learning can be imported into quantum classifiers. In contrast, before this work it was not clear how to do this because existing quantum PCA methods cannot be easily adapted from using a mean to a median. Additionally, this robustness also can be valuable in non-adversarial settigns becuase it makes estimates yielded by PCA less sensitive to outliers which may not necessarily be added by an adversary. We will see this theme repeated in the following section where we discuss how to perform quantum bagging and boosting.

IV. BOOTSTRAP AGGRAGATION AND BOOSTING
Bootstrap aggragation, otherwise known as bagging for short, is another approach that is commonly used to increase the security of machine learning as well as improve the quality of the decision boundaries of the classifier. The idea behind bagging is to replace a single classifier with an ensemble of classifiers. This ensemble of classifiers is constructed by randomly choosing portions of the data set to feed to each classifier. A common way of doing this is bootstrap aggragation, wherein the training vectors are chosen randomly with replacement from the original training set. Since each vector is chosen with replacement, with high probability each training set will exclude a constant fraction of the training examples. This can make the resultant classifiers more robust to outliers and also make it more challenging for an adversary to steal the model.
We can abstract this process by instead looking at an ensemble of classifiers that are trained using some quantum subroutine. These classifiers may be trained with legitimate data or instead may be trained using data from an adversary. We can envision a classifier, in the worst case, as being compromised if it receives the adversary's training set. As such for our purposes we will mainly look at bagging through the lens of boosting, which uses an ensemble of different classifiers to assign a class to a test vector each of which may be trained using the same or different data sets. Quantum boosting has already been discussed in [25], however, our approach to boosting differs considerably from this treatment.
The type of threat that we wish to address with our quantum boosting protocol is given below.
Threat Model 2. Assume that the user is provided a programmable oracle, C, such that C|j |x = |j C j |x where each C j is a quantum classifier that acts on the test vector |x . The adversary is assumed to control a fraction of 0 ≤ α < 1 of all classifiers C j that the oracle uses to classify data and wishes to affect the classes reported by the user of the boosted quantum classifier that implements C. The adversary is assumed to have no computational restrictions and complete knowledge of the classifiers / training data used in the boosting protocol.
While the concept of a classifier has a clear meaning in classical computing, if we wish to apply the same ideas to quantum classifiers we run into certain challenges. The first such challenge lies with the encoding of the training vectors. In classical machine learning training vectors are typically bit vectors. For example, a binary classifier can then be thought of as a map from the space of test vectors to {−1, 1} corresponding to the two classes. This is if C were a classifier then Cv = ±v for all vectors v depending on the membership of the vector. Thus every training vector can be viewed as an eigenvector with eigenvalue ±1. This is illustrated for unit vectors in Figure 1.
Now imagine we instead that our test vector is a quantum state |ψ in C 2 . Unlike the classical setting, it may be physically impossible for the classifier to know precisely what |ψ is because measurement inherently damages the state. This makes the classification task fundamentally different than it is in classical settings where the test vector is known in its entirety. Since such vectors live in a two-dimensional vector space and they comprise an infinite set, not all |ψ can be eigenvectors of the classifier C. However, if we let C be a classifier that has eigenvalue ±1 we can always express |ψ = a|φ + + 1 − |a| 2 |φ − , where C|φ ± = ±|φ ± . Thus we can still classify in the same manner, but now we have to measure the state repeatedly to determine whether |ψ has more projection onto the positive or negative eigenspace of the classifier. This notion of classification is a generalization of the classical idea and highlights the importance of thinking about the eigenspaces of a classifier within a quantum framework.
Our approach to boosting and bagging embraces this idea. The idea is to combine an ensemble of classifiers to form a weighted sum of classifiers The same obviously holds for any |φ that is a negative eigenvector of each C j . That is any vector that is in the simultaneous positive or negative eigenspace of each classifier will be deterministically classified by C.
The simplest way to construct a classifier out of an ensemble of classifiers is to project the test vector onto the eigenvectors of the sum of the classifiers and compute the projection of the state into the positive and negative eigenvalue subspaces of the classifier. This gives us an additional freedom not observed in classical machine learning. While the positive and negative eigenspaces are fixed for each classifier in the classical analogues of our classifier, here they are not. Also we wish our algorithm to function for states that are fed in a streaming fashion. That is to say, we do not assume that when a state |ψ is provided that we can prepare a second copy of this state. This prevents us from straight forwardly measuring the expectation of each of the constituent C j to obtain a classification for |ψ . It also prevents us from using quantum state tomography to learn |ψ and then applying a classifier on it. We provide a formal definition of this form of quantum boosting or bagging below. Definition 6. Two-class quantum boosting or bagging is defined to be a process by which an unknown test vector is classified by projecting it onto the eigenspace of C = j b j C j and assigning the class to be +1 if the probability of projecting onto the positive eigenspace is greater than 1/2 and −1 otherwise. Here each C j is unitary with eigenvalue ±1, j b j = 1, b j ≥ 0, and at least two b j that correspond to distinct C j are positive.
We realize such a classifier via phase estimation. But before discussing this, we need to abstract the input to the problem for generality. We do this by assuming that each classifier, C j , being included is specified only by a weight vector w j . If the individual classifiers were neural networks then the weights could correspond to edge and bias weights for the different neural networks that we would find by training on different subsets of the data. Alternatively, if we were forming a committee of neural networks of different structures then we simply envision taking the register used to store w j to be of the form [tag, w j ] where the tag tells the system which classifier to use. This allows the same data structure to be used for arbitrary classifiers.
One might wonder why we choose to assign the class based on the mean class output by C rather than the class associated with the mean of C. In other words, we could have defined our quantum analogue of boosting such that the class is given by the expectation value of C in the test state, sign( ψ|C|ψ ). In such a case, we would have no guarantee that the quantum classifier would be protected against the adversary. The reason for this is that the mean that is computed is not robust. For example, assume that the ensemble consists of N classifiers such that for each C j , | ψ|C j |ψ | ≤ 1/(2N ) and assume the weights are uniform (if they are not uniform then the adversary can choose to replace the most significant classifier and have an even greater effect). Then if an adversary controls a single classifier example and knows the test example then they could replace C 1 such that | ψ|C 1 |ψ | = 1. The assigned class is then In such an example, even a single compromised classifier can impact the class returned because by picking ψ|C 1 |ψ = ±1 the adversary has complete control over the class returned. In contrast, we show that compromising a single quantum classifier does not have a substantial impact on the class if our proposal is used assuming the eigenvalue gap between positive and negative eigenvalues of C is sufficiently large in the absence of an adversary. We formalize these ideas in a quantum setting by introducing quantum blackboxes, T , B and C. The first takes an index and outputs the corresponding weight vector. The second prepares the weights for the different classifiers in the ensemble C. The final blackbox applies, programmed via the weight vector, the classifier on the data vector in question. We formally define these black boxes below.
Definition 7. Let C be a unitary operator such that if |w ∈ C 2 nw represents the weights that specify the classifier C|w |ψ = |w C(w)|ψ for any state |ψ . Let T be a unitary operator that, up to local isometries performs T |j |0 = |j |w j , which is to say that it generates the weights for classifier j (potentially via training on a subset of the data). Finally, define unitary B : B|0 = j b j |j for non-negative b j that sum to 1.
In order to apply phase estimation on C, for a fixed and unknown input vector, |ψ , we need to be able to simulate e −iCt . Fortunately, because each C j has eigenvalue ±1 it is easy to see that C 2 j = 1 and hence is Hermitian. Thus C is a Hamiltonian. We can therefore apply Hamiltonian simulation ideas to implement this operator and formally demonstrate this in the following lemma.
each C j is Hermitian and unitary and has a corresponding weight vector w j for b j ≥ 0 ∀ j and j b j = 1. Then for every > 0 and t ≥ 0 there exists a protocol for implementing a nonunitary operator, W , such that for all |ψ e −iCt |ψ − W |ψ ≤ using a number of queries to C, B and T that is in O(t log(t/ )/ log log(t/ )).
Proof. The proof follows from the truncated Taylor series simulation algorithm. By definition we have that B|0 = Next, by assumption we have that j b j = 1 and therefore T = t. Thus the number of queries to select(V ) and B in the algorithm is in O(t log(t/ )/ log log(t/ )). The result then follows after noting that a single call to select(V ) requires O(1) calls to V and C.
Note that in general purpose simulation algorithms, the application of select(V ) will require a complex series of controls to execute; whereas here the classifier is made programmable via the T oracle and so the procedure does not explicitly depend on the number of terms in C (although in practice it will often depend on it through the cost of implementing T as a circuit). For this reason the query complexity cited above can be deceptive if used as a surrogate for the time-complexity in a given computational model.
With the result of Lemma 8 we can now proceed to finding the cost of performing two-class bagging or boosting where the eigenvalue of C is used to perform the classification. The main claim is given below.
Theorem 9. Under the assumptions of Lemma 8, the number of queries needed to C, B and T to project |ψ onto an eigenvector of C with probability at least 1 − δ and estimate the eigenvalue within error is in O(( 1 δ ) log(1/ )/ log log(1/ )). Proof. Since C ≤ 1, it follows that we can apply phase estimation on the e −iC in order to project the unknown state |ψ onto an eigenvector. The number of applications of e −iC needed to achieve this within accuracy and error δ is O(1/δ ) using coherent phase estimation [20]. Therefore, after taking t = 1 and using the result of Lemma 8 to simulate e −iC that the overall query complexity for the simulation is in O(( 1 δ ) log(1/ )/ log log(1/ )). Finally, there is the question of how large on an impact an adversary can have on the eigenvalues and eigenvectors of the classifier. This answer can be found using perturbation theory to estimate the derivatives of the eigenvalues and eigenvectors of the classifier C as a function of the maximum fraction of the classifiers that the adversary has compromised. This leads to the following result.
Corollary 10. Assume that C is a classifier defined as per Definition 6 that is described by a Hermitian matrix that the absolute value of each eigenvalue is bounded below by γ/2 and that one wishes to perform a classification of a data set based on the mean sign of the eigenvectors in the suport of a test vector. Given that an adversary controls a fraction of the classifiers equal to α < γ/4, the adversary cannot affect the classes output by this protocol in the limit where → 0 and δ → 0.
Proof. If an adversary controls a fraction of data equal to α then we can view the perturbed classifier as This means that we can view C as a perturbation of the original matrix by a small amount. In particular, this implies that because each C j is Hermitian and unitary Using perturbation theory, (see Eq. (A32) in the appendix), we show that the maximum shift in any eigenvalue due to this perturbation is at most 2α. By which we mean that if E j is the j th eigenvalue of C and C (σ) = . By integrating this we find that E j (1) (informally the eigenvalue of C corresponding to the j th eigenvalue of C) we find Given that E j ≥ γ/2 the above argument shows that sign(E j ) = sign(E j ) if 2α < γ/2. Thus if the value of α is sufficiently small then none of the eigenvectors returned will have the incorrect sign. This implies that if we neglect errors in the estimation of eigenvalues of C then the adversary can only impact the class output by C if they control a fraction greater than γ/4. In the limit as δ → 0 and → 0 these errors become negligible and the result follows.
From this we see that adversaries acting in accordance with Threat Model 2 can be thwarted using boosting or bagging that have control over a small fraction of the quantum classifiers used in a boosting protocol. This illustrates that by generalizing the concept of boosting to a quantum setting we can not only make our classifiers better but also make them more resilient to adversaries who control a small fraction of the classifiers provided. This result, combined with our previous discussion of robust quantum PCA, shows that quantum techniques can be used to defend quantum classifiers against causative attacks by quantum adversaries.
However, we have not discussed exploratory attacks which often could be used as the precursors to such attacks or in some cases may be the goal in and of itself. We show below that quantum can be used in a strong way to defend against some classes of these attacks.

V. QUANTUM ENHANCED PRIVACY FOR CLUSTERING
Since privacy is one of the major applications of quantum technologies, it should come as no surprise that quantum computing can help boost privacy in machine learning as well. As a toy example, we will first discuss how quantum computing can be used to allow k-means clustering to be performed without leaking substantial information about any of the training vectors.
The k-means clustering algorithm is perhaps the most ubiquitous algorithm used to cluster data. While there are several variants of the algorithm, the most common variant attmepts to break up a data set into k clusters such that the sum of the intra-cluster variances is minimized. In particular, let f : (x j , {µ}) → (1, . . . , k) gives the index of the set of centroids {µ} = {µ 1 , . . . , µ k } and x j are the vectors in the data set. The k-means clustering algorithm then seeks to minimize j |x j − µ f (xj ,{µ}) | 2 2 . Formally, this problem is NP-hard which means that no efficient clustering algorithm (quantum or classical) is likely to exist. However, most clustering problems are not hard examples, which means that clustering generically is typically not prohibitively challenging.
The k-means algorithm for clustering is simple. First begin by assigning the µ p to data points sampled at random. Then for each x j find the µ p that is closest to it, and assign that vector to the p th cluster. Next set µ 1 , . . . , µ k to be the cluster centroids of each of the k clusters and repeat the previous steps until the cluster centroids converge.
A challenge in applying this algorithm to cases, such as clustering medical data, is that the individual performing the algorithm typically needs to have access to the users information. For sensitive data such as this, it is difficult to apply such methods to understand structure in medical data sets that are not already anonymized. Quantum mechanics can be used to address this.
Imagine a scenario where an experimenter wishes to collaboratively cluster a private data set. The experimenter is comfortable broadcasting information about the model to the N owners of the data, but the users are not comfortable leaking more than m bits of information about their private data in the process of clustering the data. Our approach, which is related to quantum voting strategies [33], is to share an entangled quantum state between the recipients and use this to anonymously learn the means of the data.
Threat Model 3. Assume that a group of N users wish to apply k-means clustering to cluster their data and that an adversary is present that wishes to learn the private data held by at least one of the users as an exploratory attack and has no prior knowledge about any of the users' private data before starting the protocol. The adversary is assumed to have control over any and all parties that partake in the protocol, authentication is impossible in this setting and the adversary is unbounded computationally.
The protocol that we propose to thwart such attacks is given below.
1. The experimenter broadcasts k cluster centroids over a public classical channel.
3. Each participant that decides to contribute applies e −iZ/2N to the qubit corresponding to the cluster that is closest to their vector. If two clusters are equidistant, the closest cluster is chosen randomly.
4. The experimenter repeats above two steps O(1/ ) times in a phase estimation protocol to learn P (f (x j , {µ}) = p) within error . Note that because participants may not participate k p=1 P (f (x j , {µ}) = p) ≤ 1. 5. Next the experimenter performs the same procedure except phase estimation is now performed for each of the d components of x j , a total of O(d/ P (f (x j ) = p)) for each cluster p.
6. From these values the experimenter updates the centroids and repeats the above steps until convergence or until the privacy of the users data can no longer be guaranteed.
Intuitively, the idea behind the above method is that each time a participant interacts with their qubits they do so by performing a tiny rotation. These rotations are so small that individually they cannot be easily distinguished from performing no rotation, even when the best test permitted by quantum mechanics is used. However, collectively the rotation angles added by each participant sums to non-negligible rotation angle if a large number of participants are pooled. By encoding their private data as rotation angles, the experimenter can apply this trick to learn cluster means without learning more than a small fraction of a bit of information about any individual participant's private data. Proof. Let us examine the protocol from the perspective of the experimenter. From this perspective the participants' actions can be viewed collectively as enacting blackbox transformations that apply an appropriate phase on the state (|0 N + |1 N )/ √ 2. Let us consider the first phase of the algorithm, corresponding to steps 2 and 3, where the experimenter attempts to learn the probabilities of users being in each of the k clusters.
First when the cluster centroids are announced to participant j, they can then classically compute the distance x j and each of the cluster centroids efficiently. No quantum operations are needed to perform this step. Next for the qubit corresponding to cluster p user j performs a single qubit rotation and uses a swap operation to send that qubit to the experimenter. This requires O(1) quantum operations. The collective phase incurred on the state (|0 N + |1 N )/ √ 2 from each such rotation results in (up to a global phase) It is then easy to see that after querying this black box t-times that the state obtained by them applying their rotation t times is (|0 N + e −iP (f (xj )=p)t |1 N )/ √ 2. Upon receiving this, the experimenter performs a series of N controlled-not operations to reduce this state to (|0 + e −iP (f (xj )=p) |1 )/ √ 2 up to local isometries. Then after applying a Hadamard transform the state can be expressed as cos(P (f (x j ) = p)t/2)|0 + sin(P (f (x j ) = p)t/2)|1 . Thus the probability of measuring this qubit to be 0 exactly matches that of phase estimation, wherein P (f (x) = p) corresponds to the unknown phase. This inference problem is well understood and solutions exist such that if {t q } are the values used in the steps of the inference process then q |t q | ∈ O(1/ ) if we wish to estimate P (f (x j ) = p) within error .
While it may naıvely seem that the users require O(k/ ) rotations to perform this protocol, each user in fact only needs to perform O(1/ ) rotations. This is because each vector is assigned to only one cluster using the above protocol. Thus only O(1/ ) rotations are needed per participant.
In the next phase of the protocol, the experimentalist follows the same strategy to learn the means of the clustered data from each of the participants. In this case, however, the blackbox function is of the form where [x j ] q is the q th component of the vector. By querying this black box t times and performing the exact same transformations used above, we can apply phase estimation to learn j [x j ] q δ f (xj ),p /N within error δ using O(1/δ) queries to the participants. Similarly, we can estimate If P (f (x j ) = p) is known within error and P (f (x j ) = p) > then it is straight forward to see that Then because it follows that the overall error is at most given our assumption that min p P (f (x j ) = p) > . Each component of the centroid can therefore be learned using O(1/[min p P (f (x j ) = p) ]) rotations. Thus the total number of rotations that each user needs to contribute to provide information about their While this lemma shows that the protocol is capable of updating the cluster centroids in a k-means clustering protocol, it does not show that the users' data is kept secret. In particular, we need to show that even if the experimenter is an all powerful adversary then they cannot even learn whether or not a user actually participated in the protocol. The proof of this is actually quite simple and is given in the following theorem.
Theorem 12. Let B be a participant in a clustering task that repeats the above protocol R times where B has x j and let {µ (r) } be the cluster centroids at round r of the clustering. Assume that the eavesdropper E assigns a prior probability that B participated in the above protocol to be 1/2 and wishes to determine whether B contributed x j (which may not be known to E) to the clustering. The maximum probability, over all quantum strategies, of E successfully deciding this is 1/2 + O(Rd/[min p,r P (f (x j , {µ (r) }) = p)N ]) if min p,r P (f (x j , {µ (r) ) = p) > .
Proof. The proof of the theorem proceeds as follows. Since B does not communicate classically to the experimenter, the only way that E can solve this problem is by feeding B an optimal state to learn B's data. Specifically, imagine E usurps the protocol and provides B a state ρ. This quantum state operator is chosen to maximize the information that E can learn about the state. The state ρ, for example, could be entangled over the multiple qubits that would be sent over the protocol and could also be mixed in general. When this state is passed to B a transformation U ρU † is enacted in the protocol. Formally, the task that E is faced with is then to distinguish between ρ and U ρU † .
Fortunately, this state discrimination problem is well studied. The optimal probability of correctly distinguishing the two, given a prior probability of 1/2, is [34] Let us assume without loss of generality that B's x j is located in the first cluster. Then U takes the form where 1 refers to the identity acting on the qubits used to learn data about clusters 2 through k (as well as any ancillary qubits that E chooses to use to process the bits) and q 1 and q 2 are the number of rotations used in the first and second phases discussed in Lemma 11. Z (p) refers to the Z gate applied to the p th qubit. Using Hadamard's lemma and the fact that Tr(|ρA|) ≤ A 2 for any density operator ρ we then see that provided Let {µ (r) } be the cluster cetnroids in round r of the R rounds. From Lemma 11 we see that the number of rotations used in the R rounds of the first phase obeys q 1 ∈ O(R/ ) and the number of rotations in the R rounds of the second phase obeys q 2 ∈ O(Rd/[min p,r P (f (x j , {µ (r) }) = p) ]). We therefore have from Eq. (13) that if min p,r P (f (x j , {µ (r) }) = p) > then If the minimum probability of membership over all k clusters and R rounds obeys which is what we expect in typical cases, the probability of a malevolent experimenter discerning whether participant j contributed data to the clustering algorithm is O(dk/N ). It then follows that if a total probability of 1/2 + δ for the eavesdropper identifying whether the user partook in the algorithm can be tolerated then R rounds of clustering can be carried out if N ∈ Ω Rdk δ . This shows that if N is sufficiently large then this protocol for k-means clustering can be carried out without compromising the privacy of any of the participants.
An important drawback of the above approach is that the scheme does not protect the model learned from the users. Such protocols could be enabled by only having the experimenter reveal hypothetical cluster centroids and then from the results infer the most likely cluster centroids, however this diverges from the k-means approach to clustering and will likely need a larger value of N to guarantee privacy given that the information from each round is unlikely to be as useful as it is in k-means clustering.
Also, while the scheme is private it is not secure. This can easily be seen from the fact that an eavesdropper could intercept a qubit and apply a random phase to it. Because the protocol assumes that the sum of the phases from each participant adds up to at most 1, this can ruin the information sent. While this approach is not secure against an all-powerful adversary, the impact that individual malfeasant participants could have in the protocol can be mitigated. One natural way is to divide the N participants into a constant number of smaller groups, and compute the median of the cluster centroids returned. While such strategies will be successful at thwarting a constant number of such attacks, finding more general secure and private quantum methods for clustering data remains an important goal for future work.

VI. CONCLUSION
We have surveyed here the impacts that quantum computing can have on security and privacy of quantum machine learning algorithms. We have shown that robust versions of quantum principal component analysis exist that retain the exponential speedups observed for its non-robust analogue (modulo certain assumptions about data access and output). We also show how bootstrap aggregation or boosting can be performed on a quantum computer and show that it is quadratically easier to generate statistics over a large number of weak classifiers using these methods. Finally, we show that quantum methods can be used to perform a private form of k-means clustering wherein no eavesdropper can determine with high-probability that any participant contributed data, let alone learn that data.
These results show that quantum technologies hold promise for helping secure machine learning and artificial intelligences. Going forward, providing a better understanding of the impacts that technologies such as blind quantum computing [35,36] may have for both securing quantum machine learning algorithms as well as blinding the underlying data sets used from any adversary. Another important question is whether tools from quantum information theory could be used to bound the maximum information about the models used by quantum classifiers, such as quantum Boltzmann machines or quantum PCA, that adversaries can learn by querying a quantum classifier that is made publicly available in the cloud. Given the important contributions that quantum information theory has made for our understanding of privacy and security in a quantum world, we have every confidence that these same lessons will one day have an equal impact on the security and privacy of machine learning.
There are of course many open questions left in this field and we have only attempted to give a taste here of the sorts of questions that can be raised when one looks at quantum machine learning in adversarial settings. One important question surrounds the problem of generating adversarial examples for quantum classifiers. Due to the unitarity of quantum computing, many classifiers can be easily inverted from the output quantum states. This allows adversarial examples to be easily created that would generate false positives (or negatives) when exposed to the classifier. The question of whether quantum computing offers significant countermeasures against such attacks remains an open problem. Similarly, gaining a deeper understanding of the limitations that sample lower bounds on state/process imply on stealing quantum models imply for the security of quantum machine learning solutions could be an interesting question for further reasearch. Understanding how quantum computing can provide ways of making machine learning solutions more robust to adversarial noise of course brings more than security, it helps us understand how to train quantum computers to understand concepts in a robust fashion, similar to how we understand concepts. Thus thinking about such issues may help us address what is perhaps the biggest open question in quantum machine learning: "Can we train a quantum computer to think like we think?"

Appendix A: Proof of Theorem 1 and Robust LCU Simulation
Our goal in this approach to robust PCA is to examine the eigendecomposition of a vector in terms of the principal components of M. This has a major advantage over standard PCA in that it is manifestly robust to outliers. We will see that, within this oracle query model, quantum approaches can provide great advantages for robust PCA if we assume that M is sparse. Now that we have an algorithm for coherently computing the median, we can apply this subroutine to compute the components of M. First we need a method to be able to compute a representation of e T k x j for any k and j. Lemma 13. For any k and j define |Ψ j,k such that, for integer y, if y|Ψ j,k = 0 then |y − e T k v j | ≤ there exists a coherent quantum algorithm that maps, up to isometries, |j, k |0 → |j, Proof. First let us assume that all the v j are not unit vectors. Under such circumstances we can use Lemma 3 to embed these vectors as unit vectors in a high dimensional space. Therefore we can assume without loss of generality that v j are all unit vectors.
Since v j can be taken to be a unit vector, we can use quantum approaches to compute the inner product. Specifically, the following circuit can estimate the the inner product v j |k .

|0
H The circuit implements the Hadamard test on the unitary U , which prepares the state |v j and computes the bitwise exclusive or on the resultant vector and the basis vector k. Specifically, the probability of measuring the top-most qubit to be 0 is where recall that we have assumed that the training vectors satisfy |v j ∈ R. To see this, note that the circuit performs the following transformation for some state vector |φ ⊥ . The probability of measuring the first qubit to be 0 is Tr(|ψ ψ|(|0 0| ⊗ 1 1)) which after some elementary simplifications is given that v j is real valued. Following the argument for coherent amplitude estimation in [20] we can use this process to produce a circuit that maps |j, k |0 → |j, k ( 1 − χ j,k |Ψ j,k + √ χ j,k |Ψ ⊥ j,k for 0 ≤ χ j,k ≤ ∆/2 ≤ 1 using O(log(1/∆)/ ) invocations of the above circuit where, for integer y, y|Ψ = 0 if |y − (1 + e T k v j )/2| ≤ /2. Each query requires 1 invocation of controlled U and hence the query complexity is O(log(1/∆)/ ). By subtracting off 1/2 from the resulting bit string and multiplying the result by 2 we obtain the desired answer. The result is computing to precision because we demanded that (1 + e T k v j )/2 is computed with accuracy /2. The arithmetic requires no additional queries, so the overall query complexity is O(log(1/∆)/ ) as claimed. Note that in this case there are several junk qubit registers that are invoked by this algorithm that necessarily extend the space beyond that claimed in the lemma statement; however, since we only require the result to be true up to isometries we neglect such registers above for simplicity. Here |P be a quantum state on a Hilbert space H junk ⊗ H Nv such that Tr(|y y|Tr junk (|P P|)) = 0 if and only if |P ( x k |x j < y) − 1/2| ≤ < 1/4. Furthermore, this state can be prepared using a number of queries to U that scales as O( −2 log(1/ ) log(L/ ) log(log(1/ )/δ)).
Proof. Similar to the proof of Grover's algorithm for the median [37] and that of Nayak and Wu [38], our approach reduces to coherently applying binary search. Their algorithms cannot be directly used here because they utilize measurement, which prevents their use within a coherent amplitude estimation algorithm. Furthermore, the algorithms of [38] solve a harder problem, namely that of outputting an approximate median from the list rather than simply providing a value that approximately seperates the data into two equal length halves. This value, however, need not actually be contained in either list unlike the algorithm of [38]. For this reason we propose a slight variation on Nayak and Wu's ideas here. Consider the following algorithm for some value of > 0 1. Prepare the state |L 1 |R 1 such that L 1 corresponds to 0 and R 1 corresponds to 1.

Repeat steps 3 through 8 for
4. Repeat the following step within a coherent amplitude estimation with error 0 /2 and error probability δ 0 on the fifth register using a projector that marks all states in that register, |L p , such that L p < µ p and store the probabilities in the sixth register.
6. Use a reversible circuit to set, conditioned on the probability computed in the above steps is less than 1/2, |L p+1 ← |µ p and |R p+1 ← |R p .
7. Use a reversible circuit to set, conditioned on the probability computed in the above steps is greater than 1/2, |R p+1 ← |µ p and |L p+1 ← |L p .
Let us focus on a single update step in the outer loop, which constitutes estimation of the inner products and computation of a median. Lemma 13 shows that there exists a unitary process, that performs (up to isometries) Now define an operator O µ such that for integers µ and y This oracle serves to mark all states that have support on values less than or equal to y. We can then express the state O ip |j, k |0 as where O µ |Ψ < j,k |µ = −|Ψ < j,k |µ , O µ |Ψ > j,k |µ = |Ψ > j,k |µ and |φ ⊥ j,k is not necessarily an eigenvector of O µ and we formally express O µ |φ ⊥ j,k |µ = |φ ⊥ j,k |µ . If we apply coherent amplitude estimation using the oracles O µ and O ip with error at most 0 /2 and failure probability at most δ 0 ≥ ∆ j,k we obtain a state of the form where P j,k good |x = 0 only if This is equivalent to saying that the estimate of the probability output by the process, in the success branch of the wave function, matches the exact probability within error at most 0 /2. Now we further have from the fact that χ j,k ≤ 0 /2 that Thus the errors in the probability estimates that the median, or more generally any percentile, is based on are at most 0 with the assumed level of precision. Now without loss of generality, let us assume that the right end point is updated in a given step of the protocol. Let R be the value of the right-endpoint in the event that there was zero error in the estimate of the probability and letR be the approximation to the right end point that arises from errors in the median probability and also from estimation of the inner products. If we then defineR p to be the error in the estimate of R that arises when we only consider errors from the estimate of the median probability we see that The maximum deviation that can occur in the p th percentile from perturbing each element of a list by is . Similarly, since Q is Lipshitz with constant L we have that |R −R p | ≤ L 0 . Thus Note that in many cases, the error in the right end point will be zero because the uncertainty in the probability estimation will be less than the gap between the percentiles at R and L; however, we will track this propagation of uncertainty in each step for simplicity. Thus we have that if t (p) is the total uncertainty in the median at step p of the binary search procoess we have that, in the worst case scenario that Solving this recurrence relation with t (0) = 1/2 yields, Thus if we take 0 = /L and we find that if we want the total uncertainty to be at the end of the protocol then it suffices to pick .
If > 4 and < 1/4 then it is clear that it suffices to take p max ∈ O(log(1/ )) in order to obtain an uncertainty in the mean of at the end of the binary search. We assume in the theorem statement that < 1/4 and choose < /4. This validates the claim that the binary search process need only be repeated O(log(1/ )) times. We then have that the overall query complexity of the algorithm is the product of the query complexities of performing the inner product calculation, calculating the median and the number of repetitions needed in binary search. The final complexity, C, therefore scales from Lemma 13 and from bounds on the query complexity of coherent amplitude estimation [20] as (A13) Here we have used the fact that the success probability for the overall algorithm is at least 1 − δ therefore we need to pick δ 0 ∈ O(δ/ log(1/ )) since there are O(log(1/ )) distinct rounds and the failure probability grows at most additively from the union bound. Finally, note that in none of the steps in the algorithm has any garbage collection of ancillary qubits been performed. For this reason it is necessary to explicitly write the resultant state as where |P be a quantum state on a Hilbert space H junk ⊗ H Nv such that Tr(|y y|Tr junk (|P P|)) = 0 if and only if |P ( x k |x j < y) − 1/2| ≤ .
Corollary 15. Given the assumptions of Theorem 14 hold with L ∈ Θ(1), an oracle that provides γ approximate queries to M k, can be implemented in the form of a unitary that maps for 1 ≥ δ ≥ δ k, ≥ 0 using a number of queries to U that scales as O(γ −2 log 2 (1/γ) log(log(1/γ)/δ)) where |M k, ∈ C n+a has projection only on states that encode of M k, up to γ-error meaning that for any computational basis state Tr(|y y|Tr junk (|M k, M k, |)) = 0 if |y − M k, | > γ.
Proof. The proof is a direct consequence of Theorem 14. The computation of the medians involve three steps. First we need to compute the median for fixed k and the median for fixed and then compute the median of the products of differences between these terms and the inner products. From Theorem 14 the number of queries to the vectors needed to estimate this is in O(γ −2 log 2 (1/γ) log(log(1/γ)/δ)) because the costs of both of these processes is additive. Because the inner products are bounded above by 1 the error from estimating the medians in the products of terms is O(γ). Thus the error from the computation of both medians sums to a value of O(γ) if O(γ −2 log 2 (1/γ) log(log(1/γ)/δ)) queries are used for each of the three procedures. Now that we have established the cost of implementing an approximate query to the matrix M the only step that remains is to diagonalize the matrix in order to be able to find the support of an input vector within a subspace. For simplicity, we will assume in the following that the error tolerance for the median, , required by the algorithms is constant. However, we will assume that the error tolerance required for the inner products, , is not a constant and will demand increased precision as the size of the problem grows. For this reason, existing methods that show that the errors in the unitary implementation translate into error in the eigenphase do not necessarily hold (as they would for Trotter-based methods). For this reason we need to explicitly demonstrate that phase estimation can be employed using this approach with finite error.
Before concluding with our main theorem on robust quantum PCA, there is one more technical lemma that needs to be proven. We need to demonstrate that when using non-unitary simulation methods for e −iM that the errors that arise from the truncated Taylor simulation method do not invalidate the phase estimation protocol.
Lemma 16. Let M ∈ C 2 n ×2 n and M ∈ C 2 n ×2 n be d-sparse Hermitian matrices and let M (j) : j = 1, . . . , κd 2 be a set of one-sparse matrices such that each non-zero matrix element in M is assigned to precisely one M (j) where f (p, j) yields the non-zero matrix element in row p of M (j) . If we then define V to be the set of outcomes for which amplitude estimation estimates the desired matrix element of Y within error and Proof. We prove this lemma using the fact that if A and B are d-sparse matrices with the same sparsity pattern then To see this note from Vizing's theorem that at most d + 1 colors are required to edge color a degree d graph since the diagonal elements can be viewed as having self loops in the worst case scenario one additional color may be needed hence d + 2 colors is an upper bound. Since we can view any Hermitian matrix as a weighted adjacency matrix of a graph and since the adjacency matrix for any graph that has degree at most 1 is one-sparse, Vizing's theorem implies that A − B can be decomposed into at most d + 2 one-sparse matrices since A and B are d-sparse and share the same sparsity pattern.
Next for any one sparse matrix C such that C x,y = 0 then the irreducible sub-matrix in the span of |x and |y is either of the form 0 C x,y C * x,y 0 or C x,x if x = y and if x = y respectively. In both cases, exact diagonalization shows that the eigenvalues of C within this span are ±|C x,y |. It therefore follows from this observation, the triangle inequality and the fact that Vizing's coloring uniquely assigns each matrix element (edge in the adjacency matrix) to a unique one-sparse matrix Now let y be a boolean function that evaluates to Then a quantum circuit that implements F can be executed using O(1) queries to the matrix elements of M and a polynomial-sized arithmetic circuit to compute Eq. (A21) and kick the result back into the phase. Specifically this allows us to perform a transformation that we denote PrepareW (1) where N is a normalizing constant. This transformation takes the form of the PrepareW oracle used in the LCU lemma [40]. By invoking PrepareW (1) , F as well as an appropriate phase correction to add a factor of −i to the linear term and the inverse of PrepareW the LCU lemma states that, upon success, we apply whereM is given by Lemma 16. This illustrative example shows why when non-deterministic oracles are used within an LCU Hamiltonian simulation Now let f (p, j) be defined to be the oracle that yields the column index of the non-zero element in the p th row of the product of several one-sparse matrices whose j indices are stored in the vector j. This is given recursively via For simplicity we also define f (p, ∅) = p. Then for dim( j) = J, if we call the approximate oracle J times to compute the matrix element corresponding to a fixed j and p we obtain It then follows that by acting on a superposition over j vectors, we can prepare a state using J queries to the approximate oracle that is, up to isometries, Where above we use the assumption that M max ≤ 1. This takes the same form as the simulation error for linear combination of unitary expansions that have no error. Thus at this point we can follow the remainder of the derivation in [39]. Robust oblivious amplitude amplification is used to boost the probability to nearly 100%, however, there are limits on the effectiveness of this technique based on the non-unitarity of the underlying operations. By taking J ∈ O(log(d 2 / )/ log log(d 2 / )) we can guarantee that the error in the truncated Taylor series simulation obeys for a given number of segments r [39] max |ψ e −i M/r |ψ − J q=0 (−i M/r) q /q! J q=0 (−i M/r) q /q!|ψ |ψ ≤ r .
If we let Q be the non-unitary approximation that arises from potential failures in the LCU method leading the errors in the expansion, then we obtain Thus if we use the form of robust oblivious amplitude amplification discussed in [39] and take M ∈ Θ(d 2J r/ ) then the total non-unitary error of implementing the r ∈ Θ(d 2 ) segments of the simulation [39] is in O( ). The overall query complexity is O(Jr) which O(d 2 log(d 2 / )/ log log(d 2 / )) if simulation error is desired for a simulation of e −i M . Similarly, we have from Lemma 16, the triangle inequality and box 4.1 from Nielsen and Chuang [41] that e −iM − e −i M ∈ O(d[δ + γ]). Thus we require that δ ∈ O( /d) to ensure that the error is O(dγ + ). It then follows that the error in the approximation can be made at most for M ∈ O(d 2J / ), which implies that the circuit used to prepare the M register (which is a uniform superposition) is of size at most O(J log(d) + log(1/ )) qubits. Thus the constraint that δ ∈ O(1/M ) directly implies δ ∈ O( /d) and furthermore the space and time required to prepare such a register is polynomial.
This shows that we can perform a simulation using LCU methods using a non-deterministic oracle for the matrix elements of the underlying Hamiltonian. Note the above reasoning explicitly holds for the case of simulating e −iHt for sparse Hermitian H by substituting M → Ht.
Theorem 18. Under the assumptions of Lemma 16 and given that the minimum eigenvalue gap of M within the support of the input state obeys λ ≥ /4 > 0 . We then have the following: asymypotically subdominant to the query complexity required to simulate V M . Thus the total number of queries needed to the training data is for γ ∈ O( /d) O((d 4 / 2 ) log(d 2 / ) log 2 (d/ ) log(d log(d/ )/ )/ log log(d 2 / )) ⊆ O((d 4 / 2 ) log 3 (d 2 / )). (A30) Next we need to consider the effect of such errors in M on the eigenvalues of the robust covariance matrix. In order to understand the impact of such errors we must ensure that the errors incurred in the principal components of the approximated robust density matrix M are small. In general, if the perturbations to the robust covariance matrix are large relative to the eigenvalue gaps then we can expect that the impact will be substantial for both the eigenvalue and eigenvector in question. To this end, we will have to bound the impact that these perturbations will have on the result.
Let M − M ≤ σ then there exists an a matrix valued function with norm at most 1, ∆(σ), such that M = M + σ∆(σ). Now if M is an analytic matrix-valued function of σ then so is ∆(σ). In this context, the path that we choose to perturb M along to reach M is arbitrary. We therefore choose it to be analytic. Thus we can assume without loss of generality that ∆(σ) = ∆ + O(σ) for constant operator ∆ such that ∆ ≤ 1. The latter point follows by contradiction, if ∆ > 1 then the requirement that ∆(σ) ≤ 1 will fail to hold in some σ-neighborhood about 0. Now we will investigate the role that such errors have in the shifts in the eigenvalues and eigenvectors of M. Since Λ > 0 we can apply non-degenerate perturbation theory to estimate the shift in the eigenvalues and eigenvectors. If we define E n to be the n th eigenvalue of M and |E n to be the corresponding eigenvector, and E n to be the corresponding perturbed eigenvalue we have from non-degenerate perturbation theory that [42].
E n (σ) = E n + E n |∆|E n σ + O(σ 2 ). (A31) This implies that the directional derivative of the eigenvalue in the direction of the perturbation is Note that because the result is insensitive to the initial value, the same argument can be used to bound the derivative for any value of σ. Similarly, the derivative of the projection of the eigenstate onto any fixed vector |x is Now let us define the space to be the direct sum of three disjoint spaces eigenspaces of M defined by the projectors P + , P − and P ? , where 1 1 = P + + P − + P ? . Furthermore, assume that for any |ψ in the +1 eigenspace of P + and |φ in the +1 eigenspace of P − , | ψ|M|ψ − φ|M|φ | ≥ λ. Similarly let P + ( ) be the projector onto the perturbed +1 eigenspace. Under the assumption that M ∈ R N ×N we have that the eigenvectors can always be chosen to be real valued and hence for any |ψ , where n ∈ P + implies P + |E n = |E n and the last line follows from |E p − E n | ≥ λ. Thus if the eigenvalues are computed within an error of at most λ/4 then it follows that the maximum error that we can observe in the derivative is 4/λ. Thus | φ|P + ( ) − P + |φ | ≤ 0 max ψ |∂ σ ψ|P + (σ)|ψ | ≤ 4 λ . (A35)