Key questions for the quantum machine learner to ask themselves

Within the last several years quantum machine learning (QML) has begun to mature; however, many open questions remain. Rather than review open questions, in this perspective piece I will discuss my view about how we should approach problems in QML. In particular I will list a series of questions that I think we should ask ourselves when developing quantum algorithms for machine learning. These questions focus on what the definition of quantum ML is, what is the proper quantum analogue of QML algorithms is, how one should compare QML to traditional ML and what fundamental limitations emerge when trying to build QML protocols. As an illustration of this process I also provide information theoretic arguments that show that amplitude encoding can require exponentially more queries to a quantum model to determine membership of a vector in a concept class than classical bit-encodings would require; however, if the correct analogue is chosen then both the quantum and classical complexities become polynomially equivalent. This example underscores the importance of asking ourselves the right questions when developing and benchmarking QML algorithms.

protocols and in some cases can support exponential speedups (unless of course BQP= BPP, which is highly unlikely).
The wide range input models that can be considered in QML complicates the search for quantum advantages for machine learning tasks or finding new forms of machine learning that may not have clear classical analogues. Thinking critically about the definitions of the input for purely QML problems is absolutely vital both to understand an algorithm and contextualize its performance.

Question 2: what is the classical analogue of my quantum machine learning algorithm?
The importance of this question was raised recently in a very visible manner by Ewin Tang and others [9,35,38,39] in a series of papers that criticized the claims of exponential speedups in algorithms such as quantum principal component analysis, nearest centroid classification, quantum recommendation systems and related algorithms. The central issue behind most of the claims refuted in these works stemmed from the fact that the exponential speedups were not justified using complexity theoretic, query complexity arguments or even the inability for the community to provide an efficient classical algorithm for the problem (as in the case of factoring); rather, they were justified by comparison to a classical algorithm that was deemed to be analogous to the quantum algorithm. The central problem with such a comparison is that there often are multiple classical analogues of a quantum algorithm. In all these cases, the claims of exponential speedups evaporated when the correct analogue of the classical algorithms is compared against.
I find that I struggle more giving clear answers to question 2 than I do for question 1. The struggle is very much akin to understanding the classical limit of quantum mechanics. There are two main correspondences that often are used to understand the classical limit of quantum mechanics: the Ehrenfest and the Liouville correspondence [40]. The former states that a single Newtonian trajectory should emerge from the correct classical limit of quantum dynamics. The Liouville correspondence states instead that the classical limit of quantum dynamics occurs when the quantum predictions are in practically indistinguishable from those of a classical probability distribution. This exact same issue arises in QML. Philosophically, we can either choose the classical analogues of a QML algorithm to either be deterministic or randomized classical algorithms. Both options need to be considered when comparing a quantum ML algorithm to a classical counterpart. I detail below some of the thought process that I have found useful for clarifying these issues in my own research.

Question 2.1: what is my input model?
An input model is an abstraction that describes how the data is both represented and fed into a quantum algorithm. Classically, there are a number of such input models. The first is a database model, wherein the data is assumed to be provided by a database that holds all the training data. Another popular model is the streaming model or online learning, wherein the algorithm does not have explicit access to the underlying training vectors but instead is fed the data one at a time. Examples include perceptron training and various forms of kernel learning [41,42].
In either of the above cases, there are also multiple ways that the data can be represented. In quantum computing, however, there are further complications because the feature vectors that are used classically can be encoded directly as qubit strings that represent vectors or as a quantum state vector. These two encoding schemes are known as bit-encodings and amplitude-encodings. I will also discuss a third input model wherein the data fed into the system is an unknown quantum state and will refer to it as a quantum input model (despite the fact that all three input models are quantum).

Bit encodings
The most concise way to think of input models that use a bit encoding is in an oracular model. In order to understand this model, let us assume that the learning problem has a training set that is composed of training vectors of the form of the following form: {v i : i = 0, . . . , N − 1}. I will also denote a bit-string representation of the training vectors to be [v i ] in the following. These bit vectors can be accessed on demand through a self inverse quantum oracle of the form I call this form of the input oracle a coherent bit-oracle. This is perhaps the most commonly used abstraction for how to get training data into a quantum algorithm. It is used in quantum algorithms for perceptron training, support vector machines, nearest neighbor classification and can be instantiated in low-depth using a data structure known as a QRAM [43,44]. The salient feature of such oracles is that they allow us to which is necessary in order to achieve a quantum speedup over the size of the data set. Another sub-class of a bit-oracle is the incoherent bit-oracle. In this case the input data is yielded by a subroutine that acts as This form of the input precludes the use of quantum superpositions over the training data and in turn the speedups that are associated with this. This feature is perhaps most dramatically demonstrated in the recent proof that at best constant factor improvements in sample (i.e. query) complexity relative to classical learning can be attained within the quantum PAC model, which is a model that precludes coherent accesses to the training data [22]. On the other hand, this input model does accurately reflect the use cases that emerge QML algorithms that do not depend on superpositions over the input data such as variational autoencoders [27,28], feed-forward quantum neural networks [23,24] and some implementations of quantum Boltzmann machines [15].
The main advantage of bit-encodings are that they provide the data in exactly the same manner that the data is provided to most classical machine learning algorithms. This means that a query to a subroutine that provides the data is analogous to an access to the classical training set.
The primary disadvantage of these approaches is that the space overheads needed to store the quantum state can be prohibitive. In particular, for the case of the MNIST example considered earlier the training images consist of 784 pixels, which means that a bit encoding of the input would require minimally 784 qubits. This makes most applications in vision out of reach of existing quantum computers, which typically have fewer than 100 quantum bits [45].

Amplitude encodings
With the exception of experiments performed on quantum annealers [13,15], the vast majority of QML experiments on real-world data sets have been performed on amplitude encoded data. The advantage of amplitude encodings is that they represent their data using only logarithmically many quantum bits by embedding the vectors in the training set as unit vectors in the quantum computer [25,46]. In particular, let {v i : i = 1, . . . , N} be a set of training vectors in R 2 n such that max Then we can construct an isomorphism between the training vectors and unit vectors in R 2 n+ log 2 N +1 : Note that this is an isomorphism because the inner products between the vectors remain the same up to a factor of Λ −2 which is a known constant. Additional qubits are needed in this mapping is to guarantee that the junk-space, which is used to pad the non-unit vector into a unit vector, will be unique and thus not have a projection onto any of the other encodings of the training vectors. These additional space requirements needed to maintain the correct inner products between amplitude encoded vectors can be prohibitive and so some in the QML community use such encodings without guaranteeing the orthogonality of the vectors. This, in effect, applies a feature map to the quantum vectors that are loaded in. This feature map means that the training data that the quantum algorithm is provided differs from that the classical algorithm is provided and care should be taken, especially if the error rate for the quantum model is higher than classical methods.
In direct analogy to the bit-encodings, there are two main classes of amplitude encoding: coherent and incoherent encodings. For the case of a coherent encoding and the incoherent encoding is defined in precisely the same manner: The coherent encoding version of this is used, for example, in nearest neighbor classification algorithms and the incoherent version is more commonly used in quantum neural networks.

Quantum input
Quantum input models are the final category of inputs that are commonly used in QML algorithms and are most commonly used in unsupervised training examples, but they have also been proposed for other approaches such as quantum PCA [30], quantum Hamiltonian learning [36] and also generative training of quantum neural networks [26]. There are many ways such models can be formalized, but here we will consider two concrete versions of access models that generalize the models that are often implicitly used in QML algorithms that take quantum inputs. In the first such model, a subroutine exists that prepares allows the user to prepare one of an indexed family of density operators ρ i : i = 1, . . . , N. This procedure can be viewed as an explicit function Λ mapping the indices for the training set to copies of distinct density operators A further refinement of this model is sometimes used wherein the quantum data is provided in the form of a pure state. In particular, if we assume that the Hilbert space factorizes into a tensor product of space A and B wherein we define Tr B (|ψ i ) = ρ i and the purified state preparation algorithm can be written as This model is relevant in cases where density matrix exponentiation [30] is used as a tool in QML because the fundamental limits that the technique imposes on the query complexities can be alleviated by using a purified quantum input oracle [47]. Coherent analogues of the purified state preparation algorithm can also be considered: Such models are necessary for achieving quantum speedups over the size of the set and are often instantiated using a QRAM quantum data structure [44].

Question 2.2: what does my algorithm output?
In order to properly compare quantum and classical we also need to be able to clearly articulate what form the outputs of the quantum algorithm take. For example, in supervised learning tasks the output could be a class label or potentially an estimate for a gradient for the weights (i.e. the parameters) of the model. For other quantum algorithms, the output is often taken to be a quantum state [33,48]. Such cases often arise when the QML algorithm to be a subroutine in a larger algorithm since there are no clear classical outputs for such algorithms. In order to fairly compare quantum to classical we therefore need to agree upon a classical task that both need to solve, or at the very least establish a correspondence between the outputs of the two models. Fortunately, this is typically much easier to do for the output of quantum algorithms than it is for their input.
Having an answer to question 2.2 is vital for practitioners of QML to have in hand for any algorithm that they propose since several of the most significant quantum algorithms in the field suffer from the problem of extracting classically meaningful data from the quantum output [30,33]. This problem has been referred to in the literature as the output problem [1], wherein the cost of extracting desired information from the solution dominates the cost of a quantum algorithm. Addressing this problem, along with the issue of efficiently loading the data into the algorithm (the input problem) remain among the two most pressing issues when trying to devise quantum algorithms that offer exponential speedups for ML on classical datasets [49].
Answer 1: I am using a bit encoding so standard classical algorithms are a natural analogue to my quantum algorithm.
This is the simplest case that we can consider for addressing question 2. Here the inputs to the quantum algorithm are specified as bit strings which makes comparison between the quantum algorithm and its classical counterparts straight forward. Examples of quantum algorithms that take this tack are given in [10,15,24].
There are also several important properties that bit-encoded quantum data has that are infrequently remarked on. The first is that while we can have quantum superpositions over the input bit-vectors when we measure the result we will get a unique training vector back. Second, if we are provided a quantum bit string corresponding to a training vector then we can clone the vector to our heart's content in both the quantum and the classical case. The third property is that subtle differences between training vectors can be immediately deduced by looking at their binary representations. We will see that none of these properties hold for amplitude encoded data. Answer 2: I am using an amplitude encoding so randomized classical ML algorithms are a natural analogue of my quantum algorithm.
While it may be tempting to think that the bit-encoded classical training vectors used in most ML protocols are analogous to amplitude encoded quantum state vectors, they are not. To see this, let v k be a training vector in R n + . We can represent this vector as a probability distribution using the exact same tricks employed in amplitude encoding. To see this, consider the following probability distribution (represented as a density operator) The corresponding amplitude encoded quantum state would be Thus when the classical analogue of an amplitude encoded state is measured it yields a single outcome that is distributed according to the training vector, just as the amplitude encoded state is. Similarly, given a quantum state the no-cloning theorem prevents one from generating two states from a single copy of |ψ v k and the same restriction clearly holds for ρ v k . Thus amplitude encoded states have few of the salient features of bit encoding that are considered above and have a closer correspondence to the probabilistic classical encoding given in (10). While amplitude encodings are frequently used in QML [12,23,25,30], this correspondence is seldom commented on in the literature. Ewin Tang and others, however, explicitly invoke this correspondence in works such as [9,38] to show that classical algorithms exist that are only polynomially slower than quantum algorithms for clustering and nearest neighbor classification. In turn, it has been noted that after this correspondence is used then the apparent exponential advantages of quantum algorithms no longer persists. This underscores the importance of considering classical randomized approaches to ML when comparing quantum to classical and also reinforces why we need to think clearly about what the classical analogues of our quantum algorithms are, especially in cases where the input data is amplitude encoded.
Answer 3: I am learning directly from quantum states and so classical analogues of my algorithm may exist in the quantum tomography literature.
Perhaps even more than with amplitude encoding, our aim to find a classical analogue for our QML protocols is stymied when the input itself is quantum. This is because classical machine learning algorithms take classical information as input and the quantum algorithms allow us to manipulate quantum information. It is difficult therefore to see how such quantum algorithms can ever truly have a classical analogue.
While I am not aware of a direct analogue for such methods, strong parallels with such algorithms exist with tomographic protocols [3,26]. This point has been made a few times in the literature and explicit comparisons have been drawn with partial tomography, which is a procedure that attempts to reconstruct features of the quantum state (of which channel identification can be viewed as an important sub-problem) rather than the entire state.
As an example of a classical analogue for a quantum protocol, consider the following. If we assume that we are using a direct quantum access model and we assume that we are given an informationally complete POVM F = {F j : j = 1, . . . , M} then in the classical setting we allow ourselves to draw samples from a distribution P(j|i) = Tr(F j Λ(i)). This means that the classical version of the learning algorithm can access quantum data and in principle reconstruct the entire training set. Giving the classical computer access to this information allows it to in principle simulate any quantum algorithm's action on the training data (at exponential cost unless BQP= BPP).
Once an approach such as this is taken, the inputs in the quantum and classical cases are analogous in that both algorithms are allowed to directly make queries of the quantum state and in principle both algorithms can learn a perfect model for the data. Work by Aaronson and others has also shown that this type of reasoning leads to a protocol called quantum shadow tomography [37], which gives a way of predicting the values of most quantum observables using a small number of samples from a quantum system.
The difference between quantum and classical machine learning methods given quantum inputs arises from the way in which we are allowed to interact with the quantum data. In the classical case, we can envision that we are allowed to sample from a training corpus that consists of measurements of the quantum states using the fixed POVM. In the quantum setting, we can apply arbitrary quantum transformations and entangle the data with ancillary quantum bits.
A challenge with this definition of a classical analogue to QML techniques is that the ability to measure an arbitrary POVM can in some cases obfuscate substantial quantum computational power because the POVM elements themselves may require a substantial number of quantum gates to implement. In order to fairly compare the quantum and the classical ML algorithms under this correspondence then the number of classical and quantum operations needed in both the quantum algorithm and the classical algorithm's implementation of P(j|i) must also be compared.

Question 3: how well does my QML algorithm compare to corresponding classical approaches?
This question is perhaps the most important, but hardest to address question in QML. In QML there are a number of criteria that are relevant for benchmarking the performance of a protocol.
• The value of a loss function attainable through quantum or classical means, such as prediction loss, KL-divergence or intra-cluster variance for clustering algorithms. • The time complexity of an algorithm specified by the number of operations needed to successfully execute the algorithm within a given error tolerance. • The number of samples (for incoherent access models or streaming models) or the number of quantum queries to the training data needed by the training protocol. All of these issues must be addressed in order to make a compelling argument for a QML protocol. After all, if a quantum model yields greater prediction accuracy than a classical analogue but requires exponentially more samples from the training set then the advantages yielded by the algorithm may be minimized. In order to assess the value of any quantum algorithm this question needs to be addressed and the above points should always be considered in a response.
Answer 1: my QML algorithm, which uses amplitude encoding, may be exponentially worse than classical algorithms that use a bit encoding.
Perhaps in light of question 2, this point should not be seen as a surprise since the natural analogue of amplitude encoding is a classical probability encoding and while a bit oracle can simulate a query to a probability oracle efficiently, the converse is not true.
In order to understand the power of these two methods, let us first focus on the classical case. In the classical case, given in (10), bounds on the distinguishability of distributions cause models that inherit this encoding to be intrinsically weaker than conventional models where the input is given by a bit string. Let us consider two concept classes C + and C − for a binary classifier that are separated and compact. Specifically, Let v + and v − be two vectors taken from these two classes such that v + − v − 1 = min x∈C + ,y∈C − x − y 1 = Δ > 0. Given access to an oracle that yields bit strings encoding v + and v − even these two nearby vectors can be distinguished in a single query. On the other hand, the total variational distance between ρ v + and ρ v − is (note that the k labels of both vectors cannot be assumed to be distinct in this context, otherwise that would hold information that could distinguish the classes) For fixed max k v k 1 the number of samples needed to distinguish the two distributions using the optimal estimator scales as O(1/Δ). Thus the concept classes C + and C − need O(1/Δ) samples to distinguish between them as opposed to 1 in cases where the entire training vector is provided. Conversely, a sample can be generated using the bitstring representing v + or v − so this model is strictly weaker than the corresponding bit-encoded classical model. A similar result holds for amplitude encoding. The distinguishability of two distributions is again given by the trace-distance (or Schatten 1-norm) which reads Thus the two concept classes C + and C − cannot be distinguished easily from amplitude encoding if the margin between the two classes is sufficiently small. This shows that again, in principle, amplitude encodings that use a small number of can be weaker than classical classifiers for some problems. Note that the sample complexity in this model for learning the concept class scales as O(1/Δ) in this context as well.
In principle, the number of samples needed (i.e. queries) can be quadratically boosted by amplitude amplification [50] but this does not change the fact that exponentially many state preparations will be needed.
As an example of a challenging concept to learn using an amplitude encoding that uses a small number of quantum bits let us consider learning the parity concept. Specifically, given n-bits the parity of the n bits is 0 if there is an even number of 1s in the string and is 1 otherwise. The most concise amplitude encoding uses only O(log 2 (n)) qubits (for a constant number of training vectors) to store the training vector. Consider the vectors 0 and 1 encoded in n qubits. The amplitude encoding of the states is In this case the inner product between the two states is 1 − O(1/n) and therefore even if one is able to find an optimal classifier, an exponentially larger number of repetitions will be needed in order to decide on a class. Further, identifying an optimal classifier using a simple circuit remains a challenge. This can be addressed, in part, by simply using more qubits. If O(n) qubits are used then the parity can be computed using a string of CNOT gates, which allows the class to be trivially determined. Thus a compact amplitude encoding may not always be desirable in terms of the sample complexity. Despite these theoretical challenges, we note that the majority of the real-world problems that we consider are not as pathological as the parity problem. This suggests that relatively large margins exist when the problem is represented within the amplitude encodings that we use. However, it is important to keep these information theoretic restrictions in mind when considering how to use amplitude encoding when building classifier and also the potential impacts that choosing an improper amplitude encoding can have on the decision boundaries for a problem and in turn the sample complexity needed to classify data.
Answer 2: I really do not know, but I have uncompelling numerical/experimental results that may show advantages for an objective function I chose on a few small data sets.
The face of machine learning has changed within the last decade and rather than focusing on models that are based on a solid mathematical bedrock, machine learning (particularly in vision and speech) has begun to rely more heavily on models whose theoretical performance may be poorly understood but whose empirical performance often can be quite impressive. The success of this endeavor has been a result of several factors: • The availability of sufficient computing power to train and validate models [51].
• The presence of standardized metrics by which to evaluate the performance of ML algorithms [52][53][54]. • Open source implementations through packages such as Tensorflow [55] or PyTorch [56] that allow (in principle) easy comparison to other state of the art ML methods. This methodology has begun to influence QML as well, empowered in part by easy availability of quantum hardware in the form of the IBM quantum experience as well as a plethora of programming languages [57][58][59]. In contrast, less emphasis is being spent looking at QML from the perspective of a computer science and more emphasis is taken on examining QML from a data science perspective. In my opinion, this is perhaps a misstep because many of the features that make empirical approaches so successful for classical ML are absent from QML.
Present quantum computers are limited in scale and gate fidelity [60], which means that we often do not have sufficient quantum computing resources to test heuristic QML algorithms. Even simply evaluating the performance of the final model belies a much more substantial cost of learning reasonable parameters such as learning rates or architectures for the parameterized quantum gates that comprise the algorithms. This means that, even numerically, it is very difficult with existing computing resources to train models beyond 20 qubits and in experiment, Herculean efforts are needed to perform QML/variational algorithms on 12 or fewer quantum bits [61]. It should also be noted that these memory restrictions are not present in existing quantum annealers, although because of the absence of so-called non-stoquastic couplings such approaches can rigorously be simulated in polynomial time on classical computers [62,63]. While these simulation methods are far from practical, they reveal that the advantages existing quantum annealers are expected to yield for typical ML problems are still unclear.  [25]. The cells are of the format 'training error/validation error'. The variance between the 50 repetition for each experiment was of the order of 0.01-0.001 for the training and test error. * For multilabel classification problems with d labels, the average of all d one-versus-all problems train and test errors were taken. Note that the quantum algorithm in question outperforms, in terms of test error, all other reference models only on the cancer set. Quantum data sets have also yet to go through the standardization process that allowed the empirical performance of ML algorithms to be gauged. Excellent examples of such classical benchmarks include the MNIST dataset, the UCI repository and imagenet. These benchmarks, while appropriate for amplitude-encodings or applications on annealers, are often cannot be evaluated for bit encodings on existing quantum computers or classical simulators. This has meant that most comparisons are done on very small synthetic data sets that are easily scalable, such as the bars and stripes set [64]. This means that while small scale experiments can be performed at present, it is unclear whether any purported performance advantage will persist as we scale the problem size up.
If amplitude encoding is used then the above criticism do not apply. However, a new problem involving benchmarking the algorithm against its appropriate classical analogue emerges. In order to fairly compare the results, we therefore need to train our classical algorithms on the exact same training data provided to the quantum algorithm. If we do not do this, as demonstrated by the parity example above, this can result in the appearance of quantum approaches being much weaker than classical ones when in fact such differences may only arise from the use of a different input model.
A further challenge facing providing appropriate benchmarks arises from the lack of standardized metrics to compare. The lack of a common set of benchmarks as well as reference implementations means that the author is responsible for not only implementing and optimizing their algorithm but also its competition. The problem is that the author faces an incentive to show an advantage for the algorithm over the competition and thus it is difficult, even for the most high-minded researcher, to fairly compare their proposal to similarly optimized competitive approaches. Standardizing the benchmarks as well as providing open source reference implementations of competitive algorithms will be needed to meaningfully answer the question of whether a QML algorithm provides an advantage against its classical algorithm.
In order to publish a set of numerical experiments, it is often essential to show evidence of an advantage over analogous or competitive classical techniques. In the event that an advantage is not seen then the result will have a hard time being published in a prestigious venue. For this reason, the machine learner faces two main temptations. The first, is to selectively choose the test sets that the model is evaluated on in order to show an advantage. The second temptation is to spend more effort optimizing the quantum algorithm than is spent optimizing the classical algorithms.
A good example of a thorough effort to benchmark quantum classifiers is given in table 1 wherein the authors demonstrate the performance of the circuit centric classification algorithm of [25] is examined for a wide range of test cases (although the work does conflate amplitude encoding with bit encodings to some extent). In such cases a temptation exists to only publish the favorable data, in this case the performance of the algorithm on cancer data. Instead, by comparing the performance of the algorithm over a wide range of data sets a more nuanced view of the performance of the quantum algorithm is possible and in this case revealed that the advantage for the heuristic algorithm proposed was not in the accuracy attainable but rather in the dramatically smaller number of model parameters required to achieve comparable performance to the reference algorithms.
In short, if we are to provide compelling numerical studies of the performance of QML, we must continue to emulate the best practices of the classical machine learning community. Given the present excitement surrounding both quantum algorithms and machine learning we not only owe it to the scientific community to do our best to critically evaluate QML algorithms but also since the classical community does not test our protocols (yet) against their methods we cannot be sure that the reference examples that we test against are representative of the best that the community can provide. For this reason, until the field matures great effort will be needed in order to understand the true performance of existing and future heuristic quantum ML algorithms.

Answer 3: I have not done a solid empirical comparison but I see polynomial advantages in either query/sample/time complexity or have complexity theoretic arguments for why my scheme cannot be efficiently simulated classically.
This answer is another type of response that is common to this question and to me often represents a reply to give to yourself when considering a new heuristic quantum algorithm. An example of this involves heuristics where the underlying quantum circuits form a universal gate set [25] or result in a partial collapse of the polynomial hierarchy [65]. These solutions provide confidence that, regardless of the input model, the quantum algorithm is unlikely to be simulatable classically.
In the event that the quantum algorithm is an accelerated form of a classical protocol, such as the first Boltzmann training algorithms [14]; clustering [9,17]; recommendation systems algorithms [12] and more, then query complexity separations can also be used to suggest that there may be a speedup to be gleaned from such an application.
Without one of these facts, or a related compelling justification for the algorithm, the impact that small scale implementations of the algorithm have are diminished. For this reason, it is important to consider this question well before any experiments are performed.
A nice side effect of trying to answer the question in this fashion forces you to think about what metrics you will be using to compare your quantum algorithm to its classical analogue. As mentioned, there are a host of different types of resources that emerge in such algorithms and by clearly articulating the tradeoffs between them it often becomes easier to contextualize any advantages once they become apparent or appreciate other advantages that may appear even if the quantum model affords no greater prediction accuracy than the classical method.

Question 4: what fundamental limitations exist for my quantum machine learning task?
Even if the previous questions are answered, an important last question to ask involves determining what fundamental obstacles stand in the way of QML algorithms. I have found this sort of question to be an excellent sanity check and challenge to see how much more a quantum algorithm can be improved.
Answer 1: there are complexity theoretic reasons why I cannot solve this QML task in polynomial time.
This answer arises frequently in QML. Many problems in machine learning can be recast as optimization problems that are NP-hard. K-means clustering is an excellent example of such a protocol [4], and as a result we strongly suspect that quantum computers will be unable to provide exponential speedups for such problems [9]. On the other hand, other protocols such as training quantum Boltzmann machines can reduce to problems that are QMA-hard [34]. Other tasks such as preparing Gibbs states using quantum rejection sampling involves implicitly estimating partition functions [66]. This task is #P-hard, which means that efficient exact algorithms for these protocols are highly unlikely to exist.
Answer 2: there are information theoretic limitations that impact the performance of my algorithm.
Even if there are no compelling reasons why improvements to an algorithm ought to be hard from complexity theoretic arguments, information theoretic arguments can also place limitations on QML protocols. This limitations can be especially pronounced in amplitude-encodings. An example of such limitations is given above when we showed that bounds on state discrimination can prevent vectors from being unambiguously classified using a sub-exponential number of measurements. Similarly, bounds on approximate cloning place limitations on the number of samples needed to be able to learn an accurate generative model for an unknown quantum state. Also results on channel compression have been used to argue about the maximum accuracy that could be attainable for quantum autoencoders [27]. I find information theoretic bounds particularly important because they not only provide simple arguments for the limitations of QML algorithms but also this line of thinking helps build bridges between the new field of QML and the much more mature field of quantum information theory. These bridges can, at times, allow us to leverage decades worth of development within that space and sometimes leads to easy answers to questions that we would otherwise be unable to even articulate.
Answer 3: there are arguments from random matrix theory that training my model will be generically hard.
A final limitation that often emerges involves insights gleaned from random matrix theory. Random matrix theory has a long history of being used to model the distributions found by complex quantum dynamics that emerge in quantum chaos as well as in quantum supremacy experiments and finally in training quantum neural networks. These arguments, which are used to great effect in [65], can be used to show that with high probability the gradients of training objective functions in large unstructured quantum circuits are concentrated exponentially closely to zero. Given that even quantum algorithms used to estimate N Wiebe the gradient of a model with D parameters acting on n-qubits require O( √ D/ ) queries to estimate the gradient within error (in the infinity-norm) [48], this implies that with high probability the gradient will require O(poly(2 n )) applications of the model to even learn the sign of any component of the gradient accurately.
Such problems it should be noted do not usually arise in classical bit encodings because of the presence of back propagation, which can rely on double precision to apply the chain rule implicitly and compute gradients of the training objective function. In quantum settings the development of quantum backpropagation algorithms does not address this issue in most cases (and in particular in quantum or amplitude-encoded input models) because of the precision requirements on the outputs. For these reasons, it is my belief that older approaches to the gradient decay problem, such as generative pre-training, may play a larger role in QML than it presently plays in classical machine learning. An examples of such generative pre-training would be to use a model like a Boltzmann machine wherein the gradients of the cross-entropy term in the KL-divergence (or more generally the quantum relative entropy) [16,26] before switching over to a loss function appropriate for a supervised task. This can yield substantial advantages because the objective function is logarithmic and can dramatically amplify the vanishing gradient problem that such barren plateaus impose.
While generative pre-training may sometimes be used to surmount these problems, more commonly insight into the form of the data set is used to try to ensure the model weights are chosen so as to avoid the 'chaotic regime' where the gradients vanish [65]. Nonetheless, recognizing that many approaches to quantum neural networks may have gradients that are hard to evaluate is vitally important and stresses the significance of beginning the training process with an intelligently chosen weight distribution.

Conclusion
I have given a short list of the questions that I often ask myself when designing a quantum algorithm for a machine learning task. The key message behind these questions should be seen as a call for increased self-criticality. We need to know more than just how to execute our quantum algorithms: we need to know why to execute them. We need to know what the natural classical analogues are to our algorithms and we need to know what limitations physics places against us in the face of our attempts to bend the laws of nature to strengthen our machine intelligences. While QML may provide a path towards hitherto unrealized possibilities for understanding both quantum and classical datasets, we need to critically engage with ourselves to help understand the role that quantum will play in this future and provide the information needed to help those in the broader machine learning community both assess the significance of our developments and contextualize them with respect to existing knowledge. Until a deeper back and forth dialogue begins between these two camps, we will have to be our own gatekeepers and to that end ask ourselves the hard questions that will not only strengthen the science but help propel us closer to achieving the grand dreams of QML.