Quantum speedups of some general-purpose numerical optimisation algorithms

We give quantum speedups of several general-purpose numerical optimisation methods for minimising a function $f:\mathbb{R}^n \to \mathbb{R}$. First, we show that many techniques for global optimisation under a Lipschitz constraint can be accelerated near-quadratically. Second, we show that backtracking line search, an ingredient in quasi-Newton optimisation algorithms, can be accelerated up to quadratically. Third, we show that a component of the Nelder-Mead algorithm can be accelerated by up to a multiplicative factor of $O(\sqrt{n})$. Fourth, we show that a quantum gradient computation algorithm of Gily\'en et al. can be used to approximately compute gradients in the framework of stochastic gradient descent. In each case, our results are based on applying existing quantum algorithms to accelerate specific components of the classical algorithms, rather than developing new quantum techniques.


Introduction
Quantum computers are designed to use quantum mechanics to outperform their classical counterparts. As well as the remarkable exponential speedups that are known for specialised problems such as integer factorisation and simulation of quantum-mechanical systems, there are also quantum algorithms which speed up general-purpose classical algorithms in the domains of combinatorial search and optimisation. These algorithms may achieve relatively modest speedups, but make up for this by having very broad applications. The most famous example is Grover's algorithm [26], which achieves a quadratic speedup of classical unstructured search, and can be used to accelerate classical algorithms for solving hard constraint satisfaction problems such as Boolean satisfiability.
Here our focus is on quantum algorithms that accelerate classical numerical optimisation algorithms: that is, algorithms that attempt to solve the problem of finding x ∈ R n such that f (x) is minimised, for some function f : R n → R. (We use boldface throughout for elements of R n .) A vast number of optimisation algorithms are known. Some algorithms seek to find (or approximate) a global minimum of f , given some constraints on f ; others only attempt to find a local minimum. Some algorithms have provable correctness and/or performance bounds, while the performance of others must be verified experimentally. Whether or not an algorithm has good theoretical properties, its performance on a given problem often can only be determined by running it. These factors have led to the development and use of many numerical optimisation algorithms based on varied techniques.
Here we consider some prominent general-purpose numerical optimisation techniques, and investigate the extent to which they can be accelerated by quantum algorithms. We stress that our goal is not to develop new quantum optimisation techniques (that perhaps would not have rigorous performance bounds), but rather to find quantum algorithms that speed up existing classical techniques, while retaining the same performance guarantees. That is, if the classical algorithm performs well in terms of solution quality or execution time on a given problem instance, the quantum algorithm should also perform well. We assume throughout that the quantum algorithm has access to an oracle that computes f (x) exactly on particular inputs x, implemented as a quantum circuit 1 . That is, we assume we have access to the map |x |0 → |x |f (x) . This contrasts with a model sometimes used elsewhere in the literature, where x is assumed to be provided to the quantum algorithm as a quantum state of log 2 n qubits [34,47] stored in a quantum RAM, and the goal is to produce a quantum state corresponding to arg min x f (x).
Our results can be summarised as follows, where we use the notation (as in the rest of the paper) T (f ) for an upper bound on the time required to evaluate the function f . See Table 1 for a summary of the speedups we obtain.
• Section 2: We show that a number of techniques for global optimisation under a Lipschitz constraint can be accelerated near-quadratically, and also discuss some challenges associated with speeding up the related and well-known classical algorithm DIRECT [31]. In Lipschitzian optimisation, one assumes that |f (x) − f (y)| ≤ K x − y for some K that is known in advance (the Lipschitz constant of f ), where · is the Euclidean norm. Many techniques for Lipschitzian optimisation can be understood in the framework of branch-and-bound algorithms [28]. These algorithms are based on dividing f 's domain into subsets, and using a lower-bounding procedure to rule out certain subsets from consideration. This enables the use of a quantum algorithm for speeding up branch-and-bound algorithms [43]. The complexity of branch-and-bound algorithms is controlled by a parameter T min discussed below; the quantum algorithm achieves a quadratic reduction in complexity in terms of this parameter. A simple representative example of an algorithm fitting into this framework is Galperin's cubic algorithm [21]. In this case, the quantum algorithm's complexity is then O( √ T min d 3/2 2 n T (f )), where d is the depth of the branch-and-bound tree, whereas the classical complexity is O(T min 2 n T (f )).
• Section 3: We show that backtracking line search [45, Algorithm 3.1], a subroutine used in many quasi-Newton optimisation algorithms such as the BFGS algorithm, can be accelerated using a quantum algorithm which is a variant of Grover search [39]. Backtracking line search is based on choosing a direction d and searching along that direction. If the overall algorithm makes k iterations, the complexity of choosing d is τ (d), and the number of search steps taken by the classical algorithm is m 0 , the complexity of one iteration of this classical routine is O(τ (d) + m 0 T (f )), while the complexity of the quantum algorithm is O(τ (d) + √ m 0 (log k)T (f )).
• Section 4: We show that the Nelder-Mead algorithm [44], a widely-used derivative-free numerical optimisation algorithm, can be accelerated using quantum minimum-finding [17]. The algorithm is an iterative procedure based on maintaining a simplex. Assume that T (f ) = Ω(n 3/2 ), and that the algorithm performs k iterations, s of which are "shrink" steps (qv). Then the complexity of the quantum algorithm is O(((s + 1) √ n log k + k)T (f ))), as compared with the classical complexity, O(((s + 1)n + k)T (f )). So if the number of shrink steps is large with respect to k, or k is small, the quantum speedup can be relatively substantial (up to a O( √ n) factor).
• Section 5: Approximate computation of a gradient is a key subroutine in many optimisation algorithms, including the very widely-used gradient descent algorithm [8]. We show that the gradient of § Gradients of averaged functions Quantum gradient computation [22] can be computed more efficiently using a quantum algorithm of Gilyén, Arunachalam and Wiebe [22]. Given that each individual function f i is bounded and can be computed in time T (f ) (and satisfies some technical constraints on its partial derivatives), the quantum algorithm outputs an approximation of the gradient that is accurate up to in the ∞ norm, in time O( √ nT (f ) −1 ), as compared with the classical complexity O(nT (f ) −2 ). (The O notation hides polylogarithmic factors in N , n and 1/ .) However, as we will discuss, it is not clear whether this notion of approximation is sufficient to accelerate classical stochastic gradient descent algorithms.
In each case, the quantum speedups we find are based on the use of existing quantum algorithms, rather than the development of new algorithmic techniques. We believe that there are many more quantum speedups of numerical optimisation algorithms to be discovered. We remark that, in many of the cases we consider, the extent of the quantum speedup achieved depends on the interplay of various parameters governing the optimisation algorithm's runtime, so not every problem instance will yield a speedup.
Prior work on quantum speedups of numerical optimisation algorithms (as opposed to the analysis of new quantum algorithms such as the adiabatic algorithm [20] or quantum approximate optimisation algorithm [30,19]) has been relatively limited. Dürr and Høyer [17] gave a quantum algorithm to find a global minimum of a function f on a discrete space of size N , which is based on the use of Grover's algorithm and uses O( √ N ) evaluations of f . Arunachalam [5] applied Dürr and Høyer's algorithm to improve the generalised pattern search and mesh-adaptive direct search optimisation algorithms. A sequence of papers has found quantum speedups of linear programming and semidefinite programming algorithms [10,3,2,35,9]; quantum speedups of more general convex optimisation algorithms are also known [51,14]. Quantum speedups are known for computing gradients [32,22,15], an important subroutine in many optimisation algorithms; larger (exponential) speedups could be available in gradient descent-type algorithms if the inputs to the optimisation algorithm are available in a quantum RAM (qRAM) [34,47]. Recently, it was shown that classical algorithms based on the general technique known as branch-and-bound can be accelerated near-quadratically [43].

Branch-and-bound algorithms for global optimisation with a Lipschitz constraint
Finding a global minimum of an arbitrary function f : R n → R can be a very challenging (or indeed impossible) task. One way to make this problem more tractable is to assume that f satisfies a Lipschitz condition: |f (x) − f (y)| ≤ K x − y for some K that is known in advance, where · is the Euclidean norm.
Finding a global minimum of f under this condition is known as Lipschitzian optimisation. Lipschitzian optimisation is very general and hence can be applied in many contexts. Hansen and Jaumard [28] describe a selection of applications of Lipschitzian optimisation, including solution of nonlinear equations and inequalities; parametrisation of statistical models; black box system optimisation; and location problems.
It is natural to restrict the domain of f to [0, 1] n , and to assume that f is bounded such that f (x) ∈ [0, 1] for all x ∈ [0, 1] n . Finally, we can relax to solving the approximate optimisation problem of finding y such that f (y) − min x∈[0,1] n f (x) ≤ , for some accuracy parameter that is determined in advance. Even in the case n = 1 and with these restrictions, this problem is far from trivial. One class of algorithms that can solve Lipschitzian optimisation problems are branch-and-bound algorithms. Generically, a branch-andbound algorithm solves a minimisation problem using the following procedures: • A branching procedure which, given a subset S of possible solutions, divides S into two or more smaller subsets, or returns that S should not be divided further.
• A bounding procedure which, when given a subset S produced during the branching process, returns a lower bound L(S) such that L(S) ≤ min x∈S f (x).
Branch-and-bound algorithms can be seen as exploring a tree, whose vertices correspond to subsets S. The children of a subset S correspond to the subsets which S was divided into, and leaves are subsets that should not be divided further. For a leaf, one should additionally have that L(S) = min x∈S f (x).
Branch-and-bound algorithms use the additional information provided by the branch and bound procedures to explore the most promising sets S early on, and to avoid exploring subsets S such that L(S) is larger than the best solution found so far. One can show that the complexity of an optimal classical branch-and-bound algorithm based on these generic procedures is controlled by the size of the branch-and-bound tree, truncated by deleting all vertices whose corresponding lower bounds are less than the optimal cost min x f (x): if the size of this tree is T min , the optimal classical algorithm makes Θ(T min ) calls to the branch and bound procedures [33]. It is not required to know T min in order to apply this bound.
A generic framework for branch-and-bound algorithms in the context of Lipschitzian optimisation was given by Hansen and Jaumard [28,Section 3.3], and we describe it as Algorithm 1. The algorithm splits [0, 1] n into hyperrectangles I, each of which is recursively split again. Each hyperrectangle has an associated upper bound (obtained by evaluating f at a discrete set of points in that hyperrectangle) and lower bound (obtained via a separate lower-bounding function), and the algorithm terminates when it finds a hyperrectangle whose upper bound is sufficiently close to its lower bound. Convergence is guaranteed if some simple criteria are satisfied, discussed in [28] (for example, the upper bound and lower bound should converge as the interval size tends to 0). Hansen and Jaumard show that many previously known algorithms for Lipschitzian optimisation can be understood as particular cases of Algorithm 1. These include Galperin's cubic algorithm [21], which proceeds by dividing the search space into hypercubes, and algorithms of Pijavskii [46], Shubert [48] and Mladineo [40].
The branching procedure of Algorithm 1 fits into the standard branch-and-bound framework. Given a subset I j , an upper bound is obtained by evaluating f (x) at a discrete set of positions x, and a lower bound is obtained using the bounding function F j . If the two are within , I j should not be expanded further. Otherwise, I j is split into subsets. Algorithm 1 has a notion of selecting the next subset in L using a selection rule, but it is shown in [33] that the best possible selection rule in branch-and-bound procedures (in a query complexity sense) is to expand the subset whose bounding function is smallest 2 . i. Partition I into hyperrectangles I 1 , . . . , I p according to a branching rule [Branch] ii. For j = 1, . . . , p: Algorithm 1: Generic branch-and-bound algorithm for Lipschitzian optimisation problems [28] There is a quantum algorithm that can achieve a near-quadratic speedup of classical branch-and-bound algorithms [43]. The algorithm is based on the use of quantum procedures for estimating the size of an unknown tree [1], and searching within such a tree [6,7,42]. The algorithm achieves a complexity of O( √ T min d 3/2 ) uses of the branch and bound procedures for finding the minimum of f up to accuracy . In this bound d is the maximal depth of the branch-and-bound tree and the O notation hides polylogarithmic factors in d, 1/ , and 1/δ, where δ is the probability of failure. (We remark that the algorithm as presented in [43] assumes knowledge of an upper bound on d in advance, but such a bound can be found efficiently by applying the quantum tree search algorithms of [42,6,7] to the branch-and-bound tree obtained by truncating at depth d , with exponentially increasing choices of d , until d is found where the corresponding tree does not contain any internal vertices that have not been expanded.) The quantum branch-and-bound algorithm can immediately be applied to Algorithm 1. If the time complexity of the branching and bounding rules is upper-bounded by C, the cost of the quantum algorithm is O( √ T min d 3/2 C), as compared with the classical complexity, which is O(T min C). If T min d, the speedup of the quantum algorithm over its classical counterpart in terms of the number of uses of the branching and bounding rules is near-quadratic. If these rules in turn are relatively simple to compute compared with T min (as is likely to be the case for challenging optimisation problems that occur in practice), this translates into a near-quadratic runtime speedup.
To illustrate how this approach could be applied in practice, a simple example of an algorithm fitting into this framework is Galperin's cubic algorithm [21]. The branch and bound procedures are defined as The result of a few steps of splitting into subintervals is shown. The centres of intervals are labelled below with the step at which they are divided into subintervals (red), and the lower bound in that interval (blue). Endpoints are labelled above with the evaluated function values, shown to two decimal places.
follows, recalling that K is the Lipschitz constant of f : • Branch: the subproblem I corresponding to a hypercube is divided into p = q n equal hypercubes, for some q ≥ 2, by dividing each side into q equal parts.
• Lower bounding rule: Let x 0 be an extreme point of I. I has side length 1/q k for some integer k.
n, maximised over extreme points of I.
• Upper bounding rule: Evaluate f on the extreme points of I and return the minimum value found.
Galperin's algorithm is illustrated in Figure 2 for the case n = 1. The complexity of the branch and bounding steps is dominated by the cost of evaluating f at the extreme points of each hypercube I, which is O(2 n T (f )). The quantum complexity is then O( √ T min d 3/2 2 n T (f )), whereas the classical complexity is O(T min 2 n T (f )); so we see that the speedup is largest for small n, e.g. n = O(1).

The DIRECT algorithm
A prominent algorithm proposed to handle Lipschitzian optimisation for n-variate functions where one does not know the Lipschitz constant in advance is known as DIRECT [31] (for "dividing rectangles"). The basic concept is to divide [0, 1] n into (hyper)rectangles, and at each step of the algorithm to produce a list of potentially optimal rectangles, which are those that should be expanded further; see Appendix A for more details. This is similar to the branch-and-bound algorithms of the previous section, but with the additional complication of generating the list of potentially optimal rectangles, which involves interaction across several nodes of the branch-and-bound tree. This creates a difficulty for the quantum branch-andbound algorithm, as it can only use branch and bound procedures based on only local information from the tree. Therefore it is unclear whether a similar quadratic speedup can be obtained.
To identify the potentially optimal vertices, the DIRECT algorithm uses a 2d convex hull algorithm. It is a natural idea to speed this up via a quantum convex hull algorithm. Lanzagorta and Uhlmann [38] have described a quantum algorithm based on Grover's algorithm for computing a convex hull of m points in 2d with complexity O( √ mh), where h is the number of points in the convex hull; they also give an algorithm based on a heuristic whose runtime may be O( √ mh) for practically relevant problems. However, the special 1. Choose a starting point x 0 and constants γ ∈ (0, 1) and β ∈ (0, 1). Set x ← x 0 .

Choose a direction d such that
3. Compute the step size: We apply this result to step 3 of the classical algorithm to achieve a square-root reduction in the dependence on m 0 . To achieve a final probability of failure bounded by a small constant, by a union bound over the k iterations, it is sufficient to repeat the algorithm of Theorem 1 O(log k) times to achieve O(1/k) failure probability at each iteration. This gives an overall complexity of the quantum algorithm which is O(τ (d) + √ m 0 (log k)T (f )) per iteration. If the overall algorithm makes k iterations, and m max is the largest value of m 0 for any iteration, we have an overall complexity of O(k(τ (d) + √ m max (log k)T (f ))).
In cases where τ (d) = O(n) (such as the steepest descent method), T (f ) = Ω(n), and k is not exponentially large in n, the dominant term in this complexity bound is the second one, and we always achieve a quantum speedup.
The assumption T (f ) = Ω(n) is natural if f depends on all n variables.
. Therefore, the speedup achieved by the quantum algorithm (based on this worst-case bound) will be greatest when L is large (representing that ∇f could change rapidly), yet |D d f (x)| is small (representing that f does not change rapidly in direction d).
Another way in which one might hope to speed up Algorithm 3 is computing D d f (x) more efficiently. For example, a quantum algorithm was presented by Gilyén, Arunachalam and Wiebe [23], based on a detailed analysis of and modifications to an earlier algorithm of Jordan [32], that approximately computes ∇f (x) for smooth functions quadratically more efficiently than classical methods (that are based e.g. on finite differences). However, it seems challenging to prove that such an approximation can be inserted in the backtracking line search framework without affecting the performance of the overall algorithm, in the worst case. This is because even a small change in the direction d can significantly change the behaviour of the algorithm, as the definition of Step 3 of Algorithm 3 is such that an arbitrarily small change to the values taken by f along the direction d can change m 0 substantially. See Section 5 below for a further discussion of this algorithm.
Finally, we remark that one simple way to find a direction d such that D d (f ) is nonzero, as required for the line search procedure, is to choose i such that ∂f /∂x i is nonzero. Although a valid choice, in practice this could be less efficient than (for example) moving in the direction of steepest descent. The use of Grover's algorithm would reduce the complexity of this step to O( √ n(log k)T (f )), as compared with the classical O(nT (f )).

Nelder-Mead algorithm
The Nelder-Mead algorithm is a direct search optimisation algorithm; that is, one which does not require information about the gradient of the objective function. It is commonly-used and implemented within many computer algebra packages. However, little convergence theory exists and in practice it is ineffective in higher dimensions 4 [37,27]. The Nelder-Mead algorithm uses expansion, reflection, contraction and shrink steps to update a simplex in R n . A number of variants of the algorithm have been proposed. The variant we will use was analysed by Lagarias et al. [37], and is presented as Algorithm 4. Algorithm 4 does not specify a termination criterion. Termination criteria that could be used include the function values at the simplex points becoming sufficiently close; the simplex points themselves becoming sufficiently close; or an iteration limit being reached.

2.
Sort. Order and relabel the vertices of the simplex such that f (x 0 ) ≥ f (x 1 ) ≥ · · · ≥ f (x n ) and let x 0 be the worst vertex, x 1 the next-worst vertex and x n the best vertex. Set c = 1 n n i=1 x i .

3.
Reflection. Calculate the reflection point, accept reflection, replace x 0 with x r and return to step 2.
f (x r ) accept the expansion point and replace x 0 with x e , otherwise accept the reflection point and replace x 0 with x r . Return to step 2.
, accept the outside contraction point, replace x 0 with x c1 and return to step 2. Else go to step 7.
accept inside contraction, replace x 0 with x c2 and return to step 2. Else go to step 7.

7.
Shrink. For all points other than the best point, replace it with its shrink point, Algorithm 4: Nelder-Mead algorithm (see e.g. [37]) to write down the n + 1 points. To analyse step 2, observe that a complete ordering of the points is never required; the only information about the ordering needed is the worst vertex x 0 , the next-worst vertex x 1 , and the best vertex x n . Knowledge of the identities of these points is sufficient to compute the centroid c, and to carry out all the updates required, including the shrink step. So the first time that step 2 is executed, its complexity is O(n 2 + nT (f )), where the O(n 2 ) comes from computing the centroid. Each time step 2 is executed subsequently, except following a shrink step, the required updates can be made in time O(n).
The complexity of step 2, when executed for the first time or following a shrink step, can be improved using quantum minimum-finding: Thus a quantum algorithm using Theorem 2 can find the worst, next-worst and best vertices with failure probability O(1/k) at each iteration in time O( √ nT (f ) log k) in total. This choice of failure probability is so that, by a union bound, the total probability of failure can be bounded by an arbitrarily small constant. Further, observe that the centroid can be updated in time O(n) following a shrink step, as if c denotes the updated centroid, then c = δc + (1 − δ)x n . This does not give a quantum speedup of step 2 in all cases; the first time that step 2 is executed, if T (f ) = O(n 3/2 ), its complexity is dominated by the O(n 2 ) cost of computing the centroid. There also remains an O(n 2 ) cost for updating the points at each shrink step. (There may be a more efficient way of keeping track of these shrink steps; however, we do not pursue this further here.) Then the overall complexity of the quantum algorithm is O((s + 1)(n 2 + √ nT (f ) log k) + k(n + T (f ))), and using a union bound over the k steps, the algorithm's failure probability is bounded above by an arbitrarily small constant. If T (f ) = Ω(n 3/2 ), this simplifies to O(((s + 1) √ n log k + k)T (f )). Comparing with the classical complexity, we see that the quantum speedup is largest when s is large compared with k.
However, in practice shrink steps appear to be rare; in one set of experiments, only 33 shrink steps were observed in 2.9M iterations [50], and shrink steps never occur when Nelder-Mead is applied to a strictly convex function [37]. If there are no shrink steps and T (f ) = Ω(n 3/2 ), the complexity of the quantum algorithm is O(( √ n log k + k)T (f )), while the complexity of the classical algorithm is O((n + k)T (f )). This is still a quantum speedup if k = o(n); on the other hand, if k = Ω(n), the complexity is dominated by evaluating f once at each iteration, and it is difficult to see how a quantum speedup could be achieved.
To be able to use quantum minimum-finding, we have assumed the ability to construct superpositions of the form 1 √ n+1 n i=0 |i |x i , which enables us to evaluate f in superposition. This is a quantum RAM [24], and quantum RAMs are often assumed to be difficult to construct; however, our requirements are very weak, because we only need the addressing to be performed in time O(n), rather than O(log n), which can be achieved using an explicit quantum circuit.
Finally, we consider the possibility of accelerating calculation of the centroid c using a quantum algorithm. If each component of each vector x i is suitably bounded (e.g. x i ∞ ≤ 1) we could use quantum mean estimation [29,11,41] to estimate each component of c up to accuracy in time O((n/ ) log(n/ )) with failure probability bounded by a small constant, where the log(n/ ) term comes from reducing the failure probability for each component to /n. Classical mean estimation could be used instead with an overhead of an additional O(1/ ) factor. This would give an overall time complexity similar to that derived above, but it is not obvious what the effect of replacing the centroid with an approximate centroid would be on the overall algorithm. For example, it is argued in [18] that random perturbations to the centroid throughout the algorithm can be beneficial.

Stochastic gradient descent
One of the most widely-used, effective and simple methods for finding a local minimum of a function is gradient descent. Given a function f : R n → R and an initial point x ∈ R n , the algorithm moves to the point x = x − η∇f (x), where η > 0. In application areas such as machine learning [8], one often encounters functions f of the form for some "simple" functions f i (x), where N is large. (For example, f i (x) could be the error of a neural network parametrised by x on the i'th item of training data, and we might seek to minimise the average error.) Rather than computing the exact gradient ∇f (x) by summing ∇f i (x) over all N choices for i, it is natural to approximate ∇f by sampling k random indices i 1 , . . . , i k ∈ [N ] with replacement and outputting 1 k (∇f i 1 (x) + · · · + ∇f i k (x)). (The case k = 1 is known as stochastic gradient descent; the sample i 1 , . . . , i k is sometimes known as a mini-batch.) If f satisfies the Lipschitz condition that ∇f i (x) ∞ ≤ 1, to approximate ∇f (x) up to additive error in the ∞ norm with failure probability δ it is sufficient to take k = O( −2 log(n/δ)) by a Chernoff bound argument. Let T (f ) denote an upper bound on the time required to compute f i (x) for all i. If we approximate ∇f i (x) using the finite difference method, then each approximation to ∇f i (x) can be computed in time O(nT (f )), giving a total complexity of O(nT (f ) −2 log(n/δ)).
The use of quantum amplitude estimation [12] would improve the dependence on quadratically. Here we observe that the dependence on n can also be improved quadratically, using a result of Gilyén, Arunachalam and Wiebe [22]. We will impose the restriction (for technical reasons) that the range of each function f i is within [1/10, 9/10], where these numbers could be replaced with any constants between 0 and 1. Given the more typical constraint that f i : R n → [0, 1] (e.g. if the output of f i represents a probability), f i can easily be modified to satisfy this constraint by a simple linear transformation, which does not change arg min x f (x).
The results of [22] use two somewhat nonstandard oracle models which we now define. First we will consider probability access, and define what a probability oracle is.
Essentially, within this model our objective function corresponds precisely to the probability of a certain outcome being observed upon measurement (in particular, the probability of seeing |1 when measuring the final qubit). Indeed, given a classical description of the function g(z), an oracle of this form can be constructed without a significant overhead [13]. The next access model we consider is access via a phase oracle. The authors of [22] showed that a probability oracle is capable of simulating a phase oracle, and vice versa, with only logarithmic overhead: Theorem 3 (Converting between probability and phase oracles [22]). Suppose g : Z → [0, 1] is given by access to a probability oracle U g which makes use of a auxiliary qubits. Then we can simulate anapproximate phase oracle using O(log(1/ )) queries to U g ; the gate complexity is the same up to a factor of O(a). Similarly, suppose g : Z → [δ, 1 − δ] is given by access to a phase oracle O g . Then we can construct an -approximate probability oracle for g using O(log(1/ )/δ) queries to O g . The gate complexity is the same up to a factor of O(log(1/ )(log log(1/ ) + log(1/δ))).
What this shows is that the two access models are more-or-less equivalent in power. Now we have defined probability oracles, we can show that access to probability oracles for the individual f i functions immediately gives such access for f itself. Proof. We start with the superposition 1 √ N N i=1 |i |x |0 , where |x denotes a description of the real vector x in terms of binary, up to some digits of precision, leading to an orthonormal basis. If N is a power of 2, this state can be constructed easily by applying Hadamard gates to each qubit in a register of log 2 N qubits.
If not, the state can be constructed in circuit complexity O(log N ) as follows: attach a register of log 2 N qubits; apply Hadamard gates to produce |i ; compute the function "i ≤ N " into an ancilla qubit using an efficient comparison circuit (e.g. [16]); measure the ancilla qubit; and proceed only if the answer is 1. If not (which occurs with probability at most 1/2), repeat this step. We then apply the controlled operation |i |ψ → |i U f i |ψ . This produces for some sequences of normalised states |ψ Rearranging subsystems, we can write this as for some unnormalised states |ψ 0 , |ψ 1 where as required by the definition of a probability oracle for f . We will use this probability oracle within the framework of the fast quantum algorithm of [22] for computing gradients. This algorithm is applicable to functions that satisfy a certain smoothness condition. Given some analytic function h : The following result shows that if each function f i satisfies the required smoothness condition [22], we have that the overall function f also satisfies the same condition.
Claim 5. Let c be a real constant, and fix some x ∈ R n . Suppose that for all i ∈ [N ] the function f i : R n → R is analytic, and that for every natural number k, and α ∈ [n] k , we have that then we have that f also satisfies the same condition.
Proof. We apply the linearity of ∂ α . Observe that and we are done.
In fact it's not too hard to see that this claim generalises to more-or-less any bound on the partial derivatives. We can now state the result we will need from [22].
Theorem 6 (Gilyén, Arunachalam and Wiebe [22,Theorem 25]). Suppose that g : R n → R is an analytic function such that, for all r ∈ N and α ∈ [n] r , |∂ α g(x)| ≤ c r r r/2 . Assume access to g is given by a phase oracle O g . Then there exists an algorithm that outputs a vector ∇f (x) ∈ R n such that ∇f (x) − ∇f (x) ∞ ≤ with 99% probability, using O( √ n/ ) queries to the oracle and additional time O(n 3/2 / ).
Note that, if the time complexity of evaluating O g is Ω(n), this dominates the overall runtime bound. We can encapsulate the combination of these results in the following theorem.
Theorem 7. Let f be defined as in (1), and assume that each function f i satisfies the conditions required for Theorem 6 and can be computed in time T (f ), for some bound T (f ) such that T (f ) = Ω(n). Then there is a quantum algorithm that outputs ∇f (x) such that ∇f (x) − ∇f (x) ∞ ≤ with 99% probability, in time Proof. Given the ability to compute each f i function in time T (f ), we can produce a phase oracle computing f i in time O(T (f )). By Theorem 3, and using that f i : R n → [1/10, 9/10], we can then obtain an operation approximating a probability oracle for f i up to error in time O(T (f )). By Lemma 4, this gives a probability oracle for f , at additional cost O(log N ). By Theorem 3, we then obtain a phase oracle for f at additional cost poly log (N, 1/ ). This finally allows us to apply Theorem 6 to achieve the stated complexity.
Despite Theorem 7 giving a more efficient quantum algorithm for approximately computing ∇f , it is not clear whether this translates into a more efficient quantum algorithm for stochastic gradient descent, or a quantum speedup of other algorithms making use of ∇f . This is because the algorithm of [22] only outputs an approximate gradient, and one which may not be an unbiased estimate of ∇f . To prove approximate convergence of stochastic gradient descent, it is not essential for the gradient estimates to be unbiased [8], and it is plausible that an approximate estimate of the gradient should lead to an approximate minimiser for f being found. However, the technique used in [8] to show approximate convergence in this scenario requires the 2-norm of the approximate gradient to be close to that of ∇f . The algorithm of [22] provides accuracy in the ∞-norm, which would only give accuracy √ n in the 2-norm. Further, it was shown by Cornelissen [15] that if f is picked from a certain class of smooth functions, approximating ∇f up to 2-norm accuracy requires Ω(n/ ) uses of a phase oracle for f in the worst case, so this is not merely a technical restriction. Nevertheless, it is possible that quantum gradient estimation may be more efficient than stochastic gradient descent in practice.
2. Let S be the set of potentially optimal hyperrectangles.
4. Evaluate hyperrectangle j and decide where to divide it using the following procedure: (a) Let I be the set of dimensions with maximal side length. Let δ be one-third of this maximal side length. Let c be the centre of hyperrectangle j.
(b) Evaluate f at the points c ± δe i for all i ∈ I, where e i is the i'th vector in the standard basis.
(c) Divide the hyperrectangle containing c into thirds along the dimensions i ∈ I, in ascending order of w i = min{f (c + δe i ), f (c + δe i )}.
Let ∆m be the number of new points evaluated. Update m ← m + ∆m, f min ← new best min.
6. t ← t + 1. If t = T , where T is the iteration limit, then stop, if not go to step 2.
We think of K in Definition 3 as a surrogate for the Lipschitz constant of f (which is not assumed to be known in advance). An example of the first couple of steps of dividing [0, 1] 2 into rectangles is shown in Figure 6a. The set of potentially optimal hyperrectangles can be determined in time O(m ), where m ≤ m is the number of distinct interval lengths, using a convex hull technique described in [31] and illustrated in Figure 6b. The conditions (2) and (3) are satisfied by the points that lie on the lower convex hull when f (c j ) is plotted against d j for each hyperrectangle, and we also include the point (0, f min − |f min |). In Figure 6b the red dots represent potentially optimal hyperrectangles whereas the black dots represent hyperrectangles that are not potentially optimal.