Quantum gradient descent and Newton's method for constrained polynomial optimization

Optimization problems in disciplines such as machine learning are commonly solved with iterative methods. Gradient descent algorithms find local minima by moving along the direction of steepest descent while Newton's method takes into account curvature information and thereby often improves convergence. Here, we develop quantum versions of these iterative optimization algorithms and apply them to polynomial optimization with a unit norm constraint. In each step, multiple copies of the current candidate are used to improve the candidate using quantum phase estimation, an adapted quantum principal component analysis scheme, as well as quantum matrix multiplications and inversions. The required operations perform polylogarithmically in the dimension of the solution vector and exponentially in the number of iterations. Therefore, the quantum algorithm can be beneficial for high-dimensional problems where a small number of iterations is sufficient.

phase estimation, an adapted quantum principal component analysis scheme, as well as quantum matrix multiplications and inversions. The required operations perform polylogarithmically in the dimension of the solution vector and exponentially in the number of iterations. Therefore, the quantum algorithm can be beneficial for highdimensional problems where a small number of iterations is sufficient.

Introduction
Optimization plays a vital role in various fields. In machine learning and artificial intelligence, common techniques such as regression, support vectors machines, and neural networks rely on optimization. Often in these cases the objective function to minimize is a least-squares loss or error function f (x) that takes a vector-valued input to a scalar-valued output [1]. For strictly convex functions on a convex set, there exists a unique global minimum, and, in the case of equality-constrained quadratic programming, solving the optimization problem reduces to a matrix inversion problem. In machine learning, one often deals with either convex objective functions beyond quadratic programming, or non-convex objective functions with multiple minima. In these situations, no single-shot solution is known in general and one usually has to resort to iterative search in the landscape defined by the objective function.
One popular approach focusses on gradient descent methods such as the famous backpropagation algorithm in the training of neural networks. Gradient descent finds a local minimum starting from an initial guess by iteratively proceeding along the negative gradient of the objective function. Because it only takes into account the first derivatives of f (x), gradient descent may involve many steps in cases when the problem has an unfortunate landscape ‡. For such objective functions, second-order methods, which model the local curvature and correct the gradient step size, have been shown to perform well [2]. One such method, the so-called Newton's method, multiplies the inverse Hessian to the gradient of the function. By taking into account the curvature information in such a manner, the number of steps required to find the minimum often greatly reduces at the cost of computing and inverting the matrix of the second derivatives of the function with respect to all coordinates. Once the method arrives in the vicinity of a minimum, the algorithm enters a realm of quadratic convergence, where the number of correct bits in the solution doubles with every step [3,4].
As quantum computation becomes more realistic and ventures into the field of machine learning, it is worthwhile to consider in what way optimization algorithms can be translated into a quantum computational framework. Optimization has been considered in various implementation proposals of quantum computing [5,6]. The adiabatic quantum computing paradigm [7] and its famous sibling, quantum annealing, are strategies to find the ground state of a Hamiltonian and can therefore be understood as 'analogue' algorithms for optimization problems. The first commercial implementation of quantum annealing, the D-Wave machine, solves certain quadratic unconstrained optimization problems and has been tested for machine learning applications such as classification [8,9] and sampling for the training of Boltzmann machines [10]. In the gate model of quantum computation, quadratic optimization problems deriving from machine learning tasks such as least-squares regression [11,12] and the least-squares support vector machine [13] were tackled with the quantum matrix inversion technique [14], demonstrating the potential for exponential speedups for such single-shot solutions under certain conditions. Variational methods that use classical optimization while computing an objective function on a quantum computer have become popular methods targeting near-term devices with limited coherence times [15,16].
In this work, we provide quantum algorithms for iterative optimization, specifically the gradient descent and Newton's methods. We thereby extend the quantum machine learning literature by techniques that can be used in non-quadratic convex or even nonconvex optimization problems. The main idea is that at each step we take multiple copies of a quantum state |x (t) to produce multiple copies of another quantum state |x (t+1) by considering the gradient vector and the Hessian matrix of the objective function. Depending on the step size taken, this quantum state improves the objective function.
We consider optimization for the special case of polynomials with a normalization constraint. The class of polynomials we discuss in detail for optimization contains homogeneous polynomials of even order over large-dimensional spaces with a relatively small number of monomials. We also sketch the polynomial optimization with a relatively small number of inhomogeneous terms added. We show how to make the gradient of the objective function and the Hessian matrix operationally available in the quantum computation using techniques such as density matrix exponentiation [17]. The objective function is assumed to be given via oracles providing access to the coefficients of the polynomial and allowing black-box Hamiltonian simulation as described in [18].
Since at each step we consume multiple copies of the current solution to prepare a single copy of the next step, the algorithms scale exponentially in the number of steps performed. While this exponential scaling precludes the use of our algorithms for optimizations that require many iterations, it is acceptable in cases when only a few steps are needed to arrive at a reasonably good solution, especially in the case of the Newton method, or when performing a local search. We note that the computation of gradients on a quantum computer has been investigated before, for a setting in which the coordinates of the input vectors to the function are encoded as binary numbers rather than as the amplitudes of a quantum system [19]. We also note subsequent work on quantum gradient descent in [20] and [21].

Problem statement
Gradient descent is formulated as follows. Let f : R N → R be the objective function one intends to minimize. Given an initial point x (0) ∈ R N , one iteratively updates this point using information on the steepest descent of the objective function in a neighborhood of the current point, where η > 0 is a hyperparameter (called the learning rate in a machine learning context) which may in general be step-dependent. Newton's method extends this strategy by taking into account information about the curvature, i.e. the second derivatives of the objective function, in every step. The iterative update therefore includes the Hessian matrix H of the objective function, with the entries H ij = ∂ 2 f ∂x i ∂x j evaluated at the current point x (t) . These iterative methods are in principle applicable to any sufficiently smooth function. In this work, we restrict ourselves to the optimization of multidimensional homogeneous polynomials of even order and under spherical constraints, where the polynomials only consists of a small number of terms (a property which we refer to as "sparse"). We also discuss the extension to a small number of inhomogeneities.

Sparse, even homogenous polynomials
The homogeneous objective function we seek to minimize here is a polynomial of order 2p defined over x ∈ R N , with the N 2p coefficients A i 1 ...i 2p ∈ R and x = (x 1 , . . . , x N ) T . As we will see, the number of non-zero coefficients, or "sparsity", will appear in the final resource analysis of the quantum algorithm. By mapping the inputs to a higher dimensional space constructed from p tensor products of each vector, we can write this function as an algebraic form, The coefficients become entries of a p-dimensional tensor A ∈ R N ×N ⊗ · · · ⊗ R N ×N that can be written as a N p × N p dimensional matrix. Let Λ A > 0 be given such that the matrix norm of A is A ≤ Λ A . Equations (3) and (4) describe a homogeneous polynomial of even order. For the more familiar case of p = 1, the objective function reduces to f (x) = x T Ax and a common quadratic optimization problem. For p = 2 and N = 2, the two-dimensional input x = (x 1 , x 2 ) T gets projected to a vector containing the polynomial terms up to second order It will be helpful to formally decompose A into a sum of tensor products on each of the p spaces that make up the 'larger space', where each A α i is a N × N matrix for i = 1, . . . , p and K is the number of terms in the sum needed to specify A. Since the quadratic form remains the same by x/2 we can assume without loss of generality that the A α i are symmetric and also that A is symmetric. Note that the representation of Eq. (5) is useful to simplify the computation of the gradient and the Hessian matrix of f (x). However, our quantum algorithm does not explicitly require this tensor decomposition of A, which may be hard to compute in general. We only require access to the matrix elements of A for the use in standard Hamiltonian simulation methods [18] (see the data input discussion below). In this work, we consider general but sparse matrices A, where the sparsity requirement arises from the quantum simulation methods. A sparse matrix A represents a polynomial with a relatively small number of monomials. For the gradient descent and Newton's methods, we need to compute expressions for the gradient and the Hessian matrix of the objective function. In the tensor formulation of Eq. (5), the gradient of the objective function at point x can be written as which can be interpreted as an operator D ≡ D(x) which is a sum of matrices A α j with x-dependent coefficients i =j x T A α i x, applied to a single vector x. Note that of course the ordering of the scalar factors in the product p i=1 i =j x T A α i x is unimportant. In addition, for the matrix norm of the operator D, note that with |x| ≤ 1 we have In a similar fashion, the Hessian matrix at the same point reads, defining the operator H 1 . For the matrix norm of the operator H, note that with |x| ≤ 1 we have Thus we can define a bound for the norm of H with Λ H > 0 for which H ≤ Λ H and Λ H = O (p 2 Λ A ). The core of the quantum algorithms for gradient descent and Newton's method will be to implement these matrices as quantum operators acting on the current state and successively shifting the current state towards the desired minimum.

Spherical constraints
Since in this work we represent vectors x as quantum states, the quantum algorithm naturally produces normalized vectors with x T x = 1, thereby implementing a constraint known in the optimization literature as a spherical constraint. Applications of such optimization problems appear in image and signal processing, biomedical engineering, speech recognition and quantum mechanics [22]. In addition, we include further standard assumptions [4]. We assume an initial guess x 0 reasonably close to the constraint local minimum. We assume that the Hessian of the polynomial in Eq. (7) in the vicinity of the solution is positive semidefinite. We also include an assumption about the smoothness/continuity of the polynomial, which is straightforwardly satisfied as long as the polynomial does not diverge unreasonably in  Figure 3. The quantum algorithms can be adapted to optimize polynomials that include inhomogeneities. Here, we consider a quadratic form, where the parameters including the inhomogeneity are the same as in Figure 1. The panels are analogous to Figure 2. Newton's method arrives at the solution faster than gradient descent.
the optimization region under consideration. These assumptions guarantee saddle-point free optimization and monotonic convergence to the minimum. The problem we attempt to solve is therefore defined as follows: Problem 1. Let f (x) be a homogeneous polynomial of even order 2p ∈ Z > 0 as in Eq. (4). Let the matrix A defining f (x) be symmetric and have sparsity s A , defined as the number of non-zero matrix elements in each row and column. In addition, let Λ A > 0 be given such that the matrix norm of A is A ≤ Λ A , and without loss of generality Λ A ≥ 1. Starting with an initial guess x 0 , solve the problem subject to the constraint x T x = 1 by finding a local minimum x * . We assume that the Hessian is positive semidefinite at the solution, i.e., H(f (x * )) ≥ 0, and that the initial guess is sufficiently close to the solution. In addition, the polynomial is smooth and we assume that it is Lipschitz continuous in the region of optimization.
A well-known adaptation of gradient descent for such constrained problems is projected gradient descent [23,24], where after each iteration the current solution is projected onto the feasible set (here corresponding to a renormalization to x T x = 1). It turns out that our quantum algorithms naturally follow this procedure and the renormalization is realized in each update of the quantum state. We therefore obtain the general convergence properties of the projected gradient descent and projected Newton's methods. Note that for the simple case of the quadratic function x T Ax, i.e., for p = 1, gradient descent with a unit norm constraint finds one of the eigenstates of A.
Although the choice of the objective function allows for an elegant implementation of the operators D and H −1 by means of quantum information processing, it is in some cases not suited for Newton's method. For example, if p = K = 1 the objective function reduces to a quadratic form and H −1 ∇f (x) = D −1 Dx = x. The direction of search is consequently perpendicular to the unit sphere and Newton's method does not update the initial guess at all (see Figure 2). For this reason and, more generally, to increase the class of functions that can be optimized, it is interesting to include inhomogeneous terms to the polynomial. For example, a simple linear inhomogeneity can be added by considering the polynomial function f (x) + c T x, where f (x) is the homogeneous part as before and c is a vector specifying the inhomogeneous part (see Figure 3). We will sketch a method to include such inhomogeneities in Section 5.

Data input model
To implement a quantum version of the gradient descent algorithm we assume without loss of generality that the dimension N of the vector x is N = 2 n where n is an integer. A vector not satisfying this condition can always be padded to 2 n as long as the additional coordinates do not play a role for the optimization through matrix A. As mentioned, we consider the case of a normalization constraint x T x = 1. Following previous quantum algorithms [14,13,17] we represent the entries of x = (x 1 , . . . , x N ) T as the amplitudes of a n-qubit quantum state |x = N j=1 x j |j , where |j is the j'th computational basis state.
In this work, we assume the following oracles for the data input. First, we define the oracle for providing copies of the initial state. Let N ) T be the initial point we choose as a candidate for finding the minimum x * with i |x Oracle 1 (Initial state preparation oracle). There exists an oracle that performs the operation |0 → |x (0) on n qubits.
We assume that the initial quantum state |x (0) corresponding to x (0) via amplitude encoding can be prepared efficiently, either as an output of a quantum algorithm or by preparing the state from quantum random access memory [25,26,27]. For example, efficient algorithms exist for preparing states corresponding to integrable [28] and bounded [29] distributions. In addition, we assume an oracle for the matrix A [18].
We assume that the error χ is much smaller than other errors and does not affect the analysis [18]. Note that the indices j, k can isomorphically be described by numbers i 1 , · · · , i 2p = 1, · · · , N . Thus, the oracle can be equivalently given as |i 1 , · · · , i 2p |0 → |i 1 , · · · , i 2p |A i 1 ,···,i 2p . To take advantage of sparsity, we also assume the following oracle which allows us to choose the non-zero matrix elements.
Oracle 3 (Sparse input oracle). Let j = 1, · · · , N p and l = 1, · · · , s A . There exists an oracle that performs the operation |j, l → |j, g A (j, l) on 2pn qubits where the efficiently computable function g A (j, l) gives the column index of the l-th nonzero element of row j of matrix A.
These Oracles (2) and (3) allow for an efficient simulation of e −i At via the methods in [18] and are often a standard assumption in the literature. In the remainder of this section, we will first describe how to quantumly simulate the gradient operator as e −iDt and how to implement a quantum state |∇f (x) representing the gradient ∇f (x). We then describe how to obtain the update of the current candidate |x (t) at the t-th step of the gradient descent method via the current-step gradient quantum state. Finally, we discuss the run time of multiple steps.

Computing the gradient
In this section, we present the quantum simulation of the gradient operator D and the preparation of the gradient state D |x , which may be of independent interest. For the definition of D, see Eq. (6). We omit the index t indicating the current step for readability. For the diagonalizable matrices discussed here, it sufficies to use the operator norm · = max j |λ j (·)|, where λ j are the eigenvalues of the matrix under consideration.
Result 1 (Quantum simulation of the gradient operator). Given the exact quantum states |x as well as Oracles (2) and (3). Let the gradient operator corresponding to the state |x be D Then there exists a quantum algorithm that simulates e −i Dτ to accuracy using O queries to the oracles for A, andÕ First, note that the classical scalar coefficients x T A α i x that occur as weighing factors in Equations (6) and (7) can be written as the expectation values of operators A α i and quantum states |x , or x| A α i |x = tr{A α i ρ} where ρ = |x x| is the corresponding density matrix. With this relation, we show that the gradient operator can be implemented as a Hamiltonian in a quantum algorithm. The gradient operator D can be represented as where ρ ⊗(p−1) = ρ ⊗ · · · ⊗ ρ is the joint quantum state of p − 1 copies of ρ. This operator D acts on another copy of ρ. The auxiliary operator M D is independent of ρ and given by More informally stated, the operator M D can be represented from the A α j , j = 1 . . . p, of Eq. (5) in such a way that for each term in the sum the expectation values of the first p − 1 subsystems correspond to the desired weighing factor, and the last subsystem remains as the operator acting on another copy of ρ. Note that the order of the factors in the product i only changes the matrix M D but not the operator D. The matrix M D is a sum over p matrices that have the same sparsity as A, hence its sparsity is O (ps A ). The matrix exponential e −i M D τ can be simulated efficiently on a quantum computer with access to the above oracles, as shown in Appendix A, Lemma 1. Simulating a time τ to accuracy , we require O (p 3 Λ A s A τ 2 / ) queries to the oracles for A andÕ (p 4 Λ A s A τ 2 log N/ ) quantum gates.
With this mapping from D to M D and the exponentiation of M D , we can exponentiate D. The idea is to implement the matrix exponentiation e −i D∆t adapting the quantum principal component analysis (QPCA) procedure outlined in [17]. Since D depends on the current state |x , we cannot simply use the oracular exponentiation methods of [18]. Instead we exponentiate the sparse matrix M D given in Eq. (9). For a short time ∆t, use multiple copies of ρ = |x x| and perform a matrix exponentiation of M D . In the reduced space of the last copy of ρ, we observe the operation where Here, the error term E contains contributions from the erroneous simulation of e −i M D ∆t given by E M D and the intrinsic error of the sample-based Hamiltonian simulation given by E samp . We use the operator norm · to bound the error terms.
Regarding the error E M D , we can choose E M D = O (p 2 ∆t 2 ), using O (pΛ A s A ) queries to the oracle for A and usingÕ (p 2 Λ A s A log N ) gates, see Appendix A, Lemma 1. Regarding the error E samp , its size is given by E samp = O (Λ 2 D ∆t 2 ) similar to [17]. The total error for a small time step ∆t can be upper bounded by . For a total time τ = n∆t and given accuracy , we have steps are required and for each step This results assumes perfect states |x . However, within the gradient descent routine, the states pick up errors due to the imprecise simulation and phase estimation methods. We discuss a generic result regarding the simulation with such imperfect states using a particular error model. The error model assumes identically, independently distributed (i.i.d.) random error vectors. Such random vectors result in scalar-valued random variables when used to form scalar-valued quantities such as x| A α j v of the matrices A α j . One can apply the central limit theorem for a sum of such random variables. Such i.i.d. models are generally a natural start for an error discussion. The context of the gradient descent method begs the question how applicable such a model is because it is reasonable that the errors do not arise in an independent fashion but rather in a somewhat correlated sense as each step depends on the previous steps. We note that central limit theorems have been shown also for correlated random variables [30]. In addition, we note that certain errors are tolerable. For example, in stochastic gradient descent, a gradient is computed from randomly sampling the training set. This leads to surprisingly good learning behavior for neural networks [31,32]. More generally it has even been shown that the stochasticity in learning problems leads to better generalization ability through an indirect regularization of the problem. In many practical situations, sufficiently close to the minimum of a convex optimization problem, a small error in the gradient can nevertheless lead to convergence if the step size is chosen appropriately. A more general and quantitative analysis of correlated errors can give further insight into the algorithms behavior and is left for future work. Under this error model the simulation performance decreases as 1/ → 1/ 2 .
Result 2 (Quantum simulation of the gradient operator with i.i.d. imperfect states). Assume Oracles (2) and (3). Let the quantum states |x be given with errors according to Assumption 1. Let the gradient operator corresponding to the state |x be D = Then there exists a quantum algorithm that simulates e −i Dτ to accuracy using queries to the oracles for A andÕ The modification here is that the error terms for a simulation of time ∆t becomes in comparison to Eq. (11). The additional error term is E |x . An error in |x of size < 1 directly translates into the error E |x = O (p∆t) linear in ∆t. However, the norm E |x omits the sign of the error term. For multiple steps of size ∆t the error terms add up with their respective signs. Under the assumed error model, such addition of errors leads to the central limit theorem and the error averages to certain extend. We refer to Appendix B, Lemma 3, for further discussion. The next result shows how to prepare a quantum state D |x .
Result 3 (Gradient state preparation with imperfect states). Given Oracles (2) and (3). Let quantum states |x be given according to the error model of Assumption 1 with error bound O ( D ) 1. Let the gradient operator D corresponding to the state |x have condition number κ D . Then there exists a quantum algorithm that applies the gradient operator to the state |x . That is, a state ∝ D |x can be prepared to accuracy O ( D ) with the required resources given by setting → D and τ → κ D / D in Result 2 and O( log(2 + 1/2 D ) ) additional ancilla qubits. We require O (κ 2 D ) repetitions to success with high probability.
We can implement the multiplication D |x = |∇f (x) via phase estimation. In order to perform phase estimation and extract the eigenvalues of D, we need to implement a series of powers of the exponentials in Equation (10) controlled on a time register. The infinitesimal application of the M D operator can be modified as in Refs. [17,33], as |0 0| ⊗ I + |1 1| ⊗ e −i M D ∆t and applied to a state |q q| ⊗ ρ where |q is an arbitrary single control qubit state. For the phase estimation, use a multi-qubit register with O( log(2+1/2 D ) ) control qubits forming an eigenvalue register to resolve eigenvalues to accuracy D . Then apply S ph l=1 |l∆t l∆t|(e −i D∆t ) l for S ph steps. If S ph is chosen appropriately, see below, this conditioned application of the matrix exponentials allows to prepare a quantum state proportional to Here, |x = j β j |u j (D) is the original state |x written in the eigenbasis {|u(D) j } of D, and |λ(D) j is the additional register encoding the corresponding eigenvalue λ(D) j in binary representation with accuracy D . As in [14,11], we conditionally rotate an extra ancilla qubit, uncompute the phase estimation, and measure the ancilla qubit, which results in This performs the matrix multiplication with D. The accuracy of this matrix multiplication is D if the total time of the phase estimation is taken to be S ph ∆t = O(κ D / D ) [11]. As the main algorithmic step is controlled Hamiltonian simulation, we obtain the resource requirements from Result 2 by replacing → D and τ → κ D / D . In addition, the ancilla measurement success probability is 1/κ 2 D [11]. Thus, we require O (κ 2 D ) repetitions for a high probability of success, which completes Result 3.

Single gradient descent step
In this section, we establish the gradient descent procedure. To decouple the simulation method with the main result, we use the following assumption for the simulation method of D (t) .

Assumption 2.
There exists a quantum simulation method for the operator D (t) with D (t) ≤ Λ D . We can simulate the controlled operator e −i |1 1|⊗D (t) τ to accuracy D using O (poly(p, s A , Λ D , τ, 1/ D )) copies of the state |x (t) and queries to the oracles for A. In addition, the method requiresÕ (poly(p, s A , Λ D ) log N ) basic quantum gates for each copy.
Given the simulation method, we can perform a single gradient descent step.
Result 4 (Single gradient descent step). Given quantum states |x (t) encoding the current solution at time step t to Problem 1 in amplitude encoding to accuracy (t) > 0, as well as ancilla qubits. Let the gradient operator corresponding to the state |x (t) be given by D (t) with D (t) ≤ Λ D . Assume a quantum algorithm for the simulation of D (t) according to Assumption 2 exists. Let a step size be given by 0 < η (t) < 1/(2Λ D ). Then there exists a quantum algorithm using Oracles (2) and (3) such that a single step of the gradient descent method of step size η (t) to prepare an improved normalized D + (t) ) can be performed. This quantum algorithm requires O (poly(p, s A , Λ D , 1/ D )) copies of the state |x (t) and queries to the oracle of A, andÕ (poly(p, s A , Λ D , 1/ D ) log N ) basic quantum gates. The success probability of this algorithm is lower-bounded by 1/16.
We shorthand D := D (t) in the following. Starting with the current solution, prepare the state where θ is an external parameter determining the gradient step size, the first register contains the state |x (t) , the second register contains a single ancilla qubit used for the gradient step (subscript g), and the final register contains ancilla qubits for the gradient matrix multiplication. The operator D can be multiplied to |x (t) conditioned on the first ancilla being in state |1 g . Let the eigenstates of D be given by |u j (D) and the eigenvalues by λ j (D). After the conditional phase estimation, we obtain: where β j = u j (D) |x andλ j (D) is an approximation to λ j (D) with accuracy D Now use another ancilla (subscript d.) For the |0 g path this ancilla is initialized in the |1 d state and for the |1 g path perform a conditional rotation of this ancilla. Uncomputing the eigenvalue register arrives at the state Chose a constant ξ D such that 1/Λ D > ξ D ≥ η, such that the rotation is well defined, where Λ D ≥ max j |λ j (D)|. Now perform a joint measurement of the final two ancillas. The first of these ancillas is required for the vector addition of the current solution and the state related to the first derivative, the second for the matrix multiplication. The first ancilla is measured in the basis |yes = 1 √ 2 (|0 g + i |1 g ) and |no = 1 √ 2 (i |0 g + |1 g ). The second ancilla is measured in the state |1 d . Thus, if |yes |1 d is measured, we arrive at (omitting normalization) Choose θ such that to obtain the state with Here, x (t) |D|x (t) refers to the inner product of the current solution |x (t) with the gradient at this point and where ϕ D is the angle between |x (t) and D |x (t) . The success probability of the measurement of |yes |1 is given by For sufficiently bounded step size, this success probability can be lower bounded. The angle ϕ D = 0 achieves the lower bound With 0 < η ≤ 1/(2Λ D ) by assumption and ∆ D ≤ Λ D by definition, we have C (t+1) D 2 ≥ 1/4. In addition, 1 + η 2 /ξ 2 D ≤ 2 because η ≤ ξ D by the choice of ξ D . Thus P yes,1 > 1/16. Thus, the upper bound for the number of repetitions needed, which is O (1/P yes,1 ), is at most O (1) in this case.
The quantum state |x (t+1) is an approximation to the classical vector x (t+1) , which would be the result of a classical projected gradient descent update departing from a normalized vector x (t) . The error of the updated step is bounded by the error of the previous step (t) plus the error of the conditioned matrix multiplication, i.e. O (t) + η (t) D . Given the assumptions stated above, this state can be prepared via phase estimation, an ancilla rotation and measurement. The measurement and postselection normalizes the quantum state at each step. The resource requirements are obtained by setting τ → O (1/ D ) from phase estimation into the simulation method Assumption 2. Taking a step size according to the stated assumption, the probability of success and thus the number of repetitions to success are constants.
Result 4 uses generically the Assumption 2 for the simulation of D. We now provide a concrete estimate of the resources needed to implement the single-step algorithm using the error model Assumption 1 and our sample-based simulation method Lemma 3.  (1) |x (1) |x (1) |x (2) |x (2) garbage correct iterations Figure 4. Both the quantum gradient descent and Newton's method 'consume' copies in every step and have only a probability of less than one to produce the correct updated state. This limits their application to local searches with a small amount of iterations and is in principle an underlying feature of iterative quantum routines with probabilistic outcomes.
The resource requirements for this result are given from Lemma 3 for the erroneous sample-based Hamiltonian simulation analyzed in Appendix B. The extension to the controlled simulation as required by phase estimation is done in an overhead that factors in as a constant into the resource requirements [34].
gives the stated Result 5.

Multiple steps
We have presented a single step of the quantum version of gradient descent for polynomial optimization. We now estimate the resource requirements for multiple steps. It is reasonable to bound the space explored with the quantum gradient descent algorithm by a constant, as all quantum states live on the surface of a unit sphere.
Result 6 (Multiple gradient steps). Assume the setting of optimization of polynomials given in Problem 1. Let the task be to use the quantum gradient descent method for T ≥ 0 steps to prepare a solution |x (T ) to final accuracy δ > 0, with step-size schedule η (t) > 0 and gradient operator norms Λ (t) D > 0. Assume that the space explored with the algorithm is bounded by a constant, i.e. T max t η (t) = O (1). Assume a quantum algorithm for the simulation of D (t) according to Assumption 2 exists at every step t. There exists a quantum algorithm for this task that requires copies of the initial state |x (0) , where Λ = max t Λ (t) D . The gate requirement is O poly(p, s A , Λ, 1/δ) T log N quantum gates. In addition, if the quantum states |x (t) at every step satisfy the error model given in Assumption 1 and a simulation method for the operator D (t) is provided according to Lemma 3, then there exists a quantum algorithm that requires copies of the initial state |x (0) andÕ log N quantum gates.
When performing multiple steps, t = 0, . . . , T , the step size parameter η is usually decreased as one gets closer to the target, i.e. η → η (t) . Let δ > 0 be the final desired accuracy after T steps. Each step incurs the error (t) = (t−1) + η (t) (t) D , i.e., the error from the previous step and the gradient error. Hence the accumulated error At step T this error shall be (T ) ≤ δ. Assume for the discussion (0) = 0. To achieve this final error δ, choose the desired error of each gradient multiplication (t) D = δ/T η (t) . Using this (t) D , we have for the accumulated error (t) = tδ/T ≤ δ for t ≤ T . For a single gradient descent step according to Result 4, the number of copies required is then, using D = δ/T η, where we have used the assumption T η ≤ T max t η (t) = O (1) and Λ = max t Λ copies of the initial state |x (0) . Each copy requiresÕ (poly(p, s A , Λ D ) log N ) gates thus the overall gate requirement isÕ poly(p, s A , Λ, 1/δ) T log N quantum gates. Now, take the Assumption 1 and the simulation method provided by Lemma 3. For a single gradient descent step according to Result 5, the number of copies required is then For T iterations of the gradient descent method, we need at most, copies of the initial state |x (0) . The gate requirement isÕ log N . Thus, for multiple steps a number of copies that is exponentiated by the number of steps is required. While this upper bound potentially can be improved significantly, it is obvious that any exponential growth of the resources in both space and run time with the number of steps is prohibitive. A possible solution is to look at extensions of the gradient descent method which exhibit faster convergence to the minimum in only a few steps. One such option is Newton's method, which requires inverting the Hessian matrix of f (x) at point x, an operation that can become expensive on a classical computer while being well suited for a quantum computer.

Quantum Newton's method
A quantum algorithm for Newton's method follows the quantum gradient descent scheme, but in addition to the conditional implementation of D |x = |∇f (x) , one also applies an operator H −1 to |∇f (x) which represents the inverse Hessian matrix. Our aim (classically and quantumly) here is to invert the well-conditioned subspace of H and not take into account zero eigenvalues or eigenvalues that are extremely (exponentially) small. Thus in this work, by H −1 we denote the inverse on this wellconditioned subspace. More formally, we define the well-conditioned subspace of H such that Λ H −1 := H −1 is a constant. Without loss of generality, we also assume that H = O (1) [14,11].
First, we use an assumption about the simulation of the Hessian.
Assumption 3. There exists a quantum simulation method for the operator H (t) with H ≤ Λ H . We can simulate the controlled operator e −i |1 1|⊗H (t) τ to accuracy H using O (poly(p, s A , Λ H , τ, 1/ H )) copies of the state |x (t) and queries to the oracles for A.
In addition, the method requiresÕ (poly(p, s A , Λ H ) log N ) basic quantum gates for each copy.
A Newton step can be performed, assuming this simulation method for the Hessian and the gradient simulation as before.
Result 7 (Single Newton step). Given quantum states |x (t) encoding the current solution at time step t to Problem 1 in amplitude encoding to accuracy (t) > 0, as well as ancilla qubits. Let the gradient operator corresponding to the state |x (t) be given by D (t) and the Hessian matrix corresponding to the state |x (t) be given by H (t) . Assume quantum algorithms for the simulation of D (t) and H (t) according to Assumptions 2 and 3 exist. Let D (t) ≤ Λ D , := H (t) ≤ Λ H , and (H (t) ) −1 ≤ Λ H −1 . Let a step size be given by 0 < η (t) < 1/(2 max{Λ D , Λ H −1 , Λ D Λ H −1 }). Additionally, let nwt > 0. Then there exists a quantum algorithm using Oracles (2) and (3) such that a single step of the Newton's method of step size η (t) to prepare an improved normalized quantum state nwt + (t) ) can be performed. This quantum algorithm requires O (poly(p, s A , Λ, 1/ nwt )) copies of the state |x (t) and queries to the oracle of A, andÕ (poly(p, s A , Λ, 1/ nwt ) log N ) basic quantum gates, where Λ := max{Λ D , Λ H }. The success probability of this algorithm is lowerbounded by 1/16.
The quantum Newton step continues as follows, similar to the steps for the gradient as above. Starting with the current solution, prepare the state where the final registers contain ancilla qubits for the gradient matrix multiplication and the Hessian matrix inversion. With the eigenstates |u j (D) and the eigenvalues by λ j (D) of D, the gradient operator phase estimation obtains: where β j = u j (D) |x . Here,λ j (D) is an approximation to λ j (D) with accuracy nwt . With the eigenstates of H given by |u j (H) and the eigenvalues by λ j (H), the Hessian operator phase estimation conditioned on the ancilla being |1 g obtains Here, β jj = u j (H)| u j (D) andλ j (H) is an approximation to λ j (H) with accuracy nwt . Perform the conditional rotation of an ancilla (subscript d) for the derivative operator with ξ D such that 1/Λ D > ξ D ≥ η as before. This ancilla is set to |1 d for the |0 g path. In addition, perform a conditional rotation of another ancilla (subscript h) for the eigenvalues in the well-conditioned subspace of H [14] Here, ξ H is chosen such that 1/Λ H −1 > ξ H ≥ η, from ξ H λ j (H) ≤ 1 for all j . The next step uncomputes the eigenvalue registers by running the phase estimations in reverse. A combined measurement of the ancillas in the state |yes |1 d |1 h arrives at Similar to before, choosing θ such that results in with a normalization factor This corresponds to the expressions of the gradient descent method, see Eq. (20), with the only difference that instead of the gradient, Newton's direction is used to update the previous state. The probability of success of the |yes |1 d |1 h measurement is given by With the same argument as in Section 3.3 we can bound the number of repetitions needed due to the non-deterministic outcome by the eigenvalues of the operators. One and ϕ H is the angle between |x (t) and H −1 D |x (t) . For a small step size, this success probability can be lower bounded. The angle ϕ H = 0 achieves the lower bound With 0 < η ≤ 1/(2Λ D Λ H −1 ) by assumption and ∆ H ≤ Λ D Λ H −1 by definition, we have C (t+1) H 2 ≥ 1/4. In addition, 1 + η 2 /(ξ 2 D ξ 2 H ) ≤ 2 because η ≤ min{ξ D , ξ H } by the choice of ξ D and ξ H . Thus P nwt yes,1,1 > 1/16. Thus, the upper bound for the number of repetitions needed, which is O 1/P nwt yes,1,1 , is at most O (1) in this case. As before, we can concretely specify the resources via the sample-based simulation method of the gradient operator and the Hessian.
with H 1 and D as above. Let the norm of the Hessian H be Λ H := H , which also bounds the norm of H 1 . To obtain the eigenvalues of H via phase estimation, we exponentiate H via exponentiating the matrices H 1 and D sequentially using the standard Lie product formula [35] e i H∆t = e i H 1 ∆t e i D∆t + O Λ 2 ∆t 2 .
To implement the individual exponentiations themselves we use a similar trick as before.
We associate a simulatable auxiliary operator M H 1 with H 1 and, as before, the operator M D with D. For the part H 1 , the corresponding operator M H 1 is given by Here, S is the swap matrix between the last two registers. Let σ be an arbitrary state on which the matrix exponential e i H 1 ∆t shall be applied on and ρ = |x x| be the current state. The relationship between the operator H 1 and M H 1 is given by  (2) and (3), see Appendix A, Lemma 2. We can use multiple copies of the current state ρ = |x x| to perform The error for a small time step arises from the sample-based simulation method, the Lie product formula, and from errors from the current solution |x . As shown in Appendix B, Lemma 4, we require O p 9 Λ 2 H 4 nwt copies of the current state to perform phase estimation of H to accuracy nwt .
For the Newton's method the performance is determined from computing the gradient, inverting the Hessian, and subsequent vector addition as before. If the stepsize is chosen appropriately, the post-selection for the gradient, Hessian, and taking the Newton step succeeds with constant probability. The phase estimation for the gradient to accuracy nwt requires O copies. Thus, the number of copies for a single Newton step of step size η < 1/(2 max{Λ D , The two phase estimation requirements contribute additively and we use Λ = max{Λ D , Λ H }. The required number queries to the oracle for A is given by O (p 8 Λ 3 s A / 4 nwt ) and the number of elementary operations is given bỹ O (p 10 Λ 3 s A log N/ 4 nwt ), adding the requirements for gradient and Hessian and using the maximized Λ. The improved solution is prepared to accuracy O(η nwt + ), as the current solution is given to accuracy .
Similar to the discussion on multiple steps for gradient descent above, for T iterations of Newton's method to final accuracy δ, we need at most copies of the initial state |x (0) , where we choose Λ D and Λ H as their respective maximum for all time steps t. The gate complexity isÕ This means that the required number of copies of the initial state depends exponentially on the number of steps T . However, recall that in Newton's method in the vicinity of an optimal solution x * the accuracy ∆ := |x * − x (T ) | of the estimate often improves quadratically with the number of iterations, ∆ ∝ O(exp(−T 2 )). This convergence is for example discussed for unconstrained convex problems in [3]. Theorem 3.5 in Ref. [4] shows that if the function is twice differentiable, Lipschitz continuous in the neighborhood of a solution, and the Hessian positive definite at the solution then the convergence of Newton's method is quadratic. See also second-order sufficient conditions in Theorem 2.4 therein. For projected methods, the convergence properties often translate under similar conditions. For example for optimizing inequality constraints via Newton's method one obtains a quadratic convergence [36]. For proximal Newton methods, which are generalized versions of projected Newton methods, one obtains quadratic convergence under the assumption of strong convexity around the optimum and Lipshitz continuity of the Hessian [37]. Thus in some cases, even though the quantum algorithm requires a number of initial copies that grows exponentially in the number of iterations required, the accuracy of the approximate solution yielded by the algorithm can improve even faster in the number of iterations.

Inhomogeneous polynomials
Our methods can be extended to polynomials that are of odd order and also inhomogeneous. An example of a homogeneous polynomial of odd order is x 5 and an example of an inhomogeneous polynomial is x 5 1 x 2 2 + x 2 1 x 2 2 . A detailed discussion of such polynomials will be left for future work. We provide a discussion for a subset of polynomials we can solve in a similar fashion to the homogeneous, even polynomials in Problem 1. The problem can be posed as: where f hom,even (x) is as in Problem 1. Let the inhomogeneous part of the objective function be given by where the vector c j and the symmetric matrices B ij define the polynomial. Given an initial guess x 0 , solve the problem subject to the constraint x T x = 1 by finding a local minimum x * . Let the assumptions of Problem 1 also apply to the problem defined by f (x).
The term f inhom (x) allows us to represent a class of monomials of uneven degree ≤ 2p − 1. This class is essentially a sum of homogeneous even polynomials x T B ij x , with the term c T j x adding inhomogeneities. Thus, with f hom,even (x) from above, we can represent combinations of homogeneous even polynomials with additional inhomogeneous terms. In practice, the efficient sparse Hamiltonian simulation methods impose restrictions on the number of inhomogeneous terms that can be efficiently optimized. Sparsity of the matrices B ij implies a relatively small number of inhomogeneous monomials.
The optimization of functions containing terms of the form f inhom (x) can be performed via additional simulation terms and vector additions. In order to perform the inhomogeneous updates we require the following ingredients: • We analytically compute the respective gradient and Hessian of the function f inhom (x) similar to the homogeneous part.
• We simulate the time evolution under the respective gradient and Hessian of the function f inhom (x), using similar quantum state exponentiation methods as described in the homogeneous setting. We have to use additional subroutines to simulate terms that contain inner products of the form x| c j , and outer products such as |x c j | + |c j x|.
• We perform additional vector additions related to the states c j . For example the function f hom,even (x) + c T x requires one additional vector addition. Before adding the vectors we have to conditionally apply the gradient and Hessian operators to the x and c j respectively. This requires in each iteration a number of copies of the c j and x. As we require hence in every state another copy of the states this adds another tree of states to the resource requirements, since we need to build these in parallel to the main algorithm. However, this does not change the overall scaling of the runtime of the whole algorithm.
• We perform similar to the homogeneous step matrix-vector multiplications and matrix inversions via a conditional rotations on the eigenvalue registers and postselection. This will result in different success probabilities for the gradient and the Hessian operator than presented above.
• Finally we perform a measurement in the yes/no-basis as before and perform the vector addition of the current solution and the step update.
All of these steps will add a computational overhead due to additional matrix multiplications and vector additions. This computational overhead will be discussed in a future work. Similarly to the homogeneous case the number of computational steps scales exponentially with the number of iterations T and logarithmically in the dimension of the solution vector.

Discussion and conclusion
The present work has considered iterative polynomial optimization in a quantum computing framework. We have developed quantum algorithms for the optimization of a class of polynomials under spherical constraints, for which we can find expressions for the gradient and the Hessian matrix. The class of polynomials we can optimize is constrained by sparsity conditions of Hamiltonian simulation methods used here. Beyond polynomials, one can envision a setting where copies of quantum states representing the current solution are consumed for evaluating the first derivative of the objective function. If we can implement the operation |x ⊗ · · · ⊗ |x |0 → |ψ |∇f (x) , where |ψ is an arbitrary "garbage" state, we can use the same basic gradient descent steps as discussed in this work with a similar performance as presented here for polynomial optimization.
In reference [38], Childs et al. presented an exponential improvement of the matrix inversion error dependence, 1/ → log 1/ by using approximation polynomials instead of phase estimation. Further exponentially-precise Hamiltonian simulation methods were shown in [39,40], which may find application in the gradient descent problem. In addition, variable-time amplitude amplification can lead to further quadratic speedups [41,42,38]. The connection of the present work to the semi-definite programming quantum solver in [43,44] is another potential avenue of study.
Our optimization algorithms scale exponentially in the number of steps performed. Such a performance is in general expected since the problem we attempt to solve is QMA-hard, since we can reduce certain k-local Hamiltonian decision problem to it, see e.g. [45] for a definition. However, we envision a few scenarios where the algorithm can nevertheless be useful. One scenario is when the size of the vectors dominates the dependence on other parameters such as condition number and error raised to the power of the number of steps. In this case, classical computation is prohibitively expensive, while our quantum method scales logarithmically in the size of the vectors and could allow optimization. Often, a constant number of steps can yield a significant improvement on the initial guess, even without finding a local minimum, whereas the problem would be intractable on a classical computer. Another case where our quantum algorithm yields potential speed ups is the application of Newton's method to (locally) convex problems. In such problems, the number of iterations of Newton's method required to find a highly accurate solution is only weakly dependent on the dimension of the system [3], and often the minimum can be found in around 5-20 steps. Yet the standard Newton's method is also well known to fail in high-dimensional spaces due to the prevalence of saddle-points, see for example [2]. While this issue is classically not trivial to solve, our quantum method allows for an easy extension to the saddle-free Newton's method [46]. This is done by simply replacing the inverse of the eigenvalues with the inverse of the absolute values of the eigenvalues. Thereby the sign of the eigenvalues around a saddle-point is not changed and hence the algorithm takes steps in the correct direction. Therefore, our quantum algorithm is applicable to a wider range of problems with only slight adaptions.
In summary, the optimization algorithms presented here yield a performance O(polylog(N )) in the dimension N of the vector space over which the optimization is performed, as long as the number of iterations required is small. When searching for a small number of solutions in a featureless landscape, a large number iterations is required, and the algorithms scale polynomially in N : in particular, the algorithms cannot do better than Grover search, which finds M solutions in an essentially 'gradient-free' landscape in time O( N/M ). Further research may find a possible application of quantum gradient algorithms in machine learning for the training of deep neural networks [31,32], which exhibit a large number of parameters. Gradient descent methods indeed lead to good solutions after a few iterations, and quantum gradient descent algorithms may provide exponential improvements over their classical counterparts.
ordering is not important (after taking the partial trace). Note that there exist (unitary) permutation matrices Q j , such that with M j = Q j AQ † j we have Thus, the M j matrix elements can be obtained from A matrix elements. The permutation matrices are specified via to be To simulate a small step e −i M D ∆t , use the Lie product formula for the sum in Eq. (A.3), where the sum of p terms leads to an error scaling as p 2 , as discussed for example in [35]. Hence M D can be simulated for ∆t via a sequence of simulations of A sandwiched between the permutation matrices. Each of the p permutation matrices Q j can be simulated by swapping the two log N -qubit registers corresponding to the respective permutation, i.e. involving log N swap operations. By Eq. (A.10), a single step thus requires 2p log N swap operations. Also, we require p simulations of A of time ∆t. The error of these simulations shall be p∆t 2 to accumulate to a total error O (p 2 ∆t 2 ).
using τ = m∆t, we take m = O (p 2 τ 2 / ) to achieve final accuracy . This requires in total O (mpΛ A s A ) queries to the oracles for A andÕ (mp 2 ( Recall that the simulation of M D is a means to simulate the gradient operator D. The additional error incurred for M D will be included in the error incurred by the quantum state exponentiation for simulating D, which will be discussed in Appendix B. In addition, we note that improved methods for simulating M D may be found via efficiently computing the sparsity function g M D (k, l) from the sparsity function of A.
We can provide a similar result for the operator appearing in Newton's method.
Lemma 2. Let the Hessian auxiliary operator part M H 1 be given by Eq. (43), with the vector space dimension N p . Let A ≤ Λ A and Λ A ≥ 1. There exists an efficient quantum computation using Oracles (2) and (3) such that e −i M H 1 t can be simulated. For simulating a time τ to accuracy , we require O (p 6 Λ A s A τ 2 / ) queries to the oracles for A andÕ (p 6 Λ A s A τ 2 log N/ ) quantum gates.
(A. 15) The permutation between the p − 1 and p part is owed to the additional swap matrix in Eq. (A.12). Each required permutation to relate M H 1 to A can be performed with 2 log N swap operators. We use the Lie product formula as in Lemma 1, to simulate e −i M H 1 ∆t efficiently with error O (p 4 ∆t 2 ) using p 2 small-time simulations of A. The error of the simulations of A shall be p 2 ∆t 2 to accumulate to a total error O (p 4 ∆t 2 ). Thus, for each simulation of A, we require O Λ A s A ∆t/ p 2 ∆t 2 ≤ O (Λ A s A ) queries to the oracles for A and usingÕ Λ A s A ∆t p log(N )/ p 2 ∆t 2 ≤Õ (Λ A s A log(N )) gates [18]. We also have 2p 2 log N swap operations for a small time step. For multiple steps m = p 4 τ 2 / to given , we thus require O (mp 2 Λ A s A ) queries to the oracles of A andÕ (m(Λ A + 1)s A p 2 log N ) ≤Õ (mΛ A s A p 2 log N ) quantum gates, if Λ A ≥ 1.
where V li := |x x| ⊗i−1 ⊗ |x v T li + v li x| ⊗ |x x| ⊗(p−1)−i ⊗ I. To order O (β) only one state is erroneous and we have O (p) terms with the error at a different location in the p − 1 state register. Here, we can analyze the term where only the first register is erroneous and assume that the other terms are of the same size. That is we use Here, we have another O (p) terms that are similar to the one explicitly written out.
Because v l1 is a random vector, we have random variables Y α l given by (B.8) These random variables have support on [−∞, +∞], a symmetric distribution and standard deviation O (β). To get the final error we will use a central limit argument.
As we care about the quantum simulation of the Hamiltonian D, we need to evaluate the difference in desired and actual time evolution. Note Equation (12) for the error of a single time step ∆t = t/m: (B.9) As before, we can choose E M D = O (p 2 ∆t 2 ), using O (pΛ A s A ) = O (Λ D s A ) queries to the oracle for A and usingÕ (p 2 Λ A s A log N ) =Õ (pΛ D s A log N ) gates, see Appendix A, Lemma 1. Regarding the error E samp , its size is given by E samp = O (Λ 2 D ∆t 2 ) similar to [17]. The error for a small time step ∆t from these two sources can be upper bounded by E ≤ E samp + E M D = O (p 2 Λ 2 D ∆t 2 ). For m steps this error is m E = O (p 2 Λ 2 D τ 2 /m) As in [35], define the remainder of an analytic function f = ∞ j=0 a j x j to be R k (f ) := ∞ j=k+1 a j x j , for k ∈ N. Then the error between desired and actual time Here, the central limit theorem allows to bound the size of the sum of the random variables Y l by O ( √ m) and the standard deviation O (K 2 Λ D β). To simplify, we assume K = O (1). Putting all together, neglecting higher-order β terms, we obtain the