A parallel Newton-type method for nonlinear model predictive control

A parallel Newton-type method for nonlinear model predictive control is presented that exploits the particular structure of the associated discrete-time Euler–Lagrange equations obtained by utilizing an explicit discretization method in the reverse-time direction. These equations are approximately decoupled into single-step subproblems along the prediction horizon for parallelization. The coupling variable of each subproblem is approximated to its optimal value using a simple, efficient, and effective method at each iteration. The rate of convergence of the proposed method is proved to be superlinear under mild conditions. Numerical simulation of using the proposed method to control a quadrotor showed that the proposed method is highly parallelizable and converges in only a few iterations, even to a high accuracy. Comparison of the proposed method’s performance with that of several state-of-the-art methods showed that it is faster. © 2019TheAuthors.PublishedbyElsevierLtd.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).


Introduction
Nonlinear model predictive control (NMPC) is an optimal control method for controlling nonlinear systems. A finite horizon optimal control problem (FHOCP) is solved at each sampling instant, and only the first optimal control input is applied. With its ability to handle constraints and nonlinear dynamics systematically, NMPC has become widely used in process and chemical applications (Qin & Badgwell, 2000), which typically have slow dynamics. However, the computational demand for solving FHOCPs in real time, especially ones with fast dynamics, large dimensionalities, and complicated constraints, prevents its broader application. One way to avoid online optimization is to shift it to offline, e.g., by using explicit NMPC. In explicit NMPC, the optimal control law is computed as a function of the initial state and stored. Since only the stored function has to be evaluated on-line, the amount of computation is reduced. However, NMPC, unlike linear MPC (Bemporad, Morari, Dua, & Pistikopoulos, 2002), does not have a general explicit state feedback law, and only an approximate explicit solution can be found, e.g., by multi-parametric nonlinear programming (Johansen, 2004). Although an exact solution can be obtained using the principle of ✩ This work was partly supported by JSPS KAKENHI Grant Number 15H02257.
The material in this paper was partially presented at the 6th IFAC Conference on Nonlinear Model Predictive Control, August 19-22, 2018, Madison, Wisconsin, USA. This paper was recommended for publication in revised form by Associate Editor Gabriele Pannocchia under the direction of Editor Ian R. Petersen. dynamic programming (Bellman, 1957), that is, by solving the Hamilton-Jacobi-Bellman (HJB) equation, it is often computationally impractical because the amount of computations and the storage required grow prohibitively with the state space dimension and horizon, which is known as the ''curse of dimensionality''. One way to overcome this drawback is to use adaptive/approximate dynamic programming (ADP) to obtain an approximate solution of the HJB equation. A very good survey of ADP is given in Wang, Zhang, and Liu (2009). However, these methods require much data storage for high-dimensional systems and suffer performance loss due to sub-optimality.
When on-line optimization is inevitable, it has to be solved in real time. On-line methods for the numerical solution of OCPs can typically be categorized into indirect and direct methods (Von Stryk & Bulirsch, 1992). Indirect methods use the necessary condition for optimality known as Pontryagin's Maximum Principle (Pontryagin, Boltyanskii, Gamkrelidze, & Mishchenko, 1961). Methods that solve the continuous-time Euler-Lagrange equations obtained by the calculus of variations are indirect methods, and can be referred to as ''first optimize, then discretize'' methods. Methods that solve an FHOCP by first discretizing it into a nonlinear programming problem (NLP) that can be solved using an off-the-shelf NLP solver are direct methods. The latter type are the most widely used nowadays, and there have been numerous studies on them. Newton-type methods are direct methods that can be categorized, depending on how the inequalities are treated, into sequential quadratic programming (SQP) methods (Powell, 1978) and interior-point (IP) methods (Byrd, Hribar, & Nocedal, 1999). Efficient methods for SQP and IP meth- of NMPC, such as Riccati recursion exploiting the banded structure of the Hessian matrix (Frasch, Sager, & Diehl, 2015;Glad & Jonson, 1984;Jørgensen, Rawlings, & Jørgensen, 2004;Rao, Wright, & Rawlings, 1998), generalized Gauss-Newton methods for nonlinear least-squares NLPs (Bock, 1983;Houska, Ferreau, & Diehl, 2011b), and solution tracing by continuation (Ohtsuka, 2004), just to name a few. An excellent review of Newton-type methods can be found in Diehl, Ferreau, and Haverbeke (2009). In addition to the Newton-type methods, first-order methods for NMPC have been proposed that deal directly with the nonlinear dynamics (Graichen & Käpernick, 2012) or solve the underlying quadratic programming (QP) (Kalmari, Backman, & Visala, 2015;Kouzoupis, Ferreau, Peyrl, & Diehl, 2015).
Most real-time methods can only be parallelized to a certain extent, such as parallel system simulation in the context of multiple shooting (Bock & Plitt, 1984) or parallel matrix operations. However, there is growing demand for tailored parallel NMPC methods due to the rapid development of parallel devices such as multi-core processors, graphics processing units, and fieldprogrammable gate arrays (FPGAs). Although first-order methods such as the alternating direction method of multipliers (Jerez et al., 2014;O'Donoghue, Stathopoulos, & Boyd, 2013) and the fast gradient method (Jerez et al., 2014) can be easily parallelized and efficiently implemented for linear MPC, they suffer from slow rates of convergence compared with the Newton-type methods, dealing with complicated constraints, and time-varying dynamics in the underlying linearized problems when using SQP for NMPC. Although the Newton-type methods are basically able to deal with general constraints and have fast rates of convergence, parallelization is still challenging. Methods based on parallel Riccati recursion (Frasch et al., 2015;Nielsen & Axehill, 2015) have been proposed. A computational complexity of O(log N) for N threads can be achieved, where N is the number of prediction steps. Several tailored parallel methods for NMPC have been reported. The advanced multi-step NMPC (Yang & Biegler, 2013), which is parallelized along the time axis, concurrently solves several predicted OCPs beforehand, and then the input is updated on the basis of the state measurement. Particle swarm optimization (Xu, Chen, Gong, & Mei, 2016), with its inherent parallelism, has been implemented on FPGA for NMPC. An augmented Lagrangian-based method tailored to nonlinear OCPs has been reported (Kouzoupis, Quirynen, Houska, & Diehl, 2016) that concurrently solves subproblems along the prediction steps and then solves a centralized consensus QP to update the dual variables at each iteration. This method has a linear rate of convergence and its speed-up depends on the computation time of the consensus step in accordance with Amdahl's law (Amdahl, 1967).
In this paper, a highly parallelizable Newton-type method is proposed. The FHOCP is first discretized using an explicit integration scheme performed in the reverse-time direction, enabling the problem to be split into linearly coupled subproblems. Then, a sequential method, which we call the backward correction method, is formulated. The backward correction method is proved to be exactly identical to Newton's method but with a clearer structure for possibility of parallelization. In the backward correction method, the coupling variable in each subproblem is calculated recursively in a backward manner, which is timeintensive. Reasonable approximations of the coupling variables are used to break down the recursion process so that the calculation can be done in parallel. The resulting algorithm is then proved to converge superlinearly. Numerical simulation of controlling a quadrotor demonstrated that the proposed method is highly parallelizable and can converge to a specified tolerance in only a few iterations. Comparing with our earlier work (Deng & Ohtsuka, 2018), more general discretization method is considered and the rate of convergence is characterized in this paper. This paper is organized as follows. First, the NMPC problem and its discretization are presented in Section 2. The backward correction method is formulated and proved to be exactly identical to Newton's method in Section 3. The proposed method and its rate of convergence are shown in Section 4. The numerical experiment is described and the results are discussed in Section 5.
The key points are summarized and future work is mentioned in Section 6.

Nonlinear model predictive control
Consider a continuous-time nonlinear system: where x ∈ R nx , u ∈ R nu , and p ∈ R np are the system state, control input, and given time-dependent parameter, respectively.
In NMPC, a cost function over a finite horizon interval [t, t + T ] is minimized under the governing of (1) with an initial state (2) In solving the FHOCP above, inequality constraint (2) can be transferred approximately into a barrier cost function under the framework of the interior-point method, or alternatively, into a nonlinear algebraic equality constraint with additional slack variables introduced (Ohtsuka, 2004).
We thus consider solving the following discretized FHOCP with only equality constraints: where X = (x 0 , x 1 , x 2 , . . . , x N ) and U = (u 1 , u 2 , . . . , u N ) are, respectively, the sequences of states and inputs along the horizon and L, ϕ, F , and C are twice continuously differentiable functions.
Remark 1. The discretization method for (1) formulated in (3) is herein called the ''reverse-time discretization method'', which is a class of implicit discretization methods. For example, the backward Euler method with corresponds to the forward Euler method performed backward in time. We do not assume any particular discretization method in this paper. Any explicit integration method, e.g., the Runge-Kutta method, can be used backward in time to result in the reverse-time discretization, and the obtained results, including the parallelization and convergence, apply as well. Discretization using the conventional explicit method is discussed in Remark 2.
To solve the NLP (3), we first introduce the Lagrange multiplier for the difference equations and the equality constraints, respectively. The sequences of the optimal control input {u * satisfy the following nonlinear algebraic equations: given x * 0 =x 0 and λ * N+1 = 0, where H denotes the Hamiltonian defined by H (λ, µ, u, x, p) Here, H u := ∂H/∂u and H x := ∂H/∂x. Note that Eqs. (4) are the discrete-time Euler-Lagrange equations (also called the Karush-Kuhn-Tucker (KKT) conditions), which are the first-order conditions for optimality of problem (3). The nonlinear model predictive controller is implemented at each sampling instant by solving the Euler-Lagrange equations (4) given current statē x 0 and parameter sequence {p i } N i=1 , then using u * 1 as the actual input.
Remark 2. Discretization using the reverse-time discretization method minimizes the complexity of coupling between neighboring stages of the Euler-Lagrange equations. Namely, x i−1 and λ i+1 enter the Euler-Lagrange equations with index i linearly. In the case of explicit discretization, e.g., using the forward Euler method, a similar structure with linear couplings can also be obtained for the corresponding reordered Euler-Lagrange equations (Biegler & Thierry, 2018), and the results of parallelization and convergence can in principle be extended. However, it can be seen from the similar linear coupling structure that explicit discretization will not lead to better computational performance; thus, we herein focus on the reverse-time discretization.

Notations
Let us denote a vector of unknown variables with subscript Then U (V) = 0 is the compact form of Eqs. (4), with the solution denoted by V * . For a variable v, v k stands for the value of v at the kth iteration. For simplicity, we denote J(V) := U ′ (V) as the Jacobian matrix of U with respect to V, U k The symbol ∥ · ∥ denotes the Euclidean norm for a vector and the p-norm for a matrix, where p can be any positive integer.

Backward correction method
Newton's method for solving Eqs. (4) is difficult to parallelize due to the coupling between stages. In this section, we will take a look at the iteration of each stage and investigate the coupling between stages. We formulate the backward correction method and prove that it is exactly identical to Newton's method.

Motivation
Newton's method is practical and powerful for solving the nonlinear algebraic equation U (V) = 0 as it is simple to implement and converges quadratically in general. The full-step Newton's method performs the following iteration starting from an initial guess V 0 that is sufficiently close to V * : However, performing Newton's iteration is computationally expensive because one has to first evaluate U (V k ) and its Jacobian J(V k ) and then solve a linear equation. The computational complexity of computing J(V k ) −1 U (V k ), in terms of N, can be made either O(N 2 ) by condensing (Frison & Jorgensen, 2013), or O(N) by exploiting the banded structure of J(V), as proposed in Rao et al. (1998). Due to the coupling between stages (J(V k ) is not block-diagonal), parallelization is challenging. Although methods using parallel Riccati recursion (Frasch et al., 2015;Nielsen & Axehill, 2015) can have complexities of O(log(N)), they are still computationally expensive. Note that, thanks to the reverse-time discretization of the state equation, the Euler-Lagrange equations (4) at stage i are coupled with the neighboring stages linearly, that is, formulated as the necessary condition for a single-period OCP with initial state x i−1 and terminal penalty function ϕ ( are given for each stage i in advance. It should be noted that this resembles the principle of optimality, which states that any part of the optimal trajectory itself must be optimal. Unfortunately, x * i−1 and λ * i+1 cannot be known in advance, and only suboptimal values are given in general. When iterative methods are used for convergence, the iterations at the stage level should be investigated in order to parallelize them.

Consider the following Newton's iteration for solving
wherex i−1 andλ i+1 are the estimation of x * i−1 and λ * i+1 , respectively. One of the simplest methods takesx . This method has complexity O(n 3 ) in parallel but it might diverge. The Gauss-Seidel scheme performs (5) recursively from i−1 andλ i+1 , for example, estimated with a coarse-grained high-level controller in advance, as proposed in Zavala (2016). The convergence of the Gauss-Seidel scheme cannot be guaranteed either, unless λ * i+1 is well estimated.
Next, we present a new method that estimates λ * i+1 on the basis of the current kth iteration's information. In the Gauss-Seidel method, stages 2, . . . , N are updated after the update of As x 1 enters U 2:N linearly, the following equation holds: Therefore, the Jacobian matrix of U 2:N with respect to V 2:N does not depend on x 1 , and the update of λ 2 can be extracted from the first n x elements of (6) and expressed as · U k 2:N ∈ R nx . Although λ * 2 cannot be known in advance, it can be regarded as the correction of λ k 2 by (7) so that the iteration of solving U 1 (x 0 , V 1 , λ k+1 2 (x 1 )) = 0 with respect to V 1 is given by After x k+1 1 is obtained, stages 2, . . . , N can be further split and iterated recursively, which results in the algorithm described in the next subsection.

Algorithm
As stage N is the last stage and cannot be further split, the update of λ N as a function of x N−1 can be directly formulated.
to N following the Gauss-Seidel scheme. The resulting method is called the backward correction method (Algorithm 1), which has a computational complexity of O(Nn 3 ).

Algorithm 1 k-th iteration of backward correction method
) .

end for
Next, we show that the backward correction method results in exactly the same update as Newton's method.
Theorem 1. The backward correction method in Algorithm 1 is identical to Newton's method, that is, starting from the same V k at the kth iteration, the updated value obtained using Newton's method denoted by V k+1 (nt) is equal to the value obtained using the backward correction method denoted by V k+1 (bc) .
The proof of Theorem 1 is provided in Appendix A.
The backward correction method is one of the methods exploiting the banded structure of J(V) using block Gauss elimination. However, discretization of the system dynamics using the reverse-time method leads to linear couplings between neighboring stages in the Euler-Lagrange Equations (4), which makes the recursion in the backward correction method extremely simple compared with that in other structure-exploiting methods (Glad & Jonson, 1984;Jørgensen et al., 2004;Rao et al., 1998).

Proposed method
The backward correction method is inherently nonparallelizable due to the recursion for calculating H k i in (8) from i = N to 1 in order to formulate the expression of λ k+1 i+1 (x i ) in (10) for each stage, which is the most computationally expensive part. Although the iterations are sequential, the iteration of each stage is clearly shown in (13). We show next that the backward correction method can be parallelized by using a reasonable approximation.
Recall that, in the iteration (5) for solving the Euler-Lagrange equations (4), its convergence depends on the estimation of the coupling variables x * i−1 and λ * i+1 for each stage i. The estimation of λ * i+1 in the backward correction method is performed using (10) and can be seen as an accurate estimation of λ * i+1 . Consequently, a quadratic rate of convergence can be obtained. A slower rate of convergence can be achieved by using a relatively coarse To visualize the small variations of {Λ k i } N i=1 as k increases, let us consider the situation in which the cost function L(u, x, p) is quadratic, F (u, x, p) and the constraint C (u, x, p) are linear, and the time-varying parameter p is fixed. Under these conditions, the Jacobian matrix of U with respect to V is constant, which means (11) are always constant during iteration. As for general problems, we can expect small variations of {Λ k i } N i=1 as k increases when the problem is less nonlinear and p varies slowly.
Therefore, we propose estimating λ * i+1 on the basis of Λ k−1 i+1 at the kth iteration. That is, (10) in the backward correction method is replaced by This removes the recursion for calculating H k can be calculated in parallel using

Parallelized implementation
in parallel leads to an algorithm with complexity O(n 3 + Nn 2 ) for N threads, where n 3 is the complexity of the matrix inverse in (15), and Nn 2 indicates the complexities of carrying out (12) and (13) backward and forward, respectively.
Moreover, further parallelization is achieved by arranging the order of computations as shown in Algorithm 2.
Step 1 of Algorithm 2). This coarse update introduces approximations for coupling variables (x i−1 and λ i+1 ) compared with (13). The correction due to the approximation of λ i+1 is conducted in a backward manner (Step 2 of Algorithm 2). In the sequential part of Step 2, the approximation error d λ i+1 of λ i+1 is calculated for all stages, and then the other variables are corrected on the basis of the approximation error in the parallel part. Likewise, the correction due to the approximation of x i−1 is conducted in a forward manner (Step 3 of Algorithm 2). As a result of these two correction steps, the further parallelized implementation produces exactly the same update as the proposed method and reduces the complexity to O(n 3 +Nn 2 x ) for N threads.

Convergence
In this subsection, the rate of convergence is shown for the proposed method by measuring the distance to the backward correction method.
Algorithm 2 Proposed parallelized implementation of the backward correction method (k-th iteration) Step 1. Coarse update: Step 2. Backward correction due to approximation of λ: Step 3. Forward correction due to approximation of x:

end for
At the kth iteration, the proposed method differs from the backward correction method only in how {λ k+1 for the proposed method. To distinguish these two methods, let v (bc) be the variable in the backward correction method for a variable v. It can be seen that the proposed method approximates the backward correction method, where the approximation is introduced by where Theorem 1 has shown that the backward correction method is equivalent to Newton's method, so the iteration can be written as The kth iteration of the proposed method is expressed as is a mapping that defines the increment with the proposed method. We have the following assumptions: Assumption 1. The solution V * is an isolated solution to U (V) = 0.
Note that Assumption 1 guarantees the nonsingularities for not only the Jacobian matrix J 1:N but the Jacobian matrices {J i:N } N i=2 in a neighborhood of V * .

The function
Recall that the proposed method has the same algorithm as the backward correction method, except that λ * i+1 is estimated by (14) instead of (10). Therefore, in the following proof, we refer to the equations in Algorithm 1 to show the convergence of Algorithm 2 for convenience.
The proof is structured as follows. Lemma 3 shows that lim k→∞ h k i = 0 for all i ∈ {1, . . . , N} when the iteration converges.
This implies that the proposed method approaches the backward correction method when it converges. To further show this, the

explicit expression of G(V, h) is given in Lemma 4 by G(V, h) = P(V, h) · U (V), and the distance between P(V, h) −1 and J(V) is
shown to be arbitrarily boundable in Lemma 4. Finally, it is shown in Theorem 2 that if V 0 is chosen close to V * and h 0 close to 0, the convergence of the proposed method can be guaranteed. Moreover, since the proposed method approaches the backward correction method, which converges quadratically, the rate of convergence is shown to be superlinear in Theorem 2.

Lemma 3.
If the iterates {V k } given by the proposed method converge to V * , then h k → 0 (k → ∞).

Lemma 4. G(V, h) can be expressed as G(V, h)
Proof. See Appendix C.

Theorem 2. There are balls S
. . , N}} ⊂ D h such that, for any V 0 ∈ S V and h 0 ∈ S h , the iterates {V k } given by the proposed method remain in S V and converge to V * superlinearly.
Proof. Set β = ∥H(V * )∥. For any ϵ ∈ (0, 1 2 β −1 ), there exists hold by the continuity of J at V * and the Lipschitz continuity in Assumption 2, respectively. For such ϵ, from Lemma 4, we can choose δ 2 > 0 such that h k ∈ S 2 := {h|∥h i ∥ < δ 2 , i ∈ {1, . . . , N}} ⊂ D h and From (17) and (19), we have From (18) and (20), is the upper bound on the rate of convergence. The result of Lemma 3. Therefore, if V 0 ∈ S V and h 0 ∈ S h , then the iterates {V k } converge and remain in S V . Moreover, since V k → V * as k → ∞, we have h k → 0 as k → ∞ from Lemma 3. Therefore, ω(V k , h k ) → 0 as k → ∞, which indicates that the proposed method has a superlinear rate of convergence. □ Remark 3. Theorem 2 gives only conditions of convergence for the update of only one sampling instant. However, the convergence of the proposed method should be guaranteed for every sampling instant. When the iteration at one sampling instant converges, a sufficiently large number of iterations k can guarantee sufficiently small ∥h k i ∥ for all i ∈ {1, . . . , N} from Lemma 3.
Moreover, if the time-dependent parameter p varies smoothly with respect to time, a small sampling interval can ensure that the new statex 0 and the time-dependent parameter p are close to their previous values even if there are model mismatches or disturbances in the actual system, which implies that the initial guess V 0 is close to the optimal value V * when warm-start is used. Therefore, the conditions in Theorem 2 can always be satisfied in the next sampling instant.

NMPC for a quadrotor
In order to demonstrate the computation time and rate of convergence of the proposed method, reference tracking of a quadrotor is considered. The state vector of the quadrotor is x = [X,Ẋ, Y ,Ẏ , Z ,Ż, γ , β, α] T ∈ R 9 , where (X , Y , Z ) and (γ , β, α) are the position and angles of the quadrotor, respectively. The input vector is u = [a, ω X , ω Y , ω Z ] T , where a represents the thrust and (ω X , ω Y , ω Z ) the rotational rates. We use a previously defined nonlinear model (Hehn & D'Andrea, 2011):  (u, x, p) with Q = diag(10, 1, 2, 1, 10, 1, 1, 1, 1) and R = I 4 , and the terminal cost function ϕ(x, p) is not imposed. The prediction horizon is T = 1 s and is divided into N = 8, 16, 24, 48, 96 steps for comparison. Simulation is performed for 20 s with a sampling interval of 0.01 s.

Computation time
The proposed method was implemented using the opensource toolkit ParNMPC (Deng, 2018), which can automatically generate parallel C/C++ code with OpenMP (Dagum & Menon, 1998). For comparison, the code generation tool (Houska et al., 2011b) in the ACADO Toolkit (Houska, Ferreau, & Diehl, 2011a) and the AutoGenU (Ohtsuka, 2015) code generation toolkit are used to perform closed-loop NMPC simulation. ACADO Code Generation is based on the SQP method with Gauss-Newton Hessian approximation (GNHA); the qpOASES (Ferreau, Kirches, Potschka, Bock, & Diehl, 2014) dense QP solver and the qpDUNES (Frasch et al., 2015) sparse QP solver are used respectively to solve the underlying QP problems. AutoGenU is based on the C/GMRES method. The simulation was performed on a desktop computer with dual 2.2 GHz Intel Xeon E5-2650 V4 processors with 24 cores in total with the Hyper-Threading and Turbo Boost features are disabled.
In principle, ACADO Code Generation is based on the realtime iteration (RTI) scheme, in which only one SQP iteration is performed. The KKT tolerance is controlled by performing several SQP iterations, and the computation time is the sum of the CPU times returned by the solver. The KKT tolerances of the proposed method and the SQP methods are set to 5 · 10 −3 , while in the C/GMRES method, only one iteration is performed per update. It should be noted that these methods do not converge to the same solution even when the same KKT tolerances are achieved because of the differences in handling inequality constraints. Namely, the SQP methods converge to the optimal solution of the inequality constrained optimization problem while the proposed method and C/GMRES only approximately take the inequality constraints into account by introducing dummy inputs (Ohtsuka, 2004). The resulting suboptimality is defined as where U * dummy and U * ieq denote the solutions of the control inputs sequences obtained by introducing dummy inputs and by dealing directly with inequality constraints, respectively. The discretization within the SQP methods is based on the explicit Euler method with multiple shooting. The QP problems solved by qpOASES are condensed, and warm-start is enabled. Although SQP methods can, in principle, be parallelized to a certain degree, e.g., for multiple shooting and the qpDUNES QP solver, they are run in a fully sequential fashion. The explicit Euler method is used for the C/GMRES method, and the number of iterations in GMRES is set to 7. The proposed method is based on the implicit Euler method.
The approximation to the backward correction method is defined in (16). Let us denote the relative approximation error by e h , which measures the distance to Newton's method: The simulation results for the inputs, position, suboptimality e o , and relative approximation error e h after each update with the proposed method for N = 24 are shown in Fig. 1. The results obtained by introducing dummy inputs are almost the same as those with the SQP methods, i.e., active input bound constraints can be achieved when tracking the position reference. Since the degree of parallelism of the proposed method is N, the program can be run on min{24, N} cores concurrently. Table 1 lists the average computation time and number of iterations per update. Although the computation time of the proposed method is long when running on one core in a fully sequential fashion, it becomes faster than other methods in the medium and high range of N when running on multiple cores.
As the computation time of the proposed method is in the µs range, the overhead time is non-negligible. Let us denote t E as the average time per update running on min{24, N} cores in practice and t p and t s as the times of the parallelizable and sequential parts running on one core, respectively. The theoretical  minimum execution time is given by t A := t s + t p /min{24, N} according to Amdahl's law. From Table 2, we can see that t s is less than 1% of the total elapsed time, indicating that the proposed method is highly parallelizable. In this example, the overhead time accounts for a substantial part of the total elapsed time, even greater than 60%. The overhead time can also be observed from the parallel scaling results in Table 3. Although a speedup of more than 10× can be achieved, the efficiency (speed-up per core) degenerates with the growth in the number of cores. One reason for these overheads is the communication between the dual processors when carrying out the backward and forward correction steps. This overhead can be reduced by running on a single-processor-based computer. Also, interruptions by other threads on Windows unbalance the parallel tasks, which also explains why the overhead time is proportionately the highest for N = 24 in Table 2 and explains the lowest efficiency when using all 24 cores in Table 3. These overhead time results show the potential of the proposed method if it is run on platforms with less overhead to maximize its efficiency.

Number of iterations for different tolerances
Although the proposed method is proven to converge superlinearly, the rate of convergence is only characterized for k → ∞.
The proposed method was compared with Newton's method (the backward correction method) and the SQP method (qpOASES was used) with GNHA. The proposed method and Newton's method converge to the same solution, and the suboptimalities e o for both methods under the set tolerances are nearly the same as the suboptimality shown in Fig. 1. The numbers of iterations shown in Fig. 2 are filtered using a 10th order moving average filter to remove the jitter that arose on the critical point of satisfying the KKT tolerance, so the numbers of iterations become distinguishable for different methods. Fig. 2 shows that the proposed method converge to the specified tolerance within several iterations, even to a high accuracy. Moreover, it converges faster than the SQP method with GNHA. The relative approximation error e h in Fig. 1 shows that the proposed method is extremely close to Newton's method. Therefore, the proposed method is slightly slower than Newton's method. Examples handling inequalities using the interior-point method and with complicated constraints, such as state and terminal constraints, can be found on the website of the toolkit (Deng, 2018).

Conclusion
This paper presents a highly parallelizable Newton-type method for NMPC. First, the original FHOCP is discretized using the reverse-time integration method so that the Euler-Lagrange equations of the resulting NLP are linearly coupled between neighboring stages. Next, the backward correction method, by predicting the coupling variables recursively from the last stage to the first, is formulated and proved to be identical to Newton's method. Since this method is computationally expensive, it is approximated using the previous iteration's information so that the recursion is broken down. The proposed method (parallelized implementation) has a complexity of O(n 3 and their Jacobian are done in parallel. Numerical simulation of controlling a quadrotor shows that the proposed method is highly parallelizable (sequential code running time ≤ 1%) and faster than conventional methods. Since the experimental environment has inherently high overhead and a limitation on the number of cores, implementation of the proposed method on other platforms, such as ones with FPGAs and many-core processors, would improve its efficiency. This remains for future work.

Appendix A. Proof of Theorem 1
First, the calculation of ∆V k (nt) is given. Note that stage i is coupled with only its neighboring stages linearly, so we know that where the recursions are given by where ∆V k 0(nt) = 0.
Next, we show that the recursion in Algorithm 1 matches the one in the block Gaussian elimination to prove the identity between ∆V k (bc) and ∆V k (nt) . Combining (8) and (11) results in the recursion for calculating H k i being given by i+1 (x i ) in (10), the following leftmost equation holds, and from (12), the rightmost equation holds.

Appendix B. Proof of Lemma 3
This lemma can be proved by showing that h k holds. Hence, the following must hold: As the inequality (B.1) also holds for j = k, we can obtain, with the aid of Lemma From the backward correction method, Θ k i(bc) is obtained from the last M stages (stages i : N) by using ] .
Similar to Θ k i(bc) , Θ k−1 i has a recursive formulation: can be written as ] .

Appendix C. Proof of Lemma 4
First, we find the expression of P(V, h). As the proposed method has the same algorithm structure as the backward correction method, the iteration of V i can be found from (13): (C.1) and the recursion for calculating d k λ i is given in (12) by (C.2) From (C.1) and (C.2), recursively, we can have  for all V ∈ D V . It can be proved by induction on l, together with the boundedness of ∥J(V) −1 ∥ in Assumption 2, that, for any ϵ > 0, there exists δ h > 0 such that, if h k holds for any α ∈ {1, −1}, r l ∈ {1, . . . , N}, l ∈ {0, 1, . . . , 2N}, m ∈ {0, 1, . . . , 2N}, and B l ∈ {B U , B L }. Since P i,j (V k , h k ) is the summation of terms of (C.5) with maximum number N and satisfying (C.7), we obtain Hence, Since ϵ > 0 is arbitrary, ∥P(V k , h k ) −H(V k )∥ can be bounded arbitrarily. Thus, ∥P(V k , h k ) −1 −J(V k )∥ can also be bounded arbitrarily due to the continuity of matrix inverse on D V , that is, (C.6) holds.  ogy, Japan, in 1990, 1992and 1995. From 1995to 1999, he worked as an Assistant Professor at the University of Tsukuba. In 1999, he joined Osaka University, and he was a Professor at the Graduate School of Engineering Science from 2007 to 2013. In 2013, he joined Kyoto University as a Professor at the Graduate School of Informatics. His research interests include nonlinear control theory and real-time optimization with applications to aerospace engineering and mechanical engineering. He is a member of IEEE, SICE, ISCIE, JSME, JSASS, and AIAA.