Approximate Dynamic Programming for Constrained Linear Systems: A Piecewise Quadratic Approximation Approach

Approximate dynamic programming (ADP) faces challenges in dealing with constraints in control problems. Model predictive control (MPC) is, in comparison, well-known for its accommodation of constraints and stability guarantees, although its computation is sometimes prohibitive. This paper introduces an approach combining the two methodologies to overcome their individual limitations. The predictive control law for constrained linear quadratic regulation (CLQR) problems has been proven to be piecewise affine (PWA) while the value function is piecewise quadratic. We exploit these formal results from MPC to design an ADP method for CLQR problems. A novel convex and piecewise quadratic neural network with a local-global architecture is proposed to provide an accurate approximation of the value function, which is used as the cost-to-go function in the online dynamic programming problem. An efficient decomposition algorithm is developed to speed up the online computation. Rigorous stability analysis of the closed-loop system is conducted for the proposed control scheme under the condition that a good approximation of the value function is achieved. Comparative simulations are carried out to demonstrate the potential of the proposed method in terms of online computation and optimality.


Background
Reinforcement learning (RL) [1] provides powerful tools for the synthesis of controllers due to its strong interactions with the environment. When RL is applied to physical systems, model-based RL, also known as (approximate) dynamic programming ((A)DP) [2], has received much attention since it can make effective use of the state transition information provided by the model. Combined with control techniques, model-based RL has successfully been applied to a broad range of control engineering.
Compared to its diverse industrial applications, the theoretical analysis on the stability and safety for RL faces great challenges [3]. Originated from the artificial intelligence (AI) community, RL used to be developed only for AI applications (e.g., board games), in which AI researchers focused almost entirely on performance with respect to a designated reward function while stability and safety are ignored. Stability and safety, on the other hand, are central to the control community since unstable modes or risky control policies may lead to destructive consequences in physical systems. In the control community, the earliest ADP methods can be traced back to the 1990s, when Werbos [4] introduced function approximators to construct actor-critic structures for the synthesis of feedback control laws. Thanks to Lyapunov stability theory, the stability issue of ADP has been comprehensively addressed, both for linear [5] and nonlinear systems [6].
Although ADP approaches consider stability, safety is another important issue that needs further study. From the control viewpoint, safety means that the behavior of closed-loop systems, including states, inputs, and outputs, satisfies some hard or soft constraints. In summary, techniques employed by RL or ADP approaches for dealing with constraints can be grouped into two categories: policy-projection-based methods and policyoptimization-based methods. Policy-projection-based methods considers the RL formulation in the uncon-strained case and involves a regulator to check and modify the policies that may violate the constraints. Such regulator can be a single projection operator [7,8] that can handle linear state and input constraints, a predictive safety filter [9] for nonlinear state and input constraints, or it can contain control barrier functions to guarantee safety [10,11]. These indirect methods, however, sometimes fail to capture the optimal solution of the constrained problem and lack stability guarantees. In comparison, optimization-based methods intend to directly get the optimal value function for the constrained problems by solving the constrained Hamilton-Jacobi-Bellman (HJB) equation. With the optimal value function available, the optimal control policy can thereby be produced by solving a constrained policy optimization problem. [12] was the first investigation of this kind of method, although it was limited to searching for the best linear feedback control law. The authors in [13] focused on using deep neural networks (NNs) to estimate the value function and the optimal policy for a nonlinear control problem with state constraints.
It should be mentioned that in constrained cases, neither the optimal policy nor the optimal value function is readily available, even for the most basic infinitehorizon linear quadratic regulation (LQR) problems. This to some extent restricts the development of the policy-optimization-based RL methods in constrained control problems. In comparison, model predictive control (MPC) [14], an optimization-based control scheme widely adopted in the control community, has a mature stability and robustness theory as well as an inherent constraint handling. MPC is distinguished in its enlargement of the horizon in the online policy optimization problem. From RL's point of view, using multi-step policy optimization algorithms diminishes the need for exact knowledge of the cost-to-go function [15]. For infinite-horizon LQR problems, MPC can render the exact optimal control law due to the equivalence of finite and infinite optimal control as long as the horizon is sufficiently long [16]. With this fundamental property, infinite-horizon LQR problems can be solved by either implementing MPC online or explicit MPC [17], and the solution is proven to be piecewise affine (PWA) in the state space [17]. However, the computational complexity of online MPC and the storage requirements of explicit MPC grow dramatically with the increase of the horizon. This is one of the main drawbacks of MPC compared with RL or ADP [18].
Dedicated to overcoming this drawback, approximation methods of predictive control laws have received much attention in recent years. An emerging methodology is using specific function regression techniques such as polynomial approximators [19] and PWA neural networks [8,20,21]. The PWA neural networks, or more specifically, the neural networks with rectified linear units, can represent PWA functions [22], which provides the opportunity for using PWA neural networks to ac-curately fit the predictive control laws. In [20,21], this approach is shown to alleviate storage demands. Nevertheless, no guarantees of closed-loop stability are conveniently available, even through the final approximation error is small enough. Using NN-based controllers to warm start the solver in online MPC [23] can inherently guarantee stability, and learning Lyapunov functions to verify the stability of NN-based controllers [24] is also an alternative way. However, as additional computation is required in the optimization or learning procedure, the superiority of lower computational complexity brought by approximation methods is not obvious.

Novelties and contributions
Based on these observations, we aim to benefit from both MPC and RL (ADP) to attain a computationally inexpensive control scheme for linear systems with stability and feasibility guarantees. We focus on the infinitehorizon LQR problem with state and input constraints. Note that usually, the optimal control law can be steep in some regions while the value function is continuous and convex [18]. So in general, the value function may be easier to approximate accurately than the control law. Different from the research of approximating the MPC policy [8,[19][20][21], we propose a value function approximation scheme. The optimal value function that has been extensively characterized by explicit MPC, is approximated by a piecewise quadratic (PWQ) neural network. The synthesis of the control law is carried out in a DP problem, where stability, feasibility, and sub-optimality can be guaranteed if a good approximation is obtained. Meanwhile, note that one disadvantage of approximating the value function is that it needs online policy optimization, which would not be the case in policy approximation methods [8,[19][20][21]. To address this issue, we develop algorithms so that online optimization can be done efficiently, as will be shown in Section 3.
The contributions of the paper are highlighted as follows: (1) We present a new approach to approximate MPC for linear systems based on value function approximation and DP. Different from the policy-based approximation approaches, the proposed approach achieves suboptimal control performance with stability and feasibility guarantees.
(2) To obtain the MPC value function, we design a PWQ neural network that can fully capture the PWQ and convex properties of the value function. As a result, better approximation quality can be realized with a simple network structure.
(3) We develop a policy optimization algorithm to solve the ADP problem efficiently. In particular, the nonlinear dynamic program is simplified to a quadratic program (QP) in which decision variables and constraints are much fewer than those in traditional MPC problems.
(4) Compared to [12], the first exploration of ADP in constrained LQR setting, the proposed approach eliminates the restriction of searching for a linear feedback law and does not require an initially stabilizing policy or an initial state belonging to an invariant set.

Preliminaries
Notations: The identity matrix with dimension n × n is written as I n . The set N denotes the set of positive integers containing zero. The symbol ∇f (·) is used to represent the gradient of a function f (·). In addition, λ max (P ) and λ min (P ) represent the maximum and minimum eigenvalues of a symmetric positive definite matrix P respectively. The boundary of the set P is ∂P, and int(P) stands for the interior of P. We utilize A i,· to represent the ith row of the matrix A.

Problem formulation
We study the infinite-horizon constrained linear quadratic regulation (CLQR) problem (1) where x k ∈ R n , u k ∈ R m , A ∈ R n×n , B ∈ R n×m , X and U are polyhedra that contain the origin in their interior, and U ∞ = {u k } ∞ k=0 is the infinite-dimensional decision variable. Like other papers [8,25], it is assumed that Assumption A1: (A, B) is stabilizable and Q 1/2 , A is observable. Moreover, for any possible initial state x, there exists a sequence of admissible input vectors u 0 , u 1 , . . . that can stabilize the state to the origin, i.e., In some existing literature [14,25],X is also called the maximal stabilizable set.
With Assumption A1, let K ∈ R m×n be a stabilizing gain matrix for the unconstrained plant x k+1 = Ax k + Bu k . As in [16], we consider the admissible set of state under a given stabilizing control gain K asX K = {x ∈ R n | x ∈ X , −Kx ∈ U}.
If the constraints in the CLQR problem (1) are removed, the problem admits a unique linear solution u LQR k = −K * x k , k = 0, 1, . . . where K * = R + B T P * B −1 B T P * A and P * = P * T is the unique positive-definite solution of the algebraic Riccati equation [5] In the constrained case, problem (1) is not as easy to solve as the unconstrained one due to infinite number of decision variables. However, if the state is close to the origin, the unconstrained LQR solution u LQR k = −K * x k , k = 0, 1, . . . may not violate the constraints so that the solution to problem (1) is identical to the unconstrained LQR solution −K * x k as if there is no constraint at all. This motivates researchers to consider the following positively invariant set: With this definition, the existing literature [16,26,27] considers using a finite-horizon problem: to approximate the infinite-horizon LQR problem (1). The rationale behind this design is that after a sufficient long horizon, the resulting x N will fall into O LQR ∞ where the unconstrained LQR solution u N = −K * x N will not violate the constraints. Since O LQR ∞ is positively invariant, the subsequent trajectories x N +1 , x N +2 , . . . will always stay in O LQR ∞ , and the constraints will be automatically satisfied with the control gain K * . As the result, the terminal cost x T N P * x N is an exact representation for J * ∞ (x N ). The following theorem, proposed in [16], formally demonstrates the validity of this design.

Theorem 2 [16] LetJ be an upper bound on
According to Theorem 2, the horizon N is essential to the equivalence of the finite-horizon and infinite-horizon problems. For a given x, one can repeatedly solve problem (3) with an increasing N until the terminal state falls into O LQR ∞ [26]. This approach only works for a certain state value. Alternatively, [16] gives a conservative estimate of the horizon for all x belonging to a polyhedral set X 0 based on the result in Theorem 2. Specifically, [16] computes p m , q m , and J * ∞ (·) at the vertices of X 0 . Next, J is set to be maximum of all J * ∞ (·), and N is selected to be the smallest integer such that N ≥ (J − p m )/q m .
It is not necessary to compute O LQR ∞ explicitly. A more mathematically convenient routine is to compute an ellipsoidal subset of O LQR ∞ [16]. This approach, on the other hand, may lead to a conservative choice of N .
With the equivalence in Theorem 2 established, we can regulate the state to minimize the infinite-horizon cost by solving problem (3). Problem (3) can be solved in a receding-horizon manner. Specifically, at each time step t, t ∈ N, x in (3) is set to the current state x t and only the first m elements of the corresponding so-

Explicit MPC
By substituting the state update equation into J * N (x) and the constraints, (3) can be reformulated as a multiparametric quadratic program (mpQP) that only depends on the current state x: where the matrices H, F, Y, G, w, S can be readily obtained through matrix multiplication operations, and k=0 is the finite-dimensional decision variable. Problem (4) can be solved online or offline. The online implementation requires to solve a quadratic program (QP) with x in (4) fixed to the current state at each time step. Although efficient QP solvers based on interior point methods and active-set methods are applicable, the computational complexity is cubic in the horizon N and the dimension m of the input vector. In comparison, the offline method, also called explicit MPC, aims to directly solve the mpQP in (4) for all possible x in a feasible set, and find the relationship between the optimizer U * (x) and x.
The results from explicit MPC show that the optimizer U * (x) and the optimal MPC control law are PWA. Therefore, the solution to the infinite-horizon problem (1) is also PWA. The following theorem, proposed in [17], summarizes the properties of the explicit solution to the CLQR problem.
Theorem 3 [17] With Assumption A1 and the equivalence in Theorem 2 satisfied, the state feedback solution to the CLQR problem (1) in a compact polyhedral set of the initial conditions X 0 ⊆X is time-invariant, continuous and PWA where the polyhedral sets R j = {x ∈ R n : H j x ≤ h j } , j = 1, . . . , N r constitute a finite partition of X 0 .
To implement the PWA feedback law, the direct way is to store H j , h j , F j , g j that define all the polyhedra R j and the corresponding affine feedback law, perform an online search to locate the polyhedron that contains x, and finally feed the system with the corresponding affine feedback law.
The explicit controller is easy to implement offline in the case of a short horizon and a low-dimensional input vector. However, with the increase of the horizon and the input's dimension, the number of regions in (5) grows exponentially [17], which may make the offline computation almost intractable.
It has been verified that artificial neural networks with at least one hidden layer have the capability of universal approximation [28] for any continuous function, provided that the hidden layer is given enough units. Herein, one direct intuition is that we can use neural networks to represent the explicit MPC law, without any need to identify the regions in (5). With the knowledge of the PWA form of the MPC law, one can construct some specific neural architectures to achieve exact and computationally inexpensive approximation. Related work has been reported in [8,20,29], combined with supervised learning or policy gradient methods. The neuralnetwork-based controllers proposed in these papers require online feasibility certificate to determine whether the outputs of the neural networks are safe. Usually, a feasibility recovery technique based on projection is used to regulate the control inputs. These neural network based policy approximators, on the other hand, inherently lack stability guarantees. The authors in [23] propose an explicit-implicit MPC scheme, where neural networks provide a control input initialization and a primal active-set method is executed to solve a QP online to ensure feasibility and asymptotic stability. However, this explicit-implicit MPC scheme needs large-scale neural networks whose outputs are supposed to contain all primal and dual variables in (4), and the online computation burden still exists since the decision variables in the QP is still U .
The optimal value function, on the other hand, is a scalar, positive, and PWQ function [17]. If we could get an approximationĴ(·) of the optimal value function, the control input can thereby be generated through a dynamic programming approach, which can rigorously guarantee the recursive feasibility and stability [2]. The approach of combining value function approximation and dynamic programming is called approximate dynamic programming (ADP), which has been widely studied for optimal control of unconstrained linear and nonlinear systems [2]. However, in constrained cases, even the most primary LQR problem becomes challenging from the ADP point of view. A relevant paper [12], published in 2019, simplifies the problem by searching for a quadratic approximation for the value function, and thus obtains linear feedback.
Before establishing our approximation structure, we concentrate on the properties of the optimal value function J * ∞ (·).
Theorem 4 [14] With Assumption A1 and the equivalence in Theorem 2 satisfied, then, in a compact polyhedral set of the initial conditions X 0 ⊆X, the optimal value function J * ∞ (·) obtained from J * N (·) is continuous, convex, and PWQ over polyhedra: Theorem 5 [30] Assume that the infinite-horizon problem (1) results in a non-degenerate mpQP. Let R i , R j be two neighboring polyhedra and A i , A j be the corresponding sets of active constraints at the optimum of (4), i.e., Problem (4) is said to be (primal) degenerate if G Aj ,· is not full row rank in some regions R j . Note that some CLQR problems may result in degenerate mpQPs, which means the continuous differentiability of J * ∞ (·) may not always hold. For example, degeneracy could happen when more than N m constraints in (4) are active at the optimizer U * (x). Nevertheless, degeneracy does not necessarily cause the loss of continuous differentiability of the value function [30]. Also notice that currently, there are no analytical results on how to identify a class of control problems that can ensure nondegeneracy in advance. Adding a terminal constraint may make (4) become degenerate [30]. If degeneracy is recorded when solving (4), feasible methods to avoid degeneracy include removing redundant constraints from GU ≤ w + Sx 0 [31] or slightly tuning the weight matrices Q and R.

Design of neural network for approximation in value space
According to Theorems 4 and 5, the value function approximator, denoted byĴ(·, θ) where θ refers to some parameters, is expected to have the following features: • (F1) It can partition its input space into polyhedral regions. • (F2) It can produce a convex and PWQ function partitioned by polyhedra. • (F3) For x in a small region containing the origin, the approximator can provide the exact representation for the value function, i,e,Ĵ(x, θ) = x T P * x.
We intend to use a feed-forward neural network to capture the relationship between the state x and the value J * ∞ (·) of the optimal value function. A feed-forward neural network is composed of one or more hidden layers and an output layer, where each hidden layer contains an affine map followed by a nonlinear map In (8) and (9), M is the width of each layer, referring to the number of units in each layer, κ l−1 ∈ R M is the output of the previous layer with κ 0 the input of the network, W l ∈ R M×M and b l ∈ R M are the weights and biases, respectively, g l (·) : R M → R M is a nonlinear activation function that is usually applied element-wise on its input.
In the output layer, the outputs of the last hidden layer are linearly combined with weight W L+1 ∈ R 1×M , to produce the final output of the network Based on these definitions, a neural network with L hidden layers and M units in each layer can be represented by where θ contains all the weights and biases of the affine functions in all the layers, and the symbol • means the layers are connected in series.
The activation function plays an important role in the approximation of neural network. In this paper, we consider a popular activation function, named the rectifier linear unit (ReLU), which is defined as f ReLU (x) = max {0, x}. Here, max(0, x) computes the element-wise maximum between the vector x and a zero vector of the same dimension. An amazing property of the ReLU is that it can produce a series of PWA functions with polyhedral partitions, combined with an affine transformation (8) [22]. Actually, the output of a neural network with ReLUs as activation functions is PWA. Based on these observations, [8,20,29] contemplates using ReLU neural networks to represent the explicit MPC law.
Now, we know that the optimal value function is PWQ. We are interested in producing a class of PWQ basis functions that can efficiently represent the value function. Let us focus on the last hidden layer in (11). All activation units in this layer can still be written as a continuous PWA function over the input space of the network [32]. The product of any two continuous PWA functions is a continuous PWQ function. This indicates that such a product can be a possible basis to capture the features of the MPC optimal value function. Now, we are in a position to state our design. We calculate the element-wise product ⊙ of all the units κ L in the last hidden layer to generate a series of PWQ functions This calculation can be viewed as a layer in the network, denoted by the product layer p(κ L ).
Finally, the output layer is a simple weighted sum of the outputs of the previous layer: where r = (r 1 , r 2 , ..., r M ) T ∈ R M is the weight vector of the output layer.
To make the proposed approximator satisfy F3, we develop a "local-global" architecture [33], which decomposes the outputs of the approximator into two partŝ with P * being the solution to the ARE (2). The term x T P * x is included to capture "global" aspects of J * ∞ (·), while the neural-network-based term is exploited to identify the polyhedral partition in (6) and capture the local residuals J * ∞ (x) − x T P * x. Since the known term x T P * x that dominates the value function is extracted and fixed, using such a "local-global" architecture is possible to enhance the quality of approximation.
The reason why we use only one hidden layer is twofold. First, by using one hidden layer, it is possible to construct a convexĴ (·, θ), which is desirable since the real optimal value function is convex. This property will be discussed in the next section. More importantly, using one hidden layer provides the opportunity to solve the dynamic programming problem through simple QP methods. It should be noticed that in our case the dynamic programming problem is a nonlinear program since the objective function is PWQ. If we use one hidden layer, we can explicitly compute the coefficients of J (·, θ) by extracting the activated units in the hidden layer. The online dynamic programming can be conducted by solving a sequence of quadratic programs without the need to calculate any gradients or Hessians.

Network training and convexity analysis
So far, we have finished the design of the architecture of the proposed neural network. This subsection discusses the optimization strategy over the parameters in θ in order to approximate the optimal value function.
Firstly, training data needs to be generated offline by denote the solution to the MPC problem for the initial state x (i) and the corresponding trajectories of the closed-loop controlled system. All existing research [20,29] on the approximation of MPC only uses the initial states {x as the input training data, which means solving (3) N x times can only generate N x training pairs.
In view of this, we present a more efficient strategy for data generation by leveraging the equivalence between the finite-horizon problem and the infinite-horizon problem. Suppose that we have obtained the optimal control sequence {u k=0 and the corresponding value J * N x (i) for different x (i) , consider the sub-problems whereby we start at x (1). According to the principle of optimality [15], the truncated optimal control sequence {u j=k is also optimal for these subproblems. As a result, the optimal value functions for these subsequent trajectories x with no need to solve (3) repeatedly.
k=0 is prone to congregating near the origin if the initial state x (i) is chosen too close to the origin. Therefore, to make the training data cover the whole region of X 0 as much as possible, the initial states x (i) , i = 1, . . . , N x are recommended to be set near the boundary of X 0 .
With the N x N state-value pairs available, the neural network is trained so that its parameters approximate the solution to the following problem: 2 is the square of the approximation error for each training pair, and x is introduced to guarantee that no units in the hidden layer are activated when x is near the origin, i.e., to fulfill F3, while the constraint r ≥ 0 is responsible for maintaining convexity ofĴ(·, θ).
Problem (16) is a nonlinear least-squares problem, which can be successfully solved by the gradient descent method [34] with a learning rate α > 0. In each iteration s, s = 1, 2, . . . , θ is projected onto the feasible set of (16) after updated by the gradient downhill.
After the neural network is trained, the system can be run and the control signals are computed by solving a dynamic programming problem. It is desirable that it could be convex so that any locally optimal point is also globally optimal. It should be noticed that in most case a neural network cannot provide a convex function w.r.t. its inputs even through it uses convex activation functions. With a specific structure, our proposed neural network, on the other hand, allows to produce a convex functionĴ(·, θ).
Suppose that the output of the proposed approximator has the following PWQ form: whereR j are polyhedra defined by the hyperplanes {W l,· x + b l = 0} M l=1 . DefineR = diag(r) and we can rewrite (14) asĴ where κ, the output of the hidden layer, is PWA w.r.t.
x. Therefore, by applying the chain rule for the second derivative, it is found that the Hessian ∇ 2 xĴ is positive definite in the interior of anyR j if the weight matrixR ≥ 0, i.e., r i ≥ 0, for i = 1, . . . , M . In the following, we will show that the gradient descent method with projection can guarantee the positive semi-definiteness ofR.
At the boundary of any neighboringR i andR j , without loss of generality, suppose thatR i ,R j are partitioned by the hyperplane Differentiating both sides of (19) yields 20) which proves continuous differentiability ofĴ(x, θ) at the boundary.
Furthermore, sinceĴ(x, θ) is differentiable, convexity of J(x, θ) at the boundary can be checked through firstorder conditions [35]. Formally speaking,Ĵ(x, θ) is convex if and only if holds for all x 1 , x 2 belonging to the domain ofĴ(x, θ).
As convexity ofĴ(x, θ) in the interior of anyR j , j = 1, . . . ,N r has been verified, we only focus on the boundary. Without loss of generality, let x 1 ∈R i and x 2 ∈R j . From (19) one haŝ which demonstrates thatĴ(·, θ) satisfies (21) at the boundary. This completes the proof of Theorem 6. ✷

Remark 7
In [12], a quadratic value function approximator is proposed, and it is updated online. In contrast, our method retains the PWQ property of the real value function and move the training process offline. As a result, the proposed control framework can more accurately accommodate the optimal control law. Besides, [12] assumes that the initial state lies in an invariant set, while our method relaxes this restriction. In particular, the domain of attraction under the proposed control law can be equal to the maximal stabilizable setX.

Suboptimal control law based on dynamic programming
With a well-fittedĴ(·, θ) available, at each time step t, t ∈ N, we can obtain a suboptimal control policy by solving the following one-step dynamic programming problem whereQ(x t , u t ) can be viewed as the approximated optimal Q-function [15] for problem (1). Denote the solution to (23) byû * t . Here, we use another subscript (·) t to indicate that problem (23) is solved online at each time step t. In problem (23), C is chosen asX or R n , depending on whetherX is computable. In particular, if the iterative algorithm (Algorithm 10.3) in [14] does not terminate in finite time, we drop the constraint on Ax t + Bu t . This could happen, e.g., when there is no state constraint in (1). Besides, in both cases problem (23) is recursively feasible sinceX is a control invariant set (CIS) [14]. Moreover, it should be noted that some approximation methods of predictive control laws specify C as the maximal CIS C ∞ [8] or an arbitrary polytopic CIS [20], which may make the approximated controllers not stabilizing since usuallyX ⊆ C ∞ . The set C ∞ \X contains some initial states that cannot be steered to the origin. The dynamic programming solution in (23), on the other hand, can prevent x t from falling into C ∞ \X as we can prove thatĴ(·, θ) is possible to be a Lyapunov function for the closed-loop system.
Let m c and l c denote the minimum number of linear inequalities that define U and C, respectively. Note that problem (23) contains m decision variables and m c + l c constraints, both of which are much less than those in problem (4). This is one of the main benefits of ADP or RL. Nevertheless, the objective function in (23) is nonlinear and contains a neural network. Standard solvers such as the ellipsoid algorithm or the interior-point algorithm require the computation of the Hessian ∇ 2 uQ (x, u) or the gradient ∇ uQ (x, u) in each iteration. Such computation can only be carried out by visiting all units in the hidden layers and extracting the activated ones. The advantage of low computational complexity brought by dynamic programming will then inevitably diminish.
In view of this, we intend to avoid frequently calculating ∇ 2 uQ (x, u) and ∇ uQ (x, u) by decomposing the Q-functionQ(x t , u t ) into some quadratic functions. In particular, we develop two optimization algorithms in which problem (23) is reduced to a QP problem in each iteration. For a given x t , t ∈ N, consider the set of activated ReLU units inĴ (Ax + Bu, θ): (24) where M is the width of the hidden layer. With (24), Q(x t , u) can thus be computed aŝ SinceP (Ā(u)) > 0, the right-hand side of (25) is a convex quadratic function ifĀ(u) is fixed. An algorithm that can cope with general piecewise convex programs (PCP) is proposed in [36]. We adapt it to solving problem (23) and summarize it in Algorithm 1.
Algorithm 1 is a modified version of the PCP algorithm in [36]. In each iteration, Algorithm 1 solves two auxiliary QPs and terminates if the minimizers d (s+1) and u (s+1) are identical. The parameter ε > 0 is a tolerance. If |d (s+1) − u (s+1) | ≤ ε is recorded, we assume that the two problems in Algorithm 1 admit the same optimal point. It is preferable that d (sm) or u (sm) could be the exact solution to problem (23) and that Algorithm 1 could stop in finite time. [36] provides these guarantees, which are summarized in the following theorem.
As shown in [36], using Algorithm 1 to solve piecewise quadratic programs can be effective since it only needs Algorithm 1 PCP algorithm for solving the piecewise quadratic program (23) [36] Input: State xt at time step t, input ut−1 at the last time step t − 1 (if t > 0), the optimal Q-functionQ (·, ·), C Output: Control input ut that will be applied to the system Initialize a starting point u (1) ← ut−1 Initialize a set U1 ← {u|u ∈ U, Axt + Bu ∈ C for s = 1, 2, . . . do Update the set of activated unitsĀ(u (s) ) at u (s) by (24), and compute the coefficientsP (Ā(u (s) )) and q T (Ā(u (s) )) associated withĀ(u (s) ) through (26) Find u (s+1) ← arg min u∈Us u TP (Ā(u (s) ))u +q T (Ā(u (s) ))u end if end for to compute ∇ 2 uQ (x, u) or ∇ uQ (x, u) when s is updated. On the other hand, Algorithm 1 is not the ideal choice for solving our DP problem, because problem (9) contains numerous constraints if the neural network has a large number of hidden units. This motivates us to consider the following design. In each iteration s, s ∈ N + , let u (s) be an initial input (for s = 1) or the input calculated from the last iteration (for s > 1). We compute the set of activated unitsĀ(u (s) ) at u (s) , and thereby get P (Ā(u (s) )) andq T (Ā(u (s) )) from (26). Then, we solve the following QP: which returns u (s+1) for the next iteration. In the next iteration, after computingĀ(u (s+1) ), we can terminate and output u (s+1) ifĀ(u (s+1) )=Ā(u (s) ). If a cycle occurs, i.e.,Ā(u (s+1) )=Ā(u (k) ), ∃k ∈ {1, . . . , s − 1}, we arbitrarily choose anotherĀ(u (s+1) ) that has not been involved in previous iterations. The proposed method for getting the solution to (23) is summarized in Algorithm 2.
In Algorithm 2, the starting point u (1) = u t−1 is initialized with the last control input, which can be viewed as a warm start for the algorithm. To understand the rationale of this design, suppose that x t is close to the ori-Algorithm 2 Decomposition algorithm for solving the piecewise quadratic program (23) Input: State x t at time step t, input u t−1 at the last time step t − 1 (if t > 0), the optimal Q-function Q (·, ·), C Output: Control input u t that will be applied to the system Initialize a starting point u (1) ← u t−1 for s = 1, 2, . . . do Update the set of activated unitsĀ(u (s) ) by (24) ifĀ(u (s) ) =Ā(u (s−1) ) and s > 1 then Let s m ← s, return u t ← u (sm) , break else ifĀ(u (s) ) =Ā(u (k) ), ∃k ∈ {1, . . . , s − 2} then ResetĀ(u (s) ) to be a new set of activated units that never occurs previously.
end if Compute the coefficientsP (Ā(u (s) )) and q T (Ā(u (s) )) associated withĀ(u (s) ) through (26) Update the policy through (27) and get u (s+1) end if end for gin after enough time steps. Then, u t will vary slightly around the origin. Choosing u (1) = u t−1 is beneficial to reducing the number of iterations.
Compared to Algorithm 1, Algorithm 2 only needs to solve one QP, in which the constraints are the same as those in (23). Additionally, Algorithm 2 circumvents the calculation of U s+1 . Meanwhile, Algorithm 2 can also achieve finite termination as well as the optimality for problem (23). However, the number of iterations in Algorithm 2 may be more than that in Algorithm 1, and we need to store all the previous set of activated units A(u (k) ), k = 1, . . . , s − 1. For all that, we can still rely on Algorithm 2 because we have found that cycles rarely occur in our numerical experiments for several examples. Besides, we can also switch to algorithm 1 by taking U sc ← {u|u ∈ U, Ax t + Bu ∈ C if a cycle is detected at s = s c .
Theorem 9 Consider the DP problem (23). For any x t that makes problem (23) feasible, the decomposition algorithm (Algorithm 2) terminates in a finite number of iterations, i.e, there exists a finite s m such thatĀ u (sm) =Ā u (sm−1) holds. Moreover, if A u (sm) =Ā u (sm−1) , then u (sm) is the solution to problem (23).
Proof We will first show that if Algorithm 2 stops, it outputs a solution to problem (23).
Furthermore, if a cycle occurs, i.e.,Ā(u (s) ) =Ā(u (k) ), ∃k ∈ {1, 2, . . . , s − 2} happens for some s, according to Algorthm 2, we select another set of activated units that has never been considered in problem (27). As the combinations of activated units are finite, which means the number of differentĀ(·) is limited, the algorithm must stop in finite iterations. ✷

Stability analysis
The recursive feasibility of the DP problem (23) is inherently guaranteed. In this section, we investigate the stability of the proposed control law.
If problem (4) is non-degenerate, all R i , i = 1, . . . , N r are full-dimensional [14]. In contrast, in the degenerate case, some R i are lower-dimensional, and in general, they correspond to common boundaries between full-dimensional regions. Since both J * ∞ (·) andĴ (·, θ) are PWQ on polyhedra, the intersection of any R i and R j , i = 1, . . . , N r , j = 1, . . . ,N r is still polyhedral. Let R i,j denote this intersection if such intersection represents a full-dimensional region, i.e., R i,j R i ∩R j , i ∈ {1, ..., N r } , j ∈ {1, ...,N r } and dim(R i ∩R j ) = n. It is clear that all R i,j are a partition of X 0 .
Upper bound of the approximation error. To certificate the stability, we need to know the upper bound of the approximation error for all x belonging to X 0 . Computing the maximum of |Ĵ(·, θ) − J * ∞ (·)|/J * ∞ (·) over each R i,j is cumbersome since (i) it needs the analytical form of J * ∞ (·) and (ii) it needs to solve at most N rNr non-convex optimization problems, which is computationally demanding.
For this reason, we develop a procedure to get an upper bound of |Ĵ(·, θ) − J * ∞ (·)|/J * ∞ (·) by leveraging the Lipschitz continuity of ∇Ĵ(·, θ) and ∇J * ∞ (·). Since we can compute the approximation error of the proposed neural network at the training points, it is possible to obtain an upper bound of the approximation error for all x in X 0 . In the interior of each R i,j , both J * ∞ (·) andĴ (·, θ) are quadratic and hence twice continuously differentiable. According to [14,Lemma 3.2], the gradients ∇ x J * ∞ (·) and ∇ xĴ (·, θ) are locally Lipschitz on int(R i,j ), i.e., there exist non-negative constants L i and L j such that for all pairs (x, (30) The Lipschitz constants L i andL j can be chosen as the largest eigenvalue of P i andP j [14, Lemma 3.2], respectively. In the following, we will show that there exists a positive constant ζ such that To achieve this, we need the following assumption.
Assumption A2: For the partition R i,j of X 0 , suppose that there exists at least one training point in each int(R i,j ).
Meanwhile, define a measure of the gradient error as , and letf andḡ stand for the maximum values of |f (·)| and g(·) over all training data, which are computable. It is straightforward to see that is any training point specified in (15).

Remark 10
If Assumption A2 is not satisfied after the neural network is trained, we can add some testing points in int(R i,j ), evaluate |f (·)| and g(·) at these testing points, and adjustf andḡ if necessary so that |f (·)| ≤f , g (·) ≤ḡ also hold at these testing points.
Let R 1 refer to the polyhedron where no constraints in (4) are active, i.e., R 1 = O LQR ∞ , and accordingly let R 1 represents the polyhedron where no ReLU units in J (·, θ) are activated. In the region R 1,1 = R 1 ∩R 1 , we have |f (x)| ≡ 0 due to (14). For every R i,j except R 1,1 , consider the following three cases: Case 1: f (x) ≥ 0, ∀x ∈ int(R i,j ). In this case, for any x ∈ int(R i,j ), let y ∈ int(R i,j ) denote the training or testing point closest to x. Substituting x and y into (30) results in where β(y) = ∇ T J * ∞ (y) 2 /J * ∞ (y). The first inequality of (33) is true due to the second inequality of (30) and the convexity of J * ∞ (·). Let d i,j denote the maximum Euclidean distance between x and its nearest training or testing point subject to x ∈ int(R i,j ). Using the fact that (32) holds at y, we obtain from (33) that holds for any x ∈ int(R i,j ). Herein, it should be mentioned that β(y) is bounded on all R i,j except R 1,1 , since J * ∞ (·) can only equal zero at origin, which is contained in R 1,1 . Therefore, one can always make d i,j sufficiently small so that 1 − β(y)d i,j > 0.
Case 2: f (x) ≤ 0, ∀x ∈ int(R i,j ). Similarly to Case 1, we can readily get where the first inequality of (35) comes from the first inequality of (30) and the convexity ofĴ (·, θ). The last line of (35) is almost the same as the right-hand side of (34), and the only difference is thatL j is replaced by L i .
Consequently, combining the above 3 cases, it is sufficient to conclude that for any x ∈ int(R i,j ), |f (x)| is upper bounded by Finally, since f (x) is continuous on the closure of R i,j , the bound ζ i,j applies to all x in R i,j . In addition, thanks to Assumption A2, the value of ζ in (31) can be determined by computing the right-hand side of (36) at all training/testing points and choosing the largest one.
Computation of ∇Ĵ (y, θ), ∇J * ∞ (y), L i , andL j . The computation of ∇Ĵ (y, θ) andL j is straightforward since the analytical form ofĴ (·, θ) is known. The values of ∇J * ∞ (y) and L i can be collected in the training process. Specifically, note that all y are contained in the interior of full-dimensional R i , in which the rows of G A i are linearly independent. In this case, the Lagrange multipliers λ * (·) of problem (4) are affine functions of x [17] on and thus continuously differentiable. Then, similar to the analysis in [14, Theorem 6.9], the gradient ∇J * ∞ (y) can be computed by To obtain L i , let us restrict our attention to the set of active constraints at y. The set A i defined in Theorem 5 can be determined by checking whether the elements of λ * (y) is zero or not. Then, as illustrated in the proof of [30, Lemma 1 (Theorem 5 in this paper)], substituting A i into where Γ = G Ai,· H −1 G T Ai,· ≻ 0, we can obtain P i and hence select L i as the largest eigenvalue of P i .
With the property of boundedness for f (x) established, we can assess the stability for the closed-loop system with the approximate controllerû * . Corresponding to the selection of C in problem (23), we consider two cases: (1) C =X and (2) C = R n .
Theorem 11 Letû * t be the solution to problem (23) with C =X. Suppose that Assumptions A1-A2 hold with X 0 = X. If ζ satisfies then the origin of the closed-loop system x t+1 = Ax t + Bû * t , t = 0, 1, . . . is asymptotically stable with domain of attractionX.
Proof Let us first consider the solution to the MPC problem (3). For every x t , t = 0, 1, . . . , MPC finds the solution U * t to problem (3) and only collects the first m elements, i.e., u * t = [I m 0 m · · · 0 m N −1 terms ]U * t will be applied to the system. From the equivalence between problem (1) and problem (3), {u * t } ∞ t=0 is also a solution to the problem (1). As a result, Ax t + Bu * t ∈X for any x T Qx ≤ λmax(P * ) λmin(Q) . Besides, in view of (36), ζ is determined mainly byf ,ḡ, and d i,j . The condition (39) or (44) can thereby be guaranteed in two ways. One is to add more hidden units into the neural network so thatf andḡ could be smaller according to the universal approximation theorem [38]. Another possibility is to involve more state-value/gradient data to test (32) so that d i,j is reduced.
If the data generation strategy in Section 3.2 is used, sometimes it is not straightforward to satisfy Assumption A2 since the positions of the subsequent states x (i) * k , i = 1, . . . N x , k = 1, . . . , N − 1 cannot be determined explicitly. An alternative approach is to only use initial states {x (i) } Nx i=1 with a much larger N x if we find that a number of regions of the partition R i,j are not visited by the training points.
Approximation methods of MPC with stability guarantees have been investigated in, e.g., [23,37,39]. The stability results in these three papers are based on a common fundamental theorem, stating that if is still a Lyapunov function for the system controlled by {û * t } ∞ t=0 . Our method, on the contrary, firstly obtains an estimationĴ (·, θ) of J * ∞ (·). IfĴ (·, θ) can approximate J * ∞ (·) well (characterized by conditions (39) and (44)), the approximated control law generated by the DP (23) will be inherently stabilizing. In addition, note that the Lyapunov functions in Theorems 11 and 12 areĴ (·, θ), instead of J * ∞ (·).
Remark 13 It is worth mentioning that our approach can be extended to the degenerate cases. Even if J * ∞ (·) is not continuously differentiable, we can still use the proposed PWQ neural network to approximate it. In addition, all the results of the proposed method apply to the degenerate cases.

Complexity analysis
We analyze the offline storage requirement as well as the online computational complexity of the proposed control scheme, and compare them with other methods, such as implicit MPC, explicit MPC, and the policy approximation methods of MPC [20,23].
Storage space is dominated by the number of regions and control laws (for explicit MPC), or the structure of the neural network (for approximate MPC). The storage of some system's parameters, such as A, B, X , U, Q, and R, are neglected for consistency. Implicit MPC does not need to store any data except for some system's parameters. In comparison, to implement explicit MPC, the straightforward way is to store all polyhedral regions R j , j = 1, . . . , N r and all control laws F j x + g j , identify online the region that contains the current state, and then choose the corresponding control law. From the explicit control law given in (3), explicit MPC requires the storage of N r regions and affine feedback laws. Suppose that each region R j is defined by n (j) c constraints. Then, explicit MPC needs to store (n + 1) Nr j=1 n (j) c + N r (mn + m) real numbers. As for the proposed method, it needs to construct a PWQ neural network before running the system. The neural network contains 3 parameters: W, b, and r, so the storage of the proposed neural network requires nM + 2M real numbers in total. In addition, as for the policy approximation methods reported in [20,23], the total storage demand of the neural networks is (n + n 0 + 1) M + (L − 1) (M + 1) M , with n 0 = m for [20] and n 0 = N (m + 2n + n c + m c ) for [23], respectively. Here, L denotes the number of hidden layers, and n c and m c denote the number of constraints specified by X and U.
Online computation time will be evaluated in terms of floating point operations (flops) for the computations that should be performed online. Implicit MPC needs to solve the QP (3) or (4) at each time step. Solving (4) for a given x requires f QP (N m, N (m c + n c )) flops in the worst case ((4) has no redundant constraints). Here, f QP (n D , n I ) represents the number of flops needed to solve a QP with n D decision variables and n I linear inequalities. So in an interior-point method, solving (4) requires O N 3 m 3 flops per iteration. In comparison, the number of flops for explicit MPC is 2n Nr j=1 n (j) c [14], far less than that for implicit MPC.
In our proposed control scheme, solving the DP problem (23) causes computational complexity. In Algorithms 1 and 2, the number of flops to determineĀ u (s) and to compute the coefficientsP Ā u (s) ,q T Ā u (s) is bounded by f act = M (2n 2 + n + m 2 + 5m + 2mn + 2) + 2mn + m(m + 3)/2 in total. Then, in each iteration Algorithm 1 has to solve 2 different QPs, which need f QP (m, n c + l c + s − 1) and f QP (m, n s ) flops. Here, n s stands for the number of inequalities that define U s . Algorithm 2 needs only f QP (m, n c + l c ) flops to find u (s) in each iteration. Besides, some other calculation includes the update of U s (2m 2 + 2m − 1 flops per iteration) and the comparison ofĀ(u (s) ) andĀ(u (s−1) ) (M flops per iteration). Therefore, the total number of flops to implement Algorithm 1 is and the number of flops to calculate the control input by using Algorithm 2 is which can be much less than those in Algorithm 1.

Numerical examples
Two case studies are presented to assess the feasibility and effectiveness of the proposed control scheme. In addition, some other methods including implicit MPC and the policy approximation method of MPC [20], are also compared with the proposed method in the case studies. The simulations are conducted in MATLAB 2021a.
We are interested in stabilizing the system at the origin and meanwhile minimizing the cost ∞ k=0 x T k Qx k + u T k Ru k with Q = I 2 and R = 0.1I 2 . We choose the region of interest X 0 = x ∈ R 2 | x ∞ ≤ 3 . By applying the algorithm in [16], it can be verified that the vertices of X 0 can be steered to an ellipsoidal subset of O LQR ∞ with the horizon N = 10. By solving the explicit MPC problem (4) with N = 10, the partition of the state space for the optimal control law on the region X 0 can be obtained, and is depicted in Fig. 2(a). If some regions share the same form of the control law and their union is convex, the union of these regions is computed and dyed the same color.
(a) Regions of optimal explicit MPC. (e) Relative approximation error at training points.
(f) Relative error of the gradient at training points. To illustrate that the proposed PWQ neural network has a good approximation performance, we compare it with standard ReLU neural networks (with 1 or 2 hidden layers). The widths of the neural networks are taken as 15 (for the proposed and shallow ReLU neural networks) and 6 (for the deep ReLU neural networks), so that all of them have a comparable number of parameters. To prevent the network from falling into local minimums that are not globally optimal, multi-start local optimization [40] is employed. Particularly, the network is initialized with multiple values of θ that are randomly selected in the feasible set. We then evaluate the values of the objective function in (16) and select the initial network parameter that yields the lowest function value. Choosing α = 0.1, we compare the absolute mean square errors, i.e., the values of the objective function in (16) during the training process and show them in Fig.  2(c). From Fig. 2(c), it is observed that the training objectives for the 3 neural networks decrease consistently over epochs, while the proposed approach has the smallest mean square error during the training process.
The polyhedral partition of the proposed neural network is also depicted in Fig. 2(b). From Fig. 2(b) it is noticed that the partition of the neural network concentrates in the areas where the explicit MPC law changes rapidly. We also compare the output of the PWQ neural network with the real optimal value function in Figs. 2(d)-2(f), from which one can observe that the proposed method is able to closely approximate the optimal value function with a very simple network architecture.
To verify the stability of the closed  (36), we compute the value of the right-hand side of (36) at all training points and choose the largest one: ζ = 0.188. Next, the righthand side of (44) can be computed by solving the following constrained nonlinear problem: where ǫ is a small positive constant to avoid singularity. (47) returns a maximum 4.904. As a result, it can be readily verified that (44) holds.
In the closed-loop simulation, the proposed method, implicit MPC, and the policy-approximation method of [20] are compared. To implement the policyapproximation method, we construct a deep ReLU neural network with 2 hidden layers, to learn the explicit MPC law. Normally, the width of the deep ReLU neural network should be set as 6 so that it has a comparable number of parameters. However, we find that it comes off a bad approximation performance. Therefore, we increase the width of the deep ReLU neural network to 8. When the system is running online, the output of the deep ReLU neural network may violate the input constraint, and thus is projected onto U. All QPs during the simulation are solved via MATLAB function "quadprog", and we select an interior-point algorithm when using "quadprog". We choose the initial point x 0 = [0 2] T and run the simulation program in 100 steps. The closed-loop behaviors in the first 30 time steps are plotted in Fig. 3. From Fig. 3, it can be seen that the proposed method can properly approximate the MPC controller, and that the corresponding trajectory is stabilized at the origin. In comparison, the policyapproximation method experiences some fluctuations when the state is close to the origin. This is probably  because the value of the optimal control input is pretty small if x is around the origin, so a slight approximation error of the policy may lead to drastic changes in the dynamic response. Table 1 shows the whole simulation time of the three methods, as well as their total cost 100 t=0 x T t Qx t + u T t Ru t . Besides, we also compare the simulation time of Algorithm 1, Algorithm 2, and the nonlinear programming method (using the MATLAB function "fmincon" with the interior-point algorithm) when implementing the proposed method. The policy-approximation method needs the least simulation time since it just needs to compute a single projection (solve a QP) at each time step. The proposed method with Algorithm 2 requires shorter computation time than online MPC. Besides, compared with the nonlinear solver and Algorithm 1, Algorithm 2 significantly reduces the computation complexity. In other aspects, the control law generated by the proposed method is nearly optimal since its total cost is very close to that of MPC, while the total cost of the policy-approximation method is relatively large (about 126% of that of MPC), due to the fluctuations near the origin.
In the second case study, we concentrate on the case of X 0 =X. Hence, we consider a linear system with state and input constraints: where the weight matrices Q and R are taken as I 2 and 0.1, respectively. We intend to compare the proposed method with an existing ADP method for constrained linear systems developed in [12]. With the help of the Multi-Parametric Toolbox,X can be efficiently computed, as depicted in Fig. 4.
As in the previous example, it is verified that N = 6 leads to the equivalence of (1) and (4). In contrast to using the initial states as well as their subsequent trajectories as training data in the previous example, a 121 × 121 state data grid is constructed to cover the region x ∈ R 2 | x ∞ 3 and only the points contained inX are selected as training points. Therefore, the distance between adjacent points is 0.05. These training points are exploited to train the PWQ neural network with width 15. After 15000 training iterations, the value of the objective function in (16) is less than 10 −1 , and we obtainf = 0.0552,ḡ = 0.355, and hence ζ = 0.0745. Meanwhile, we can compute the right-hand side of (39), which equals 9.853. Consequently, condition (39) is satisfied. The offline analysis hence provides a stability guarantee.
The proposed control lawû * t is expected to be stabilizing for all x 0 inX. To test this, Fig. 4 plots the trajectories of the closed-loop system starting from the vertices of X. Owing to the convexity of J * ∞ (·), such vertices have the largest J * ∞ (·) in their neighborhoods, and therefore the system starting from these vertices is prone to be divergent. However, Fig. 4 shows that the trajectories starting from any vertices converge rapidly to the origin, and the proposed control law significantly approximates the MPC law, which is optimal for the infinite-horizon problem. Table 2 compares the proposed method (using Algorithm 2) with the ADP method in [12] and online MPC in terms of total simulation time and total cost in the first 100 time steps. As the ADP method in [12] is infeasible for all vertices ofX, we choose an initial point x 0 = [2 − 2] T that is much closer to the origin and run the system in 100 time steps. The proposed method combined with Algorithm 2 takes the least computation time, while the ADP method in [12] needs the most, which is mainly because [12] must solve an extra semidefinite program online to estimate the value function. In terms of optimality, the proposed method produces the same solution as MPC since their total costs are equal to each other.

Conclusions and future work
In this paper, we have developed an ADP control framework for infinite-horizon optimal control of linear systems subject to state and input constraints. Compared to some common neural networks such as ReLU neural networks, the proposed neural network maintains the PWQ property and convexity of the real value function and has a much better approximation performance. These properties and superiority contribute to the reduction of the online computation as well as the construction of explicit stability criteria. Therefore, advantages of our method include low computational requirements, stability assurance, and excellent approximation of the optimal control law.
The main limitations of the proposed scheme are that the model considered is linear, and that modeling uncertainty is not addressed is the proposed framework. Besides, the supervised learning method needs much offline computation for the preparation of the training data. Future work include using temporal difference learning to train the neural network, and extending the proposed method to constrained PWA systems, which can model a wide range of hybrid processes and nonlinear systems.