Inverse reinforcement learning for identification of linear–quadratic zero-sum differential games

In this paper, we address the inverse problem in the case of linear–quadratic zero-sum differential games. The problem is to evaluate an unknown cost function given the observed trajectories that are known to be generated by a stationary linear feedback Nash equilibrium pair. Using the observed data, we construct a game that is equivalent to the game that leads to the observed trajectories in the sense that the equilibrium feedback law of any of the two player is the same for that player in the original and constructed games. Towards this end, we introduce a model-based algorithm that uses the given trajectories to accomplish this task. The algorithm combines both inverse optimal control and reinforcement learning methods making extensive use of gradient descent optimization for the latter. The analysis of the algorithm focuses on the proof of its convergence and stability. Simulation results validate the effectiveness of the proposed algorithm.


Introduction
Dynamic game theory brings together four components that are key to many situations in economics, ecology, and other related disciplines: optimizing behavior, the presence of multiple agents/ players, enduring consequences of decisions, and robustness with respect to the changing environment [1].Noncooperative differential games were first introduced in [2] within the framework of zero-sum games.This type of game attracted considerable attention from the control community due to the fact that quadratic differential games provide new angles to examine the performances of control laws.The application of differential games is far-reaching [3][4][5].Although most of the literature has focused on determining the outcome of a game given the players' objective function, recently, an increasing interest appeared in the inverse problem, where, given the players' gameplaying behavior, one wants to reverse engineer the objective of a player.
Inverse problems have attracted considerable attention due to, in part, their application in guiding the system to desired behavior outcomes.Significant research has been done in the area of inverse optimal control (IOC) [6,7].Another closely related research area is inverse reinforcement learning (IRL) [8].Although these two areas are concerned with similar problems, they are different in structure -the IOC aims to reconstruct an objective function given the state/action samples assuming dealing with a stable control system, while the IRL recovers an objective function using expert demonstration assuming that the expert behavior is optimal [9].There is a close relationship between IOC and the inverse problem for linear quadratic differential games.There are various works dedicated to the inverse problem for noncooperative linear-quadratic differential games.Some of them use purely IRL approaches [10,11], while others are based on IOC [12].However, not much attention was paid to the linearquadratic zero-sum differential games despite the fact that they might be used to solve the L 2 -gain problem [13].Some results for the inverse problem in such games were achieved via inverse Q-learning in the context of the imitation learning problem [14].
For linear-quadratic zero-sum differential games, finding the Nash equilibrium is done via solving the so-called Generalized Algebraic Riccati Equation (GARE) [15,16].In this work, we use reinforcement learning methods and inverse optimal control methods to solve GARE.The developed algorithm is model-based, i.e., in addition to knowing equilibrium trajectories, we also know the weight matrices in the dynamics.Instead of seeking the cost function that, together with the dynamics, generated the observed behavior, we are looking for an equivalent cost function that, together with the given dynamics, constitutes a game that shares the same feedback law with the original game.
The paper is structured as follows.Section 2 provides preliminary results on linear-quadratic zero-sum differential games and formulates the problem addressed in the paper.In Section 3, we describe each step of the algorithm.Section 4 is dedicated to the analysis of the algorithm; we show its convergence and stability and characterize possible solutions.Sections 5 and 6 provide simulation results and conclusion, respectively.Notations: For a matrix P ∈ R m×n , P k , P (k) denote P to the power of k, and matrix P at the kth iteration, respectively.In addition, P > 0, P ≥ 0, P ≤ 0, and P < 0, denote positive (semi-)definiteness, and (semi-)negative definiteness of matrix P, respectively.Tr P denotes the trace of matrix P. I k is the k × k identity matrix.R + denotes the set of positive real numbers.Z + denotes the set of positive integers.

Problem formulation
This section introduces linear-quadratic (LQ) zero-sum differential games and defines stationary linear feedback Nash equilibrium (referred to as NE).We clarify what an optimal behavior for the game is and introduce the inverse problem.

LQ zero-sum differential game
Consider a differential game with continuous time dynamics (1) where x ∈ R n is the state and u ∈ R m and d ∈ R p are control inputs of players 1 and 2, respectively; plant matrix A, control input matrices B and D have appropriate dimensions.
We consider that the players select their control to be linear time-invariant feedback laws of the form where F and L are linear time invariant feedback matrices of players 1 and 2, respectively.Further, to ease notations, we use Within the game, player 1 aims to find a controller that minimizes a cost function, and player 2, on the opposite, looks for a controller that maximizes it.The cost function is quadratic and given as follows where In the game, we are interested in finding a Nash equilibrium The optimal value function in the game is defined by where K is a symmetric matrix, sometimes referred to as the value matrix.Define ∇V * := . The Hamiltonian function is Using the stationarity conditions we obtain where K * satisfies the following Generalized Algebraic Riccati Equation (GARE) [1] − In this game, we restrict the set of admissible controllers (F , L) to belong to the following set since (u * , d * ) need to stabilize trajectories to qualify as the unique NE equilibrium in this game [1].This restriction is essential because, as shown in [17], without this restriction it is possible to provide an example where a non-stabilizing feedback yields lower cost for one of the player while another player sticks to the stabilizing feedback law.Thus, beside satisfying (12), K should also be stabilizing to qualify (F , L) as a unique NE [1].The following assumption guarantees the non-emptiness of the set. 1) is stabilizable.

Inverse problem
We formulate the inverse problem for LQ zero-sum differential games in this subsection.
Consider an LQ differential game (referred to as the observed LQ game) with continuous-time system dynamics x o (0) = x 0,o (15) where with the unknown matrices where K o is the unique stabilizing symmetric solution of the following GARE The above assumption is the only restriction we have on the game that lead to the observed equilibrium trajectories.In fact, assumption that the unique stabilizing solution of ( 19) is positive definite, i.e., K o > 0, leads to A + BF o being stable.Although the result is known [16], it lacks the proof which is presented below.

Lemma 1. Consider the observed LQ game (
To see this, we use the fact that Moving the DL o terms to the right-hand side, we get From the inequality in (21), one can conclude that and, as a result of Positive definiteness of the value matrix is a common assumption for differential games [18].Note that whether A + BF o is true or not can be checked using the estimation of F o , which can be computed via the procedure described in 3.2.
Notation: we use the (A, B, D, Q , R, M) tuple to describe an LQ differential game with the dynamics' matrices A, B, D and the cost In other words, the games are equivalent if they share the same equilibrium feedback law of the player that minimizes the cost function ( 16), i.e. player 1.Now, we are ready to formulate the inverse problem to be addressed in this paper.Inverse Problem: Given the dynamics' matrices A, B, D and the observed trajectories (x o , u o ), we want to derive a game equivalent to the (A, Remark 1.Since no assumptions on definiteness of Q o are made, the problem can be reformulated for player 2, and the solution proposed further is still valid.
The goal is to be accomplished via a model-based inverse reinforcement learning algorithm described in the following section.

Model-based inverse learning
In this section, we describe the algorithm that uses trajectories (x o , u o ) generated by known dynamics in (14) for learning a cost function equivalent to the one parametrized by The procedure is as follows -firstly, we initialize an LQ differential game with dynamics (A, B, D).We generate an initial Q (0) updated in each iteration, M (0) > 0 updated when necessary and, the control input weights R > 0 remaining the same in each iteration.The next step is to provide an estimation of F o using the observed trajectories.Then, to solve the resulting LQ game, we solve GARE (12) to derive the unique stabilizing solution K (0) .After that, we start the iterative update of K (0) using the gradient descent method [19] and update of Q (0) using the inverse optimal control method [20].

Optimal control on given cost function parameters
Following the first step, we need to initialize

With the known dynamics
A, B, D, one needs to solve the following GARE with respect to the symmetric K (0) , which is the unique stabilizing solution.To solve (24) might not be straightforward because GARE is not guaranteed to have the desired solution due to the following term which might be indefinite.However, since we have the freedom to choose Q (0) , R and γ (0) , referring again to [18], we initialize ) is observable.Note that if the desired solution exists, it is unique [21,22].
Then, using the algorithm presented in [18] (Algorithm 3), with the initialized parameters Q (0) , R, γ (0) , through the iterative procedure, the process is guaranteed to converge to the unique stabilizing positive definite solution K (0) > 0. Using the derived solution, we calculate the state feedback law of player 1 as follows Together with , F (0) forms the NE pair for the initialized game (A, B, C , Q (0) , R, M (0) ).

Gradient descent update
We aim at tracking the difference between the feedback law F that is the NE feedback law for the current iteration game and the desired feedback law F o .Towards this end, we need to derive an estimation Fo of F o .Given the (x o , u o ) trajectories, we use the batch least-square (LS) method [23].To estimate that matrix pair, we need k ≥ n, k ∈ Z + data samples from the trajectories, i.e. ( When tracking the difference between Fo and F (i) at the ith iteration of the algorithm, we denote the difference function by Next, we define an error function as which is a function of K that we aim to minimize.Employing the gradient descent method [19], we introduce the following update rule where α > 0 is the learning rate and the partial derivative is is a solution of the initialized GARE (24).In that case, using the fact that ∥s (i) ∥ 2,1 > ∥s (i+1) ∥ 2,1 for i = 0, 1, . . .due to gradient descent update, we have Also, as explained in Section 2.1, we want to guarantee the stability of the resulting solution, i.e., where K * is the goal of the optimization procedure described before, i.e., −R −1 B ⊤ K * = F * = F o ; and γ * > 0 is a parameter that we might need to update starting from γ (0) to guarantee the stability of the resulting dynamics.Thus, to update K (i) , we need to always check whether and if it is not the case, γ (i) needs to be increased.This update can be performed, for example, linearly, i.e., γ (i+1) = cγ (i) .As it is shown in Section 4.2 dedicated to the stability analysis, such a c always exists.

Inverse optimal control update
The last step is to update Q (i) using K (i) received via the gradient descent update.We simply substitute the update value matrix into GARE ( 12) We repeat the presented steps till 0 ≤ E (i) < ϵ where ϵ ∈ R + is a desired precision.
where M * = (γ * ) 2 I p .To summarize, we present the whole procedure in Algorithm 1.The section thereafter provides the analysis of the proposed algorithm.

Remark 2.
From the complexity point of view, the demanding parts of algorithm are finding solution of the game with initialized parameters (Q (0) , R, M (0) ) and matrix multiplication done in the following steps.The algorithm proposed in [18], is used in our work to solve the initialized GARE.This algorithm is based on so-called Lyapunov Iterations.Methods to solve the Lyapunov Equations with respect to K ∈ R n×n usually have complexity O(n 3 ) [24].The steps of the algorithm that require performing matrix multiplication via standard methods have complexity . Hence, the overall computational

Analysis of the algorithm
In this section, we derive a few analytical results for the presented algorithm.Firstly, we show the convergence of the algorithm.Next, we show that F * = F o and L * constitute the equilibrium for the synthesized game.Finally, we provide some results on the characterization of possible solutions in the addressed inverse problem.
) is observable for the known A and set i = 0. Solve GARE (24) with respect to K (0) .

Estimate F o using the observed trajectories as
(40)

Convergence analysis
The first result claims the convergence of the proposed algorithm.

Theorem 1. In Algorithm 1, the reward weight Q
Proof.Let us consider (43).Using the gradient descent method to update K (i) , we drive F (i) to Fo .Hence, the error decreases with each iteration, i.e., and we have Then, Considering the effectiveness of LS estimation Fo = F o , we have We denote lim i→∞ K (i) = K * .It is clear that when K (i) converges to K * , γ (i) also converges to some γ * ≥ γ (i) for i = 1, 2, . . ., i.e., lim i→∞ γ (i)  = γ * or lim i→∞ c (i) = 1.Next, using the gradient update rule we expand K (i) in (36) and get Taking the limit of both sides and using (50), we get Then, we denote lim i→∞ Q (i)   = Q * .Thus, one can conclude the following which shows that

Stability analysis
In this section, we show the stability of the proposed algorithm.
Since A + BF o is a stable matrix, for any K (i) it will always be possible to find γ We give some more details on the initial choice of γ (0) .Notice that before implementing check in step 5 and iterative update of . Thus, we can always evaluate K * and see whether for the initialized γ (0) the resulting solution K * is a stabilizing one.Note that if GARE (12) for some γ 1 , γ 2 such that 0 < γ 2 < γ 1 and some fixed A, B, D, R, Q has solution, then K (γ 1 ) < K (γ 2 ) [21].
Before presenting the result on the stability of the proposed algorithm, we use the following result from [1].
Theorem 2. Consider an LQ zero-sum differential game described by (1) with the cost function given by (5).The game has for every initial state a feedback NE if and only if the following Riccati equation Finally, the following can be concluded for the algorithm.Proof.In view of Theorem 1 and the validity of (34) via step 5 in Algorithm 1, one concludes that K * is both a solution of GARE (57) and stabilizing.■ Corollary 1.There exist ᾱ > 0 and N such that for i = N, N +1, . . .
, where = γ (i) I p , forms GARE where K (i) is a stabilizing solution, i.e., Proof.Note that using (42), can be rewritten as As shown in Lemma 1, A + BF o is stable; and γ (i) in L (i) is updated if needed in a way to guarantee the stability of A + BF o + DL (i) .
Thus, the term Bs (i) is the one that might violate the stability of (60).However, s (i) is decreasing with each i and, starting from i = N, N + 1, . . ., Bs (i) is small enough so (60) is satisfied.Now, we have that (F (N) , L (N) ) and (F * , L * ), where F * = F o is the terminal feedback law for player 1, are both stabilizing pairs, i.e., Since K (i) linearly affects (F (i) , L (i) ) for i = N, N + 1, . . .and (F (i) , L (i) ) is a result of the gradient descent update from (F (N) , L (N) ) in the direction of (F * , L * ), there exists α = ᾱ in (43) that guarantees the stability of (F (i) , L (i) ) [19].Hence, Q (i) , updated via (36) using K (i) , is stabilizing.This completes the proof.■ Remark 3. N might be reduced by increasing γ (i) since bigger γ (i) changes the eigenvalues of A + BF (i)  + DL (i) (that are all negative) in a non-increasing way.In fact, picking γ (0) such that K (0) and resulting K * for γ * = γ (0) are both stabilizing, which guarantees that every K (i) is stabilizing, i.e., N = 0. Practical advice for implementing the algorithm would be to choose ''high'' γ (0) from the beginning.

Characterization of the solutions
In this section, we provide a discussion and results on the characterization of the possible output of the algorithm.
Note that we are looking for (Q * , R, M * ) such that with the known (A, B, C ) that form GARE (12) that has a stabilizing solution If B has no full rank, there might be an infinite number of possible K * [14].Remark 4. All possible outputs of Algorithm 1, i.e., Q * , γ * and K * , satisfy the following equality (63) is received via subtracting (37) from (19).

Simulations
In this section, we present simulation results of the modelbased algorithm developed in this paper.

Simulation results 1
Consider the following continuous time system dynamics ẋ = Ax + Bu + Dd, ) .
The observed NE trajectories are generated for the game with the following weight matrices ) ) .
Given this game, F o and K o are ) .
The initialized parameters are the following with γ (0) = 2.The learning rate is set to α = 0.1.
The solution generated by the algorithm is ) .
In addition, ) . ( The resulting dynamics A + BF + DL < 0 is stable as shown in   i) , K (i) and Q (i) .

Simulation results 2
In this example, we use the dynamics and the cost function provided in [18].Consider the following continuous time system dynamics The observed NE trajectories are generated for the game with the following weight matrices Given this game, F o and K o are The initialized parameters are the following with γ (0) = √ 2. The learning rate is set to α = 0.1.
The solution generated by the algorithm is  ) .
The resulting dynamics A + BF + DL < 0 is stable as shown in Fig. 2(a).The convergence of the iterative procedure is shown in Fig. 2(b).

Conclusion
In this paper, we provided the algorithm that solves the inverse problem for linear-quadratic zero-sum differential games.We showed that the algorithm's output is the set of weight matrices that together with the known dynamics form an equivalent game for one of the players.After proving the convergence of the algorithm to a desired output, we provided simulations to demonstrate the effectiveness of the proposed method.
The presented algorithm has the potential for extension to model-free (neither plant matrix A nor control input matrices B, D are unknown) or to partially model-free (plant matrix A is unknown) settings.The steps of the algorithm that require plant matrix A are related to solving the initialized GARE and the inverse update of matrix Q .These steps might be implemented in different ways if methods to find the optimal controller for ARE without knowledge of dynamics are exploited [25,26].Note that in the case of unknown control input matrices, the gradient update step might require changes in order to avoid using matrix B (or D in the case for player 2).
For future work, the case of a general-sum game will be considered, where instead of GARE, described in this work, the coupled algebraic Riccati equation arise [18].

Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Ming Cao reports financial support was provided by European Research Council.Co-author prof.Ming Cao serves as a senior editor for the Systems and Control Letters journal.
m and d o ∈ R p are NE trajectories of the observed LQ game with u and d being trajectories of players 1 and 2, respectively; A, B, D have appropriate dimensions and satisfy Assumption 1.The cost function of the game has the following known quadratic structure are used to estimate F o via (17), i.e., Fo = −û o x⊤ o (x o x⊤ o ) −1 .

Theorem 3 .
The output of Algorithm 1, given the observed trajec-tories (x o , u o ) generated by a game (A, B, D, Q o , R o , M o ) described in Section 2, is the tuple (Q * , R, M * ) such that,combined with the known dynamics (A, B, D), it forms a game with the unique NE feedback law for player 1 identical to (A, B, D, Q o , R o , M o ) game, i.e., F * = F o .

Fig. 1 .
Fig. 1.(a) The stability of the observed and resulting dynamics.(b) Convergence of the norm for iterations of F(i) , K(i) and Q(i) .

Fig. 2 .
Fig. 2. (a) The stability of the observed and resulting dynamics.(b) Convergence of the norm for iterations of F(i) , K(i) and Q(i) .