Statistically consistent inverse optimal control for discrete-time indefinite linear-quadratic systems

The Inverse Optimal Control (IOC) problem is a structured system identification problem that aims to identify the underlying objective function based on observed optimal trajectories. This provides a data-driven way to model experts' behavior. In this paper, we consider the case of discrete-time finite-horizon linear-quadratic problems where: the quadratic cost term in the objective is not necessarily positive semi-definite; the planning horizon is a random variable; we have both process noise and observation noise; the dynamics can have a drift term; and where we can have a linear cost term in the objective. In this setting, we first formulate the necessary and sufficient conditions for when the forward optimal control problem is solvable. Next, we show that the corresponding IOC problem is identifiable. Using the conditions for existence of an optimum of the forward problem, we then formulate an estimator for the parameters in the objective function of the forward problem as the globally optimal solution to a convex optimization problem, and prove that the estimator is statistical consistent. Finally, the performance of the algorithm is demonstrated on two numerical examples.


Introduction
Optimal control is a powerful framework in which control decisions are performed in order to minimize some given objective function; see, e.g., one of the monographs [2,6].In fact, many processes in nature can be model as optimal control problems with respect to some criteria [1].However, in applications of optimal control, a fundamental problem is to design an appropriate objective function.In order to induce an appropriate control response, the object function needs to be adapted to the contextual environment in which the system is operat-ing.This is a difficult task, which relies heavily on the designers' experience and imagination.
Instead of designing the cost criteria, one way to overcome this difficulty would be to identify the cost function from the observations of an expert system that behaves "optimally" in the environment and thus "imitating" the expert behaviour.The latter is known as Inverse Optimal Control (IOC) [17], and has received considerable attention.In particular, IOC reconstructs the objective function of the expert system and hence predicts the closed-loop system's behaviour using observed data as well as the knowledge of underlying system dynamics.The problem can be categorized as gray-box system identification [22, p. 13], and is also closely related to inverse reinforcement learning [23].
As one of the most classical optimal controller designs, linear-quadratic optimal regulators has been widely used in engineering.Though most of the literature consid-arXiv:2212.08426v1[math.OC] 16 Dec 2022 ers the case of Q (the penalty parameter for the states) being positive semi-definite, indefinite linear-quadratic optimal control [8,10,11,27,28] has found applications in, e.g., mathematical finance [43], crowd evacuation [34,35], and controller design for automatic steering of ships [30].To this end, we are motivated to develop an IOC framework for general indefinite linearquadratic optimal control.Linear-quadratic IOC problem has been studied under many different settings, including the infinite-horizon case in both continuous time [2,7] and discrete time [26], respectively, as well as the finite-horizon case in both continuous time [19,20] and discrete time [18,[39][40][41][42], respectively.IOC is also closely connected to inverse reinforcement learning [23], and this perspective has been used in [21,37,38] to consider infinite horizon discrete-time and continuous-time linearquadratic set-ups for regulation, tracking, and adversary scenarios, respectively.However, to the best of our knowledge, IOC frameworks for general indefinite linearquadratic optimal control has not been considered yet.Furthermore, there are also other important aspects that have not been fully investigated in the aforementioned literature.More precisely, any real-world data would inevitably contain noise: it can either be process noise, observation noise, or both.Therefore, from robustness and accuracy perspectives, it is important to have a statistically consistent estimator, i.e., that converges to the true underlying parameter values as the number of observation grows.Moreover, many of the aforementioned IOC algorithms that are based on optimization either suffer from the fact that the estimation problems are nonconvex [18,39,42], and can therefore have issues with local minima, or suffer from the fact the estimation procedure needs to know the control gain a priori [2,7,19,20,26].In the latter case, the estimation normally needs to be done in a two-stage procedure and the information is thus not used in the most efficient way.Furthermore, most of the literature on linear-quadratic IOC consider the regulation problem.However, in many experimental set-ups, an expert agent may have more complicated tasks than regulation, e.g., tracking a reference signal.Finally, real-world data can be of different time lengths, and this needs to be handled in a systematic way in order not to deteriorate the estimates.
In this work, we address these issues.More specifically, we consider the generalized linear-quadratic, indefinite, discrete-time IOC problem.We extend our previous work [40] where we consider the standard linearquadratic IOC problem for indistinguishable agents only with noisy measurements and fixed time-horizon length.We also extend the conference paper [41], where we only considered the linear-quadratic positive semidefinite tracking case with random planning horizons.In particular, the contribution of this work is three-fold: (1) We give necessary and sufficient conditions for the well-posedness of the generalized discrete-time finite-horizon indefinite linear-quadratic optimal control problem.Specifically, we do not assume that the running cost is positive semi-definite, and we include a linear term in the objective function, a forcing term in the dynamics, and process noise.(2) We prove the identifiability of the parameters in the objective function.(3) We construct an IOC algorithm that works for both the positive semi-definite and the indefinite case.
The algorithm is based on convex optimization, and we show that its unique optimal solution is the "true" underlying parameters in the objective function of the forward problem.Moreover, the constructed optimization problem is approximated by empirical mean in practice and we also show that the estimator based on the approximated optimization is statistically consistent.This means that the estimator converge in probability to the true underlying parameter when the number of observations goes to infinity.In addition, the convex optimization formulation guarantees that the statistically consistent estimate that corresponds to the global optimum can actually be attained in practice.
The article is organized as follows.In Section 2, we formulate the forward and inverse problem, and specify the assumptions we use.Next, in Section 3, we analyze the forward problem and prove the necessary and sufficient conditions for when it is well-posed.Section 4 investigates the identifiability issue of the formulated IOC problem, and in particular prove that the formulated IOC problem is identifiable.In Section 5, we construct the IOC algorithm, construct the estimator, and prove that the latter is statistical consistent.To illustrate the performance of the proposed algorithm, in Section 6 we include a numerical example.Finally, we conclude the paper in Section 7.For improved readability, some proofs are deferred to the Appendix.
Notation: S n denotes the set of n×n symmetric matrices, while S n + denotes the set of n × n positive semi-definite matrices.• denotes l 2 -norm and denotes the open ball of radius ϕ centered at x, and Bn ϕ (x) denotes its closure.Moreover, Further, G † denotes the Moore-Penrose pseudo-inverse of G.For integers ν 1 , ν 2 , we let ν 1 : ν 2 denote ν 1 , ν 1 + 1, . . ., ν 2 , and define the expression as empty if ν 2 < ν 1 .In addition, by Im and ker we denote the image space and kernel, respectively, by ⊥ we denote the orthogonal complement, and by ¬ we denote the (mathematical) negation of a statement.Finally, we use italic bold font to denote stochastic elements, and use p → to denote convergence in probability, i.e., for a sequence of random elements {a i } ∞ i=1 and a random element a, a i p → a means that for all ε > 0, lim i→∞ P( a i − a > ε) = 0.

Problem formulation
We start by introducing the mathematical formulation of the forward optimal control problem.To this end, let (Ω, F, P) be a probability space that carries random vec- , and a random variable N ∈ {2, 3, • • • , ν} ⊂ Z + .It is assumed that for each realization (x, N ) of the random element ( x, N ) (corresponding to the initial position and planning horizon length), the agent's control decision u t is determined by a stochastic generalized linear-quadratic control problem, namely, min x1:ν , u1:ν where , and q, d ∈ R n .More specifically, the minimization in ( 1) is over admissible control strategies with complete state information, i.e., u t is a function that maps from R n to R m , u t : x t → u t (x t ) (see, e.g., [29,Chp. 8]).Furthermore, the reason why we only consider a time-invariant external forcing term d in (1b) is to simplify the presentation.The results of the paper also hold for time-varying forcing terms d t , provided that the agent knows all the future forcing terms for t = ν − N + 1 : ν − 1.Similarly, this formulation can be extended to tracking-problems by letting the linear cost term be time-varying, q t = Qx r t where x r t is the reference signal, and the results in the paper follows analogously (cf.[41]).In addition, we will in general make the following assumptions.
The rationale behind the first assumption is that we are considering a controllable discrete-time linear system that is not over-actuated, and that is sampled from a continuous-time system.
The formulation (1) is motivated by the fact that an agent can have different time horizon lengths (and initial values) to complete different tasks, while the underlying decision principle (i.e., running cost) remains unchanged since the principle is connected to the agent's characteristics.In particular, given a realization of the time-horizon length N = N and initial value x = x, the agent starts to apply its control from the initial value x at the time instant t = ν − N + 1 and the agent maintains the same running cost for each control execution.This formulation gives a systematic way to handle realworld data with different time lengths (see [41]).However, due to the fact that Q might not be positive semidefinite, (1) might not admit an optimal solution (see, e.g.[10,11]).We analyze the well-posedness of the forward problem (1) in depth in Section 3. To proceed, we first make the following mild assumptions regarding the stochastic planning horizon N .
The above assumption means that the longest possible planning horizon is known, that it is sufficiently long, and that the longest horizon ν can be realized, i.e., it has a nonzero probability.In addition, we have the following proposition which illustrates the reason why we emphasize the longest time-horizon length in Assumption 2.3.Proposition 2.4 Under Assumptions 2.1 and 2.2, if the optimal control problem (1) with the objective function given by ( Q, q, R) admits a solution for planning horizon N = ν for any x ∈ R n , then it admits a solution for all N = 2 : ν for any x ∈ R n .
Proof.See appendix. 2 With Assumption 2.3 and Proposition 2.4 in mind, we are thus interested in parameters that belong to the following set: with the objective function given by (Q, q, R) admits solutions a.s.under the distribution of x and for all N ∈ {2, 3, . . ., ν}}.
More specifically, if ( Q, q) ∈ F ( R), then the forward problem is well-posed for any N ∈ {2, . . ., ν} under the distribution of x almost surely.In addition, for the inverse problem we assume that the observation of the optimal states x 1:ν are contaminated by observation noise: and that we observe M trials of the agent.More precisely, let y i t have the same distribution as y t , for all i = 1 : M and all t = 1 : ν.Then the observed M trajectories of the agent's trials, {y i t } M i=1 , are just realizations of the I.I.D. random vectors {y i t } M i=1 .
Before we formulate the IOC problem, we also make the following assumptions.
When we solve the corresponding inverse problem in practice, we can always set ϕ arbitrary large if we have no prior knowledge on the norm bound of the parameters.With the setup presented in this section, we summarize the IOC problem to be considered in this paper.
For the sake of simplicity, we consider the case R = I when designing the IOC algorithm.
Problem 2.7 (General stochastic linear-quadratic IOC) Suppose the unknown ( Q, q) ∈ F (I).Given the optimal state trajectory observations {y i t } ν t=1 of the agent's trials i = 1 : M that are governed by (1), estimate the corresponding ( Q, q) in the objective function (1a) that governs the agents' motion.

Forward problem analysis
Before we present the IOC algorithm set-up, we first need to analyze the forward problem.More precisely, we need to characterize the set F ( R) and find the necessary and sufficient optimality conditions for the existence of such generalized indefinite linear-quadratic optimal control.This is not only because we want to ensure that the forward-problem is well-behaved, but also to construct the IOC algorithm based on the optimality conditions.Moreover, we also analyze the properties of the timevarying closed-loop system matrices that will be useful in developing the IOC algorithm.For the theoretical development in this section, we do not assume that R = I.

Necessary and sufficient conditions for existence of optimal control
To this end, we first derive the necessary and sufficient conditions for existence of optimal control to (1).The results build on [10]; in particular, some of the proof ideas are inspired by the proof in [10, Thm.2.1].However, we not only extend the result to a more general setting of stochastic linear-quadratic problems, but also show that solvability of the forward problem, i.e., that ( Q, q) ∈ F ( R), can be characterized in different, but equivalent, ways.The main result of this section is the following.
In addition, if any of the four above conditions hold, then the optimal control signal u t for (1) is parametrized by any arbitrary vector λ t ∈ R n , and is given by where P ker( R) t = I − R † t Rt is the projection operator onto the kernel space of Rt . 1 Proof.First, assume that 4) holds.Note that for an agent with a planning horizon realization N = ν, by the principle of optimality in dynamic programming (see, e.g., [6, p. 18]), we can easily show that 4) =⇒ 1) (cf.[6, Prop.1.3.1]).For the case N < ν, note that the HJBE is still valid for t = ν − N + 1 : ν, and thus the agents behavior is still optimal in t = ν − N + 1 : ν.Since the systems behavior for t = 1 : ν − N is completely determined by (1c), and (1e), by Bellman's principle of optimality, the solution to the HJBE still gives the optimal control.This implies 1), and hence shows that 4) =⇒ 1).
Before we proceed, first note the fact that for any N ∈ {2, . . ., ν}.Taking the expectation with respect to w ν−N +1:ν−1 on both sides of (10), adding the latter expression to (1a), and using (1b), (3a), (3c), and Assumption 2.2, we can write the objective function as with Ht in the form of (5b) and ξt = ḡT t R † t ḡt in Ht .Note that the term Υ t in the above equation is constant with respect to the state and the control and hence can be discarded in the optimization problem.On the other hand, since { Pt } ν t=1 and {η t } ν t=1 are generated by ( 3), Ht can also be written as Now, we use the above trick to prove that ¬2) =⇒ ¬1).Suppose (4d) and (4e) cease to hold at the N th backward iteration (3), where N ∈ {2, . . ., ν}.Namely, Rν−N+t 0, ker( Rν−N+t ) ⊂ ker( ST ν−N +t ) ∩ ker(ḡ T ν−N +t ) still holds for t = 2 : N but not for t = 1.We proceed by showing that this implies that for this planning horizon length realization N , (1) is not bounded from below.In particular, by (12) and ker( Rν−N+t ) ⊂ ker( ST for t = 2 : N .Hence, in view of (1d), (1c) and the fact that {w t } ∞ t=1 is independent of other random elements from Assumption 2.2, for the given "initial state" realization x ν−N +1 = x ∈ R n from which the agent starts tracking, the objective function can be written as Notably, it holds that where σ max (•) is the largest eigenvalue of a matrix.Also note that, for any u ν−N +1 and x ν−N +1 = x, by the dynamics (1b) we get an λ t , for some arbitrary λ t ∈ R n , then all terms in the summation in (14) would become zero.Therefore, for the objective function J N it holds that If Rν−N+1 0, it is clear that we can choose u ν−N +1 = αv − ( Rν−N+1 ), where v − ( Rν−N+1 ) is an eigenvector that corresponds to a negative eigenvalue of Rν−N+1 .In this case, since such choice of u ν−N +1 does not depend on the random vectors w ν−N +1:ν−1 , the expectation would be marginalized out.Letting α → +∞ would make J N tend to minus infinity.Thus, unless (4d) holds, irrespective of the initial condition x there is a planning horizon length realization N ∈ {2, . . ., ν} such that the cost function is not bounded from below.
On the other hand, if (4d) holds but ker( Rν−N+1 ) ⊂ ker( ST ν−N +1 ) ∩ ker(ḡ T ν−N +1 ) , then there exists a vector v ∈ R m with norm one, such that Rν−N+1 v = 0, and such that ST Without loss of generality, assume that ḡT ν−N +1 v ≥ 0; otherwise we instead consider −v.By Assumption 2.5, it follows that we can find ρ > 0 such that , where ṽ ∈ B n 1 (0) and where 1 > 0 will be determined shortly.Note that with such choice, the expectation in ( 15) would be again marginalized out.Then it holds for the objective function that In particular, we can always make 1 > 0 small enough so that ρ ST . By taking α → −∞, the upper bound goes to minus infinity and hence pushes J N to minus infinity.Recalling that P( x ∈ B n (ρS T ν−N +1 v)) > 0, ∀ ∈ (0, 1 ), this shows that there exists a set of initial value realizations x with non-zero probability such that the forward problem is ill-posed.This proves that ¬2) =⇒ ¬1), and thus that 1) =⇒ 2).
Next, we prove that 2) =⇒ 4).We make an ansatz that the solution to the HJBE ( 6) is V t (χ t ) = 1 2 χ T t Pt χ t + ηT t χ t + γt , where P1:ν , η1:ν and γ1:ν are generated by ( 3) and ( 8), respectively.The ansatz fulfils (6a) and (8).Plugging the ansatz into (6b), we have that ) To verify that it has a solution, and to work out the optimal control, we take the derivative of ( 16) with respect to µ t and equate it to zero, which gives Since (4e) holds, we have that St (see, e.g., [9, Prop.2.3] for the last equality), where ⊕ denotes the direct sum of the subspaces.Hence, the above equation has a solution, and the control signal takes the form of which minimizes the right hand side of (16).Plug ( 17) into (16), use the property of (4e) and in view of (3), ( 8), the quadratic, first order and constant terms regarding χ t equates between the left and right hand sides.Thus the ansatz is indeed a solution to the HJBE.
Now we prove 3) =⇒ 2).If (5) Since rank( Ht ) = rank( Rt ), we must have that Ht \ Rt = 0, and hence (3) follows.This finishes the proof of the entire theorem. 2 Before we continue, we make a few remarks related to Theorem 3.1 and the proof.
Remark 3.2 From the above proof, it is interesting to note that in the equivalence between point 2) and 3) in Theorem 3.1, we have that (5a) and (5c) are equivalent to that the Riccati recursions in (3) are satisfied, and that (5b), i.e., that Ht 0 holds, is equivalent to (4d) and (4e).
and cost matrices given by we can indeed rewrite the forward problem (1) as a "stan-dard" LQ problem (albeit possibly indefinite): with initial value x = [x T , 1] T , and where wt = [w T t , 0] T .However, note that (19) is not a classic LQR in the sense that ( Ã, B) is not controllable.Moreover, x does not satisfy Assumption 2.5.Furthermore, extending (19) to time-varying d t or the tracking problem, where q t = Qx r t and x r t is the reference signal, is not possible without specifying the problem structure (18).Therefore, existing IOC results cannot be directly applied to (19).

Analysis of the closed-loop system matrices
In view of (9), it seems like there might be infinitely many choices of control signals that are optimal.However, as we shall see, under the assumptions we impose the optimal control signal is unique for the considered forward problem (1).Proposition 3.4 Let R 0. Under Assumptions 2.1, 2.2 and 2.5, ( Q, q) ∈ F ( R) is equivalent to Rt 0, t = 1 : ν − 1.
Proof.The implication "⇐=" follows from Theorem 3.1.To prove the implications "=⇒": by Theorem 3.1, since ( Q, q) ∈ F ( R), there exist matrices and vectors that fulfil (3) and (4).In particular, since A is invertible, by (4a) and (4b) we have that for t = 1 : ν − 1.Now, by (4e) we have that ker( Rt ) ⊂ ker( ST t ) holds for t = 1 : ν − 1.In particular, this means that This in turn means that for any z ∈ ker( Rt ), and since R 0 this means that z = 0. Therefore, the only vector in ker( Rt ) is the zero-vector, and since Rt is positive semi-definite (see (4d)), this implies that Rt is in fact strictly positive definite for t = 1 : ν − 1. 2 Corollary 3.5 Under the assumptions in Proposition 3.4, the optimal control signal for problem (1) takes the form Remark 3.6 Even if R is not strictly positive definite, by the proof of Proposition 3.4 we can still partially characterize ker( Rt ).In particular, z ∈ ker( Rt ) implies that z T Rz = 0.If R is full rank, this is the neutral subspace in the indefinite inner product space defined by R; see, e.g., [12,Chp. 2].
By Corollary 3.5, under the conditions in Proposition 3.4 there is a unique solution to the forward optimal control problem (1).In this case, the system's behavior (after it starts applying control) is determined by xt+1 = Ãcl (t; Q, q) xt + wt , for t = ν − N + 1 : ν, where Ãcl (t; Q, q) is the closed-loop system matrix at time t for the extended state-space model (see Remark 3.3).In particular, with Rt , St , and ḡt as in (4), and hence it implicitly depends on ( Q, q) in the objective function (1a).This means that the (conditional) distribution of the agent's optimal trajectory P(x 1:ν | N = N, x = x) and optimal control P(u 1:ν−1 | N = N, x = x) are implicitly given by solving (1).Moreover, under mild regularity conditions on the probability distribution of ( x, N ), the formulation in (1) then defines joint probability distributions for (x 1:ν , N , x) and (u 1:ν−1 , N , x) (cf.[32,Prop. 1.15] or [16,Thm. 5.3]).Before we continue analyzing the identifiability, we present the following corollary that is useful in the analysis to come.
Corollary 3.7 Under the assumptions in Proposition 3.4, given any R 0, for any (Q, q) ∈ F (R), let {P t } ν t=1 be the solution to (3) that corresponds to Q. Accordingly, let R t , S t and g t be defined as in Theorem 3.1, for t = 1 : ν − 1.Then the matrix as well as the matrix Ãcl (t; Q, q) in (21), are invertible for all t = 1 : ν − 1.
Proof.The fact that the matrix A cl (t; Q) is invertible for t = 1 : ν − 1 follows by an argument similar to the proof of [42, Thm.2.1], since R 0 and R t 0 holds for t = 1 : ν − 1.To show that Ãcl (t; Q, q) has full rank, we simply note that it has the upper block-triangular form (21), and since A cl (t; Q) has full rank so does Ãcl (t; Q, q). 2 4 Identifiability analysis and persistent excitation Next, we investigate the inverse problem of recovering ( Q, q) from observations of optimal trajectories to problem (1).Throughout the rest we will therefore, unless explicitly stated otherwise, assume that R = I and ( Q, q) ∈ F ( R = I).We start by considering the identifiability of the problem.
To this end, first note that from the analysis in Section 3, for any parameters (Q, q) ∈ F (I), the agent's behavior is completely determined by the time-varying closed-loop system matrices Ãcl (t; Q, q).Therefore, the fundamental question for identifiability is if there exist two different sets of parameters (Q, q) and (Q , q ) such that Ãcl (t; Q, q) = Ãcl (t; Q , q ) for all t = 1 : ν − 1.
Writing ( 22) in a compact form gives Since (A, B) is controllable by Assumption 2.1, and ν ≥ n + 1 by Assumption 2.3, the matrix Γ has full column rank, and hence ∆q = 0.The fact that ∆Q = 0, ∆q = 0 implies that This means that the parameters (Q, q) that characterizes the closed-loop system matrices are identifiable.Moreover, in view of (1), we can see ( x, N ) as the "input" of the model and y 1:ν as the "output".To this end, in order to uniquely identify the parameters ( Q, q), the "input" ( x, N ) needs to be "persistently exciting".Notably, Assumption 2.3 gives the persistent excitation assumption regarding N .Moreover, Assumption 2.5 turns out to give a persistent excitation condition for the initial value x.In fact, we have the following result, the proof of which we defer to the appendix.This result can now be used to prove the following Lemma, which is useful in the IOC algorithm construction to come.Similarly, the proof of this Lemma is also deferred to the appendix.] < ∞ for all t = 1 : ν.

The IOC algorithm
In this section, we construct the IOC algorithm for general linear-quadratic systems with different time-horizon lengths.In particular, we show that the algorithm is statistically consistent, i.e., that it converges in probability to the true underlying parameter.For the sake of brevity, in some of the following we sometimes use the notation (•) for the arguments of some functions.

Construction and empirical approximation
To this end, the algorithm is constructed based on the necessary and sufficient optimality conditions in Theorem 3.1.More precisely, we are interested in finding the best (Q , q ) ∈ F (I) that suits the collected optimal state trajectory observations {y i t } ν t=1 of the agent's trials i = 1 : M .Thus, the IOC algorithm will be built upon an optimization problem whose optimal solution (Q , q ) is constructed to be the "true" ( Q, q) ∈ F (I) and unique.
First, we construct the objective function for the optimization problem.Given a realization of the planning horizon N , state χ t and control signal µ t , we first define the "violation" of HJBE at time step t by where V t (•) has the form (7), and where P t:t+1 , η t:t+1 , and γ t:t+1 in V t (•) are determined by (Q, q) via (3).Note that by (6), if µ t is not optimal with respect to (Q, q), the "violation" ψ t,N (Q, q) would be greater than zero.This is the main idea behind the construction of the objective function.
Now, plugging (7) and ( 8) in to the above equation, we have Given a realization of the planning horizon N , let x ν−N +1:ν and u ν−N +1:ν−1 be the optimal trajectory and control.We let χ t = x t and µ t = u t , and take the expectation of ψ t,N (Q, q; x t , u t ) with respect to x t |N = N .In view of (1b), this gives However, the above expression is constructed based on x t , while the observations are in terms of y t .To rewrite it in terms of y t , first we simply add and subtract some terms in the expression above: On the other hand, by Assumption 2.2, {v t } ∞ t=1 are independent of any other stochastic elements.Using the cyclic permutation property of the matrix trace operator, we know that Similarly, we also have ), d T P t+1 d = tr(P t+1 dd T ).In view of ( 8), ( 2) and the fact that E wt [w t ] = 0, E vt [v t ] = 0, using (24) and ( 25) we can rewrite [ ψt,N (Q, q, P t:t+1 , η t , ξ t ; y t:t+1 )] where we introduce ξ t := g T t R † t g t .We construct the objective function Ψ(Q, q, P 1:ν , η 1:ν , ξ 1:ν−1 ) by summing up the above equation from t ] which are constants, and taking the expectation over N .In particular, where The objective function ( 26) can be rewritten as a joint expectation over y 1:ν and N .However, we find the form in ( 26) more useful both in analysis and in implementation.
The objective function ( 26) is constructed with the idea that Ψ(•) + , where the latter is the discarded constant, should be bounded from below by 0 for all (Q, q) ∈ F (I). Therefore, ideally we would consider the problem of finding the point (Q , q ) that minimizes (26) subject to (5) and ξ t = g T t R † t g t for t = 1 : ν − 1.However, while ( 26) is linear and hence a convex function, neither the constraint (5) nor the constraint ξ t = g T t R † t g t for t = 1 : ν − 1 are convex.Even so, note that the only nonconvex part in ( 5) is (5c).We therefore consider the relaxed, convex problem obtained by removing the constraints (5c) and ξ t = g T t R † t g t for t = 1 : ν − 1. 2 In particular, the optimization problem for IOC reads min where H t has the same form as (5b).In Section 5.2, we prove that the unique optimal solution to this optimization problem is indeed ( Q, q).
Nevertheless, the distribution of x, w t , v t and N are usually not a priori known in practice, and hence the distribution of y t and N are not known.Therefore, it is not possible to calculate the objective function ( 26) explicitly, and hence we cannot solve (28) directly.But since we have the optimal state trajectory observations {y i t } ν t=1 of the agent's trials, i.e., realizations of I.I.D. random processes {y i t } ν t=1 for i = 1 : M , we can instead empirically estimate the objective function.To this end, let M N denote the number of observations which has a planning horizon of N time steps.Clearly, Then, for each value of N , the expectation in ( 26) is ap-proximated by the empirical mean as On the other hand, approximating the expectation over N is the same as estimating the probabilities P(N = N ) using the empirical estimates M N /M .This, together with the above expression, gives that We therefore consider the estimator min In practice, an estimate is obtained by solving (30) for a given realization {y i 1:ν } M i=1 of {y i 1:ν } M i=1 .We will use the notation Ψ y E (•)| y=y to denote the objective function at the given realization.

Statistical consistency analysis
In this section, we analyze the statistical consistency of the IOC algorithm.To proceed, we first show that the optimization problem ( 28) is well-posed, i.e., the objective function ( 26) is bounded from below on its feasible domain (28a), (28b) and (28c).In addition, we show the "true" ( Q, q) is actually the unique global minimizer.Theorem 5.1 Let ( Q, q) ∈ F (I) be the "true" parameters of the stochastic linear-quadratic control problem (1) that governs the agent, and let x 1:ν , u 1:ν , and y 1:ν be distributed accordingly.Under Assumptions 2.1, 2.2, 2.3, 2.5, and 2.6, for any feasible solution (Q, q, P 1:ν , η 1:ν , ξ 1:ν−1 ) of the optimization problem (28), the objective function (26) is bounded from below by Moreover, for ϕ that is large enough, ( Q, q, P1:ν , η1:ν , ξ1:ν−1 ) is the unique globally optimal solution achieving the lower bound, where P1:ν , η1:ν are generated by (3) Proof.As can be seen from the construction of the objective function in Section 5.1, and in view of (1e), Now, by the definition of ψ t,N (Q, q; x t , u t ) in (23), as in (24) we have that By opening the parenthesis and computing the expectation with respect to w t , and using Assumption 2.2, we have that where H t has the form (5b). On the other hand, since (Q, q, P 1:ν , η 1:ν , ξ 1:ν−1 ) is feasible, by the constraint (28c) we have that H t 0. Therefore, it holds that This proves the first part of the theorem.
Finally, we show that the "true" ( Q, q, P1:ν , η1:ν , ξ1:ν−1 ) is actually the unique global optimizer to (28).To this end, let (Q , q , P 1:N , η 1:N , ξ 1:N −1 ) be an optimal solution to (28), and let us also use to denote other vectors and matrices obtained using this optimal solution.Since the solution is optimal, it must be feasible, which implies that H t 0, ∀t = 1 : ν −1.Hence it follows that R t 0, ker(R t ) ⊂ ker(S T t ) ∩ ker(g T t ) and H t \R t 0, see [14,Thm. 1.20,p. 43].In view of the above "kernel containment" and (32), the optimal objective value can be further rewritten as Recalling the notation xt = [x T t , 1] T and the fact that R t 0, we in turn get for all N such that P(N = N ) > 0. In particular, by Assumption 2.3 it must be true for N = ν.From Lemma 4.3, we know that E xt|N =ν [ xt xT t ] 0. Thus it follows that, for N = ν, (33b) implies that H t \R t = 0 holds for t = 1 : ν − 1.By using the observation in Remark 3.2, we therefore have that (Q , q , P 1:ν , η 1:ν ) satisfies the generalized Riccati iterations (3).Now, to show that the optimal solution to (28) is unique, first assume that (Q , q , P 1:ν , η 1:ν ) is an optimal solution such that R t 0, for t = 1 : ν − 1.In this case, also (R t ) and thus, by (21), that Ãcl (t; Q , q ) = Ãcl (t; Q, q), t = 1 : ν − 1.
Proof.It has already been established that the feasible region is convex, see the arguments in Section 5.1.Moreover, the feasible region is bounded by construction.Next, for any realization, Ψ y E (Q, q, P 1:ν , η 1:ν , ξ 1:ν−1 )| y=y is a linear function, and hence (30) is a convex problem.Moreover, since the cost function is linear, it is bounded on the compact feasible domain.Finally, by Weierstrass' theorem (see, e.g., [5,Prop. A.8]) the optimization problem (30) ) be a corresponding optimal solution to (30).Then Q M p → Q and q M p → q as M → ∞.
Proof.The result follows by verifying the conditions in [36,Thm. 5.7].In particular, since (30) ) is a globally optimal solution.This means that converges to 0 in probability as M → ∞.Finally, the fact that the feasible region to ( 28) and ( 30) is compact (see Lemma 5.2), and that (28) has a unique optimal solution (see Theorem 5.1), by [36, p. 46] the last condition also holds.Hence, the result follows. 2 6 Numerical example: Identification of cost in non-zero sum pursuit-evasion game In this section, we demonstrate the performance of the proposed IOC algorithm on a non-zero sum twodimensional finite-horizon linear-quadratic pursuitevasion game, cf.[33].For a more extensive treatment of pursuit-evasion games, see, e.g., [3].To this end, let x t ∈ R 2 be the distance between the pursuer and the evader, and let u p t , u e t ∈ R 2 be the control signal of the pursuer and the evader, respectively.In particular, for each realization (x, N ) of ( x, N ), we assume that the evader solves the following problem min x1:ν , u e 1:ν where A = e Â∆t , B = ∆t 0 e Ât dt B are sampled from the continuous-time dynamics ẋ = Âx + Bu e + Bu p using the sampling period ∆t = 0.1, and where ν = 20.Notably, Q e ≺ 0. If Q e remains unknown to the pursuer, the pursuer can first use some "dummy" movements u p ν−N +1:ν that are easy for the evader to predict (i.e., known by the evader).For instance, if u p t is chosen to be constant, then it can be viewed by the evader as a constant forcing term d = Bu p t (cf.(1b)).The pursuer observes the noisy distance (see (2)) between the pursuer and evader, which is the optimal solution to (34), and can use the proposed IOC algorithm to estimate Q e .This estimate can be used to model and predict the future movement of the evader.
The latter matrices were randomly generated by drawing two elements from a Wishart distribution of degree 2, i.e., with the same degrees of freedom as the dimension of the state.Moreover, the Wishart distribution used has a random covariance of 0.01GG T , where each element in G ∈ R 2×2 was drawn from a standard nor-mal distribution.As "dummy" movements for the pursuer, we choose u p t = [−1, −1] T , t = ν − N + 1 : ν, and hence the constant forcing term in the dynamics is given by d = B[−1, −1] T .Finally, the random variable N is taken to be uniformly distributed on the integers between 2 and ν = 20.
To test the performance of the algorithm, we generate 100 batches of trajectories, where each batch consists of 50000 trajectories.For each batch, we divide the trajectories into groups of size M = 100 + 100(k − 1), for k = 1, . . ., 500, where each larger group contains all the trajectories of a smaller group.For each such group of trajectories, we solve the IOC problem (with ϕ set to 10 6 ), and this procedure is repeated for all the 100 batches.This means that we obtain estimates Q ,M est , for = 1, . . ., 100 and M = 100, 200, . . ., 50000.For each value of M , the relative error Q ,M est − Q e F / Q e F is averaged over the batches, and the resulting empirical mean and empirical standard deviation (as a function of M ) is shown in Figure 1.From the figure we see that, in line with the statistical consistency proved in Theorem 5.4, both the mean and the standard deviation decreases with increasing M .Moreover, in Figure 1 the logarithm of the mean and the logarithm of the standard deviation appears to be (approximately) affine in log(M ).The figure also shows the corresponding lines obtained by fitting an affine model to each of the two sets of logarithmic data.From this fit, we see that Mean of relative error ≈ O(M −0.53 ) and Standard deviation of relative error ≈ O(M −0.51 ).We hence suspect that the convergence rate is O(M −0.5 ), and that √ M (Q M − Q) is asymptotically normal, just like most M-estimators such as maximum log-likelihood [36, p. 51].However, a theoretical analysis of this is left for future work.The estimates are obtained using noisy data, as described in Section 6.

Conclusion
In this work, we have considered the inverse optimal control problem for discrete-time finite-horizon general indefinite linear-quadratic problems with stochastic planning horizons.We first investigate the necessary and sufficient conditions for the when the forward problem is solvable.The identifiability of the corresponding inverse optimal control problem is analyzed and proved.Furthermore, based on the underlying necessary and sufficient condition, we construct the estimator of the inverse optimal control problem as the solution to a convex optimization problem, and prove that the estimator is statistically consistent.The performance of the estimator is illustrated on a numerical example of identifying the evaders cost in non-zero sum pursuit-evasion game.

A Deferred proofs
Proof of Proposition 2.4.We show the contraposition of the statement, i.e., that if ( Q, q, R) is such that there exists an initial value x ∈ R n and a time-horizon length N ∈ {2, . . ., ν} so that the optimal control problem (1) is unbounded from below, then there exists an initial value x so that (1) is unbounded from below for planning horizon length ν.To this end, consider planning horizon N = ν.By Theorem 3.1, if ( Q, q) ∈ F ( R) then there exists an N such that (4d) or (4e) does not hold for t = ν − N + 1. Splitting the summation in the objective function as and following along the lines of the proof of "1) =⇒ 2)" in Theorem 3.1, similar to (15) we get the following inequality Moreover, from the same proof we know that we can select x ν−N +1 in order to make the terms outside of the last summation unbounded from below.Now, by invertability of A, we can select u t = 0 and x t = A −1 (x t+1 − d − w t ) for t = 1 : ν − N .For any value of x ν−N +1 , this gives an initial condition and a sequence of states and controls that fulfill the constraints (1b)-(1e) (note that (1c) and (1e) are vacuous since N = ν).Moreover it is easily seen that J ν is bounded from above by an expression similar to the one in the previous proof, but containing and additional constant τ (x ν−N +1 ).Following a logic similar to the reminder of the proof of "1) =⇒ 2)" in Theorem 3.1, the result follows. 2 Proof of Lemma 4.2.Let N ∈ {2, . . ., ν} be such that P(N = N ) > 0. The matrix cov x|N =N ( x, x) is symmetric.Let cov x|N =N ( x, x) = n i=1 λ i v i be an orthonormal eigen-decomposition of the matrix.This means that we can write x = n i=1 α i v i for some real-valued random variables α i , i = 1, . . ., n.
Assume that cov x|N =N ( x, x) is not (strictly) positive definite.Then at least one eigenvalue is zero; without loss of generality, let λ 1 = 0. Then we have that ).By Jensen's inequality [4, Prop.9.24], we know that (E x|N =N (α 1 )) 2 ≤ E x|N =N (α 2 1 ), and by following the proof of [4, Prop.9.24], we see that equality holds if and only if α 1 is constant a.s.(otherwise, the inequality from [4, Thm.9.23] is strict at some points).To this end, let α 1 = c a.s.for some constant c.If c = 0, then for χ = v 1 , Assumption 2.5 does not hold.If c = 0, then the probability mass of x is located on a hyperplane defined by α 1 = c, which does not pass through the origin.In this case, let χ = v 2 and note that for < c we have that P( x ∈ B n (ρv 2 )) = 0 for all ρ, hence violating Assumption 2.5.Therefore, we must have cov x|N =N ( x, x) 0.

Remark 3 . 3
By extending the state space to xt = [x T t , 1] T and with state space matrices given by

Fig. 1 .
Fig.1.Log-log plot of the mean and standard deviation of the relative error of Qest as a function of the number of agents.The estimates are obtained using noisy data, as described in Section 6.

2
Proof of Lemma 4.3.Note that Hence by Lemma 4.2 and [14, Thm.1.12],weknow that for all N such that P(N = N ) > 0, E x|N =N [ xx T ] 0.On the other hand, by Assumption 2.2, w t is independent of the noiseless z t := Ax t + Bu t + d, for t = ν − N + 1 : ν − 1, and for such t it thus hold that T. In particular, note that