Optimal adaptive leader-follower consensus of linear multi-agent systems : Known and unknown dynamics

In this paper, the optimal adaptive leader-follower consensus of linear continuous time multi-agent systems is considered. The error dynamics of each player depends on its neighbors’ information. Detailed analysis of online optimal leader-follower consensus under known and unknown dynamics is presented. The introduced reinforcement learning-based algorithms learn online the approximate solution to algebraic Riccati equations. An optimal adaptive control technique is employed to iteratively solve the algebraic Riccati equation based on the online measured error state and input information for each agent without requiring the priori knowledge of the system matrices. The decoupling of the multi-agent system global error dynamics facilitates the employment of policy iteration and optimal adaptive control techniques to solve the leaderfollower consensus problem under known and unknown dynamics. Simulation results verify the effectiveness of the proposed methods.


Introduction
In recent decades multi-agent systems (MASs) are applied as new methods for solving problems which cannot be solved by a single agent.MASs contain agents forming a network which exchange information through the network to satisfy a predefined objective.Information exchanging among agents can be divided to centralized and distributed approaches.Centralized approaches are mainly concentrated and discussed where all agents have to continuously communicate with a central agent.This kind of communication results in a heavy traffic, information loss and delay.Also, the central agent must be equipped with huge computational capabilities to receive all the agents' information and provide them with a command in response.Recently these challenges deviates the stream of studies toward distributed techniques where agents only need to communicate with their local neighbors.A main problem in cooperative control of MASs is Consensus or synchronization.In consensus problems, it is desired to design simple control law for each agent, using local information, such that the system can achieve prescribed collective behaviors.In the field of control, consensus of MAS is categorized to cooperative regulation and cooperative tracking.In cooperative regulator problems, known as leaderless consensus, distributed controllers are designed for each agent, such that all agents are eventually driven to an unprescribed common value [1].This value may be a constant, or may be time varying, but is generally a function of the initial states of the agents in the communication network [2].Alternatively in a cooperative tracking problem, which is considered in this paper, there exists a leader agent.The leader agent acts as a command generator, which generates the desired reference trajectory.The leader ignores information from the follower agents and all other agents are required to follow the leader agent [3,4].This problem is known as the leader-follower consensus [5], model reference consensus [6], or pinning control [7].In MASs, the network structure and agents communications can be shown by graph theory tools.Multi player linear differential games rely on solving the coupled algebraic Riccati equations (AREs).The solution of each player coupled equations requires knowledge of the player's neighbors strategies.Since AREs are nonlinear, it is difficult to solve them directly.To solve ARE, the following approaches have been proposed and extended: backwards integration of the Differential Riccati Equation, or Chandrasekhar equations [8]; eigenvector-based algorithms [9,10] and the numerically advantageous Schur-vectorbased modification [11]; matrix-sign-based algorithms [12][13][14]; Newton's method [15][16][17][18].These methods are mostly offline procedures and are proven to converge to the desired solution of the ARE.They either operate on the Hamiltonian matrix associated with the ARE (eigenvector and matrix-sign-based algorithms) or require solving Lyapunov equations (Newton's method).In all methods, the system dynamics must be known and a preceding identification procedure is always necessary.
Adaptive control [19,20] allows the design of online stabilizing controllers for uncertain dynamic systems.A conventional way to design an adaptive optimal control law is to identify the system parameters first and then solve the related algebraic Riccati equation.However, such adaptive systems are known to respond slowly to parameter variations from the plant.Optimal adaptive controllers can be obtained by designing adaptive controllers with the ability of learning online the solutions to optimal control problems.Reinforcement learning (RL) is a sub-area of machine learning involved with how to methodically modify the actions of an agent (player) based on observed responses from its environment [21].RL is a class of methods, which provides online solution for optimal control problems by means of a reinforcement scalar signal measured from the environment, which indicates the level of control performance.This is because a number of RL algorithms [22][23][24] do not require knowledge or identification/learning of the system dynamics, and RL is strongly connected with direct and indirect optimal adaptive control methods.In this paper, the optimal adaptive control means the algorithms based on RL that provide online synthesis of optimal control policies.Also, the scalar value associated with the online adaptive controller acts as a reinforcement signal to optimally modify the adaptive controller in an online fashion.RL algorithms can be employed to solve optimal control problems, by means of function approximation structures that can learn the solution of ARE.Since function approximation structures are used to implement these online iterative learning algorithms, the employed methods can also be addressed as approximate dynamic programming (ADP) [24].Policy Iteration (PI), a computational RL technique [25], provides an effective means of online learning solutions to AREs.PI contains a class of algorithms with two steps, policy evaluation and policy improvement.In control theory, PI algorithm amounts to learning the solution to a nonlinear Lyapunov equation, and then updating the policy through minimizing a Hamiltonian function.Using PI technique, a nonlinear ARE is solved successively by breaking it into a sequence of linear equations that are easier to handle.However, PI has primarily been developed for discrete-time systems [24,25], recent research findings present Policy Iteration techniques for continuous-time systems [26].ADP and RL methods have been used to solve multi player games for finite-state systems [27,28].In [29][30][31][32], RL methods have been employed to learn online in real-time the solutions of optimal control problems for dynamic systems and differential games.The leader-follower consensus has been an active area of research.Jadbabaie et al. considered a leader-follower consensus problem and proved that if all the agents were jointly connected with their leader, their states would converge to that of the leader over the course of time [33].To solve the leader-follower problem, Hong et al. proposed a distributed control law using local information [34] and Cheng et al. provided a rigorous proof for the consensus using an extension of LaSalle's invariance principle [35].Cooperative leader follower attitude control of multiple rigid bodies was considered in [36].Leader-follower formation control of nonholonomic mobile robots was studied in [37].Peng et al. studied the leaderfollower consensus for an MAS with a varyingvelocity leader and time-varying delays [38].The consensus problem in networks of dynamic agents with switching topology and time-delays was proposed in [39].In the progress of the research on leader-follower consensus of MASs, the mentioned methods were mostly offline and non-optimal and required the complete knowledge of the system dynamics.The optimal adaptive control contains the algorithms that provide online synthesis of optimal control policies [40].For a single system, [26] introduced an online iterative PI method which does not require the knowledge of internal system dynamics but does require the knowledge of input dynamics to solve the linear quadratic regulator (LQR) problem.Vrabie et al. showed that after each time the control policy is updated, and the information of state and input must be recollected for the next iteration [26].Jiang et al. introduced a computational adaptive optimal control method for the LQR problem, which does not require either the internal or the input dynamics [41].For MASs, [42] introduced an online synchronous PI for optimal leader-follower consensus of linear MASs with the known dynamics.Based on the previous studies, the online optimal leader-follower consensus of MASs under the unknown linear dynamics has remained an open problem.This paper presents an online optimal adaptive algorithm for continuous time leader-follower consensus of MASs under known and unknown dynamics.The main contribution of the paper is the introduction of a direct optimal adaptive algorithm (data-based approach) which converges to optimal control solution without using an explicit, a priori obtained, model of the matrices (drift and input matrices) of the linear system.We implement the decoupling of multi-agent global error dynamics which facilitates the employment of policy iteration and optimal adaptive control techniques to solve the leader-follower consensus problem under known and unknown dynamics.The introduced method employs PI technique to iteratively solve the ARE of each agent using the online information of error state and input without requiring a primary knowledge of system matrices.For each agent, all iterations are implemented using repeatedly the same error state and input information on some fixed time intervals.In this paper, the employed online optimal adaptive computational tool is motivated with [41], where the method is generalized for leader-follower consensus in MASs.The paper is organized as follows.Section 2 contains the results from Graph theory, also the problem formulation, node error dynamics and leader-follower error dynamics decoupling are clarified in this section.Section 3 introduces Policy iteration algorithm for leader-follower consensus under known dynamics.Optimal adaptive control design for leader-follower consensus under unknown dynamics is presented in section 4. Simulation results are discussed in Section 5. Finally the conclusions are drawn in section 6.

Problem formulation and preliminaries 2.1. Graphs
Graph theory is a useful mathematical tool in multi-agent systems research where information exchange between agents and the leader is shown through a graph.The topology of a communication network can be expressed by either a directed or undirected graph, according to whether the information flow is unidirectional or bidirectional.The topology of information exchange between N agents is described by a graph is the set of vertices representing N agents and is the set of edges of the graph.
means there is an edge from node i to node j .We assume the graph is simple, e.g., no repeated edges and no self-loops.The topology of a graph is often represented by an adjacency matrix , i.e. the set of nodes with arcs incoming to i .If node j is a neighbor of node i , the node i can get information from node j not necessarily vice versa for directed graphs.In undirected graphs, neighbor is a mutual relation.Define the in-degree matrix as a diagonal matrix the weighted in-degree of node i

Synchronization and node error dynamics
In cooperative tracking control of networked linear systems, we wish to achieve synchronization in the multi-agent system simultaneously optimizing some performance specifications on the agents.Consider an MAS consisting of N agents and a leader, which are in communication through an undirected graph.The dynamics of each agent is where is the measurable state of agent i , and is the input of player i .In this section, we assume that A and i B are accurately known.The matrix i B is full column rank.The leader labeled, as 0  i has linear dynamics as where n R x  0 is the measurable state of the leader.Obviously, the leader's dynamics is independent of others.We take the same internal dynamic matrix ( A ) for all the agents and the leader to be identical because this case has practical background such as group of birds, school of fishes etc.The following assumption is used throughout the paper.Assumption 1.The pair . The dynamics of each agent (node) can describe the motion of a robot, unmanned autonomous vehicle, or missile that satisfies a performance objective.Definition 1.The leader-follower consensus of system (1)-( 2) is said to be achieved if, for each agent such that the closedloop system satisfies . The design objective is to employ the following distributed control law for agent , 1,..., where is a feedback matrix to be designed and i g is defined to be 1 when the leader is a neighbor of the agent i , and 0 otherwise.Since the proposed feedback controller i u , depends on both the states of its neighbors and the leader agent states, i u is a distributed controller.In order to analyze the leader-follower consensus problem, we denote the error state between the agent i and the leader as where and  is the Kronecker product. ) Graph topology has the following properties, which are proved in [43]: 1.
The matrix H has nonnegative eigenvalues.

2.
The matrix H is positive definite if and only if the graph Gr is connected.
Assumption 2. The graph Gr is connected.The design objective for each agent i is to find the feedback matrix i K which minimizes the following performance index for linear system (4), 0 ( ) , 1, 2,..., N where Before we proceed to the design of online controllers, we need to decouple the global error dynamics (5), as discussed in the following.

Decoupling of Leader-follower error dynamic
Since H is symmetric, there exists an orthogonal matrix are the eigenvalues of matrix H . Based on Assumption 2, Gr is connected therefore H is a positive definite matrix and Since the obtained global error dynamics (7) is block diagonal, it can be easily decoupled for each agent i , where for each agent we have ( ) , 1, 2,..., N ..., N In order to find the optimal i K which guarantees the leader-follower consensus for every agent i , we can minimize (9) with respect to (8), which is easier in comparison with minimizing (6) with respect to (4).Based on linear optimal control theory, minimizing (9) with respect to (8) to find the feedback matrix i K can be done by solving the following algebraic Riccati equation for each agent: Based on the mentioned assumptions, (10) has a unique symmetrical positive definite solution Therefore, the optimal feedback gain matrix can be determined by , due to the dependence of i K to i  , each feedback gain depends on the graph topology.Since ARE is nonlinear in i P , it is usually difficult to directly solve * i P from (10), especially for large size matrices.Furthermore, solving (10) and obtaining * i K requires the knowledge of A and i B matrices.

Policy iteration algorithm for leader-follower consensus of continuous time linear systems under known dynamic
One of the efficient algorithms to numerically approximate the solution of ARE is the Kleinman algorithm [17].Here we employ the Kleinman algorithm to numerically solve the corresponding ARE for each agent.The Kleinman method performs as a PI algorithm as discussed in the following.

Optimal adaptive control for leader-follower consensus under unknown dynamics
To solve (11) without the knowledge of A , we have [40] () By online measurement of both i uniquely determined under some persistence excitation (PE) condition though matrix i B is still needed to calculate (12).To freely solve (11) and ( 12) without the knowledge of A and i B , here the result of [41] is generalized for MAS leader-follower consensus.An online learning algorithm for the leaderfollower consensus problem is developed but does not rely on either A or i B .
For each agent i , we assume a stabilizing 0 i K is known.Then we seek to find symmetric positive definite matrix k i P and feedback gain matrix matrices to be known.System ( 8) is rewritten as where . Then using (14), along the solutions of (15), by (11) and ( 12) we have where . Note that in ( 16), the term , which can be obtained by measuring i  online.Also, the term K is treated as another unknown matrix to be solved together with k i P [41].Therefore, ( 16) plays an important role in separating the system dynamics from the iterative process.As a result, the requirement of the system matrices in (11) and ( 12) can be replaced by the i  and input information i u measured online.In other words, the information regarding the system dynamics ( A and i B matrices) is embedded in the error states and input which are measured online.
We employ , with e the exploration noise (for satisfying PE condition), as the input signal for learning in (15), without affecting the convergence of the learning process.Given a stabilizing k i K , a pair of matrices ( , satisfying (11) and ( 12) can be uniquely determined without knowing A or i B , under certain condition (Equation ( 27)).
We employ Furthermore, by using Kronecker product representation we have: Also, for positive integer l , we define matrices ii where, Inspired by [41], (16) implies the following matrix form of linear equations for any given stabilizing gain matrix where, ] 2 () Notice that if k i  has full column rank, (24) can be directly solved as follows: The steps of the proposed optimal adaptive control algorithm for practical online implementation are presented as follows: Algorithm 2 (Optimal adaptive learning algorithm): Step 1: For the agent i employ e K u , where 0 i K is stabilizing and e is the exploration noise (to satisfy PE condition).Compute (26).
, and repeat Step 2 until Step 4: as the approximated optimal control policy for each agent i .
It must be noted that in the cases where the solution of (24) does not exist due to the numerical error in  26) can be obtained by employing the least square solution of (24).Lemma 1.As proved in [41], the convergence is guaranteed, if , 1, 2,..., Proof: See [41] for the similar proof.Several types of exploration noise, such as random noise [44,45], exponentially decreasing probing noise [32] and sum of sinusoids noise [41] are added to the input in reinforcement learning problems.The input signal should be persistently exciting; therefore, the generated signals from the system, which contains the information of the unknown system dynamics, are rich enough to lead us to the exact solution.Here is a sum of sinusoids noise applied in the simulations to satisfy PE condition.Remark 1.In comparison with the previous research on MASs leader-follower consensus, which is mostly offline and requires the complete knowledge of the system dynamics, this paper has presented an online optimal adaptive controller for the leader-follower consensus, which does not require the knowledge of drift and input matrices of the linear agents.Remark 2. The main advantage of the proposed method is that the introduced optimal adaptive learning method is an online model-free ADP algorithm.Moreover, this technique iteratively solves the algebraic Riccati equation using the online information of state and input, without requiring the priori knowledge of the system matrices and all iterations can be conducted by using repeatedly the same state and input information ( integrators in the learning system to collect information of the error state and the input.

Simulation results
In this section, we give an example to illustrate the validity of the proposed methods.Consider the graph structure shown in figure 1, similar to [42] focusing on the dynamic of each agent, which is as follows The Laplacian L and matrix G are as follows: The cost function of parameters for each agent, namely the Q and R matrices, is chosen to be identity matrices of appropriate dimensions.Since agents dynamics are already stable, the initial stabilizing feedback gains are considered as  The error difference between the parameters of the solution     components to zero is depicted in figure 5 where the synchronization of all agents to the leader is guaranteed.As mentioned in table 1, the Kleinman PI method after 6 iterations results in leader-follower consensus in 6 seconds under known dynamics.The introduced optimal adaptive PI learns the optimal policy and guarantees the leader-follower consensus in 12 seconds after 10 iterations under unknown dynamics.Clearly, the introduced optimal adaptive method for unknown dynamics requires more time and iterations in comparison with the method for known dynamics to converge to the optimal control policies.As illustrated in the simulation results by employing PI technique and optimal adaptive learning algorithm, all agents synchronize to the leader.

Conclusions
In this paper, the online optimal leader-follower consensus problem for linear continuous time systems under known and unknown dynamics is considered.The multi-agent global error dynamic is decoupled to simplify the employment of policy iteration and optimal adaptive control techniques for leader-follower consensus under known and unknown dynamics respectively.The online optimal adaptive control solves the algebraic Riccati equation iteratively using system error state and input information collected online for each agent, without knowing the system matrices.Graph theory is employed to show the network topology of the multi-agent system, where the connectivity of the network graph is assumed as a key condition to ensure leader-follower consensus.Simulation results indicate the capabilities of the introduced algorithms. doi:10.5829/idosi.JAIDM.2015.03.01.11 (i.e.i th row sum of G A ). Define the graph Laplacian matrix as G L D A  , which has all row sums equal to zero.Apparently in bidirectional (undirected) graphs, L is a symmetric matrix.A path is a sequence of connected edges in a graph.A graph is connected if there is a path between every pair of vertices.The leader is represented by vertex 0. Information is exchanged between the leader and the agents which are in the neighbors of the leader (See Figure1.).

.
The time derivative of this Lyapunov candidate along the trajectory of system ( rank condition(27)) is computed, the iterative process of Algorithm 2 results in a sequence of   0 some fixed time intervals.However, the main burden in implementing the introduced optimal adaptive method (Algorithm 2

Figure 2
that A and i B matrices are precisely known and we employ the Kleinman policy iteration (Algorithm 1) to reach leaderfollower consensus.trajectories to zero by time in 6 iterations, which confirm the synchronization of all agents to the leader.

Figure 2 .
Figure 2. Agents , 1,...,5 i i   by directly solving the ARE, is in the range of 4 10  .Now we assume that A and i B matrices are unknown and we employ the optimal adaptive learning method (Algorithm 2

i
 and i u information of each agent is collected over each interval of 0.1 s.The policy iteration started at

Figure 3
Figure 3. Convergence of k i P to * i P during learning

Figure 4
Figure 4. Convergence of k i K to * i K during learning

Table 1 . Online PI methods comparison under known and unknown dynamics. Online method
i  Convergence time to zero A and i B