Learning generalized Nash equilibria in multi-agent dynamical systems via extremum seeking control

In this paper, we consider the problem of learning a generalized Nash equilibrium (GNE) in strongly monotone games. First, we propose a novel continuous-time solution algorithm that uses regular projections and first-order information. As second main contribution, we design a data-driven variant of the former algorithm where each agent estimates their individual pseudo-gradient via zero-order information, namely, measurements of their individual cost function values, as typical of extremum seeking control. Third, we generalize our setup and results for multi-agent systems with nonlinear dynamics. Finally, we apply our algorithms to connectivity control in robotic sensor networks and distributed wind farm optimization.


Introduction
Multi-agent optimization problems and games with selfinterested decision makers or agents appear in many engineering applications, such as demand-side management in smart grids [42], [47], charging/discharging coordination for plug-in electric vehicles [37], [23], thermostatically controlled loads [33], [34], [24] and robotic formation control [35]. Typically, in these games, the cost functions and the constraints of the agents are coupled together, e.g. due to common congestion penalties and shared resource capacity, respectively. Since the agents are self-interested, their interaction might be unstable. Thus, one main research area is that of finding (seeking) agent decisions that are self-enforceable, e.g. decisions such that no agent has an incentive to deviate from -the so-called Generalized Nash equilibrium (GNE) [16]. From a control-theoretic perspective, in the presence of dynamical agents, the main challenge is to design distributed, possibly almost decentralized, control laws that ensure both the convergence of the agent decisions to a GNE and the asymptotic stability of the equilibrium for the coupled agent dynamics.
Literature review: The literature on generalized Nash equilibrium problems (GNEPs) is vast -see [16] for a survey. Traditional GNE seeking approaches assume that the agents have complete game information, which requires the exchange of large amounts of data over a communication network. Thus, one prominent research focus in recent years has been on semi-decentralized algorithms in which the agents have very limited game information. The class of aggregative games is naturally appealed to such a scenario, since each agent has a cost function which depends on his own decision and on an aggregate (e.g. average) of the decisions of the other agents. The GNEP in aggregative games can be efficiently solved via semi-decentralized algorithms where the aggregate variable is broadcasted by a central coordinator [4], [5], or via distributed algorithms where the agents locally estimate the aggregate variable [29], [19], [7]. Continuous-time GNE seeking algorithm have been recently proposed as well, all with projections onto the state-dependent tangent cone of the local constraints [11], [8].
We note that in most of the literature, GNE seeking algorithms are for static agents, i.e., where the agent costs instantaneously reflect the chosen decisions. However, this is not the case when the cost functions depend on the internal states of the agents and not on their decisions (control inputs). Let us refer to this class of agents as dynamical agents. The two main approaches to reach a GNE for dynamical agents are passivity-based first-order algorithms and payoff-based zero-order algorithms. By using a passivity property, in [18], [46], Pavel and co-authors design a control law that guarantees convergence to a Nash equilibrium (NE) in a multi-agent system with single and multi-integrator dynamics over a time-invariant network. With the same goal, in [12], De Persis and Grammatico relax the network connectivity assumption in [18] by designing a network weight adaptation scheme. In [8], the authors extend the convergence results to GNEPs for the first time. In payoff-based algorithms, each agent can only measure the value of their own cost function, but does not know its analytic form. Most of such algorithms are designed for NEPs with static agents with finite action spaces, e.g. [22], [38], [40]. In the case of continuous action spaces, the measurements of the cost functions are often used to estimate the pseudo-gradients. Perhaps the most popular class of control algorithms that exploits this principle is that of extremum seeking control (ESC). The main idea is to use perturbation signals to "excite" the cost function and estimate its gradient. Since the first general proof of convergence [31], there has been a strong research effort to extend the original ESC approach [51], [20], as well as to conceive diverse variants, e.g. [13] and [25]. ESC was used for NE seeking in [17] where the proposed ESC algorithm is proven to converge to a neighborhood of a NE for nonlinear dynamical agents. The results are extended in [36] to include stochastic perturbation signals. In [45], Poveda and Teel propose a framework for the synthesis of a hybrid controller which could also be used for NEPs with nonlinear dynamical agents.Unfortunately, all the available ESC algorithms cannot be used for GNE learning/seeking.The main technical issue is the primal-dual Lagrangian reformulation of the GNEP, which is necessary to decouple the coupling constraints, does not preserve the strong monotonicity of the (extended) pseudo-gradient, the usual background assumption in the literature. This is an important technical challenge and in fact there is still no research on data-driven (zero-order) GNE learning in monotone games for nonlinear dynamical agents.
Contribution: Motivated by the above literature and open research problem, to the best of knowledge, we consider and solve for the first time the problem of learning a GNE in monotone games with nonlinear dynamical agents. Specifically, our main technical contributions are summarized next: • We design a novel, full-information, semi-decentralized continuous-time GNE seeking algorithm, which uses projections onto fixed convex sets instead of projections onto state-dependent tangent cones as in [8]. In this way, the state flow is Lipschitz continuous and admits solutions in the classical sense. We overcome the lack of strong monotonicity of the primal-dual pseudo-gradient thanks to a suitable preconditioning of the operators defining the optimality conditions.
• We design an extremum seeking scheme that learns GNEs in (strongly) monotone games with static agents who perform local computations and communicate with a central coordinator only. Differently from [25], where the authors consider an optimization problem, we consider a noncooperative game. Furthermore, we prove that, with a time-scale separation, our algorithm learns GNEs in (strongly) monotone games with nonlinear dynamical agents.
We also apply for the first time semi-decentralized GNE learning to the robot connectivity problem and to wind farm optimization.
Notation: R denotes the set of real numbers. For a matrix A ∈ R n×m , A and A denote its transpose and maximum singular value respectively. For vectors x, y ∈ R n , x y and x denote the Euclidean inner product and norm, respectively. We denote the unit ball set as B : Collective vectors are defined as x := col (x 1 , . . . , x N ) and for each i = 1, . . . , N , denotes the block diagonal matrix with A i on its diagonal. For a function v : R n × R m → R differentiable in the first argument, we denote the partial gradient vector Maximal and minimal eigenvalues of matrix A are denoted as σ max (A) and σ min (A) respectively. The mapping proj S : R n → S denotes the projection onto a closed convex set S, i.e., proj S (v) := argmin y∈S y − v . The setvalued mapping N S : R n ⇒ R n denotes the normal cone operator for the set S ⊆ R n , i.e., N S (x) = ∅ if x / ∈ S, v ∈ R n | sup z∈S v (z − x) ≤ 0 otherwise. Id is the identity operator. I n is the identity matrix of dimension n and 0 n is vector column of n zeros. The non-negative orthant is defined as ) i∈M .

Multi-agent dynamical systems
We consider a multi-agent system with N agents indexed by i ∈ I = {1, 2, . . . , N }, each with the following dynamics:ẋ where x i ∈ X i ⊂ R ni is the state variable, u i ∈ Ω i ∈ R mi is the control input (decision variable), y i ∈ R is the output variable which evaluates the cost function h i : R ni × R n−i → R, and f i : X i → R ni is the state flow mapping. Let us also define n := n i and n −i := j =i n j .
To ensure the existence and uniqueness of the solutions to the state equations, we make a common assumption in the nonlinear system literature [28,Thm. 3.3]: Assumption 1 (Existence and uniqueness) For each i ∈ I, f i is locally Lipschitz continuous on X i × Ω i , where X i is a compact set such that any solution to (1a) with x i (t 0 ) ∈ X i , lies entirely in X i . 2 Furthermore, we assume that the decision variables of the agents are subject to local constraints u i ∈ Ω i and coupling constraints Au ≤ b, where A ∈ R q×m , b ∈ R q , and u := col ((u i ) i∈I ) collects all the control inputs. Let us denote the collection of local constraints as As the the decision variables are also coupled together, the overall feasible decision set U is contained in Ω, i.e.
Let us also denote the feasible set of each agent i as By using the previous definition, let us also define the collective steady-state mappings π(u) := col (π i (u i )) i∈I , In this paper, we assume that the goal of each agent is to minimize its own steady-state cost function, i.e., ∀i ∈ I : min where which depends on decision variables of other agents as well. From a game-theoretic perspective, we consider the problem to compute a generalized Nash equilibrium (GNE), as formalized next.

Definition 1 (Generalized Nash equilibrium)
A set of control actions u * := col (u * i ) i∈I is a generalized Nash equilibrium if, for all i ∈ I, with U as in (3) and J i as in (7). 2 In plain words, a set of inputs is a GNE if no agent can improve its steady-state cost function by unilaterally changing its input. To ensure the existence of the GNE, we postulate the following basic assumption [15,Thm. 2]: Standing Assumption 2 (Regularity) For each i ∈ I, the function J i in (7) is continuous; the function J i (·, u −i ) is convex for every u −i . For each i ∈ I, the set Ω i is non-empty, closed and convex; U is non-empty and satisfies Slater's constraint qualification. 2 More precisely, in this paper we focus on a subclass of GNE called variational GNE (v-GNE) [15,Def. 3.11]. A collective decision u * is a v-GNE in (8) if and only if there exists a dual variable λ ∈ R q such that the following KKT conditions are satisfied [15,Th. 4.8]: By stacking the partial gradients ∇ ui J i (u i , u −i ) in to single vector, we form the pseudo-gradient mapping We assume that there exists a central coordinator who is capable of bidirectional communication on a star-shaped network with the agents, which is a frequent assumption in semi-decentralized algorithms [5], [6]. The central coordinator is tasked with computation of the dual variables. For ease of notation, we index the coordinator with 0 and define I 0 := I ∪ {0}. Let us postulate additional common assumptions ([11, Std. Ass. 2], [12, Ass. 1]) in order to assure the convergence of the algorithm we propose latter on.
Standing Assumption 3 (Well-behavedness) For each i ∈ I, J i in (7) is twice differentiable, Lipschitz continuous, and its gradient ∇J i is -Lipschitz continuous, with > 0. The pseudo-gradient mapping F in (10) is µ-strongly monotone, i.e., for any pair Another common assumption is (local) exponential stability of the equilibrium points [17,Ass. 4.2]. Thus, with the change of coordinates z i := x i −π i (u i ), we also adopt the following assumption throughout the paper:

Standing Assumption 4 (Lyapunov stability)
For each i ∈ I, there is a smooth Lyapunov function, with Lipschitz continuous partial deriva-tives, i.e. for every constant u i ∈ U i , it holds that for some positive constants α i , α i and κ i . Moreover, for every constant u i ∈ U i , it holds that 3 Generalized Nash Equilibrium seeking for static agents Let us start from the case of static agents to highlight the proposed algorithm and its integration with the zerooder gradient scheme.

Assumption 2 (Static agents)
For each i ∈ I, x i = u i (in place of (1a)). 2 We propose two control schemes for GNE seeking with static agents. In the first, the agents have perfect information about the decisions of other agents and know the analytic expression of their partial gradient. The second scheme is data-driven, i.e. the agents have access to the output of their own cost function only.

Gradient-based case
In our first GNE seeking algorithm, each agent updates their decision, u i , based on decisions of all other agents and the dual variable. A central coordinator updates the dual variable and broadcasts it amongst the agents: where λ ∈ R q , (γ i ) i∈I are the step sizes chosen by the agents; Γ = diag ((γ i I mi ) i∈I ) and γ 0 is the step size chosen by the central coordinator. Now we state our first theorem: Let the Standing assumptions and Assumption 2 hold and let (u(t), λ(t)) t≥0 be the solution to (13). Then, there exist small enough (γ i ) i∈I0 such that the pair (u(t), λ(t)) t≥0 converges to some (u * , λ * ) ∈ U × R ≥0 , where u * is the variational generalized Nash equilibrium of the game in (8). 2 (13) is semi-decentralized, meaning that the update and the broadcast of the dual variable is managed by a central coordinator. A fully distributed variant, where each agent has their own copy of the dual variable and their consensus is imposed, can be derived similarly to [8,Sec. 3].

Data-driven case
In the limited information case, we consider that the agents have access to the cost output only. We emphasize that in this case, they neither know the actions of the other agents, nor they know the analytic expressions of their partial gradients. In fact, this is a standard setup used in extremum seeking ( [31], [26], [45] among others). However, the agents can communicate with a central coordinator, to whom they send their decision variable and its derivative. Let us first evaluate the time derivative of the cost output l i = J i (u i , u −i ) along the trajectories of u: where we define The variable θ 0 i measures the influence of the decision variables of the other agents on the cost output of agent i. The variable θ 1 i measures the effect of the decision variable of agent i on the cost output l i , which is needed for (13). To estimate the local θ 0 i and θ 1 i , we use a timevarying parameter estimation approach, as proposed in [26] for centralized optimization. Let us provide a basic intuition first.
Letl i andθ i be estimations of the output l i and the variable θ i respectively and let e i = l i −l i be the estimation error. Then, the estimator model in (14) for agent i is given bẏ where K i is a free design parameter. Note that the first two terms on the right-hand side resemble high gain observer schemes [44]. As the structure of the problem does not directly allow the use of high gain observers, it is necessary to introduce some other dynamics into the estimation. This is the primary role of the third term in (18). Therefore, the dynamics of c i (t) are choosen aṡ Let us also introduce an auxiliary variable It is also necessary to define a symmetric, positive definite matrix variable Σ i ∈ R (mi+1)×(mi+1) with dynamics given bẏ where ρ i , σ i and Σ 0 i are free design parameters. We note that, originally, in [1],Σ i = c i c i , but this proved to be inconvenient in practical implementations, as the elements of Σ i grow unbounded. Instead, as in (21), dynamics of Σ i behave as a first-order system. The third term is added so that the matrix is always invertible. Equations (18)-(21) form the parameter update law presented in [1]: where Π Θi (θ, v) denotes the projection of the vector v onto the tangent cone of the set Θ i atθ, as defined by Equation 2.14 in [43]. This implies that if the starting The set Θ i represents the expected (possible) values of θ i . Let us also define Θ := Θ 1 × · · · × Θ N .
We are finally ready to propose our semi-decentralized v-GNE learning algorithm: where A i represents the ith matrix row of A , which contains the constraints of agent i and d i represents the perturbation signal of agent i. In collective form, it can be written as where c i is the solution to (19). 2 We conclude the section with the convergence result.
For any initial condition of the decision variables, there exist gains such that the control variables converge to an arbitrarily small neighborhood of the v-GNE.
Theorem 2 (v-GNE static learning) Let the Standing Assumptions and Assumptions 2 and 3 hold and let (s(t) := (η(t),θ(t), u(t), λ(t))) t≥0 be the closed-loop solution to (18)- (21), (23). Then, for any compact set K and any ε > 0, there exist small enough parameters ( 1 Ki , 1 ρi , σ i , γ i ) i∈I and γ 0 , such that for every solution with s(0) ∈ K, u(t) converges to the set {u * } + εB, where u * is a variational generalized Nash equilibrium of the game in (8). 4 Generalized Nash equilibrium learning for dynamical agents In this section, we propose a control scheme for generalized Nash equilibrium learning for dynamical agents.
We consider a data-driven scenario only, i.e. the agents have access to the cost output measurements and the information that is given to them by a central coordinator. They are not aware of the analytic expression of their steady-state cost function, nor of their pseudogradient, nor can they observe the states and decisions of the other agents.
For the multi-agent dynamical system where > 0 is a time scale separation constant, with the objective of reaching a neighborhood of a v-GNE, we propose the same control law as in (23), with the distinction thatθ 1 is estimated by a parameter estimation scheme (18) - (22), where we collect the measurements of the output y i in (1b) instead of J i (u i , u −i ) directly. Thus, the estimation error is hereby redefined as We conclude the section with the most general theoretical result of the paper, namely, the convergence of the closed-loop dynamics to a neighborhood of a v-GNE of the game in (6).

Connectivity control in robotic swarms
The problem of connectivity control has been considered in [50] as a Nash equilibrium problem. In many practical scenarios, multi-agent systems, besides their primary objective, are designed to uphold certain connectivity as their secondary objective. In what follows, we consider a similar problem in which each agent is tasked with finding a source of an unknown signal while maintaining certain connectivity. Unlike [50], we require the existence of a central coordinator and we allow for coupled restrictions on the decisions variables. Additionally, we model the agents as unicycles with setpoint regulators, which does not require a constant angular velocity as in [50].
Consider a multi-agent system consisting of unicycle vehicles, indexed by i ∈ {1, . . . N }, with the following dynamics:ẋ where x i , y i , θ i represent the state variables, and v i , ω i represent the control inputs. Each unicycle implements the following feedback controller used for setpoint regulation: for some K 1 i , K 2 i > 0, which was studied in [32]. A graphical representation of the variables R i and φ i is shown on Figure 1. Therefore, we rewrite the closed-loop non- linear dynamics as where r i = col (x i , y i ) and u i = col (u x i , u y i ) is the input of the transformed system, which represents the coordinates of the setpoint. For each i, the steady-state mapping is then given by π i (u i ) = col (u i , 0).
Each agent is tasked with locating a source of a unique unknown signal. The strength of all signals abides by the inverse-square law, i.e. proportional to 1/r 2 . Therefore, the inverse of the signal strength can be used in the cost function. Additionally, the agents must not drift apart from each other too much, as they should provide quick assistance to each other in case of critical failure. This is enforced in two ways: by incorporating the signal strength of the fellows agents in the cost functions and by communicating with the central coordinator. Thus, we design the cost output and position constraints as ∀i ∈ I : where I −i := I \ {i}, c, b > 0 and r s i represents the position of the source assigned to agent i. The safe traversing area is described by a rectangle: For our numerical simulations, we choose the parameters as in Table 1. We randomly choose for each of the agents the following perturbation frequencies: (ω 1 1 , ω 2 1 ) = (5.11, 6.38), (ω 1 2 , ω  Table 1 Simulation parameters for connectivity control. we see that smaller perturbations and frequency factors bring the system closer towards the v-GNE; however in Figure 3, we see that the convergence rate slows down significantly. Thus, there is a trade-off between convergence speed and closeness to the solution. In Figure  4, we see a representative example of agent trajectories for d i = 0.49. We note that the trajectories of agents 1 and 3 have saturate in their decision variables, while agents 2 and 4 are limited mostly by the coupling constraints.

Wind farm optimization
As one of the main sources of renewable energy, wind farms and their optimization has been addressed extensively from different perspectives such as power tracking of single turbines [30], [10], power tracking via extremum seeking [21], power tracking with load reduction [49], [48], distributed optimization of wind farms [41], [39], [2]   and distributed optimization via extremum seeking [14]. While in the power tracking case, often the torque or some other related variable is taken as the control input, in the distributed optimization case, the axial induction factor (AIL) is usually taken as the control input.
In what follows, we consider a similar problem in which a wind farm tries to maximize its power output with AIL as the control input. The control variables are subject to local constraints (feasible values of AIL). Also, we require that the turbines experience a similar amount of mechanical stress. In order to do that, we impose that AILs of a row of wind turbines cannot differ to much from AILs of the succeeding row, which introduces coupling constraints to the optimization problem. Unlike the previously mentioned literature on distributed wind farm optimization, here we also allow for AIL dynamics in order to reflect the turbine time constant and its effect on the power output. One possible way of solving the problem would be via global optimization, where a central coordinator would minimize a global cost function and send AIL commands to the turbines. To avoid having a single critical node for computation and communication, an alternative approach is to pose the problem as a potential game, where the cost function of the turbines are "aligned" to a global utility function. In our case, the potential function would be the sum of all power outputs. We choose that the individual cost functions are equal to the potential function and each of the agents minimizes their cost function on their own, with limited information from the central coordinator.
In this setup, a v-GNE corresponds to an optimal solution of the global power maximization problem.
Technically speaking, we consider N wind turbines, indexed by i ∈ {1, . . . , N }, each with the following AIL dynamics and power output: where a i represent the state variable, u i represent the control input, namely the AIL reference, y i is the measured power output of the wind farm, which is broadcasted by the central coordinator, ρ is the air density, A is the surface area encompassed by the blades of a single turbine, C P (a i ) := a i (1 − a i ) 2 is the power efficiency coefficient and V i is the average wind speed experienced by wind turbine i, as in [39,Equ. 5]: The wind turbines are placed in R rows and C columns with coordinates x i and their indices can be written as i = i c + i r C, where i c ∈ {1, . . . , C} and i r ∈ {0, . . . , R − 1}. They are tasked to maximize the wind farm power output under local constraints a i ∈ [a min , a max ] and coupling constraints |a i − a j | ≤ b for all i, j, where it holds that j r = i r + 1. For our numerical simulation, we choose a similar setup as in [39]. The wind farm setup geometric setup is shown in Figure 5 and the following parameters are chosen: ρ = 1.225, U ∞ = 8, τ = 10, γ i = γ 0 = 0.05, = 0.005, b = 0.03, a min = 0.1, a max = 1 3 . We take the same parameter estimation scheme as in previous numerical simulation, apart for the perturbation frequencies that we randomly choose in the interval [3,11] and perturbation amplitudes that we take as d i = 0.01. All initial conditions, apart for a i , were set to zero. The initial condition for a i was set to 1 3 , which corresponds to the greedy strategy in [39]. In our simulations, we use three different wind directions. In the time interval [0, 50000), the wind  We assume that the wind turbines instantly adjust their orientation towards the wind direction as this process is relatively fast compared with the GNE learning process.
The simulation results are shown on Figure 6. We can see that the wind turbines reach a neighborhood of the v-GNE, even with the delay introduced by AIL dynamics.

Conclusion
Generalized Nash equilibrium problems with nonlinear dynamical agents can be solved via a preconditioned forward-backward algorithm that uses estimates of the pseudo-gradient mapping if it is strongly monotone and Lipschitz continuous, and if the dynamical agents have a certain exponential stability property. Regular projections enable the use of a parameter estimation scheme. Numerical simulations show that there is trade-off between closeness to the equilibrium solution and the speed of convergence.
conditioned forward-backward algorithm, whose convergence is proven using well-known properties of monotone operators. First, we show the equivalence. Let us denote ω = col(u, λ). We write Equation (23) as: where Γ = diag (Γ, γ 0 I q ). Next, by the property of projection operator in [3, Prep. 6.47], Equation (A.1) reads aṡ , which is equivalent to the inclusioṅ When the elements of the last matrix product in (A.2) are rearranged and the equation is premultiplied by Γ −1 , the equations read as follows: where we have used the notationΓ = I m −ΓA Φω +Â(ω +ω), (A.4)

+Â(ω +ω). (A.5)
Next, the following expression is valid for the matrices: From Equations (A.5) and (A.6), it follows By inverting the operator on the left side of the previous expression, we finally arrive to desired equation:  (6). Before proving convergence, we have to prove an additional result.
, where A is maximally monotone. Then it holds: for all (x, x * ) ∈ dom(T ) × fix(T ). 2 PROOF. Let us denote x * = T x * = J A y * as the fixed point of operator T . Then it holds: where the last equation holds due to properties of firmly nonexpansive operators.
. Then, the dynamics in (A.8) read asω We propose the Lyapunov function candidate where ω * is a fixed point of operator T . Its derivative along the trajectory in (A.8) is theṅ From Lemma 1, it follows thaṫ Bounds on the eigenvalues of Φ can be estimated with Gershgorin's theorem. For small enough step sizes, we denote the lower and upper bound on the eigenvalues as σ min = 1 max i∈I 0 (γ −1 i )+ A and σ max = 1 min i∈I 0 (γ −1 i )− A , respectively. We bound (A.13) aṡ (A.14) Since F (u) is strongly monotone and Lipschitz continuous, it is also cocoercive [52, Lemma 5]. Therefore, the operator B is cocoercive with constant β = µ L 2 . Equa-tion (A.14) then becomes: Since, it is always possible to choose parameters γ i and γ 0 small enough such that βσ min ≥ σ 2 max and the matrix in (A.15) is negative definite, the last equation reads aṡ i.e., ω ∈ fix(T ). The set ζ 1 can be rewritten as As ω * was chosen as an arbitrary fixed point of the operator T , it follows that u * is a singleton and ζ 1 = fix(T ). From the dynamics (A.10), it follows that the largest invariant set is ζ inv = ζ 1 = fix(T ). Therefore, ζ 0 ⊆ ζ inv . By La Salle's invariance principle [28,Thm. 4.4], we conclude that the state trajectories converge to the set ζ 0 , in which u = u * is a singleton and it holds that col(u * , λ) ∈ fix(T ).

B Proof of Theorem 2
Let us consider a Lyapunov function candidate of the form V = V θ + V ω , where V θ represents a parameter estimation error and V ω represents a primal-dual convergence error: Since we anticipate that the derivative of the projection function does not exist on some corner points, we use the Lyapunov theory for differential inclusions as in [9, Chp. 2], namely we use upper Dini derivatives (D + ) instead of regular time derivatives. For ease of notation, we use the regular derivatives whenever they are equal to Dini derivatives.
Outline of the proof: We first bound all of the positive terms in D + V θ with functions of variables (η,θ, ω), then we similarly bound all of the terms in D + V ω introduced by the parameter estimates. Finally, we use the quadratic terms of D + V to show that the positive terms are majorized by the negative terms.
Parameter estimation term: We bound the Dini derivative of V θ similarly to [26,Thm. 1] and [27,Eq. 31] with the only difference that we let each agent choose their own parameters (σ i , K i , ρ i ). The Lyapunov derivative reads as follows: and σ := max i σ i 1 .
Next, we bound the positive terms in (B.3). The analysis starts with θ = col θ 0 , θ 1 , where We have that The term 1 2 e − η 2 was omitted in [26] (look at page 4, first column, second to last equation) and [27], as it wasn't required.
where L 0 := max u∈U J 0 (u) < ∞, since U is bounded. Then, we bound θ as follows for L 1 := L 0 2 , L 2 := 2 2 and L 3 = 2. In order to bound D + θ , we observe the dini derivatives of θ 0 and θ 1 : On a compact set Ω c ,u H J 0 is bounded, therefore by combining (B.6), (B.7), (B.8) and the arithmetic mean -quadratic mean inequality, it follows: for some positive L 4 to L 9 . By using the previously calculated bounds, D + V θ reads as: Primal-dual term: Unlike the full-information case, our agents use the estimateθ 1 instead of F (u) in the control law in (23). Therefore, by adding and subtracting the derivative of full-information case in the Dini derivative of (B.1), we have: where the last line follows from the inequality (B.10) Complete Lyapunov candidate: Finally, the Dini derivative of V is bounded as follows: Now, we can make the last three norms arbitrarily small and b 1 , b 2 , b 3 positive by choosing k c , σ and k 4 small enough, we can make b 4 positive by choosing (σ i ) i∈I small enough, we can make b 5 positive by making k b large enough. Of the mentioned parameters, only k b and k c cannot be chosen arbitrarily. To make k b large enough, we chose (K i ) i∈I and (ρ i ) i∈I large enough, to make k c small enough we have to chose parameters k 1 and k 2 small enough. Since Ω c was chosen for arbitrary c, it follows that for any compact set K, it is possible to find such control parameters that for (η(0),θ(0), u(0)) ∈ K, (η,θ, u) converge to an arbitrarily small neighborhood of (0, 0, u * ), which concludes the proof. For λ we can only claim that this is bounded.

C Proof of Theorem 3
We have to prove that there exists a timescale separation between the GNE learning scheme described in section 3 and dynamics of the multi-agent system in (25) such that the interconnection is also stable. Let us consider a Lyapunov function candidate V = V θ + V ω + V z , where V θ and V ω are the same as (B.1), (B.2) and V z is formed using Standing Assumption 4 in the following way: Outline of the proof: We first bound all of the terms in D + V z introduced by nonconstant inputs with functions of the variables (η,θ, ω, z), then we bound all of the terms in D + V θ introduced by the redefinition of error e i in (26) in the same manner. At the end, we use the quadratic terms of the complete D + V to show that the additional terms are majorized by the negative terms.
Parameter estimation term: GNE learning is identical as in the static case, apart from the measurements of the cost function. Let us denote l := col (l i ) i∈I , y := col (y i ) i∈I , h(x) := col (h i (u i , u −i )) i∈I .
Complete Lyapunov candidate: Finally, the Dini derivative of the complete Lyapunov function candidate is bounded as follows: Now, we can make the b 8 , b 9 , b 10 arbitrarily small and b 2 , b 7 positive by choosing k c , σ and k 4 small enough, we can make b 1 positive by choosing k 5 , k 7 large enough, we can make b 3 positive by choosing k a large enough, we can make b 4 positive by choosing k b large enough, we can make b 5 positive by choosing (σ i ) i∈I small enough and we can make b 6 positive by choosing small enough. Of the mentioned parameters, only k a , k b and k c cannot be chosen arbitrarily. To make k a and k b large enough, we chose (K i ) i∈I and (ρ i ) i∈I large enough, to make k c small enough we have to chose parameters k 1 and k 2 small enough. Since Ω c was chosen for arbitrary c, it follows that for any compact set K, it is possible to find such control parameters that for (η(0),θ(0), u(0), z(0)) ∈ K, (η,θ, u, z) converge to an arbitrarily small neighborhood of (0, 0, u * , 0), which concludes the proof. For λ we can only claim that this is bounded.