Dynamically analyzing cell interactions in biological environments using multiagent social learning framework

Background Biological environment is uncertain and its dynamic is similar to the multiagent environment, thus the research results of the multiagent system area can provide valuable insights to the understanding of biology and are of great significance for the study of biology. Learning in a multiagent environment is highly dynamic since the environment is not stationary anymore and each agent’s behavior changes adaptively in response to other coexisting learners, and vice versa. The dynamics becomes more unpredictable when we move from fixed-agent interaction environments to multiagent social learning framework. Analytical understanding of the underlying dynamics is important and challenging. Results In this work, we present a social learning framework with homogeneous learners (e.g., Policy Hill Climbing (PHC) learners), and model the behavior of players in the social learning framework as a hybrid dynamical system. By analyzing the dynamical system, we obtain some conditions about convergence or non-convergence. We experimentally verify the predictive power of our model using a number of representative games. Experimental results confirm the theoretical analysis. Conclusion Under multiagent social learning framework, we modeled the behavior of agent in biologic environment, and theoretically analyzed the dynamics of the model. We present some sufficient conditions about convergence or non-convergence and prove them theoretically. It can be used to predict the convergence of the system.


Background
All living systems live in environments that are uncertain and dynamically-changing. However, it is remarkable that these systems survive and achieve their goals by exhibiting intelligent features such as adaption and robustness. Biological system behaviors [1] and human diseases [2] are often the outcome of complex interactions among a very large number of cells and their environments [3,4]. Similarly, in the multiagent system [5][6][7][8][9], an important ability of an agent is to adjust its behavior adaptively to facilitate efficient coordination among agents in unknown and dynamic environments. If we regard the cells in the biological system as the agents in the multiagent system, we can analyse the cells' behavior using the theory of multiagent system. So understanding collective decision made by such intelligent multiagent system is an interesting research topic not only for artificial intelligent but also for biology. The conclusion of the theoretical analysis can be applied to the research of biology, for example, the results of convergence can be used for explaining the phenomenon of cell's group behaviour. Now, computational methods have been widely used to solve biological problems [10,11]. Many researchers have investigated biological systems which are composed of cells and their environments via modeling and simulation [1,12]. There are two principal approaches: population based modeling and discrete agent based modeling. Population based modeling approximates the cells within any grid box by a set of variables associated with the grid box [13,14]. Discrete agent based modeling maps each cell to a discrete simulation entity [13,15,16].
We use multiagent learning techniques to model the behaviors of each cell agent, which is an important technique to achieve efficient coordination in multiagent system area [9,[17][18][19]. Until now, significant amount of efforts have been devoted to develop effective learning techniques for different multiagent interaction environments [20][21][22][23]. In the multiagent environments, each agent interacts with the agent selected from its neighborhood randomly each round, and updates its strategy based on the feedback in the current round. To describe the behavior of an agent, one common line of researches is to extend existing reinforcement learning techniques in single-agent environment to multiple-agent interaction environment. However, due to the violation of Markov property, the existing theoretical guarantees do not hold any more in multiagent environment. It is important and challenging for us to model the multi-agent environment and analyse the learning dynamics of multiagent environments.
This paper presents a social learning framework to simulate the dynamics of multiagent system in biological environment and a theoretical analysis of the learning dynamics of this model is also given. The analysis results shed lights on how and when the consistent knowledge in terms of equilibrium can be or not be evolved among the population of agents. In the social learning framework, all agents play PHC strategy [24] for decision making, and use a weighted graph model for neighbor selection. In the part of theoretical analysis, we present a theoretical model to analyze the learning dynamics of the learning framework. The purpose of analysing the learning dynamics is to judge whether the learning algorithm that the agent adopt can converge or not. The intention behind is that convergence to an equilibrium has been the most commonly accepted goal to pursue in multiagent learning literature. Firstly, we model the overall dynamics among agents as a system of differential equations. Then, some conditions are proved to be the sufficient condition of convergence or nonconvergence. It can be used to predict the convergence of the system. Finally, we estimate the prediction through simulation experiment. The experimental results confirm the predictive outcomes of our theoretical analysis.
The remainder of the paper is organized as follows. "Method" section first reviews normal-form game and the basic gradient ascent approach with a GA-based algorithm named PHC, and then introduces the multiagent learning framework where all the agents are PHC learners. In the "Result and discussion" section, we present the theoretical model of the learning dynamics of agents, and prove convergence and non-convergence conditions by analyze geometrical behaviors of the hybrid dynamic system in the help of nonlinear dynamic theory. In the "Experimental simulation" section, we evaluate the predictive ability of our theoretical model by comparing it with the simulation results. Lastly we conclude the paper and point out future directions in "Conclusion" section.

Normal-form games
In a two-player, two-action, general-sum normal-form game, the payoff for each player i ∈ {k, l} can be specified by a matrix as follows, Each player i selects an action simultaneously from its action set A i = {1, 2}, and the payoff of each player is determined by their joint actions. For example, if player k selects the pure strategy of action 1 while player l selects the pure strategy of action 2, then player k receives a payoff of r 12 k and player l receives the payoff of r 21 l . Apart from pure strategy, each player can also employ a mixed strategy to make decisions. A mixed strategy can be represented as a probability distribution over the action set and a pure strategy is a special case of mixed strategies. Let p k ∈ [0, 1] and p l ∈ [0, 1] denote the probability of choosing action 1 by player k and player l, respectively. Given a joint mixed strategy profile (p k , p l ), the expected payoffs of player l and player k can be computed as follows, A strategy profile is a Nash Equilibrium (NE) if no player can get a better expected payoff by changing its current strategy unilaterally. Formally,

Gradient ascent (GA) and PHC algorithm
When a game is repeatedly played, an individually rational agent updates its strategy with the propose of maximizing its expected payoff. We know that the gradient direction is the fastest increasing direction, thus it is a well-deserved way to model the behavior of agent using gradient ascent algorithm. Agent i that employs GA-based algorithm updates its policy towards the direction of its expected reward gradient, which is shown in the following equations.
The parameter η is the size of gradient step. [0,1] is the projection function mapping the input value to the valid probability range of [ 0, 1], which is used for preventing the gradient from moving the strategy out of the valid probability space. Formally, we have To simplify the notation, let us define u i = r 11 i . For the two-player case, the Eqs. 4 and 5 can be represented as follows, In the case of infinitesimal size of gradient step (η → 0), the learning dynamics of the agent can be modeled as a system of differential equations. Further, it can be analyzed using dynamic system theory [25]. It is proved that the strategies of all agents will converge to a Nash equilibrium, or if the strategies do not converge, agents' average payoff will converge to the average payoff of Nash equilibrium [26]. The policy hill-climbing algorithm (PHC) is a combination of gradient ascent algorithm and Q-learning where each agent i adjusts its policy p to follow the gradient of expected payoff (or the value function Q). It is shown in the Algorithm 1.
Here, α ∈ (0, 1] and δ ∈ (0, 1] are learning rate, and Q values are maintained just as in normal Q-learning. The policy is improved by increasing the probability of selecting the highest valued action based on the learning rate δ.

Modeling multiagent learning
Under the multiagent social learning framework with N agents, each agent interacts with one of its neighbors selected randomly from its neighborhood each round. The neighborhood of each agent is determined by its underlying network topology. The interaction between each pair of agents is modeled as a two-player normal-
Select action a ∈ A i according to mixed strategy p i with suitable exploration. 4: Observe reward r and update Q value Step p closer to the optimal policy w.r.t. Q, p i (a) ← p i (a) + a while constrained to a legal probability distribution, : until the repeated game ends form game. During each interaction, each agent selects its action following a specified learning strategy, which is updated repeatedly based on the feedback from the environment at the end of interaction. The framework is presented in Algorithm 2.

Algorithm 2
Overall interaction protocol of the social learning framework 1: repeat 2: for each agent in the population do 3: Chose one of its neighbors with a certain probability.

4:
Play a two-player normal-form game with this neighbor.

5:
Select a action according to its mixed strategy with suitable exploration. Environmental feedback. 8: for each agent in the population do 9: Observing reward r and update its policy based on its past experience according to specific policies. 10: end for 11: until the repeated game ends We use graph G = (V , E) to model the underlying neighborhood network, which is composed by N = |V | agents. The edges E = {e ij }, i, j ∈ V represent social contacts among agents, where e ij denotes the probability that agent i chooses agent j to interact with. We have j∈V e ij = 1 ∧ e ii = 0. Here, we propose an adaptive strategy for agents to make their decisions in social learning framework with PHC learning strategy, which is shown in Algorithm 3.

Algorithm 3
Learning process in the multiagent framework for agent i ∈ V 1: Let α ∈ (0, 1] and δ ∈ (0, 1] be learning rates. Initialize Select agent j ∈ V according to E with probability e ij . 4: Select action a ∈ A i according to mixed strategy p i with suitable exploration. 5: Observe reward r according to interaction between i and j. 6: Step p closer to the optimal policy w.r.t. Q, p i (a) ← p i (a) + a while constrained to a legal probability distribution, 8: until the repeated game ends

Analysis of the multiagent Learning Dynamics
In this section, we present a theoretical model to estimate and analyze the learning dynamics of the above multiagent learning framework in Algorithm 3. We extend notations in section to the multiagent environment. Without loss of generality, we consider the case with two-action only. Assume that the payoff that an agent receives only depends on the joint action, then the payoff for agent i ∈ V can be defined as a fixed matrix R i , where r mn i denotes the payoff received by agent i when i selects action m and its neighbor selects n. Here, we use the p i to denote the probability that the player i selects action 1. Then the mixed strategy (p 1 , p 2 , . . . , p N ) in multiagent framework can be considered as a point in R N constrained to the unit square. The expected payoff V i (p 1 , p 2 , . . . , p N ) of player i can be computed as follows, where , and e ij is the probability that the agent i selects agent j to interact with.
Each agent i updates its strategy in order to maximize the value of V i . Recall the Eqs. 4 and 5, we can obtain where parameter η is the size of gradient step. As η p → 0, it is straightforward that the Eq. 11 becomes differential equation. Considering the step size to be infinitesimal, the unconstrained dynamics of the all players' strategies can be modeled by the following differential equations.
Equation 12 can be simplified as follows using some notation, where For the constrained dynamics of the strategies, we can model it as the following equations, where G i = u i j∈V e ij p j + c i . Notice that Eq. 14 is a hybrid system composed of two parts: a series of continuous linear differential dynamic systems in the respective domain space and a switch mechanism between differential dynamic systems when dynamic touch the boundary. Generally, it is hard to obtain a complete conclusion by analyzing dynamics of a general hybrid system, even though the differential system is linear. But we can still find some convergence and non-convergence conditions under certain instances(i.e., Eq. 14).

Non-convergence condition of the multiagent learning framework
According to the above definition, we have the following general result under which non-convergence is guaranteed.

Theorem 1
In an N agent, two-action, integrated general sum game, every player follows the constrained dynamics of the strategy we defined in Eq. 14. If the following two conditions are met, 1. There exists a point P * = p * 1 , p * 2 , . . . , p * N ∈ (0, 1) N , that UEP * + C = 0, 2. There exists a pair of pure imaginary eigenvalues of matrix UE, then there exists a set P ⊂ [ 0, 1] N , that the solution of the initial value problem of Eq. 14 with P(0) ∈ P can not converge.
Proof Considering the complexity of the hybrid system represented by Eq. 14, we begin with the unconstrained ones. Based on the theorems of differential equations dynamical systems [25], we calculate the analytic solution of Eq. 13. Homogenizing the in-homogeneous equation by substituting P with P = X + P * , where UEP * + C = 0, we getẊ = UEX.
The J i is a square matrix and its form is one of the following two, Here, J is the Jordan normal form of matrix UE. J i is the Jordan block corresponding to λ i , which is a repeated eigenvalue of UE with multiplicity n i . If eigenvalue λ i is a real number, then J i is in the form (1), else J i is in the form (2). Suppose that λ 1 , . . . , λ k are matrix UE's real eigenvalues, and λ k+1 , . . . , λ m is matrix UE's complex eigenvalues, then we have n 1 + . . . + n k + 2 n k+1 + . . . n m = N.
Then the analytic solution of functionẊ = UEX with initial value X(0) will be Using the notation Y (t) = T −1 X(t), we have Suppose that λ k = βi is a pure imaginary eigenvalue of UE with multiplicity n k , soλ k = −βi is an eigenvalue of UE with multiplicity n k . Then J has a block J k , J k = ⎡ ⎢ ⎢ ⎢ ⎣ Due to e tD 2 = exp t 0 β −β 0 = cos βt sin βt − sin βt cos βt , there must exist a pair of items about vector Y (t) as follows.
If y i (0) = 0 ∨ y i+1 (0) = 0, then Eq. 14 has a periodic solution. Let v i and v i+1 to denote eigenvector of T = (v 1 , . . . , v N ) corresponding to λ k andλ k , respectively. Note that X(t) = TY (t), then the solution of Eq. 13 with the initial value P(0) ∈ S is cyclical, where Because of P * ∈ (0, 1) N , there must exists a ε > 0 for the deleted neighborhood B(P * ; ε) ⊂ (0, 1) N of P * , , the solution of the Eq. 14 with any initial value belongs to P is cyclical, which means the algorithm corresponding to the Eq. 14 can not converge. Theorem 1 shows that there exist some situations in which the agents fail to converge under the multiagent social learning framework. Before giving the details of those situations, we need to introduce the following notations first.
According to the theorem 1, T is the transformation matrix for T −1 UET = J, T = (v 1 , v 2 , . . . , v N ). Let v j1 , v j2 , . . . , v jn j denote eigenvectors associated to eigenvalue λ j , j = 1, 2, . . . , m. According to properties of the matrix transformations [27], v j1 , v j2 , . . . , v jn j are linearly independent. Classify column vectors of the transformation matrix T into three parts corresponding to λ, . Now we are ready to give the precise description of the subspace where the agents fail to converge, which is summarized in the following theorem.
that the solution of the initial value problem of the Eq. 14 with P(0) ∈ P can't convergence.
Proof According to Theorem 1, we have the solution of the initial value problem that the Eq. 14 with P(0) ∈ S ∩ B(P * ; ε) can not convergence. Here For the eigenvalue λ i associated to vector v i ∈ V 1 , there are Re(λ i ) < 0. According to conclusions of bifurcation theory [25], the subspace span(V 1 ) is a stable submanifold of the unconstrained dynamics (13), which means every trajectory start from S will eventually convergence to P * , where Then trajectories start from S will eventually convergence to S, thus we got the final conclusion that the solution of the initial value problem of the Eq. 14 with P(0) ∈ P can't convergence.
Note that Theorem 1 and 2 are just sufficient conditions of non-convergence.

Convergence condition of the multiagent learning framework
In most cases, the conditions that guarantee the convergence of a algorithm are more valuable.

Theorem 3
In an N agent, two-action, integrated general sum game, every player follows the constrained dynamics of the strategy we defined in Eq. 14. If the following two conditions are met, 1. There exists a point P * = p * 1 , p * 2 , . . . , p * N ∈ (0, 1) N , that UEP * + C = 0, 2. All of the eigenvalues of matrix UE has negative real part, then all the solutions of the initial value problem of Eq. 14 with P(0) ∈ [0, 1] N will converge eventually.
Proof The conclusion is obvious. It is known that the construction of the linear dynamic system is stable. If all eigenvalues of matrix UE have negative real part, then point P is a stable equilibrium point. It means that all the solutions of the initial value problem of the Eq. 14 with P(0) ∈ [0, 1] N will converge to P.
Theorem 3 proposes a sufficient condition to identify the convergence of dynamic in Eq. 14. We know that it is hard to calculate eigenvalues of a matrix with high dimensional. Here, we propose a more realistic convergence condition which is suitable for multiagent learning framework shown in Algorithm 3.

Theorem 4
In an N agent, two-action, integrated general sum game, every player follows the constrained dynamics of the strategy we defined in Eq. 14. If matrix UE is symmetrical, then all the solution of the initial value problem of Eq. 14 with P(0) ∈ [0, 1] N will converge eventually.
Proof It is known that the eigenvalues of real symmetric matrix are real numbers [27]. We analyze all the cases of Eq. 14 when all of the eigenvalues of matrix UE are real: 1. There exists a point P * = p * 1 , p * 2 , . . . , p * N ∈ (0, 1) N , that UEP * + C = 0.
2. There are no such a point, that UEP * + C = 0.
For case 1), if all eigenvalues of matrix UE are negative number, then point P is a stable equilibrium points; otherwise, all the solutions of the initial value problem of the hybrid system with P(0) ∈ [0, 1] N will move away from P toward boundary of the hybrid system [25]. Because the domain of hybrid system represented by 14 has boundary(i.e., P(t) ∈ [0, 1] N ), then there must exists a point P = p 1 , . . . , p N T in the boundary of the domain, where The dynamic P(t) will converge to P eventually. Similarly, we can find a point P = p 1 , . . . , p N T in the boundary of the hybrid system domain in case 2) and the dynamic P(t) will converge to P eventually. The theorem must hold.
Based on conclusions of Subsections Non-convergence condition of the multiagent learning framework and Convergence condition of the multiagent learning framework , we can determine the learning dynamics of any cases we defined in Eqs. 14 and 13. However, the computational complexity may be prohibitive when the model size becomes too large. In the next section, we consider a special case under an interesting network structure which can be analyzed with relatively light computational complexity for any network size.

The simplest case whose underlying topology is a ring
We consider the case when the underlying topology is a ring, and each agent only interacts with the neighbor on its right-hand side in each interaction. As defined in the previous section, the adjacency matrix E is According to Eq. 14, the constrained dynamics of this special case can be modeled as follows: where G i = u i p i+1 + c i , i = {1, 2, . . . , N − 1}, and G N = u N p 1 +c N . Through analyzing the dynamics of this model, we have the following conclusion.

Theorem 5
In an N-player, two-action, integrated general-sum game, every agent follows the constrained dynamics of the model in Eq. 15. If one of the agents converges to a strategy, then every agent will converges eventually.
Proof Suppose agent k converges at some time, according to the definition, its strategy p k will be a constant. In Eq. 15, we have G k−1 = u k−1 p k + c k−1 be a constant, which means convergence of player k implies convergence of player k − 1. By induction, every agent will converge eventually.
According to the above theorem, we can easily obtain the following proposition.

Proposition 1 In Eq. 15, if there exists a dominant strategy for some players, then their strategies will asymptotically converge to a Nash equilibrium.
According to the above conclusion, finally we present the following unconvergence result.

Theorem 6
In an N agent, two-action, integrated general sum game, every player follows the constrained dynamics of the strategy we defined in Eq. 15. If every player has no dominant strategy, and met one of the following conditions, then there exists a set P ⊂ [0, 1] N that the solution of the initial value problem of the Eq. 15 with P(0) ∈ P can't converge.
Proof According to the definitions above, the payoff matrix of player i is Since every agent has no dominant strategy, we have Thus we have u i c i < 0, and Set p * i = − c i u i and P * = p * 1 , p * 2 , . . . , p * N T , then we have P * ∈ (0, 1) N and UEP + C = 0. Considering the Eq. 15, by calculating the eigenvalue of matrix UE, we have If N = 4k, k ∈ N and N i=1 u i > 0, then matrix UE has a pair of pure imaginary eigenvalue. Otherwise, if N = 4k+2, k ∈ N and N i=1 u i < 0, then matrix UE has a pair of pure imaginary eigenvalue. According to Theorem 1, there exists a set P ⊂ [0, 1] N that the solution of the initial value problem of Eq. 15 with P(0) ∈ P can not convergence.

Experimental simulation
In this section, we compare the empirical dynamics of the multiagent social learning framework composed by PHC learners with theoretical prediction of our hybrid dynamic model. We perform two experiments that satisfy the Theorem 1 and 4, respectively.
In Fig. 1, the dynamic solution of the game with initial value P(0) is plotted, where k 1 = k 2 = 0.1. Each of the four lines in Fig. 1 shows the strategy's dynamic changing of each agent, respectively. We can see that the strategies of those agents do not converge. Obviously, the simulation results are consistent with the theoretical prediction.

A convergence multi-agent Game
In this subsection, we consider a 4-player, two-action game. The game is defined as follows, Because matrix UE is symmetrical, according to Theorem 4, the solution of the initial value problem of this game with any P(0) ∈ [0, 1] 4 will converge eventually. Figure 2 illustrates dynamics of the PHC learners' strategy for the game with initial value initial value P(0) = (1/2, 1/2, 1/2, 1/2) T . Each of the four lines in Fig. 2 shows the strategy's dynamic changing of each agent, respectively. We can see that the strategies of those agents converge eventually, which are consistent with the theoretical prediction.

Conclusion
In this work, we proposed a multiagent social learning framework to model the behavior of agent in biologic environment, and theoretically analyzed the dynamics of multiagent social learning framework using non-linear dynamic theories. We present some sufficient conditions about convergence or non-convergence and prove them by the theoretically analysis. It can be used to predict the convergence of the system. Experimental results show that the predictions of our dynamic model are consistent with the simulation results.
As future work, more extensive study of the dynamics of multiagent social learning framework with PHC learners is needed. Other worthwhile directions include to improve the PHC algorithm, to develop more realistic multiagent social learning framework to model the realistic interactions among cells in biologic environments, and to achieve better convergence performance based on our theoretical findings.