Nash equilibrium realization of population games based on social learning processes

: In the two-population game model, we assume the players have certain imitative learning abilities. To simulate the learning process of the game players, we propose a new swarm intelligence algorithm by combining the particle swarm optimization algorithm, where each player can be considered a particle. We conduct simulations for three typical games: the prisoner’s dilemma game (with only one pure-strategy Nash equilibrium), the coin-flip game (with only one fully-mixed Nash equilibrium), and the coordination game (with two pure-strategy Nash equilibria and one fully-mixed Nash equilibrium). The results show that when the game has a pure strategy Nash equilibrium, the algorithm converges to that equilibrium. However, if the game does not have a pure strategy Nash equilibrium, it exhibits periodic convergence to the only mixed-strategy Nash equilibrium. Furthermore, the magnitude of the periodical convergence is inversely proportional to the introspection rate. After conducting experiments, our algorithm outperforms the Meta Equilibrium Q-learning algorithm in realizing mixed-strategy Nash equilibrium.


Introduction
Game theory is an important branch of operations research, in which population games are a classical game model.Because population games not only can reveal the essential features of collaboration and competition but also can provide profound insights and revelations among populations, it has been widely used in social sciences, biology, economics, and other fields.Therefore, population games have been a hot topic of academic research.In the game theory, Nash equilibrium [1,2] is an important concept.However, the traditional Nash equilibrium requires that players are perfectly rational and have complete information.Fudenberg and Levine [3] propose an alternative interpretation of equilibrium, "The equilibrium is the long-term outcome of the process by which imperfectly rational players seek to optimize over time".Influenced by Fudenberg's interpretation of equilibrium, we consider how to find the path to Nash equilibrium under conditions of imperfect rationality and incomplete information.In reality, players aim to maximize their benefits, and equilibrium emerges after repeated games.Nash equilibrium is an integral component of this equilibrium, and due to its challenging establishment, investigating the process of Nash equilibrium formation holds intrinsic value.These players are not smart enough, and their ability is limited.To depict the strategic interactions among these players, we develop an algorithm to simulate their gaming processes.Among these algorithms, the particle swarm optimization (PSO) algorithm [4,5] is based on the feeding behavior of a bird flock.Both the PSO algorithm and the realization of Nash equilibrium are based on the concept of optimization, albeit with distinct approaches.The PSO algorithm emphasizes collective optimization, whereas Nash equilibrium realization centers around individual optimization.Therefore, we can glean insights from the PSO algorithm to develop an algorithm suitable for achieving Nash equilibrium.The algorithmic realization of Nash equilibrium is rooted in the decisions made by imperfectly rational players, developing algorithms to simulate equilibrium evolution.However, limited research exists regarding achieving Nash equilibrium in population games using swarm intelligence algorithms.
In the field of game theory, Nash equilibrium theory holds significant importance.Learning rules provide a perspective for studying Nash equilibrium from the players' viewpoint.Currently, three primary types of learning models exist.The first type is the virtual action learning theory [6][7][8][9][10][11][12][13], which was first proposed by Fudenberg and Levine [3].It is believed that the opponent's strategy remains uncertain in each game, requiring the anticipation of the opponent's moves.The theory of virtual action learning considers the opponent's prior strategy choices, assigning weight to these choices and using the weighted outcome to determine the opponent's subsequent strategy.The second type is the social learning model [14][15][16][17][18] for population games, which is also proposed by Fudenberg and Levine [19].Within this model, players can glean information about fellow players who achieve superior benefits within the population.This collective learning process eventually converges the system to a stable state.The third type is the reinforcement learning model [20][21][22][23][24], Littman [25] proposed a two-player model in zero-sum games.The model assumes that players can retain the memory of strategies and their associated benefits from previous games.Through continuous reflective learning, players strive to achieve Nash equilibrium.In addition, Borgers and Sarin [26] proposed the stimulus-response learning model based on the reinforcement learning model.In this model, players can solely recall their past strategy selections and the associated benefits.Consequently, they are inclined to employ their previous actions to guide future strategic decisions.The model posits that well-performing actions are positively reinforced, while poorly performing actions are negatively reinforced.Jordan [27] proposed the Bayesian learning model, and Camerer and Hua [28] proposed the experience-adding weight affinity (EWA) model.These two models are also important learning models based on virtual, social, and reinforcement learning models.
Previous research on realizing Nash equilibrium has been mainly based on reinforcement learning theory.In 2000, Singh et al. [29] first proposed the infinitesimal gradient algorithm (IGA), which enables each player to adjust its strategy based on the gradient of its expected benefit.This algorithm converges to a particular Nash equilibrium.After that, Zinkevich [30] proposed the generalized infinitesimal gradient algorithm (GIGA), which extends the applicability of the IGA algorithm from just two strategies to encompass multi-strategy scenarios.Wang et al. [31] expanded their investigation to meta-games, achieving a path towards meta-equilibrium using Q-learning.However, the existing realizations of Nash equilibrium mainly focus on inter-player games, and further exploration and re-finement are necessary for the realization of Nash equilibrium in population games.This paper develops the population game particle swarm optimization (PGPSO) algorithm, which uses social learning and population imitation as theoretical sources.We theoretically prove the convergence of the PGPSO algorithm, and the Nash equilibrium of a single mixed strategy is proved to be the center of a stable limit ring in the algorithm.Using the PGPSO algorithm, we simulate the evolution of three two-population games and search their Nash equilibriums' realization paths.The experimental outcomes validate the efficacy of the PGPSO algorithm in uncovering Nash equilibria.Additionally, the effect of introspection rate and initial state on the PGPSO algorithm to realize Nash equilibrium is further explored.

Preliminaries
This section will mainly introduce the fundamental concepts of population games, social learning theory, and population imitation theory.

Concepts related to two-population game
The two-population game is denoted by {Γ, X, F} [32]. 1) Γ = {1, 2} denotes the two populations.For each population, p ∈ Γ, S p = {1, 2} denotes the set of pure strategies available to population p.
2) For population 1, x 1 denotes the proportion of players who choose strategy 1; for population 2, y 1 denotes the proportion of players who choose strategy 1.Further, x = (x 1 , x 2 ) denotes the pure strategy distribution state of population 1, y = (y 1 , y 2 ) denotes the pure strategy distribution state of population 2. The strategy choice of the i-th player in population 1 is denoted as x i , and likewise, the strategy choice of the j-th player in population 2 is denoted as y j .Given that players can only select pure strategies, x i , y j ∈ {0, 1}.If x i (y j ) = 1, it means that the i-th ( j-th) player has chosen strategy 1; If x i (y j ) = 0, it means that the i-th ( j-th) player has chosen strategy 2. The combined set X = (x, y) represents the social state of the two populations Γ.
3) For any given population p, F p s : X → R denotes the expected benefit associated with pure strategy s, s ∈ S p .Therefore, the corresponding set of pure strategies for population p is denoted as S p , and F p : X → R 2 represents the expected benefit of population p.The overall expected benefit function of the entire society Γ is denoted as The following definition is the Nash equilibrium definition of the two populations game {Γ, X, F} [32].
Definition 1.Let {Γ, X, F} be a two-population game.If the social state z = ( x, ȳ) ∈ X satisfies ∀p ∈ Γ, xs > 0, ȳs > 0 ⇒ F p s (z) = max r∈S p F p r (z), ∀s ∈ S p , then we define z = ( x, ȳ) as the Nash equilibrium of the population game {Γ, X, F}, and denote the set containing all Nash equilibria as E(F).

Social learning theory
Fudenberg and Levine [3] proposed the social learning theory to explain the formation of the Nash equilibrium of the population game.In a single iteration, the initial population is called the "parent", denoted as q(t).The population that completes strategy adjustment is called the "offspring", denoted as q(t + 1).There is an excessive generation from the parent to the offspring, called the pending generation, denoted as q(t ′ ).It is crucial that the overall strategy distribution of the pending generation is the same as that of the parent, i.e., x s (t ′ ) = x s (t), with pending players corresponding one-to-one with their parent players.
During each iteration, for one of the populations p, a proportion α of pending players chooses to adjust their strategies, while the remaining pending players will keep their original strategy.Björnerstedt and Weibull [15] interpret the phenomenon as an introspective phenomenon, where certain players in the population actively imitate and learn from others.Players who adjust their strategies are called "introspective players", whereas those who keep their original strategies are called "non-introspective players".In the context of a game, following the principle of random matching, all players choose the pure strategy.This process can be illustrated using the game model presented in [3] as an example.
The model is based on a game framework featuring a virtual population 2, as proposed by Fudenberg and Levine [3].The social learning theory for strategy updating is as follows: x 1 (t) denotes the proportion of parents in population 1 who choose strategy U, x 2 (t) denotes the proportion of parents in population 1 who choose strategy D, y 1 (t) denotes the proportion of parents in population 2 who choose strategy L, and y 2 (t) denotes the proportion of parents who choose strategy R. For population 1, the proportion of direct choice strategy U without introspection is (1 − α)x 1 (t).According to the social learning theory, a player's strategy remains unchanged if their strategy is consistent with their parent's, and the proportion is αx 1 (t) 2 .
When a player's strategy does not match their parent's strategy, in that case, it is divided into two small populations according to the encountered opponents, and the player imitates the strategy of the small population with the highest expected benefit.For example, when a player encounters an opponent who chooses strategy L, he will choose strategy U, and the proportion is 2αy 1 (t)x 1 (t)x 2 (t).Similarly, if he encounters an opponent who chooses strategy R, he will choose strategy D, and the proportion is 2αy 2 (t)x 1 (t)x 2 (t).
With the above variation of strategies, the proportional update formula for the offspring selection strategy U of population 1 can be obtained as: (2.1)

Population imitation theory
For the population game model, {Γ, X, F}, Schlag [14] presents an alternative perspective, asserting that each player can observe the strategies and expected benefits of others within their population.This perspective eliminates the notion of parent and offspring from social learning theory.Schlag argues that the emergence of Nash equilibrium in population games hinges solely on the phenomenon of imitation.The rules of imitation, as defined by Schlag, are as follows: 1) Following imitative behavior, i.e., change behavior exclusively by imitating others.
2) Never imitate someone who performs worse than you do.
Rule 2) means that the players will only imitate those who exhibit superior expected benefits.This implies that players evaluate their strategies based on the expected benefit of other players.And players choose to imitate other players with superior expected benefits to improve their own benefit.The essence of the imitation rule is that players adjust their strategies based on the strategies and benefits of other players.

The idea of the PGPSO algorithm
In 1995, inspired by the regularity of birds' flock feeding behavior, Kennedy and Eberhart [4,5] developed a simplified algorithm model, which later evolved into the particle swarm optimization (PSO) algorithm through subsequent enhancements.The idea of the PSO algorithm originated from studying the birds' flock feeding behavior, where the birds share information collectively so that the flock can find the optimal destination.In the PSO algorithm, the feasible solution of each optimization problem can be considered as a point in the d-dimensional search space.Let the position of the ith particle be denoted as l i = (l i1 , l i2 , ..., l id ), and the best position it has experienced is denoted as p i = (p i1 , p i2 , ..., p id ), and also known as p best .The index number of the best position experienced by all particles is denoted by the symbol g best .The velocity of the i-th particle is denoted as v i = (v i1 , v i2 , ..., v id ).For each iteration, the velocity and position of the particle change according to the following: where c 1 , c 2 are learning factors; r 1 , r 2 are random numbers varying within (0, 1); w are inertia weights; k is the number of iterations.
The PGPSO algorithm builds upon the framework of the PSO algorithm to realize Nash equilibrium and find the realization path, which introduces social learning and population imitation theory into the PSO algorithm.In the PGPSO algorithm, each Nash equilibrium represents a solution to the problem, and each player in the population is considered a particle of the PSO algorithm.Players with different strategies reflect the differences in position among particles, and different benefits reflect the differences in expected benefits among particles, both of which constitute particle diversity.
According to population imitation theory, in the population updating process of particles (players), particles (players) with high expected benefits are learned.However, if all players with low benefits change their strategies by imitating in the first iteration, the algorithm stops directly, resulting in Nash equilibrium losing the opportunity to be learned.Therefore, this paper takes the introspection rate to the PSO algorithm.This serves two purposes: firstly, only some particles (players) choose to introspect, effectively maintaining the diversity of particles (players) in each iteration.Secondly, the introspection rate fits the lag of strategy update of the players in the actual game.
During each iteration, the player chooses a pure strategy, i.e., the position of the particles.Given the nature of the two-population game, the PGPSO algorithm accommodates two distinct particle populations.The benefit matrices for a two-population game are defined as follows: At the k-th iteration, the expected benefit of the i-th player in population 1 is calculated as per references [33,34].
where x i denotes the i-th player strategy choice in population 1, x i ∈ {0, 1}.If x i = 1, then the player has chosen the strategy 1; if x i = 0, then the player has chosen the strategy 2. y 1 indicates the proportion of players in population 2 who choose strategy 1.
The expected benefit of the j-th player in population 2 is where y i denotes the i-th player strategy choice in population 2, y i ∈ {0, 1}.If y i = 1, then the player has chosen the strategy 1; if y i = 0, then the player has chosen the strategy 2. x 1 denotes the proportion of players in population 1 who choose strategy 1.The Nash equilibrium of the population game represents a state where all players maximize their benefits.The players are selfish and aim to maximize their benefits, so the benefits defined in the Eqs (3.3) and (3.4) are obtained based on that consideration.Since the benefits of players in the population who choose the same strategy are indistinguishable, it is assumed that determining the benefits corresponding to each strategy becomes an optimization problem.In this case, solving the Nash equilibrium is equivalent to finding the optimal solution of this optimization problem.The key difference is that an ordinary optimization problem is a single-player optimization problem.At the same time, the game studies a multi-player optimization problem, which is the essential distinction between the two.
In the iterative process of particles, population imitation theory guides introspective players to adopt the strategy associated with the highest expected benefit.In the k-th iteration of a population, the particle with the highest expected benefit is determined by the Eqs (3.3) and (3.4), denoted as i k best , and the introspective particle changes its strategy to i k best .In contrast, the non-introspective particle keeps its strategy unchanged.At the k-th iteration, suppose the set of all particles' ordinal numbers is I k , and the set of introspective players' ordinal numbers is denoted as I k α .Take the Eqs (3.1) and (3.2) as the basis, the PGPSO algorithm iteration function is This is the iterative formulation of the PGPSO algorithm, which draws on the formula form of the Eqs (3.1) and (3.2).However, the idea is derived from the theory of population imitation and social learning, where players with relatively lower benefits adopt strategies observed from the players with the highest benefit.

Algorithm construction
The implementing steps of the PGPSO algorithm are as follows.(a) The PGPSO algorithm initializes the parameter values.These include the introspection rates α, β, the lower bound of the search space popmin, the upper bound of the search space popmax, the population size m, n, and the number of iterations genmax.
(b) Two populations are created, each containing m and n particles, respectively.The algorithm randomly generates the strategy x i for population 1 of m particles, x i satisfies x i = 0 or 1.Then the algorithm randomly generates the strategy y j for population 2 of n particles, y j satisfies y j = 0 or 1.
(c) Expected benefits of all particles in both populations are computed using Eqs (3.3) and (3.4), and the particle strategy with the highest expected benefit is found, denoted as i best1 and i best2 , respectively.
(e) Whether to end the iteration is determined according to the number of iterations genmax.If the iteration ends, the algorithm outputs the two populations' optimal benefits and particle position figures.Otherwise, the algorithm turns to (c).

PGPSO algorithm update formula
In the PGPSO algorithm, the update rule of players' strategies is as follows: the players with α proportion choose to adjust the strategy in population 1, and the remaining players do not.The players with β proportion choose to adjust their strategy in population 2, and the remaining players do not.For the players who choose to adjust their strategies, they can observe the expected benefits of the players in their population, i.e., the benefits are derived from the Eqs (3.3) and (3.4).Subsequently, these players apply population imitation theory to select an optimal strategy.After the above adjustment, the Eq (2.1) is the basis.The updated formula for the proportion of two populations choosing strategy 1 is obtained as follows: where x 1 (t + 1), y 1 (t + 1) is the proportion of players choosing strategy 1 at t + 1 iterations for both populations, respectively x 1 (t + 1), y 1 (t + 1) ∈ [0, 1].If x 1 (t + 1) and y 1 (t + 1) are both 0, it implies that both populations choose strategy 2. According to the benefit matrices A and B, all players in the first population receive the benefit a 22 , and all players in the second population receive the benefit b 22 . ; .
The updating Eq (4.1) essentially represents a discretized form of the differential Eqs (3.5) and (3.6).Compared with the Eq (2.1) in social learning theory, Eq (4.1) can be effectively applied to any two-population and two-strategy game model.Second, the social learning rule specifies the concepts of a parent, pending generation, and child, and the parent influences strategy updating.At the same time, the Eq (4.1) removes the concepts of a parent and pending generation, thereby reducing the dependency on heritability as a condition for strategy adaptation.

PGPSO algorithm convergence analysis
For the benefit matrices of the two-population game: The benefit matrices simplify to where a 1 0, a 2 0, b 1 0, b 2 0. Theorem 4.1.In a non-cooperative repeated two-population and two-strategy game with benefit matrices A and B, when each population updates its strategy following the Eq (4.1), the convergence outcome corresponds to either the Nash equilibrium or its stable limit ring.
Proof.The updating Eq (4.1) is transformed to continuous time, and the imitation dynamic equation is obtained using the difference method. 2) The two-population game is classified into three types according to their equilibria: one pure strategy Nash equilibrium, one mixed strategy Nash equilibrium, and three Nash equilibria (two pure strategy Nash equilibria, one mixed strategy Nash equilibrium).Next, we discuss the classification.
For the second type, Nash equilibrium ( x, ȳ) For the Eq (4.1), we get: , the imitation dynamic Eq (4.2) will converge to Nash equilibrium ( a 1 +a 2 , then we can obtain that the Eq (4.2) is a differential equation with zero solutions.
1) When 0 < θ < π 2 , we get x 1 (t), y 1 (t) > 0, which is obtained from the Eq (4.4) as From Eq (4.4), it follows that when r = −αb ′ cos θ+β(1−a ′ ) sin θ αcos 2 θ+βsin 2 θ , ṙ = 0, i.e., there is a special solution The solution is a curve in the phase plane centered at the origin.When r > −αb ′ cos θ+β(1−a ′ ) sin θ αcos 2 θ+βsin 2 θ , we get ṙ < 0 from the Eq (4.4).That is, the trajectory converges to the curve from outside the curve.When r < −αb ′ cos θ+β(1−a ′ ) sin θ αcos 2 θ+βsin 2 θ , we get ṙ > 0 from the Eq (4.4).That is, the trajectory converges to the curve from inside the curve.Therefore the system has a stable limit ring centered at the origin at 0 < θ < π 2 .A stable limit ring is a periodic solution around a non-isolated equilibrium point.When the solution trajectory evolves from a point in the solution space, it converges to this limit ring and makes a periodic movement on this limit ring.
After the above analysis, the second type proves to be completed.
If and only if According to the above analysis, the solution trajectory of the Eq (4.2) starts from any position in the solution space.It will converge to an element of the Nash equilibrium set E(F).
When the Nash equilibrium set the analysis is similar.The Eq (4.2) converges to Nash equilibrium, which is proved.Example 4.1.Take the prisoner's dilemma game with the following benefit matrices.We set the introspection rate α = 0.1, β = 0.2, the horizontal or vertical separation of the initial position is 0.2, and the arrows represent the direction of the solution trajectory.The solution trajectory is shown in Figure 1. 2) for the prisoner's dilemma game.
Example 4.2.Take the coin-flip game with the following benefit matrices.We set the introspection rate α = 0.1, β = 0.2, the horizontal or vertical separation of the initial position is 0.25, and the arrow represents the direction of the solution trajectory.The solution trajectory is shown in Figure 2. From the Figure 2, we can see that when the initial point is (0.5, 0.5), the solution trajectory converges to (0.5, 0.5), i.e., Nash equilibrium ( x, ȳ) = (( 12 , 1 2 ), ( 1 2 , 1 2 )); when the initial point is other than (0.5, 0.5), the solution trajectory evolves counterclockwise to (0.5, 0.5) and converges to the limit ring centered at (0.5, 0.5), i.e., a stable limit ring centered at Nash equilibrium ( x, ȳ) = (( 12 , 1 2 )).Example 4.3.Take the coordination game with the following benefit matrices.We set the introspection rate α = 0.1, β = 0.2, the horizontal or vertical separation of the initial position is 1  6 , and the arrow represents the direction of the solution trajectory.The solution trajectory is shown in Figure 3. strategy Nash equilibrium, and the hawk-dove game has three Nash equilibria, i.e., two pure strategy Nash equilibria and one mixed strategy Nash equilibrium.

Test 1: Prisoner's dilemma game
The prisoner's dilemma game, typically a two-person game, has been employed by Wang [38] to corroborate the presence of the "baiting effect" within social populations.Notably, the benefit matrices characteristic of the inter-player game can be seamlessly applied to the population game model.Therefore, this game model can be used as an example of the population game.The prisoner's dilemma game has only one pure-strategy Nash equilibrium.Both populations choose strategy 1, i.e., ( x, ȳ) = ((1, 0), (1, 0)).The following are the benefit matrices and expected benefits of the prisoner's dilemma game: In the PGPSO algorithm of the prisoner's dilemma game, we set the introspection rate α = β =   From the Figure 4(a),(b) we can see that all the particles of population 1 converge to x 1 = 1, i.e., all players of population 1 choose the strategy 1, and the best benefit of population 1 converges to −5.All particles of population 2 converge to y 1 = 1, i.e., all players of population 2 choose strategy 1, and the best benefit of population 2 converges to −5.This outcome is consistent with the Nash equilibrium ((1, 0), (1, 0)) of the prisoner's dilemma game.Therefore, the PGPSO algorithm accurately finds the particles' positions and benefits corresponding to its Nash equilibrium strategy.It completely records the path to the Nash equilibrium of the prisoner's dilemma game.

Test 2: Coin-flip game
Consider the coin-flip game as a population game, where two populations play against each other according to the benefit matrices of the coin-flip game.The coin-flip game has a mixed-strategy Nash equilibrium, i.e., ( x, ȳ) = (( 12 , 1 2 ), ( 1 2 , 1 2 )).The following are the benefit matrices and expected benefits of the coin-flip game: In the PGPSO algorithm of the coin-flip game, we set the introspection rate α = β = From the Figure 5(a),(b), we can see that all particles of population 1 converge cyclically to x 1 = 1 2 , i.e., nearly half of the players choose the strategy 1 in population 1.The optimal benefit of population (f) The optimal benefit (( 23 , 1 3 ), ( 23 , 1 3 )) Figure 6.The figures of two populations in the hawk-dove game.

Effect of introspection rate on Nash equilibrium realization
In the PGPSO algorithm with introspection rate sensitivity of the three games, we set the introspection rate α = β, the population size m = n = 48, the search space range from popmin = 0 to popmax = 1, and the number of iterations genmax = 20.The initial states of the prisoner's dilemma and the hawk-dove game are x 1 (0) = y 1 (0) = 0.5, and the initial states of the coin-flip game are x 1 (0) = y 1 (0) = 0.4.The position figures of the two populations are shown in Figure 7. From the Figure 7(a),(c), it can be seen that in the prisoner's dilemma and the hawk-dove game, for the two populations, the number of players converging to Nash equilibrium decreases as the introspection rate α increases, representing that the increase of the introspection rate α speeds up the evolution of Nash equilibrium realization.From the Figure 7(b), it can be seen that in the coin-flip game, the magnitude of the cycle convergence increases with the increase of the introspection rate α, which means that the increase of the introspection rate α expands the range of cycle fluctuations.

Effect of initial state on Nash equilibrium realization
In the PGPSO algorithm for the initial state sensitivity of the three games, we set the introspection , the population size m = n = 48, the search space range from popmin = 0 to popmax = 1 and the number of iterations genmax = 50.For the prisoner's dilemma and the hawkdove game, the initial state range of population 1 is 0.1-0.9, and the positions are chosen every 0.1.The initial state range of population 2 is chosen with the same rules as population 1.A total of 81 parameter configurations are generated by combining the two populations.For each parameter configuration, a series of 50 experiments are conducted to observe the system's steady state.Two initial states (0.1, 0.1) and (0.9, 0.9) are selected for the coin-flip game.The location figures of the two populations are shown in Figure 8.  From Figure 8(a), it can be seen that the populations start from any initial state in the prisoner's dilemma game.The two populations converge to the Nash equilibrium ((1, 0), (1, 0)), which means that the change of initial state can't affect the Nash equilibrium realization.From Figure 8(b), it can be seen that the population starts from two initial states in the coin-flip game.The two populations converge cyclically to the mixed strategy Nash equilibrium (( 12 , 1 2 ), ( 1 2 , 1 2 )), representing that the initial state changes can't affect the Nash equilibrium realization.From Figure 8(c), it can be seen that in the hawk-dove game, the two populations converge to the different Nash equilibrium from the different initial states.Specifically, when the initial state is on the right side of the diagonal with x = y, the two populations converge to the Nash equilibrium ((1, 0), (0, 1)).Conversely, when the initial state is on the left side of this diagonal, the two populations converge to Nash equilibrium ((0, 1), (1, 0)).Subtle nuances emerge when considering different positions along the diagonal.Specifically, for the first five positions, the populations converge to ((0, 1), (1, 0)), while the last four positions lead to ((0, 1), (1, 0)).These results represent that the initial state will affect the Nash equilibrium realization.

Comparison with Meta Equilibrium Q-learning algorithm
Taking the welfare game in [31] as an example, for finding its mixed-strategy Nash equilibrium realization path, we compare the difference between the PGPSO algorithm and the Meta Equilibrium Q-learning algorithm.
Example 5.1.This game model has a mixed-strategy Nash equilibrium, i.e., ( x, ȳ) = (( 1 2 , 1 2 ), ( 1 4 , 3 4 )), and the followings are the benefit matrices and expected benefits of the welfare game: The Meta Equilibrium Q-learning algorithm represents an enhancement over the Nash Q-learning algorithm.The rationale behind this improvement stems from the Nash Q-learning algorithm's limitation C specifically, its inability to devise a pathway to realize a mixed-strategy Nash equilibrium when each player opts for a pure strategy.In order to address this problem, the Meta Equilibrium Q-learning algorithm transforms the welfare game into a meta-game.It uses the pure strategy meta-equilibrium to represent the mixed-strategy Nash equilibrium in the welfare game.The meta-equilibrium's realization path replaces the mixed strategy's realization path.Thus, we can obtain the path to find the mixed-strategy Nash equilibrium.However, it is important to note that this transformation comes at the expense of increased complexity in locating the realization path for the mixed-strategy Nash equilibrium.
In the PGPSO algorithm of the welfare game, we set the introspection rate α = β =   From Figure 9(a), it can be seen that after 25 iterations, all particles of population 1 converge cyclically to x 1 = 1 2 and all particle of population 2 converge cyclically to y 1 = 1 4 .The difference between the maximum and minimum values of the cycle convergence is 0.0625.From the Figure 9(b), when the initial strategy distribution of the two populations is Nash equilibrium, all particles of population 1 converge to x 1 = 1 2 , and all particles of population 2 converge to y 1 = 1 4 .This outcome is consistent with the Nash equilibrium of the welfare game.Therefore, for players who choose a pure strategy in the welfare game, the PGPSO algorithm converges to the mixed strategy Nash equilibrium and finds the path to realize the equilibrium.
In the welfare game, for the problem of finding the realization path of the mixed-strategy Nash equilibrium, the Meta Equilibrium Q-learning algorithm transforms the welfare game into a metagame.The meta-equilibrium's realization path replaces the mixed strategy's realization path.From Figure 9(a), it can be seen that the PGPSO algorithm converges the stable limit ring centered on the mixed-strategy Nash equilibrium.The PGPSO algorithm directly finds the realization path, which means that the algorithm reduces the complexity by eliminating the operation of transforming the welfare game into a meta-game.

Conclusions
The population's Nash equilibrium exists commonly in human societies and biological populations.Its realization is crucial from the perspective of individual players within a population.In a population of finite players, players can optimize their strategies by imitating other players with higher benefits, depending on their knowledge of their strategic environment.So far, the rule of imitation has been widely studied in different game models.In this paper, we combine social learning theory and population imitation theory, develop the PGPSO algorithm, and apply it to the Nash equilibrium realization of three two-population game models.
The motivation for studying the imitation learning rule is to explore a problem: whether imitation learning rules can realize the Nash equilibrium realization of population games.Specifically, the new learning rule is transformed into a swarm intelligence algorithm, which is used to simulate the behavioral dynamics of the players in the game.For the PGPSO algorithm iterative formulation, the convergence analysis is performed from the perspective of differential equations.The result is that the solution trajectory of differential equations converges completely to the pure strategy Nash equilibrium.The solution trajectory will converge completely to the mixed strategy Nash equilibrium when the initial position is the mixed strategy Nash equilibrium.Also, in the coin-flip game, the mixedstrategy Nash equilibrium is the center of a stable limit ring of the differential equation.When the initial position is not on the mixed strategy Nash equilibrium, all initial points converge to this limit ring.
Using the PGPSO algorithm, we simulate the Nash equilibrium realization process for three twopopulation games.Simulation outcomes demonstrate that the PGPSO algorithm successfully realizes Nash equilibrium realization.Meanwhile, the PGPSO algorithm clearly shows the path of realizing Nash equilibrium.According to the analysis of the effect of introspection rate and initial state on the realization of Nash equilibrium, the increase of introspection rate accelerates the evolution of pure strategy Nash equilibrium realization.However, it expands the range of cycle fluctuations of mixed strategy Nash equilibrium.The change in the initial state can't affect the Nash equilibrium realization of the prisoner's dilemma and the coin-flip game, but it causes the hawk-dove game to converge to the different Nash equilibrium.

Figure 1 .
Figure 1.Solution trajectories of the imitation dynamic Eq (4.2) for the prisoner's dilemma game.

Figure 2 .
Figure 2. Solution trajectory of the imitation dynamic Eq (4.2) for the coin-flip game.

1 12 , k ≤ 25 1 24 , k > 25 ,
the population size m = n = 48, the search space range from popmin = 0 to popmax = 1, and the number of iterations genmax = 50.The initial states of the populations are chosen randomly by the system.The two populations' optimal benefit and location figures are shown in Figure4.

Figure 4 .
Figure 4.The figures of two populations in the prisoners' dilemma game.

Figure 7 .
Figure 7.The relationship between Nash equilibrium realization and introspection rate for the three games.The solid line represents population 1, and the dashed line represents population 2.

Figure 8 .
Figure 8.The relationship between Nash equilibrium realization and initial state for the three games.The solid line represents population 1, and the dashed line represents population 2.

1 12 , k ≤ 25 1 24 , k > 25 ,
the population size m = n = 48, the search space range from popmin = 0 to popmax = 1 and the number of iterations genmax = 50.The computer system randomly selects the initial states of the populations.The position figures of the two populations are shown in Figure9.

Figure 9 .
Figure 9.The figures of two populations in the welfare game.