Evolutionary dynamics of networked multi-person games: mixing opponent-aware and opponent-independent strategy decisions

How rational individuals make strategic decisions when confronted with the temptation of defection is consistently a longstanding conundrum. Particularly, in a heterogeneous environment incorporating multiple decision rules, little is known about the evolutionary dynamics of networked multi-person games. To address this issue, we propose an original theoretical framework to investigate the hybrid dynamics for mixed opponent-aware and opponent-independent strategy decisions. We equip each agent with an individualized decision-making function, by which decision-makers cannot only select the information type but can also process it individually. Under weak selection, we theoretically derive a condition under which one strategy wins over the other, and accordingly we demonstrate that such an analogous criterion also holds in a mutation-selection process. For a hybrid system of two decision-making functions, we specifically prove that this condition is robust against population structures. Beyond weak selection, however, we find that the co-evolutionary dynamics induced by strategy adoption and decision-rule imitation are sensitive to the change of population structures. Our work, thus, clearly elucidate how the diversity and heterogeneity of individual decision-making affect the fate of strategy evolution, which may offer some insights into the evolution of cooperation.


Introduction
Understanding how cooperative behaviours existing extensively in human societies and biological organisms emerge and persist, is always a longstanding conundrum across various fields [1,2]. Particularly, such a problem has been widely accepted as a physics research agenda [3][4][5]. As a general mathematical framework, evolutionary game theory offers us a promising avenue to examine and interpret the mechanisms for the evolution of cooperation at different levels of analysis [6][7][8]. Since the long-run behaviour of cooperation is sensitively influenced by the microscopic evolutionary details, a problem of paramount importance therein is how individuals rationally make strategic decisions when faced with the temptation of defection [3,4].
Traditionally, one of the most classical modes modeling how individuals make strategic decisions is the powerful tit-for-tat (start with the cooperation and then do whatever the opponent did in the previous round), which achieves great success in Axelrod's tournaments [1]. But soon afterwards it is outperformed by a selflearning manner, win-stay-lose-shift [9]. Motivated by these work, a large body of rules have been proposed since then, such as imitation [10,11], best-response [12], aspiration [13][14][15][16], and Q-learning [17], to name a few. Typically, the probability determining whether an individual chooses a certain strategy is closely related to two kinds of information. One is the personal information of the decision-maker, and the other is the information from its opponents. When both of them are employed as the basis of decisions, such as tit-for-tat, imitation, and best-response, we define it as the opponent-aware type. While if it involves only the information of decisionmakers themselves and the information of opponents is not used, such as win-stay-lose-shift, aspiration, and Q-learning, we define it as the opponent-independent type. In addition to the information type, actually, the types of information, here we mainly consider two categories of representative modes. One is the opponentaware type where individuals update strategies depending on the payoffs and traits from both opponents and themselves. The other is the opponent-independent type where only the personal payoffs and traits are used by the decision-maker. Therefore, without loss of generality, we define the payoff information used by decisionmakers as Π l , where subscript l denotes the type, l=1 for the opponent-aware and l=2 for the opponentindependent. In this way, the decision-making function equipped by each player i (i=1, 2, K, N) can be denoted by j i (β, Π l ), where β represents the intensity of selection and weak selection is measured by 0 b  [45,46]. This function means that, when updating strategies, player i changes its current strategy to an opposite one with probability j i (β, Π l ), and otherwise it sticks to its current strategy with probability 1 , ). Roughly speaking, this function, in an individualized manner, provides a unified framework to characterize the strategic decisions of players. In order to be clearer in the physical sense, we further assume that it is always possible for a decision-maker to change strategies, which means that 0 , 1 i l j b < P < ( ) must be satisfied. On the other hand, when selection intensity vanishes, we assume that the strategy update is independent of the payoff information, i.e. 0, Constant ) . The process of strategy evolution is modeled by a Markov chain where the number of A players in the population defines a finite state space, S N 0, 1, 2, , At each time step, depending on the individualized decision-making function, j i (β, Π l ), each individual has a chance to switch from the current strategy to the opposite. Note that the state transition is uniquely determined by the decision-making function. Hence, the transition probability p u,v from state u S Î to state v S Î can be written as p , In addition, since it is always possible for an individual to change strategies, the Markov chain has no absorbing state and thus is ergodic, existing a unique stationary distribution X=(x s ), s S Î [47,48]. In order to get this stationary distribution, one needs to find out a left eigenvector corresponding to eigenvalue 1 of the transition probability matrix P p u v , = [ ], that is X(P−I)=0. In particular, we are interested in the condition under which strategy A is favoured over strategy B. In other words, we hope to figure out the condition under which the average abundance of strategy A in the stationary distribution is more abundant than that of strategy B, i.e. x s N 1 2 . In the following, we primarily focus on our research in the limit of weak selection, i.e. 0 b  . We begin with assuming that j i (β, Π l ) is differentiable at β=0. Then, under weak selection, we can give it in terms of firstorder Taylor expansions Similar to the pairwise interaction of two-person games [39,41], we additionally notice that the expected (or accumulative) payoff of each player in the multi-person games still remains the linear combination of a j and b j without constant terms [49][50][51]. Thus, we generally express the payoff information Π l used by player i as where ij l a and ij l b are combination coefficients, and e i l denotes the personal trait of player i, which does not rely on the payoff obtained from game interactions. For instance, when the strategy update is based on the proportional imitation rule (an opponent-aware type) [4,10,21], the payoff information Π l is defined by where π i is the payoff of the focal player i and π −i is the payoff of one of its co-players. Under this circumstance, the information subject to personal traits is absent and accordingly e 0 i l = . However, if the strategy update is based on the self-learning of success [21,23] or aspiration [13][14][15]24] (an opponentindependent type), Π l is defined by e i l i p -, where e i l denotes the aspiration level of the focal player i. Moreover, we assume that the first-order derivative of the decision-making function at β=0 can be expressed by a linear combination of the payoff, that is k c 0, , where k i and c i are two coefficients that are determined by the decision-making function j i (β, Π l ). In fact, such a condition is not harsh and holds for most strategy update rules, such as the Moran-like rule [45], pairwise comparison or Fermi rule [3,52], and the self-learning with a updating function j i (βΠ l ) [24]. Substituting equation (2) into this equation, then we have In this situation, the transition probability p , In appendix A, we prove that strategy A is more abundant than strategy B in the stationary distribution if , where s i m and s i n are two sets of coefficients that are independent of m i l and c i l . Inserting m i l into this inequality and exchanging the two summation notations lead to Furthermore, we notice that the decision-making process is symmetric for the two strategies A and B. That is, exchanging labels, A and B, and the corresponding payoff entries does not influence the criterion of the strategy dominance, but merely modifies the notation of the payoff table [24,39,42]. By implementing the swap operation, based on inequality (5), it follows that the condition under which strategy B is more abundant than Since both strategies cannot be favoured at the same time, it yields that strategy A is more abundant than strategy Accordingly, we summarize it in the form of the following theorem.
Theorem 1. Consider a graph-structured population where a d-person game is played and some decision-making functions , , which satisfy the following conditions: (i) the decision-making functions are differentiable at β=0; (ii) the decision process is symmetric for the two strategies A and B. Then, in the limit of weak selection, when the decision-making function , is a linear combination of k i without constant terms, which depends on the individualized decision-making function, payoff information type, population structure, and population size, but not on the entries of the payoff table, a j and b j .
Particularly, when all individuals adopt a homogeneous type of decision-making function, both k i s and c i s are irrespective of the index i and collectively identical. Therefore, we define them as k 0 and c 0 , respectively. In this way, the inequality Then, we have the following corollary.
Corollary 1. Consider a graph-structured population where a homogeneous decision-making function , ) is adopted, and the same conditions as in theorem 1 are satisfied. Then, in the limit of weak selection, when the decision-making function is a constant depending on the population structure, population size, and payoff information type, but not on the decision-making function or the entries of the payoff table, a j and b j .
It is clear that, for a given multi-person matrix game, the condition determining a strategy to dominate the other is not only influenced by the population properties and the decision-making information, but also by the form of decision-making functions. The former is captured by the coefficient j l s , whereas the latter is characterized by the constant k 0 . Similar to a recent finding in two-person games [44], constant k 0 curbs the direction of strategy selection. When the decision-making function is given such that k 0 is positive, this condition is analogous to a multi-person version of σ-rule [42]. It predicts that strategy A is favoured over strategy B. On the contrary, when k 0 is negative, it reduces to the opposite of σ-rule and predicts that selection favours B to dominate A. Moreover, we further demonstrate that such an effect still holds in the mutationselection process, in which the effective payoff function plays a similar role to the decision-making function (see supplementary information available online at stacks.iop.org/NJP/21/063013/mmedia).
To shed more light on how evolutionary dynamics are affected by the hybrid decision-making information, in what follows, we specifically study a binary system where only two kinds of decision-making functions are available for players. In finite well-mixed populations and structured populations, we derive the condition under which one strategy outperforms the other, respectively.

Hybrid dynamics for two decision-making functions
In a graph-structured population with N individuals, here, we consider a minimal model where each player can only choose a decision-making function from j 1 (β, Π 1 ) and j 2 (β, Π 2 ). Following the notations before, Π 1 and Π 2 denote the payoff information of the opponent-aware type and the opponent-independent type, respectively. Specifically, for the former, we assume that an individual switches from its current strategy to an opposite one by comparing its current payoff with that of the opponents using the opposite strategy. In contrast, for the latter, we assume that the decision-maker updates its strategy by comparing its current payoff with an aspiration value. If we define k l Y A P ( )as the payoff information that a focal Y player acquires when its d−1 co-players merely incorporate k A agents of type A, then, for the opponent-aware type (l = 1) we have Particularly, when all co-players use the same strategy as the focal player, we assign a −1 =b 0 and b d =a d−1 . While for the opponent-independent type (l = 2), the payoff information is given by where e denotes the aspiration value.
In addition, we assign the number of players using , 1 1 j b P ( ) as the decision-making function by N 1 . Then, the number of players choosing j 2 (β, Π 2 ) is N 2 =N−N 1 . In particular, we denote the A (resp. B) players using j l (β, Π l ) by A l (resp. B l ), and the number of A l players by n n N l 0 , 1 ,2 ). In figure 1, we give a diagram to illustrate this population configuration, which is similar to a binary hierarchical society [53,54]. At each time step, we randomly choose an individual as the focal player, and then other d−1 co-players are selected at random from its neighbors to play a d-person game. Based on the payoff table given by table 1, each player participating in the game interactions gains a payoff. Subsequently, depending on the decision-making function, the focal player determines whether to switch its current strategy to the opposite or not.

Well-mixed and structured populations
When the population structure is modeled by a complete graph (well-mixed) or a regular graph with degree d 1 -(structured), under weak selection, we analytically derive the average abundance of strategy A in the stationary state. Specifically, when condition k c , · is satisfied and the population size is sufficiently large, the average abundance of strategy A within the entire population is given by (see appendix B for details) , l=1, 2, are the frequencies of players using the decision-making function j l (β, Π l ) in the population, and 0, ), l=1, 2, are constants. This result holds in both well-mixed and structured populations. Namely, the differences in the spatial geometry of population structure do not give rise to any variances in the eventual strategy equilibrium, which is in stark contrast to the most previous findings [31,36,37]. Based on equation (14), then, it is easy for us to obtain the condition under which strategy A is more abundant than strategy B in the stationary distribution, which is given by where ω j is defined by equation (15). This criterion is fully consistent with theorem 1 and the formulation of ω j keeps a good linear combination of k l , which validate the generalization of our results. Although the effect of population structure on the stationary distribution of strategies is not highlighted as indicated by equation (14), the chosen game paradigm and the ratio of different decision-making functions do matter. In particular, for a given game paradigm, there always exists a critical value z 0, 1 1 * Î [ ] for the proportion of players using j 1 (β, Π 1 ), above which one strategy can outperform the other. To shed light on this relation, we take into account an example of a three-person game where the payoff table is given by ), l=1, 2. At the same time, the probability that the system stays in the current state is T n n T n n 1 , , In virtue of condition (16), we obtain the critical ratio z 1 * as z k ka k k a k k 3 6 ] . By theoretical calculations and computer simulations, in figure 2, we show the average abundance of strategy A using different values of payoff entry a 0 and different pairs of decision-making functions. When decisionmaking functions and the payoff entry a 0 are given such that k k a k k holds, the increase of z 1 monotonously enhances the average abundance of strategy A. At the same time, the critical ratio z 1 * prescribes a threshold value above which strategy A is more abundant than strategy B. Conversely, when k k a k k , the average abundance of strategy A decreases with increasing z 1 , and above the critical ratio z 1 * strategy A is less abundant than strategy B. Furthermore, we find that the change of a 0 can force the critical ratio z 1 * to move towards the limit value z k k k 3 ) when a 0  ¥. Particularly, when a 0 is given, the values of k l (l=1, 2) determined by decision-making functions determine the direction of strategy evolution. In other words, by changing decision-making functions, we can alter the fate of strategy evolution (comparing (a) with (c) and (b) with (d) in figure 2). These results are validated by that simulations on a complete graph and a ring graph are in good agreement with the theoretical calculations.

Co-evolutionary dynamics beyond weak selection
Up to now, we have always constrained our research in the limit of weak selection and the proportion of players using a certain decision-making function is fixed. Here, we relax these constraints into studying co-evolutionary dynamics beyond weak selection where the decision-making function is also a trait that can be learnt. In general, for a sufficiently large population, the master equation describing the evolution of the proportion of decisionmaking functions, z l (l=1,2 and z 1 +z 2 =1), can be written as ) are applied in the bottom row. Given these functions, we have ζ 1 =ζ 2 =1/2 in all panels. Moreover, k 1 =−1/6 and k 2 =1/4 for the first pair functions lead to the limit value z 1 3 1 = ((a) and (b)), whereas k 1 =1/4 and k 2 =−1/6 for the second pair functions lead to z 2 11 1 = ((c) and (d)). Since the critical ratio z 1 * belongs to [0,1], it results in a 0 1 or a 0 2. When a 0  ¥, the critical ratio z 1 * moves towards the limit value z 1 . Lines are analytical results, whereas symbols are simulations where the well-mixed population is modeled by a complete graph ((a) and (c)) and the structured population by a ring graph ((b) and (d)). All simulation results are obtained by averaging 100 network realizations and 10 8 time steps after a transient time of 10 7 . The initial frequency of strategy A is set to 0.5. Parameters are N=1000, β=0.01, e=2.0, and the payoff table is given by the matrix (17 where ρ IJ is the conditional switch rate from decision-making function I to function J (I J 1, 2 ¹ Î { }). Since the payoff of players is not directly influenced by the form of the decision-making function, the individual payoff information does not have an impact on learning a decision-making function. For the sake of simplicity, therefore, we consider a frequency-based learning mechanism. That is, individuals attend to what the majority is doing [55]. We model this learning process with a Fermi function. An individual using function I learns to adopt function J J I ¹ ( ) with probability n n 1 1 exp where n I (resp. n J ) is the number of players using function I (resp. J) within the d-person game that the individual participates in. Although we do not make an assumption of time scale separation between strategy evolution and function learning like most previous studies on co-evolutionary dynamics [33,34,56], such a property exists naturally in our model. This is because function learning here is independent of the evolution of strategy (see appendix C).
In well-mixed populations, we find that the learning process occurring between the two distinct decisionmaking functions is symmetrical, which leads conditional switch rate ρ 12 to being equal to ρ 21 (see appendix C for details). Therefore, let the right-hand side of equation (18) be zero. It results in that z 1 =1/2 is the only one stable equilibrium. That is, arbitrarily given an initial distribution for the proportion of decision-making functions, the population will ultimately evolve into the steady regime that one-half individuals choose function j 1 (β, Π 1 ), and the remaining half choose function j 2 (β, Π 2 ). Similarly, by numerical calculations, we find that such an equilibrium z 1 =1/2 also holds in structured populations (see figure 3).
By numerically solving the co-evolutionary dynamic equations of the system, in figure 3 we show the equilibrium frequencies of strategy A and the equilibrium proportion of A l (l=1,2) players within the entire population, respectively. Overall, no matter whether in well-mixed populations or in structured populations, the average abundance of A 2 players is always more abundant than that of A 1 players. Moreover, compared with A 2 players, the equilibrium frequency of A 1 players is more sensitive to the changes of selection intensity β and payoff entry a 0 . Increasing the selection intensity results in a dramatic decline in the level of A 1 players, whereas ( )] to describe decision-making functions, j 1 (β, Π 1 ) and j 2 (β, Π 2 ). In this case, to assure the critical ratio z a a 1 2 5 0 ,1 ) . Corresponding to the average abundance of strategy A l (l=1,2) in (a)-(d), under the bottom of each panel, we show the average abundance of strategy A within the population for different values a 0 and β, respectively. The black lines with dots represent the critical a 0 as a function of selection intensity β (dots by numerical calculations and lines by fitting), above which the average abundance of strategy A is more abundant than that of strategy B. The equilibrium proportion of players using function j 1 (β, Π 1 ) is z 1 =1/2 in both well-mixed and structured cases. Parameters are N=100, e=2.0, and the payoff table is given by the matrix (17). the level of A 2 players keeps relatively stable. For the average abundance of strategy A, however, the effect of selection intensity on the evolution of strategy A acts as an amplifier of selection. When strategy A is more abundant than strategy B, the increase of selection intensity enhances the average abundance of strategy A (see the region above the black lines with dots in figure 3). On the contrary, when strategy B is more abundant than strategy A, increasing selection intensity drops the average abundance of strategy A (see the region below the black lines with dots in figure 3). But, such an effect is not reflected in changing the entry of the payoff matrix a 0 , which monotonously promotes the average level of strategy A. These results are robust against altering population structures and the forms of decision-making functions.
Nevertheless, the critical value of a 0 (the black lines with dots in figure 3), above which strategy A is more abundant than strategy B, is markedly sensitive to the population structure and the selection intensity. Under weak selection, well-mixed populations and structured populations lead to the same critical value of a 0 , which is in line with the result obtained from condition (16). Beyond weak selection, nonetheless, structured populations hinder the evolution of strategy A and hence force the critical a 0 to reach a bigger value compared with the well-mixed case. Therefore, the average abundance of strategy A in well-mixed populations is higher than that in structured populations as a whole.

Conclusion and discussion
Among the factors determining the fate of strategy evolution, the rule that players depend on to make strategic decisions or update subsequent actions is one of the most importance [4,16]. Based on the information source that individuals utilize to modify their strategies, typically, we summarize the existing mainstream rules and categorize them into two categories. One is the opponent-aware type in which both the opponent's information and one's own information are employed by decision-makers. The other is the opponent-independent type where decision-makers update strategies only based on their personal information. Different from most previous studies where all players use the same update rule [3,4,39,43], here we have investigated the evolutionary dynamics in a heterogeneous population where each agent can individually choose the information type and the decision-making manner. Mathematically, for a networked multi-person game of two strategies, we give a generic criterion under which one strategy can win over the other under weak selection. At the same time, we find that such a criterion also holds in a mutation-selection system. Particularly, when the heterogeneous system only incorporates two kinds of decision-making functions, we further demonstrate that this criterion is robust against the population structure. When introducing co-evolutionary dynamics beyond weak selection, however, we find that strategic evolution is sensitively influenced by the population structure.
Most existing studies in setting how individuals make strategic decisions, normally, take a homogeneous hypothesis that all agents update strategies in the same way [3,4,39,43]. Even if when consolidating two distinct decision modes, the heterogeneity is mainly concentrated on the onefold information source or decision manner [27][28][29][30]. Given the diversity and self-organization of individual interaction [18,31], however, it is not unreasonable to predict that both heterogeneities can dramatically affect the evolutionary outcomes. In contrast to a recent finding that the heterogeneous decision-making function is a trivial element for the condition of strategy dominance [24,25], we here demonstrate that it plays a key role in determining the fate of strategy evolution. One important reason arising this difference is that the decision-making function we adopt here is a more general form. Consistent with the result in aspiration dynamics [16,24,25], however, personal traits are not contained in the condition of strategy dominance. Two reasons result in such a consequence. First, the assumption of weak selection guarantees the establishment of condition (5). Second, the symmetry of the strategic decision process assures that exchanging labels A and B merely alters the notations of payoff entries, but does not influence the personal traits.
As a paramount theoretical framework, the mutation-selection process has been widely applied to investigate the evolution of different traits across various disciplines [39,57]. Compared with the opponentaware and opponent-independent decision-making processes, it lays more emphasis on the properties in biological inheritance or social learning. Nontrivially, we find that the decision-making function plays a similar role to the effective payoff function (i.e. fitness function) (see supplementary information available online at stacks.iop.org/NJP/21/063013/mmedia). Although mathematically both of them can be described by a Markov chain for the dynamics of strategy evolution and can follow a similar proof path to derive the condition of strategy dominance, there still exist some striking differences between them. Notably, since the decisionmaking of individuals in our model is heterogeneous, it results in that our results cover the homogeneous system as a specific case and thus cannot be directly obtained from the previous homogeneous conclusions [39,42,53].
Compared with the well-mixed cases, spatial structure is conventionally regarded as a powerful tool for promoting the evolution of cooperation [3,4,36]. However, some studies recently conclude that cooperation level cannot be always enhanced effectively in spatial structures [15,24,37,38], which is in stark contrast to the positive effect of the network reciprocity [36]. Similarly, under weak selection, we find that when mixing two distinct decision manners, the evolutionary outcomes obtained in well-mixed and structured populations are identical. It arises mainly because weak selection significantly inhibits the influence of population structure [4]. Beyond weak selection, nevertheless, it is notable that population structures have a striking impact on evolutionary dynamics. Even though these results derived here are based on a complete graph and a regular graph, our theorem is not confined by this and can be applied to other population structures where a multiperson game is played. In addition, it is easy and natural to apply our theoretical framework to consolidate other types of individual decision-making, not just limited to mixing opponent-aware and opponent-independent types.

Acknowledgments
, where x s is the stationary probability when the system stays in the state s, which is a function of selection intensity β. Since the differentiability of the decisionmaking function guarantees that the stationary probability x s is differentiable at β=0 [24,39,41], we can rewrite it in terms of first-order Taylor expansion, In the following, we first prove that for the vanishing selection intensity the average abundance of each strategy is one half, that is

A1. The average abundance is one half at β=0
First, we denote the strategy state of individual i by q i (i=1, 2, K, N). If strategy A is played, q i =1, and q i =0 otherwise. Hence, q i is a random variable with two states. In this case, the number of individuals using strategy A is q i N i å and the average abundance of strategy A is E q N i N i å ( ) . In addition, based on the model described in the main text, we know that individual i is selected with probability 1/N and it switches from its current strategy to an opposite one with probability j i (β, Π l ). Particularly, when selection intensity is vanishing (i.e. β=0), based on the assumption given in the main text, we have 0, Constant , and we denote this constant by j i (0). Then, the transition probability matrix for the strategy state of individual i is given by Note that 0<j i (0)<1 always holds. Therefore, the Markov chain is irreducible and aperiodic, existing a unique stationary distribution which is given by (1/2,1/2). In this way, it leads to E(q i )=1/2 and thus the average abundance of strategy A is

A2. The linear relation
Then, under weak selection, whenever we have a degree k term in β, it must be accompanied by a degree k term in m i l and c i l , and not by a constant term. Furthermore, because matrix P is stochastic and primitive, its largest eigenvalue is 1 and it is a simple one. Hence, the equation XP=X we need to solve has only one degree of freedom. To find the solution, using a similar method to [39], we perform the Gaussian elimination, and without loss of generality we assume the freedom variable is x N . In this way, there exits a set of parameters h s , s=0,1, 2, K, N−1, such that Moreover, note that the stationary distribution is a vector with non-zero and sums up to 1. Accordingly, we have In view of the nature of the Gaussian elimination, the elements of the reduced matrix must be fractions of polynomials in β. Thus, the above h s s are all rational functions of β. Based on equations (A.3) and (A.4), it follows that the stationary probability x s is a fraction of polynomials in β. We write it in an irreducible form Herein, ℓ 0 and λ 0 are evaluated by the transition probability when selection intensity is vanishing, which are independent of the parameters m i l and c i l , whereas ℓ 1 and λ 1 are linear combinations of m i l and c i l , respectively. Under weak selection, differentiating x s and evaluating it at β=0, we obtain

Appendix B. Hybrid dynamics in well-mixed and structured populations
In both well-mixed and structured populations, we here derive the hybrid dynamic equations when there are only two decision-making functions available for players, and calculate the average abundance of strategy A in the steady state, respectively. The basic idea is first to obtain a set of system equations based on the microscopic evolutionary process, and then to apply the perturbation method to estimate the coefficients of selection intensity [15,24,25].

B1. Well-mixed populations
As illustrated in figure 1, there are four possible state choices for leaving from the current system state (n 1 , n 2 ) to a new one, and their transition probabilities are T n n , ), l=1,2, respectively. When the population structure is well-mixed (i.e. a complete graph), we obtain the probabilities that the number of A 1 players, n 1 , increases and decreases by one as  (12) and (13). Then, the master equation describing the evolution of the microscopic process can be written as P n n T n n P n n T n n P n n T n n P n n T n n P n n T n n T n n T n n T n n P n n , , where P(n 1 , n 2 , τ) is the probability that the population contains n 1 individuals of type A 1 and n 2 individuals of type A 2 at time τ. To perform a diffusion approximation of the master equation [48], we first introduce the notations, y l =n l /N (l=1,2) and t=τ/N, and substitute them into equation (B.5). Subsequently, expand the transition rates, T y y , where z l =N l /N (l=1,2) are the frequencies of players using j l (β, Π l ) in the population, and z 1 +z 2 =1 always holds. In order to normalize y l , we additionally scale the frequency y l by z l , denoted by x y z 0, 1 Thus, equations (B.6) are changed to where Y=A or B, and the assumption k k k c , is applied. In addition, since the frequency of A players in the population always keeps one half when selection intensity is vanishing (β=0), the fixed points of the system equations (B.9) can be written as 1/2 plus some perturbations under weak selection [15,24] x O l 1 2 , 1, 2. B.11 Therefore, in the steady state, the average abundance of strategy A in the whole population is

B2. Structured populations
In contrast to the well-mixed situation, a structured population where the topology is modeled by a regular or irregular graph normally can evoke some novel dynamics [3,4,19,37]. Here, we consider a regular graph with degree d−1. To describe the evolutionary dynamics of the system, we begin with defining the state variables P n N A l l l ≔ (l=1,2) as the frequencies of A l players in the subpopulation where j l (β, Π l ) is used. Correspondingly, the frequency of B l (l=1,2) players in the subpopulation is defined by . Based on the microscopic evolutionary process described in the main text, the probabilities that P A 1 increases and decreases by 1/N 1 are , and k A is the number of A players among d−1 neighbors. Similarly, the probabilities that P A 2 increases and decreases by 1/N 2 are given by respectively. As only one replacement event takes place in one unit of time, for sufficiently large N the dynamic equations of P A 1 and P A 2 can be given by P P r P P r P P P r P P r P , .   where l=1, 2. Moreover, we notice that strategies A and B share the same frequency in the stable state when selection intensity is vanishing. Thus, in the limit of weak selection, based on the perturbation method [15,24,25], the frequency of A l (l=1,2) in the steady state can be given by P O 1 2 Clearly, they are completely identical to δ 1 and δ 2 given by equations (B.13). Therefore, in the steady state, we have the average abundance of strategy A in the whole population as shown in equation (B.14). , where P 11 denotes the frequency of a pair of players using function 1. Therefore, to obtain the co-evolutionary dynamics, we additionally need to acquire the dynamic of P 11 , which is given by . Here, P AA denotes the frequency of AA-pairs in the entire population. Overall, there are two events that can quantitatively result in changing the number of AA-pairs. When a focal B player with k A A neighbors updates its strategy to A, the number of AA-pairs increases by k A and therefore P AA increases by k d N