Introspection dynamics: a simple model of counterfactual learning in asymmetric games

Social behavior in human and animal populations can be studied as an evolutionary process. Individuals often make decisions between different strategies, and those strategies that yield a fitness advantage tend to spread. Traditionally, much work in evolutionary game theory considers symmetric games: individuals are assumed to have access to the same set of strategies, and they experience the same payoff consequences. As a result, they can learn more profitable strategies by imitation. However, interactions are oftentimes asymmetric. In that case, imitation may be infeasible (because individuals differ in the strategies they are able to use), or it may be undesirable (because individuals differ in their incentives to use a strategy). Here, we consider an alternative learning process which applies to arbitrary asymmetric games, introspection dynamics. According to this dynamics, individuals regularly compare their present strategy to a randomly chosen alternative strategy. If the alternative strategy yields a payoff advantage, it is more likely adopted. In this work, we formalize introspection dynamics for pairwise games. We derive simple and explicit formulas for the abundance of each strategy over time and apply these results to several well-known social dilemmas. In particular, for the volunteer’s timing dilemma, we show that the player with the lowest cooperation cost learns to cooperate without delay.


Introduction
Social behaviors often follow an evolutionary process. Behaviors that yield a high payoff proliferate, whereas inferior strategies go extinct [1,2]. Providing a quantitative description of how individuals learn, make choices, and interact, however, is non-trivial. Actions of one group member may affect the behaviors of others; and, in a complex chain of interdependencies, this may, in turn, affect the group's surrounding environment [3][4][5]. Researchers aim to capture this complex dynamics with models of learning [6][7][8][9][10][11][12][13] and the tools of evolutionary game theory [14][15][16]. One key application of this theory is the study of cooperation in social dilemmas, where an individual's self-interest is at odds with the common good [17,18]. For example, to be physically isolated during a pandemic is an action that is simultaneously costly for the individual and valuable for public health. Game theory is a standard mathematical framework to model decision-making in this context [19]. A game-defined by the players, their possible strategies, and their payoffs-captures how individuals interact and what the resulting consequences are.
One of the more subtle aspects of such a model is how it specifies the strategy updating process [20,21]. In evolutionary game theory, individuals do not act optimally from the outset. Rather they dynamically adjust their strategies, based on their current payoffs. To capture this dynamics, the model needs to specify how individuals adopt new strategies, which is governed by the players' strategy update rule or selection rule. One example of such an update rule is imitation by pairwise comparison [22][23][24]: occasionally, individuals compare their own payoff to the payoff of a co-player. The larger the co-player's payoff, the more likely the focal individual imitates the co-player's strategy. Alternatively, there is also work that considers Figure 1. A schematic illustration of introspection dynamics. (a) We consider two players who interact in a matrix game over several time steps. Suppose in time step τ , the orange row player chooses strategy S 2 , whereas the blue column player chooses S 1 (as indicated by the arrows). Then the row player obtains a payoff of 4, whereas the column player obtains a payoff of 1. (b) After their interaction, the column player is randomly chosen to update their strategy. To this end, the player randomly picks an alternative strategy from their strategy set. In this case, the alternative strategy is S 3 . The column player then compares its previously realized payoff π = 1 with the hypothetical payoffπ = 2 that the player could have obtained by playing S 3 at time τ . (c) Depending on the payoff difference Δπ =π−π, the column player decides whether to switch to the alternative strategy S 3 . Throughout this article, we assume that the switching probability is parametrized by (2). In the example, the payoff difference is positive, and thus the switching probability is larger than the neutral probability 1/2. (d) If the column player switches to the alternative strategy, the outcome of the game at time τ + 1 changes accordingly.

Model
To model asymmetric interactions, we consider two individuals, player 1 and player 2, who interact in a normal-form game. Player 1 can choose among m pure strategies, S 1 ,...,S m , and player 2 can choose among n strategies, S 1 , . . . , S n . If player 1 chooses strategy S i and player 2 chooses strategy S j , the resulting payoffs are π ij and π ij , respectively. We represent such a game by a bi-matrix, (1) In this representation, player 1 can choose one of the m rows, whereas player 2 chooses one of the n columns. Thus, we can also refer to player 1 as the 'row player', and to player 2 as the 'column player'. The first entry in the bi-matrix is the payoff to player 1, whereas the second entry is the payoff to player 2. The game is called symmetric if the two players have the same set of strategies (that is, if m = n and S i = S i for all i), and if the corresponding strategies yield the same payoffs (if π ij = π ji for all i, j). In what follows we mainly focus on asymmetric games.
We assume that the two players interact over many time steps, engaging each time in the game defined by the above bi-matrix. In each time step, they can dynamically adjust their strategies. A possible learning dynamics may posit that players compare their current strategy with another one, assuming that the co-player's strategy remains fixed. The process, which we refer to as introspection dynamics, goes as in figure 1. At each time step, one player is randomly chosen to reconsider their strategy. To this end, the player randomly draws an alternative strategy from the set of all its possible strategies. The player then compares its realized payoff π with the payoffπ the player could have obtained by playing the alternative strategy instead (keeping the co-player's strategy fixed). Let Δπ :=π−π be the difference between the counterfactual payoff and the realized payoff. The player switches to the alternative strategy with probability ϕ β (Δπ) given by the Fermi function [23,24,68], Here, β 0 is a parameter that measures the intensity of selection or selection strength. In one limiting case, β → 0, payoffs are irrelevant for the learning process and any alternative strategy is adopted with probability 1/2. In the other limiting case β → ∞, players only adopt the alternative strategy if the resulting payoff matches at least the current payoff. We refer to these two limits as the case of weak selection and strong selection, respectively. As we iterate this elementary updating step over time, we obtain a stochastic process on the space of all strategy profiles (S 1 , S 1 ), (S 1 , S 2 ), . . . , (S m , S n ). Each state corresponds to the strategies that the two players currently adopt. Simulations of this process are straightforward [54,55]. Here, our goal is to analyze the mathematical properties of this process.

An explicit formula for the stationary strategy distribution
To derive explicit results, we note that each player's updating behavior only depends on the players' current strategies (whereas it is independent of the previous strategies of the players). As a result, we can represent the dynamics by a Markov chain. Given the current state (S i , S j ), we can compute the probability that the state changes to (S k , S l ) in one time step. This transition probability is In these expressions, the factor 1 2 corresponds to randomly drawing one of the two players. Similarly, the factors 1/(m − 1) and 1/(n − 1) correspond to randomly drawing an alternative strategy. The expressions ϕ β (Δπ) then give the probability that the alternative strategy is adopted Eq. (2). We can collect these transition probabilities in an mn × mn matrix T = (T ij,kl ). Here, the first double index denotes the row that corresponds to the previous state (S i , S j ) of the Markov chain, whereas the second double index corresponds to the next state (S k , S l ). By construction, T is nonnegative and row stochastic (see appendix A).
We can use the matrix T to describe how likely we are to find the process in a given state at some time t given the initial distribution of the process. Let v ij (t) denote the probability that at time t, the process is in state (S i , S j ). The mn-dimensional row-vector v(t) collects these probabilities in the same order of states as in the transition matrix. By the theory of Markov chains, the strategy distribution at time t is where v(0) is the initial strategy distribution.
To describe the long-run dynamics, we assume in the following that the selection strength β is finite (even if it may be arbitrarily large). In that case, the transition matrix T is primitive (see proposition 1 in appendix A). Thus, it follows from the Perron-Frobenius theorem [69] that v(t) converges in time to a unique and positive stationary distribution u, which is independent of the initial distribution v(0). The stationary distribution u solves the eigenvector problem Here, e denotes the mn-dimensional row-vector where each entry is equal to 1 and the superscript indicates transposition. Hence, the second equation in (5) is the usual normalization for a probability vector (requiring that the sum of all components of u is equal to 1). While (5) defines the stationary distribution implicitly, we can also obtain an explicit formula. To this end, we multiply the second equation in (5) by the row-vector e from the right, and add the result to the first equation. After some rearranging, this yields the relationship u (I + U − T) = e. Here, U = e e is the mn × mn matrix with all entries being equal to 1, and I is the identity matrix. The matrix (I + U − T) is invertible (see proposition 2 in appendix A, with (I + U − T) −1 being a fundamental matrix of the ergodic chain [70]). Thus, we obtain the explicit representation This equation allows us to numerically compute the stationary distribution u = (u ij ) of introspection dynamics for any finite normal-form game.
To derive expressions that more immediately relate u to the payoffs of the game, we consider two important special cases in the following. First, for any number of strategies, we derive an approximation in the limit of weak selection (β 1). After that, we provide exact formulas in the case where each player can choose among two strategies (m = n = 2).

Weak selection
To directly relate the long-run abundance of strategies with the game's payoffs, we consider the limit of weak selection. For this, we first note that both the transition matrix (3) and its stationary distribution (6) depend on β; to make this dependence more explicit, we write T(β) for the transition matrix and u(β) for the stationary distribution. We then expand u(β) and T(β) as a Taylor series around β = 0, Here, T 0 := T| β=0 and T 1 := ∂T/∂β| β=0 are the constant and the linear term of the Taylor expansion of the transition matrix. Both terms can be computed explicitly, see appendix B. Similarly, u 0 := u| β=0 and u 1 := ∂u/∂β| β=0 are the constant and the linear term of the stationary distribution; those are the terms we wish to compute. To this end, we take the expressions (7) and plug them into the eigenvector problem (5). This yields the relation By setting β = 0 in (8), we note that u 0 needs to satisfy the linear system Similarly, by taking the first derivative of both sides of (8) with respect to β, and then setting β = 0, it follows that u 1 needs to satisfy As we show in proposition 3 in appendix B, both systems (9) and (10) can be solved explicitly.
In the special case that both players have the same number of strategies, m = n, the solution becomes particularly simple. In that case, we can approximate the abundance of strategy profile (S i , S j ) in the stationary strategy distribution by To interpret this formula, let us introduce the following shortcut notation for some payoff averages for player 1, Here, π •j is the average payoff player 1 obtains when randomly sampling a strategy against a co-player with strategy S j . The next expression, π i• is player 1's average payoff when using strategy S i against a randomly sampling co-player. Finally, π •• is player 1's average payoff if both players sample randomly. Analogous averages can be defined with respect to player 2's payoffs π . Using this notation, we can rewrite the weak-selection formula (11) as We say the strategy profile (S i , S j ) is favored by selection if its abundance is larger than neutral, (u) ij > 1/n 2 . For weak selection, expression (13) suggests the following two mechanisms for a strategy profile to be favored: either (i) player 1's payoff from using strategy S i against S j is better than average, π ij > π •j ; or (ii) player 1's payoff from using strategy S i against a random strategy of the co-player is better than average, π i• > π •• . Two analogous mechanisms apply to player 2.
Interestingly, similar results have been previously derived for a birth-death model [60]. In that model, members of two separate populations engage in a bi-matrix game. The members of the first population act in the role of the row player, whereas the members of the second population act as the column player. In the special case that mutations are rare and selection is weak, the results of that model coincide with ours. In that case, the likelihood that population 1 uses strategy S i and that population 2 uses strategy S j simplifies to our Eq. (13). This agreement can be regarded as another instance of a more general observation: in the limit of weak selection, different selection rules often turn out to be equivalent [71,72].
The previous expressions (11) and (13) tell us how often we are to observe a strategy profile (S i , S j ) on average. In many applications, it is also relevant to know how often player 1 adopts the given strategy S i , irrespective of the co-player's strategy. Based on (11), we can approximate the corresponding marginal probability ξ i := n j=1 u ij as Using the notation (12), this expression simplifies to ξ i = 1/n + β(π i• − π •• )/n + O(β 2 ). In particular, we can use this formula to predict which of two strategies S i and S k is more likely to be played over time. We obtain that ξ i > ξ k if and only if π i• > π k• (i.e., if and only if S i performs better than S k against a uniform sample of the co-player's strategies). Interestingly, this condition naturally depends on player 1's own payoffs π pq , but it is independent of the co-player's payoffs π pq . While such a result may appear intuitive, we show that it only holds under weak selection. Once selection becomes stronger, the co-player's payoffs can have a major impact on how likely the focal player is to play a certain strategy (for an explicit example, see appendix B).
In this section, we have derived the first-order approximation of the stationary distribution as a function of the selection strength β. Using similar methods, we can recursively compute all higher-order terms in the Taylor expansion of u. The respective expressions are derived in appendix C.

Games with two strategies
We now apply our model of introspection dynamics to general asymmetric 2-player, 2-strategy games. In contrast to the previous section, the results here are valid for any intensity of selection. For m = n = 2, the payoff matrix (1) simplifies to (15) Since the transitions in introspection dynamics only depend on payoff differences between strategies, we can further simplify this payoff matrix. Appendix A shows that the transition probabilities in (3) remain unchanged if we add a constant to all payoffs π ij in a given column j or to all payoffs π ij in a given row i. We can therefore assume without loss of generality that the payoff matrix (15) takes the form (16) where A := π 11 − π 21 , A := π 11 −π 12 , B := π 22 − π 12 , and B := π 22 −π 21 . Thus, the number of free parameters reduces from eight in (15) to four in (16). For general m × n payoff matrices, this approach reduces the number of free parameters from 2mn to 2mn − m − n.
Using the payoff matrix (16), we can write the transition matrix T according to (3) explicitly, We then obtain the stationary strategy distribution u = (u 11 , u 12 , u 21 , u 22 ) by either solving (5) or using (6), Here, C is a normalization constant ensuring that components of u add up to 1. Accordingly, the respective marginal probabilities ξ 1 = u 11 + u 12 and ξ 1 = u 11 +u 21 are The remaining marginal probabilities are ξ 2 = 1 − ξ 1 and ξ 2 = 1−ξ 1 . We note that, in general, the stationary distribution u does not factorize over these marginal distributions. That is, in general, u ij = ξ i · ξ j for i, j ∈ {1, 2} and β > 0. However, one can verify that such a factorization holds when the game's payoffs satisfy A = −B and A = −B . These latter two conditions imply that each player's incentive to choose one strategy (rather than the other) is independent of the opponent's strategy. In such a case, this independence is also reflected in the game's stationary distribution.
The previous results hold for arbitrary selection strength. We can also look at the weak selection approximation. Using (13), we obtain In this case, the resulting marginal probabilities become particularly simple, ξ 1 = 1/2 + β(A − B)/8 and ξ 1 = 1/2 + β(A −B )/8. Player 1 favors strategy S 1 if and only if A > B, and the analogous result holds for player 2. In the following, we apply these formulas to discuss the introspection dynamics of some classical 2-strategy games.

Social dilemmas with two strategies
As a first application of our model, we explore the introspection dynamics of asymmetric social dilemmas. In social dilemmas, players can choose whether to cooperate (C) or to defect (D). When players choose the same action, they prefer mutual cooperation to mutual defection; yet when they choose different actions, a defector gets a higher payoff than a cooperator [17].

Prisoner's dilemma
To start with, we consider the most stringent form of a social dilemma, the prisoner's dilemma [18]. For a simple instantiation of this game, we assume that a cooperating player pays a cost in order for the co-player to get a benefit. We incorporate asymmetry by assuming that the cooperation costs c i > 0 may differ between players (whereas the benefit b of cooperation is the same for both). Therefore, the payoff matrix takes the form The unique Nash equilibrium of this game is for both players to defect, independent of the players' exact costs. For an easier interpretation, however, we assume without loss of generality that cooperation tends to be more costly for the first player, c 1 c 2 .
In the following, we ask how likely players are to cooperate when they update their strategies according to introspection dynamics. To this end, we first consider the case of a fixed benefit of b = 1. Moreover, we assume that the first player faces considerable cooperation costs, c 1 = 0.6, whereas the second player's costs are negligible, c 2 = 0.1. To illustrate the workings of introspection dynamics, we start by simulating the basic process described in section 2. Figure 2(a) shows a representative realization. Over time, the two players independently switch between cooperation and defection. As a result, they experience all possible outcomes: there are times in which both players defect, but also instances in which either one or both players cooperate. Overall, however, mutual defection appears to be most abundant, as one may expect.
To obtain a more quantitative understanding, we compute how likely we are to observe each of the four possible outcomes over time. To this end, we assume that initially both players defect, such that (0), v DD (0)) = (0, 0, 0, 1). Then we use (4) to compute v(t) for all future time steps. In figure 2(b), we show the resulting cooperation probability for each player, as defined by ξ C (t) = v CC (t) + v CD (t) for the first player and ξ C (t) = v CC (t)+v DC (t) for the second player. As before, we observe that both players are most likely to defect. However, while player 1's cooperation probability remains low (approximately 5%), player 2's cooperation probability quickly reaches a stable value of about 38%.
To further explore this limiting behavior, we compute the game's stationary distribution. Setting the four game parameters in the payoff matrix (16) In particular, we note that the invariant distribution does not depend on the benefit of cooperation (since b does not enter the payoff differences A, A , B, B ). Moreover, because β 0 and c 1 c 2 > 0, the abundances of the four possible states always obey the same relationship as in figure 2(c), For β > 0 and c 1 > c 2 all inequalities in (21) are strict. That is, among the two players, the player with the lower cooperation cost can be expected to cooperate more often. However, mutual defection is always the most abundant state.
We can also use the stationary distribution (20) to study how parameter changes affect cooperation. In figure 2(d), we explore the effect of asymmetry. To this end, we increase the difference in the players' costs c 1 − c 2 while keeping the average cost (c 1 + c 2 )/2 fixed. Interestingly, we observe a weakly positive effect. As we increase c 1 and decrease c 2 , the first player's reduced cooperation is more than compensated by the second player.
Similarly, in figure 2(e) we explore the impact of selection strength. When selection is weak, such that payoff differences have a negligible impact, the stationary distribution simplifies to In particular, the last two states (D, C) and (D, D) are favored by selection, whereas the other two states are disfavored.
In contrast, for the strong selection regime, we take the limit of u as β becomes arbitrarily large. We obtain lim β→∞ u = (0, 0, 0, 1).
Thus, we recover the classical prediction that both players learn to defect at all times.

Stag-hunt game
As another instance of a social dilemma, we explore a version of the stag-hunt game [73]. In contrast to the prisoner's dilemma, we now assume that players only derive a benefit if they both cooperate (which could reflect the advantage of collective action in hunting expeditions). Under this assumption, the payoff matrix becomes (24) As before, we assume the payoff parameters satisfy b > c 1 c 2 > 0. Under this assumption, the stag-hunt game becomes a coordination game. There are two pure equilibria, according to which either both players cooperate, or both players defect. Mutual cooperation is always payoff-dominant (it is the equilibrium that gives a higher payoff to both players). However, when the benefit of cooperation is comparably small, b < c 1 + c 2 , mutual defection is risk-dominant (loosely meaning that mutual defection is the safer option when there is uncertainty about the co-player's actions [74]). We can analyze the introspection dynamics of the stag-hunt game in the same way as the prisoner's dilemma. This time, As a result, the stationary distribution according to (17) becomes u = 1 e βb + e βc 1 + e βc 2 + e β(c 1 +c 2 ) · e βb , e βc 2 , e βc 1 , e β(c 1 +c 2 ) .
It follows that there are two possible orderings of the abundance of the four states. Which ordering applies depends on how the benefit b relates to the total costs of cooperation c 1 + c 2 , Irrespective of the precise ordering, players are always most likely to settle at one of the two equilibrium outcomes. Moreover, introspection dynamics lends further support to the static notion of risk-dominance: players only coordinate on mutual cooperation if the benefit exceeds the sum of the costs. In figure 3, we illustrate this result for our baseline parameters, b = 1, c 1 = 0.6, and c 2 = 0.1. In particular, since c 1 + c 2 < b, mutual cooperation is by far the most abundant outcome. As in the prisoner's dilemma, we can also use the stationary distribution to discuss the impact of selection strength on cooperation, see figure 3. If selection is weak, we can approximate the stationary distribution by (18) In particular, the second state (C, D) in which only the high-cost player cooperates is always disfavored by selection. The other three states may be favored or disfavored, depending on the exact values of the benefits That is, in the limit of strong selection, introspection dynamics always selects the risk-dominant equilibrium (even though mutual cooperation is payoff-dominant for all parameter values).

Volunteer's dilemma
As a final example of a social dilemma, we consider the volunteer's dilemma [75]. Here, already one cooperating player is sufficient for both players to get a benefit. The payoff matrix is (28) where b > c 1 c 2 > 0. The game has two pure equilibria in which either only the first player or only the second player cooperates, (C, D) and (D, C). In both cases, the dilemma arises because players prefer the other player to volunteer as a cooperator. For the volunteer's dilemma, we set . By (17), the stationary distribution is u = 1 e βb + e β(b+c 1 ) + e β(b+c 2 ) + e β(c 1 +c 2 ) · e βb , e β(b+c 2 ) , e β(b+c 1 ) , e β(c 1 +c 2 ) .
The impact of selection strength can be discussed analogously to the previous cases ( figure 3). When selection is weak, the stationary distribution simplifies to Here, the third state (D, C) is always favored by selection. The other three states may be favored or disfavored, depending on the magnitudes of b, c 1 , c 2 . For strong selection and c 1 > c 2 , That is, the low-cost player volunteers with certainty. We briefly summarize the key results of this section in table 1. For each of the three social dilemmas we considered, we describe (i) the respective game parameters, (ii) the resulting stationary distribution, (iii) the respective weak selection limit, (iv) the strong selection limit, and (v) the game's Nash equilibria.

Comparing introspection and imitation in symmetric games
While the previous section focused on asymmetric social dilemmas, introspection dynamics is equally applicable to symmetric games. In the special case that the game has two strategies only, the assumption of symmetry implies A = A and B = B , and the payoff matrix (16) simplifies to (33) Similarly, the stationary distribution u = (u CC , u CD , u DC , u DD ) reduces to u = 1 2 + e Aβ + e Bβ · (e Aβ , 1, 1, e Bβ ).
In particular, we can immediately see that u CD = u DC for any A and B, as one would expect from a symmetric game. Moreover, the average cooperation probability of player 1 (and therefore also of player 2) becomes ξ C = u CC + u CD = 1 + e Aβ 2 + e Aβ + e Bβ .
For a given selection strength, this formula has only two free parameters. Hence, we can use this formula to explore the expected cooperation rate across all symmetric 2 × 2 games, by simultaneously varying both A and B as in figure 4(a). Depending on the signs of A and B, we recover the four classical symmetric games [76]: the prisoner's dilemma (A < 0 and B > 0), the stag-hunt game (A > 0 and B > 0), the snowdrift For each parameter combination, we measure the average cooperation rate if players either adopt strategies according to (a) introspection dynamics or (b) pairwise imitation. We observe that the two dynamics generally yield similar results unless the interaction takes the form of a snowdrift game. Here, introspection dynamics typically leads to a cooperation rate of approximately 50%, irrespective of the exact game parameters. In contrast, imitation dynamics depends more gradually on the game parameters. For this figure, we use a selection strength of β = 8 in each case. For the pairwise imitation dynamics, we additionally need to specify the population size (Z = 50) and the mutation rate (μ = 0.05). Moreover, the shown results for imitation dynamics assume well-mixed populations; structured populations tend to yield a different dynamics [51,77].
game (A < 0 and B < 0), and the harmony game (A > 0 and B < 0). For each possible combination of (A, B), we plot the resulting cooperation probability according to (35). We observe that there are three qualitative regions: (i) for B > 0 and A < B, defection is either a dominant strategy, or it is risk-dominant.
In this parameter region, we, therefore, observe comparably little cooperation. (ii) Conversely, for A > 0 and B < A, cooperation is either dominant or risk-dominant. As a consequence, introspection dynamics leads to almost full cooperation. (iii) If both A < 0 and B < 0 (i.e., in the snowdrift game), players have an incentive to choose the opposite strategy of the opponent. In this parameter region, we, therefore, observe an approximately equal share of cooperators and defectors. For symmetric games, we can compare introspection dynamics to the classical pairwise imitation rule [24], see figure 4(b) (for details of the implemented imitation model, see appendix D). In most parameter regions, the corresponding results are strikingly similar. Only in the snowdrift game regime, imitation leads to a more gradual change from full cooperation (when B < 0 and A ≈ 0) to full defection (when A < 0 and B ≈ 0). To explain this difference, we note that the imitation process takes place in an entire population of players. As a result, individuals do not adapt their strategy to a specific opponent, but rather to the population average. The resulting imitation dynamics can lead to a mixed equilibrium in which cooperators and defectors coexist. The exact position of this equilibrium changes gradually in the game parameters, as observed in figure 4(b). Overall, these results suggest that the two dynamics are comparable when the game tends to converge toward a homogeneous population. However, if the game has at least one stable mixed equilibrium (as in the snowdrift game), the predictions of the two models may diverge.

The volunteer's timing dilemma
In the previous examples, we have considered social dilemmas with two strategies only. In this section, we illustrate how introspection dynamics can be applied to a game with arbitrarily many strategies. To this end, we take the volunteer's dilemma and turn it into a timing dilemma. Here, players no longer only determine whether or not to volunteer. Instead, the game takes place over time, and players determine how long they wait for the co-player to volunteer before they volunteer themselves. This kind of game has been first studied by Weesie [78] as a model to explore the emergence of 'wait and see' behaviors in the context of voluntary action. Here we explore the game's introspection dynamics.
To formalize the game, we assume that within a given time interval [t 0 , t max ] := [0, 1], players need to make a decision whether to volunteer. For simplicity, we assume that time is discretized, such that there are n + 1 evenly spaced time points {t 0 , t 1 , . . . , t n } = {0, 1/n, . . . , 1}, at which players may volunteer. As before, we assume that players may have different costs to volunteer and that player 1's cost tends to be larger, c 1 c 2 . If at least one player volunteers during the time interval, both players derive some benefit. However, to add a component of time pressure, we assume that the benefit of cooperation is linearly degrading in time. That is, if one of the players cooperates immediately at time t 0 , both players get a benefit of b > c i . However, if the first player to volunteer cooperates at time t max , the resulting benefit is zero.
A strategy for the volunteer's timing dilemma is now a rule that tells the player at which point to volunteer. To this end, we associate each strategy {S 0 , S 1 , ..., S n } with a waiting time {t 0 , t 1 , . . . , t n }. For . We assume that the benefit of immediate cooperation is b = 1, whereas the players' costs of volunteering are c 1 = 0.6 and c 2 = 0.1, respectively. Using a strength of selection of β = 10, we compute the stationary distribution with (6). We find that the most likely outcome is that the low-cost player cooperates without delay, whereas the high-cost player waits as long as possible. (b) Maintaining n = 4 and b = 1, we compute the average volunteer time for varying cost difference (c 1 − c 2 ) and intensity of selection (β = 5, 10 and 50). Note that the average cost is kept constant, (c 1 + c 2 )/2 = 0.5. We verify that for an increasing cost asymmetry, the time to act decreases. The impact is stronger for high intensity of selection. (c) Finally, we have computed the stationary distribution for varying discretizations of time (dots with solid lines), the other parameters being the same as in (a). In addition, we have simulated the basic process described in section 2, for which alternative strategies are uniformly drawn from the entire interval [0, 1] (dashed lines). We observe that for large n the numerically computed cooperation probabilities approach the time average of the simulations.
i < n, a player with strategy S i volunteers at time t i , unless the co-player has already volunteered earlier (in which case the focal player's cooperation is no longer required). For i = n, we associate the respective strategy S n with not cooperating at all. If player 1 and player 2 adopt the strategies S i and S j , respectively, the resulting payoffs are Equivalently, the game can be represented by the payoff matrix (37) Note that for n = 1, we recover the payoffs of the original volunteer's dilemma. We explore the introspection dynamics of the volunteer's timing dilemma numerically, by computing the stationary distribution with (6). To this end, we first consider a case in which n = 4, such that the players' possible waiting times aret ∈ {0, 1 4 , 1 2 , 3 4 , 1}. Moreover, we consider a normalized benefit of b = 1, and the player's cooperation costs are c 1 = 0.7 and c 2 = 0.3, respectively. The resulting stationary distribution is displayed in figure 5(a). As one may expect, we observe that the low-cost player is more cooperative; however, more surprisingly, this player typically cooperates without any delay.
A natural question, then, is if the cost asymmetry helps to solve the timing dilemma. In figure 5(b), we show that the higher the difference in costs, the faster one of the players volunteers. This positive effect of asymmetry is particularly pronounced when selection is strong. To explore how these results depend on our discretization of the time interval, we have repeated the above analysis for different n ∈ {1, . . . , 15}. In addition, we have run simulations in which players are able to volunteer at any time in [0, 1]. The results of this analysis are displayed in figure 5(c). Interestingly, the low-cost player is most likely to volunteer when n = 1 (the original volunteer's dilemma), in which case a 'wait and see' approach is ruled out by the design of the game. However, across all values of n considered, the low-cost player always cooperates with a probability of at least 80%, implying that this player remains the most reliable volunteer.

Discussion and conclusion
Herein, we present a simple model of learning in social interactions. The model considers individuals who can choose among several strategies using introspection, that is, by reasoning about their strategies' prospective consequences. Compared to imitation models [24], this approach has the advantage that it can be applied to symmetric and asymmetric games alike. As another advantage, we can derive explicit formulas that describe how the system evolves in time, and what the long-run abundance of each strategy is. These formulas become particularly simple when players can only choose among two strategies, or when selection is weak [79][80][81].
Mathematically, the model takes the form of a Markov chain, whose states are the possible combinations of the strategies of the two players. In particular, if the players can choose among m and n strategies, respectively, there are mn states, arguably the minimum number of states that any learning model for such asymmetric games must have. While a similar Markov approach can also be used to analyze imitation processes in populations of players, the two approaches differ in their computational complexity. Population models need to record how many players apply any given strategy at any point in time. As a result, if a symmetric game with n strategies is played in a population of size Z, there are Z+n−1 n−1 possible states [82]. Since this number of states increases exponentially in n, numerically exact results are only feasible in games with a few strategies [83], or when mutations are rare [84][85][86][87]. In contrast, the computational complexity of introspection dynamics only depends polynomially on the number of strategies. As a result, the stationary distribution can be easily computed even in complex games with many possible moves [67].
Despite these differences in terms of computational complexity, the results of introspection dynamics are often in remarkable agreement with other evolutionary processes. For example, in the limit of weak selection, the introspection dynamics of two adapting learners becomes equivalent to a birth-death model of two co-evolving populations [60] (section 3.2). Similarly, across all symmetric 2 × 2 games, we find that introspection dynamics often recovers the results of pairwise imitation [24] (figure 4). The only notable exception occurs for the snowdrift game (as in [10]). In the snowdrift game, pairwise imitation typically selects for the mixed symmetric equilibrium, in which cooperators and defectors coexist. In contrast, introspection dynamics selects pure but asymmetric equilibria in which one player cooperates and the other defects.
To illustrate our analytical results, we apply our framework to a number of asymmetric games. As one particular example, we study the dynamics of the volunteer's timing dilemma [78]. In this game, one player is required to volunteer as quickly as possible to create a benefit for the whole group; yet each player may be tempted to wait, hoping the other player would give in first. When players differ in their costs to volunteer, we observe not only that the player with the lower cost is more likely to volunteer; we also note that this player usually volunteers without any delay ( figure 5). This game thus illustrates how players may sometimes benefit from asymmetry because it helps them to coordinate more efficiently. Herein, we have studied this advantage of asymmetry by assuming that players have different costs. However, similar results could be obtained for other sources of asymmetry, such as when people differ in their endowments [43,44], their productivities [54], or their strategic options more generally [41].
Here, we consider a comparably simple setup of introspective learning: there are two individuals who continually interact with each other. However, one could also imagine an entire population of introspective learners who interact with one another in such pairwise games. The results of such a process are likely to depend on the population's network topology [51,77,[88][89][90]. Alternatively, one may also imagine that introspective learners engage in interactions that involve more than two players at a time [91,92]. The introspection dynamics of such multiplayer games can easily be studied with simulations [54], similar to the 2-player games considered in this paper. Especially for asymmetric games with many unequal players, introspection dynamics can serve as a simple model to study the resulting learning processes.

Data availability statement
The data that support the findings of this study are available upon reasonable request from the authors.

Appendix A. General properties of introspection dynamics
Here, we derive some properties of the transition matrix T defined in (3) in the main text and its stationary distribution u.

Proposition 1 (Properties of the transition matrix T).
(a) For a fixed j, T is unchanged by adding an arbitrary constant d j to all payoffs π ij . (b) For a fixed i, T is unchanged by adding an arbitrary constant d i to all payoffs π ij . (c) If β is finite, T is primitive.
(c) T is primitive if there is a ∈ N such that T a is positive. Let a = 2. For arbitrary i, j, k, l, For finite β, the entries of T as defined in (3) in the main text imply that both terms on the right-hand side of the above inequality are positive. Therefore, T 2 is positive and T is primitive under finite selection intensity.
The first two properties described in proposition 1 are useful as they allow us to simplify the payoff matrices we need to consider (as illustrated in section 3.3). The last property helps us to analyze the long-term abundances of each strategy. Because T is primitive for finite β, the Perron-Frobenius theorem implies that the likelihood u ij to observe the two players in a given state (S i , S j ) converges as a function of time. Moreover, this likelihood can be determined by finding the unique solution of (5) in the main text While (A1) provides an implicit characterization of the stationary distribution u, we can also provide an explicit representation.

Proposition 2 (An explicit representation of the stationary distribution).
For p ∈ N, let T be a row-stochastic and primitive p × p matrix, I denote the p × p identity matrix, and U = e e denote the p × p matrix whose entries are all equal to 1. Then (I + U − T) is invertible, and the unique solution of (A1) can be given by u = e (I + U − T) −1 .
Proof. One proof of this result can be found in [70], where (I + U − T) −1 is shown to be a fundamental matrix of the ergodic chain. In the following, we provide an independent proof. Suppose u satisfies (A1). If we multiply the second equation in (A1) by e from the right, and subtract the result from the first equation, we obtain, after rearranging some of the terms Thus, for the proposition to hold, we only need to verify that (I + U − T) is invertible. To this end, let λ 1 , . . . , λ p be the eigenvalues of T and w 1 , . . . , w p the corresponding right eigenvectors. Since T is row-stochastic it has an eigenvalue λ 1 = 1 with corresponding right eigenvector w 1 = (1, 1, . . . , 1) . By T's primitivity, it follows that I − T has a unique eigenvalue equal to 0 with corresponding right eigenvector w 1 , while 1 − λ k = 0 for k = 2, . . . , p. Because U = w 1 w 1 , it follows by Brauer's theorem (example 1.2.8, p 51 of [93]), that the eigenvalues of (U + I − T) are w 1 w 1 , 1 − λ 2 , . . . , 1 − λ p . That is, in general the eigenvalues are the same as for the matrix I − T; only the eigenvalue corresponding to w 1 gets replaced by w 1 w 1 . Since w 1 w 1 = p > 0 and 1 − λ k = 0 for k = 2, . . . , p, the matrix (U + I − T) has no eigenvalue equal to 0. Therefore, it is invertible.
While the stationary distribution u is well-defined for any finite β, in section 4.1 we also study which strategies are played as β approaches infinity. To this end, we first compute an expression for u(β) that is valid for finite β. Thereafter, we take the limit lim β→∞ u(β) in R mn . For all considered 2 × 2 games, this yields a unique prediction of the strong selection limit-even for those games for which the respective limiting transition matrix lim β→∞ T(β) allows for several absorbing states. One example is the stag-hunt game (see table 1). Here, introspection dynamics predicts that if (C, C) is risk-dominant, players coordinate on mutual cooperation. Similarly, if (D, D) is risk-dominant, players coordinate on mutual defection. In comparison, according to the limiting transition matrix lim β→∞ T(β), both (C, C) and (D, D) are absorbing, irrespective of which one is risk-dominant. These results suggest that the strong selection limit of introspection dynamics can serve as an equilibrium selection device for 2 × 2 games, similar to other evolutionary dynamics [94,95].

Appendix B. Introspection dynamics under weak selection
Here, we derive an explicit expression for the linear approximation of the stationary distribution when selection is weak, u ≈ u 0 + u 1 β. As we show in the main text, in (9) and (10), the constant term u 0 of the stationary distribution obeys whereas the linear u 1 term satisfies Both systems can be solved explicitly. For that, we first compute T 0 := T| β=0 and T 1 := ∂T/∂β| β=0 , which are the constant and the linear term of the transition matrix. Based on (3), we obtain (B3) In the special case that both players have the same number of strategies, the stationary distribution takes a particularly simple form, as the following result shows.
(a) To prove the first part, we note that T 0 is symmetric; indeed, swapping i with k and j with l in the expression for T 0 in (B3) shows that (T 0 ) ij,kl = (T 0 ) kl,ij . The stationary distribution of a Markov chain with symmetric transition matrix is uniform [70]. Since T is of size mn, the result in (B4) immediately follows.
(b) To show that any solution to (B2) must be unique, we multiply the second equation in (B2) by the row-vector e from the right, and add the result to the first equation. This yields where again U = e e. By proposition 2, the matrix (I + U − T 0 ) is invertible, and hence any solution to (B2) is uniquely determined by u 1 = u 0 T 1 (I + U − T 0 ) −1 .
Next, we derive (14) in the main text. For a given stationary distribution u, we define the respective marginal distributions by ξ i := n j=1 (u) ij and ξ j := m i=1 (u) ij . Using proposition 3 we can derive the following weak selection formulas for ξ i and ξ j when the two players have the same number of strategies (m = n).

Corollary 1 (Marginal distributions for weak selection).
For m = n and small β, the abundance of the strategies S i and S j is Proof. To obtain the first equation, we use the weak selection formula in proposition 3 and sum up over all co-player's strategies S j , The analogous formula for ξ j follows by symmetry.
Three remarks are in order.
(a) Independence of the co-player's payoffs. According to corollary 1, the abundance of a player's strategy in the limit of weak selection only depends on that player's payoffs. For instance, in (B13), the abundance ξ i depends on the first player's payoffs π pq , but it is independent of the second player's payoffs π pq . While such a result may appear intuitive, it is important to note that this result only holds in the limit of weak selection. As an example, consider the game with 2 × 2 payoff matrix (B15) For this game, (B13) implies that for sufficiently small β, the marginal abundance of S 1 is approximately given by ξ 1 = 1 2 +β 3 8 , irrespective of the value of x. However, in the limit of strong selection, β → ∞, one can show that ξ 1 = 1 if x is positive, but ξ 1 = 0 if x is negative. Therefore, the co-player's payoffs do in general affect how likely a player is to adopt a certain strategy. Only when selection is weak (such that each co-player's strategy is played with approximately equal frequency), this dependency disappears. (b) Comparing the abundance of different strategies. Corollary 1 also allows us to rank the different strategies of a player according to how often they are played in the stationary distribution. To this end, we say strategy S i is favored to S k , and we write S i S k , if ξ i > ξ k . An analogous notation can be defined for the column player. By corollary 1 we find that when selection is weak, then S i S k if and only if n q=1 π iq > n q=1 π kq . Again, the respective condition only depends on the focal player's payoffs π pq and not on the co-player's payoffs π pq . However, as before, this independence vanishes for stronger selection (which can be shown with the same example (B15)). (c) Comparative statics. Finally, we can use corollary 1 to compute how changes in the players' payoffs affect the resulting strategy abundances in the limit of weak selection. By (B13), we note that ∂ξ i /∂π pq is positive if p = i, while it is negative if p = i. Hence, increasing any one of the payoffs π iq has a positive effect on the long-run abundance of strategy S i , whereas increasing any one of the other payoffs π pq with p = i has a negative effect.  0  1  2  3  4  5   1  1  0  0  0  0  0  2  1  1  0  0  0  0  3  1  4  1  0  0  0  4  1  11  11  1  0  0  5  1  26  66  26  1  0 For introspection dynamics, the first factor on the right-hand side can be further simplified. We note that T i is the zero matrix for all even i 2, as the following result shows. (C8)

Appendix D. Pairwise imitation dynamics
Pairwise imitation dynamics is a frequency-dependent update rule where players can imitate other players' strategies [24]. The dynamics takes place in an entire population of size Z. For symmetric games with two strategies, as considered in section 4.2, the state of the population can be defined by the number of individuals i who use strategy C (the number of individuals who use strategy D is then Z − i). Given the current population state i, one can calculate the expected payoffs of cooperators and defectors, respectively, as Similar to introspection dynamics, pairwise imitation dynamics assumes that at regular time intervals, a randomly chosen player is given the opportunity to revise their strategy. With probability μ (the mutation rate), this player simply adopts a randomly chosen strategy. With probability 1 − μ, the player instead picks a randomly chosen role model from the population. If the focal player's payoff is π and the role model's payoff isπ, the focal player adopts the role model's strategy with probability ϕ β (π − π), where ϕ β is again the Fermi function defined by (2).
This process can also be described by a Markov chain. Given that the current population state is i, the probability that a D-player changes to C in the next time step is Similarly, the probability that a C-player changes to D is Since only one player is allowed to update at each time step, Finally, for the normalization condition to hold, the probability of remaining in the same state is As before, we can characterize the long-run dynamics of this process by computing the stationary distribution u = (u 0 , u 1 , . . . , u Z ). In this case, however, u i denotes the prevalence of a population state with i cooperators and Z − i defectors.
To construct a metric that is comparable to the one used for introspection dynamics, we compute the average fraction of cooperation, which we plot in figure 4(b). Additionally, one can also obtain the average probability of drawing CC, CD, DC, and DD pairs from the population Again, p CD = p DC because players are indistinguishable in the symmetric case. These quantities are comparable to the state frequencies of our stationary distribution, u = (u CC , u CD , u DC , u DD ).