Emergence of specialized third-party enforcement

Significance In a social dilemma, total resources are maximized if everyone cooperates, but everyone is tempted to not cooperate. Cooperation may be impossible if the parties interact infrequently or do not have sufficient information about each other’s past behavior. In such cases, human cooperation is often maintained by third parties who specialize in enforcement (e.g., a police force). However, if enforcers are powerful enough to punish noncooperators, they may also be powerful enough to extract resources from their clients without providing any services in return. We show that by using a reputation system, enforcers can police each other and sanction those who fail to punish noncooperation in the social dilemma.


Histories
The history of a producer in a stage game consists of the action profile in the producer's match in step 1 and the action by the enforcer in the producer's match in step 2, as well as the identities of the enforcer and the other producer. The set of all such stage game histories is HP = A1 × A1 × A2 × NP × NE. The set of possible histories of a producer at the beginning of step 1 of the stage game in round r is HP (r) = H r−1 P . The set of all possible histories of a producer at the beginning of step 1 of some stage game is HP = ∪ ∞ r=1 HP (r). Let M1 be a symmetric n P × n P matrix such that entry ij is equal to 1 if producers i and j are matched and 0 otherwise, and let M1 denote the set of all such matrices, i.e all possible matchings in step 1 of the stage game. Let M2 be an n P × n E matrix such that entry ij is equal to 1 if i is a client of j and 0 otherwise, and let M2 denote the set of all such matrices, i.e. all possible matchings in step 2 of the stage game. Let M3 be a symmetric n E × n E matrix such that entry ij is equal to 1 if enforcers i and j are matched and 0 otherwise, and let M3 denote the set of all such matrices, i.e. all possible matchings in step 3 of the stage game.
The history of an enforcer in a stage game consists of all actions taken in the game, as well as the matching of all steps. The set of all such stage game histories is HE = A n P B. Cooperative Equilibrium. Reputation We will define our cooperative strategy profile with the help of a reputation system, as follows. Each agent i has a label zi ∈ {0, 1, 2, . . . , κ}, with the interpretation that if zi = 0 then i is in good standing and if zi = k > 0 then i is in bad standing and shall be punished/attacked for the next following k rounds, including the current round. Labels are updated as follows: If an agent is in good standing (zi = 0) and follows the proposed strategy, then she remains in good standing. If an agent does not follow the proposed strategy then she becomes enters bad standing with zi = κ. If a player with zi = k > 0 follows the proposed strategy then her label is updated to zi = k − 1. Thus, if a player with zi = 1 follows the proposed strategy then she becomes enters good standing (zi = 0).
Note that since enforcers know the complete history (know all actions taken by everyone in the repeated game so far) they can derive the label of any enforcers.

Equilibrium
Let s * = s CP , s CE denote the strategy profile in which each producer i follows strategy CP, and each enforcer i follows strategy CE. We can show that s * is a subgame perfect Nash equilibrium if the continuation probability δ is high enough, and the punishment phase for enforcers is long (κ large) enough, provided that the punishment p of producers is severe enough. This result and the proof of it (except the part concerning unconditional strategies) is similar to Theorem 2 of (1), adjusted for the fact that we have a sequential stage game, and two kinds of players, producers and enforcers.
We perturbed the environment by assuming that producers make a mistake and fail to play their intended action with probability µP (i.i.d. across players and rounds) and look for subgame perfect Nash equilibria as µP → 0.
Proof. First consider producers. Ignore mistakes for now. A producer complying with s CP earns (1 − τ ) (b − c + w) in a round in which she faces a cooperating co-player and (1 − τ ) (−c + w) in a round in which she faces a defecting co-player. If she would deviate from s CP and play D instead of C she would earn (1 − τ ) (b + w) − p in a round in which she faces a cooperating co-player and (1 − τ ) (w) − p in a round in which she faces a defecting co-player. Her deviation does not affect the behaviour of other players. Thus deviation to defection is unprofitable if (1 − τ ) c < p. This is true for a any subgame since the strategy s CE always prescribes punishment of defection. The above argument continues to hold in the presence of sufficiently small mistakes, since all the inequalities governing profitability are strict.
Next we examine whether enforcer i has any profitable deviation from the proposed strategy, given that the remaining players comply with the strategy. For ease of exposition we assume that nE = 2nP . This is without loss of generality.
Recall that strategies that do not incur the fixed cost f are unable to condition on whether a producer defected or not and unable to condition on whether an enforcer is in good or bad standing. To begin with we exclude these strategies and show that it is not profitable to deviate to a strategy that has paid the fixed cost f and conditions on producer behaviour or enforcer standing. In this case we can utilise the one-shot deviation principle, which states that a strategy profile of a repeated game (with δ < 1) is a subgame perfect Nash equilibrium if and only if the following holds: At every information set, the player acting there cannot increase her payoff by deviating at that information set and then returning to her strategy for the rest of the game, given that all other players stick to their strategies throughout the game. The one-shot deviation principle is not valid when we examine deviations to strategies that cannot condition on the same events, so we perform that comparison separately.
Strategies paying the fixed cost f : We ignore mistakes, noting that the argument presented below continues to hold in the presence of sufficiently small mistakes, since all the inequalities governing profitability are strict.
Note that regardless of whether other players are in good or bad standing they will behave in the same way towards i, since they are assumed to comply with the proposed strategy, and the proposed strategy only conditions behaviour on the label of the co-player and not on one's own label. Also note that if all players start complying with the proposed strategy profile in round 1, then from round κ + 1 and onwards everyone is in good standing. In what follows let the action that the proposed strategy profile prescribes for player i in the meta-enforcement step of round t be denoted a t i , i.e. Let R t,comp be the the maximum amount of resources obtained in the enforcer step by an enforcer who punish defections in round t, and let R t,dev be the maximum amount of resources that can be obtained in the enforcer step by an enforcer who does not punish defections in round t. Note that We define the dummy variables A t i and B t i as follows (note that B t i = 1 − A t i ): There are κ + 1 different kinds of subgames to consider, corresponding to zi ∈ {0, 1, 2, . . . , κ}. Case κ: If i is in bad standing with zi = κ her payoff from complying with the proposed strategy in both the enforcement and meta-enforcement steps is at least since from round κ + 1 and onwards everyone is in good standing. Her payoff from deviating (one-shot deviation in round one) is at most [S2] Again we use the fact that from round κ + 1 and onwards everyone is in good standing, but i remains in bad standing until the end of round κ + 1. Note that Using this observation when subtracting (S2) from (S1) we find that deviation is unprofitable if Note that as δ κ → 1 the left hand side of (S3) approaches Thus if l > 2v then (S3) is satisfied as δ κ → 1. Case 1, 2, 3, ..., κ − 1: If i is in bad standing with zi < κ her payoff from complying with the proposed strategy in both the enforcement and meta-enforcement steps is at most, which is strictly higher than the payoff of complying in the case of zi = 1. In contrast her payoff from deviating (one-shot deviation in round one) is the same as in the case of zi = 1. Thus, if (S3) holds then a agent i with zi > 1 will also find it unprofitable to deviate. Case 0: If i is in good standing, zi = 0, her payoff from complying with the proposed strategy is Deviation may begin in the enforcement step and in the in the meta-enforcement step. Let R 1,max denote the highest payoff that any of these deviations can yield, so that the payoff from deviating (one-shot deviation), is at most Thus, subtracting (S5) from (S4) we find that deviation is unprofitable if Note that for a given value of δ κ , if δ → 1 then the left hand side goes to Cases 0 and 1 together: We now show that (S3) and (S6) can be satisfied for some δ and some κ. We have noted that condition (S3) holds for δ κ sufficiently close to one. Fixing δ κ and letting δ → 1 the left hand side of (S6) becomes strictly positive as we have noted. More precisely, there is some ν ∈ (0, 1) such that if δ κ ≥ ν ⇐⇒ κ ≤ log ν/ log δ then (S3) holds. Set κ = log ν/2 log δ implying that δ κ > ν for any choice of δ, thereby satisfying (S3), and let δ → 1 so that (S6) is satisfied.
Strategies not paying the fixed cost f : Fix µP > 0. The probability that at least one of two clients defect in a given round is at least 1 − (1 − µP ) 2 . Divide time into blocks with a lenght of κ/2 rounds. Fix an enforcer. Let Kt be random variable that takes the value 1 if the enforcer in question faces at least one defecting client in the t th block of κ/2 rounds, and takes the value 0 otherwise. The probability that an enforcer faces at least one defecting client in a block of κ/2 rounds is at least (1 − µP ) 2 κ/2 = (1 − µP ) κ It follows that for any εA ∈ (0, 1) there is some κ * ε A such that if κ > κ * ε A then the probability that an enforcer faces at least one defecting client in a block of κ/2 rounds is at least 1 − εA, i.e.
Note that {Kt} ∞ t=1 is a sequence of i.i.d. random variables. Let k be the number of blocks. By the weak law of large numbers the averageK = k t=1 Kt/k converges in probability to E [Kt] = Pr (Kt = 1) as k → ∞. That is, for any εB ∈ (0, 1) and εC ∈ (0, 1) there is some k * ε B ,ε C such that Let T denote the number of rounds in the repeated game. The number of blocks of κ/2 rounds is 2T /κ . Let M δ (T ) denote the median number of rounds given constant repetition probability δ. Note that T has a geometric distribution so that there is some Consider an enforcer who has not invested in information networks (saving on the fixed cost f ) and never punishes defecting clients. If κ > κ * (εA) and δ > δ * then with probability of at least (1 − εC ) /2 the enforcer will enter bad standing or restart bad standing in a at least a fraction 1 − εA − εB of the blocks of κ/2 rounds. Thus, with probability of at least (1 − εC ) /2 the enforcer will spend at least a fraction 1 − εA − εB of all rounds in bad standing.
The payoff to i of complying in a subgame at least as high as the payoff from complying in in a subgame where i is in bad standing with zi = κ, i.e. at least If κ > κ * and δ > δ * then the payoff from a strategy that never punishes defection is at most Note that Since l 2 > f we can pick µP such that 1 2 (Q + l) > f and then pick εA, εB, εC such that 1 This establishes that at every subgame a strategy that does not pay f and hence always fails to punish a defection with some arbitrarily small probability µP earns less than strategies that do pay f , provided that the punishment phase is long enough, and that the repetition probability is high enough (relative to the punishment phase).

S2. Analytical Results for Dynamics
A. Assumptions. In order to obtain analytical results, we necessarily need to make a number of simplifying assumptions. We only consider the CE and DE strategies for the enforcers, in addition to the CP and DP strategies for the producers. If the population is of size N then the set of states is We abstract away from action mistakes and variable cost (by setting µ = v = 0) and focus on the limit where the (expected) number of rounds per period and the number of punishment rounds get arbitrarily large (δ → 1 and κ → ∞). Furthermore, we assume that punishment is sufficiently severe, p > (1 − τ ) c, and sufficiently cheap, f < l. We assume τ w > v, which implies positive tax revenue Ri > 0.
We define our strategies such that if nP = 0 then the DE still attacks her co-player in the enforcement step, and the CE still has to pay the fixed cost. Reverting these assumptions do not change our results.
B. Remarks on Our Dynamics. The speed of convergence to equilibria can be increased dramatically by making interactions local (2,3). Local interactions would also be perfectly realistic. However, we know that imposing spatial structure and local interactions is favourable for the evolution of cooperation, and hence would confound our results, making it unclear whether cooperation emerged due to specialised enforcement or due to local interactions.
We have examined the relative stability of the cooperation and defection equilibria by means of stochastic stability analysis. Alternatively one may consider the effect of group selection on which equilibrium is likely to be observed. It is plausible to assume that groups that manage to coordinate on the more efficient cooperation equilibrium will be favoured relative to groups that coordinate on the less efficient defection equilibrium, as suggested by (4) for the case of so-called altruistic peer-punishment. However, the introduction of several groups also opens up the possibility for new enforcer strategies, including those that try to tax or plunder the producers of other populations. For this reason an examination of the effect of group selection will have to be relegated to future research.
In our model we assume that revision across professions (i.e. across producer and enforcer roles) occur more rarely than revisions within professions. One could object that enforcers will have an incentive to limit entry into the enforcement business, thereby maintaining higher payoff for themselves. Such an attempt by enforcers can be modelled by adding a fixed cost of becoming an enforcer. It will reduce the fraction of enforcers in equilibrium but will not substantially alter the conclusions.
C. Payoff Calculations. We calculate payoffs under the limiting assumption of no action mistakes (µ = 0), no monitoring mistakes (ρ = 0), infinite repetition (δ → 1), and infinite punishment (κ → ∞). For now we allow for v > 0, though will (for reasons of tractability) set v = 0 in the stochastic stability analysis below. C.1. Production. Let us first consider states where there are at least two producers, and at least one enforcer present in the population (nP ≥ 2 and nE ≥ 1). Consider a cooperative producer. When she faces a cooperating co-player her payoff is (1 − τ ) (b − c + w) and when faces a defecting co-player her payoff is (1 − τ ) (w − c), independently of whether she is matched with a DE or CE. In total the expected payoff of a cooperating producer is Next consider a defecting producer. Her payoff in the production step is 0 when she faces another defector, and b when she faces a cooperating co-player. Regardless of the kind of enforcer she is matched with she has background resources w and is taxed at rate τ . If she faces a DE she is not punished but if she faces a CE she is punished, thereby suffering a loss of p. In total the expected payoff of a defecting producer is Now let us consider states where there is a single producer, i.e. nP = 1 (implying nE ≥ 1 for N > 2) then there is no pair of producers that can play the PD so Finally, if there are no enforcers, nE = 0, then we have C.2. Enforcement. First, assume that at least one enforcer and at least two producers are present in the population (nE ≥ 1 and nP ≥ 2) so that there is at least one producer-producer interaction. Consider a cooperation enforcer and suppose she is matched with only one pair of producers. The expected payoff the enforcer obtains from one pair is The expected number of client pairs per enforcer is 1 2 n P n E , so the expected payoff from the enforcement stage is Consider a defection enforcer. She receives the same payoff as a cooperation enforcer, but does not have to incur the cost v; Now let us consider states where nE < 1 or nP < 2. If nP = 1 (implying nE ≥ 1) then there is no pair of producers that can play the PD, so We have argued heuristically that as δ → 1 the fraction of rounds in which almost all DE are in bad standing goes to one. The argument can be formalised as follows. Let β be the fraction of DE who are in bad standing. A DE ends up in bad standing as soon as she has been matched with a DP in the production step or matched with a CE in the meta-enforcement step. Thus, provided we are in a state where at least one CE is present, the probability that a DE enters bad standing in a any given round is bounded above zero. It follows that for anyβ < 1 and ε < 1 there is some roundt(ε,β) ≥ 1 such that if t >t(ε,β), then Pr β >β > 1 − ε. That is, for any round aftert(ε,β), the fraction of the DE that are in bad standing is at leastβ with probability at least 1 − ε.
Let T be the realised length of the repeated game. For any ε andT there is someδ(T , ε) < 1 such that if δ >δ(T , ε) then Pr T >T > 1 − ε. That is, the period lasts at leastT rounds with probability at least 1 − ε.
Now, fixβ and ε, and pick anyT >t(ε,β). By making δ large enough we can ensure that Pr(T >T ) is high enough. More precisely, if δ >δ(T , ε), then with a probability of at least (1 − ε) we have T >T >t (ε,β). This implies that with a probability of at least (1 − ε) 2 the fraction of rounds in which β >β is at least T −t(ε,β) /T . This shows that δ → 1 implies that T −t(ε,β) /T → 1. Furthermore, this holds forβ arbitrarily close to 1. Once a DE agent is in bad standing, she will remain there for an arbitrarily long time, since κ → ∞. Hence, the average payoffs will be approximately equal to the payoff obtained when all CE are in good standing and all DE are in bad standing.
D. (Exact) Best-Reply Learning. As mentioned above, we consider a three-speed myopic best response process where agents may switch strategy within professions more often than they may switch profession (move between producer and enforcer strategies). The main idea behind this assumption is that it takes more time to learn a new trade than it takes to change one's behaviour given the current occupation. At the beginning of each of the first nP periods one producer is drawn at random (without replacement) and may choose whether to become a different kind of producer. When such an opportunity arises she chooses the strategy that would have maximised her previous per-round payoff. More formally, in period t player i chooses a strategy where n t−1 is the state of the population (distribution of strategies) in period t − 1.
Once all producers have decided on their strategy one enforcer is drawn at random and may decide whether to become an enforcer of the other type. She is assumed to choose a myopic best response to the distribution of play in the previous period, i.e. she chooses s t i ∈ arg max s∈{CE,DE} π(s, n t−1 ).
After this, all enforcers again receive the opportunity to update their strategy. Once all enforcers have updated their strategy, and all producers have had the opportunity to update following the last producer, one player is selected at random and may choose her strategy from the full set of strategies, i.e., in addition to the choice of cooperating or defecting she may decide on her profession. More formally, she chooses s t i ∈ arg max s∈S π(s, n t−1 ).
Under the best response process agents take into account that their strategy choice will have an impact on the distribution of strategies in the overall population. In order to represent this we introduce the following notation. Given an agent with current strategy s who switches to s and an initial population profile n, we denote the resulting population profile by n s|s , with n E. Analysis of the Unperturbed Dynamic. In a first step we will examine the circumstances under which producers will switch between cooperation and defection, when nP ≥ 2 and nE ≥ 1. A defecting producer will switch if π(CP, n DP |CP ) ≥ π(DP, n). Rewriting this condition reveals that a defecting producer will become a cooperator with positive probability if will switch with certainty in case the inequality is strict, and will not switch in case it is violated. Likewise, a cooperating producer will become a defector with positive probability if π(DP, n CP |DP ) ≥ π(CP, n). She will stay in case this inequality is violated and will switch in case it does not hold. A cooperating producer will switch with positive probability whenever Now consider enforcers. A defection enforcer will switch to cooperation enforcement with positive probability if π(CE, n DE|CE ) ≥ π(DE, n) or if She will switch with probability one in case of a strict inequality, and will stay in case the above inequality fails to holds. Similarly, a cooperation enforcer will switch with positive probability if π(DE, n CE|DE ) ≥ π(CE, n), will switch with certainty in case of a strict inequality and will remain a cooperation enforcer otherwise. This condition can be rewritten as Intuitively, players who consider switching will have one more co-player of the same type as players who remain at their strategy. Thus, if players decide to stay at their current action players with the other strategy will have to switch.
If the population is of size N then the set of states is We define In the next lemma it will be shown that E C and E D correspond to the cooperative and the defecting absorbing sets of our dynamic process. In both sets there are only enforcers and producers of one type present in the population and the relative size of the two groups is determined by the fundamentals of the model. Lemma S1 Consider states where there is at least two enforcers and at least two producers are present in the population (nE ≥ 2 and nP ≥ 2). If p > (1 − τ ) c and if N is sufficiently large, the following holds: 1. From any state with nCE > f l (nE − 1) + 1 the process converges to E C with probability one, 2. from any state with nCE < f l (nE − 1) the process converges to E D with probability one, and 3. from any state with n CE −1 n E −1 ≤ f l ≤ n CE n E −1 the process converges to either E C or E D with positive probability.
Proof. First, consider states with nCE > f l (nE − 1) + 1. After all producers have updated their strategy an enforcer is drawn to decide whether to become an enforcer of the other type. For nCE > f l (nE − 1) + 1 enforcers who are currently cooperative will stay and defecting enforcers will become cooperation enforcers. Thus, with probability one the process converges to a state where nCE = nE. As n CE n E = 1 > (1−τ )c p (due to the assumption that p > (1 − τ ) c), all producers will either switch to become cooperative producers or remain cooperative producers. We have reached a state with nCE = nE and nCP = nP . Now consider the stage where agents may choose among all four different strategies (at a state with nCE = nE and nCP = nP ). If a cooperation enforcer is drawn, she will not decide to become a defecting producer since n CE n E = 1 implies that a cooperative producer will always earn a higher payoff. Likewise, she will not decide to become a defection enforcer since cooperation enforcers earn more. Thus, a cooperation enforcer may either decide to stay a cooperation enforcer or become cooperative producer. She will switch strategies if π(CP, n CE|CP ) > π(CE, n), will remain if π(CP, n CE|CP ) < π(CE, n), and will randomise between the two options if π(CP, n CE|CP ) = π(CE, n). One can check that a cooperation enforcer will switch with certainty if nCE > α C N.
(Note that the system moves towards the rest point away from the vertices, so the system remains in the set of states where nE ≥ 2 and nP ≥ 2.) If a cooperative producer is drawn she will never decide to become a defecting producer or a defection enforcer (since n CE n E = 1). A cooperative producer will become a cooperation enforcer with certainty if π(CE, n CP |CE ) > π(CP, n). First, consider the case where after the switch there are at least two producers left, nCP = nP ≥ 3. Note that in this case π(CE, n CP |CE ) = n P −1 Now, consider the special case where after the switch there is only one producer left, nCP = nP = 2. Since this lone producer receives less payoff also the income of the switching enforcers changes to π(CE, n CP |CE ) = n P −1 n E +1 τ w − f and we have that a producer becomes an enforcer with certainty whenever π(CE, n CP |CE ) = n P −1 . [S8] Thus, provided the population is sufficiently large (so that the previous inequality is violated) cooperating producers will never choose to become cooperating enforcers if that would leave only one producer left. Thus, for all states where nCE > α C N the cooperation enforcers will switch and the cooperative producers will stay, thus reducing the number of enforcers. Similarly, for nCE < α C N − 1 enforcers will stay and producers will switch, increasing the number of enforcers. Now let us consider states with α C N − 1 ≤ nCE ≤ α C N . If α C N / ∈ Z, both, enforcers and producers will stay with probability one, implying that the state nCE = α C N constitutes a singleton absorbing set. If α C N ∈ Z and nCE = α C N − 1 the enforcers will stay with certainty and the producers will randomise. Thus, we might move to a state with nCE = α C N . At this state producers will stay with probability one and the enforcers will randomise, thus moving us back to a state with nCE = α C N − 1 with positive probability. Hence, in the (non-generic) case α C N ∈ Z there exists a non-singleton absorbing set.
Finally, we need to check that in these absorbing sets there are at least two producers and at least two enforcers. First consider α C N / ∈ Z. To make sure that there are at least two enforcers, we require nCE = α C N ≥ 2 which translates into N ≥ 2 α C . To ensure there are at least two producers, we need to have nCE = α C N ≤ N − 2 which holds if N > 1 1−α C . If α C N ∈ Z the number of enforcers fluctuates between α C N and α C N − 1. If N ≥ 1 α C there are always two enforcers (α C N − 1 ≥ 2) and if N ≥ 2 1−α C there are at least two producers (α C N ≤ N − 2). Provided, the population is sufficiently large, the previous four inequalities hold. Finally note that for N sufficiently large also inequality S8 will be violated so that cooperating producers will never choose to become cooperating enforcers if this would leave only one producer left.
Consider now the second case, nCE < f l (nE − 1). An argument akin to the one used above shows that the process converges to a state where nCE = nE ≥ 2 and nCP = nP ≥ 2 when agents can only decide on their behaviour within professions. Further, as above, defecting producers and defecting enforcers will never become cooperative producers or enforcers. A defection enforcer will switch strategies with probability one if π(DP, n DE|DP ) > π(DE, n), which can be rewritten as (Note that the system moves towards the rest point and away from the boundary, so the system remains in the set of states where nE ≥ 2 and nP ≥ 2.) Likewise, a defecting producer will become a defection enforcer with certainty if π(DE, n DP |DE ) > π(DP, n), or nDE < α D N − 1.
As above, if α D N / ∈ Z the state nDE = α C N corresponds to the (singleton) absorbing set. If α D N ∈ Z the absorbing set contains the two states α D N − 1 and α D N .
Again, we need to make sure that at the absorbing set there are at least two enforcers and at least two producers. For Note that all of this inequalities hold for N sufficiently large.
Finally, consider the case f If an cooperation enforcer is first drawn, she will switch to become a defection enforcer, implying that we move to nCE − 1 < f l nE − 1. Thus, we will eventually converge to E D . Conversely, if a defection enforcer is first drawn she will switch to become a cooperation enforcer, implying that we eventually converge to E D . Now consider f l (nE − 1) ∈ Z. If nCE = f l (nE − 1), a revising cooperation enforcer will switch and a revising defection enforcer will randomise. If the former agent is drawn to revise, we end up in E D . If the latter is drawn we may end up in either E C or E D . Likewise, if nCE = f l (nE − 1) the process may either end up in E C or E D , depending on who is drawn to revise and how an indifferent cooperation enforcer decides.
If f > l then nCE < f l (nE − 1) + 1 so the above lemma implies that process converges to E D with probability one. In this case the analysis is trivial, so in what follows we assume that f < l. In a similar vein if p < (1 − τ ) c then it is easy to see that CP will be outperformed by DP at all states. Thus we focus on the case p < (1 − τ ) c.
The next lemma characterises the dynamic process in the absence of enforcers, nE = 0, and when there is only one enforcer, nE = 1.

Lemma S2
Consider states with nE < 2. If p > (1 − τ ) c, and if N is large enough, then the process converges to E D with probability one.
Proof. Suppose nE = 0, so nP = N . First, note that in the absence of enforcers we will move to a state where all producers defect since π DP, n CP Thus, we will move to a state where there are only defecting producers, nDP = N , and each will earn a payoff of w. Now consider the stage where producers may decide among the full set of strategies. In this case we have π DE, n DP |DE = τ (N − 1) w and π CE, n CP |CE = τ (N − 1) w − f . (Note, we do not subtract l from DE's payoff since there is only one DE in the population.) By assumption τ > 1/ (N − 1). It follows that Thus, if a producer is drawn to chose a strategy from the full set of strategies she chooses to become DE. Thus the system moves to a state with nE = nDE = 1. Suppose nE = nDE = 1. All producers chose to become DP because, for N large enough so that 1 Once all producers are DP , the enforcer gets to choose between DE and CE, and clearly prefers DE, for the same reasons as above. Next someone (one of the producers or the only enforcer) is allowed to choose among all four strategies. If it is the only enforcer that gets this choice she chooses to remain DE. What happens if one of the producers is allowed to choose among all four strategies? The state n is such that nE = nDE = 1 and nP = nDP = N − 1. We have (Note that if there is one CE and one DE then they will fight in the meta-enforcement stage.) We have π DE, n DP |DE > π CE, n DP |CE , and if N is large enough (N > 2 w+l τ w ) then we have π DE, n DP |DE > π (DP, n). Thus a producer chooses to become DE. We arrive at a state with nE = nDE = 2 and nP = nDP = N − 2. From the previous lemma we know that this state is in the basin of attraction for E D .
We have π (CP, n) > π DP, n CP |DP and π CP, n DP |CP > π (DP, n) if p > (1 − τ ) c − 1 n P b , which is implied by the assumption that p > (1 − τ ) c. Thus all producers become CP , so we are at a state where nE = nCE = 1 and nP = nCP = N −1. Next the only enforcer chooses between CE and DE. Since she chooses to become DE, so we are at a state where nE = nDE = 1 and nP = nCP = N − 1. Next all producers get to adjust and they all move from CP to DP, as explained above (assuming N is large enough so that b ≤ c (N − 1)). Following the above reasoning (for the case of nE = nDE = 1 and nP = nDP = N − 1) we arrive at a state with nE = nDE = 2 and nP = nDP = N − 2, and from the previous lemma we know that this state is in the basin of attraction for E D . The next lemma characterises the dynamic process in the absence of producers, nP = 0, and when there is only one producer, nP = 1.
In either case the payoffs to DP are strictly larger than the payoffs to CP . Moreover, the payoff to DP is larger than what is obtained be remaining an enforcer, provided that N is large enough (N > τ w (1−τ )w+l + 1). Thus, we arrive at a state with nP = 2 and nE = nDE = N − 2. From the first lemma we know that this is in the basin of attraction of E D .
Observation S1 Under the maintained assumption that f < l, it is always the case that α C > α D . Moreover, as N → ∞, the payoff in the cooperation equilibrium is and payoff in the defection equilibrium is

Payoff in the absence of enforcers (where all producers defect) is
Note that we always have π P > π D (since τ > 0) and π C > π D (since b > c). Finally π C > π P if and only if F. Analysis of the Perturbed Dynamic: Stochastic Stability. The following theorem characterises the set of stochastically stable states. Suppose N is large enough, p > (1 − τ ) c, and f < l. The radius and coradius of the defection equilibrium E D and cooperation equilibrium E C satisfy

Lemma S4
Proof. First, consider the defection equilibrium, E D with nCE = 0. There is only one way to move into the basin of attraction of E C , namely to move to a state where nCE ≥ f l (nE − 1) or equivalently n CE n E −1 ≥ f l (lemma S1 and lemma S3). We want to move from a state where n CE n E −1 = 0 to a state where n CE n E −1 ≥ f l . Mutations may either change the number nCE, nE or both. We are looking for the least costly way of increasing the fraction n CE n E −1 . If one defection enforcer mutates and becomes a cooperation enforcer we move to a new state which is characterised by n CE+1 n E −1 . If a defection enforcer mutates to become a producer we move to a state with n CE n E −2 . If a producer switches to become a cooperation enforcer we move to a state n CE +1 n E −1 . Thus, it is more cost-effective to switch defection enforcers to cooperation enforcers than to switch producers to cooperation enforcers. Further, we have Thus, up until the point where all but one enforcers are cooperative, nCE = nE − 1, a mutation from an defection enforcer to a cooperation enforcer increases the fraction n CE n E −1 more than any other mutation could. Note that at the point nCE = nE − 1 we have that the remaining defection enforcer will switch with certainty, since nCE − 1 > f l (nE − 1). Thus, when calculating the number of mutations required for a transition we only need to consider mutations from defection enforcers to cooperation enforcers.
To move out of the basin of attraction with positive probability we need nCE ≥ f l (nE − 1). Let m DC be the number of mutations required to move from E D to E C , via increasing n CE+1 n E −1 (lemma S1), with positive probability. Note that at the absorbing set E D the number of enforcers is given by If α D N ∈ Z we fluctuate between states with α D N − 1 and α D N enforcers. If the mutations happen when the number of enforcers is at its lowest we have m DC = f l α D N − 2 . Consequently, regardless of whether α D N ∈ Z or not m DC = f l α D N − 2 . Now consider the cooperation equilibrium, E C with nCE = nE. There are two possible ways to escape its basin of attraction: (i) move to a state where nCE ≤ f l (nE − 1) + 1 or equivalently n CE −1 n E −1 ≤ f l (lemma S1) or (ii) move to a state with n E ≤ 1 enforcers (lemma S2). First consider the former alternative. If a cooperation enforcer switches to become a defection enforcer we move to a state with n CE −2 n E −1 , if a cooperation enforcer becomes a producer we move to a state with n CE −2 n E −2 , if a infinity all strategies are chosen with equal probability. In models of genetic evolution, the inverse of η is often referred to as the intensity of selection (5). One common approach, which we follow, is to consider the limit of rare mutations, i.e. let vanish. In the case of imitative (or birth-death) processes it is particularly convenient to study the limit of rare mutations since the invariant distribution then puts almost all weight on a monomorphic state (6). While this approach is suitable when the equilibria of interest are monomorphic, our paper is concerned with the stability of mixed equilibria (i.e., polymorphic states), where a producer strategy and an enforcer strategy co-exist. One could attempt to overcome this difficulty by assuming that agents can be programmed to play mixed strategies (7), thereby making mixed equilibria correspond to monomorphic population states. However, this is not an attractive option in our setting, since it would imply that our cooperative equilibria are maintained not by the co-existence of enforcers and producers, but by the presence of a single type that mixes between being an enforcer and a producer. By contrast, in the case of a (smooth or exact) best-response dynamic like the one we use, mixed equilibria may correspond to polymorphic absorbing states.
Other approaches create scope for analytical solutions by taking the payoff sensitivity of revisions to some limit. By letting the payoff sensitivity of revisions go to zero (which in our model would imply taking the noise parameter η to infinity), one obtains what is known as the limit of weak selection (5). Typically, this assumption is invoked in the context of imitative processes or birth-death processes. One, then, proceeds by identifying the strategies whose share in the invariant distribution is above (favoured) or below (not favoured) their respective share in the uniform distribution (8). Again, the fact that we are interested in genuinely polymorphic mixed equilibria makes this approach unappealing. We are interested in whether a combination of strategies has a selective advantage together, not whether a single strategy has an advantage on its own (as in (7)). For this, instead of making revisions arbitrarily insensitive to payoff, we make them maximally sensitive to payoff by letting η go to zero so that the payoff-sensitive revisions become exact best responses.
It should be noted that our simulations approximate the invariant distribution without imposing any limiting assumptions on either mutation rates or the intensity of selection. The limiting assumptions are only made for analytical tractability. Nevertheless, we find coherence between our simulation results and the analytical results derived under the limiting assumptions.

S4. Numerical Approach
A. Simulation Methods. In order to approximate the invariant distribution we simulate the learning process for seven different (randomly drawn) initial conditions and compute the the time average of the different strategies over the seven runs. We iterate the repeated game over 10 6 periods (i.e. 10 6 instances of the repeated game and equally many revisions), with one agent revising in each iteration, in a population of 50 agents. Each round of the repeated game is played out as described in fig.  1 of the main text. We use the baseline parameter values reported in Table S1 In the case of an odd number of producers or an odd number of enforcers some agents will not be matched. In the production step unmatched producers (in case of an odd number of producers) simply earn the autarky payoff w. In the enforcement step unmatched enforcers do not earn any tax revenue and cannot punish any producers in the enforcement step. In the meta-enforcement step unmatched enforcers (in case of an odd number of enforcers) keep the payoff they earned in step 2, and their reputation is unaltered.
Revisions occur as described in the main text in the Dynamic Framework section. When evaluating payoffs for revisions, the various πY in eq. (6) of the main text are the realised average payoffs of Y -strategy agents in the current round. Figure S2 shows the effect of adding PE to the set of strategies, under the same conditions as in fig. 4 of the main text. Overall the results are similar to those obtained in the absence of PE. Naturally, when the variable cost v is increased sufficiently, PE replaces CE, since the cost is only paid by the latter. For the very high levels of the continuation probability δ there is a sharp increase in PE and a corresponding drop in CE. We believe that this is due to the fact that a population of only DE and PE becomes an equilibrium when δ is sufficiently high (higher than what is needed for an equilibrium consisting of only DE and CE to exist). Figure S3 displays the results of varying action mistake probability µ, logit precision η, and revision mistakes ε. We consider both the case when the set of enforcer strategies consists of only CE and DE, and the one when PE is added to the mix. For modest levels of noise, the effect on cooperation is small. When action mistakes increase beyond 2.5% cooperation deteriorates substantially. We stress that this is the probability of single mistake. A typical enforcer faces four producers and one enforcer in each round, meaning that, in our baseline, the probability of making at least one mistake is approximately 12%. So, the typical CE has a 12% risk of acting in a way that confers bad standing.

D. Robustness to Initial Reputations.
It is desirable that the reputation system is robust to mistakes. To this end, our simulations have included action (execution) mistakes, which ensure that all kinds of enforcers may end up in bad standing. As a further robustness check, we now consider starting our simulations at a state where all enforcers are in bad standing. The results are shown in Figure S4. Figure S4a shows the results from simulations where enforcers begin with maximal bad standing (i.e. with bad standing equal to κ), for varying values of κ. Interestingly, this panel illustrates a trade-off. On the one hand, higher values of κ imply that CE have a harder time to shed off their bad reputation. On the other hand, higher values of κ make it harder for DEs to gain good standing (through mistakes or through being randomly matched to CPs in step 2 and enforcers in bad standing in step 3) and consequently avoid punishment. This can be seen in panel Figure S4a, where maximal cooperation is reached for κ = 2. Figure S4b shows that as enforcers start off with increasingly worse standing, the advantage conferred to the CE strategy decreases and cooperation decreases in the invariant distribution (numerically aproximated by time averages). This is to be expected. Note that, in our baseline, the expected length of a period (supergame) is 10 rounds and the number of punishment rounds is κ = 8. This means that if everyone starts out in bad standing, then for at least 4/5 of their expected lifespan both CE and DE agents will be in bad standing. So, there are very few rounds in which CEs can earn a higher payoff than DEs. With less extreme initial reputations, the CEs spend a smaller fraction of their expected lifespan in bad standing, hence they can-and eventually do-earn higher payoffs than DEs.