Tit for Tattling: Cooperation, communication, and how each could stabilize the other

Indirect reciprocity is a mechanism by which individuals cooperate with those who have cooperated with others. This creates a regime in which repeated interactions are not necessary to incent cooperation (as would be required for direct reciprocity). However, indirect reciprocity creates a new problem: how do agents know who has cooperated with others? To know this, agents would need to access some form of reputation information. Perhaps there is a communication system to disseminate reputation information, but how does it remain truthful and informative? Most papers assume the existence of a truthful, forthcoming, and informative communication system; in this paper, we seek to explain how such a communication system could remain evolutionarily stable in the absence of exogenous pressures. Specifically, we present three conditions that together maintain both the truthfulness of the communication system and the prevalence of cooperation: individuals (1) use a norm that rewards the behaviors that it prescribes (an aligned norm), (2) can signal not only about the actions of other agents, but also about their truthfulness (by acting as third party observers to an interaction), and (3) make occasional mistakes, demonstrating how error can create stability by introducing diversity.


Introduction
1. act and signal according to an 'aligned' norm, that is, a norm that rewards the actions that it prescribes. That is, if the norm prescribes defection against agents in 'bad moral standing', it must also reward defection against those agents (Section 3.3.3).
2. occasionally deviate from their strategy, which prescribes particular actions in particular situations. If everyone employs the same strategy, the environment becomes homogenized to the point where agents become unprepared for novel threats. Deviations keep the environment sufficiently variable that agents can remain prepared (Section 4.1).
3. exert normative pressure on each other's signals, which allows benefits not only to be disproportionately distributed to cooperators in the prisoner's dilemma, but also to truthful communicators (Section 4.1.1).
In Section 2, we will provide background on relevant concepts, which some may choose to skip. In section Section 3, we present a basic model of indirect reciprocity, showing why communication does not remain stable in such a regime. We then present an altered model in Section 4 that does have a stable, cooperative state. Finally, in Section 5, we discuss the broader significance of our results.

Indirect reciprocity stabilizes cooperation in large groups...
Indirect reciprocity can help to explain pairwise altruism when acts are unlikely to be directly reciprocated, such as when groups are large, because third parties can withhold cooperation specifically from those with bad reputations (creating targeted punishment [Boyd and Richerson, 1988]). It has been observed experimentally, where people often cooperate disproportionately with those who have cooperated with others when there is no possibility of having their act directly reciprocated, suggesting that reputation is not just an aid for estimating one's own future payoff [Wedekind and Milinski, 2000]. And ethnographically, some societies tolerate stealing from families with bad reputations Muthukrishna, 2021, Bhui et al., 2019].
A strategy is a set of rules for acting and signaling. Sometimes we refer to a strategy as a 'norm'; we do this when the strategy encodes a set of rules that is followed by the large majority of a social group. The simplest norm in indirect reciprocity is image-scoring, which says that 'cooperate with cooperators, defect with defectors' [Nowak and Sigmund, 1998]. However, this norm punishes those who defect against defectors, even though this is precisely (1) what the norm prescribes and (2) what makes defection costly [Nowak, 2005].
A more nuanced approach might be to introduce the notion of standing (defined recursively: someone is in good standing if they cooperated, or if they defected against someone in bad standing), and to cooperate with agents if and only if (iff) they are in good standing [Ohtsuki and Iwasa, 2005]. Stern judging, similar to standing, additionally puts those who cooperate with those in bad standing into bad standing [Pacheco et al., 2006].
Both standing and stern judging create an incentive to heed an opponent's reputation when choosing how to act, unlike with image scoring case, where cooperation is good and defection is bad, regardless of your opponent's reputation [Leimar andHammerstein, 2000, Panchanathan and]. In part for this reason, these strategies have been shown capable of stabilizing states of high cooperation -but all under the assumption that agents have perfect information about how others act, either by direct observation, or a completely truthful communication system.

...but relies on effective communication
The stability of honest communication systems usually rely on some kind of pressure on the signaler to be truthful [Grafen, 1990, Oliphant, 1996: for example, when there is a common interest between receiver and signaler [Blume et al., 2001], when signals are costly [Gintis et al., 2001], when there is a cost differential between truthful and non-truthful signaling (even if the equilibrium is cost-free) [Lachmann et al., 2001], when the interaction is a coordination game (where both parties benefit from knowing the world-state of the other) [McElreath et al., 2003, Young, 1998], or when direct observation and partner-choice supplement the signals themselves [Robinson-Arnull, 2018].
In its basic form, communicating the reputations of others places no such pressure on the signaler. Solutions have been offered: for instance, a truth-for-truth reciprocity system can be successful in sparking the rise of a truthful communication system [Oliphant, 1996]. The problem here, however, is that reputations and indirect reciprocity become important precisely when repeated interactions, on which reciprocity relies, become unlikely. But could an indirect reciprocity mechanism enforce truthfulness (Section 4.1.1)?
Truthfulness is not the only requirement of the communication system -agents also need to be forthcoming, that is, actually share the information they have, despite, for instance, fear of reprisal. While in this model there is no mechanism directly representing such reprisal * , it is still important to mitigate against it as an exogenous risk by making signal-withholding costly. The stable strategy deals with this by treating failures to signal in the same way it treats lying (Section 4.1.1).

Model 1: a first pass
In this section, we will define Model 1, which will provide us insight into what conditions are required for a stable state of altruistic cooperation and effective communication. It provides a necessary foil for interpreting our primary model (Model 2; Section 4). Classic models of indirect reciprocity lock the signaling system, restricting an agent's strategies to the actions they take (whether to cooperate or not). Our first model will remedy this by allowing for agency in both action and signaling.

Effective communication
We say that a communication system is a set of mappings from meanings to symbols, and we define an effective communication system as one that is 1. uniform, that is, everyone abides by the same mapping from meanings to symbols, 2. forthcoming, that is, agents never fail to signal when they possess relevant information, and * For more on this, see [Wiessner, 2009], who has examined how well the assumptions of games-theoretic formulations like the ultimatum and dictator games align with living social norms. We might expect similar alignments and deviations for our communication model.
3. informative, that is, everyone's mapping distinguishes between at least two meanings (in our case, the 'meanings' are facts about the actions of other agents), guaranteeing that at least some information gets transmitted.
The uniformity criterion corresponds to truthfulness: an untruthful agent would be one that uses a different mapping to the one everyone has agreed on. If everyone agrees that 0 means 'bad standing', an agent that uses 0 to map to good standing is untruthful.

Cooperation
Cooperation is paying a cost, γ, for the benefit, β, of someone else, stipulating that β > γ, that is, cooperation produces a net benefit for the aggregate of agents (there are of course interesting cases, beyond our scope, where γ > β). If an interaction consists of two agents coming together and deciding whether to cooperate, we obtain the payoff matrix in Table 1. This is a prisoner's dilemma, because β > β − γ > 0 > −γ. Agents in our model will interact repeatedly in prisoner's dilemmas, each round with a new partner.

Stability
Strategy A is stable if no other strategy can invade it, that is, no other strategy can proliferate in a population of 100% A-type agents (when A is said to 'predominate').
To make this precise, let R x,1 (y, z) be the expected payoff of strategy y against strategy z when a randomly selected agent has strategy x with probability 1; our requirement of stability can be stated as: In words: As do better against As than Bs do, when A is predominant in the population. Since interactions will be with an agent of type A with probability 1, this is equivalent to the requirement that A strictly outperforms B. We must take expectations since payoffs are random.
This condition is stronger than Maynard Smith's [Maynard Smith and Price, 1973]: we require that A strictly outperforms any potential invader. Meeting this stronger requirement means that A will not only be stable against invasions of a single strategy, but also of a combination of strategies. For instance, Maynard Smith's conditions allow for the possibility for a third, spoiler strategy C, against which B does so well that its relative disadvantage is negated. Our definition of stability is robust against third 'spoiler' strategies because if A strictly outperforms all other strategies from the start, there is no possibility that a third strategy would increase its population enough to become a spoiler.

The model (Model 1)
The model unfolds in an infinite number of rounds in which agents interact in prisoner's dilemmas. † Agents also signal by 'tagging' each other with a single bit of information -0 or 1. Tags indirectly affect an agent's payoff because their partners might act differentially based on their tag. For simplicity, agents can only have one tag at a time, so whenever they are tagged, their previous tag is overwritten. A round consists of two steps: Model 1 description 1. Everyone pairs up with a partner at random, and makes a choice about whether to cooperate or defect in a prisoner's dilemma. An agent's choice may depend on their partner's tag from the previous round (Figure 1(1)).
2. Everyone signals a new 'tag' for their partner. They can use whatever mapping they desire from the information they have (for instance, their partner's action, their action, their partner's old tag, their old tag, etc.) to the available symbols (1 and 0). The mapping can even be randombut we assume agents tag each other simultaneously and thus do not know how the other agent has tagged them. This process can be imagined as your partner writing a 0 or 1 on your forehead for your next partner to see, as the signal need only be known to one's next partner (Figure 1(2)).
The major difference between this indirect reciprocity model and others is in step (2) -most, with the exception of [Smead, 2010], 'lock' the communication system, pre-ordaining that agents must, for example, signal 0 about defectors and 1 about cooperators [Leimar and Hammerstein, 2000, Nowak, 2005, Nowak and Sigmund, 1998, Ohtsuki and Iwasa, 2005, Pacheco et al., 2006, Yamamoto et al., 2017. However, we are interested in the conditions required to stabilize a particular communication system, and must therefore allow it to vary.

Strategies
To answer our question, we will seek to find a stable strategy (that includes both a prescription for how to act and how to signal) that, when followed by everyone in the population, leads to a stable state of (1) high cooperation and (2) effective communication. We will call this the focal strategy.
We want a relatively parsimonious strategy that is stable; for this reason, we keep the set of example strategies in the coming sections simple. We will show, however, that the focal strategy is stable against all potential invaders, not just invaders from the restricted set we present in the coming sections. † Following the infinite number of rounds, there is an implicit reproductive step that we do not directly model such that one's representation in the next generation is directly proportional to one's success (i.e., payoffs).

Action
One can imagine many possible action strategies. For example, the set of action strategies that take into account only the tag of the actor's partner are (1) cooperator (cooperate with everyone), (2) defector (defect with everyone), (3) discriminator (defect with 0s, cooperate with 1s), and (4) reversediscriminator (defect with 1s, cooperate with 0s). These possibilities are shown in Table 2. Note that (3) and (4) could just as easily have swapped names, because the symbols 0 and 1 are, a priori, meaningless. Furthermore, one could imagine action strategies with many more possible inputs, such as one's own tag or one's previous action. In the end, we will find a strategy stable against any other strategy, including these more complex ones.

Action Strategies
Name ID Action for 0s Action for 1s Cooperator cc c c Table 2: Some possible action strategies Each row represents a possible action strategy. The first and second bit in the ID specify how to act towards a partner with tags of 0 and 1, respectively.

Signal
After acting, agents may signal about their partner by giving them a one-bit tag. An agent's tag is observed by their subsequent partner (unless it is overwritten before then).
The signaling scheme we consider as a candidate for a stable state takes into account the action of the agent being tagged and the tag of their partner (who also happens to be the signaler, since agents tag their partners). Thus, there are four situations to be distinguished: cooperating with a 0, cooperating with a 1, defecting with a 0, and defecting with a 1. Because this class of signaling strategies takes into account both the actor's present action and their partner's action from one time step back (preserved by their tag), they are sometimes called 'second order norms'.
There are 16 possible signaling strategies in this class, as there are two possible signals for each of these four outcomes. We will sometimes denote a signal strategy as a string of four binary digits (obtained by reading off the columns in Table 3 vertically). For instance, 1001 corresponds to the stern judging signal strategy.
We will also consider the possibility that invading strategies might not signal at all in certain situations -this will pose no problems to our stability results.  Table 3: Three examples of signal strategies. There are 16 of these, which would produce too many columns to reproduce here, but here are three possibilities. This class of signal strategies can take into account the tag of the actor's partner (the person affected by the action). Imagescoring does not care about the partner's tag, while standing and stern-judging do. Agents then send signals about each other in step (2), producing new tags for the next round. Pictured is the behavior of the stern discriminator, which cooperates with 1s and defects with 0s. All agents who follow the this action protocol receive a 1-tag from stern discriminators, and get cooperated with in the following round. Others receive a 0-tag, and get defected with. Thus the strategy rewards the behaviors it prescribes; it is an aligned norm. Now it is time to select a candidate focal strategy (the strategy whose stability we will analyze) from the set we considered above. Consider the stern-judging discriminator:

The focal strategy
1. Action: cooperates with 1s, defects with 0s 2. Signaling: tags cooperation with 1-agents and defection with 0-agents with a 1, and cooperation with 0s and defection with 1s with 0.
We will henceforth call these agents 'stern discriminators', or use the strategy string notation dc1001. Why is this strategy particularly promising? Partially because we have the benefit of hindsight. But there are good reasons to believe that stern discriminators may successfully fend off incursions of other strategies: the signal strategy tags everyone who follows the action strategy's dictates with a 1, and the action strategy cooperates with 1s, so anyone who follows the strategy gets cooperated with. Inversely, anyone who doesn't follow the action strategy gets defected with. In other words, the strategy is admirably self-consistent -it rewards all of the behaviors it prescribes and punishes those that it doesn't. If everyone follows it, it is an aligned norm. The intuition, then, is that any alternative will do worse in a world dominated by stern discriminators because they will punished more and rewarded less.
In the following section we will see if this intuition bears out, by analytically determining the stability of the stern discriminators. To make the mathematics tractable, we will make two simplifying assumptions: first, we will assume an infinite population; second, as already described in Section 3.2, we will assume infinite rounds per generation, so that agents' actual payoffs converge to their expected payoffs (for further justification, see Appendix B).

Analysis
The analysis will proceed by checking if the stern discriminator state is 1. cooperative (where we define 'cooperative' as a world in which at most ϵ-fraction of agents defect, where ϵ > 0 but can be made arbitrarily small), 2. effectively communicative, and 3. stable.

Cooperative and communicative: check
First, cooperative. Let us suppose agents start with an arbitrary assignment of tags (could be all 0, all 1, or any mix). Then, in the first round, stern discriminators will cooperate with 0s and defect with 1s. Because all who follow the stern discriminator strategy receive a tag of 1, everyone will receive a 1-tag to start the second round, and thus, everyone will cooperate with their partners in the second round. Everyone will thus receive a 1-tag again and the cycle will continue -so all rounds except the first are fully cooperative (Figure 2(A)). Clearly the state consisting entirely of stern discriminators is effectively communicative, since all agents abide by a separating, uniform mapping (as specified by the stern judging norm), and never withhold information (see Section 3.1.1).

Stability: fail
It remains to check if the state is stable (spoiler: it won't be). In this section, we will identify what the instabilities of this state are, paving the way for the additional mechanisms we propose in the next section.
It turns out that many alternative strategies can invade this model. Two instabilities (that is, vulnerabilities that allow alternative strategies to perform just as well) lie at the root of these invading strategies: Figure 2: (A) Proportion of 1-tagged agents. In the first round, agents are arbitrarily assigned 1and 0-tags. Stern discriminators cooperate with the 1s and defect with the 0s, and all receive a 1-tag (according to the stern judging signal strategy) as a result. This leads to cooperation forever after.

(B) Factors influencing an agent's present payoff.
Many factors from previous rounds determine an agent's payoff in the current round. However, none of them have to do with their own signaling strategy

Instabilities in Model 1
1. Unpunishability of language -In the base model, there is no payoff difference between individuals with the same action strategy but different languages. This is because agents' payoffs depend on two things: their own actions and their partners' actions, neither of which depend on their own signal strategies (see Figure 2(B)). Thus, given an action strategy that can perform as well as the stern discriminator, swapping out its signal strategy will not hamper its ability to invade (e.g. a 'pushover' discriminator that signals 1 about everyone does just as well as a stern discriminator).
2. Unexpressed traits -We know that in the stern discriminator world, everyone cooperates, and therefore everyone receives a tag of 1. Therefore, no one ever must decide how to act against an agent with a tag of 0, and no one must decide how to tag an agent who has defected (assuming that agents interact with mutants with negligible frequency). This means that the part of the action strategy that prescribes how to act against 0s, and the part of the signal strategy that prescribes how to signal about defectors (along with the part that prescribes how to signal about cooperation against a 0-tagged agent) remain latent. Thus, for example, unconditional cooperators are indistinguishable from discriminators. Now referring to our strategy-string nomenclature, unexpressed traits imply that there are four strategy clusters, each of whose members have identical payoffs (where a dash indicates that the 'slot' can be filled with either of its two possibilities): -d---0, -d---1, -c---0, and -c---1. Payoffs within each cluster are identical because they only differ on traits that never get expressed. And because the last two clusters differ visibly only on their signal strategies, we obtain that the last two clusters have identical payoffs. We conclude that any member of the last two clusters can invade. Still, the stern discriminators can at least fend off the defecting clusters: the -d---1 and -d---0. This is because their defections will provoke stern discriminators to 0-tag them (since defection against 1-tagged agents does not conform to the stern-discriminator norm). This will lead their subsequent partners to defect against them, generating for them a payoff of zero (compared to β − γ for stern discriminators).

Model 2: next!
In this section, we address the two instabilities of the base model with two modifications. First, to address the irrelevance of language, we add a mechanism, called meta-signaling, to the model. This allows agents to signal not only about each other's actions but also about each other's signals, thereby making language relevant by folding the truthfulness of an agent into their reputation.

Instability Remedy
Unpunishability of language Meta-signaling tags agents based on their signal Unexpressed traits Error creates diversity Table 4: Instabilities in Model 1, their remedies in Model 2. Observers who can tag agents based on their signal are called meta-signalers. Their presence means that each agent has a probability p of receiving a tag based on their signal (even meta-signalers themselves) which makes signal relevant to their own payoff. The addition of error means that agents deviate from their prescribed strategy with probabilities ϵ (in the case of action) or δ (in the case of signal). Error introduces heterogeneity required to bring all traits into expression at least some of the time.
But meta-signaling alone is not enough. Even if agents can signal about each other's truthfulness, there are still no defections, and no agents with a 0-mark, so any components of the strategy relevant to those cases will remain latent. To remedy this second instability, we add error: agents deviate from their prescribed strategies with some small probability (ϵ for actions, δ for signals). This introduces the diversity required to bring out latent traits. We summarize the instabilities and their remedies in Table 4.

Model 2 description
1. In each round, agents are assigned either to the role of actor (with probability 1 − p) or observer (with probability p).
3. Observers are assigned randomly to another agent -note that the agent being observed can be either an actor or another observer. Thus, assuming an infinite population, and that each agent has a maximum of a single observer, one's probability of being observed in any given round is p (we will show in Section 5.2 that the single-observer requirement is not strictly necessary).
4. Actors interact in a prisoner's dilemma, making cooperation and defection choices based on what their strategies prescribe (Figure 3(1)), with error rate ϵ (see Section 4.1).
5. Actors signal by tagging each other -either a 0 or a 1 -based on their signaling strategies (with error rate δ). An agent's tag will be observable by whoever next interacts with them (as long as it is not overwritten before then), in addition to the observers of the interaction (more on this in Section 4.1.1). Depicted in Figure 3(2).
6. Observers 'meta-signal' by determining whether they agree with their observee's signal, and they tag their observee according to their meta-signaling strategy (also with error rate δ), overwriting the observee's pre-existing tag (Figure 3(2)).

Strategies and error
Action and signaling work the same in this new model, with the exception that agents make errors: that is, agents follow the prescriptions of the action and signal strategies only with probability 1 − ϵ and 1 − δ, respectively -otherwise, they do the opposite. Apart from the addition of error, observers make the difference between Model 2 and Model 1, so we will first describe their behavior in more detail, and then define an extended focal strategy that includes a meta-signaling strategy (the stern discriminators have a defined action and signal strategy, but we've not yet said how they should meta-signal).

Observers (meta-signalers)
Every individual, whether actor or observer, has someone watching them with probability p. There can be a chain of observers: perhaps a first-level observer is observing one of the actors, and some secondlevel observer is observing the first-level observer (see Figure 3). If p is small, however, these chains are unlikely to be very long -their expected length is 1 1−p − 1, derived from the waiting time for an event (not having an observer) that has probability 1 − p.
We assume the observer can see the chain of events (actions, signals, and meta-signals) before it. Using this information, observers can 'meta-signal' according to some strategy (named so because they are signaling about someone's signal). This meta-signal then overwrites the observee's previous tag.
One way to specify meta-signaling strategies is to consider whether an observer would have signaled in the same way as the agent they are observing (they agree) or in a different way (they disagree). If the signaler fails to signal when the observer would have, this counts as a 'disagree'. An agent follows the so-called 'separator' meta-signaling strategy if they signal 1 when they agree and 0 when they disagree.
Of course, one could imagine a wealth of much more complex meta-signaling strategies; we will, however, show the new focal strategy to be stable against all such alternatives.  (1), agents make decisions on how to act. The asterisk indicates that the defection from the right-hand agent is an action that doesn't conform to the stern discriminator strategy. Every agent is observed with probability p (the decreasing opacity of the observers represents the geometric fall-off in the probability of their presence). Then, in stage (2), the actors tag each other, with the non-conforming actor getting a 0-tag. Agents also receive a tag from their observer (if they have one), based on whether their observer agrees or disagrees with their tag (e.g. the second agent from the bottom on the left sends a non-conforming signal, hence in turn receives a 0-tag.)

The new focal strategy
In the spirit of the rest of the stern discriminator strategy, we say that the new focal strategy uses the 'separating' meta-signaling strategy: 0 if they disagree, 1 if they agree. This selection maintains the property that the strategy rewards the behaviors it prescribes: anyone who is observed following the prescribed signaling and meta-signaling strategies gets a 1-tag from the observers, leading to cooperation in the subsequent round. By contrast, anyone who is observed failing to follow the signaling and meta-signaling strategies gets a 0-tag and a subsequent defection.

Analysis
In this section, we will test to see if model 2 is cooperative, effectively communicative, and stable (it will be). We assume a population of mostly stern discriminators (future work could address arbitrary mixes of strategies).

Cooperative and communicative: check
The state is clearly effectively communicative since all agents abide by a separating mapping from meanings to symbols, never withholding information. To see that it is cooperative, we show in Appendix C.1 that while it is true that ϵ > 0 and δ > 0 imply that there are a nonzero portion of defections, that portion can be made arbitrarily small by shrinking ϵ and δ.

Stability: check
Learning from the failures of Model 1, for stability we must check that 1. Any strategy that deviates from the stern discriminators, in a world dominated by stern discriminators, does strictly worse.
2. No latent traits exist that will cause invading strategies to be indistinguishable from stern discriminators.

Any alternative does worse
We start with the first step. Whenever an agent makes any decision -whether on how to signal or how to act -there are three possible consequences of that decision: 1. first-order costs -immediate costs paid for a decision (for instance, paying the cost γ of cooperation) 2. second-order costs -costs borne for a decision in the subsequent action round (for instance, their subsequent partner defects with them for having a 0-tag) 3. effects on future scenarios -the present decisions of agents may affect the conditions surrounding their future decisions, potentially affecting payoffs. These costs are the changes in payoffs relative to following the stern discriminator strategy. Thus, to show that deviating is always costly, we need to show that i. Given a particular decision-making scenario in a world dominated by stern discriminators, deviating from the stern discriminator strategy is costly (the sum of the first-order (1) and secondorder (2) costs is positive), and ii. In no case does deviating in a particular scenario produce future scenarios with higher expected payoffs (3).
The combination of these two facts implies that deviating is always, on net, costly. The intuition behind part (i) of the proof is that the first-order benefits of deviating are at most around γ (since you can save on the cost of cooperating, potentially, by deviating), but the secondorder costs can always exceed this since they are proportional to β (stern discriminators will withhold cooperation from those who deviate, costing a deviant β), and β is greater than γ. For part (ii), we will show that deviating often has no effect on future decision-making scenarios, but even when it does, it cannot increase one's payoff.
We will conclude that there are a set of parameters p, ϵ, k, and δ for which any alternative strategy, pure or mixed, does worse.
Let us be precise with these bounds (shown in Figure 4). To meet our stability criterion, we need our effective error rate (defined as ϵ ′ := ϵ + δ − 2ϵδ) to satisfy The higher the benefit is relative to the cost, the closer the error rate can be to 1 2 .
Then we define r = (1−2ϵ ′ )β γ , so named because it is the ratio between the expected second-order benefit of cooperating and the first-order cost (the benefit is not quite β because of potential error). The probability of being an observer, p, must satisfy Finally, the discounting factor, k, must satisfy If ϵ ′ = p = 0, this would collapse down to kβ > γ, which makes sense as the discounted benefit would need to exceed the immediate cost (this relation is represented by the dotted line in the left panel of Figure 4). The added complexity in the denominator stems from the fact that not every defection is punished. Some readers may choose to jump straight to Section 4.2.2.2, skipping the proof.

In a given scenario, deviation is costly
In this part, we show that in a world of stern discriminators, the total cost of deviating in a particular scenario (the sum of first and second order costs) is positive. We define the cost of doing x as the payoff difference between doing x and doing what a stern discriminator would have done.
First-order costs are borne immediately, and second-order, if ever, in the subsequent action round. Thus, we can calculate the expected costs of an agent's choice in action round n simply by looking at their payoffs in action rounds n and n + 1.
First-order costs (round n) There is no immediate cost or benefit to signaling in any particular way. All potential signaling costs are second-order. Action, however, is a different story: there are immediate payoff differences between acting conformingly and not, because cooperation is always more costly than defection.
These costs are quite simple. There are two ways of acting non-conformingly: cooperating with a 0 or defecting with a 1. Cooperating with a 0 comes with an additional cooperation cost γ. On the other hand, defecting with a 1 produces some upfront benefit (negative cost), as the agent avoids paying γ. Thus, defining F x as the first-order cost of deviating when doing x (e.g. F def is the first-order cost of deviating by defecting), we have Second-order costs (round n + 1) Now we calculate the second-order costs of deviating from the stern discriminator strategy. Conveniently, this looks pretty similar for action, signal, and meta-signal deviations. To pay a second-order cost for deviating, three things must happen.
Tag The deviant must receive a tag for the behavior in which they strayed from the norm; for instance, if the agent deviated in action, the agent should receive a tag for their action and not their signal Stick That tag must stick until the subsequent action round, so that their partner has an opportunity to punish them for it Norm Both the agent tagging the deviant and the agent subsequently acting with the deviant must not err, because if they do, they will fail to punish the deviation -that is, the norm must be effectively followed.
We will start by assuming all three of these conditions, then gradually roll back to get the actual expected second-order cost of deviation. First, assuming all three conditions, we have the second-order cost of deviating, because if all the conditions are satisfied, the deviant's subsequent partner will defect, costing them the benefit from cooperation β (there is technically a discount factor here, but we leave it out since we do not yet know how many rounds the agent waited between action rounds). If we then roll back the assumption that the norm was effectively followed (because someone erred) in the pipeline, what is the expected cost of deviation? Where S tag, stick is this quantity, we have In Equation 7 we distinguish between four scenarios: (1) neither the deviant's tagger nor their current partner erred, (2) both erred, or (3, 4) one erred and not the other. If neither the deviant's tagger nor their subsequent action partner errs, or if they both err, the deviant forgoes the benefit of cooperation (paying cost β). However, if one of the two makes an error, the deviant's subsequent partner actually cooperates, giving them a benefit of β. The equation calculates the cost, so benefits are negative and costs are positive. Notice that we defined the effective error rate in Equation 8, ϵ ′ = ϵ + δ − 2ϵδ. Here is why. Intuitively, an error in this scenario is when someone who deviated doesn't get the response that one would expect. This happens either if the deviant's signaler makes an error and the subsequent partner acts faithfully on the erroneous signal, or if the signaler is accurate but the subsequent partner makes an error. These scenarios respectively have probabilities δ(1 − ϵ) and ϵ(1 − δ), and adding them up we obtain ϵ + δ − 2ϵδ.
But what if we roll back the sticking assumption? This requires us to account for two things. First, the tag might be overwritten between the deviation in behavior and the subsequent action round. And second, the longer we wait between the deviation and the subsequent action round, the more the resulting cost will be discounted.
The first term in Equation 9 corresponds to the scenario where the subsequent round is an action round, with probability 1 − p, and therefore the agent pays S tag, stick , but discounted by a factor k since a single round has passed. The second term corresponds to the case where the agent is an observer in the subsequent round (which happens with probability p) but is herself unobserved (which happens with probability (1−p)) and thus the tag persists for another round, and we are back to the initial question: what is the cost assuming you have been tagged for a deviation (of course, it we need to discount by a factor k, since in this case a round has passed by). So we've obtained S tag , which is the second-order cost of deviating given that you are being tagged by someone on that behavior. Note that S tag > 0 when ϵ ′ < 1 2 , which makes sense: as long as errors occur less often than not, agents must pay some expected second-order cost if they will be tagged based on their deviation. But we are interested in S x , the second-order costs of deviating when doing x, when an agent may or may not be tagged for it.
First, the signaling case. Agents are tagged based on their signal with probability p. So we obtain Meta-signaling in a non-conforming way has exactly the same second-order cost, S meta-sig = S sig , since the probability of being observed p, the probability of sticking p stick , and the cost S dev, tag are all equal to their corresponding values in the regular signaling case.
In the action case, agents are tagged based on their action with probability (1 − p) (in the event that their signal is not being observed). So this gives Total costs We now would like to calculate the total cost, C x = F x + S x of deviating from the norm in each scenario, so that we can set bounds on the parameters that will allow for a stable stern discriminator state. We will start with the signaling case: For C sig > 0, we just need p > 0, ϵ ′ < 1 2 . And since C coop > C def , if we can find the values for which C def > 0 (that is, it is costly to deviate by defecting), then it will also be costly to deviate by cooperating. Solving for the values of p, k, and ϵ ′ that satisfy that inequality, we obtain the bounds we gave at the beginning of Section 4.2.2.1.

Deviations cannot produce beneficial future scenarios
We've shown that deviating in a given scenario is always costly, but can deviating produce future scenarios that are beneficial? Figure 5 shows ways that this might happen.

Figure 5: Ways that non-conforming decisions can create alternative downstream scenarios.
On the left side, an action deviation in round n can lead to a difference in one's signal in round n + 1 which can lead to a payoff difference in round n + 2. Similar downstream effects can occur in signal deviations. However, we show that none of the downstream consequences of a deviation can actually benefit an agent (either they are neutral or detrimental).
While deviant actions may alter the conditions surrounding future decisions, they can't improve future expected payoffs. Recall, for each type of decision, stern discriminators distinguish between the following world scenarios: a. Action -the tag of one's partner (two possible scenarios) b. Signal -one's own tag and the action of one's partner (four possible scenarios) c. Meta-signal -whether one agrees with the signal of the agent they are observing (two possible scenarios).
We need to show that for each of these cases, a prior deviation cannot produce more beneficial subsequent scenarios. In part (a), it is true that interacting with agents with a 0-tag is more beneficial because it means that an agent can conformingly defect, saving on the cost of cooperation. However, an agent's previous actions have no effect on the tag of one's partner, which are determined by the partner's previous interactions. So deviating cannot create more favorable action scenarios.
In part (b), it is possible for an agent's previous actions to influence both their own tag and their partner's action. However, conforming in both action and signal is the highest payoff action, since it leads to an expected subsequent payoff of (1 − 2ϵ ′ )β. Thus, deviating cannot lead to more beneficial signaling scenarios.
In part (c), an agent's previous actions do not affect the signal of the agent they are observing (which is independent of the observer), so deviating in some prior round cannot create more favorable metasignaling scenarios.
Thus, no matter the kind of decision, prior deviations cannot produce more beneficial future scenarios. This completes the proof.

All traits are expressed with error
We've found parameter values for which it is always costly to deviate from the stern discriminator strategy. But could the homogeneity of the stern discriminator world restrict the set of possible world states in such a way that an invading strategy never deviates from stern discriminators on the restricted set but is nonetheless different on the full set? If this is the case, that strategy could invade by drift, causing vulnerabilities. We saw how, in Model 1, this vulnerability allowed any of the -c---1 strategies to invade. However, this is no longer a problem in our current model, because Any bounded world state comes to pass, due to error. Let us consider an agent's world state to be some sequence of actions, signaling events, and tags that an agent has seen, either as an observer or an actor. The sequence is of bounded length (we assume agents have bounded memory).
Any arbitrary string of events of this form has nonzero probability, because a world state is a conjunction of finitely many agent decisions, all of which have nonzero probability due to error. Therefore, even a strategy that only deviates from stern discriminators in the most esoteric of world states would be caught deviating on occasion -and therefore, would receive a lower payoff.
Variables defining world states are limited. Of course, likelihoods for such esoteric events quickly become vanishingly small, which matters in practice since we don't actually have infinite rounds. However, also in practice, agents are under cognitive constraints that mean they will likely not be able to use a very long sequences as world states, and instead would probably make use of the information relevant to the event. The world state for a given decision, then, would likely be determined by variables relevant to the situation at hand, such as one's own tag and action and one's partner's tag and action. When world states are determined by this smaller set of variables, any given one is very likely to occur at some point, even if we relax the infinite-round approximation.
Taking advantage of rare world states is hard. But suppose there is a world state that is extremely unlikely to occur once the infinite-round approximation is relaxed, and that strategy B deviates in this world state but no others. This is a problem for the cooperative equilibrium not because Strategy B, which is indistinguishable from stern discriminators, can invade. The problem is that a third strategy, C, may invade by taking advantage of the scenario on which stern discriminators and Strategy B differ.
Here is the classic example: in the model of indirect reciprocity with no error, unconditional cooperators can invade the world of image-scorers (those who defect with defectors and cooperate with cooperators), because no one in this world is a defector (and thus the situation in which image-scorers and cooperators differ is never realized). This first wave of invaders doesn't perturb the cooperative equilibrium, but they create the opportunity for a third strategy, unconditional defectors, to enter without suffering any consequences (since cooperators cooperate with defectors). That is, the defectors take advantage of the situation on which image-scorers and cooperators differ,the scenario where someone defects, by creating it: by themselves defecting.
In this model, it would be impossible for a lone invader to single handedly create such rare situations, which consist of unlikely combinations of one's own tag and action, one's partner's tag and action, and for meta-signalers, the chain of signals leading up to them, in addition to the tag of the agent they're observing. Out of these, the only levers an individual agent may pull to create an unlikely world state are their own tag and their own action. Unlike the unconditional defectors in the image scoring case, who can simply, by their own choices, produce the scenario they are able to take advantage of, these invaders must rely on serendipitous combinations of their own actions and errors of other agents (who will be of the predominant strategy). Thus, these agents cannot reliably create the scenario, making it extremely difficult to exploit.
In summary, there are three safeguards against the problem of unexpressed traits. First, any bounded world state is bound to occur eventually. Second, due to the structure of the interactions, minimal relevant information is available to an agent in any given round, paring down the plausible world states to a set whose members are all reasonably likely to occur. And third, even if there was an extremely rare world state that doesn't occur once the infinite rounds assumption is relaxed, it would be very difficult to exploit because it relies on the confluence of deviations of many agents, not just the agent that would stand to benefit.

Discussion
The three conditions under which communication and cooperation could remain stable, without assuming that any exogenous machinery was doing the work for us, were: 1. Aligned norms, rewarding conforming action strategies 2. Meta-signaling, rewarding conforming signaling strategies 3. Error, ensuring that all components of the strategy specification (both signal and action) will find expression at least some of the time -important because when certain components of the strategy remain unexpressed, there will be clusters of indistinguishable strategies, all of which will likely invade.
We should note that these conditions apply equally to genetically and culturally transmitted strategies, and they are indifferent to whether alternative strategies invade an existing equilibrium by way of mutation, migration, or agents simply choosing to change their strategies. We will first examine these three stabilizers in more detail, and then go on to analyze the resulting equilibrium and its downstream consequences.

Meta-signaling
The concept of meta-signaling makes it clear why it makes sense to study the evolution of communication in the context of cooperation. Research suggests that there must be some kind of pressure on the signaler for a uniform communication system to emerge [Oliphant, 1996]. This pressure may be intrinsic to the signal, or it may result from differential allocations of social benefits and costs to truth-tellers an deceivers. And this differential allocation -whether through kin selection, reciprocity, punishment, or indirect reciprocity -is also how altruism becomes stable, at least conventionally. Because the machinery for such differential allocation is already present for cooperation, it is plausible for it to have been co-opted for the purposes of communication. Meta-signaling co-opts the indirect reciprocity mechanism for the purposes of maintaining the communication system, but one can equally imagine a co-optation of other mechanisms, such as reciprocity (e.g., lying to those who have lied to you) and punishment (e.g., ostracizing those who miscommunicate) for the same purpose.
Meta-signaling helps point at why exaptation through analogy-making is so powerful. We humans build mechanisms that do a lot of heavy lifting: tool-making technologies, and yes, reputation systems to enforce cooperation. When we can co-opt existing systems [Villani et al., 2007], we save ourselves the effort of both building and maintaining a separate system. But here, something extra special is happening. The innovation was not simply horizontal -repurposing a technology for an analogous purpose -but vertical -repurposing a technology for the analogous purpose of maintaining itself. The slippage not only allows the system to perform a new task, but it also enhances its performance on its old task. This vertical co-optation is extra powerful because it sets up a positive feedback loop: enhancements to the system lead to further enhancements, since the system 'gives back' to itself. For instance, in the case where tools are used for tool-making, advances in tool-making technology not only lead to better tools, but better tool-making tools, which in turn could lead to even better tools. The same goes for compilers written in the language that they compile. For this reason, vertical cooptation seems to be at the root of leaps in complexity [Hofstadter, 1979].
On a technical level, our model combines signaling with an assurance that any given choice has some probability of being observed to resolve the common problem (in evolutionary models) of continuously kicking the can down the road until we stop at a set of assumptions that one might question, either for fundamental or practical reasons. This is not to say our model has no assumptions, but rather its assumptions are as 'light' as possible (at least, we have strived for this) while handling certain of the complications of kicking the can down the road by letting observers observe observers, etc.
To be completely candid about the origin of this recursive structure, two reviewers in an initial round of reviews pointed out that our original model (which involved meta-signaling but no possibility of observation of meta-signalers) used meta-signaling as a deus ex machina, questioning the realism of simply signaling (notably since signaling has no direct cost). Our new model -that is, Model 2 described in this article -does not contain direct costs for signaling or meta-signaling, but it does have indirect (second order) costs.

Error
Error has the advantage of keeping the 'environment' varied. As social beings, other people make up a large part of our environment, and as a result, our adaptations often deal with navigating social situations. But a homogeneous population creates homogeneous social situations and correspondingly brittle social adaptations that cannot handle situations outside the monoculture. In this case, when we live in a completely homogeneous environment of cooperators, absent is the selection pressure to maintain defense mechanisms against defectors. This homogeneity problem is the reason that tit-fortat fails to be evolutionarily stabletit-for-taters look like unconditional cooperators when interacting with each other, so cooperative (but nonconforming) strategies are indistinguishable, and can invade by drift. In our model, error, by introducing heterogeneity, provides the pressure needed to maintain defense mechanisms against potential invading strategies.

Norms in communication
This model demonstrates the benefits and dangers of communicating moral information rather than simply factual information. With the image-scoring norm, the simple morality 'defection is bad, cooperation is good' had a one-to-one correspondence between factual states of the world (cooperate/defect) and moral states (good/bad). But with the more complex stern judging norm in our model, 'good' could mean cooperation with a cooperator, cooperation with someone who defected against a cooperator, defection with a someone who cooperated with someone who cooperated with a defector... An infinite number of states like this collapse into 'good' and 'bad.' And this is the power of morality in communication: it collapses infinite factual information into (sometimes) a single bit. Agents need not know the infinite chain of who cooperated with which defectors and whether they defected against cooperators; they need only know two things: (1) what their partner did (cooperate or defect) and (2) whether that was against someone with a 'good' or 'bad' reputation. When Henrich objected to Boyd's recursive punishment strategy (punish those who are in bad standing, where someone is in bad standing if they fail to cooperate, or if they fail to punish someone in bad standing) [Henrich, 2004, Boyd andRicherson, 1992], he objected to agents having to track an infinite chain of possible transgressions -but moral communication solves this issue, by collapsing that infinite chain at the second step.
Embedding moral norms in the communication system comes with danger. When dealing with norms and cooperation, one can ask three questions 1. Does a given norm lead to high levels of cooperation? 2. Is that norm stable against other norms?

Is that norm stable against other norms when embedded in a communication system?
Clearly the first question is of a different class that the next two. But even questions 2 and 3 deal with profoundly different dynamics. In the scenario 2, someone following different norms acts according to their private rule that deviates from the prevailing norm, but otherwise minds their own business. In scenario 3, however, a signaler subscribing to different norms 'pollutes' everyone else's reputation information with their own normative judgment, thus impacting not only their own actions but the actions of others. This can lead to domino effects that do not exist in case 2, which will therefore lead to different equilibria [Yamamoto et al., 2017].

Disagreements in standing
A reviewer remarked that we assume that there are no disagreements about an agent's tag. This is true, but only in a weak sense, as the agreement of taggers has an explanation that is endogenous to the model: that is, the state where all agents are stern discriminators is stable, and that world is a world in which all agents agree on standing, modulo error. That 'modulo error' is of course an important caveat -we do allow for disagreements, in the sense that agents can err in their tagging, and thus tag differently from how a non-erring stern discriminator would tag. However, in this setting, as long as error is not too high, the stern discriminator norm is nevertheless stable. This is partially because, unlike in cases such as image scoring, the stern discriminator norm, being aligned, does not lead to error propagation ‡ .
Because the model has only one observer for an action, there is only one case in the model where an agent may disagree with a tag based on her own observations: that is, if an agent disagrees with her own tag. Why, in this case, should she nonetheless signal conformingly? The simple answer is that it is potentially costly not to, as signaling non-conformingly always is (and signaling conformingly is cost-free). But perhaps this explanation for why an agent would so easily accept an erroneous tag seems overly pat -this is an issue that could be addressed in elaborations of this model.
Of course, one might ask why we don't allow for an action or signal being witnessed by multiple agents and getting tagged differently. Our reason is simple: the single-tagger case is in some sense the 'worst-case', because multiple taggers would only increase the fidelity of the average signal, by the law of large numbers.

Relaxation of assumptions
A reasonable question might be: how much of the structure of the model is necessary for the proofs to go through? Certain parts of the model, including the way observers are assigned, are quite specific. What is actually needed? 1. Any agent has a nonzero, possibly variable, probability of being observed, p.
2. When deciding how to signal, agents are unaware of whether they are being observed or not (though they may know their probabilities of being observed).
3. There is some expected discounting factor, k tot < 1, that accounts for a round-to-round discount rate, in addition to the possibility that one's tag might not stick, allowing the agent to escape the second order cost entirely.
4. Agents err with some probability in actions, signals, and meta-signals (leading to some effective error rate ϵ ′ In this more general setting, it is still true that the cost of defecting non-conformingly, C def , has the lowest cost out of all ways of not conforming, because it allows the agent to save on the first-order cost γ of cooperation. So if we can find a set of parameters k tot , p, and ϵ ′ for which C def > 0, then we can conclude that it is always costly to deviate. We need therefore that In other words, with p stick sufficiently large, and ϵ ′ and p sufficiently small (but nonzero) -and all of them possibly variable -it is costly to act in a way that deviates from the stern discriminator strategy, even if the exact details of the model laid out above are not adhered to.

The strength of the equilibrium
In the direct reciprocity case, tit-for-tat is only collectively stable; that is, it performs at least as well as any other strategy when it is dominant, a condition weaker than evolutionary stability (for more background, see Appendix A.1). In fact, no pure strategy can be evolutionarily stable in direct reciprocity models [Boyd and Lorberbaum, 1987]. Our equilibrium, by contrast, is stronger than evolutionary stability [Maynard Smith and Price, 1973], because in it, stern discriminators strictly outperform all other strategies.
Importantly, the equilibrium we found is not the only stable point in the system. Trivially, there is the symmetrical stern reverse-discriminator state that would clearly also be stable (cd0110). But even if there are stable states with defection, weak group selection would be enough to select for the cooperative equilibrium. Usually, group selection is under intense time pressure, as it must quickly kill off groups that tend towards defection before they propagate. In our case, groups do not tend towards defection, since the forces we laid out produce a cooperative equilibrium. Thus, group selection no longer has such time pressure, and it need only select, rather than maintain, the cooperative equilibrium [Henrich, 2004].

The benefits of social enforcement
We have demonstrated a mechanism for the social enforcement of truthful communication, which has a number of advantages: 1. Social enforcement is more flexible in the determination of costs. This flexibility not only allows for the the truthful signal to be the least costly, but also to be potentially cost-free. Furthermore, it allows for more complex, combinatorial communication systems (where meaning is determined as a function of a set of symbols, rather than having a straightforward symbol-toworld-state mapping [Lachmann et al., 2001]), because costs can be differentially determined on the level of a statement, rather than, say, associating each symbol with a cost. This kind of flexibility would be unlikely if costs were determined physiologically, for instance. [Lachmann et al., 2001] 2. It has stability benefits. When costs are not socially determined, but instead determined physiologically, there is selection pressure on the signaler to evolve ways of producing (perhaps untruthful) signals in a cheaper fashion, though there are ways to ensure that even physiological signals are honest, such as in Grafen's costly signaling model [Grafen, 1990]. But in the social enforcement case, costs are determined by recipients and, therefore, they are less likely to be destabilized by natural selection [Lachmann et al., 2001]. This is not to say that they are completely immune to such disturbance, however -one could imagine that signalers might evolve ways to evade social enforcement, by making their deceitful signals harder to verify.
There are constraints we place on our strategy set that make it easier for the social enforcement mechanism: for instance, signalers cannot adjust their signaling strategy based on whether they are being observed. This is not such a great problem. While there is sometimes an absence of pressure to signal in a conforming fashion, there is never any positive pressure for an individual to signal in a non-conforming fashion -there is no benefit to smearing or praising untruthfully, because it is not the reputation of others, but ones own reputation, that determines one's fitness. Thus, even if an agent could signal differentially based on whether they were being observed, if they even entertained any minute probability of being caught, they would always signal in a conforming fashion.

The ubiquity of reputation
Reputation -and the communication system that underlies it -is vital for all kinds of decision-making. In indirect reciprocity, it indicates who to cooperate with. In third party punishment, reputation would likely be necessary for the third party to know who to punish (see Appendix A.2 and [Boyd et al., 2010]). And for cooperation to function as a costly signal for mate quality or alliance potential [Gintis et al., 2001], there must be some kind of reputation mechanism that propagates information about who is cooperating -unless every cooperative act is observed directly by everyone, which seems unlikely.
For these cases, our results provide two alternative explanations. Either (1) a communication system for disseminating reputation information might have evolved for the indirect reciprocity case, and was then co-opted for use by third-party punishers, or (2) selection pressures analogous to the indirect reciprocity case exist in these other situations, and our research can inform their exploration. Third party punishment is especially similar to indirect reciprocity, since the latter is punishment by withholding of cooperation while the former is punishment by direct imposition of cost. While this creates a slightly different payoff matrix, it is likely that the three principles underlying the viability of communication in our case could be extended to third-party punishment.

Conclusions
Human relationships cannot be reduced to pairwise interactions. Instead, we are submerged in a tangle of observation and judgment, where people form opinions and base decisions on interactions they are not involved with, using gossip to gather information we have not observed directly. As institutions, legal systems, and governments develop, the role of the third party, and therefore, communication, only increases [Nowak, 2005, Alexander, 1986.
It is therefore vital to understand how a communication system maintaining all of this social information could possibly be stable. Here we found three sufficient conditions of stability for a simple binary-valued communication system: meta-signaling, error, and stern judging norms.
There are a number of directions to pursue in further work.
1. Alternative equilibria -We have found an equilibrium point. However, knowing the full set of equilibria, in addition to the probabilities of fixating on each given some initial conditions, will shed light on how likely a society is to attain the equilibrium we've identified.
2. Elaborate communication -human language involves combining symbols to create meaning. This type of combinatorial communication lacks some of the stability properties of more basic symbol to meaning systems (like ours) and therefore needs to be analyzed further [Lachmann and Bergstrom, 2004]. Relatedly, if communication is too elaborate, cognitive and informational bounds may place limits on communication and decision making [Price and Jones, 2020]. One intriguing way to address this is to use reinforcement learning to set agent's strategies [Köster et al., 2022]. Figure 6: Tit for taters always cooperate as many or more times as their partners. A possible sequence of moves between a tit for tater and a randomly chosen strategy. Every tit-for-tat defection is the result of the opponent's defection in the previous round -so the opponent will always have defected at least as many times as the tit-for-tater.
Still the literature is informative for the cases in which punishment is actually costly. In these cases, the problem appears to be similar to the initial problem of costly cooperation. It is not quite the same, however, because when punishment is common, it becomes less costly, as there are fewer defectors to punish. Such is not the case with cooperation, which remains equally costly no matter how common it is [Boyd et al., 2003].
Partially due to this asymmetry, many solutions to the costly punishment problem have been proposed. If the punishers can recoup the costs of punishment by coercing those around them to cooperate, they can proliferate. If a recursive strategy exists that punishes non-cooperators, and punishes those that fail to punish non-cooperators, and so on, ad infinitum, then punishment can stabilize almost any behavior [Boyd and Richerson, 1992]. This is reminiscent of the behavior in our If some modicum of conformist transmission exists, that is, a tendency to absorb norms from the majority behavior of the group [Henrich and Boyd, 1998], punishment can also emerge [Henrich and Boyd, 2001]. Furthermore, if the costs of punishment are spread across the group, as is the case for ostracism, punishment can be stable [Hirshleifer and Rasmusen, 1989].
are small [Maynard Smith, 1976], [Brown et al., 1982]. Because the mathematical analysis of our model is feasible for very small and very large (infinite) groups, we chose the latter in order to fill out the explanatory void for cooperation in large groups.
• well-mixed groups -This means that any two individuals are equally likely to interact -agents do not preferentially interact with a subset of the group. Many times, selective assortativity, in which cooperators preferentially interact with other cooperators, is used as an explanatory mechanism for cooperation in groups, since it reduces the payoff disadvantage of cooperators when compared to defectors. Similarly, spatial structure, in which agents interact with nearby agents, has the effect of lowering effective group size [Bowles et al., 2003, Wang et al., 2012. These kinds of mechanisms are generally a tailwind in favor of cooperation, and by keeping our model well-mixed, we can examine the effect of indirect reciprocity in isolation, without the help of these other forces.
In many ways, therefore, our approximations, while making the math easier, make the evolution of cooperation harder. This is not true in all respects, however: a zero probability of invasion in the infinite case translates to a very small probability of invasion in the finite case (but this becomes negligible as n becomes large).

Appendix C Proofs Appendix C.1 Cooperation can be made arbitrarily close to 100%
Let us select a random agent, agent A, in an action round with her interaction partner, agent B. Will A cooperate? There are two cases: either B's tag is based on his signal (call the probability of this occurring p s ), in which case B followed his prescribed strategy with probability 1−δ, or B's tag is based on his action (with probability p a ), in which case he followed his prescribed strategy with probability 1 − ϵ. Now, if B followed his prescribed strategy, he will have a 1-tag.
If A doesn't err, she will cooperate if B has a 1-tag, and if she does err, she will cooperate if B has a 0-tag. We obtain therefore that if B's tag is based on his action, A will cooperate with probability (1−δ)(1−ϵ)+δϵ, and if B's tag is based on his action, A will cooperate with probability (1−ϵ) 2 +ϵ 2 . A's total probability of cooperating is Note that this approaches probability 1 as ϵ and δ approach 0, since the two bracketed expressions approach 1 in this limit, and p s + p a = 1 (one's tag must either be based on one's action or one's signal). We conclude that the percentage of cooperation in this state can be made arbitrarily close to 1.